Loki Ingester 0/1 fails to Connect to Outdated IPs in Gossip Ring in RHOCP 4
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- Red Hat OpenShift Logging (RHOL)
- 5
- 6
- LokiStack
- Loki
Issue
-
Lokistack operator fails to start
logging-loki-ingesterpod due to connection timeout to an outdated IP in the gossip ring. -
IP addresses in the gossip ring endpoint list that are no longer in use causing the issue.
-
The logs indicate an error connecting to the outdated IP address:
WriteTo failed" addr=<IP>:7946 err="dial tcp <IP>:7946: i/o timeout". -
The IP addresses are not present in the podnetwork or in the list of addresses for the
logging-loki-gossip-ring endpoint. -
Loki Ingester in
0/1, even, when not having an issue with the Loki storage as described in the Red Hat Knowledge Article "Loki ingesters 0/1 in RHOCP 4" -
Loki Ingester pod throws the error:
msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.x.x.x:9095 past heartbeat timeout"
Resolution
Red Hat investigated this issue in bug report:
| RHOL Release | Bug | Fixed version | Errata |
|---|---|---|---|
| 6.3 | This content is not included.LOG-6987 | 6.3.0 | RHBA-2025:11336 |
| 6.2 | This content is not included.LOG-6992 | 6.2.1 | RHBA-2025:3908 |
If this issue still occurs in the environment after updating, open a support case in the Red Hat Customer Portal referring to this solution.
Workaround
Note: the variable LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR could be different depending on the environment as it has two parts:
- Variable part: LokiStack CR name in upper case. In this example
LOGGING_LOKI - Fixed part:
_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR
- Set environment variables
$ cr="logging-loki" ### LokiStack CR name, change is the name is different
$ ns="openshift-logging" ### change if Loki runs in a different namespace
- Start a Loki Distributor pod with with the UBI8 image
$ oc -n ${ns} debug --image=registry.redhat.io/ubi8:latest deployment/logging-loki-distributor
- Get the "Unhealthy" Loki Ingester members
- Forget the "Unhealthy" Loki Ingester members
// If not using cluster-wide proxy
sh-4.4$ curl -k --cert /var/run/tls/http/server/tls.crt --key /var/run/tls/http/server/tls.key https://${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}:3100/ring -X POST --data-raw 'forget=<UNHEALTHY_POD_FROM_EARLIER_COMMAND>'
// If using [cluster-wide proxy](https://docs.openshift.com/container-platform/4.16/networking/enable-cluster-wide-proxy.html)
$ curl -k --cert /var/run/tls/http/server/tls.crt --key /var/run/tls/http/server/tls.key --noproxy "${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}" https://${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}:3100/ring -X POST --data-raw 'forget=<UNHEALTHY_POD_FROM_EARLIER_COMMAND>'
- Restart the Loki Ingester pods
$ oc delete pods -l app.kubernetes.io/component=ingester -n [namespace_name]
Root Cause
Loki ingesters that got into an Unhealthy state due to networking issues stayed in that state even when the network recovered.
Diagnostic Steps
-
Verify that the Loki Ingester pod is
0/1$ oc get pods -n [namespace_name] | grep ingester logging-loki-ingester-1 0/1 Running 0 1d -
Check the logs of loki-ingester pod for the issue with the ring
$ oc logs logging-loki-ingester-1 -n [namespace_name] [...] 2023-01-01T00:00:00.000000000Z level=warn ts=2023-01-01T00:00:00.0000000002Z caller=lifecycler.go:241 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.x.x.x:9095 past heartbeat timeout" -
Verify that the Loki Ingester
logging-loki-ingester-1is "UNHEALTHY". In the next example, thelogging-loki-ingester-1isUNHEALTHY$ oc -n [namespace_name] debug --image=registry.redhat.io/ubi8:latest deployment/logging-loki-distributor sh-4.4$ curl -k --cert /var/run/tls/http/server/tls.crt --key /var/run/tls/http/server/tls.key https://${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}:3100/ring -H "Accept: application/json" [...] {"shards":[{"id":"logging-loki-ingester-0","state":"ACTIVE",[...]},{"id":"logging-loki-ingester-1","state":"UNHEALTHY",[...]} -
Verify that the Loki Gossip ring contains IP addresses that not longer exists
$ oc describe endpoints logging-loki-gossip-ring -n [namespace_name] [...] Subsets: Addresses: 10.128.1.44,10.128.1.46,10.129.0.73,10.129.0.76,10.129.0.77,10.129.0.78,10.129.0.79,10.130.1.10,10.130.1.11,10.130.1.12,10.130.1.13 NotReadyAddresses: 10.128.1.45,10.129.0.80 Ports: Name Port Protocol ---- ---- -------- gossip-ring 7946 TCP -
Verify that not hitting the issue described in the Red Hat Article Knowledge Base "Loki ingesters 0/1 in RHOCP 4"
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.