Loki Ingester 0/1 fails to Connect to Outdated IPs in Gossip Ring in RHOCP 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Logging (RHOL)
    • 5
    • 6
  • LokiStack
  • Loki

Issue

  • Lokistack operator fails to start logging-loki-ingester pod due to connection timeout to an outdated IP in the gossip ring.

  • IP addresses in the gossip ring endpoint list that are no longer in use causing the issue.

  • The logs indicate an error connecting to the outdated IP address: WriteTo failed" addr=<IP>:7946 err="dial tcp <IP>:7946: i/o timeout".

  • The IP addresses are not present in the podnetwork or in the list of addresses for the logging-loki-gossip-ring endpoint.

  • Loki Ingester in 0/1, even, when not having an issue with the Loki storage as described in the Red Hat Knowledge Article "Loki ingesters 0/1 in RHOCP 4"

  • Loki Ingester pod throws the error:

    msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.x.x.x:9095 past heartbeat timeout"
    

Resolution

Red Hat investigated this issue in bug report:

RHOL ReleaseBugFixed versionErrata
6.3This content is not included.LOG-69876.3.0RHBA-2025:11336
6.2This content is not included.LOG-69926.2.1RHBA-2025:3908

If this issue still occurs in the environment after updating, open a support case in the Red Hat Customer Portal referring to this solution.

Workaround

Note: the variable LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR could be different depending on the environment as it has two parts:

  • Variable part: LokiStack CR name in upper case. In this example LOGGING_LOKI
  • Fixed part: _DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR
  1. Set environment variables
$ cr="logging-loki"      ### LokiStack CR name, change is the name is different 
$ ns="openshift-logging" ### change if Loki runs in a different namespace
  1. Start a Loki Distributor pod with with the UBI8 image
$ oc -n ${ns} debug --image=registry.redhat.io/ubi8:latest deployment/logging-loki-distributor
  1. Get the "Unhealthy" Loki Ingester members

Content from ${logging_loki_distributor_http_port_3100_tcp_addr} is not included.https://${logging_loki_distributor_http_port_3100_tcp_addr}:3100/ring%20/

  1. Forget the "Unhealthy" Loki Ingester members
  // If not using cluster-wide proxy
  sh-4.4$ curl -k --cert /var/run/tls/http/server/tls.crt --key /var/run/tls/http/server/tls.key   https://${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}:3100/ring -X POST --data-raw 'forget=<UNHEALTHY_POD_FROM_EARLIER_COMMAND>'

  // If using [cluster-wide proxy](https://docs.openshift.com/container-platform/4.16/networking/enable-cluster-wide-proxy.html)
  $ curl -k --cert /var/run/tls/http/server/tls.crt --key /var/run/tls/http/server/tls.key --noproxy "${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}" https://${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}:3100/ring -X POST --data-raw 'forget=<UNHEALTHY_POD_FROM_EARLIER_COMMAND>'
  1. Restart the Loki Ingester pods
$ oc delete pods -l app.kubernetes.io/component=ingester -n [namespace_name]

Root Cause

Loki ingesters that got into an Unhealthy state due to networking issues stayed in that state even when the network recovered.

Diagnostic Steps

  1. Verify that the Loki Ingester pod is 0/1

    $ oc get pods -n [namespace_name] | grep ingester
    logging-loki-ingester-1                        0/1     Running   0          1d
    
  2. Check the logs of loki-ingester pod for the issue with the ring

    $ oc logs logging-loki-ingester-1 -n [namespace_name]
    [...]
    2023-01-01T00:00:00.000000000Z level=warn ts=2023-01-01T00:00:00.0000000002Z caller=lifecycler.go:241 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.x.x.x:9095 past heartbeat timeout"
    
  3. Verify that the Loki Ingester logging-loki-ingester-1 is "UNHEALTHY". In the next example, the logging-loki-ingester-1 is UNHEALTHY

       $ oc -n [namespace_name] debug --image=registry.redhat.io/ubi8:latest deployment/logging-loki-distributor
    
       sh-4.4$ curl -k --cert /var/run/tls/http/server/tls.crt --key /var/run/tls/http/server/tls.key  https://${LOGGING_LOKI_DISTRIBUTOR_HTTP_PORT_3100_TCP_ADDR}:3100/ring -H "Accept: application/json" 
       [...]
       {"shards":[{"id":"logging-loki-ingester-0","state":"ACTIVE",[...]},{"id":"logging-loki-ingester-1","state":"UNHEALTHY",[...]}         
    
  4. Verify that the Loki Gossip ring contains IP addresses that not longer exists

    $ oc describe endpoints logging-loki-gossip-ring -n [namespace_name]
    [...]
    Subsets:
      Addresses:          10.128.1.44,10.128.1.46,10.129.0.73,10.129.0.76,10.129.0.77,10.129.0.78,10.129.0.79,10.130.1.10,10.130.1.11,10.130.1.12,10.130.1.13
      NotReadyAddresses:  10.128.1.45,10.129.0.80
      Ports:
        Name         Port  Protocol
        ----         ----  --------
        gossip-ring  7946  TCP
    
  5. Verify that not hitting the issue described in the Red Hat Article Knowledge Base "Loki ingesters 0/1 in RHOCP 4"

Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.