Loki ingesters 0/1 in RHOCP 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Logging (RHOL)
    • 5
    • 6
  • Red Hat Network Observability
  • LokiStack

Issue

  • A logging-loki-ingester or netobserv-loki-ingester pod are unable to initialize due to presence of stale loki-ingester entries in LokiStack hash ring:

    netobserv-loki-ingester-0                        0/1     Running   0          1d
    
    logging-loki-ingester-0                        0/1     Running   0          1d
    
  • The loki-ingester logs shows an error with the InvalidBucketState/ring:

    level=error ts=2024-08-14T09:25:53.576834111Z caller=flush.go:143 org_id=infrastructure msg="failed to flush" err="failed to flush chunks: store put chunk: InvalidBucketState: The request is not valid with the current state of the bucket.\n\tstatus code: 409, request id: xxxx-xxxx-xxxx, host id: xxxx-xxxx-xxx, num_chunks: 1, labels: {kubernetes_host=\"example.ocp.com\", log_type=\"infrastructure\"}"
    
    msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.x.x.x:9095 past heartbeat timeout"
    
  • Unable see logs in web console log section showing "Too Many Unhealthy Instances In The Ring"
    Error

Resolution

1. For not "production" sizes

Red Hat investigated this issue:

RHOL ReleaseBugFixed versionErrata
RHOL 5.9This content is not included.LOG-56145.9.3RHBA-2024:3736
RHOL 5.8This content is not included.LOG-56155.8.8RHBA-2024:3738
RHOL 5.7This content is not included.LOG-56165.7.15RHBA-2024:3739
RHOL 5.6This content is not included.LOG-56175.6.20RHBA-2024:3740

If this issue still occurs in your environment after updating, open a support case in the Red Hat Customer Portal referring to this solution.

After upgrading, restart the Loki Ingester pods and in case that the Loki Operator was not doing it automatically:

$ oc delete pods -l app.kubernetes.io/component=ingester -n [namespace_name]
2. For when not able to write the data to the storage

It should be reviewed and fixed the problem for being able to persist the data, usually, this is related to a bad definition for accessing to the backend storage, bad credentials, no connectivity, or an issue in the backend storage that is not working as expected.

If the issue was:

Once the storage issue is fixed. Restart the Loki Ingester pods:

$ oc delete pods -l app.kubernetes.io/component=ingester -n [namespace_name]

Root Cause

Red Hat has investigated this issue in This content is not included.LOG-4840 and detected two different causes:

  • In the most of the cases, the Loki ingester not able to write the data to the storage remaining Unhealthy. Then, it should be reviewed and fixed the problem with the storage to persist the data
  • In not "production" sizes a configuration issue was present being the replay_memory_ceiling value 0

Diagnostic Steps

Note: Replace the [namespace_name] with openshift-logging or netobserv as needed.

  • Check the status of the loki-ingester pods:

    $ oc get pods -n [namespace_name] | grep ingester
    logging-loki-ingester-0                        0/1     Running   0          1d
    
  • Check the logs of loki-ingester pod for the issue with the ring:

    $ oc logs $loki-ingester-pod -n [namespace_name]
    [...]
    2023-01-01T00:00:00.000000000Z level=warn ts=2023-01-01T00:00:00.0000000002Z caller=lifecycler.go:241 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.x.x.x:9095 past heartbeat timeout"
    

Note: if not an issue with the storage, check the Red Hat Knowledge Article "Loki Ingester 0/1 fails to Connect to Outdated IPs in Gossip Ring in RHOCP 4"

Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.