Vector collector ignores audit logs or duplicates data after restart due to short ignore_older_secs default

Solution Unverified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Logging
    • 6.0 and later
  • Vector collector

Issue

  • Audit logs are lost or duplicated during Vector restarts because the default one-hour timeout is frequently exceeded in environments with low audit event frequency.

Resolution

The parameter ignore_older_secs is not exposed in the ClusterLogForwarder (CLF) API to be configured to set a higher value.

This issue has been reported to Red Hat engineering. It is being tracked in Content from redhat.atlassian.net is not included.LOG-9359. For more information, please open a This content is not included.new support case on the Red Hat Customer Portal referring to this solution.

Root Cause

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

The input for reading the audit logs has configured the option ignore_older_secs: 3600 (1 hour). When collector pod using Vector is restarted, the Vector's checkpoint uses ignore_older_secs when resuming to read the logs.

This issues is compounded by a known upstream issue in Vector (Content from github.com is not included.#17208 where the collector restarts reading from the beginning instead of resuming from a checkpoint when the checkpoint is older than the ignore_older_secs threshold. While fixing this upstream bug would prevent duplication, the current short timeout would still cause Vector to "ignore" the file, leading to silent data loss.

Diagnostic Steps

  1. Check the vector.yaml configuration inside the collector pod to verify the current threshold:

    [sources.input_audit_host]
    type = "file"
    include = ["/var/log/audit/audit.log"]
    ignore_older_secs = 3600
    
  2. Run the following command on a node to identify if there are gaps in the audit logs exceeding one hour:

    cat /var/log/audit/audit.log | awk 'match($0, /audit\(([0-9]+\.[0-9]+)/, m){ts=m[1]; if(pts && ts-pts>=3600){print pl; print $0; print "---"}; pts=ts; pl=$0}'
    
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.