PodDisruptionBudget (PDB) could cause Machine Config Operator to be degraded in OpenShift 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Machine Config Operator (MCO)
  • PodDisruptionBudget (PDB)

Issue

  • OpenShift 4 upgrade is failing due to machine-config-operator degraded.

  • MCP is degraded with following message:

    pool is degraded because nodes fail with "1 nodes are reporting degraded
      status on sync": "Node [node_name]
      is reporting: \"failed to drain node: [node_name]
      after 1 hour. Please see machine-config-controller logs for more information\
    
  • Log message errors in machine-config-controller pod:

    error when evicting pods/"[pod_name]" -n "[namespace_name]" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    

Resolution

All the procedures will cause the offending pod(s) to be deleted, so it is needed to ensure they can be deleted at the time the procedure is executed.

IMPORTANT: if the PDB preventing the upgrade is from OCS, please refer to cannot evict rook-ceph-mon pod due to pod violating PodDisruptionBudget in OCS 4, as deleting the rook-ceph-mon pod if the OCS cluster is not healthy could cause data loss.

IMPORTANT: if the PDB preventing the upgrade is from OpenShift Virtualization (virt-launcher Pods), deleting the PDB will ungracefully kill the Virtual Machine, potentially causing data loss and unintended service disruption. So do NOT follow this KCS. The most common cause for this is RAM dirty rate being higher than network bandwidth, please refer to OpenShift upgrade delayed due to Virtual Machines failing to drain from nodes on how to deal with this situation. However, other issues can also prevent live-migration.

Check for similar pods in the same node

$ oc get pods -n <pod_namespace> -o wide

Apply one of the following procedures

If the affected PDB is related to an OpenShift component, please troubleshoot the cause of the rest of the pods failing. As an example, a configuration issue with the console can cause the console PDB to not allow a console pod to be evicted.

1- Disable eviction for draining the node

Use the --disable-eviction option for manually draining the node as explained in drain with PodDisruptionBudget blocks in OpenShift 4.

2- Delete the pod(s) that cannot be evicted

  • Manually delete the pod/pods that cannot be evicted, to let them recreate in different nodes:

    $ oc delete pod <pod_name> -n <pod_namespace>
    
  • Wait until the upgrade is finished, and the MCO is available.

    $ watch -n10 "oc get clusterversion; echo; oc get mcp; echo; oc get nodes -o wide; echo; oc get co"
    

3- If the pod cannot be manually removed

  • Check if the PodDisruptionBudget has configured minAvailable: 1, as it will affect pod eviction process during OCP 4 upgrade:

    $ oc get pdb <pdb_name> -n <pod_namespace> 
    NAME         MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
    <pdb_name>   1               N/A               0                     18h
    
  • If the pod replicas is just "1", patch the PodDisruptionBudget as follows:

    $ oc patch pdb <pdb_name> -n <pod_namespace> --type=merge -p '{"spec":{"minAvailable":0}}'
    
  • Wait until the upgrade is finished, and the MCO is available.

    $ watch -n10 "oc get clusterversion; echo; oc get mcp; echo; oc get nodes -o wide; echo; oc get co"
    
  • Restore the PodDisruptionBudget:

    $ oc patch pdb <pdb_name> -n <pod_namespace> --type=merge -p '{"spec":{"minAvailable":1}}'
    

4- If the PodDisruptionBudget can't be patched

If the patch of the PodDisruptionBudget fails with error: PodDisruptionBudget.policy "<pdb_name>" is invalid: spec: Forbidden: updates to poddisruptionbudget spec are forbidden.

  • Backup the PodDisruptionBudget:

    $ oc get pdb <pdb_name> -n <pod_namespace> -o yaml > <pdb_name>.yaml
    
  • Remove the PodDisruptionBudget which is configured with minAvailable: 1:

    $ oc delete pdb <pdb_name> -n <pod_namespace>
    
  • Wait until the upgrade is finished, and the MCO is available.

    $ watch -n10 "oc get clusterversion; echo; oc get mcp; echo; oc get nodes -o wide; echo; oc get co"
    
  • Edit the file and remove the unneeded metadata and the status (see PodDisruptionBudget example below).

  • Create the PodDisruptionBudget again:

    $ oc create -f <pdb_name>.yaml -n  <pod_namespace>
    

Example of PodDisruptionBudget with minAvailable: 1

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: <pdb_name>
  namespace: <pod_namespace>
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: <my-app>

Note: if the pod's replicas are set to 0, then the rolling update will not be affected even if the ALLOWED DISRUPTIONS is 0 in PDB since there is no pod that needs to be evicted. So another workaround that can be done before the upgrade, other than modifying/deleting the PDB, is to scale the pod replicas to 0 before the upgrade and scale it back to the expected replicas after the upgrade is finished.

Root Cause

A PodDisruptionBudget not correctly configured could cause a node to not being drained, affecting the upgrade:

  • minAvailable: 1 in PodDisruptionBudget can be blocker for eviction process while OCP4 upgrade proceed.
  • If several nodes are rebooted, all the pods could be running in only one node, and the PodDisruptionBudget can prevent to drain the node.
  • The PodDisruptionBudget prevents the automatic eviction of pods, but it's possible to manually delete the pods with a PodDisruptionBudget configured.

Diagnostic Steps

Output when the upgrade is failing:

  • The machine-config Cluster Operator is degraded, and showing a message similar to the following one:

        $ oc get co machine-config
        NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
        machine-config   4.12.45   True        False         True       2h
    
        $ oc get co machine-config -o yaml
        [...]
          extension:
            master: all 3 nodes are at latest configuration rendered-master-b0201cffb3e33e8504ca4cd06644be41
            worker: 'pool is degraded because nodes fail with "1 nodes are reporting degraded
              status on sync": "Node [node_name]
              is reporting: \"failed to drain node: [node_name]
              after 1 hour. Please see machine-config-controller logs for more information\""'
        [...]
    
  • Some nodes will be SchedulingDisabled:

    $ oc get node
    NAME                          STATUS                     ROLES    AGE   VERSION
    ip-10-0-111-11.example.com    Ready,SchedulingDisabled   master   42h   v1.25.14+a52e8df
    ip-10-0-222-22.example.com    Ready                      worker   41h   v1.25.14+a52e8df
    ip-10-0-333-33.example.com    Ready                      master   42h   v1.25.14+a52e8df
    ip-10-0-444-44.example.com    Ready                      worker   41h   v1.25.14+a52e8df
    ip-10-0-555-55.example.com    Ready                      master   42h   v1.25.14+a52e8df
    ip-10-0-666-66.example.com    Ready,SchedulingDisabled   worker   41h   v1.25.14+a52e8df
    
  • The machine-config-controller pod logs show the following messages:

    $ oc logs -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller
    ...
    I0220 04:14:18.029980   49566 update.go:89] error when evicting pod "test-1-xxxxx" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    I0220 04:14:23.055546   49566 update.go:89] error when evicting pod "test-1-xxxxx" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    I0220 04:14:28.073188   49566 update.go:89] error when evicting pod "test-1-xxxxx" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
    
  • The pod which is shown in above logs run well without eviction at that time:

    $ oc get pod -o wide
    NAME            READY   STATUS      RESTARTS   AGE   IP            NODE                                              NOMINATED NODE   READINESS GATES
    test-1-xxxxx    1/1     Running     0          47m   10.131.0.23   ip-10-0-111-11.example.com    <none>           <none>
    
  • Several pods can be running in only one node:

    $ oc get pod -o wide
    NAME            READY   STATUS      RESTARTS   AGE   IP            NODE                                              NOMINATED NODE   READINESS GATES
    test-1-xxxxx    1/1     Running     0          47m   10.131.0.23   ip-10-0-111-11.example.com    <none>           <none>
    test-1-yyyyy    1/1     Running     0          47m   10.131.0.24   ip-10-0-111-11.example.com    <none>           <none>
    test-1-zzzzz    1/1     Running     0          47m   10.131.0.25   ip-10-0-111-11.example.com    <none>           <none>
    
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.