Pod disruption budget slows down cluster update

Solution Verified - Updated

Environment

  • Red Hat OpenShift Service on AWS (ROSA)
  • Red Hat OpenShift Dedicated (OSD)

Issue

  • Cluster updates to OSD or ROSA compute nodes are slow or the update appears stalled
  • Pod disruption budget impacts time taken for cluster update to complete

Resolution

  • Consider relaxing the minAvailable for any PodDisruptionBudget resources specified for workloads for the duration of the update to reduce the amount of time that PodDisruptionBudget blocks draining of pods from compute nodes.
  • Specify a shorter Node draining Grace period in Cluster Settings appropriate to a reasonable application pod termination time.

Root Cause

  • Specifying a pod disruption budget for workloads may block draining of pods from compute nodes during the Machine Config Operator part of the cluster update process, if not defined correctly.
  • Long grace periods will effectively pause the update of compute nodes with restrictive PodDisruptionBudget specification since the update process will wait for the specified grace period before forcibly evicting pods from nodes.

Diagnostic Steps

Check pod disruption budget for workload namespaces as follows:

oc get poddisruptionbudget -n <namespace>

For all namespaces:

oc get poddisruptionbudget --all-namespaces

Note: some pod disruption budget specifications already exist in control plane namespaces which are required for normal operation of a managed OpenShift cluster and must not be modified.

Verify Node drain Grace period Cluster settings in This content is not included.Red Hat Hybrid Cloud Console, for example:

OpenShift Cluster Manager settings

Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.