HCPNodepoolUpgradeDelay alert in the worker node

Solution Verified - Updated

Environment

  • Red Hat OpenShift on AWS Hosted Control Planes
    • 4.12+

Issue

  • A worker node is stuck in the Ready.SchedulingDisabled status.
  • A number of pods in customer-owned namespaces are preventing the worker node from being drained successfully.

Resolution

Since these workloads are running in customer namespaces on worker nodes, customer action is required.

The blocking pods must be deleted to allow the node drain operation to complete.

  1. Run the command to confirm the pods are still present.
$ oc get pod -n <namespace> -o wide

NAME              READY   STATUS        RESTARTS   AGE   IP            NODE                                         
pod-xxxxx         0/1     Terminating   0          3h    10.xxx.4.xx   ip-10-xxx-xxx-6.us-west-2.compute.internal   
  1. Check if a PDB (Pod Disruption Budget) is actively blocking the eviction of the pods.
$ oc get pdb -n <namespace>

NAME         MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
<pdb_name>   1               N/A               0                     18h

NOTE: If the PDB (Pod Disruption Budget) is configured with maxUnavailable: 0, modify or delete the PDB to allow the eviction to proceed.

  1. Delete the pods that are blocking the worker node from draining.
$ oc delete pod <pod_name> -n <namespace> --grace-period=0 --force

PS:
For other reasons which might cause nodepool upgrade pending , check this KCS https://access.redhat.com/articles/7094348 also.

Root Cause

When a NodePool upgrade is triggered in a Hosted Control Plane environment, the operator attempts to cordon and drain nodes sequentially to replace them with updated machine instances. If the drain process exceeds the default timeout configuration, the HCPNodepoolUpgradeDelay alert fires.

Diagnostic Steps

  1. Check the status of the worker node.
$ oc get nodes | grep "worker"

NAME                                                                               STATUS                                      ROLES    AGE    VERSION
ip-10-xxx-xxx-6.us-west-2.compute.internal       Ready,SchedulingDisabled    worker      29h        v1.34.x
  1. Check the worker node events.
$ oc describe node ip-10-xxx-xxx-6.us-west-2.compute.internal    | grep -A 20 "Events:"

Events:
  Type     Reason               Age    From               Message
  ----     ------               ----   ----               -------
  
  Warning  ErrorAddingResource  92m    ovnk-controlplane  adding or updating remote node IC resources ip-10-xxx-xxx-6.us-west-2.compute.internal failed, err - ensuring transit switch for remote zone node ip-10-xxx-xxx-6.us-west-2.compute.internal for the network default failed : err - failed to create/update transit switch transit_switch: error in transact with ops 
  
  Normal   NodeNotSchedulable   84m    kubelet            Node ip-10-xxx-xxx-6.us-west-2.compute.internal status is now: NodeNotSchedulable
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.