HCPNodepoolUpgradeDelay alert in the worker node
Environment
- Red Hat OpenShift on AWS Hosted Control Planes
- 4.12+
Issue
- A worker node is stuck in the
Ready.SchedulingDisabledstatus. - A number of
podsin customer-owned namespaces are preventing the worker node from being drained successfully.
Resolution
Since these workloads are running in customer namespaces on worker nodes, customer action is required.
The blocking pods must be deleted to allow the node drain operation to complete.
- Run the command to confirm the pods are still present.
$ oc get pod -n <namespace> -o wide
NAME READY STATUS RESTARTS AGE IP NODE
pod-xxxxx 0/1 Terminating 0 3h 10.xxx.4.xx ip-10-xxx-xxx-6.us-west-2.compute.internal
- Check if a PDB (Pod Disruption Budget) is actively blocking the eviction of the pods.
$ oc get pdb -n <namespace>
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
<pdb_name> 1 N/A 0 18h
NOTE: If the PDB (Pod Disruption Budget) is configured with maxUnavailable: 0, modify or delete the PDB to allow the eviction to proceed.
- Delete the pods that are blocking the
worker nodefrom draining.
$ oc delete pod <pod_name> -n <namespace> --grace-period=0 --force
PS:
For other reasons which might cause nodepool upgrade pending , check this KCS https://access.redhat.com/articles/7094348 also.
Root Cause
When a NodePool upgrade is triggered in a Hosted Control Plane environment, the operator attempts to cordon and drain nodes sequentially to replace them with updated machine instances. If the drain process exceeds the default timeout configuration, the HCPNodepoolUpgradeDelay alert fires.
Diagnostic Steps
- Check the status of the worker node.
$ oc get nodes | grep "worker"
NAME STATUS ROLES AGE VERSION
ip-10-xxx-xxx-6.us-west-2.compute.internal Ready,SchedulingDisabled worker 29h v1.34.x
- Check the
worker nodeevents.
$ oc describe node ip-10-xxx-xxx-6.us-west-2.compute.internal | grep -A 20 "Events:"
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ErrorAddingResource 92m ovnk-controlplane adding or updating remote node IC resources ip-10-xxx-xxx-6.us-west-2.compute.internal failed, err - ensuring transit switch for remote zone node ip-10-xxx-xxx-6.us-west-2.compute.internal for the network default failed : err - failed to create/update transit switch transit_switch: error in transact with ops
Normal NodeNotSchedulable 84m kubelet Node ip-10-xxx-xxx-6.us-west-2.compute.internal status is now: NodeNotSchedulable
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.