Node pools(machine pools) upgrade troubleshooting guide for ROSA HCP cluster
Table of Contents
Overview
-
Updates for the ROSA HCP cluster involve updating the control planes and the node pools(machine pools). This Document focuses on troubleshooting issues during the machine pool upgrade. (For control plane upgrade troubleshooting refer to another KCS https://access.redhat.com/articles/7093782)
-
For how to upgrade ROSA HCP machine pool, please check RedHat document This page is not included, but the link has been rewritten to point to the nearest parent document.Upgrading the machine pool
Troubleshooting
-
Upgrade SL(service log)
- During the upgrade, there will be several SL(service log) like the below been generated in the OCM console (console.redhat.com)
NodePool <nodepool name> upgrade maintenance scheduled NodePool <nodepool name> upgrade maintenance rescheduled NodePool <nodepool name> upgrade maintenance beginning NodePool <nodepool name> upgrade maintenance delayed NodePool <nodepool name> upgrade maintenance completed NodePool <nodepool name> upgrade maintenance cancelled NodePool <nodepool name> upgrade maintenance failed -
There will be PHC(pre-flight check) before the upgrade begins
- If the PHC fails, the upgrade state will moved to aborted.
PHC will check the factors below 1. Critical alerts are firing. [solution:] Open openshift console -> alert (check if critical alert exists) before doing an upgrade 2. Node pools have no replicas. [solution:] Make sure all worker nodes are running and not cordoned 3. Cluster operator degraded [solution:] 1) Check operator error information and use that to search existing KCS $ oc get co/<operator name> -o yaml 2) Check if there are any AWS side resources been changed recently (you can check that using AWS cloud trail) .subnet tag .KMS policy .DHCP setting .security group 4. Another upgrade is ongoing -
If the upgrade passes the PHC(pre-flight check) and begins, you'll receive an email showing the upgrade has begun. If the upgrade takes a long time not complete, you can create a support ticket to RedHat Support for check.
- If the upgrade completes successfully, you'll receive an email that the upgrade has been completed successfully.
-
Check SL can help to identify what happens during the upgrade.
-
Check the SL message to help understand and verify why the upgrade failed.
If there are questions about that, please raise a support ticket for further support. -
Below are some sample SL that show reasons of why upgrades fail
NodePool 'xxxxxx' upgrade maintenance failed node pool upgrade failed due to error: Cluster operator operator-lifecycle-manager should not be upgraded between minor versions: ClusterServiceVersions blocking cluster upgrade: openshift-operators/openshift- pipelines-operator-rh.v1.13.1 is incompatible with OpenShift minor versions greater than 4.15NodePool 'xxxxxx' upgrade maintenance failed node pool upgrade failed due to error: found 1 critical alerts -
KnownIssues
-
PDB setting in worker node can cause HCP node pool upgrade fail
KCS: https://access.redhat.com/solutions/7092997 -
Change the machine pool setting during HCP node pool upgrade can cause the upgrade pending or the version not upgraded
KCS: https://access.redhat.com/solutions/7092998 -
Critical alerts are firing
KCS: https://access.redhat.com/solutions/7094130 -
Cluster's worker nodes are stopped due to manual action causing HCP nodepool upgrade fail
KCS: https://access.redhat.com/solutions/7061653 -
AWS side tags on the subnet been changed or removed
KCS: https://access.redhat.com/solutions/7059601 -
Cluster with custom KMS enabled changed the policy attached to KMS
KCS: https://access.redhat.com/solutions/7093002
https://access.redhat.com/articles/6155612 -
HCPNodepoolUpgradeDelay alert in the worker node
KCS: https://access.redhat.com/solutions/7143023
Existing Bugs
- This content is not included.OCM-11702 Prevent NP version from being updated when NP upgrade is in progress
- This content is not included.OCM-11700 Patch upgrade policy state does not validate current state
- This content is not included.OCM-7971 HCP NodePool upgrade policy failed while the NodePool upgrade is still in progress
- This content is not included.XCMSTRAT-747 HCP Upgrade Improvement - Machine Pool Upgrade