Node pools(machine pools) upgrade troubleshooting guide for ROSA HCP cluster

Updated

Table of Contents


Overview

Troubleshooting

  • Upgrade SL(service log)

    • During the upgrade, there will be several SL(service log) like the below been generated in the OCM console (console.redhat.com)
    NodePool <nodepool name> upgrade maintenance scheduled
    NodePool <nodepool name> upgrade maintenance rescheduled
    NodePool <nodepool name> upgrade maintenance beginning
    NodePool <nodepool name> upgrade maintenance delayed
    NodePool <nodepool name> upgrade maintenance completed
    NodePool <nodepool name> upgrade maintenance cancelled
    NodePool <nodepool name> upgrade maintenance failed
    
  • There will be PHC(pre-flight check) before the upgrade begins

    • If the PHC fails, the upgrade state will moved to aborted.
        PHC will check the factors below
        1. Critical alerts are firing.
           [solution:] Open openshift console -> alert (check if critical alert exists) before doing an upgrade
    
        2. Node pools have no replicas.
           [solution:] Make sure all worker nodes are running and not cordoned
    
        3. Cluster operator degraded
           [solution:] 
           1) Check operator error information and use that to search existing KCS
              $ oc get co/<operator name> -o yaml
    
           2) Check if there are any AWS side resources been changed recently (you can check that using AWS cloud trail) 
              .subnet tag 
              .KMS policy 
              .DHCP setting
              .security group
    
        4. Another upgrade is ongoing
    
  • If the upgrade passes the PHC(pre-flight check) and begins, you'll receive an email showing the upgrade has begun. If the upgrade takes a long time not complete, you can create a support ticket to RedHat Support for check.

nodepool update begin
  • If the upgrade completes successfully, you'll receive an email that the upgrade has been completed successfully.
nodepool update complet
  • Check SL can help to identify what happens during the upgrade.

    • Check the SL message to help understand and verify why the upgrade failed.
      If there are questions about that, please raise a support ticket for further support.

    • Below are some sample SL that show reasons of why upgrades fail

    NodePool 'xxxxxx' upgrade maintenance failed	
    node pool upgrade failed due to error: Cluster operator operator-lifecycle-manager should not be upgraded 
    between minor versions: ClusterServiceVersions blocking cluster upgrade: openshift-operators/openshift- 
    pipelines-operator-rh.v1.13.1 is incompatible with OpenShift minor versions greater than 4.15
    
    NodePool 'xxxxxx' upgrade maintenance failed
    node pool upgrade failed due to error: found 1 critical alerts
    

KnownIssues

  1. PDB setting in worker node can cause HCP node pool upgrade fail
    KCS: https://access.redhat.com/solutions/7092997

  2. Change the machine pool setting during HCP node pool upgrade can cause the upgrade pending or the version not upgraded
    KCS: https://access.redhat.com/solutions/7092998

  3. Critical alerts are firing
    KCS: https://access.redhat.com/solutions/7094130

  4. Cluster's worker nodes are stopped due to manual action causing HCP nodepool upgrade fail
    KCS: https://access.redhat.com/solutions/7061653

  5. AWS side tags on the subnet been changed or removed
    KCS: https://access.redhat.com/solutions/7059601

  6. Cluster with custom KMS enabled changed the policy attached to KMS
    KCS: https://access.redhat.com/solutions/7093002
    https://access.redhat.com/articles/6155612

  7. HCPNodepoolUpgradeDelay alert in the worker node
    KCS: https://access.redhat.com/solutions/7143023

Existing Bugs

  1. This content is not included.OCM-11702 Prevent NP version from being updated when NP upgrade is in progress
  2. This content is not included.OCM-11700 Patch upgrade policy state does not validate current state
  3. This content is not included.OCM-7971 HCP NodePool upgrade policy failed while the NodePool upgrade is still in progress
  4. This content is not included.XCMSTRAT-747 HCP Upgrade Improvement - Machine Pool Upgrade
Category
Components
Tags
Article Type