Control plane upgrade troubleshooting guide for ROSA clusters

Updated

Table of Contents


Overview

Troubleshooting

  • The upgrade of the cluster can only be carried out by either the cluster owner or the user who installed it. For more details please follow the resolution steps mentioned in ROSA upgrade fails with "Forbidden access to update resource".

  • During the upgrade process, the events below can be observed in the This content is not included.OpenShift Cluster Manager Console:

    Control Plane upgrade maintenance scheduled
    Control Plane upgrade maintenance rescheduled
    Control Plane upgrade maintenance beginning
    Control Plane upgrade maintenance delayed
    Control Plane upgrade maintenance completed
    Control Plane upgrade maintenance cancelled
    Control Plane upgrade maintenance failed
    
  • There will be also some pre-flight checks before the upgrade begins. If it fails for some reason, the upgrade state will move to aborted. Below, it is the list of factors which are evaluated:

    1. Critical alerts are firing
    Solution: Open Openshift Console -> Alert -> Check if critical alerts exist before doing upgrade

    2. Node pools have no replicas
    Solution: Make sure all worker nodes are running and not cordoned

    3. Cluster Operator degraded
    Solution:
    1. Check the cluster operator error messages under "Conditions". That can be used when researching in our Knowledgebase for known issues:

      $ oc get co/<operator name> -o yaml
      
     2. Check if there are any AWS resources been changed recently. You can check that by using [AWS Cloud Trail](https://aws.amazon.com/cloudtrail/) searching for the resources below:
    
        .subnet tag 
        .KMS policy
        .DHCP setting
        .security group
    

    4. Another upgrade is on going

  • If the upgrade passes the pre-flight check and initiate, you will receive an email as shown below saying that the upgrade has begun:

    upgrade pending
  • If the upgrade completes successfully, you will receive an email saying that the upgrade has been completed successfully:

    upgrade successful
  • Check for the events in the Cluster history tab in the This content is not included.OpenShift Cluster Manager Console. They can help to identify what happens during the upgrade. Below are some samples that show reasons of why upgrades may fail:

    Control Plane upgrade maintenance failed 
    Control plane upgrade failed: found 2 critical alerts
    
    Control Plane upgrade maintenance failed
    Control plane upgrade failed: Cluster 'xxxxx' is not upgradable as it has the following node pool upgrades started 'workers-0,workers-1'
    

For further troubleshooting, please contact Red Hat Support.

Known Issues

  1. Critical alerts are firing
    Reference: Critical alerts are firing causing ROSA HCP cluster upgrade fail
  2. Cluster's worker nodes are stopped due to manual action cause HCP control plane upgrade fail
    Reference: Is it possible to scale down ROSA HCP worker nodes to zero?
  3. IDP issue might cause hcp upgrade fail
    Reference: ROSA HCP cluster upgrade fail because of OpenID authentication IDP error
  4. AWS side tags on the subnet been change or removed
    Reference: Cluster operator Console is degraded in ROSA
  5. Cluster with custom KMS enabled changes the policy attached to KMS. Check if the KMS setting is correct set
    Reference: Creating ROSA with HCP clusters using a custom AWS KMS encryption key.
    Additional reference: Enabling bring your own key (BYOK) with KMS in OSD and ROSA

Existing Bugs

  1. This content is not included.OCPBUGS-43267 pause_image content mismatch for file "/etc/crio/crio.conf.d/00-default"
  2. This content is not included.HOSTEDCP-1517 Control plane upgrades should succeed regardless of data plane state
  3. This content is not included.OCPBUGS-38132 OIDC IDP validation check should not be fatal to CPO reconcilation
Category
Components
Tags
Article Type