Pre-upgrade health checks for ROSA HCP

Updated

Environment

  • Red Hat OpenShift Service on AWS (ROSA HCP) 4

Issue

  • Perform pre-upgrade health checks on a ROSA HCP Cluster
  • How to check the health of cluster prior to an upgrade

Resolution

  • ROSA HCP clusters consist of a hosted control plane in a Red Hat‑managed AWS account and node pools in your AWS account. Upgrades are performed in two phases: the control plane must be upgraded first, followed by machine pools. Both phases include automated pre‑flight checks (PHC) that run immediately when an upgrade is scheduled.

  • Additionally, you can also leverage oc commands, such as oc adm upgrade, to check various cluster health and status details.

  • Please review our Proactive OSD/ROSA article for steps and information on how to report a proactive cluster maintenance ticket to Red Hat Support, if deemed necessary.

Diagnostic Steps

  1. Download the latest OpenShift Content from mirror.openshift.com is not included.oc and ROSA Content from mirror.openshift.com is not included.rosa CLI, if you have not already done so.

  2. View the cluster status.

  • Run the command:
$ oc adm upgrade
  • Note: This is a read-only check for any Degraded states of cluster operators and does not initiate any changes to your cluster.
  • If the output includes Failing=True, please create a Support Case in the Red Hat Customer Portal.
  1. Confirm the available versions that the cluster can be upgraded to, and note the recommended version.
  • For the control plane
$ rosa list upgrade --cluster=<cluster_name OR cluster_id>
  • For a specific machine pool
$ rosa list upgrade --cluster=<cluster_name_or_id> --machinepool=<machinepool_name>
  1. ROSA HCP performs automated pre‑flight checks when a control plane and machine pool upgrade is scheduled. If any check fails, the upgrade is aborted and a service log is generated. You can monitor results in the Cluster History tab of the Hybrid Cloud Console and in service logs. If a check fails, resolve the underlying issue and reschedule the upgrade.
  • Example Service Log for control plane:
    Control Plane upgrade maintenance failed
    Control plane upgrade failed: found 2 critical alerts

  • Example Service Log for machine pool:
    NodePool 'xxxxxx' upgrade maintenance failed
    node pool upgrade failed due to error: found 1 critical alerts

Scheduling and Canceling a Cluster Upgrade

  1. To schedule an upgrade, run the following command with the correct Date and Time for your maintenance window.
  • Control Plane Upgrade
$ rosa upgrade cluster -c <cluster_name OR cluster_id> --version <version-id> --schedule-date 2024-05-18 --schedule-time 09:00 --version <version_number>
  • Machine Pools Upgrade
$ rosa upgrade cluster -c <cluster_name OR cluster_id> <machinepool_name> --version <version-id> --schedule-date 2024-05-18 --schedule-time 09:00 --version <version_number>
  1. To cancel a scheduled upgrade, please verify that the cluster upgrade has not already started by running the following command. Please note that if the upgrade has already started, it CANNOT be stopped or canceled and must complete the process.
  • Control Plane Upgrade
$ rosa list upgrades --cluster=<cluster_name OR cluster_id>
Example output:
VERSION  NOTES
4.15.14  recommended - scheduled for 2024-06-02 15:00 UTC
4.15.13
  • If the upgrade has not started, please run the following command to delete its schedule and cancel the upgrade.
$ rosa delete upgrade --cluster=<cluster_name OR cluster_id>
Confirm the deletion by entering 'Yes' when prompted
  • Machine Pools Upgrade
$ rosa list upgrades --cluster=<cluster_name OR cluster_id> --machinepool=<machinepool_name>
Example output:
VERSION  NOTES
4.15.14  recommended - scheduled for 2024-06-02 15:00 UTC
4.15.13
  • If the upgrade has not started, please run the following command to delete its schedule and cancel the upgrade.
$ rosa delete upgrade --cluster=<cluster_name OR cluster_id> --machinepool=<machinepool_name>
Confirm the deletion by entering 'Yes' when prompted

Performing pre-upgrade checks manually

  1. Check all the cluster operators for any that may be in DEGRADED=True state.
$ oc get co
  1. Check the status of Machine Pools.
$ rosa list machinepools --cluster=<cluster_name>
  1. Check that there are no restrictive Pod Disruption Budget (PDB) defined in workload namespaces.
$ oc get pdb -A
  1. Check operator compatibility.
Category
Article Type