ROSA HCP cluster machine pools not upgraded to expected version

Solution Unverified - Updated 26 Oct 2024

Environment

Red Hat OpenShift Service on AWS (ROSA) HCP
- 4

Issue

Check the SL(service log) from OCM console shows the workers-1 machine pool already been upgraded to 4.15.33

NodePool 'workers-1' upgrade maintenance scheduled	Cluster Updates	Info	xxxxxxxxx	4 Oct 2024, 11:00 UTC
Cluster is scheduled for NodePool upgrade maintenance to version '4.15.33' on 2024-10-04 at 11:06 UTC

NodePool 'workers-1' upgrade maintenance beginning	Cluster Updates	Info	xxxxxxxxx	4 Oct 2024, 11:09 UTC
NodePool is currently being upgraded to version '4.15.33'

........

NodePool 'workers-1' upgrade maintenance completed	Cluster Updates	xxxxxxxxx	9 Oct 2024, 02:54 UTC
NodePool has been successfully upgraded to version '4.15.33'

However, by checking the machine pool version in OCM, it still shows the older version 4.15.14

NAME                      CLUSTER         DESIRED NODES   CURRENT NODES   AUTOSCALING   AUTOREPAIR   VERSION   UPDATINGVERSION   UPDATINGCONFIG   MESSAGE
       
workers-1   xxxxxx   2               2               False         True         4.15.14   False             False

Resolution

Do not modify the machine pool when the machine pool is upgraded. You can check SL (service log) below in OCM console to see when the upgrade has begin and if the upgrade has finished or not

NodePool 'xxxxxx' upgrade maintenance beginning
NodePool is currently being upgraded to version 'xxxxx'
......
NodePool 'xxxxxx' upgrade maintenance completed	
NodePool has been successfully upgraded to version 'xxxxx'

If you happen to change the setting towards the machine pool when it is upgrading and after the machine pool upgrade SL shows it is finished, but the machine pool version is still on the old version, please create a support ticket to RedHat support for help.

Root Cause

There is an internal bug reported when upgrading HCP machine pool and not finished yet, the action of modifying the machine pool like adding a machine to the machine pool might cause the upgrade to be pending. The final solution is still on developing

This content is not included.Prevent NP version from being updated when NP upgrade is in progress

Diagnostic Steps

Check the SL(service log) from the OCM console. After the upgrade of the machine pool started, and before the machine pool upgrade finished, the machine pool was changed manually by cluster owner as the SL (service log ) below shows

Node pool 'workers-1' has been updated in cluster 'xxxxxx'	Cluster Scaling	Info	xxxxxx	7 Oct 2024, 03:29 UTC
Node pool ID: 'workers-1', Replicas: '2', Instance Type: 'm7i.xlarge', EC2 Metadata Http Tokens: 'optional', 'no labels', 'no taints', Tags: 'red-hat-managed: true, red-hat-clustertype: rosa, api.openshift.com/id: xxxxxxxx, api.openshift.com/name: xxxxxxxx, api.openshift.com/environment: production, api.openshift.com/nodepool-ocm: workers-1, api.openshift.com/legal-entity-id: xxxxxx, api.openshift.com/nodepool-hypershift: xxxxxx-workers-1'

Even try to upgrade the machine pool again using the same command, it will say there are already one upgrade ongoing

$ rosa upgrade machinepool -c xxxxxxx workers-1 --version 4.15.33 

WARN: There is already a started upgrade to version 4.15.33 on 2024-10-04 11:06 UTC
INFO: An upgrade already exists for machine pool 'workers-1' in cluster 'xxxxxxx'

SBR

Shift Hosted

Product(s)

Red Hat OpenShift Service on AWS

Components

upgrade

Category

Upgrade

Tags

upgrade

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.