Red Hat OpenShift 4 disk performance degradation on Azure
Environment
- Red Hat Openshift Container Platform (RHOCP)
- 4
- Microsoft Azure (Azure)
Issue
-
Disk
performanceissues on Red Hat OpenShift 4 clusters onAzure. -
etcdoperatorbecomesDegraded, and related messages appearing in theetcdlogs:etcdserver: read-only range request ... took too long to executeembed: rejected connection from "X.X.X.X:X" (error "EOF", ServerName "")
Resolution
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
- The default disk size for the control plane for both
Installer-Provisioned Infrastructure(IPI) andUser-Provisioned Infrastructure(UPI) installations is1024 GBas explained in why is the minimum recommended size of disk for control plane nodes1024 GBwhen installing Red Hat OpenShift 4 on Azure.
IMPORTANT NOTE:
-
The host caching option for the
Azuremachines must be set toReadOnlyas explained in optimizing storage performance forMicrosoft Azure, specially for control plane nodes. -
To avoid most of the disk performance issues in
Azure(includingetcdbad performance), the recommendation is to moveetcdto a different disk (when possible Content from learn.microsoft.com is not included.Premium SSD v2 disk), but that procedure is not supported when using control plane machine sets, as it requires the disk to be available when the machine is created. ForIPIinstallations not using control plane machine sets, it could be possible to use that procedure if the additional disk is added when the machine is created.
Note: starting with Red Hat OpenShift v4.20, there is a Technology Preview feature to allow configuring additional etcd disks on
Azure IPI installationsas shown in dedicated disk foretcdonMicrosoft Azure(Technology Preview) and detailed in configuring a dedicated disk for etcd. Unfortunately, that feature is only available for new installed clusters. For clusters already installed, there is and internal Content from redhat.atlassian.net is not included.RFE-5214, still under discussion.
- When it is not possible to move etcd to a different disk, the minimal recommendation for the control plane nodes is to use Content from learn.microsoft.com is not included.
1TBPremiumSSDdisks (P30) which gives5000 IOPS/200 MBps, in combination with at minimum Content from learn.microsoft.com is not included.Standard_D8s_v3instances, which support the throughput for uncached writes of12800 IOPS/192 MBps(note that the MBps supported by theDSv3instance type is less than the ones supported by theP30 disks).
AsStandard_D8s_v3is an old generation instance type no longer available for new machines, and newer instance types with better performance were released, using a newer instance type (like, at the time of writting this, the Content from learn.microsoft.com is not included.Standard_D8s_v6instances, which supports12800 IOPS/424 MBpsfor uncached writes is recommended). Note that fastest disks than P30 could be needed to take advantage of the424 MBpssupported by newer instance types.
ADDITIONAL NOTES:
-
The disk performance could vary between Azure regions as shown in
AzureDisk performance by region (note the data there is not up-to-date), and in some cases, the minimal recommendation above could be not enough for good disk performance (including etcd performance). In those cases, it will be needed to check the etcd metrics and contact Azure Support for the disk performance if the metrics values are not according with the instance and disk documented performance. It is also possible to follow the Diagnostic Steps section of that linked solution for testing the performance in specificAzureregion, but note that several tests should be done and not only one. -
For other Red Hat OpenShift components, like for example
StorageClassesor OpenShift Data Foundation, it could be beneficial to use more performant disks like Premium SSD v2 disks or Ultra Disks. Refer to Content from learn.microsoft.com is not included.Azure managed disk types for the limitations of each kind of disk (likePremium SSD v2 disksandUltra Diskscannot be used as OS disks).
Root Cause
-
In
Azure, disk performance is directly dependent onSSDdisk sizes and ServiceQuotas(and they are layered). The ServiceQuotasto be considered must always be the lower ones. When limits are exceeded, the service itself (storage,compute,networking, etc) throttles the offending entity. -
Many of the issues observed are somehow related to clusters reaching disk or
VMI/O limitsand saturation, which leads toAzurethrottling, then the OS disk becoming slow or unresponsive from the kernel perspective blocked onIOWait, leading to failures infstat()and different parts of the cluster stack failing or degrading, sometimes in the most surprising ways. -
Because of the Service
Quotaslayering mentioned before, it is very important to get theIOPSnumbers correctly aligned between theVMOS disk quotas and the quotas of the attached disk. The recommendations forcontrol planeVM sizes and disk sizes take into account the hardware requirements for disks inetcdwith a target of5000concurrent (likely ~500sequential)IOPS.
Diagnostic Steps
After some time, the cluster becomes very slow and many operators start to become unhealthy with such error in the etcd logs :
W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/scheduler-kubeconfig\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-apiserver-operator/kube-apiserver-operator-lock\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/leases/knative-serving/hpaautoscaler\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/operators.coreos.com/clusterserviceversions/openshift-operators/nfd.4.4.0-202005252114\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/eventing.knative.dev/triggers\" range_end:\"/kubernetes.io/eventing.knative.dev/triggert\" count_only:true " with result "range_response_count:0 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-cloud-credential-operator/cloud-credential-operator-leader\" " with result "range_response_count:1 size:*" took too long to execute
W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-config-managed/\" range_end:\"/kubernetes.io/secrets/openshift-config-managed0\" " with result "range_response_count:27 size:*" took too long to execute
I | embed: rejected connection from "X.X.X.X:X" (error "EOF", ServerName "")
etcdserver: failed to send out heartbeat on time
etcdserver: server is likely overloaded
wal: sync duration of xxxx s, expected less than 1s
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.