Red Hat OpenShift 4 disk performance degradation on Azure

Solution Verified - Updated

Environment

  • Red Hat Openshift Container Platform (RHOCP)
    • 4
  • Microsoft Azure (Azure)

Issue

  • Disk performance issues on Red Hat OpenShift 4 clusters on Azure.

  • etcd operator becomes Degraded, and related messages appearing in the etcd logs:

    etcdserver: read-only range request ... took too long to execute
    
    embed: rejected connection from "X.X.X.X:X" (error "EOF", ServerName "")
    

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

IMPORTANT NOTE:
  • The host caching option for the Azure machines must be set to ReadOnly as explained in optimizing storage performance for Microsoft Azure, specially for control plane nodes.

  • To avoid most of the disk performance issues in Azure (including etcd bad performance), the recommendation is to move etcd to a different disk (when possible Content from learn.microsoft.com is not included.Premium SSD v2 disk), but that procedure is not supported when using control plane machine sets, as it requires the disk to be available when the machine is created. For IPI installations not using control plane machine sets, it could be possible to use that procedure if the additional disk is added when the machine is created.

Note: starting with Red Hat OpenShift v4.20, there is a Technology Preview feature to allow configuring additional etcd disks on Azure IPI installations as shown in dedicated disk for etcd on Microsoft Azure (Technology Preview) and detailed in configuring a dedicated disk for etcd. Unfortunately, that feature is only available for new installed clusters. For clusters already installed, there is and internal Content from redhat.atlassian.net is not included.RFE-5214, still under discussion.

ADDITIONAL NOTES:

  • The disk performance could vary between Azure regions as shown in Azure Disk performance by region (note the data there is not up-to-date), and in some cases, the minimal recommendation above could be not enough for good disk performance (including etcd performance). In those cases, it will be needed to check the etcd metrics and contact Azure Support for the disk performance if the metrics values are not according with the instance and disk documented performance. It is also possible to follow the Diagnostic Steps section of that linked solution for testing the performance in specific Azure region, but note that several tests should be done and not only one.

  • For other Red Hat OpenShift components, like for example StorageClasses or OpenShift Data Foundation, it could be beneficial to use more performant disks like Premium SSD v2 disks or Ultra Disks. Refer to Content from learn.microsoft.com is not included.Azure managed disk types for the limitations of each kind of disk (like Premium SSD v2 disks and Ultra Disks cannot be used as OS disks).

Root Cause

  • In Azure, disk performance is directly dependent on SSD disk sizes and Service Quotas (and they are layered). The Service Quotas to be considered must always be the lower ones. When limits are exceeded, the service itself (storage, compute, networking, etc) throttles the offending entity.

  • Many of the issues observed are somehow related to clusters reaching disk or VM I/O limits and saturation, which leads to Azure throttling, then the OS disk becoming slow or unresponsive from the kernel perspective blocked on IOWait, leading to failures in fstat() and different parts of the cluster stack failing or degrading, sometimes in the most surprising ways.

  • Because of the Service Quotas layering mentioned before, it is very important to get the IOPS numbers correctly aligned between the VM OS disk quotas and the quotas of the attached disk. The recommendations for control plane VM sizes and disk sizes take into account the hardware requirements for disks in etcd with a target of 5000 concurrent (likely ~500 sequential) IOPS.

Diagnostic Steps

After some time, the cluster becomes very slow and many operators start to become unhealthy with such error in the etcd logs :

           W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-scheduler/scheduler-kubeconfig\" " with result "range_response_count:1 size:*" took too long to execute
           W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-kube-apiserver-operator/kube-apiserver-operator-lock\" " with result "range_response_count:1 size:*" took too long to execute
           W | etcdserver: read-only range request "key:\"/kubernetes.io/leases/knative-serving/hpaautoscaler\" " with result "range_response_count:1 size:*" took too long to execute
           W | etcdserver: read-only range request "key:\"/kubernetes.io/operators.coreos.com/clusterserviceversions/openshift-operators/nfd.4.4.0-202005252114\" " with result "range_response_count:1 size:*" took too long to execute
           W | etcdserver: read-only range request "key:\"/kubernetes.io/eventing.knative.dev/triggers\" range_end:\"/kubernetes.io/eventing.knative.dev/triggert\" count_only:true " with result "range_response_count:0 size:*" took too long to execute
           W | etcdserver: read-only range request "key:\"/kubernetes.io/configmaps/openshift-cloud-credential-operator/cloud-credential-operator-leader\" " with result "range_response_count:1 size:*" took too long to execute
           W | etcdserver: read-only range request "key:\"/kubernetes.io/secrets/openshift-config-managed/\" range_end:\"/kubernetes.io/secrets/openshift-config-managed0\" " with result "range_response_count:27 size:*" took too long to execute
           I | embed: rejected connection from "X.X.X.X:X" (error "EOF", ServerName "")
           etcdserver: failed to send out heartbeat on time
           etcdserver: server is likely overloaded
           wal: sync duration of xxxx s, expected less than 1s
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.