It takes 2 minutes 15 seconds to failover from primary egressIP to secondary egress IP

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform
    • 4.x

Issue

  • Takes approximately 2 minutes 15 seconds to failover from the primary EgressIP to the secondary EgressIP
  • When rebooting my AWS node, the egress IP does no failover

Resolution

Using Curl as a validator

Curl has two options: --connect-timeout and --max-time which can affect the behavior of the command.
Running curl with --connect-timeout to 2 seconds will instruct/configure curl to bail out if it can’t reach the server within 2 seconds instead of the 2 minutes default timeout. Thus the loop will retry connecting to the server sooner and we should see the failover happen within 15 seconds.

10.0.181.51 - - [08/Dec/2020:23:20:35 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"
10.0.181.51 - - [08/Dec/2020:23:20:36 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"
10.0.223.51 - - [08/Dec/2020:23:20:49 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"
10.0.223.51 - - [08/Dec/2020:23:20:50 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"

In this case, the failover took ~15 seconds to complete.

Root Cause

Usually, an HTTP client has a Read Timeout and a Write Timeout.
In the case of the EgressIP validation, the connection timeout setting may result in a delay of the request which distorts the test.

As an example, curl has a 2 minutes default timeout connection setting.
This fakes the result by increasing the validation of the EgressIP failover.

Diagnostic Steps

Using Curl as a validator

Example of curl usage to test the connectivity on a pod in the desired namespace:

while true; do curl -s remotehost.example.com/healthz >/dev/null ; sleep 1; done
  • After deploying some egress IPs as described in the Openshift documentation [1], the curl requests are showing some failover hang for approximatively 2 minutes 15 seconds.
    In reality, the manual failover/fallback processes are completed in a second.
    Example of configuration:
$ oc get hostsubnet
NAME                                              HOST                                              HOST IP        SUBNET          EGRESS CIDRS   EGRESS IPS
ip-10-0-181-17.ap-northeast-1.compute.internal    ip-10-0-181-17.ap-northeast-1.compute.internal    10.0.181.17    10.128.2.0/23   []             ["10.0.181.51"]
ip-10-0-223-103.ap-northeast-1.compute.internal   ip-10-0-223-103.ap-northeast-1.compute.internal   10.0.223.103   10.129.2.0/23   []             ["10.0.223.51"]
[...]
$ oc get netnamespace egress-testing
NAME             NETID      EGRESS IPS
egress-testing   11840413   ["10.0.223.51","10.0.181.51"]
  • In a bare-metal environment, during the reboot of an egress node, the failover process is taking ~2 minutes 15 seconds to complete. The node reboot is almost ~ 5 minutes long. Once the primary node is back online, the internal fallback process is completed in a second.
    In an AWS environment, during the reboot of an egress node, the failover process is not triggered. The node reboot is quicker than a bare-metal reboot server (~2 minutes 20 seconds) and the failover is not occurring at all, but the requests are hung.

Example of the log in on the HTTP side:

10.0.181.51 - - [07/Dec/2020:03:27:32 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"
10.0.181.51 - - [07/Dec/2020:03:27:33 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"
10.0.223.51 - - [07/Dec/2020:03:29:48 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"
10.0.223.51 - - [07/Dec/2020:03:29:49 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "curl/7.29.0"

In this case, the failover took ~2 minutes 15 seconds to complete.

Reference:
[1] https://docs.openshift.com/container-platform/4.4/networking/openshift_sdn/assigning-egress-ips.html

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.