Recovering an OpenShift 4 IPI Cluster from Complete etcd Quorum Loss
Environment
- Red Hat OpenShift Container Platform 4
Issue
- An OpenShift IPI cluster becomes completely unreachable — the API server returns
EOForconnection refusedon port 6443. The cluster was either freshly installed and never fully bootstrapped, or experienced a failure that left etcd running on only one of three control plane nodes. The etcd operator cannot self-recover because it refuses to make changes when quorum is not fault-tolerant.
Resolution
Step 1: Identify the recovery node
The recovery node is the master that has:
- An
etcd-pod.yamlmanifest in/etc/kubernetes/manifests/ - Data in
/var/lib/etcd/member/ - etcd container (even if crash-looping)
In the example, this is master-2 (10.0.1.12).
Step 2: Stop etcd and kube-apiserver on the recovery node
Move the static pod manifests out to stop the pods:
ssh core@10.0.1.12 "sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp/etcd-pod.yaml"
ssh core@10.0.1.12 "sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp/kube-apiserver-pod.yaml"
Wait for the pods to stop and ports to be released (15-30 seconds):
ssh core@10.0.1.12 "ss -tln | grep -E '2379|2380|6443'"
All three ports should show no listeners.
Step 3: Back up the etcd data directory
ssh core@10.0.1.12 "sudo cp -a /var/lib/etcd/member /var/lib/etcd/member.bak"
Step 4: Force etcd to a single-member cluster
Use the etcd container image (from the etcd-pod.yaml manifest) to run --force-new-cluster. This rewrites the etcd WAL to remove all other members and makes
Find the etcd image:
grep -o 'quay.io/openshift-release-dev/[^"]*' /tmp/etcd-pod.yaml | head -1
Important: Use --entrypoint etcd — the default container entrypoint includes etcd, so passing etcd as an argument would cause a `'etcd' is not a val
Run force-new-cluster:
ETCD_IMAGE="<image-from-above>"
ssh core@10.0.1.12 "sudo podman run -d --name etcd-force --rm \
-v /var/lib/etcd:/var/lib/etcd:Z \
--network=host --privileged \
--entrypoint etcd ${ETCD_IMAGE} \
--force-new-cluster \
--data-dir /var/lib/etcd \
--name <cluster-infra-id>-master-2"
Wait ~10 seconds, then verify etcd is running as a single member:
ssh core@10.0.1.12 "sudo podman run --rm --network=host \
--entrypoint etcdctl ${ETCD_IMAGE} \
--endpoints=http://127.0.0.1:2379 member list -w table"
Expected output — a single member with status started:
+------------------+---------+-----------------------+----------------------------+-----------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+-----------------------+----------------------------+-----------------------+
| 131cfa19f51cfb79 | started | <infra-id>-master-2 | https://10.0.1.12:2380 | http://localhost:2379 |
+------------------+---------+-----------------------+----------------------------+-----------------------+
If the member list is empty or the command times out, the force-new-cluster failed. Restore the backup and retry:
ssh core@10.0.1.12 "sudo podman stop etcd-force 2>/dev/null; \
sudo rm -rf /var/lib/etcd/member; \
sudo mv /var/lib/etcd/member.bak /var/lib/etcd/member"
Step 5: Stop the temporary etcd and restore manifests
ssh core@10.0.1.12 "sudo podman stop etcd-force"
Wait for ports to be free, then restore the manifests. Restore etcd first, wait a few seconds for it to start, then restore the API server:
ssh core@10.0.1.12 "sudo mv /tmp/etcd-pod.yaml /etc/kubernetes/manifests/etcd-pod.yaml"
sleep 3
ssh core@10.0.1.12 "sudo mv /tmp/kube-apiserver-pod.yaml /etc/kubernetes/manifests/kube-apiserver-pod.yaml"
Step 6: Verify etcd elected itself leader
Check the etcd logs to confirm it successfully became leader:
ssh core@10.0.1.12 "sudo crictl ps | grep 'etcd '"
ssh core@10.0.1.12 "sudo crictl logs <new-etcd-container-id> 2>&1 | grep -E 'became leader|elected leader|ready to serve'"
Expected:
{"msg":"131cfa19f51cfb79 became leader at term 11"}
{"msg":"raft.node: 131cfa19f51cfb79 elected leader 131cfa19f51cfb79 at term 11"}
{"msg":"ready to serve client requests"}
Step 7: Wait for API server to come up
The kube-apiserver may take 30-90 seconds to start after etcd. It may fail once on the first attempt (race condition — etcd not yet ready when apiserver start
# Watch for port 6443 to start listening
ssh core@10.0.1.12 "ss -tln | grep 6443"
# Test API access from the bastion or local machine
export KUBECONFIG=/path/to/auth/kubeconfig
oc get nodes
If the API server keeps crashing, check its logs for the specific error. Common issues at this stage:
- etcd not yet ready: Wait longer, kubelet will retry
- Certificate issues: The force-new-cluster may have invalidated some certs — the cert-regeneration sidecar should handle this
Step 8: Enable unsafe single-member mode in the etcd operator
Once the API is reachable, the etcd operator will detect quorum=1 and refuse to make any changes (deploy etcd to other masters, generate certificates) becausebut creates a deadlock during recovery.
Override this safety check:
oc patch etcd/cluster --type=merge -p \
'{"spec":{"unsupportedConfigOverrides":{"useUnsupportedUnsafeNonHANonProductionUnstableEtcd":true}}}'
Check the operator logs to confirm it is now reconciling:
oc logs -n openshift-etcd-operator deployment/etcd-operator --tail=20
The TargetConfigController and EtcdCertSignerController errors should stop, and the operator should begin deploying etcd pods to the other masters.
Step 9: Monitor etcd scaling to all masters
Watch the etcd operator deploy etcd pods to all three masters:
watch 'oc get pods -n openshift-etcd -l app=etcd -o wide; echo; oc get co etcd'
Wait until all three etcd pods show 4/4 Running. This typically takes 2-5 minutes.
You can also verify the static pod manifests are deployed to all masters:
for ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo "$ip: $(ssh -o StrictHostKeyChecking=no core@${ip} 'ls /etc/kubernetes/manifests/' 2>/dev/null | tr '\n' ' ')"
done
Step 10: Remove the unsafe override
Once all three etcd members are healthy and running, remove the override:
oc patch etcd/cluster --type=merge -p '{"spec":{"unsupportedConfigOverrides":null}}'
Important: Do not leave this override in place permanently. It disables quorum safety checks that protect against data loss.
Step 11: Approve pending CSRs
After etcd recovery, kubelet client and serving certificates may have expired or been invalidated. The kubelets will request new certificates but they need ap
Check for and approve pending CSRs:
# Check for pending CSRs
oc get csr | grep Pending
# Approve all pending CSRs
oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | \
xargs -r oc adm certificate approve
Signs that CSR approval is needed:
- Nodes show
NotReadystatus - Kubelet logs show
Unable to register node with API server: nodes is forbidden: User "system:anonymous"— the kubelet's certificate is invalid and it's fal - Kubelet logs show
no serving certificate available for the kubelet
You may need to run the approval command multiple times — new CSRs appear as kubelets re-register and request both client and serving certificates.
Step 12: Wait for full cluster recovery
Monitor all cluster operators:
watch 'oc get co | grep -vE "True.*False.*False"'
The remaining operators will progressively recover:
- etcd — completes revision rollout across all masters (5-10 min)
- kube-apiserver — deploys static pods to master-0 and master-1 (5-10 min)
- kube-controller-manager — deploys static pods to missing masters (5 min)
- kube-scheduler — deploys static pods to missing masters (5 min)
- authentication, console, ingress, monitoring — restart after API stabilizes (2-5 min)
- openshift-apiserver — rolls out updated pods (2-5 min)
Total recovery time after API comes back: 10-20 minutes.
A fully recovered cluster shows all operators as Available=True, Progressing=False, Degraded=False.
Root Cause
etcd requires a majority (2 out of 3) of members to form quorum and elect a leader. When only one member is running, it cannot elect itself leader, so all reacannot start, and without the API server, the etcd-operator cannot deploy etcd to the other masters — creating a deadlock.
Common causes:
- Installation/bootstrap did not complete on all masters (network issues, resource limits such as vCPU per-host caps, image pull failures)
- Two masters lost their etcd data simultaneously (disk failure, accidental deletion)
- Node failures during an etcd operator rollout
- ESXi host resource limits (e.g., 512 vCPU cap) preventing VMs from powering on during cluster deployment
Diagnostic Steps
Since the API is down, all diagnosis must be done by SSH-ing directly to the RHCOS nodes as the core user.
Step 1: Check which static pod manifests exist on each master
A healthy master should have etcd-pod.yaml, kube-apiserver-pod.yaml, kube-controller-manager-pod.yaml, and kube-scheduler-pod.yaml:
for ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo "=== $ip ==="
ssh -o StrictHostKeyChecking=no core@${ip} "ls /etc/kubernetes/manifests/"
done
Example output showing the problem — master-0 and master-1 are missing control plane manifests:
=== 10.0.1.10 (master-0) ===
coredns.yaml
haproxy.yaml
keepalived.yaml
=== 10.0.1.11 (master-1) ===
coredns.yaml
haproxy.yaml
keepalived.yaml
kube-controller-manager-pod.yaml
kube-scheduler-pod.yaml
=== 10.0.1.12 (master-2) ===
coredns.yaml
etcd-pod.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver-pod.yaml
kube-controller-manager-pod.yaml
kube-scheduler-pod.yaml
Step 2: Check container status on each master
Use crictl to identify which control plane components are running or crash-looping on each master:
for ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
echo "=== $ip ==="
ssh -o StrictHostKeyChecking=no core@${ip} \
"sudo crictl ps -a 2>/dev/null | grep -E 'kube-apiserver|etcd|kube-controller|kube-sched' | head -10"
done
Look for:
- etcd container with high restart count or
Exitedstatus - kube-apiserver container repeatedly exiting
- Missing containers on nodes that should have them
Step 3: Check etcd logs on the running member
On the master that has etcd running (master-2 in the example), get the etcd container ID and check its logs:
# Find the etcd container
ssh core@10.0.1.12 "sudo crictl ps -a 2>/dev/null | grep -i etcd"
# Get the last 30 lines of logs (replace <container-id> with actual ID)
ssh core@10.0.1.12 "sudo crictl logs --tail 30 <container-id> 2>&1"
Key indicators of quorum loss:
- Repeated
MsgPreVoterequests to unreachable peers dial tcp <peer-ip>:2380: connect: connection refusederrors- The member never transitions from
pre-candidatetoleader
Step 4: Verify etcd cannot respond to commands
Try running etcdctl member list inside the etcd container:
ssh core@10.0.1.12 "sudo crictl exec <etcd-container-id> etcdctl member list --write-out=table 2>&1"
If this times out with context deadline exceeded, etcd has no quorum and cannot serve any requests.
Step 5: Check kube-apiserver crash reason
On the master where kube-apiserver exists, check its last crash logs:
# Find the exited kube-apiserver container
ssh core@10.0.1.12 "sudo crictl ps -a 2>/dev/null | grep kube-apiserver | head -5"
# Get crash logs
ssh core@10.0.1.12 "sudo crictl logs --tail 20 <apiserver-container-id> 2>&1"
Common crash messages:
PostStartHook "start-service-ip-repair-controllers" failed— etcd unreachable, cannot verify service IP allocationsdial tcp <ip>:2379: connect: connection refused— cannot reach any etcd endpointcontext deadline exceeded— etcd connections timing out
Step 6: Verify etcd data directory exists on the recovery node
Before proceeding with recovery, confirm that the etcd data directory exists and has data:
ssh core@10.0.1.12 "sudo ls -la /var/lib/etcd/member/ && sudo du -sh /var/lib/etcd/member/"
You should see snap/ and wal/ subdirectories. If the directory is empty or missing, this node cannot be used for recovery — check the other masters.
Verification
# All nodes Ready
oc get nodes
# All cluster operators healthy (no output = all healthy)
oc get co | grep -vE "True.*False.*False"
# etcd cluster has 3 healthy members
oc get pods -n openshift-etcd -l app=etcd -o wide
# Cluster version not reporting errors
oc get clusterversion
# No pending CSRs
oc get csr | grep Pending
Troubleshooting Recovery Failures
etcd operator still stuck after applying the override
Check if the etcd-all-certs secret exists in the openshift-etcd namespace:
oc get secret -n openshift-etcd etcd-all-certs
If missing, the EtcdCertSignerController needs to generate it. Check the operator logs:
oc logs -n openshift-etcd-operator deployment/etcd-operator --tail=50 2>&1 | grep -E "CertSigner|InstallerController"
The InstallerControllerDegraded: missing required resources: [secrets: etcd-all-certs] error should resolve after the cert signer runs successfully.
Nodes stuck in NotReady after CSR approval
If nodes remain NotReady after approving CSRs, check kubelet status on the affected node:
ssh core@10.0.1.10 "sudo systemctl status kubelet"
ssh core@10.0.1.10 "sudo journalctl -u kubelet --no-pager -n 20"
If kubelet is cycling with system:anonymous errors, it needs a fresh bootstrap token. Restart kubelet to force re-bootstrap:
ssh core@10.0.1.10 "sudo systemctl restart kubelet"
Then approve the new CSRs that appear.
kube-apiserver not deploying to other masters
If kube-apiserver remains degraded with Missing operand on node master-X, the kube-apiserver-operator needs the API to be stable before it can roll out. C
oc get pods -n openshift-kube-apiserver -o wide | grep -v Completed
If installer pods are stuck, check their logs:
oc logs -n openshift-kube-apiserver <installer-pod-name>
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.