一次Kubernetes集群故障处理案例:etcd无法选出Leader导致Kubernetes API-Server启动失败

1. 概述

1、集群信息

Name IP Role
Node1 172.17.1.120 控制节点1
Node2 172.17.1.121 控制节点2
k8s-2 172.17.1.131 工作节点1
k8s-3 172.17.1.132 工作节点2

2、故障现象

我按照正常卸载工作节点的操作步骤,一切顺利,当时并没有什么异常。

出现错误的操作记录
1
2
3
4
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2

root@node2:~# kubeadm reset --force

故障出现原因:双控制平面Kubernetes(这个状态本身就异常,但不在本次讨论范围内)删除Node2控制节点后,另外一个控制平面无法正常工作。具体表现为ETCD启动失败,导致Kubernetes api-server 启动失败。

journalctl -u kubelet -n 100 --no-pager | less
1
Jul 22 23:17:05 node1 kubelet[3301]: E0722 23:17:05.135471    3301 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://172.17.1.120:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node1?timeout=10s\": dial tcp 172.17.1.120:6443: connect: connection refused" interval="7s"

2. 故障分析

在控制节点查询查询容器运行状态,发现所有容器都处于正常状态,但是etcd 容器日志显示无法连接另外一个控制节点。

1、查询控制节点服务状态

1
2
3
4
5
root@node1:~# crictl  --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps  | grep -e  apiserver -e etcd -e controller-manager -e scheduler
6dd841c1bdcc3 9ea0bd82ed4f6 About an hour ago Running kube-scheduler 53 9477ef18cb630
cda7709fabb7f b0cdcf76ac8e9 About an hour ago Running kube-controller-manager 54 7a3368070af64
78f4ae23ef1e0 a9e7e6b294baf About an hour ago Running etcd 54 583d4b926dc80
526d7fbe05632 f44c6888a2d24 12 hours ago Running kube-apiserver 0 e21825618af02

经过查询服务都已经运行,但etcd日志显示无法连接另外一个控制节点 ,api-server 不能连接到etcd。

1
2
3
4
5
6
7
root@node1:~# crictl  --runtime-endpoint=unix:///var/run/containerd/containerd.sock logs -f 78f4ae23ef1e0

{"level":"warn","ts":"2025-07-23T05:31:03.896215Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"39011656e166436e","rtt":"0s","error":"dial tcp 172.17.1.121:2380: connect: connection refused"}
{"level":"info","ts":"2025-07-23T05:31:04.416899Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 is starting a new election at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.416978Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 became pre-candidate at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417053Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 received MsgPreVoteResp from e6c9d72c757dea1 at term 62"}
{"level":"info","ts":"2025-07-23T05:31:04.417147Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 [logterm: 62, index: 234052257] sent MsgPreVote request to 39011656e166436e at term 62"}

从上面日志不难看出,etcd 一直处于选举状态,这是因为存活节点(1/2)达不到Raft的多数派原则,导致无法选举出Leader,2379端口没有正常监听。导致我们操作不了etcd,无论重启kubelet还是重启etcd容器都无法解决问题。

3. 故障处理

既然etcd无法选出leader,而且我们也只需要一个etcd那么最好的方式就是强制启动etcd,让故障自恢复。

1、查询etcd容器启动参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps | grep etcd | awk '{print $1}' | xargs crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock inspect


# 找到 image和 args 部分
"image": {
"image": "registry.k8s.io/etcd:3.5.16-0"
}
---省略部分输出---
"args": [
"etcd",
"--advertise-client-urls=https://172.17.1.120:2379",
"--cert-file=/etc/kubernetes/pki/etcd/server.crt",
"--client-cert-auth=true",
"--data-dir=/var/lib/etcd",
"--experimental-initial-corrupt-check=true",
"--experimental-watch-progress-notify-interval=5s",
"--initial-advertise-peer-urls=https://172.17.1.120:2380",
"--initial-cluster=node1=https://172.17.1.120:2380",
"--key-file=/etc/kubernetes/pki/etcd/server.key",
"--listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379",
"--listen-metrics-urls=http://127.0.0.1:2381",
"--listen-peer-urls=https://172.17.1.120:2380",
"--name=node1",
"--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt",
"--peer-client-cert-auth=true",
"--peer-key-file=/etc/kubernetes/pki/etcd/peer.key",
"--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt",
"--snapshot-count=10000",
"--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"
],

2、强制启动etcd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
docker  run --rm --network=host -p 2379:2379 -p 2380:2380 -v /etc/kubernetes/pki/:/etc/kubernetes/pki/ -v /var/lib/etcd:/var/lib/etcd  registry.k8s.io/etcd:3.5.16-0 \
etcd \
--advertise-client-urls=https://172.17.1.120:2379 \
--cert-file=/etc/kubernetes/pki/etcd/server.crt \
--client-cert-auth=true \
--data-dir=/var/lib/etcd \
--experimental-initial-corrupt-check=true \
--experimental-watch-progress-notify-interval=5s \
--initial-advertise-peer-urls=https://172.17.1.120:2380 \
--initial-cluster=node1=https://172.17.1.120:2380 \
--key-file=/etc/kubernetes/pki/etcd/server.key \
--listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379 \
--listen-metrics-urls=http://127.0.0.1:2381 \
--listen-peer-urls=https://172.17.1.120:2380 \
--name=node1 \
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
--peer-client-cert-auth=true \
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--snapshot-count=10000 \
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
--force-new-cluster
```
这里有几个关键信息需要特别注意:
- 强制新集群:`--force-new-cluster`,因为只保留一个节点,也不需要考虑数据一致性,刷新不回删除原有数据,只更新meta。
- etcd密钥与证书 `-v /etc/kubernetes/pki/:/etc/kubernetes/pki/ `,etcd数据目录`-v /var/lib/etcd:/var/lib/etcd ` 这个kebelet启动的continaerd容器一致即可。

查看etcd日志,etcd已经正常启动,并node2被移除。
```bash
{"level":"info","ts":"2025-07-23T05:35:03.300893Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 switched to configuration voters=(1039378730311999137)"}
{"level":"info","ts":"2025-07-23T05:35:03.301000Z","caller":"membership/cluster.go:472","msg":"removed member","cluster-id":"4a0015d70b3f3c63","local-member-id":"e6c9d72c757dea1","removed-remote-peer-id":"39011656e166436e","removed-remote-peer-urls":["https://172.17.1.121:2380"]}

建议采用docker启动修复etcd数据,体验比crictl要更好。

3、重启kubelet

通过systemctl start kubelet命令启动服务,经过5分钟左右的自恢复,观察日志正常后,控制节点恢复正常。
控制节点正常后,依次启动work节点集群恢复正常。

1
2
3
4
5
6

root@node1:/data# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-2 Ready <none> 139d v1.29.15
k8s-3 Ready <none> 140d v1.29.15
node1 Ready control-plane 728d v1.29.15

使用 kubectl get nodes 命令,可以看到控制已经恢复正常。

4. 复盘

1、控制平面节点数量

双数节点一致是Raft协议的弱点,实际上所有分布式系统都不推荐双数节点,一般来说`高可用`需要采用3、5、7等奇数节点部署。

2、上述操作能避免吗?

如果在执行 `kubeadm reset`之前先删除etcd member也能避免上述错误。

也就是在1.2执行阶段完整的操作应该是

出现错误的操作记录
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data
root@node1:~# kubectl delete node node2
# 找到node2的memberID
root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--endpoints=https://127.0.0.1:2379 \
member list


root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--endpoints=https://127.0.0.1:2379 \
member remove <member-id >
#在node2上重置kubelet
root@node2:~# kubeadm reset --force
清理iptables ipvs 等后续操作略...