1. 概述 1、集群信息
Name
IP
Role
Node1
172.17.1.120
控制节点1
Node2
172.17.1.121
控制节点2
k8s-2
172.17.1.131
工作节点1
k8s-3
172.17.1.132
工作节点2
2、故障现象 我按照正常卸载工作节点的操作步骤,一切顺利,当时并没有什么异常。
出现错误的操作记录 1 2 3 4 root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data root@node1:~# kubectl delete node node2 root@node2:~# kubeadm reset --force
故障出现原因:双控制平面Kubernetes(这个状态本身就异常,但不在本次讨论范围内)删除Node2
控制节点后,另外一个控制平面无法正常工作。具体表现为ETCD启动失败,导致Kubernetes api-server
启动失败。
journalctl -u kubelet -n 100 --no-pager | less 1 Jul 22 23:17:05 node1 kubelet[3301]: E0722 23:17:05.135471 3301 controller.go:145] "Failed to ensure lease exists, will retry" err="Get \"https://172.17.1.120:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node1?timeout=10s\": dial tcp 172.17.1.120:6443: connect: connection refused" interval="7s"
2. 故障分析 在控制节点查询查询容器运行状态,发现所有容器都处于正常状态,但是etcd
容器日志显示无法连接另外一个控制节点。
1、查询控制节点服务状态 1 2 3 4 5 root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps | grep -e apiserver -e etcd -e controller-manager -e scheduler 6dd841c1bdcc3 9ea0bd82ed4f6 About an hour ago Running kube-scheduler 53 9477ef18cb630 cda7709fabb7f b0cdcf76ac8e9 About an hour ago Running kube-controller-manager 54 7a3368070af64 78f4ae23ef1e0 a9e7e6b294baf About an hour ago Running etcd 54 583d4b926dc80 526d7fbe05632 f44c6888a2d24 12 hours ago Running kube-apiserver 0 e21825618af02
经过查询服务都已经运行,但etcd
日志显示无法连接另外一个控制节点 ,api-server
不能连接到etcd。
1 2 3 4 5 6 7 root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock logs -f 78f4ae23ef1e0 {"level":"warn","ts":"2025-07-23T05:31:03.896215Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"39011656e166436e","rtt":"0s","error":"dial tcp 172.17.1.121:2380: connect: connection refused"} {"level":"info","ts":"2025-07-23T05:31:04.416899Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 is starting a new election at term 62"} {"level":"info","ts":"2025-07-23T05:31:04.416978Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 became pre-candidate at term 62"} {"level":"info","ts":"2025-07-23T05:31:04.417053Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 received MsgPreVoteResp from e6c9d72c757dea1 at term 62"} {"level":"info","ts":"2025-07-23T05:31:04.417147Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"e6c9d72c757dea1 [logterm: 62, index: 234052257] sent MsgPreVote request to 39011656e166436e at term 62"}
从上面日志不难看出,etcd 一直处于选举状态,这是因为存活节点(1/2)达不到Raft的多数派原则,导致无法选举出Leader,2379
端口没有正常监听。导致我们操作不了etcd,无论重启kubelet还是重启etcd容器都无法解决问题。
3. 故障处理 既然etcd无法选出leader,而且我们也只需要一个etcd那么最好的方式就是强制启动etcd,让故障自恢复。
1、查询etcd容器启动参数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 root@node1:~# crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock ps | grep etcd | awk '{print $1}' | xargs crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock inspect "image" : { "image" : "registry.k8s.io/etcd:3.5.16-0" } ---省略部分输出--- "args" : [ "etcd" , "--advertise-client-urls=https://172.17.1.120:2379" , "--cert-file=/etc/kubernetes/pki/etcd/server.crt" , "--client-cert-auth=true" , "--data-dir=/var/lib/etcd" , "--experimental-initial-corrupt-check=true" , "--experimental-watch-progress-notify-interval=5s" , "--initial-advertise-peer-urls=https://172.17.1.120:2380" , "--initial-cluster=node1=https://172.17.1.120:2380" , "--key-file=/etc/kubernetes/pki/etcd/server.key" , "--listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379" , "--listen-metrics-urls=http://127.0.0.1:2381" , "--listen-peer-urls=https://172.17.1.120:2380" , "--name=node1" , "--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt" , "--peer-client-cert-auth=true" , "--peer-key-file=/etc/kubernetes/pki/etcd/peer.key" , "--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt" , "--snapshot-count=10000" , "--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt" ],
2、强制启动etcd 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 docker run --rm --network=host -p 2379:2379 -p 2380:2380 -v /etc/kubernetes/pki/:/etc/kubernetes/pki/ -v /var/lib/etcd:/var/lib/etcd registry.k8s.io/etcd:3.5.16-0 \ etcd \ --advertise-client-urls=https://172.17.1.120:2379 \ --cert-file=/etc/kubernetes/pki/etcd/server.crt \ --client-cert-auth=true \ --data-dir=/var/lib/etcd \ --experimental-initial-corrupt-check=true \ --experimental-watch-progress-notify-interval=5s \ --initial-advertise-peer-urls=https://172.17.1.120:2380 \ --initial-cluster=node1=https://172.17.1.120:2380 \ --key-file=/etc/kubernetes/pki/etcd/server.key \ --listen-client-urls=https://127.0.0.1:2379,https://172.17.1.120:2379 \ --listen-metrics-urls=http://127.0.0.1:2381 \ --listen-peer-urls=https://172.17.1.120:2380 \ --name=node1 \ --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \ --peer-client-cert-auth=true \ --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \ --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \ --snapshot-count=10000 \ --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \ --force-new-cluster ``` 这里有几个关键信息需要特别注意: - 强制新集群:`--force-new-cluster`,因为只保留一个节点,也不需要考虑数据一致性,刷新不回删除原有数据,只更新meta。 - etcd密钥与证书 `-v /etc/kubernetes/pki/:/etc/kubernetes/pki/ `,etcd数据目录`-v /var/lib/etcd:/var/lib/etcd ` 这个kebelet启动的continaerd容器一致即可。 查看etcd日志,etcd已经正常启动,并node2被移除。 ```bash {"level" :"info" ,"ts" :"2025-07-23T05:35:03.300893Z" ,"logger" :"raft" ,"caller" :"etcdserver/zap_raft.go:77" ,"msg" :"e6c9d72c757dea1 switched to configuration voters=(1039378730311999137)" } {"level" :"info" ,"ts" :"2025-07-23T05:35:03.301000Z" ,"caller" :"membership/cluster.go:472" ,"msg" :"removed member" ,"cluster-id" :"4a0015d70b3f3c63" ,"local-member-id" :"e6c9d72c757dea1" ,"removed-remote-peer-id" :"39011656e166436e" ,"removed-remote-peer-urls" :["https://172.17.1.121:2380" ]}
建议采用docker启动修复etcd数据,体验比crictl要更好。
3、重启kubelet 通过systemctl start kubelet
命令启动服务,经过5分钟左右的自恢复,观察日志正常后,控制节点恢复正常。 控制节点正常后,依次启动work节点集群恢复正常。
1 2 3 4 5 6 root@node1:/data# kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-2 Ready <none> 139d v1.29.15 k8s-3 Ready <none> 140d v1.29.15 node1 Ready control-plane 728d v1.29.15
使用 kubectl get nodes
命令,可以看到控制已经恢复正常。
4. 复盘 1、控制平面节点数量 双数节点一致是Raft协议的弱点,实际上所有分布式系统都不推荐双数节点,一般来说`高可用`需要采用3、5、7等奇数节点部署。
2、上述操作能避免吗? 如果在执行 `kubeadm reset`之前先删除etcd member也能避免上述错误。
也就是在1.2
执行阶段完整的操作应该是
出现错误的操作记录 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 root@node1:~# kubectl drain node node2 --ignore-daemonsets --delete-emptydir-data root@node1:~# kubectl delete node node2 root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --endpoints=https://127.0.0.1:2379 \ member list root@node1:~# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/peer.crt \ --key=/etc/kubernetes/pki/etcd/peer.key \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --endpoints=https://127.0.0.1:2379 \ member remove <member-id > root@node2:~# kubeadm reset --force 清理iptables ipvs 等后续操作略...