K8S 备份及升级
程序员文章站
2022-07-13 10:49:40
...
一、准备工作
查看集群版本:
# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master Ready master 1y v1.12.1 <none> Asianux Server 7 (Lotus) 3.10.0-514.axs7.x86_64 docker://18.6.1
k8s-node1 Ready <none> 1y v1.12.1 <none> Asianux Server 7 (Lotus) 3.10.0-514.axs7.x86_64 docker://18.6.1
k8s-node2 Ready <none> 1y v1.12.1 <none> Asianux Server 7 (Lotus) 3.10.0-514.axs7.x86_64 docker://18.6.1
查看etcd服务容器信息:
# kubectl get pod -n kube-system
# kubectl describe pod etcd-k8s-master -n kube-system > /opt/k8s.bak/etcd.txt ##etcd-podc01 是etcd pod名称,根据实际替换
# cat /opt/k8s.bak/etcd.txt
Name: etcd-k8s-master
Namespace: kube-system
Priority: 2000000000
PriorityClassName: system-cluster-critical
Node: k8s-master/192.101.10.80
Start Time: Fri, 31 Jul 2020 18:51:15 +0800
Labels: component=etcd
tier=control-plane
Annotations: kubernetes.io/config.hash=c6ac110cbbe80b7156d7f1bb985f7e90
kubernetes.io/config.mirror=c6ac110cbbe80b7156d7f1bb985f7e90
kubernetes.io/config.seen=2019-08-01T16:36:33.33000287+08:00
kubernetes.io/config.source=file
scheduler.alpha.kubernetes.io/critical-pod=
Status: Running
IP: 192.101.10.80
Containers:
etcd:
Container ID: docker://4b90b88609a5f5f2c5c686698a1938cb2829709c07d31f21cc5796b1ee2abd9f
Image: k8s.gcr.io/etcd:3.2.24
Image ID: docker://sha256:3cab8e1b9802cbe23a2703c2750ac4baa90b049b65e2a9e0a83e9e2c29f0724f
Port: <none>
Host Port: <none>
Command:
etcd
--advertise-client-urls=https://127.0.0.1:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--client-cert-auth=true
--data-dir=/var/lib/etcd
--initial-advertise-peer-urls=https://127.0.0.1:2380
--initial-cluster=k8s-master=https://127.0.0.1:2380
--key-file=/etc/kubernetes/pki/etcd/server.key
--listen-client-urls=https://127.0.0.1:2379
--listen-peer-urls=https://127.0.0.1:2380
--name=k8s-master
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
--peer-client-cert-auth=true
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
--snapshot-count=10000
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
State: Running
Started: Sat, 01 Aug 2020 19:28:18 +0800
Ready: True
Restart Count: 2
Liveness: exec [/bin/sh -ec ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo] delay=15s timeout=15s period=10s #success=1 #failure=8
Environment: <none>
Mounts:
/etc/kubernetes/pki/etcd from etcd-certs (rw)
/var/lib/etcd from etcd-data (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
etcd-data:
Type: HostPath (bare host directory volume)
Path: /var/lib/etcd
HostPathType: DirectoryOrCreate
etcd-certs:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/pki/etcd
HostPathType: DirectoryOrCreate
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: :NoExecute
Events: <none>
注意;配置参数位于宿主机的/var/lib/etcd,证书文件位于/etc/kubernetes/pki/etcd
k8s集群数据备份:
/etc/kubernetes/ 目录下的所有文件(证书,manifest文件)
/var/lib/kubelet/ 目录下的所有文件(plugins容器连接认证)
/var/lib/etcd 目录下的所有文件(etcd api数据)
# mkdir -p /opt/k8s.bak/etc/kubernetes
# mkdir -p /opt/k8s.bak/var/lib/kubelet
# mkdir -p /opt/k8s.bak/var/lib/etcd
# cp -r /etc/kubernetes/* /opt/k8s.bak/etc/kubernetes/
# cp -r /var/lib/kubelet/* /opt/k8s.bak/var/lib/kubelet/
# cp -r /var/lib/etcd/* /opt/k8s.bak/var/lib/etcd/ ##对于单节点的etcd服务
二、etcd数据备份
#tree /var/lib/etcd/
/var/lib/etcd/
└── member
├── snap
│?? ├── 0000000000000003-0000000000b9def0.snap
│?? ├── 0000000000000003-0000000000ba0601.snap
│?? ├── 0000000000000003-0000000000ba2d12.snap
│?? ├── 0000000000000003-0000000000ba5423.snap
│?? ├── 0000000000000003-0000000000ba7b34.snap
│?? └── db
└── wal
├── 000000000000006b-0000000000b35df1.wal
├── 000000000000006c-0000000000b506e9.wal
├── 000000000000006d-0000000000b6afd2.wal
├── 000000000000006e-0000000000b858bf.wal
├── 000000000000006f-0000000000ba012f.wal
└── 0.tmp
3 directories, 12 files
k8s集群的灾备与恢复基于etcd的灾备与恢复,etcd的数据默认会存放在命令的工作目录(即master的/var/lib/etcd/)中,数据所在的目录,会被分为两个文件夹snap与wal:
snap: 存放快照数据,etcd防止WAL文件过多而设置的快照,存储etcd数据状态。
wal: 存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中,所有数据的修改在提交前,都要先写入到WAL中。
备份有三个办法:
1)可以直接备份/etc/kubernetes/pki/etcd和/var/lib/etcd下的文件内容。
对于单节点的etcd服务,数据备份和恢复基于数据文件的备份,Kubeadm的默认安装时,将etcd的数据以文件形式存储在宿主机的/var/lib/etcd/目录,将此目录下的文件定期备份起来,etcd的数据出现问题,需要恢复时,直接将文件还原到此目录下,新库加载文件,就实现了单节点的etcd数据库的重建。
对于多节点的etcd服务,不能使用直接备份和恢复目录文件的方法。
备份之前先使用docker stop停止相应的服务,然后再启动即可。
如果停止etcd服务,备份过程中服务会中断。
缺省配置情况下,每隔10000次改变,etcd将会产生一个snap。
如果只备份/var/lib/etcd/member/snap下的文件,不需要停止服务。
2)通过etcd-client客户端备份
可以使用下面的命令对etcd数据库打快照。
# etcd集群查看:
# 查看etcdctl帮助
# etcdctl help
# etcdctl version
etcdctl version: 3.3.11
API version: 3.3
# 列出成员:
# etcdctl --endpoints=https://192.168.105.92:2379,https://192.168.105.93:2379,https://192.168.105.94:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
# etcdctl --endpoints=https://192.101.11.160:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
330027b50fd1daa4: name=hadoop008 peerURLs=https://192.101.11.160:2380 clientURLs=https://192.101.11.160:2379 isLeader=true
# 列出kubernetes数据:
# export ETCDCTL_API=3
# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
# 备份数据:
# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt snapshot save /opt/k8s-backup/data/etcd-snapshot/$(date +%F)-k8s-snapshot.db
注意:因为kubernetes集群使用https,因此需要指定--cert-file、--key-file和--ca-file三个参数,参数文件都位于 /etc/kubernetes/pki/etcd目录下。
3)使用kubernetes的cronjob实现定期自动化备份
使用kubernetes的cronjob实现定期自动化备份需要对images和启动参数有一些调整,yaml文件如下:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: etcd-disaster-recovery
namespace: cronjob
spec:
schedule: "0 22 * * *"
jobTemplate:
spec:
template:
metadata:
labels:
app: etcd-disaster-recovery
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- podc01
containers:
- name: etcd
image: k8s.gcr.io/etcd:3.2.24
imagePullPolicy: "IfNotPresent"
command:
- sh
- -c
- "export ETCDCTL_API=3; \
etcdctl --endpoints=$ENDPOINT \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
snapshot save /snapshot/$(date +%Y%m%d_%H%M%S)_snapshot.db; \
echo etcd backup success"
env:
- name: ENDPOINT
value: "https://127.0.0.1:2379"
volumeMounts:
- mountPath: "/etc/kubernetes/pki/etcd"
name: etcd-certs
- mountPath: "/var/lib/etcd"
name: etcd-data
- mountPath: "/snapshot"
name: snapshot
subPath: data/etcd-snapshot
- mountPath: /etc/localtime
name: lt-config
- mountPath: /etc/timezone
name: tz-config
restartPolicy: OnFailure
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: etcd-data
hostPath:
path: /var/lib/etcd
- name: snapshot
hostPath:
path: /home/supermap/k8s-backup
- name: lt-config
hostPath:
path: /etc/localtime
- name: tz-config
hostPath:
path: /etc/timezone
hostNetwork: true
三、etcd数据恢复
注意:数据恢复操作,会停止全部应用状态和访问!!!
首先需要分别停掉每台Master机器的kube-apiserver,确保kube-apiserver已经停止了。
# mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak
# docker ps|grep k8s_ # 查看etcd、api是否up,等待全部停止
# mv /var/lib/etcd /var/lib/etcd.bak
etcd集群用同一份snapshot恢复:
# 准备恢复文件
cd /tmp
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.93:/tmp/
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.94:/tmp/
在Master1上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.92:2379 \
--name=lab1 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.92:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
在Master2上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.93:2379 \
--name=lab2 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.93:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
在Master3上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.94:2379 \
--name=lab3 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.94:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
全部恢复完成后,三台Master机器恢复manifests。
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
最后确认:
# 再次查看key
# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
registry/apiextensions.k8s.io/customresourcedefinitions/apprepositories.kubeapps.com
/registry/apiregistration.k8s.io/apiservices/v1.
/registry/apiregistration.k8s.io/apiservices/v1.apps
/registry/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io
........此处省略..........
# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-777d78ff6f-m5chm 1/1 Running 1 18h
coredns-777d78ff6f-xm7q8 1/1 Running 1 18h
dashboard-kubernetes-dashboard-7cfc6c7bf5-hr96q 1/1 Running 0 13h
dashboard-kubernetes-dashboard-7cfc6c7bf5-x9p7j 1/1 Running 0 13h
etcd-lab1 1/1 Running 0 18h
etcd-lab2 1/1 Running 0 1m
etcd-lab3 1/1 Running 0 18h
kube-apiserver-lab1 1/1 Running 0 18h
kube-apiserver-lab2 1/1 Running 0 1m
kube-apiserver-lab3 1/1 Running 0 18h
kube-controller-manager-lab1 1/1 Running 0 18h
kube-controller-manager-lab2 1/1 Running 0 1m
kube-controller-manager-lab3 1/1 Running 0 18h
kube-flannel-ds-7w6rl 1/1 Running 2 18h
kube-flannel-ds-b9pkf 1/1 Running 2 18h
kube-flannel-ds-fck8t 1/1 Running 1 18h
kube-flannel-ds-kklxs 1/1 Running 1 18h
kube-flannel-ds-lxxx9 1/1 Running 2 18h
kube-flannel-ds-q7lpg 1/1 Running 1 18h
kube-flannel-ds-tlqqn 1/1 Running 1 18h
kube-proxy-85j7g 1/1 Running 1 18h
kube-proxy-gdvkk 1/1 Running 1 18h
kube-proxy-jw5gh 1/1 Running 1 18h
kube-proxy-pgfxf 1/1 Running 1 18h
kube-proxy-qx62g 1/1 Running 1 18h
kube-proxy-rlbdb 1/1 Running 1 18h
kube-proxy-whhcv 1/1 Running 1 18h
kube-scheduler-lab1 1/1 Running 0 18h
kube-scheduler-lab2 1/1 Running 0 1m
kube-scheduler-lab3 1/1 Running 0 18h
kubernetes-dashboard-754f4d5f69-7npk5 1/1 Running 0 13h
kubernetes-dashboard-754f4d5f69-whtg9 1/1 Running 0 13h
小结:
不管是二进制还是kubeadm安装的Kubernetes,其备份主要是通过etcd的备份完成的。而恢复时,主要考虑的是整个顺序:停止kube-apiserver,停止etcd,恢复数据,启动etcd,启动kube-apiserver。
四、Master节点控制组件的备份及恢复
一般来说,如果master节点需要备份恢复,那除了误操作和删除,很可能就是整个机器已出现了故障,故而可能需要同时进行etcd数据的恢复。
而在恢复时,有个前提条件,就是在待恢复的机器上,机器名称和ip地址需要与崩溃前的主节点配置完成一样,因为这个配置是写进了etcd数据存储当中的。
A,主节点数据备份
主节点数据的备份包括三个部分:
1,/etc/kubernetes/目录下的所有文件(证书,manifest文件)
2,用户主目录下.kube/config文件(kubectl连接认证)
3,/var/lib/kubelet/目录下所有文件(plugins容器连接认证)
B,主节点组件恢复
主节点组件的恢复可按以下步骤进行:
1,按之前的安装脚本进行全新安装(kubeadm reset,iptables –X…)
2,停止系统服务systemctl stop kubelet.service。
3,删除相关插件容器(coredns,flannel,dashboard)。
4,恢复etcd数据(参见第一章节操作)。
5,将之前备份的三个目录依次还原。
6,重启系统服务systemctl start kubelet.service。
7,稍等片刻,待所有组件启动成功后进行验证。
五、master节点升级
注意:本次升级 从1.12.1到1.16.3
1、master节点上升级kubeadm
# 查看软件包的版本
yum list --showduplicates kubeadm --disableexcludes=kubernetes
# 安装最新版本kubeadm和kubectl
yum install -y kubeadm-1.16.3-0 --disableexcludes=kubernetes ###所有节点
# 查看kubeadm version
kubeadm version
# 查看更新计划
kubeadm upgrade plan ###查看集群是否可以升级,升级后各组件的版本信息
# 通过更新计划和version 判断kubeadm 是否更新成功
查看所需镜像并通过脚本拉取
# 查看所需镜像
kubeadm config images list
# 根据镜像名称编辑脚本,并使用脚本拉取镜像
# 执行脚本拉取镜像
./pull_images.sh
2、升级master节点
#kubeadm upgrade apply v1.16.3
yum install -y kubelet-1.16.3-0 kubectl-1.16.3-0 --disableexcludes=kubernetes
sudo systemctl restart kubelet
六、worker节点升级
# 先下载相关镜像,主要是proxy和coredns
#升级kubeadm
yum install -y kubeadm-1.16.3-0 --disableexcludes=kubernetes
# 节点驱逐,驱逐工作负载,master上执行
# 如果master 节点有作为node跑的pod,则需要执行以下命令驱逐这些 pod 并使节点进入维护模式(禁止调度)。
# 将 cp-node-name 换成 Master 节点名称
kubectl drain cp-node-name --ignore-daemonsets
kubectl drain $NODE --ignore-daemonsets
# 更新kubelet
sudo kubeadm upgrade node
yum install -y kubelet-1.16.3-0 kubectl-1.16.3-0 --disableexcludes=kubernetes
sudo systemctl restart kubelet
kubectl uncordon $NODE
验证是否成功:
kubectl get nodes
注意:排查kubelet的日志,看是否有问题。journalctl -f -u kubelet
查看集群版本:
# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master Ready master 1y v1.12.1 <none> Asianux Server 7 (Lotus) 3.10.0-514.axs7.x86_64 docker://18.6.1
k8s-node1 Ready <none> 1y v1.12.1 <none> Asianux Server 7 (Lotus) 3.10.0-514.axs7.x86_64 docker://18.6.1
k8s-node2 Ready <none> 1y v1.12.1 <none> Asianux Server 7 (Lotus) 3.10.0-514.axs7.x86_64 docker://18.6.1
查看etcd服务容器信息:
# kubectl get pod -n kube-system
# kubectl describe pod etcd-k8s-master -n kube-system > /opt/k8s.bak/etcd.txt ##etcd-podc01 是etcd pod名称,根据实际替换
# cat /opt/k8s.bak/etcd.txt
Name: etcd-k8s-master
Namespace: kube-system
Priority: 2000000000
PriorityClassName: system-cluster-critical
Node: k8s-master/192.101.10.80
Start Time: Fri, 31 Jul 2020 18:51:15 +0800
Labels: component=etcd
tier=control-plane
Annotations: kubernetes.io/config.hash=c6ac110cbbe80b7156d7f1bb985f7e90
kubernetes.io/config.mirror=c6ac110cbbe80b7156d7f1bb985f7e90
kubernetes.io/config.seen=2019-08-01T16:36:33.33000287+08:00
kubernetes.io/config.source=file
scheduler.alpha.kubernetes.io/critical-pod=
Status: Running
IP: 192.101.10.80
Containers:
etcd:
Container ID: docker://4b90b88609a5f5f2c5c686698a1938cb2829709c07d31f21cc5796b1ee2abd9f
Image: k8s.gcr.io/etcd:3.2.24
Image ID: docker://sha256:3cab8e1b9802cbe23a2703c2750ac4baa90b049b65e2a9e0a83e9e2c29f0724f
Port: <none>
Host Port: <none>
Command:
etcd
--advertise-client-urls=https://127.0.0.1:2379
--cert-file=/etc/kubernetes/pki/etcd/server.crt
--client-cert-auth=true
--data-dir=/var/lib/etcd
--initial-advertise-peer-urls=https://127.0.0.1:2380
--initial-cluster=k8s-master=https://127.0.0.1:2380
--key-file=/etc/kubernetes/pki/etcd/server.key
--listen-client-urls=https://127.0.0.1:2379
--listen-peer-urls=https://127.0.0.1:2380
--name=k8s-master
--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
--peer-client-cert-auth=true
--peer-key-file=/etc/kubernetes/pki/etcd/peer.key
--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
--snapshot-count=10000
--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
State: Running
Started: Sat, 01 Aug 2020 19:28:18 +0800
Ready: True
Restart Count: 2
Liveness: exec [/bin/sh -ec ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo] delay=15s timeout=15s period=10s #success=1 #failure=8
Environment: <none>
Mounts:
/etc/kubernetes/pki/etcd from etcd-certs (rw)
/var/lib/etcd from etcd-data (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
etcd-data:
Type: HostPath (bare host directory volume)
Path: /var/lib/etcd
HostPathType: DirectoryOrCreate
etcd-certs:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/pki/etcd
HostPathType: DirectoryOrCreate
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: :NoExecute
Events: <none>
注意;配置参数位于宿主机的/var/lib/etcd,证书文件位于/etc/kubernetes/pki/etcd
k8s集群数据备份:
/etc/kubernetes/ 目录下的所有文件(证书,manifest文件)
/var/lib/kubelet/ 目录下的所有文件(plugins容器连接认证)
/var/lib/etcd 目录下的所有文件(etcd api数据)
# mkdir -p /opt/k8s.bak/etc/kubernetes
# mkdir -p /opt/k8s.bak/var/lib/kubelet
# mkdir -p /opt/k8s.bak/var/lib/etcd
# cp -r /etc/kubernetes/* /opt/k8s.bak/etc/kubernetes/
# cp -r /var/lib/kubelet/* /opt/k8s.bak/var/lib/kubelet/
# cp -r /var/lib/etcd/* /opt/k8s.bak/var/lib/etcd/ ##对于单节点的etcd服务
二、etcd数据备份
#tree /var/lib/etcd/
/var/lib/etcd/
└── member
├── snap
│?? ├── 0000000000000003-0000000000b9def0.snap
│?? ├── 0000000000000003-0000000000ba0601.snap
│?? ├── 0000000000000003-0000000000ba2d12.snap
│?? ├── 0000000000000003-0000000000ba5423.snap
│?? ├── 0000000000000003-0000000000ba7b34.snap
│?? └── db
└── wal
├── 000000000000006b-0000000000b35df1.wal
├── 000000000000006c-0000000000b506e9.wal
├── 000000000000006d-0000000000b6afd2.wal
├── 000000000000006e-0000000000b858bf.wal
├── 000000000000006f-0000000000ba012f.wal
└── 0.tmp
3 directories, 12 files
k8s集群的灾备与恢复基于etcd的灾备与恢复,etcd的数据默认会存放在命令的工作目录(即master的/var/lib/etcd/)中,数据所在的目录,会被分为两个文件夹snap与wal:
snap: 存放快照数据,etcd防止WAL文件过多而设置的快照,存储etcd数据状态。
wal: 存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中,所有数据的修改在提交前,都要先写入到WAL中。
备份有三个办法:
1)可以直接备份/etc/kubernetes/pki/etcd和/var/lib/etcd下的文件内容。
对于单节点的etcd服务,数据备份和恢复基于数据文件的备份,Kubeadm的默认安装时,将etcd的数据以文件形式存储在宿主机的/var/lib/etcd/目录,将此目录下的文件定期备份起来,etcd的数据出现问题,需要恢复时,直接将文件还原到此目录下,新库加载文件,就实现了单节点的etcd数据库的重建。
对于多节点的etcd服务,不能使用直接备份和恢复目录文件的方法。
备份之前先使用docker stop停止相应的服务,然后再启动即可。
如果停止etcd服务,备份过程中服务会中断。
缺省配置情况下,每隔10000次改变,etcd将会产生一个snap。
如果只备份/var/lib/etcd/member/snap下的文件,不需要停止服务。
2)通过etcd-client客户端备份
可以使用下面的命令对etcd数据库打快照。
# etcd集群查看:
# 查看etcdctl帮助
# etcdctl help
# etcdctl version
etcdctl version: 3.3.11
API version: 3.3
# 列出成员:
# etcdctl --endpoints=https://192.168.105.92:2379,https://192.168.105.93:2379,https://192.168.105.94:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
# etcdctl --endpoints=https://192.101.11.160:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
330027b50fd1daa4: name=hadoop008 peerURLs=https://192.101.11.160:2380 clientURLs=https://192.101.11.160:2379 isLeader=true
# 列出kubernetes数据:
# export ETCDCTL_API=3
# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
# 备份数据:
# ETCDCTL_API=3 etcdctl --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt snapshot save /opt/k8s-backup/data/etcd-snapshot/$(date +%F)-k8s-snapshot.db
注意:因为kubernetes集群使用https,因此需要指定--cert-file、--key-file和--ca-file三个参数,参数文件都位于 /etc/kubernetes/pki/etcd目录下。
3)使用kubernetes的cronjob实现定期自动化备份
使用kubernetes的cronjob实现定期自动化备份需要对images和启动参数有一些调整,yaml文件如下:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: etcd-disaster-recovery
namespace: cronjob
spec:
schedule: "0 22 * * *"
jobTemplate:
spec:
template:
metadata:
labels:
app: etcd-disaster-recovery
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- podc01
containers:
- name: etcd
image: k8s.gcr.io/etcd:3.2.24
imagePullPolicy: "IfNotPresent"
command:
- sh
- -c
- "export ETCDCTL_API=3; \
etcdctl --endpoints=$ENDPOINT \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
snapshot save /snapshot/$(date +%Y%m%d_%H%M%S)_snapshot.db; \
echo etcd backup success"
env:
- name: ENDPOINT
value: "https://127.0.0.1:2379"
volumeMounts:
- mountPath: "/etc/kubernetes/pki/etcd"
name: etcd-certs
- mountPath: "/var/lib/etcd"
name: etcd-data
- mountPath: "/snapshot"
name: snapshot
subPath: data/etcd-snapshot
- mountPath: /etc/localtime
name: lt-config
- mountPath: /etc/timezone
name: tz-config
restartPolicy: OnFailure
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: etcd-data
hostPath:
path: /var/lib/etcd
- name: snapshot
hostPath:
path: /home/supermap/k8s-backup
- name: lt-config
hostPath:
path: /etc/localtime
- name: tz-config
hostPath:
path: /etc/timezone
hostNetwork: true
三、etcd数据恢复
注意:数据恢复操作,会停止全部应用状态和访问!!!
首先需要分别停掉每台Master机器的kube-apiserver,确保kube-apiserver已经停止了。
# mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak
# docker ps|grep k8s_ # 查看etcd、api是否up,等待全部停止
# mv /var/lib/etcd /var/lib/etcd.bak
etcd集群用同一份snapshot恢复:
# 准备恢复文件
cd /tmp
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.93:/tmp/
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.94:/tmp/
在Master1上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.92:2379 \
--name=lab1 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.92:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
在Master2上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.93:2379 \
--name=lab2 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.93:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
在Master3上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
--endpoints=192.168.105.94:2379 \
--name=lab3 \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--initial-advertise-peer-urls=https://192.168.105.94:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
--data-dir=/var/lib/etcd
全部恢复完成后,三台Master机器恢复manifests。
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests
最后确认:
# 再次查看key
# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
registry/apiextensions.k8s.io/customresourcedefinitions/apprepositories.kubeapps.com
/registry/apiregistration.k8s.io/apiservices/v1.
/registry/apiregistration.k8s.io/apiservices/v1.apps
/registry/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io
........此处省略..........
# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-777d78ff6f-m5chm 1/1 Running 1 18h
coredns-777d78ff6f-xm7q8 1/1 Running 1 18h
dashboard-kubernetes-dashboard-7cfc6c7bf5-hr96q 1/1 Running 0 13h
dashboard-kubernetes-dashboard-7cfc6c7bf5-x9p7j 1/1 Running 0 13h
etcd-lab1 1/1 Running 0 18h
etcd-lab2 1/1 Running 0 1m
etcd-lab3 1/1 Running 0 18h
kube-apiserver-lab1 1/1 Running 0 18h
kube-apiserver-lab2 1/1 Running 0 1m
kube-apiserver-lab3 1/1 Running 0 18h
kube-controller-manager-lab1 1/1 Running 0 18h
kube-controller-manager-lab2 1/1 Running 0 1m
kube-controller-manager-lab3 1/1 Running 0 18h
kube-flannel-ds-7w6rl 1/1 Running 2 18h
kube-flannel-ds-b9pkf 1/1 Running 2 18h
kube-flannel-ds-fck8t 1/1 Running 1 18h
kube-flannel-ds-kklxs 1/1 Running 1 18h
kube-flannel-ds-lxxx9 1/1 Running 2 18h
kube-flannel-ds-q7lpg 1/1 Running 1 18h
kube-flannel-ds-tlqqn 1/1 Running 1 18h
kube-proxy-85j7g 1/1 Running 1 18h
kube-proxy-gdvkk 1/1 Running 1 18h
kube-proxy-jw5gh 1/1 Running 1 18h
kube-proxy-pgfxf 1/1 Running 1 18h
kube-proxy-qx62g 1/1 Running 1 18h
kube-proxy-rlbdb 1/1 Running 1 18h
kube-proxy-whhcv 1/1 Running 1 18h
kube-scheduler-lab1 1/1 Running 0 18h
kube-scheduler-lab2 1/1 Running 0 1m
kube-scheduler-lab3 1/1 Running 0 18h
kubernetes-dashboard-754f4d5f69-7npk5 1/1 Running 0 13h
kubernetes-dashboard-754f4d5f69-whtg9 1/1 Running 0 13h
小结:
不管是二进制还是kubeadm安装的Kubernetes,其备份主要是通过etcd的备份完成的。而恢复时,主要考虑的是整个顺序:停止kube-apiserver,停止etcd,恢复数据,启动etcd,启动kube-apiserver。
四、Master节点控制组件的备份及恢复
一般来说,如果master节点需要备份恢复,那除了误操作和删除,很可能就是整个机器已出现了故障,故而可能需要同时进行etcd数据的恢复。
而在恢复时,有个前提条件,就是在待恢复的机器上,机器名称和ip地址需要与崩溃前的主节点配置完成一样,因为这个配置是写进了etcd数据存储当中的。
A,主节点数据备份
主节点数据的备份包括三个部分:
1,/etc/kubernetes/目录下的所有文件(证书,manifest文件)
2,用户主目录下.kube/config文件(kubectl连接认证)
3,/var/lib/kubelet/目录下所有文件(plugins容器连接认证)
B,主节点组件恢复
主节点组件的恢复可按以下步骤进行:
1,按之前的安装脚本进行全新安装(kubeadm reset,iptables –X…)
2,停止系统服务systemctl stop kubelet.service。
3,删除相关插件容器(coredns,flannel,dashboard)。
4,恢复etcd数据(参见第一章节操作)。
5,将之前备份的三个目录依次还原。
6,重启系统服务systemctl start kubelet.service。
7,稍等片刻,待所有组件启动成功后进行验证。
五、master节点升级
注意:本次升级 从1.12.1到1.16.3
1、master节点上升级kubeadm
# 查看软件包的版本
yum list --showduplicates kubeadm --disableexcludes=kubernetes
# 安装最新版本kubeadm和kubectl
yum install -y kubeadm-1.16.3-0 --disableexcludes=kubernetes ###所有节点
# 查看kubeadm version
kubeadm version
# 查看更新计划
kubeadm upgrade plan ###查看集群是否可以升级,升级后各组件的版本信息
# 通过更新计划和version 判断kubeadm 是否更新成功
查看所需镜像并通过脚本拉取
# 查看所需镜像
kubeadm config images list
# 根据镜像名称编辑脚本,并使用脚本拉取镜像
# 执行脚本拉取镜像
./pull_images.sh
2、升级master节点
#kubeadm upgrade apply v1.16.3
yum install -y kubelet-1.16.3-0 kubectl-1.16.3-0 --disableexcludes=kubernetes
sudo systemctl restart kubelet
六、worker节点升级
# 先下载相关镜像,主要是proxy和coredns
#升级kubeadm
yum install -y kubeadm-1.16.3-0 --disableexcludes=kubernetes
# 节点驱逐,驱逐工作负载,master上执行
# 如果master 节点有作为node跑的pod,则需要执行以下命令驱逐这些 pod 并使节点进入维护模式(禁止调度)。
# 将 cp-node-name 换成 Master 节点名称
kubectl drain cp-node-name --ignore-daemonsets
kubectl drain $NODE --ignore-daemonsets
# 更新kubelet
sudo kubeadm upgrade node
yum install -y kubelet-1.16.3-0 kubectl-1.16.3-0 --disableexcludes=kubernetes
sudo systemctl restart kubelet
kubectl uncordon $NODE
验证是否成功:
kubectl get nodes
注意:排查kubelet的日志,看是否有问题。journalctl -f -u kubelet
下一篇: 自动化运维 Ansible 安装部署