欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

K8S 备份及升级

程序员文章站 2022-07-13 10:49:40
...
一、准备工作
查看集群版本:
# kubectl get node -o wide
NAME         STATUS    ROLES     AGE       VERSION   EXTERNAL-IP   OS-IMAGE                   KERNEL-VERSION           CONTAINER-RUNTIME
k8s-master   Ready     master    1y        v1.12.1   <none>        Asianux Server 7 (Lotus)   3.10.0-514.axs7.x86_64   docker://18.6.1
k8s-node1    Ready     <none>    1y        v1.12.1   <none>        Asianux Server 7 (Lotus)   3.10.0-514.axs7.x86_64   docker://18.6.1
k8s-node2    Ready     <none>    1y        v1.12.1   <none>        Asianux Server 7 (Lotus)   3.10.0-514.axs7.x86_64   docker://18.6.1


查看etcd服务容器信息:
# kubectl get pod -n kube-system
# kubectl describe pod etcd-k8s-master -n kube-system > /opt/k8s.bak/etcd.txt   ##etcd-podc01 是etcd pod名称,根据实际替换
# cat /opt/k8s.bak/etcd.txt
Name:               etcd-k8s-master
Namespace:          kube-system
Priority:           2000000000
PriorityClassName:  system-cluster-critical
Node:               k8s-master/192.101.10.80
Start Time:         Fri, 31 Jul 2020 18:51:15 +0800
Labels:             component=etcd
                    tier=control-plane
Annotations:        kubernetes.io/config.hash=c6ac110cbbe80b7156d7f1bb985f7e90
                    kubernetes.io/config.mirror=c6ac110cbbe80b7156d7f1bb985f7e90
                    kubernetes.io/config.seen=2019-08-01T16:36:33.33000287+08:00
                    kubernetes.io/config.source=file
                    scheduler.alpha.kubernetes.io/critical-pod=
Status:             Running
IP:                 192.101.10.80
Containers:
  etcd:
    Container ID:  docker://4b90b88609a5f5f2c5c686698a1938cb2829709c07d31f21cc5796b1ee2abd9f
    Image:         k8s.gcr.io/etcd:3.2.24
    Image ID:      docker://sha256:3cab8e1b9802cbe23a2703c2750ac4baa90b049b65e2a9e0a83e9e2c29f0724f
    Port:          <none>
    Host Port:     <none>
    Command:
      etcd
      --advertise-client-urls=https://127.0.0.1:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://127.0.0.1:2380
      --initial-cluster=k8s-master=https://127.0.0.1:2380
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379
      --listen-peer-urls=https://127.0.0.1:2380
      --name=k8s-master
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    State:          Running
      Started:      Sat, 01 Aug 2020 19:28:18 +0800
    Ready:          True
    Restart Count:  2
    Liveness:       exec [/bin/sh -ec ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo] delay=15s timeout=15s period=10s #success=1 #failure=8
    Environment:    <none>
    Mounts:
      /etc/kubernetes/pki/etcd from etcd-certs (rw)
      /var/lib/etcd from etcd-data (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  etcd-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/etcd
    HostPathType:  DirectoryOrCreate
  etcd-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/pki/etcd
    HostPathType:  DirectoryOrCreate
QoS Class:         BestEffort
Node-Selectors:    <none>
Tolerations:       :NoExecute
Events:            <none>


注意;配置参数位于宿主机的/var/lib/etcd,证书文件位于/etc/kubernetes/pki/etcd

k8s集群数据备份:
/etc/kubernetes/  目录下的所有文件(证书,manifest文件)
/var/lib/kubelet/ 目录下的所有文件(plugins容器连接认证)
/var/lib/etcd     目录下的所有文件(etcd api数据)

# mkdir -p /opt/k8s.bak/etc/kubernetes
# mkdir -p /opt/k8s.bak/var/lib/kubelet
# mkdir -p /opt/k8s.bak/var/lib/etcd
# cp -r  /etc/kubernetes/* /opt/k8s.bak/etc/kubernetes/
# cp -r /var/lib/kubelet/* /opt/k8s.bak/var/lib/kubelet/
# cp -r /var/lib/etcd/* /opt/k8s.bak/var/lib/etcd/    ##对于单节点的etcd服务

二、etcd数据备份
#tree /var/lib/etcd/
/var/lib/etcd/
└── member
    ├── snap
    │?? ├── 0000000000000003-0000000000b9def0.snap
    │?? ├── 0000000000000003-0000000000ba0601.snap
    │?? ├── 0000000000000003-0000000000ba2d12.snap
    │?? ├── 0000000000000003-0000000000ba5423.snap
    │?? ├── 0000000000000003-0000000000ba7b34.snap
    │?? └── db
    └── wal
        ├── 000000000000006b-0000000000b35df1.wal
        ├── 000000000000006c-0000000000b506e9.wal
        ├── 000000000000006d-0000000000b6afd2.wal
        ├── 000000000000006e-0000000000b858bf.wal
        ├── 000000000000006f-0000000000ba012f.wal
        └── 0.tmp

3 directories, 12 files

k8s集群的灾备与恢复基于etcd的灾备与恢复,etcd的数据默认会存放在命令的工作目录(即master的/var/lib/etcd/)中,数据所在的目录,会被分为两个文件夹snap与wal:

snap: 存放快照数据,etcd防止WAL文件过多而设置的快照,存储etcd数据状态。
wal: 存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中,所有数据的修改在提交前,都要先写入到WAL中。
  
备份有三个办法:
1)可以直接备份/etc/kubernetes/pki/etcd和/var/lib/etcd下的文件内容。
对于单节点的etcd服务,数据备份和恢复基于数据文件的备份,Kubeadm的默认安装时,将etcd的数据以文件形式存储在宿主机的/var/lib/etcd/目录,将此目录下的文件定期备份起来,etcd的数据出现问题,需要恢复时,直接将文件还原到此目录下,新库加载文件,就实现了单节点的etcd数据库的重建。
对于多节点的etcd服务,不能使用直接备份和恢复目录文件的方法。
备份之前先使用docker stop停止相应的服务,然后再启动即可。
    如果停止etcd服务,备份过程中服务会中断。
缺省配置情况下,每隔10000次改变,etcd将会产生一个snap。
   如果只备份/var/lib/etcd/member/snap下的文件,不需要停止服务。
  
2)通过etcd-client客户端备份
可以使用下面的命令对etcd数据库打快照。
# etcd集群查看:
# 查看etcdctl帮助
# etcdctl help
# etcdctl version
etcdctl version: 3.3.11
API version: 3.3

# 列出成员:
# etcdctl --endpoints=https://192.168.105.92:2379,https://192.168.105.93:2379,https://192.168.105.94:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt  --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
# etcdctl --endpoints=https://192.101.11.160:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt  --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt member list
330027b50fd1daa4: name=hadoop008 peerURLs=https://192.101.11.160:2380 clientURLs=https://192.101.11.160:2379 isLeader=true

# 列出kubernetes数据:
# export ETCDCTL_API=3
# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt

# 备份数据:
# ETCDCTL_API=3  etcdctl --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt snapshot save /opt/k8s-backup/data/etcd-snapshot/$(date +%F)-k8s-snapshot.db

注意:因为kubernetes集群使用https,因此需要指定--cert-file、--key-file和--ca-file三个参数,参数文件都位于 /etc/kubernetes/pki/etcd目录下。

3)使用kubernetes的cronjob实现定期自动化备份
使用kubernetes的cronjob实现定期自动化备份需要对images和启动参数有一些调整,yaml文件如下:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: etcd-disaster-recovery
  namespace: cronjob
spec:
schedule: "0 22 * * *"
jobTemplate:
  spec:
    template:
      metadata:
       labels:
        app: etcd-disaster-recovery
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
                  nodeSelectorTerms:
                  - matchExpressions:
                    - key: kubernetes.io/hostname
                      operator: In
                      values:
                      - podc01
        containers:
        - name: etcd
          image: k8s.gcr.io/etcd:3.2.24
          imagePullPolicy: "IfNotPresent"
          command:
          - sh
          - -c
          - "export ETCDCTL_API=3; \
             etcdctl --endpoints=$ENDPOINT \
             --cert=/etc/kubernetes/pki/etcd/server.crt \
             --key=/etc/kubernetes/pki/etcd/server.key \
             --cacert=/etc/kubernetes/pki/etcd/ca.crt \
             snapshot save /snapshot/$(date +%Y%m%d_%H%M%S)_snapshot.db; \
             echo etcd backup success"
          env:
          - name: ENDPOINT
            value: "https://127.0.0.1:2379"
          volumeMounts:
            - mountPath: "/etc/kubernetes/pki/etcd"
              name: etcd-certs
            - mountPath: "/var/lib/etcd"
              name: etcd-data
            - mountPath: "/snapshot"
              name: snapshot
              subPath: data/etcd-snapshot
            - mountPath: /etc/localtime
              name: lt-config
            - mountPath: /etc/timezone
              name: tz-config
        restartPolicy: OnFailure
        volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
          - name: etcd-data
            hostPath:
              path: /var/lib/etcd
          - name: snapshot
            hostPath:
              path: /home/supermap/k8s-backup
          - name: lt-config
            hostPath:
              path: /etc/localtime
          - name: tz-config
            hostPath:
              path: /etc/timezone
        hostNetwork: true


三、etcd数据恢复

注意:数据恢复操作,会停止全部应用状态和访问!!!
首先需要分别停掉每台Master机器的kube-apiserver,确保kube-apiserver已经停止了。

# mv /etc/kubernetes/manifests /etc/kubernetes/manifests.bak
# docker ps|grep k8s_  # 查看etcd、api是否up,等待全部停止
# mv /var/lib/etcd /var/lib/etcd.bak

etcd集群用同一份snapshot恢复:
# 准备恢复文件
cd /tmp
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.93:/tmp/
rsync -avz 2018-09-18-k8s-snapshot.db 192.168.105.94:/tmp/

在Master1上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
    --endpoints=192.168.105.92:2379 \
    --name=lab1 \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --initial-advertise-peer-urls=https://192.168.105.92:2380 \
    --initial-cluster-token=etcd-cluster-0 \
    --initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
    --data-dir=/var/lib/etcd

在Master2上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
    --endpoints=192.168.105.93:2379 \
    --name=lab2 \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --initial-advertise-peer-urls=https://192.168.105.93:2380 \
    --initial-cluster-token=etcd-cluster-0 \
    --initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
    --data-dir=/var/lib/etcd

在Master3上执行:
cd /tmp/
export ETCDCTL_API=3
etcdctl snapshot restore 2018-09-18-k8s-snapshot.db \
    --endpoints=192.168.105.94:2379 \
    --name=lab3 \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --initial-advertise-peer-urls=https://192.168.105.94:2380 \
    --initial-cluster-token=etcd-cluster-0 \
    --initial-cluster=lab1=https://192.168.105.92:2380,lab2=https://192.168.105.93:2380,lab3=https://192.168.105.94:2380 \
    --data-dir=/var/lib/etcd

全部恢复完成后,三台Master机器恢复manifests。
mv /etc/kubernetes/manifests.bak /etc/kubernetes/manifests

最后确认:
# 再次查看key
# etcdctl get / --prefix --keys-only --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
registry/apiextensions.k8s.io/customresourcedefinitions/apprepositories.kubeapps.com

/registry/apiregistration.k8s.io/apiservices/v1.

/registry/apiregistration.k8s.io/apiservices/v1.apps

/registry/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io

           ........此处省略..........

# kubectl get pod -n kube-system
NAME                                              READY     STATUS    RESTARTS   AGE
coredns-777d78ff6f-m5chm                          1/1       Running   1          18h
coredns-777d78ff6f-xm7q8                          1/1       Running   1          18h
dashboard-kubernetes-dashboard-7cfc6c7bf5-hr96q   1/1       Running   0          13h
dashboard-kubernetes-dashboard-7cfc6c7bf5-x9p7j   1/1       Running   0          13h
etcd-lab1                                         1/1       Running   0          18h
etcd-lab2                                         1/1       Running   0          1m
etcd-lab3                                         1/1       Running   0          18h
kube-apiserver-lab1                               1/1       Running   0          18h
kube-apiserver-lab2                               1/1       Running   0          1m
kube-apiserver-lab3                               1/1       Running   0          18h
kube-controller-manager-lab1                      1/1       Running   0          18h
kube-controller-manager-lab2                      1/1       Running   0          1m
kube-controller-manager-lab3                      1/1       Running   0          18h
kube-flannel-ds-7w6rl                             1/1       Running   2          18h
kube-flannel-ds-b9pkf                             1/1       Running   2          18h
kube-flannel-ds-fck8t                             1/1       Running   1          18h
kube-flannel-ds-kklxs                             1/1       Running   1          18h
kube-flannel-ds-lxxx9                             1/1       Running   2          18h
kube-flannel-ds-q7lpg                             1/1       Running   1          18h
kube-flannel-ds-tlqqn                             1/1       Running   1          18h
kube-proxy-85j7g                                  1/1       Running   1          18h
kube-proxy-gdvkk                                  1/1       Running   1          18h
kube-proxy-jw5gh                                  1/1       Running   1          18h
kube-proxy-pgfxf                                  1/1       Running   1          18h
kube-proxy-qx62g                                  1/1       Running   1          18h
kube-proxy-rlbdb                                  1/1       Running   1          18h
kube-proxy-whhcv                                  1/1       Running   1          18h
kube-scheduler-lab1                               1/1       Running   0          18h
kube-scheduler-lab2                               1/1       Running   0          1m
kube-scheduler-lab3                               1/1       Running   0          18h
kubernetes-dashboard-754f4d5f69-7npk5             1/1       Running   0          13h
kubernetes-dashboard-754f4d5f69-whtg9             1/1       Running   0          13h

小结:
不管是二进制还是kubeadm安装的Kubernetes,其备份主要是通过etcd的备份完成的。而恢复时,主要考虑的是整个顺序:停止kube-apiserver,停止etcd,恢复数据,启动etcd,启动kube-apiserver。

四、Master节点控制组件的备份及恢复
一般来说,如果master节点需要备份恢复,那除了误操作和删除,很可能就是整个机器已出现了故障,故而可能需要同时进行etcd数据的恢复。

而在恢复时,有个前提条件,就是在待恢复的机器上,机器名称和ip地址需要与崩溃前的主节点配置完成一样,因为这个配置是写进了etcd数据存储当中的。

A,主节点数据备份
主节点数据的备份包括三个部分:

1,/etc/kubernetes/目录下的所有文件(证书,manifest文件)

2,用户主目录下.kube/config文件(kubectl连接认证)

3,/var/lib/kubelet/目录下所有文件(plugins容器连接认证)

B,主节点组件恢复
    主节点组件的恢复可按以下步骤进行:

        1,按之前的安装脚本进行全新安装(kubeadm reset,iptables –X…)

        2,停止系统服务systemctl stop kubelet.service。

        3,删除相关插件容器(coredns,flannel,dashboard)。

        4,恢复etcd数据(参见第一章节操作)。

        5,将之前备份的三个目录依次还原。

        6,重启系统服务systemctl start kubelet.service。

        7,稍等片刻,待所有组件启动成功后进行验证。

五、master节点升级

注意:本次升级 从1.12.1到1.16.3

1、master节点上升级kubeadm
# 查看软件包的版本
yum list --showduplicates kubeadm --disableexcludes=kubernetes
# 安装最新版本kubeadm和kubectl
yum install -y kubeadm-1.16.3-0 --disableexcludes=kubernetes  ###所有节点

# 查看kubeadm version
kubeadm version 
# 查看更新计划
kubeadm upgrade plan  ###查看集群是否可以升级,升级后各组件的版本信息
# 通过更新计划和version 判断kubeadm 是否更新成功

查看所需镜像并通过脚本拉取
# 查看所需镜像
kubeadm config images list
# 根据镜像名称编辑脚本,并使用脚本拉取镜像
# 执行脚本拉取镜像
./pull_images.sh

2、升级master节点
#kubeadm upgrade apply v1.16.3

yum install -y kubelet-1.16.3-0 kubectl-1.16.3-0 --disableexcludes=kubernetes

sudo systemctl restart kubelet

六、worker节点升级
# 先下载相关镜像,主要是proxy和coredns

#升级kubeadm
yum install -y kubeadm-1.16.3-0 --disableexcludes=kubernetes
# 节点驱逐,驱逐工作负载,master上执行
# 如果master 节点有作为node跑的pod,则需要执行以下命令驱逐这些 pod 并使节点进入维护模式(禁止调度)。
# 将 cp-node-name 换成 Master 节点名称
kubectl drain cp-node-name --ignore-daemonsets

kubectl drain $NODE --ignore-daemonsets
# 更新kubelet
sudo kubeadm upgrade node
yum install -y kubelet-1.16.3-0 kubectl-1.16.3-0 --disableexcludes=kubernetes
sudo systemctl restart kubelet
kubectl uncordon $NODE

验证是否成功:
kubectl get nodes

注意:排查kubelet的日志,看是否有问题。journalctl -f -u kubelet











相关标签: k8s