欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

docker+rancher搭建过程中的报错排除笔记

程序员文章站 2022-03-01 17:17:08
...

docker+rancher搭建k8s 报错笔记

CentOS 7
docker v1.20.x
rancher v2.3.5

内网环境 无法出外网

ps:采用Nexus3作为docker镜像仓库代理,Nexus3的代理的相关安装配置见这里

ETCD无法创建问题

没有外网,经常出现docker 镜像无法拉取的情况,rancher正常启动后,登录到webui界面,开始创建k8s集群,发现抱错,etcd无法创建如下
docker+rancher搭建过程中的报错排除笔记
查看了一下rancher容器运行log,日志如下

2022/02/14 13:34:14 [WARNING] Failed to create Docker container [etcd] on host [192.168.1.1]: Error response from daemon: No such image: rancher/coreos-etcd:v3.4.3-rancher1
2022/02/14 13:34:14 [ERROR] cluster [c-rc4nk] provisioning: [etcd] Failed to bring up Etcd Plane: Failed to create [etcd] container on host [192.168.1.1]: Failed to create Docker container [etcd] on host [192.168.1.1]: <nil>
2022/02/14 13:34:14 [INFO] kontainerdriver rancherkubernetesengine stopped
2022/02/14 13:34:14 [ERROR] ClusterController c-rc4nk [cluster-provisioner-controller] failed with : [etcd] Failed to bring up Etcd Plane: Failed to create [etcd] container on host [192.168.1.1]: Failed to create Docker container [etcd] on host [192.168.1.1]: <nil>
2022/02/14 13:37:44 [ERROR] Error parsing max age Error parsing auth refresh max age: time: invalid duration s

rancher/coreos-etcd:v3.4.3-rancher1 docker镜像无法拉取,没法后面从其他地方pull取该镜像,再推到Nexus中

[etcd] Failed to bring up Etcd Plane

ETCD启动失败问题,这个是个经典的问题,网上很多教程,就是得重新删除干净,重启docker服务

[etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [192.168.154.231] failed to report healthy. Check etcd container logs on each host for more information

docker+rancher搭建过程中的报错排除笔记
清除的指令如下

docker stop $(docker ps -aq)
# 注意,这个会把所用容器删除
docker system prune -f
# 注意,这个会清空所有volume
docker volume rm $(docker volume ls -q)
# 注意,这个会清空所有image
docker image rm $(docker image ls -q)
rm -rf /etc/ceph \
       /etc/cni \
       /etc/kubernetes \
       /opt/cni \
       /opt/rke \
       /run/secrets/kubernetes.io \
       /run/calico \
       /run/flannel \
       /var/lib/calico \
       /var/lib/etcd \
       /var/lib/cni \
       /var/lib/kubelet \
       /var/lib/rancher/rke/log \
       /var/log/containers \
       /var/log/pods \

反复重启后,终于以为快到胜利了,谁知还有王炸

Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

这个尼玛到处找解决方案,找了好久,也是各种尝试都失败了
后面找到个比较靠谱的

查看了一下kubelet容器日志

docker logs kubelet

E0215 14:34:53.376690   25851 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
W0215 14:34:54.249077   25851 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
E0215 14:34:55.238592   25851 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
I0215 14:34:55.238747   25851 provider.go:98] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
I0215 14:34:55.788537   25851 kube_docker_client.go:345] Stop pulling image "rancher/pause:3.1": "7675586df687: Downloading "
E0215 14:34:55.788569   25851 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed pulling image "rancher/pause:3.1": Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:52359->[::1]:53: read: connection refused
E0215 14:34:55.788592   25851 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "rke-network-plugin-deploy-job-lpm7j_kube-system(36f21b3d-16b8-4ca6-9b62-8f96be849d6c)" failed: rpc error: code = Unknown desc = failed pulling image "rancher/pause:3.1": Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:52359->[::1]:53: read: connection refused
E0215 14:34:55.788601   25851 kuberuntime_manager.go:729] createPodSandbox for pod "rke-network-plugin-deploy-job-lpm7j_kube-system(36f21b3d-16b8-4ca6-9b62-8f96be849d6c)" failed: rpc error: code = Unknown desc = failed pulling image "rancher/pause:3.1": Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:52359->[::1]:53: read: connection refused
E0215 14:34:55.788624   25851 pod_workers.go:191] Error syncing pod 36f21b3d-16b8-4ca6-9b62-8f96be849d6c ("rke-network-plugin-deploy-job-lpm7j_kube-system(36f21b3d-16b8-4ca6-9b62-8f96be849d6c)"), skipping: failed to "CreatePodSandbox" for "rke-network-plugin-deploy-job-lpm7j_kube-system(36f21b3d-16b8-4ca6-9b62-8f96be849d6c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"rke-network-plugin-deploy-job-lpm7j_kube-system(36f21b3d-16b8-4ca6-9b62-8f96be849d6c)\" failed: rpc error: code = Unknown desc = failed pulling image \"rancher/pause:3.1\": Get \"https://registry-1.docker.io/v2/\": dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:52359->[::1]:53: read: connection refused"
E0215 14:34:58.384759   25851 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
W0215 14:34:58.917882   25851 container.go:412] Failed to create summary reader for "/docker/24b652688425c7ce066f2b48c152e3517cefc5b6e6ddd6c678d0f3690ce85343": none of the resources are being tracked.
W0215 14:34:59.249222   25851 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
W0215 14:35:01.876641   25851 container.go:412] Failed to create summary reader for "/docker/6c119f3554c2826383b27ada71887a6a7b91549440a2fc908a991fb6e99cca83": none of the resources are being tracked.
E0215 14:35:03.389468   25851 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
W0215 14:35:03.878133   25851 container.go:412] Failed to create summary reader for "/docker/f1caf907cd832aa331d4421c745891abfd0ab0b4aa250be963797ddb49ede164": none of the resources are being tracked.
W0215 14:35:04.249318   25851 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
W0215 14:35:06.514104   25851 container.go:412] Failed to create summary reader for "/docker/0a2415cd7536b2c252698e4c59ff5d2fb0b8f5888d11c4ccfb189082c850178e": none of the resources are being tracked.
E0215 14:35:08.395996   25851 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I0215 14:35:09.073937   25851 container_manager_linux.go:469] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
W0215 14:35:09.213734   25851 container.go:412] Failed to create summary reader for "/docker/7b9a4127f215057a403ff4747f5ee82e99bf58fe3db4da468573f987087cc73c": none of the resources are being tracked.
W0215 14:35:09.249559   25851 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
I0215 14:35:11.008180   25851 kuberuntime_manager.go:424] No sandbox for pod "rke-network-plugin-deploy-job-lpm7j_kube-system(36f21b3d-16b8-4ca6-9b62-8f96be849d6c)" can be found. Need to start a new one
W0215 14:35:11.676849   25851 container.go:412] Failed to create summary reader for "/docker/49aa66c13dd22d740981bfd6cea750b674c5bcf6dc2b1486095723bf661439c0": none of the resources are being tracked.
E0215 14:35:13.401485   25851 kubelet.go:2183] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
W0215 14:35:14.249673   25851 cni.go:237] Unable to update cni config: no networks found in /etc/cni/net.d
W0215 14:35:14.609182   25851 container.go:412] Failed to create summary reader for "/docker/0c05b25ff9b0f72ccbd8be4dd7767aac3c56cb279521d91a3ac4f703473227aa": none of the resources are being tracked.

认为关键错误是

Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

尝试了下面的操作

添加配置文件

mkdir /etc/cni/net.d
vi /etc/cni/net.d/10-flannel.conflist

{
   "name": "cbr0",
   "cniVersion": "0.3.0", #这个版本内是去rancher的官方docker仓库中查到的flannel-cni最新版本
   "plugins": [
       {
           "type": "flannel",
           "delegate": {
               "hairpinMode": true,
               "isDefaultGateway": true
           }
       },
       {
           "type": "portmap",
           "capabilities": {
               "portMappings": true
           }
       }
   ]
}

重启后,rancher-web-ui上还是报错Failed to get job complete status for job rke-network-plugin-deploy-job in namespace kube-system

但是,重新查看docker logs kubelet 有新发现


E0215 15:15:52.939226   25851 pod_workers.go:191] Error syncing pod ea4ac90a-352a-4edc-8a4b-347f5ce5405c ("rke-network-plugin-deploy-job-jdt6t_kube-system(ea4ac90a-352a-4edc-8a4b-347f5ce5405c)"), skipping: failed to "CreatePodSandbox" for "rke-network-plugin-deploy-job-jdt6t_kube-system(ea4ac90a-352a-4edc-8a4b-347f5ce5405c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"rke-network-plugin-deploy-job-jdt6t_kube-system(ea4ac90a-352a-4edc-8a4b-347f5ce5405c)\" failed: rpc error: code = Unknown desc = failed pulling image \"rancher/pause:3.1\": Get \"https://registry-1.docker.io/v2/\": dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:46815->[::1]:53: read: connection refused"
W0215 15:15:54.359391   25851 cni.go:202] Error validating CNI config list {
   "name": "cbr0",
   "cniVersion": "0.3.0",
   "plugins": [
       {
           "type": "flannel",
           "delegate": {
               "hairpinMode": true,
               "isDefaultGateway": true
           }
       },
       {
           "type": "portmap",
           "capabilities": {
               "portMappings": true
           }
       }
   ]
}
: [failed to find plugin "flannel" in path [/opt/cni/bin] failed to find plugin "portmap" in path [/opt/cni/bin]]
W0215 15:15:54.359428   25851 cni.go:237] Unable to update cni config: no valid networks found in /etc/cni/net.d

有可能是前执行清空docker指令导致的 /opt/cni/bin目录下没有任何程序,没有重新创建,然后复制了其它同镜像的容器里/opt/cni/bin下面的文件到宿主机/opt/cni/bin目录下

再次重新启动,还是ui上还是报错,查看了docker logs kubetel 又有新发现,最后锁定下面的错误

E0215 15:47:53.338808    6762 remote_runtime.go:105] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed pulling image "rancher/pause:3.1": Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io on [::1]:53: read udp [::1]:33648->[::1]:53: read: connection refused

rancher/pause:3.1无法拉取,解决这个docker镜像无法拉取的问题后,终于正常跑起来来了

附上愉快的图
docker+rancher搭建过程中的报错排除笔记

注意保证服务器环境的整洁,以前残留的数据会影响集群的

  • kubelet容器会挂载/etc/cni /opt/cni目录的
  • etcd会挂载/var/lib/etcd目录

最后献上几篇参考

https://www.jianshu.com/p/b6a5a233f117
https://www.bookstack.cn/read/rancher-v2.x/db0dcb78c29a3817.md
https://kubernetes.io/docs/tasks/tools/
https://docs.rancher.cn/