Traefik中诡异的502和504问题

程序员文章站 2022-04-18 18:58:08

...

我们都知道在 Kubernetes 集群中通常会使用 Ingress 方案来统一代理集群内部的流量，而常用的 Ingress 方案为 traefik 和 nginx，和传统的 Nginx 作为企业内部的反向代理以及负载设备一样，在不同的场景下可能需要专有的配置才能满足需求，否则会出现奇奇怪怪的异常状态码。本篇文章一起来看下，我们在 traefik 中遇到的 500 和 502 异常。

一、前言

在开始之前，我们先来看几个在 Nginx 作为反向代理工具中经常遇到的几个异常状态码：

499: 客户端主动断开连接。通常为一次请求未在客户端指定时间内返回，客户端主动关闭连接，客户端无数据返回 (在 Nginx 中会记录 499)。一般是客户端超时
500: 服务器内部错误。服务器遇到未知错误导致无法完成请求处理，通常由于后端业务逻辑异常导致 (本身 Bug)
502: 网关错误。通常为网关未从上游服务中获取期望的响应 (上游未返回数据或未按照协议约定返回数据)，网关感觉自己没用了，返回了网关错误。一般是后端服务器宕机或业务逻辑超时
504: 网关超时。表示网关没有及时从上游获取响应数据。一般是 Nginx 网关作为客户端去向上游服务请求响应的过程中，Nginx 网关超时导致，但此时对于上游服务器来将，它会继续执行，直到结束。(Nginx网关作为客户端时的超时)

# 499 的实际情况就是，客户端指定超时时间为N秒，但是该请求在服务端实际需要执行M秒(M>N秒)，客户端等的不耐烦了就关闭了
# 对于499状态来讲，解决方式是优化后端代码逻辑或者修改nginx参数
$ cat nginx.conf
proxy_ignore_client_abort    on;
$ curl -i -m 3 http://127.0.0.1/hello.php


# 502的实际情况通常是Nginx网关后端的服务器直接宕机了(所以就拿不到上游的响应了)
# 当然也有可能是上游服务器真正的执行逻辑超过了上游服务器的超时时间限制(比如php-fpm.conf设置request_terminate_timeout5s，但是实际的业务逻辑需要7s才能完成)，此时上游服务器端出现`业务逻辑超时`，给Nginx网关返回了异常的数据造成的
# 502时后端的几种错误日志
 recv() failed (104: Connection reset by peer) while reading response header from upstream
 upstream prematurely closed connection while reading response header from upstream
 connect() failed (111: Connection refused) while connecting to upstream
# 整体来说502出现的问题通常是因为后端挂了，或者因为后端负载太高，暂时不可响应
# 可以在nginx侧增加proxy_read_timeout来暂时缓解
$ cat nginx.conf
proxy_read_timeout  20s;

# 504的实际情况就是客户端->Nginx->Backend，在过程中Nginx需要作为客户端访问Backend服务，但是在Backend还没用执行完成时，Nginx首先超过了自己的客户端超时时间，此时就会出现504的异常(但是对于客户端来说返回什么呢?)
# 对于504场景而言，通常的做法就是优化Backend的逻辑，适当减少执行时间；另外也可以适当的增加Nginx作为客户端时的超时时间
# 要知道，当Nginx作为客户端时，是以一个Proxy的角色存在的，配置如下参数即可
$ cat nginx.conf
uwsgi_connect_timeout 5;
uwsgi_send_timeout 5;
uwsgi_read_timeout 5;
fastcgi_read_timeout 5;
fastcgi_send_timeout 5;
proxy_connect_timeout      90;
proxy_send_timeout         90;
proxy_read_timeout         90;

二、traefik 中诡异的 500 和 502

traefik 在 Kubernetes 集群中的部署配置

我们当前集群的 traefik 的配置如下:

# traefik的configmap配置文件
$ cat traefik-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: traefik-config
  namespace: kube-system
data:
  traefik.toml: |
    defaultEntryPoints = ["http","https"]
    debug = false
    logLevel = "INFO"

    InsecureSkipVerify = true
    [entryPoints]
      [entryPoints.http]
      address = ":80"
      compress = true
      [entryPoints.https]
      address = ":443"
        [entryPoints.https.tls]
    [web]
      address = ":8080"
    [kubernetes]
    [metrics]
      [metrics.prometheus]
      buckets=[0.1,0.3,1.2,5.0]
      entryPoint = "traefik"
    [ping]
    entryPoint = "http"

# traefik的DaemonSet配置
$ cat traefik-ds-v1.7.16.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
---
kind: DaemonSet
apiVersion: extensions/v1beta1
metadata:
  name: traefik-ingress-controller
  namespace: kube-system
  labels:
    k8s-app: traefik-ingress-lb
spec:
  template:
    metadata:
      labels:
        k8s-app: traefik-ingress-lb
        name: traefik-ingress-lb
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: DoesNotExist
      serviceAccountName: traefik-ingress-controller
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      containers:
      - image: traefik:v1.7.16
        name: traefik-ingress-lb
        ports:
        - name: http
          containerPort: 80
          hostPort: 80
        - name: admin
          containerPort: 8080
        securityContext:
          capabilities:
            drop:
            - ALL
            add:
            - NET_BIND_SERVICE
        args:
        - --api
        - --kubernetes
        - --logLevel=INFO
        - --traefikLog.filePath=/logdata/traefik.log
        - --configfile=/config/traefik.toml
        - --accesslog.filepath=/logdata/access.log
        - --accesslog.bufferingsize=100
        volumeMounts:
        - mountPath: /config
          name: config
        - mountPath: /logdata
          name: access-log
      volumes:
      - configMap:
          name: traefik-config
        name: config
      - name: access-log
        hostPath:
          path: /opt/logs/ingress/
---
kind: Service
apiVersion: v1
metadata:
  name: traefik-ingress-service
  namespace: kube-system
  labels:
    k8s-app: traefik-ingress-lb
spec:
  selector:
    k8s-app: traefik-ingress-lb
  ports:
    - protocol: TCP
      port: 80
      name: web
    - protocol: TCP
      port: 8080
      name: admin

Python 的对外 API 接口

# 接口对外的ingress
$ kubectl  get ingress -n s-data
NAME                     HOSTS                    ADDRESS   PORTS   AGE
data-api.bgbiao.cn   data-api.bgbiao.cn             80      236d
ops.bgbiao.cn       ops.bgbiao.cn                 80      236d

# 测试对外接口
$ curl data-api.bgbiao.cn  -i
HTTP/1.1 401 Unauthorized
Access-Control-Allow-Headers: Content-Type, X-TOKEN
Access-Control-Allow-Origin: *
Content-Length: 58
Content-Type: application/json
Vary: Accept-Encoding
Date: Sun, 28 Jun 2020 14:55:00 GMT

# 接口需要登录，那么我们对登录接口进行压测来模拟问题
$ curl -X POST  --data '@/root/login.json' -H 'Content-type:application/json' http://data-api.bgbiao.cn/account/users/login/   -i
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Content-Type, X-TOKEN
Access-Control-Allow-Origin: *
Content-Length: 250
Content-Type: application/json
Vary: Accept-Encoding
Date: Sun, 28 Jun 2020 14:56:33 GMT

诡异的 500 和 502

在服务部署完成后，一切皆正常，但是简单压测后发现服务出现部分请求失败。

# 使用ab工具进行压测
# 由压测结果可以发现，20个并发共压测200个请求，期间出现了7次失败请求
$ ab -c 20 -n 200   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/
...
Benchmarking data-api.bgbiao.cn (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:
Server Hostname:        data-api.bgbiao.cn
Server Port:            80

Document Path:          /account/users/login/
Document Length:        250 bytes

Concurrency Level:      20
Time taken for tests:   1.340 seconds
Complete requests:      200
Failed requests:        7
   (Connect: 0, Receive: 0, Length: 7, Exceptions: 0)
Write errors:           0
Non-2xx responses:      7
Total transferred:      91371 bytes
Total body sent:        46400
HTML transferred:       48387 bytes
Requests per second:    149.21 [#/sec] (mean)
Time per request:       134.035 [ms] (mean)
Time per request:       6.702 [ms] (mean, across all concurrent requests)
Transfer rate:          66.57 [Kbytes/sec] received
                        33.81 kb/s sent
                        100.38 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    1   0.1      1       1
Processing:     2  116  27.8    114     179
Waiting:        2  116  27.8    114     179
Total:          3  117  27.8    116     180

Percentage of the requests served within a certain time (ms)
  50%    116
  66%    121
  75%    125
  80%    129
  90%    154
  95%    167
  98%    173
  99%    175
 100%    180 (longest request)


# 将压测结果保存到文本中进行简单分析
# 简单分析在200个压测请求中，有4个请求失败，分别为2次500错误和2次502错误
$ ab -v 10 -c 20 -n 200   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/  > ab-log.txt
$ cat ab-log.txt | grep HTTP | sort| uniq -c
    196 HTTP/1.0 200 OK
      2 HTTP/1.0 500 Internal Server Error
      2 HTTP/1.0 502 Bad Gateway
      1 POST /account/users/login/ HTTP/1.0

traefik 中 500 和 502 问题排查

前面我们提到了在 Nginx 的场景中的 500 和 502 状态码的原因以及相关的解决方式，那么在 Kubernetes 集群中，traefik 起到的作用和 Nginx 的作用也是相似的。

在开始的时候，我们提到了集群内的 traefik 的配置信息，对于 SRE 来讲，任何生产的服务都必须有相关的可观测性数据，因此，我们也默认将 traefik 的访问日志和进程日志进行了持久化 (分别对应 access.log和traefik.log)，同时也暴露了 traefik 的 prometheus 的 metrics 接口。

对于上面的压测请求，我们在访问日志里抓到了如下的异常日志:

$ tail -f access.log | grep data-api.bgbiao.cn | grep -v 'HTTP/1.0" 200'
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267376 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267385 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267410 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267418 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267484 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:38 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122267518 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 1ms
192.168.0.23 - - [28/Jun/2020:14:57:39 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122267550 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 4ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272696 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 2ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 502 11 "-" "ApacheBench/2.3" 122272711 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 1ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272836 "data-api.bgbiao.cn/" "http://20.0.26.9:8080" 0ms
192.168.0.23 - - [28/Jun/2020:15:02:06 +0000] "POST /account/users/login/ HTTP/1.0" 500 21 "-" "ApacheBench/2.3" 122272837 "data-api.bgbiao.cn/" "http://20.0.41.8:8080" 0ms

可以看到，和我们压测结果里失败请求的状态码相同，都是 500 和 502，虽然数量上不止 4 个，但这暂时不太重要。

通常对于大多数人来讲，在代理层看到 500 或者 502 都会下意识的想，肯定是上游服务的问题，不过这种猜测也能很快进行排除，排除的方法基本如下:

在压测出现 500 和 502 期间，持续去访问上游服务器
使用同样的压测参数直接压上游服务器
将上游服务部署在独立的 ECS 上并使用同样参数进行压测

经过上述三种方式的测试后，我们基本排除了上游服务的问题，因此正式怀疑 traefik 本身是否有性能或者其他参数上的问题。

修改 traefki 中的日志级别为 DEBUG:

$ cat traefik-ds-v1.7.6.yaml
....
        args:
        - --api
        - --kubernetes
        - --logLevel=DEBUG
....

然后在日志中看到如下相关信息:

# 500相关的日志
time="2020-06-28T15:35:05Z" level=debug msg="'500 Internal Server Error' caused by: http: server closed idle connection"
time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 500, Length: 21, duration: 1.486276ms"

# 502相关的日志
time="2020-06-28T15:35:05Z" level=debug msg="'502 Bad Gateway' caused by: EOF"
time="2020-06-28T15:35:05Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://20.0.26.9:8080, code: 502, Length: 11, duration: 1.530677ms"

上面的 500 信息，大概可以看出来是 traefik 的服务进程主动关闭了空闲链接导致的，而下面的 502 是因为 EOF，感觉像是没有读取完响应数据，就被断开了，导致 traefik 返回 502。通常这种情况在 Nginx 中很常见，也很容易调整相关的配置参数 (文章开始有提到)，但是 traefik 的部署模式以及参数调整，还是需要花心思去看相关文档的。

然后在 github 上去翻 traefik 相关的 issues，发现该问题曾经出现过很多次。

一个是 500 的 issues，一个是 502 的 issues，但通常这两个问题都是成双出现的。

500 和 502 问题解决方案

在上面第一个 issue 中提到，traefik 在 http 的反向代理功能中默认开启了 http 的 keep-alive 功能，但是 python 的应用中未开启 http 的 keep-alive，因为我们上面的测试程序其实也是使用的 python 开发的，先对该参数进行调整。

# 在uwsgi.ini中增加keep-alive参数即可
$ cat uwsgi.ini
[uwsgi]
http = 0.0.0.0:8080
http-keepalive = 1
chdir = /opt/app/
wsgi-file = /opt/app/main.py
callable = app
stats = 0.0.0.0:8081
processes = 2
threads = 10

# 重启应用后，再次进行压测
# 之前出现的502和500错误基本都消失了
# 并发200，共1万个请求，失败的请求数为0，总耗时1min(qps才到160????)
$ ab  -c 200 -n 10000   -T 'application/json' -p /root/login.json http://data-api.bgbiao.cn/account/users/login/
....
....

Concurrency Level:      200
Time taken for tests:   59.323 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      4670000 bytes
Total body sent:        2320000
HTML transferred:       2500000 bytes
Requests per second:    168.57 [#/sec] (mean)
Time per request:       1186.454 [ms] (mean)
Time per request:       5.932 [ms] (mean, across all concurrent requests)
Transfer rate:          76.88 [Kbytes/sec] received
                        38.19 kb/s sent
                        115.07 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        1    3  38.6      1    1035
Processing:   101  942 1457.7    857   32684
Waiting:      101  942 1457.7    857   32684
Total:        102  945 1458.0    861   32685

# p99达到7.3s
Percentage of the requests served within a certain time (ms)
  50%    861
  66%   1033
  75%   1136
  80%   1191
  90%   1886
  95%   2281
  98%   4209
  99%   7399
 100%  32685 (longest request)

通过对业务层的 http 的 keep-alive 参数的开启，来暂时解决了 500 和 502 的问题，那能否通过 traefik 层的参数来优化该问题呢，上面第二个 issue 中其实也提到了。

即，通过修改 traefik 的如下几个参数，并重新部署整个 traefik 集群即可:

# 关闭traefik的keep-alive参数，参数默认为200，如果参数为0，则使用go标准库中的DefaultMaxIdleConnsPerHost参数
# keep-alive主要是用来复用链接来减少open files的，但是对于大量的短连接来将这种链接复用就可能出现上述情况
--maxidleconnsperhost=-1
# 即通过设置重试次数，增加空闲链接的超时时间，增加转发响应的超时时间,默认是0次
--retry.attempts=10
# 该参数已经替换为--respondingtimeouts.idletimeout参数了，默认为3m0s
--idletimeout=60s
# 默认是0s
--forwardingtimeouts.responseheadertimeout=60s

# traefik 空闲链接超时
$ ./traefik --help | grep idletimeout
    --idletimeout                                 (Deprecated) maximum amount of time an idle (keep-alive) connection will remain  (default "0s")
    --respondingtimeouts.idletimeout              IdleTimeout is the maximum amount duration an idle (keep-alive) connection will  (default "3m0s")

# 响应超时相关参数
 $ ./traefik --help | grep respondingtimeouts
    --respondingtimeouts                          Timeouts for incoming requests to the Traefik instance                           (default "true")
    --respondingtimeouts.idletimeout              IdleTimeout is the maximum amount duration an idle (keep-alive) connection will  (default "3m0s")
    --respondingtimeouts.readtimeout              ReadTimeout is the maximum duration for reading the entire request, including    (default "0s")
    --respondingtimeouts.writetimeout             WriteTimeout is the maximum duration before timing out writes of the response.   (default "0s")


# 转发的超时
$ ./traefik --help | grep forwardingtimeouts
    --forwardingtimeouts                          Timeouts for requests forwarded to the backend servers                           (default "true")
    --forwardingtimeouts.dialtimeout              The amount of time to wait until a connection to a backend server can be         (default "30s")
    --forwardingtimeouts.responseheadertimeout    The amount of time to wait for a server's response headers after fully writing   (default "0s")

最终修改后的 traefik 的参数如下:

# 可以根据实际情况考虑是否要关闭keep-alive 即增加参数: --maxidleconnsperhost=-1
        - --api
        - --kubernetes
        - --logLevel=INFO
        - --traefikLog.filePath=/logdata/traefik.log
        - --configfile=/config/traefik.toml
        - --accesslog.filepath=/logdata/access.log
        - --accesslog.bufferingsize=100
        - --forwardingtimeouts.responseheadertimeout=60s
        - --respondingtimeouts.idletimeout=180s
        - --retry.attempts=10
        - --idletimeout=180s

三、SLB 中诡异的 503

在发现域名直接解析到 traefik 节点上不再出现 502 和 500 后，我们将 traefik 的节点挂载到阿里云的内网 slb 中，但是又开始出现了诡异的 503 问题。

接入 slb 后的简单压测情况 (内网使用的是免费低配的 slb????)

# client->ali-slb->traefik->pods
$ ab -v 10 -c 200 -n 2000   -T 'application/json' -p postfile.json http://data-api.soulapp.cn/get_doc > slb-log.txt
$ cat slb-log.txt | grep  'HTTP/1.1 200' | wc -l
1322
$ cat slb-log.txt | grep  'HTTP/1.1 503' | wc -l
678


# client->traefik->pods
$ ab -v 10 -c 200 -n 2000   -T 'application/json' -p postfile.json http://data-api.c.bgbiao.cn/get_doc > traefik-log.txt
Completed 200 requests
Completed 400 requests
Completed 600 requests
Completed 800 requests
Completed 1000 requests
Completed 1200 requests
Completed 1400 requests
Completed 1600 requests
Completed 1800 requests
Completed 2000 requests
Finished 2000 requests

$ cat traefik-log.txt  | grep  'HTTP/1.0 200' | wc -l
2000

后来在阿里云文档中看到如下文档，基本上就是在阿里云的 SLB 侧对不同规格的 SLB 做了一定的限流策略，此时会向客户端返回 503。

阿里云 TPS-503