TCP拥塞窗口验证
如果在一个RTO时长内,拥塞窗口没有被完全的使用,TCP发送端将减小拥塞窗口。因为此时TCP发送端的拥塞窗口可能并非当前的网络状况,所以发送端应减小拥塞窗口。根据RFC2861,ssthresh应设置为其当前值与3/4倍的拥塞窗口值两者之间的最大值,而拥塞窗口设置为实际使用的量和当前拥塞窗口值之和的一半。
在如下发送函数tcp_write_xmit中,如果实际执行了发送报文操作,即sent_pkts数量不为零,在发送之后,调用tcp_cwnd_validate函数验证拥塞窗口。其参数is_cwnd_limited表明报文发送是否被拥塞窗口所限,其由两个部分决定,其一是函数tcp_tso_should_defer中的赋值;其二是判断网络中的报文数量是否大于拥塞窗口,两处赋值为或的关系。
static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp)
{
max_segs = tcp_tso_segs(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
...
tso_segs = tcp_init_tso_segs(skb, mss_now);
if (tso_segs == 1) {
} else {
if (!push_one &&
tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
&is_rwnd_limited, max_segs))
break;
}
...
}
...
if (likely(sent_pkts)) {
...
is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
}
以下为tcp_tso_should_defer函数,在发送单个报文时不执行。如果判断到拥塞窗口小于发送窗口,并且拥塞窗口小于等于报文长度时,意味着当前报文不能发送,设置拥塞窗口限制标志is_cwnd_limited。
static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
bool *is_cwnd_limited, bool *is_rwnd_limited, u32 max_segs)
{
send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
/* From in_flight test above, we know that cwnd > in_flight. */
cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache;
...
/* Ok, it looks like it is advisable to defer.
* Three cases are tracked :
* 1) We are cwnd-limited
* 2) We are rwnd-limited
* 3) We are application limited.
*/
if (cong_win < send_win) {
if (cong_win <= skb->len) {
*is_cwnd_limited = true;
return true;
}
} else {
以下拥塞窗口验证函数,第一次执行此函数时,max_packets_out和max_packets_seq均未赋值,分别为两者赋值为packets_out和SND.NXT的值。之后再次执行此函数时,只有进入下一个发送窗口期或者发送报文大于记录值时进行更新。由此可见,前者max_packets_out记录的为上一个窗口发送的最大报文数量;而后者max_packets_seq记录的为最大的发送序号。
static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited)
{
const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
struct tcp_sock *tp = tcp_sk(sk);
/* Track the maximum number of outstanding packets in each
* window, and remember whether we were cwnd-limited then.
*/
if (!before(tp->snd_una, tp->max_packets_seq) ||
tp->packets_out > tp->max_packets_out) {
tp->max_packets_out = tp->packets_out;
tp->max_packets_seq = tp->snd_nxt;
tp->is_cwnd_limited = is_cwnd_limited;
}
参数is_cwnd_limited记录了上一个发送窗口期是否受到了拥塞窗口的限制。函数tcp_is_cwnd_limited判断连接的发送是否受限于拥塞窗口,为真表明当前发送使用了全部可用网络资源,反之,表明存在空闲的网络资源。
在后一种情况下,记录当前网络中报文数量到变量snd_cwnd_used中。如果内核配置开启了在空闲时长超过RTO之后,复位拥塞窗口的功能,即tcp_slow_start_after_idle为真,并且空闲时长大于等于RTO,并且拥塞控制算法未定义相关处理,这里调用tcp_cwnd_application_limited函数,处理应用限速的情况。
if (tcp_is_cwnd_limited(sk)) {
/* Network is feed fully. */
tp->snd_cwnd_used = 0;
tp->snd_cwnd_stamp = tcp_jiffies32;
} else {
/* Network starves. */
if (tp->packets_out > tp->snd_cwnd_used)
tp->snd_cwnd_used = tp->packets_out;
if (sock_net(sk)->ipv4.sysctl_tcp_slow_start_after_idle &&
(s32)(tcp_jiffies32 - tp->snd_cwnd_stamp) >= inet_csk(sk)->icsk_rto &&
!ca_ops->cong_control)
tcp_cwnd_application_limited(sk);
以下检查是否为发送端缓存不足引起的空闲,首先排除拥塞窗口的原因,其次是发送队列为空,而且应用程序遇到了缓存限值,记录下发送缓存限值标志TCP_CHRONO_SNDBUF_LIMITED。
/* The following conditions together indicate the starvation
* is caused by insufficient sender buffer:
* 1) just sent some data (see tcp_write_xmit)
* 2) not cwnd limited (this else condition)
* 3) no more data to send (tcp_write_queue_empty())
* 4) application is hitting buffer limit (SOCK_NOSPACE)
*/
if (tcp_write_queue_empty(sk) && sk->sk_socket &&
test_bit(SOCK_NOSPACE, &sk->sk_socket->flags) &&
(1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
tcp_chrono_start(sk, TCP_CHRONO_SNDBUF_LIMITED);
对于是否为拥塞窗口受限,内核的判断与RFC2861略有不同,RFC2861建议如果CWND没有全部的使用,不应增加其值,这正符合内核在拥塞避免阶段的实现。但是,对于慢启动阶段,内核允许拥塞窗口增长到使用量的一倍。参见tcp_is_cwnd_limited相关注释,在初始窗口为10,发送了9个报文后,如果所有报文都被确认了,将窗口增加到18。这将有利于限速应用程序更好的探测网络带宽。
/* We follow the spirit of RFC2861 to validate cwnd but implement a more
* flexible approach. The RFC suggests cwnd should not be raised unless
* it was fully used previously. And that's exactly what we do in
* congestion avoidance mode. But in slow start we allow cwnd to grow
* as long as the application has used half the cwnd.
* Example :
* cwnd is 10 (IW10), but application sends 9 frames.
* We allow cwnd to reach 18 when all frames are ACKed.
* This check is safe because it's as aggressive as slow start which already
* risks 100% overshoot. The advantage is that we discourage application to
* either send more filler packets or data to artificially blow up the cwnd
* usage, and allow application-limited process to probe bw more aggressively.
*/
static inline bool tcp_is_cwnd_limited(const struct sock *sk)
{
const struct tcp_sock *tp = tcp_sk(sk);
/* If in slow start, ensure cwnd grows to twice what was ACKed. */
if (tcp_in_slow_start(tp))
return tp->snd_cwnd < 2 * tp->max_packets_out;
return tp->is_cwnd_limited;
}
以下函数tcp_cwnd_application_limited在网络空闲(未充分利用)RTO时长之后调整拥塞窗口,此调整不针对重传阶段,以及应用程序受到发送缓存限值的情况。首先获得窗口的使用情况,取值为初始窗口值和tcp_cwnd_validate函数中记录的使用值snd_cwnd_used之间的较大值。拥塞窗口值调整为原窗口值与窗口使用值之和的一半。
/* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
* As additional protections, we do not touch cwnd in retransmission phases,
* and if application hit its sndbuf limit recently.
*/
static void tcp_cwnd_application_limited(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
if (inet_csk(sk)->icsk_ca_state == TCP_CA_Open &&
sk->sk_socket && !test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
/* Limited by application or receiver window. */
u32 init_win = tcp_init_cwnd(tp, __sk_dst_get(sk));
u32 win_used = max(tp->snd_cwnd_used, init_win);
if (win_used < tp->snd_cwnd) {
tp->snd_ssthresh = tcp_current_ssthresh(sk);
tp->snd_cwnd = (tp->snd_cwnd + win_used) >> 1;
}
tp->snd_cwnd_used = 0;
}
tp->snd_cwnd_stamp = tcp_jiffies32;
}
内核版本 5.0