Message ID | 20221102132811.70858-1-luwei32@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp3591278wru; Wed, 2 Nov 2022 05:35:07 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5+jKcGWjwkbode/3Uj6ltRHbfxsfgHPEMbezYproFh50FP28Ty6/HPfmwkAOjzRnZMtlUw X-Received: by 2002:a17:907:2c72:b0:7a5:47da:5893 with SMTP id ib18-20020a1709072c7200b007a547da5893mr23336708ejc.612.1667392507216; Wed, 02 Nov 2022 05:35:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1667392507; cv=none; d=google.com; s=arc-20160816; b=JuQUGHB4Ej4K42BwTUZSMos4LVACBsWLizgXyUT6MCsBtfKcTwZN28VMH4M0lmrF4o s4TyLgwykL2dGQce1cguYpvtyuak8k/fCKzFE/HDysLUD6CaKtjDBFDqRAAozu95vi8K jbTNBdsKorHOLuc8DEoLRZk7MenysviOTcna0wNoyp5StVvtdUWHe6tE/RFbqIbtQLt2 GihFLnfBxeVyW3E0VYoXoIOcEQkcF+DBkuwKx1gKb2CfBsxPVdrCbwROlfh0DuXOp88u O4AmHWopp1z5V9pWRxAHT1n9DUsx/LNhjlqMmA/1CbF15xIq272oB2aFHYW/tZ4Hf8zQ Wbog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from; bh=bRKqiFuIHHgtqd923ieyNKvo9YXiVU0jHgC2f8bhWmQ=; b=cCdyIPODKoIq+Cc6f9cuc68aiwUwBgbgBiV+n2qs0l1206EQGUn/zoZU8Nyt2vndXo ETzRw6APqcRbucy6L3KwrXUjLv0ZhbXq5Lp2nBTHmAZ7Rz04mgYSkWw0qDyuWVM3qGek jygVeZnUN0iYLbkn+AUkdK+j9USSbtjXvqeufKEFogwi0WwG6wewmBsagui0qprDnXTv 0pWVUHUuVegO/wa7LYg3af7gXGQNBwSRwXNV3OYLxR7RznC4wC69GaaHSsXGAWKAwn98 Ii+rwV3eEfc1GNmecPQejIeIxcmriQvjlTDI6ZfNaAzc1x2mCGXhBTYkipttaofeadVD uFtg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k15-20020a1709061c0f00b006feb76dbd51si11687560ejg.289.2022.11.02.05.34.36; Wed, 02 Nov 2022 05:35:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230366AbiKBMXV (ORCPT <rfc822;yves.mi.zy@gmail.com> + 99 others); Wed, 2 Nov 2022 08:23:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41746 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230139AbiKBMXT (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 2 Nov 2022 08:23:19 -0400 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AEE6228729; Wed, 2 Nov 2022 05:23:18 -0700 (PDT) Received: from kwepemi500015.china.huawei.com (unknown [172.30.72.55]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4N2Qym37bnzbc7c; Wed, 2 Nov 2022 20:23:12 +0800 (CST) Received: from huawei.com (10.175.101.6) by kwepemi500015.china.huawei.com (7.221.188.92) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Wed, 2 Nov 2022 20:23:15 +0800 From: Lu Wei <luwei32@huawei.com> To: <edumazet@google.com>, <davem@davemloft.net>, <yoshfuji@linux-ipv6.org>, <dsahern@kernel.org>, <kuba@kernel.org>, <pabeni@redhat.com>, <xemul@parallels.com>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [patch net v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent Date: Wed, 2 Nov 2022 21:28:11 +0800 Message-ID: <20221102132811.70858-1-luwei32@huawei.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.175.101.6] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To kwepemi500015.china.huawei.com (7.221.188.92) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748387765493848795?= X-GMAIL-MSGID: =?utf-8?q?1748387765493848795?= |
Series |
[net,v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent
|
|
Commit Message
Lu Wei
Nov. 2, 2022, 1:28 p.m. UTC
If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
of TCPOPT_SACK_PERM is called to enable sack after data is sent
and before data is acked, it will trigger a warning in function
tcp_verify_left_out() as follows:
============================================
WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
tcp_timeout_mark_lost+0x154/0x160
tcp_enter_loss+0x2b/0x290
tcp_retransmit_timer+0x50b/0x640
tcp_write_timer_handler+0x1c8/0x340
tcp_write_timer+0xe5/0x140
call_timer_fn+0x3a/0x1b0
__run_timers.part.0+0x1bf/0x2d0
run_timer_softirq+0x43/0xb0
__do_softirq+0xfd/0x373
__irq_exit_rcu+0xf6/0x140
The warning is caused in the following steps:
1. a socket named socketA is created
2. socketA enters repair mode without build a connection
3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
directly
4. socketA leaves repair mode
5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
ack receives) increase
6. socketA enters repair mode again
7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
increases
9. sack_outs + lost_out > packets_out triggers since lost_out and
sack_outs increase repeatly
In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
Step7 not happen and the warning will not be triggered. As suggested by
Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
can be set only if tp->segs_out is 0.
socket-tcp tests in CRIU has been tested as follows:
$ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \
--ignore-taint
socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
Signed-off-by: Lu Wei <luwei32@huawei.com>
---
net/ipv4/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote: > > If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code > of TCPOPT_SACK_PERM is called to enable sack after data is sent > and before data is acked, ... This "before data is acked" phrase does not quite seem to match the sequence below, AFAICT? How about something like: If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM is called to enable SACK after data is sent and the data sender receives a dupack, ... > ... it will trigger a warning in function > tcp_verify_left_out() as follows: > > ============================================ > WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132 > tcp_timeout_mark_lost+0x154/0x160 > tcp_enter_loss+0x2b/0x290 > tcp_retransmit_timer+0x50b/0x640 > tcp_write_timer_handler+0x1c8/0x340 > tcp_write_timer+0xe5/0x140 > call_timer_fn+0x3a/0x1b0 > __run_timers.part.0+0x1bf/0x2d0 > run_timer_softirq+0x43/0xb0 > __do_softirq+0xfd/0x373 > __irq_exit_rcu+0xf6/0x140 > > The warning is caused in the following steps: > 1. a socket named socketA is created > 2. socketA enters repair mode without build a connection > 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED > directly > 4. socketA leaves repair mode > 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup > ack receives) increase > 6. socketA enters repair mode again > 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack > 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out > increases > 9. sack_outs + lost_out > packets_out triggers since lost_out and > sack_outs increase repeatly > > In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if > Step7 not happen and the warning will not be triggered. As suggested by > Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was > already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS > can be set only if tp->segs_out is 0. > > socket-tcp tests in CRIU has been tested as follows: > $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \ > --ignore-taint > > socket-tcp* represent all socket-tcp tests in test/zdtm/static/. > > Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters") > Signed-off-by: Lu Wei <luwei32@huawei.com> > --- > net/ipv4/tcp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index ef14efa1fb70..1f5cc32cf0cc 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname, > case TCP_REPAIR_OPTIONS: > if (!tp->repair) > err = -EINVAL; > - else if (sk->sk_state == TCP_ESTABLISHED) > + else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out) The tp->segs_out field is only 32 bits wide. By my math, at 200 Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a caller could get unlucky or carefully sequence its call to TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the accounting and trigger the kernel warning. How about using some other method to determine if this is safe? Perhaps using tp->bytes_sent, which is a 64-bit field, which by my math would take 23 years to wrap at 200 Gbit/sec? If we're more paranoid about wrapping we could also check tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more paranoid I suppose we could have a special new bit to track whether we've ever sent something, but that probably seems like overkill?) neal
在 2022/11/2 10:46 PM, Neal Cardwell 写道: > On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote: >> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code >> of TCPOPT_SACK_PERM is called to enable sack after data is sent >> and before data is acked, ... > This "before data is acked" phrase does not quite seem to match the > sequence below, AFAICT? > > How about something like: > > If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM > is called to enable SACK after data is sent and the data sender receives a > dupack, ... yes, thanks for suggestion > > >> ... it will trigger a warning in function >> tcp_verify_left_out() as follows: >> >> ============================================ >> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132 >> tcp_timeout_mark_lost+0x154/0x160 >> tcp_enter_loss+0x2b/0x290 >> tcp_retransmit_timer+0x50b/0x640 >> tcp_write_timer_handler+0x1c8/0x340 >> tcp_write_timer+0xe5/0x140 >> call_timer_fn+0x3a/0x1b0 >> __run_timers.part.0+0x1bf/0x2d0 >> run_timer_softirq+0x43/0xb0 >> __do_softirq+0xfd/0x373 >> __irq_exit_rcu+0xf6/0x140 >> >> The warning is caused in the following steps: >> 1. a socket named socketA is created >> 2. socketA enters repair mode without build a connection >> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED >> directly >> 4. socketA leaves repair mode >> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup >> ack receives) increase >> 6. socketA enters repair mode again >> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack >> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out >> increases >> 9. sack_outs + lost_out > packets_out triggers since lost_out and >> sack_outs increase repeatly >> >> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if >> Step7 not happen and the warning will not be triggered. As suggested by >> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was >> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS >> can be set only if tp->segs_out is 0. >> >> socket-tcp tests in CRIU has been tested as follows: >> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \ >> --ignore-taint >> >> socket-tcp* represent all socket-tcp tests in test/zdtm/static/. >> >> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters") >> Signed-off-by: Lu Wei <luwei32@huawei.com> >> --- >> net/ipv4/tcp.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c >> index ef14efa1fb70..1f5cc32cf0cc 100644 >> --- a/net/ipv4/tcp.c >> +++ b/net/ipv4/tcp.c >> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname, >> case TCP_REPAIR_OPTIONS: >> if (!tp->repair) >> err = -EINVAL; >> - else if (sk->sk_state == TCP_ESTABLISHED) >> + else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out) > The tp->segs_out field is only 32 bits wide. By my math, at 200 > Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a > caller could get unlucky or carefully sequence its call to > TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the > accounting and trigger the kernel warning. > > How about using some other method to determine if this is safe? > Perhaps using tp->bytes_sent, which is a 64-bit field, which by my > math would take 23 years to wrap at 200 Gbit/sec? > > If we're more paranoid about wrapping we could also check > tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either > tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more > paranoid I suppose we could have a special new bit to track whether > we've ever sent something, but that probably seems like overkill?) > > neal > . I didn't notice that u32 will be easily wrapped in huge network throughput, thank you neal. But tcp->packets_out shoud not be used because tp->packets_out can decrease when expected ack is received, so it can decrease to 0 and this is the common condition.
On Wed, Nov 2, 2022 at 7:11 PM luwei (O) <luwei32@huawei.com> wrote: > > > 在 2022/11/2 10:46 PM, Neal Cardwell 写道: > > On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote: > >> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code > >> of TCPOPT_SACK_PERM is called to enable sack after data is sent > >> and before data is acked, ... > > This "before data is acked" phrase does not quite seem to match the > > sequence below, AFAICT? > > > > How about something like: > > > > If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM > > is called to enable SACK after data is sent and the data sender receives a > > dupack, ... > yes, thanks for suggestion > > > > > >> ... it will trigger a warning in function > >> tcp_verify_left_out() as follows: > >> > >> ============================================ > >> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132 > >> tcp_timeout_mark_lost+0x154/0x160 > >> tcp_enter_loss+0x2b/0x290 > >> tcp_retransmit_timer+0x50b/0x640 > >> tcp_write_timer_handler+0x1c8/0x340 > >> tcp_write_timer+0xe5/0x140 > >> call_timer_fn+0x3a/0x1b0 > >> __run_timers.part.0+0x1bf/0x2d0 > >> run_timer_softirq+0x43/0xb0 > >> __do_softirq+0xfd/0x373 > >> __irq_exit_rcu+0xf6/0x140 > >> > >> The warning is caused in the following steps: > >> 1. a socket named socketA is created > >> 2. socketA enters repair mode without build a connection > >> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED > >> directly > >> 4. socketA leaves repair mode > >> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup > >> ack receives) increase > >> 6. socketA enters repair mode again > >> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack > >> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out > >> increases > >> 9. sack_outs + lost_out > packets_out triggers since lost_out and > >> sack_outs increase repeatly > >> > >> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if > >> Step7 not happen and the warning will not be triggered. As suggested by > >> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was > >> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS > >> can be set only if tp->segs_out is 0. > >> > >> socket-tcp tests in CRIU has been tested as follows: > >> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \ > >> --ignore-taint > >> > >> socket-tcp* represent all socket-tcp tests in test/zdtm/static/. > >> > >> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters") > >> Signed-off-by: Lu Wei <luwei32@huawei.com> > >> --- > >> net/ipv4/tcp.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > >> index ef14efa1fb70..1f5cc32cf0cc 100644 > >> --- a/net/ipv4/tcp.c > >> +++ b/net/ipv4/tcp.c > >> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname, > >> case TCP_REPAIR_OPTIONS: > >> if (!tp->repair) > >> err = -EINVAL; > >> - else if (sk->sk_state == TCP_ESTABLISHED) > >> + else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out) > > The tp->segs_out field is only 32 bits wide. By my math, at 200 > > Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a > > caller could get unlucky or carefully sequence its call to > > TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the > > accounting and trigger the kernel warning. > > > > How about using some other method to determine if this is safe? > > Perhaps using tp->bytes_sent, which is a 64-bit field, which by my > > math would take 23 years to wrap at 200 Gbit/sec? > > > > If we're more paranoid about wrapping we could also check > > tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either > > tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more > > paranoid I suppose we could have a special new bit to track whether > > we've ever sent something, but that probably seems like overkill?) > > > > neal > > . > > I didn't notice that u32 will be easily wrapped in huge network throughput, > thank you neal. > > But tcp->packets_out shoud not be used because tp->packets_out can decrease > when expected ack is received, so it can decrease to 0 and this is the common > condition. Right, so just use tp->bytes_sent I doubt syzbot will be patient enough to wait for an overflow.
On Wed, Nov 2, 2022 at 10:11 PM luwei (O) <luwei32@huawei.com> wrote: > > > 在 2022/11/2 10:46 PM, Neal Cardwell 写道: > > On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote: > >> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code > >> of TCPOPT_SACK_PERM is called to enable sack after data is sent > >> and before data is acked, ... > > This "before data is acked" phrase does not quite seem to match the > > sequence below, AFAICT? > > > > How about something like: > > > > If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM > > is called to enable SACK after data is sent and the data sender receives a > > dupack, ... > yes, thanks for suggestion > > > > > >> ... it will trigger a warning in function > >> tcp_verify_left_out() as follows: > >> > >> ============================================ > >> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132 > >> tcp_timeout_mark_lost+0x154/0x160 > >> tcp_enter_loss+0x2b/0x290 > >> tcp_retransmit_timer+0x50b/0x640 > >> tcp_write_timer_handler+0x1c8/0x340 > >> tcp_write_timer+0xe5/0x140 > >> call_timer_fn+0x3a/0x1b0 > >> __run_timers.part.0+0x1bf/0x2d0 > >> run_timer_softirq+0x43/0xb0 > >> __do_softirq+0xfd/0x373 > >> __irq_exit_rcu+0xf6/0x140 > >> > >> The warning is caused in the following steps: > >> 1. a socket named socketA is created > >> 2. socketA enters repair mode without build a connection > >> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED > >> directly > >> 4. socketA leaves repair mode > >> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup > >> ack receives) increase > >> 6. socketA enters repair mode again > >> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack > >> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out > >> increases > >> 9. sack_outs + lost_out > packets_out triggers since lost_out and > >> sack_outs increase repeatly > >> > >> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if > >> Step7 not happen and the warning will not be triggered. As suggested by > >> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was > >> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS > >> can be set only if tp->segs_out is 0. > >> > >> socket-tcp tests in CRIU has been tested as follows: > >> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp* --keep-going \ > >> --ignore-taint > >> > >> socket-tcp* represent all socket-tcp tests in test/zdtm/static/. > >> > >> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters") > >> Signed-off-by: Lu Wei <luwei32@huawei.com> > >> --- > >> net/ipv4/tcp.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > >> index ef14efa1fb70..1f5cc32cf0cc 100644 > >> --- a/net/ipv4/tcp.c > >> +++ b/net/ipv4/tcp.c > >> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname, > >> case TCP_REPAIR_OPTIONS: > >> if (!tp->repair) > >> err = -EINVAL; > >> - else if (sk->sk_state == TCP_ESTABLISHED) > >> + else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out) > > The tp->segs_out field is only 32 bits wide. By my math, at 200 > > Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a > > caller could get unlucky or carefully sequence its call to > > TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the > > accounting and trigger the kernel warning. > > > > How about using some other method to determine if this is safe? > > Perhaps using tp->bytes_sent, which is a 64-bit field, which by my > > math would take 23 years to wrap at 200 Gbit/sec? > > > > If we're more paranoid about wrapping we could also check > > tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either > > tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more > > paranoid I suppose we could have a special new bit to track whether > > we've ever sent something, but that probably seems like overkill?) > > > > neal > > . > > I didn't notice that u32 will be easily wrapped in huge network throughput, > thank you neal. > > But tcp->packets_out shoud not be used because tp->packets_out can decrease > when expected ack is received, so it can decrease to 0 and this is the common > condition. To say tp->packets_out should not be used is a bit strong. :-) Obviously packets_out decreases when packets are ACKed. The point of checking both tp->bytes_sent and tp->packets_out would be if we are paranoid enough that we want to prevent this warning in the case where tp->bytes_sent wraps and becomes zero. If tp->bytes_sent wraps and is zero, we will be saved from hitting this warning if we deny the request to set TCP_REPAIR_OPTIONS if tp->packets_out is non-zero. (Because we can only hit this warning if tp->sacked_out is non-zero, and tp->sacked_out should only be non-zero if tp->packets_out is non-zero.) neal
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index ef14efa1fb70..1f5cc32cf0cc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname, case TCP_REPAIR_OPTIONS: if (!tp->repair) err = -EINVAL; - else if (sk->sk_state == TCP_ESTABLISHED) + else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out) err = tcp_repair_options_est(sk, optval, optlen); else err = -EPERM;