[net,v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent

Message ID 20221102132811.70858-1-luwei32@huawei.com
State New
Headers
Series [net,v3] tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent |

Commit Message

Lu Wei Nov. 2, 2022, 1:28 p.m. UTC
  If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
of TCPOPT_SACK_PERM is called to enable sack after data is sent
and before data is acked, it will trigger a warning in function
tcp_verify_left_out() as follows:

============================================
WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
tcp_timeout_mark_lost+0x154/0x160
tcp_enter_loss+0x2b/0x290
tcp_retransmit_timer+0x50b/0x640
tcp_write_timer_handler+0x1c8/0x340
tcp_write_timer+0xe5/0x140
call_timer_fn+0x3a/0x1b0
__run_timers.part.0+0x1bf/0x2d0
run_timer_softirq+0x43/0xb0
__do_softirq+0xfd/0x373
__irq_exit_rcu+0xf6/0x140

The warning is caused in the following steps:
1. a socket named socketA is created
2. socketA enters repair mode without build a connection
3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
   directly
4. socketA leaves repair mode
5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
   ack receives) increase
6. socketA enters repair mode again
7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
   increases
9. sack_outs + lost_out > packets_out triggers since lost_out and
   sack_outs increase repeatly

In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
Step7 not happen and the warning will not be triggered. As suggested by
Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
can be set only if tp->segs_out is 0.

socket-tcp tests in CRIU has been tested as follows:
$ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp*  --keep-going \
       --ignore-taint

socket-tcp* represent all socket-tcp tests in test/zdtm/static/.

Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
Signed-off-by: Lu Wei <luwei32@huawei.com>
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Neal Cardwell Nov. 2, 2022, 2:46 p.m. UTC | #1
On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote:
>
> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
> of TCPOPT_SACK_PERM is called to enable sack after data is sent
> and before data is acked, ...

This "before data is acked" phrase does not quite seem to match the
sequence below, AFAICT?

How about something like:

 If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM
 is called to enable SACK after data is sent and the data sender receives a
 dupack, ...


> ... it will trigger a warning in function
> tcp_verify_left_out() as follows:
>
> ============================================
> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
> tcp_timeout_mark_lost+0x154/0x160
> tcp_enter_loss+0x2b/0x290
> tcp_retransmit_timer+0x50b/0x640
> tcp_write_timer_handler+0x1c8/0x340
> tcp_write_timer+0xe5/0x140
> call_timer_fn+0x3a/0x1b0
> __run_timers.part.0+0x1bf/0x2d0
> run_timer_softirq+0x43/0xb0
> __do_softirq+0xfd/0x373
> __irq_exit_rcu+0xf6/0x140
>
> The warning is caused in the following steps:
> 1. a socket named socketA is created
> 2. socketA enters repair mode without build a connection
> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
>    directly
> 4. socketA leaves repair mode
> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
>    ack receives) increase
> 6. socketA enters repair mode again
> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
>    increases
> 9. sack_outs + lost_out > packets_out triggers since lost_out and
>    sack_outs increase repeatly
>
> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
> Step7 not happen and the warning will not be triggered. As suggested by
> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
> can be set only if tp->segs_out is 0.
>
> socket-tcp tests in CRIU has been tested as follows:
> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp*  --keep-going \
>        --ignore-taint
>
> socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
>
> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
> Signed-off-by: Lu Wei <luwei32@huawei.com>
> ---
>  net/ipv4/tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ef14efa1fb70..1f5cc32cf0cc 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
>         case TCP_REPAIR_OPTIONS:
>                 if (!tp->repair)
>                         err = -EINVAL;
> -               else if (sk->sk_state == TCP_ESTABLISHED)
> +               else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out)

The tp->segs_out field is only 32 bits wide. By my math, at 200
Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a
caller could get unlucky or carefully sequence its call to
TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the
accounting and trigger the kernel warning.

How about using some other method to determine if this is safe?
Perhaps using tp->bytes_sent, which is a 64-bit field, which by my
math would take 23 years to wrap at 200 Gbit/sec?

If we're more paranoid about wrapping we could also check
tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either
tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more
paranoid I suppose we could have a special new bit to track whether
we've ever sent something, but that probably seems like overkill?)

neal
  
Lu Wei Nov. 3, 2022, 2:11 a.m. UTC | #2
在 2022/11/2 10:46 PM, Neal Cardwell 写道:
> On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote:
>> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
>> of TCPOPT_SACK_PERM is called to enable sack after data is sent
>> and before data is acked, ...
> This "before data is acked" phrase does not quite seem to match the
> sequence below, AFAICT?
>
> How about something like:
>
>   If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM
>   is called to enable SACK after data is sent and the data sender receives a
>   dupack, ...
      yes, thanks for suggestion
>
>
>> ... it will trigger a warning in function
>> tcp_verify_left_out() as follows:
>>
>> ============================================
>> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
>> tcp_timeout_mark_lost+0x154/0x160
>> tcp_enter_loss+0x2b/0x290
>> tcp_retransmit_timer+0x50b/0x640
>> tcp_write_timer_handler+0x1c8/0x340
>> tcp_write_timer+0xe5/0x140
>> call_timer_fn+0x3a/0x1b0
>> __run_timers.part.0+0x1bf/0x2d0
>> run_timer_softirq+0x43/0xb0
>> __do_softirq+0xfd/0x373
>> __irq_exit_rcu+0xf6/0x140
>>
>> The warning is caused in the following steps:
>> 1. a socket named socketA is created
>> 2. socketA enters repair mode without build a connection
>> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
>>     directly
>> 4. socketA leaves repair mode
>> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
>>     ack receives) increase
>> 6. socketA enters repair mode again
>> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
>> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
>>     increases
>> 9. sack_outs + lost_out > packets_out triggers since lost_out and
>>     sack_outs increase repeatly
>>
>> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
>> Step7 not happen and the warning will not be triggered. As suggested by
>> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
>> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
>> can be set only if tp->segs_out is 0.
>>
>> socket-tcp tests in CRIU has been tested as follows:
>> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp*  --keep-going \
>>         --ignore-taint
>>
>> socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
>>
>> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
>> Signed-off-by: Lu Wei <luwei32@huawei.com>
>> ---
>>   net/ipv4/tcp.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index ef14efa1fb70..1f5cc32cf0cc 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
>>          case TCP_REPAIR_OPTIONS:
>>                  if (!tp->repair)
>>                          err = -EINVAL;
>> -               else if (sk->sk_state == TCP_ESTABLISHED)
>> +               else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out)
> The tp->segs_out field is only 32 bits wide. By my math, at 200
> Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a
> caller could get unlucky or carefully sequence its call to
> TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the
> accounting and trigger the kernel warning.
>
> How about using some other method to determine if this is safe?
> Perhaps using tp->bytes_sent, which is a 64-bit field, which by my
> math would take 23 years to wrap at 200 Gbit/sec?
>
> If we're more paranoid about wrapping we could also check
> tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either
> tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more
> paranoid I suppose we could have a special new bit to track whether
> we've ever sent something, but that probably seems like overkill?)
>
> neal
> .

I didn't notice that u32 will be easily wrapped in huge network throughput,
thank you neal.

But tcp->packets_out shoud not be used because tp->packets_out can decrease
when expected ack is received, so it can decrease to 0 and this is the common
condition.
  
Eric Dumazet Nov. 3, 2022, 2:27 a.m. UTC | #3
On Wed, Nov 2, 2022 at 7:11 PM luwei (O) <luwei32@huawei.com> wrote:
>
>
> 在 2022/11/2 10:46 PM, Neal Cardwell 写道:
> > On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote:
> >> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
> >> of TCPOPT_SACK_PERM is called to enable sack after data is sent
> >> and before data is acked, ...
> > This "before data is acked" phrase does not quite seem to match the
> > sequence below, AFAICT?
> >
> > How about something like:
> >
> >   If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM
> >   is called to enable SACK after data is sent and the data sender receives a
> >   dupack, ...
>       yes, thanks for suggestion
> >
> >
> >> ... it will trigger a warning in function
> >> tcp_verify_left_out() as follows:
> >>
> >> ============================================
> >> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
> >> tcp_timeout_mark_lost+0x154/0x160
> >> tcp_enter_loss+0x2b/0x290
> >> tcp_retransmit_timer+0x50b/0x640
> >> tcp_write_timer_handler+0x1c8/0x340
> >> tcp_write_timer+0xe5/0x140
> >> call_timer_fn+0x3a/0x1b0
> >> __run_timers.part.0+0x1bf/0x2d0
> >> run_timer_softirq+0x43/0xb0
> >> __do_softirq+0xfd/0x373
> >> __irq_exit_rcu+0xf6/0x140
> >>
> >> The warning is caused in the following steps:
> >> 1. a socket named socketA is created
> >> 2. socketA enters repair mode without build a connection
> >> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
> >>     directly
> >> 4. socketA leaves repair mode
> >> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
> >>     ack receives) increase
> >> 6. socketA enters repair mode again
> >> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
> >> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
> >>     increases
> >> 9. sack_outs + lost_out > packets_out triggers since lost_out and
> >>     sack_outs increase repeatly
> >>
> >> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
> >> Step7 not happen and the warning will not be triggered. As suggested by
> >> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
> >> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
> >> can be set only if tp->segs_out is 0.
> >>
> >> socket-tcp tests in CRIU has been tested as follows:
> >> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp*  --keep-going \
> >>         --ignore-taint
> >>
> >> socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
> >>
> >> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
> >> Signed-off-by: Lu Wei <luwei32@huawei.com>
> >> ---
> >>   net/ipv4/tcp.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index ef14efa1fb70..1f5cc32cf0cc 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
> >>          case TCP_REPAIR_OPTIONS:
> >>                  if (!tp->repair)
> >>                          err = -EINVAL;
> >> -               else if (sk->sk_state == TCP_ESTABLISHED)
> >> +               else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out)
> > The tp->segs_out field is only 32 bits wide. By my math, at 200
> > Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a
> > caller could get unlucky or carefully sequence its call to
> > TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the
> > accounting and trigger the kernel warning.
> >
> > How about using some other method to determine if this is safe?
> > Perhaps using tp->bytes_sent, which is a 64-bit field, which by my
> > math would take 23 years to wrap at 200 Gbit/sec?
> >
> > If we're more paranoid about wrapping we could also check
> > tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either
> > tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more
> > paranoid I suppose we could have a special new bit to track whether
> > we've ever sent something, but that probably seems like overkill?)
> >
> > neal
> > .
>
> I didn't notice that u32 will be easily wrapped in huge network throughput,
> thank you neal.
>
> But tcp->packets_out shoud not be used because tp->packets_out can decrease
> when expected ack is received, so it can decrease to 0 and this is the common
> condition.

Right, so just use tp->bytes_sent

I doubt syzbot will be patient enough to wait for an overflow.
  
Neal Cardwell Nov. 3, 2022, 1:54 p.m. UTC | #4
On Wed, Nov 2, 2022 at 10:11 PM luwei (O) <luwei32@huawei.com> wrote:
>
>
> 在 2022/11/2 10:46 PM, Neal Cardwell 写道:
> > On Wed, Nov 2, 2022 at 8:23 AM Lu Wei <luwei32@huawei.com> wrote:
> >> If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
> >> of TCPOPT_SACK_PERM is called to enable sack after data is sent
> >> and before data is acked, ...
> > This "before data is acked" phrase does not quite seem to match the
> > sequence below, AFAICT?
> >
> > How about something like:
> >
> >   If setsockopt TCP_REPAIR_OPTIONS with opt_code TCPOPT_SACK_PERM
> >   is called to enable SACK after data is sent and the data sender receives a
> >   dupack, ...
>       yes, thanks for suggestion
> >
> >
> >> ... it will trigger a warning in function
> >> tcp_verify_left_out() as follows:
> >>
> >> ============================================
> >> WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
> >> tcp_timeout_mark_lost+0x154/0x160
> >> tcp_enter_loss+0x2b/0x290
> >> tcp_retransmit_timer+0x50b/0x640
> >> tcp_write_timer_handler+0x1c8/0x340
> >> tcp_write_timer+0xe5/0x140
> >> call_timer_fn+0x3a/0x1b0
> >> __run_timers.part.0+0x1bf/0x2d0
> >> run_timer_softirq+0x43/0xb0
> >> __do_softirq+0xfd/0x373
> >> __irq_exit_rcu+0xf6/0x140
> >>
> >> The warning is caused in the following steps:
> >> 1. a socket named socketA is created
> >> 2. socketA enters repair mode without build a connection
> >> 3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
> >>     directly
> >> 4. socketA leaves repair mode
> >> 5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
> >>     ack receives) increase
> >> 6. socketA enters repair mode again
> >> 7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
> >> 8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
> >>     increases
> >> 9. sack_outs + lost_out > packets_out triggers since lost_out and
> >>     sack_outs increase repeatly
> >>
> >> In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
> >> Step7 not happen and the warning will not be triggered. As suggested by
> >> Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
> >> already sent. So this patch checks tp->segs_out, only TCP_REPAIR_OPTIONS
> >> can be set only if tp->segs_out is 0.
> >>
> >> socket-tcp tests in CRIU has been tested as follows:
> >> $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp*  --keep-going \
> >>         --ignore-taint
> >>
> >> socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
> >>
> >> Fixes: b139ba4e90dc ("tcp: Repair connection-time negotiated parameters")
> >> Signed-off-by: Lu Wei <luwei32@huawei.com>
> >> ---
> >>   net/ipv4/tcp.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index ef14efa1fb70..1f5cc32cf0cc 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -3647,7 +3647,7 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
> >>          case TCP_REPAIR_OPTIONS:
> >>                  if (!tp->repair)
> >>                          err = -EINVAL;
> >> -               else if (sk->sk_state == TCP_ESTABLISHED)
> >> +               else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out)
> > The tp->segs_out field is only 32 bits wide. By my math, at 200
> > Gbit/sec with 1500 byte MTU it can wrap roughly every 260 secs. So a
> > caller could get unlucky or carefully sequence its call to
> > TCP_REPAIR_OPTIONS (based on packets sent so far) to mess up the
> > accounting and trigger the kernel warning.
> >
> > How about using some other method to determine if this is safe?
> > Perhaps using tp->bytes_sent, which is a 64-bit field, which by my
> > math would take 23 years to wrap at 200 Gbit/sec?
> >
> > If we're more paranoid about wrapping we could also check
> > tp->packets_out, and refuse to allow TCP_REPAIR_OPTIONS if either
> > tp->bytes_sent or tp->packets_out are non-zero. (Or if we're even more
> > paranoid I suppose we could have a special new bit to track whether
> > we've ever sent something, but that probably seems like overkill?)
> >
> > neal
> > .
>
> I didn't notice that u32 will be easily wrapped in huge network throughput,
> thank you neal.
>
> But tcp->packets_out shoud not be used because tp->packets_out can decrease
> when expected ack is received, so it can decrease to 0 and this is the common
> condition.

To say tp->packets_out should not be used is a bit strong. :-)
Obviously packets_out decreases when packets are ACKed. The point of
checking both tp->bytes_sent and tp->packets_out would be if we are
paranoid enough that we want to prevent this warning in the case where
tp->bytes_sent wraps and becomes zero. If tp->bytes_sent wraps and is
zero, we will be saved from hitting this warning if we deny the
request to set TCP_REPAIR_OPTIONS if tp->packets_out is non-zero.
(Because we can only hit this warning if tp->sacked_out is non-zero,
and tp->sacked_out should only be non-zero if tp->packets_out is
non-zero.)

neal
  

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ef14efa1fb70..1f5cc32cf0cc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3647,7 +3647,7 @@  int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 	case TCP_REPAIR_OPTIONS:
 		if (!tp->repair)
 			err = -EINVAL;
-		else if (sk->sk_state == TCP_ESTABLISHED)
+		else if (sk->sk_state == TCP_ESTABLISHED && !tp->segs_out)
 			err = tcp_repair_options_est(sk, optval, optlen);
 		else
 			err = -EPERM;