[net-next,2/3] net: tcp: send zero-window when no memory

Message ID 20230517124201.441634-3-imagedong@tencent.com
State New
Headers
Series net: tcp: add support of window shrink |

Commit Message

Menglong Dong May 17, 2023, 12:42 p.m. UTC
  From: Menglong Dong <imagedong@tencent.com>

For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

Therefore, now we force to receive one packet on current socket when
the protocol memory is out of the limitation. Then, this socket will
stay in 'no mem' status, util protocol memory is available.

When a socket is in 'no mem' status, it's receive window will become
0, which means window shrink happens. And the sender need to handle
such window shrink properly, which is done in the next commit.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
 include/net/sock.h    |  1 +
 net/ipv4/tcp_input.c  | 12 ++++++++++++
 net/ipv4/tcp_output.c |  7 +++++++
 3 files changed, 20 insertions(+)
  

Comments

Eric Dumazet May 17, 2023, 2:44 p.m. UTC | #1
On Wed, May 17, 2023 at 2:42 PM <menglong8.dong@gmail.com> wrote:
>
> From: Menglong Dong <imagedong@tencent.com>
>
> For now, skb will be dropped when no memory, which makes client keep
> retrans util timeout and it's not friendly to the users.

Yes, networking needs memory. Trying to deny it is recipe for OOM.

>
> Therefore, now we force to receive one packet on current socket when
> the protocol memory is out of the limitation. Then, this socket will
> stay in 'no mem' status, util protocol memory is available.
>

I think you missed one old patch.

commit ba3bb0e76ccd464bb66665a1941fabe55dadb3ba    tcp: fix
SO_RCVLOWAT possible hangs under high mem pressure



> When a socket is in 'no mem' status, it's receive window will become
> 0, which means window shrink happens. And the sender need to handle
> such window shrink properly, which is done in the next commit.
>
> Signed-off-by: Menglong Dong <imagedong@tencent.com>
> ---
>  include/net/sock.h    |  1 +
>  net/ipv4/tcp_input.c  | 12 ++++++++++++
>  net/ipv4/tcp_output.c |  7 +++++++
>  3 files changed, 20 insertions(+)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5edf0038867c..90db8a1d7f31 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -957,6 +957,7 @@ enum sock_flags {
>         SOCK_XDP, /* XDP is attached */
>         SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
>         SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
> +       SOCK_NO_MEM, /* protocol memory limitation happened */
>  };
>
>  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a057330d6f59..56e395cb4554 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5047,10 +5047,22 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
>                 if (skb_queue_len(&sk->sk_receive_queue) == 0)
>                         sk_forced_mem_schedule(sk, skb->truesize);

I think you missed this part : We accept at least one packet,
regardless of memory pressure,
if the queue is empty.

So your changelog is misleading.

>                 else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
> +                       if (sysctl_tcp_wnd_shrink)

We no longer add global sysctls for TCP. All new sysctls must per net-ns.

> +                               goto do_wnd_shrink;
> +
>                         reason = SKB_DROP_REASON_PROTO_MEM;
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
>                         sk->sk_data_ready(sk);
>                         goto drop;
> +do_wnd_shrink:
> +                       if (sock_flag(sk, SOCK_NO_MEM)) {
> +                               NET_INC_STATS(sock_net(sk),
> +                                             LINUX_MIB_TCPRCVQDROP);
> +                               sk->sk_data_ready(sk);
> +                               goto out_of_window;
> +                       }
> +                       sk_forced_mem_schedule(sk, skb->truesize);

So now we would accept two packets per TCP socket, and yet EPOLLIN
will not be sent in time ?

packets can consume about 45*4K each, I do not think it is wise to
double receive queue sizes.

What you want instead is simply to send EPOLLIN sooner (when the first
packet is queued instead when the second packet is dropped)
by changing sk_forced_mem_schedule() a bit.

This might matter for applications using SO_RCVLOWAT, but not for
other applications.
  
Menglong Dong May 18, 2023, 2:14 a.m. UTC | #2
On Wed, May 17, 2023 at 10:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, May 17, 2023 at 2:42 PM <menglong8.dong@gmail.com> wrote:
> >
> > From: Menglong Dong <imagedong@tencent.com>
> >
> > For now, skb will be dropped when no memory, which makes client keep
> > retrans util timeout and it's not friendly to the users.
>
> Yes, networking needs memory. Trying to deny it is recipe for OOM.
>
> >
> > Therefore, now we force to receive one packet on current socket when
> > the protocol memory is out of the limitation. Then, this socket will
> > stay in 'no mem' status, util protocol memory is available.
> >
>
> I think you missed one old patch.
>
> commit ba3bb0e76ccd464bb66665a1941fabe55dadb3ba    tcp: fix
> SO_RCVLOWAT possible hangs under high mem pressure
>
>
>
> > When a socket is in 'no mem' status, it's receive window will become
> > 0, which means window shrink happens. And the sender need to handle
> > such window shrink properly, which is done in the next commit.
> >
> > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > ---
> >  include/net/sock.h    |  1 +
> >  net/ipv4/tcp_input.c  | 12 ++++++++++++
> >  net/ipv4/tcp_output.c |  7 +++++++
> >  3 files changed, 20 insertions(+)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5edf0038867c..90db8a1d7f31 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -957,6 +957,7 @@ enum sock_flags {
> >         SOCK_XDP, /* XDP is attached */
> >         SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> >         SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
> > +       SOCK_NO_MEM, /* protocol memory limitation happened */
> >  };
> >
> >  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index a057330d6f59..56e395cb4554 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5047,10 +5047,22 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
> >                 if (skb_queue_len(&sk->sk_receive_queue) == 0)
> >                         sk_forced_mem_schedule(sk, skb->truesize);
>
> I think you missed this part : We accept at least one packet,
> regardless of memory pressure,
> if the queue is empty.
>
> So your changelog is misleading.

Sorry that I didn't describe the problem clearly enough. The problem is
for two cases.

Case 1:

tcp_mem[2] limitation causes packet drop. In some cases, applications
may not read the data in the socket receiving queue quickly enough.
In my case, it will call recv() every 5 minutes. And there are a lot of such
sockets. tcp_mem[2] limitation can happen easily in such a case, and once
this happens, skb will be dropped (the receive queue is not empty) and
the send retrans the skb until timeout and the connection break.

Case 2:

The sender keeps sending small packets and makes the rec_buf full.
Meanwhile, the window is not zero, and the sender will keep retrans
until timeout, as the skb is dropped by the receiver.

>
> >                 else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
> > +                       if (sysctl_tcp_wnd_shrink)
>
> We no longer add global sysctls for TCP. All new sysctls must per net-ns.
>
> > +                               goto do_wnd_shrink;
> > +
> >                         reason = SKB_DROP_REASON_PROTO_MEM;
> >                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
> >                         sk->sk_data_ready(sk);
> >                         goto drop;
> > +do_wnd_shrink:
> > +                       if (sock_flag(sk, SOCK_NO_MEM)) {
> > +                               NET_INC_STATS(sock_net(sk),
> > +                                             LINUX_MIB_TCPRCVQDROP);
> > +                               sk->sk_data_ready(sk);
> > +                               goto out_of_window;
> > +                       }
> > +                       sk_forced_mem_schedule(sk, skb->truesize);
>
> So now we would accept two packets per TCP socket, and yet EPOLLIN
> will not be sent in time ?
>
> packets can consume about 45*4K each, I do not think it is wise to
> double receive queue sizes.
>

What we want to do here is to send a ack with zero window. It
may be not necessary to force receive new data here, but to stay
the same with the logic of 'tcp_may_update_window()', only
newer 'ack' in a ack packet can shrink the window.

If we don't receive new data and send a zero-window ack directly
here, it will be weird, as the previous ack with the same 'seq' and 'ack'
has non-zero window.

Thanks!
Menglong Dong

> What you want instead is simply to send EPOLLIN sooner (when the first
> packet is queued instead when the second packet is dropped)
> by changing sk_forced_mem_schedule() a bit.
>
> This might matter for applications using SO_RCVLOWAT, but not for
> other applications.
  
Menglong Dong May 18, 2023, 2:25 p.m. UTC | #3
On Wed, May 17, 2023 at 10:45 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, May 17, 2023 at 2:42 PM <menglong8.dong@gmail.com> wrote:
> >
> > From: Menglong Dong <imagedong@tencent.com>
> >
> > For now, skb will be dropped when no memory, which makes client keep
> > retrans util timeout and it's not friendly to the users.
>
> Yes, networking needs memory. Trying to deny it is recipe for OOM.
>
> >
> > Therefore, now we force to receive one packet on current socket when
> > the protocol memory is out of the limitation. Then, this socket will
> > stay in 'no mem' status, util protocol memory is available.
> >
>
> I think you missed one old patch.
>
> commit ba3bb0e76ccd464bb66665a1941fabe55dadb3ba    tcp: fix
> SO_RCVLOWAT possible hangs under high mem pressure
>
>
>
> > When a socket is in 'no mem' status, it's receive window will become
> > 0, which means window shrink happens. And the sender need to handle
> > such window shrink properly, which is done in the next commit.
> >
> > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > ---
> >  include/net/sock.h    |  1 +
> >  net/ipv4/tcp_input.c  | 12 ++++++++++++
> >  net/ipv4/tcp_output.c |  7 +++++++
> >  3 files changed, 20 insertions(+)
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5edf0038867c..90db8a1d7f31 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -957,6 +957,7 @@ enum sock_flags {
> >         SOCK_XDP, /* XDP is attached */
> >         SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
> >         SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
> > +       SOCK_NO_MEM, /* protocol memory limitation happened */
> >  };
> >
> >  #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index a057330d6f59..56e395cb4554 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5047,10 +5047,22 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
> >                 if (skb_queue_len(&sk->sk_receive_queue) == 0)
> >                         sk_forced_mem_schedule(sk, skb->truesize);
>
> I think you missed this part : We accept at least one packet,
> regardless of memory pressure,
> if the queue is empty.
>
> So your changelog is misleading.
>
> >                 else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
> > +                       if (sysctl_tcp_wnd_shrink)
>
> We no longer add global sysctls for TCP. All new sysctls must per net-ns.
>
> > +                               goto do_wnd_shrink;
> > +
> >                         reason = SKB_DROP_REASON_PROTO_MEM;
> >                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
> >                         sk->sk_data_ready(sk);
> >                         goto drop;
> > +do_wnd_shrink:
> > +                       if (sock_flag(sk, SOCK_NO_MEM)) {
> > +                               NET_INC_STATS(sock_net(sk),
> > +                                             LINUX_MIB_TCPRCVQDROP);
> > +                               sk->sk_data_ready(sk);
> > +                               goto out_of_window;
> > +                       }
> > +                       sk_forced_mem_schedule(sk, skb->truesize);
>
> So now we would accept two packets per TCP socket, and yet EPOLLIN
> will not be sent in time ?
>
> packets can consume about 45*4K each, I do not think it is wise to
> double receive queue sizes.
>
> What you want instead is simply to send EPOLLIN sooner (when the first
> packet is queued instead when the second packet is dropped)
> by changing sk_forced_mem_schedule() a bit.
>
> This might matter for applications using SO_RCVLOWAT, but not for
> other applications.

To be more clear, what I talk about here is not to send EPOLLIN
sooner, but try to make the TCP connection, which has a "hang"
receiver and in TCP protocol memory pressure, entry 0-probe
state. And this commit is the first step: make the receiver
shrink the window by sending a zero-window ack.

Thanks!
Menglong Dong
  

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 5edf0038867c..90db8a1d7f31 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -957,6 +957,7 @@  enum sock_flags {
 	SOCK_XDP, /* XDP is attached */
 	SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
 	SOCK_RCVMARK, /* Receive SO_MARK  ancillary data with packet */
+	SOCK_NO_MEM, /* protocol memory limitation happened */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a057330d6f59..56e395cb4554 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5047,10 +5047,22 @@  static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		if (skb_queue_len(&sk->sk_receive_queue) == 0)
 			sk_forced_mem_schedule(sk, skb->truesize);
 		else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
+			if (sysctl_tcp_wnd_shrink)
+				goto do_wnd_shrink;
+
 			reason = SKB_DROP_REASON_PROTO_MEM;
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
 			sk->sk_data_ready(sk);
 			goto drop;
+do_wnd_shrink:
+			if (sock_flag(sk, SOCK_NO_MEM)) {
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPRCVQDROP);
+				sk->sk_data_ready(sk);
+				goto out_of_window;
+			}
+			sk_forced_mem_schedule(sk, skb->truesize);
+			sock_set_flag(sk, SOCK_NO_MEM);
 		}
 
 		eaten = tcp_queue_rcv(sk, skb, &fragstolen);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfe128b81a01..21dc4f7e0a12 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -300,6 +300,13 @@  static u16 tcp_select_window(struct sock *sk)
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFROMZEROWINDOWADV);
 	}
 
+	if (sock_flag(sk, SOCK_NO_MEM)) {
+		if (sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 2))
+			sock_reset_flag(sk, SOCK_NO_MEM);
+		else
+			new_win = 0;
+	}
+
 	return new_win;
 }