[net-next] mptcp: fix rcv buffer auto-tuning

Message ID 20230720-upstream-net-next-20230720-mptcp-fix-rcv-buffer-auto-tuning-v1-1-175ef12b8380@tessares.net
State New
Headers
Series [net-next] mptcp: fix rcv buffer auto-tuning |

Commit Message

Matthieu Baerts July 20, 2023, 6:47 p.m. UTC
  From: Paolo Abeni <pabeni@redhat.com>

The MPTCP code uses the assumption that the tcp_win_from_space() helper
does not use any TCP-specific field, and thus works correctly operating
on an MPTCP socket.

The commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
broke such assumption, and as a consequence most MPTCP connections stall
on zero-window event due to auto-tuning changing the rcv buffer size
quite randomly.

Address the issue syncing again the MPTCP auto-tuning code with the TCP
one. To achieve that, factor out the windows size logic in socket
independent helpers, and reuse them in mptcp_rcv_space_adjust(). The
MPTCP level scaling_ratio is selected as the minimum one from the all
the subflows, as a worst-case estimate.

Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
---
 include/net/tcp.h    | 20 +++++++++++++++-----
 net/mptcp/protocol.c | 15 +++++++--------
 net/mptcp/protocol.h |  8 +++++++-
 net/mptcp/subflow.c  |  2 +-
 4 files changed, 30 insertions(+), 15 deletions(-)


---
base-commit: b44693495af8f309b8ddec4b30833085d1c2d0c4
change-id: 20230720-upstream-net-next-20230720-mptcp-fix-rcv-buffer-auto-tuning-bb759d4a6111

Best regards,
  

Comments

Eric Dumazet July 20, 2023, 7:07 p.m. UTC | #1
On Thu, Jul 20, 2023 at 8:48 PM Matthieu Baerts
<matthieu.baerts@tessares.net> wrote:
>
> From: Paolo Abeni <pabeni@redhat.com>
>
> The MPTCP code uses the assumption that the tcp_win_from_space() helper
> does not use any TCP-specific field, and thus works correctly operating
> on an MPTCP socket.
>
> The commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> broke such assumption, and as a consequence most MPTCP connections stall
> on zero-window event due to auto-tuning changing the rcv buffer size
> quite randomly.
>
> Address the issue syncing again the MPTCP auto-tuning code with the TCP
> one. To achieve that, factor out the windows size logic in socket
> independent helpers, and reuse them in mptcp_rcv_space_adjust(). The
> MPTCP level scaling_ratio is selected as the minimum one from the all
> the subflows, as a worst-case estimate.
>
> Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
> ---
>  include/net/tcp.h    | 20 +++++++++++++++-----
>  net/mptcp/protocol.c | 15 +++++++--------
>  net/mptcp/protocol.h |  8 +++++++-
>  net/mptcp/subflow.c  |  2 +-
>  4 files changed, 30 insertions(+), 15 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c5fb90079920..794642fbd724 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1430,22 +1430,32 @@ void tcp_select_initial_window(const struct sock *sk, int __space,
>                                __u32 *window_clamp, int wscale_ok,
>                                __u8 *rcv_wscale, __u32 init_rcv_wnd);
>
> -static inline int tcp_win_from_space(const struct sock *sk, int space)
> +static inline int __tcp_win_from_space(u8 scaling_ratio, int space)
>  {
> -       s64 scaled_space = (s64)space * tcp_sk(sk)->scaling_ratio;
> +       s64 scaled_space = (s64)space * scaling_ratio;
>
>         return scaled_space >> TCP_RMEM_TO_WIN_SCALE;
>  }
>
> -/* inverse of tcp_win_from_space() */
> -static inline int tcp_space_from_win(const struct sock *sk, int win)
> +static inline int tcp_win_from_space(const struct sock *sk, int space)

Maybe in a follow up patch we could change the prototype of this helper
to avoid future mis use :)

static inline int tcp_win_from_space(const struct tcp_sock *tp, int space)
{
}

Reviewed-by: Eric Dumazet <edumazet@google.com>


> +{
> +       return __tcp_win_from_space(tcp_sk(sk)->scaling_ratio, space);
> +}
> +
> +/* inverse of __tcp_win_from_space() */
> +static inline int __tcp_space_from_win(u8 scaling_ratio, int win)
>  {
>         u64 val = (u64)win << TCP_RMEM_TO_WIN_SCALE;
>
> -       do_div(val, tcp_sk(sk)->scaling_ratio);
> +       do_div(val, scaling_ratio);
>         return val;
>  }
>
> +static inline int tcp_space_from_win(const struct sock *sk, int win

Same remark here.

Thanks for the fix !
  
Soheil Hassas Yeganeh July 20, 2023, 7:14 p.m. UTC | #2
On Thu, Jul 20, 2023 at 3:07 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jul 20, 2023 at 8:48 PM Matthieu Baerts
> <matthieu.baerts@tessares.net> wrote:
> >
> > From: Paolo Abeni <pabeni@redhat.com>
> >
> > The MPTCP code uses the assumption that the tcp_win_from_space() helper
> > does not use any TCP-specific field, and thus works correctly operating
> > on an MPTCP socket.
> >
> > The commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> > broke such assumption, and as a consequence most MPTCP connections stall
> > on zero-window event due to auto-tuning changing the rcv buffer size
> > quite randomly.
> >
> > Address the issue syncing again the MPTCP auto-tuning code with the TCP
> > one. To achieve that, factor out the windows size logic in socket
> > independent helpers, and reuse them in mptcp_rcv_space_adjust(). The
> > MPTCP level scaling_ratio is selected as the minimum one from the all
> > the subflows, as a worst-case estimate.
> >
> > Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
> > Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
> > ---
> >  include/net/tcp.h    | 20 +++++++++++++++-----
> >  net/mptcp/protocol.c | 15 +++++++--------
> >  net/mptcp/protocol.h |  8 +++++++-
> >  net/mptcp/subflow.c  |  2 +-
> >  4 files changed, 30 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index c5fb90079920..794642fbd724 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1430,22 +1430,32 @@ void tcp_select_initial_window(const struct sock *sk, int __space,
> >                                __u32 *window_clamp, int wscale_ok,
> >                                __u8 *rcv_wscale, __u32 init_rcv_wnd);
> >
> > -static inline int tcp_win_from_space(const struct sock *sk, int space)
> > +static inline int __tcp_win_from_space(u8 scaling_ratio, int space)
> >  {
> > -       s64 scaled_space = (s64)space * tcp_sk(sk)->scaling_ratio;
> > +       s64 scaled_space = (s64)space * scaling_ratio;
> >
> >         return scaled_space >> TCP_RMEM_TO_WIN_SCALE;
> >  }
> >
> > -/* inverse of tcp_win_from_space() */
> > -static inline int tcp_space_from_win(const struct sock *sk, int win)
> > +static inline int tcp_win_from_space(const struct sock *sk, int space)
>
> Maybe in a follow up patch we could change the prototype of this helper
> to avoid future mis use :)
>
> static inline int tcp_win_from_space(const struct tcp_sock *tp, int space)
> {
> }
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Thank you for the fix!

> > +{
> > +       return __tcp_win_from_space(tcp_sk(sk)->scaling_ratio, space);
> > +}
> > +
> > +/* inverse of __tcp_win_from_space() */
> > +static inline int __tcp_space_from_win(u8 scaling_ratio, int win)
> >  {
> >         u64 val = (u64)win << TCP_RMEM_TO_WIN_SCALE;
> >
> > -       do_div(val, tcp_sk(sk)->scaling_ratio);
> > +       do_div(val, scaling_ratio);
> >         return val;
> >  }
> >
> > +static inline int tcp_space_from_win(const struct sock *sk, int win
>
> Same remark here.
>
> Thanks for the fix !
  
patchwork-bot+netdevbpf@kernel.org July 25, 2023, 12:20 a.m. UTC | #3
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 20 Jul 2023 20:47:50 +0200 you wrote:
> From: Paolo Abeni <pabeni@redhat.com>
> 
> The MPTCP code uses the assumption that the tcp_win_from_space() helper
> does not use any TCP-specific field, and thus works correctly operating
> on an MPTCP socket.
> 
> The commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> broke such assumption, and as a consequence most MPTCP connections stall
> on zero-window event due to auto-tuning changing the rcv buffer size
> quite randomly.
> 
> [...]

Here is the summary with links:
  - [net-next] mptcp: fix rcv buffer auto-tuning
    https://git.kernel.org/netdev/net-next/c/b8dc6d6ce931

You are awesome, thank you!
  

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c5fb90079920..794642fbd724 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1430,22 +1430,32 @@  void tcp_select_initial_window(const struct sock *sk, int __space,
 			       __u32 *window_clamp, int wscale_ok,
 			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
 
-static inline int tcp_win_from_space(const struct sock *sk, int space)
+static inline int __tcp_win_from_space(u8 scaling_ratio, int space)
 {
-	s64 scaled_space = (s64)space * tcp_sk(sk)->scaling_ratio;
+	s64 scaled_space = (s64)space * scaling_ratio;
 
 	return scaled_space >> TCP_RMEM_TO_WIN_SCALE;
 }
 
-/* inverse of tcp_win_from_space() */
-static inline int tcp_space_from_win(const struct sock *sk, int win)
+static inline int tcp_win_from_space(const struct sock *sk, int space)
+{
+	return __tcp_win_from_space(tcp_sk(sk)->scaling_ratio, space);
+}
+
+/* inverse of __tcp_win_from_space() */
+static inline int __tcp_space_from_win(u8 scaling_ratio, int win)
 {
 	u64 val = (u64)win << TCP_RMEM_TO_WIN_SCALE;
 
-	do_div(val, tcp_sk(sk)->scaling_ratio);
+	do_div(val, scaling_ratio);
 	return val;
 }
 
+static inline int tcp_space_from_win(const struct sock *sk, int win)
+{
+	return __tcp_space_from_win(tcp_sk(sk)->scaling_ratio, win);
+}
+
 static inline void tcp_scaling_ratio_init(struct sock *sk)
 {
 	/* Assume a conservative default of 1200 bytes of payload per 4K page.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 3613489eb6e3..553828ef62df 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -90,6 +90,7 @@  static int __mptcp_socket_create(struct mptcp_sock *msk)
 	if (err)
 		return err;
 
+	msk->scaling_ratio = tcp_sk(ssock->sk)->scaling_ratio;
 	WRITE_ONCE(msk->first, ssock->sk);
 	WRITE_ONCE(msk->subflow, ssock);
 	subflow = mptcp_subflow_ctx(ssock->sk);
@@ -1881,6 +1882,7 @@  static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 {
 	struct mptcp_subflow_context *subflow;
 	struct sock *sk = (struct sock *)msk;
+	u8 scaling_ratio = U8_MAX;
 	u32 time, advmss = 1;
 	u64 rtt_us, mstamp;
 
@@ -1911,9 +1913,11 @@  static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 
 		rtt_us = max(sf_rtt_us, rtt_us);
 		advmss = max(sf_advmss, advmss);
+		scaling_ratio = min(tp->scaling_ratio, scaling_ratio);
 	}
 
 	msk->rcvq_space.rtt_us = rtt_us;
+	msk->scaling_ratio = scaling_ratio;
 	if (time < (rtt_us >> 3) || rtt_us == 0)
 		return;
 
@@ -1922,8 +1926,8 @@  static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 
 	if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
-		int rcvmem, rcvbuf;
 		u64 rcvwin, grow;
+		int rcvbuf;
 
 		rcvwin = ((u64)msk->rcvq_space.copied << 1) + 16 * advmss;
 
@@ -1932,18 +1936,13 @@  static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
 		do_div(grow, msk->rcvq_space.space);
 		rcvwin += (grow << 1);
 
-		rcvmem = SKB_TRUESIZE(advmss + MAX_TCP_HEADER);
-		while (tcp_win_from_space(sk, rcvmem) < advmss)
-			rcvmem += 128;
-
-		do_div(rcvwin, advmss);
-		rcvbuf = min_t(u64, rcvwin * rcvmem,
+		rcvbuf = min_t(u64, __tcp_space_from_win(scaling_ratio, rcvwin),
 			       READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]));
 
 		if (rcvbuf > sk->sk_rcvbuf) {
 			u32 window_clamp;
 
-			window_clamp = tcp_win_from_space(sk, rcvbuf);
+			window_clamp = __tcp_win_from_space(scaling_ratio, rcvbuf);
 			WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
 
 			/* Make subflows follow along.  If we do not do this, we
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 37fbe22e2433..795f422e8597 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -321,6 +321,7 @@  struct mptcp_sock {
 		u64	time;	/* start time of measurement window */
 		u64	rtt_us; /* last maximum rtt of subflows */
 	} rcvq_space;
+	u8		scaling_ratio;
 
 	u32		subflow_id;
 	u32		setsockopt_seq;
@@ -351,9 +352,14 @@  static inline int __mptcp_rmem(const struct sock *sk)
 	return atomic_read(&sk->sk_rmem_alloc) - READ_ONCE(mptcp_sk(sk)->rmem_released);
 }
 
+static inline int mptcp_win_from_space(const struct sock *sk, int space)
+{
+	return __tcp_win_from_space(mptcp_sk(sk)->scaling_ratio, space);
+}
+
 static inline int __mptcp_space(const struct sock *sk)
 {
-	return tcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) - __mptcp_rmem(sk));
+	return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) - __mptcp_rmem(sk));
 }
 
 static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9ee3b7abbaf6..ad7080fabb2f 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -1359,7 +1359,7 @@  void mptcp_space(const struct sock *ssk, int *space, int *full_space)
 	const struct sock *sk = subflow->conn;
 
 	*space = __mptcp_space(sk);
-	*full_space = tcp_full_space(sk);
+	*full_space = mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf));
 }
 
 void __mptcp_error_report(struct sock *sk)