[RFC,03/28] tcp: Support MSG_SPLICE_PAGES

Message ID 20230316152618.711970-4-dhowells@redhat.com
State New
Headers
Series splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) |

Commit Message

David Howells March 16, 2023, 3:25 p.m. UTC
  Make TCP's sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 6 deletions(-)
  

Comments

Willem de Bruijn March 16, 2023, 6:37 p.m. UTC | #1
David Howells wrote:
> Make TCP's sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
> spliced from the source iterator if possible (the iterator must be
> ITER_BVEC and the pages must be spliceable).
> 
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Eric Dumazet <edumazet@google.com>
> cc: "David S. Miller" <davem@davemloft.net>
> cc: Jakub Kicinski <kuba@kernel.org>
> cc: Paolo Abeni <pabeni@redhat.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: netdev@vger.kernel.org
> ---
>  net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 53 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 288693981b00..77c0c69208a5 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  	int flags, err, copied = 0;
>  	int mss_now = 0, size_goal, copied_syn = 0;
>  	int process_backlog = 0;
> -	bool zc = false;
> +	int zc = 0;
>  	long timeo;
>  
>  	flags = msg->msg_flags;
> @@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  		if (msg->msg_ubuf) {
>  			uarg = msg->msg_ubuf;
>  			net_zcopy_get(uarg);
> -			zc = sk->sk_route_caps & NETIF_F_SG;
> +			if (sk->sk_route_caps & NETIF_F_SG)
> +				zc = 1;
>  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>  			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
>  			if (!uarg) {
>  				err = -ENOBUFS;
>  				goto out_err;
>  			}
> -			zc = sk->sk_route_caps & NETIF_F_SG;
> -			if (!zc)
> +			if (sk->sk_route_caps & NETIF_F_SG)
> +				zc = 1;
> +			else
>  				uarg_to_msgzc(uarg)->zerocopy = 0;
>  		}
> +	} else if (unlikely(flags & MSG_SPLICE_PAGES) && size) {
> +		if (!iov_iter_is_bvec(&msg->msg_iter))
> +			return -EINVAL;
> +		if (sk->sk_route_caps & NETIF_F_SG)
> +			zc = 2;
>  	}

The commit message mentions MSG_SPLICE_PAGES as an internal flag.

It can be passed from userspace. The code anticipates that and checks
preconditions.

A side effect is that legacy applications that may already be setting
this bit in the flags now start failing. Most socket types are
historically permissive and simply ignore undefined flags.

With MSG_ZEROCOPY we chose to be extra cautious and added
SOCK_ZEROCOPY, only testing the MSG_ZEROCOPY bit if this socket option
is explicitly enabled. Perhaps more cautious than necessary, but FYI.
  
David Howells March 16, 2023, 6:44 p.m. UTC | #2
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> 
> It can be passed from userspace. The code anticipates that and checks
> preconditions.

Should I add a separate field in the in-kernel msghdr struct for such internal
flags?  That would also avoid putting an internal flag in the same space as
the uapi flags.

David
  
Willem de Bruijn March 16, 2023, 7 p.m. UTC | #3
David Howells wrote:
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > 
> > It can be passed from userspace. The code anticipates that and checks
> > preconditions.
> 
> Should I add a separate field in the in-kernel msghdr struct for such internal
> flags?  That would also avoid putting an internal flag in the same space as
> the uapi flags.

That would work, if no cost to common paths that don't need it.

A not very pretty alternative would be to add an an extra arg to each
sendmsg handler that is used only when called from sendpage.

There are a few other internal MSG_.. flags, such as
MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
in sendmsg, I think. Which would explain why it was clearly safe to
add them.
  
David Howells March 21, 2023, 12:38 a.m. UTC | #4
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> David Howells wrote:
> > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > 
> > > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > > 
> > > It can be passed from userspace. The code anticipates that and checks
> > > preconditions.
> > 
> > Should I add a separate field in the in-kernel msghdr struct for such internal
> > flags?  That would also avoid putting an internal flag in the same space as
> > the uapi flags.
> 
> That would work, if no cost to common paths that don't need it.

Actually, it might be tricky.  __ip_append_data() doesn't take a msghdr struct
pointer per se.  The "void *from" argument *might* point to one - but it
depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't
know.

Possibly this changes if sendpage goes away.

> A not very pretty alternative would be to add an an extra arg to each
> sendmsg handler that is used only when called from sendpage.
> 
> There are a few other internal MSG_.. flags, such as
> MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
> in sendmsg, I think. Which would explain why it was clearly safe to
> add them.

Should those be moved across to the internal flags with MSG_SPLICE_PAGES?

David
  
Willem de Bruijn March 21, 2023, 2:22 p.m. UTC | #5
David Howells wrote:
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> > David Howells wrote:
> > > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > > 
> > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > > > 
> > > > It can be passed from userspace. The code anticipates that and checks
> > > > preconditions.
> > > 
> > > Should I add a separate field in the in-kernel msghdr struct for such internal
> > > flags?  That would also avoid putting an internal flag in the same space as
> > > the uapi flags.
> > 
> > That would work, if no cost to common paths that don't need it.
> 
> Actually, it might be tricky.  __ip_append_data() doesn't take a msghdr struct
> pointer per se.  The "void *from" argument *might* point to one - but it
> depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't
> know.
> 
> Possibly this changes if sendpage goes away.

Is it sufficient to mask out this bit in tcp_sendmsg_locked and
udp_sendmsg if passed from userspace (and should be ignored), and pass
it through flags to callees like ip_append_data?
> 
> > A not very pretty alternative would be to add an an extra arg to each
> > sendmsg handler that is used only when called from sendpage.
> > 
> > There are a few other internal MSG_.. flags, such as
> > MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
> > in sendmsg, I think. Which would explain why it was clearly safe to
> > add them.
> 
> Should those be moved across to the internal flags with MSG_SPLICE_PAGES?

I would not include that in this patch series.
  
Willem de Bruijn March 23, 2023, 1:17 a.m. UTC | #6
David Howells wrote:
> Hi Willem,
> 
> Here's another option to passing MSG_SPLICE_PAGES into sendmsg()[1] without
> polluting the flags in msg->msg_flags.  The idea here is to put the flag
> into a new field in msghdr, msg_kflags, that holds internal kernel flags
> that aren't available to userspace.
> 
> What I've done here is:
> 
>  (1) Pass msg down to __ip_append_data() and __ip6_append_data() so that
>      they can access the extra flags.
> 
>  (2) In order to avoid adding extra arguments to these functions and the
>      functions in their call chains (such as ip_make_skb()), remove the
>      size and flags arguments as these values are redundant if msg is
>      passed in.
> 
>  (3) msg is then passed into getfrag().  I would like to get rid of the
>      "from" argument also in favour of using something in msghdr, but I'm
>      not sure how best to do that.
> 
>  (4) The size parameter to ->sendmsg() seems to be redundant; indeed
>      sock_sendmsg() doesn't actually take it, but rather gets the count
>      from msg_iter - so remove this parameter.
> 
>      kernel_sendmsg() will still take a size, but it sets it on the
>      iterator and then calls sock_sendmsg().
> 
>  (5) Protocol sendmsg implementations then extract the length and the flags
>      from the iterator.
> 
>  (6) Illustrate the addition of msg_kflags and MSG_SPLICE_PAGES.  I think
>      that, at some point in the future, some of the other flags could be
>      moved from msg_flags to msg_kflags.
> 
> David
> 
> Link: https://lore.kernel.org/r/20230316152618.711970-1-dhowells@redhat.com/ [1]
> 
> David Howells (3):
>   net: Drop the size argument from ->sendmsg()
>   ip: Make __ip{,6}_append_data() and co. take a msghdr*
>   net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
> 
>  crypto/af_alg.c                               | 12 +--
>  crypto/algif_aead.c                           |  9 +--
>  crypto/algif_hash.c                           |  8 +-
>  crypto/algif_rng.c                            |  3 +-
>  crypto/algif_skcipher.c                       | 10 +--
>  drivers/isdn/mISDN/socket.c                   |  3 +-
>  .../chelsio/inline_crypto/chtls/chtls.h       |  2 +-
>  .../chelsio/inline_crypto/chtls/chtls_io.c    | 15 ++--
>  drivers/net/ppp/pppoe.c                       |  4 +-
>  drivers/net/tap.c                             |  3 +-
>  drivers/net/tun.c                             |  3 +-
>  drivers/vhost/net.c                           |  6 +-
>  drivers/xen/pvcalls-back.c                    |  2 +-
>  drivers/xen/pvcalls-front.c                   |  4 +-
>  drivers/xen/pvcalls-front.h                   |  3 +-
>  fs/afs/rxrpc.c                                |  8 +-
>  include/crypto/if_alg.h                       |  3 +-
>  include/linux/lsm_hook_defs.h                 |  3 +-
>  include/linux/lsm_hooks.h                     |  1 -
>  include/linux/net.h                           |  6 +-
>  include/linux/security.h                      |  4 +-
>  include/linux/socket.h                        |  3 +
>  include/net/af_rxrpc.h                        |  3 +-
>  include/net/inet_common.h                     |  2 +-
>  include/net/ip.h                              | 24 +++---
>  include/net/ipv6.h                            | 22 +++---
>  include/net/ping.h                            |  7 +-
>  include/net/sock.h                            |  7 +-
>  include/net/tcp.h                             |  8 +-
>  include/net/udp.h                             |  2 +-
>  include/net/udplite.h                         |  4 +-
>  net/appletalk/ddp.c                           |  3 +-
>  net/atm/common.c                              |  3 +-
>  net/atm/common.h                              |  2 +-
>  net/ax25/af_ax25.c                            |  4 +-
>  net/bluetooth/hci_sock.c                      |  4 +-
>  net/bluetooth/iso.c                           |  4 +-
>  net/bluetooth/l2cap_sock.c                    |  5 +-
>  net/bluetooth/rfcomm/sock.c                   |  7 +-
>  net/bluetooth/sco.c                           |  4 +-
>  net/caif/caif_socket.c                        | 13 ++--
>  net/can/bcm.c                                 |  3 +-
>  net/can/isotp.c                               |  3 +-
>  net/can/j1939/socket.c                        |  4 +-
>  net/can/raw.c                                 |  3 +-
>  net/core/sock.c                               |  4 +-
>  net/dccp/dccp.h                               |  2 +-
>  net/dccp/proto.c                              |  3 +-
>  net/ieee802154/socket.c                       | 11 +--
>  net/ipv4/af_inet.c                            |  4 +-
>  net/ipv4/icmp.c                               | 14 ++--
>  net/ipv4/ip_output.c                          | 73 ++++++++++---------
>  net/ipv4/ping.c                               | 18 ++---
>  net/ipv4/raw.c                                | 23 +++---
>  net/ipv4/tcp.c                                | 17 +++--
>  net/ipv4/tcp_bpf.c                            |  5 +-
>  net/ipv4/tcp_input.c                          |  3 +-
>  net/ipv4/udp.c                                | 24 +++---
>  net/ipv6/af_inet6.c                           |  7 +-
>  net/ipv6/icmp.c                               | 21 ++++--
>  net/ipv6/ip6_output.c                         | 57 +++++++--------
>  net/ipv6/ping.c                               | 12 +--
>  net/ipv6/raw.c                                | 25 +++----
>  net/ipv6/udp.c                                | 26 ++++---
>  net/ipv6/udp_impl.h                           |  2 +-
>  net/iucv/af_iucv.c                            |  4 +-
>  net/kcm/kcmsock.c                             |  2 +-
>  net/key/af_key.c                              |  3 +-
>  net/l2tp/l2tp_ip.c                            |  3 +-
>  net/l2tp/l2tp_ip6.c                           |  3 +-
>  net/l2tp/l2tp_ppp.c                           |  4 +-
>  net/llc/af_llc.c                              |  5 +-
>  net/mctp/af_mctp.c                            |  3 +-
>  net/mptcp/protocol.c                          |  8 +-
>  net/netlink/af_netlink.c                      | 11 +--
>  net/netrom/af_netrom.c                        |  3 +-
>  net/nfc/llcp_sock.c                           |  7 +-
>  net/nfc/rawsock.c                             |  3 +-
>  net/packet/af_packet.c                        | 11 +--
>  net/phonet/datagram.c                         |  3 +-
>  net/phonet/pep.c                              |  3 +-
>  net/phonet/socket.c                           |  5 +-
>  net/qrtr/af_qrtr.c                            |  4 +-
>  net/rds/rds.h                                 |  2 +-
>  net/rds/send.c                                |  3 +-
>  net/rose/af_rose.c                            |  3 +-
>  net/rxrpc/af_rxrpc.c                          |  6 +-
>  net/rxrpc/ar-internal.h                       |  2 +-
>  net/rxrpc/output.c                            | 22 +++---
>  net/rxrpc/rxperf.c                            |  4 +-
>  net/rxrpc/sendmsg.c                           | 15 ++--
>  net/sctp/socket.c                             |  3 +-
>  net/smc/af_smc.c                              |  5 +-
>  net/socket.c                                  | 16 ++--
>  net/tipc/socket.c                             | 34 ++++-----
>  net/tls/tls.h                                 |  4 +-
>  net/tls/tls_device.c                          |  5 +-
>  net/tls/tls_sw.c                              |  2 +-
>  net/unix/af_unix.c                            | 19 +++--
>  net/vmw_vsock/af_vsock.c                      | 16 ++--
>  net/x25/af_x25.c                              |  3 +-
>  net/xdp/xsk.c                                 |  6 +-
>  net/xfrm/espintcp.c                           |  8 +-
>  security/apparmor/lsm.c                       |  6 +-
>  security/security.c                           |  4 +-
>  security/selinux/hooks.c                      |  3 +-
>  security/smack/smack_lsm.c                    |  4 +-
>  security/tomoyo/common.h                      |  3 +-
>  security/tomoyo/network.c                     |  4 +-
>  security/tomoyo/tomoyo.c                      |  6 +-
>  110 files changed, 444 insertions(+), 456 deletions(-)

That's a significant code change if only for this purpose.

If this bit is undefined and ignored by all socket families today,
masking it out in sock_sendmsg should be enough to start using it
safely as an internal flag.
  

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..77c0c69208a5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1220,7 +1220,7 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	int process_backlog = 0;
-	bool zc = false;
+	int zc = 0;
 	long timeo;
 
 	flags = msg->msg_flags;
@@ -1231,17 +1231,24 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (msg->msg_ubuf) {
 			uarg = msg->msg_ubuf;
 			net_zcopy_get(uarg);
-			zc = sk->sk_route_caps & NETIF_F_SG;
+			if (sk->sk_route_caps & NETIF_F_SG)
+				zc = 1;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
 			if (!uarg) {
 				err = -ENOBUFS;
 				goto out_err;
 			}
-			zc = sk->sk_route_caps & NETIF_F_SG;
-			if (!zc)
+			if (sk->sk_route_caps & NETIF_F_SG)
+				zc = 1;
+			else
 				uarg_to_msgzc(uarg)->zerocopy = 0;
 		}
+	} else if (unlikely(flags & MSG_SPLICE_PAGES) && size) {
+		if (!iov_iter_is_bvec(&msg->msg_iter))
+			return -EINVAL;
+		if (sk->sk_route_caps & NETIF_F_SG)
+			zc = 2;
 	}
 
 	if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@@ -1345,7 +1352,7 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (copy > msg_data_left(msg))
 			copy = msg_data_left(msg);
 
-		if (!zc) {
+		if (zc == 0) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
@@ -1390,7 +1397,7 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				page_ref_inc(pfrag->page);
 			}
 			pfrag->offset += copy;
-		} else {
+		} else if (zc == 1)  {
 			/* First append to a fragless skb builds initial
 			 * pure zerocopy skb
 			 */
@@ -1411,6 +1418,46 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			if (err < 0)
 				goto do_error;
 			copy = err;
+		} else if (zc == 2) {
+			/* Splice in data. */
+			const struct bio_vec *bv = msg->msg_iter.bvec;
+			size_t seg = iov_iter_single_seg_count(&msg->msg_iter);
+			size_t off = bv->bv_offset + msg->msg_iter.iov_offset;
+			bool can_coalesce;
+			int i = skb_shinfo(skb)->nr_frags;
+
+			if (copy > seg)
+				copy = seg;
+
+			can_coalesce = skb_can_coalesce(skb, i, bv->bv_page, off);
+			if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
+				tcp_mark_push(tp, skb);
+				goto new_segment;
+			}
+			if (tcp_downgrade_zcopy_pure(sk, skb))
+				goto wait_for_space;
+
+			copy = tcp_wmem_schedule(sk, copy);
+			if (!copy)
+				goto wait_for_space;
+
+			if (can_coalesce) {
+				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+			} else {
+				get_page(bv->bv_page);
+				skb_fill_page_desc_noacc(skb, i, bv->bv_page, off, copy);
+			}
+			iov_iter_advance(&msg->msg_iter, copy);
+
+			if (!(flags & MSG_NO_SHARED_FRAGS))
+				skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
+
+			skb->len += copy;
+			skb->data_len += copy;
+			skb->truesize += copy;
+			sk_wmem_queued_add(sk, copy);
+			sk_mem_charge(sk, copy);
+
 		}
 
 		if (!copied)