Message ID | 20230316152618.711970-4-dhowells@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp552703wrt; Thu, 16 Mar 2023 08:39:27 -0700 (PDT) X-Google-Smtp-Source: AK7set9Hzx2TRUfsY/LEUFgRUy2McYjL7qKlzsmhJmyxbIdYkPTQQCKlZDHC79vOmvNrU8BW9y4L X-Received: by 2002:a05:6a20:699a:b0:cc:a8d7:ad7e with SMTP id t26-20020a056a20699a00b000cca8d7ad7emr4908209pzk.60.1678981166963; Thu, 16 Mar 2023 08:39:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678981166; cv=none; d=google.com; s=arc-20160816; b=eM0eV3KkmLM8OVpXCGLAv2pMJb0T7AU3VYNMXIazu+ZtKyCR2sUjdZ5vZmWlaxszdk w6CHhvJybBRRpO1ghnO1FDA/RF2oc6FU8l8Nvq1ECBOKRLPpKBrWM3Ikwp5FwzlJM79X Q9wUWiGhnqodUpgxIQCuwYts7CDEvVZ7lg9KsKs1vw7LLvZxktvsKqgMJ1OusAJCG4da JsAh+DzinkEdvT1homTnXk5Da55a5LmZZr4igIyIVihZbJp9xEeLpiBtiWVBAKSe7tXJ qvvMIbTpPkNiYjZsr/Sr0imm8iyN8x8TYAEAwSjMNsrojJyuwlZ/+kIuUFo0iL2q0Eim uURg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=jyeyKmuw2xi8Qa/hUDKrm3skE7lHLzOU8pZdDmCLqIg=; b=tXQ6L/FIqgSpnHdu5x3ibeG681tqwHM7KY6K3o4HW+5vC1Jz4pOrXe2tRSDTElRTRs V5VPgvZU3PZ0MulUK8dBwJA2fyggMKqhDNov191EWtcZ0t944RgRsUK1X7XgFjiJfTja g5/XALY1/Eu3+FPis4mFvxt7YoAMa4HcaU272xrhiha4BJN2KDcFpDwA/iU2MwM+peqn FlMkXuzszdtkPue5txYCQZSHqdLWCK+pxo+eJEdz8vWL2Y/eELcj9zETVCZx9JXERaTE XxMaHa3Z4rpNuMz+fv/0UZD0fho0qMC0WSFGap5P1CFUr2fh4CzUa0to5vixpQDHPkFe 1yOw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="e/Xz7X0a"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w38-20020a631626000000b004fc2f19bacasi8283050pgl.120.2023.03.16.08.39.14; Thu, 16 Mar 2023 08:39:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="e/Xz7X0a"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231431AbjCPP2I (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Thu, 16 Mar 2023 11:28:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231432AbjCPP2E (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 16 Mar 2023 11:28:04 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 589A96C1B9 for <linux-kernel@vger.kernel.org>; Thu, 16 Mar 2023 08:26:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678980396; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jyeyKmuw2xi8Qa/hUDKrm3skE7lHLzOU8pZdDmCLqIg=; b=e/Xz7X0a+lgQkATgHxO1ri+gYBzZPtEMhNMUaJKJ5hjMTuTuP/ITgLKFnSCyrcuP4TpZvA mSPLLVplYjYSxB7UylMqr9rE05n7tcHt5G2rlMAJ/JtwImpl1HR6+4yXUjM7r5E4q4ALI5 XQfPwS7S327FnjauJhTt+taeLB5cwjs= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-473-njCfjHCZMGSNtwp8bzoGhw-1; Thu, 16 Mar 2023 11:26:32 -0400 X-MC-Unique: njCfjHCZMGSNtwp8bzoGhw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2889C38149BF; Thu, 16 Mar 2023 15:26:31 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.33.36.18]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6026F492B00; Thu, 16 Mar 2023 15:26:29 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: Matthew Wilcox <willy@infradead.org>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: David Howells <dhowells@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org>, Jens Axboe <axboe@kernel.dk>, Jeff Layton <jlayton@kernel.org>, Christian Brauner <brauner@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES Date: Thu, 16 Mar 2023 15:25:53 +0000 Message-Id: <20230316152618.711970-4-dhowells@redhat.com> In-Reply-To: <20230316152618.711970-1-dhowells@redhat.com> References: <20230316152618.711970-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760539355742111070?= X-GMAIL-MSGID: =?utf-8?q?1760539355742111070?= |
Series |
splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES)
|
|
Commit Message
David Howells
March 16, 2023, 3:25 p.m. UTC
Make TCP's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).
This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 53 insertions(+), 6 deletions(-)
Comments
David Howells wrote: > Make TCP's sendmsg() support MSG_SPLICE_PAGES. This causes pages to be > spliced from the source iterator if possible (the iterator must be > ITER_BVEC and the pages must be spliceable). > > This allows ->sendpage() to be replaced by something that can handle > multiple multipage folios in a single transaction. > > Signed-off-by: David Howells <dhowells@redhat.com> > cc: Eric Dumazet <edumazet@google.com> > cc: "David S. Miller" <davem@davemloft.net> > cc: Jakub Kicinski <kuba@kernel.org> > cc: Paolo Abeni <pabeni@redhat.com> > cc: Jens Axboe <axboe@kernel.dk> > cc: Matthew Wilcox <willy@infradead.org> > cc: netdev@vger.kernel.org > --- > net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 53 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 288693981b00..77c0c69208a5 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > int flags, err, copied = 0; > int mss_now = 0, size_goal, copied_syn = 0; > int process_backlog = 0; > - bool zc = false; > + int zc = 0; > long timeo; > > flags = msg->msg_flags; > @@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) > if (msg->msg_ubuf) { > uarg = msg->msg_ubuf; > net_zcopy_get(uarg); > - zc = sk->sk_route_caps & NETIF_F_SG; > + if (sk->sk_route_caps & NETIF_F_SG) > + zc = 1; > } else if (sock_flag(sk, SOCK_ZEROCOPY)) { > uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); > if (!uarg) { > err = -ENOBUFS; > goto out_err; > } > - zc = sk->sk_route_caps & NETIF_F_SG; > - if (!zc) > + if (sk->sk_route_caps & NETIF_F_SG) > + zc = 1; > + else > uarg_to_msgzc(uarg)->zerocopy = 0; > } > + } else if (unlikely(flags & MSG_SPLICE_PAGES) && size) { > + if (!iov_iter_is_bvec(&msg->msg_iter)) > + return -EINVAL; > + if (sk->sk_route_caps & NETIF_F_SG) > + zc = 2; > } The commit message mentions MSG_SPLICE_PAGES as an internal flag. It can be passed from userspace. The code anticipates that and checks preconditions. A side effect is that legacy applications that may already be setting this bit in the flags now start failing. Most socket types are historically permissive and simply ignore undefined flags. With MSG_ZEROCOPY we chose to be extra cautious and added SOCK_ZEROCOPY, only testing the MSG_ZEROCOPY bit if this socket option is explicitly enabled. Perhaps more cautious than necessary, but FYI.
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > It can be passed from userspace. The code anticipates that and checks > preconditions. Should I add a separate field in the in-kernel msghdr struct for such internal flags? That would also avoid putting an internal flag in the same space as the uapi flags. David
David Howells wrote: > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > > > It can be passed from userspace. The code anticipates that and checks > > preconditions. > > Should I add a separate field in the in-kernel msghdr struct for such internal > flags? That would also avoid putting an internal flag in the same space as > the uapi flags. That would work, if no cost to common paths that don't need it. A not very pretty alternative would be to add an an extra arg to each sendmsg handler that is used only when called from sendpage. There are a few other internal MSG_.. flags, such as MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored in sendmsg, I think. Which would explain why it was clearly safe to add them.
Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > David Howells wrote: > > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > > > > > It can be passed from userspace. The code anticipates that and checks > > > preconditions. > > > > Should I add a separate field in the in-kernel msghdr struct for such internal > > flags? That would also avoid putting an internal flag in the same space as > > the uapi flags. > > That would work, if no cost to common paths that don't need it. Actually, it might be tricky. __ip_append_data() doesn't take a msghdr struct pointer per se. The "void *from" argument *might* point to one - but it depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't know. Possibly this changes if sendpage goes away. > A not very pretty alternative would be to add an an extra arg to each > sendmsg handler that is used only when called from sendpage. > > There are a few other internal MSG_.. flags, such as > MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored > in sendmsg, I think. Which would explain why it was clearly safe to > add them. Should those be moved across to the internal flags with MSG_SPLICE_PAGES? David
David Howells wrote: > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > David Howells wrote: > > > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag. > > > > > > > > It can be passed from userspace. The code anticipates that and checks > > > > preconditions. > > > > > > Should I add a separate field in the in-kernel msghdr struct for such internal > > > flags? That would also avoid putting an internal flag in the same space as > > > the uapi flags. > > > > That would work, if no cost to common paths that don't need it. > > Actually, it might be tricky. __ip_append_data() doesn't take a msghdr struct > pointer per se. The "void *from" argument *might* point to one - but it > depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't > know. > > Possibly this changes if sendpage goes away. Is it sufficient to mask out this bit in tcp_sendmsg_locked and udp_sendmsg if passed from userspace (and should be ignored), and pass it through flags to callees like ip_append_data? > > > A not very pretty alternative would be to add an an extra arg to each > > sendmsg handler that is used only when called from sendpage. > > > > There are a few other internal MSG_.. flags, such as > > MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored > > in sendmsg, I think. Which would explain why it was clearly safe to > > add them. > > Should those be moved across to the internal flags with MSG_SPLICE_PAGES? I would not include that in this patch series.
David Howells wrote: > Hi Willem, > > Here's another option to passing MSG_SPLICE_PAGES into sendmsg()[1] without > polluting the flags in msg->msg_flags. The idea here is to put the flag > into a new field in msghdr, msg_kflags, that holds internal kernel flags > that aren't available to userspace. > > What I've done here is: > > (1) Pass msg down to __ip_append_data() and __ip6_append_data() so that > they can access the extra flags. > > (2) In order to avoid adding extra arguments to these functions and the > functions in their call chains (such as ip_make_skb()), remove the > size and flags arguments as these values are redundant if msg is > passed in. > > (3) msg is then passed into getfrag(). I would like to get rid of the > "from" argument also in favour of using something in msghdr, but I'm > not sure how best to do that. > > (4) The size parameter to ->sendmsg() seems to be redundant; indeed > sock_sendmsg() doesn't actually take it, but rather gets the count > from msg_iter - so remove this parameter. > > kernel_sendmsg() will still take a size, but it sets it on the > iterator and then calls sock_sendmsg(). > > (5) Protocol sendmsg implementations then extract the length and the flags > from the iterator. > > (6) Illustrate the addition of msg_kflags and MSG_SPLICE_PAGES. I think > that, at some point in the future, some of the other flags could be > moved from msg_flags to msg_kflags. > > David > > Link: https://lore.kernel.org/r/20230316152618.711970-1-dhowells@redhat.com/ [1] > > David Howells (3): > net: Drop the size argument from ->sendmsg() > ip: Make __ip{,6}_append_data() and co. take a msghdr* > net: Declare MSG_SPLICE_PAGES internal sendmsg() flag > > crypto/af_alg.c | 12 +-- > crypto/algif_aead.c | 9 +-- > crypto/algif_hash.c | 8 +- > crypto/algif_rng.c | 3 +- > crypto/algif_skcipher.c | 10 +-- > drivers/isdn/mISDN/socket.c | 3 +- > .../chelsio/inline_crypto/chtls/chtls.h | 2 +- > .../chelsio/inline_crypto/chtls/chtls_io.c | 15 ++-- > drivers/net/ppp/pppoe.c | 4 +- > drivers/net/tap.c | 3 +- > drivers/net/tun.c | 3 +- > drivers/vhost/net.c | 6 +- > drivers/xen/pvcalls-back.c | 2 +- > drivers/xen/pvcalls-front.c | 4 +- > drivers/xen/pvcalls-front.h | 3 +- > fs/afs/rxrpc.c | 8 +- > include/crypto/if_alg.h | 3 +- > include/linux/lsm_hook_defs.h | 3 +- > include/linux/lsm_hooks.h | 1 - > include/linux/net.h | 6 +- > include/linux/security.h | 4 +- > include/linux/socket.h | 3 + > include/net/af_rxrpc.h | 3 +- > include/net/inet_common.h | 2 +- > include/net/ip.h | 24 +++--- > include/net/ipv6.h | 22 +++--- > include/net/ping.h | 7 +- > include/net/sock.h | 7 +- > include/net/tcp.h | 8 +- > include/net/udp.h | 2 +- > include/net/udplite.h | 4 +- > net/appletalk/ddp.c | 3 +- > net/atm/common.c | 3 +- > net/atm/common.h | 2 +- > net/ax25/af_ax25.c | 4 +- > net/bluetooth/hci_sock.c | 4 +- > net/bluetooth/iso.c | 4 +- > net/bluetooth/l2cap_sock.c | 5 +- > net/bluetooth/rfcomm/sock.c | 7 +- > net/bluetooth/sco.c | 4 +- > net/caif/caif_socket.c | 13 ++-- > net/can/bcm.c | 3 +- > net/can/isotp.c | 3 +- > net/can/j1939/socket.c | 4 +- > net/can/raw.c | 3 +- > net/core/sock.c | 4 +- > net/dccp/dccp.h | 2 +- > net/dccp/proto.c | 3 +- > net/ieee802154/socket.c | 11 +-- > net/ipv4/af_inet.c | 4 +- > net/ipv4/icmp.c | 14 ++-- > net/ipv4/ip_output.c | 73 ++++++++++--------- > net/ipv4/ping.c | 18 ++--- > net/ipv4/raw.c | 23 +++--- > net/ipv4/tcp.c | 17 +++-- > net/ipv4/tcp_bpf.c | 5 +- > net/ipv4/tcp_input.c | 3 +- > net/ipv4/udp.c | 24 +++--- > net/ipv6/af_inet6.c | 7 +- > net/ipv6/icmp.c | 21 ++++-- > net/ipv6/ip6_output.c | 57 +++++++-------- > net/ipv6/ping.c | 12 +-- > net/ipv6/raw.c | 25 +++---- > net/ipv6/udp.c | 26 ++++--- > net/ipv6/udp_impl.h | 2 +- > net/iucv/af_iucv.c | 4 +- > net/kcm/kcmsock.c | 2 +- > net/key/af_key.c | 3 +- > net/l2tp/l2tp_ip.c | 3 +- > net/l2tp/l2tp_ip6.c | 3 +- > net/l2tp/l2tp_ppp.c | 4 +- > net/llc/af_llc.c | 5 +- > net/mctp/af_mctp.c | 3 +- > net/mptcp/protocol.c | 8 +- > net/netlink/af_netlink.c | 11 +-- > net/netrom/af_netrom.c | 3 +- > net/nfc/llcp_sock.c | 7 +- > net/nfc/rawsock.c | 3 +- > net/packet/af_packet.c | 11 +-- > net/phonet/datagram.c | 3 +- > net/phonet/pep.c | 3 +- > net/phonet/socket.c | 5 +- > net/qrtr/af_qrtr.c | 4 +- > net/rds/rds.h | 2 +- > net/rds/send.c | 3 +- > net/rose/af_rose.c | 3 +- > net/rxrpc/af_rxrpc.c | 6 +- > net/rxrpc/ar-internal.h | 2 +- > net/rxrpc/output.c | 22 +++--- > net/rxrpc/rxperf.c | 4 +- > net/rxrpc/sendmsg.c | 15 ++-- > net/sctp/socket.c | 3 +- > net/smc/af_smc.c | 5 +- > net/socket.c | 16 ++-- > net/tipc/socket.c | 34 ++++----- > net/tls/tls.h | 4 +- > net/tls/tls_device.c | 5 +- > net/tls/tls_sw.c | 2 +- > net/unix/af_unix.c | 19 +++-- > net/vmw_vsock/af_vsock.c | 16 ++-- > net/x25/af_x25.c | 3 +- > net/xdp/xsk.c | 6 +- > net/xfrm/espintcp.c | 8 +- > security/apparmor/lsm.c | 6 +- > security/security.c | 4 +- > security/selinux/hooks.c | 3 +- > security/smack/smack_lsm.c | 4 +- > security/tomoyo/common.h | 3 +- > security/tomoyo/network.c | 4 +- > security/tomoyo/tomoyo.c | 6 +- > 110 files changed, 444 insertions(+), 456 deletions(-) That's a significant code change if only for this purpose. If this bit is undefined and ignored by all socket families today, masking it out in sock_sendmsg should be enough to start using it safely as an internal flag.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 288693981b00..77c0c69208a5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) int flags, err, copied = 0; int mss_now = 0, size_goal, copied_syn = 0; int process_backlog = 0; - bool zc = false; + int zc = 0; long timeo; flags = msg->msg_flags; @@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (msg->msg_ubuf) { uarg = msg->msg_ubuf; net_zcopy_get(uarg); - zc = sk->sk_route_caps & NETIF_F_SG; + if (sk->sk_route_caps & NETIF_F_SG) + zc = 1; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); if (!uarg) { err = -ENOBUFS; goto out_err; } - zc = sk->sk_route_caps & NETIF_F_SG; - if (!zc) + if (sk->sk_route_caps & NETIF_F_SG) + zc = 1; + else uarg_to_msgzc(uarg)->zerocopy = 0; } + } else if (unlikely(flags & MSG_SPLICE_PAGES) && size) { + if (!iov_iter_is_bvec(&msg->msg_iter)) + return -EINVAL; + if (sk->sk_route_caps & NETIF_F_SG) + zc = 2; } if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) && @@ -1345,7 +1352,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (copy > msg_data_left(msg)) copy = msg_data_left(msg); - if (!zc) { + if (zc == 0) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); @@ -1390,7 +1397,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) page_ref_inc(pfrag->page); } pfrag->offset += copy; - } else { + } else if (zc == 1) { /* First append to a fragless skb builds initial * pure zerocopy skb */ @@ -1411,6 +1418,46 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) if (err < 0) goto do_error; copy = err; + } else if (zc == 2) { + /* Splice in data. */ + const struct bio_vec *bv = msg->msg_iter.bvec; + size_t seg = iov_iter_single_seg_count(&msg->msg_iter); + size_t off = bv->bv_offset + msg->msg_iter.iov_offset; + bool can_coalesce; + int i = skb_shinfo(skb)->nr_frags; + + if (copy > seg) + copy = seg; + + can_coalesce = skb_can_coalesce(skb, i, bv->bv_page, off); + if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) { + tcp_mark_push(tp, skb); + goto new_segment; + } + if (tcp_downgrade_zcopy_pure(sk, skb)) + goto wait_for_space; + + copy = tcp_wmem_schedule(sk, copy); + if (!copy) + goto wait_for_space; + + if (can_coalesce) { + skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); + } else { + get_page(bv->bv_page); + skb_fill_page_desc_noacc(skb, i, bv->bv_page, off, copy); + } + iov_iter_advance(&msg->msg_iter, copy); + + if (!(flags & MSG_NO_SHARED_FRAGS)) + skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG; + + skb->len += copy; + skb->data_len += copy; + skb->truesize += copy; + sk_wmem_queued_add(sk, copy); + sk_mem_charge(sk, copy); + } if (!copied)