[net-next,v5,00/19] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1
Message ID | 20230406094245.3633290-1-dhowells@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp894887vqo; Thu, 6 Apr 2023 02:45:44 -0700 (PDT) X-Google-Smtp-Source: AKy350be0ScoXCYbZqllY6XZLkd/n7M4r55skip5uMe5OKf+yXaeOGWuuH2oR6Mwoyf27Zh2y0pF X-Received: by 2002:a05:6a20:b2aa:b0:de:b790:d249 with SMTP id ei42-20020a056a20b2aa00b000deb790d249mr2269631pzb.31.1680774344051; Thu, 06 Apr 2023 02:45:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680774344; cv=none; d=google.com; s=arc-20160816; b=IzlnmD4t4Oqy74Z1q2WXJkPqR++qJsdA9FO/L8BEOjBRmRQ6My/wCBYJnoZAJYhJHt aYNRev5cUjIC1ZY+gkW8U86bMxM8pJt8eW9rTyW/zi9S2E8mVtQO9MkKej5TdoQhCRA4 lguNMzCQvsVufs7Pes5sD0rHK2l8qpecoPJP9/2hFGeRI7HP4wW0k2cnFDaU/uiVzlpH 8MTvNgk9A9y6c3F4R8RjQXYlhXoACfzabcd/juJqzKT0LkZ5eV9ou1pNw39Fk/ckL2u1 1aoatV0AUbGSl2HtN0pTqbkim5ayf2/tMGsL4zmBrY5K/jxhlZOUXN6arjUzFEbgbhFm dhuQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=/nSpc8A5/JhxKnDIHi1umeJr7tD/QslcYAbrZ5+84zY=; b=urinjJ90/jSUbPROo7rQrzvD3e4j6tLWxjZTAb4d4kAsziZ7Tvdv4Vpa+N0AoerTUJ 8g4P78+w81Ce58OmeWU3Rpx5KaI7arPT8SfMCvMU73lQ8hTGaZcecn1NXo+QJFIDXp0g qn7LQxqy9yWiyt1PfCfc1p+s8SFPPec95c41/Gqiifb560l0I9nQwtnbVDprZfMYowDt S5OhUxYqG/Jgz4044g9qeH6mSQWypU9JXCT5Bje43jhUJHi3JkThRtsSLR1PmqYTutfI cud8yIkiRiOclx99qyEvVcHfutSPHmL6qDvUST0VpoZoEfIaNT8gX8zITQ8vLW/qG0F0 8MLQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UQYHc0Iu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 68-20020a630347000000b005030a59a81dsi906305pgd.159.2023.04.06.02.45.31; Thu, 06 Apr 2023 02:45:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UQYHc0Iu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236503AbjDFJnp (ORCPT <rfc822;lkml4gm@gmail.com> + 99 others); Thu, 6 Apr 2023 05:43:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51606 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229824AbjDFJnm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 6 Apr 2023 05:43:42 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 583CF6A4D for <linux-kernel@vger.kernel.org>; Thu, 6 Apr 2023 02:42:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1680774174; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=/nSpc8A5/JhxKnDIHi1umeJr7tD/QslcYAbrZ5+84zY=; b=UQYHc0IuIHYtVMJRjuZiHL5TD0WHGaob1H3GRCXLjJHmdOA12bq3wrcgZqJrHMYpPqEFvz jEtv4mYDT+SXwLKyEpiwdjz+L012UHCSL3ptAVWtwWPfeEzy7BdGJG4vfeOyrLGlf0pyA8 +95D3enJkbH0GIi1sgdJP5ZJVbfhr40= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-106-kU54lGFqMbeLRht9z28RPA-1; Thu, 06 Apr 2023 05:42:51 -0400 X-MC-Unique: kU54lGFqMbeLRht9z28RPA-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6B8472A59557; Thu, 6 Apr 2023 09:42:50 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.33.36.18]) by smtp.corp.redhat.com (Postfix) with ESMTP id 668B540C83AC; Thu, 6 Apr 2023 09:42:48 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: netdev@vger.kernel.org Cc: David Howells <dhowells@redhat.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Willem de Bruijn <willemdebruijn.kernel@gmail.com>, Matthew Wilcox <willy@infradead.org>, Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org>, Jens Axboe <axboe@kernel.dk>, Jeff Layton <jlayton@kernel.org>, Christian Brauner <brauner@kernel.org>, Chuck Lever III <chuck.lever@oracle.com>, Linus Torvalds <torvalds@linux-foundation.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH net-next v5 00/19] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 Date: Thu, 6 Apr 2023 10:42:26 +0100 Message-Id: <20230406094245.3633290-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.1 X-Spam-Status: No, score=-0.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1762357072692966259?= X-GMAIL-MSGID: =?utf-8?q?1762419638452994903?= |
Series |
splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1
|
|
Message
David Howells
April 6, 2023, 9:42 a.m. UTC
Here's the first tranche of patches towards providing a MSG_SPLICE_PAGES internal sendmsg flag that is intended to replace the ->sendpage() op with calls to sendmsg(). MSG_SPLICE is a hint that tells the protocol that it should splice the pages supplied if it can and copy them if not. This will allow splice to pass multiple pages in a single call and allow certain parts of higher protocols (e.g. sunrpc, iwarp) to pass an entire message in one go rather than having to send them piecemeal. This should also make it easier to handle the splicing of multipage folios. This set consists of the following parts: (1) Define the MSG_SPLICE_PAGES flag and prevent sys_sendmsg() from being able to set it. (2) Overhaul the page_frag_alloc_align() allocator: (a) Split it out from mm/page_alloc.c into its own file, mm/page_frag_alloc.c. (b) Make it use multipage folios rather than compound pages. (c) Give it per-cpu buckets to allocate from so no locking is required. (d) The netdev_alloc_cache and the napi fragment cache are then cast in terms of this and some private allocators are removed. I'm not sure that the existing allocator is 100% thread safe. (3) Implement MSG_SPLICE_PAGES support in TCP. (4) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its various callers. (5) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just a wrapper around sendmsg(). (6) Implement MSG_SPLICE_PAGES support in IP6/UDP6. (7) Implement MSG_SPLICE_PAGES support in AF_UNIX. (8) Make AF_UNIX copy unspliceable pages. I've pushed the patches here also: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-1 The follow-on patches are on branch iov-sendpage on the same tree. David Changes ======= ver #5) - Dropped the samples patch as it causes lots of failures in the patchwork 32-bit builds due to apparent libc userspace header issues. - Made the pagefrag alloc patches alter the Google gve driver too. - Rearranged the patches to put the support in IP before altering UDP. ver #4) - Added some sample socket-I/O programs into samples/net/. - Fix a missing page-get in AF_KCM. - Init the sgtable and mark the end in AF_ALG when calling netfs_extract_iter_to_sg(). - Add a destructor func for page frag caches prior to generalising it and making it per-cpu. ver #3) - Dropped the iterator-of-iterators patch. - Only expunge MSG_SPLICE_PAGES in sys_send[m]msg, not sys_recv[m]msg. - Split MSG_SPLICE_PAGES code in __ip_append_data() out into helper functions. - Implement MSG_SPLICE_PAGES support in __ip6_append_data() using the above helper functions. - Rename 'xlength' to 'initial_length'. - Minimise the changes to sunrpc for the moment. - Don't give -EOPNOTSUPP if NETIF_F_SG not available, just copy instead. - Implemented MSG_SPLICE_PAGES support in the TLS, Chelsio-TLS and AF_KCM code. ver #2) - Overhauled the page_frag_alloc() allocator: large folios and per-cpu. - Got rid of my own zerocopy allocator. - Use iov_iter_extract_pages() rather poking in iter->bvec. - Made page splicing fall back to page copying on a page-by-page basis. - Made splice_to_socket() pass 16 pipe buffers at a time. - Made AF_ALG/hash use finup/digest where possible in sendmsg. - Added an iterator-of-iterators, ITER_ITERLIST. - Made sunrpc use the iterator-of-iterators. - Converted more drivers. Link: https://lore.kernel.org/r/20230316152618.711970-1-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20230329141354.516864-1-dhowells@redhat.com/ # v2 Link: https://lore.kernel.org/r/20230331160914.1608208-1-dhowells@redhat.com/ # v3 Link: https://lore.kernel.org/r/20230405165339.3468808-1-dhowells@redhat.com/ # v4 David Howells (19): net: Declare MSG_SPLICE_PAGES internal sendmsg() flag mm: Move the page fragment allocator from page_alloc.c into its own file mm: Make the page_frag_cache allocator use multipage folios mm: Make the page_frag_cache allocator use per-cpu tcp: Support MSG_SPLICE_PAGES tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg espintcp: Inline do_tcp_sendpages() tls: Inline do_tcp_sendpages() siw: Inline do_tcp_sendpages() tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() ip, udp: Support MSG_SPLICE_PAGES ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data ip6, udp6: Support MSG_SPLICE_PAGES udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES ip: Remove ip_append_page() af_unix: Support MSG_SPLICE_PAGES af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data drivers/infiniband/sw/siw/siw_qp_tx.c | 17 +- drivers/net/ethernet/google/gve/gve.h | 1 - drivers/net/ethernet/google/gve/gve_main.c | 16 -- drivers/net/ethernet/google/gve/gve_rx.c | 2 +- drivers/net/ethernet/mediatek/mtk_wed_wo.c | 19 +- drivers/net/ethernet/mediatek/mtk_wed_wo.h | 2 - drivers/nvme/host/tcp.c | 19 +- drivers/nvme/target/tcp.c | 22 +- include/linux/gfp.h | 17 +- include/linux/mm_types.h | 13 +- include/linux/socket.h | 3 + include/net/ip.h | 3 +- include/net/tcp.h | 2 - include/net/tls.h | 2 +- mm/Makefile | 2 +- mm/page_alloc.c | 126 ---------- mm/page_frag_alloc.c | 201 ++++++++++++++++ net/core/skbuff.c | 32 +-- net/ipv4/ip_output.c | 202 ++++++---------- net/ipv4/tcp.c | 260 ++++++++------------- net/ipv4/tcp_bpf.c | 20 +- net/ipv4/udp.c | 50 +--- net/ipv6/ip6_output.c | 12 + net/socket.c | 2 + net/tls/tls_main.c | 24 +- net/unix/af_unix.c | 115 +++++++-- net/xfrm/espintcp.c | 10 +- 27 files changed, 577 insertions(+), 617 deletions(-) create mode 100644 mm/page_frag_alloc.c
Comments
On Thu, Apr 6, 2023 at 11:42 AM David Howells <dhowells@redhat.com> wrote: > > Here's the first tranche of patches towards providing a MSG_SPLICE_PAGES > internal sendmsg flag that is intended to replace the ->sendpage() op with > calls to sendmsg(). MSG_SPLICE is a hint that tells the protocol that it > should splice the pages supplied if it can and copy them if not. > I find this patch series quite big/risky for 6.4 Can you spell out why we need "unspliceable pages support" ? This seems to add quite a lot of code in fast paths. Thanks.
Eric Dumazet <edumazet@google.com> wrote: > > Here's the first tranche of patches towards providing a MSG_SPLICE_PAGES > > internal sendmsg flag that is intended to replace the ->sendpage() op with > > calls to sendmsg(). MSG_SPLICE is a hint that tells the protocol that it > > should splice the pages supplied if it can and copy them if not. > > > > I find this patch series quite big/risky for 6.4 If you want me to hold this till after the merge window, that's fine. > Can you spell out why we need "unspliceable pages support" ? > This seems to add quite a lot of code in fast paths. The patches to copy unspliceable pages (patches 6, 14 and 19) only really add to the MSG_SPLICE_PAGES path - I don't know whether you count this as a fast path or not. (Or are you objecting to MSG_SPLICE_PAGES and getting rid of sendpage in general?) What I'm trying to do with this aspect is twofold: Firstly, I'm trying to make it such that the layer above can send each message in a single sendmsg() if possible. This is possible with sunrpc and siw, for example, but currently they make a whole bunch of separate calls into the transport layer - typically at least three for header, body, trailer. Secondly, I'm trying to avoid a double copy. The layer above TCP/UDP/etc (sunrpc[*], siw, etc.) needs to glue protocol bits on either end of the message body and it may have this data in the slab or on the stack - which it would then need to copy into a page fragment so that it can be zero-copied. However, if the device can handle this or we don't have sufficient frags, the network layer may decide to copy it anyway - I'm not sure how the higher layer can determine this. It just seems there are fewer places this is required if it can be done in the network protocol. Note that userspace cannot make use of this since they're not allowed to set MSG_SPLICE_PAGES. However, I have kept these bits separate and discard them if it's considered a bad idea and that MSG_SPLICE_PAGES should, say, give an error in such a case. David [*] sunrpc, at least, seems to store the header and trailer in zerocopyable pages, but has an additional bit on the front that's not.