Message ID | 20230522205744.2825689-1-dhowells@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1724447vqo; Mon, 22 May 2023 14:17:28 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4E4m7tHKywNt4UhlX3S6ohZPz6ayDpdEWoxN3Ads6s4TsMvA3vVivbA2psIvYeAs/UkWmA X-Received: by 2002:a05:6a20:8421:b0:10b:5305:6a11 with SMTP id c33-20020a056a20842100b0010b53056a11mr6519060pzd.52.1684790248205; Mon, 22 May 2023 14:17:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684790248; cv=none; d=google.com; s=arc-20160816; b=FcARHJzGK57PAW4U2zg8f+hg4/7lA5BRJhr5CzkoWK90ina8tuHBOPKxxSwBbkVbbC hpiMH6Ei3s/ewGpX3EQtnlYu1TqMzHoCWpVdgjNzfG6O9hDKtqVzG4fmQHS+XWeMiNFj 97a0a0oB2EMJPaFv7hmo0h7uA2FirGqEYc0psK2mTm23U0AgXjC6E1bWfJO3Hkzc/4Z9 U+WFv3yEdpjnFIIaZ3IEk9k74s6XmVC6U8OWl3EvmQRdy8Vl0XYSv+wSeSeMNkNPM8z2 mUnmajiykZoHqYJRrYovx/BSInQY7es6JB2XFF3K9BTwf/3Robgd9j+lQant4v0kH0wx prWg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=SVo/OPz1sjS/f50cGx7T062W5Iem3nbuZQWCmdXefsI=; b=ZjBvIf3n+XOZlHLTAaUYyUTvv3A8QwHz/pIxlYV58Nbiq+gJV7iXL0/nP3njSiqfut mn801lIJsPau02aLebe1VseheuW8Hqw8PcF0Epa41CJfgOSDCQkvr5E61NZ8Ig4zSjg3 OJJ0qKv77jnFXYMlaa46wsz8k73JtHgoaGeCEoxzQ+UIyQlV3ymSf/oXMjTpQ/jydD1k 73CX2WwYdLY+fHYnU0rkaJP2zgseGjuaiY8OjquwVHTGa8XFSk5eCl7UBHt0EioAWqen pocqkNneq2pU82NZ+xxYblGNre8MR4ZMyHxClCgx5wSKYRW+pB5gfde+LwgTh9EutDeV rpwQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=eFRFUUlq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m28-20020a638c1c000000b0053eeed3d267si193376pgd.686.2023.05.22.14.17.16; Mon, 22 May 2023 14:17:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=eFRFUUlq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230145AbjEVU6e (ORCPT <rfc822;wlfightup@gmail.com> + 99 others); Mon, 22 May 2023 16:58:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57484 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229505AbjEVU6c (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 22 May 2023 16:58:32 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 072F694 for <linux-kernel@vger.kernel.org>; Mon, 22 May 2023 13:57:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684789073; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=SVo/OPz1sjS/f50cGx7T062W5Iem3nbuZQWCmdXefsI=; b=eFRFUUlq38nrPysQ9eCHOh54VCzOZ3W3tKDWniKf6svcKoetaG6ZJV4rDYreERS45LIiX/ KBimpjx4cv0tBZK6AB5IqZBbQEDQ/lOJMGAaeaqDZnbfIXLSiy4QFsYPAGcsuyeOUUiLEY lwHplaA1yiNPYBQvltCGJ92RXn8eLWE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-554-xDKuyN60P1WngCzk1RKB_Q-1; Mon, 22 May 2023 16:57:49 -0400 X-MC-Unique: xDKuyN60P1WngCzk1RKB_Q-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E652F800BFF; Mon, 22 May 2023 20:57:48 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.39.192.68]) by smtp.corp.redhat.com (Postfix) with ESMTP id 552731121314; Mon, 22 May 2023 20:57:46 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: Jens Axboe <axboe@kernel.dk>, Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org> Cc: David Howells <dhowells@redhat.com>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Jeff Layton <jlayton@kernel.org>, David Hildenbrand <david@redhat.com>, Jason Gunthorpe <jgg@nvidia.com>, Logan Gunthorpe <logang@deltatee.com>, Hillf Danton <hdanton@sina.com>, Christian Brauner <brauner@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v21 0/6] block: Use page pinning Date: Mon, 22 May 2023 21:57:38 +0100 Message-Id: <20230522205744.2825689-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1766630619365759197?= X-GMAIL-MSGID: =?utf-8?q?1766630619365759197?= |
Series |
block: Use page pinning
|
|
Message
David Howells
May 22, 2023, 8:57 p.m. UTC
Hi Jens, Al, Christoph, This patchset rolls page-pinning out to the bio struct and the block layer, using iov_iter_extract_pages() to get pages and noting with BIO_PAGE_PINNED if the data pages attached to a bio are pinned. If the data pages come from a non-user-backed iterator, then the pages are left unpinned and unref'd, relying on whoever set up the I/O to do the retaining. This requires the splice-read patchset to have been applied first, otherwise reversion of the ITER_PAGE iterator can race with truncate and return pages to the allocator whilst they're still undergoing DMA[2]. (1) Don't hold a ref on ZERO_PAGE in iomap_dio_zero(). (2) Fix bio_flagged() so that it doesn't prevent a gcc optimisation. (3) Make the bio struct carry a pair of flags to indicate the cleanup mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used) is added. BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it and this series does not fully address that file. (4) Add a function, bio_release_page(), to release a page appropriately to the cleanup mode indicated by the BIO_PAGE_* flags. (5) Make bio_iov_iter_get_pages() use iov_iter_extract_pages() to retain the pages appropriately and clean them up later. (6) Make bio_map_user_iov() also use iov_iter_extract_pages(). I've pushed the patches here also: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract David Changes: ======== ver #21) - Split off the splice-read patchset to reduce the patch count. ver #20) - Make direct_splice_read() limit the read to eof for regular files and blockdevs. - Check against s_maxbytes on the backing store, not a devnode inode. - Provide stubs for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2, orangefs, xfs and zonefs. - Always use direct_splice_read() for 9p, trace and sockets. ver #19) - Remove a missed get_page() on the zeropage in shmem_splice_read(). ver #18) - Split out the cifs bits from the patch the switches generic_file_splice_read() over to using the non-ITER_PIPE splicing. - Don't get/put refs on the zeropage in shmem_splice_read(). ver #17) - Rename do_splice_to() to vfs_splice_read() and export it so that it can be a helper and make overlayfs and coda use it, allowing duplicate checks to be removed. ver #16) - The filemap_get_pages() changes are now upstream. - filemap_splice_read() and direct_splice_read() are now upstream. - iov_iter_extract_pages() is now upstream. ver #15) - Fixed up some errors in overlayfs_splice_read(). ver #14) - Some changes to generic_file_buffered_splice_read(): - Rename to filemap_splice_read() and move to mm/filemap.c. - Create a helper, pipe_head_buf(). - Use init_sync_kiocb(). - Some changes to generic_file_direct_splice_read(): - Use alloc_pages_bulk_array() rather than alloc_pages_bulk_list(). - Use release_pages() instead of __free_page() in a loop. - Rename to direct_splice_read(). - Rearrange the patches to implement filemap_splice_read() and direct_splice_read() separately to changing generic_file_splice_read(). - Don't call generic_file_splice_read() when there isn't a ->read_folio(). - Insert patches to fix read_folio-less cases: - Make tty, procfs, kernfs and (u)random use direct_splice_read(). - Make overlayfs and coda call down to a lower layer. - Give shmem its own splice-read that doesn't insert missing pages. - Fixed a min() with mixed type args on some arches. ver #13) - Only use allocation in advance and ITER_BVEC for DIO read-splice. - Make buffered read-splice get pages directly from the pagecache. - Alter filemap_get_pages() & co. so that it doesn't need an iterator. ver #12) - Added the missing __bitwise on the iov_iter_extraction_t typedef. - Rebased on -rc7. - Don't specify FOLL_PIN to pin_user_pages_fast(). - Inserted patch at front to fix race between DIO read and truncation that caused memory corruption when iov_iter_revert() got called on an ITER_PIPE iterator[2]. - Inserted a patch after that to remove the now-unused ITER_PIPE and its helper functions. - Removed the ITER_PIPE bits from iov_iter_extract_pages(). ver #11) - Fix iov_iter_extract_kvec_pages() to include the offset into the page in the returned starting offset. - Use __bitwise for the extraction flags ver #10) - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec. - Drop bio_set_cleanup_mode(), open coding it instead. ver #9) - It's now not permitted to use FOLL_PIN outside of mm/, so: - Change iov_iter_extract_mode() into iov_iter_extract_will_pin() and return true/false instead of FOLL_PIN/0. - Drop of folio_put_unpin() and page_put_unpin() and instead call unpin_user_page() (and put_page()) directly as necessary. - Make __bio_release_pages() call bio_release_page() instead of unpin_user_page() as there's no BIO_* -> FOLL_* translation to do. - Drop the FOLL_* renumbering patch. - Change extract_flags to extraction_flags. ver #8) - Import Christoph Hellwig's changes. - Split the conversion-to-extraction patch. - Drop the extract_flags arg from iov_iter_extract_mode(). - Don't default bios to BIO_PAGE_REFFED, but set explicitly. - Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0. - Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0. - We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the patches for that. ver #7) - For now, drop the parts to pass the I/O direction to iov_iter_*pages*() as it turned out to be a lot more complicated, with places not setting IOCB_WRITE when they should, for example. - Drop all the patches that changed things other then the block layer's bio handling. The netfslib and cifs changes can go into a separate patchset. - Add support for extracting pages from KVEC-type iterators. - When extracting from BVEC/KVEC, skip over empty vecs at the front. ver #6) - Fix write() syscall and co. not setting IOCB_WRITE. - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE. - Use op_is_write() in bio_copy_user_iov(). - Drop the iterator direction checks from smbd_recv(). - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of gup_flags to iov_iter_get/extract_pages*(). - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove. - Add back the function to indicate the cleanup mode. - Drop the cleanup_mode return arg to iov_iter_extract_pages(). - Provide a helper to clean up a page. - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have the same numerical values, enforced with an assertion. - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and NFS. - Added in the patches to make CIFS do top-to-bottom iterators and use various of the added extraction functions. - Added a pair of work-in-progess patches to make sk_buff fragments store FOLL_GET and FOLL_PIN. ver #5) - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch. - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags. - Add patch to allow bio_flagged() to be combined by gcc. ver #4) - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're no longer referenced by linux/uio.h. - Add ITER_SOURCE/DEST cleanup patches. - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST. - Allow additional gup_flags to be passed into iov_iter_extract_pages(). - Add struct bio patch. ver #3) - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access to get/pin_user_pages_fast()[1]. ver #2) - Rolled the extraction cleanup mode query function into the extraction function, returning the indication through the argument list. - Fixed patch 4 (extract to scatterlist) to actually use the new extraction API. Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1] Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2] Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6 Link: https://lore.kernel.org/r/20230120175556.3556978-1-dhowells@redhat.com/ # v7 Link: https://lore.kernel.org/r/20230123173007.325544-1-dhowells@redhat.com/ # v8 Link: https://lore.kernel.org/r/20230124170108.1070389-1-dhowells@redhat.com/ # v9 Link: https://lore.kernel.org/r/20230125210657.2335748-1-dhowells@redhat.com/ # v10 Link: https://lore.kernel.org/r/20230126141626.2809643-1-dhowells@redhat.com/ # v11 Link: https://lore.kernel.org/r/20230207171305.3716974-1-dhowells@redhat.com/ # v12 Link: https://lore.kernel.org/r/20230209102954.528942-1-dhowells@redhat.com/ # v13 Link: https://lore.kernel.org/r/20230214171330.2722188-1-dhowells@redhat.com/ # v14 Link: https://lore.kernel.org/r/20230308143754.1976726-1-dhowells@redhat.com/ # v16 Link: https://lore.kernel.org/r/20230308165251.2078898-1-dhowells@redhat.com/ # v17 Link: https://lore.kernel.org/r/20230314220757.3827941-1-dhowells@redhat.com/ # v18 Link: https://lore.kernel.org/r/20230315163549.295454-1-dhowells@redhat.com/ # v19 Link: https://lore.kernel.org/r/20230519074047.1739879-1-dhowells@redhat.com/ # v20 Splice-read patch subset: Link: https://lore.kernel.org/r/20230520000049.2226926-1-dhowells@redhat.com/ # v21 Link: https://lore.kernel.org/r/20230522135018.2742245-1-dhowells@redhat.com/ # v22 Additional patches that got folded in: Link: https://lore.kernel.org/r/20230213134619.2198965-1-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20230213153301.2338806-1-dhowells@redhat.com/ # v2 Link: https://lore.kernel.org/r/20230214083710.2547248-1-dhowells@redhat.com/ # v3 Christoph Hellwig (1): block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells (5): iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing block: Fix bio_flagged() so that gcc can better optimise it block: Add BIO_PAGE_PINNED and associated infrastructure block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages block: convert bio_map_user_iov to use iov_iter_extract_pages block/bio.c | 29 +++++++++++++++-------------- block/blk-map.c | 22 +++++++++++----------- block/blk.h | 12 ++++++++++++ fs/direct-io.c | 2 ++ fs/iomap/direct-io.c | 1 - include/linux/bio.h | 5 +++-- include/linux/blk_types.h | 3 ++- 7 files changed, 45 insertions(+), 29 deletions(-)
Comments
On Mon, May 22, 2023 at 09:57:38PM +0100, David Howells wrote: > Hi Jens, Al, Christoph, > > This patchset rolls page-pinning out to the bio struct and the block layer, > using iov_iter_extract_pages() to get pages and noting with BIO_PAGE_PINNED > if the data pages attached to a bio are pinned. If the data pages come > from a non-user-backed iterator, then the pages are left unpinned and > unref'd, relying on whoever set up the I/O to do the retaining. I think I already review the patches, so nothing new here. But can you please also take care of the legacy direct I/O code? I'd really hate to leave yet another unfinished transition around.
Christoph Hellwig <hch@infradead.org> wrote: > But can you please also take care of the legacy direct I/O code? I'd really > hate to leave yet another unfinished transition around. I've been poking at it this afternoon, but it doesn't look like it's going to be straightforward, unfortunately. The mm folks have been withdrawing access to the pinning API behind the ramparts of the mm/ dir. Further, the dio code will (I think), under some circumstances, arbitrarily insert the zero_page into a list of things that are maybe pinned or maybe unpinned, but I can (I think) also be given a pinned zero_page from the GUP code if the page tables point to one and a DIO-write is requested - so just doing if page == zero_page isn't sufficient. What I'd like to do is to make the GUP code not take a ref on the zero_page if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup code always ignore the zero_page. Alternatively, I can drop the pin immediately if I get given one on the zero_page - it's not going anywhere, after all. I also need to be able to take an additional pin on a folio that gets split across multiple bio submissions to replace the get_page() that's there now. Alternatively to that, I can decide how much data I'm willing to read/write in one batch, call something like netfs_extract_user_iter() to decant that portion of the parameter iterator into an bvec[] and let that look up the overlapping page multiple times. However, I'm not sure if this would work well for a couple of reasons: does a single bio have to refer to a contiguous range of disk blocks? and we might expend time on getting pages we then have to give up because we hit a hole. Something that I noticed is that the dio code seems to wangle to page bits on the target pages for a DIO-read, which seems odd, but I'm not sure I fully understand the code yet. David
On Mon, 22 May 2023 21:57:38 +0100, David Howells wrote: > This patchset rolls page-pinning out to the bio struct and the block layer, > using iov_iter_extract_pages() to get pages and noting with BIO_PAGE_PINNED > if the data pages attached to a bio are pinned. If the data pages come > from a non-user-backed iterator, then the pages are left unpinned and > unref'd, relying on whoever set up the I/O to do the retaining. > > This requires the splice-read patchset to have been applied first, > otherwise reversion of the ITER_PAGE iterator can race with truncate and > return pages to the allocator whilst they're still undergoing DMA[2]. > > [...] Applied, thanks! [1/6] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing commit: 9e73bb36b189ec73c7062ec974e0ff287c1aa152 [2/6] block: Fix bio_flagged() so that gcc can better optimise it commit: b9cc607a7f722c374540b2a7c973382592196549 [3/6] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic commit: 100ae68dac60a0688082dcaf3e436606ec0fd51f [4/6] block: Add BIO_PAGE_PINNED and associated infrastructure commit: 84d9fe8b7ea6a53fd93506583ff33a408f95ac60 [5/6] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages commit: b7c96963925fe08d4ef175b7d438c0017155807c [6/6] block: convert bio_map_user_iov to use iov_iter_extract_pages commit: 36b61bb07963b13de4cc03a945aa25b9ffc7d003 Best regards,
On Tue, May 23, 2023 at 03:38:31PM -0600, Jens Axboe wrote:
> Applied, thanks!
This ended up on the for-6.5/block branch, but I think it needs to be
on the splice one, as that is pre-requisite unless I'm missing
something.
On Tue, May 23, 2023 at 09:16:11PM +0100, David Howells wrote: > I've been poking at it this afternoon, but it doesn't look like it's going to > be straightforward, unfortunately. The mm folks have been withdrawing access > to the pinning API behind the ramparts of the mm/ dir. Further, the dio code > will (I think), under some circumstances, arbitrarily insert the zero_page > into a list of things that are maybe pinned or maybe unpinned, but I can (I > think) also be given a pinned zero_page from the GUP code if the page tables > point to one and a DIO-write is requested - so just doing if page == zero_page > isn't sufficient. Yes. I think the proper workaround is to add a MM helper that just pins a single page and make it available to direct-io.c. It should not be exported and clearly marked to not be used in new code. > What I'd like to do is to make the GUP code not take a ref on the zero_page > if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup > code always ignore the zero_page. I don't think that'll work, as we can't mix different pin vs get types in a bio. And that's really a good thing. > Something that I noticed is that the dio code seems to wangle to page bits on > the target pages for a DIO-read, which seems odd, but I'm not sure I fully > understand the code yet. I don't understand this sentence.
On 23.05.23 22:16, David Howells wrote: > Christoph Hellwig <hch@infradead.org> wrote: > >> But can you please also take care of the legacy direct I/O code? I'd really >> hate to leave yet another unfinished transition around. > > I've been poking at it this afternoon, but it doesn't look like it's going to > be straightforward, unfortunately. The mm folks have been withdrawing access > to the pinning API behind the ramparts of the mm/ dir. Further, the dio code > will (I think), under some circumstances, arbitrarily insert the zero_page > into a list of things that are maybe pinned or maybe unpinned, but I can (I > think) also be given a pinned zero_page from the GUP code if the page tables > point to one and a DIO-write is requested - so just doing if page == zero_page > isn't sufficient. > > What I'd like to do is to make the GUP code not take a ref on the zero_page > if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup > code always ignore the zero_page. We discussed doing that unconditionally in the context of vfio (below), but vfio decided to add a workaround suitable for stable. In case of FOLL_PIN it's simple: if we detect the zeropage, don't mess with the refcount when pinning and don't mess with the refcount when unpinning (esp. unpin_user_pages). FOLL_GET is a different story but we don't have to mess with that. So there shouldn't be need for a FOLL_DONT_PIN_ZEROPAGE, we could just do it unconditionally. > > Alternatively, I can drop the pin immediately if I get given one on the > zero_page - it's not going anywhere, after all. That's what vfio did in commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4 Author: Alex Williamson <alex.williamson@redhat.com> Date: Mon Aug 29 21:05:40 2022 -0600 vfio/type1: Unpin zero pages There's currently a reference count leak on the zero page. We increment the reference via pin_user_pages_remote(), but the page is later handled as an invalid/reserved page, therefore it's not accounted against the user and not unpinned by our put_pfn(). Introducing special zero page handling in put_pfn() would resolve the leak, but without accounting of the zero page, a single user could still create enough mappings to generate a reference count overflow. The zero page is always resident, so for our purposes there's no reason to keep it pinned. Therefore, add a loop to walk pages returned from pin_user_pages_remote() and unpin any zero pages. For vfio that handling no longer required, because FOLL_LONGTERM will never pin the shared zeropage.
Christoph Hellwig <hch@infradead.org> wrote: > > Applied, thanks! > > This ended up on the for-6.5/block branch, but I think it needs to be > on the splice one, as that is pre-requisite unless I'm missing > something. Indeed. As I noted in the cover note: This requires the splice-read patchset to have been applied first, otherwise reversion of the ITER_PAGE iterator can race with truncate and return pages to the allocator whilst they're still undergoing DMA[2]. David
Christoph Hellwig <hch@infradead.org> wrote: > > What I'd like to do is to make the GUP code not take a ref on the zero_page > > if, say, FOLL_DONT_PIN_ZEROPAGE is passed in, and then make the bio cleanup > > code always ignore the zero_page. > > I don't think that'll work, as we can't mix different pin vs get types > in a bio. And that's really a good thing. True - but I was thinking of just treating the zero_page specially and never hold a pin or a ref on it. It can be checked by address, e.g.: static inline void bio_release_page(struct bio *bio, struct page *page) { if (page == ZERO_PAGE(0)) return; if (bio_flagged(bio, BIO_PAGE_PINNED)) unpin_user_page(page); else if (bio_flagged(bio, BIO_PAGE_REFFED)) put_page(page); } I'm slightly concerned about the possibility of overflowing the refcount. The problem is that it only takes about 2 million pins to do that (because the zero_page isn't a large folio) - which is within reach of userspace. Create an 8GiB anon mmap and do a bunch of async DIO writes from it. You won't hit ENOMEM because it will stick ~2 million pointers to zero_page into the page tables. > > Something that I noticed is that the dio code seems to wangle to page bits on > > the target pages for a DIO-read, which seems odd, but I'm not sure I fully > > understand the code yet. > > I don't understand this sentence. I was looking at this: static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio) { ... if (dio->is_async && dio_op == REQ_OP_READ && dio->should_dirty) bio_set_pages_dirty(bio); ... } but looking again, the lock is taken briefly and the dirty bit is set - which is reasonable. However, should we be doing it before starting the I/O? David
On 5/23/23 11:52 PM, Christoph Hellwig wrote: > On Tue, May 23, 2023 at 03:38:31PM -0600, Jens Axboe wrote: >> Applied, thanks! > > This ended up on the for-6.5/block branch, but I think it needs to be > on the splice one, as that is pre-requisite unless I'm missing > something. Oops yes, that's my bad. I've reshuffled things now so that they should make more sense.
On Wed, May 24, 2023 at 09:47:10AM +0100, David Howells wrote: > True - but I was thinking of just treating the zero_page specially and never > hold a pin or a ref on it. It can be checked by address, e.g.: > > static inline void bio_release_page(struct bio *bio, struct page *page) > { > if (page == ZERO_PAGE(0)) > return; > if (bio_flagged(bio, BIO_PAGE_PINNED)) > unpin_user_page(page); > else if (bio_flagged(bio, BIO_PAGE_REFFED)) > put_page(page); > } That does sound good as well to me. > I was looking at this: > > static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio) > { > ... > if (dio->is_async && dio_op == REQ_OP_READ && dio->should_dirty) > bio_set_pages_dirty(bio); > ... > } > > but looking again, the lock is taken briefly and the dirty bit is set - which > is reasonable. However, should we be doing it before starting the I/O? It is done before starting the I/O - the submit_bio is just below this.
On Wed, May 24, 2023 at 1:47 AM David Howells <dhowells@redhat.com> wrote: > > True - but I was thinking of just treating the zero_page specially and never > hold a pin or a ref on it. It can be checked by address, e.g.: > > static inline void bio_release_page(struct bio *bio, struct page *page) > { > if (page == ZERO_PAGE(0)) > return; That won't actually work. We do have cases that try to use the page coloring that we support. Admittedly it seems to be only rmda that does it directly with something like this: vmf->page = ZERO_PAGE(vmf->address); but you can get arbitrary zero pages by pinning or GUPing them from user space mappings. Now, the only architectures that *use* multiple zero pages are - I think - MIPS (including Loongarch) and s390. So it's rare, but it does happen. Linus
On 25.05.23 18:31, Linus Torvalds wrote: > On Wed, May 24, 2023 at 1:47 AM David Howells <dhowells@redhat.com> wrote: >> >> True - but I was thinking of just treating the zero_page specially and never >> hold a pin or a ref on it. It can be checked by address, e.g.: >> >> static inline void bio_release_page(struct bio *bio, struct page *page) >> { >> if (page == ZERO_PAGE(0)) >> return; > > That won't actually work. > > We do have cases that try to use the page coloring that we support. > > Admittedly it seems to be only rmda that does it directly with > something like this: > > vmf->page = ZERO_PAGE(vmf->address); > > but you can get arbitrary zero pages by pinning or GUPing them from > user space mappings. > > Now, the only architectures that *use* multiple zero pages are - I > think - MIPS (including Loongarch) and s390. > > So it's rare, but it does happen. I think the correct way to test for a zero page is is_zero_pfn(page_to_pfn(page). Using my_zero_pfn(vmf->address) in do_anonymous_page() these can easily end up in any process.
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> We do have cases that try to use the page coloring that we support.
What do we gain from it? Presumably since nothing is supposed to write to
that page, it can be shared in all the caches.
David
On Thu, May 25, 2023 at 9:45 AM David Hildenbrand <david@redhat.com> wrote: > > I think the correct way to test for a zero page is > is_zero_pfn(page_to_pfn(page). Yeah. Except it's really ugly and strange, and we should probably add a helper for that pattern. The reason it has that odd "look at pfn" is just because I think the first users were in the page table code, which had the pfn already, and the test is basically based on the zero_page_mask thing that the affected architectures have. So I suspect we should add that is_zero_pfn(page_to_pfn(page)) as a helper inline function rather than write it out even more times (that "is this 'struct page' a zero page" pattern already exists in /proc and a few other places. is_longterm_pinnable_page() already has it, so adding it as a helper there in <linux/mm.h> is probably a good idea. Linus
David Hildenbrand <david@redhat.com> wrote: > I think the correct way to test for a zero page is > is_zero_pfn(page_to_pfn(page). > > Using my_zero_pfn(vmf->address) in do_anonymous_page() these can easily end up > in any process. Should everywhere that is using ZERO_PAGE(0) actually be using my_zero_pfn()? ZERO_PAGE() could do with a kdoc comment saying how to use it. David
On Thu, May 25, 2023 at 10:01 AM David Howells <dhowells@redhat.com> wrote: > > What do we gain from it? Presumably since nothing is supposed to write to > that page, it can be shared in all the caches. I don't remember the details, but they went something like "broken purely virtually indexed cache avoids physical aliases by cacheline exclusion at fill time". Which then meant that if you walk a zero mapping, you'll invalidate the caches of the previous page when you walk the next one. Causing horrendously bad performance. Unless it's colored. Something like that. I probably got all the details wrong. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I suspect we should add that > > is_zero_pfn(page_to_pfn(page)) > > as a helper inline function rather than write it out even more times > (that "is this 'struct page' a zero page" pattern already exists in > /proc and a few other places. > > is_longterm_pinnable_page() already has it, so adding it as a helper > there in <linux/mm.h> is probably a good idea. I just added: static inline bool IS_ZERO_PAGE(const struct page *page) { return is_zero_pfn(page_to_pfn(page)); } static inline bool IS_ZERO_FOLIO(const struct folio *folio) { return is_zero_pfn(page_to_pfn((const struct page *)folio)); } to include/linux/pgtable.h. It doesn't seem I can add it to mm.h as an inline function. David
On Thu, May 25, 2023 at 10:07 AM David Howells <dhowells@redhat.com> wrote: > > Should everywhere that is using ZERO_PAGE(0) actually be using my_zero_pfn()? No, that would just make code uglier for no reason, because then you have to turn that pfn into a virtual address. So if what you *want* is a pfn to begin with, then use, use my_zero_pfn(). But if what you want is just the virtual address, use ZERO_PAGE(). And if you are going to map it at some address, give it the address you're going to use, otherwise just do zero for "whatever". The only thing you can't use ZERO_PAGE(0) for is literally that "is this a zero page" address comparison, because ZERO_PAGE(0) is just _one_ address. Linus
On Thu, May 25, 2023 at 10:15 AM David Howells <dhowells@redhat.com> wrote: > > It doesn't seem I can add it to mm.h as an inline function. What? We already have that pattern inside is_longterm_pinnable_page(), so that's really strange. But regardless, please don't duplicate that odd conditional for no reason, and don't scream. So regardless of where it is, make that "is_zero_folio()" just do "is_zero_page(&folio->page)" rather than repeat the question. I also wonder whether we shouldn't just use the "transparent union" argument thing more aggressively. Something like typedef union { struct page *page; struct folio *folio; } page_or_folio_t __attribute__ ((__transparent_union__)); and then you should be able to do something like this: static inline bool is_zero_page(const page_or_folio_t arg) { return is_zero_pfn(page_to_pfn(arg.page)); } and we don't have to keep generating the two versions over and over and over again. Linus