Message ID | 20230120175556.3556978-1-dhowells@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp343419wrn; Fri, 20 Jan 2023 09:59:50 -0800 (PST) X-Google-Smtp-Source: AMrXdXsg1rzaEcAD4NHxgRZVbSxMCzHav1RDmfTH1UDlK80h0bFvGwBUiIup15Q79gHDv1nJLtou X-Received: by 2002:a05:6a20:549e:b0:ae:47a6:e6d9 with SMTP id i30-20020a056a20549e00b000ae47a6e6d9mr20504745pzk.6.1674237590396; Fri, 20 Jan 2023 09:59:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674237590; cv=none; d=google.com; s=arc-20160816; b=HyiLt/FR1cKSUMpOmN6t4rdQ++hToCxe+PsqLDGoyHDffx20Brdj5pd5oSUN/MhXaG bQKSX7fQ+p5NgYg8UfMtqx73qgMpMJ5DaU8QE3LFmENCybfjJBzGbpRIzI9M/d9JCha+ Sliwg4my09gF8cw50ztg86hqMBZf9dSaXRm7LRbUD8ocRszs8mhSh5kntmI8YCc1lMra IxgT4S0Hay53aJj6Q2uGPy2GZB0DNCtvsAcx076EIfXUArPg3Sz/J3ekW1yAB9NXAIS5 Q9rc+xDI1fRm+SZZzmquZTssK6ZP1D3q23tKtFh9Mv+4pu3imqoVrCOu56G8qqL3JXQB 4Xog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=L1vbMszv9II0pr/36LhZgm3t0G5zZlJBs1R5Ur/EGww=; b=TuP5/I1mRV6MzQbyG69PlwqiPloYwig6HqbePkz+yHWZpUN51BzVByCxo/Bwou5VaF aU8VP3jskC2SaoMGviDz3TvTnmuhATYWEwobVX+7pEwEZ0l+BdjOAgLdHPN74UUZholO XQjTVjoQys7suBLMuqwYCKsWfy242Cg4jtF5IUS58sFREoWZ5XHnX3Mj/xBw/vuvI9Cf lL9nyo9VPxpm6v4O2pNpO+KUEKYVYxmUXSgK9ZwWK2x3t/vRvUdFjoon3BfD2Z3nZhkc g/5m/XtNeke/LbdzzYB2be84dmabnw1ElHh+fJX//l0+VbjQ+5hBnwHh5A4BY0sMa/jj v0sw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=CKfWbxXe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z19-20020a63ac53000000b004af44c74b52si1558508pgn.568.2023.01.20.09.59.38; Fri, 20 Jan 2023 09:59:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=CKfWbxXe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230002AbjATR4u (ORCPT <rfc822;forouhar.linux@gmail.com> + 99 others); Fri, 20 Jan 2023 12:56:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46680 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229786AbjATR4s (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 20 Jan 2023 12:56:48 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F773366A3 for <linux-kernel@vger.kernel.org>; Fri, 20 Jan 2023 09:56:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674237367; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=L1vbMszv9II0pr/36LhZgm3t0G5zZlJBs1R5Ur/EGww=; b=CKfWbxXeBa7aMnhtztYh99aPfvRIcOhvdqrCfjg0FBMuxF4NoHiQ0ImhAy9oLVAnBgTXdw ShViXthNEi/hVZ/dSDOUSe7T9McLhTI43VXoeJyfhuW8+y++PYgxCHoKga9ru0YFPnPoce t+4NjGvp/IjAcl4q416U0101aPkjRsE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-160-JeBkd2PCM0urfNQeCIc4vA-1; Fri, 20 Jan 2023 12:56:04 -0500 X-MC-Unique: JeBkd2PCM0urfNQeCIc4vA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8474B802C1D; Fri, 20 Jan 2023 17:56:02 +0000 (UTC) Received: from warthog.procyon.org.uk.com (unknown [10.33.36.23]) by smtp.corp.redhat.com (Postfix) with ESMTP id 31E7253AA; Fri, 20 Jan 2023 17:56:01 +0000 (UTC) From: David Howells <dhowells@redhat.com> To: Al Viro <viro@zeniv.linux.org.uk>, Christoph Hellwig <hch@infradead.org> Cc: David Howells <dhowells@redhat.com>, Matthew Wilcox <willy@infradead.org>, Jens Axboe <axboe@kernel.dk>, Jan Kara <jack@suse.cz>, Jeff Layton <jlayton@kernel.org>, Logan Gunthorpe <logang@deltatee.com>, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) Date: Fri, 20 Jan 2023 17:55:48 +0000 Message-Id: <20230120175556.3556978-1-dhowells@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755565355507982985?= X-GMAIL-MSGID: =?utf-8?q?1755565355507982985?= |
Series |
iov_iter: Improve page extraction (ref, pin or just list)
|
|
Message
David Howells
Jan. 20, 2023, 5:55 p.m. UTC
Hi Al, Christoph, Here are patches to provide support for extracting pages from an iov_iter and a patch to use the primary extraction function in the block layer bio code. The patches make the following changes: (1) Add a function, iov_iter_extract_pages() to replace iov_iter_get_pages*() that gets refs, pins or just lists the pages as appropriate to the iterator type and the I/O direction. Add a function, iov_iter_extract_mode() that will indicate from the iterator type and the I/O direction how the cleanup is to be performed, returning FOLL_GET, FOLL_PIN or 0. (2) Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(), that take a page and the return from iov_iter_extract_mode() and do the right thing to clean up the page. (3) Make the bio struct carry a pair of flags to indicate the cleanup mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is added. (4) Add a function, bio_release_page(), to release a page appropriately to the cleanup mode indicated by the BIO_PAGE_* flags. (5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the pages appropriately and clean them up later. (6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation. (7) Renumber FOLL_GET and FOLL_PIN down so that they're at bits 0 and 1 and coincident with BIO_PAGE_REFFED and BIO_PAGE_PINNED. The compiler can then optimise on that. Also, it's probably going to be necessary to embed these in the page pointer in sk_buff fragments. This patch can go independently through the mm tree. Changes: ======== ver #7) - For now, drop the parts to pass the I/O direction to iov_iter_*pages*() as it turned out to be a lot more complicated, with places not setting IOCB_WRITE when they should, for example. - Drop all the patches that changed things other then the block layer's bio handling. The netfslib and cifs changes can go into a separate patchset. - Add support for extracting pages from KVEC-type iterators. - When extracting from BVEC/KVEC, skip over empty vecs at the front. ver #6) - Fix write() syscall and co. not setting IOCB_WRITE. - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE. - Use op_is_write() in bio_copy_user_iov(). - Drop the iterator direction checks from smbd_recv(). - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of gup_flags to iov_iter_get/extract_pages*(). - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove. - Add back the function to indicate the cleanup mode. - Drop the cleanup_mode return arg to iov_iter_extract_pages(). - Provide a helper to clean up a page. - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have the same numerical values, enforced with an assertion. - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and NFS. - Added in the patches to make CIFS do top-to-bottom iterators and use various of the added extraction functions. - Added a pair of work-in-progess patches to make sk_buff fragments store FOLL_GET and FOLL_PIN. ver #5) - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch. - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags. - Add patch to allow bio_flagged() to be combined by gcc. ver #4) - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're no longer referenced by linux/uio.h. - Add ITER_SOURCE/DEST cleanup patches. - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST. - Allow additional gup_flags to be passed into iov_iter_extract_pages(). - Add struct bio patch. ver #3) - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access to get/pin_user_pages_fast()[1]. ver #2) - Rolled the extraction cleanup mode query function into the extraction function, returning the indication through the argument list. - Fixed patch 4 (extract to scatterlist) to actually use the new extraction API. I've pushed the patches (excluding the two WIP networking patches) here also: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract David Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1] Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6 David Howells (8): iov_iter: Define flags to qualify page extraction. iov_iter: Add a function to extract a page list from an iterator mm: Provide a helper to drop a pin/ref on a page block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning block: Add BIO_PAGE_PINNED block: Make bio structs pin pages rather than ref'ing if appropriate block: Fix bio_flagged() so that gcc can better optimise it mm: Renumber FOLL_GET and FOLL_PIN down block/bio.c | 43 ++-- block/blk-map.c | 26 +-- block/blk.h | 29 +++ fs/iomap/direct-io.c | 1 - include/linux/bio.h | 5 +- include/linux/blk_types.h | 3 +- include/linux/mm.h | 17 +- include/linux/uio.h | 35 ++- lib/iov_iter.c | 438 +++++++++++++++++++++++++++++++++++++- mm/gup.c | 22 ++ 10 files changed, 571 insertions(+), 48 deletions(-)
Comments
On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote: > (3) Make the bio struct carry a pair of flags to indicate the cleanup > mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to > FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is > added. I think there's a simpler solution than all of this. As I understand the fundamental problem here, the question is when to copy a page on fork. We have the optimisation of COW, but O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate to the fork code "You can't do COW to this page". Why do we want to track that information on a per-page basis? Wouldn't it be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in that VMA will now never be COWed, regardless of their refcount/mapcount. And the whole "did we pin or get this page" problem goes away. Along with folio->pincount.
Matthew Wilcox <willy@infradead.org> wrote: > Why do we want to track that information on a per-page basis? Wouldn't it > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in > that VMA will now never be COWed, regardless of their refcount/mapcount. > And the whole "did we pin or get this page" problem goes away. Along > with folio->pincount. Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW if they did a single DIO to/from it. Also you only mention DIO read - but what about "start DIO write; fork(); touch buffer" in the parent - now the write buffer belongs to the child and they can affect the parent's write. David
On Mon 23-01-23 16:31:32, Matthew Wilcox wrote: > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote: > > (3) Make the bio struct carry a pair of flags to indicate the cleanup > > mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to > > FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is > > added. > > I think there's a simpler solution than all of this. > > As I understand the fundamental problem here, the question is > when to copy a page on fork. We have the optimisation of COW, but > O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate > to the fork code "You can't do COW to this page". > > Why do we want to track that information on a per-page basis? Wouldn't it > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in > that VMA will now never be COWed, regardless of their refcount/mapcount. > And the whole "did we pin or get this page" problem goes away. Along > with folio->pincount. Well, but anon COW code is not the only (planned) consumer of the pincount. Filesystems also need to know whether a (shared pagecache) page is pinned and can thus be modified behind their backs. And for that VMA tracking isn't really an option. Honza
On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > Why do we want to track that information on a per-page basis? Wouldn't it > > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first > > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in > > that VMA will now never be COWed, regardless of their refcount/mapcount. > > And the whole "did we pin or get this page" problem goes away. Along > > with folio->pincount. > > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW > if they did a single DIO to/from it. Yes. Would that be an actual problem for any real application? We could do this with a vm_pincount if it's essential to be able to count how often it's happened and be able to fork() without COW if it's something that happened in the past and is now not happening. > Also you only mention DIO read - but what about "start DIO write; fork(); touch > buffer" in the parent - now the write buffer belongs to the child and they can > affect the parent's write. I'm struggling to see the problem here. If the child hasn't exec'd, the parent and child are still in the same security domain. The parent could have modified the buffer before calling fork().
Matthew Wilcox <willy@infradead.org> wrote: > > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW > > if they did a single DIO to/from it. > > Yes. Would that be an actual problem for any real application? Without auditing all applications that do direct I/O writes, it's hard to say - but a big database engine, Oracle for example, forking off a process, say, could cause a massive slow down as fork suddenly has to copy a huge amount of malloc'd data unnecessarily[*]. [*] I'm making wild assumptions about how Oracle's DB engine works. > > Also you only mention DIO read - but what about "start DIO write; fork(); > > touch buffer" in the parent - now the write buffer belongs to the child > > and they can affect the parent's write. > > I'm struggling to see the problem here. If the child hasn't exec'd, the > parent and child are still in the same security domain. The parent > could have modified the buffer before calling fork(). It could still inadvertently change the data its parent set to write out. The child *shouldn't* be able to change the parent's in-progress write. The most obvious problem would be in something that does DIO from a stack buffer, I think. David
On Mon 23-01-23 16:42:56, Matthew Wilcox wrote: > On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote: > > Matthew Wilcox <willy@infradead.org> wrote: > > Also you only mention DIO read - but what about "start DIO write; fork(); touch > > buffer" in the parent - now the write buffer belongs to the child and they can > > affect the parent's write. > > I'm struggling to see the problem here. If the child hasn't exec'd, the > parent and child are still in the same security domain. The parent > could have modified the buffer before calling fork(). Sadly they are not. Android in particular starts applications by forking one big binary (zygote) that has multiple apps linked together and relies on the fact the child cannot influence the parent after the fork. We've already had CVEs with GUP & COW & fork due to this. David Hildebrand has a lot of memories regarding this I believe ;) Honza
On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote: > On Mon 23-01-23 16:31:32, Matthew Wilcox wrote: > > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote: > > > (3) Make the bio struct carry a pair of flags to indicate the cleanup > > > mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to > > > FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is > > > added. > > > > I think there's a simpler solution than all of this. > > > > As I understand the fundamental problem here, the question is > > when to copy a page on fork. We have the optimisation of COW, but > > O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate > > to the fork code "You can't do COW to this page". > > > > Why do we want to track that information on a per-page basis? Wouldn't it > > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first > > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in > > that VMA will now never be COWed, regardless of their refcount/mapcount. > > And the whole "did we pin or get this page" problem goes away. Along > > with folio->pincount. > > Well, but anon COW code is not the only (planned) consumer of the pincount. > Filesystems also need to know whether a (shared pagecache) page is pinned > and can thus be modified behind their backs. And for that VMA tracking > isn't really an option. Bleh, I'd forgotten about that problem. We really do need to keep track of which pages are under I/O for this case, because we need to tell the filesystem that they are now available for writeback. That said, I don't know that we need to keep track of it in the pages themselves. Can't we have something similar to rmap which keeps track of a range of pinned pages, and have it taken care of at a higher level (ie unpin the pages in the dio_iodone_t rather than in the BIO completion handler)? I'm not even sure why pinned pagecache pages remain on the LRU. They should probably go onto the unevictable list with the mlocked pages, then on unpin get marked dirty and placed on the active list. There's no point in writing back a pinned page since it can be written to at any instant without any part of the kernel knowing.
On Mon, Jan 23, 2023 at 05:19:51PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW > > > if they did a single DIO to/from it. > > > > Yes. Would that be an actual problem for any real application? > > Without auditing all applications that do direct I/O writes, it's hard to > say - but a big database engine, Oracle for example, forking off a process, > say, could cause a massive slow down as fork suddenly has to copy a huge > amount of malloc'd data unnecessarily[*]. > > [*] I'm making wild assumptions about how Oracle's DB engine works. Yes. The cache is shared between all Oracle processes, so it's not COWed. Indeed (as the mshare patches show), what Oracle wants is _more_ sharing between the processes, not _less_. > > > Also you only mention DIO read - but what about "start DIO write; fork(); > > > touch buffer" in the parent - now the write buffer belongs to the child > > > and they can affect the parent's write. > > > > I'm struggling to see the problem here. If the child hasn't exec'd, the > > parent and child are still in the same security domain. The parent > > could have modified the buffer before calling fork(). > > It could still inadvertently change the data its parent set to write out. The > child *shouldn't* be able to change the parent's in-progress write. The most > obvious problem would be in something that does DIO from a stack buffer, I > think. If it's a problem then O_DIRECT writes can also set the NOCOW flag.
On 1/23/23 09:33, Matthew Wilcox wrote: ... > Bleh, I'd forgotten about that problem. We really do need to keep > track of which pages are under I/O for this case, because we need to > tell the filesystem that they are now available for writeback. > > That said, I don't know that we need to keep track of it in the > pages themselves. Can't we have something similar to rmap which > keeps track of a range of pinned pages, and have it taken care of > at a higher level (ie unpin the pages in the dio_iodone_t rather > than in the BIO completion handler)? > > I'm not even sure why pinned pagecache pages remain on the LRU. > They should probably go onto the unevictable list with the mlocked This is an intriguing idea, but... > pages, then on unpin get marked dirty and placed on the active list. > There's no point in writing back a pinned page since it can be > written to at any instant without any part of the kernel knowing. > There have been filesystems discussions about this: if a page goes unwritten for "too long", it's not good. To address that, bounce buffers were proposed for periodic writeback of pinned pages. The idea with bounce buffers is: even though the page is never guaranteed up to date (because DMA can instantly make it effectively dirty), it is at least much less out of date after a periodic writeback, then it was before. And that is important for some file system use cases. thanks,
On 23.01.23 18:25, Jan Kara wrote: > On Mon 23-01-23 16:42:56, Matthew Wilcox wrote: >> On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote: >>> Matthew Wilcox <willy@infradead.org> wrote: >>> Also you only mention DIO read - but what about "start DIO write; fork(); touch >>> buffer" in the parent - now the write buffer belongs to the child and they can >>> affect the parent's write. >> >> I'm struggling to see the problem here. If the child hasn't exec'd, the >> parent and child are still in the same security domain. The parent >> could have modified the buffer before calling fork(). > > Sadly they are not. Android in particular starts applications by forking > one big binary (zygote) that has multiple apps linked together and relies > on the fact the child cannot influence the parent after the fork. We've > already had CVEs with GUP & COW & fork due to this. David Hildebrand has a > lot of memories regarding this I believe ;) :) Once FOLL_PIN is used most of the issues go away and we don't have to play any games with VM flags or similar ... With FOLL_PIN, I consider anon a solved problem and not worth any new fancy ideas.
On Mon 23-01-23 17:33:24, Matthew Wilcox wrote: > On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote: > > On Mon 23-01-23 16:31:32, Matthew Wilcox wrote: > > > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote: > > > > (3) Make the bio struct carry a pair of flags to indicate the cleanup > > > > mode. BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to > > > > FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is > > > > added. > > > > > > I think there's a simpler solution than all of this. > > > > > > As I understand the fundamental problem here, the question is > > > when to copy a page on fork. We have the optimisation of COW, but > > > O_DIRECT/RDMA/... breaks it. So all this page pinning is to indicate > > > to the fork code "You can't do COW to this page". > > > > > > Why do we want to track that information on a per-page basis? Wouldn't it > > > be easier to have a VM_NOCOW flag in vma->vm_flags? Set it the first > > > time somebody does an O_DIRECT read or RDMA pin. That's it. Pages in > > > that VMA will now never be COWed, regardless of their refcount/mapcount. > > > And the whole "did we pin or get this page" problem goes away. Along > > > with folio->pincount. > > > > Well, but anon COW code is not the only (planned) consumer of the pincount. > > Filesystems also need to know whether a (shared pagecache) page is pinned > > and can thus be modified behind their backs. And for that VMA tracking > > isn't really an option. > > Bleh, I'd forgotten about that problem. We really do need to keep > track of which pages are under I/O for this case, because we need to > tell the filesystem that they are now available for writeback. > > That said, I don't know that we need to keep track of it in the > pages themselves. Can't we have something similar to rmap which > keeps track of a range of pinned pages, and have it taken care of > at a higher level (ie unpin the pages in the dio_iodone_t rather > than in the BIO completion handler)? We could but bear in mind that there are more places than just (the two) direct IO paths. There are many GUP users that can be operating on mmapped pagecache and that can modify page data. Also note that e.g. direct IO is rather performance sensitive so any CPU overhead that is added to that path gets noticed. That was actually the reason why we keep pinned pages on the LRU - John tried to remove them while pinning but the overhead was not acceptable. Finally, since we need to handle pinning for anon pages as well, just (ab)using the page refcount looks like the least painful solution. > I'm not even sure why pinned pagecache pages remain on the LRU. > They should probably go onto the unevictable list with the mlocked > pages, then on unpin get marked dirty and placed on the active list. > There's no point in writing back a pinned page since it can be > written to at any instant without any part of the kernel knowing. True but as John said sometimes we need to writeout even pinned page - e.g. on fsync(2). For some RDMA users which keep pages pinned for days or months, this is actually crutial... Honza
On Tue, Jan 24, 2023 at 11:29:31AM +0100, Jan Kara wrote: > True but as John said sometimes we need to writeout even pinned page - e.g. > on fsync(2). For some RDMA users which keep pages pinned for days or > months, this is actually crutial... I think we have to distinguish between short term (just FOLL_PIN) and long term (FOLL_PIN | FOLL_LONGERM) pins. For short term ones the proper thing to do in data integrity writeback is to simply wait for the unpin. For the long term pins that obviously can't work. The right answer for that is complicated and I don't have a good answer yet. The best one would probably to probibit them on MAP_SHARED file backed mappings - this might break some existing software, but that software is already so broken that this might be best.