[v7,0/8] iov_iter: Improve page extraction (ref, pin or just list)

Message ID 20230120175556.3556978-1-dhowells@redhat.com
Headers
Series iov_iter: Improve page extraction (ref, pin or just list) |

Message

David Howells Jan. 20, 2023, 5:55 p.m. UTC
  Hi Al, Christoph,

Here are patches to provide support for extracting pages from an iov_iter
and a patch to use the primary extraction function in the block layer bio
code.

The patches make the following changes:

 (1) Add a function, iov_iter_extract_pages() to replace
     iov_iter_get_pages*() that gets refs, pins or just lists the pages as
     appropriate to the iterator type and the I/O direction.

     Add a function, iov_iter_extract_mode() that will indicate from the
     iterator type and the I/O direction how the cleanup is to be
     performed, returning FOLL_GET, FOLL_PIN or 0.

 (2) Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(),
     that take a page and the return from iov_iter_extract_mode() and do
     the right thing to clean up the page.

 (3) Make the bio struct carry a pair of flags to indicate the cleanup
     mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
     FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
     added.

 (4) Add a function, bio_release_page(), to release a page appropriately to
     the cleanup mode indicated by the BIO_PAGE_* flags.

 (5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
     pages appropriately and clean them up later.

 (6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

 (7) Renumber FOLL_GET and FOLL_PIN down so that they're at bits 0 and 1
     and coincident with BIO_PAGE_REFFED and BIO_PAGE_PINNED.  The compiler
     can then optimise on that.  Also, it's probably going to be necessary
     to embed these in the page pointer in sk_buff fragments.  This patch
     can go independently through the mm tree.

Changes:
========
ver #7)
 - For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
   as it turned out to be a lot more complicated, with places not setting
   IOCB_WRITE when they should, for example.
 - Drop all the patches that changed things other then the block layer's
   bio handling.  The netfslib and cifs changes can go into a separate
   patchset.
 - Add support for extracting pages from KVEC-type iterators.
 - When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
 - Fix write() syscall and co. not setting IOCB_WRITE.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the iterator direction checks from smbd_recv().
 - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
   gup_flags to iov_iter_get/extract_pages*().
 - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Provide a helper to clean up a page.
 - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
   the same numerical values, enforced with an assertion.
 - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
   NFS.
 - Added in the patches to make CIFS do top-to-bottom iterators and use
   various of the added extraction functions.
 - Added a pair of work-in-progess patches to make sk_buff fragments store
   FOLL_GET and FOLL_PIN.

ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

I've pushed the patches (excluding the two WIP networking patches) here
also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6

David Howells (8):
  iov_iter: Define flags to qualify page extraction.
  iov_iter: Add a function to extract a page list from an iterator
  mm: Provide a helper to drop a pin/ref on a page
  block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the
    meaning
  block: Add BIO_PAGE_PINNED
  block: Make bio structs pin pages rather than ref'ing if appropriate
  block: Fix bio_flagged() so that gcc can better optimise it
  mm: Renumber FOLL_GET and FOLL_PIN down

 block/bio.c               |  43 ++--
 block/blk-map.c           |  26 +--
 block/blk.h               |  29 +++
 fs/iomap/direct-io.c      |   1 -
 include/linux/bio.h       |   5 +-
 include/linux/blk_types.h |   3 +-
 include/linux/mm.h        |  17 +-
 include/linux/uio.h       |  35 ++-
 lib/iov_iter.c            | 438 +++++++++++++++++++++++++++++++++++++-
 mm/gup.c                  |  22 ++
 10 files changed, 571 insertions(+), 48 deletions(-)
  

Comments

Matthew Wilcox Jan. 23, 2023, 4:31 p.m. UTC | #1
On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
>  (3) Make the bio struct carry a pair of flags to indicate the cleanup
>      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
>      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
>      added.

I think there's a simpler solution than all of this.

As I understand the fundamental problem here, the question is
when to copy a page on fork.  We have the optimisation of COW, but
O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
to the fork code "You can't do COW to this page".

Why do we want to track that information on a per-page basis?  Wouldn't it
be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
that VMA will now never be COWed, regardless of their refcount/mapcount.
And the whole "did we pin or get this page" problem goes away.  Along
with folio->pincount.
  
David Howells Jan. 23, 2023, 4:38 p.m. UTC | #2
Matthew Wilcox <willy@infradead.org> wrote:

> Why do we want to track that information on a per-page basis?  Wouldn't it
> be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> that VMA will now never be COWed, regardless of their refcount/mapcount.
> And the whole "did we pin or get this page" problem goes away.  Along
> with folio->pincount.

Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
if they did a single DIO to/from it.

Also you only mention DIO read - but what about "start DIO write; fork(); touch
buffer" in the parent - now the write buffer belongs to the child and they can
affect the parent's write.

David
  
Jan Kara Jan. 23, 2023, 4:42 p.m. UTC | #3
On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> >  (3) Make the bio struct carry a pair of flags to indicate the cleanup
> >      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> >      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> >      added.
> 
> I think there's a simpler solution than all of this.
> 
> As I understand the fundamental problem here, the question is
> when to copy a page on fork.  We have the optimisation of COW, but
> O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
> to the fork code "You can't do COW to this page".
> 
> Why do we want to track that information on a per-page basis?  Wouldn't it
> be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> that VMA will now never be COWed, regardless of their refcount/mapcount.
> And the whole "did we pin or get this page" problem goes away.  Along
> with folio->pincount.

Well, but anon COW code is not the only (planned) consumer of the pincount.
Filesystems also need to know whether a (shared pagecache) page is pinned
and can thus be modified behind their backs. And for that VMA tracking
isn't really an option.

								Honza
  
Matthew Wilcox Jan. 23, 2023, 4:42 p.m. UTC | #4
On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > Why do we want to track that information on a per-page basis?  Wouldn't it
> > be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> > time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > And the whole "did we pin or get this page" problem goes away.  Along
> > with folio->pincount.
> 
> Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> if they did a single DIO to/from it.

Yes.  Would that be an actual problem for any real application?

We could do this with a vm_pincount if it's essential to be able to
count how often it's happened and be able to fork() without COW if it's
something that happened in the past and is now not happening.

> Also you only mention DIO read - but what about "start DIO write; fork(); touch
> buffer" in the parent - now the write buffer belongs to the child and they can
> affect the parent's write.

I'm struggling to see the problem here.  If the child hasn't exec'd, the
parent and child are still in the same security domain.  The parent
could have modified the buffer before calling fork().
  
David Howells Jan. 23, 2023, 5:19 p.m. UTC | #5
Matthew Wilcox <willy@infradead.org> wrote:

> > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> > if they did a single DIO to/from it.
> 
> Yes.  Would that be an actual problem for any real application?

Without auditing all applications that do direct I/O writes, it's hard to
say - but a big database engine, Oracle for example, forking off a process,
say, could cause a massive slow down as fork suddenly has to copy a huge
amount of malloc'd data unnecessarily[*].

[*] I'm making wild assumptions about how Oracle's DB engine works.

> > Also you only mention DIO read - but what about "start DIO write; fork();
> > touch buffer" in the parent - now the write buffer belongs to the child
> > and they can affect the parent's write.
> 
> I'm struggling to see the problem here.  If the child hasn't exec'd, the
> parent and child are still in the same security domain.  The parent
> could have modified the buffer before calling fork().

It could still inadvertently change the data its parent set to write out.  The
child *shouldn't* be able to change the parent's in-progress write.  The most
obvious problem would be in something that does DIO from a stack buffer, I
think.

David
  
Jan Kara Jan. 23, 2023, 5:25 p.m. UTC | #6
On Mon 23-01-23 16:42:56, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
> > Matthew Wilcox <willy@infradead.org> wrote:
> > Also you only mention DIO read - but what about "start DIO write; fork(); touch
> > buffer" in the parent - now the write buffer belongs to the child and they can
> > affect the parent's write.
> 
> I'm struggling to see the problem here.  If the child hasn't exec'd, the
> parent and child are still in the same security domain.  The parent
> could have modified the buffer before calling fork().

Sadly they are not. Android in particular starts applications by forking
one big binary (zygote) that has multiple apps linked together and relies
on the fact the child cannot influence the parent after the fork. We've
already had CVEs with GUP & COW & fork due to this. David Hildebrand has a
lot of memories regarding this I believe ;)

								Honza
  
Matthew Wilcox Jan. 23, 2023, 5:33 p.m. UTC | #7
On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote:
> On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > >  (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > >      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > >      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > >      added.
> > 
> > I think there's a simpler solution than all of this.
> > 
> > As I understand the fundamental problem here, the question is
> > when to copy a page on fork.  We have the optimisation of COW, but
> > O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
> > to the fork code "You can't do COW to this page".
> > 
> > Why do we want to track that information on a per-page basis?  Wouldn't it
> > be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> > time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > And the whole "did we pin or get this page" problem goes away.  Along
> > with folio->pincount.
> 
> Well, but anon COW code is not the only (planned) consumer of the pincount.
> Filesystems also need to know whether a (shared pagecache) page is pinned
> and can thus be modified behind their backs. And for that VMA tracking
> isn't really an option.

Bleh, I'd forgotten about that problem.  We really do need to keep
track of which pages are under I/O for this case, because we need to
tell the filesystem that they are now available for writeback.

That said, I don't know that we need to keep track of it in the
pages themselves.  Can't we have something similar to rmap which
keeps track of a range of pinned pages, and have it taken care of
at a higher level (ie unpin the pages in the dio_iodone_t rather
than in the BIO completion handler)?

I'm not even sure why pinned pagecache pages remain on the LRU.
They should probably go onto the unevictable list with the mlocked
pages, then on unpin get marked dirty and placed on the active list.
There's no point in writing back a pinned page since it can be
written to at any instant without any part of the kernel knowing.
  
Matthew Wilcox Jan. 23, 2023, 6:04 p.m. UTC | #8
On Mon, Jan 23, 2023 at 05:19:51PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> > > if they did a single DIO to/from it.
> > 
> > Yes.  Would that be an actual problem for any real application?
> 
> Without auditing all applications that do direct I/O writes, it's hard to
> say - but a big database engine, Oracle for example, forking off a process,
> say, could cause a massive slow down as fork suddenly has to copy a huge
> amount of malloc'd data unnecessarily[*].
> 
> [*] I'm making wild assumptions about how Oracle's DB engine works.

Yes.  The cache is shared between all Oracle processes, so it's not COWed.
Indeed (as the mshare patches show), what Oracle wants is _more_ sharing
between the processes, not _less_.

> > > Also you only mention DIO read - but what about "start DIO write; fork();
> > > touch buffer" in the parent - now the write buffer belongs to the child
> > > and they can affect the parent's write.
> > 
> > I'm struggling to see the problem here.  If the child hasn't exec'd, the
> > parent and child are still in the same security domain.  The parent
> > could have modified the buffer before calling fork().
> 
> It could still inadvertently change the data its parent set to write out.  The
> child *shouldn't* be able to change the parent's in-progress write.  The most
> obvious problem would be in something that does DIO from a stack buffer, I
> think.

If it's a problem then O_DIRECT writes can also set the NOCOW flag.
  
John Hubbard Jan. 23, 2023, 10:53 p.m. UTC | #9
On 1/23/23 09:33, Matthew Wilcox wrote:
...
> Bleh, I'd forgotten about that problem.  We really do need to keep
> track of which pages are under I/O for this case, because we need to
> tell the filesystem that they are now available for writeback.
> 
> That said, I don't know that we need to keep track of it in the
> pages themselves.  Can't we have something similar to rmap which
> keeps track of a range of pinned pages, and have it taken care of
> at a higher level (ie unpin the pages in the dio_iodone_t rather
> than in the BIO completion handler)?
> 
> I'm not even sure why pinned pagecache pages remain on the LRU.
> They should probably go onto the unevictable list with the mlocked

This is an intriguing idea, but...

> pages, then on unpin get marked dirty and placed on the active list.
> There's no point in writing back a pinned page since it can be
> written to at any instant without any part of the kernel knowing.
> 

There have been filesystems discussions about this: if a page goes
unwritten for "too long", it's not good. To address that, bounce buffers
were proposed for periodic writeback of pinned pages. The idea with
bounce buffers is: even though the page is never guaranteed up to date
(because DMA can instantly make it effectively dirty), it is at least
much less out of date after a periodic writeback, then it was before.

And that is important for some file system use cases.


thanks,
  
David Hildenbrand Jan. 24, 2023, 10:24 a.m. UTC | #10
On 23.01.23 18:25, Jan Kara wrote:
> On Mon 23-01-23 16:42:56, Matthew Wilcox wrote:
>> On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
>>> Matthew Wilcox <willy@infradead.org> wrote:
>>> Also you only mention DIO read - but what about "start DIO write; fork(); touch
>>> buffer" in the parent - now the write buffer belongs to the child and they can
>>> affect the parent's write.
>>
>> I'm struggling to see the problem here.  If the child hasn't exec'd, the
>> parent and child are still in the same security domain.  The parent
>> could have modified the buffer before calling fork().
> 
> Sadly they are not. Android in particular starts applications by forking
> one big binary (zygote) that has multiple apps linked together and relies
> on the fact the child cannot influence the parent after the fork. We've
> already had CVEs with GUP & COW & fork due to this. David Hildebrand has a
> lot of memories regarding this I believe ;)

:)

Once FOLL_PIN is used most of the issues go away and we don't have to 
play any games with VM flags or similar ...

With FOLL_PIN, I consider anon a solved problem and not worth any new 
fancy ideas.
  
Jan Kara Jan. 24, 2023, 10:29 a.m. UTC | #11
On Mon 23-01-23 17:33:24, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote:
> > On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> > > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > > >  (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > > >      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > > >      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > > >      added.
> > > 
> > > I think there's a simpler solution than all of this.
> > > 
> > > As I understand the fundamental problem here, the question is
> > > when to copy a page on fork.  We have the optimisation of COW, but
> > > O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
> > > to the fork code "You can't do COW to this page".
> > > 
> > > Why do we want to track that information on a per-page basis?  Wouldn't it
> > > be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> > > time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> > > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > > And the whole "did we pin or get this page" problem goes away.  Along
> > > with folio->pincount.
> > 
> > Well, but anon COW code is not the only (planned) consumer of the pincount.
> > Filesystems also need to know whether a (shared pagecache) page is pinned
> > and can thus be modified behind their backs. And for that VMA tracking
> > isn't really an option.
> 
> Bleh, I'd forgotten about that problem.  We really do need to keep
> track of which pages are under I/O for this case, because we need to
> tell the filesystem that they are now available for writeback.
> 
> That said, I don't know that we need to keep track of it in the
> pages themselves.  Can't we have something similar to rmap which
> keeps track of a range of pinned pages, and have it taken care of
> at a higher level (ie unpin the pages in the dio_iodone_t rather
> than in the BIO completion handler)?

We could but bear in mind that there are more places than just (the two)
direct IO paths. There are many GUP users that can be operating on mmapped
pagecache and that can modify page data. Also note that e.g. direct IO is
rather performance sensitive so any CPU overhead that is added to that path
gets noticed. That was actually the reason why we keep pinned pages on the
LRU - John tried to remove them while pinning but the overhead was not
acceptable.

Finally, since we need to handle pinning for anon pages as well, just
(ab)using the page refcount looks like the least painful solution.

> I'm not even sure why pinned pagecache pages remain on the LRU.
> They should probably go onto the unevictable list with the mlocked
> pages, then on unpin get marked dirty and placed on the active list.
> There's no point in writing back a pinned page since it can be
> written to at any instant without any part of the kernel knowing.

True but as John said sometimes we need to writeout even pinned page - e.g.
on fsync(2). For some RDMA users which keep pages pinned for days or
months, this is actually crutial...

								Honza
  
Christoph Hellwig Jan. 24, 2023, 1:21 p.m. UTC | #12
On Tue, Jan 24, 2023 at 11:29:31AM +0100, Jan Kara wrote:
> True but as John said sometimes we need to writeout even pinned page - e.g.
> on fsync(2). For some RDMA users which keep pages pinned for days or
> months, this is actually crutial...

I think we have to distinguish between short term (just FOLL_PIN)
and long term (FOLL_PIN | FOLL_LONGERM) pins.

For short term ones the proper thing to do in data integrity writeback
is to simply wait for the unpin.  For the long term pins that obviously
can't work.  The right answer for that is complicated and I don't have
a good answer yet.  The best one would probably to probibit them on
MAP_SHARED file backed mappings - this might break some existing
software, but that software is already so broken that this might be
best.