[v7,3/3] mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed mappings

Message ID b3a4441cade9770e00d24f5ecb75c8f4481785a4.1683044162.git.lstoakes@gmail.com
State New
Headers
Series mm/gup: disallow GUP writing to file-backed mappings by default |

Commit Message

Lorenzo Stoakes May 2, 2023, 4:34 p.m. UTC
  Writing to file-backed dirty-tracked mappings via GUP is inherently broken
as we cannot rule out folios being cleaned and then a GUP user writing to
them again and possibly marking them dirty unexpectedly.

This is especially egregious for long-term mappings (as indicated by the
use of the FOLL_LONGTERM flag), so we disallow this case in GUP-fast as
we have already done in the slow path.

We have access to less information in the fast path as we cannot examine
the VMA containing the mapping, however we can determine whether the folio
is anonymous and then whitelist known-good mappings - specifically hugetlb
and shmem mappings.

While we obtain a stable folio for this check, the mapping might not be, as
a truncate could nullify it at any time. Since doing so requires mappings
to be zapped, we can synchronise against a TLB shootdown operation.

For some architectures TLB shootdown is synchronised by IPI, against which
we are protected as the GUP-fast operation is performed with interrupts
disabled. Equally, we are protected from architectures which specify
CONFIG_MMU_GATHER_RCU_TABLE_FREE as the interrupts being disabled imply an
RCU lock as well.

We whitelist anonymous mappings (and those which otherwise do not have a
valid mapping), shmem and hugetlb mappings, none of which require dirty
tracking so are safe to long-term pin.

It's important to note that there are no APIs allowing users to specify
FOLL_FAST_ONLY for a PUP-fast let alone with FOLL_LONGTERM, so we can
always rely on the fact that if we fail to pin on the fast path, the code
will fall back to the slow path which can perform the more thorough check.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Kirill A . Shutemov <kirill@shutemov.name>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/gup.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 60 insertions(+), 2 deletions(-)
  

Comments

David Hildenbrand May 2, 2023, 5:13 p.m. UTC | #1
[...]

> +{
> +	struct address_space *mapping;
> +
> +	/*
> +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> +	 * to disappear from under us, as well as preventing RCU grace periods
> +	 * from making progress (i.e. implying rcu_read_lock()).
> +	 *
> +	 * This means we can rely on the folio remaining stable for all
> +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +	 * and those that do not.
> +	 *
> +	 * We get the added benefit that given inodes, and thus address_space,
> +	 * objects are RCU freed, we can rely on the mapping remaining stable
> +	 * here with no risk of a truncation or similar race.
> +	 */
> +	lockdep_assert_irqs_disabled();
> +
> +	/*
> +	 * If no mapping can be found, this implies an anonymous or otherwise
> +	 * non-file backed folio so in this instance we permit the pin.
> +	 *
> +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> +	 * explicitly whitelist these.
> +	 *
> +	 * Other non dirty-tracked folios will be picked up on the slow path.
> +	 */
> +	mapping = folio_mapping(folio);
> +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);

"Folios in the swap cache return the swap mapping" -- you might disallow 
pinning anonymous pages that are in the swap cache.

I recall that there are corner cases where we can end up with an anon 
page that's mapped writable but still in the swap cache ... so you'd 
fallback to the GUP slow path (acceptable for these corner cases, I 
guess), however especially the comment is a bit misleading then.

So I'd suggest not dropping the folio_test_anon() check, or open-coding 
it ... which will make this piece of code most certainly easier to get 
when staring at folio_mapping(). Or to spell it out in the comment 
(usually I prefer code over comments).

> +}
> +
>   /**
>    * try_grab_folio() - Attempt to get or pin a folio.
>    * @page:  pointer to page to be grabbed
> @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>    */
>   struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   {
> +	bool is_longterm = flags & FOLL_LONGTERM;
> +
>   	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
>   		return NULL;
>   
> @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   		 * right zone, so fail and let the caller fall back to the slow
>   		 * path.
>   		 */
> -		if (unlikely((flags & FOLL_LONGTERM) &&
> -			     !is_longterm_pinnable_page(page)))
> +		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
>   			return NULL;
>   
>   		/*
> @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>   		if (!folio)
>   			return NULL;
>   
> +		/*
> +		 * Can this folio be safely pinned? We need to perform this
> +		 * check after the folio is stabilised.
> +		 */
> +		if ((flags & FOLL_WRITE) && is_longterm &&
> +		    !folio_longterm_write_pin_allowed(folio)) {
> +			folio_put_refs(folio, refs);
> +			return NULL;
> +		}

So we perform this change before validating whether the PTE changed.

Hmm, naturally, I would have done it afterwards.

IIRC, without IPI syncs during TLB flush (i.e., 
CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
(1) We lookup the pte
(2) The page was unmapped and free
(3) The page gets reallocated and used
(4) We pin the page
(5) We dereference page->mapping

If we then de-reference page->mapping that gets used by whoever 
allocated it for something completely different (not a pointer to 
something reasonable), I wonder if we might be in trouble.

Checking first, whether the PTE changed makes sure that what we pinned 
and what we're looking at is what we expected.

... I can spot that the page_is_secretmem() check is also done before 
that. But it at least makes sure that it's still an LRU page before 
staring at the mapping (making it a little safer?).

BUT, I keep messing up this part of the story. Maybe it all works as 
expected because we will be synchronizing RCU somehow before actually 
freeing the page in the !IPI case. ... but I think that's only true for 
page tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
  
Peter Zijlstra May 2, 2023, 5:22 p.m. UTC | #2
On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> [...]
> 
> > +{
> > +	struct address_space *mapping;
> > +
> > +	/*
> > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > +	 * to disappear from under us, as well as preventing RCU grace periods
> > +	 * from making progress (i.e. implying rcu_read_lock()).
> > +	 *
> > +	 * This means we can rely on the folio remaining stable for all
> > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +	 * and those that do not.
> > +	 *
> > +	 * We get the added benefit that given inodes, and thus address_space,
> > +	 * objects are RCU freed, we can rely on the mapping remaining stable
> > +	 * here with no risk of a truncation or similar race.
> > +	 */
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	/*
> > +	 * If no mapping can be found, this implies an anonymous or otherwise
> > +	 * non-file backed folio so in this instance we permit the pin.
> > +	 *
> > +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> > +	 * explicitly whitelist these.
> > +	 *
> > +	 * Other non dirty-tracked folios will be picked up on the slow path.
> > +	 */
> > +	mapping = folio_mapping(folio);
> > +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> 
> "Folios in the swap cache return the swap mapping" -- you might disallow
> pinning anonymous pages that are in the swap cache.
> 
> I recall that there are corner cases where we can end up with an anon page
> that's mapped writable but still in the swap cache ... so you'd fallback to
> the GUP slow path (acceptable for these corner cases, I guess), however
> especially the comment is a bit misleading then.
> 
> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> ... which will make this piece of code most certainly easier to get when
> staring at folio_mapping(). Or to spell it out in the comment (usually I
> prefer code over comments).

So how stable is folio->mapping at this point? Can two subsequent reads
get different values? (eg. an actual mapping and NULL)

If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
the compiler from emitting the load multiple times.
  
Lorenzo Stoakes May 2, 2023, 5:31 p.m. UTC | #3
On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> [...]
>
> > +{
> > +	struct address_space *mapping;
> > +
> > +	/*
> > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > +	 * to disappear from under us, as well as preventing RCU grace periods
> > +	 * from making progress (i.e. implying rcu_read_lock()).
> > +	 *
> > +	 * This means we can rely on the folio remaining stable for all
> > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +	 * and those that do not.
> > +	 *
> > +	 * We get the added benefit that given inodes, and thus address_space,
> > +	 * objects are RCU freed, we can rely on the mapping remaining stable
> > +	 * here with no risk of a truncation or similar race.
> > +	 */
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	/*
> > +	 * If no mapping can be found, this implies an anonymous or otherwise
> > +	 * non-file backed folio so in this instance we permit the pin.
> > +	 *
> > +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> > +	 * explicitly whitelist these.
> > +	 *
> > +	 * Other non dirty-tracked folios will be picked up on the slow path.
> > +	 */
> > +	mapping = folio_mapping(folio);
> > +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>
> "Folios in the swap cache return the swap mapping" -- you might disallow
> pinning anonymous pages that are in the swap cache.
>
> I recall that there are corner cases where we can end up with an anon page
> that's mapped writable but still in the swap cache ... so you'd fallback to
> the GUP slow path (acceptable for these corner cases, I guess), however
> especially the comment is a bit misleading then.

How could that happen?

>
> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> ... which will make this piece of code most certainly easier to get when
> staring at folio_mapping(). Or to spell it out in the comment (usually I
> prefer code over comments).

I literally made this change based on your suggestion :) but perhaps I
misinterpreted what you meant.

I do spell it out in the comment that the page can be anonymous, But perhaps
explicitly checking the mapping flags is the way to go.

>
> > +}
> > +
> >   /**
> >    * try_grab_folio() - Attempt to get or pin a folio.
> >    * @page:  pointer to page to be grabbed
> > @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> >    */
> >   struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> >   {
> > +	bool is_longterm = flags & FOLL_LONGTERM;
> > +
> >   	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> >   		return NULL;
> > @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> >   		 * right zone, so fail and let the caller fall back to the slow
> >   		 * path.
> >   		 */
> > -		if (unlikely((flags & FOLL_LONGTERM) &&
> > -			     !is_longterm_pinnable_page(page)))
> > +		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> >   			return NULL;
> >   		/*
> > @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> >   		if (!folio)
> >   			return NULL;
> > +		/*
> > +		 * Can this folio be safely pinned? We need to perform this
> > +		 * check after the folio is stabilised.
> > +		 */
> > +		if ((flags & FOLL_WRITE) && is_longterm &&
> > +		    !folio_longterm_write_pin_allowed(folio)) {
> > +			folio_put_refs(folio, refs);
> > +			return NULL;
> > +		}
>
> So we perform this change before validating whether the PTE changed.
>
> Hmm, naturally, I would have done it afterwards.
>
> IIRC, without IPI syncs during TLB flush (i.e.,
> CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
> (1) We lookup the pte
> (2) The page was unmapped and free
> (3) The page gets reallocated and used
> (4) We pin the page
> (5) We dereference page->mapping

But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG
option does something odd (I've not really dug into its brehaviour). It feels
like that would break GUP-fast as a whole.

>
> If we then de-reference page->mapping that gets used by whoever allocated it
> for something completely different (not a pointer to something reasonable),
> I wonder if we might be in trouble.
>
> Checking first, whether the PTE changed makes sure that what we pinned and
> what we're looking at is what we expected.
>
> ... I can spot that the page_is_secretmem() check is also done before that.
> But it at least makes sure that it's still an LRU page before staring at the
> mapping (making it a little safer?).

As do we :)

We also via try_get_folio() check to ensure that we aren't subject to a split.

>
> BUT, I keep messing up this part of the story. Maybe it all works as
> expected because we will be synchronizing RCU somehow before actually
> freeing the page in the !IPI case. ... but I think that's only true for page
> tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.

My understanding based on what Peter said is that the IRQs being disabled should
prevent anything bad from happening here.

>
> --
> Thanks,
>
> David / dhildenb
>
  
David Hildenbrand May 2, 2023, 5:34 p.m. UTC | #4
On 02.05.23 19:22, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
>> [...]
>>
>>> +{
>>> +	struct address_space *mapping;
>>> +
>>> +	/*
>>> +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>> +	 * to disappear from under us, as well as preventing RCU grace periods
>>> +	 * from making progress (i.e. implying rcu_read_lock()).
>>> +	 *
>>> +	 * This means we can rely on the folio remaining stable for all
>>> +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>> +	 * and those that do not.
>>> +	 *
>>> +	 * We get the added benefit that given inodes, and thus address_space,
>>> +	 * objects are RCU freed, we can rely on the mapping remaining stable
>>> +	 * here with no risk of a truncation or similar race.
>>> +	 */
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	/*
>>> +	 * If no mapping can be found, this implies an anonymous or otherwise
>>> +	 * non-file backed folio so in this instance we permit the pin.
>>> +	 *
>>> +	 * shmem and hugetlb mappings do not require dirty-tracking so we
>>> +	 * explicitly whitelist these.
>>> +	 *
>>> +	 * Other non dirty-tracked folios will be picked up on the slow path.
>>> +	 */
>>> +	mapping = folio_mapping(folio);
>>> +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>>
>> "Folios in the swap cache return the swap mapping" -- you might disallow
>> pinning anonymous pages that are in the swap cache.
>>
>> I recall that there are corner cases where we can end up with an anon page
>> that's mapped writable but still in the swap cache ... so you'd fallback to
>> the GUP slow path (acceptable for these corner cases, I guess), however
>> especially the comment is a bit misleading then.
>>
>> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
>> ... which will make this piece of code most certainly easier to get when
>> staring at folio_mapping(). Or to spell it out in the comment (usually I
>> prefer code over comments).
> 
> So how stable is folio->mapping at this point? Can two subsequent reads
> get different values? (eg. an actual mapping and NULL)
> 
> If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
> the compiler from emitting the load multiple times.

I can only talk about anon pages in this specific call order here (check 
first, then test if the PTE changed in the meantime): we don't care if 
we get two different values. If we get a different value the second 
time, surely we (temporarily) pinned an anon page that is no longer 
mapped (freed in the meantime). But in that case (even if we read 
garbage folio->mapping and made the wrong call here), we'll detect 
afterwards that the PTE changed, and unpin what we (temporarily) pinned. 
As folio_test_anon() only checks two bits in folio->mapping it's fine, 
because we won't dereference garbage folio->mapping.

With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill 
said it would be fairly stable, but I suspect that it could change 
(especially if we call it before validating if the PTE changed as I 
described further below).

Now, if we read folio->mapping after checking if the page we pinned is 
still mapped (PTE unchanged), at least the page we pinned cannot be 
reused in the meantime. I suspect that we can still read "NULL" on the 
second read. But whatever we dereference from the first read should 
still be valid, even if the second read would have returned NULL ("rcu 
freeing").
  
Lorenzo Stoakes May 2, 2023, 5:34 p.m. UTC | #5
On Tue, May 02, 2023 at 07:22:32PM +0200, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> > [...]
> >
> > > +{
> > > +	struct address_space *mapping;
> > > +
> > > +	/*
> > > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > +	 * to disappear from under us, as well as preventing RCU grace periods
> > > +	 * from making progress (i.e. implying rcu_read_lock()).
> > > +	 *
> > > +	 * This means we can rely on the folio remaining stable for all
> > > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > +	 * and those that do not.
> > > +	 *
> > > +	 * We get the added benefit that given inodes, and thus address_space,
> > > +	 * objects are RCU freed, we can rely on the mapping remaining stable
> > > +	 * here with no risk of a truncation or similar race.
> > > +	 */
> > > +	lockdep_assert_irqs_disabled();
> > > +
> > > +	/*
> > > +	 * If no mapping can be found, this implies an anonymous or otherwise
> > > +	 * non-file backed folio so in this instance we permit the pin.
> > > +	 *
> > > +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> > > +	 * explicitly whitelist these.
> > > +	 *
> > > +	 * Other non dirty-tracked folios will be picked up on the slow path.
> > > +	 */
> > > +	mapping = folio_mapping(folio);
> > > +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> >
> > "Folios in the swap cache return the swap mapping" -- you might disallow
> > pinning anonymous pages that are in the swap cache.
> >
> > I recall that there are corner cases where we can end up with an anon page
> > that's mapped writable but still in the swap cache ... so you'd fallback to
> > the GUP slow path (acceptable for these corner cases, I guess), however
> > especially the comment is a bit misleading then.
> >
> > So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> > ... which will make this piece of code most certainly easier to get when
> > staring at folio_mapping(). Or to spell it out in the comment (usually I
> > prefer code over comments).
>
> So how stable is folio->mapping at this point? Can two subsequent reads
> get different values? (eg. an actual mapping and NULL)
>
> If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
> the compiler from emitting the load multiple times.

Yes that actually feels like a bit of a flaw in folio_mapping(). I suppose
we run the risk of mapping being reset (e.g. to NULL) even if any mapping
we get being guaranteed to be safe due to the RCU thing.

Based on David's feedback I think I'll recode to something more direct
where we READ_ONCE() the mapping, then check mapping flags for anon, avoid
the swap case + check shmem.
  
David Hildenbrand May 2, 2023, 5:38 p.m. UTC | #6
On 02.05.23 19:31, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
>> [...]
>>
>>> +{
>>> +	struct address_space *mapping;
>>> +
>>> +	/*
>>> +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>> +	 * to disappear from under us, as well as preventing RCU grace periods
>>> +	 * from making progress (i.e. implying rcu_read_lock()).
>>> +	 *
>>> +	 * This means we can rely on the folio remaining stable for all
>>> +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>> +	 * and those that do not.
>>> +	 *
>>> +	 * We get the added benefit that given inodes, and thus address_space,
>>> +	 * objects are RCU freed, we can rely on the mapping remaining stable
>>> +	 * here with no risk of a truncation or similar race.
>>> +	 */
>>> +	lockdep_assert_irqs_disabled();
>>> +
>>> +	/*
>>> +	 * If no mapping can be found, this implies an anonymous or otherwise
>>> +	 * non-file backed folio so in this instance we permit the pin.
>>> +	 *
>>> +	 * shmem and hugetlb mappings do not require dirty-tracking so we
>>> +	 * explicitly whitelist these.
>>> +	 *
>>> +	 * Other non dirty-tracked folios will be picked up on the slow path.
>>> +	 */
>>> +	mapping = folio_mapping(folio);
>>> +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>>
>> "Folios in the swap cache return the swap mapping" -- you might disallow
>> pinning anonymous pages that are in the swap cache.
>>
>> I recall that there are corner cases where we can end up with an anon page
>> that's mapped writable but still in the swap cache ... so you'd fallback to
>> the GUP slow path (acceptable for these corner cases, I guess), however
>> especially the comment is a bit misleading then.
> 
> How could that happen?
> 
>>
>> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
>> ... which will make this piece of code most certainly easier to get when
>> staring at folio_mapping(). Or to spell it out in the comment (usually I
>> prefer code over comments).
> 
> I literally made this change based on your suggestion :) but perhaps I
> misinterpreted what you meant.
> 
> I do spell it out in the comment that the page can be anonymous, But perhaps
> explicitly checking the mapping flags is the way to go.
> 
>>
>>> +}
>>> +
>>>    /**
>>>     * try_grab_folio() - Attempt to get or pin a folio.
>>>     * @page:  pointer to page to be grabbed
>>> @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
>>>     */
>>>    struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>    {
>>> +	bool is_longterm = flags & FOLL_LONGTERM;
>>> +
>>>    	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
>>>    		return NULL;
>>> @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>    		 * right zone, so fail and let the caller fall back to the slow
>>>    		 * path.
>>>    		 */
>>> -		if (unlikely((flags & FOLL_LONGTERM) &&
>>> -			     !is_longterm_pinnable_page(page)))
>>> +		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
>>>    			return NULL;
>>>    		/*
>>> @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
>>>    		if (!folio)
>>>    			return NULL;
>>> +		/*
>>> +		 * Can this folio be safely pinned? We need to perform this
>>> +		 * check after the folio is stabilised.
>>> +		 */
>>> +		if ((flags & FOLL_WRITE) && is_longterm &&
>>> +		    !folio_longterm_write_pin_allowed(folio)) {
>>> +			folio_put_refs(folio, refs);
>>> +			return NULL;
>>> +		}
>>
>> So we perform this change before validating whether the PTE changed.
>>
>> Hmm, naturally, I would have done it afterwards.
>>
>> IIRC, without IPI syncs during TLB flush (i.e.,
>> CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
>> (1) We lookup the pte
>> (2) The page was unmapped and free
>> (3) The page gets reallocated and used
>> (4) We pin the page
>> (5) We dereference page->mapping
> 
> But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG
> option does something odd (I've not really dug into its brehaviour). It feels
> like that would break GUP-fast as a whole.
> 
>>
>> If we then de-reference page->mapping that gets used by whoever allocated it
>> for something completely different (not a pointer to something reasonable),
>> I wonder if we might be in trouble.
>>
>> Checking first, whether the PTE changed makes sure that what we pinned and
>> what we're looking at is what we expected.
>>
>> ... I can spot that the page_is_secretmem() check is also done before that.
>> But it at least makes sure that it's still an LRU page before staring at the
>> mapping (making it a little safer?).
> 
> As do we :)
> 
> We also via try_get_folio() check to ensure that we aren't subject to a split.
> 
>>
>> BUT, I keep messing up this part of the story. Maybe it all works as
>> expected because we will be synchronizing RCU somehow before actually
>> freeing the page in the !IPI case. ... but I think that's only true for page
>> tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
> 
> My understanding based on what Peter said is that the IRQs being disabled should
> prevent anything bad from happening here.


... only if we verified that the PTE didn't change IIUC. IRQs disabled 
only protect you from the mapping getting freed and reused (because 
mappings are freed via RCU IIUC).

But as far as I can tell, it doesn't protect you from the page itself 
getting freed and reused, and whoever freed the page uses page->mapping 
to store something completely different.

But, again, it's all complicated and confusing to me.


page_is_secretmem() also doesn't use a READ_ONCE() ...
  
Lorenzo Stoakes May 2, 2023, 5:45 p.m. UTC | #7
On Tue, May 02, 2023 at 07:38:27PM +0200, David Hildenbrand wrote:
> On 02.05.23 19:31, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> > > [...]
> > >
> > > > +{
> > > > +	struct address_space *mapping;
> > > > +
> > > > +	/*
> > > > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > > +	 * to disappear from under us, as well as preventing RCU grace periods
> > > > +	 * from making progress (i.e. implying rcu_read_lock()).
> > > > +	 *
> > > > +	 * This means we can rely on the folio remaining stable for all
> > > > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +	 * and those that do not.
> > > > +	 *
> > > > +	 * We get the added benefit that given inodes, and thus address_space,
> > > > +	 * objects are RCU freed, we can rely on the mapping remaining stable
> > > > +	 * here with no risk of a truncation or similar race.
> > > > +	 */
> > > > +	lockdep_assert_irqs_disabled();
> > > > +
> > > > +	/*
> > > > +	 * If no mapping can be found, this implies an anonymous or otherwise
> > > > +	 * non-file backed folio so in this instance we permit the pin.
> > > > +	 *
> > > > +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> > > > +	 * explicitly whitelist these.
> > > > +	 *
> > > > +	 * Other non dirty-tracked folios will be picked up on the slow path.
> > > > +	 */
> > > > +	mapping = folio_mapping(folio);
> > > > +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> > >
> > > "Folios in the swap cache return the swap mapping" -- you might disallow
> > > pinning anonymous pages that are in the swap cache.
> > >
> > > I recall that there are corner cases where we can end up with an anon page
> > > that's mapped writable but still in the swap cache ... so you'd fallback to
> > > the GUP slow path (acceptable for these corner cases, I guess), however
> > > especially the comment is a bit misleading then.
> >
> > How could that happen?
> >
> > >
> > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> > > ... which will make this piece of code most certainly easier to get when
> > > staring at folio_mapping(). Or to spell it out in the comment (usually I
> > > prefer code over comments).
> >
> > I literally made this change based on your suggestion :) but perhaps I
> > misinterpreted what you meant.
> >
> > I do spell it out in the comment that the page can be anonymous, But perhaps
> > explicitly checking the mapping flags is the way to go.
> >
> > >
> > > > +}
> > > > +
> > > >    /**
> > > >     * try_grab_folio() - Attempt to get or pin a folio.
> > > >     * @page:  pointer to page to be grabbed
> > > > @@ -123,6 +170,8 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > > >     */
> > > >    struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > > >    {
> > > > +	bool is_longterm = flags & FOLL_LONGTERM;
> > > > +
> > > >    	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
> > > >    		return NULL;
> > > > @@ -136,8 +185,7 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > > >    		 * right zone, so fail and let the caller fall back to the slow
> > > >    		 * path.
> > > >    		 */
> > > > -		if (unlikely((flags & FOLL_LONGTERM) &&
> > > > -			     !is_longterm_pinnable_page(page)))
> > > > +		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
> > > >    			return NULL;
> > > >    		/*
> > > > @@ -148,6 +196,16 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
> > > >    		if (!folio)
> > > >    			return NULL;
> > > > +		/*
> > > > +		 * Can this folio be safely pinned? We need to perform this
> > > > +		 * check after the folio is stabilised.
> > > > +		 */
> > > > +		if ((flags & FOLL_WRITE) && is_longterm &&
> > > > +		    !folio_longterm_write_pin_allowed(folio)) {
> > > > +			folio_put_refs(folio, refs);
> > > > +			return NULL;
> > > > +		}
> > >
> > > So we perform this change before validating whether the PTE changed.
> > >
> > > Hmm, naturally, I would have done it afterwards.
> > >
> > > IIRC, without IPI syncs during TLB flush (i.e.,
> > > CONFIG_MMU_GATHER_RCU_TABLE_FREE), there is the possibility that
> > > (1) We lookup the pte
> > > (2) The page was unmapped and free
> > > (3) The page gets reallocated and used
> > > (4) We pin the page
> > > (5) We dereference page->mapping
> >
> > But we have an implied RCU lock from disabled IRQs right? Unless that CONFIG
> > option does something odd (I've not really dug into its brehaviour). It feels
> > like that would break GUP-fast as a whole.
> >
> > >
> > > If we then de-reference page->mapping that gets used by whoever allocated it
> > > for something completely different (not a pointer to something reasonable),
> > > I wonder if we might be in trouble.
> > >
> > > Checking first, whether the PTE changed makes sure that what we pinned and
> > > what we're looking at is what we expected.
> > >
> > > ... I can spot that the page_is_secretmem() check is also done before that.
> > > But it at least makes sure that it's still an LRU page before staring at the
> > > mapping (making it a little safer?).
> >
> > As do we :)
> >
> > We also via try_get_folio() check to ensure that we aren't subject to a split.
> >
> > >
> > > BUT, I keep messing up this part of the story. Maybe it all works as
> > > expected because we will be synchronizing RCU somehow before actually
> > > freeing the page in the !IPI case. ... but I think that's only true for page
> > > tables with CONFIG_MMU_GATHER_RCU_TABLE_FREE.
> >
> > My understanding based on what Peter said is that the IRQs being disabled should
> > prevent anything bad from happening here.
>
>
> ... only if we verified that the PTE didn't change IIUC. IRQs disabled only
> protect you from the mapping getting freed and reused (because mappings are
> freed via RCU IIUC).
>
> But as far as I can tell, it doesn't protect you from the page itself
> getting freed and reused, and whoever freed the page uses page->mapping to
> store something completely different.

Ack, and we'd not have mapping->inode to save us in an anon case either.

I'd rather be as cautious as we can possibly be, so let's move this to after the
'PTE is the same' check then, will fix on respin.

>
> But, again, it's all complicated and confusing to me.
>

It's just a fiddly, complicated, delicate area I feel :) hence why I
endeavour to take on board the community's views on this series to ensure
we end up with the best possible implementation.

>
> page_is_secretmem() also doesn't use a READ_ONCE() ...

Perhaps one for a follow up patch...

>
> --
> Thanks,
>
> David / dhildenb
>
  
Lorenzo Stoakes May 2, 2023, 6:17 p.m. UTC | #8
On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
> On 02.05.23 19:22, Peter Zijlstra wrote:
> > On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
> > > [...]
> > >
> > > > +{
> > > > +	struct address_space *mapping;
> > > > +
> > > > +	/*
> > > > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > > +	 * to disappear from under us, as well as preventing RCU grace periods
> > > > +	 * from making progress (i.e. implying rcu_read_lock()).
> > > > +	 *
> > > > +	 * This means we can rely on the folio remaining stable for all
> > > > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +	 * and those that do not.
> > > > +	 *
> > > > +	 * We get the added benefit that given inodes, and thus address_space,
> > > > +	 * objects are RCU freed, we can rely on the mapping remaining stable
> > > > +	 * here with no risk of a truncation or similar race.
> > > > +	 */
> > > > +	lockdep_assert_irqs_disabled();
> > > > +
> > > > +	/*
> > > > +	 * If no mapping can be found, this implies an anonymous or otherwise
> > > > +	 * non-file backed folio so in this instance we permit the pin.
> > > > +	 *
> > > > +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> > > > +	 * explicitly whitelist these.
> > > > +	 *
> > > > +	 * Other non dirty-tracked folios will be picked up on the slow path.
> > > > +	 */
> > > > +	mapping = folio_mapping(folio);
> > > > +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> > >
> > > "Folios in the swap cache return the swap mapping" -- you might disallow
> > > pinning anonymous pages that are in the swap cache.
> > >
> > > I recall that there are corner cases where we can end up with an anon page
> > > that's mapped writable but still in the swap cache ... so you'd fallback to
> > > the GUP slow path (acceptable for these corner cases, I guess), however
> > > especially the comment is a bit misleading then.
> > >
> > > So I'd suggest not dropping the folio_test_anon() check, or open-coding it
> > > ... which will make this piece of code most certainly easier to get when
> > > staring at folio_mapping(). Or to spell it out in the comment (usually I
> > > prefer code over comments).
> >
> > So how stable is folio->mapping at this point? Can two subsequent reads
> > get different values? (eg. an actual mapping and NULL)
> >
> > If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
> > the compiler from emitting the load multiple times.
>
> I can only talk about anon pages in this specific call order here (check
> first, then test if the PTE changed in the meantime): we don't care if we
> get two different values. If we get a different value the second time,
> surely we (temporarily) pinned an anon page that is no longer mapped (freed
> in the meantime). But in that case (even if we read garbage folio->mapping
> and made the wrong call here), we'll detect afterwards that the PTE changed,
> and unpin what we (temporarily) pinned. As folio_test_anon() only checks two
> bits in folio->mapping it's fine, because we won't dereference garbage
> folio->mapping.
>
> With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill said
> it would be fairly stable, but I suspect that it could change (especially if
> we call it before validating if the PTE changed as I described further
> below).
>
> Now, if we read folio->mapping after checking if the page we pinned is still
> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
> meantime. I suspect that we can still read "NULL" on the second read. But
> whatever we dereference from the first read should still be valid, even if
> the second read would have returned NULL ("rcu freeing").
>

On a specific point - if mapping turns out to be NULL after we confirm
stable PTE, I'd be inclined to reject and let the slow path take care of
it, would you agree that that's the correct approach?

I guess you could take that to mean that the page is no longer mapped so is
safe, but it feels like refusing it would be the safe course.


> --
> Thanks,
>
> David / dhildenb
>
  
Peter Zijlstra May 2, 2023, 6:59 p.m. UTC | #9
On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
> Now, if we read folio->mapping after checking if the page we pinned is still
> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
> meantime. I suspect that we can still read "NULL" on the second read. But
> whatever we dereference from the first read should still be valid, even if
> the second read would have returned NULL ("rcu freeing").

Right, but given it's the compiler adding loads we're not sure what if
anything it uses and it gets very hard to reason about the code.

This is where READ_ONCE() helps, we instruct the compiler to only do a
single load and we can still reason about the code.
  
David Hildenbrand May 2, 2023, 7:05 p.m. UTC | #10
On 02.05.23 20:59, Peter Zijlstra wrote:
> On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
>> Now, if we read folio->mapping after checking if the page we pinned is still
>> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
>> meantime. I suspect that we can still read "NULL" on the second read. But
>> whatever we dereference from the first read should still be valid, even if
>> the second read would have returned NULL ("rcu freeing").
> 
> Right, but given it's the compiler adding loads we're not sure what if
> anything it uses and it gets very hard to reason about the code.
> 
> This is where READ_ONCE() helps, we instruct the compiler to only do a
> single load and we can still reason about the code.

I completely agree, and I think we should fix that in 
page_is_secretmem() as well.
  
David Hildenbrand May 2, 2023, 7:07 p.m. UTC | #11
On 02.05.23 20:17, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 07:34:06PM +0200, David Hildenbrand wrote:
>> On 02.05.23 19:22, Peter Zijlstra wrote:
>>> On Tue, May 02, 2023 at 07:13:49PM +0200, David Hildenbrand wrote:
>>>> [...]
>>>>
>>>>> +{
>>>>> +	struct address_space *mapping;
>>>>> +
>>>>> +	/*
>>>>> +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>>>> +	 * to disappear from under us, as well as preventing RCU grace periods
>>>>> +	 * from making progress (i.e. implying rcu_read_lock()).
>>>>> +	 *
>>>>> +	 * This means we can rely on the folio remaining stable for all
>>>>> +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>>>> +	 * and those that do not.
>>>>> +	 *
>>>>> +	 * We get the added benefit that given inodes, and thus address_space,
>>>>> +	 * objects are RCU freed, we can rely on the mapping remaining stable
>>>>> +	 * here with no risk of a truncation or similar race.
>>>>> +	 */
>>>>> +	lockdep_assert_irqs_disabled();
>>>>> +
>>>>> +	/*
>>>>> +	 * If no mapping can be found, this implies an anonymous or otherwise
>>>>> +	 * non-file backed folio so in this instance we permit the pin.
>>>>> +	 *
>>>>> +	 * shmem and hugetlb mappings do not require dirty-tracking so we
>>>>> +	 * explicitly whitelist these.
>>>>> +	 *
>>>>> +	 * Other non dirty-tracked folios will be picked up on the slow path.
>>>>> +	 */
>>>>> +	mapping = folio_mapping(folio);
>>>>> +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
>>>>
>>>> "Folios in the swap cache return the swap mapping" -- you might disallow
>>>> pinning anonymous pages that are in the swap cache.
>>>>
>>>> I recall that there are corner cases where we can end up with an anon page
>>>> that's mapped writable but still in the swap cache ... so you'd fallback to
>>>> the GUP slow path (acceptable for these corner cases, I guess), however
>>>> especially the comment is a bit misleading then.
>>>>
>>>> So I'd suggest not dropping the folio_test_anon() check, or open-coding it
>>>> ... which will make this piece of code most certainly easier to get when
>>>> staring at folio_mapping(). Or to spell it out in the comment (usually I
>>>> prefer code over comments).
>>>
>>> So how stable is folio->mapping at this point? Can two subsequent reads
>>> get different values? (eg. an actual mapping and NULL)
>>>
>>> If so, folio_mapping() itself seems to be missing a READ_ONCE() to avoid
>>> the compiler from emitting the load multiple times.
>>
>> I can only talk about anon pages in this specific call order here (check
>> first, then test if the PTE changed in the meantime): we don't care if we
>> get two different values. If we get a different value the second time,
>> surely we (temporarily) pinned an anon page that is no longer mapped (freed
>> in the meantime). But in that case (even if we read garbage folio->mapping
>> and made the wrong call here), we'll detect afterwards that the PTE changed,
>> and unpin what we (temporarily) pinned. As folio_test_anon() only checks two
>> bits in folio->mapping it's fine, because we won't dereference garbage
>> folio->mapping.
>>
>> With folio_mapping() on !anon and READ_ONCE() ... good question. Kirill said
>> it would be fairly stable, but I suspect that it could change (especially if
>> we call it before validating if the PTE changed as I described further
>> below).
>>
>> Now, if we read folio->mapping after checking if the page we pinned is still
>> mapped (PTE unchanged), at least the page we pinned cannot be reused in the
>> meantime. I suspect that we can still read "NULL" on the second read. But
>> whatever we dereference from the first read should still be valid, even if
>> the second read would have returned NULL ("rcu freeing").
>>
> 
> On a specific point - if mapping turns out to be NULL after we confirm
> stable PTE, I'd be inclined to reject and let the slow path take care of
> it, would you agree that that's the correct approach?

If it's not an anon page and the mapping is NULL, I'd say simply 
fallback to the slow path.
  
Jason Gunthorpe May 2, 2023, 7:07 p.m. UTC | #12
On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:

> On a specific point - if mapping turns out to be NULL after we confirm
> stable PTE, I'd be inclined to reject and let the slow path take care of
> it, would you agree that that's the correct approach?

I think in general if GUP fast detects any kind of race it should bail
to the slow path.

The races it tries to resolve itself should have really safe and
obvious solutions.

I think this comment is misleading:

> +	/*
> +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> +	 * to disappear from under us, as well as preventing RCU grace periods
> +	 * from making progress (i.e. implying rcu_read_lock()).

True, but that is not important here since we are not reading page
tables

> +	 * This means we can rely on the folio remaining stable for all
> +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +	 * and those that do not.

Not really clear. We have a valid folio refcount here, that is all.

> +	 * We get the added benefit that given inodes, and thus address_space,
> +	 * objects are RCU freed, we can rely on the mapping remaining stable
> +	 * here with no risk of a truncation or similar race.

Which is the real point:

1) GUP-fast disables IRQs which means this is the same context as rcu_read_lock()
2) We have a valid ref on the folio due to normal GUP fast operation
   Thus derefing struct folio is OK
3) folio->mapping can be deref'd safely under RCU since mapping is RCU free'd
   It may be zero if we are racing a page free path
   Can it be zero for other reasons?

If it can't be zero for any other reason then go to GUP slow and let
it sort it out

Otherwise you have to treat NULL as a success.

Really what you are trying to do here is remove the folio lock which
would normally protect folio->mapping. Ie this test really boils down
to just 'folio_get_mapping_a_ops_rcu() == shmem_aops'

The hugetlb test is done on a page flag which should be stable under
the pageref.

So.. Your function really ought to be doing this logic:

        // Should be impossible for a slab page to be in a VMA
	if (unlikely(folio_test_slab(folio)))
	   return do gup slow;

	// Can a present PTE even be a swap cache?
   	if (unlikely(folio_test_swapcache(folio)))
	   return do gup slow;

	if (folio_test_hugetlb(folio))
	   return safe for fast

	// Safe without the folio lock ?
   	struct address_space *mapping = READ_ONCE(folio->mapping)
	if ((mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON)
	   return safe for fast
	if ((mapping & PAGE_MAPPING_FLAGS) == 0 && mapping)
	   return mapping->a_ops == shmem_aops;

        // Depends on what mapping = NULL means
	return do gup slow

Jason
  
David Hildenbrand May 2, 2023, 7:17 p.m. UTC | #13
> +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> +{
> +	struct address_space *mapping;
> +
> +	/*
> +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> +	 * to disappear from under us, as well as preventing RCU grace periods
> +	 * from making progress (i.e. implying rcu_read_lock()).
> +	 *
> +	 * This means we can rely on the folio remaining stable for all
> +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> +	 * and those that do not.
> +	 *
> +	 * We get the added benefit that given inodes, and thus address_space,
> +	 * objects are RCU freed, we can rely on the mapping remaining stable
> +	 * here with no risk of a truncation or similar race.
> +	 */
> +	lockdep_assert_irqs_disabled();
> +
> +	/*
> +	 * If no mapping can be found, this implies an anonymous or otherwise
> +	 * non-file backed folio so in this instance we permit the pin.
> +	 *
> +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> +	 * explicitly whitelist these.
> +	 *
> +	 * Other non dirty-tracked folios will be picked up on the slow path.
> +	 */
> +	mapping = folio_mapping(folio);
> +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> +}

BTW, try_grab_folio() is also called from follow_hugetlb_page(), which 
is ordinary GUP and has interrupts enabled if I am not wrong.
  
Lorenzo Stoakes May 2, 2023, 7:25 p.m. UTC | #14
On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote:
> On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:
>
> > On a specific point - if mapping turns out to be NULL after we confirm
> > stable PTE, I'd be inclined to reject and let the slow path take care of
> > it, would you agree that that's the correct approach?
>
> I think in general if GUP fast detects any kind of race it should bail
> to the slow path.
>
> The races it tries to resolve itself should have really safe and
> obvious solutions.
>
> I think this comment is misleading:
>
> > +	/*
> > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > +	 * to disappear from under us, as well as preventing RCU grace periods
> > +	 * from making progress (i.e. implying rcu_read_lock()).
>
> True, but that is not important here since we are not reading page
> tables
>
> > +	 * This means we can rely on the folio remaining stable for all
> > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +	 * and those that do not.
>
> Not really clear. We have a valid folio refcount here, that is all.

Some of this is a product of mixed signals from different commenters and
my being perhaps a little _too_ willing to just go with the flow.

With interrupts disabled and IPI blocked, plus the assurances that
interrupts being disabled implied the RCU version of page table
manipulation is also blocked, my understanding was that remapping in this
process to another page could not occur.

Of course the folio is 'stable' in the sense we have a refcount on it, but
it is unlocked so things can change.

I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI,
because in the IPI case it seems to me you couldn't even clear the PTE
entry before getting to the page table case.

Otherwise, I'm a bit uncertain actually as to how we can get to the point
where the folio->mapping is being manipulated. Is this why?

>
> > +	 * We get the added benefit that given inodes, and thus address_space,
> > +	 * objects are RCU freed, we can rely on the mapping remaining stable
> > +	 * here with no risk of a truncation or similar race.
>
> Which is the real point:
>
> 1) GUP-fast disables IRQs which means this is the same context as rcu_read_lock()
> 2) We have a valid ref on the folio due to normal GUP fast operation
>    Thus derefing struct folio is OK
> 3) folio->mapping can be deref'd safely under RCU since mapping is RCU free'd
>    It may be zero if we are racing a page free path
>    Can it be zero for other reasons?

Zero? You mean NULL?

In any case, I will try to clarify these, I do agree the _key_ point is
that we can rely on safely derefing the mapping, at least READ_ONCE()'d, as
use-after-free or dereffing garbage is the fear here.

>
> If it can't be zero for any other reason then go to GUP slow and let
> it sort it out
>
> Otherwise you have to treat NULL as a success.
>

Well that was literally the question :) and I've got somewhat contradictory
feedback. My instinct aligns with yours in that, just fallback to slow
path, so that's what I'll do. But just wanted to confirm.

> Really what you are trying to do here is remove the folio lock which
> would normally protect folio->mapping. Ie this test really boils down
> to just 'folio_get_mapping_a_ops_rcu() == shmem_aops'
>
> The hugetlb test is done on a page flag which should be stable under
> the pageref.
>
> So.. Your function really ought to be doing this logic:
>
>         // Should be impossible for a slab page to be in a VMA
> 	if (unlikely(folio_test_slab(folio)))
> 	   return do gup slow;
>
> 	// Can a present PTE even be a swap cache?
>    	if (unlikely(folio_test_swapcache(folio)))
> 	   return do gup slow;
>
> 	if (folio_test_hugetlb(folio))
> 	   return safe for fast
>
> 	// Safe without the folio lock ?
>    	struct address_space *mapping = READ_ONCE(folio->mapping)
> 	if ((mapping & PAGE_MAPPING_FLAGS) == PAGE_MAPPING_ANON)
> 	   return safe for fast
> 	if ((mapping & PAGE_MAPPING_FLAGS) == 0 && mapping)
> 	   return mapping->a_ops == shmem_aops;
>
>         // Depends on what mapping = NULL means
> 	return do gup slow
>

Yeah this is how I was planning to implement it, or something along these
lines. The only question was whether my own view aligned with others to
avoid more spam :)

The READ_ONCE() approach is precisely how I wanted to do it in thet first
instance, but feared feedback about duplication and wondered if it made
much difference, but now it's clear this is ther ight way.

> Jason
  
David Hildenbrand May 2, 2023, 7:33 p.m. UTC | #15
On 02.05.23 21:25, Lorenzo Stoakes wrote:
> On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote:
>> On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:
>>
>>> On a specific point - if mapping turns out to be NULL after we confirm
>>> stable PTE, I'd be inclined to reject and let the slow path take care of
>>> it, would you agree that that's the correct approach?
>>
>> I think in general if GUP fast detects any kind of race it should bail
>> to the slow path.
>>
>> The races it tries to resolve itself should have really safe and
>> obvious solutions.
>>
>> I think this comment is misleading:
>>
>>> +	/*
>>> +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
>>> +	 * to disappear from under us, as well as preventing RCU grace periods
>>> +	 * from making progress (i.e. implying rcu_read_lock()).
>>
>> True, but that is not important here since we are not reading page
>> tables
>>
>>> +	 * This means we can rely on the folio remaining stable for all
>>> +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
>>> +	 * and those that do not.
>>
>> Not really clear. We have a valid folio refcount here, that is all.
> 
> Some of this is a product of mixed signals from different commenters and
> my being perhaps a little _too_ willing to just go with the flow.
> 
> With interrupts disabled and IPI blocked, plus the assurances that
> interrupts being disabled implied the RCU version of page table
> manipulation is also blocked, my understanding was that remapping in this
> process to another page could not occur.
> 
> Of course the folio is 'stable' in the sense we have a refcount on it, but
> it is unlocked so things can change.
> 
> I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI,
> because in the IPI case it seems to me you couldn't even clear the PTE
> entry before getting to the page table case.
> 
> Otherwise, I'm a bit uncertain actually as to how we can get to the point
> where the folio->mapping is being manipulated. Is this why?

I'll just stress again that I think there are cases where we unmap and 
free a page without synchronizing against GUP-fast using an IPI or RCU.

That's one of the reasons why we recheck if the PTE changed to back off, 
so I've been told.

I'm happy if someone proves me wrong and a page we just (temporarily) 
pinned cannot have been freed+reused in the meantime.
  
Lorenzo Stoakes May 2, 2023, 7:37 p.m. UTC | #16
On Tue, May 02, 2023 at 09:33:45PM +0200, David Hildenbrand wrote:
> On 02.05.23 21:25, Lorenzo Stoakes wrote:
> > On Tue, May 02, 2023 at 04:07:50PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 02, 2023 at 07:17:14PM +0100, Lorenzo Stoakes wrote:
> > >
> > > > On a specific point - if mapping turns out to be NULL after we confirm
> > > > stable PTE, I'd be inclined to reject and let the slow path take care of
> > > > it, would you agree that that's the correct approach?
> > >
> > > I think in general if GUP fast detects any kind of race it should bail
> > > to the slow path.
> > >
> > > The races it tries to resolve itself should have really safe and
> > > obvious solutions.
> > >
> > > I think this comment is misleading:
> > >
> > > > +	/*
> > > > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > > > +	 * to disappear from under us, as well as preventing RCU grace periods
> > > > +	 * from making progress (i.e. implying rcu_read_lock()).
> > >
> > > True, but that is not important here since we are not reading page
> > > tables
> > >
> > > > +	 * This means we can rely on the folio remaining stable for all
> > > > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > > > +	 * and those that do not.
> > >
> > > Not really clear. We have a valid folio refcount here, that is all.
> >
> > Some of this is a product of mixed signals from different commenters and
> > my being perhaps a little _too_ willing to just go with the flow.
> >
> > With interrupts disabled and IPI blocked, plus the assurances that
> > interrupts being disabled implied the RCU version of page table
> > manipulation is also blocked, my understanding was that remapping in this
> > process to another page could not occur.
> >
> > Of course the folio is 'stable' in the sense we have a refcount on it, but
> > it is unlocked so things can change.
> >
> > I'm guessing the RCU guarantees in the TLB logic are not as solid as IPI,
> > because in the IPI case it seems to me you couldn't even clear the PTE
> > entry before getting to the page table case.
> >
> > Otherwise, I'm a bit uncertain actually as to how we can get to the point
> > where the folio->mapping is being manipulated. Is this why?
>
> I'll just stress again that I think there are cases where we unmap and free
> a page without synchronizing against GUP-fast using an IPI or RCU.

OK that explains why we need to be careful. Don't worry I am going to move the
check after we confirm PTE entry hasn't changed.

>
> That's one of the reasons why we recheck if the PTE changed to back off, so
> I've been told.
>
> I'm happy if someone proves me wrong and a page we just (temporarily) pinned
> cannot have been freed+reused in the meantime.

Let's play it safe for now :)

>
> --
> Thanks,
>
> David / dhildenb
>
  
Lorenzo Stoakes May 2, 2023, 7:45 p.m. UTC | #17
On Tue, May 02, 2023 at 09:17:53PM +0200, David Hildenbrand wrote:
> > +static bool folio_longterm_write_pin_allowed(struct folio *folio)
> > +{
> > +	struct address_space *mapping;
> > +
> > +	/*
> > +	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
> > +	 * to disappear from under us, as well as preventing RCU grace periods
> > +	 * from making progress (i.e. implying rcu_read_lock()).
> > +	 *
> > +	 * This means we can rely on the folio remaining stable for all
> > +	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
> > +	 * and those that do not.
> > +	 *
> > +	 * We get the added benefit that given inodes, and thus address_space,
> > +	 * objects are RCU freed, we can rely on the mapping remaining stable
> > +	 * here with no risk of a truncation or similar race.
> > +	 */
> > +	lockdep_assert_irqs_disabled();
> > +
> > +	/*
> > +	 * If no mapping can be found, this implies an anonymous or otherwise
> > +	 * non-file backed folio so in this instance we permit the pin.
> > +	 *
> > +	 * shmem and hugetlb mappings do not require dirty-tracking so we
> > +	 * explicitly whitelist these.
> > +	 *
> > +	 * Other non dirty-tracked folios will be picked up on the slow path.
> > +	 */
> > +	mapping = folio_mapping(folio);
> > +	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
> > +}
>
> BTW, try_grab_folio() is also called from follow_hugetlb_page(), which is
> ordinary GUP and has interrupts enabled if I am not wrong.

It does hold the PTL though, so can't fiddle with the entry.

But that does suggest folio_test_hugetlb() should be put _before_ the irq
disabled assertion then :)

>
> --
> Thanks,
>
> David / dhildenb
>
  
Jason Gunthorpe May 2, 2023, 11:51 p.m. UTC | #18
On Tue, May 02, 2023 at 09:33:45PM +0200, David Hildenbrand wrote:

> I'll just stress again that I think there are cases where we unmap and free
> a page without synchronizing against GUP-fast using an IPI or RCU.

AFAIK this is true on ARM64 and other arches that do not use IPIs for
TLB shootdown.

Eg the broadcast TLBI described here:

https://developer.arm.com/documentation/101811/0102/Translation-Lookaside-Buffer-maintenance

TLB invalidation of remote CPUs Is done at the interconnect level and
does not trigger any interrupt.

So, arches that don't use IPI have to RCU free the page table entires
to work with GUP fast. They will set MMU_GATHER_RCU_TABLE_FREE. The
whole interrupt disable thing in GUP turns into nothing more than a
hacky RCU on those arches.

The ugly bit is the comment:

 * We do not adopt an rcu_read_lock() here as we also want to block IPIs
 * that come from THPs splitting.

Which, I think, today can be summarized in today's code base as
serializing with split_huge_page_to_list().

I don't know this code well, but the comment there says "Only caller
must hold pin on the @page" which is only possible if all the PTEs
have been replaced with migration entries or otherwise before we get
here. So the IPI the comment talks about, I suppose, is from the
installation of the migration entry?

However gup_huge_pmd() does the usual read, ref, check pattern, and
split_huge_page_to_list() uses page_ref_freeze() which is cmpxchg

So the three races seem to resolve themselves without IPI magic
 - GUP sees the THP folio before the migration entry and +1's the ref
   split_huge_page_to_list() bails beacuse the ref is elevated
 - GUP fails to +1 the ref because it is zero and bails,
   split_huge_page_to_list() made it zero, so it runs
 - GUP sees the THP folio, split_huge_page_to_list() ran to
   completion, and then GUP +1's a 4k page. The recheck of pmd_val
   will see the migration entry, or the new PTE table pointer, but
   never the original THP folio. So GUP will bail. The speculative
   ref on the 4k page is harmless.

I can't figure out what this 2014 comment means in today's code.

Or where ARM64 hid the "IPI broadcast on THP split" :)

See commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13

> That's one of the reasons why we recheck if the PTE changed to back off, so
> I've been told.

Yes, with no synchronizing point the refcount in GUP fast could be
taken on the page after it has been free'd and reallocated, but this
is only possible on RCU

> I'm happy if someone proves me wrong and a page we just (temporarily) pinned
> cannot have been freed+reused in the meantime.

Sadly I think no.. We'd need to RCU free the page itself as well to
make this true.

Jason
  
Jason Gunthorpe May 3, 2023, 12:22 a.m. UTC | #19
On Tue, May 02, 2023 at 08:25:57PM +0100, Lorenzo Stoakes wrote:

> Otherwise, I'm a bit uncertain actually as to how we can get to the point
> where the folio->mapping is being manipulated. Is this why?

On RCU architectures another thread zap's the PTEs and proceeds to
teardown and free the page. Part of that is clearing folio->mapping
under the folio lock.

The IPI approach would not get there, but we can't think in terms of
IPI since we have RCU architectures.

> In any case, I will try to clarify these, I do agree the _key_ point is
> that we can rely on safely derefing the mapping, at least READ_ONCE()'d, as
> use-after-free or dereffing garbage is the fear here.

I didn't chase it down, but as long as folio->mapping is always set to
something that is RCU free'd then this would be OK.
 
> Well that was literally the question :) and I've got somewhat contradictory
> feedback. My instinct aligns with yours in that, just fallback to slow
> path, so that's what I'll do. But just wanted to confirm.

I don't know it well enough to say, it looks like folio->mapping is
almost always set to something from what I can see.. At least NULL
pointer and the anon bits set for instance.

In any case fall back to fast is always safe, I'd just go to the
trouble to check with testing that in normal process cases we don't
hit it.

> The READ_ONCE() approach is precisely how I wanted to do it in thet first
> instance, but feared feedback about duplication and wondered if it made
> much difference, but now it's clear this is ther ight way.

The duplication is bad, and maybe we need more functions to counter
it, but GUP needs to know some of the details a little more, eg a NULL
return from folio_mapping() could inidicate the anon bits being set
which should not flow down to slow.

Jason
  

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 6e209ca10967..93b4aa39e5a5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -18,6 +18,7 @@ 
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
 #include <linux/sched/mm.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>
@@ -95,6 +96,52 @@  static inline struct folio *try_get_folio(struct page *page, int refs)
 	return folio;
 }
 
+/*
+ * Used in the GUP-fast path to determine whether a FOLL_PIN | FOLL_LONGTERM |
+ * FOLL_WRITE pin is permitted for a specific folio.
+ *
+ * This assumes the folio is stable and pinned.
+ *
+ * Writing to pinned file-backed dirty tracked folios is inherently problematic
+ * (see comment describing the writeable_file_mapping_allowed() function). We
+ * therefore try to avoid the most egregious case of a long-term mapping doing
+ * so.
+ *
+ * This function cannot be as thorough as that one as the VMA is not available
+ * in the fast path, so instead we whitelist known good cases.
+ */
+static bool folio_longterm_write_pin_allowed(struct folio *folio)
+{
+	struct address_space *mapping;
+
+	/*
+	 * GUP-fast disables IRQs - this prevents IPIs from causing page tables
+	 * to disappear from under us, as well as preventing RCU grace periods
+	 * from making progress (i.e. implying rcu_read_lock()).
+	 *
+	 * This means we can rely on the folio remaining stable for all
+	 * architectures, both those that set CONFIG_MMU_GATHER_RCU_TABLE_FREE
+	 * and those that do not.
+	 *
+	 * We get the added benefit that given inodes, and thus address_space,
+	 * objects are RCU freed, we can rely on the mapping remaining stable
+	 * here with no risk of a truncation or similar race.
+	 */
+	lockdep_assert_irqs_disabled();
+
+	/*
+	 * If no mapping can be found, this implies an anonymous or otherwise
+	 * non-file backed folio so in this instance we permit the pin.
+	 *
+	 * shmem and hugetlb mappings do not require dirty-tracking so we
+	 * explicitly whitelist these.
+	 *
+	 * Other non dirty-tracked folios will be picked up on the slow path.
+	 */
+	mapping = folio_mapping(folio);
+	return !mapping || shmem_mapping(mapping) || folio_test_hugetlb(folio);
+}
+
 /**
  * try_grab_folio() - Attempt to get or pin a folio.
  * @page:  pointer to page to be grabbed
@@ -123,6 +170,8 @@  static inline struct folio *try_get_folio(struct page *page, int refs)
  */
 struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 {
+	bool is_longterm = flags & FOLL_LONGTERM;
+
 	if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
 		return NULL;
 
@@ -136,8 +185,7 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 		 * right zone, so fail and let the caller fall back to the slow
 		 * path.
 		 */
-		if (unlikely((flags & FOLL_LONGTERM) &&
-			     !is_longterm_pinnable_page(page)))
+		if (unlikely(is_longterm && !is_longterm_pinnable_page(page)))
 			return NULL;
 
 		/*
@@ -148,6 +196,16 @@  struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
 		if (!folio)
 			return NULL;
 
+		/*
+		 * Can this folio be safely pinned? We need to perform this
+		 * check after the folio is stabilised.
+		 */
+		if ((flags & FOLL_WRITE) && is_longterm &&
+		    !folio_longterm_write_pin_allowed(folio)) {
+			folio_put_refs(folio, refs);
+			return NULL;
+		}
+
 		/*
 		 * When pinning a large folio, use an exact count to track it.
 		 *