[01/13] mm: Update ptep_get_lockless()s comment

Message ID 20221022114424.515572025@infradead.org
State New
Headers
Series Clean up pmd_get_atomic() and i386-PAE |

Commit Message

Peter Zijlstra Oct. 22, 2022, 11:14 a.m. UTC
  Improve the comment.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/pgtable.h |   17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)
  

Comments

John Hubbard Oct. 24, 2022, 5:42 a.m. UTC | #1
On 10/22/22 04:14, Peter Zijlstra wrote:
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
>  
>  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
>  /*
> - * WARNING: only to be used in the get_user_pages_fast() implementation.
> - *
> - * With get_user_pages_fast(), we walk down the pagetables without taking any
> - * locks.  For this we would like to load the pointers atomically, but sometimes
> - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
> - * we do have is the guarantee that a PTE will only either go from not present
> - * to present, or present to not present or both -- it will not switch to a
> - * completely different present page without a TLB flush in between; something
> - * that we are blocking by holding interrupts off.
> + * For walking the pagetables without holding any locks.  Some architectures
> + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> + * instructions.  We are guaranteed that a PTE will only either go from not
> + * present to present, or present to not present -- it will not switch to a
> + * completely different present page without a TLB flush inbetween; which we
> + * are blocking by holding interrupts off.


This is getting interesting. My latest understanding of this story is
that both the "before" and "after" versions of that comment are
incorrect! Because, as Jann Horn noticed recently [1], there might not
be any IPIs involved in a TLB flush, if x86 is running under a
hypervisor, and that breaks the chain of reasoning here.


[1] https://lore.kernel.org/all/CAG48ez3h-mnp9ZFC10v+-BW_8NQvxbwBsMYJFP8JX31o0B17Pg@mail.gmail.com/


thanks,
  
Peter Zijlstra Oct. 24, 2022, 8 a.m. UTC | #2
On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote:
> On 10/22/22 04:14, Peter Zijlstra wrote:
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
> >  
> >  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
> >  /*
> > - * WARNING: only to be used in the get_user_pages_fast() implementation.
> > - *
> > - * With get_user_pages_fast(), we walk down the pagetables without taking any
> > - * locks.  For this we would like to load the pointers atomically, but sometimes
> > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
> > - * we do have is the guarantee that a PTE will only either go from not present
> > - * to present, or present to not present or both -- it will not switch to a
> > - * completely different present page without a TLB flush in between; something
> > - * that we are blocking by holding interrupts off.
> > + * For walking the pagetables without holding any locks.  Some architectures
> > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> > + * instructions.  We are guaranteed that a PTE will only either go from not
> > + * present to present, or present to not present -- it will not switch to a
> > + * completely different present page without a TLB flush inbetween; which we
> > + * are blocking by holding interrupts off.
> 
> 
> This is getting interesting. My latest understanding of this story is
> that both the "before" and "after" versions of that comment are
> incorrect! Because, as Jann Horn noticed recently [1], there might not
> be any IPIs involved in a TLB flush, if x86 is running under a
> hypervisor, and that breaks the chain of reasoning here.

That mail doesn't really include enough detail. The way x86 HV TLB
flushing is supposed to work is by making use of
MMU_GATHER_RCU_TABLE_FREE. Specifically, something like:


	vCPU0				vCPU1

					tlb_gather_mmut(&tlb, mm);

					....

	local_irq_disable();
	... starts page-table walk ...

	<schedules out; sets KVM_VCPU_PREEMPTED>

					tlb_finish_mmu(&tlb)
					  ...
					  kvm_flush_tlb_multi()
					    if (state & KVM_VCPU_PREEMPTED)
					      if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB))
						__cpumask_clear_cpu(cpu, flushmask);


					  tlb_remove_table_sync_one() / call_rcu()


	<schedules back in>

	... continues page-table walk ...
	local_irq_enable();

If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory
pressure), then you've got your IPI back, otherwise it does call_rcu()
and RCU itself will need vCPU0 to enable IRQs in order to make progress.

Either way around, the actual freeing of the pages is delayed until the
page-table walk is finished.

What am I missing?
  
Jann Horn Oct. 24, 2022, 7:58 p.m. UTC | #3
On Mon, Oct 24, 2022 at 10:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 23, 2022 at 10:42:49PM -0700, John Hubbard wrote:
> > On 10/22/22 04:14, Peter Zijlstra wrote:
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -260,15 +260,12 @@ static inline pte_t ptep_get(pte_t *ptep
> > >
> > >  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
> > >  /*
> > > - * WARNING: only to be used in the get_user_pages_fast() implementation.
> > > - *
> > > - * With get_user_pages_fast(), we walk down the pagetables without taking any
> > > - * locks.  For this we would like to load the pointers atomically, but sometimes
> > > - * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
> > > - * we do have is the guarantee that a PTE will only either go from not present
> > > - * to present, or present to not present or both -- it will not switch to a
> > > - * completely different present page without a TLB flush in between; something
> > > - * that we are blocking by holding interrupts off.
> > > + * For walking the pagetables without holding any locks.  Some architectures
> > > + * (eg x86-32 PAE) cannot load the entries atomically without using expensive
> > > + * instructions.  We are guaranteed that a PTE will only either go from not
> > > + * present to present, or present to not present -- it will not switch to a
> > > + * completely different present page without a TLB flush inbetween; which we
> > > + * are blocking by holding interrupts off.
> >
> >
> > This is getting interesting. My latest understanding of this story is
> > that both the "before" and "after" versions of that comment are
> > incorrect! Because, as Jann Horn noticed recently [1], there might not
> > be any IPIs involved in a TLB flush, if x86 is running under a
> > hypervisor, and that breaks the chain of reasoning here.
>
> That mail doesn't really include enough detail. The way x86 HV TLB
> flushing is supposed to work is by making use of
> MMU_GATHER_RCU_TABLE_FREE. Specifically, something like:
>
>
>         vCPU0                           vCPU1
>
>                                         tlb_gather_mmut(&tlb, mm);
>
>                                         ....
>
>         local_irq_disable();
>         ... starts page-table walk ...
>
>         <schedules out; sets KVM_VCPU_PREEMPTED>
>
>                                         tlb_finish_mmu(&tlb)
>                                           ...
>                                           kvm_flush_tlb_multi()
>                                             if (state & KVM_VCPU_PREEMPTED)
>                                               if (try_cmpxchg(,&state, state | KVM_VCPU_FLUSH_TLB))
>                                                 __cpumask_clear_cpu(cpu, flushmask);
>
>
>                                           tlb_remove_table_sync_one() / call_rcu()
>
>
>         <schedules back in>
>
>         ... continues page-table walk ...
>         local_irq_enable();
>
> If mmu gather is forced into tlb_remove_talbe_sync_one() (by memory
> pressure), then you've got your IPI back, otherwise it does call_rcu()
> and RCU itself will need vCPU0 to enable IRQs in order to make progress.
>
> Either way around, the actual freeing of the pages is delayed until the
> page-table walk is finished.
>
> What am I missing?

Unless I'm completely misunderstanding what's going on here, the whole
"remove_table" thing only happens when you "remove a table", meaning
you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
logic.
  
Linus Torvalds Oct. 24, 2022, 8:19 p.m. UTC | #4
On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote:
>
> Unless I'm completely misunderstanding what's going on here, the whole
> "remove_table" thing only happens when you "remove a table", meaning
> you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> logic.

I do have to admit that I'd be happier if this code - and the GUP code
that also relies on "interrupts off" behavior - would just use a
sequence counter instead.

Relying on blocking IPI's is clever, but also clearly very subtle and
somewhat dangerous.

I think our GUP code is a *lot* more important than some "legacy
x86-32 has problems in case you have an incredibly unlikely race that
re-populates the page table with a different page that just happens to
be exactly the same MOD-4GB", so honestly, I don't think the
load-tearing is even worth worrying about - if you have hardware that
is good enough at virtualizing things, it's almost certainly already
64-bit, and running 32-bit virtual machines with PAE you really only
have yourself to blame.

So I can't find it in myself to care about the 32-bit tearing thing,
but this discussion makes me worried about Fast GUP.

Note that even with proper atomic

                pte_t pte = ptep_get_lockless(ptep);

in gup_pte_range(), and even if the page tables are RCU-free'd, that
just means that the 'ptep' access itself is safe.

But then you have the whole "the lookup of the page pointer is not
atomic" wrt that. And right now that GUP code does rely on the "block
IPI" to make it basically valid.

I don't think it matters if GUP races with munmap or madvise() or
something like that - if you get the old page, that's still a valid
page, and the user only has himself to blame.

But if we have memory pressure that causes vmscan to push out a page,
and it gets replaced with a new page, and GUP gets the old page with
no serialization, that sounds like a possible source of data
inconsistency.

I don't know if this can happen, but the whole "interrupts disabled
doesn't actually block IPI's and synchronize with TLB flushes" really
sounds like it would affect GUP too. And be much more serious there
than on some x86-32 platform that nobody should be using anyway.

               Linus
  
Jann Horn Oct. 24, 2022, 8:23 p.m. UTC | #5
On Mon, Oct 24, 2022 at 10:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Oct 24, 2022 at 12:58 PM Jann Horn <jannh@google.com> wrote:
> >
> > Unless I'm completely misunderstanding what's going on here, the whole
> > "remove_table" thing only happens when you "remove a table", meaning
> > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > logic.
>
> I do have to admit that I'd be happier if this code - and the GUP code
> that also relies on "interrupts off" behavior - would just use a
> sequence counter instead.
>
> Relying on blocking IPI's is clever, but also clearly very subtle and
> somewhat dangerous.
>
> I think our GUP code is a *lot* more important than some "legacy
> x86-32 has problems in case you have an incredibly unlikely race that
> re-populates the page table with a different page that just happens to
> be exactly the same MOD-4GB", so honestly, I don't think the
> load-tearing is even worth worrying about - if you have hardware that
> is good enough at virtualizing things, it's almost certainly already
> 64-bit, and running 32-bit virtual machines with PAE you really only
> have yourself to blame.
>
> So I can't find it in myself to care about the 32-bit tearing thing,
> but this discussion makes me worried about Fast GUP.
>
> Note that even with proper atomic
>
>                 pte_t pte = ptep_get_lockless(ptep);
>
> in gup_pte_range(), and even if the page tables are RCU-free'd, that
> just means that the 'ptep' access itself is safe.
>
> But then you have the whole "the lookup of the page pointer is not
> atomic" wrt that. And right now that GUP code does rely on the "block
> IPI" to make it basically valid.
>
> I don't think it matters if GUP races with munmap or madvise() or
> something like that - if you get the old page, that's still a valid
> page, and the user only has himself to blame.
>
> But if we have memory pressure that causes vmscan to push out a page,
> and it gets replaced with a new page, and GUP gets the old page with
> no serialization, that sounds like a possible source of data
> inconsistency.
>
> I don't know if this can happen, but the whole "interrupts disabled
> doesn't actually block IPI's and synchronize with TLB flushes" really
> sounds like it would affect GUP too. And be much more serious there
> than on some x86-32 platform that nobody should be using anyway.

That's why GUP-fast re-checks the PTE after it has grabbed a reference
to the page. Pasting from a longer writeup that I plan to publish
soon, describing how gup_pte_range() works:

"""
This guarantees that the page tables that are being walked
aren't freed concurrently, but at the end of the walk, we
have to grab a stable reference to the referenced page.
For this we use the grab-reference-and-revalidate trick
from above again: First we (locklessly) load the page
table entry, then we grab a reference to the page that it
points to (which can fail if the refcount is zero, in that
case we bail), then we recheck that the page table entry
is still the same, and if it changed in between, we drop the
page reference and bail.
This can, again, grab a reference to a page after it has
already been freed and reallocated. The reason why this is
fine is that the metadata structure that holds this refcount,
`struct folio` (or `struct page`, depending on which kernel
version you're looking at; in current kernels it's `folio`
but `struct page` and `struct folio` are actually aliases for
the same memory, basically, though that is supposed to maybe
change at some point) is never freed; even when a page is
freed and reallocated, the corresponding `struct folio`
stays. This does have the fun consequence that whenever a
page/folio has a non-zero refcount, the refcount can
spuriously go up and then back down for a little bit.
(Also it's technically not as simple as I just described it,
because the `struct page` that the PTE points to might be
a "tail page" of a `struct folio`.
So actually we first read the PTE, the PTE gives us the
`page*`, then from that we go to the `folio*`, then we
try to grab a reference to the `folio`, then if that worked
we check that the `page` still points to the same `folio`,
and then we recheck that the PTE is still the same.)
"""
  
Linus Torvalds Oct. 24, 2022, 8:36 p.m. UTC | #6
On Mon, Oct 24, 2022 at 1:24 PM Jann Horn <jannh@google.com> wrote:
>
> That's why GUP-fast re-checks the PTE after it has grabbed a reference
> to the page.

Bah. I should have known that. We got that through the PPC version
that never had the whole IPI serialization thing (just the RCU
freeing).

I haven't looked at that code in much too long.

Actually, by "much too long" I really mean "thankfully I haven't needed to" ;)

                 Linus
  
Matthew Wilcox Oct. 25, 2022, 3:21 a.m. UTC | #7
On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
> """
> This guarantees that the page tables that are being walked
> aren't freed concurrently, but at the end of the walk, we
> have to grab a stable reference to the referenced page.
> For this we use the grab-reference-and-revalidate trick
> from above again:
> First we (locklessly) load the page
> table entry, then we grab a reference to the page that it
> points to (which can fail if the refcount is zero, in that
> case we bail), then we recheck that the page table entry
> is still the same, and if it changed in between, we drop the
> page reference and bail.
> This can, again, grab a reference to a page after it has
> already been freed and reallocated. The reason why this is
> fine is that the metadata structure that holds this refcount,
> `struct folio` (or `struct page`, depending on which kernel
> version you're looking at; in current kernels it's `folio`
> but `struct page` and `struct folio` are actually aliases for
> the same memory, basically, though that is supposed to maybe
> change at some point) is never freed; even when a page is
> freed and reallocated, the corresponding `struct folio`
> stays. This does have the fun consequence that whenever a
> page/folio has a non-zero refcount, the refcount can
> spuriously go up and then back down for a little bit.
> (Also it's technically not as simple as I just described it,
> because the `struct page` that the PTE points to might be
> a "tail page" of a `struct folio`.
> So actually we first read the PTE, the PTE gives us the
> `page*`, then from that we go to the `folio*`, then we
> try to grab a reference to the `folio`, then if that worked
> we check that the `page` still points to the same `folio`,
> and then we recheck that the PTE is still the same.)
> """

Nngh.  In trying to make this description fit all kernels (with
both pages and folios), you've complicated it maximally.  Let's
try a more simple explanation:

First we (locklessly) load the page table entry, then we grab a
reference to the folio that contains it (which can fail if the
refcount is zero, in that case we bail), then we recheck that the
page table entry is still the same, and if it changed in between,
we drop the folio reference and bail.
This can, again, grab a reference to a folio after it has
already been freed and reallocated. The reason why this is
fine is that the metadata structure that holds this refcount,
`struct folio` is never freed; even when a folio is
freed and reallocated, the corresponding `struct folio`
stays. This does have the fun consequence that whenever a
folio has a non-zero refcount, the refcount can
spuriously go up and then back down for a little bit.
(Also it's slightly more complex than I just described,
because the PTE that we just loaded might be in the middle of
being reallocated into a different folio.
So actually we first read the PTE, translate the PTE into the
`page*`, then from that we go to the `folio*`, then we
try to grab a reference to the `folio`, then if that worked
we check that the `page` is still in the same `folio`,
and then we recheck that the PTE is still the same.  Older kernels
did not make a clear distinction between pages and folios, so
it was even more confusing.)


Better?
  
Alistair Popple Oct. 25, 2022, 7:54 a.m. UTC | #8
Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
>> """
>> This guarantees that the page tables that are being walked
>> aren't freed concurrently, but at the end of the walk, we
>> have to grab a stable reference to the referenced page.
>> For this we use the grab-reference-and-revalidate trick
>> from above again:
>> First we (locklessly) load the page
>> table entry, then we grab a reference to the page that it
>> points to (which can fail if the refcount is zero, in that
>> case we bail), then we recheck that the page table entry
>> is still the same, and if it changed in between, we drop the
>> page reference and bail.
>> This can, again, grab a reference to a page after it has
>> already been freed and reallocated. The reason why this is
>> fine is that the metadata structure that holds this refcount,
>> `struct folio` (or `struct page`, depending on which kernel
>> version you're looking at; in current kernels it's `folio`
>> but `struct page` and `struct folio` are actually aliases for
>> the same memory, basically, though that is supposed to maybe
>> change at some point) is never freed; even when a page is
>> freed and reallocated, the corresponding `struct folio`
>> stays. This does have the fun consequence that whenever a
>> page/folio has a non-zero refcount, the refcount can
>> spuriously go up and then back down for a little bit.
>> (Also it's technically not as simple as I just described it,
>> because the `struct page` that the PTE points to might be
>> a "tail page" of a `struct folio`.
>> So actually we first read the PTE, the PTE gives us the
>> `page*`, then from that we go to the `folio*`, then we
>> try to grab a reference to the `folio`, then if that worked
>> we check that the `page` still points to the same `folio`,
>> and then we recheck that the PTE is still the same.)
>> """
>
> Nngh.  In trying to make this description fit all kernels (with
> both pages and folios), you've complicated it maximally.  Let's
> try a more simple explanation:
>
> First we (locklessly) load the page table entry, then we grab a
> reference to the folio that contains it (which can fail if the
> refcount is zero, in that case we bail), then we recheck that the
> page table entry is still the same, and if it changed in between,
> we drop the folio reference and bail.
> This can, again, grab a reference to a folio after it has
> already been freed and reallocated. The reason why this is
> fine is that the metadata structure that holds this refcount,
> `struct folio` is never freed; even when a folio is
> freed and reallocated, the corresponding `struct folio`
> stays.

I'm probably missing something obvious but how is that synchronised
against memory hotplug? AFAICT if it isn't couldn't the pages be freed
and memory removed? In that case the above would no longer hold because
(I think) the metadata structure could have been freed.

> This does have the fun consequence that whenever a
> folio has a non-zero refcount, the refcount can
> spuriously go up and then back down for a little bit.
> (Also it's slightly more complex than I just described,
> because the PTE that we just loaded might be in the middle of
> being reallocated into a different folio.
> So actually we first read the PTE, translate the PTE into the
> `page*`, then from that we go to the `folio*`, then we
> try to grab a reference to the `folio`, then if that worked
> we check that the `page` is still in the same `folio`,
> and then we recheck that the PTE is still the same.  Older kernels
> did not make a clear distinction between pages and folios, so
> it was even more confusing.)
>
>
> Better?
  
Peter Zijlstra Oct. 25, 2022, 1:33 p.m. UTC | #9
On Tue, Oct 25, 2022 at 06:54:10PM +1100, Alistair Popple wrote:

> > First we (locklessly) load the page table entry, then we grab a
> > reference to the folio that contains it (which can fail if the
> > refcount is zero, in that case we bail), then we recheck that the
> > page table entry is still the same, and if it changed in between,
> > we drop the folio reference and bail.
> > This can, again, grab a reference to a folio after it has
> > already been freed and reallocated. The reason why this is
> > fine is that the metadata structure that holds this refcount,
> > `struct folio` is never freed; even when a folio is
> > freed and reallocated, the corresponding `struct folio`
> > stays.
> 
> I'm probably missing something obvious but how is that synchronised
> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
> and memory removed? In that case the above would no longer hold because
> (I think) the metadata structure could have been freed.

Note, this scheme is older than memory hot-plug, so if anybody is to
blame it's the memory hotplug code.

Anyway, since all that is done with IRQs disabled, all the hotplug stuff
needs to do is rcu_synchronize() in order to ensure all active
IRQ-disabled regions are finshed (between ensuring the memory is unused
and taking out the struct page).
  
Jann Horn Oct. 25, 2022, 1:44 p.m. UTC | #10
On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Matthew Wilcox <willy@infradead.org> writes:
>
> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
> >> """
> >> This guarantees that the page tables that are being walked
> >> aren't freed concurrently, but at the end of the walk, we
> >> have to grab a stable reference to the referenced page.
> >> For this we use the grab-reference-and-revalidate trick
> >> from above again:
> >> First we (locklessly) load the page
> >> table entry, then we grab a reference to the page that it
> >> points to (which can fail if the refcount is zero, in that
> >> case we bail), then we recheck that the page table entry
> >> is still the same, and if it changed in between, we drop the
> >> page reference and bail.
> >> This can, again, grab a reference to a page after it has
> >> already been freed and reallocated. The reason why this is
> >> fine is that the metadata structure that holds this refcount,
> >> `struct folio` (or `struct page`, depending on which kernel
> >> version you're looking at; in current kernels it's `folio`
> >> but `struct page` and `struct folio` are actually aliases for
> >> the same memory, basically, though that is supposed to maybe
> >> change at some point) is never freed; even when a page is
> >> freed and reallocated, the corresponding `struct folio`
> >> stays. This does have the fun consequence that whenever a
> >> page/folio has a non-zero refcount, the refcount can
> >> spuriously go up and then back down for a little bit.
> >> (Also it's technically not as simple as I just described it,
> >> because the `struct page` that the PTE points to might be
> >> a "tail page" of a `struct folio`.
> >> So actually we first read the PTE, the PTE gives us the
> >> `page*`, then from that we go to the `folio*`, then we
> >> try to grab a reference to the `folio`, then if that worked
> >> we check that the `page` still points to the same `folio`,
> >> and then we recheck that the PTE is still the same.)
> >> """
> >
> > Nngh.  In trying to make this description fit all kernels (with
> > both pages and folios), you've complicated it maximally.  Let's
> > try a more simple explanation:
> >
> > First we (locklessly) load the page table entry, then we grab a
> > reference to the folio that contains it (which can fail if the
> > refcount is zero, in that case we bail), then we recheck that the
> > page table entry is still the same, and if it changed in between,
> > we drop the folio reference and bail.
> > This can, again, grab a reference to a folio after it has
> > already been freed and reallocated. The reason why this is
> > fine is that the metadata structure that holds this refcount,
> > `struct folio` is never freed; even when a folio is
> > freed and reallocated, the corresponding `struct folio`
> > stays.

Oh, thanks. You're right, trying to talk about kernels with folios
made it unnecessarily complicated...

> I'm probably missing something obvious but how is that synchronised
> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
> and memory removed? In that case the above would no longer hold because
> (I think) the metadata structure could have been freed.

Hm... that's this codepath?

arch_remove_memory -> __remove_pages -> __remove_section ->
sparse_remove_section -> section_deactivate ->
depopulate_section_memmap -> vmemmap_free -> remove_pagetable

which then walks down the page tables and ends up freeing individual
pages in remove_pte_table() using the confusingly-named
free_pagetable()?

I'm not sure what the synchronization against hotplug is - GUP-fast is
running with IRQs disabled, but other codepaths might not, like
get_ksm_page()? I don't know if that's holding something else for protection...
  
Peter Zijlstra Oct. 25, 2022, 2:02 p.m. UTC | #11
On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:

> Unless I'm completely misunderstanding what's going on here, the whole
> "remove_table" thing only happens when you "remove a table", meaning
> you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> logic.

Aah; yes true. OTOH even it that were not so, I think it would still be
broken because the current code relies on the TLB flush to have
completed, whereas the RCU scheme is effectively async and can be
considered pending until the callback runs.

Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
for i386-pae builds.

Something like so... nobody in his right mind should care about i386-pae
virt performance much.

---
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 95fb85bea111..cbfb84e88251 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -473,6 +473,12 @@ static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
 
 static bool pv_tlb_flush_supported(void)
 {
+	/*
+	 * i386-PAE split loads are incompatible with optimized TLB flushes.
+	 */
+	if (IS_ENABLED(CONFIG_GUP_GET_PTE_LOW_HIGH))
+		return false;
+
 	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
  
Jann Horn Oct. 25, 2022, 2:18 p.m. UTC | #12
On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:
>
> > Unless I'm completely misunderstanding what's going on here, the whole
> > "remove_table" thing only happens when you "remove a table", meaning
> > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > logic.
>
> Aah; yes true. OTOH even it that were not so, I think it would still be
> broken because the current code relies on the TLB flush to have
> completed, whereas the RCU scheme is effectively async and can be
> considered pending until the callback runs.
>
> Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
> for i386-pae builds.
>
> Something like so... nobody in his right mind should care about i386-pae
> virt performance much.

I think Xen and HyperV have similar codepaths.
hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls,
xen_flush_tlb_multi() too.

On top of that, I think that theoretically, Linux doesn't even ensure
that you have a TLB flush in between tearing down one PTE and
installing another PTE (see
https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/),
but I haven't tested that, and if it is true, I'm also not entirely
sure if it's correct (in the sense that it only creates incoherent-TLB
states when userspace is doing something stupid like racing
MADV_DONTNEED and page faults on the same region).

I think the more clearly correct fix would be to get rid of the split
loads and use CMPXCHG16B instead (probably destroying the performance
of GUP-fast completely), but that's complicated because some of the
architectures that use the split loads path don't have cmpxchg_double
(or at least don't have it wired up).
  
Peter Zijlstra Oct. 25, 2022, 3:06 p.m. UTC | #13
On Tue, Oct 25, 2022 at 04:18:20PM +0200, Jann Horn wrote:
> On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:
> >
> > > Unless I'm completely misunderstanding what's going on here, the whole
> > > "remove_table" thing only happens when you "remove a table", meaning
> > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > > logic.
> >
> > Aah; yes true. OTOH even it that were not so, I think it would still be
> > broken because the current code relies on the TLB flush to have
> > completed, whereas the RCU scheme is effectively async and can be
> > considered pending until the callback runs.
> >
> > Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
> > for i386-pae builds.
> >
> > Something like so... nobody in his right mind should care about i386-pae
> > virt performance much.
> 
> I think Xen and HyperV have similar codepaths.
> hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls,
> xen_flush_tlb_multi() too.

Sure (not updated).

> On top of that, I think that theoretically, Linux doesn't even ensure
> that you have a TLB flush in between tearing down one PTE and
> installing another PTE (see
> https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/),
> but I haven't tested that, and if it is true, I'm also not entirely
> sure if it's correct (in the sense that it only creates incoherent-TLB
> states when userspace is doing something stupid like racing
> MADV_DONTNEED and page faults on the same region).
> 
> I think the more clearly correct fix would be to get rid of the split
> loads and use CMPXCHG16B instead (probably destroying the performance
> of GUP-fast completely), but that's complicated because some of the
> architectures that use the split loads path don't have cmpxchg_double
> (or at least don't have it wired up).

cmpxchg8b; but no, I think we want to fix MADV_DONTNEED, incoherent TLB
states are a pain nobody needs.

Something like so should force TLB flushes before dropping pte_lock (not
looked at the various pmd level things yet).

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 95fb85bea111..cbfb84e88251 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -473,6 +473,12 @@ static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
 
 static bool pv_tlb_flush_supported(void)
 {
+	/*
+	 * i386-PAE split loads are incompatible with optimized TLB flushes.
+	 */
+	if (IS_ENABLED(CONFIG_GUP_GET_PTE_LOW_HIGH))
+		return false;
+
 	return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
 		!kvm_para_has_hint(KVM_HINTS_REALTIME) &&
 		kvm_para_has_feature(KVM_FEATURE_STEAL_TIME) &&
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..397bc04e2d82 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3474,5 +3474,6 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  * default, the flag is not set.
  */
 #define  ZAP_FLAG_DROP_MARKER        ((__force zap_flags_t) BIT(0))
+#define  ZAP_FLAG_FORCE_FLUSH	     ((__force zap_flags_t) BIT(1))
 
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..9bb63b3fbee1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
 						      ptent);
+
+			if (!force_flush && !tlb->fullmm && details &&
+			    details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
+				force_flush = 1;
+
 			if (unlikely(!page))
 				continue;
 
@@ -1749,6 +1754,9 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	struct maple_tree *mt = &vma->vm_mm->mm_mt;
 	unsigned long end = start + size;
 	struct mmu_notifier_range range;
+	struct zap_details details = {
+		.zap_flags = ZAP_FLAG_FORCE_FLUSH,
+	};
 	struct mmu_gather tlb;
 	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
@@ -1759,7 +1767,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	do {
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
+		unmap_single_vma(&tlb, vma, start, range.end, &details);
 	} while ((vma = mas_find(&mas, end - 1)) != NULL);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
@@ -1806,7 +1814,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size)
 {
 	if (!range_in_vma(vma, address, address + size) ||
-	    		!(vma->vm_flags & VM_PFNMAP))
+	    !(vma->vm_flags & VM_PFNMAP))
 		return;
 
 	zap_page_range_single(vma, address, size, NULL);
  
Alistair Popple Oct. 26, 2022, 12:45 a.m. UTC | #14
Jann Horn <jannh@google.com> writes:

> On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple <apopple@nvidia.com> wrote:
>>
>>
>> Matthew Wilcox <willy@infradead.org> writes:
>>
>> > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote:
>> >> """
>> >> This guarantees that the page tables that are being walked
>> >> aren't freed concurrently, but at the end of the walk, we
>> >> have to grab a stable reference to the referenced page.
>> >> For this we use the grab-reference-and-revalidate trick
>> >> from above again:
>> >> First we (locklessly) load the page
>> >> table entry, then we grab a reference to the page that it
>> >> points to (which can fail if the refcount is zero, in that
>> >> case we bail), then we recheck that the page table entry
>> >> is still the same, and if it changed in between, we drop the
>> >> page reference and bail.
>> >> This can, again, grab a reference to a page after it has
>> >> already been freed and reallocated. The reason why this is
>> >> fine is that the metadata structure that holds this refcount,
>> >> `struct folio` (or `struct page`, depending on which kernel
>> >> version you're looking at; in current kernels it's `folio`
>> >> but `struct page` and `struct folio` are actually aliases for
>> >> the same memory, basically, though that is supposed to maybe
>> >> change at some point) is never freed; even when a page is
>> >> freed and reallocated, the corresponding `struct folio`
>> >> stays. This does have the fun consequence that whenever a
>> >> page/folio has a non-zero refcount, the refcount can
>> >> spuriously go up and then back down for a little bit.
>> >> (Also it's technically not as simple as I just described it,
>> >> because the `struct page` that the PTE points to might be
>> >> a "tail page" of a `struct folio`.
>> >> So actually we first read the PTE, the PTE gives us the
>> >> `page*`, then from that we go to the `folio*`, then we
>> >> try to grab a reference to the `folio`, then if that worked
>> >> we check that the `page` still points to the same `folio`,
>> >> and then we recheck that the PTE is still the same.)
>> >> """
>> >
>> > Nngh.  In trying to make this description fit all kernels (with
>> > both pages and folios), you've complicated it maximally.  Let's
>> > try a more simple explanation:
>> >
>> > First we (locklessly) load the page table entry, then we grab a
>> > reference to the folio that contains it (which can fail if the
>> > refcount is zero, in that case we bail), then we recheck that the
>> > page table entry is still the same, and if it changed in between,
>> > we drop the folio reference and bail.
>> > This can, again, grab a reference to a folio after it has
>> > already been freed and reallocated. The reason why this is
>> > fine is that the metadata structure that holds this refcount,
>> > `struct folio` is never freed; even when a folio is
>> > freed and reallocated, the corresponding `struct folio`
>> > stays.
>
> Oh, thanks. You're right, trying to talk about kernels with folios
> made it unnecessarily complicated...
>
>> I'm probably missing something obvious but how is that synchronised
>> against memory hotplug? AFAICT if it isn't couldn't the pages be freed
>> and memory removed? In that case the above would no longer hold because
>> (I think) the metadata structure could have been freed.
>
> Hm... that's this codepath?
>
> arch_remove_memory -> __remove_pages -> __remove_section ->
> sparse_remove_section -> section_deactivate ->
> depopulate_section_memmap -> vmemmap_free -> remove_pagetable
> which then walks down the page tables and ends up freeing individual
> pages in remove_pte_table() using the confusingly-named
> free_pagetable()?

Right. section_deactivate() will also clear SECTION_HAS_MEM_MAP which
would trigger VM_BUG_ON(!pfn_valid(pte_pfn(pte))) in gup_pte_range().

> I'm not sure what the synchronization against hotplug is - GUP-fast is
> running with IRQs disabled, but other codepaths might not, like
> get_ksm_page()? I don't know if that's holding something else for protection...

I was thinking about this from the ZONE_DEVICE perspective (ie.
memunmap_pages -> pageunmap_range -> arch_remove_memory -> ...). That
runs with IRQs enabled, and I couldn't see any other synchronization.
pageunmap_range() does call mem_hotplug_begin() which takes hotplug
locks but GUP-fast doesn't take those locks. So based on Peter's
response I think I need to add a rcu_synchronize() call to
pageunmap_range() right after calling mem_hotplug_begin().

I could just add it to mem_hotplug_begin() but offline_pages() calls
that too early (before pages have been isolated) so will need a separate
change.
  
Jann Horn Oct. 26, 2022, 4:45 p.m. UTC | #15
On Tue, Oct 25, 2022 at 5:06 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 25, 2022 at 04:18:20PM +0200, Jann Horn wrote:
> > On Tue, Oct 25, 2022 at 4:02 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Mon, Oct 24, 2022 at 09:58:07PM +0200, Jann Horn wrote:
> > >
> > > > Unless I'm completely misunderstanding what's going on here, the whole
> > > > "remove_table" thing only happens when you "remove a table", meaning
> > > > you free an entire *pagetable*. Just zapping PTEs doesn't trigger that
> > > > logic.
> > >
> > > Aah; yes true. OTOH even it that were not so, I think it would still be
> > > broken because the current code relies on the TLB flush to have
> > > completed, whereas the RCU scheme is effectively async and can be
> > > considered pending until the callback runs.
> > >
> > > Hurmph... easiest fix is probably to dis-allow kvm_flush_tlb_multi()
> > > for i386-pae builds.
> > >
> > > Something like so... nobody in his right mind should care about i386-pae
> > > virt performance much.
> >
> > I think Xen and HyperV have similar codepaths.
> > hyperv_flush_tlb_multi() looks like it uses remote flush hypercalls,
> > xen_flush_tlb_multi() too.
>
> Sure (not updated).
>
> > On top of that, I think that theoretically, Linux doesn't even ensure
> > that you have a TLB flush in between tearing down one PTE and
> > installing another PTE (see
> > https://lore.kernel.org/all/CAG48ez1Oz4tT-N2Y=Zs6jumu=zOp7SQRZ=V2c+b5bT9P4retJA@mail.gmail.com/),
> > but I haven't tested that, and if it is true, I'm also not entirely
> > sure if it's correct (in the sense that it only creates incoherent-TLB
> > states when userspace is doing something stupid like racing
> > MADV_DONTNEED and page faults on the same region).
> >
> > I think the more clearly correct fix would be to get rid of the split
> > loads and use CMPXCHG16B instead (probably destroying the performance
> > of GUP-fast completely), but that's complicated because some of the
> > architectures that use the split loads path don't have cmpxchg_double
> > (or at least don't have it wired up).
>
> cmpxchg8b; but no, I think we want to fix MADV_DONTNEED, incoherent TLB
> states are a pain nobody needs.
>
> Something like so should force TLB flushes before dropping pte_lock (not
> looked at the various pmd level things yet).
[...]
>  #endif /* _LINUX_MM_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..9bb63b3fbee1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                         tlb_remove_tlb_entry(tlb, pte, addr);
>                         zap_install_uffd_wp_if_needed(vma, addr, pte, details,
>                                                       ptent);
> +
> +                       if (!force_flush && !tlb->fullmm && details &&
> +                           details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> +                               force_flush = 1;
> +

Hmm... I guess that might work, assuming that there is no other
codepath we might race with that first turns the present PTE into a
non-present PTE but keeps the flush queued for later. At least
codepaths that use the tlb_batched infrastructure are unproblematic...
  
Nadav Amit Oct. 26, 2022, 7:43 p.m. UTC | #16
On Oct 25, 2022, at 6:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> 			if (!force_flush && !tlb->fullmm && details &&
> +			    details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> +				force_flush = 1;

Isn’t it too big of a hammer?

At the same time, the whole reasoning about TLB flushes is not getting any
simpler. We had cases in which MADV_DONTNEED and another concurrent
operation that effectively zapped PTEs (e.g., another MADV_DONTNEED) caused
the zap_pte_range() to skip entries since pte_none() was true. To resolve
these cases we relied on tlb_finish_mmu() to flush the range when needed
(i.e., flush the whole range when mm_tlb_flush_nested()).

Now, I do not have a specific broken scenario in mind following this change,
but it is all sounds to me a bit dangerous and at same time can potentially
introduce new overheads.

One alternative may be using mm_tlb_flush_pending() when setting a new PTE
to check for pending flushes and flushing the TLB if that is the case. This
is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this
might induce some overheads. As noted before, it is possible to track
pending TLB flushes in VMA/page-table granularity, with different tradeoffs
of overheads.
  
Peter Zijlstra Oct. 27, 2022, 7:08 a.m. UTC | #17
On Wed, Oct 26, 2022 at 06:45:16PM +0200, Jann Horn wrote:

> >  #endif /* _LINUX_MM_H */
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f88c351aecd4..9bb63b3fbee1 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1440,6 +1440,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                         tlb_remove_tlb_entry(tlb, pte, addr);
> >                         zap_install_uffd_wp_if_needed(vma, addr, pte, details,
> >                                                       ptent);
> > +
> > +                       if (!force_flush && !tlb->fullmm && details &&
> > +                           details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> > +                               force_flush = 1;
> > +
> 
> Hmm... I guess that might work, assuming that there is no other
> codepath we might race with that first turns the present PTE into a
> non-present PTE but keeps the flush queued for later. At least
> codepaths that use the tlb_batched infrastructure are unproblematic...

So I thought the general rule was that if you modify a PTE and have not
unmapped things -- IOW, there's actual concurrency possible on the
thing, then the TLB invalidate needs to happen under pte_lock, since
that is what controls concurrency at the pte level.

As it stands MADV_DONTNEED seems to blatatly violate that general rule.

Then again; I could've missed something and the rules changed?
  
Peter Zijlstra Oct. 27, 2022, 7:27 a.m. UTC | #18
On Wed, Oct 26, 2022 at 10:43:21PM +0300, Nadav Amit wrote:
> On Oct 25, 2022, at 6:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > 			if (!force_flush && !tlb->fullmm && details &&
> > +			    details->zap_flags & ZAP_FLAG_FORCE_FLUSH)
> > +				force_flush = 1;
> 
> Isn’t it too big of a hammer?

It is the obvious hammer :-) TLB invalidate under pte_lock when
clearing.

> At the same time, the whole reasoning about TLB flushes is not getting any
> simpler. We had cases in which MADV_DONTNEED and another concurrent
> operation that effectively zapped PTEs (e.g., another MADV_DONTNEED) caused
> the zap_pte_range() to skip entries since pte_none() was true. To resolve
> these cases we relied on tlb_finish_mmu() to flush the range when needed
> (i.e., flush the whole range when mm_tlb_flush_nested()).

Yeah, whoever thought that allowing concurrency there was a great idea :/

And I must admit to hating the pending thing with a passion. And that
mm_tlb_flush_nested() thing in tlb_finish_mmu() is a giant hack at the
best of times.

Also; I feel it's part of the problem here; it violates the basic rules
we've had for a very long time.

> Now, I do not have a specific broken scenario in mind following this change,
> but it is all sounds to me a bit dangerous and at same time can potentially
> introduce new overheads.

I'll take correctness over being fast. As you say, this whole TLB thing
is getting out of hand.

> One alternative may be using mm_tlb_flush_pending() when setting a new PTE
> to check for pending flushes and flushing the TLB if that is the case. This
> is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this
> might induce some overheads. As noted before, it is possible to track
> pending TLB flushes in VMA/page-table granularity, with different tradeoffs
> of overheads.

Right; I just don't believe in VMAs for this, they're *waaay* to big.
  
Nadav Amit Oct. 27, 2022, 5:30 p.m. UTC | #19
On Oct 27, 2022, at 12:27 AM, Peter Zijlstra <peterz@infradead.org> wrote:

>> One alternative may be using mm_tlb_flush_pending() when setting a new PTE
>> to check for pending flushes and flushing the TLB if that is the case. This
>> is somewhat similar to what ptep_clear_flush() does. Anyhow, I guess this
>> might induce some overheads. As noted before, it is possible to track
>> pending TLB flushes in VMA/page-table granularity, with different tradeoffs
>> of overheads.
> 
> Right; I just don't believe in VMAs for this, they're *waaay* to big.

Well, I did it for VMA in an RFC only because I was pushed. I thought and do
think that page-table granularity is the right one.
  
Linus Torvalds Oct. 27, 2022, 6:13 p.m. UTC | #20
On Thu, Oct 27, 2022 at 12:08 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I thought the general rule was that if you modify a PTE and have not
> unmapped things -- IOW, there's actual concurrency possible on the
> thing, then the TLB invalidate needs to happen under pte_lock, since
> that is what controls concurrency at the pte level.

Yeah, I think we should really think about the TLB issue and make the
rules very clear.

A lot of our thinking behind "fix TLB races" have been about
"use-after-free wrt TLB on another CPU", and the design in
zap_page_ranges() is almost entirely about that: making sure that the
TLB flush happens before the underlying pages are free'd.

I say "almost entirely", because you can see the effects of the
*other* kind of race in

                        if (!PageAnon(page)) {
                                if (pte_dirty(ptent)) {
                                        force_flush = 1;
                                        set_page_dirty(page);
                                }

where it isn't about the lifetime of the page, but about the state
coherency wrt other users.

But I think even that code is quite questionable, and we really should
write down the rules much more strictly somewhere.

For example, wrt that pte_dirty() causing force_flush, I'd love to
have some core set of rules that would explain

 (a) why it is only relevant for shared pages

 (b) do even shared pages need it for the "fullmm" case when we tear
down the whole VM?

 (c) what are the force_flush interactions wrt holding the mmap lock
for writing vs reading?

In the above, I think (b)/(c) are related - I suspect "fullmm" is
mostly equivalent to "mmap held for writing", in that it guarantees
that there should be no concurrent faults or other active sw ops on
the table.

But "fullmm" is probably even stronger than "mmap write-lock" in that
it should also mean "no other CPU can be actively using this" - either
for hardware page table walking, or for GUP.

Both both have the vmscan code and rmap that can still get to the
pages - and the vmscan code clearly only cares about the page table
lock. But the vmscan code itself also shouldn't care about some stale
TLB value, so ...

> As it stands MADV_DONTNEED seems to blatatly violate that general rule.

So let's talk about exactly what and why the TLB would need to be
flushed before dropping the page table lock.

For example, MADV_DONTNEED does this all with just the mmap lock held
for reading, so we *unless* we have that 'force_flush', we can

 (a) have another CPU continue to use the old stale TLB entry for quite a while

 (b) yet another CPU (that didn't have a TLB entry, or wanted to write
to a read-only one ) could take a page fault, and install a *new* PTE
entry in the same slot, all at the same time.

Now, that's clearly *very* confusing. But being confusing may not mean
"wrong" - we're still delaying the free of the old entry, so there's
no use-after-free.

The biggest problem I can see is that this means that user space
memory ordering might be screwed up: the CPU in (a) will see not just
an old TLB entry, but simply old *data*, when the CPU in (b) may be
writing to that same address with new data.

So I think we clearly do *NOT* serialize as much as we could for
MADV_DONTNEED, and I think the above is a real semantic result of
that.

But equally arguably if you do this kind of unserialized MADV_DONTNEED
in user space, that's a case of "you asked for the breakage - you get
to keep both pieces".

So is that ever an actual problem? If the worst issue is that some
CPU's may see old ("pre-DONTNEED") data, while other CPU's then see
new data while the MADV_DONTNEED is executing, I think maybe *THAT*
part is fine.

But because it's so very confusing, maybe we have other problems in
this area, which is why I think it would be really good to have an
actual set of real hard documented rules about this all, and exactly
when we need to flush TLBs synchronously with the page table lock, and
when we do not.

Anybody willing to try to write up the rules (and have each rule
document *why* it's a rule - not just "by fiat", but an actual "these
are the rules and this is *why* they are the rules").

Because right now I think all of our rules are almost entirely just
encoded in the code, with a couple of comments, and a few people who
just remember why we do what we do.

                   Linus
  
Peter Zijlstra Oct. 27, 2022, 7:35 p.m. UTC | #21
Just two quick remarks; it's far to late to really think :-)

On Thu, Oct 27, 2022 at 11:13:55AM -0700, Linus Torvalds wrote:

> But "fullmm" is probably even stronger than "mmap write-lock" in that
> it should also mean "no other CPU can be actively using this" - either
> for hardware page table walking, or for GUP.

IIRC fullmm is really: this is the last user and we're taking the whole
mm down -- IOW exit().

> For example, MADV_DONTNEED does this all with just the mmap lock held
> for reading, so we *unless* we have that 'force_flush', we can
> 
>  (a) have another CPU continue to use the old stale TLB entry for quite a while
> 
>  (b) yet another CPU (that didn't have a TLB entry, or wanted to write
> to a read-only one ) could take a page fault, and install a *new* PTE
> entry in the same slot, all at the same time.
> 
> Now, that's clearly *very* confusing. But being confusing may not mean
> "wrong" - we're still delaying the free of the old entry, so there's
> no use-after-free.

Do we worry about CPU errata where things go side-ways if multiple CPUs
have inconsistent TLB state?
  
Linus Torvalds Oct. 27, 2022, 7:43 p.m. UTC | #22
On Thu, Oct 27, 2022 at 12:35 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 27, 2022 at 11:13:55AM -0700, Linus Torvalds wrote:
>
> > But "fullmm" is probably even stronger than "mmap write-lock" in that
> > it should also mean "no other CPU can be actively using this" - either
> > for hardware page table walking, or for GUP.
>
> IIRC fullmm is really: this is the last user and we're taking the whole
> mm down -- IOW exit().

Yes.

But that doesn't mean that it's entirely "just ours" - vmscan can
still see the entries due to rmap, I think. So there can still be some
concurrency concerns, but it's limited.

> Do we worry about CPU errata where things go side-ways if multiple CPUs
> have inconsistent TLB state?

Yeah, we should definitely worry about those, since I think they have
been known to cause machine checks etc, which then crashes the machine
because the machine check architecture is broken garbage.

"User gets the odd memory ordering they asked for" is different from
"user can crash machine because of bad machine check architecture" ;)

That said, I don't think this is a real worry here. Because if I
recall the errata correctly, they are not about "different TLB
contents", but "different cacheability for the same physical page".

Because different TLB contents are normal and even expected, I think.
Things like kmap_local etc already end up doing some lazy TLB
flushing. No?

I think it's only "somebody did an UC access to a cacheline I have"
that ends up being bad.

Note the *WILD HANDWAVING* above - I didn't actually look up the
errata. The above is from my dim memories of the issues we had, and I
might just be wrong.

               Linus
  
Nadav Amit Oct. 27, 2022, 8:15 p.m. UTC | #23
On Oct 27, 2022, at 11:13 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Anybody willing to try to write up the rules (and have each rule
> document *why* it's a rule - not just "by fiat", but an actual "these
> are the rules and this is *why* they are the rules").
> 
> Because right now I think all of our rules are almost entirely just
> encoded in the code, with a couple of comments, and a few people who
> just remember why we do what we do.

I think it might be easier to come up with new rules instead of phrasing the
existing ones.

The approach I suggested before [1] is something like:

1. Turn x86’s TLB-generation mechanism to be generic. Turn the
   TLB-generation into “pending TLB-generation”.

2. For each mm track “completed TLB-generation”, whenever an actual flush
   takes place.

3. When you defer a TLB-flush, while holding the PTL:
  a. Increase the TLB-generation.
  b. Save the updated “table generation" in a new field in the
     page-table’s page-struct.

4. When you are about to rely on a PTE value that is read from a page-table,
   first check if a TLB flush is needed. The check is performed by comparing
   the “table generation” with the “completed generation”. If the “table
   generation” is behind, a TLB flush is needed.

   [ You rely on the PTE value when you install new PTEs or change them ]

That’s about it. I might have not covered some issues with fast-GUP. But in
general I think it is a simple scheme. The thing I like about this scheme
the most is that it avoids relying on almost all the OS data-structures
(e.g., PageAnon()), making it much easier to grasp.

I can revive the patch-set if the overall approach is agreeable.


[1] https://lore.kernel.org/lkml/20210131001132.3368247-1-namit@vmware.com/
  
Linus Torvalds Oct. 27, 2022, 8:31 p.m. UTC | #24
On Thu, Oct 27, 2022 at 1:15 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I think it might be easier to come up with new rules instead of phrasing the
> existing ones.

I'm ok with that, but I think you are missing a very important issue:
all the cases where we can short-circuit TLB invalidations *entirely*.

You don't mention those at all.

Those optimizations are *very* important. Process exit is one of the
most performance-critical pieces of code in the kernel on some loads,
because a lot of traditional unix loads have a *ton* of small
fork/exec/exit sequences, and the whole "do just one TLB flush" was at
least historically quite a big deal.

So one very big issue here is when zap_page_tables() can end up
skipping TLB flushes entirely, because nobody cares.

And no, the fix is not to turn it into some "just increment a
generation number".

We want to avoid *even that* cost for the whole "we don't actually
need a TLB flush at all, until we actually free the pages".

So there are two levels of tlb flush optimizations

 (a) avoiding them entirely in the first place

 (b) the whole "once you have to flush, keep track of lazy modes and
TLB generations, and flush ranges"

And honestly, I think you ignored (a), and that's where we do exactly
those kinds of "this case doesn't need to flush AT ALL" things.

So when you say

>       The thing I like about this scheme
> the most is that it avoids relying on almost all the OS data-structures
> (e.g., PageAnon()), making it much easier to grasp.

I think it's because you've ignored a big part of the whole issue.

               Linus
  
Nadav Amit Oct. 27, 2022, 9:44 p.m. UTC | #25
On Oct 27, 2022, at 1:31 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So there are two levels of tlb flush optimizations
> 
> (a) avoiding them entirely in the first place
> 
> (b) the whole "once you have to flush, keep track of lazy modes and
> TLB generations, and flush ranges"
> 
> And honestly, I think you ignored (a), and that's where we do exactly
> those kinds of "this case doesn't need to flush AT ALL" things.

I did try to avoid TLB flushes by introducing pte_needs_flush() and avoiding
flushes based on the architectural PTE changes. There are even more x86
arch-based opportunities to further avoid TLB flushes (and then only flush
the TLB if spurious #PF occurs).

Personally, I still think that making decisions on flushes based on (mostly)
only the arch state makes the code more robust against misuse (e.g., see
various confusions between mmu_gather’s fullmm and need_flush_all).

Having said that, I will follow your feedback that the extra complexity
worth the extra performance.

Anyhow, admittedly, I need to give it more thought. For instance, in respect
to the code that you mentioned (in zap_pte_range), after reading it again,
it seems strange: how is ok to defer the TLB flush after the rmap was
removed, even if it is done while the PTL is held.
folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page
might be later re-dirtied using a stale cached PTE. Oh well.
  
Nadav Amit Oct. 28, 2022, 11:57 p.m. UTC | #26
On Oct 27, 2022, at 2:44 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> Anyhow, admittedly, I need to give it more thought. For instance, in respect
> to the code that you mentioned (in zap_pte_range), after reading it again,
> it seems strange: how is ok to defer the TLB flush after the rmap was
> removed, even if it is done while the PTL is held.
> folio_clear_dirty_for_io() would not sync on the PTL afterwards, so the page
> might be later re-dirtied using a stale cached PTE. Oh well.

Peter,

So it appears to be a problem - flushing after removing the rmap. I attach a
PoC that shows the problem.

The problem is in the following code of zap_pte_range():

	if (!PageAnon(page)) {
		if (pte_dirty(ptent)) {
			force_flush = 1;
			set_page_dirty(page);
		}
                …
	}
	page_remove_rmap(page, vma, false);

Once we remove the rmap, rmap_walk() would not acquire the page-table lock
anymore. As a result, nothing prevents the kernel from performing writeback
and cleaning the page-struct dirty-bit before the TLB is actually flushed.
As a result, if there is an additional thread that has the dirty-PTE cached
in the TLB, it can keep writing to the page and nothing (PTE/page-struct)
will keep track that the page has been dirtied.

In other words, writes to the memory mapped file after
munmap()/MADV_DONTNEED started can be lost.

This affects both munmap() and MADV_DONTNEED. One might argue that if you
don’t keep writing after munmap()/MADV_DONTNEED it’s your fault
(questionable), but if that’s the case why do we bother with force_flush at
all?

If we want it to behave correctly - i.e., writes after munmap/MADV_DONTNEED
to propagate to the file - we need to collect dirty pages and remove their
rmap only after we flush the TLB (and before we release the ptl). mmu_gather
would probably need to be changed for this matter.

Thoughts?

---

Some details about the PoC (below):

We got 3 threads that use 2MB range:

1. Maintains a counter in the first 4KB of the 2MB range and checks it
   actually updates the memory.

2. Dirties pages in 2MB range (just to make the race more likely).

3. Either (i) maps the same mapping at the first 4KB (to test munmap
   indirectly); or (ii) runs MADV_DONTNEED on the first 4KB.

In addition, a child process runs msync and fdatasync to writeback the first
4KB.

The PoC should be run with a file on a block RAM device. It manages to
trigger the issue relatively reliably (within 60 seconds) with munmap() and
slightly less reliably with MADV_DONTNEED. I have no idea if it works in VM,
deep C-states, etc..

-- >8 --

#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

#define handle_error(msg) \
   do { perror(msg); exit(EXIT_FAILURE); } while (0)

void *p;
volatile bool stop = false;
pid_t flusher_pid;
int fd;

#define PAGE_SIZE	(4096ul)
#define PAGES_PER_PMD	(512)
#define HPAGE_SIZE	(PAGE_SIZE * PAGES_PER_PMD)

// Comment MUNMAP_TEST for MADV_DONTNEED test
#define MUNMAP_TEST

void *dirtying_thread(void *arg)
{
	int i;

	while (!stop) {
		for (i = 1; i < PAGES_PER_PMD; i++) {
			*(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5;
		}
	}
	return NULL;
}

void *checking_thread(void *arg)
{
	volatile unsigned long *ul_p = (volatile unsigned long*)p;
	unsigned long cnt = 0;

	while (!stop) {
		*ul_p = cnt;
		if (*ul_p != cnt) {
			printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p);
			kill(flusher_pid, SIGTERM);
			exit(0);
		}
		cnt++;
	}
	return NULL;
}

void *remap_thread(void *arg)
{
	void *ptr;
	struct timespec t = {
		.tv_nsec = 10000,
	};

	while (!stop) {
#ifdef MUNMAP_TEST
		ptr = mmap(p, PAGE_SIZE, PROT_READ|PROT_WRITE,
			   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
		if (ptr == MAP_FAILED)
			handle_error("remap_thread");
#else
		if (madvise(p, PAGE_SIZE, MADV_DONTNEED) < 0)
			handle_error("MADV_DONTNEED");
		nanosleep(&t, NULL);
#endif
	}
	return NULL;
}

void flushing_process(void)
{
	// Remove the pages to speed up rmap_walk and allow to drop caches.
	if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
		handle_error("MADV_DONTNEED");

	while (true) {
		if (msync(p, PAGE_SIZE, MS_SYNC))
			handle_error("msync");
		if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED))
			handle_error("posix_fadvise");
	}
}

int main(int argc, char *argv[])
{
	void *(*thread_funcs[])(void*) = {
		&dirtying_thread,
		&checking_thread,
		&remap_thread,
	};	
	int r, i;
	int rc1, rc2;
	unsigned long addr;
	void *ptr;
	char *page = malloc(PAGE_SIZE);
	int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs);
	pthread_t *threads = malloc(sizeof(pthread_t) * n_threads);
	pid_t pid;
	
	if (argc < 2) {
		fprintf(stderr, "usages: %s [filename]\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	fd = open(argv[1], O_RDWR|O_CREAT, 0666);
	if (fd == -1)
		handle_error("open fd");

	for (i = 0; i < PAGES_PER_PMD; i++) {
		if (write(fd, page, PAGE_SIZE) != PAGE_SIZE)
			handle_error("write");
	}
	free(page);

	ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON,
                   -1, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap anon");

	addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
	printf("starting...\n");

	ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap file - start");
	
	p = ptr;

	for (i = 0; i < n_threads; i++) {
		r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL);
		if (r)
			handle_error("pthread_create");
	}

	// Run the flushing process in a different process, so msync() would
	// not require mmap_lock.
	pid = fork();
	if (pid == 0)
		flushing_process();
	flusher_pid = pid;

	sleep(60);

	stop = true;
	for (i = 0; i < n_threads; i++)
		pthread_join(threads[i], NULL);
	kill(flusher_pid, SIGTERM);
	printf("Finished without an error");

	exit(0);
}
  
Linus Torvalds Oct. 29, 2022, 12:42 a.m. UTC | #27
On Fri, Oct 28, 2022 at 4:57 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> The problem is in the following code of zap_pte_range():
>
>         if (!PageAnon(page)) {
>                 if (pte_dirty(ptent)) {
>                         force_flush = 1;
>                         set_page_dirty(page);
>                 }
>                 …
>         }
>         page_remove_rmap(page, vma, false);
>
> Once we remove the rmap, rmap_walk() would not acquire the page-table lock
> anymore. As a result, nothing prevents the kernel from performing writeback
> and cleaning the page-struct dirty-bit before the TLB is actually flushed.

Hah.

The original reason for force_flush there was similar, with a race wrt
page_mkclean() because this code doesn't take the page lock that would
normally serialize things, because the page lock is so painful and
ends up having nasty nasty interactions with slow IO operations.

So all the page-dirty handling really *wants* to use the page lock,
and for the IO side (writeback) that ends up being acceptable and
works well, but from that "serialize VM state" it's horrendous.

So instead the code intentionally serialized on the rmap data
structures which page_mkclean() also walks, and as you point out,
that's broken. It's not broken at the point where we do
set_page_dirty(), but it *comes* broken when we drop the rmap, and the
problem is exactly that "we still have the dirty bit hidden in the TLB
state" issue that you pinpoint.

I think the proper fix (or at least _a_ proper fix) would be to
actually carry the dirty bit along to the __tlb_remove_page() point,
and actually treat it exactly the same way as the page pointer itself
- set the page dirty after the TLB flush, the same way we can free the
page after the TLB flush.

We could easiy hide said dirty bit in the low bits of the
"batch->pages[]" array or something like that. We'd just have to add
the 'dirty' argument to __tlb_remove_page_size() and friends.

Hmm?

Your idea of "do the page_remove_rmap() late instead" would also work,
but the reason I think just squirrelling away the dirty bit is the
"proper" fix is that it would get rid of the whole need for
'force_flush' in this area entirely. So we'd not only fix that race
you noticed, we'd actually do so and reduce the number of TLB flushes
too.

I don't know. Maybe I'm missing something fundamental, and my idea is
just stupid.

               Linus
  
Nadav Amit Oct. 29, 2022, 6:05 p.m. UTC | #28
On Oct 28, 2022, at 5:42 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I think the proper fix (or at least _a_ proper fix) would be to
> actually carry the dirty bit along to the __tlb_remove_page() point,
> and actually treat it exactly the same way as the page pointer itself
> - set the page dirty after the TLB flush, the same way we can free the
> page after the TLB flush.
> 
> We could easiy hide said dirty bit in the low bits of the
> "batch->pages[]" array or something like that. We'd just have to add
> the 'dirty' argument to __tlb_remove_page_size() and friends.

Thank you for your quick response. I was slow to respond due to a jet lag.

Anyhow, I am not sure whether the solution that you propose would work.
Please let me know if my understanding makes sense.

Let’s assume that we do not call set_page_dirty() before we remove the rmap
but only after we invalidate the page [*]. Let’s assume that
shrink_page_list() is called after the page’s rmap is removed and the page
is no longer mapped, but before set_page_dirty() was actually called.

In such a case, shrink_page_list() would consider the page clean, and would
indeed keep the page (since __remove_mapping() would find elevated page
refcount), which appears to give us a chance to mark the page as dirty
later.

However, IIUC, in this case shrink_page_list() might still call
filemap_release_folio() and release the buffers, so calling set_page_dirty()
afterwards - after the actual TLB invalidation took place - would fail.

> Your idea of "do the page_remove_rmap() late instead" would also work,
> but the reason I think just squirrelling away the dirty bit is the
> "proper" fix is that it would get rid of the whole need for
> 'force_flush' in this area entirely. So we'd not only fix that race
> you noticed, we'd actually do so and reduce the number of TLB flushes
> too.

I’m all for reducing the number of TLB flushes, and your solution does sound
better in general. I proposed something that I considered having the path of
least resistance (i.e., least chance of breaking something). I can do what
you propsosed, but I am not sure how to deal with the buffers being removed.

One more note: This issue, I think, also affects migrate_vma_collect_pmd().
Alistair recently addressed an issue there, but in my prior feedback to him
I missed this issue.


[*] Note that this would be for our scenario pretty much the same if we also
called set_page_dirty() before removing the rmap, but the page was cleaned
while the TLB invalidation has still not been performed.
  
Linus Torvalds Oct. 29, 2022, 6:36 p.m. UTC | #29
On Sat, Oct 29, 2022 at 11:05 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Anyhow, I am not sure whether the solution that you propose would work.
> Please let me know if my understanding makes sense.

Let me include my (UNTESTED! PROBABLY GARBAGE!) patch here just as a
"I meant something like this".

Note that it's untested, but I tried to make it intentionally use
different names and an opaque type in the 'mmu_gather' data structure
so that at least these bit games end up being type-safe and more
likely to be correct if it compiles.

And I will say that it does compile for me - but only in my normal
x86-64 config. I haven't tested anything else, and I guarantee it
won't build on s390, for example, because s390 has its own mmu_gather
functions.

> Let’s assume that we do not call set_page_dirty() before we remove the rmap
> but only after we invalidate the page [*]. Let’s assume that
> shrink_page_list() is called after the page’s rmap is removed and the page
> is no longer mapped, but before set_page_dirty() was actually called.
>
> In such a case, shrink_page_list() would consider the page clean, and would
> indeed keep the page (since __remove_mapping() would find elevated page
> refcount), which appears to give us a chance to mark the page as dirty
> later.

Right. That is not different to any other function (like "write()"
having looked up the page.

> However, IIUC, in this case shrink_page_list() might still call
> filemap_release_folio() and release the buffers, so calling set_page_dirty()
> afterwards - after the actual TLB invalidation took place - would fail.

I'm not seeing why.

That would imply that any "look up page, do set_page_dirty()" is
broken. They don't have rmap either. And we have a number of them all
over (eg think "GUP users" etc).

But hey, you were right about the stale TLB case, so I may just be
missing something.

I *think* the important thing is just that we need to mark the page
dirtty from the page tables _after_ we've flushed the TLB, just to
make sure that there can be no subsequent dirtiers that then get lost.

Anyway, I think the best documentation for "this is what I meant" is
simply the patch. Does this affect your PoC on your setup?

I tried to run your program on my machine (WITHOUT this patch - I have
compiled this patch, but I haven't even booted it - it really could be
horrible broken).

But it doesn't fail for me even without the patch and I just get
"Finished without an error" over and over again - but you said it had
to be run on a RAM block device, which I didn't do, so my testing is
very suspect,.

Again: THIS PATCH IS UNTESTED. I feel like it might actually work, if
only because I tried to be so careful with the type system.

But that "might actually work" is probably me being wildly optimistic,
and also depends on the whole concept being ok in the first place.

So realistically, think of this patch more as a "document in code what
Linus meant with his incoherent ramblings" rather than anything else.

Hmm?

               Linus
  
Linus Torvalds Oct. 29, 2022, 6:58 p.m. UTC | #30
On Sat, Oct 29, 2022 at 11:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, I think the best documentation for "this is what I meant" is
> simply the patch. Does this affect your PoC on your setup?

Here's a slightly cleaned up set with preliminary commit messages, and
an explanation for why some of the 'struct page' declarations were
moved around a bit in case you wondered about that part of the change
in the full patch.

The end result should be the same, so if you already looked at the
previous unified patch, never mind. But this one tries to make for a
better patch series.

Still not tested in any way, shape, or form. I decided I wanted to
send this one before booting into this and possibly blowing up ;^)

                   Linus
  
Linus Torvalds Oct. 29, 2022, 7:14 p.m. UTC | #31
On Sat, Oct 29, 2022 at 11:58 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Still not tested in any way, shape, or form. I decided I wanted to
> send this one before booting into this and possibly blowing up ;^)

Well, it boots, and I see no difference with your PoC code.

It didn't fail for me before, it doesn't fail for me with those patches.

Again, the "it doesn't fail for me" is probably because I'm running it
incorrectly, although for all I know there can also be hardware
differences.

I'm testing on an older AMD threadripper, and as I'm sure you are very
aware, some AMD cores used to have special support for keeping the TLB
coherent with the actual page table contents in order to then avoid
TLB flushes entirely.

Those things ended up being buggy and disabled, but my point is that
hardware differences can obviously actively hide this issue by making
the TLB contents track page table changes.

So even if I were to run it the same way you do, I might not see the
failure due to just running it on different hardware with different
TLB and timing.

Anyway, the patches don't seem to cause any *obvious* problems. That's
not to say that they are correct, or that they fix anything, but it's
certainly a fairly simple and straightforward patch, and it "feels
right" to me.

Sadly, reality doesn't always agree with my feelings. Damn.

                Linus
  
Nadav Amit Oct. 29, 2022, 7:28 p.m. UTC | #32
On Oct 29, 2022, at 12:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Oct 29, 2022 at 11:58 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Still not tested in any way, shape, or form. I decided I wanted to
>> send this one before booting into this and possibly blowing up ;^)
> 
> Well, it boots, and I see no difference with your PoC code.
> 
> It didn't fail for me before, it doesn't fail for me with those patches.
> 
> Again, the "it doesn't fail for me" is probably because I'm running it
> incorrectly, although for all I know there can also be hardware
> differences.

Please give me some time to test it. I presume you ran it with block ram
device (not tmpfs) and not on a virtual machine (which can affect besides
Intel/AMD implementation differences).

But even if your patches work and the tests pass, I am not sure it means
that everything is fine. I did not try to trigger a race with
shrink_page_list(), and doing that might be harder than the race I tried to
create before. I need to do some tracing to understand what I was missing in
my understanding of the shrink_page_list() - assuming that I am mistaken
about the buffers being potentially released.

I would note that my concern about releasing the buffers is partially driven
by to issues that were reported before [1]. I am actually not sure how they
were resolved.


[1] https://lore.kernel.org/all/20180103100430.GE4911@quack2.suse.cz/
  
John Hubbard Oct. 29, 2022, 7:39 p.m. UTC | #33
On 10/29/22 11:36, Linus Torvalds wrote:
>> In such a case, shrink_page_list() would consider the page clean, and would
>> indeed keep the page (since __remove_mapping() would find elevated page
>> refcount), which appears to give us a chance to mark the page as dirty
>> later.
> 
> Right. That is not different to any other function (like "write()"
> having looked up the page.
> 
>> However, IIUC, in this case shrink_page_list() might still call
>> filemap_release_folio() and release the buffers, so calling set_page_dirty()
>> afterwards - after the actual TLB invalidation took place - would fail.
> 
> I'm not seeing why.
> 
> That would imply that any "look up page, do set_page_dirty()" is
> broken. They don't have rmap either. And we have a number of them all
> over (eg think "GUP users" etc).

Yes, we do have a bunch of "look up page, do set_page_dirty()" cases.
And I think that many (most?) of them are in fact broken!

Because: the dirtiness of a page is something that the filesystem
believes that it is managing, and so filesystem coordination is, in
general, required in order to mark a page as dirty.

Jan Kara's 2018 analysis [1] (which launched the pin_user_pages()
effort) shows a nice clear example. And since then, I've come to believe
that most of the gup/pup call sites have it wrong:

    a) pin_user_pages() b) /* access page contents */ c)
    set_page_dirty() or set_page_dirty_lock() // PROBLEM HERE d)
    unpin_user_page()

ext4 has since papered over the problem, by soldiering on if it finds a
page without writeback buffers when it expected to be able to writeback
a dirty page. But you get the idea.

And I think that applies beyond the gup/pup situation.


[1] https://lore.kernel.org/linux-mm/20180103100430.GE4911@quack2.suse.cz/T/#u


thanks,
  
Linus Torvalds Oct. 29, 2022, 8:15 p.m. UTC | #34
On Sat, Oct 29, 2022 at 12:39 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> ext4 has since papered over the problem, by soldiering on if it finds a
> page without writeback buffers when it expected to be able to writeback
> a dirty page. But you get the idea.

I suspect that "soldiering on" is the right thing to do, but yes, our
'mkdirty' vs 'mkclean' thing has always been problematic.

I think we always needed a page lock for it, but PG_lock itself
doesn't work (as mentioned earlier) because the VM can't serialize
with IO, and needs the lock to basically be a spinlock.

The page table lock kind of took its place, and then the rmap removal
makes for problems (since it is what basically ends up being the
shared place to look it upo).

I can think of three options:

 (a) filesystems just deal with it

 (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too

 (c) we could actually add a spinlock (hashed on the page?) for this

I think (a) is basically our current expectation.

And (b) would be fairly easy - same model as that dirty bit patch,
just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
vma as well (and we delay the TLB flush over multiple vma's, so it's
not just a "save vma in mmu_gather").

Doing (c) doesn't look hard, except for the "new lock" thing, which is
always a potential huge disaster. If it's only across set_page_dirty()
and page_mkclean(), though, and uses some simple page-based hash, it
sounds fairly benign.

                  Linus



              Linus
  
Linus Torvalds Oct. 29, 2022, 8:30 p.m. UTC | #35
On Sat, Oct 29, 2022 at 1:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I can think of three options:
>
>  (a) filesystems just deal with it
>
>  (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
>
>  (c) we could actually add a spinlock (hashed on the page?) for this
>
> I think (a) is basically our current expectation.

Side note: anybody doing gup + set_page_dirty() won't be fixed by b/c
anyway, so I think (a) is basically the only thing.

And that's true even if you do a page pinning gup, since the source of
the gup may be actively unmapped after the gup.

So a filesystem that thinks that only write, or a rmap-accessible mmap
can turn the page dirty really seems to be fundamentally broken.

And I think that has always been the case, it's just that filesystem
writers may not have been happy with it, and may not have had
test-cases for it.

It's not surprising that the filesystem people then try to blame users.

          Linus
  
John Hubbard Oct. 29, 2022, 8:42 p.m. UTC | #36
On 10/29/22 13:30, Linus Torvalds wrote:
>> I can think of three options:
>>
>>  (a) filesystems just deal with it
>>
>>  (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
>>
>>  (c) we could actually add a spinlock (hashed on the page?) for this
>>
>> I think (a) is basically our current expectation.
> 
> Side note: anybody doing gup + set_page_dirty() won't be fixed by b/c
> anyway, so I think (a) is basically the only thing.
> 
> And that's true even if you do a page pinning gup, since the source of
> the gup may be actively unmapped after the gup.

I was just now writing a response that favored (c) over (b), precisely
because of that, yes. :)

> 
> So a filesystem that thinks that only write, or a rmap-accessible mmap
> can turn the page dirty really seems to be fundamentally broken.
> 
> And I think that has always been the case, it's just that filesystem
> writers may not have been happy with it, and may not have had
> test-cases for it.
> 
> It's not surprising that the filesystem people then try to blame users.
> 
>           Linus

Yes, lots of unhappy debates about this over the years.

However, I remain intrigued by (c), because if we had a "dirty page lock"
that is looked up by page (much like looking up the ptl), it seems like
a building block that would potentially help solve the whole thing.

The above points about "file system needs to coordinate with mm about
what's allowed to be dirtied, including gup/dma cases", those are still
true and not yet solved, yes. But having a solid point of synchronization
for this, definitely looks interesting.

Of course, without working through this more thoroughly, it's not fair
to impose this constraint on the current discussion, understood. :)

thanks,
  
Nadav Amit Oct. 29, 2022, 8:56 p.m. UTC | #37
On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
> 
> And (b) would be fairly easy - same model as that dirty bit patch,
> just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
> vma as well (and we delay the TLB flush over multiple vma's, so it's
> not just a "save vma in mmu_gather”).

(b) sounds reasonable and may potentially allow future performance
improvements (batching, doing stuff without locks).

It does appear to break a potential hidden assumption that rmap is removed
while the ptl is acquired (at least in the several instances I samples).
Yet, anyhow page_vma_mapped_walk() checks the PTE before calling the
function, so it should be fine.

I’ll give it a try.
  
Theodore Ts'o Oct. 29, 2022, 8:59 p.m. UTC | #38
On Sat, Oct 29, 2022 at 01:15:26PM -0700, Linus Torvalds wrote:
> On Sat, Oct 29, 2022 at 12:39 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > ext4 has since papered over the problem, by soldiering on if it finds a
> > page without writeback buffers when it expected to be able to writeback
> > a dirty page. But you get the idea.
> 
> I suspect that "soldiering on" is the right thing to do, but yes, our
> 'mkdirty' vs 'mkclean' thing has always been problematic.
>
> ...
>
>  (a) filesystems just deal with it

It should be noted that "soldiering on" just means that the kernel
will not crash or BUG.  It may mean that the dirty page will not
gotten written back (since at the time when it is discovered we are in
a context we may not allocate memory or block if there is a need to
allocate blocks if the file system uses delayed allocation).

Furthermore, since the file system does not know that one or more
pages have dirtied behind it's back, if the file system is almost
full, some writes may silently fail --- including writes where the
usesrspace application was implicitly promised that the write would
succeed by having the write(2) system call return without errors.

If people are OK with that, it's fine.  Just don't complain to the
file system maintainers.  :-)

						- Ted

P.S.  The reason why this isn't an utter disaster is because normally
users of remote RMA use preallocated and pre-written/initialized
files.  And there aren't _that_ many other users of gup.  So long as
this remains the case, we might be happy to let sleeping canines lie.
Just please dear $DEITY, let's not have any additional users of gup
until we have a better solution.
  
Nadav Amit Oct. 29, 2022, 9:03 p.m. UTC | #39
On Oct 29, 2022, at 1:56 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
>> 
>> And (b) would be fairly easy - same model as that dirty bit patch,
>> just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
>> vma as well (and we delay the TLB flush over multiple vma's, so it's
>> not just a "save vma in mmu_gather”).
> 
> (b) sounds reasonable and may potentially allow future performance
> improvements (batching, doing stuff without locks).
> 
> It does appear to break a potential hidden assumption that rmap is removed
> while the ptl is acquired (at least in the several instances I samples).
> Yet, anyhow page_vma_mapped_walk() checks the PTE before calling the
> function, so it should be fine.
> 
> I’ll give it a try.

I have just seen John’s and your emails. It seems (b) fell off. (a) is out
of my “zone”, and anyhow assuming it would not be solved soon, deferring
page_remove_rmap() might cause regressions.

(c) might be more intrusive and potentially induce overheads. If we need a
small backportable solution, I think the approach that I proposed (marking
the page dirty after the invalidation, before the PTL is released) is the
simplest one.

Please advise how to proceed.
  
Linus Torvalds Oct. 29, 2022, 9:12 p.m. UTC | #40
On Sat, Oct 29, 2022 at 1:56 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> On Oct 29, 2022, at 1:15 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > (b) we could move the "page_remove_rmap()" into the "flush-and-free" path too
> >
> > And (b) would be fairly easy - same model as that dirty bit patch,
> > just a 'do page_remove_rmap too' - except page_remove_rmap() wants the
> > vma as well (and we delay the TLB flush over multiple vma's, so it's
> > not just a "save vma in mmu_gather”).
>
> (b) sounds reasonable and may potentially allow future performance
> improvements (batching, doing stuff without locks).

So the thing is, I think (b) makes sense from a TLB flush standpoint,
but as mentioned, if filesystems _really_ want to feel in control,
none of these locks fundamentally help, because the whole "gup +
set_page_dirty()" situation still exists.

And that fundamentally does not hold any locks between the lookup and
the set_page_dirty(), and fundamentally doesn't stop an rmap from
happening in between, so while the zap_page_range() and dirty TLB case
case be dealt with that way, you don't actually fix any fundamental
issues.

Now, neither does my (c) case on its own, but as John Hubbard alluded
to, *if* we had some sane serialization for a struct page, maybe that
could then be used to at least avoid the issue with "rmap no longer
exists", and make the filesystem handling case a bit better.

IOW, I really do think that not only is (a) the current solution, it's
the *correct* solution.

But it is possible that (c) could then be used as a way to make (a)
more palatable to filesystems, in that at least then there would be
some way to serialize with "set_page_dirty()".

But (b) cannot be used for that - because GUP fundamentally breaks
that rmap association.

It's all nasty.

                Linus
  
Nadav Amit Oct. 30, 2022, 12:18 a.m. UTC | #41
On Oct 29, 2022, at 12:14 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It didn't fail for me before, it doesn't fail for me with those patches.

For the record, I tried to run the PoC on another machine, and it indeed did
not fail.

Turns out I had a small bug in one of the mechanisms that were intended to
make the failure more likely (I should have mapped again or madvised
HPAGE_SIZE to increase the time zap_pte_range spends to increase the
probability of the race).

I am still trying to figure out how to address this issue, and whether the
fact that some rmap_walk(), which do not use PVMW_SYNC are of an issue.

---

#define _GNU_SOURCE
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

#define handle_error(msg) \
   do { perror(msg); exit(EXIT_FAILURE); } while (0)

void *p;
volatile bool stop = false;
pid_t flusher_pid;
int fd;

#define PAGE_SIZE	(4096ul)
#define PAGES_PER_PMD	(512)
#define HPAGE_SIZE	(PAGE_SIZE * PAGES_PER_PMD)

// Comment MUNMAP_TEST for MADV_DONTNEED test
#define MUNMAP_TEST

void *dirtying_thread(void *arg)
{
	int i;

	while (!stop) {
		for (i = 1; i < PAGES_PER_PMD; i++) {
			*(volatile char *)(p + (i * PAGE_SIZE) + 64) = 5;
		}
	}
	return NULL;
}

void *checking_thread(void *arg)
{
	volatile unsigned long *ul_p = (volatile unsigned long*)p;
	unsigned long cnt = 0;

	while (!stop) {
		*ul_p = cnt;
		if (*ul_p != cnt) {
			printf("FAILED: expected %ld, got %ld\n", cnt, *ul_p);
			kill(flusher_pid, SIGTERM);
			exit(0);
		}
		cnt++;
	}
	return NULL;
}

void *remap_thread(void *arg)
{
	void *ptr;
	struct timespec t = {
		.tv_nsec = 10000,
	};

	while (!stop) {
#ifdef MUNMAP_TEST
		ptr = mmap(p, HPAGE_SIZE, PROT_READ|PROT_WRITE,
			   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);
		if (ptr == MAP_FAILED)
			handle_error("remap_thread");
#else
		if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
			handle_error("MADV_DONTNEED");
		nanosleep(&t, NULL);
#endif
	}
	return NULL;
}

void flushing_process(void)
{
	// Remove the pages to speed up rmap_walk and allow to drop caches.
	if (madvise(p, HPAGE_SIZE, MADV_DONTNEED) < 0)
		handle_error("MADV_DONTNEED");

	while (true) {
		if (msync(p, PAGE_SIZE, MS_SYNC))
			handle_error("msync");
		if (posix_fadvise(fd, 0, PAGE_SIZE, POSIX_FADV_DONTNEED))
			handle_error("posix_fadvise");
	}
}

int main(int argc, char *argv[])
{
	void *(*thread_funcs[])(void*) = {
		&dirtying_thread,
		&checking_thread,
		&remap_thread,
	};
	int r, i;
	int rc1, rc2;
	unsigned long addr;
	void *ptr;
	char *page = malloc(PAGE_SIZE);
	int n_threads = sizeof(thread_funcs) / sizeof(*thread_funcs);
	pthread_t *threads = malloc(sizeof(pthread_t) * n_threads);
	pid_t pid;

	if (argc < 2) {
		fprintf(stderr, "usages: %s [filename]\n", argv[0]);
		exit(EXIT_FAILURE);
	}

	fd = open(argv[1], O_RDWR|O_CREAT, 0666);
	if (fd == -1)
		handle_error("open fd");

	for (i = 0; i < PAGES_PER_PMD; i++) {
		if (write(fd, page, PAGE_SIZE) != PAGE_SIZE)
			handle_error("write");
	}
	free(page);

	ptr = mmap(NULL, HPAGE_SIZE * 2, PROT_NONE, MAP_PRIVATE|MAP_ANON,
                   -1, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap anon");

	addr = (unsigned long)(ptr + HPAGE_SIZE - 1) & ~(HPAGE_SIZE - 1);
	printf("starting...\n");

	ptr = mmap((void *)addr, HPAGE_SIZE, PROT_READ|PROT_WRITE,
		   MAP_SHARED|MAP_FIXED|MAP_POPULATE, fd, 0);

	if (ptr == MAP_FAILED)
		handle_error("mmap file - start");

	p = ptr;

	for (i = 0; i < n_threads; i++) {
		r = pthread_create(&threads[i], NULL, thread_funcs[i], NULL);
		if (r)
			handle_error("pthread_create");
	}

	// Run the flushing process in a different process, so msync() would
	// not require mmap_lock.
	pid = fork();
	if (pid == 0)
		flushing_process();
	flusher_pid = pid;

	sleep(60);

	stop = true;
	for (i = 0; i < n_threads; i++)
		pthread_join(threads[i], NULL);
	kill(flusher_pid, SIGTERM);
	printf("Finished without an error\n");

	exit(0);
}
  
Nadav Amit Oct. 30, 2022, 2:17 a.m. UTC | #42
On Oct 29, 2022, at 11:58 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Oct 29, 2022 at 11:36 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Anyway, I think the best documentation for "this is what I meant" is
>> simply the patch. Does this affect your PoC on your setup?
> 
> Here's a slightly cleaned up set with preliminary commit messages, and
> an explanation for why some of the 'struct page' declarations were
> moved around a bit in case you wondered about that part of the change
> in the full patch.
> 
> The end result should be the same, so if you already looked at the
> previous unified patch, never mind. But this one tries to make for a
> better patch series.
> 
> Still not tested in any way, shape, or form. I decided I wanted to
> send this one before booting into this and possibly blowing up ;^)

Running the PoC on Linux 6.0.6 with these patches caused the following splat
on the following line:

	WARN_ON_ONCE(!folio_test_locked(folio) && !folio_test_dirty(folio));

Although I did not hit the warning on the next line (!folio_buffers(folio)),
the commit log for the warning that actually triggered also leads to the
same patch of Jan Kara that is intended to check if a page is dirtied
without buffers (the scenario we are concerned about).


  Author: Jan Kara <jack@suse.cz>
  Date:   Thu Dec 1 11:46:40 2016 -0500

    ext4: warn when page is dirtied without buffers
    
    Warn when a page is dirtied without buffers (as that will likely lead to
    a crash in ext4_writepages()) or when it gets newly dirtied without the
    page being locked (as there is nothing that prevents buffers to get
    stripped just before calling set_page_dirty() under memory pressure). 



[  908.444806] ------------[ cut here ]------------
[  908.451010] WARNING: CPU: 16 PID: 2113 at fs/ext4/inode.c:3634 ext4_dirty_folio+0x74/0x80
[  908.460343] Modules linked in:
[  908.463856] CPU: 16 PID: 2113 Comm: poc Not tainted 6.0.6+ #21
[  908.470521] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.13.0 05/14/2021
[  908.479202] RIP: 0010:ext4_dirty_folio+0x74/0x80
[  908.484489] Code: d5 ee ff 41 5c 41 5d 5d c3 cc cc cc cc be 08 00 00 00 4c 89 e7 e8 bc 03 e0 ff 4c 89 e7 e8 f4 f8 df ff 49 8b 04 24 a8 08 75 bc <0f> 0b eb b8 0f 0b eb c6 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41
[  908.505851] RSP: 0018:ffff88a1197df9a8 EFLAGS: 00010246
[  908.511826] RAX: 0057ffffc0002014 RBX: ffffffff83414b60 RCX: ffffffff818ceafc
[  908.519964] RDX: dffffc0000000000 RSI: 0000000000000008 RDI: ffffea00fffd9f40
[  908.528103] RBP: ffff88a1197df9b8 R08: 0000000000000001 R09: fffff9401fffb3e9
[  908.536239] R10: ffffea00fffd9f47 R11: fffff9401fffb3e8 R12: ffffea00fffd9f40
[  908.544376] R13: ffff88a087d368d8 R14: ffff88a1197dfb08 R15: ffff88a1197dfb00
[  908.552509] FS:  00007ff7caa68700(0000) GS:ffff8897edc00000(0000) knlGS:0000000000000000
[  908.561731] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  908.568299] CR2: 00007ff7caa67ed8 CR3: 00000020cc970001 CR4: 00000000001706e0
[  908.576437] Call Trace:
[  908.579252]  <TASK>
[  908.581683]  folio_mark_dirty+0x69/0xa0
[  908.586097]  set_page_dirty+0x2a/0x90
[  908.590301]  tlb_flush_mmu+0xc1/0x320
[  908.594517]  tlb_finish_mmu+0x49/0x190
[  908.598822]  unmap_region+0x1fa/0x250
[  908.603029]  ? anon_vma_compatible+0x120/0x120
[  908.608110]  ? __kasan_check_read+0x11/0x20
[  908.612926]  ? __vma_rb_erase+0x38a/0x610
[  908.617547]  __do_munmap+0x313/0x770
[  908.621669]  mmap_region+0x227/0xa50
[  908.625774]  ? down_read+0x320/0x320
[  908.629874]  ? lock_acquire+0x19a/0x450
[  908.634285]  ? __x64_sys_brk+0x4e0/0x4e0
[  908.641552]  ? thp_get_unmapped_area+0xca/0x150
[  908.649404]  ? cap_mmap_addr+0x1d/0x90
[  908.656373]  ? security_mmap_addr+0x3c/0x50
[  908.663781]  ? get_unmapped_area+0x173/0x1f0
[  908.671248]  ? arch_get_unmapped_area+0x330/0x330
[  908.679231]  do_mmap+0x3c3/0x610
[  908.685519]  vm_mmap_pgoff+0x177/0x230
[  908.692303]  ? randomize_page+0x70/0x70
[  908.699133]  ksys_mmap_pgoff+0x241/0x2a0
[  908.706011]  __x64_sys_mmap+0x8d/0xb0
[  908.712594]  do_syscall_64+0x3b/0x90
[  908.719090]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  908.727201] RIP: 0033:0x7ff7cbf868e6
[  908.733559] Code: 00 00 00 00 f3 0f 1e fa 41 f7 c1 ff 0f 00 00 75 2b 55 48 89 fd 53 89 cb 48 85 ff 74 37 41 89 da 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 62 5b 5d c3 0f 1f 80 00 00 00 00 48 8b 05 71
[  908.759522] RSP: 002b:00007ff7caa67ea8 EFLAGS: 00000206 ORIG_RAX: 0000000000000009
[  908.770475] RAX: ffffffffffffffda RBX: 0000000000008011 RCX: 00007ff7cbf868e6
[  908.780919] RDX: 0000000000000003 RSI: 0000000000200000 RDI: 00007ff7cbc00000
[  908.791344] RBP: 00007ff7cbc00000 R08: 0000000000000003 R09: 0000000000000000
[  908.801751] R10: 0000000000008011 R11: 0000000000000206 R12: 00007ffed51cbc4e
[  908.812118] R13: 00007ffed51cbc4f R14: 00007ffed51cbc50 R15: 00007ff7caa67fc0
[  908.822523]  </TASK>
[  908.827213] irq event stamp: 4169
[  908.833101] hardirqs last  enabled at (4183): [<ffffffff8133f028>] __up_console_sem+0x68/0x80
[  908.844884] hardirqs last disabled at (4194): [<ffffffff8133f00d>] __up_console_sem+0x4d/0x80
[  908.856622] softirqs last  enabled at (4154): [<ffffffff83000430>] __do_softirq+0x430/0x5db
[  908.868167] softirqs last disabled at (4149): [<ffffffff8125fd89>] irq_exit_rcu+0xe9/0x120
[  908.879611] ---[ end trace 0000000000000000 ]---
  
Linus Torvalds Oct. 30, 2022, 6:19 p.m. UTC | #43
On Sat, Oct 29, 2022 at 7:17 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> Running the PoC on Linux 6.0.6 with these patches caused the following splat
> on the following line:
>
>         WARN_ON_ONCE(!folio_test_locked(folio) && !folio_test_dirty(folio));

Yeah, this is a sign of that "folio_mkclean() serializes with
folio_mark_dirty using rmap and the page table lock".

And page_remove_rmap() could *almost* be called later, but it does
have code that also depends on the page table lock, although it looks
like realistically that's just because it "knows" that means that
preemption is disabled, so it uses non-atomic statistics update.

I say "knows" in quotes, because that's what the comment says, but it
turns out that __mod_node_page_state() has to deal with CONFIG_RT
anyway and does that

        preempt_disable_nested();
        ...
        preempt_enable_nested();

thing.

And then it wants to see the vma, although that's actually only to see
if it's 'mlock'ed, so we could just squirrel that away.

So we *could* move page_remove_rmap() later into the TLB flush region,
but then we would have lost the page table lock anyway, so then
folio_mkclean() can come in regardless.

So that doesn't even help.

End result: we do want to do the page_set_dirty() and the
remove_rmap() under the paeg table lock, because it's what serializes
folio_mkclean().

And we'd _like_ to do the TLB flush before the remove_rmap(), but we
*really* don't want to do that for every page.

So my current gut feel is that we should just say that if you do
"MADV_DONTNEED or do a munmap() (which includes the "re-mmap() over
the area", while some other thread is still writing to that memory
region, you may lose writes.

IOW, just accept the behavior that Nadav's test-program tries to show,
and say "look, you're doing insane things, we've never given you any
other semantics, it's your problem" to any user program that does
that.

If a user program does MADV_DONTNEED on an area that it is actively
using at the same time in another thread, that sounds really really
bogus. Same goes doubly for 'munmap()' or over-mapping.

              Linus
  
Linus Torvalds Oct. 30, 2022, 6:51 p.m. UTC | #44
On Sun, Oct 30, 2022 at 11:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And we'd _like_ to do the TLB flush before the remove_rmap(), but we
> *really* don't want to do that for every page.

Hmm. I have yet another crazy idea.

We could keep the current placement of the TLB flush, to just before
we drop the page table lock.

And we could do all the things we do in 'page_remove_rmap()' right now
*except* for the mapcount stuff.

And only move the mapcount code to the page freeing stage.

Because all the rmap() walk synchronization really needs is that
'page->_mapcount' is still elevated, and if it is it will serialize
with the page table lock.

And it turns out that 'page_remove_rmap()' already treats the case we
care about differently, and all it does is

        lock_page_memcg(page);

        if (!PageAnon(page)) {
                page_remove_file_rmap(page, compound);
                goto out;
        }
       ...
out:
        unlock_page_memcg(page);

        munlock_vma_page(page, vma, compound);

for that case.

And that 'page_remove_file_rmap()' is literally the code that modifies
the _mapcount.

Annoyingly, this is all complicated by that 'compound' argument, but
that's always false in that zap_page_range() case.

So what we *could* do, is make a new version of page_remove_rmap(),
which is specialized for this case: no 'compound' argument (always
false), and doesn't call 'page_remove_file_rmap()', because we'll do
that for the !PageAnon(page) case later after the TLB flush.

That would keep the existing TLB flush logic, keep the existing 'mark
page dirty' and would just make sure that 'folio_mkclean()' ends up
being serialized with the TLB flush simply because it will take the
page table lock because we delay the '._mapcount' update until
afterwards.

Annoyingly, the organization of 'page_remove_rmap()' is a bit ugly,
and we have several other callers that want the existing logic, so
while the above sounds conceptually simple, I think the patch would be
a bit messy.

                  Linus
  
Nadav Amit Oct. 30, 2022, 7:34 p.m. UTC | #45
On Oct 30, 2022, at 11:19 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And page_remove_rmap() could *almost* be called later, but it does
> have code that also depends on the page table lock, although it looks
> like realistically that's just because it "knows" that means that
> preemption is disabled, so it uses non-atomic statistics update.
> 
> I say "knows" in quotes, because that's what the comment says, but it
> turns out that __mod_node_page_state() has to deal with CONFIG_RT
> anyway and does that
> 
>        preempt_disable_nested();
>        ...
>        preempt_enable_nested();
> 
> thing.
> 
> And then it wants to see the vma, although that's actually only to see
> if it's 'mlock'ed, so we could just squirrel that away.
> 
> So we *could* move page_remove_rmap() later into the TLB flush region,
> but then we would have lost the page table lock anyway, so then
> folio_mkclean() can come in regardless.
> 
> So that doesn't even help.

Well, if you combine it with the per-page-table stale TLB detection
mechanism that I proposed, I think this could work.

Reminder (feel free to skip): you would have per-mm “completed
TLB-generation” in addition to the current one, which would be renamed to
“pending TLB-generation”. Whenever you update the page-tables in a manner
that might require a TLB flush, you would increase the “pending
TLB-generation” and save the pending TLB-generation in the page-table’s
page-struct. All of that is done once under the page-table lock. When you
finish a TLB-flush, you update the “completed TLB-generation”.

Then on page_vma_mkclean_one(), you would check if the page-table’s
TLB-generation is greater than the completed TLB-generation, which would
indicate that TLB entries for PTEs in this table might be stale. In that
case you would just flush the TLB. [ Of course you can instead just flush if
mm_tlb_flush_pending(), but nobody likes this mechanism that has a very
coarse granularity, and therefore can lead to many unnecessary TLB flushes.
]

Indeed, there would be potentially some overhead in extreme cases, since
mm's TLB-generation since its cache is already highly-contended in extreme
cases. But I think it worth it to have simple logic that allows to reason
about correctness.

My intuition is that although you appear to be right that we can just mark
this case as “extreme case nobody cares about”, it might have now or in the
future some other implications that are hard to predict and prevent.
  
Linus Torvalds Oct. 30, 2022, 10:47 p.m. UTC | #46
On Sun, Oct 30, 2022 at 11:51 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> We could keep the current placement of the TLB flush, to just before
> we drop the page table lock.
>
> And we could do all the things we do in 'page_remove_rmap()' right now
> *except* for the mapcount stuff.
>
> And only move the mapcount code to the page freeing stage.

So I actually have a three-commit series to do the rmap
simplification, but let me just post the end result of that series,
because the end result is actually smaller than the individual commits
(I did it as three incremental commits just to make it more obvious to
me how to get to that end result).

The three commits end up being

      mm: introduce simplified versions of 'page_remove_rmap()'
      mm: inline simpler case of page_remove_file_rmap()
      mm: re-unify the simplified page_zap_*_rmap() function

and the end result of them is this attached patch.

I'm *claiming* that the attached patch is semantically identical to
what we do before it, just _hugely_ simplified.

Basically, that new 'page_zap_pte_rmap()' does the same things that
'page_remove_rmap()' did, except it is limited to only last-level PTE
entries (and that munlock_vma_page() has to be called separately).

The simplification comes from 'compound' being false, from it always
being about small pages, and from the atomic mapcount decrement having
been moved outside the memcg lock, since it is independent of it.

Anyway, this simplification patch basically means that the *next* step
could be to just move that ipage_zap_pte_rmap()' after the TLB flush,
and now it's trivial and no longer scary.

I did *not* do that yet, because it still needs that "encoded_page[]"
array - except now it doesn't encode the 'dirty' bit, now it would
encode the 'do a page->_mapcount decrement' bit.

I didn't do that part, because needed to do the rc3 release, plus I'd
like to have somebody look at this introductory patch first.

              Linus
  
Linus Torvalds Oct. 31, 2022, 1:47 a.m. UTC | #47
On Sun, Oct 30, 2022 at 3:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, this simplification patch basically means that the *next* step
> could be to just move that ipage_zap_pte_rmap()' after the TLB flush,
> and now it's trivial and no longer scary.

So here's that next patch: it's patch 4/4 in this series.

Patches 1-3 are the patches that I already sent out as one (smaller)
combined patch. I'm including them here as the whole series in case
somebody else wants to follow along with how I did the simplified
version of page_remove_rmap().

So if you already looked at the previous patch and don't have any
questions about that one, you could just apply PATCH 4/4 on top of
that one.

Or you can do the whole series with commit messages and (hopefully)
clearer individual steps to the simplified version of
page_remove_rmap().

I haven't actually tested 4/4 yet, so this is yet another of my "I
think this should work" patches.

The reason I haven't actually tested it is partly because I never
recreated the original problem Navav reported, and partly because the
meat of patch 4/4 is just the same "encode an extra flag bit in the
low bit of the page pointer" that I _did_ test, just doing the "remove
rmap" instead of "set dirty".

In other words, I *think* this should make Nadav's test-case happy,
and avoid the warning he saw.

If somebody wants a git branch, I guess I can push that too, but it
would be a non-stable branch only for testing.

Also, it's worth noting that zap_pte_range() does this sanity test:

                        if (unlikely(page_mapcount(page) < 0))
                                print_bad_pte(vma, addr, ptent, page);

and that is likely worthless now (because it hasn't actually
decremented the mapcount yet). I didn't remove it, because I wasn't
sure which option was best:

 (a) just remove it entirely

 (b) change the "< 0" to "<= 0"

 (c) move it to clean_and_free_pages_and_swap_cache() that actually
does the page_zap_pte_rmap() now.

so I'm in no way claiming this series is any kind of final word, but I
do mostly like how the code ended up looking.

Now, this whole "remove rmap after flush" would probably make sense
for some of the largepage cases too, and there's room for another bit
in there (or more, if you align 'struct page') and that whole code
could use page size hints etc too. But I suspect that the main case we
really care is for individual small pages, just because that's the
case where I think it would be much worse to do any fancy TLB
tracking. The largepage cases presumably aren't as critical, since
there by definition is fewer of those.

Comments?

                 Linus
  
Nadav Amit Oct. 31, 2022, 4:09 a.m. UTC | #48
On Oct 30, 2022, at 6:47 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The reason I haven't actually tested it is partly because I never
> recreated the original problem Navav reported, and partly because the
> meat of patch 4/4 is just the same "encode an extra flag bit in the
> low bit of the page pointer" that I _did_ test, just doing the "remove
> rmap" instead of "set dirty".
> 
> In other words, I *think* this should make Nadav's test-case happy,
> and avoid the warning he saw.

I am sorry for not managing to make it reproducible on your system. The fact
that you did not get the warning that I got means that it is not a
hardware-TLB differences issue (at least not only that), but the race does
not happen on your system (assuming you used ext4 on the BRD).

Anyhow, I ran the tests with the patches and there are no failures.
Thanks for addressing this issue.

I understand from the code that you decided to drop the deferring of
set_page_dirty(), which could - at least for the munmap case (where
mmap_lock is taken for write) - prevent the need for “force_flush” and
potentially save TLB flushes.

I was just wondering whether the reason for that is that you wanted
to have small backportable and conservative patches, or whether you
changed your mind about it.
  
Nadav Amit Oct. 31, 2022, 4:55 a.m. UTC | #49
On Oct 30, 2022, at 9:09 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> I understand from the code that you decided to drop the deferring of
> set_page_dirty(), which could - at least for the munmap case (where
> mmap_lock is taken for write) - prevent the need for “force_flush” and
> potentially save TLB flushes.
> 
> I was just wondering whether the reason for that is that you wanted
> to have small backportable and conservative patches, or whether you
> changed your mind about it.

Please ignore this silly question. I understand - the buffers might still be
dropped.
  
Linus Torvalds Oct. 31, 2022, 5 a.m. UTC | #50
On Sun, Oct 30, 2022 at 9:09 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> I am sorry for not managing to make it reproducible on your system.

Heh, that's very much *not* your fault. Honestly, I didn't try very
much or very hard.

I felt like I understood the problem cause sufficiently that I didn't
really need to have a reproducer, and I much prefer to just think the
solution through and try to make it really robust.

Or, put another way - I'm just lazy.

> Anyhow, I ran the tests with the patches and there are no failures.

Lovely.

> Thanks for addressing this issue.

Well, I'm not sure the issue is "addressed" yet. I think the patch
series is likely the right thing to do, but others may disagree with
this approach.

And regardless of that, this still leaves some questions open.

 (a) there's the issue of s390, which does its own version of
__tlb_remove_page_size.

I *think* s390 basically does the TLB flush synchronously in
zap_pte_range(), and that it would be for that reason trivial to just
add that 'flags' argument to the s390 __tlb_remove_page_size(), and
make it do

        if (flags & TLB_ZAP_RMAP)
                page_zap_pte_rmap(page);

at the top synchronously too. But some s390 person would need to look at it.

I *think* the issue is literally that straightforward and not a big
deal, but it's probably not even worth bothering the s390 people until
VM people have decided "yes, this makes sense".

 (b) the issue I mentioned with the currently useless
"page_mapcount(page) < 0" test with that patch.

Again, this is mostly just janitorial stuff associated with that patch series.

 (c) whether to worry about back-porting

I don't *think* this is worth backporting, but if it causes other
changes, then maybe..

> I understand from the code that you decided to drop the deferring of
> set_page_dirty(), which could - at least for the munmap case (where
> mmap_lock is taken for write) - prevent the need for “force_flush” and
> potentially save TLB flushes.

I really liked my dirty patch, but your warning case really made it
obvious that it was just broken.

The thing is, moving the "set_page_dirty()" to later is really nice,
and really makes a *lot* of sense from a conceptual standpoint: only
after that TLB flush do we really have no more people who can dirty
it.

BUT.

Even if we just used another bit for in the array for "dirty", and did
the set_page_dirty() later (but still before getting rid of the rmap),
that wouldn't actually *work*.

Why? Because the race with folio_mkclean() would just come back. Yes,
now we'd have the rmap data, so mkclean would be forced to serialize
with the page table lock.

But if we get rid of the "force_flush" for the dirty bit, that
serialization won't help, simply because we've *dropped* the page
table lock before we actually then do the set_page_dirty() again.

So the mkclean serialization needs *both* the late rmap dropping _and_
the page table lock being kept.

So deferring set_page_dirty() is conceptually the right thing to do
from a pure "just track dirty bit" standpoint, but it doesn't work
with the way we currently expect mkclean to work.

> I was just wondering whether the reason for that is that you wanted
> to have small backportable and conservative patches, or whether you
> changed your mind about it.

See above: I still think it would be the right thing in a perfect world.

But with the current folio_mkclean(), we just can't do it. I had
completely forgotten / repressed that horror-show.

So the current ordering rules are basically that we need to do
set_page_dirty() *and* we need to flush the TLB's before dropping the
page table lock. That's what gets us serialized with "mkclean".

The whole "drop rmap" can then happen at any later time, the only
important thing was that it was kept to at least after the TLB flush.

We could do the rmap drop still inside the page table lock, but
honestly, it just makes more sense to do as we free the batched pages
anyway.

Am I missing something still?

And again, this is about our horrid serialization between
folio_mkclean and set_page_dirty(). It's related to how GUP +
set_page_dirty() is also fundamentally problematic. So that dirty bit
situation *may* change if the rules for folio_mkclean() change...

                Linus
  
Peter Zijlstra Oct. 31, 2022, 9:28 a.m. UTC | #51
On Sun, Oct 30, 2022 at 03:47:36PM -0700, Linus Torvalds wrote:

>  include/linux/rmap.h |  1 +
>  mm/memory.c          |  3 ++-
>  mm/rmap.c            | 24 ++++++++++++++++++++++++
>  3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index bd3504d11b15..f62af001707c 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -196,6 +196,7 @@ void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>  		unsigned long address);
>  void page_add_file_rmap(struct page *, struct vm_area_struct *,
>  		bool compound);
> +void page_zap_pte_rmap(struct page *);
>  void page_remove_rmap(struct page *, struct vm_area_struct *,
>  		bool compound);
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..c893f5ffc5a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1452,8 +1452,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				    likely(!(vma->vm_flags & VM_SEQ_READ)))
>  					mark_page_accessed(page);
>  			}
> +			page_zap_pte_rmap(page);
> +			munlock_vma_page(page, vma, false);
>  			rss[mm_counter(page)]--;
> -			page_remove_rmap(page, vma, false);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
>  			if (unlikely(__tlb_remove_page(tlb, page))) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2ec925e5fa6a..28b51a31ebb0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1412,6 +1412,30 @@ static void page_remove_anon_compound_rmap(struct page *page)
>  		__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
>  }
>  
> +/**
> + * page_zap_pte_rmap - take down a pte mapping from a page
> + * @page:	page to remove mapping from
> + *
> + * This is the simplified form of page_remove_rmap(), that only
> + * deals with last-level pages, so 'compound' is always false,
> + * and the caller does 'munlock_vma_page(page, vma, compound)'
> + * separately.
> + *
> + * This allows for a much simpler calling convention and code.
> + *
> + * The caller holds the pte lock.
> + */
> +void page_zap_pte_rmap(struct page *page)
> +{

One could consider adding something like:

#ifdef USE_SPLIT_PTE_PTLOCKS
	lockdep_assert_held(ptlock_ptr(page))
#endif


> +	if (!atomic_add_negative(-1, &page->_mapcount))
> +		return;
> +
> +	lock_page_memcg(page);
> +	__dec_lruvec_page_state(page,
> +		PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
> +	unlock_page_memcg(page);
> +}

Took me a little while, but yes, .compound=false seems to reduce to
this.
  
Peter Zijlstra Oct. 31, 2022, 9:36 a.m. UTC | #52
On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote:

> Also, it's worth noting that zap_pte_range() does this sanity test:
> 
>                         if (unlikely(page_mapcount(page) < 0))
>                                 print_bad_pte(vma, addr, ptent, page);
> 
> and that is likely worthless now (because it hasn't actually
> decremented the mapcount yet). I didn't remove it, because I wasn't
> sure which option was best:
> 
>  (a) just remove it entirely
> 
>  (b) change the "< 0" to "<= 0"
> 
>  (c) move it to clean_and_free_pages_and_swap_cache() that actually
> does the page_zap_pte_rmap() now.

I'm leaning towards (c); simply because the error case is so terrifying
I feel we should check for it (and I do have vague memories of us
actually hitting something like this in the very distant past).
  
Peter Zijlstra Oct. 31, 2022, 9:39 a.m. UTC | #53
On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote:

>  - there's no 'compund' argument any more, because it was always false
> 
>  - we can get rid of the tests for 'compund' and 'PageAnon()' since we
>    know that they are

You're making up new words ;-)

  s/compund/compound/g
  
Peter Zijlstra Oct. 31, 2022, 9:46 a.m. UTC | #54
On Sun, Oct 30, 2022 at 06:47:23PM -0700, Linus Torvalds wrote:

> diff --git a/mm/memory.c b/mm/memory.c
> index ba1d08a908a4..c893f5ffc5a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1451,9 +1451,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  				if (pte_young(ptent) &&
>  				    likely(!(vma->vm_flags & VM_SEQ_READ)))
>  					mark_page_accessed(page);
> +			}
> +			page_zap_pte_rmap(page);
>  			munlock_vma_page(page, vma, false);
>  			rss[mm_counter(page)]--;
>  			if (unlikely(page_mapcount(page) < 0))
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 69de6c833d5c..28b51a31ebb0 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1413,47 +1413,26 @@ static void page_remove_anon_compound_rmap(struct page *page)
>  }
>  
>  /**
> + * page_zap_pte_rmap - take down a pte mapping from a page
>   * @page:	page to remove mapping from
>   *
> + * This is the simplified form of page_remove_rmap(), that only
> + * deals with last-level pages, so 'compound' is always false,
> + * and the caller does 'munlock_vma_page(page, vma, compound)'
> + * separately.
>   *
> + * This allows for a much simpler calling convention and code.
>   *
>   * The caller holds the pte lock.
>   */
> +void page_zap_pte_rmap(struct page *page)
>  {
>  	if (!atomic_add_negative(-1, &page->_mapcount))
>  		return;
>  
>  	lock_page_memcg(page);
> +	__dec_lruvec_page_state(page,
> +		PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
>  	unlock_page_memcg(page);
>  }

So we *could* use atomic_add_return() and include the print_bad_pte()
thing in this function -- however that turns the whole thing into a mess
again :/
  
Nadav Amit Oct. 31, 2022, 3:43 p.m. UTC | #55
On Oct 30, 2022, at 10:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So the current ordering rules are basically that we need to do
> set_page_dirty() *and* we need to flush the TLB's before dropping the
> page table lock. That's what gets us serialized with "mkclean”.

I understand. I am still not sure whether ordering the set_page_dirty() and
dropping the mapcount reference cannot suffice for the reclaim logic not to
free the buffers if the page is dirtied.

According to the code, shrink_page_list() first checks for folio_mapped()
and then for folio_test_dirty() to check whether pageout() is necessary.
IIUC, the buffers are not dropped up to this point and set_page_dirty()
would always set the page-struct dirty bit.

IOW: In shrink_page_list(), when we decide on whether to pageout(), we
should see whether the page is dirty (give or take smp_rmb()).

But this is an optimization and I do not know all the cases in which buffers
might be dropped. My intuition says that they cannot be dropped while
mapcount != 0, but I need to further explore it.
  
Linus Torvalds Oct. 31, 2022, 5:19 p.m. UTC | #56
On Mon, Oct 31, 2022 at 2:29 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 30, 2022 at 03:47:36PM -0700, Linus Torvalds wrote:
>
> > + * This is the simplified form of page_remove_rmap(), that only
> > + * deals with last-level pages, so 'compound' is always false,
> > + * and the caller does 'munlock_vma_page(page, vma, compound)'
> > + * separately.
> > + *
> > + * This allows for a much simpler calling convention and code.
> > + *
> > + * The caller holds the pte lock.
> > + */
> > +void page_zap_pte_rmap(struct page *page)
> > +{
>
> One could consider adding something like:
>
> #ifdef USE_SPLIT_PTE_PTLOCKS
>         lockdep_assert_held(ptlock_ptr(page))
> #endif

Yes, except that the page lock goes away in the next few patches and
gets replaced by just using the safe dec_lruvec_page_state() instead,
so it's not really worth it.

> > +     if (!atomic_add_negative(-1, &page->_mapcount))
> > +             return;
> > +
> > +     lock_page_memcg(page);
> > +     __dec_lruvec_page_state(page,
> > +             PageAnon(page) ? NR_ANON_MAPPED : NR_FILE_MAPPED);
> > +     unlock_page_memcg(page);
> > +}
>
> Took me a little while, but yes, .compound=false seems to reduce to
> this.

Yeah - it's why I kept that thing as three separate patches, because
even if each of the patches isn't "immediately obvious", you can at
least go back and follow along and see what I did.

The *full* simplification end result just looks like magic.

Admittedly, I think a lot of that "looks like magic" is because the
rmap code has seriously cruftified over the years. We had that time
when we actually

Go back a decade, and we literally used to do pretty much exactly what
the simplified form does. The transformation to complexity hell starts
with commit 89c06bd52fb9 ("memcg: use new logic for page stat
accounting"), but just look at what it looked like before that:

  git show 89c06bd52fb9^:mm/rmap.c

gets you the state back when it was simple. And look at what it did:

        void page_remove_rmap(struct page *page)
        {
                /* page still mapped by someone else? */
                if (!atomic_add_negative(-1, &page->_mapcount))
                        return;
                ... statistics go here ..

so in the end the simplified form of this page_zap_pte_rmap() really
isn't *that* surprising.

In fact, that old code handled PageHuge() fairly naturally too, and
almost all the mess comes from the memcg accounting - and locking -
changes.

And I actually really wanted to one step further, and try to then
batch up the page state accounting too. It's kind of stupid how we do
all that memcg locking for each page, and with this new setup we have
one nice array of pages that we *could* try to just batch things with.

The pages in normal situations *probably* mostly all share the same
memcg and node, so just optimistically trying to do something like "as
long as it's the same memcg as last time, just keep the lock".

Instead of locking and unlocking for every single page.

But just looking at it exhausted me enough that I don't think I'll go there.

Put another way: after this series, it's not the 'rmap' code that
makes me go Ugh, it's the memcg tracking..

(But the hugepage rmap code is incredibly nasty stuff still, and I
think the whole 'compound=true' case would be better with somebody
taking a look at that case too, but that somebody won't be me).

                  Linus
  
Linus Torvalds Oct. 31, 2022, 5:22 p.m. UTC | #57
On Mon, Oct 31, 2022 at 2:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> You're making up new words ;-)
>
>   s/compund/compound/g

Oops.

Fixed. Thanks,

              Linus
  
Linus Torvalds Oct. 31, 2022, 5:28 p.m. UTC | #58
On Mon, Oct 31, 2022 at 2:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> >  (c) move it to clean_and_free_pages_and_swap_cache() that actually
> > does the page_zap_pte_rmap() now.
>
> I'm leaning towards (c); simply because the error case is so terrifying
> I feel we should check for it (and I do have vague memories of us
> actually hitting something like this in the very distant past).

Ok. At that point we no longer have the pte or the virtual address, so
it's not goign to be exactly the same debug output.

But I think it ends up being fairly natural to do

        VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);

instead, and I've fixed that last patch up to do that.

                      Linus
  
Linus Torvalds Oct. 31, 2022, 5:32 p.m. UTC | #59
On Mon, Oct 31, 2022 at 8:43 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> On Oct 30, 2022, at 10:00 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > So the current ordering rules are basically that we need to do
> > set_page_dirty() *and* we need to flush the TLB's before dropping the
> > page table lock. That's what gets us serialized with "mkclean”.
>
> I understand. I am still not sure whether ordering the set_page_dirty() and
> dropping the mapcount reference cannot suffice for the reclaim logic not to
> free the buffers if the page is dirtied.

Ahh, ok.

> According to the code, shrink_page_list() first checks for folio_mapped()
> and then for folio_test_dirty() to check whether pageout() is necessary.
> IIUC, the buffers are not dropped up to this point and set_page_dirty()
> would always set the page-struct dirty bit.
>
> IOW: In shrink_page_list(), when we decide on whether to pageout(), we
> should see whether the page is dirty (give or take smp_rmb()).
>
> But this is an optimization and I do not know all the cases in which buffers
> might be dropped. My intuition says that they cannot be dropped while
> mapcount != 0, but I need to further explore it.

Yes, the above sounds like one fairly good way out of the whole forced
TLB flushing for dirty pages, while still keeping the filesystem code
happy.

But at this point it's an independent issue.

And I really would like any fix like that to also fix the whole issue
with GUP too, which seems related.

           Linus
  
Linus Torvalds Oct. 31, 2022, 6:43 p.m. UTC | #60
Updated subject line, and here's the link to the original discussion
for new people:

    https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/

On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok. At that point we no longer have the pte or the virtual address, so
> it's not going to be exactly the same debug output.
>
> But I think it ends up being fairly natural to do
>
>         VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>
> instead, and I've fixed that last patch up to do that.

Ok, so I've got a fixed set of patches based on the feedback from
PeterZ, and also tried to do the s390 updates for this blindly, and
pushed them out into a git branch:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix

If people really want to see the patches in email again, I can do
that, but most of you already have, and the changes are either trivial
fixes or the s390 updates.

For the s390 people that I've now added to the participant list maybe
the git tree is fine - and the fundamental explanation of the problem
is in that top-most commit (with the three preceding commits being
prep-work). Or that link to the thread about this all.

That top-most commit is also where I tried to fix things up for s390
that uses its own non-gathering TLB flush due to
CONFIG_MMU_GATHER_NO_GATHER.

NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
rebased etc for further comments and fixes. So don't consider that
stable, it's still more of an RFC branch.

At a minimum I'll update it with Ack's etc, assuming I get those, and
my s390 changes are entirely untested and probably won't work.

As far as I can tell, s390 doesn't actually *have* the problem that
causes this change, because of its synchronous TLB flush, but it
obviously needs to deal with the change of rmap zapping logic.

Also added a few people who are explicitly listed as being mmu_gather
maintainers. Maybe people saw the discussion on the linux-mm list, but
let's make it explicit.

Do people have any objections to this approach, or other suggestions?

I do *not* consider this critical, so it's a "queue for 6.2" issue for me.

It probably makes most sense to queue in the -MM tree (after the thing
is acked and people agree), but I can keep that branch alive too and
just deal with it all myself as well.

Anybody?

                     Linus
  
Christian Borntraeger Nov. 2, 2022, 9:14 a.m. UTC | #61
Am 31.10.22 um 19:43 schrieb Linus Torvalds:
> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ok. At that point we no longer have the pte or the virtual address, so
>> it's not going to be exactly the same debug output.
>>
>> But I think it ends up being fairly natural to do
>>
>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>
>> instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.

Adding Gerald.

> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.
> 
> Also added a few people who are explicitly listed as being mmu_gather
> maintainers. Maybe people saw the discussion on the linux-mm list, but
> let's make it explicit.
> 
> Do people have any objections to this approach, or other suggestions?
> 
> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
> 
> It probably makes most sense to queue in the -MM tree (after the thing
> is acked and people agree), but I can keep that branch alive too and
> just deal with it all myself as well.
> 
> Anybody?
> 
>                       Linus

It certainly needs a build fix for s390:


In file included from kernel/sched/core.c:78:
./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
    50 |                 page_zap_pte_rmap(page);
       |                 ^~~~~~~~~~~~~~~~~
  
Christian Borntraeger Nov. 2, 2022, 9:23 a.m. UTC | #62
Am 02.11.22 um 10:14 schrieb Christian Borntraeger:
> Am 31.10.22 um 19:43 schrieb Linus Torvalds:
>> Updated subject line, and here's the link to the original discussion
>> for new people:
>>
>>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
>>
>> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Ok. At that point we no longer have the pte or the virtual address, so
>>> it's not going to be exactly the same debug output.
>>>
>>> But I think it ends up being fairly natural to do
>>>
>>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>>
>>> instead, and I've fixed that last patch up to do that.
>>
>> Ok, so I've got a fixed set of patches based on the feedback from
>> PeterZ, and also tried to do the s390 updates for this blindly, and
>> pushed them out into a git branch:
>>
>>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
>>
>> If people really want to see the patches in email again, I can do
>> that, but most of you already have, and the changes are either trivial
>> fixes or the s390 updates.
>>
>> For the s390 people that I've now added to the participant list maybe
>> the git tree is fine - and the fundamental explanation of the problem
>> is in that top-most commit (with the three preceding commits being
>> prep-work). Or that link to the thread about this all.
> 
> Adding Gerald.

now the correct Gerald....
> 
>>
>> That top-most commit is also where I tried to fix things up for s390
>> that uses its own non-gathering TLB flush due to
>> CONFIG_MMU_GATHER_NO_GATHER.
>>
>> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
>> rebased etc for further comments and fixes. So don't consider that
>> stable, it's still more of an RFC branch.
>>
>> At a minimum I'll update it with Ack's etc, assuming I get those, and
>> my s390 changes are entirely untested and probably won't work.
>>
>> As far as I can tell, s390 doesn't actually *have* the problem that
>> causes this change, because of its synchronous TLB flush, but it
>> obviously needs to deal with the change of rmap zapping logic.
>>
>> Also added a few people who are explicitly listed as being mmu_gather
>> maintainers. Maybe people saw the discussion on the linux-mm list, but
>> let's make it explicit.
>>
>> Do people have any objections to this approach, or other suggestions?
>>
>> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
>>
>> It probably makes most sense to queue in the -MM tree (after the thing
>> is acked and people agree), but I can keep that branch alive too and
>> just deal with it all myself as well.
>>
>> Anybody?
>>
>>                       Linus
> 
> It certainly needs a build fix for s390:
> 
> 
> In file included from kernel/sched/core.c:78:
> ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
>     50 |                 page_zap_pte_rmap(page);
>        |                 ^~~~~~~~~~~~~~~~~
  
Peter Zijlstra Nov. 2, 2022, 12:45 p.m. UTC | #63
On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote:

>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix


Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
  
Linus Torvalds Nov. 2, 2022, 5:55 p.m. UTC | #64
On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> It certainly needs a build fix for s390:
>
> In file included from kernel/sched/core.c:78:
> ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
>     50 |                 page_zap_pte_rmap(page);
>        |                 ^~~~~~~~~~~~~~~~~

Hmm. I'm not sure if I can add a

   #include <linux/rmap.h>

to that s390 asm header file without causing more issues.

The minimal damage would probably be to duplicate the declaration of
page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used.

Not pretty to have two different declarations of that thing, but
anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which
is much of mm) would then verify the consistency of  them.

So I'll do that minimal fix and update that branch, but if s390 people
end up having a better fix, please holler.

                Linus
  
Linus Torvalds Nov. 2, 2022, 6:28 p.m. UTC | #65
On Wed, Nov 2, 2022 at 10:55 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I'll do that minimal fix and update that branch, but if s390 people
> end up having a better fix, please holler.

I've updated the branch with that, so hopefully s390 builds now.

I also fixed a typo in the commit message and added Peter's ack. Other
than that it's all the same it was before.

                 Linus
  
Gerald Schaefer Nov. 2, 2022, 10:29 p.m. UTC | #66
On Wed, 2 Nov 2022 10:55:10 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Nov 2, 2022 at 2:15 AM Christian Borntraeger
> <borntraeger@linux.ibm.com> wrote:
> >
> > It certainly needs a build fix for s390:
> >
> > In file included from kernel/sched/core.c:78:
> > ./arch/s390/include/asm/tlb.h: In function '__tlb_remove_page_size':
> > ./arch/s390/include/asm/tlb.h:50:17: error: implicit declaration of function 'page_zap_pte_rmap' [-Werror=implicit-function-declaration]
> >     50 |                 page_zap_pte_rmap(page);
> >        |                 ^~~~~~~~~~~~~~~~~
> 
> Hmm. I'm not sure if I can add a
> 
>    #include <linux/rmap.h>
> 
> to that s390 asm header file without causing more issues.
> 
> The minimal damage would probably be to duplicate the declaration of
> page_zap_pte_rmap() in the s390 asm/tlb.h header where it is used.
> 
> Not pretty to have two different declarations of that thing, but
> anything that then includes both <asm/tlb.h> and <linux/rmap.h> (which
> is much of mm) would then verify the consistency of  them.
> 
> So I'll do that minimal fix and update that branch, but if s390 people
> end up having a better fix, please holler.

It compiles now with your duplicate declaration, but adding the #include
also did not cause any damage, so that should also be OK.
  
Gerald Schaefer Nov. 2, 2022, 10:31 p.m. UTC | #67
On Mon, 31 Oct 2022 11:43:30 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>     https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok. At that point we no longer have the pte or the virtual address, so
> > it's not going to be exactly the same debug output.
> >
> > But I think it ends up being fairly natural to do
> >
> >         VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
> >
> > instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.
> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.

Correct, we need to flush already when we change a PTE, which is
done in ptep_get_and_clear() etc. Only exception would be lazy
flushing when only one active thread is attached, then we would
flush later in flush_tlb_mm/range(), or as soon as another thread
is attached (IIRC).

So it seems straight forward to just call page_zap_pte_rmap()
from our private __tlb_remove_page_size() implementation.

Just wondering a bit why you did not also add the
VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like
in the generic change.

Acked-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> # s390
  
Linus Torvalds Nov. 2, 2022, 11:13 p.m. UTC | #68
On Wed, Nov 2, 2022 at 3:31 PM Gerald Schaefer
<gerald.schaefer@linux.ibm.com> wrote:
>
> Just wondering a bit why you did not also add the
> VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page), like
> in the generic change.

Heh, I had considered dropping it entirely even from the generic code,
since I don't remember seeing that ever trigger, but PeterZ convinced
me otherwise.

For the s390 side I really wanted to keep things minimal since I
(obviously) didn't even built-test it, so..

I'm perfectly happy with s390 people adding it later, of course.

                Linus
  
David Hildenbrand Nov. 3, 2022, 9:52 a.m. UTC | #69
On 31.10.22 19:43, Linus Torvalds wrote:
> Updated subject line, and here's the link to the original discussion
> for new people:
> 
>      https://lore.kernel.org/all/B88D3073-440A-41C7-95F4-895D3F657EF2@gmail.com/
> 
> On Mon, Oct 31, 2022 at 10:28 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ok. At that point we no longer have the pte or the virtual address, so
>> it's not going to be exactly the same debug output.
>>
>> But I think it ends up being fairly natural to do
>>
>>          VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
>>
>> instead, and I've fixed that last patch up to do that.
> 
> Ok, so I've got a fixed set of patches based on the feedback from
> PeterZ, and also tried to do the s390 updates for this blindly, and
> pushed them out into a git branch:
> 
>      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?h=mmu_gather-race-fix
> 
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.
> 
> That top-most commit is also where I tried to fix things up for s390
> that uses its own non-gathering TLB flush due to
> CONFIG_MMU_GATHER_NO_GATHER.
> 
> NOTE NOTE NOTE! Unlike my regular git branch, this one may end up
> rebased etc for further comments and fixes. So don't consider that
> stable, it's still more of an RFC branch.
> 
> At a minimum I'll update it with Ack's etc, assuming I get those, and
> my s390 changes are entirely untested and probably won't work.
> 
> As far as I can tell, s390 doesn't actually *have* the problem that
> causes this change, because of its synchronous TLB flush, but it
> obviously needs to deal with the change of rmap zapping logic.
> 
> Also added a few people who are explicitly listed as being mmu_gather
> maintainers. Maybe people saw the discussion on the linux-mm list, but
> let's make it explicit.
> 
> Do people have any objections to this approach, or other suggestions?
> 
> I do *not* consider this critical, so it's a "queue for 6.2" issue for me.
> 
> It probably makes most sense to queue in the -MM tree (after the thing
> is acked and people agree), but I can keep that branch alive too and
> just deal with it all myself as well.
> 
> Anybody?

Happy to see that we're still decrementing the mapcount before 
decrementingthe refcount, I was briefly concerned.

I was not able to come up quickly with something that would be 
fundamentally wrong here, but devil is in the detail.

Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is 
unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON 
instead of VM_BUG_ON).

I agree that 6.2 is good enough and that upstreaming this via the -MM 
tree would be a good way to move forward.
  
Linus Torvalds Nov. 3, 2022, 4:54 p.m. UTC | #70
On Thu, Nov 3, 2022 at 2:52 AM David Hildenbrand <david@redhat.com> wrote:
>
> Happy to see that we're still decrementing the mapcount before
> decrementingthe refcount, I was briefly concerned.

Oh, that would have been horribly wrong.

> I was not able to come up quickly with something that would be
> fundamentally wrong here, but devil is in the detail.

So I tried to be very careful.

The biggest change in the whole series (visible in last patch, but
there in the prep-patches too) is how it narrows down some lock
coverage.

Now, that locking didn't *do* anything valid, but I did try to point
it out when it happens - how first the mapcount is decremented outside
the memcg lock (in preparatory patches), and then later on the
__dec_lruvec_page_state() turns into a dec_lruvec_page_state() because
it's then done outside the page table lock.

The locking in the second case did do something - not locking-wise,
but simply the "running under the spinlock means we are not
preemptable without RT".

And in the memcg case it was just plain overly wide lock regions.

None of the other changes should have any real semantic meaning
*apart* from just keeping ->_mapcount elevated slightly longer.

> Some minor things could be improved IMHO (ENCODE_PAGE_BITS naming is
> unfortunate, TLB_ZAP_RMAP could be a __bitwise type, using VM_WARN_ON
> instead of VM_BUG_ON).

That VM_BUG_ON() is a case of "what to do if this ever triggers?" So a
WARN_ON() would be fatal too, it's some seriously bogus stuff to try
to put bits in that won't fit.

It probably should be removed, since the value should always be pretty
much a simple constant. It was more of a "let's be careful with new
code, not for production".

Probably like pretty much all VM_BUG_ON's are - test code that just
got left around.

I considered just getting rid of ENCODE_PAGE_BITS entirely, since
there is only one bit. But it was always "let's keep the option open
for dirty bits etc", so I kept it, but I agree that the name isn't
wonderful.

And in fact I wanted the encoding to really be done by the caller (so
that TLB_ZAP_RMAP wouldn't be a new argument, but the 'page' argument
to __tlb_remove_page_*() would simply be an 'encoded page' pointer,
but that would have caused the patch to be much bigger (and expanded
the s390 side too). Which I didn't want to do.

Long-term that's probably still the right thing to do, including
passing the encoded pointers all the way to
free_pages_and_swap_cache().

Because it's pretty disgusting how it cleans up that array in place
and then does that cast to a new array type, but it is also disgusting
how it traverses that array multiple times (ie
free_pages_and_swap_cache() will just have *another* loop).

But again, those changes would have made the patch bigger, which I
didn't want at this point (and 'release_pages()' would need that
clean-in-place anyway, unless we changed *that* too and made the whole
page encoding be something widely available).

That's also why I then went with that fairly generic
"ENCODE_PAGE_BITS" name. The *use* of it right now is very very
specific to just the TLB gather, and the TLB_ZAP_RMAP bit shows that
in the name. But then I went for a fairly generic "encode extra bits
in the page pointer" name because it felt like it might expand beyond
the initial intentionally small patch in the long run.

So it's a combination of "we might want to expand on this in the
future" and yet also "I really want to keep the changes localized in
this patch".

And the two are kind of inverses of each other, which hopefully
explains the seemingly disparate naming logic.

                 Linus
  
Linus Torvalds Nov. 3, 2022, 5:09 p.m. UTC | #71
On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But again, those changes would have made the patch bigger, which I
> didn't want at this point (and 'release_pages()' would need that
> clean-in-place anyway, unless we changed *that* too and made the whole
> page encoding be something widely available).

And just to clarify: this is not just me trying to expand the reach of my patch.

I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE
and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits
that TLB_ZAP_RMAP is.

Except the mlock code does *not* show that in the type system, and
instead just passes a "struct page **" array around in pvec->pages,
and then you'd just better know that "oh, it's not *really* just a
page pointer".

So I really think that the "array of encoded page pointers" thing is a
generic notion that we *already* have.

It's just that we've done it disgustingly in the past, and I didn't
want to do that disgusting thing again.

So I would hope that the nasty things that the mlock code would some
day use the same page pointer encoding logic to actually make the
whole "this is not a page pointer that you can use directly, it has
low bits set for flags" very explicit.

I am *not* sure if then the actual encoded bits would be unified.
Probably not - you might have very different and distinct uses of the
encode_page() thing where the bits mean different things in different
contexts.

Anyway, this is me just explaining the thinking behind it all. The
page bit encoding is a very generic thing (well, "very generic" in
this case means "has at least one other independent user"), explaining
the very generic naming.

But at the same time, the particular _patch_ was meant to be very targeted.

So slightly schizophrenic name choices as a result.

             Linus
  
David Hildenbrand Nov. 3, 2022, 5:36 p.m. UTC | #72
On 03.11.22 18:09, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 9:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> But again, those changes would have made the patch bigger, which I
>> didn't want at this point (and 'release_pages()' would need that
>> clean-in-place anyway, unless we changed *that* too and made the whole
>> page encoding be something widely available).
> 
> And just to clarify: this is not just me trying to expand the reach of my patch.
> 
> I'd suggest people look at mlock_pagevec(), and realize that LRU_PAGE
> and NEW_PAGE are both *exactly* the same kind of "encoded_page" bits
> that TLB_ZAP_RMAP is.
> 
> Except the mlock code does *not* show that in the type system, and
> instead just passes a "struct page **" array around in pvec->pages,
> and then you'd just better know that "oh, it's not *really* just a
> page pointer".
> 
> So I really think that the "array of encoded page pointers" thing is a
> generic notion that we *already* have.
> 
> It's just that we've done it disgustingly in the past, and I didn't
> want to do that disgusting thing again.
> 
> So I would hope that the nasty things that the mlock code would some
> day use the same page pointer encoding logic to actually make the
> whole "this is not a page pointer that you can use directly, it has
> low bits set for flags" very explicit.
> 
> I am *not* sure if then the actual encoded bits would be unified.
> Probably not - you might have very different and distinct uses of the
> encode_page() thing where the bits mean different things in different
> contexts.
> 
> Anyway, this is me just explaining the thinking behind it all. The
> page bit encoding is a very generic thing (well, "very generic" in
> this case means "has at least one other independent user"), explaining
> the very generic naming.
> 
> But at the same time, the particular _patch_ was meant to be very targeted.
> 
> So slightly schizophrenic name choices as a result.

Thanks for the explanation. I brought it up because the generic name 
somehow felt weird in include/asm-generic/tlb.h. Skimming over the code 
I'd have expected something like TLB_ENCODE_PAGE_BITS, so making the 
"very generic" things "very specific" as long as it lives in tlb.h :)
  
Alexander Gordeev Nov. 4, 2022, 6:33 a.m. UTC | #73
On Mon, Oct 31, 2022 at 11:43:30AM -0700, Linus Torvalds wrote:
[...]
> If people really want to see the patches in email again, I can do
> that, but most of you already have, and the changes are either trivial
> fixes or the s390 updates.
> 
> For the s390 people that I've now added to the participant list maybe
> the git tree is fine - and the fundamental explanation of the problem
> is in that top-most commit (with the three preceding commits being
> prep-work). Or that link to the thread about this all.

I rather have a question to the generic part (had to master the code quotting).

> static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
> {
> 	for (unsigned int i = 0; i < nr; i++) {
> 		struct encoded_page *encoded = pages[i];
> 		unsigned int flags = encoded_page_flags(encoded);
> 		if (flags) {
> 			/* Clean the flagged pointer in-place */
> 			struct page *page = encoded_page_ptr(encoded);
> 			pages[i] = encode_page(page, 0);
> 
> 			/* The flag bit being set means that we should zap the rmap */

Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version?
(I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling
page_zap_pte_rmap() without such a check would be a bug).

> 			page_zap_pte_rmap(page);
> 			VM_WARN_ON_ONCE_PAGE(page_mapcount(page) < 0, page);
> 		}
> 	}
> 
> 	/*
> 	 * Now all entries have been un-encoded, and changed to plain
> 	 * page pointers, so we can cast the 'encoded_page' array to
> 	 * a plain page array and free them
> 	 */
> 	free_pages_and_swap_cache((struct page **)pages, nr);
> }

Thanks!
  
Linus Torvalds Nov. 4, 2022, 5:35 p.m. UTC | #74
On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev
<agordeev@linux.ibm.com> wrote:
>
> I rather have a question to the generic part (had to master the code quotting).

Sure.

Although now I think the series in in Andrew's -mm tree, or just about
to get moved in there, so I'm not going to touch my actual branch any
more.

> > static void clean_and_free_pages_and_swap_cache(struct encoded_page **pages, unsigned int nr)
> > {
> >       for (unsigned int i = 0; i < nr; i++) {
> >               struct encoded_page *encoded = pages[i];
> >               unsigned int flags = encoded_page_flags(encoded);
> >               if (flags) {
> >                       /* Clean the flagged pointer in-place */
> >                       struct page *page = encoded_page_ptr(encoded);
> >                       pages[i] = encode_page(page, 0);
> >
> >                       /* The flag bit being set means that we should zap the rmap */
>
> Why TLB_ZAP_RMAP bit is not checked explicitly here, like in s390 version?
> (I assume, when/if ENCODE_PAGE_BITS is not TLB_ZAP_RMAP only, calling
> page_zap_pte_rmap() without such a check would be a bug).

No major reason. This is basically the same issue as the naming, which
I touched on in

  https://lore.kernel.org/all/CAHk-=wiDg_1up8K4PhK4+kzPN7xJG297=nw+tvgrGn7aVgZdqw@mail.gmail.com/

and the follow-up note about how I hope the "encoded page pointer with
flags" thing gets used by the mlock code some day too.

IOW, there's kind of a generic "I have extra flags associated with the
pointer", and then the specific "this case uses this flag", and
depending on which mindset you have at the time, you might do one or
the other.

So in that clean_and_free_pages_and_swap_cache(), the code basically
knows "I have a pointer with extra flags", and it's written that way.
And that's partly historical, because it actually started with the
original code tracking the dirty bit as the extra piece of
information, and then transformed into this "no, the information is
TLB_ZAP_RMAP".

So "unsigned int flags" at one point was "bool dirty" instead, but
then became more of a "I think this whole page pointer with flags is
general", and the naming changed, and I had both cases in mind, and
then the code is perhaps not so specifically named. I'm not sure the
zap_page_range() case will ever use more than one flag, but the mlock
case already has two different flags. So the "encode_page" thing is
very much written to be about more than just the zap_page_range()
case.

But yes, that code could (and maybe should) use "if (flags &
TLB_ZAP_RMAP)" to make it clear that in this case, the single flags
bit is that one bit.

But the "if ()" there actually does two conceptually *separate*
things: it needs to clean the pointer in-place (which is regardless of
"which" flag bit is set, and then it does that page_zap_pte_rmap(),
which is just for the TLB_ZAP_RMAP bit.

So to be really explicit about it, you'd have two different tests: one
for "do I have flags that need to be cleaned up" and then an inner
test for each flag. And since there is only one flag in this
particular use case, it's essentially that inner test that I dropped
as pointless.

In contrast, in the s390 version, that bit was never encoded as a a
general "flags associated with a page pointer" in the first place, so
there was never any such duality. There is only TLB_ZAP_RMAP.

Hope that explains the thinking.

                 Linus
  
Hugh Dickins Nov. 6, 2022, 9:06 p.m. UTC | #75
Adding Stephen to Cc, for next-20221107 alert.
Adding Johannes to Cc, particularly for lock_page_memcg discussion.

On Fri, 4 Nov 2022, Linus Torvalds wrote:
> On Thu, Nov 3, 2022 at 11:33 PM Alexander Gordeev
> <agordeev@linux.ibm.com> wrote:
> >
> > I rather have a question to the generic part (had to master the code quotting).
> 
> Sure.
> 
> Although now I think the series in in Andrew's -mm tree, or just about
> to get moved in there, so I'm not going to touch my actual branch any
> more.

Linus, we've been exchanging about my mm/rmap.c mods in private mail,
I need to raise some points about your mods here in public mail.

Four topics - munlock (good), anon_vma (bad), mm-unstable (bad),
lock_page_memcg (uncertain).  I'm asking Andrew here to back your
patchset out of mm-unstable for now, until we have its fixes in:
otherwise I'm worried that next-20221107 will waste everyone's time.

munlock (good)
--------------
You've separated out the munlock_vma_page() from the rest of PTE
remove rmap work.  Worried me at first, but I think that's fine:
bundling it into page_remove_rmap() was mainly so that we wouldn't
forget to do it anywhere (and other rmap funcs already took vma arg).

Certainly during development, I had been using page mapcount somewhere
inside munlock_vma_page(), either for optimization or for sanity check,
I forget; but gave up on that as unnecessary complication, and I think
it became impossible with the pagevec; so not an issue now.

If it were my change, I'd certainly do it by changing the name of the
vma arg in page_remove_rmap() from vma to munlock_vma, not doing the
munlock when that is NULL, and avoid the need for duplicating code:
whereas you're very pleased with cutting out the unnecessary stuff
in slimmed-down page_zap_pte_rmap().  Differing tastes (or perhaps
you'll say taste versus no taste).

anon_vma (bad)
--------------
My first reaction to your patchset was, that it wasn't obvious to me
that delaying the decrementation of mapcount is safe: I needed time to
think about it.  But happily, testing saved me from needing much thought.

The first problem, came immediately in the first iteration of my
huge tmpfs kbuild swapping loads, was BUG at mm/migrate.c:1105!
VM_BUG_ON_FOLIO(folio_test_anon(src) && !folio_test_ksm(src) && !anon_vma, src);
Okay, that's interesting but not necessarily fatal, we can very easily
make it a recoverable condition.  I didn't bother to think on that one,
just patched around it.

The second problem was more significant: came after nearly five hours
of load, BUG NULL dereference (address 8) in down_read_trylock() in
rmap_walk_anon(), while kswapd was doing a folio_referenced().

That one goes right to the heart of the matter, and instinct had been
correct to worry about delaying the decrementation of mapcount, that is,
extending the life of page_mapped() or folio_mapped().

See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
establishing the continued validity of an anon_vma.  See comments
above folio_get_anon_vma(), some by me but most by PeterZ IIRC.

I believe what has happened is that your patchset has, very intentionally,
kept the page as "folio_mapped" until after free_pgtables() does its
unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read()
that the anon_vma is safe to use when actually it has been freed.
(It looked like a page table when I peeped at it.)

I'm not certain, but I think that you made page_zap_pte_rmap() handle
anon as well as file, just for the righteous additional simplification;
but I'm afraid that (without opening a huge anon_vma refcounting can of
worms) that unification has to be reverted, and anon left to go the
same old way it did before.

I didn't look into whether reverting one of your patches would achieve
that, I just adjusted the code in zap_pte_range() to go the old way for
PageAnon; and that has been running successfully, hitting neither BUG,
for 15 hours now.

mm-unstable (bad)
-----------------
Aside from that PageAnon issue, mm-unstable is in an understandably bad
state because you could not have foreseen my subpages_mapcount addition
to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
PageCompound (but not the "compound") case too.  I rushed you and akpm
an emergency patch for that on Friday night, but you, let's say, had
reservations about it.  So I haven't posted it, and while the PageAnon
issue remains, I think your patchset has to be removed from mm-unstable
and linux-next anyway.

What happens to mm-unstable with page_zap_pte_rmap() not handling
subpages_mapcount?  In my CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y case,
I get a "Bad page state" even before reaching first login after boot,
"page dumped because: nonzero subpages_mapcount".  Yes, I make it
worse for myself by having a BUG() in bad_page(); others will limp
along with more and more of those "Bad page" messages.

lock_page_memcg (uncertain)
---------------------------
Johannes, please help! Linus has got quite upset by lock_page_memcg(),
its calls from mm/rmap.c anyway, and most particularly by the way
in which it is called at the start of page_remove_rmap(), before
anyone's critical atomic_add_negative(), yet its use is to guarantee
the stability of page memcg while doing the stats updates, done only
when atomic_add_negative() says so.

I do have one relevant insight on this.  It (or its antecedents under
other names) date from the days when we did "reparenting" of memcg
charges from an LRU: and in those days the lock_page_memcg() before
mapcount adjustment was vital, to pair with the uses of folio_mapped()
or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
are precisely around the stats which the rmap functions affect.

But nowadays mem_cgroup_move_account() is only called, with page table
lock held, on matching pages found in a task's page table: so its
"mapped" checks are redundant - I've sometimes thought in the past of
removing them, but held back, because I always have the notion (not
hope!) that "reparenting" may one day be brought back from the grave.
I'm too out of touch with memcg to know where that notion stands today.

I've gone through a multiverse of opinions on those lock_page_memcg()s
in the last day: I currently believe that Linus is right, that the
lock_page_memcg()s could and should be moved just before the stats
updates.  But I am not 100% certain of that - is there still some
reason why it's important that the page memcg at the instant of the
critical mapcount transition be kept unchanged until the stats are
updated?  I've tried running scenarios through my mind but given up.

(And note that the answer might be different with Linus's changes
than without them: since he delays the mapcount decrementation
until long after pte was removed and page table lock dropped).

And I do wish that we could settle this lock_page_memcg() question
in an entirely separate patch: as it stands, page_zap_pte_rmap()
gets to benefit from Linus's insight (or not), and all the other rmap
functions carry on with the mis?placed lock_page_memcg() as before.

Let's press Send,
Hugh
  
Linus Torvalds Nov. 6, 2022, 10:34 p.m. UTC | #76
[ Editing down to just the bare-bones problem cases ]

On Sun, Nov 6, 2022 at 1:06 PM Hugh Dickins <hughd@google.com> wrote:
>
> anon_vma (bad)
> --------------
>
> See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
> establishing the continued validity of an anon_vma.  See comments
> above folio_get_anon_vma(), some by me but most by PeterZ IIRC.
>
> I believe what has happened is that your patchset has, very intentionally,
> kept the page as "folio_mapped" until after free_pgtables() does its
> unlink_anon_vmas(); but that is telling folio_lock_anon_vma_read()
> that the anon_vma is safe to use when actually it has been freed.
> (It looked like a page table when I peeped at it.)
>
> I'm not certain, but I think that you made page_zap_pte_rmap() handle
> anon as well as file, just for the righteous additional simplification;
> but I'm afraid that (without opening a huge anon_vma refcounting can of
> worms) that unification has to be reverted, and anon left to go the
> same old way it did before.

Indeed. I made them separate initially, just because the only case
that mattered for the dirty bit was the file-mapped case.

But then the two functions ended up being basically the identical
function, so I unified them again.

But the anonvma lifetime issue looks very real, and so doing the
"delay rmap only for file mappings" seems sane.

In fact, I wonder if we should delay it only for *dirty* file
mappings, since it doesn't matter for the clean case.

Hmm.

I already threw away my branch (since Andrew picked the patches up),
so a question for Andrew: do you want me to re-do the branch entirely,
or do you want me to just send you an incremental patch?

To make for minimal changes, I'd drop the 're-unification' patch, and
then small updates to the zap_pte_range() code to keep the anon (and
possibly non-dirty) case synchronous.

And btw, this one is interesting: for anonymous (and non-dirty
file-mapped) patches, we actually can end up delaying the final page
free (and the rmap zapping) all the way to "tlb_finish_mmu()".

Normally we still have the vma's all available, but yes,
free_pgtables() can and does happen before the final TLB flush.

The file-mapped dirty case doesn't have that issue - not just because
it doesn't have an anonvma at all, but because it also does that
"force_flush" thing that just measn that the page freeign never gets
delayed that far in the first place.

> mm-unstable (bad)
> -----------------
> Aside from that PageAnon issue, mm-unstable is in an understandably bad
> state because you could not have foreseen my subpages_mapcount addition
> to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
> PageCompound (but not the "compound") case too.  I rushed you and akpm
> an emergency patch for that on Friday night, but you, let's say, had
> reservations about it.  So I haven't posted it, and while the PageAnon
> issue remains, I think your patchset has to be removed from mm-unstable
> and linux-next anyway.

So I think I'm fine with your patch, I just want to move the memcg
accounting to outside of it.

I can re-do my series on top of mm-unstable, I guess. That's probably
the easiest way to handle this all.

Andrew - can you remove those patches again, and I'll create a new
series for you?

                 Linus
  
Andrew Morton Nov. 6, 2022, 11:14 p.m. UTC | #77
On Sun, 6 Nov 2022 14:34:51 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > mm-unstable (bad)
> > -----------------
> > Aside from that PageAnon issue, mm-unstable is in an understandably bad
> > state because you could not have foreseen my subpages_mapcount addition
> > to page_remove_rmap().  page_zap_pte_rmap() now needs to handle the
> > PageCompound (but not the "compound") case too.  I rushed you and akpm
> > an emergency patch for that on Friday night, but you, let's say, had
> > reservations about it.  So I haven't posted it, and while the PageAnon
> > issue remains, I think your patchset has to be removed from mm-unstable
> > and linux-next anyway.
> 
> So I think I'm fine with your patch, I just want to move the memcg
> accounting to outside of it.
> 
> I can re-do my series on top of mm-unstable, I guess. That's probably
> the easiest way to handle this all.
> 
> Andrew - can you remove those patches again, and I'll create a new
> series for you?

Yes, I've removed both serieses and shall push the tree out within half
an hour from now.
  
Stephen Rothwell Nov. 7, 2022, 12:06 a.m. UTC | #78
Hi all,

On Sun, 6 Nov 2022 15:14:16 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Yes, I've removed both serieses and shall push the tree out within half
> an hour from now.

And I have picked up the new version for today's linux-next.  Thanks all.
  
Peter Zijlstra Nov. 7, 2022, 9:12 a.m. UTC | #79
Hi Hugh!

On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote:
> See folio_lock_anon_vma_read(): folio_mapped() plays a key role in
> establishing the continued validity of an anon_vma.  See comments
> above folio_get_anon_vma(), some by me but most by PeterZ IIRC.

Ohhh, you're quite right. Unfortunately I seem to have completely
forgotten about that :-(
  
Linus Torvalds Nov. 7, 2022, 4:19 p.m. UTC | #80
On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Yes, I've removed both serieses and shall push the tree out within half
> an hour from now.

Oh, can you please just put Hugh's series back in?

I don't love the mapcount changes, but I'm going to re-do my parts so
that there is no clash with them.

And while I'd much rather see the mapcounts done another way, that's a
completely separate issue.

                Linus
  
Johannes Weiner Nov. 7, 2022, 8:07 p.m. UTC | #81
Hello,

On Sun, Nov 06, 2022 at 01:06:19PM -0800, Hugh Dickins wrote:
> lock_page_memcg (uncertain)
> ---------------------------
> Johannes, please help! Linus has got quite upset by lock_page_memcg(),
> its calls from mm/rmap.c anyway, and most particularly by the way
> in which it is called at the start of page_remove_rmap(), before
> anyone's critical atomic_add_negative(), yet its use is to guarantee
> the stability of page memcg while doing the stats updates, done only
> when atomic_add_negative() says so.

As you mentioned, the pte lock historically wasn't always taken on the
move side. And so the reason lock_page_memcg() covers the mapcount
update is that move_account() needs to atomically see either a) the
page is mapped and counted, or b) unmapped and uncounted. If we lock
after mapcountdec, move_account() could miss pending updates that need
transferred, and it would break the scheme breaks thusly:

memcg1->nr_mapped = 1
memcg2->nr_mapped = 0

page_remove_rmap:                              mem_cgroup_move_account():
  if atomic_add_negative(page->mapcount):
                                                 lock_page_memcg()
                                                 if page->mapcount: // NOT TAKEN
                                                   memcg1->nr_mapped--
                                                   memcg2->nr_mapped++
                                                 page->memcg = memcg2
                                                 unlock_page_memcg()
    lock_page_memcg()
    page->memcg->nr_mapped-- // UNDERFLOW memcg2->nr_mapped
    unlock_page_memcg()

> I do have one relevant insight on this.  It (or its antecedents under
> other names) date from the days when we did "reparenting" of memcg
> charges from an LRU: and in those days the lock_page_memcg() before
> mapcount adjustment was vital, to pair with the uses of folio_mapped()
> or page_mapped() in mem_cgroup_move_account() - those "mapped" checks
> are precisely around the stats which the rmap functions affect.
> 
> But nowadays mem_cgroup_move_account() is only called, with page table
> lock held, on matching pages found in a task's page table: so its
> "mapped" checks are redundant - I've sometimes thought in the past of
> removing them, but held back, because I always have the notion (not
> hope!) that "reparenting" may one day be brought back from the grave.
> I'm too out of touch with memcg to know where that notion stands today.
>
> I've gone through a multiverse of opinions on those lock_page_memcg()s
> in the last day: I currently believe that Linus is right, that the
> lock_page_memcg()s could and should be moved just before the stats
> updates.  But I am not 100% certain of that - is there still some
> reason why it's important that the page memcg at the instant of the
> critical mapcount transition be kept unchanged until the stats are
> updated?  I've tried running scenarios through my mind but given up.

Okay, I think there are two options.

- If we don't want to codify the pte lock requirement on the move
  side, then moving the lock_page_memcg() like that would break the
  locking scheme, as per above.

- If we DO want to codify the pte lock requirement, we should just
  remove the lock_page_memcg() altogether, as it's fully redundant.

I'm leaning toward the second option. If somebody brings back
reparenting they can bring the lock back along with it.

[ If it's even still necessary by then.

  It's conceivable reparenting is brought back only for cgroup2, where
  the race wouldn't matter because of the hierarchical stats. The
  reparenting side wouldn't have to move page state to the parent -
  it's already there. And whether rmap would see the dying child or
  the parent doesn't matter much either: the parent will see the
  update anyway, directly or recursively, and we likely don't care to
  balance the books on a dying cgroup.

  It's then just a matter of lifetime - which should be guaranteed
  also, as long as the pte lock prevents an RCU quiescent state. ]

So how about something like below?

UNTESTED, just for illustration. This is cgroup1 code, which I haven't
looked at too closely in a while. If you can't spot an immediate hole
in it, I'd go ahead and test it and send a proper patch.

---
From 88a32b1b5737630fb981114f6333d8fd057bd8e9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 7 Nov 2022 12:05:09 -0500
Subject: [PATCH] mm: remove lock_page_memcg() from rmap

rmap changes (mapping and unmapping) of a page currently take
lock_page_memcg() to serialize 1) update of the mapcount and the
cgroup mapped counter with 2) cgroup moving the page and updating the
old cgroup and the new cgroup counters based on page_mapped().

Before b2052564e66d ("mm: memcontrol: continue cache reclaim from
offlined groups"), we used to reassign all pages that could be found
on a cgroup's LRU list on deletion - something that rmap didn't
naturally serialize against. Since that commit, however, the only
pages that get moved are those mapped into page tables of a task
that's being migrated. In that case, the pte lock is always held (and
we know the page is mapped), which keeps rmap changes at bay already.

The additional lock_page_memcg() by rmap is redundant. Remove it.

NOT-Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 27 ++++++++++++---------------
 mm/rmap.c       | 30 ++++++++++++------------------
 2 files changed, 24 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d8549ae1b30..f7716e9038e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5666,7 +5666,10 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
- * The caller must make sure the page is not on LRU (isolate_page() is useful.)
+ * This function acquires folio_lock() and folio_lock_memcg(). The
+ * caller must exclude all other possible ways of accessing
+ * page->memcg, such as LRU isolation (to lock out isolation) and
+ * having the page mapped and pte-locked (to lock out rmap).
  *
  * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
  * from old cgroup.
@@ -5685,6 +5688,7 @@ static int mem_cgroup_move_account(struct page *page,
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON(compound && !folio_test_large(folio));
+	VM_WARN_ON_ONCE(!folio_mapped(folio));
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at
@@ -5705,30 +5709,23 @@ static int mem_cgroup_move_account(struct page *page,
 	folio_memcg_lock(folio);
 
 	if (folio_test_anon(folio)) {
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
-			if (folio_test_transhuge(folio)) {
-				__mod_lruvec_state(from_vec, NR_ANON_THPS,
-						   -nr_pages);
-				__mod_lruvec_state(to_vec, NR_ANON_THPS,
-						   nr_pages);
-			}
+		__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
+		if (folio_test_transhuge(folio)) {
+			__mod_lruvec_state(from_vec, NR_ANON_THPS, -nr_pages);
+			__mod_lruvec_state(to_vec, NR_ANON_THPS, nr_pages);
 		}
 	} else {
 		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
 		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
+		__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
 
 		if (folio_test_swapbacked(folio)) {
 			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
 			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
 		}
 
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
-		}
-
 		if (folio_test_dirty(folio)) {
 			struct address_space *mapping = folio_mapping(folio);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..60c31375f274 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1197,11 +1197,6 @@ void page_add_anon_rmap(struct page *page,
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
 
-	if (unlikely(PageKsm(page)))
-		lock_page_memcg(page);
-	else
-		VM_BUG_ON_PAGE(!PageLocked(page), page);
-
 	if (compound) {
 		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1217,19 +1212,19 @@ void page_add_anon_rmap(struct page *page,
 	if (first) {
 		int nr = compound ? thp_nr_pages(page) : 1;
 		/*
-		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
-		 * these counters are not modified in interrupt context, and
-		 * pte lock(a spinlock) is held, which implies preemption
-		 * disabled.
+		 * We use the irq-unsafe __{inc|mod}_zone_page_stat
+		 * because these counters are not modified in
+		 * interrupt context, and pte lock(a spinlock) is
+		 * held, which implies preemption disabled.
+		 *
+		 * The pte lock also stabilizes page->memcg wrt
+		 * mem_cgroup_move_account().
 		 */
 		if (compound)
 			__mod_lruvec_page_state(page, NR_ANON_THPS, nr);
 		__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
 	}
 
-	if (unlikely(PageKsm(page)))
-		unlock_page_memcg(page);
-
 	/* address might be in next vma when migration races vma_adjust */
 	else if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1290,7 +1285,6 @@ void page_add_file_rmap(struct page *page,
 	int i, nr = 0;
 
 	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
-	lock_page_memcg(page);
 	if (compound && PageTransHuge(page)) {
 		int nr_pages = thp_nr_pages(page);
 
@@ -1311,6 +1305,7 @@ void page_add_file_rmap(struct page *page,
 		if (nr == nr_pages && PageDoubleMap(page))
 			ClearPageDoubleMap(page);
 
+		/* The pte lock stabilizes page->memcg */
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						nr_pages);
@@ -1328,7 +1323,6 @@ void page_add_file_rmap(struct page *page,
 out:
 	if (nr)
 		__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
-	unlock_page_memcg(page);
 
 	mlock_vma_page(page, vma, compound);
 }
@@ -1356,6 +1350,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
+		/* The pte lock stabilizes page->memcg */
 		if (PageSwapBacked(page))
 			__mod_lruvec_page_state(page, NR_SHMEM_PMDMAPPED,
 						-nr_pages);
@@ -1423,8 +1418,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
 void page_remove_rmap(struct page *page,
 	struct vm_area_struct *vma, bool compound)
 {
-	lock_page_memcg(page);
-
 	if (!PageAnon(page)) {
 		page_remove_file_rmap(page, compound);
 		goto out;
@@ -1443,6 +1436,9 @@ void page_remove_rmap(struct page *page,
 	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
+	 *
+	 * The pte lock also stabilizes page->memcg wrt
+	 * mem_cgroup_move_account().
 	 */
 	__dec_lruvec_page_state(page, NR_ANON_MAPPED);
 
@@ -1459,8 +1455,6 @@ void page_remove_rmap(struct page *page,
 	 * faster for those pages still in swapcache.
 	 */
 out:
-	unlock_page_memcg(page);
-
 	munlock_vma_page(page, vma, compound);
 }
  
Linus Torvalds Nov. 7, 2022, 8:29 p.m. UTC | #82
On Mon, Nov 7, 2022 at 12:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> - If we DO want to codify the pte lock requirement, we should just
>   remove the lock_page_memcg() altogether, as it's fully redundant.
>
> I'm leaning toward that second option.

The thing is, that's very much the case we do *not* want.

We need to delay the rmap removal until at least after the TLB flush.
At least for dirty filemapped pages - because the page cleaning needs
to see that they exists as mapped entities until all CPU's have
*actually* dropped them.

Now, we do the TLB flush still under the page table lock, so we could
still then do the rmap removal before dropping the lock.

But it would be much cleaner from the TLB flushing standpoint to delay
it until the page freeing, which ends up being delayed until after the
lock is dropped.

That said, if always doing the rmap removal under the page table lock
means that that memcg lock can just be deleted in that whole path, I
will certainly bow to _that_ simplification instead, and just handle
the dirty pages after the TLB flush but before the page table drop.

              Linus
  
Andrew Morton Nov. 7, 2022, 11:02 p.m. UTC | #83
On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, Nov 6, 2022 at 3:14 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > Yes, I've removed both serieses and shall push the tree out within half
> > an hour from now.
> 
> Oh, can you please just put Hugh's series back in?
> 

Done, all pushed out to
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable.
  
Stephen Rothwell Nov. 7, 2022, 11:44 p.m. UTC | #84
Hi all,

On Mon, 7 Nov 2022 15:02:42 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 7 Nov 2022 08:19:24 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> Done, all pushed out to
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm mm-unstable.

Will be in today's linux-next.
  
Linus Torvalds Nov. 7, 2022, 11:47 p.m. UTC | #85
On Mon, Nov 7, 2022 at 12:29 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That said, if always doing the rmap removal under the page table lock
> means that that memcg lock can just be deleted in that whole path, I
> will certainly bow to _that_ simplification instead, and just handle
> the dirty pages after the TLB flush but before the page table drop.

Ok, so I think I have a fairly clean way to do this.

Let me try to make that series look reasonable, although it might be
until tomorrow. I'll need to massage my mess into not just prettier
code, but a sane history.

               Linus
  
Linus Torvalds Nov. 8, 2022, 4:28 a.m. UTC | #86
On Mon, Nov 7, 2022 at 3:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, so I think I have a fairly clean way to do this.
>
> Let me try to make that series look reasonable, although it might be
> until tomorrow. I'll need to massage my mess into not just prettier
> code, but a sane history.

Ugh. Ok, so massaging it into a saner form and splitting it out into a
pretty history took a lot longer than writing the initial ugly
"something like this".

Anyway, the end result looks very different from the previous series,
since I very consciously tried to keep away from rmap changes to not
clash with Hugh's work, and also made sure we still do the
page_remove_rmap() under the page table lock, just *after* the TLB
flush.

It does share some basic stuff - I'm still using that "struct
encoded_page" thing, I just moved it into a separate commit, and moved
it to where it conceptually belongs (I'd love to say that I also made
the mlock.c code use it, but I didn't - that 'pagevec' thing is a
pretty pointless abstraction and I didn't want to fight it).

So you'll see some familiar support structures and scaffolding, but
the actual approach to zap_pte_range() is very different.

The approach taken to the "s390 is very different" issue is also
completely different. It should actually be better for s390: we used
to cause that "force_flush" whenever we hit a dirty shared page, and
s390 doesn't care. The new model makes that "s390 doesn't care" be
part of the whole design, so now s390 treats dirty shared pages
basically as if they weren't anything special at all. Which they
aren't on s390.

But, as with the previous case, I don't even try to cross-compile for
s390, so my attempts at handling s390 well are just that: attempts.
The code may be broken.

Of course, the code may be broken on x86-64 too, but at least there
I've compiled it and am running it right now.

Oh, and because I copied large parts of the commit message from the
previous approach (because the problem description is the same), I
noticed that I also kept the "Acked-by:". Those are bogus, because the
code is sufficiently different that any previous acks are just not
valid any more, but I just hadn't fixed that yet.

The meat of it all is in that last patch, the rest is just type system
cleanups to prep for it. But it was also that last patch that I spent
hours on just tweaking it to look sensible. The *code* was pretty
easy. Making it have sensible naming and a sensible abstraction
interface that worked well for s390 too, that was 90% of the effort.

But also I hope the end result is reasonably easy to review as a result.

I'm sending this out because I'm stepping away from the keyboard,
because that whole "let's massage it into something lagible" was
really somewhat exhausting. You don't see all the small side turns it
took only to go "that's ugly, let's try again" ;)

Anybody interested in giving this a peek?

(Patch 2/4 might make some people pause. It's fairly small and simple.
It's effective and makes it easy to do some of the later changes. And
it's also quite different from our usual model. It was "inspired" by
the signed-vs-unsigned char thread from a few weeks ago. But patch 4/4
is the one that matters).

                  Linus
  
Linus Torvalds Nov. 8, 2022, 7:56 p.m. UTC | #87
On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm sending this out because I'm stepping away from the keyboard,
> because that whole "let's massage it into something legible" was
> really somewhat exhausting. You don't see all the small side turns it
> took only to go "that's ugly, let's try again" ;)

Ok, I actually sent the individual patches with 'git-send-email',
although I only sent them to the mailing list and to people that were
mentioned in the commit descriptions.

I hope that makes review easier.

See

   https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org

for the series if you weren't mentioned and are interested.

Oh, and because I decided to just use the email in this thread as the
reference and cover letter, it turns out that this all confuses 'b4',
because it actually walks up the whole thread all the way to the
original 13-patch series by PeterZ that started this whole discussion.

I've seen that before with other peoples patch series, but now that it
happened to my own, I'm cc'ing Konstantine here too to see if there's
some magic for b4 to say "look, I pointed you to a msg-id that is
clearly a new series, don't walk all the way up and then take patches
from a completely different one.

Oh well. I guess I should just have not been lazy and done a
cover-letter and a whole new thread.

My bad.

Konstantin, please help me look like less of a tool?

                    Linus
  
Konstantin Ryabitsev Nov. 8, 2022, 8:03 p.m. UTC | #88
On Tue, Nov 08, 2022 at 11:56:13AM -0800, Linus Torvalds wrote:
> On Mon, Nov 7, 2022 at 8:28 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I'm sending this out because I'm stepping away from the keyboard,
> > because that whole "let's massage it into something legible" was
> > really somewhat exhausting. You don't see all the small side turns it
> > took only to go "that's ugly, let's try again" ;)
> 
> Ok, I actually sent the individual patches with 'git-send-email',
> although I only sent them to the mailing list and to people that were
> mentioned in the commit descriptions.
> 
> I hope that makes review easier.
> 
> See
> 
>    https://lore.kernel.org/all/20221108194139.57604-1-torvalds@linux-foundation.org
> 
> for the series if you weren't mentioned and are interested.
> 
> Oh, and because I decided to just use the email in this thread as the
> reference and cover letter, it turns out that this all confuses 'b4',
> because it actually walks up the whole thread all the way to the
> original 13-patch series by PeterZ that started this whole discussion.
> 
> I've seen that before with other peoples patch series, but now that it
> happened to my own, I'm cc'ing Konstantine here too to see if there's
> some magic for b4 to say "look, I pointed you to a msg-id that is
> clearly a new series, don't walk all the way up and then take patches
> from a completely different one.

Yes, --no-parent.

It's slightly more complicated in your case because the patches aren't
threaded to the first patch/cover letter, but you can choose an arbitrary
msgid upthread and tell b4 to ignore anything that came before it. E.g.:

b4 am -o/tmp --no-parent 20221108194139.57604-1-torvalds@linux-foundation.org

-K
  
Linus Torvalds Nov. 8, 2022, 8:18 p.m. UTC | #89
On Tue, Nov 8, 2022 at 12:03 PM Konstantin Ryabitsev <mricon@kernel.org> wrote:
>
> Yes, --no-parent.

Ahh, that's new. I guess I need to update my ancient b4-0.8.0 install..

But yes, with that, and the manual parent lookup (because otherwise
"--no-parent" will fetch *just* the patch itself, not even walking up
the single parent chain). it works.

Maybe a "--single-parent" or "--deadbeat-parent" option would be a good idea?

Anyway, with a more recent b4 version, the command

   b4 am --no-parent
CAHk-=wh6MxaCA4pXpt1F5Bn2__6MxCq0Dr-rES4i=MOL9ibjpg@mail.gmail.com

gets that series and only that series.

              Linus
  

Patch

--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -260,15 +260,12 @@  static inline pte_t ptep_get(pte_t *ptep
 
 #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
 /*
- * WARNING: only to be used in the get_user_pages_fast() implementation.
- *
- * With get_user_pages_fast(), we walk down the pagetables without taking any
- * locks.  For this we would like to load the pointers atomically, but sometimes
- * that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What
- * we do have is the guarantee that a PTE will only either go from not present
- * to present, or present to not present or both -- it will not switch to a
- * completely different present page without a TLB flush in between; something
- * that we are blocking by holding interrupts off.
+ * For walking the pagetables without holding any locks.  Some architectures
+ * (eg x86-32 PAE) cannot load the entries atomically without using expensive
+ * instructions.  We are guaranteed that a PTE will only either go from not
+ * present to present, or present to not present -- it will not switch to a
+ * completely different present page without a TLB flush inbetween; which we
+ * are blocking by holding interrupts off.
  *
  * Setting ptes from not present to present goes:
  *