[v2,06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

Message ID 20240103091423.400294-7-peterx@redhat.com
State New
Headers
Series mm/gup: Unify hugetlb, part 2 |

Commit Message

Peter Xu Jan. 3, 2024, 9:14 a.m. UTC
  From: Peter Xu <peterx@redhat.com>

Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems.  That should never apply
to hugepd.  Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/gup.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)
  

Comments

Jason Gunthorpe Jan. 15, 2024, 6:37 p.m. UTC | #1
On Wed, Jan 03, 2024 at 05:14:16PM +0800, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> PPC_8XX), however those pages are not candidates for GUP.
> 
> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> file-backed mappings") added a check to fail gup-fast if there's potential
> risk of violating GUP over writeback file systems.  That should never apply
> to hugepd.  Considering that hugepd is an old format (and even
> software-only), there's no plan to extend hugepd into other file typed
> memories that is prone to the same issue.

I didn't dig into the ppc stuff too deeply, but this looks to me like
it is the same thing as ARM's contig bits?

ie a chunk of PMD/etc entries are all managed together as though they
are a virtual larger entry and we use the hugepte_addr_end() stuff to
iterate over each sub entry.

But WHY is GUP doing this or caring about this? GUP should have no
problem handling the super-size entry (eg 8M on nohash) as a single
thing. It seems we only lack an API to get this out of the arch code?

It seems to me we should see ARM and PPC agree on what the API is for
this and then get rid of hugepd by making both use the same page table
walker API. Is that too hopeful?

> Drop that check, not only because it'll never be true for hugepd per any
> known plan, but also it paves way for reusing the function outside
> fast-gup.

I didn't see any other caller of this function in this series? When
does this re-use happen??

Jason
  
Christophe Leroy Jan. 16, 2024, 6:30 a.m. UTC | #2
Le 15/01/2024 à 19:37, Jason Gunthorpe a écrit :
> On Wed, Jan 03, 2024 at 05:14:16PM +0800, peterx@redhat.com wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
>> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
>> PPC_8XX), however those pages are not candidates for GUP.
>>
>> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
>> file-backed mappings") added a check to fail gup-fast if there's potential
>> risk of violating GUP over writeback file systems.  That should never apply
>> to hugepd.  Considering that hugepd is an old format (and even
>> software-only), there's no plan to extend hugepd into other file typed
>> memories that is prone to the same issue.
> 
> I didn't dig into the ppc stuff too deeply, but this looks to me like
> it is the same thing as ARM's contig bits?
> 
> ie a chunk of PMD/etc entries are all managed together as though they
> are a virtual larger entry and we use the hugepte_addr_end() stuff to
> iterate over each sub entry.

As far as I understand ARM's contig stuff, hugepd on powerpc is 
something different.

hugepd is a page directory dedicated to huge pages, where you have huge 
pages listed instead of regular pages. For instance, on powerpc 32 with 
each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A 
hugepd for 512k is a page table with 8 entries.

And for 8Mbytes entries, the hugepd is a page table with only one entry. 
And 2 consecutive PGS entries will point to the same hugepd to cover the 
entire 8Mbytes.

> 
> But WHY is GUP doing this or caring about this? GUP should have no
> problem handling the super-size entry (eg 8M on nohash) as a single
> thing. It seems we only lack an API to get this out of the arch code?
> 
> It seems to me we should see ARM and PPC agree on what the API is for
> this and then get rid of hugepd by making both use the same page table
> walker API. Is that too hopeful?

Can't see the similarity between ARM contig PTE and PPC huge page 
directories.

> 
>> Drop that check, not only because it'll never be true for hugepd per any
>> known plan, but also it paves way for reusing the function outside
>> fast-gup.
> 
> I didn't see any other caller of this function in this series? When
> does this re-use happen??
> 
> Jason


Christophe
  
Jason Gunthorpe Jan. 16, 2024, 12:31 p.m. UTC | #3
On Tue, Jan 16, 2024 at 06:30:39AM +0000, Christophe Leroy wrote:
> 
> 
> Le 15/01/2024 à 19:37, Jason Gunthorpe a écrit :
> > On Wed, Jan 03, 2024 at 05:14:16PM +0800, peterx@redhat.com wrote:
> >> From: Peter Xu <peterx@redhat.com>
> >>
> >> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
> >> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
> >> PPC_8XX), however those pages are not candidates for GUP.
> >>
> >> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> >> file-backed mappings") added a check to fail gup-fast if there's potential
> >> risk of violating GUP over writeback file systems.  That should never apply
> >> to hugepd.  Considering that hugepd is an old format (and even
> >> software-only), there's no plan to extend hugepd into other file typed
> >> memories that is prone to the same issue.
> > 
> > I didn't dig into the ppc stuff too deeply, but this looks to me like
> > it is the same thing as ARM's contig bits?
> > 
> > ie a chunk of PMD/etc entries are all managed together as though they
> > are a virtual larger entry and we use the hugepte_addr_end() stuff to
> > iterate over each sub entry.
> 
> As far as I understand ARM's contig stuff, hugepd on powerpc is 
> something different.
> 
> hugepd is a page directory dedicated to huge pages, where you have huge 
> pages listed instead of regular pages. For instance, on powerpc 32 with 
> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A 
> hugepd for 512k is a page table with 8 entries.
> 
> And for 8Mbytes entries, the hugepd is a page table with only one entry. 
> And 2 consecutive PGS entries will point to the same hugepd to cover the 
> entire 8Mbytes.

That still sounds alot like the ARM thing - except ARM replicates the
entry, you also said PPC relicates the entry like ARM to get to the
8M?

I guess the difference is in how the table memory is layed out? ARM
marks the size in the same entry that has the physical address so the
entries are self describing and then replicated. It kind of sounds
like PPC is marking the size in prior level and then reconfiguring the
layout of the lower level? Otherwise it surely must do the same
replication to make a radix index work..

If yes, I guess that is the main problem, the mm APIs don't have way
today to convey data from the pgd level to understand how to parse the
pmd level?

> > It seems to me we should see ARM and PPC agree on what the API is for
> > this and then get rid of hugepd by making both use the same page table
> > walker API. Is that too hopeful?
> 
> Can't see the similarity between ARM contig PTE and PPC huge page 
> directories.

Well, they are both variable sized entries.

So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
that would return enough information for both.

Jason
  
Christophe Leroy Jan. 16, 2024, 6:32 p.m. UTC | #4
Le 16/01/2024 à 13:31, Jason Gunthorpe a écrit :
> On Tue, Jan 16, 2024 at 06:30:39AM +0000, Christophe Leroy wrote:
>>
>>
>> Le 15/01/2024 à 19:37, Jason Gunthorpe a écrit :
>>> On Wed, Jan 03, 2024 at 05:14:16PM +0800, peterx@redhat.com wrote:
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
>>>> some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
>>>> PPC_8XX), however those pages are not candidates for GUP.
>>>>
>>>> Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
>>>> file-backed mappings") added a check to fail gup-fast if there's potential
>>>> risk of violating GUP over writeback file systems.  That should never apply
>>>> to hugepd.  Considering that hugepd is an old format (and even
>>>> software-only), there's no plan to extend hugepd into other file typed
>>>> memories that is prone to the same issue.
>>>
>>> I didn't dig into the ppc stuff too deeply, but this looks to me like
>>> it is the same thing as ARM's contig bits?
>>>
>>> ie a chunk of PMD/etc entries are all managed together as though they
>>> are a virtual larger entry and we use the hugepte_addr_end() stuff to
>>> iterate over each sub entry.
>>
>> As far as I understand ARM's contig stuff, hugepd on powerpc is
>> something different.
>>
>> hugepd is a page directory dedicated to huge pages, where you have huge
>> pages listed instead of regular pages. For instance, on powerpc 32 with
>> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
>> hugepd for 512k is a page table with 8 entries.
>>
>> And for 8Mbytes entries, the hugepd is a page table with only one entry.
>> And 2 consecutive PGS entries will point to the same hugepd to cover the
>> entire 8Mbytes.
> 
> That still sounds alot like the ARM thing - except ARM replicates the
> entry, you also said PPC relicates the entry like ARM to get to the
> 8M?

Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
directory, even for 8M.

You can see in attached picture what the hardware expects.

> 
> I guess the difference is in how the table memory is layed out? ARM
> marks the size in the same entry that has the physical address so the
> entries are self describing and then replicated. It kind of sounds
> like PPC is marking the size in prior level and then reconfiguring the
> layout of the lower level? Otherwise it surely must do the same
> replication to make a radix index work..

Yes that's how it works on powerpc. For 8xx we used to do that for both 
8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
means replicating the entry 128 times) as that's needed to allow mixing 
different page sizes for a given PGD entry.

But for 8M pages that would mean replicating the entry 2048 times. 
That's a bit too much isn't it ?

> 
> If yes, I guess that is the main problem, the mm APIs don't have way
> today to convey data from the pgd level to understand how to parse the
> pmd level?
> 
>>> It seems to me we should see ARM and PPC agree on what the API is for
>>> this and then get rid of hugepd by making both use the same page table
>>> walker API. Is that too hopeful?
>>
>> Can't see the similarity between ARM contig PTE and PPC huge page
>> directories.
> 
> Well, they are both variable sized entries.
> 
> So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
> that would return enough information for both.

pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
level. It must be a two-level process even for pages bigger than a PMD 
entry.

Christophe
  
Jason Gunthorpe Jan. 17, 2024, 1:22 p.m. UTC | #5
On Tue, Jan 16, 2024 at 06:32:32PM +0000, Christophe Leroy wrote:
> >> hugepd is a page directory dedicated to huge pages, where you have huge
> >> pages listed instead of regular pages. For instance, on powerpc 32 with
> >> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
> >> hugepd for 512k is a page table with 8 entries.
> >>
> >> And for 8Mbytes entries, the hugepd is a page table with only one entry.
> >> And 2 consecutive PGS entries will point to the same hugepd to cover the
> >> entire 8Mbytes.
> > 
> > That still sounds alot like the ARM thing - except ARM replicates the
> > entry, you also said PPC relicates the entry like ARM to get to the
> > 8M?
> 
> Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
> directory, even for 8M.

Your diagram looks almost exactly like ARM to me.

The key thing is that the address for the L2 Table is *always* formed as:

   L2 Table Base << 12 + L2 Index << 2 + 00

Then the L2 Descriptor must contains bits indicating the page
size. The L2 Descriptor is replicated to every 4k entry that the page
size covers.

The only difference I see is the 8M case which has a page size greater
than a single L1 entry.

> Yes that's how it works on powerpc. For 8xx we used to do that for both 
> 8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
> means replicating the entry 128 times) as that's needed to allow mixing 
> different page sizes for a given PGD entry.

Right, you want to have granular page sizes or it becomes unusable in
the general case
 
> But for 8M pages that would mean replicating the entry 2048 times. 
> That's a bit too much isn't it ?

Indeed, de-duplicating the L2 Table is a neat optimization.

> > So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
> > that would return enough information for both.
> 
> pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
> level. It must be a two-level process even for pages bigger than a PMD 
> entry.

Right, this is the normal THP/hugetlb situation on x86/etc. It
wouldn't apply here since it seems the HW doesn't have a bit in the L1
descriptor to indicate leaf.

Instead for PPC this hugepd stuff should start to follow Ryan's
generic work for ARM contig:

https://lore.kernel.org/all/20231218105100.172635-1-ryan.roberts@arm.com/

Specifically the arch implementation:

https://lore.kernel.org/linux-mm/20231218105100.172635-15-ryan.roberts@arm.com/

Ie the arch should ultimately wire up the replication and variable
page size bits within its implementation of set_ptes(). set_ptes()s
gets a contiguous run of address and should install it with maximum
use of the variable page sizes. The core code will start to call
set_ptes() in more cases as Ryan gets along his project.

For the purposes of GUP, where are are today and where we are going,
it would be much better to not have a special PPC specific "hugepd"
parser. Just process each of the 4k replicates one by one like ARM is
starting with.

The arch would still have to return the correct page address from
pte_phys() which I think Ryan is doing by having the replicates encode
the full 4k based address in each entry. The HW will ignore those low
bits and pte_phys() then works properly. This would work for PPC as
well, excluding the 8M optimization.

Going forward I'd expect to see some pte_page_size() that returns the
size bits and GUP can have logic to skip reading replicates.

The advantage of all this is that it stops making the feature special
and the work Ryan is doing to generically push larger folios into
set_ptes will become usable on these PPC platforms as well. And we can
kill the PPC specific hugepd.

Jason
  
Ryan Roberts Jan. 18, 2024, 3:15 p.m. UTC | #6
On 17/01/2024 13:22, Jason Gunthorpe wrote:
> On Tue, Jan 16, 2024 at 06:32:32PM +0000, Christophe Leroy wrote:
>>>> hugepd is a page directory dedicated to huge pages, where you have huge
>>>> pages listed instead of regular pages. For instance, on powerpc 32 with
>>>> each PGD entries covering 4Mbytes, a regular page table has 1024 PTEs. A
>>>> hugepd for 512k is a page table with 8 entries.
>>>>
>>>> And for 8Mbytes entries, the hugepd is a page table with only one entry.
>>>> And 2 consecutive PGS entries will point to the same hugepd to cover the
>>>> entire 8Mbytes.
>>>
>>> That still sounds alot like the ARM thing - except ARM replicates the
>>> entry, you also said PPC relicates the entry like ARM to get to the
>>> 8M?
>>
>> Is it like ARM ? Not sure. The PTE is not in the PGD it must be in a L2 
>> directory, even for 8M.
> 
> Your diagram looks almost exactly like ARM to me.
> 
> The key thing is that the address for the L2 Table is *always* formed as:
> 
>    L2 Table Base << 12 + L2 Index << 2 + 00
> 
> Then the L2 Descriptor must contains bits indicating the page
> size. The L2 Descriptor is replicated to every 4k entry that the page
> size covers.
> 
> The only difference I see is the 8M case which has a page size greater
> than a single L1 entry.
> 
>> Yes that's how it works on powerpc. For 8xx we used to do that for both 
>> 8M and 512k pages. Now for 512k pages we do kind of like ARM (which 
>> means replicating the entry 128 times) as that's needed to allow mixing 
>> different page sizes for a given PGD entry.
> 
> Right, you want to have granular page sizes or it becomes unusable in
> the general case
>  
>> But for 8M pages that would mean replicating the entry 2048 times. 
>> That's a bit too much isn't it ?
> 
> Indeed, de-duplicating the L2 Table is a neat optimization.
> 
>>> So if you imagine a pmd_leaf(), pmd_leaf_size() and a pte_leaf_size()
>>> that would return enough information for both.
>>
>> pmd_leaf() ? Unless I'm missing something I can't do leaf at PMD (PGD) 
>> level. It must be a two-level process even for pages bigger than a PMD 
>> entry.
> 
> Right, this is the normal THP/hugetlb situation on x86/etc. It
> wouldn't apply here since it seems the HW doesn't have a bit in the L1
> descriptor to indicate leaf.
> 
> Instead for PPC this hugepd stuff should start to follow Ryan's
> generic work for ARM contig:
> 
> https://lore.kernel.org/all/20231218105100.172635-1-ryan.roberts@arm.com/
> 
> Specifically the arch implementation:
> 
> https://lore.kernel.org/linux-mm/20231218105100.172635-15-ryan.roberts@arm.com/
> 
> Ie the arch should ultimately wire up the replication and variable
> page size bits within its implementation of set_ptes(). set_ptes()s
> gets a contiguous run of address and should install it with maximum
> use of the variable page sizes. The core code will start to call
> set_ptes() in more cases as Ryan gets along his project.

Note that it's not just set_ptes() that you want to batch; there are other calls
that can benefit too. See patches 2 and 3 in the series you linked. (although
I'm working with DavidH on this and the details are going to change a little).

> 
> For the purposes of GUP, where are are today and where we are going,
> it would be much better to not have a special PPC specific "hugepd"
> parser. Just process each of the 4k replicates one by one like ARM is
> starting with.
> 
> The arch would still have to return the correct page address from
> pte_phys() which I think Ryan is doing by having the replicates encode
> the full 4k based address in each entry.

Yes; although its actually also a requirement of the arm architecture. Since the
contig bit is just a hint that the HW may or may not take any notice of, the
page tables have to be correct for the case where the HW just reads them in base
pages. Fixing up the bottom bits should be trivial using the PTE pointer, if
needed for ppc.

> The HW will ignore those low
> bits and pte_phys() then works properly. This would work for PPC as
> well, excluding the 8M optimization.
> 
> Going forward I'd expect to see some pte_page_size() that returns the
> size bits and GUP can have logic to skip reading replicates.

Yes; pte_batch_remaining() in patch 2 is an attempt at this. But as I said the
details will likely change a little.

> 
> The advantage of all this is that it stops making the feature special
> and the work Ryan is doing to generically push larger folios into
> set_ptes will become usable on these PPC platforms as well. And we can
> kill the PPC specific hugepd.
> 
> Jason
  
Peter Xu Feb. 21, 2024, 11:55 a.m. UTC | #7
On Mon, Jan 15, 2024 at 02:37:48PM -0400, Jason Gunthorpe wrote:
> > Drop that check, not only because it'll never be true for hugepd per any
> > known plan, but also it paves way for reusing the function outside
> > fast-gup.
> 
> I didn't see any other caller of this function in this series? When
> does this re-use happen??

It's reused in patch 12 ("mm/gup: Handle hugepd for follow_page()").

Thanks,
  

Patch

diff --git a/mm/gup.c b/mm/gup.c
index eebae70d2465..fa93e14b7fca 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2820,11 +2820,6 @@  static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	if (!folio_fast_pin_allowed(folio, flags)) {
-		gup_put_folio(folio, refs, flags);
-		return 0;
-	}
-
 	if (!pte_write(pte) && gup_must_unshare(NULL, flags, &folio->page)) {
 		gup_put_folio(folio, refs, flags);
 		return 0;
@@ -2835,6 +2830,14 @@  static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	return 1;
 }
 
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
 		unsigned int pdshift, unsigned long end, unsigned int flags,
 		struct page **pages, int *nr)