[21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

Message ID 20230105101844.1893104-22-jthoughton@google.com
State New
Headers
Series Based on latest mm-unstable (85b44c25cd1e). |

Commit Message

James Houghton Jan. 5, 2023, 10:18 a.m. UTC
  The main change in this commit is to walk_hugetlb_range to support
walking HGM mappings, but all walk_hugetlb_range callers must be updated
to use the new API and take the correct action.

Listing all the changes to the callers:

For s390 changes, we simply ignore HGM PTEs (we don't support s390 yet).

For smaps, shared_hugetlb (and private_hugetlb, although private
mappings don't support HGM) may now not be divisible by the hugepage
size. The appropriate changes have been made to support analyzing HGM
PTEs.

For pagemap, we ignore non-leaf PTEs by treating that as if they were
none PTEs. We can only end up with non-leaf PTEs if they had just been
updated from a none PTE.

For show_numa_map, the challenge is that, if any of a hugepage is
mapped, we have to count that entire page exactly once, as the results
are given in units of hugepages. To support HGM mappings, we keep track
of the last page that we looked it. If the hugepage we are currently
looking at is the same as the last one, then we must be looking at an
HGM-mapped page that has been mapped at high-granularity, and we've
already accounted for it.

For DAMON, we treat non-leaf PTEs as if they were blank, for the same
reason as pagemap.

For hwpoison, we proactively update the logic to support the case when
hpte is pointing to a subpage within the poisoned hugepage.

For queue_pages_hugetlb/migration, we ignore all HGM-enabled VMAs for
now.

For mincore, we ignore non-leaf PTEs for the same reason as pagemap.

For mprotect/prot_none_hugetlb_entry, we retry the walk when we get a
non-leaf PTE.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 arch/s390/mm/gmap.c      | 20 ++++++++--
 fs/proc/task_mmu.c       | 83 +++++++++++++++++++++++++++++-----------
 include/linux/pagewalk.h | 10 +++--
 mm/damon/vaddr.c         | 42 +++++++++++++-------
 mm/hmm.c                 | 20 +++++++---
 mm/memory-failure.c      | 17 ++++----
 mm/mempolicy.c           | 12 ++++--
 mm/mincore.c             | 17 ++++++--
 mm/mprotect.c            | 18 ++++++---
 mm/pagewalk.c            | 20 +++++-----
 10 files changed, 180 insertions(+), 79 deletions(-)
  

Comments

Peter Xu Jan. 5, 2023, 10:42 p.m. UTC | #1
On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> -static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
> +static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry,
> +				struct mm_struct *mm,
>  				struct vm_area_struct *vma, unsigned long addr)
>  {
>  	bool referenced = false;
> -	pte_t entry = huge_ptep_get(pte);
> +	pte_t entry = huge_ptep_get(hpte->ptep);

My compiler throws me:

mm/damon/vaddr.c: In function ‘damon_hugetlb_mkold’:
mm/damon/vaddr.c:338:15: error: ‘entry’ redeclared as different kind of symbol
  338 |         pte_t entry = huge_ptep_get(hpte->ptep);
      |               ^~~~~   

I guess this line can just be dropped.

>  	struct folio *folio = pfn_folio(pte_pfn(entry));
>  
>  	folio_get(folio);
  
Peter Xu Jan. 11, 2023, 10:58 p.m. UTC | #2
James,

On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>  		int mapcount = page_mapcount(page);
>  
>  		if (mapcount >= 2)
> -			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> +			mss->shared_hugetlb += hugetlb_pte_size(hpte);
>  		else
> -			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> +			mss->private_hugetlb += hugetlb_pte_size(hpte);
>  	}
>  	return 0;

One thing interesting I found with hgm right now is mostly everything will
be counted as "shared" here, I think it's because mapcount is accounted
always to the huge page even if mapped in smaller sizes, so page_mapcount()
to a small page should be huge too because the head page mapcount should be
huge.  I'm curious the reasons behind the mapcount decision.

For example, would that risk overflow with head_compound_mapcount?  One 1G
page mapping all 4K takes 0.25M counts, while the limit should be 2G for
atomic_t.  Looks like it's possible.

Btw, are the small page* pointers still needed in the latest HGM design?
Is there code taking care of disabling of hugetlb vmemmap optimization for
HGM?  Or maybe it's not needed anymore for the current design?

Thanks,
  
James Houghton Jan. 12, 2023, 2:06 p.m. UTC | #3
On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> >               int mapcount = page_mapcount(page);
> >
> >               if (mapcount >= 2)
> > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> >               else
> > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> >       }
> >       return 0;
>
> One thing interesting I found with hgm right now is mostly everything will
> be counted as "shared" here, I think it's because mapcount is accounted
> always to the huge page even if mapped in smaller sizes, so page_mapcount()
> to a small page should be huge too because the head page mapcount should be
> huge.  I'm curious the reasons behind the mapcount decision.
>
> For example, would that risk overflow with head_compound_mapcount?  One 1G
> page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> atomic_t.  Looks like it's possible.

The original mapcount approach was "if the hstate-level PTE is
present, increment the compound_mapcount by 1" (basically "if any of
the hugepage is mapped, increment the compound_mapcount by 1"), but
this was painful to implement, so I changed it to what it is now (each
new PT mapping increments the compound_mapcount by 1). I think you're
right, there is absolutely an overflow risk. :( I'm not sure what the
best solution is. I could just go back to the old approach.

Regarding when things are accounted in private_hugetlb vs.
shared_hugetlb, HGM shouldn't change that, because it only applies to
shared mappings (right now anyway). It seems like "private_hugetlb"
can include cases where the page is shared but has only one mapping,
in which case HGM will change it from "private" to "shared".

>
> Btw, are the small page* pointers still needed in the latest HGM design?
> Is there code taking care of disabling of hugetlb vmemmap optimization for
> HGM?  Or maybe it's not needed anymore for the current design?

The hugetlb vmemmap optimization can still be used with HGM, so there
is no code to disable it. We don't need small page* pointers either,
except for when we're dealing with mapping subpages, like in
hugetlb_no_page. Everything else can handle the hugetlb page as a
folio.

I hope we can keep compatibility with the vmemmap optimization while
solving the mapcount overflow risk.

Thanks Peter.
- James
  
Peter Xu Jan. 12, 2023, 3:29 p.m. UTC | #4
On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > James,
> >
> > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > >               int mapcount = page_mapcount(page);
> > >
> > >               if (mapcount >= 2)
> > > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > >               else
> > > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> > >       }
> > >       return 0;
> >
> > One thing interesting I found with hgm right now is mostly everything will
> > be counted as "shared" here, I think it's because mapcount is accounted
> > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > to a small page should be huge too because the head page mapcount should be
> > huge.  I'm curious the reasons behind the mapcount decision.
> >
> > For example, would that risk overflow with head_compound_mapcount?  One 1G
> > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > atomic_t.  Looks like it's possible.
> 
> The original mapcount approach was "if the hstate-level PTE is
> present, increment the compound_mapcount by 1" (basically "if any of
> the hugepage is mapped, increment the compound_mapcount by 1"), but
> this was painful to implement,

Any more info here on why it was painful?  What is the major blocker?

> so I changed it to what it is now (each new PT mapping increments the
> compound_mapcount by 1). I think you're right, there is absolutely an
> overflow risk. :( I'm not sure what the best solution is. I could just go
> back to the old approach.

No rush on that; let's discuss it thoroughly before doing anything.  We
have more context than when it was discussed initially in the calls, so
maybe a good time to revisit.

> 
> Regarding when things are accounted in private_hugetlb vs.
> shared_hugetlb, HGM shouldn't change that, because it only applies to
> shared mappings (right now anyway). It seems like "private_hugetlb"
> can include cases where the page is shared but has only one mapping,
> in which case HGM will change it from "private" to "shared".

The two fields are not defined against VM_SHARED, it seems.  At least not
with current code base.

Let me quote the code again just to be clear:

		int mapcount = page_mapcount(page);    <------------- [1]

		if (mapcount >= 2)
			mss->shared_hugetlb += hugetlb_pte_size(hpte);
		else
			mss->private_hugetlb += hugetlb_pte_size(hpte);

		smaps_hugetlb_hgm_account(mss, hpte);

So that information (for some reason) is only relevant to how many mapcount
is there.  If we have one 1G page and only two 4K mapped, with the existing
logic we should see 8K private_hugetlb while in fact I think it should be
8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
(as they all goes back to the head).

I have no idea whether violating that will be a problem or not, I suppose
at least it needs justification if it will be violated, or hopefully it can
be kept as-is.

> 
> >
> > Btw, are the small page* pointers still needed in the latest HGM design?
> > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > HGM?  Or maybe it's not needed anymore for the current design?
> 
> The hugetlb vmemmap optimization can still be used with HGM, so there
> is no code to disable it. We don't need small page* pointers either,
> except for when we're dealing with mapping subpages, like in
> hugetlb_no_page. Everything else can handle the hugetlb page as a
> folio.
> 
> I hope we can keep compatibility with the vmemmap optimization while
> solving the mapcount overflow risk.

Yeh that'll be perfect if it works.  But afaiu even with your current
design (ignoring all the issues on either smaps accounting or overflow
risks), we already referenced the small pages, aren't we?  See:

static inline int page_mapcount(struct page *page)
{
	int mapcount = atomic_read(&page->_mapcount) + 1;  <-------- here

	if (likely(!PageCompound(page)))
		return mapcount;
	page = compound_head(page);
	return head_compound_mapcount(page) + mapcount;
}

Even if we assume small page->_mapcount should always be zero in this case,
we may need to take special care of hugetlb pages in page_mapcount() to not
reference it at all.  Or I think it's reading random values and some days
it can be non-zero.

The other approach is probably using the thp approach.  After Hugh's rework
on the thp accounting I assumed it would be even cleaner (at least no
DoubleMap complexity anymore.. even though I can't say I fully digested the
whole history of that).  It's all about what's the major challenges of
using the same approach there with thp.  You may have more knowledge on
that aspect so I'd be willing to know.

Thanks,
  
James Houghton Jan. 12, 2023, 4:45 p.m. UTC | #5
On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > James,
> > >
> > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > >               int mapcount = page_mapcount(page);
> > > >
> > > >               if (mapcount >= 2)
> > > > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > >               else
> > > > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > >       }
> > > >       return 0;
> > >
> > > One thing interesting I found with hgm right now is mostly everything will
> > > be counted as "shared" here, I think it's because mapcount is accounted
> > > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > > to a small page should be huge too because the head page mapcount should be
> > > huge.  I'm curious the reasons behind the mapcount decision.
> > >
> > > For example, would that risk overflow with head_compound_mapcount?  One 1G
> > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > > atomic_t.  Looks like it's possible.
> >
> > The original mapcount approach was "if the hstate-level PTE is
> > present, increment the compound_mapcount by 1" (basically "if any of
> > the hugepage is mapped, increment the compound_mapcount by 1"), but
> > this was painful to implement,
>
> Any more info here on why it was painful?  What is the major blocker?

The original approach was implemented in RFC v1, but the
implementation was broken: the way refcount was handled was wrong; it
was incremented once for each new page table mapping. (How?
find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
would increment refcount and we wouldn't drop it, and in
__unmap_hugepage_range(), the mmu_gather bits would decrement the
refcount once per mapping.)

At the time, I figured the complexity of handling mapcount AND
refcount correctly in the original approach would be quite complex, so
I switched to the new one.

1. In places that already change the mapcount, check that we're
installing the hstate-level PTE, not a high-granularity PTE. Adjust
mapcount AND refcount appropriately.
2. In the HGM walking bits, to the caller if we made the hstate-level
PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
keep track of this until we figure out which page we're allocating
PTEs for, then change mapcount/refcount appropriately.
3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
once per hugepage. (This is probably the hardest of these three things
to get right.)

>
> > so I changed it to what it is now (each new PT mapping increments the
> > compound_mapcount by 1). I think you're right, there is absolutely an
> > overflow risk. :( I'm not sure what the best solution is. I could just go
> > back to the old approach.
>
> No rush on that; let's discuss it thoroughly before doing anything.  We
> have more context than when it was discussed initially in the calls, so
> maybe a good time to revisit.
>
> >
> > Regarding when things are accounted in private_hugetlb vs.
> > shared_hugetlb, HGM shouldn't change that, because it only applies to
> > shared mappings (right now anyway). It seems like "private_hugetlb"
> > can include cases where the page is shared but has only one mapping,
> > in which case HGM will change it from "private" to "shared".
>
> The two fields are not defined against VM_SHARED, it seems.  At least not
> with current code base.
>
> Let me quote the code again just to be clear:
>
>                 int mapcount = page_mapcount(page);    <------------- [1]
>
>                 if (mapcount >= 2)
>                         mss->shared_hugetlb += hugetlb_pte_size(hpte);
>                 else
>                         mss->private_hugetlb += hugetlb_pte_size(hpte);
>
>                 smaps_hugetlb_hgm_account(mss, hpte);
>
> So that information (for some reason) is only relevant to how many mapcount
> is there.  If we have one 1G page and only two 4K mapped, with the existing
> logic we should see 8K private_hugetlb while in fact I think it should be
> 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
> (as they all goes back to the head).
>
> I have no idea whether violating that will be a problem or not, I suppose
> at least it needs justification if it will be violated, or hopefully it can
> be kept as-is.

Agreed that this is a problem. I'm not sure what should be done here.
It seems like the current upstream implementation is incorrect (surely
MAP_SHARED with only one mapping should still be shared_hugetlb not
private_hugetlb); the check should really be `if (vma->vm_flags &
VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken,
we don't have a problem here.

>
> >
> > >
> > > Btw, are the small page* pointers still needed in the latest HGM design?
> > > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > > HGM?  Or maybe it's not needed anymore for the current design?
> >
> > The hugetlb vmemmap optimization can still be used with HGM, so there
> > is no code to disable it. We don't need small page* pointers either,
> > except for when we're dealing with mapping subpages, like in
> > hugetlb_no_page. Everything else can handle the hugetlb page as a
> > folio.
> >
> > I hope we can keep compatibility with the vmemmap optimization while
> > solving the mapcount overflow risk.
>
> Yeh that'll be perfect if it works.  But afaiu even with your current
> design (ignoring all the issues on either smaps accounting or overflow
> risks), we already referenced the small pages, aren't we?  See:
>
> static inline int page_mapcount(struct page *page)
> {
>         int mapcount = atomic_read(&page->_mapcount) + 1;  <-------- here
>
>         if (likely(!PageCompound(page)))
>                 return mapcount;
>         page = compound_head(page);
>         return head_compound_mapcount(page) + mapcount;
> }
>
> Even if we assume small page->_mapcount should always be zero in this case,
> we may need to take special care of hugetlb pages in page_mapcount() to not
> reference it at all.  Or I think it's reading random values and some days
> it can be non-zero.

IIUC, it's ok to read from all the hugetlb subpage structs, you just
can't *write* to them (except the first few). The first page of page
structs is mapped RW; all the others are mapped RO to a single
physical page.

>
> The other approach is probably using the thp approach.  After Hugh's rework
> on the thp accounting I assumed it would be even cleaner (at least no
> DoubleMap complexity anymore.. even though I can't say I fully digested the
> whole history of that).  It's all about what's the major challenges of
> using the same approach there with thp.  You may have more knowledge on
> that aspect so I'd be willing to know.

I need to take a closer look at Hugh's approach to see if we can do it
the same way. (I wonder if the 1G THP series has some ideas too.)

A really simple solution could be just to prevent userspace from doing
MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the
poison) if it could result in a mapcount overflow. For 1G pages,
userspace would need 8192 mappings to overflow mapcount/refcount.
  
James Houghton Jan. 12, 2023, 4:55 p.m. UTC | #6
> The original approach was implemented in RFC v1, but the
> implementation was broken: the way refcount was handled was wrong; it
> was incremented once for each new page table mapping. (How?
> find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
> would increment refcount and we wouldn't drop it, and in
> __unmap_hugepage_range(), the mmu_gather bits would decrement the
> refcount once per mapping.)
>
> At the time, I figured the complexity of handling mapcount AND
> refcount correctly in the original approach would be quite complex, so
> I switched to the new one.

Sorry I didn't make this clear... the following steps are how we could
correctly implement the original approach.

> 1. In places that already change the mapcount, check that we're
> installing the hstate-level PTE, not a high-granularity PTE. Adjust
> mapcount AND refcount appropriately.
> 2. In the HGM walking bits, to the caller if we made the hstate-level
> PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
> keep track of this until we figure out which page we're allocating
> PTEs for, then change mapcount/refcount appropriately.
> 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
> once per hugepage. (This is probably the hardest of these three things
> to get right.)
  
Peter Xu Jan. 12, 2023, 8:27 p.m. UTC | #7
On Thu, Jan 12, 2023 at 11:45:40AM -0500, James Houghton wrote:
> On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > James,
> > > >
> > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > > >               int mapcount = page_mapcount(page);
> > > > >
> > > > >               if (mapcount >= 2)
> > > > > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > > >               else
> > > > > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > > >       }
> > > > >       return 0;
> > > >
> > > > One thing interesting I found with hgm right now is mostly everything will
> > > > be counted as "shared" here, I think it's because mapcount is accounted
> > > > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > > > to a small page should be huge too because the head page mapcount should be
> > > > huge.  I'm curious the reasons behind the mapcount decision.
> > > >
> > > > For example, would that risk overflow with head_compound_mapcount?  One 1G
> > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > > > atomic_t.  Looks like it's possible.
> > >
> > > The original mapcount approach was "if the hstate-level PTE is
> > > present, increment the compound_mapcount by 1" (basically "if any of
> > > the hugepage is mapped, increment the compound_mapcount by 1"), but
> > > this was painful to implement,
> >
> > Any more info here on why it was painful?  What is the major blocker?
> 
> The original approach was implemented in RFC v1, but the
> implementation was broken: the way refcount was handled was wrong; it
> was incremented once for each new page table mapping. (How?
> find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
> would increment refcount and we wouldn't drop it, and in
> __unmap_hugepage_range(), the mmu_gather bits would decrement the
> refcount once per mapping.)

I'm not sure I fully get the point here, but perhaps it's mostly about how
hugetlb page cache is managed (in hstate size not PAGE_SIZE)?

static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
{
	/* HugeTLBfs indexes the page cache in units of hpage_size */
	if (folio_test_hugetlb(folio))
		return &folio->page;
	return folio_page(folio, index & (folio_nr_pages(folio) - 1));
}

I haven't thought through on that either.  Is it possible that we switche
the pgcache layout to be in PAGE_SIZE granule too when HGM enabled (e.g. a
simple scheme is we can fail MADV_SPLIT if hugetlb pgcache contains any
page already).

If keep using the same pgcache scheme (hpage_size stepped indexes),
find_lock_page() will also easily content on the head page lock, so we may
not be able to handle concurrent page faults on small mappings on the same
page as efficient as thp.

> 
> At the time, I figured the complexity of handling mapcount AND
> refcount correctly in the original approach would be quite complex, so
> I switched to the new one.
> 
> 1. In places that already change the mapcount, check that we're
> installing the hstate-level PTE, not a high-granularity PTE. Adjust
> mapcount AND refcount appropriately.
> 2. In the HGM walking bits, to the caller if we made the hstate-level
> PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
> keep track of this until we figure out which page we're allocating
> PTEs for, then change mapcount/refcount appropriately.
> 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
> once per hugepage. (This is probably the hardest of these three things
> to get right.)
> 
> >
> > > so I changed it to what it is now (each new PT mapping increments the
> > > compound_mapcount by 1). I think you're right, there is absolutely an
> > > overflow risk. :( I'm not sure what the best solution is. I could just go
> > > back to the old approach.
> >
> > No rush on that; let's discuss it thoroughly before doing anything.  We
> > have more context than when it was discussed initially in the calls, so
> > maybe a good time to revisit.
> >
> > >
> > > Regarding when things are accounted in private_hugetlb vs.
> > > shared_hugetlb, HGM shouldn't change that, because it only applies to
> > > shared mappings (right now anyway). It seems like "private_hugetlb"
> > > can include cases where the page is shared but has only one mapping,
> > > in which case HGM will change it from "private" to "shared".
> >
> > The two fields are not defined against VM_SHARED, it seems.  At least not
> > with current code base.
> >
> > Let me quote the code again just to be clear:
> >
> >                 int mapcount = page_mapcount(page);    <------------- [1]
> >
> >                 if (mapcount >= 2)
> >                         mss->shared_hugetlb += hugetlb_pte_size(hpte);
> >                 else
> >                         mss->private_hugetlb += hugetlb_pte_size(hpte);
> >
> >                 smaps_hugetlb_hgm_account(mss, hpte);
> >
> > So that information (for some reason) is only relevant to how many mapcount
> > is there.  If we have one 1G page and only two 4K mapped, with the existing
> > logic we should see 8K private_hugetlb while in fact I think it should be
> > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
> > (as they all goes back to the head).
> >
> > I have no idea whether violating that will be a problem or not, I suppose
> > at least it needs justification if it will be violated, or hopefully it can
> > be kept as-is.
> 
> Agreed that this is a problem. I'm not sure what should be done here.
> It seems like the current upstream implementation is incorrect (surely
> MAP_SHARED with only one mapping should still be shared_hugetlb not
> private_hugetlb); the check should really be `if (vma->vm_flags &
> VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken,
> we don't have a problem here.

Naoya added relevant code.  Not sure whether he'll chim in.

commit 25ee01a2fca02dfb5a3ce316e77910c468108199
Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date:   Thu Nov 5 18:47:11 2015 -0800

    mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps

In all cases, it'll still be a slight ABI change, so worth considering the
effects to existing users.

> 
> >
> > >
> > > >
> > > > Btw, are the small page* pointers still needed in the latest HGM design?
> > > > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > > > HGM?  Or maybe it's not needed anymore for the current design?
> > >
> > > The hugetlb vmemmap optimization can still be used with HGM, so there
> > > is no code to disable it. We don't need small page* pointers either,
> > > except for when we're dealing with mapping subpages, like in
> > > hugetlb_no_page. Everything else can handle the hugetlb page as a
> > > folio.
> > >
> > > I hope we can keep compatibility with the vmemmap optimization while
> > > solving the mapcount overflow risk.
> >
> > Yeh that'll be perfect if it works.  But afaiu even with your current
> > design (ignoring all the issues on either smaps accounting or overflow
> > risks), we already referenced the small pages, aren't we?  See:
> >
> > static inline int page_mapcount(struct page *page)
> > {
> >         int mapcount = atomic_read(&page->_mapcount) + 1;  <-------- here
> >
> >         if (likely(!PageCompound(page)))
> >                 return mapcount;
> >         page = compound_head(page);
> >         return head_compound_mapcount(page) + mapcount;
> > }
> >
> > Even if we assume small page->_mapcount should always be zero in this case,
> > we may need to take special care of hugetlb pages in page_mapcount() to not
> > reference it at all.  Or I think it's reading random values and some days
> > it can be non-zero.
> 
> IIUC, it's ok to read from all the hugetlb subpage structs, you just
> can't *write* to them (except the first few). The first page of page
> structs is mapped RW; all the others are mapped RO to a single
> physical page.

I'm not familiar with vmemmap work, but I saw this:

	/*
	 * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
	 * to the page which @vmemmap_reuse is mapped to, then free the pages
	 * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
	 */
	if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))

It seems it'll reuse the 1st page of the huge page* rather than backing the
rest vmemmap with zero pages.  Would that be a problem (as I think some
small page* can actually points to the e.g. head page* if referenced)?

> 
> >
> > The other approach is probably using the thp approach.  After Hugh's rework
> > on the thp accounting I assumed it would be even cleaner (at least no
> > DoubleMap complexity anymore.. even though I can't say I fully digested the
> > whole history of that).  It's all about what's the major challenges of
> > using the same approach there with thp.  You may have more knowledge on
> > that aspect so I'd be willing to know.
> 
> I need to take a closer look at Hugh's approach to see if we can do it
> the same way. (I wonder if the 1G THP series has some ideas too.)

https://lore.kernel.org/all/47ad693-717-79c8-e1ba-46c3a6602e48@google.com/

It's already in the mainline.  I think it's mostly internally implemented
under the rmap code APIs.  For the HGM effort, I'd start with simply
passing around compound=false when using the rmap APIs, and see what'll
start to fail.

> 
> A really simple solution could be just to prevent userspace from doing
> MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the
> poison) if it could result in a mapcount overflow. For 1G pages,
> userspace would need 8192 mappings to overflow mapcount/refcount.

I'm not sure you can calculate it; consider fork()s along the way when pmd
sharing disabled, or whatever reason when the 1G pages mapped at multiple
places with more than one mmap()s.

Thanks,
  
James Houghton Jan. 12, 2023, 9:17 p.m. UTC | #8
On Thu, Jan 12, 2023 at 3:27 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 12, 2023 at 11:45:40AM -0500, James Houghton wrote:
> > On Thu, Jan 12, 2023 at 10:29 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Jan 12, 2023 at 09:06:48AM -0500, James Houghton wrote:
> > > > On Wed, Jan 11, 2023 at 5:58 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > > On Thu, Jan 05, 2023 at 10:18:19AM +0000, James Houghton wrote:
> > > > > > @@ -751,9 +761,9 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> > > > > >               int mapcount = page_mapcount(page);
> > > > > >
> > > > > >               if (mapcount >= 2)
> > > > > > -                     mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > > +                     mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > > > > >               else
> > > > > > -                     mss->private_hugetlb += huge_page_size(hstate_vma(vma));
> > > > > > +                     mss->private_hugetlb += hugetlb_pte_size(hpte);
> > > > > >       }
> > > > > >       return 0;
> > > > >
> > > > > One thing interesting I found with hgm right now is mostly everything will
> > > > > be counted as "shared" here, I think it's because mapcount is accounted
> > > > > always to the huge page even if mapped in smaller sizes, so page_mapcount()
> > > > > to a small page should be huge too because the head page mapcount should be
> > > > > huge.  I'm curious the reasons behind the mapcount decision.
> > > > >
> > > > > For example, would that risk overflow with head_compound_mapcount?  One 1G
> > > > > page mapping all 4K takes 0.25M counts, while the limit should be 2G for
> > > > > atomic_t.  Looks like it's possible.
> > > >
> > > > The original mapcount approach was "if the hstate-level PTE is
> > > > present, increment the compound_mapcount by 1" (basically "if any of
> > > > the hugepage is mapped, increment the compound_mapcount by 1"), but
> > > > this was painful to implement,
> > >
> > > Any more info here on why it was painful?  What is the major blocker?
> >
> > The original approach was implemented in RFC v1, but the
> > implementation was broken: the way refcount was handled was wrong; it
> > was incremented once for each new page table mapping. (How?
> > find_lock_page(), called once per hugetlb_no_page/UFFDIO_CONTINUE
> > would increment refcount and we wouldn't drop it, and in
> > __unmap_hugepage_range(), the mmu_gather bits would decrement the
> > refcount once per mapping.)
>
> I'm not sure I fully get the point here, but perhaps it's mostly about how
> hugetlb page cache is managed (in hstate size not PAGE_SIZE)?
>
> static inline struct page *folio_file_page(struct folio *folio, pgoff_t index)
> {
>         /* HugeTLBfs indexes the page cache in units of hpage_size */
>         if (folio_test_hugetlb(folio))
>                 return &folio->page;
>         return folio_page(folio, index & (folio_nr_pages(folio) - 1));
> }
>
> I haven't thought through on that either.  Is it possible that we switche
> the pgcache layout to be in PAGE_SIZE granule too when HGM enabled (e.g. a
> simple scheme is we can fail MADV_SPLIT if hugetlb pgcache contains any
> page already).
>
> If keep using the same pgcache scheme (hpage_size stepped indexes),
> find_lock_page() will also easily content on the head page lock, so we may
> not be able to handle concurrent page faults on small mappings on the same
> page as efficient as thp.

We keep the pgcache scheme the same: hpage_size stepped indexes. The
refcount and mapcount are both stored in the head page. The problems
with the implementation in RFC v1 were (1) refcount became much much
larger than mapcount, and (2) refcount would have the same overflow
problem we're discussing.

You're right -- find_lock_page() in hugetlb_no_page() and
hugetlb_mcopy_atomic_pte() will contend on the head page lock, but I
think it's possible to improve things here (i.e., replace them with
find_get_page() somehow).

>
> >
> > At the time, I figured the complexity of handling mapcount AND
> > refcount correctly in the original approach would be quite complex, so
> > I switched to the new one.
> >
> > 1. In places that already change the mapcount, check that we're
> > installing the hstate-level PTE, not a high-granularity PTE. Adjust
> > mapcount AND refcount appropriately.
> > 2. In the HGM walking bits, to the caller if we made the hstate-level
> > PTE present. (hugetlb_[pmd,pte]_alloc is the source of truth.) Need to
> > keep track of this until we figure out which page we're allocating
> > PTEs for, then change mapcount/refcount appropriately.
> > 3. In unmapping bits, change mmu_gather/tlb bits to drop refcount only
> > once per hugepage. (This is probably the hardest of these three things
> > to get right.)
> >
> > >
> > > > so I changed it to what it is now (each new PT mapping increments the
> > > > compound_mapcount by 1). I think you're right, there is absolutely an
> > > > overflow risk. :( I'm not sure what the best solution is. I could just go
> > > > back to the old approach.
> > >
> > > No rush on that; let's discuss it thoroughly before doing anything.  We
> > > have more context than when it was discussed initially in the calls, so
> > > maybe a good time to revisit.
> > >
> > > >
> > > > Regarding when things are accounted in private_hugetlb vs.
> > > > shared_hugetlb, HGM shouldn't change that, because it only applies to
> > > > shared mappings (right now anyway). It seems like "private_hugetlb"
> > > > can include cases where the page is shared but has only one mapping,
> > > > in which case HGM will change it from "private" to "shared".
> > >
> > > The two fields are not defined against VM_SHARED, it seems.  At least not
> > > with current code base.
> > >
> > > Let me quote the code again just to be clear:
> > >
> > >                 int mapcount = page_mapcount(page);    <------------- [1]
> > >
> > >                 if (mapcount >= 2)
> > >                         mss->shared_hugetlb += hugetlb_pte_size(hpte);
> > >                 else
> > >                         mss->private_hugetlb += hugetlb_pte_size(hpte);
> > >
> > >                 smaps_hugetlb_hgm_account(mss, hpte);
> > >
> > > So that information (for some reason) is only relevant to how many mapcount
> > > is there.  If we have one 1G page and only two 4K mapped, with the existing
> > > logic we should see 8K private_hugetlb while in fact I think it should be
> > > 8K shared_hugetlb due to page_mapcount() taking account of both 4K mappings
> > > (as they all goes back to the head).
> > >
> > > I have no idea whether violating that will be a problem or not, I suppose
> > > at least it needs justification if it will be violated, or hopefully it can
> > > be kept as-is.
> >
> > Agreed that this is a problem. I'm not sure what should be done here.
> > It seems like the current upstream implementation is incorrect (surely
> > MAP_SHARED with only one mapping should still be shared_hugetlb not
> > private_hugetlb); the check should really be `if (vma->vm_flags &
> > VM_MAYSHARE)` instead of `mapcount >= 2`. If that change can be taken,
> > we don't have a problem here.
>
> Naoya added relevant code.  Not sure whether he'll chim in.
>
> commit 25ee01a2fca02dfb5a3ce316e77910c468108199
> Author: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date:   Thu Nov 5 18:47:11 2015 -0800
>
>     mm: hugetlb: proc: add hugetlb-related fields to /proc/PID/smaps
>
> In all cases, it'll still be a slight ABI change, so worth considering the
> effects to existing users.
>
> >
> > >
> > > >
> > > > >
> > > > > Btw, are the small page* pointers still needed in the latest HGM design?
> > > > > Is there code taking care of disabling of hugetlb vmemmap optimization for
> > > > > HGM?  Or maybe it's not needed anymore for the current design?
> > > >
> > > > The hugetlb vmemmap optimization can still be used with HGM, so there
> > > > is no code to disable it. We don't need small page* pointers either,
> > > > except for when we're dealing with mapping subpages, like in
> > > > hugetlb_no_page. Everything else can handle the hugetlb page as a
> > > > folio.
> > > >
> > > > I hope we can keep compatibility with the vmemmap optimization while
> > > > solving the mapcount overflow risk.
> > >
> > > Yeh that'll be perfect if it works.  But afaiu even with your current
> > > design (ignoring all the issues on either smaps accounting or overflow
> > > risks), we already referenced the small pages, aren't we?  See:
> > >
> > > static inline int page_mapcount(struct page *page)
> > > {
> > >         int mapcount = atomic_read(&page->_mapcount) + 1;  <-------- here
> > >
> > >         if (likely(!PageCompound(page)))
> > >                 return mapcount;
> > >         page = compound_head(page);
> > >         return head_compound_mapcount(page) + mapcount;
> > > }
> > >
> > > Even if we assume small page->_mapcount should always be zero in this case,
> > > we may need to take special care of hugetlb pages in page_mapcount() to not
> > > reference it at all.  Or I think it's reading random values and some days
> > > it can be non-zero.
> >
> > IIUC, it's ok to read from all the hugetlb subpage structs, you just
> > can't *write* to them (except the first few). The first page of page
> > structs is mapped RW; all the others are mapped RO to a single
> > physical page.
>
> I'm not familiar with vmemmap work, but I saw this:
>
>         /*
>          * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end)
>          * to the page which @vmemmap_reuse is mapped to, then free the pages
>          * which the range [@vmemmap_start, @vmemmap_end] is mapped to.
>          */
>         if (vmemmap_remap_free(vmemmap_start, vmemmap_end, vmemmap_reuse))
>
> It seems it'll reuse the 1st page of the huge page* rather than backing the
> rest vmemmap with zero pages.  Would that be a problem (as I think some
> small page* can actually points to the e.g. head page* if referenced)?

It shouldn't be a problem if we don't use _mapcount and always use
compound_mapcount, which is how mapcount is handled in this version of
this series.

>
> >
> > >
> > > The other approach is probably using the thp approach.  After Hugh's rework
> > > on the thp accounting I assumed it would be even cleaner (at least no
> > > DoubleMap complexity anymore.. even though I can't say I fully digested the
> > > whole history of that).  It's all about what's the major challenges of
> > > using the same approach there with thp.  You may have more knowledge on
> > > that aspect so I'd be willing to know.
> >
> > I need to take a closer look at Hugh's approach to see if we can do it
> > the same way. (I wonder if the 1G THP series has some ideas too.)
>
> https://lore.kernel.org/all/47ad693-717-79c8-e1ba-46c3a6602e48@google.com/
>
> It's already in the mainline.  I think it's mostly internally implemented
> under the rmap code APIs.  For the HGM effort, I'd start with simply
> passing around compound=false when using the rmap APIs, and see what'll
> start to fail.

I'll look into it, but doing it this way will use _mapcount, so we
won't be able to use the vmemmap optimization. I think even if we do
use Hugh's approach, refcount is still being kept on the head page, so
there's still an overflow risk there (but maybe I am
misunderstanding).

>
> >
> > A really simple solution could be just to prevent userspace from doing
> > MADV_SPLIT (or, if we are enabling HGM due to hwpoison, ignore the
> > poison) if it could result in a mapcount overflow. For 1G pages,
> > userspace would need 8192 mappings to overflow mapcount/refcount.
>
> I'm not sure you can calculate it; consider fork()s along the way when pmd
> sharing disabled, or whatever reason when the 1G pages mapped at multiple
> places with more than one mmap()s.

Yeah you're right. :(
  
Peter Xu Jan. 12, 2023, 9:33 p.m. UTC | #9
On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
> I'll look into it, but doing it this way will use _mapcount, so we
> won't be able to use the vmemmap optimization. I think even if we do
> use Hugh's approach, refcount is still being kept on the head page, so
> there's still an overflow risk there (but maybe I am
> misunderstanding).

Could you remind me what's the issue if using refcount on the small pages
rather than the head (assuming vmemmap still can be disabled)?

Thanks,
  
David Hildenbrand Jan. 16, 2023, 10:17 a.m. UTC | #10
On 12.01.23 22:33, Peter Xu wrote:
> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
>> I'll look into it, but doing it this way will use _mapcount, so we
>> won't be able to use the vmemmap optimization. I think even if we do
>> use Hugh's approach, refcount is still being kept on the head page, so
>> there's still an overflow risk there (but maybe I am
>> misunderstanding).
> 
> Could you remind me what's the issue if using refcount on the small pages
> rather than the head (assuming vmemmap still can be disabled)?

The THP-way of doing things is refcounting on the head page. All folios 
use a single refcount on the head.

There has to be a pretty good reason to do it differently.
  
James Houghton Jan. 17, 2023, 11:11 p.m. UTC | #11
On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.01.23 22:33, Peter Xu wrote:
> > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
> >> I'll look into it, but doing it this way will use _mapcount, so we
> >> won't be able to use the vmemmap optimization. I think even if we do
> >> use Hugh's approach, refcount is still being kept on the head page, so
> >> there's still an overflow risk there (but maybe I am
> >> misunderstanding).
> >
> > Could you remind me what's the issue if using refcount on the small pages
> > rather than the head (assuming vmemmap still can be disabled)?
>
> The THP-way of doing things is refcounting on the head page. All folios
> use a single refcount on the head.
>
> There has to be a pretty good reason to do it differently.

Peter and I have discussed this a lot offline. There are two main problems here:

1. Refcount overflow

Refcount is always kept on the head page (before and after this
series). IIUC, this means that if THPs could be 1G in size, they too
would be susceptible to the same potential overflow. How easy is the
overflow? [1]

To deal with this, the best solution we've been able to come up with
is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
want to kill a random process).

(So David, I think this answers your question. Refcount should be
handled just like THPs.)

2. page_mapcount() API differences

In this series, page_mapcount() returns the total number of page table
references for the compound page. For example, if you have a
PTE-mapped 2M page (with no other mappings), page_mapcount() for each
4K page will be 512. This is not the same as a THP: page_mapcount()
would return 1 for each page. Because of the difference in
page_mapcount(), we have 4 problems:

i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
"private_hugetlb" or "shared_hugetlb".
ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
the hugepage is shared or not. Pages that would otherwise be migrated
now require MPOL_MF_MOVE_ALL to be migrated.
[Really both of the above are checking how many VMAs are mapping our hugepage.]
iii. CoW. This isn't a problem right now because CoW is only possible
with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
iv. The hwpoison handling code will check if it successfully unmapped
the poisoned page. This isn't a problem right now, as hwpoison will
unmap all the mappings for the hugepage, not just the 4K where the
poison was found.

Doing it this way allows HGM to remain compatible with the hugetlb
vmemmap optimization. None of the above problems strike me as
particularly major, but it's unclear to me how important it is to have
page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.

The other way page_mapcount() (let's say the "THP-like way") could be
done is like this: increment compound mapcount if we're mapping a
hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
high-granularity, increment the mapcount for each 4K page that is
getting mapped (e.g., PMD within a 1G page: increment the mapcount for
the 512 pages that are now mapped). This yields the same
page_mapcount() API we had before, but we lose the hugetlb vmemmap
optimization.

We could introduce an API like hugetlb_vma_mapcount() that would, for
hugetlb, give us the number of VMAs that map a hugepage, but I don't
think people would like this.

I'm curious what others think (Mike, Matthew?). I'm guessing the
THP-like way is probably what most people would want, though it would
be a real shame to lose the vmemmap optimization.

- James

[1] INT_MAX is 2^31. We increment the refcount once for each 4K
mapping in 1G: that's 512 * 512 (2^18). That means we only need 8192
(2^13) VMAs for the same 1G page to overflow refcount. I think it's
safe to say that if userspace is doing this, they are attempting to
overflow refcount.
  
David Hildenbrand Jan. 18, 2023, 9:43 a.m. UTC | #12
On 18.01.23 00:11, James Houghton wrote:
> On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.01.23 22:33, Peter Xu wrote:
>>> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
>>>> I'll look into it, but doing it this way will use _mapcount, so we
>>>> won't be able to use the vmemmap optimization. I think even if we do
>>>> use Hugh's approach, refcount is still being kept on the head page, so
>>>> there's still an overflow risk there (but maybe I am
>>>> misunderstanding).
>>>
>>> Could you remind me what's the issue if using refcount on the small pages
>>> rather than the head (assuming vmemmap still can be disabled)?
>>
>> The THP-way of doing things is refcounting on the head page. All folios
>> use a single refcount on the head.
>>
>> There has to be a pretty good reason to do it differently.
> 
> Peter and I have discussed this a lot offline. There are two main problems here:
> 
> 1. Refcount overflow
> 
> Refcount is always kept on the head page (before and after this
> series). IIUC, this means that if THPs could be 1G in size, they too
> would be susceptible to the same potential overflow. How easy is the
> overflow? [1]

Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64 
processes with 64k VMAs. Not impossible to achieve if one really wants 
to break the system ...

Side note: a long long time ago, we used to have sub-page refcounts for 
THP. IIRC, that was even before we had sub-page mapcounts and was used 
to make COW decisions.

> 
> To deal with this, the best solution we've been able to come up with
> is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> want to kill a random process).

You'd have to also make sure that fork() won't do the same. At least 
with uffd-wp, Peter also added page table copying during fork() for 
MAP_SHARED mappings, which would have to be handled.

Of course, one can just disallow fork() with any HGM right from the 
start and keep it all simpler to not open up a can of worms there.

Is it reasonable, to have more than one (or a handful) of VMAs mapping a 
huge page via a HGM? Restricting it to a single one, would make handling 
   much easier.

If there is ever demand for more HGM mappings, that whole problem (and 
complexity) could be dealt with later. ... but I assume it will already 
be a requirement for VMs (e.g., under QEMU) that share memory with other 
processes (virtiofsd and friends?)


> 
> (So David, I think this answers your question. Refcount should be
> handled just like THPs.)
> 
> 2. page_mapcount() API differences
> 
> In this series, page_mapcount() returns the total number of page table
> references for the compound page. For example, if you have a
> PTE-mapped 2M page (with no other mappings), page_mapcount() for each
> 4K page will be 512. This is not the same as a THP: page_mapcount()
> would return 1 for each page. Because of the difference in
> page_mapcount(), we have 4 problems:

IMHO, it would actually be great to just be able to remove the sub-page 
mapcounts for THP and make it all simpler.

Right now, the sub-page mapcount is mostly required for making COW 
decisions, but only for accounting purposes IIRC (NR_ANON_THPS, 
NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See 
page_remove_rmap().

If we can avoid that complexity right from the start for hugetlb, great, ..

> 
> i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
> "private_hugetlb" or "shared_hugetlb".
> ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
> the hugepage is shared or not. Pages that would otherwise be migrated
> now require MPOL_MF_MOVE_ALL to be migrated.
> [Really both of the above are checking how many VMAs are mapping our hugepage.]
> iii. CoW. This isn't a problem right now because CoW is only possible
> with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
> iv. The hwpoison handling code will check if it successfully unmapped
> the poisoned page. This isn't a problem right now, as hwpoison will
> unmap all the mappings for the hugepage, not just the 4K where the
> poison was found.
> 
> Doing it this way allows HGM to remain compatible with the hugetlb
> vmemmap optimization. None of the above problems strike me as
> particularly major, but it's unclear to me how important it is to have
> page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.

See below, maybe we should tackle HGM from a different direction.

> 
> The other way page_mapcount() (let's say the "THP-like way") could be
> done is like this: increment compound mapcount if we're mapping a
> hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
> high-granularity, increment the mapcount for each 4K page that is
> getting mapped (e.g., PMD within a 1G page: increment the mapcount for
> the 512 pages that are now mapped). This yields the same
> page_mapcount() API we had before, but we lose the hugetlb vmemmap
> optimization.
> 
> We could introduce an API like hugetlb_vma_mapcount() that would, for
> hugetlb, give us the number of VMAs that map a hugepage, but I don't
> think people would like this.
> 
> I'm curious what others think (Mike, Matthew?). I'm guessing the
> THP-like way is probably what most people would want, though it would
> be a real shame to lose the vmemmap optimization.

Heh, not me :) Having a single mapcount is certainly much cleaner. ... 
and if we're dealing with refcount overflows already, mapcount overflows 
are not an issue.


I wonder if the following crazy idea has already been discussed: treat 
the whole mapping as a single large logical mapping. One reference and 
one mapping, no matter how the individual parts are mapped into the 
assigned page table sub-tree.

Because for hugetlb with MAP_SHARED, we know that the complete assigned 
sub-tree of page tables can only map the given hugetlb page, no 
fragments of something else. That's very different to THP in private 
mappings ...

So as soon as the first piece gets mapped, we increment 
refcount+mapcount. Other pieces in the same subtree don't do that.

Once the last piece is unmapped (or simpler: once the complete subtree 
of page tables is gone), we decrement refcount+mapcount. Might require 
some brain power to do this tracking, but I wouldn't call it impossible 
right from the start.

Would such a design violate other design aspects that are important?
  
Peter Xu Jan. 18, 2023, 3:35 p.m. UTC | #13
On Wed, Jan 18, 2023 at 10:43:47AM +0100, David Hildenbrand wrote:
> On 18.01.23 00:11, James Houghton wrote:
> > On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 12.01.23 22:33, Peter Xu wrote:
> > > > On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
> > > > > I'll look into it, but doing it this way will use _mapcount, so we
> > > > > won't be able to use the vmemmap optimization. I think even if we do
> > > > > use Hugh's approach, refcount is still being kept on the head page, so
> > > > > there's still an overflow risk there (but maybe I am
> > > > > misunderstanding).
> > > > 
> > > > Could you remind me what's the issue if using refcount on the small pages
> > > > rather than the head (assuming vmemmap still can be disabled)?
> > > 
> > > The THP-way of doing things is refcounting on the head page. All folios
> > > use a single refcount on the head.
> > > 
> > > There has to be a pretty good reason to do it differently.
> > 
> > Peter and I have discussed this a lot offline. There are two main problems here:
> > 
> > 1. Refcount overflow
> > 
> > Refcount is always kept on the head page (before and after this
> > series). IIUC, this means that if THPs could be 1G in size, they too
> > would be susceptible to the same potential overflow. How easy is the
> > overflow? [1]
> 
> Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64
> processes with 64k VMAs. Not impossible to achieve if one really wants to
> break the system ...
> 
> Side note: a long long time ago, we used to have sub-page refcounts for THP.
> IIRC, that was even before we had sub-page mapcounts and was used to make
> COW decisions.
> 
> > 
> > To deal with this, the best solution we've been able to come up with
> > is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> > want to kill a random process).
> 
> You'd have to also make sure that fork() won't do the same. At least with
> uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
> mappings, which would have to be handled.

If we want such a check to make a real difference, IIUC we may want to
consider having similar check in:

        page_ref_add
        page_ref_inc
        page_ref_inc_return
        page_ref_add_unless

But it's unfortunate that mostly all the callers to these functions
(especially the first two) do not have a retval yet at all.  Considering
the low possibility so far on having it overflow, maybe it can also be done
for later (and I think checking negative as try_get_page would suffice too).

> 
> Of course, one can just disallow fork() with any HGM right from the start
> and keep it all simpler to not open up a can of worms there.
> 
> Is it reasonable, to have more than one (or a handful) of VMAs mapping a
> huge page via a HGM? Restricting it to a single one, would make handling
> much easier.
> 
> If there is ever demand for more HGM mappings, that whole problem (and
> complexity) could be dealt with later. ... but I assume it will already be a
> requirement for VMs (e.g., under QEMU) that share memory with other
> processes (virtiofsd and friends?)

Yes, any form of multi-proc QEMU will need that for supporting HGM
postcopy.

> 
> 
> > 
> > (So David, I think this answers your question. Refcount should be
> > handled just like THPs.)
> > 
> > 2. page_mapcount() API differences
> > 
> > In this series, page_mapcount() returns the total number of page table
> > references for the compound page. For example, if you have a
> > PTE-mapped 2M page (with no other mappings), page_mapcount() for each
> > 4K page will be 512. This is not the same as a THP: page_mapcount()
> > would return 1 for each page. Because of the difference in
> > page_mapcount(), we have 4 problems:
> 
> IMHO, it would actually be great to just be able to remove the sub-page
> mapcounts for THP and make it all simpler.
> 
> Right now, the sub-page mapcount is mostly required for making COW
> decisions, but only for accounting purposes IIRC (NR_ANON_THPS,
> NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See
> page_remove_rmap().
> 
> If we can avoid that complexity right from the start for hugetlb, great, ..
> 
> > 
> > i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
> > "private_hugetlb" or "shared_hugetlb".
> > ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
> > the hugepage is shared or not. Pages that would otherwise be migrated
> > now require MPOL_MF_MOVE_ALL to be migrated.
> > [Really both of the above are checking how many VMAs are mapping our hugepage.]
> > iii. CoW. This isn't a problem right now because CoW is only possible
> > with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
> > iv. The hwpoison handling code will check if it successfully unmapped
> > the poisoned page. This isn't a problem right now, as hwpoison will
> > unmap all the mappings for the hugepage, not just the 4K where the
> > poison was found.
> > 
> > Doing it this way allows HGM to remain compatible with the hugetlb
> > vmemmap optimization. None of the above problems strike me as
> > particularly major, but it's unclear to me how important it is to have
> > page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.
> 
> See below, maybe we should tackle HGM from a different direction.
> 
> > 
> > The other way page_mapcount() (let's say the "THP-like way") could be
> > done is like this: increment compound mapcount if we're mapping a
> > hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
> > high-granularity, increment the mapcount for each 4K page that is
> > getting mapped (e.g., PMD within a 1G page: increment the mapcount for
> > the 512 pages that are now mapped). This yields the same
> > page_mapcount() API we had before, but we lose the hugetlb vmemmap
> > optimization.
> > 
> > We could introduce an API like hugetlb_vma_mapcount() that would, for
> > hugetlb, give us the number of VMAs that map a hugepage, but I don't
> > think people would like this.
> > 
> > I'm curious what others think (Mike, Matthew?). I'm guessing the
> > THP-like way is probably what most people would want, though it would
> > be a real shame to lose the vmemmap optimization.
> 
> Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
> if we're dealing with refcount overflows already, mapcount overflows are not
> an issue.
> 
> 
> I wonder if the following crazy idea has already been discussed: treat the
> whole mapping as a single large logical mapping. One reference and one
> mapping, no matter how the individual parts are mapped into the assigned
> page table sub-tree.
> 
> Because for hugetlb with MAP_SHARED, we know that the complete assigned
> sub-tree of page tables can only map the given hugetlb page, no fragments of
> something else. That's very different to THP in private mappings ...
> 
> So as soon as the first piece gets mapped, we increment refcount+mapcount.
> Other pieces in the same subtree don't do that.
> 
> Once the last piece is unmapped (or simpler: once the complete subtree of
> page tables is gone), we decrement refcount+mapcount. Might require some
> brain power to do this tracking, but I wouldn't call it impossible right
> from the start.
> 
> Would such a design violate other design aspects that are important?

The question is how to maintaining above information.

It needs to be per-map (so one page mapped multiple times can be accounted
differently), and per-page (so one mapping/vma can contain multiple pages).
So far I think that's exactly the pgtable.  If we can squeeze information
into the pgtable it'll work out, but definitely not trivial.  Or we can
maintain seperate allocates for such information, but that can be extra
overheads too.

So far I'd still consider going with reusing thp mapcounts, which will
mostly be what James mentioned above. The only difference is I'm not sure
whether we should allow mapping e.g. 2M ranges for 1G pages.  THP mapcount
doesn't have intermediate layer to maintain mapcount information like 2M,
so to me it's easier we start with only mapping either the hpage size or
PAGE_SIZE, not any intermediate size allowed.

Having intermediate size mapping allowed can at least be error prone to
me.  One example is if some pgtable walker found a 2M page, it may easily
fetch the PFN out of it, assuming it's a compound page and it should
satisfy PageHead(pfn)==true but it'll start to break here, because the 2M
PFN will only be a small page pfn for the 1G huge page in this case.

To me, intermediate sized mappings are good to have but not required to
resolve HGM problems, at least so far.  Said that, I'm fine with looking at
what it'll look like if James would like to keep persuing that direction.

Thanks,
  
James Houghton Jan. 18, 2023, 4:39 p.m. UTC | #14
> > > To deal with this, the best solution we've been able to come up with
> > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> > > want to kill a random process).
> >
> > You'd have to also make sure that fork() won't do the same. At least with
> > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
> > mappings, which would have to be handled.

Indeed, thank you! copy_hugetlb_page_range() (and therefore fork() in
some cases) would need this check too.

>
> If we want such a check to make a real difference, IIUC we may want to
> consider having similar check in:
>
>         page_ref_add
>         page_ref_inc
>         page_ref_inc_return
>         page_ref_add_unless
>
> But it's unfortunate that mostly all the callers to these functions
> (especially the first two) do not have a retval yet at all.  Considering
> the low possibility so far on having it overflow, maybe it can also be done
> for later (and I think checking negative as try_get_page would suffice too).

I think as long as we annotate the few cases that userspace can
exploit to overflow refcount, we should be ok. I think this was the
same idea with try_get_page(): it is supposed to be used in places
that userspace could reasonably exploit to overflow refcount.

>
> >
> > Of course, one can just disallow fork() with any HGM right from the start
> > and keep it all simpler to not open up a can of worms there.
> >
> > Is it reasonable, to have more than one (or a handful) of VMAs mapping a
> > huge page via a HGM? Restricting it to a single one, would make handling
> > much easier.
> >
> > If there is ever demand for more HGM mappings, that whole problem (and
> > complexity) could be dealt with later. ... but I assume it will already be a
> > requirement for VMs (e.g., under QEMU) that share memory with other
> > processes (virtiofsd and friends?)
>
> Yes, any form of multi-proc QEMU will need that for supporting HGM
> postcopy.

+1. Disallowing fork() entirely is quite a restrictive limitation.

[snip]
> > > I'm curious what others think (Mike, Matthew?). I'm guessing the
> > > THP-like way is probably what most people would want, though it would
> > > be a real shame to lose the vmemmap optimization.
> >
> > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
> > if we're dealing with refcount overflows already, mapcount overflows are not
> > an issue.
> >
> >
> > I wonder if the following crazy idea has already been discussed: treat the
> > whole mapping as a single large logical mapping. One reference and one
> > mapping, no matter how the individual parts are mapped into the assigned
> > page table sub-tree.
> >
> > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > something else. That's very different to THP in private mappings ...
> >
> > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > Other pieces in the same subtree don't do that.
> >
> > Once the last piece is unmapped (or simpler: once the complete subtree of
> > page tables is gone), we decrement refcount+mapcount. Might require some
> > brain power to do this tracking, but I wouldn't call it impossible right
> > from the start.
> >
> > Would such a design violate other design aspects that are important?

This is actually how mapcount was treated in HGM RFC v1 (though not
refcount); it is doable for both [2].

One caveat here: if a page is unmapped in small pieces, it is
difficult to know if the page is legitimately completely unmapped (we
would have to check all the PTEs in the page table). In RFC v1, I
sidestepped this caveat by saying that "page_mapcount() is incremented
if the hstate-level PTE is present". A single unmap on the whole
hugepage will clear the hstate-level PTE, thus decrementing the
mapcount.

On a related note, there still exists an (albeit minor) API difference
vs. THPs: a piece of a page that is legitimately unmapped can still
have a positive page_mapcount().

Given that this approach allows us to retain the hugetlb vmemmap
optimization (and it wouldn't require a horrible amount of
complexity), I prefer this approach over the THP-like approach.

>
> The question is how to maintaining above information.
>
> It needs to be per-map (so one page mapped multiple times can be accounted
> differently), and per-page (so one mapping/vma can contain multiple pages).
> So far I think that's exactly the pgtable.  If we can squeeze information
> into the pgtable it'll work out, but definitely not trivial.  Or we can
> maintain seperate allocates for such information, but that can be extra
> overheads too.

I don't think we necessarily need to check the page table if we allow
for the limitations stated above.

>
> So far I'd still consider going with reusing thp mapcounts, which will
> mostly be what James mentioned above. The only difference is I'm not sure
> whether we should allow mapping e.g. 2M ranges for 1G pages.  THP mapcount
> doesn't have intermediate layer to maintain mapcount information like 2M,
> so to me it's easier we start with only mapping either the hpage size or
> PAGE_SIZE, not any intermediate size allowed.
>
> Having intermediate size mapping allowed can at least be error prone to
> me.  One example is if some pgtable walker found a 2M page, it may easily
> fetch the PFN out of it, assuming it's a compound page and it should
> satisfy PageHead(pfn)==true but it'll start to break here, because the 2M
> PFN will only be a small page pfn for the 1G huge page in this case.

I assume you mean PageHuge(page). There are cases where assumptions
are made about hugetlb pages and VMAs; they need to be corrected. It
sounds like you're really talking about errors wrt missing a
compound_head()/page_folio(), but it can be even more general than
that. For example, the arm64 KVM MMU assumes that hugetlb HGM doesn't
exist, and so it needs to be fixed before HGM can be enabled for
arm64.

If a page is HGM-mapped, I think it's more error-prone to somehow make
PageHuge() come back with false. So the only solution I see here is
careful auditing and testing.

I don't really see how allow/disallowing intermediate sizes changes
this problem either.

>
> To me, intermediate sized mappings are good to have but not required to
> resolve HGM problems, at least so far.  Said that, I'm fine with looking at
> what it'll look like if James would like to keep persuing that direction.

I've mentioned this to Peter already, but I don't think discarding
intermediate mapping levels really reduces complexity all that much.

[2]: Using the names of functions as of v1 (Peter has since suggested
a name change): hugetlb_alloc_{pte,pmd} will tell us if we really
allocated the PTE. That info can be passed up through
hugetlb_walk_step()=>hugetlb_hgm_walk() (where we only care if the
*hstate-level* PTE got allocated) and hugetlb_full_walk*() for the
caller to determine if refcount/mapcount should be incremented.

Thank you both for your thoughts so far. :)

- James
  
David Hildenbrand Jan. 18, 2023, 5:08 p.m. UTC | #15
On 18.01.23 16:35, Peter Xu wrote:
> On Wed, Jan 18, 2023 at 10:43:47AM +0100, David Hildenbrand wrote:
>> On 18.01.23 00:11, James Houghton wrote:
>>> On Mon, Jan 16, 2023 at 2:17 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 12.01.23 22:33, Peter Xu wrote:
>>>>> On Thu, Jan 12, 2023 at 04:17:52PM -0500, James Houghton wrote:
>>>>>> I'll look into it, but doing it this way will use _mapcount, so we
>>>>>> won't be able to use the vmemmap optimization. I think even if we do
>>>>>> use Hugh's approach, refcount is still being kept on the head page, so
>>>>>> there's still an overflow risk there (but maybe I am
>>>>>> misunderstanding).
>>>>>
>>>>> Could you remind me what's the issue if using refcount on the small pages
>>>>> rather than the head (assuming vmemmap still can be disabled)?
>>>>
>>>> The THP-way of doing things is refcounting on the head page. All folios
>>>> use a single refcount on the head.
>>>>
>>>> There has to be a pretty good reason to do it differently.
>>>
>>> Peter and I have discussed this a lot offline. There are two main problems here:
>>>
>>> 1. Refcount overflow
>>>
>>> Refcount is always kept on the head page (before and after this
>>> series). IIUC, this means that if THPs could be 1G in size, they too
>>> would be susceptible to the same potential overflow. How easy is the
>>> overflow? [1]
>>
>> Right. You'd need 8k VMAs. With 2 MiB THP you'd need 4096k VMAs. So ~64
>> processes with 64k VMAs. Not impossible to achieve if one really wants to
>> break the system ...
>>
>> Side note: a long long time ago, we used to have sub-page refcounts for THP.
>> IIRC, that was even before we had sub-page mapcounts and was used to make
>> COW decisions.
>>
>>>
>>> To deal with this, the best solution we've been able to come up with
>>> is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
>>> and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
>>> from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
>>> page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
>>> want to kill a random process).
>>
>> You'd have to also make sure that fork() won't do the same. At least with
>> uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
>> mappings, which would have to be handled.
> 
> If we want such a check to make a real difference, IIUC we may want to
> consider having similar check in:
> 
>          page_ref_add
>          page_ref_inc
>          page_ref_inc_return
>          page_ref_add_unless
> 
> But it's unfortunate that mostly all the callers to these functions
> (especially the first two) do not have a retval yet at all.  Considering
> the low possibility so far on having it overflow, maybe it can also be done
> for later (and I think checking negative as try_get_page would suffice too).
> 
>>
>> Of course, one can just disallow fork() with any HGM right from the start
>> and keep it all simpler to not open up a can of worms there.
>>
>> Is it reasonable, to have more than one (or a handful) of VMAs mapping a
>> huge page via a HGM? Restricting it to a single one, would make handling
>> much easier.
>>
>> If there is ever demand for more HGM mappings, that whole problem (and
>> complexity) could be dealt with later. ... but I assume it will already be a
>> requirement for VMs (e.g., under QEMU) that share memory with other
>> processes (virtiofsd and friends?)
> 
> Yes, any form of multi-proc QEMU will need that for supporting HGM
> postcopy.
> 
>>
>>
>>>
>>> (So David, I think this answers your question. Refcount should be
>>> handled just like THPs.)
>>>
>>> 2. page_mapcount() API differences
>>>
>>> In this series, page_mapcount() returns the total number of page table
>>> references for the compound page. For example, if you have a
>>> PTE-mapped 2M page (with no other mappings), page_mapcount() for each
>>> 4K page will be 512. This is not the same as a THP: page_mapcount()
>>> would return 1 for each page. Because of the difference in
>>> page_mapcount(), we have 4 problems:
>>
>> IMHO, it would actually be great to just be able to remove the sub-page
>> mapcounts for THP and make it all simpler.
>>
>> Right now, the sub-page mapcount is mostly required for making COW
>> decisions, but only for accounting purposes IIRC (NR_ANON_THPS,
>> NR_SHMEM_PMDMAPPED, NR_FILE_PMDMAPPED) and mlock handling IIRC. See
>> page_remove_rmap().
>>
>> If we can avoid that complexity right from the start for hugetlb, great, ..
>>
>>>
>>> i. Smaps uses page_mapcount() >= 2 to determine if hugetlb memory is
>>> "private_hugetlb" or "shared_hugetlb".
>>> ii. Migration with MPOL_MF_MOVE will check page_mapcount() to see if
>>> the hugepage is shared or not. Pages that would otherwise be migrated
>>> now require MPOL_MF_MOVE_ALL to be migrated.
>>> [Really both of the above are checking how many VMAs are mapping our hugepage.]
>>> iii. CoW. This isn't a problem right now because CoW is only possible
>>> with MAP_PRIVATE VMAs and HGM can only be enabled for MAP_SHARED VMAs.
>>> iv. The hwpoison handling code will check if it successfully unmapped
>>> the poisoned page. This isn't a problem right now, as hwpoison will
>>> unmap all the mappings for the hugepage, not just the 4K where the
>>> poison was found.
>>>
>>> Doing it this way allows HGM to remain compatible with the hugetlb
>>> vmemmap optimization. None of the above problems strike me as
>>> particularly major, but it's unclear to me how important it is to have
>>> page_mapcount() have a consistent meaning for hugetlb vs non-hugetlb.
>>
>> See below, maybe we should tackle HGM from a different direction.
>>
>>>
>>> The other way page_mapcount() (let's say the "THP-like way") could be
>>> done is like this: increment compound mapcount if we're mapping a
>>> hugetlb page normally (e.g., 1G page with a PUD). If we're mapping at
>>> high-granularity, increment the mapcount for each 4K page that is
>>> getting mapped (e.g., PMD within a 1G page: increment the mapcount for
>>> the 512 pages that are now mapped). This yields the same
>>> page_mapcount() API we had before, but we lose the hugetlb vmemmap
>>> optimization.
>>>
>>> We could introduce an API like hugetlb_vma_mapcount() that would, for
>>> hugetlb, give us the number of VMAs that map a hugepage, but I don't
>>> think people would like this.
>>>
>>> I'm curious what others think (Mike, Matthew?). I'm guessing the
>>> THP-like way is probably what most people would want, though it would
>>> be a real shame to lose the vmemmap optimization.
>>
>> Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
>> if we're dealing with refcount overflows already, mapcount overflows are not
>> an issue.
>>
>>
>> I wonder if the following crazy idea has already been discussed: treat the
>> whole mapping as a single large logical mapping. One reference and one
>> mapping, no matter how the individual parts are mapped into the assigned
>> page table sub-tree.
>>
>> Because for hugetlb with MAP_SHARED, we know that the complete assigned
>> sub-tree of page tables can only map the given hugetlb page, no fragments of
>> something else. That's very different to THP in private mappings ...
>>
>> So as soon as the first piece gets mapped, we increment refcount+mapcount.
>> Other pieces in the same subtree don't do that.
>>
>> Once the last piece is unmapped (or simpler: once the complete subtree of
>> page tables is gone), we decrement refcount+mapcount. Might require some
>> brain power to do this tracking, but I wouldn't call it impossible right
>> from the start.
>>
>> Would such a design violate other design aspects that are important?
> 
> The question is how to maintaining above information.

Right.

> 
> It needs to be per-map (so one page mapped multiple times can be accounted
> differently), and per-page (so one mapping/vma can contain multiple pages).
> So far I think that's exactly the pgtable.  If we can squeeze information
> into the pgtable it'll work out, but definitely not trivial.  Or we can
> maintain seperate allocates for such information, but that can be extra
> overheads too.

If there is no sub-pgtable level, there is certainly no HGM. If there is 
a sub-pgtable, we can store that information in that pgtable memmap most 
probably. Maybe simply a pointer to the hugetlb page. As long the 
pointer is there, we increment the mapcount/refcount.


Either directly, or via some additional metadata. Metadata should be 
small and most probably "noting relevant in size" compared to the actual 
1 GiB page or the 2 MiB+ of page tables to cover 1 GiB.

We could even teach most pgtable walkers to just assume that "logically" 
there is simply a hugtlb page mapped, without traversing the actual 
sub-pgtables. IIUC, only pgtable walkers that actually want to access 
page content (page faults, pinning) or change PTEs (mprotect, uffd) 
would really care. Maybe stuff like smaps could just say "well, there is 
a hugetlb page mapped" and continue. Just a thought.




> 
> So far I'd still consider going with reusing thp mapcounts, which will
> mostly be what James mentioned above. The only difference is I'm not sure
> whether we should allow mapping e.g. 2M ranges for 1G pages.  THP mapcount
> doesn't have intermediate layer to maintain mapcount information like 2M,
> so to me it's easier we start with only mapping either the hpage size or
> PAGE_SIZE, not any intermediate size allowed.
> 
> Having intermediate size mapping allowed can at least be error prone to
> me.  One example is if some pgtable walker found a 2M page, it may easily
> fetch the PFN out of it, assuming it's a compound page and it should
> satisfy PageHead(pfn)==true but it'll start to break here, because the 2M
> PFN will only be a small page pfn for the 1G huge page in this case.
> 
> To me, intermediate sized mappings are good to have but not required to
> resolve HGM problems, at least so far.  Said that, I'm fine with looking at
> what it'll look like if James would like to keep persuing that direction.

Yeah, was just an idea from my side to avoid most of the refcount and 
mapcount issues -- in theory :)

Let me think about that all a bit more ...
  
David Hildenbrand Jan. 18, 2023, 6:21 p.m. UTC | #16
>>> Once the last piece is unmapped (or simpler: once the complete subtree of
>>> page tables is gone), we decrement refcount+mapcount. Might require some
>>> brain power to do this tracking, but I wouldn't call it impossible right
>>> from the start.
>>>
>>> Would such a design violate other design aspects that are important?
> 
> This is actually how mapcount was treated in HGM RFC v1 (though not
> refcount); it is doable for both [2].
> 
> One caveat here: if a page is unmapped in small pieces, it is
> difficult to know if the page is legitimately completely unmapped (we
> would have to check all the PTEs in the page table). In RFC v1, I
> sidestepped this caveat by saying that "page_mapcount() is incremented
> if the hstate-level PTE is present". A single unmap on the whole
> hugepage will clear the hstate-level PTE, thus decrementing the
> mapcount.
> 
> On a related note, there still exists an (albeit minor) API difference
> vs. THPs: a piece of a page that is legitimately unmapped can still
> have a positive page_mapcount().
> 
> Given that this approach allows us to retain the hugetlb vmemmap
> optimization (and it wouldn't require a horrible amount of
> complexity), I prefer this approach over the THP-like approach.

If we can store (directly/indirectly) metadata in the highest pgtable 
that HGM-maps a hugetlb page, I guess what would be reasonable:

* hugetlb page pointer
* mapped size

Whenever mapping/unmapping sub-parts, we'd have to update that information.

Once "mapped size" dropped to 0, we know that the hugetlb page was 
completely unmapped and we can drop the refcount+mapcount, clear 
metadata (including hugetlb page pointer) [+ remove the page tables?].

Similarly, once "mapped size" corresponds to the hugetlb size, we can 
immediately spot that everything is mapped.

Again, just a high-level idea.
  
Mike Kravetz Jan. 18, 2023, 7:28 p.m. UTC | #17
On 01/18/23 08:39, James Houghton wrote:
> > > > To deal with this, the best solution we've been able to come up with
> > > > is to check if refcount is > INT_MAX/2 (similar to try_get_page()),
> > > > and if it is, stop the operation (UFFDIO_CONTINUE or a page fault)
> > > > from proceeding. In the UFFDIO_CONTINUE case, return ENOMEM. In the
> > > > page fault cause, return VM_FAULT_SIGBUS (not VM_FAULT_OOM; we don't
> > > > want to kill a random process).
> > >
> > > You'd have to also make sure that fork() won't do the same. At least with
> > > uffd-wp, Peter also added page table copying during fork() for MAP_SHARED
> > > mappings, which would have to be handled.
> 
> Indeed, thank you! copy_hugetlb_page_range() (and therefore fork() in
> some cases) would need this check too.
> 
> >
> > If we want such a check to make a real difference, IIUC we may want to
> > consider having similar check in:
> >
> >         page_ref_add
> >         page_ref_inc
> >         page_ref_inc_return
> >         page_ref_add_unless
> >
> > But it's unfortunate that mostly all the callers to these functions
> > (especially the first two) do not have a retval yet at all.  Considering
> > the low possibility so far on having it overflow, maybe it can also be done
> > for later (and I think checking negative as try_get_page would suffice too).
> 
> I think as long as we annotate the few cases that userspace can
> exploit to overflow refcount, we should be ok. I think this was the
> same idea with try_get_page(): it is supposed to be used in places
> that userspace could reasonably exploit to overflow refcount.
> 
> >
> > >
> > > Of course, one can just disallow fork() with any HGM right from the start
> > > and keep it all simpler to not open up a can of worms there.
> > >
> > > Is it reasonable, to have more than one (or a handful) of VMAs mapping a
> > > huge page via a HGM? Restricting it to a single one, would make handling
> > > much easier.
> > >
> > > If there is ever demand for more HGM mappings, that whole problem (and
> > > complexity) could be dealt with later. ... but I assume it will already be a
> > > requirement for VMs (e.g., under QEMU) that share memory with other
> > > processes (virtiofsd and friends?)
> >
> > Yes, any form of multi-proc QEMU will need that for supporting HGM
> > postcopy.
> 
> +1. Disallowing fork() entirely is quite a restrictive limitation.
> 
> [snip]
> > > > I'm curious what others think (Mike, Matthew?). I'm guessing the
> > > > THP-like way is probably what most people would want, though it would
> > > > be a real shame to lose the vmemmap optimization.
> > >
> > > Heh, not me :) Having a single mapcount is certainly much cleaner. ... and
> > > if we're dealing with refcount overflows already, mapcount overflows are not
> > > an issue.
> > >
> > >
> > > I wonder if the following crazy idea has already been discussed: treat the
> > > whole mapping as a single large logical mapping. One reference and one
> > > mapping, no matter how the individual parts are mapped into the assigned
> > > page table sub-tree.
> > >
> > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > something else. That's very different to THP in private mappings ...
> > >
> > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > Other pieces in the same subtree don't do that.
> > >
> > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > brain power to do this tracking, but I wouldn't call it impossible right
> > > from the start.
> > >
> > > Would such a design violate other design aspects that are important?
> 
> This is actually how mapcount was treated in HGM RFC v1 (though not
> refcount); it is doable for both [2].

My apologies for being late to the party :)

When Peter first brought up the issue with ref/map_count overflows I was
thinking that we should use a scheme like David describes above.  As
James points out, this was the approach taken in the first RFC.

> One caveat here: if a page is unmapped in small pieces, it is
> difficult to know if the page is legitimately completely unmapped (we
> would have to check all the PTEs in the page table).

Are we allowing unmapping of small (non-huge page sized) areas with HGM?
We must be if you are concerned with it.  What API would cause this?
I just do not remember this discussion.

> would have to check all the PTEs in the page table). In RFC v1, I
> sidestepped this caveat by saying that "page_mapcount() is incremented
> if the hstate-level PTE is present". A single unmap on the whole
> hugepage will clear the hstate-level PTE, thus decrementing the
> mapcount.
> 
> On a related note, there still exists an (albeit minor) API difference
> vs. THPs: a piece of a page that is legitimately unmapped can still
> have a positive page_mapcount().
> 
> Given that this approach allows us to retain the hugetlb vmemmap
> optimization (and it wouldn't require a horrible amount of
> complexity), I prefer this approach over the THP-like approach.

Me too.

> >
> > The question is how to maintaining above information.
> >
> > It needs to be per-map (so one page mapped multiple times can be accounted
> > differently), and per-page (so one mapping/vma can contain multiple pages).
> > So far I think that's exactly the pgtable.  If we can squeeze information
> > into the pgtable it'll work out, but definitely not trivial.  Or we can
> > maintain seperate allocates for such information, but that can be extra
> > overheads too.
> 
> I don't think we necessarily need to check the page table if we allow
> for the limitations stated above.
> 

When I was thinking about this I was a bit concerned about having enough
information to know exactly when to inc or dec counts.  I was actually
worried about knowing to do the increment.  I don't recall how it was
done in the first RFC, but from a high level it would need to be done
when the first hstate level PTE is allocated/added to the page table.
Right?  My concern was with all the places where we could 'error out'
after allocating the PTE, but before initializing it.  I was just thinking
that we might need to scan the page table or keep metadata for better
or easier accounting.

I think Peter mentioned it elsewhere, we should come up with a workable
scheme for HGM ref/map counting.  This can be done somewhat independently.
  
James Houghton Jan. 19, 2023, 4:57 p.m. UTC | #18
> > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > whole mapping as a single large logical mapping. One reference and one
> > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > page table sub-tree.
> > > >
> > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > something else. That's very different to THP in private mappings ...
> > > >
> > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > Other pieces in the same subtree don't do that.
> > > >
> > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > from the start.
> > > >
> > > > Would such a design violate other design aspects that are important?
> >
> > This is actually how mapcount was treated in HGM RFC v1 (though not
> > refcount); it is doable for both [2].
>
> My apologies for being late to the party :)
>
> When Peter first brought up the issue with ref/map_count overflows I was
> thinking that we should use a scheme like David describes above.  As
> James points out, this was the approach taken in the first RFC.
>
> > One caveat here: if a page is unmapped in small pieces, it is
> > difficult to know if the page is legitimately completely unmapped (we
> > would have to check all the PTEs in the page table).
>
> Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> We must be if you are concerned with it.  What API would cause this?
> I just do not remember this discussion.

There was some discussion about allowing MADV_DONTNEED on
less-than-hugepage pieces [3] (it actually motivated the switch from
UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
in this series, but it could be implemented in the future.

In the more immediate future, we want hwpoison to unmap at 4K, so
MADV_HWPOISON would be a mechanism that userspace is granted to do
this.

>
> > would have to check all the PTEs in the page table). In RFC v1, I
> > sidestepped this caveat by saying that "page_mapcount() is incremented
> > if the hstate-level PTE is present". A single unmap on the whole
> > hugepage will clear the hstate-level PTE, thus decrementing the
> > mapcount.
> >
> > On a related note, there still exists an (albeit minor) API difference
> > vs. THPs: a piece of a page that is legitimately unmapped can still
> > have a positive page_mapcount().
> >
> > Given that this approach allows us to retain the hugetlb vmemmap
> > optimization (and it wouldn't require a horrible amount of
> > complexity), I prefer this approach over the THP-like approach.
>
> Me too.
>
> > >
> > > The question is how to maintaining above information.
> > >
> > > It needs to be per-map (so one page mapped multiple times can be accounted
> > > differently), and per-page (so one mapping/vma can contain multiple pages).
> > > So far I think that's exactly the pgtable.  If we can squeeze information
> > > into the pgtable it'll work out, but definitely not trivial.  Or we can
> > > maintain seperate allocates for such information, but that can be extra
> > > overheads too.
> >
> > I don't think we necessarily need to check the page table if we allow
> > for the limitations stated above.
> >
>
> When I was thinking about this I was a bit concerned about having enough
> information to know exactly when to inc or dec counts.  I was actually
> worried about knowing to do the increment.  I don't recall how it was
> done in the first RFC, but from a high level it would need to be done
> when the first hstate level PTE is allocated/added to the page table.
> Right?  My concern was with all the places where we could 'error out'
> after allocating the PTE, but before initializing it.  I was just thinking
> that we might need to scan the page table or keep metadata for better
> or easier accounting.

The only two places where we can *create* a high-granularity page
table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
where we error out. To correctly handle these cases, we basically have
to do the pagecache lookup before touching the page table.

1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
so immediately. We can easily pass the page into
hugetlb_mcopy_atomic_pte() (via 'pagep') .

2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
lookup before we do the page table walk. I'm not sure how to support
non-shared HGM mappings with this scheme (in this series, we also
don't support non-shared; we return -EINVAL).
NB: The only case where high-granularity mappings for !VM_MAYSHARE
VMAs would come up is as a result of hwpoison.

So we can avoid keeping additional metadata for what this series is
trying to accomplish, but if the above isn't acceptable, then I/we can
try to come up with a scheme that would be acceptable.

There is also the possibility that the scheme implemented in this
version of the series is acceptable (i.e., the page_mapcount() API
difference, which results in slightly modified page migration behavior
and smaps output, is ok... assuming we have the refcount overflow
check).

>
> I think Peter mentioned it elsewhere, we should come up with a workable
> scheme for HGM ref/map counting.  This can be done somewhat independently.

FWIW, what makes the most sense to me right now is to implement the
THP-like scheme and mark HGM as mutually exclusive with the vmemmap
optimization. We can later come up with a scheme that lets us retain
compatibility. (Is that what you mean by "this can be done somewhat
independently", Mike?)

[3]: https://lore.kernel.org/linux-mm/20221021163703.3218176-1-jthoughton@google.com/T/#m9a1090108b61d32c04b68a1f3f2577644823a999

- James
  
Mike Kravetz Jan. 19, 2023, 5:31 p.m. UTC | #19
On 01/19/23 08:57, James Houghton wrote:
> > > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > > whole mapping as a single large logical mapping. One reference and one
> > > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > > page table sub-tree.
> > > > >
> > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > > something else. That's very different to THP in private mappings ...
> > > > >
> > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > > Other pieces in the same subtree don't do that.
> > > > >
> > > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > > from the start.
> > > > >
> > > > > Would such a design violate other design aspects that are important?
> > >
> > > This is actually how mapcount was treated in HGM RFC v1 (though not
> > > refcount); it is doable for both [2].
> >
> > My apologies for being late to the party :)
> >
> > When Peter first brought up the issue with ref/map_count overflows I was
> > thinking that we should use a scheme like David describes above.  As
> > James points out, this was the approach taken in the first RFC.
> >
> > > One caveat here: if a page is unmapped in small pieces, it is
> > > difficult to know if the page is legitimately completely unmapped (we
> > > would have to check all the PTEs in the page table).
> >
> > Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> > We must be if you are concerned with it.  What API would cause this?
> > I just do not remember this discussion.
> 
> There was some discussion about allowing MADV_DONTNEED on
> less-than-hugepage pieces [3] (it actually motivated the switch from
> UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
> in this series, but it could be implemented in the future.

OK, so we do not actually create HGM mappings until a uffd operation is
done at a less than huge page size granularity.  MADV_SPLIT just says
that HGM mappings are 'possible' for this vma.  Hopefully, my understanding
is correct.

I was concerned about things like the page fault path, but in that case
we have already 'entered HGM mode' via a uffd operation.

Both David and Peter have asked whether eliminating intermediate mapping
levels would be a simplification.  I trust your response that it would
not help much in the current design/implementation.  But, it did get me
thinking about something else.

Perhaps we have discussed this before, and perhaps it does not meet all
user needs, but one way possibly simplify this is:

- 'Enable HGM' via MADV_SPLIT.  Must be done at huge page (hstate)
  granularity.
- MADV_SPLIT implicitly unmaps everything with in the range.
- MADV_SPLIT says all mappings for this vma will now be done at a base
  (4K) page size granularity.  vma would be marked some way.
- I think this eliminates the need for hugetlb_pte's as we KNOW the
  mapping size.
- We still use huge pages to back 4K mappings, and we still have to deal
  with the ref/map_count issues.
- Code touching hugetlb page tables would KNOW the mapping size up front.

Again, apologies if we talked about and previously dismissed this type
of approach.

> > When I was thinking about this I was a bit concerned about having enough
> > information to know exactly when to inc or dec counts.  I was actually
> > worried about knowing to do the increment.  I don't recall how it was
> > done in the first RFC, but from a high level it would need to be done
> > when the first hstate level PTE is allocated/added to the page table.
> > Right?  My concern was with all the places where we could 'error out'
> > after allocating the PTE, but before initializing it.  I was just thinking
> > that we might need to scan the page table or keep metadata for better
> > or easier accounting.
> 
> The only two places where we can *create* a high-granularity page
> table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
> copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
> where we error out. To correctly handle these cases, we basically have
> to do the pagecache lookup before touching the page table.
> 
> 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
> PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
> so immediately. We can easily pass the page into
> hugetlb_mcopy_atomic_pte() (via 'pagep') .
> 
> 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
> lookup before we do the page table walk. I'm not sure how to support
> non-shared HGM mappings with this scheme (in this series, we also
> don't support non-shared; we return -EINVAL).
> NB: The only case where high-granularity mappings for !VM_MAYSHARE
> VMAs would come up is as a result of hwpoison.
> 
> So we can avoid keeping additional metadata for what this series is
> trying to accomplish, but if the above isn't acceptable, then I/we can
> try to come up with a scheme that would be acceptable.

Ok, I was thinking we had to deal with other code paths such as page
fault.  But, now I understand that is not the case with this design.

> There is also the possibility that the scheme implemented in this
> version of the series is acceptable (i.e., the page_mapcount() API
> difference, which results in slightly modified page migration behavior
> and smaps output, is ok... assuming we have the refcount overflow
> check).
> 
> >
> > I think Peter mentioned it elsewhere, we should come up with a workable
> > scheme for HGM ref/map counting.  This can be done somewhat independently.
> 
> FWIW, what makes the most sense to me right now is to implement the
> THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> optimization. We can later come up with a scheme that lets us retain
> compatibility. (Is that what you mean by "this can be done somewhat
> independently", Mike?)

Sort of, I was only saying that getting the ref/map counting right seems
like a task than can be independently worked.  Using the THP-like scheme
is good.
  
James Houghton Jan. 19, 2023, 7:42 p.m. UTC | #20
On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 01/19/23 08:57, James Houghton wrote:
> > > > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > > > whole mapping as a single large logical mapping. One reference and one
> > > > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > > > page table sub-tree.
> > > > > >
> > > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > > > something else. That's very different to THP in private mappings ...
> > > > > >
> > > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > > > Other pieces in the same subtree don't do that.
> > > > > >
> > > > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > > > from the start.
> > > > > >
> > > > > > Would such a design violate other design aspects that are important?
> > > >
> > > > This is actually how mapcount was treated in HGM RFC v1 (though not
> > > > refcount); it is doable for both [2].
> > >
> > > My apologies for being late to the party :)
> > >
> > > When Peter first brought up the issue with ref/map_count overflows I was
> > > thinking that we should use a scheme like David describes above.  As
> > > James points out, this was the approach taken in the first RFC.
> > >
> > > > One caveat here: if a page is unmapped in small pieces, it is
> > > > difficult to know if the page is legitimately completely unmapped (we
> > > > would have to check all the PTEs in the page table).
> > >
> > > Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> > > We must be if you are concerned with it.  What API would cause this?
> > > I just do not remember this discussion.
> >
> > There was some discussion about allowing MADV_DONTNEED on
> > less-than-hugepage pieces [3] (it actually motivated the switch from
> > UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
> > in this series, but it could be implemented in the future.
>
> OK, so we do not actually create HGM mappings until a uffd operation is
> done at a less than huge page size granularity.  MADV_SPLIT just says
> that HGM mappings are 'possible' for this vma.  Hopefully, my understanding
> is correct.

Right, that's the current meaning of MADV_SPLIT for hugetlb.

> I was concerned about things like the page fault path, but in that case
> we have already 'entered HGM mode' via a uffd operation.
>
> Both David and Peter have asked whether eliminating intermediate mapping
> levels would be a simplification.  I trust your response that it would
> not help much in the current design/implementation.  But, it did get me
> thinking about something else.
>
> Perhaps we have discussed this before, and perhaps it does not meet all
> user needs, but one way possibly simplify this is:
>
> - 'Enable HGM' via MADV_SPLIT.  Must be done at huge page (hstate)
>   granularity.
> - MADV_SPLIT implicitly unmaps everything with in the range.
> - MADV_SPLIT says all mappings for this vma will now be done at a base
>   (4K) page size granularity.  vma would be marked some way.
> - I think this eliminates the need for hugetlb_pte's as we KNOW the
>   mapping size.
> - We still use huge pages to back 4K mappings, and we still have to deal
>   with the ref/map_count issues.
> - Code touching hugetlb page tables would KNOW the mapping size up front.
>
> Again, apologies if we talked about and previously dismissed this type
> of approach.

I think Peter was the one who originally suggested an approach like
this, and it meets my needs. However, I still think the way that
things are currently implemented is the right way to go.

Assuming we want decent performance, we can't get away with the same
strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE
should be based on the PMD above the PTE, so we need to either pass
around the PMD above our PTE, or we need to pass around the PTL. This
is something that hugetlb_pte does for us, so, in some sense, even
going with this simpler approach, we still need a hugetlb_pte-like
construct.

Although most of the other complexity that HGM introduces would have
to be introduced either way (like having to deal with putting
compound_head()/page_folio() in more places and doing some
per-architecture updates), there are some complexities that the
simpler approach avoids:

- We avoid problems related to compound PTEs (the problem being: two
threads racing to populate a contiguous and non-contiguous PTE that
take up the same space could lead to user-detectable incorrect
behavior. This isn't hard to fix; it will be when I send the arm64
patches up.)

- We don't need to check if PTEs get split from under us in PT walks.
(In a lot of cases, the appropriate action is just to treat the PTE as
if it were pte_none().) In the same vein, we don't need
hugetlb_pte_present_leaf() at all, because PTEs we find will always be
leaves.

- We don't have to deal with sorting hstates or implementing
for_each_hgm_shift()/hugetlb_alloc_largest_pte().

None of these complexities are particularly major in my opinion.

This might seem kind of contrived, but let's say you have a VM with 1T
of memory, and you find 100 memory errors all in different 1G pages
over the life of this VM (years, potentially). Having 10% of your
memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
(lost performance and increased memory overhead). There might be other
cases in the future where being able to have intermediate mapping
sizes could be helpful.

> > > When I was thinking about this I was a bit concerned about having enough
> > > information to know exactly when to inc or dec counts.  I was actually
> > > worried about knowing to do the increment.  I don't recall how it was
> > > done in the first RFC, but from a high level it would need to be done
> > > when the first hstate level PTE is allocated/added to the page table.
> > > Right?  My concern was with all the places where we could 'error out'
> > > after allocating the PTE, but before initializing it.  I was just thinking
> > > that we might need to scan the page table or keep metadata for better
> > > or easier accounting.
> >
> > The only two places where we can *create* a high-granularity page
> > table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
> > copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
> > where we error out. To correctly handle these cases, we basically have
> > to do the pagecache lookup before touching the page table.
> >
> > 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
> > PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
> > so immediately. We can easily pass the page into
> > hugetlb_mcopy_atomic_pte() (via 'pagep') .
> >
> > 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
> > lookup before we do the page table walk. I'm not sure how to support
> > non-shared HGM mappings with this scheme (in this series, we also
> > don't support non-shared; we return -EINVAL).
> > NB: The only case where high-granularity mappings for !VM_MAYSHARE
> > VMAs would come up is as a result of hwpoison.
> >
> > So we can avoid keeping additional metadata for what this series is
> > trying to accomplish, but if the above isn't acceptable, then I/we can
> > try to come up with a scheme that would be acceptable.
>
> Ok, I was thinking we had to deal with other code paths such as page
> fault.  But, now I understand that is not the case with this design.
>
> > There is also the possibility that the scheme implemented in this
> > version of the series is acceptable (i.e., the page_mapcount() API
> > difference, which results in slightly modified page migration behavior
> > and smaps output, is ok... assuming we have the refcount overflow
> > check).
> >
> > >
> > > I think Peter mentioned it elsewhere, we should come up with a workable
> > > scheme for HGM ref/map counting.  This can be done somewhat independently.
> >
> > FWIW, what makes the most sense to me right now is to implement the
> > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > optimization. We can later come up with a scheme that lets us retain
> > compatibility. (Is that what you mean by "this can be done somewhat
> > independently", Mike?)
>
> Sort of, I was only saying that getting the ref/map counting right seems
> like a task than can be independently worked.  Using the THP-like scheme
> is good.

Ok! So if you're ok with the intermediate mapping sizes, it sounds
like I should go ahead and implement the THP-like scheme.

- James
  
Peter Xu Jan. 19, 2023, 8:53 p.m. UTC | #21
On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote:
> - We avoid problems related to compound PTEs (the problem being: two
> threads racing to populate a contiguous and non-contiguous PTE that
> take up the same space could lead to user-detectable incorrect
> behavior. This isn't hard to fix; it will be when I send the arm64
> patches up.)

Could you elaborate this one a bit more?

> This might seem kind of contrived, but let's say you have a VM with 1T
> of memory, and you find 100 memory errors all in different 1G pages
> over the life of this VM (years, potentially). Having 10% of your
> memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> (lost performance and increased memory overhead). There might be other
> cases in the future where being able to have intermediate mapping
> sizes could be helpful.

This is not the norm, or is it?  How the possibility of bad pages can
distribute over hosts over years?  This can definitely affect how we should
target the intermediate level mappings.

Thanks,
  
Mike Kravetz Jan. 19, 2023, 10 p.m. UTC | #22
On 01/19/23 11:42, James Houghton wrote:
> On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > On 01/19/23 08:57, James Houghton wrote:
> >
> > OK, so we do not actually create HGM mappings until a uffd operation is
> > done at a less than huge page size granularity.  MADV_SPLIT just says
> > that HGM mappings are 'possible' for this vma.  Hopefully, my understanding
> > is correct.
> 
> Right, that's the current meaning of MADV_SPLIT for hugetlb.
> 
> > I was concerned about things like the page fault path, but in that case
> > we have already 'entered HGM mode' via a uffd operation.
> >
> > Both David and Peter have asked whether eliminating intermediate mapping
> > levels would be a simplification.  I trust your response that it would
> > not help much in the current design/implementation.  But, it did get me
> > thinking about something else.
> >
> > Perhaps we have discussed this before, and perhaps it does not meet all
> > user needs, but one way possibly simplify this is:
> >
> > - 'Enable HGM' via MADV_SPLIT.  Must be done at huge page (hstate)
> >   granularity.
> > - MADV_SPLIT implicitly unmaps everything with in the range.
> > - MADV_SPLIT says all mappings for this vma will now be done at a base
> >   (4K) page size granularity.  vma would be marked some way.
> > - I think this eliminates the need for hugetlb_pte's as we KNOW the
> >   mapping size.
> > - We still use huge pages to back 4K mappings, and we still have to deal
> >   with the ref/map_count issues.
> > - Code touching hugetlb page tables would KNOW the mapping size up front.
> >
> > Again, apologies if we talked about and previously dismissed this type
> > of approach.
> 
> I think Peter was the one who originally suggested an approach like
> this, and it meets my needs. However, I still think the way that
> things are currently implemented is the right way to go.
> 
> Assuming we want decent performance, we can't get away with the same
> strategy of just passing pte_t*s everywhere. The PTL for a 4K PTE
> should be based on the PMD above the PTE, so we need to either pass
> around the PMD above our PTE, or we need to pass around the PTL. This
> is something that hugetlb_pte does for us, so, in some sense, even
> going with this simpler approach, we still need a hugetlb_pte-like
> construct.

Agree there is this performance hit.  However, the 'simplest' approach
would be to just use the page table lock as is done by default for 4K
PTEs.

I do not know much about the (primary) live migration use case.  My
guess is that page table lock contention may be an issue?  In this use
case, HGM is only enabled for the duration the live migration operation,
then a MADV_COLLAPSE is performed.  If contention is likely to be an
issue during this time, then yes we would need to pass around with
something like hugetlb_pte.

> Although most of the other complexity that HGM introduces would have
> to be introduced either way (like having to deal with putting
> compound_head()/page_folio() in more places and doing some
> per-architecture updates), there are some complexities that the
> simpler approach avoids:
> 
> - We avoid problems related to compound PTEs (the problem being: two
> threads racing to populate a contiguous and non-contiguous PTE that
> take up the same space could lead to user-detectable incorrect
> behavior. This isn't hard to fix; it will be when I send the arm64
> patches up.)
> 
> - We don't need to check if PTEs get split from under us in PT walks.
> (In a lot of cases, the appropriate action is just to treat the PTE as
> if it were pte_none().) In the same vein, we don't need
> hugetlb_pte_present_leaf() at all, because PTEs we find will always be
> leaves.
> 
> - We don't have to deal with sorting hstates or implementing
> for_each_hgm_shift()/hugetlb_alloc_largest_pte().
> 
> None of these complexities are particularly major in my opinion.

Perhaps not.  I was just thinking about the overall complexity of the
hugetlb code after HGM.  Currently, it is 'relatively simple' with
fixed huge page sizes.  IMO, much simpler than THP with two possible
mapping sizes.  With HGM and intermediate mapping sizes, it seems
things could get more complicated than THP.  Perhaps it is just me.
I am just too familiar with the current code and a bit anxious about
added complexity.  But, I felt the same about vmemmap optimizations. :)

> This might seem kind of contrived, but let's say you have a VM with 1T
> of memory, and you find 100 memory errors all in different 1G pages
> over the life of this VM (years, potentially). Having 10% of your
> memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> (lost performance and increased memory overhead). There might be other
> cases in the future where being able to have intermediate mapping
> sizes could be helpful.

That may be a bit contrived.  We know memory error handling is a future
use case, but I believe there is work outside of HGM than needs to be
done to handle such situations.  For example, HGM will allow the 1G
mapping to isolate the 4K page with error.  This prevents errors if you
fault almost anywhere within the 1G page.  But, there still remains the
possibility of accessing that 4K page page with error.  IMO, it will
require user space/application intervention to address this as only the
application knows about the potentially lost data.  This is still something
that needs to be designed.  It would then makes sense for the application
to also determine how it wants to proceed WRT mapping the 1G area.
Perhaps they will want (and there will exist a mechanism) to migrate the
data to a new 1G page without error.

> > > > I think Peter mentioned it elsewhere, we should come up with a workable
> > > > scheme for HGM ref/map counting.  This can be done somewhat independently.
> > >
> > > FWIW, what makes the most sense to me right now is to implement the
> > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > > optimization. We can later come up with a scheme that lets us retain
> > > compatibility. (Is that what you mean by "this can be done somewhat
> > > independently", Mike?)
> >
> > Sort of, I was only saying that getting the ref/map counting right seems
> > like a task than can be independently worked.  Using the THP-like scheme
> > is good.
> 
> Ok! So if you're ok with the intermediate mapping sizes, it sounds
> like I should go ahead and implement the THP-like scheme.

Yes, I am OK with it.  Just expressed a bit of concern above.
  
Peter Xu Jan. 19, 2023, 10:23 p.m. UTC | #23
On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> I do not know much about the (primary) live migration use case.  My
> guess is that page table lock contention may be an issue?  In this use
> case, HGM is only enabled for the duration the live migration operation,
> then a MADV_COLLAPSE is performed.  If contention is likely to be an
> issue during this time, then yes we would need to pass around with
> something like hugetlb_pte.

I'm not aware of any such contention issue.  IMHO the migration problem is
majorly about being too slow transferring a page being so large.  Shrinking
the page size should resolve the major problem already here IIUC.

AFAIU 4K-only solution should only reduce any lock contention because locks
will always be pte-level if VM_HUGETLB_HGM set.  When walking and creating
the intermediate pgtable entries we can use atomic ops just like generic
mm, so no lock needed at all.  With uncertainty on the size of mappings,
we'll need to take any of the multiple layers of locks.

[...]

> > None of these complexities are particularly major in my opinion.
> 
> Perhaps not.  I was just thinking about the overall complexity of the
> hugetlb code after HGM.  Currently, it is 'relatively simple' with
> fixed huge page sizes.  IMO, much simpler than THP with two possible
> mapping sizes.  With HGM and intermediate mapping sizes, it seems
> things could get more complicated than THP.  Perhaps it is just me.

Please count me in. :) I'm just still happy to see what it'll look like if
James think having that complexity doesn't greatly affect the whole design.

Thanks,
  
James Houghton Jan. 19, 2023, 10:35 p.m. UTC | #24
On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > I do not know much about the (primary) live migration use case.  My
> > guess is that page table lock contention may be an issue?  In this use
> > case, HGM is only enabled for the duration the live migration operation,
> > then a MADV_COLLAPSE is performed.  If contention is likely to be an
> > issue during this time, then yes we would need to pass around with
> > something like hugetlb_pte.
>
> I'm not aware of any such contention issue.  IMHO the migration problem is
> majorly about being too slow transferring a page being so large.  Shrinking
> the page size should resolve the major problem already here IIUC.

This will be problematic if you scale up VMs to be quite large. Google
upstreamed the "TDP MMU" for KVM/x86 that removed the need to take the
MMU lock for writing in the EPT violation path. We found that this
change is required for VMs >200 or so vCPUs to consistently avoid CPU
soft lockups in the guest.

Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
the same PTL would be problematic in the same way.

>
> AFAIU 4K-only solution should only reduce any lock contention because locks
> will always be pte-level if VM_HUGETLB_HGM set.  When walking and creating
> the intermediate pgtable entries we can use atomic ops just like generic
> mm, so no lock needed at all.  With uncertainty on the size of mappings,
> we'll need to take any of the multiple layers of locks.
>

Other than taking the HugeTLB VMA lock for reading, walking/allocating
page tables won't need any additional locking.

We take the PTL to allocate the next level down, but so does generic
mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
misunderstanding.

- James
  
James Houghton Jan. 19, 2023, 10:45 p.m. UTC | #25
On Thu, Jan 19, 2023 at 12:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote:
> > - We avoid problems related to compound PTEs (the problem being: two
> > threads racing to populate a contiguous and non-contiguous PTE that
> > take up the same space could lead to user-detectable incorrect
> > behavior. This isn't hard to fix; it will be when I send the arm64
> > patches up.)
>
> Could you elaborate this one a bit more?

In hugetlb_mcopy_atomic_pte(), we check that the PTE we're about to
overwrite is pte_none() before overwriting it. For contiguous PTEs,
this only checks the first PTE in the bunch.

If someone came around and populated one of the PTEs that lied in the
middle of a potentially contiguous group of PTEs, we could end up
overwriting that PTE if we later UFFDIO_CONTINUEd in such a way to
create a contiguous PTE.

We would expect to get EEXIST here, but in this case the operation
would succeed. To fix this, we can just check that ALL the PTEs in the
contiguous bunch have the value that we're expecting, not just the
first one.

hugetlb_no_page() has the same problem, but it's not immediately clear
to me how it would result in incorrect behavior.

>
> > This might seem kind of contrived, but let's say you have a VM with 1T
> > of memory, and you find 100 memory errors all in different 1G pages
> > over the life of this VM (years, potentially). Having 10% of your
> > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> > (lost performance and increased memory overhead). There might be other
> > cases in the future where being able to have intermediate mapping
> > sizes could be helpful.
>
> This is not the norm, or is it?  How the possibility of bad pages can
> distribute over hosts over years?  This can definitely affect how we should
> target the intermediate level mappings.

I can't really speak for norms generally, but I can try to speak for
Google Cloud. Google Cloud hasn't had memory error virtualization for
very long (only about a year), but we've seen cases where VMs can pick
up several memory errors over a few days/weeks. IMO, 100 errors in
separate 1G pages over a few years isn't completely nonsensical,
especially if the memory that you're using isn't so reliable or was
damaged in shipping (like if it was flown over the poles or
something!).

Now there is the concern about how an application would handle it. In
a VMM's case, we can virtualize the error for the guest. In the guest,
it's possible that a good chunk of the errors lie in unused pages and
so can be easily marked as poisoned. It's possible that recovery is
much more difficult. It's not unreasonable for an application to
recover from a lot of memory errors.

- James
  
Peter Xu Jan. 19, 2023, 11:07 p.m. UTC | #26
On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote:
> On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > > I do not know much about the (primary) live migration use case.  My
> > > guess is that page table lock contention may be an issue?  In this use
> > > case, HGM is only enabled for the duration the live migration operation,
> > > then a MADV_COLLAPSE is performed.  If contention is likely to be an
> > > issue during this time, then yes we would need to pass around with
> > > something like hugetlb_pte.
> >
> > I'm not aware of any such contention issue.  IMHO the migration problem is
> > majorly about being too slow transferring a page being so large.  Shrinking
> > the page size should resolve the major problem already here IIUC.
> 
> This will be problematic if you scale up VMs to be quite large.

Do you mean that for the postcopy use case one can leverage e.g. 2M
mappings (over 1G) to avoid lock contentions when VM is large?  I agree it
should be more efficient than having 512 4K page installed, but I think
it'll make the page fault resolution slower too if some thead is only
looking for a 4k portion of it.

> Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take
> the MMU lock for writing in the EPT violation path. We found that this
> change is required for VMs >200 or so vCPUs to consistently avoid CPU
> soft lockups in the guest.

After the kvm mmu rwlock convertion, it'll allow concurrent page faults
even if only 4K pages are used, so it seems not directly relevant to what
we're discussing here, no?

> 
> Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
> the same PTL would be problematic in the same way.

Pte-level pgtable lock only covers 2M range, so I think it depends on which
is the address that the vcpu is faulted on?  IIUC the major case should be
that the faulted threads are not falling upon the same 2M range.

> 
> >
> > AFAIU 4K-only solution should only reduce any lock contention because locks
> > will always be pte-level if VM_HUGETLB_HGM set.  When walking and creating
> > the intermediate pgtable entries we can use atomic ops just like generic
> > mm, so no lock needed at all.  With uncertainty on the size of mappings,
> > we'll need to take any of the multiple layers of locks.
> >
> 
> Other than taking the HugeTLB VMA lock for reading, walking/allocating
> page tables won't need any additional locking.

Actually when revisiting the locks I'm getting a bit confused on whether
the vma lock is needed if pmd sharing is anyway forbidden for HGM.  I
raised a question in the other patch of MADV_COLLAPSE, maybe they're
related questions so we can keep it there.

> 
> We take the PTL to allocate the next level down, but so does generic
> mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
> misunderstanding.

Sorry you're right, please ignore that.  I don't know why I had that
impression that spinlocks are not needed in that process.

Actually I am also curious why atomics won't work (by holding mmap read
lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries).  I
think it's possible I just missed something else.
  
James Houghton Jan. 19, 2023, 11:26 p.m. UTC | #27
On Thu, Jan 19, 2023 at 3:07 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote:
> > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > > > I do not know much about the (primary) live migration use case.  My
> > > > guess is that page table lock contention may be an issue?  In this use
> > > > case, HGM is only enabled for the duration the live migration operation,
> > > > then a MADV_COLLAPSE is performed.  If contention is likely to be an
> > > > issue during this time, then yes we would need to pass around with
> > > > something like hugetlb_pte.
> > >
> > > I'm not aware of any such contention issue.  IMHO the migration problem is
> > > majorly about being too slow transferring a page being so large.  Shrinking
> > > the page size should resolve the major problem already here IIUC.
> >
> > This will be problematic if you scale up VMs to be quite large.
>
> Do you mean that for the postcopy use case one can leverage e.g. 2M
> mappings (over 1G) to avoid lock contentions when VM is large I agree it
> should be more efficient than having 512 4K page installed, but I think
> it'll make the page fault resolution slower too if some thead is only
> looking for a 4k portion of it.

No, that's not what I meant. Sorry. If you can use the PTL that is
normally used for 4K PTEs, then you're right, there is no contention
problem. However, this PTL is determined by the value of the PMD, so
you need a pointer to the PMD to determine what the PTL should be (or
a pointer to the PTL itself).

In hugetlb, we only ever pass around the PTE pointer, and we rely on
huge_pte_lockptr() to find the PTL for us (and it does so
appropriately for everything except 4K PTEs). We would need to add the
complexity of passing around a PMD or PTL everywhere, and that's what
hugetlb_pte does for us. So that complexity is basically unavoidable,
unless you're ok with 4K PTEs with taking mm->page_table_lock (I'm
not).

>
> > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take
> > the MMU lock for writing in the EPT violation path. We found that this
> > change is required for VMs >200 or so vCPUs to consistently avoid CPU
> > soft lockups in the guest.
>
> After the kvm mmu rwlock convertion, it'll allow concurrent page faults
> even if only 4K pages are used, so it seems not directly relevant to what
> we're discussing here, no?

Right. I was just bringing it up to say that if 4K PTLs were
mm->page_table_lock, we would have a problem.

>
> >
> > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
> > the same PTL would be problematic in the same way.
>
> Pte-level pgtable lock only covers 2M range, so I think it depends on which
> is the address that the vcpu is faulted on?  IIUC the major case should be
> that the faulted threads are not falling upon the same 2M range.

Right. I think my comment should make more sense with the above clarification.

>
> >
> > >
> > > AFAIU 4K-only solution should only reduce any lock contention because locks
> > > will always be pte-level if VM_HUGETLB_HGM set.  When walking and creating
> > > the intermediate pgtable entries we can use atomic ops just like generic
> > > mm, so no lock needed at all.  With uncertainty on the size of mappings,
> > > we'll need to take any of the multiple layers of locks.
> > >
> >
> > Other than taking the HugeTLB VMA lock for reading, walking/allocating
> > page tables won't need any additional locking.
>
> Actually when revisiting the locks I'm getting a bit confused on whether
> the vma lock is needed if pmd sharing is anyway forbidden for HGM.  I
> raised a question in the other patch of MADV_COLLAPSE, maybe they're
> related questions so we can keep it there.

We can discuss there. :) I take both the VMA lock and mapping lock so
that it can stay in sync with huge_pmd_unshare(), and so HGM walks
have the same synchronization as regular hugetlb PT walks.

>
> >
> > We take the PTL to allocate the next level down, but so does generic
> > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
> > misunderstanding.
>
> Sorry you're right, please ignore that.  I don't know why I had that
> impression that spinlocks are not needed in that process.
>
> Actually I am also curious why atomics won't work (by holding mmap read
> lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries).  I
> think it's possible I just missed something else.

I think there are cases where we need to make sure the value of a PTE
isn't going to change from under us while we're doing some kind of
other operation, and so a compare-and-swap won't really be what we
need.
  
Mike Kravetz Jan. 19, 2023, 11:44 p.m. UTC | #28
On 01/19/23 18:07, Peter Xu wrote:
> 
> Actually when revisiting the locks I'm getting a bit confused on whether
> the vma lock is needed if pmd sharing is anyway forbidden for HGM.  I
> raised a question in the other patch of MADV_COLLAPSE, maybe they're
> related questions so we can keep it there.

I can quickly answer that.  Yes.  The vma lock is also being used for
fault/truncation synchronization.  Commit e700898fa075 make sure it is
even used on architectures that do not support PMD sharing.

I had come up with a rather ugly method of using the fault mutex for
fault/truncation synchronization, but using the vma lock was more
elegant.
  
Peter Xu Jan. 20, 2023, 5:23 p.m. UTC | #29
On Thu, Jan 19, 2023 at 03:26:14PM -0800, James Houghton wrote:
> On Thu, Jan 19, 2023 at 3:07 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 02:35:12PM -0800, James Houghton wrote:
> > > On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > > > > I do not know much about the (primary) live migration use case.  My
> > > > > guess is that page table lock contention may be an issue?  In this use
> > > > > case, HGM is only enabled for the duration the live migration operation,
> > > > > then a MADV_COLLAPSE is performed.  If contention is likely to be an
> > > > > issue during this time, then yes we would need to pass around with
> > > > > something like hugetlb_pte.
> > > >
> > > > I'm not aware of any such contention issue.  IMHO the migration problem is
> > > > majorly about being too slow transferring a page being so large.  Shrinking
> > > > the page size should resolve the major problem already here IIUC.
> > >
> > > This will be problematic if you scale up VMs to be quite large.
> >
> > Do you mean that for the postcopy use case one can leverage e.g. 2M
> > mappings (over 1G) to avoid lock contentions when VM is large I agree it
> > should be more efficient than having 512 4K page installed, but I think
> > it'll make the page fault resolution slower too if some thead is only
> > looking for a 4k portion of it.
> 
> No, that's not what I meant. Sorry. If you can use the PTL that is
> normally used for 4K PTEs, then you're right, there is no contention
> problem. However, this PTL is determined by the value of the PMD, so
> you need a pointer to the PMD to determine what the PTL should be (or
> a pointer to the PTL itself).
> 
> In hugetlb, we only ever pass around the PTE pointer, and we rely on
> huge_pte_lockptr() to find the PTL for us (and it does so
> appropriately for everything except 4K PTEs). We would need to add the
> complexity of passing around a PMD or PTL everywhere, and that's what
> hugetlb_pte does for us. So that complexity is basically unavoidable,
> unless you're ok with 4K PTEs with taking mm->page_table_lock (I'm
> not).
> 
> >
> > > Google upstreamed the "TDP MMU" for KVM/x86 that removed the need to take
> > > the MMU lock for writing in the EPT violation path. We found that this
> > > change is required for VMs >200 or so vCPUs to consistently avoid CPU
> > > soft lockups in the guest.
> >
> > After the kvm mmu rwlock convertion, it'll allow concurrent page faults
> > even if only 4K pages are used, so it seems not directly relevant to what
> > we're discussing here, no?
> 
> Right. I was just bringing it up to say that if 4K PTLs were
> mm->page_table_lock, we would have a problem.

Ah I see what you meant.  We definitely don't want to use the
page_table_lock for sure.

So if it's about keeping hugetlb_pte I'm fine with it, no matter what the
final version will look like.

> 
> >
> > >
> > > Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
> > > the same PTL would be problematic in the same way.
> >
> > Pte-level pgtable lock only covers 2M range, so I think it depends on which
> > is the address that the vcpu is faulted on?  IIUC the major case should be
> > that the faulted threads are not falling upon the same 2M range.
> 
> Right. I think my comment should make more sense with the above clarification.
> 
> >
> > >
> > > >
> > > > AFAIU 4K-only solution should only reduce any lock contention because locks
> > > > will always be pte-level if VM_HUGETLB_HGM set.  When walking and creating
> > > > the intermediate pgtable entries we can use atomic ops just like generic
> > > > mm, so no lock needed at all.  With uncertainty on the size of mappings,
> > > > we'll need to take any of the multiple layers of locks.
> > > >
> > >
> > > Other than taking the HugeTLB VMA lock for reading, walking/allocating
> > > page tables won't need any additional locking.
> >
> > Actually when revisiting the locks I'm getting a bit confused on whether
> > the vma lock is needed if pmd sharing is anyway forbidden for HGM.  I
> > raised a question in the other patch of MADV_COLLAPSE, maybe they're
> > related questions so we can keep it there.
> 
> We can discuss there. :) I take both the VMA lock and mapping lock so
> that it can stay in sync with huge_pmd_unshare(), and so HGM walks
> have the same synchronization as regular hugetlb PT walks.

Sure. :)

Now after a 2nd thought I don't think it's unsafe to take the vma write
lock here, especially for VM_SHARED.  I can't think of anything that will
go wrong.  It's because we need the vma lock anywhere we'll be walking the
pgtables when having mmap_sem read I think, being afraid of having pmd
sharing being possible.

But I'm not sure whether this is the cleanest way to do it.

IMHO the major special part of hugetlb comparing to generic mm on pgtable
thread safety.  I worry that complicating this lock can potentially make
the hugetlb code even more specific, which is not good for the long term if
we still have a hope of merging more hugetlb codes with the generic paths.

Here since pmd sharing is impossible for HGM, the original vma lock is not
needed here.  Meanwhile, what we want to guard is the pgtable walkers.
They're logically being protected by either mmap lock or the mapping lock
(for rmap walkers).  Fast-gup is another thing but so far I think it's all
safe when you're following the mmu gather facilities.

Somehow I had a feeling that the hugetlb vma lock (along with the pgtable
sharing explorations in the hugetlb world keeping going..) can keep
evolving in the future, and it should be helpful to keep its semantics
simple too.

So to summarize: I wonder whether we can use mmap write lock and
i_mmap_rwsem write lock to protect collapsing for hugetlb, just like what
we do with THP collapsing (after Jann's fix).

madvise_need_mmap_write() is not easily feasible because it's before the
vma scanning so we can't take conditional write lock only for hugetlb, but
that's the next question to ask only if we can reach a consensus on the
lock scheme first for HGM in general.

> 
> >
> > >
> > > We take the PTL to allocate the next level down, but so does generic
> > > mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
> > > misunderstanding.
> >
> > Sorry you're right, please ignore that.  I don't know why I had that
> > impression that spinlocks are not needed in that process.
> >
> > Actually I am also curious why atomics won't work (by holding mmap read
> > lock, then do cmpxchg(old_entry=0, new_entry) upon the pgtable entries).  I
> > think it's possible I just missed something else.
> 
> I think there are cases where we need to make sure the value of a PTE
> isn't going to change from under us while we're doing some kind of
> other operation, and so a compare-and-swap won't really be what we
> need.

Currently the pgtable spinlock is only taken during populating the
pgtables.  If it can happen, then it can happen too right after we release
the spinlock in e.g. __pmd_alloc().

One thing I can think of is we need more things done rather than the
pgtable entry installations so atomics will stop working if so.  E.g. on
x86 we have paravirt_alloc_pmd().  But I'm not sure whether that's the only
reason.
  
Peter Xu Jan. 23, 2023, 3:19 p.m. UTC | #30
Hi, Mike,

On Thu, Jan 19, 2023 at 03:44:25PM -0800, Mike Kravetz wrote:
> On 01/19/23 18:07, Peter Xu wrote:
> > 
> > Actually when revisiting the locks I'm getting a bit confused on whether
> > the vma lock is needed if pmd sharing is anyway forbidden for HGM.  I
> > raised a question in the other patch of MADV_COLLAPSE, maybe they're
> > related questions so we can keep it there.
> 
> I can quickly answer that.  Yes.  The vma lock is also being used for
> fault/truncation synchronization.  Commit e700898fa075 make sure it is
> even used on architectures that do not support PMD sharing.
> 
> I had come up with a rather ugly method of using the fault mutex for
> fault/truncation synchronization, but using the vma lock was more
> elegant.

Thanks for answering, I'll need to read some more on truncation later.
Before that, since COLLAPSE will already require the i_mmap_rwsem write
lock already, does it mean it is naturally race-free against truncation
even without vma lock?
  
Mike Kravetz Jan. 23, 2023, 5:49 p.m. UTC | #31
On 01/23/23 10:19, Peter Xu wrote:
> Hi, Mike,
> 
> On Thu, Jan 19, 2023 at 03:44:25PM -0800, Mike Kravetz wrote:
> > On 01/19/23 18:07, Peter Xu wrote:
> > > 
> > > Actually when revisiting the locks I'm getting a bit confused on whether
> > > the vma lock is needed if pmd sharing is anyway forbidden for HGM.  I
> > > raised a question in the other patch of MADV_COLLAPSE, maybe they're
> > > related questions so we can keep it there.
> > 
> > I can quickly answer that.  Yes.  The vma lock is also being used for
> > fault/truncation synchronization.  Commit e700898fa075 make sure it is
> > even used on architectures that do not support PMD sharing.
> > 
> > I had come up with a rather ugly method of using the fault mutex for
> > fault/truncation synchronization, but using the vma lock was more
> > elegant.
> 
> Thanks for answering, I'll need to read some more on truncation later.
> Before that, since COLLAPSE will already require the i_mmap_rwsem write
> lock already, does it mean it is naturally race-free against truncation
> even without vma lock?

Yes, and thanks for making me take a closer look at COLLAPSE. :)
  
James Houghton Jan. 26, 2023, 4:58 p.m. UTC | #32
On Thu, Jan 19, 2023 at 11:42 AM James Houghton <jthoughton@google.com> wrote:
>
> On Thu, Jan 19, 2023 at 9:32 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 01/19/23 08:57, James Houghton wrote:
> > > FWIW, what makes the most sense to me right now is to implement the
> > > THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> > > optimization. We can later come up with a scheme that lets us retain
> > > compatibility. (Is that what you mean by "this can be done somewhat
> > > independently", Mike?)
> >
> > Sort of, I was only saying that getting the ref/map counting right seems
> > like a task than can be independently worked.  Using the THP-like scheme
> > is good.
>
> Ok! So if you're ok with the intermediate mapping sizes, it sounds
> like I should go ahead and implement the THP-like scheme.

It turns out that the THP-like scheme significantly slows down
MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
the vast majority of the time spent in MADV_COLLAPSE when collapsing
1G mappings. It is doing 262k atomic decrements, so this makes sense.

This is only really a problem because this is done between
mmu_notifier_invalidate_range_start() and
mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
access any of the 1G page while we're doing this (and it can take like
~1 second for each 1G, at least on the x86 server I was testing on).

- James
  
Peter Xu Jan. 26, 2023, 8:30 p.m. UTC | #33
James,

On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> It turns out that the THP-like scheme significantly slows down
> MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> the vast majority of the time spent in MADV_COLLAPSE when collapsing
> 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> 
> This is only really a problem because this is done between
> mmu_notifier_invalidate_range_start() and
> mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> access any of the 1G page while we're doing this (and it can take like
> ~1 second for each 1G, at least on the x86 server I was testing on).

Did you try to measure the time, or it's a quick observation from perf?

IIRC I used to measure some atomic ops, it is not as drastic as I thought.
But maybe it depends on many things.

I'm curious how the 1sec is provisioned between the procedures.  E.g., I
would expect mmu_notifier_invalidate_range_start() to also take some time
too as it should walk the smally mapped EPT pgtables.

Since we'll still keep the intermediate levels around - from application
POV, one other thing to remedy this is further shrink the size of COLLAPSE
so potentially for a very large page we can start with building 2M layers.
But then collapse will need to be run at least two rounds.
  
James Houghton Jan. 27, 2023, 9:02 p.m. UTC | #34
On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > It turns out that the THP-like scheme significantly slows down
> > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> >
> > This is only really a problem because this is done between
> > mmu_notifier_invalidate_range_start() and
> > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > access any of the 1G page while we're doing this (and it can take like
> > ~1 second for each 1G, at least on the x86 server I was testing on).
>
> Did you try to measure the time, or it's a quick observation from perf?

I put some ktime_get()s in.

>
> IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> But maybe it depends on many things.
>
> I'm curious how the 1sec is provisioned between the procedures.  E.g., I
> would expect mmu_notifier_invalidate_range_start() to also take some time
> too as it should walk the smally mapped EPT pgtables.

Somehow this doesn't take all that long (only like 10-30ms when
collapsing from 4K -> 1G) compared to hugetlb_collapse().

>
> Since we'll still keep the intermediate levels around - from application
> POV, one other thing to remedy this is further shrink the size of COLLAPSE
> so potentially for a very large page we can start with building 2M layers.
> But then collapse will need to be run at least two rounds.

That's exactly what I thought to do. :) I realized, too, that this is
actually how userspace *should* collapse things to avoid holding up
vCPUs too long. I think this is a good reason to keep intermediate
page sizes.

When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
huge difference: the THP-like scheme is about 30% slower overall.

When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
difference. For the THP-like scheme, collapsing 4K -> 2M requires
decrementing and then re-incrementing subpage->_mapcount, and then
from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
the head-only scheme, for each 2M in the 4K -> 2M collapse, we
decrement the compound_mapcount 512 times (once per PTE), then
increment it once. And then for 2M -> 1G, for each 1G, we decrement
mapcount again by 512 (once per PMD), incrementing it once.

The mapcount decrements are about on par with how long it takes to do
other things, like updating page tables. The main problem is, with the
THP-like scheme (implemented like this [1]), there isn't a way to
avoid the 262k decrements when collapsing 1G. So if we want
MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
then I think something more clever needs to be implemented.

[1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178


- James
  
Peter Xu Jan. 30, 2023, 5:29 p.m. UTC | #35
On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > James,
> >
> > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > It turns out that the THP-like scheme significantly slows down
> > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > >
> > > This is only really a problem because this is done between
> > > mmu_notifier_invalidate_range_start() and
> > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > access any of the 1G page while we're doing this (and it can take like
> > > ~1 second for each 1G, at least on the x86 server I was testing on).
> >
> > Did you try to measure the time, or it's a quick observation from perf?
> 
> I put some ktime_get()s in.
> 
> >
> > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > But maybe it depends on many things.
> >
> > I'm curious how the 1sec is provisioned between the procedures.  E.g., I
> > would expect mmu_notifier_invalidate_range_start() to also take some time
> > too as it should walk the smally mapped EPT pgtables.
> 
> Somehow this doesn't take all that long (only like 10-30ms when
> collapsing from 4K -> 1G) compared to hugetlb_collapse().

Did you populate as much the EPT pgtable when measuring this?

IIUC this number should be pretty much relevant to how many pages are
shadowed to the kvm pgtables.  If the EPT table is mostly empty it should
be super fast, but OTOH it can be much slower if when it's populated,
because tdp mmu should need to handle the pgtable leaves one by one.

E.g. it should be fully populated if you have a program busy dirtying most
of the guest pages during test migration.

Write op should be the worst here case since it'll require the atomic op
being applied; see kvm_tdp_mmu_write_spte().

> 
> >
> > Since we'll still keep the intermediate levels around - from application
> > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > so potentially for a very large page we can start with building 2M layers.
> > But then collapse will need to be run at least two rounds.
> 
> That's exactly what I thought to do. :) I realized, too, that this is
> actually how userspace *should* collapse things to avoid holding up
> vCPUs too long. I think this is a good reason to keep intermediate
> page sizes.
> 
> When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> huge difference: the THP-like scheme is about 30% slower overall.
> 
> When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> difference. For the THP-like scheme, collapsing 4K -> 2M requires
> decrementing and then re-incrementing subpage->_mapcount, and then
> from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> decrement the compound_mapcount 512 times (once per PTE), then
> increment it once. And then for 2M -> 1G, for each 1G, we decrement
> mapcount again by 512 (once per PMD), incrementing it once.

Did you have quantified numbers (with your ktime treak) to compare these?
If we want to go the other route, I think these will be materials to
justify any other approach on mapcount handling.

> 
> The mapcount decrements are about on par with how long it takes to do
> other things, like updating page tables. The main problem is, with the
> THP-like scheme (implemented like this [1]), there isn't a way to
> avoid the 262k decrements when collapsing 1G. So if we want
> MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> then I think something more clever needs to be implemented.
> 
> [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178

I believe the whole goal of HGM is trying to face the same challenge if
we'll allow 1G THP exist and being able to split for anon.

I don't remember whether we discussed below, maybe we did?  Anyway...

Another way to not use thp mapcount, nor break smaps and similar calls to
page_mapcount() on small page, is to only increase the hpage mapcount only
when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
as leaf or a non-leaf), and the mapcount can be decreased when the pXd
entry is removed (for leaf, it's the same as for now; for HGM, it's when
freeing pgtable of the PUD entry).

Again, in all cases I think some solid measurements would definitely be
helpful (as commented above) to see how much overhead will there be and
whether that'll start to become a problem at least for the current
motivations of the whole HGM idea.

Thanks,
  
James Houghton Jan. 30, 2023, 6:38 p.m. UTC | #36
On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > James,
> > >
> > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > It turns out that the THP-like scheme significantly slows down
> > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > >
> > > > This is only really a problem because this is done between
> > > > mmu_notifier_invalidate_range_start() and
> > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > access any of the 1G page while we're doing this (and it can take like
> > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > >
> > > Did you try to measure the time, or it's a quick observation from perf?
> >
> > I put some ktime_get()s in.
> >
> > >
> > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > But maybe it depends on many things.
> > >
> > > I'm curious how the 1sec is provisioned between the procedures.  E.g., I
> > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > too as it should walk the smally mapped EPT pgtables.
> >
> > Somehow this doesn't take all that long (only like 10-30ms when
> > collapsing from 4K -> 1G) compared to hugetlb_collapse().
>
> Did you populate as much the EPT pgtable when measuring this?
>
> IIUC this number should be pretty much relevant to how many pages are
> shadowed to the kvm pgtables.  If the EPT table is mostly empty it should
> be super fast, but OTOH it can be much slower if when it's populated,
> because tdp mmu should need to handle the pgtable leaves one by one.
>
> E.g. it should be fully populated if you have a program busy dirtying most
> of the guest pages during test migration.

That's what I was doing. I was running a workload in the guest that
just writes 8 bytes to a page and jumps ahead a few pages on all
vCPUs, touching most of its memory.

But there is more to understand; I'll collect more results. I'm not
sure why the EPT can be unmapped/collapsed so quickly.

>
> Write op should be the worst here case since it'll require the atomic op
> being applied; see kvm_tdp_mmu_write_spte().
>
> >
> > >
> > > Since we'll still keep the intermediate levels around - from application
> > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > so potentially for a very large page we can start with building 2M layers.
> > > But then collapse will need to be run at least two rounds.
> >
> > That's exactly what I thought to do. :) I realized, too, that this is
> > actually how userspace *should* collapse things to avoid holding up
> > vCPUs too long. I think this is a good reason to keep intermediate
> > page sizes.
> >
> > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > huge difference: the THP-like scheme is about 30% slower overall.
> >
> > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > decrementing and then re-incrementing subpage->_mapcount, and then
> > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > decrement the compound_mapcount 512 times (once per PTE), then
> > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > mapcount again by 512 (once per PMD), incrementing it once.
>
> Did you have quantified numbers (with your ktime treak) to compare these?
> If we want to go the other route, I think these will be materials to
> justify any other approach on mapcount handling.

Ok, I can do that. GIve me a couple days to collect more results and
organize them in a helpful way.

(If it's helpful at all, here are some results I collected last week:
[2]. Please ignore it if it's not helpful.)

>
> >
> > The mapcount decrements are about on par with how long it takes to do
> > other things, like updating page tables. The main problem is, with the
> > THP-like scheme (implemented like this [1]), there isn't a way to
> > avoid the 262k decrements when collapsing 1G. So if we want
> > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > then I think something more clever needs to be implemented.
> >
> > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
>
> I believe the whole goal of HGM is trying to face the same challenge if
> we'll allow 1G THP exist and being able to split for anon.
>
> I don't remember whether we discussed below, maybe we did?  Anyway...
>
> Another way to not use thp mapcount, nor break smaps and similar calls to
> page_mapcount() on small page, is to only increase the hpage mapcount only
> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> entry is removed (for leaf, it's the same as for now; for HGM, it's when
> freeing pgtable of the PUD entry).

Right, and this is doable. Also it seems like this is pretty close to
the direction Matthew Wilcox wants to go with THPs.

Something I noticed though, from the implementation of
folio_referenced()/folio_referenced_one(), is that folio_mapcount()
ought to report the total number of PTEs that are pointing on the page
(or the number of times page_vma_mapped_walk returns true). FWIW,
folio_referenced() is never called for hugetlb folios.

>
> Again, in all cases I think some solid measurements would definitely be
> helpful (as commented above) to see how much overhead will there be and
> whether that'll start to become a problem at least for the current
> motivations of the whole HGM idea.
>
> Thanks,
>
> --
> Peter Xu
>

Thanks, Peter!

[2]: https://pastebin.com/raw/DVfNFi2m

- James
  
Peter Xu Jan. 30, 2023, 9:14 p.m. UTC | #37
On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > James,
> > > >
> > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > > It turns out that the THP-like scheme significantly slows down
> > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > > >
> > > > > This is only really a problem because this is done between
> > > > > mmu_notifier_invalidate_range_start() and
> > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > > access any of the 1G page while we're doing this (and it can take like
> > > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > > >
> > > > Did you try to measure the time, or it's a quick observation from perf?
> > >
> > > I put some ktime_get()s in.
> > >
> > > >
> > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > > But maybe it depends on many things.
> > > >
> > > > I'm curious how the 1sec is provisioned between the procedures.  E.g., I
> > > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > > too as it should walk the smally mapped EPT pgtables.
> > >
> > > Somehow this doesn't take all that long (only like 10-30ms when
> > > collapsing from 4K -> 1G) compared to hugetlb_collapse().
> >
> > Did you populate as much the EPT pgtable when measuring this?
> >
> > IIUC this number should be pretty much relevant to how many pages are
> > shadowed to the kvm pgtables.  If the EPT table is mostly empty it should
> > be super fast, but OTOH it can be much slower if when it's populated,
> > because tdp mmu should need to handle the pgtable leaves one by one.
> >
> > E.g. it should be fully populated if you have a program busy dirtying most
> > of the guest pages during test migration.
> 
> That's what I was doing. I was running a workload in the guest that
> just writes 8 bytes to a page and jumps ahead a few pages on all
> vCPUs, touching most of its memory.
> 
> But there is more to understand; I'll collect more results. I'm not
> sure why the EPT can be unmapped/collapsed so quickly.

Maybe something smart done by the hypervisor?

> 
> >
> > Write op should be the worst here case since it'll require the atomic op
> > being applied; see kvm_tdp_mmu_write_spte().
> >
> > >
> > > >
> > > > Since we'll still keep the intermediate levels around - from application
> > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > > so potentially for a very large page we can start with building 2M layers.
> > > > But then collapse will need to be run at least two rounds.
> > >
> > > That's exactly what I thought to do. :) I realized, too, that this is
> > > actually how userspace *should* collapse things to avoid holding up
> > > vCPUs too long. I think this is a good reason to keep intermediate
> > > page sizes.
> > >
> > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > > huge difference: the THP-like scheme is about 30% slower overall.
> > >
> > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > > decrementing and then re-incrementing subpage->_mapcount, and then
> > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > > decrement the compound_mapcount 512 times (once per PTE), then
> > > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > > mapcount again by 512 (once per PMD), incrementing it once.
> >
> > Did you have quantified numbers (with your ktime treak) to compare these?
> > If we want to go the other route, I think these will be materials to
> > justify any other approach on mapcount handling.
> 
> Ok, I can do that. GIve me a couple days to collect more results and
> organize them in a helpful way.
> 
> (If it's helpful at all, here are some results I collected last week:
> [2]. Please ignore it if it's not helpful.)

It's helpful already at least to me, thanks.  Yes the change is drastic.

> 
> >
> > >
> > > The mapcount decrements are about on par with how long it takes to do
> > > other things, like updating page tables. The main problem is, with the
> > > THP-like scheme (implemented like this [1]), there isn't a way to
> > > avoid the 262k decrements when collapsing 1G. So if we want
> > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > > then I think something more clever needs to be implemented.
> > >
> > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
> >
> > I believe the whole goal of HGM is trying to face the same challenge if
> > we'll allow 1G THP exist and being able to split for anon.
> >
> > I don't remember whether we discussed below, maybe we did?  Anyway...
> >
> > Another way to not use thp mapcount, nor break smaps and similar calls to
> > page_mapcount() on small page, is to only increase the hpage mapcount only
> > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > freeing pgtable of the PUD entry).
> 
> Right, and this is doable. Also it seems like this is pretty close to
> the direction Matthew Wilcox wants to go with THPs.

I may not be familiar with it, do you mean this one?

https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/

For hugetlb I think it should be easier to maintain rather than any-sized
folios, because there's the pgtable non-leaf entry to track rmap
information and the folio size being static to hpage size.

It'll be different to folios where it can be random sized pages chunk, so
it needs to be managed by batching the ptes when install/zap.

> 
> Something I noticed though, from the implementation of
> folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> ought to report the total number of PTEs that are pointing on the page
> (or the number of times page_vma_mapped_walk returns true). FWIW,
> folio_referenced() is never called for hugetlb folios.

FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
it'll walk every leaf page being mapped, big or small, so IIUC that number
should match with what it expects to see later, more or less.

But I agree the mapcount/referenced value itself is debatable to me, just
like what you raised in the other thread on page migration.  Meanwhile, I
am not certain whether the mapcount is accurate either because AFAICT the
mapcount can be modified if e.g. new page mapping established as long as
before taking the page lock later in folio_referenced().

It's just that I don't see any severe issue either due to any of above, as
long as that information is only used as a hint for next steps, e.g., to
swap which page out.
  
James Houghton Feb. 1, 2023, 12:24 a.m. UTC | #38
On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > > > It turns out that the THP-like scheme significantly slows down
> > > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > > > >
> > > > > > This is only really a problem because this is done between
> > > > > > mmu_notifier_invalidate_range_start() and
> > > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > > > access any of the 1G page while we're doing this (and it can take like
> > > > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > > > >
> > > > > Did you try to measure the time, or it's a quick observation from perf?
> > > >
> > > > I put some ktime_get()s in.
> > > >
> > > > >
> > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > > > But maybe it depends on many things.
> > > > >
> > > > > I'm curious how the 1sec is provisioned between the procedures.  E.g., I
> > > > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > > > too as it should walk the smally mapped EPT pgtables.
> > > >
> > > > Somehow this doesn't take all that long (only like 10-30ms when
> > > > collapsing from 4K -> 1G) compared to hugetlb_collapse().
> > >
> > > Did you populate as much the EPT pgtable when measuring this?
> > >
> > > IIUC this number should be pretty much relevant to how many pages are
> > > shadowed to the kvm pgtables.  If the EPT table is mostly empty it should
> > > be super fast, but OTOH it can be much slower if when it's populated,
> > > because tdp mmu should need to handle the pgtable leaves one by one.
> > >
> > > E.g. it should be fully populated if you have a program busy dirtying most
> > > of the guest pages during test migration.
> >
> > That's what I was doing. I was running a workload in the guest that
> > just writes 8 bytes to a page and jumps ahead a few pages on all
> > vCPUs, touching most of its memory.
> >
> > But there is more to understand; I'll collect more results. I'm not
> > sure why the EPT can be unmapped/collapsed so quickly.
>
> Maybe something smart done by the hypervisor?

Doing a little bit more digging, it looks like the
invalidate_range_start notifier clears the sptes, and then later on
(on the next EPT violation), the page tables are freed. I still need
to look at how they end up being so much faster still, but I thought
that was interesting.

>
> >
> > >
> > > Write op should be the worst here case since it'll require the atomic op
> > > being applied; see kvm_tdp_mmu_write_spte().
> > >
> > > >
> > > > >
> > > > > Since we'll still keep the intermediate levels around - from application
> > > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > > > so potentially for a very large page we can start with building 2M layers.
> > > > > But then collapse will need to be run at least two rounds.
> > > >
> > > > That's exactly what I thought to do. :) I realized, too, that this is
> > > > actually how userspace *should* collapse things to avoid holding up
> > > > vCPUs too long. I think this is a good reason to keep intermediate
> > > > page sizes.
> > > >
> > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > > > huge difference: the THP-like scheme is about 30% slower overall.
> > > >
> > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > > > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > > > decrementing and then re-incrementing subpage->_mapcount, and then
> > > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > > > decrement the compound_mapcount 512 times (once per PTE), then
> > > > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > > > mapcount again by 512 (once per PMD), incrementing it once.
> > >
> > > Did you have quantified numbers (with your ktime treak) to compare these?
> > > If we want to go the other route, I think these will be materials to
> > > justify any other approach on mapcount handling.
> >
> > Ok, I can do that. GIve me a couple days to collect more results and
> > organize them in a helpful way.
> >
> > (If it's helpful at all, here are some results I collected last week:
> > [2]. Please ignore it if it's not helpful.)
>
> It's helpful already at least to me, thanks.  Yes the change is drastic.

That data only contains THP-like mapcount performance, no performance
for the head-only way. But the head-only scheme makes the 2M -> 1G
very good ("56" comes down to about the same everything else, instead
of being ~100-500x bigger).

>
> >
> > >
> > > >
> > > > The mapcount decrements are about on par with how long it takes to do
> > > > other things, like updating page tables. The main problem is, with the
> > > > THP-like scheme (implemented like this [1]), there isn't a way to
> > > > avoid the 262k decrements when collapsing 1G. So if we want
> > > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > > > then I think something more clever needs to be implemented.
> > > >
> > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
> > >
> > > I believe the whole goal of HGM is trying to face the same challenge if
> > > we'll allow 1G THP exist and being able to split for anon.
> > >
> > > I don't remember whether we discussed below, maybe we did?  Anyway...
> > >
> > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > freeing pgtable of the PUD entry).
> >
> > Right, and this is doable. Also it seems like this is pretty close to
> > the direction Matthew Wilcox wants to go with THPs.
>
> I may not be familiar with it, do you mean this one?
>
> https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/

Yep that's it.

>
> For hugetlb I think it should be easier to maintain rather than any-sized
> folios, because there's the pgtable non-leaf entry to track rmap
> information and the folio size being static to hpage size.
>
> It'll be different to folios where it can be random sized pages chunk, so
> it needs to be managed by batching the ptes when install/zap.

Agreed. It's probably easier for HugeTLB because they're always
"naturally aligned" and yeah they can't change sizes.

>
> >
> > Something I noticed though, from the implementation of
> > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > ought to report the total number of PTEs that are pointing on the page
> > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > folio_referenced() is never called for hugetlb folios.
>
> FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> it'll walk every leaf page being mapped, big or small, so IIUC that number
> should match with what it expects to see later, more or less.

I don't fully understand what you mean here.

>
> But I agree the mapcount/referenced value itself is debatable to me, just
> like what you raised in the other thread on page migration.  Meanwhile, I
> am not certain whether the mapcount is accurate either because AFAICT the
> mapcount can be modified if e.g. new page mapping established as long as
> before taking the page lock later in folio_referenced().
>
> It's just that I don't see any severe issue either due to any of above, as
> long as that information is only used as a hint for next steps, e.g., to
> swap which page out.

I also don't see a big problem with folio_referenced() (and you're
right that folio_mapcount() can be stale by the time it takes the
folio lock). It still seems like folio_mapcount() should return the
total number of PTEs that map the page though. Are you saying that
breaking this would be ok?
  
Peter Xu Feb. 1, 2023, 1:24 a.m. UTC | #39
On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > > On Thu, Jan 26, 2023 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > James,
> > > > > >
> > > > > > On Thu, Jan 26, 2023 at 08:58:51AM -0800, James Houghton wrote:
> > > > > > > It turns out that the THP-like scheme significantly slows down
> > > > > > > MADV_COLLAPSE: decrementing the mapcounts for the 4K subpages becomes
> > > > > > > the vast majority of the time spent in MADV_COLLAPSE when collapsing
> > > > > > > 1G mappings. It is doing 262k atomic decrements, so this makes sense.
> > > > > > >
> > > > > > > This is only really a problem because this is done between
> > > > > > > mmu_notifier_invalidate_range_start() and
> > > > > > > mmu_notifier_invalidate_range_end(), so KVM won't allow vCPUs to
> > > > > > > access any of the 1G page while we're doing this (and it can take like
> > > > > > > ~1 second for each 1G, at least on the x86 server I was testing on).
> > > > > >
> > > > > > Did you try to measure the time, or it's a quick observation from perf?
> > > > >
> > > > > I put some ktime_get()s in.
> > > > >
> > > > > >
> > > > > > IIRC I used to measure some atomic ops, it is not as drastic as I thought.
> > > > > > But maybe it depends on many things.
> > > > > >
> > > > > > I'm curious how the 1sec is provisioned between the procedures.  E.g., I
> > > > > > would expect mmu_notifier_invalidate_range_start() to also take some time
> > > > > > too as it should walk the smally mapped EPT pgtables.
> > > > >
> > > > > Somehow this doesn't take all that long (only like 10-30ms when
> > > > > collapsing from 4K -> 1G) compared to hugetlb_collapse().
> > > >
> > > > Did you populate as much the EPT pgtable when measuring this?
> > > >
> > > > IIUC this number should be pretty much relevant to how many pages are
> > > > shadowed to the kvm pgtables.  If the EPT table is mostly empty it should
> > > > be super fast, but OTOH it can be much slower if when it's populated,
> > > > because tdp mmu should need to handle the pgtable leaves one by one.
> > > >
> > > > E.g. it should be fully populated if you have a program busy dirtying most
> > > > of the guest pages during test migration.
> > >
> > > That's what I was doing. I was running a workload in the guest that
> > > just writes 8 bytes to a page and jumps ahead a few pages on all
> > > vCPUs, touching most of its memory.
> > >
> > > But there is more to understand; I'll collect more results. I'm not
> > > sure why the EPT can be unmapped/collapsed so quickly.
> >
> > Maybe something smart done by the hypervisor?
> 
> Doing a little bit more digging, it looks like the
> invalidate_range_start notifier clears the sptes, and then later on
> (on the next EPT violation), the page tables are freed. I still need
> to look at how they end up being so much faster still, but I thought
> that was interesting.
> 
> >
> > >
> > > >
> > > > Write op should be the worst here case since it'll require the atomic op
> > > > being applied; see kvm_tdp_mmu_write_spte().
> > > >
> > > > >
> > > > > >
> > > > > > Since we'll still keep the intermediate levels around - from application
> > > > > > POV, one other thing to remedy this is further shrink the size of COLLAPSE
> > > > > > so potentially for a very large page we can start with building 2M layers.
> > > > > > But then collapse will need to be run at least two rounds.
> > > > >
> > > > > That's exactly what I thought to do. :) I realized, too, that this is
> > > > > actually how userspace *should* collapse things to avoid holding up
> > > > > vCPUs too long. I think this is a good reason to keep intermediate
> > > > > page sizes.
> > > > >
> > > > > When collapsing 4K -> 1G, the mapcount scheme doesn't actually make a
> > > > > huge difference: the THP-like scheme is about 30% slower overall.
> > > > >
> > > > > When collapsing 4K -> 2M -> 1G, the mapcount scheme makes a HUGE
> > > > > difference. For the THP-like scheme, collapsing 4K -> 2M requires
> > > > > decrementing and then re-incrementing subpage->_mapcount, and then
> > > > > from 2M -> 1G, we have to decrement all 262k subpages->_mapcount. For
> > > > > the head-only scheme, for each 2M in the 4K -> 2M collapse, we
> > > > > decrement the compound_mapcount 512 times (once per PTE), then
> > > > > increment it once. And then for 2M -> 1G, for each 1G, we decrement
> > > > > mapcount again by 512 (once per PMD), incrementing it once.
> > > >
> > > > Did you have quantified numbers (with your ktime treak) to compare these?
> > > > If we want to go the other route, I think these will be materials to
> > > > justify any other approach on mapcount handling.
> > >
> > > Ok, I can do that. GIve me a couple days to collect more results and
> > > organize them in a helpful way.
> > >
> > > (If it's helpful at all, here are some results I collected last week:
> > > [2]. Please ignore it if it's not helpful.)
> >
> > It's helpful already at least to me, thanks.  Yes the change is drastic.
> 
> That data only contains THP-like mapcount performance, no performance
> for the head-only way. But the head-only scheme makes the 2M -> 1G
> very good ("56" comes down to about the same everything else, instead
> of being ~100-500x bigger).

Oops, I think I misread those.  Yeah please keep sharing information if you
come up with any.

> 
> >
> > >
> > > >
> > > > >
> > > > > The mapcount decrements are about on par with how long it takes to do
> > > > > other things, like updating page tables. The main problem is, with the
> > > > > THP-like scheme (implemented like this [1]), there isn't a way to
> > > > > avoid the 262k decrements when collapsing 1G. So if we want
> > > > > MADV_COLLAPSE to be fast and we want a THP-like page_mapcount() API,
> > > > > then I think something more clever needs to be implemented.
> > > > >
> > > > > [1]: https://github.com/48ca/linux/blob/hgmv2-jan24/mm/hugetlb.c#L127-L178
> > > >
> > > > I believe the whole goal of HGM is trying to face the same challenge if
> > > > we'll allow 1G THP exist and being able to split for anon.
> > > >
> > > > I don't remember whether we discussed below, maybe we did?  Anyway...
> > > >
> > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > freeing pgtable of the PUD entry).
> > >
> > > Right, and this is doable. Also it seems like this is pretty close to
> > > the direction Matthew Wilcox wants to go with THPs.
> >
> > I may not be familiar with it, do you mean this one?
> >
> > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> 
> Yep that's it.
> 
> >
> > For hugetlb I think it should be easier to maintain rather than any-sized
> > folios, because there's the pgtable non-leaf entry to track rmap
> > information and the folio size being static to hpage size.
> >
> > It'll be different to folios where it can be random sized pages chunk, so
> > it needs to be managed by batching the ptes when install/zap.
> 
> Agreed. It's probably easier for HugeTLB because they're always
> "naturally aligned" and yeah they can't change sizes.
> 
> >
> > >
> > > Something I noticed though, from the implementation of
> > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > ought to report the total number of PTEs that are pointing on the page
> > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > folio_referenced() is never called for hugetlb folios.
> >
> > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > should match with what it expects to see later, more or less.
> 
> I don't fully understand what you mean here.

I meant the rmap_walk pairing with folio_referenced_one() will walk all the
leaves for the folio, big or small.  I think that will match the number
with what got returned from folio_mapcount().

> 
> >
> > But I agree the mapcount/referenced value itself is debatable to me, just
> > like what you raised in the other thread on page migration.  Meanwhile, I
> > am not certain whether the mapcount is accurate either because AFAICT the
> > mapcount can be modified if e.g. new page mapping established as long as
> > before taking the page lock later in folio_referenced().
> >
> > It's just that I don't see any severe issue either due to any of above, as
> > long as that information is only used as a hint for next steps, e.g., to
> > swap which page out.
> 
> I also don't see a big problem with folio_referenced() (and you're
> right that folio_mapcount() can be stale by the time it takes the
> folio lock). It still seems like folio_mapcount() should return the
> total number of PTEs that map the page though. Are you saying that
> breaking this would be ok?

I didn't quite follow - isn't that already doing so?

folio_mapcount() is total_compound_mapcount() here, IIUC it is an
accumulated value of all possible PTEs or PMDs being mapped as long as it's
all or part of the folio being mapped.
  
James Houghton Feb. 1, 2023, 3:45 p.m. UTC | #40
On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
[snip]
> > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > freeing pgtable of the PUD entry).
> > > >
> > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > the direction Matthew Wilcox wants to go with THPs.
> > >
> > > I may not be familiar with it, do you mean this one?
> > >
> > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> >
> > Yep that's it.
> >
> > >
> > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > folios, because there's the pgtable non-leaf entry to track rmap
> > > information and the folio size being static to hpage size.
> > >
> > > It'll be different to folios where it can be random sized pages chunk, so
> > > it needs to be managed by batching the ptes when install/zap.
> >
> > Agreed. It's probably easier for HugeTLB because they're always
> > "naturally aligned" and yeah they can't change sizes.
> >
> > >
> > > >
> > > > Something I noticed though, from the implementation of
> > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > ought to report the total number of PTEs that are pointing on the page
> > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > folio_referenced() is never called for hugetlb folios.
> > >
> > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > should match with what it expects to see later, more or less.
> >
> > I don't fully understand what you mean here.
>
> I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> leaves for the folio, big or small.  I think that will match the number
> with what got returned from folio_mapcount().

See below.

>
> >
> > >
> > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > like what you raised in the other thread on page migration.  Meanwhile, I
> > > am not certain whether the mapcount is accurate either because AFAICT the
> > > mapcount can be modified if e.g. new page mapping established as long as
> > > before taking the page lock later in folio_referenced().
> > >
> > > It's just that I don't see any severe issue either due to any of above, as
> > > long as that information is only used as a hint for next steps, e.g., to
> > > swap which page out.
> >
> > I also don't see a big problem with folio_referenced() (and you're
> > right that folio_mapcount() can be stale by the time it takes the
> > folio lock). It still seems like folio_mapcount() should return the
> > total number of PTEs that map the page though. Are you saying that
> > breaking this would be ok?
>
> I didn't quite follow - isn't that already doing so?
>
> folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> accumulated value of all possible PTEs or PMDs being mapped as long as it's
> all or part of the folio being mapped.

We've talked about 3 ways of handling mapcount:

1. The RFC v2 way, which is head-only, and we increment the compound
mapcount for each PT mapping we have. So a PTE-mapped 2M page,
compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
2. The THP-like way. If we are fully mapping the hugetlb page with the
hstate-level PTE, we increment the compound mapcount, otherwise we
increment subpage->_mapcount.
3. The RFC v1 way (the way you have suggested above), which is
head-only, and we increment the compound mapcount if the hstate-level
PTE is made present.

With #1 and #2, there is no concern with folio_mapcount(). But with
#3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
would yield 1 instead of 512 (right?). That's what I mean.

#1 has problems wrt smaps and migration (though there were other
problems with those anyway that Mike has fixed), and #2 makes
MADV_COLLAPSE slow to the point of being unusable for some
applications.

It seems like the least bad option is #1, but maybe we should have a
face-to-face discussion about it? I'm still collecting some more
performance numbers.

- James
  
David Hildenbrand Feb. 1, 2023, 3:56 p.m. UTC | #41
On 01.02.23 16:45, James Houghton wrote:
> On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
>>> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
>>>>
>>>> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
>>>>> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
>>>>>>
>>>>>> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> [snip]
>>>>>> Another way to not use thp mapcount, nor break smaps and similar calls to
>>>>>> page_mapcount() on small page, is to only increase the hpage mapcount only
>>>>>> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
>>>>>> as leaf or a non-leaf), and the mapcount can be decreased when the pXd
>>>>>> entry is removed (for leaf, it's the same as for now; for HGM, it's when
>>>>>> freeing pgtable of the PUD entry).
>>>>>
>>>>> Right, and this is doable. Also it seems like this is pretty close to
>>>>> the direction Matthew Wilcox wants to go with THPs.
>>>>
>>>> I may not be familiar with it, do you mean this one?
>>>>
>>>> https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
>>>
>>> Yep that's it.
>>>
>>>>
>>>> For hugetlb I think it should be easier to maintain rather than any-sized
>>>> folios, because there's the pgtable non-leaf entry to track rmap
>>>> information and the folio size being static to hpage size.
>>>>
>>>> It'll be different to folios where it can be random sized pages chunk, so
>>>> it needs to be managed by batching the ptes when install/zap.
>>>
>>> Agreed. It's probably easier for HugeTLB because they're always
>>> "naturally aligned" and yeah they can't change sizes.
>>>
>>>>
>>>>>
>>>>> Something I noticed though, from the implementation of
>>>>> folio_referenced()/folio_referenced_one(), is that folio_mapcount()
>>>>> ought to report the total number of PTEs that are pointing on the page
>>>>> (or the number of times page_vma_mapped_walk returns true). FWIW,
>>>>> folio_referenced() is never called for hugetlb folios.
>>>>
>>>> FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
>>>> it'll walk every leaf page being mapped, big or small, so IIUC that number
>>>> should match with what it expects to see later, more or less.
>>>
>>> I don't fully understand what you mean here.
>>
>> I meant the rmap_walk pairing with folio_referenced_one() will walk all the
>> leaves for the folio, big or small.  I think that will match the number
>> with what got returned from folio_mapcount().
> 
> See below.
> 
>>
>>>
>>>>
>>>> But I agree the mapcount/referenced value itself is debatable to me, just
>>>> like what you raised in the other thread on page migration.  Meanwhile, I
>>>> am not certain whether the mapcount is accurate either because AFAICT the
>>>> mapcount can be modified if e.g. new page mapping established as long as
>>>> before taking the page lock later in folio_referenced().
>>>>
>>>> It's just that I don't see any severe issue either due to any of above, as
>>>> long as that information is only used as a hint for next steps, e.g., to
>>>> swap which page out.
>>>
>>> I also don't see a big problem with folio_referenced() (and you're
>>> right that folio_mapcount() can be stale by the time it takes the
>>> folio lock). It still seems like folio_mapcount() should return the
>>> total number of PTEs that map the page though. Are you saying that
>>> breaking this would be ok?
>>
>> I didn't quite follow - isn't that already doing so?
>>
>> folio_mapcount() is total_compound_mapcount() here, IIUC it is an
>> accumulated value of all possible PTEs or PMDs being mapped as long as it's
>> all or part of the folio being mapped.
> 
> We've talked about 3 ways of handling mapcount:
> 
> 1. The RFC v2 way, which is head-only, and we increment the compound
> mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> 2. The THP-like way. If we are fully mapping the hugetlb page with the
> hstate-level PTE, we increment the compound mapcount, otherwise we
> increment subpage->_mapcount.
> 3. The RFC v1 way (the way you have suggested above), which is
> head-only, and we increment the compound mapcount if the hstate-level
> PTE is made present.
> 
> With #1 and #2, there is no concern with folio_mapcount(). But with
> #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> would yield 1 instead of 512 (right?). That's what I mean.

My 2 cents:

The mapcount is primarily used (in hugetlb context) to

(a) Detect if a page might be shared. mapcount > 1 implies that two 
independent page table hierarchies are mapping the page. We care about 
mapcount == 1 vs. mapcount != 1.

(b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs. 
mapcount != 0.

For hugetlb, I don't see why we should care about the subpage mapcount 
at all.

For (a) it's even good to count "somehow mapped into a single page table 
structure" as "mapcount == 1" For (b), we don't care as long as "still 
mapped" implies "mapcount != 0".
  
Peter Xu Feb. 1, 2023, 4:22 p.m. UTC | #42
On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> [snip]
> > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > freeing pgtable of the PUD entry).
> > > > >
> > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > the direction Matthew Wilcox wants to go with THPs.
> > > >
> > > > I may not be familiar with it, do you mean this one?
> > > >
> > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> > >
> > > Yep that's it.
> > >
> > > >
> > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > information and the folio size being static to hpage size.
> > > >
> > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > it needs to be managed by batching the ptes when install/zap.
> > >
> > > Agreed. It's probably easier for HugeTLB because they're always
> > > "naturally aligned" and yeah they can't change sizes.
> > >
> > > >
> > > > >
> > > > > Something I noticed though, from the implementation of
> > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > folio_referenced() is never called for hugetlb folios.
> > > >
> > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > should match with what it expects to see later, more or less.
> > >
> > > I don't fully understand what you mean here.
> >
> > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > leaves for the folio, big or small.  I think that will match the number
> > with what got returned from folio_mapcount().
> 
> See below.
> 
> >
> > >
> > > >
> > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > like what you raised in the other thread on page migration.  Meanwhile, I
> > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > before taking the page lock later in folio_referenced().
> > > >
> > > > It's just that I don't see any severe issue either due to any of above, as
> > > > long as that information is only used as a hint for next steps, e.g., to
> > > > swap which page out.
> > >
> > > I also don't see a big problem with folio_referenced() (and you're
> > > right that folio_mapcount() can be stale by the time it takes the
> > > folio lock). It still seems like folio_mapcount() should return the
> > > total number of PTEs that map the page though. Are you saying that
> > > breaking this would be ok?
> >
> > I didn't quite follow - isn't that already doing so?
> >
> > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > all or part of the folio being mapped.
> 
> We've talked about 3 ways of handling mapcount:
> 
> 1. The RFC v2 way, which is head-only, and we increment the compound
> mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> 2. The THP-like way. If we are fully mapping the hugetlb page with the
> hstate-level PTE, we increment the compound mapcount, otherwise we
> increment subpage->_mapcount.
> 3. The RFC v1 way (the way you have suggested above), which is
> head-only, and we increment the compound mapcount if the hstate-level
> PTE is made present.

Oh that's where it come from!  It took quite some months going through all
these, I can hardly remember the details.

> 
> With #1 and #2, there is no concern with folio_mapcount(). But with
> #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> would yield 1 instead of 512 (right?). That's what I mean.
> 
> #1 has problems wrt smaps and migration (though there were other
> problems with those anyway that Mike has fixed), and #2 makes
> MADV_COLLAPSE slow to the point of being unusable for some
> applications.

Ah so you're talking about after HGM being applied..  while I was only
talking about THPs.

If to apply the logic here with idea 3), the worst case is we'll need to
have special care of HGM hugetlb in folio_referenced_one(), so the default
page_vma_mapped_walk() may not apply anymore - the resource is always in
hstate sized, so counting small ptes do not help too - we can just walk
until the hstate entry and do referenced++ if it's not none, at the
entrance of folio_referenced_one().

But I'm not sure whether that'll be necessary at all, as I'm not sure
whether that path can be triggered at all in any form (where from the top
it should always be shrink_page_list()).  In that sense maybe we can also
consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
hugetlb page that got passed in?  Meanwhile, adding a TODO comment
explaining that current walk won't work easily for HGM only, so when it
will be applicable to hugetlb we need to rework?

I confess that's not pretty, though.  But that'll make 3) with no major
defect from function-wise.

Side note: did we finish folio conversion on hugetlb at all?  I think at
least we need some helper like folio_test_huge().  It seems still missing.
Maybe it's another clue that hugetlb is not important to folio_referenced()
because it's already fully converted?

> 
> It seems like the least bad option is #1, but maybe we should have a
> face-to-face discussion about it? I'm still collecting some more
> performance numbers.

Let's see how it goes..

Thanks,
  
James Houghton Feb. 1, 2023, 5:58 p.m. UTC | #43
On Wed, Feb 1, 2023 at 7:56 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.02.23 16:45, James Houghton wrote:
> > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
> >>
> >> On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> >>> On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> >>>>
> >>>> On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> >>>>> On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> >>>>>>
> >>>>>> On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > [snip]
> >>>>>> Another way to not use thp mapcount, nor break smaps and similar calls to
> >>>>>> page_mapcount() on small page, is to only increase the hpage mapcount only
> >>>>>> when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> >>>>>> as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> >>>>>> entry is removed (for leaf, it's the same as for now; for HGM, it's when
> >>>>>> freeing pgtable of the PUD entry).
> >>>>>
> >>>>> Right, and this is doable. Also it seems like this is pretty close to
> >>>>> the direction Matthew Wilcox wants to go with THPs.
> >>>>
> >>>> I may not be familiar with it, do you mean this one?
> >>>>
> >>>> https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> >>>
> >>> Yep that's it.
> >>>
> >>>>
> >>>> For hugetlb I think it should be easier to maintain rather than any-sized
> >>>> folios, because there's the pgtable non-leaf entry to track rmap
> >>>> information and the folio size being static to hpage size.
> >>>>
> >>>> It'll be different to folios where it can be random sized pages chunk, so
> >>>> it needs to be managed by batching the ptes when install/zap.
> >>>
> >>> Agreed. It's probably easier for HugeTLB because they're always
> >>> "naturally aligned" and yeah they can't change sizes.
> >>>
> >>>>
> >>>>>
> >>>>> Something I noticed though, from the implementation of
> >>>>> folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> >>>>> ought to report the total number of PTEs that are pointing on the page
> >>>>> (or the number of times page_vma_mapped_walk returns true). FWIW,
> >>>>> folio_referenced() is never called for hugetlb folios.
> >>>>
> >>>> FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> >>>> it'll walk every leaf page being mapped, big or small, so IIUC that number
> >>>> should match with what it expects to see later, more or less.
> >>>
> >>> I don't fully understand what you mean here.
> >>
> >> I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> >> leaves for the folio, big or small.  I think that will match the number
> >> with what got returned from folio_mapcount().
> >
> > See below.
> >
> >>
> >>>
> >>>>
> >>>> But I agree the mapcount/referenced value itself is debatable to me, just
> >>>> like what you raised in the other thread on page migration.  Meanwhile, I
> >>>> am not certain whether the mapcount is accurate either because AFAICT the
> >>>> mapcount can be modified if e.g. new page mapping established as long as
> >>>> before taking the page lock later in folio_referenced().
> >>>>
> >>>> It's just that I don't see any severe issue either due to any of above, as
> >>>> long as that information is only used as a hint for next steps, e.g., to
> >>>> swap which page out.
> >>>
> >>> I also don't see a big problem with folio_referenced() (and you're
> >>> right that folio_mapcount() can be stale by the time it takes the
> >>> folio lock). It still seems like folio_mapcount() should return the
> >>> total number of PTEs that map the page though. Are you saying that
> >>> breaking this would be ok?
> >>
> >> I didn't quite follow - isn't that already doing so?
> >>
> >> folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> >> accumulated value of all possible PTEs or PMDs being mapped as long as it's
> >> all or part of the folio being mapped.
> >
> > We've talked about 3 ways of handling mapcount:
> >
> > 1. The RFC v2 way, which is head-only, and we increment the compound
> > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > hstate-level PTE, we increment the compound mapcount, otherwise we
> > increment subpage->_mapcount.
> > 3. The RFC v1 way (the way you have suggested above), which is
> > head-only, and we increment the compound mapcount if the hstate-level
> > PTE is made present.
> >
> > With #1 and #2, there is no concern with folio_mapcount(). But with
> > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > would yield 1 instead of 512 (right?). That's what I mean.
>
> My 2 cents:
>
> The mapcount is primarily used (in hugetlb context) to
>
> (a) Detect if a page might be shared. mapcount > 1 implies that two
> independent page table hierarchies are mapping the page. We care about
> mapcount == 1 vs. mapcount != 1.
>
> (b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs.
> mapcount != 0.
>
> For hugetlb, I don't see why we should care about the subpage mapcount
> at all.

Agreed -- it shouldn't really matter all that much.

>
> For (a) it's even good to count "somehow mapped into a single page table
> structure" as "mapcount == 1" For (b), we don't care as long as "still
> mapped" implies "mapcount != 0".

Thanks for your thoughts, David. So it sounds like you're still
squarely in the #3 camp. :)
  
David Hildenbrand Feb. 1, 2023, 6:01 p.m. UTC | #44
>>> 1. The RFC v2 way, which is head-only, and we increment the compound
>>> mapcount for each PT mapping we have. So a PTE-mapped 2M page,
>>> compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
>>> 2. The THP-like way. If we are fully mapping the hugetlb page with the
>>> hstate-level PTE, we increment the compound mapcount, otherwise we
>>> increment subpage->_mapcount.
>>> 3. The RFC v1 way (the way you have suggested above), which is
>>> head-only, and we increment the compound mapcount if the hstate-level
>>> PTE is made present.
>>>
>>> With #1 and #2, there is no concern with folio_mapcount(). But with
>>> #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
>>> would yield 1 instead of 512 (right?). That's what I mean.
>>
>> My 2 cents:
>>
>> The mapcount is primarily used (in hugetlb context) to
>>
>> (a) Detect if a page might be shared. mapcount > 1 implies that two
>> independent page table hierarchies are mapping the page. We care about
>> mapcount == 1 vs. mapcount != 1.
>>
>> (b) Detect if unmapping was sucessfull. We care about mapcount == 0 vs.
>> mapcount != 0.
>>
>> For hugetlb, I don't see why we should care about the subpage mapcount
>> at all.
> 
> Agreed -- it shouldn't really matter all that much.
> 
>>
>> For (a) it's even good to count "somehow mapped into a single page table
>> structure" as "mapcount == 1" For (b), we don't care as long as "still
>> mapped" implies "mapcount != 0".
> 
> Thanks for your thoughts, David. So it sounds like you're still
> squarely in the #3 camp. :)

Well, yes. As long as we can get it implemented in a clean way ... :)
  
James Houghton Feb. 1, 2023, 9:32 p.m. UTC | #45
On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > [snip]
> > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > freeing pgtable of the PUD entry).
> > > > > >
> > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > >
> > > > > I may not be familiar with it, do you mean this one?
> > > > >
> > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> > > >
> > > > Yep that's it.
> > > >
> > > > >
> > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > information and the folio size being static to hpage size.
> > > > >
> > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > it needs to be managed by batching the ptes when install/zap.
> > > >
> > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > "naturally aligned" and yeah they can't change sizes.
> > > >
> > > > >
> > > > > >
> > > > > > Something I noticed though, from the implementation of
> > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > folio_referenced() is never called for hugetlb folios.
> > > > >
> > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > should match with what it expects to see later, more or less.
> > > >
> > > > I don't fully understand what you mean here.
> > >
> > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > leaves for the folio, big or small.  I think that will match the number
> > > with what got returned from folio_mapcount().
> >
> > See below.
> >
> > >
> > > >
> > > > >
> > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > like what you raised in the other thread on page migration.  Meanwhile, I
> > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > before taking the page lock later in folio_referenced().
> > > > >
> > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > swap which page out.
> > > >
> > > > I also don't see a big problem with folio_referenced() (and you're
> > > > right that folio_mapcount() can be stale by the time it takes the
> > > > folio lock). It still seems like folio_mapcount() should return the
> > > > total number of PTEs that map the page though. Are you saying that
> > > > breaking this would be ok?
> > >
> > > I didn't quite follow - isn't that already doing so?
> > >
> > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > all or part of the folio being mapped.
> >
> > We've talked about 3 ways of handling mapcount:
> >
> > 1. The RFC v2 way, which is head-only, and we increment the compound
> > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > hstate-level PTE, we increment the compound mapcount, otherwise we
> > increment subpage->_mapcount.
> > 3. The RFC v1 way (the way you have suggested above), which is
> > head-only, and we increment the compound mapcount if the hstate-level
> > PTE is made present.
>
> Oh that's where it come from!  It took quite some months going through all
> these, I can hardly remember the details.
>
> >
> > With #1 and #2, there is no concern with folio_mapcount(). But with
> > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > would yield 1 instead of 512 (right?). That's what I mean.
> >
> > #1 has problems wrt smaps and migration (though there were other
> > problems with those anyway that Mike has fixed), and #2 makes
> > MADV_COLLAPSE slow to the point of being unusable for some
> > applications.
>
> Ah so you're talking about after HGM being applied..  while I was only
> talking about THPs.
>
> If to apply the logic here with idea 3), the worst case is we'll need to
> have special care of HGM hugetlb in folio_referenced_one(), so the default
> page_vma_mapped_walk() may not apply anymore - the resource is always in
> hstate sized, so counting small ptes do not help too - we can just walk
> until the hstate entry and do referenced++ if it's not none, at the
> entrance of folio_referenced_one().
>
> But I'm not sure whether that'll be necessary at all, as I'm not sure
> whether that path can be triggered at all in any form (where from the top
> it should always be shrink_page_list()).  In that sense maybe we can also
> consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> hugetlb page that got passed in?  Meanwhile, adding a TODO comment
> explaining that current walk won't work easily for HGM only, so when it
> will be applicable to hugetlb we need to rework?
>
> I confess that's not pretty, though.  But that'll make 3) with no major
> defect from function-wise.

Another potential idea would be to add something like page_vmacount().
For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
HugeTLB pages, we could keep a separate count (in one of the tail
pages, I guess). And then in the places that matter (so smaps,
migration, and maybe CoW and hwpoison), potentially change their calls
to page_vmacount() instead of page_mapcount().

Then to implement page_vmacount(), we do the RFC v1 mapcount approach
(but like.... correctly this time). And then for page_mapcount(), we
do the RFC v2 mapcount approach (head-only, once per PTE).

Then we fix folio_referenced() without needing to special-case it for
HugeTLB. :) Or we could just special-case it. *shrug*

Does that sound reasonable? We still have the problem where a series
of partially unmaps could leave page_vmacount() incremented, but I
don't think that's a big problem.

>
> Side note: did we finish folio conversion on hugetlb at all?  I think at
> least we need some helper like folio_test_huge().  It seems still missing.
> Maybe it's another clue that hugetlb is not important to folio_referenced()
> because it's already fully converted?

I'm not sure. A lot of work was done very pretty recently, so I bet
there's probably some work left to do.
  
Peter Xu Feb. 1, 2023, 9:51 p.m. UTC | #46
On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote:
> On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > [snip]
> > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > > freeing pgtable of the PUD entry).
> > > > > > >
> > > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > > >
> > > > > > I may not be familiar with it, do you mean this one?
> > > > > >
> > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> > > > >
> > > > > Yep that's it.
> > > > >
> > > > > >
> > > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > > information and the folio size being static to hpage size.
> > > > > >
> > > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > > it needs to be managed by batching the ptes when install/zap.
> > > > >
> > > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > > "naturally aligned" and yeah they can't change sizes.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Something I noticed though, from the implementation of
> > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > > folio_referenced() is never called for hugetlb folios.
> > > > > >
> > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > > should match with what it expects to see later, more or less.
> > > > >
> > > > > I don't fully understand what you mean here.
> > > >
> > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > > leaves for the folio, big or small.  I think that will match the number
> > > > with what got returned from folio_mapcount().
> > >
> > > See below.
> > >
> > > >
> > > > >
> > > > > >
> > > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > > like what you raised in the other thread on page migration.  Meanwhile, I
> > > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > > before taking the page lock later in folio_referenced().
> > > > > >
> > > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > > swap which page out.
> > > > >
> > > > > I also don't see a big problem with folio_referenced() (and you're
> > > > > right that folio_mapcount() can be stale by the time it takes the
> > > > > folio lock). It still seems like folio_mapcount() should return the
> > > > > total number of PTEs that map the page though. Are you saying that
> > > > > breaking this would be ok?
> > > >
> > > > I didn't quite follow - isn't that already doing so?
> > > >
> > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > > all or part of the folio being mapped.
> > >
> > > We've talked about 3 ways of handling mapcount:
> > >
> > > 1. The RFC v2 way, which is head-only, and we increment the compound
> > > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > > hstate-level PTE, we increment the compound mapcount, otherwise we
> > > increment subpage->_mapcount.
> > > 3. The RFC v1 way (the way you have suggested above), which is
> > > head-only, and we increment the compound mapcount if the hstate-level
> > > PTE is made present.
> >
> > Oh that's where it come from!  It took quite some months going through all
> > these, I can hardly remember the details.
> >
> > >
> > > With #1 and #2, there is no concern with folio_mapcount(). But with
> > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > > would yield 1 instead of 512 (right?). That's what I mean.
> > >
> > > #1 has problems wrt smaps and migration (though there were other
> > > problems with those anyway that Mike has fixed), and #2 makes
> > > MADV_COLLAPSE slow to the point of being unusable for some
> > > applications.
> >
> > Ah so you're talking about after HGM being applied..  while I was only
> > talking about THPs.
> >
> > If to apply the logic here with idea 3), the worst case is we'll need to
> > have special care of HGM hugetlb in folio_referenced_one(), so the default
> > page_vma_mapped_walk() may not apply anymore - the resource is always in
> > hstate sized, so counting small ptes do not help too - we can just walk
> > until the hstate entry and do referenced++ if it's not none, at the
> > entrance of folio_referenced_one().
> >
> > But I'm not sure whether that'll be necessary at all, as I'm not sure
> > whether that path can be triggered at all in any form (where from the top
> > it should always be shrink_page_list()).  In that sense maybe we can also
> > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> > hugetlb page that got passed in?  Meanwhile, adding a TODO comment
> > explaining that current walk won't work easily for HGM only, so when it
> > will be applicable to hugetlb we need to rework?
> >
> > I confess that's not pretty, though.  But that'll make 3) with no major
> > defect from function-wise.
> 
> Another potential idea would be to add something like page_vmacount().
> For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
> HugeTLB pages, we could keep a separate count (in one of the tail
> pages, I guess). And then in the places that matter (so smaps,
> migration, and maybe CoW and hwpoison), potentially change their calls
> to page_vmacount() instead of page_mapcount().
> 
> Then to implement page_vmacount(), we do the RFC v1 mapcount approach
> (but like.... correctly this time). And then for page_mapcount(), we
> do the RFC v2 mapcount approach (head-only, once per PTE).
> 
> Then we fix folio_referenced() without needing to special-case it for
> HugeTLB. :) Or we could just special-case it. *shrug*
> 
> Does that sound reasonable? We still have the problem where a series
> of partially unmaps could leave page_vmacount() incremented, but I
> don't think that's a big problem.

I'm afraid someone will stop you from introducing yet another definition of
mapcount, where others are trying to remove it. :)

Or, can we just drop folio_referenced_arg.mapcount?  We need to keep:

	if (!pra.mapcount)
		return 0;

By replacing it with folio_mapcount which is definitely something
worthwhile, but what about the rest?

If it can be dropped, afaict it'll naturally work with HGM again.

IIUC that's an optimization where we want to stop the rmap walk as long as
we found all the pages, however (1) IIUC it's not required to function, and
(2) it's not guaranteed to work as solid anyway.. As we've discussed
before: right after it reads mapcount (but before taking the page lock),
the mapcount can get decreased by 1, then it'll still need to loop over all
the vmas just to find that there's one "misterious" mapcount lost.

Personally I really have no idea on how much that optimization can help.
  
James Houghton Feb. 2, 2023, 12:24 a.m. UTC | #47
On Wed, Feb 1, 2023 at 1:51 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote:
> > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > [snip]
> > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > > > freeing pgtable of the PUD entry).
> > > > > > > >
> > > > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > > > >
> > > > > > > I may not be familiar with it, do you mean this one?
> > > > > > >
> > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> > > > > >
> > > > > > Yep that's it.
> > > > > >
> > > > > > >
> > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > > > information and the folio size being static to hpage size.
> > > > > > >
> > > > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > > > it needs to be managed by batching the ptes when install/zap.
> > > > > >
> > > > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > > > "naturally aligned" and yeah they can't change sizes.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Something I noticed though, from the implementation of
> > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > > > folio_referenced() is never called for hugetlb folios.
> > > > > > >
> > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > > > should match with what it expects to see later, more or less.
> > > > > >
> > > > > > I don't fully understand what you mean here.
> > > > >
> > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > > > leaves for the folio, big or small.  I think that will match the number
> > > > > with what got returned from folio_mapcount().
> > > >
> > > > See below.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > > > like what you raised in the other thread on page migration.  Meanwhile, I
> > > > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > > > before taking the page lock later in folio_referenced().
> > > > > > >
> > > > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > > > swap which page out.
> > > > > >
> > > > > > I also don't see a big problem with folio_referenced() (and you're
> > > > > > right that folio_mapcount() can be stale by the time it takes the
> > > > > > folio lock). It still seems like folio_mapcount() should return the
> > > > > > total number of PTEs that map the page though. Are you saying that
> > > > > > breaking this would be ok?
> > > > >
> > > > > I didn't quite follow - isn't that already doing so?
> > > > >
> > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > > > all or part of the folio being mapped.
> > > >
> > > > We've talked about 3 ways of handling mapcount:
> > > >
> > > > 1. The RFC v2 way, which is head-only, and we increment the compound
> > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > > > hstate-level PTE, we increment the compound mapcount, otherwise we
> > > > increment subpage->_mapcount.
> > > > 3. The RFC v1 way (the way you have suggested above), which is
> > > > head-only, and we increment the compound mapcount if the hstate-level
> > > > PTE is made present.
> > >
> > > Oh that's where it come from!  It took quite some months going through all
> > > these, I can hardly remember the details.
> > >
> > > >
> > > > With #1 and #2, there is no concern with folio_mapcount(). But with
> > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > > > would yield 1 instead of 512 (right?). That's what I mean.
> > > >
> > > > #1 has problems wrt smaps and migration (though there were other
> > > > problems with those anyway that Mike has fixed), and #2 makes
> > > > MADV_COLLAPSE slow to the point of being unusable for some
> > > > applications.
> > >
> > > Ah so you're talking about after HGM being applied..  while I was only
> > > talking about THPs.
> > >
> > > If to apply the logic here with idea 3), the worst case is we'll need to
> > > have special care of HGM hugetlb in folio_referenced_one(), so the default
> > > page_vma_mapped_walk() may not apply anymore - the resource is always in
> > > hstate sized, so counting small ptes do not help too - we can just walk
> > > until the hstate entry and do referenced++ if it's not none, at the
> > > entrance of folio_referenced_one().
> > >
> > > But I'm not sure whether that'll be necessary at all, as I'm not sure
> > > whether that path can be triggered at all in any form (where from the top
> > > it should always be shrink_page_list()).  In that sense maybe we can also
> > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> > > hugetlb page that got passed in?  Meanwhile, adding a TODO comment
> > > explaining that current walk won't work easily for HGM only, so when it
> > > will be applicable to hugetlb we need to rework?
> > >
> > > I confess that's not pretty, though.  But that'll make 3) with no major
> > > defect from function-wise.
> >
> > Another potential idea would be to add something like page_vmacount().
> > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
> > HugeTLB pages, we could keep a separate count (in one of the tail
> > pages, I guess). And then in the places that matter (so smaps,
> > migration, and maybe CoW and hwpoison), potentially change their calls
> > to page_vmacount() instead of page_mapcount().
> >
> > Then to implement page_vmacount(), we do the RFC v1 mapcount approach
> > (but like.... correctly this time). And then for page_mapcount(), we
> > do the RFC v2 mapcount approach (head-only, once per PTE).
> >
> > Then we fix folio_referenced() without needing to special-case it for
> > HugeTLB. :) Or we could just special-case it. *shrug*
> >
> > Does that sound reasonable? We still have the problem where a series
> > of partially unmaps could leave page_vmacount() incremented, but I
> > don't think that's a big problem.
>
> I'm afraid someone will stop you from introducing yet another definition of
> mapcount, where others are trying to remove it. :)
>
> Or, can we just drop folio_referenced_arg.mapcount?  We need to keep:
>
>         if (!pra.mapcount)
>                 return 0;
>
> By replacing it with folio_mapcount which is definitely something
> worthwhile, but what about the rest?
>
> If it can be dropped, afaict it'll naturally work with HGM again.
>
> IIUC that's an optimization where we want to stop the rmap walk as long as
> we found all the pages, however (1) IIUC it's not required to function, and
> (2) it's not guaranteed to work as solid anyway.. As we've discussed
> before: right after it reads mapcount (but before taking the page lock),
> the mapcount can get decreased by 1, then it'll still need to loop over all
> the vmas just to find that there's one "misterious" mapcount lost.
>
> Personally I really have no idea on how much that optimization can help.

Ok, yeah, I think pra.mapcount can be removed too. (And we replace
!pra.mapcount with !folio_mapcount().)

I don't see any other existing users of folio_mapcount() and
total_mapcount() that are problematic. We do need to make sure to keep
refcount and mapcount in sync though; it can be done.

So I'll compare this "RFC v1" way with the THP-like way and get you a
performance comparison.


- James
  
James Houghton Feb. 7, 2023, 4:30 p.m. UTC | #48
On Wed, Feb 1, 2023 at 4:24 PM James Houghton <jthoughton@google.com> wrote:
>
> On Wed, Feb 1, 2023 at 1:51 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Feb 01, 2023 at 01:32:21PM -0800, James Houghton wrote:
> > > On Wed, Feb 1, 2023 at 8:22 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Wed, Feb 01, 2023 at 07:45:17AM -0800, James Houghton wrote:
> > > > > On Tue, Jan 31, 2023 at 5:24 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Jan 31, 2023 at 04:24:15PM -0800, James Houghton wrote:
> > > > > > > On Mon, Jan 30, 2023 at 1:14 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Jan 30, 2023 at 10:38:41AM -0800, James Houghton wrote:
> > > > > > > > > On Mon, Jan 30, 2023 at 9:29 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 27, 2023 at 01:02:02PM -0800, James Houghton wrote:
> > > > > [snip]
> > > > > > > > > > Another way to not use thp mapcount, nor break smaps and similar calls to
> > > > > > > > > > page_mapcount() on small page, is to only increase the hpage mapcount only
> > > > > > > > > > when hstate pXd (in case of 1G it's PUD) entry being populated (no matter
> > > > > > > > > > as leaf or a non-leaf), and the mapcount can be decreased when the pXd
> > > > > > > > > > entry is removed (for leaf, it's the same as for now; for HGM, it's when
> > > > > > > > > > freeing pgtable of the PUD entry).
> > > > > > > > >
> > > > > > > > > Right, and this is doable. Also it seems like this is pretty close to
> > > > > > > > > the direction Matthew Wilcox wants to go with THPs.
> > > > > > > >
> > > > > > > > I may not be familiar with it, do you mean this one?
> > > > > > > >
> > > > > > > > https://lore.kernel.org/all/Y9Afwds%2FJl39UjEp@casper.infradead.org/
> > > > > > >
> > > > > > > Yep that's it.
> > > > > > >
> > > > > > > >
> > > > > > > > For hugetlb I think it should be easier to maintain rather than any-sized
> > > > > > > > folios, because there's the pgtable non-leaf entry to track rmap
> > > > > > > > information and the folio size being static to hpage size.
> > > > > > > >
> > > > > > > > It'll be different to folios where it can be random sized pages chunk, so
> > > > > > > > it needs to be managed by batching the ptes when install/zap.
> > > > > > >
> > > > > > > Agreed. It's probably easier for HugeTLB because they're always
> > > > > > > "naturally aligned" and yeah they can't change sizes.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Something I noticed though, from the implementation of
> > > > > > > > > folio_referenced()/folio_referenced_one(), is that folio_mapcount()
> > > > > > > > > ought to report the total number of PTEs that are pointing on the page
> > > > > > > > > (or the number of times page_vma_mapped_walk returns true). FWIW,
> > > > > > > > > folio_referenced() is never called for hugetlb folios.
> > > > > > > >
> > > > > > > > FWIU folio_mapcount is the thing it needs for now to do the rmap walks -
> > > > > > > > it'll walk every leaf page being mapped, big or small, so IIUC that number
> > > > > > > > should match with what it expects to see later, more or less.
> > > > > > >
> > > > > > > I don't fully understand what you mean here.
> > > > > >
> > > > > > I meant the rmap_walk pairing with folio_referenced_one() will walk all the
> > > > > > leaves for the folio, big or small.  I think that will match the number
> > > > > > with what got returned from folio_mapcount().
> > > > >
> > > > > See below.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > But I agree the mapcount/referenced value itself is debatable to me, just
> > > > > > > > like what you raised in the other thread on page migration.  Meanwhile, I
> > > > > > > > am not certain whether the mapcount is accurate either because AFAICT the
> > > > > > > > mapcount can be modified if e.g. new page mapping established as long as
> > > > > > > > before taking the page lock later in folio_referenced().
> > > > > > > >
> > > > > > > > It's just that I don't see any severe issue either due to any of above, as
> > > > > > > > long as that information is only used as a hint for next steps, e.g., to
> > > > > > > > swap which page out.
> > > > > > >
> > > > > > > I also don't see a big problem with folio_referenced() (and you're
> > > > > > > right that folio_mapcount() can be stale by the time it takes the
> > > > > > > folio lock). It still seems like folio_mapcount() should return the
> > > > > > > total number of PTEs that map the page though. Are you saying that
> > > > > > > breaking this would be ok?
> > > > > >
> > > > > > I didn't quite follow - isn't that already doing so?
> > > > > >
> > > > > > folio_mapcount() is total_compound_mapcount() here, IIUC it is an
> > > > > > accumulated value of all possible PTEs or PMDs being mapped as long as it's
> > > > > > all or part of the folio being mapped.
> > > > >
> > > > > We've talked about 3 ways of handling mapcount:
> > > > >
> > > > > 1. The RFC v2 way, which is head-only, and we increment the compound
> > > > > mapcount for each PT mapping we have. So a PTE-mapped 2M page,
> > > > > compound_mapcount=512, subpage->_mapcount=0 (ignoring the -1 bias).
> > > > > 2. The THP-like way. If we are fully mapping the hugetlb page with the
> > > > > hstate-level PTE, we increment the compound mapcount, otherwise we
> > > > > increment subpage->_mapcount.
> > > > > 3. The RFC v1 way (the way you have suggested above), which is
> > > > > head-only, and we increment the compound mapcount if the hstate-level
> > > > > PTE is made present.
> > > >
> > > > Oh that's where it come from!  It took quite some months going through all
> > > > these, I can hardly remember the details.
> > > >
> > > > >
> > > > > With #1 and #2, there is no concern with folio_mapcount(). But with
> > > > > #3, folio_mapcount() for a PTE-mapped 2M page mapped in a single VMA
> > > > > would yield 1 instead of 512 (right?). That's what I mean.
> > > > >
> > > > > #1 has problems wrt smaps and migration (though there were other
> > > > > problems with those anyway that Mike has fixed), and #2 makes
> > > > > MADV_COLLAPSE slow to the point of being unusable for some
> > > > > applications.
> > > >
> > > > Ah so you're talking about after HGM being applied..  while I was only
> > > > talking about THPs.
> > > >
> > > > If to apply the logic here with idea 3), the worst case is we'll need to
> > > > have special care of HGM hugetlb in folio_referenced_one(), so the default
> > > > page_vma_mapped_walk() may not apply anymore - the resource is always in
> > > > hstate sized, so counting small ptes do not help too - we can just walk
> > > > until the hstate entry and do referenced++ if it's not none, at the
> > > > entrance of folio_referenced_one().
> > > >
> > > > But I'm not sure whether that'll be necessary at all, as I'm not sure
> > > > whether that path can be triggered at all in any form (where from the top
> > > > it should always be shrink_page_list()).  In that sense maybe we can also
> > > > consider adding a WARN_ON_ONCE() in folio_referenced() where it is a
> > > > hugetlb page that got passed in?  Meanwhile, adding a TODO comment
> > > > explaining that current walk won't work easily for HGM only, so when it
> > > > will be applicable to hugetlb we need to rework?
> > > >
> > > > I confess that's not pretty, though.  But that'll make 3) with no major
> > > > defect from function-wise.
> > >
> > > Another potential idea would be to add something like page_vmacount().
> > > For non-HugeTLB pages, page_vmacount() == page_mapcount(). Then for
> > > HugeTLB pages, we could keep a separate count (in one of the tail
> > > pages, I guess). And then in the places that matter (so smaps,
> > > migration, and maybe CoW and hwpoison), potentially change their calls
> > > to page_vmacount() instead of page_mapcount().
> > >
> > > Then to implement page_vmacount(), we do the RFC v1 mapcount approach
> > > (but like.... correctly this time). And then for page_mapcount(), we
> > > do the RFC v2 mapcount approach (head-only, once per PTE).
> > >
> > > Then we fix folio_referenced() without needing to special-case it for
> > > HugeTLB. :) Or we could just special-case it. *shrug*
> > >
> > > Does that sound reasonable? We still have the problem where a series
> > > of partially unmaps could leave page_vmacount() incremented, but I
> > > don't think that's a big problem.
> >
> > I'm afraid someone will stop you from introducing yet another definition of
> > mapcount, where others are trying to remove it. :)
> >
> > Or, can we just drop folio_referenced_arg.mapcount?  We need to keep:
> >
> >         if (!pra.mapcount)
> >                 return 0;
> >
> > By replacing it with folio_mapcount which is definitely something
> > worthwhile, but what about the rest?
> >
> > If it can be dropped, afaict it'll naturally work with HGM again.
> >
> > IIUC that's an optimization where we want to stop the rmap walk as long as
> > we found all the pages, however (1) IIUC it's not required to function, and
> > (2) it's not guaranteed to work as solid anyway.. As we've discussed
> > before: right after it reads mapcount (but before taking the page lock),
> > the mapcount can get decreased by 1, then it'll still need to loop over all
> > the vmas just to find that there's one "misterious" mapcount lost.
> >
> > Personally I really have no idea on how much that optimization can help.
>
> Ok, yeah, I think pra.mapcount can be removed too. (And we replace
> !pra.mapcount with !folio_mapcount().)
>
> I don't see any other existing users of folio_mapcount() and
> total_mapcount() that are problematic. We do need to make sure to keep
> refcount and mapcount in sync though; it can be done.
>
> So I'll compare this "RFC v1" way with the THP-like way and get you a
> performance comparison.

Here is the result: [1] (sorry it took a little while heh). The
implementation of the "RFC v1" way is pretty horrible[2] (and this
implementation probably has bugs anyway; it doesn't account for the
folio_referenced() problem).

Matthew is trying to solve the same problem with THPs right now: [3].
I haven't figured out how we can apply Matthews's approach to HGM
right now, but there probably is a way. (If we left the mapcount
increment bits in the same place, we couldn't just check the
hstate-level PTE; it would have already been made present.)

We could:
- use the THP-like way and tolerate ~1 second collapses
- use the (non-RFC) v1 way and tolerate the migration/smaps differences
- use the RFC v1 way and tolerate the complicated mapcount accounting
- flesh out [3] and see if it can be applied to HGM nicely

I'm happy to go with any of these approaches.

[1]: https://pastebin.com/raw/hJzFJHiD
[2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
[3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/

- James
  
James Houghton Feb. 7, 2023, 10:46 p.m. UTC | #49
> Here is the result: [1] (sorry it took a little while heh). The
> implementation of the "RFC v1" way is pretty horrible[2] (and this
> implementation probably has bugs anyway; it doesn't account for the
> folio_referenced() problem).
>
> Matthew is trying to solve the same problem with THPs right now: [3].
> I haven't figured out how we can apply Matthews's approach to HGM
> right now, but there probably is a way. (If we left the mapcount
> increment bits in the same place, we couldn't just check the
> hstate-level PTE; it would have already been made present.)
>
> We could:
> - use the THP-like way and tolerate ~1 second collapses

Another thought here. We don't necessarily *need* to collapse the page
table mappings in between mmu_notifier_invalidate_range_start() and
mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
we aren't punching any holes, and we aren't changing permission bits.
If we had an MMU notifier that simply informed KVM that we collapsed
the page tables *after* we finished collapsing, then it would be ok
for hugetlb_collapse() to be slow.

If this MMU notifier is something that makes sense, it probably
applies to MADV_COLLAPSE for THPs as well.


> - use the (non-RFC) v1 way and tolerate the migration/smaps differences
> - use the RFC v1 way and tolerate the complicated mapcount accounting
> - flesh out [3] and see if it can be applied to HGM nicely
>
> I'm happy to go with any of these approaches.
>
> [1]: https://pastebin.com/raw/hJzFJHiD
> [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
> [3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/

- James
  
Peter Xu Feb. 7, 2023, 11:13 p.m. UTC | #50
James,

On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > Here is the result: [1] (sorry it took a little while heh). The

Thanks.  From what I can tell, that number shows that it'll be great we
start with your rfcv1 mapcount approach, which mimics what's proposed by
Matthew for generic folio.

> > implementation of the "RFC v1" way is pretty horrible[2] (and this

Any more information on why it's horrible? :)

A quick comment is I'm wondering whether that "whether we should boost the
mapcount" value can be hidden in hugetlb_pte* so you don't need to pass
over a lot of bool* deep into the hgm walk routines.

> > implementation probably has bugs anyway; it doesn't account for the
> > folio_referenced() problem).

I thought we reached a consensus on the resolution, by a proposal to remove
folio_referenced_arg.mapcount.  Is it not working for some reason?

> >
> > Matthew is trying to solve the same problem with THPs right now: [3].
> > I haven't figured out how we can apply Matthews's approach to HGM
> > right now, but there probably is a way. (If we left the mapcount
> > increment bits in the same place, we couldn't just check the
> > hstate-level PTE; it would have already been made present.)

I'm just worried that (1) this may add yet another dependency to your work
which is still during discussion phase, and (2) whether the folio approach
is easily applicable here, e.g., we may not want to populate all the ptes
for hugetlb HGMs by default.

> >
> > We could:
> > - use the THP-like way and tolerate ~1 second collapses
> 
> Another thought here. We don't necessarily *need* to collapse the page
> table mappings in between mmu_notifier_invalidate_range_start() and
> mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> we aren't punching any holes, and we aren't changing permission bits.
> If we had an MMU notifier that simply informed KVM that we collapsed
> the page tables *after* we finished collapsing, then it would be ok
> for hugetlb_collapse() to be slow.

That's a great point!  It'll definitely apply to either approach.

> 
> If this MMU notifier is something that makes sense, it probably
> applies to MADV_COLLAPSE for THPs as well.

THPs are definitely different, mmu notifiers should be required there,
afaict.  Isn't that what the current code does?

See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.

> 
> 
> > - use the (non-RFC) v1 way and tolerate the migration/smaps differences
> > - use the RFC v1 way and tolerate the complicated mapcount accounting
> > - flesh out [3] and see if it can be applied to HGM nicely
> >
> > I'm happy to go with any of these approaches.
> >
> > [1]: https://pastebin.com/raw/hJzFJHiD
> > [2]: https://github.com/48ca/linux/commit/4495f16a09b660aff44b3edcc125aa3a3df85976
> > [3]: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/
> 
> - James
>
  
James Houghton Feb. 8, 2023, 12:26 a.m. UTC | #51
On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > Here is the result: [1] (sorry it took a little while heh). The
>
> Thanks.  From what I can tell, that number shows that it'll be great we
> start with your rfcv1 mapcount approach, which mimics what's proposed by
> Matthew for generic folio.

Do you think the RFC v1 way is better than doing the THP-like way
*with the additional MMU notifier*?

>
> > > implementation of the "RFC v1" way is pretty horrible[2] (and this
>
> Any more information on why it's horrible? :)

I figured the code would speak for itself, heh. It's quite complicated.

I really didn't like:
1. The 'inc' business in copy_hugetlb_page_range.
2. How/where I call put_page()/folio_put() to keep the refcount and
mapcount synced up.
3. Having to check the page cache in UFFDIO_CONTINUE.

>
> A quick comment is I'm wondering whether that "whether we should boost the
> mapcount" value can be hidden in hugetlb_pte* so you don't need to pass
> over a lot of bool* deep into the hgm walk routines.

Oh yeah, that's a great idea.

>
> > > implementation probably has bugs anyway; it doesn't account for the
> > > folio_referenced() problem).
>
> I thought we reached a consensus on the resolution, by a proposal to remove
> folio_referenced_arg.mapcount.  Is it not working for some reason?

I think that works, I just didn't bother here. I just wanted to show
you approximately what it would look like to implement the RFC v1
approach.

>
> > >
> > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > I haven't figured out how we can apply Matthews's approach to HGM
> > > right now, but there probably is a way. (If we left the mapcount
> > > increment bits in the same place, we couldn't just check the
> > > hstate-level PTE; it would have already been made present.)
>
> I'm just worried that (1) this may add yet another dependency to your work
> which is still during discussion phase, and (2) whether the folio approach
> is easily applicable here, e.g., we may not want to populate all the ptes
> for hugetlb HGMs by default.

That's true. I definitely don't want to wait for this either. It seems
like Matthew's approach won't work very well for us -- when doing a
lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
the PTEs to see if any of them are mapped would get really slow.

>
> > >
> > > We could:
> > > - use the THP-like way and tolerate ~1 second collapses
> >
> > Another thought here. We don't necessarily *need* to collapse the page
> > table mappings in between mmu_notifier_invalidate_range_start() and
> > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > we aren't punching any holes, and we aren't changing permission bits.
> > If we had an MMU notifier that simply informed KVM that we collapsed
> > the page tables *after* we finished collapsing, then it would be ok
> > for hugetlb_collapse() to be slow.
>
> That's a great point!  It'll definitely apply to either approach.
>
> >
> > If this MMU notifier is something that makes sense, it probably
> > applies to MADV_COLLAPSE for THPs as well.
>
> THPs are definitely different, mmu notifiers should be required there,
> afaict.  Isn't that what the current code does?
>
> See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.

Oh, yes, of course, MADV_COLLAPSE can actually move things around and
properly make THPs. Thanks. But it would apply if we were only
collapsing PTE-mapped THPs, I think?


- James
  
Peter Xu Feb. 8, 2023, 4:16 p.m. UTC | #52
On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > James,
> >
> > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > Here is the result: [1] (sorry it took a little while heh). The
> >
> > Thanks.  From what I can tell, that number shows that it'll be great we
> > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > Matthew for generic folio.
> 
> Do you think the RFC v1 way is better than doing the THP-like way
> *with the additional MMU notifier*?

What's the additional MMU notifier you're referring?

> 
> >
> > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> >
> > Any more information on why it's horrible? :)
> 
> I figured the code would speak for itself, heh. It's quite complicated.
> 
> I really didn't like:
> 1. The 'inc' business in copy_hugetlb_page_range.
> 2. How/where I call put_page()/folio_put() to keep the refcount and
> mapcount synced up.
> 3. Having to check the page cache in UFFDIO_CONTINUE.

I think the complexity is one thing which I'm fine with so far.  However
when I think again about the things behind that complexity, I noticed there
may be at least one flaw that may not be trivial to work around.

It's about truncation.  The problem is now we use the pgtable entry to
represent the mapcount, but the pgtable entry cannot be zapped easily,
unless vma unmapped or collapsed.

It means e.g. truncate_inode_folio() may stop working for hugetlb (of
course, with page lock held).  The mappings will be removed for real, but
not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
the pgtable leaves, not the ones that we used to account for mapcounts.

So the kernel may see weird things, like mapcount>0 after
truncate_inode_folio() being finished completely.

For HGM to do the right thing, we may want to also remove the non-leaf
entries when truncating or doing similar things like a rmap walk to drop
any mappings for a page/folio.  Though that's not doable for now because
the locks that truncate_inode_folio() is weaker than what we need to free
the pgtable non-leaf entries - we'll need mmap write lock for that, the
same as when we unmap or collapse.

Matthew's design doesn't have such issue if the ptes need to be populated,
because mapcount is still with the leaves; not the case for us here.

If that's the case, _maybe_ we still need to start with the stupid but
working approach of subpage mapcounts.

[...]

> > > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > > I haven't figured out how we can apply Matthews's approach to HGM
> > > > right now, but there probably is a way. (If we left the mapcount
> > > > increment bits in the same place, we couldn't just check the
> > > > hstate-level PTE; it would have already been made present.)
> >
> > I'm just worried that (1) this may add yet another dependency to your work
> > which is still during discussion phase, and (2) whether the folio approach
> > is easily applicable here, e.g., we may not want to populate all the ptes
> > for hugetlb HGMs by default.
> 
> That's true. I definitely don't want to wait for this either. It seems
> like Matthew's approach won't work very well for us -- when doing a
> lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
> the PTEs to see if any of them are mapped would get really slow.

I think it'll be a common problem to userfaultfd when it comes, e.g.,
userfaultfd by design is PAGE_SIZE based so far.  It needs page size
granule on pgtable manipulations, unless we extend the userfaultfd protocol
to support folios, iiuc.

> 
> >
> > > >
> > > > We could:
> > > > - use the THP-like way and tolerate ~1 second collapses
> > >
> > > Another thought here. We don't necessarily *need* to collapse the page
> > > table mappings in between mmu_notifier_invalidate_range_start() and
> > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > > we aren't punching any holes, and we aren't changing permission bits.
> > > If we had an MMU notifier that simply informed KVM that we collapsed
> > > the page tables *after* we finished collapsing, then it would be ok
> > > for hugetlb_collapse() to be slow.
> >
> > That's a great point!  It'll definitely apply to either approach.
> >
> > >
> > > If this MMU notifier is something that makes sense, it probably
> > > applies to MADV_COLLAPSE for THPs as well.
> >
> > THPs are definitely different, mmu notifiers should be required there,
> > afaict.  Isn't that what the current code does?
> >
> > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.
> 
> Oh, yes, of course, MADV_COLLAPSE can actually move things around and
> properly make THPs. Thanks. But it would apply if we were only
> collapsing PTE-mapped THPs, I think?

Yes it applies I think.  And if I'm not wrong it's also doing so. :) See
collapse_pte_mapped_thp().

While for anon we always allocate a new page, hence not applicable.
  
James Houghton Feb. 9, 2023, 4:43 p.m. UTC | #53
On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > James,
> > >
> > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > Here is the result: [1] (sorry it took a little while heh). The
> > >
> > > Thanks.  From what I can tell, that number shows that it'll be great we
> > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > Matthew for generic folio.
> >
> > Do you think the RFC v1 way is better than doing the THP-like way
> > *with the additional MMU notifier*?
>
> What's the additional MMU notifier you're referring?

An MMU notifier that informs KVM that a collapse has happened without
having to invalidate_range_start() and invalidate_range_end(), the one
you're replying to lower down in the email. :) [ see below... ]

>
> >
> > >
> > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > >
> > > Any more information on why it's horrible? :)
> >
> > I figured the code would speak for itself, heh. It's quite complicated.
> >
> > I really didn't like:
> > 1. The 'inc' business in copy_hugetlb_page_range.
> > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > mapcount synced up.
> > 3. Having to check the page cache in UFFDIO_CONTINUE.
>
> I think the complexity is one thing which I'm fine with so far.  However
> when I think again about the things behind that complexity, I noticed there
> may be at least one flaw that may not be trivial to work around.
>
> It's about truncation.  The problem is now we use the pgtable entry to
> represent the mapcount, but the pgtable entry cannot be zapped easily,
> unless vma unmapped or collapsed.
>
> It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> course, with page lock held).  The mappings will be removed for real, but
> not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> the pgtable leaves, not the ones that we used to account for mapcounts.
>
> So the kernel may see weird things, like mapcount>0 after
> truncate_inode_folio() being finished completely.
>
> For HGM to do the right thing, we may want to also remove the non-leaf
> entries when truncating or doing similar things like a rmap walk to drop
> any mappings for a page/folio.  Though that's not doable for now because
> the locks that truncate_inode_folio() is weaker than what we need to free
> the pgtable non-leaf entries - we'll need mmap write lock for that, the
> same as when we unmap or collapse.
>
> Matthew's design doesn't have such issue if the ptes need to be populated,
> because mapcount is still with the leaves; not the case for us here.
>
> If that's the case, _maybe_ we still need to start with the stupid but
> working approach of subpage mapcounts.

Good point. I can't immediately think of a solution. I would prefer to
go with the subpage mapcount approach to simplify HGM for now;
optimizing mapcount for HugeTLB can then be handled separately. If
you're ok with this, I'll go ahead and send v2.

One way that might be possible: using the PAGE_SPECIAL bit on the
hstate-level PTE to indicate if mapcount has been incremented or not
(if the PTE is pointing to page tables). As far as I can tell,
PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
need to be careful with existing PTE examination code as to not
misinterpret these PTEs.

>
> [...]
>
> > > > > Matthew is trying to solve the same problem with THPs right now: [3].
> > > > > I haven't figured out how we can apply Matthews's approach to HGM
> > > > > right now, but there probably is a way. (If we left the mapcount
> > > > > increment bits in the same place, we couldn't just check the
> > > > > hstate-level PTE; it would have already been made present.)
> > >
> > > I'm just worried that (1) this may add yet another dependency to your work
> > > which is still during discussion phase, and (2) whether the folio approach
> > > is easily applicable here, e.g., we may not want to populate all the ptes
> > > for hugetlb HGMs by default.
> >
> > That's true. I definitely don't want to wait for this either. It seems
> > like Matthew's approach won't work very well for us -- when doing a
> > lot of high-granularity UFFDIO_CONTINUEs on a 1G page, checking all
> > the PTEs to see if any of them are mapped would get really slow.
>
> I think it'll be a common problem to userfaultfd when it comes, e.g.,
> userfaultfd by design is PAGE_SIZE based so far.  It needs page size
> granule on pgtable manipulations, unless we extend the userfaultfd protocol
> to support folios, iiuc.
>
> >
> > >
> > > > >
> > > > > We could:
> > > > > - use the THP-like way and tolerate ~1 second collapses
> > > >
> > > > Another thought here. We don't necessarily *need* to collapse the page
> > > > table mappings in between mmu_notifier_invalidate_range_start() and
> > > > mmu_notifier_invalidate_range_end(), as the pfns aren't changing,
> > > > we aren't punching any holes, and we aren't changing permission bits.
> > > > If we had an MMU notifier that simply informed KVM that we collapsed
> > > > the page tables *after* we finished collapsing, then it would be ok
> > > > for hugetlb_collapse() to be slow.

[ from above... ] This MMU notifier. :)

> > >
> > > That's a great point!  It'll definitely apply to either approach.
> > >
> > > >
> > > > If this MMU notifier is something that makes sense, it probably
> > > > applies to MADV_COLLAPSE for THPs as well.
> > >
> > > THPs are definitely different, mmu notifiers should be required there,
> > > afaict.  Isn't that what the current code does?
> > >
> > > See collapse_and_free_pmd() for shmem and collapse_huge_page() for anon.
> >
> > Oh, yes, of course, MADV_COLLAPSE can actually move things around and
> > properly make THPs. Thanks. But it would apply if we were only
> > collapsing PTE-mapped THPs, I think?
>
> Yes it applies I think.  And if I'm not wrong it's also doing so. :) See
> collapse_pte_mapped_thp().
>
> While for anon we always allocate a new page, hence not applicable.
>
> --
> Peter Xu

Thanks Peter!
- James
  
Peter Xu Feb. 9, 2023, 7:10 p.m. UTC | #54
On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote:
> On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > James,
> > > >
> > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > > Here is the result: [1] (sorry it took a little while heh). The
> > > >
> > > > Thanks.  From what I can tell, that number shows that it'll be great we
> > > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > > Matthew for generic folio.
> > >
> > > Do you think the RFC v1 way is better than doing the THP-like way
> > > *with the additional MMU notifier*?
> >
> > What's the additional MMU notifier you're referring?
> 
> An MMU notifier that informs KVM that a collapse has happened without
> having to invalidate_range_start() and invalidate_range_end(), the one
> you're replying to lower down in the email. :) [ see below... ]

Isn't that something that is needed no matter what mapcount approach we'll
go for?  Did I miss something?

> 
> >
> > >
> > > >
> > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > > >
> > > > Any more information on why it's horrible? :)
> > >
> > > I figured the code would speak for itself, heh. It's quite complicated.
> > >
> > > I really didn't like:
> > > 1. The 'inc' business in copy_hugetlb_page_range.
> > > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > > mapcount synced up.
> > > 3. Having to check the page cache in UFFDIO_CONTINUE.
> >
> > I think the complexity is one thing which I'm fine with so far.  However
> > when I think again about the things behind that complexity, I noticed there
> > may be at least one flaw that may not be trivial to work around.
> >
> > It's about truncation.  The problem is now we use the pgtable entry to
> > represent the mapcount, but the pgtable entry cannot be zapped easily,
> > unless vma unmapped or collapsed.
> >
> > It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> > course, with page lock held).  The mappings will be removed for real, but
> > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> > the pgtable leaves, not the ones that we used to account for mapcounts.
> >
> > So the kernel may see weird things, like mapcount>0 after
> > truncate_inode_folio() being finished completely.
> >
> > For HGM to do the right thing, we may want to also remove the non-leaf
> > entries when truncating or doing similar things like a rmap walk to drop
> > any mappings for a page/folio.  Though that's not doable for now because
> > the locks that truncate_inode_folio() is weaker than what we need to free
> > the pgtable non-leaf entries - we'll need mmap write lock for that, the
> > same as when we unmap or collapse.
> >
> > Matthew's design doesn't have such issue if the ptes need to be populated,
> > because mapcount is still with the leaves; not the case for us here.
> >
> > If that's the case, _maybe_ we still need to start with the stupid but
> > working approach of subpage mapcounts.
> 
> Good point. I can't immediately think of a solution. I would prefer to
> go with the subpage mapcount approach to simplify HGM for now;
> optimizing mapcount for HugeTLB can then be handled separately. If
> you're ok with this, I'll go ahead and send v2.

I'm okay with it, but I suggest wait for at least another one day or two to
see whether Mike or others have any comments.

> 
> One way that might be possible: using the PAGE_SPECIAL bit on the
> hstate-level PTE to indicate if mapcount has been incremented or not
> (if the PTE is pointing to page tables). As far as I can tell,
> PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
> need to be careful with existing PTE examination code as to not
> misinterpret these PTEs.

This is an interesting idea. :) Yes I don't see it being used at all in any
pgtable non-leaves.

Then it's about how to let the zap code know when to remove the special
bit, hence the mapcount, because not all of them should.

Maybe it can be passed over as a new zap_flags_t bit?

Thanks,
  
James Houghton Feb. 9, 2023, 7:49 p.m. UTC | #55
On Thu, Feb 9, 2023 at 11:11 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote:
> > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > James,
> > > > >
> > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > > > Here is the result: [1] (sorry it took a little while heh). The
> > > > >
> > > > > Thanks.  From what I can tell, that number shows that it'll be great we
> > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > > > Matthew for generic folio.
> > > >
> > > > Do you think the RFC v1 way is better than doing the THP-like way
> > > > *with the additional MMU notifier*?
> > >
> > > What's the additional MMU notifier you're referring?
> >
> > An MMU notifier that informs KVM that a collapse has happened without
> > having to invalidate_range_start() and invalidate_range_end(), the one
> > you're replying to lower down in the email. :) [ see below... ]
>
> Isn't that something that is needed no matter what mapcount approach we'll
> go for?  Did I miss something?

It's not really needed for anything, but it could be an optimization
for both approaches. However, for the subpage-mapcount approach, it
would have a *huge* impact. That's what I mean.

>
> >
> > >
> > > >
> > > > >
> > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > > > >
> > > > > Any more information on why it's horrible? :)
> > > >
> > > > I figured the code would speak for itself, heh. It's quite complicated.
> > > >
> > > > I really didn't like:
> > > > 1. The 'inc' business in copy_hugetlb_page_range.
> > > > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > > > mapcount synced up.
> > > > 3. Having to check the page cache in UFFDIO_CONTINUE.
> > >
> > > I think the complexity is one thing which I'm fine with so far.  However
> > > when I think again about the things behind that complexity, I noticed there
> > > may be at least one flaw that may not be trivial to work around.
> > >
> > > It's about truncation.  The problem is now we use the pgtable entry to
> > > represent the mapcount, but the pgtable entry cannot be zapped easily,
> > > unless vma unmapped or collapsed.
> > >
> > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> > > course, with page lock held).  The mappings will be removed for real, but
> > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> > > the pgtable leaves, not the ones that we used to account for mapcounts.
> > >
> > > So the kernel may see weird things, like mapcount>0 after
> > > truncate_inode_folio() being finished completely.
> > >
> > > For HGM to do the right thing, we may want to also remove the non-leaf
> > > entries when truncating or doing similar things like a rmap walk to drop
> > > any mappings for a page/folio.  Though that's not doable for now because
> > > the locks that truncate_inode_folio() is weaker than what we need to free
> > > the pgtable non-leaf entries - we'll need mmap write lock for that, the
> > > same as when we unmap or collapse.
> > >
> > > Matthew's design doesn't have such issue if the ptes need to be populated,
> > > because mapcount is still with the leaves; not the case for us here.
> > >
> > > If that's the case, _maybe_ we still need to start with the stupid but
> > > working approach of subpage mapcounts.
> >
> > Good point. I can't immediately think of a solution. I would prefer to
> > go with the subpage mapcount approach to simplify HGM for now;
> > optimizing mapcount for HugeTLB can then be handled separately. If
> > you're ok with this, I'll go ahead and send v2.
>
> I'm okay with it, but I suggest wait for at least another one day or two to
> see whether Mike or others have any comments.

Ok. :)

>
> >
> > One way that might be possible: using the PAGE_SPECIAL bit on the
> > hstate-level PTE to indicate if mapcount has been incremented or not
> > (if the PTE is pointing to page tables). As far as I can tell,
> > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
> > need to be careful with existing PTE examination code as to not
> > misinterpret these PTEs.
>
> This is an interesting idea. :) Yes I don't see it being used at all in any
> pgtable non-leaves.
>
> Then it's about how to let the zap code know when to remove the special
> bit, hence the mapcount, because not all of them should.
>
> Maybe it can be passed over as a new zap_flags_t bit?

Here[1] is one way it could be done (it doesn't work 100% correctly,
it's just approximately what we could do). Basically we pass in the
entire range that we are unmapping ("floor" and "ceil"), and if
hugetlb_remove_rmap finds that we're doing the final removal of a page
that we are entirely unmapping (i.e., floor <= addr &
huge_page_mask(h)). Having a zap flag would probably work too.

I think something like [1] ought to go in its own series. :)

[1]: https://github.com/48ca/linux/commit/de884eaaadf61b8dcfb1defd99bbf487667e46f4

- James
  
Peter Xu Feb. 9, 2023, 8:22 p.m. UTC | #56
On Thu, Feb 09, 2023 at 11:49:25AM -0800, James Houghton wrote:
> On Thu, Feb 9, 2023 at 11:11 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote:
> > > On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > > > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > James,
> > > > > >
> > > > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > > > > Here is the result: [1] (sorry it took a little while heh). The
> > > > > >
> > > > > > Thanks.  From what I can tell, that number shows that it'll be great we
> > > > > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > > > > Matthew for generic folio.
> > > > >
> > > > > Do you think the RFC v1 way is better than doing the THP-like way
> > > > > *with the additional MMU notifier*?
> > > >
> > > > What's the additional MMU notifier you're referring?
> > >
> > > An MMU notifier that informs KVM that a collapse has happened without
> > > having to invalidate_range_start() and invalidate_range_end(), the one
> > > you're replying to lower down in the email. :) [ see below... ]
> >
> > Isn't that something that is needed no matter what mapcount approach we'll
> > go for?  Did I miss something?
> 
> It's not really needed for anything, but it could be an optimization
> for both approaches. However, for the subpage-mapcount approach, it
> would have a *huge* impact. That's what I mean.

Ah, okay.

> 
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > > > > >
> > > > > > Any more information on why it's horrible? :)
> > > > >
> > > > > I figured the code would speak for itself, heh. It's quite complicated.
> > > > >
> > > > > I really didn't like:
> > > > > 1. The 'inc' business in copy_hugetlb_page_range.
> > > > > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > > > > mapcount synced up.
> > > > > 3. Having to check the page cache in UFFDIO_CONTINUE.
> > > >
> > > > I think the complexity is one thing which I'm fine with so far.  However
> > > > when I think again about the things behind that complexity, I noticed there
> > > > may be at least one flaw that may not be trivial to work around.
> > > >
> > > > It's about truncation.  The problem is now we use the pgtable entry to
> > > > represent the mapcount, but the pgtable entry cannot be zapped easily,
> > > > unless vma unmapped or collapsed.
> > > >
> > > > It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> > > > course, with page lock held).  The mappings will be removed for real, but
> > > > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> > > > the pgtable leaves, not the ones that we used to account for mapcounts.
> > > >
> > > > So the kernel may see weird things, like mapcount>0 after
> > > > truncate_inode_folio() being finished completely.
> > > >
> > > > For HGM to do the right thing, we may want to also remove the non-leaf
> > > > entries when truncating or doing similar things like a rmap walk to drop
> > > > any mappings for a page/folio.  Though that's not doable for now because
> > > > the locks that truncate_inode_folio() is weaker than what we need to free
> > > > the pgtable non-leaf entries - we'll need mmap write lock for that, the
> > > > same as when we unmap or collapse.
> > > >
> > > > Matthew's design doesn't have such issue if the ptes need to be populated,
> > > > because mapcount is still with the leaves; not the case for us here.
> > > >
> > > > If that's the case, _maybe_ we still need to start with the stupid but
> > > > working approach of subpage mapcounts.
> > >
> > > Good point. I can't immediately think of a solution. I would prefer to
> > > go with the subpage mapcount approach to simplify HGM for now;
> > > optimizing mapcount for HugeTLB can then be handled separately. If
> > > you're ok with this, I'll go ahead and send v2.
> >
> > I'm okay with it, but I suggest wait for at least another one day or two to
> > see whether Mike or others have any comments.
> 
> Ok. :)
> 
> >
> > >
> > > One way that might be possible: using the PAGE_SPECIAL bit on the
> > > hstate-level PTE to indicate if mapcount has been incremented or not
> > > (if the PTE is pointing to page tables). As far as I can tell,
> > > PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
> > > need to be careful with existing PTE examination code as to not
> > > misinterpret these PTEs.
> >
> > This is an interesting idea. :) Yes I don't see it being used at all in any
> > pgtable non-leaves.
> >
> > Then it's about how to let the zap code know when to remove the special
> > bit, hence the mapcount, because not all of them should.
> >
> > Maybe it can be passed over as a new zap_flags_t bit?
> 
> Here[1] is one way it could be done (it doesn't work 100% correctly,
> it's just approximately what we could do). Basically we pass in the
> entire range that we are unmapping ("floor" and "ceil"), and if
> hugetlb_remove_rmap finds that we're doing the final removal of a page
> that we are entirely unmapping (i.e., floor <= addr &
> huge_page_mask(h)). Having a zap flag would probably work too.

Yeah maybe flags are not needed at all.  I had a quick glance, looks good
in general.

I think the trick is when it's not unmapped in a single shot.  Consider
someone zaps the first half of HGM-mapped hpage then the other half.  The
range may not always tell the whole story so rmap might be left over in
some cases.

But maybe it is not a big deal.  The only thing I think of so far is the
partial DONTNEED.  but I think maybe it's fine to leave it there until
another more serious request to either truncate or unmap it.  At least all
rmap walks should work as expected.

> 
> I think something like [1] ought to go in its own series. :)
> 
> [1]: https://github.com/48ca/linux/commit/de884eaaadf61b8dcfb1defd99bbf487667e46f4

Yes I agree it can be worked on top.
  

Patch

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 74e1d873dce0..284466bf4f25 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2626,13 +2626,25 @@  static int __s390_enable_skey_pmd(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 
-static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
-				      unsigned long hmask, unsigned long next,
+static int __s390_enable_skey_hugetlb(struct hugetlb_pte *hpte,
+				      unsigned long addr,
 				      struct mm_walk *walk)
 {
-	pmd_t *pmd = (pmd_t *)pte;
+	struct hstate *h = hstate_vma(walk->vma);
+	pmd_t *pmd;
 	unsigned long start, end;
-	struct page *page = pmd_page(*pmd);
+	struct page *page;
+
+	if (huge_page_size(h) != hugetlb_pte_size(hpte))
+		/* Ignore high-granularity PTEs. */
+		return 0;
+
+	if (!pte_present(huge_ptep_get(hpte->ptep)))
+		/* Ignore non-present PTEs. */
+		return 0;
+
+	pmd = (pmd_t *)hpte->ptep;
+	page = pmd_page(*pmd);
 
 	/*
 	 * The write check makes sure we do not set a key on shared
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 41b5509bde0e..c353cab11eee 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -731,18 +731,28 @@  static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
-				 struct mm_walk *walk)
+static int smaps_hugetlb_range(struct hugetlb_pte *hpte,
+				unsigned long addr,
+				struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	struct page *page = NULL;
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
-	if (pte_present(*pte)) {
-		page = vm_normal_page(vma, addr, *pte);
-	} else if (is_swap_pte(*pte)) {
-		swp_entry_t swpent = pte_to_swp_entry(*pte);
+	if (pte_present(pte)) {
+		/* We only care about leaf-level PTEs. */
+		if (!hugetlb_pte_present_leaf(hpte, pte))
+			/*
+			 * The only case where hpte is not a leaf is that
+			 * it was originally none, but it was split from
+			 * under us. It was originally none, so exclude it.
+			 */
+			return 0;
+
+		page = vm_normal_page(vma, addr, pte);
+	} else if (is_swap_pte(pte)) {
+		swp_entry_t swpent = pte_to_swp_entry(pte);
 
 		if (is_pfn_swap_entry(swpent))
 			page = pfn_swap_entry_to_page(swpent);
@@ -751,9 +761,9 @@  static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 		int mapcount = page_mapcount(page);
 
 		if (mapcount >= 2)
-			mss->shared_hugetlb += huge_page_size(hstate_vma(vma));
+			mss->shared_hugetlb += hugetlb_pte_size(hpte);
 		else
-			mss->private_hugetlb += huge_page_size(hstate_vma(vma));
+			mss->private_hugetlb += hugetlb_pte_size(hpte);
 	}
 	return 0;
 }
@@ -1572,22 +1582,31 @@  static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
-				 unsigned long addr, unsigned long end,
+static int pagemap_hugetlb_range(struct hugetlb_pte *hpte,
+				 unsigned long addr,
 				 struct mm_walk *walk)
 {
 	struct pagemapread *pm = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	u64 flags = 0, frame = 0;
 	int err = 0;
-	pte_t pte;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
+	unsigned long end = addr + hugetlb_pte_size(hpte);
+	pte_t pte = huge_ptep_get(hpte->ptep);
+	struct page *page;
 
 	if (vma->vm_flags & VM_SOFTDIRTY)
 		flags |= PM_SOFT_DIRTY;
 
-	pte = huge_ptep_get(ptep);
 	if (pte_present(pte)) {
-		struct page *page = pte_page(pte);
+		/*
+		 * We raced with this PTE being split, which can only happen if
+		 * it was blank before. Treat it is as if it were blank.
+		 */
+		if (!hugetlb_pte_present_leaf(hpte, pte))
+			return 0;
+
+		page = pte_page(pte);
 
 		if (!PageAnon(page))
 			flags |= PM_FILE;
@@ -1868,10 +1887,16 @@  static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 }
 #endif
 
+struct show_numa_map_private {
+	struct numa_maps *md;
+	struct page *last_page;
+};
+
 static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		unsigned long end, struct mm_walk *walk)
 {
-	struct numa_maps *md = walk->private;
+	struct show_numa_map_private *priv = walk->private;
+	struct numa_maps *md = priv->md;
 	struct vm_area_struct *vma = walk->vma;
 	spinlock_t *ptl;
 	pte_t *orig_pte;
@@ -1883,6 +1908,7 @@  static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 		struct page *page;
 
 		page = can_gather_numa_stats_pmd(*pmd, vma, addr);
+		priv->last_page = page;
 		if (page)
 			gather_stats(page, md, pmd_dirty(*pmd),
 				     HPAGE_PMD_SIZE/PAGE_SIZE);
@@ -1896,6 +1922,7 @@  static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	do {
 		struct page *page = can_gather_numa_stats(*pte, vma, addr);
+		priv->last_page = page;
 		if (!page)
 			continue;
 		gather_stats(page, md, pte_dirty(*pte), 1);
@@ -1906,19 +1933,25 @@  static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	return 0;
 }
 #ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
-		unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(struct hugetlb_pte *hpte, unsigned long addr,
+		struct mm_walk *walk)
 {
-	pte_t huge_pte = huge_ptep_get(pte);
+	struct show_numa_map_private *priv = walk->private;
+	pte_t huge_pte = huge_ptep_get(hpte->ptep);
 	struct numa_maps *md;
 	struct page *page;
 
-	if (!pte_present(huge_pte))
+	if (!hugetlb_pte_present_leaf(hpte, huge_pte))
+		return 0;
+
+	page = compound_head(pte_page(huge_pte));
+	if (priv->last_page == page)
+		/* we've already accounted for this page */
 		return 0;
 
-	page = pte_page(huge_pte);
+	priv->last_page = page;
 
-	md = walk->private;
+	md = priv->md;
 	gather_stats(page, md, pte_dirty(huge_pte), 1);
 	return 0;
 }
@@ -1948,9 +1981,15 @@  static int show_numa_map(struct seq_file *m, void *v)
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mempolicy *pol;
+
 	char buffer[64];
 	int nid;
 
+	struct show_numa_map_private numa_map_private;
+
+	numa_map_private.md = md;
+	numa_map_private.last_page = NULL;
+
 	if (!mm)
 		return 0;
 
@@ -1980,7 +2019,7 @@  static int show_numa_map(struct seq_file *m, void *v)
 		seq_puts(m, " huge");
 
 	/* mmap_lock is held by m_start */
-	walk_page_vma(vma, &show_numa_ops, md);
+	walk_page_vma(vma, &show_numa_ops, &numa_map_private);
 
 	if (!md->pages)
 		goto out;
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 27a6df448ee5..f4bddad615c2 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -3,6 +3,7 @@ 
 #define _LINUX_PAGEWALK_H
 
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
 
 struct mm_walk;
 
@@ -31,6 +32,10 @@  struct mm_walk;
  *			ptl after dropping the vma lock, or else revalidate
  *			those items after re-acquiring the vma lock and before
  *			accessing them.
+ *			In the presence of high-granularity hugetlb entries,
+ *			@hugetlb_entry is called only for leaf-level entries
+ *			(hstate-level entries are ignored if they are not
+ *			leaves).
  * @test_walk:		caller specific callback function to determine whether
  *			we walk over the current vma or not. Returning 0 means
  *			"do page table walk over the current vma", returning
@@ -58,9 +63,8 @@  struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*pte_hole)(unsigned long addr, unsigned long next,
 			int depth, struct mm_walk *walk);
-	int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
-			     unsigned long addr, unsigned long next,
-			     struct mm_walk *walk);
+	int (*hugetlb_entry)(struct hugetlb_pte *hpte,
+			     unsigned long addr, struct mm_walk *walk);
 	int (*test_walk)(unsigned long addr, unsigned long next,
 			struct mm_walk *walk);
 	int (*pre_vma)(unsigned long start, unsigned long end,
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 9d92c5eb3a1f..2383f647f202 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -330,11 +330,12 @@  static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
+static void damon_hugetlb_mkold(struct hugetlb_pte *hpte, pte_t entry,
+				struct mm_struct *mm,
 				struct vm_area_struct *vma, unsigned long addr)
 {
 	bool referenced = false;
-	pte_t entry = huge_ptep_get(pte);
+	pte_t entry = huge_ptep_get(hpte->ptep);
 	struct folio *folio = pfn_folio(pte_pfn(entry));
 
 	folio_get(folio);
@@ -342,12 +343,12 @@  static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 	if (pte_young(entry)) {
 		referenced = true;
 		entry = pte_mkold(entry);
-		set_huge_pte_at(mm, addr, pte, entry);
+		set_huge_pte_at(mm, addr, hpte->ptep, entry);
 	}
 
 #ifdef CONFIG_MMU_NOTIFIER
 	if (mmu_notifier_clear_young(mm, addr,
-				     addr + huge_page_size(hstate_vma(vma))))
+				     addr + hugetlb_pte_size(hpte)))
 		referenced = true;
 #endif /* CONFIG_MMU_NOTIFIER */
 
@@ -358,20 +359,26 @@  static void damon_hugetlb_mkold(pte_t *pte, struct mm_struct *mm,
 	folio_put(folio);
 }
 
-static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
+static int damon_mkold_hugetlb_entry(struct hugetlb_pte *hpte,
+				     unsigned long addr,
 				     struct mm_walk *walk)
 {
-	struct hstate *h = hstate_vma(walk->vma);
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto out;
 
-	damon_hugetlb_mkold(pte, walk->mm, walk->vma, addr);
+	if (!hugetlb_pte_present_leaf(hpte, entry))
+		/*
+		 * We raced with someone splitting a blank PTE. Treat this PTE
+		 * as if it were blank.
+		 */
+		goto out;
+
+	damon_hugetlb_mkold(hpte, entry, walk->mm, walk->vma, addr);
 
 out:
 	spin_unlock(ptl);
@@ -484,8 +491,8 @@  static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				     unsigned long addr, unsigned long end,
+static int damon_young_hugetlb_entry(struct hugetlb_pte *hpte,
+				     unsigned long addr,
 				     struct mm_walk *walk)
 {
 	struct damon_young_walk_private *priv = walk->private;
@@ -494,11 +501,18 @@  static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(h, walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto out;
 
+	if (!hugetlb_pte_present_leaf(hpte, entry))
+		/*
+		 * We raced with someone splitting a blank PTE. Treat this PTE
+		 * as if it were blank.
+		 */
+		goto out;
+
 	folio = pfn_folio(pte_pfn(entry));
 	folio_get(folio);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 6a151c09de5e..d3e40cfdd4cb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -468,8 +468,8 @@  static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
 #endif
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				      unsigned long start, unsigned long end,
+static int hmm_vma_walk_hugetlb_entry(struct hugetlb_pte *hpte,
+				      unsigned long start,
 				      struct mm_walk *walk)
 {
 	unsigned long addr = start, i, pfn;
@@ -479,16 +479,24 @@  static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 	unsigned int required_fault;
 	unsigned long pfn_req_flags;
 	unsigned long cpu_flags;
+	unsigned long hmask = hugetlb_pte_mask(hpte);
+	unsigned int order = hpte->shift - PAGE_SHIFT;
+	unsigned long end = start + hugetlb_pte_size(hpte);
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
+
+	if (!hugetlb_pte_present_leaf(hpte, entry)) {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
 
 	i = (start - range->start) >> PAGE_SHIFT;
 	pfn_req_flags = range->hmm_pfns[i];
 	cpu_flags = pte_to_hmm_pfn_flags(range, entry) |
-		    hmm_pfn_flags_order(huge_page_order(hstate_vma(vma)));
+		    hmm_pfn_flags_order(order);
 	required_fault =
 		hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, cpu_flags);
 	if (required_fault) {
@@ -605,7 +613,7 @@  int hmm_range_fault(struct hmm_range *range)
 		 * in pfns. All entries < last in the pfn array are set to their
 		 * output, and all >= are still at their input values.
 		 */
-	} while (ret == -EBUSY);
+	} while (ret == -EBUSY || ret == -EAGAIN);
 	return ret;
 }
 EXPORT_SYMBOL(hmm_range_fault);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c77a9e37e27e..e7e56298d305 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -641,6 +641,7 @@  static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 				unsigned long poisoned_pfn, struct to_kill *tk)
 {
 	unsigned long pfn = 0;
+	unsigned long base_pages_poisoned = (1UL << shift) / PAGE_SIZE;
 
 	if (pte_present(pte)) {
 		pfn = pte_pfn(pte);
@@ -651,7 +652,8 @@  static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
 			pfn = swp_offset_pfn(swp);
 	}
 
-	if (!pfn || pfn != poisoned_pfn)
+	if (!pfn || pfn < poisoned_pfn ||
+			pfn >= poisoned_pfn + base_pages_poisoned)
 		return 0;
 
 	set_to_kill(tk, addr, shift);
@@ -717,16 +719,15 @@  static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-static int hwpoison_hugetlb_range(pte_t *ptep, unsigned long hmask,
-			    unsigned long addr, unsigned long end,
-			    struct mm_walk *walk)
+static int hwpoison_hugetlb_range(struct hugetlb_pte *hpte,
+				  unsigned long addr,
+				  struct mm_walk *walk)
 {
 	struct hwp_walk *hwp = walk->private;
-	pte_t pte = huge_ptep_get(ptep);
-	struct hstate *h = hstate_vma(walk->vma);
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
-	return check_hwpoisoned_entry(pte, addr, huge_page_shift(h),
-				      hwp->pfn, &hwp->tk);
+	return check_hwpoisoned_entry(pte, addr & hugetlb_pte_mask(hpte),
+			hpte->shift, hwp->pfn, &hwp->tk);
 }
 #else
 #define hwpoison_hugetlb_range	NULL
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d3558248a0f0..e5859ed34e90 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -558,8 +558,8 @@  static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	return addr != end ? -EIO : 0;
 }
 
-static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
-			       unsigned long addr, unsigned long end,
+static int queue_pages_hugetlb(struct hugetlb_pte *hpte,
+			       unsigned long addr,
 			       struct mm_walk *walk)
 {
 	int ret = 0;
@@ -570,8 +570,12 @@  static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask,
 	spinlock_t *ptl;
 	pte_t entry;
 
-	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
-	entry = huge_ptep_get(pte);
+	/* We don't migrate high-granularity HugeTLB mappings for now. */
+	if (hugetlb_hgm_enabled(walk->vma))
+		return -EINVAL;
+
+	ptl = hugetlb_pte_lock(hpte);
+	entry = huge_ptep_get(hpte->ptep);
 	if (!pte_present(entry))
 		goto unlock;
 	page = pte_page(entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index a085a2aeabd8..0894965b3944 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -22,18 +22,29 @@ 
 #include <linux/uaccess.h>
 #include "swap.h"
 
-static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
-			unsigned long end, struct mm_walk *walk)
+static int mincore_hugetlb(struct hugetlb_pte *hpte, unsigned long addr,
+			   struct mm_walk *walk)
 {
 #ifdef CONFIG_HUGETLB_PAGE
 	unsigned char present;
+	unsigned long end = addr + hugetlb_pte_size(hpte);
 	unsigned char *vec = walk->private;
+	pte_t pte = huge_ptep_get(hpte->ptep);
 
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
-	present = pte && !huge_pte_none(huge_ptep_get(pte));
+	present = !huge_pte_none(pte);
+
+	/*
+	 * If the pte is present but not a leaf, we raced with someone
+	 * splitting it. For someone to have split it, it must have been
+	 * huge_pte_none before, so treat it as such.
+	 */
+	if (pte_present(pte) && !hugetlb_pte_present_leaf(hpte, pte))
+		present = false;
+
 	for (; addr != end; vec++, addr += PAGE_SIZE)
 		*vec = present;
 	walk->private = vec;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 71358e45a742..62d8c5f7bc92 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -543,12 +543,16 @@  static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
 		0 : -EACCES;
 }
 
-static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
-				   unsigned long addr, unsigned long next,
+static int prot_none_hugetlb_entry(struct hugetlb_pte *hpte,
+				   unsigned long addr,
 				   struct mm_walk *walk)
 {
-	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
-		0 : -EACCES;
+	pte_t pte = huge_ptep_get(hpte->ptep);
+
+	if (!hugetlb_pte_present_leaf(hpte, pte))
+		return -EAGAIN;
+	return pfn_modify_allowed(pte_pfn(pte),
+			*(pgprot_t *)(walk->private)) ? 0 : -EACCES;
 }
 
 static int prot_none_test(unsigned long addr, unsigned long next,
@@ -591,8 +595,10 @@  mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	    (newflags & VM_ACCESS_FLAGS) == 0) {
 		pgprot_t new_pgprot = vm_get_page_prot(newflags);
 
-		error = walk_page_range(current->mm, start, end,
-				&prot_none_walk_ops, &new_pgprot);
+		do {
+			error = walk_page_range(current->mm, start, end,
+					&prot_none_walk_ops, &new_pgprot);
+		} while (error == -EAGAIN);
 		if (error)
 			return error;
 	}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cb23f8a15c13..05ce242f8b7e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,7 @@ 
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/minmax.h>
 
 /*
  * We want to know the real level where a entry is located ignoring any
@@ -296,20 +297,21 @@  static int walk_hugetlb_range(unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma = walk->vma;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long next;
-	unsigned long hmask = huge_page_mask(h);
-	unsigned long sz = huge_page_size(h);
-	pte_t *pte;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
+	struct hugetlb_pte hpte;
 
 	hugetlb_vma_lock_read(vma);
 	do {
-		next = hugetlb_entry_end(h, addr, end);
-		pte = hugetlb_walk(vma, addr & hmask, sz);
-		if (pte)
-			err = ops->hugetlb_entry(pte, hmask, addr, next, walk);
-		else if (ops->pte_hole)
-			err = ops->pte_hole(addr, next, -1, walk);
+		if (hugetlb_full_walk(&hpte, vma, addr)) {
+			next = hugetlb_entry_end(h, addr, end);
+			if (ops->pte_hole)
+				err = ops->pte_hole(addr, next, -1, walk);
+		} else {
+			err = ops->hugetlb_entry(
+					&hpte, addr, walk);
+			next = min(addr + hugetlb_pte_size(&hpte), end);
+		}
 		if (err)
 			break;
 	} while (addr = next, addr != end);