[RFC,v2,10/47] hugetlb: add hugetlb_pte to track HugeTLB page table entries

Message ID 20221021163703.3218176-11-jthoughton@google.com
State New
Headers
Series hugetlb: introduce HugeTLB high-granularity mapping |

Commit Message

James Houghton Oct. 21, 2022, 4:36 p.m. UTC
  After high-granularity mapping, page table entries for HugeTLB pages can
be of any size/type. (For example, we can have a 1G page mapped with a
mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
PTE after we have done a page table walk.

Without this, we'd have to pass around the "size" of the PTE everywhere.
We effectively did this before; it could be fetched from the hstate,
which we pass around pretty much everywhere.

hugetlb_pte_present_leaf is included here as a helper function that will
be used frequently later on.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
 mm/hugetlb.c            | 29 ++++++++++++++
 2 files changed, 117 insertions(+)
  

Comments

Peter Xu Nov. 16, 2022, 10:17 p.m. UTC | #1
On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> +struct hugetlb_pte {
> +	pte_t *ptep;
> +	unsigned int shift;
> +	enum hugetlb_level level;
> +	spinlock_t *ptl;
> +};

Do we need both shift + level?  Maybe it's only meaningful for ARM where
the shift may not be directly calculcated from level?

I'm wondering whether we can just maintain "shift" then we calculate
"level" realtime.  It just reads a bit weird to have these two fields, also
a burden to most of the call sites where shift and level exactly match..

> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> +			  unsigned int shift, enum hugetlb_level level)

I'd think it's nicer to replace "populate" with something else, as populate
is definitely a meaningful word in vm world for "making something appear if
it wasn't".  Maybe hugetlb_pte_setup()?

Even one step back, on the naming of hugetlb_pte..  Sorry to comment on
namings especially on this one, I really don't like to do that normally..
but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile
it's not really a pte but an iterator.  How about hugetlb_hgm_iter?  "hgm"
tells that it only walks sub-level, and "iter" tells that it is an
iterator, being updated for each stepping downwards.

Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init().

Take these comments with a grain of salt, and it never hurts to wait for a
2nd opinion before anything.

> +{
> +	WARN_ON_ONCE(!ptep);
> +	hpte->ptep = ptep;
> +	hpte->shift = shift;
> +	hpte->level = level;
> +	hpte->ptl = NULL;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);
> +	return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);
> +	return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);
> +	return hpte->shift;
> +}
> +
> +static inline
> +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> +{
> +	WARN_ON_ONCE(!hpte->ptep);

There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s..  AFAIK the
hugetlb_pte* will be setup once with valid ptep and then it should always
be.  I rem someone commented on these helpers look not useful, which I must
confess I had the same feeling.  But besides that, I'd rather drop all
these WARN_ON_ONCE()s but only check it when init() the iterator/pte.

> +	return hpte->level;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> +	dest->ptep = src->ptep;
> +	dest->shift = src->shift;
> +	dest->level = src->level;
> +	dest->ptl = src->ptl;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
>  struct hugepage_subpool {
>  	spinlock_t lock;
>  	long count;
> @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>  	return ptl;
>  }
>  
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +	BUG_ON(!hpte->ptep);

Another BUG_ON(); better be dropped too.

> +	if (hpte->ptl)
> +		return hpte->ptl;
> +	return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);

I'm curious whether we can always have hpte->ptl set for a valid
hugetlb_pte.  I think that means we'll need to also init the ptl in the
init() fn of the iterator.  Then it'll be clear on which lock to take for
each valid hugetlb_pte.

> +}
  
James Houghton Nov. 17, 2022, 1 a.m. UTC | #2
On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> > +struct hugetlb_pte {
> > +     pte_t *ptep;
> > +     unsigned int shift;
> > +     enum hugetlb_level level;
> > +     spinlock_t *ptl;
> > +};
>
> Do we need both shift + level?  Maybe it's only meaningful for ARM where
> the shift may not be directly calculcated from level?
>
> I'm wondering whether we can just maintain "shift" then we calculate
> "level" realtime.  It just reads a bit weird to have these two fields, also
> a burden to most of the call sites where shift and level exactly match..

My main concern is interaction with folded levels. For example, if
PUD_SIZE and PMD_SIZE are the same, we want to do something like this:

pud = pud_offset(p4d, addr)
pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */
pte = pte_offset(pmd, addr)

and I think we should avoid quietly skipping the folded level, which
could happen:

pud = pud_offset(p4d, addr)
/* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here,
it is impossible to know that `pud` came from `pud_offset` and not
`pmd_offset`. We must assume the deeper level so that we don't get
stuck in a loop. */
pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * ->
pmd_t *) */

Quietly dropping p*d_offset for folded levels is safe; it's just a
cast that we're doing anyway. If you think this is fine, then I can
remove `level`. It might also be that this is a non-issue and that
there will never be a folded level underneath a hugepage level.

We could also change `ptep` to a union eventually (to clean up
"hugetlb casts everything to pte_t *" messiness), and having an
explicit `level` as a tag for the union would be nice help. In the
same way: I like having `level` explicitly so that we know for sure
where `ptep` came from.

I can try to reduce the burden at the callsite while keeping `level`:
hpage_size_to_level() is really annoying to have everywhere.

>
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                       unsigned int shift, enum hugetlb_level level)
>
> I'd think it's nicer to replace "populate" with something else, as populate
> is definitely a meaningful word in vm world for "making something appear if
> it wasn't".  Maybe hugetlb_pte_setup()?
>
> Even one step back, on the naming of hugetlb_pte..  Sorry to comment on
> namings especially on this one, I really don't like to do that normally..
> but here hugetlb_pte only walks the sub-page level of pgtables, meanwhile
> it's not really a pte but an iterator.  How about hugetlb_hgm_iter?  "hgm"
> tells that it only walks sub-level, and "iter" tells that it is an
> iterator, being updated for each stepping downwards.
>
> Then hugetlb_pte_populate() can be hugetlb_hgm_iter_init().
>
> Take these comments with a grain of salt, and it never hurts to wait for a
> 2nd opinion before anything.

I think this is a great idea. :) Thank you! I'll make this change for
v1 unless someone has a better suggestion.

>
> > +{
> > +     WARN_ON_ONCE(!ptep);
> > +     hpte->ptep = ptep;
> > +     hpte->shift = shift;
> > +     hpte->level = level;
> > +     hpte->ptl = NULL;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
> > +     return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
> > +     return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
> > +     return hpte->shift;
> > +}
> > +
> > +static inline
> > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> > +{
> > +     WARN_ON_ONCE(!hpte->ptep);
>
> There're definitely a bunch of hpte->ptep WARN_ON_ONCE()s..  AFAIK the
> hugetlb_pte* will be setup once with valid ptep and then it should always
> be.  I rem someone commented on these helpers look not useful, which I must
> confess I had the same feeling.  But besides that, I'd rather drop all
> these WARN_ON_ONCE()s but only check it when init() the iterator/pte.

The idea with these WARN_ON_ONCE()s is that it WARNs for the case that
`hpte` was never populated/initialized, but I realize that we can't
even rely on hpte->ptep == NULL. I'll remove the WARN_ON_ONCE()s, and
I'll drop hugetlb_pte_shift and hugetlb_pte_level entirely.

I'll keep the hugetlb_pte_{size,mask,copy,present_leaf} helpers as
they are legitimately helpful.

>
> > +     return hpte->level;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > +     dest->ptep = src->ptep;
> > +     dest->shift = src->shift;
> > +     dest->level = src->level;
> > +     dest->ptl = src->ptl;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > +
> >  struct hugepage_subpool {
> >       spinlock_t lock;
> >       long count;
> > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >       return ptl;
> >  }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +     BUG_ON(!hpte->ptep);
>
> Another BUG_ON(); better be dropped too.

Can do.

>
> > +     if (hpte->ptl)
> > +             return hpte->ptl;
> > +     return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
>
> I'm curious whether we can always have hpte->ptl set for a valid
> hugetlb_pte.  I think that means we'll need to also init the ptl in the
> init() fn of the iterator.  Then it'll be clear on which lock to take for
> each valid hugetlb_pte.

I can work on this for v1. Right now it's not very good: for 4K PTEs,
we manually set ->ptl while walking. I'll make it so that ->ptl is
always populated so the code is easier to read.

- James

>
> > +}

>
> --
> Peter Xu
>
  
Peter Xu Nov. 17, 2022, 4:27 p.m. UTC | #3
On Wed, Nov 16, 2022 at 05:00:08PM -0800, James Houghton wrote:
> On Wed, Nov 16, 2022 at 2:18 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Oct 21, 2022 at 04:36:26PM +0000, James Houghton wrote:
> > > +struct hugetlb_pte {
> > > +     pte_t *ptep;
> > > +     unsigned int shift;
> > > +     enum hugetlb_level level;
> > > +     spinlock_t *ptl;
> > > +};
> >
> > Do we need both shift + level?  Maybe it's only meaningful for ARM where
> > the shift may not be directly calculcated from level?
> >
> > I'm wondering whether we can just maintain "shift" then we calculate
> > "level" realtime.  It just reads a bit weird to have these two fields, also
> > a burden to most of the call sites where shift and level exactly match..
> 
> My main concern is interaction with folded levels. For example, if
> PUD_SIZE and PMD_SIZE are the same, we want to do something like this:
> 
> pud = pud_offset(p4d, addr)
> pmd = pmd_offset(pud, addr) /* this is just pmd = (pmd_t *) pud */
> pte = pte_offset(pmd, addr)
> 
> and I think we should avoid quietly skipping the folded level, which
> could happen:
> 
> pud = pud_offset(p4d, addr)
> /* Each time, we go back to pte_t *, so if we stored PUD_SHIFT here,
> it is impossible to know that `pud` came from `pud_offset` and not
> `pmd_offset`. We must assume the deeper level so that we don't get
> stuck in a loop. */
> pte = pte_offset(pud, addr) /* pud is cast from (pud_t * -> pte_t * ->
> pmd_t *) */
> 
> Quietly dropping p*d_offset for folded levels is safe; it's just a
> cast that we're doing anyway. If you think this is fine, then I can
> remove `level`. It might also be that this is a non-issue and that
> there will never be a folded level underneath a hugepage level.
> 
> We could also change `ptep` to a union eventually (to clean up
> "hugetlb casts everything to pte_t *" messiness), and having an
> explicit `level` as a tag for the union would be nice help. In the
> same way: I like having `level` explicitly so that we know for sure
> where `ptep` came from.

Makes sense.

> 
> I can try to reduce the burden at the callsite while keeping `level`:
> hpage_size_to_level() is really annoying to have everywhere.

Yeah this would be nice.

Per what I see most callers are having paired level/shift, so maybe we can
make hugetlb_hgm_iter_init() to only take one of them and calculate the
other. Then there can also be the __hugetlb_hgm_iter_init() which takes
both, only used in the few places where necessary to have explicit
level/shift.  hugetlb_hgm_iter_init() calls __hugetlb_hgm_iter_init().

Thanks,
  
Mina Almasry Dec. 8, 2022, 12:46 a.m. UTC | #4
On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
>
> After high-granularity mapping, page table entries for HugeTLB pages can
> be of any size/type. (For example, we can have a 1G page mapped with a
> mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> PTE after we have done a page table walk.
>
> Without this, we'd have to pass around the "size" of the PTE everywhere.
> We effectively did this before; it could be fetched from the hstate,
> which we pass around pretty much everywhere.
>
> hugetlb_pte_present_leaf is included here as a helper function that will
> be used frequently later on.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
> ---
>  include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
>  mm/hugetlb.c            | 29 ++++++++++++++
>  2 files changed, 117 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index db3ed6095b1c..d30322108b34 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -50,6 +50,75 @@ enum {
>         __NR_USED_SUBPAGE,
>  };
>
> +enum hugetlb_level {
> +       HUGETLB_LEVEL_PTE = 1,
> +       /*
> +        * We always include PMD, PUD, and P4D in this enum definition so that,
> +        * when logged as an integer, we can easily tell which level it is.
> +        */
> +       HUGETLB_LEVEL_PMD,
> +       HUGETLB_LEVEL_PUD,
> +       HUGETLB_LEVEL_P4D,
> +       HUGETLB_LEVEL_PGD,
> +};
> +

Don't we need to support CONTIG_PTE/PMD levels here for ARM64?

> +struct hugetlb_pte {
> +       pte_t *ptep;
> +       unsigned int shift;
> +       enum hugetlb_level level;

Is shift + level redundant? When would those diverge?

> +       spinlock_t *ptl;
> +};
> +
> +static inline
> +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> +                         unsigned int shift, enum hugetlb_level level)
> +{
> +       WARN_ON_ONCE(!ptep);
> +       hpte->ptep = ptep;
> +       hpte->shift = shift;
> +       hpte->level = level;
> +       hpte->ptl = NULL;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return 1UL << hpte->shift;
> +}
> +
> +static inline
> +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return ~(hugetlb_pte_size(hpte) - 1);
> +}
> +
> +static inline
> +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return hpte->shift;
> +}
> +
> +static inline
> +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> +{
> +       WARN_ON_ONCE(!hpte->ptep);
> +       return hpte->level;
> +}
> +
> +static inline
> +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> +{
> +       dest->ptep = src->ptep;
> +       dest->shift = src->shift;
> +       dest->level = src->level;
> +       dest->ptl = src->ptl;
> +}
> +
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> +
>  struct hugepage_subpool {
>         spinlock_t lock;
>         long count;
> @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
>         return ptl;
>  }
>
> +static inline
> +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +
> +       BUG_ON(!hpte->ptep);

I think BUG_ON()s will be frowned upon. This function also doesn't
really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if
necessary.


> +       if (hpte->ptl)
> +               return hpte->ptl;
> +       return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);

I don't know if this fallback to huge_pte_lockptr() should be obivous
to the reader. If not, a comment would help.

> +}
> +
> +static inline
> +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> +{
> +       spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> +
> +       spin_lock(ptl);
> +       return ptl;
> +}
> +
>  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
>  extern void __init hugetlb_cma_reserve(int order);
>  #else
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ef7662bd0068..a0e46d35dabc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
>         return false;
>  }
>
> +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)

I also don't know if this is obvious to other readers, but I'm quite
confused that we pass both hugetlb_pte and pte_t here, especially when
hpte has a pte_t inside of it. Maybe a comment would help.

> +{
> +       pgd_t pgd;
> +       p4d_t p4d;
> +       pud_t pud;
> +       pmd_t pmd;
> +
> +       WARN_ON_ONCE(!hpte->ptep);
> +       switch (hugetlb_pte_level(hpte)) {
> +       case HUGETLB_LEVEL_PGD:
> +               pgd = __pgd(pte_val(pte));
> +               return pgd_present(pgd) && pgd_leaf(pgd);
> +       case HUGETLB_LEVEL_P4D:
> +               p4d = __p4d(pte_val(pte));
> +               return p4d_present(p4d) && p4d_leaf(p4d);
> +       case HUGETLB_LEVEL_PUD:
> +               pud = __pud(pte_val(pte));
> +               return pud_present(pud) && pud_leaf(pud);
> +       case HUGETLB_LEVEL_PMD:
> +               pmd = __pmd(pte_val(pte));
> +               return pmd_present(pmd) && pmd_leaf(pmd);
> +       case HUGETLB_LEVEL_PTE:
> +               return pte_present(pte);
> +       default:
> +               WARN_ON_ONCE(1);
> +               return false;
> +       }
> +}
> +
>  static void enqueue_huge_page(struct hstate *h, struct page *page)
>  {
>         int nid = page_to_nid(page);
> --
> 2.38.0.135.g90850a2211-goog
>
  
James Houghton Dec. 9, 2022, 4:02 p.m. UTC | #5
On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
> >
> > After high-granularity mapping, page table entries for HugeTLB pages can
> > be of any size/type. (For example, we can have a 1G page mapped with a
> > mix of PMDs and PTEs.) This struct is to help keep track of a HugeTLB
> > PTE after we have done a page table walk.
> >
> > Without this, we'd have to pass around the "size" of the PTE everywhere.
> > We effectively did this before; it could be fetched from the hstate,
> > which we pass around pretty much everywhere.
> >
> > hugetlb_pte_present_leaf is included here as a helper function that will
> > be used frequently later on.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> > ---
> >  include/linux/hugetlb.h | 88 +++++++++++++++++++++++++++++++++++++++++
> >  mm/hugetlb.c            | 29 ++++++++++++++
> >  2 files changed, 117 insertions(+)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index db3ed6095b1c..d30322108b34 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -50,6 +50,75 @@ enum {
> >         __NR_USED_SUBPAGE,
> >  };
> >
> > +enum hugetlb_level {
> > +       HUGETLB_LEVEL_PTE = 1,
> > +       /*
> > +        * We always include PMD, PUD, and P4D in this enum definition so that,
> > +        * when logged as an integer, we can easily tell which level it is.
> > +        */
> > +       HUGETLB_LEVEL_PMD,
> > +       HUGETLB_LEVEL_PUD,
> > +       HUGETLB_LEVEL_P4D,
> > +       HUGETLB_LEVEL_PGD,
> > +};
> > +
>
> Don't we need to support CONTIG_PTE/PMD levels here for ARM64?

Yeah, which is why shift and level aren't quite the same thing.
Contiguous PMDs would be HUGETLB_LEVEL_PMD but have shift =
CONT_PMD_SHIFT, whereas regular PMDs would have shift = PMD_SHIFT.

>
> > +struct hugetlb_pte {
> > +       pte_t *ptep;
> > +       unsigned int shift;
> > +       enum hugetlb_level level;
>
> Is shift + level redundant? When would those diverge?

Peter asked a very similar question. `shift` can be used to determine
`level` if no levels are being folded. In the case of folded levels,
you might have a single shift that corresponds to multiple "levels".
That isn't necessarily a problem, as folding a level just means
casting your p?d_t* differently, but I think it's good to be able to
*know* if the hugetlb_pte was populated with a pud_t* that we treat it
like a pud_t* always.

If `ptep` was instead a union, then `level` would be the tag. Perhaps
it should be written that way.

>
> > +       spinlock_t *ptl;
> > +};
> > +
> > +static inline
> > +void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
> > +                         unsigned int shift, enum hugetlb_level level)
> > +{
> > +       WARN_ON_ONCE(!ptep);
> > +       hpte->ptep = ptep;
> > +       hpte->shift = shift;
> > +       hpte->level = level;
> > +       hpte->ptl = NULL;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return 1UL << hpte->shift;
> > +}
> > +
> > +static inline
> > +unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return ~(hugetlb_pte_size(hpte) - 1);
> > +}
> > +
> > +static inline
> > +unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return hpte->shift;
> > +}
> > +
> > +static inline
> > +enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
> > +{
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       return hpte->level;
> > +}
> > +
> > +static inline
> > +void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
> > +{
> > +       dest->ptep = src->ptep;
> > +       dest->shift = src->shift;
> > +       dest->level = src->level;
> > +       dest->ptl = src->ptl;
> > +}
> > +
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
> > +
> >  struct hugepage_subpool {
> >         spinlock_t lock;
> >         long count;
> > @@ -1210,6 +1279,25 @@ static inline spinlock_t *huge_pte_lock(struct hstate *h,
> >         return ptl;
> >  }
> >
> > +static inline
> > +spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +
> > +       BUG_ON(!hpte->ptep);
>
> I think BUG_ON()s will be frowned upon. This function also doesn't
> really need ptep. Maybe let hugetlb_pte_shift() decide to BUG_ON() if
> necessary.

Right. I'll remove this (and others that aren't really necessary).
Peter's suggestion to just let the kernel take a #pf and crash
(thereby logging more info) SGTM.

>
>
> > +       if (hpte->ptl)
> > +               return hpte->ptl;
> > +       return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
>
> I don't know if this fallback to huge_pte_lockptr() should be obivous
> to the reader. If not, a comment would help.

I'll clean this up a little for the next version. If something like
this branch stays, I'll add a comment.

>
> > +}
> > +
> > +static inline
> > +spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
> > +{
> > +       spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
> > +
> > +       spin_lock(ptl);
> > +       return ptl;
> > +}
> > +
> >  #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
> >  extern void __init hugetlb_cma_reserve(int order);
> >  #else
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index ef7662bd0068..a0e46d35dabc 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1127,6 +1127,35 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
> >         return false;
> >  }
> >
> > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
>
> I also don't know if this is obvious to other readers, but I'm quite
> confused that we pass both hugetlb_pte and pte_t here, especially when
> hpte has a pte_t inside of it. Maybe a comment would help.

It's possible for the value of the pte to change if we haven't locked
the PTL; we only store a pte_t* in hugetlb_pte, not the value itself.

Thinking about this... we *do* store `shift` which technically depends
on the value of the PTE. If the PTE is pte_none, the true `shift` of
the PTE is ambiguous, and so we just provide what the user asked for.
That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then
UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed
because we merely check if the *first* PTE in the contiguous bunch is
none/has changed.

So, in the case of a contiguous PTE where we *think* we're overwriting
a bunch of none PTEs, we need to check that each PTE we're overwriting
is still none while holding the PTL. That means that the PTL we use
for cont PTEs and non-cont PTEs of the same level must be the same.

So for the next version, I'll:
- add some requirement that contiguous and non-contiguous PTEs on the
same level must use the same PTL
- think up some kind of API like all_contig_ptes_none(), but it only
really applies for arm64, so I think actually putting it in can wait.
I'll at least put a comment in hugetlb_mcopy_atomic_pte and
hugetlb_no_page (near the final huge_pte_none() and pte_same()
checks).


>
> > +{
> > +       pgd_t pgd;
> > +       p4d_t p4d;
> > +       pud_t pud;
> > +       pmd_t pmd;
> > +
> > +       WARN_ON_ONCE(!hpte->ptep);
> > +       switch (hugetlb_pte_level(hpte)) {
> > +       case HUGETLB_LEVEL_PGD:
> > +               pgd = __pgd(pte_val(pte));
> > +               return pgd_present(pgd) && pgd_leaf(pgd);
> > +       case HUGETLB_LEVEL_P4D:
> > +               p4d = __p4d(pte_val(pte));
> > +               return p4d_present(p4d) && p4d_leaf(p4d);
> > +       case HUGETLB_LEVEL_PUD:
> > +               pud = __pud(pte_val(pte));
> > +               return pud_present(pud) && pud_leaf(pud);
> > +       case HUGETLB_LEVEL_PMD:
> > +               pmd = __pmd(pte_val(pte));
> > +               return pmd_present(pmd) && pmd_leaf(pmd);
> > +       case HUGETLB_LEVEL_PTE:
> > +               return pte_present(pte);
> > +       default:
> > +               WARN_ON_ONCE(1);
> > +               return false;
> > +       }
> > +}
> > +
> >  static void enqueue_huge_page(struct hstate *h, struct page *page)
> >  {
> >         int nid = page_to_nid(page);
> > --
> > 2.38.0.135.g90850a2211-goog
> >
  
Mike Kravetz Dec. 13, 2022, 6:44 p.m. UTC | #6
On 12/09/22 11:02, James Houghton wrote:
> On Wed, Dec 7, 2022 at 7:46 PM Mina Almasry <almasrymina@google.com> wrote:
> > On Fri, Oct 21, 2022 at 9:37 AM James Houghton <jthoughton@google.com> wrote:
> > >
> > > +bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
> >
> > I also don't know if this is obvious to other readers, but I'm quite
> > confused that we pass both hugetlb_pte and pte_t here, especially when
> > hpte has a pte_t inside of it. Maybe a comment would help.
> 
> It's possible for the value of the pte to change if we haven't locked
> the PTL; we only store a pte_t* in hugetlb_pte, not the value itself.

I had comments similar to Mina and Peter on other parts of this patch.  Calling
this without some type of locking is 'interesting'.  I have not yet looked at
callers (without locking), but I assume such callers can handle stale results.

> Thinking about this... we *do* store `shift` which technically depends
> on the value of the PTE. If the PTE is pte_none, the true `shift` of
> the PTE is ambiguous, and so we just provide what the user asked for.
> That could lead to a scenario where UFFDIO_CONTINUE(some 4K page) then
> UFFDIO_CONTINUE(CONT_PTE_SIZE range around that page) can both succeed
> because we merely check if the *first* PTE in the contiguous bunch is
> none/has changed.

Right, Yuck!

> 
> So, in the case of a contiguous PTE where we *think* we're overwriting
> a bunch of none PTEs, we need to check that each PTE we're overwriting
> is still none while holding the PTL. That means that the PTL we use
> for cont PTEs and non-cont PTEs of the same level must be the same.
> 
> So for the next version, I'll:
> - add some requirement that contiguous and non-contiguous PTEs on the
> same level must use the same PTL
> - think up some kind of API like all_contig_ptes_none(), but it only
> really applies for arm64, so I think actually putting it in can wait.
> I'll at least put a comment in hugetlb_mcopy_atomic_pte and
> hugetlb_no_page (near the final huge_pte_none() and pte_same()
> checks).
>
  

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index db3ed6095b1c..d30322108b34 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -50,6 +50,75 @@  enum {
 	__NR_USED_SUBPAGE,
 };
 
+enum hugetlb_level {
+	HUGETLB_LEVEL_PTE = 1,
+	/*
+	 * We always include PMD, PUD, and P4D in this enum definition so that,
+	 * when logged as an integer, we can easily tell which level it is.
+	 */
+	HUGETLB_LEVEL_PMD,
+	HUGETLB_LEVEL_PUD,
+	HUGETLB_LEVEL_P4D,
+	HUGETLB_LEVEL_PGD,
+};
+
+struct hugetlb_pte {
+	pte_t *ptep;
+	unsigned int shift;
+	enum hugetlb_level level;
+	spinlock_t *ptl;
+};
+
+static inline
+void hugetlb_pte_populate(struct hugetlb_pte *hpte, pte_t *ptep,
+			  unsigned int shift, enum hugetlb_level level)
+{
+	WARN_ON_ONCE(!ptep);
+	hpte->ptep = ptep;
+	hpte->shift = shift;
+	hpte->level = level;
+	hpte->ptl = NULL;
+}
+
+static inline
+unsigned long hugetlb_pte_size(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return 1UL << hpte->shift;
+}
+
+static inline
+unsigned long hugetlb_pte_mask(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return ~(hugetlb_pte_size(hpte) - 1);
+}
+
+static inline
+unsigned int hugetlb_pte_shift(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return hpte->shift;
+}
+
+static inline
+enum hugetlb_level hugetlb_pte_level(const struct hugetlb_pte *hpte)
+{
+	WARN_ON_ONCE(!hpte->ptep);
+	return hpte->level;
+}
+
+static inline
+void hugetlb_pte_copy(struct hugetlb_pte *dest, const struct hugetlb_pte *src)
+{
+	dest->ptep = src->ptep;
+	dest->shift = src->shift;
+	dest->level = src->level;
+	dest->ptl = src->ptl;
+}
+
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte);
+
 struct hugepage_subpool {
 	spinlock_t lock;
 	long count;
@@ -1210,6 +1279,25 @@  static inline spinlock_t *huge_pte_lock(struct hstate *h,
 	return ptl;
 }
 
+static inline
+spinlock_t *hugetlb_pte_lockptr(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+
+	BUG_ON(!hpte->ptep);
+	if (hpte->ptl)
+		return hpte->ptl;
+	return huge_pte_lockptr(hugetlb_pte_shift(hpte), mm, hpte->ptep);
+}
+
+static inline
+spinlock_t *hugetlb_pte_lock(struct mm_struct *mm, struct hugetlb_pte *hpte)
+{
+	spinlock_t *ptl = hugetlb_pte_lockptr(mm, hpte);
+
+	spin_lock(ptl);
+	return ptl;
+}
+
 #if defined(CONFIG_HUGETLB_PAGE) && defined(CONFIG_CMA)
 extern void __init hugetlb_cma_reserve(int order);
 #else
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef7662bd0068..a0e46d35dabc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1127,6 +1127,35 @@  static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	return false;
 }
 
+bool hugetlb_pte_present_leaf(const struct hugetlb_pte *hpte, pte_t pte)
+{
+	pgd_t pgd;
+	p4d_t p4d;
+	pud_t pud;
+	pmd_t pmd;
+
+	WARN_ON_ONCE(!hpte->ptep);
+	switch (hugetlb_pte_level(hpte)) {
+	case HUGETLB_LEVEL_PGD:
+		pgd = __pgd(pte_val(pte));
+		return pgd_present(pgd) && pgd_leaf(pgd);
+	case HUGETLB_LEVEL_P4D:
+		p4d = __p4d(pte_val(pte));
+		return p4d_present(p4d) && p4d_leaf(p4d);
+	case HUGETLB_LEVEL_PUD:
+		pud = __pud(pte_val(pte));
+		return pud_present(pud) && pud_leaf(pud);
+	case HUGETLB_LEVEL_PMD:
+		pmd = __pmd(pte_val(pte));
+		return pmd_present(pmd) && pmd_leaf(pmd);
+	case HUGETLB_LEVEL_PTE:
+		return pte_present(pte);
+	default:
+		WARN_ON_ONCE(1);
+		return false;
+	}
+}
+
 static void enqueue_huge_page(struct hstate *h, struct page *page)
 {
 	int nid = page_to_nid(page);