[v2,13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
Commit Message
hugetlb_hgm_walk implements high-granularity page table walks for
HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
immediately.
hugetlb_walk_step implements how we step forwards in the walk. For
architectures that don't use GENERAL_HUGETLB, they will need to provide
their own implementation.
The broader API that should be used is
hugetlb_full_walk[,alloc|,continue].
Signed-off-by: James Houghton <jthoughton@google.com>
Comments
Hi James,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on next-20230217]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
patch link: https://lore.kernel.org/r/20230218002819.1486479-14-jthoughton%40google.com
patch subject: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20230218/202302181558.6o0zw4Cl-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/7e55fe945a1b5f042746277050390bdeba9e22d2
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
git checkout 7e55fe945a1b5f042746277050390bdeba9e22d2
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=loongarch SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302181558.6o0zw4Cl-lkp@intel.com/
All errors (new ones prefixed by >>):
loongarch64-linux-ld: mm/hugetlb.o: in function `.L142':
>> hugetlb.c:(.text+0x9ec): undefined reference to `hugetlb_walk_step'
Hi James,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on next-20230217]
[cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
patch link: https://lore.kernel.org/r/20230218002819.1486479-14-jthoughton%40google.com
patch subject: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
config: arm64-randconfig-r005-20230217 (https://download.01.org/0day-ci/archive/20230219/202302190101.aoXrbN26-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm64 cross compiling tool for clang build
# apt-get install binutils-aarch64-linux-gnu
# https://github.com/intel-lab-lkp/linux/commit/7e55fe945a1b5f042746277050390bdeba9e22d2
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
git checkout 7e55fe945a1b5f042746277050390bdeba9e22d2
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202302190101.aoXrbN26-lkp@intel.com/
All errors (new ones prefixed by >>):
>> ld.lld: error: undefined symbol: hugetlb_walk_step
>>> referenced by hugetlb.c
>>> mm/hugetlb.o:(__hugetlb_hgm_walk) in archive vmlinux.a
On Sat, Feb 18, 2023 at 10:08 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi James,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on next-20230217]
> [cannot apply to kvm/queue shuah-kselftest/next shuah-kselftest/fixes arnd-asm-generic/master linus/master kvm/linux-next v6.2-rc8 v6.2-rc7 v6.2-rc6 v6.2-rc8]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
> patch link: https://lore.kernel.org/r/20230218002819.1486479-14-jthoughton%40google.com
> patch subject: [PATCH v2 13/46] hugetlb: add hugetlb_hgm_walk and hugetlb_walk_step
> config: arm64-randconfig-r005-20230217 (https://download.01.org/0day-ci/archive/20230219/202302190101.aoXrbN26-lkp@intel.com/config)
> compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project db89896bbbd2251fff457699635acbbedeead27f)
> reproduce (this is a W=1 build):
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install arm64 cross compiling tool for clang build
> # apt-get install binutils-aarch64-linux-gnu
> # https://github.com/intel-lab-lkp/linux/commit/7e55fe945a1b5f042746277050390bdeba9e22d2
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review James-Houghton/hugetlb-don-t-set-PageUptodate-for-UFFDIO_CONTINUE/20230218-083216
> git checkout 7e55fe945a1b5f042746277050390bdeba9e22d2
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 olddefconfig
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Link: https://lore.kernel.org/oe-kbuild-all/202302190101.aoXrbN26-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> >> ld.lld: error: undefined symbol: hugetlb_walk_step
> >>> referenced by hugetlb.c
> >>> mm/hugetlb.o:(__hugetlb_hgm_walk) in archive vmlinux.a
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests
>
This is fixed by providing a trivial definition of __hugetlb_hgm_walk
when !CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING. Will be done for v3.
hugetlb_walk_step() is only defined by architectures that support HGM.
On 02/18/23 00:27, James Houghton wrote:
> hugetlb_hgm_walk implements high-granularity page table walks for
> HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
> immediately.
>
> hugetlb_walk_step implements how we step forwards in the walk. For
> architectures that don't use GENERAL_HUGETLB, they will need to provide
> their own implementation.
>
> The broader API that should be used is
> hugetlb_full_walk[,alloc|,continue].
I guess 'full' in the name implies walking to the PTE (PAGE_SIZE) level.
It could just be me and my over-familiarity with the existing hugetlb
walking code, but that was not obvious.
Again, perhaps it is just how familiar I am with the existing code, but
I found the routines difficult to follow. Nothing looks obviously wrong.
Just a couple comments.questions below.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 9d839519c875..726d581158b1 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -223,6 +223,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
> pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, pud_t *pud);
>
> +int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma,
> + unsigned long addr);
> +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> + struct vm_area_struct *vma, unsigned long addr);
> +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> + struct vm_area_struct *vma, unsigned long addr,
> + unsigned long target_sz);
> +
> struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
>
> extern int sysctl_hugetlb_shm_group;
> @@ -272,6 +280,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> pte_t *huge_pte_offset(struct mm_struct *mm,
> unsigned long addr, unsigned long sz);
> unsigned long hugetlb_mask_last_page(struct hstate *h);
> +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long sz);
> int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep);
> void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> @@ -1054,6 +1064,8 @@ void hugetlb_register_node(struct node *node);
> void hugetlb_unregister_node(struct node *node);
> #endif
>
> +enum hugetlb_level hpage_size_to_level(unsigned long sz);
> +
> #else /* CONFIG_HUGETLB_PAGE */
> struct hstate {};
>
> @@ -1246,6 +1258,11 @@ static inline void hugetlb_register_node(struct node *node)
> static inline void hugetlb_unregister_node(struct node *node)
> {
> }
> +
> +static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
> +{
> + return HUGETLB_LEVEL_PTE;
> +}
> #endif /* CONFIG_HUGETLB_PAGE */
>
> #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bb424cdf79e4..810c05feb41f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -97,6 +97,29 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
> static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> unsigned long start, unsigned long end);
>
> +/*
> + * hpage_size_to_level() - convert @sz to the corresponding page table level
> + *
> + * @sz must be less than or equal to a valid hugepage size.
> + */
> +enum hugetlb_level hpage_size_to_level(unsigned long sz)
> +{
> + /*
> + * We order the conditionals from smallest to largest to pick the
> + * smallest level when multiple levels have the same size (i.e.,
> + * when levels are folded).
> + */
> + if (sz < PMD_SIZE)
> + return HUGETLB_LEVEL_PTE;
> + if (sz < PUD_SIZE)
> + return HUGETLB_LEVEL_PMD;
> + if (sz < P4D_SIZE)
> + return HUGETLB_LEVEL_PUD;
> + if (sz < PGDIR_SIZE)
> + return HUGETLB_LEVEL_P4D;
> + return HUGETLB_LEVEL_PGD;
> +}
> +
> static inline bool subpool_is_free(struct hugepage_subpool *spool)
> {
> if (spool->count)
> @@ -7315,6 +7338,154 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
> }
> #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
>
> +/* __hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
> + * the page table entry for @addr. We might allocate new PTEs.
> + *
> + * @hpte must always be pointing at an hstate-level PTE or deeper.
> + *
> + * This function will never walk further if it encounters a PTE of a size
> + * less than or equal to @sz.
> + *
> + * @alloc determines what we do when we encounter an empty PTE. If false,
> + * we stop walking. If true and @sz is less than the current PTE's size,
> + * we make that PTE point to the next level down, going until @sz is the same
> + * as our current PTE.
> + *
> + * If @alloc is false and @sz is PAGE_SIZE, this function will always
> + * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
> + *
> + * Return:
> + * -ENOMEM if we couldn't allocate new PTEs.
> + * -EEXIST if the caller wanted to walk further than a migration PTE,
> + * poison PTE, or a PTE marker. The caller needs to manually deal
> + * with this scenario.
> + * -EINVAL if called with invalid arguments (@sz invalid, @hpte not
> + * initialized).
> + * 0 otherwise.
> + *
> + * Even if this function fails, @hpte is guaranteed to always remain
> + * valid.
> + */
> +static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> + struct hugetlb_pte *hpte, unsigned long addr,
> + unsigned long sz, bool alloc)
> +{
> + int ret = 0;
> + pte_t pte;
> +
> + if (WARN_ON_ONCE(sz < PAGE_SIZE))
> + return -EINVAL;
> +
> + if (WARN_ON_ONCE(!hpte->ptep))
> + return -EINVAL;
> +
> + while (hugetlb_pte_size(hpte) > sz && !ret) {
> + pte = huge_ptep_get(hpte->ptep);
> + if (!pte_present(pte)) {
> + if (!alloc)
> + return 0;
> + if (unlikely(!huge_pte_none(pte)))
> + return -EEXIST;
> + } else if (hugetlb_pte_present_leaf(hpte, pte))
> + return 0;
> + ret = hugetlb_walk_step(mm, hpte, addr, sz);
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * hugetlb_hgm_walk - Has the same behavior as __hugetlb_hgm_walk but will
> + * initialize @hpte with hstate-level PTE pointer @ptep.
> + */
> +static int hugetlb_hgm_walk(struct hugetlb_pte *hpte,
> + pte_t *ptep,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + unsigned long target_sz,
> + bool alloc)
> +{
> + struct hstate *h = hstate_vma(vma);
> +
> + hugetlb_pte_init(vma->vm_mm, hpte, ptep, huge_page_shift(h),
> + hpage_size_to_level(huge_page_size(h)));
> + return __hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz,
> + alloc);
> +}
> +
> +/*
> + * hugetlb_full_walk_continue - continue a high-granularity page-table walk.
> + *
> + * If a user has a valid @hpte but knows that @hpte is not a leaf, they can
> + * attempt to continue walking by calling this function.
> + *
> + * This function will never fail, but @hpte might not change.
> + *
> + * If @hpte hasn't been initialized, then this function's behavior is
> + * undefined.
> + */
> +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> + struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + /* __hugetlb_hgm_walk will never fail with these arguments. */
> + WARN_ON_ONCE(__hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr,
> + PAGE_SIZE, false));
> +}
> +
> +/*
> + * hugetlb_full_walk - do a high-granularity page-table walk; never allocate.
> + *
> + * This function can only fail if we find that the hstate-level PTE is not
> + * allocated. Callers can take advantage of this fact to skip address regions
> + * that cannot be mapped in that case.
> + *
> + * If this function succeeds, @hpte is guaranteed to be valid.
> + */
> +int hugetlb_full_walk(struct hugetlb_pte *hpte,
> + struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + struct hstate *h = hstate_vma(vma);
> + unsigned long sz = huge_page_size(h);
> + /*
> + * We must mask the address appropriately so that we pick up the first
> + * PTE in a contiguous group.
> + */
> + pte_t *ptep = hugetlb_walk(vma, addr & huge_page_mask(h), sz);
> +
> + if (!ptep)
> + return -ENOMEM;
-ENOMEM does not seem appropriate, but I can not think of something
better. -ENOENT perhaps?
> +
> + /* hugetlb_hgm_walk will never fail with these arguments. */
> + WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, PAGE_SIZE, false));
> + return 0;
> +}
> +
> +/*
> + * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
> + * new PTEs.
> + */
> +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> + struct vm_area_struct *vma,
> + unsigned long addr,
> + unsigned long target_sz)
> +{
> + struct hstate *h = hstate_vma(vma);
> + unsigned long sz = huge_page_size(h);
> + /*
> + * We must mask the address appropriately so that we pick up the first
> + * PTE in a contiguous group.
> + */
> + pte_t *ptep = huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h),
> + sz);
> +
> + if (!ptep)
> + return -ENOMEM;
> +
> + return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true);
> +}
> +
> #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
> pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, unsigned long sz)
> @@ -7382,6 +7553,48 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> return (pte_t *)pmd;
> }
>
> +/*
> + * hugetlb_walk_step() - Walk the page table one step to resolve the page
> + * (hugepage or subpage) entry at address @addr.
> + *
> + * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
> + * lowest level PTE).
> + *
> + * @hpte will always remain valid, even if this function fails.
> + *
> + * Architectures that implement this function must ensure that if @hpte does
> + * not change levels, then its PTL must also stay the same.
> + */
> +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> + unsigned long addr, unsigned long sz)
> +{
> + pte_t *ptep;
> + spinlock_t *ptl;
> +
> + switch (hpte->level) {
> + case HUGETLB_LEVEL_PUD:
> + ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
> + if (IS_ERR(ptep))
> + return PTR_ERR(ptep);
> + hugetlb_pte_init(mm, hpte, ptep, PMD_SHIFT,
> + HUGETLB_LEVEL_PMD);
> + break;
> + case HUGETLB_LEVEL_PMD:
> + ptep = hugetlb_alloc_pte(mm, hpte, addr);
> + if (IS_ERR(ptep))
> + return PTR_ERR(ptep);
> + ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
Is that right? hpte->ptep is the PMD level entry. It seems
pte_lockptr() -> ptlock_ptr(pmd_page(*pmd)) -> return page->ptl
But, I would think we want the page mm->page_table_lock for newly
allocated PTE.
On Tue, Feb 28, 2023 at 2:15 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/18/23 00:27, James Houghton wrote:
> > hugetlb_hgm_walk implements high-granularity page table walks for
> > HugeTLB. It is safe to call on non-HGM enabled VMAs; it will return
> > immediately.
> >
> > hugetlb_walk_step implements how we step forwards in the walk. For
> > architectures that don't use GENERAL_HUGETLB, they will need to provide
> > their own implementation.
> >
> > The broader API that should be used is
> > hugetlb_full_walk[,alloc|,continue].
>
> I guess 'full' in the name implies walking to the PTE (PAGE_SIZE) level.
> It could just be me and my over-familiarity with the existing hugetlb
> walking code, but that was not obvious.
Yeah "full" means it walks all the way down to the leaf. "alloc" means
it may allocate if required to reach some target PTE size (target_sz).
I'll try to be clearer in the comments for hugetlb_full_walk and
hugetlb_full_walk_alloc.
>
> Again, perhaps it is just how familiar I am with the existing code, but
> I found the routines difficult to follow. Nothing looks obviously wrong.
>
> Just a couple comments.questions below.
>
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 9d839519c875..726d581158b1 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -223,6 +223,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
> > pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long addr, pud_t *pud);
> >
> > +int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma,
> > + unsigned long addr);
> > +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> > + struct vm_area_struct *vma, unsigned long addr);
> > +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> > + struct vm_area_struct *vma, unsigned long addr,
> > + unsigned long target_sz);
> > +
> > struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
> >
> > extern int sysctl_hugetlb_shm_group;
> > @@ -272,6 +280,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> > pte_t *huge_pte_offset(struct mm_struct *mm,
> > unsigned long addr, unsigned long sz);
> > unsigned long hugetlb_mask_last_page(struct hstate *h);
> > +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > + unsigned long addr, unsigned long sz);
> > int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long addr, pte_t *ptep);
> > void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> > @@ -1054,6 +1064,8 @@ void hugetlb_register_node(struct node *node);
> > void hugetlb_unregister_node(struct node *node);
> > #endif
> >
> > +enum hugetlb_level hpage_size_to_level(unsigned long sz);
> > +
> > #else /* CONFIG_HUGETLB_PAGE */
> > struct hstate {};
> >
> > @@ -1246,6 +1258,11 @@ static inline void hugetlb_register_node(struct node *node)
> > static inline void hugetlb_unregister_node(struct node *node)
> > {
> > }
> > +
> > +static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
> > +{
> > + return HUGETLB_LEVEL_PTE;
> > +}
> > #endif /* CONFIG_HUGETLB_PAGE */
> >
> > #ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bb424cdf79e4..810c05feb41f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -97,6 +97,29 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
> > static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
> > unsigned long start, unsigned long end);
> >
> > +/*
> > + * hpage_size_to_level() - convert @sz to the corresponding page table level
> > + *
> > + * @sz must be less than or equal to a valid hugepage size.
> > + */
> > +enum hugetlb_level hpage_size_to_level(unsigned long sz)
> > +{
> > + /*
> > + * We order the conditionals from smallest to largest to pick the
> > + * smallest level when multiple levels have the same size (i.e.,
> > + * when levels are folded).
> > + */
> > + if (sz < PMD_SIZE)
> > + return HUGETLB_LEVEL_PTE;
> > + if (sz < PUD_SIZE)
> > + return HUGETLB_LEVEL_PMD;
> > + if (sz < P4D_SIZE)
> > + return HUGETLB_LEVEL_PUD;
> > + if (sz < PGDIR_SIZE)
> > + return HUGETLB_LEVEL_P4D;
> > + return HUGETLB_LEVEL_PGD;
> > +}
> > +
> > static inline bool subpool_is_free(struct hugepage_subpool *spool)
> > {
> > if (spool->count)
> > @@ -7315,6 +7338,154 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
> > }
> > #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
> >
> > +/* __hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
> > + * the page table entry for @addr. We might allocate new PTEs.
> > + *
> > + * @hpte must always be pointing at an hstate-level PTE or deeper.
> > + *
> > + * This function will never walk further if it encounters a PTE of a size
> > + * less than or equal to @sz.
> > + *
> > + * @alloc determines what we do when we encounter an empty PTE. If false,
> > + * we stop walking. If true and @sz is less than the current PTE's size,
> > + * we make that PTE point to the next level down, going until @sz is the same
> > + * as our current PTE.
> > + *
> > + * If @alloc is false and @sz is PAGE_SIZE, this function will always
> > + * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
> > + *
> > + * Return:
> > + * -ENOMEM if we couldn't allocate new PTEs.
> > + * -EEXIST if the caller wanted to walk further than a migration PTE,
> > + * poison PTE, or a PTE marker. The caller needs to manually deal
> > + * with this scenario.
> > + * -EINVAL if called with invalid arguments (@sz invalid, @hpte not
> > + * initialized).
> > + * 0 otherwise.
> > + *
> > + * Even if this function fails, @hpte is guaranteed to always remain
> > + * valid.
> > + */
> > +static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
> > + struct hugetlb_pte *hpte, unsigned long addr,
> > + unsigned long sz, bool alloc)
> > +{
> > + int ret = 0;
> > + pte_t pte;
> > +
> > + if (WARN_ON_ONCE(sz < PAGE_SIZE))
> > + return -EINVAL;
> > +
> > + if (WARN_ON_ONCE(!hpte->ptep))
> > + return -EINVAL;
> > +
> > + while (hugetlb_pte_size(hpte) > sz && !ret) {
> > + pte = huge_ptep_get(hpte->ptep);
> > + if (!pte_present(pte)) {
> > + if (!alloc)
> > + return 0;
> > + if (unlikely(!huge_pte_none(pte)))
> > + return -EEXIST;
> > + } else if (hugetlb_pte_present_leaf(hpte, pte))
> > + return 0;
> > + ret = hugetlb_walk_step(mm, hpte, addr, sz);
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +/*
> > + * hugetlb_hgm_walk - Has the same behavior as __hugetlb_hgm_walk but will
> > + * initialize @hpte with hstate-level PTE pointer @ptep.
> > + */
> > +static int hugetlb_hgm_walk(struct hugetlb_pte *hpte,
> > + pte_t *ptep,
> > + struct vm_area_struct *vma,
> > + unsigned long addr,
> > + unsigned long target_sz,
> > + bool alloc)
> > +{
> > + struct hstate *h = hstate_vma(vma);
> > +
> > + hugetlb_pte_init(vma->vm_mm, hpte, ptep, huge_page_shift(h),
> > + hpage_size_to_level(huge_page_size(h)));
> > + return __hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz,
> > + alloc);
> > +}
> > +
> > +/*
> > + * hugetlb_full_walk_continue - continue a high-granularity page-table walk.
> > + *
> > + * If a user has a valid @hpte but knows that @hpte is not a leaf, they can
> > + * attempt to continue walking by calling this function.
> > + *
> > + * This function will never fail, but @hpte might not change.
> > + *
> > + * If @hpte hasn't been initialized, then this function's behavior is
> > + * undefined.
> > + */
> > +void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
> > + struct vm_area_struct *vma,
> > + unsigned long addr)
> > +{
> > + /* __hugetlb_hgm_walk will never fail with these arguments. */
> > + WARN_ON_ONCE(__hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr,
> > + PAGE_SIZE, false));
> > +}
> > +
> > +/*
> > + * hugetlb_full_walk - do a high-granularity page-table walk; never allocate.
> > + *
> > + * This function can only fail if we find that the hstate-level PTE is not
> > + * allocated. Callers can take advantage of this fact to skip address regions
> > + * that cannot be mapped in that case.
> > + *
> > + * If this function succeeds, @hpte is guaranteed to be valid.
> > + */
> > +int hugetlb_full_walk(struct hugetlb_pte *hpte,
> > + struct vm_area_struct *vma,
> > + unsigned long addr)
> > +{
> > + struct hstate *h = hstate_vma(vma);
> > + unsigned long sz = huge_page_size(h);
> > + /*
> > + * We must mask the address appropriately so that we pick up the first
> > + * PTE in a contiguous group.
> > + */
> > + pte_t *ptep = hugetlb_walk(vma, addr & huge_page_mask(h), sz);
> > +
> > + if (!ptep)
> > + return -ENOMEM;
>
> -ENOMEM does not seem appropriate, but I can not think of something
> better. -ENOENT perhaps?
The callers only ever check if hugetlb_full_walk() is 0 or not, so
returning 1 here could be a fine solution too. What do you think?
>
> > +
> > + /* hugetlb_hgm_walk will never fail with these arguments. */
> > + WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, PAGE_SIZE, false));
> > + return 0;
> > +}
> > +
> > +/*
> > + * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
> > + * new PTEs.
> > + */
> > +int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
> > + struct vm_area_struct *vma,
> > + unsigned long addr,
> > + unsigned long target_sz)
> > +{
> > + struct hstate *h = hstate_vma(vma);
> > + unsigned long sz = huge_page_size(h);
> > + /*
> > + * We must mask the address appropriately so that we pick up the first
> > + * PTE in a contiguous group.
> > + */
> > + pte_t *ptep = huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h),
> > + sz);
> > +
> > + if (!ptep)
> > + return -ENOMEM;
> > +
> > + return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true);
> > +}
> > +
> > #ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
> > pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned long addr, unsigned long sz)
> > @@ -7382,6 +7553,48 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
> > return (pte_t *)pmd;
> > }
> >
> > +/*
> > + * hugetlb_walk_step() - Walk the page table one step to resolve the page
> > + * (hugepage or subpage) entry at address @addr.
> > + *
> > + * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
> > + * lowest level PTE).
> > + *
> > + * @hpte will always remain valid, even if this function fails.
> > + *
> > + * Architectures that implement this function must ensure that if @hpte does
> > + * not change levels, then its PTL must also stay the same.
> > + */
> > +int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
> > + unsigned long addr, unsigned long sz)
> > +{
> > + pte_t *ptep;
> > + spinlock_t *ptl;
> > +
> > + switch (hpte->level) {
> > + case HUGETLB_LEVEL_PUD:
> > + ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
> > + if (IS_ERR(ptep))
> > + return PTR_ERR(ptep);
> > + hugetlb_pte_init(mm, hpte, ptep, PMD_SHIFT,
> > + HUGETLB_LEVEL_PMD);
> > + break;
> > + case HUGETLB_LEVEL_PMD:
> > + ptep = hugetlb_alloc_pte(mm, hpte, addr);
> > + if (IS_ERR(ptep))
> > + return PTR_ERR(ptep);
> > + ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
>
> Is that right? hpte->ptep is the PMD level entry. It seems
> pte_lockptr() -> ptlock_ptr(pmd_page(*pmd)) -> return page->ptl
> But, I would think we want the page mm->page_table_lock for newly
> allocated PTE.
If we used mm->page_table_lock for 4K PTEs, that would be a
performance nightmare (right?). The PTL we set here will be used for
page faults, UFFDIO_CONTINUEs, etc.
This code should be right. It's doing:
(1) Overwrite our leaf-level PMD with a non-leaf PMD (returning the 4K
PTE we should populate the hpte with).
(2) Find the PTL for the non-leaf PMD (the same way that generic mm does).
(3) Update hpte to point to the PTE we are supposed to use.
It's really important that (2) happens after (1), otherwise pmd_page()
won't be pointing to a page table page (and will change shortly), so
ptlock_ptr() won't give us what we want.
This is probably worth a comment; I'll write something like this here.
Thanks Mike. Hopefully the rest of the series isn't too confusing.
- James
>
> --
> Mike Kravetz
>
> > + __hugetlb_pte_init(hpte, ptep, PAGE_SHIFT,
> > + HUGETLB_LEVEL_PTE, ptl);
> > + break;
> > + default:
> > + WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
> > + __func__, hpte->level, hpte->shift);
> > + return -EINVAL;
> > + }
> > + return 0;
> > +}
> > +
> > /*
> > * Return a mask that can be used to update an address to the last huge
> > * page in a page table page mapping size. Used to skip non-present
> > --
> > 2.39.2.637.g21b0678d19-goog
> >
@@ -223,6 +223,14 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pud_t *pud);
+int hugetlb_full_walk(struct hugetlb_pte *hpte, struct vm_area_struct *vma,
+ unsigned long addr);
+void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma, unsigned long addr);
+int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma, unsigned long addr,
+ unsigned long target_sz);
+
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage);
extern int sysctl_hugetlb_shm_group;
@@ -272,6 +280,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *huge_pte_offset(struct mm_struct *mm,
unsigned long addr, unsigned long sz);
unsigned long hugetlb_mask_last_page(struct hstate *h);
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long sz);
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep);
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
@@ -1054,6 +1064,8 @@ void hugetlb_register_node(struct node *node);
void hugetlb_unregister_node(struct node *node);
#endif
+enum hugetlb_level hpage_size_to_level(unsigned long sz);
+
#else /* CONFIG_HUGETLB_PAGE */
struct hstate {};
@@ -1246,6 +1258,11 @@ static inline void hugetlb_register_node(struct node *node)
static inline void hugetlb_unregister_node(struct node *node)
{
}
+
+static inline enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+ return HUGETLB_LEVEL_PTE;
+}
#endif /* CONFIG_HUGETLB_PAGE */
#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
@@ -97,6 +97,29 @@ static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
+/*
+ * hpage_size_to_level() - convert @sz to the corresponding page table level
+ *
+ * @sz must be less than or equal to a valid hugepage size.
+ */
+enum hugetlb_level hpage_size_to_level(unsigned long sz)
+{
+ /*
+ * We order the conditionals from smallest to largest to pick the
+ * smallest level when multiple levels have the same size (i.e.,
+ * when levels are folded).
+ */
+ if (sz < PMD_SIZE)
+ return HUGETLB_LEVEL_PTE;
+ if (sz < PUD_SIZE)
+ return HUGETLB_LEVEL_PMD;
+ if (sz < P4D_SIZE)
+ return HUGETLB_LEVEL_PUD;
+ if (sz < PGDIR_SIZE)
+ return HUGETLB_LEVEL_P4D;
+ return HUGETLB_LEVEL_PGD;
+}
+
static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
if (spool->count)
@@ -7315,6 +7338,154 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
}
#endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
+/* __hugetlb_hgm_walk - walks a high-granularity HugeTLB page table to resolve
+ * the page table entry for @addr. We might allocate new PTEs.
+ *
+ * @hpte must always be pointing at an hstate-level PTE or deeper.
+ *
+ * This function will never walk further if it encounters a PTE of a size
+ * less than or equal to @sz.
+ *
+ * @alloc determines what we do when we encounter an empty PTE. If false,
+ * we stop walking. If true and @sz is less than the current PTE's size,
+ * we make that PTE point to the next level down, going until @sz is the same
+ * as our current PTE.
+ *
+ * If @alloc is false and @sz is PAGE_SIZE, this function will always
+ * succeed, but that does not guarantee that hugetlb_pte_size(hpte) is @sz.
+ *
+ * Return:
+ * -ENOMEM if we couldn't allocate new PTEs.
+ * -EEXIST if the caller wanted to walk further than a migration PTE,
+ * poison PTE, or a PTE marker. The caller needs to manually deal
+ * with this scenario.
+ * -EINVAL if called with invalid arguments (@sz invalid, @hpte not
+ * initialized).
+ * 0 otherwise.
+ *
+ * Even if this function fails, @hpte is guaranteed to always remain
+ * valid.
+ */
+static int __hugetlb_hgm_walk(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct hugetlb_pte *hpte, unsigned long addr,
+ unsigned long sz, bool alloc)
+{
+ int ret = 0;
+ pte_t pte;
+
+ if (WARN_ON_ONCE(sz < PAGE_SIZE))
+ return -EINVAL;
+
+ if (WARN_ON_ONCE(!hpte->ptep))
+ return -EINVAL;
+
+ while (hugetlb_pte_size(hpte) > sz && !ret) {
+ pte = huge_ptep_get(hpte->ptep);
+ if (!pte_present(pte)) {
+ if (!alloc)
+ return 0;
+ if (unlikely(!huge_pte_none(pte)))
+ return -EEXIST;
+ } else if (hugetlb_pte_present_leaf(hpte, pte))
+ return 0;
+ ret = hugetlb_walk_step(mm, hpte, addr, sz);
+ }
+
+ return ret;
+}
+
+/*
+ * hugetlb_hgm_walk - Has the same behavior as __hugetlb_hgm_walk but will
+ * initialize @hpte with hstate-level PTE pointer @ptep.
+ */
+static int hugetlb_hgm_walk(struct hugetlb_pte *hpte,
+ pte_t *ptep,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ unsigned long target_sz,
+ bool alloc)
+{
+ struct hstate *h = hstate_vma(vma);
+
+ hugetlb_pte_init(vma->vm_mm, hpte, ptep, huge_page_shift(h),
+ hpage_size_to_level(huge_page_size(h)));
+ return __hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr, target_sz,
+ alloc);
+}
+
+/*
+ * hugetlb_full_walk_continue - continue a high-granularity page-table walk.
+ *
+ * If a user has a valid @hpte but knows that @hpte is not a leaf, they can
+ * attempt to continue walking by calling this function.
+ *
+ * This function will never fail, but @hpte might not change.
+ *
+ * If @hpte hasn't been initialized, then this function's behavior is
+ * undefined.
+ */
+void hugetlb_full_walk_continue(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ /* __hugetlb_hgm_walk will never fail with these arguments. */
+ WARN_ON_ONCE(__hugetlb_hgm_walk(vma->vm_mm, vma, hpte, addr,
+ PAGE_SIZE, false));
+}
+
+/*
+ * hugetlb_full_walk - do a high-granularity page-table walk; never allocate.
+ *
+ * This function can only fail if we find that the hstate-level PTE is not
+ * allocated. Callers can take advantage of this fact to skip address regions
+ * that cannot be mapped in that case.
+ *
+ * If this function succeeds, @hpte is guaranteed to be valid.
+ */
+int hugetlb_full_walk(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ struct hstate *h = hstate_vma(vma);
+ unsigned long sz = huge_page_size(h);
+ /*
+ * We must mask the address appropriately so that we pick up the first
+ * PTE in a contiguous group.
+ */
+ pte_t *ptep = hugetlb_walk(vma, addr & huge_page_mask(h), sz);
+
+ if (!ptep)
+ return -ENOMEM;
+
+ /* hugetlb_hgm_walk will never fail with these arguments. */
+ WARN_ON_ONCE(hugetlb_hgm_walk(hpte, ptep, vma, addr, PAGE_SIZE, false));
+ return 0;
+}
+
+/*
+ * hugetlb_full_walk_alloc - do a high-granularity walk, potentially allocate
+ * new PTEs.
+ */
+int hugetlb_full_walk_alloc(struct hugetlb_pte *hpte,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ unsigned long target_sz)
+{
+ struct hstate *h = hstate_vma(vma);
+ unsigned long sz = huge_page_size(h);
+ /*
+ * We must mask the address appropriately so that we pick up the first
+ * PTE in a contiguous group.
+ */
+ pte_t *ptep = huge_pte_alloc(vma->vm_mm, vma, addr & huge_page_mask(h),
+ sz);
+
+ if (!ptep)
+ return -ENOMEM;
+
+ return hugetlb_hgm_walk(hpte, ptep, vma, addr, target_sz, true);
+}
+
#ifdef CONFIG_ARCH_WANT_GENERAL_HUGETLB
pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, unsigned long sz)
@@ -7382,6 +7553,48 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return (pte_t *)pmd;
}
+/*
+ * hugetlb_walk_step() - Walk the page table one step to resolve the page
+ * (hugepage or subpage) entry at address @addr.
+ *
+ * @sz always points at the final target PTE size (e.g. PAGE_SIZE for the
+ * lowest level PTE).
+ *
+ * @hpte will always remain valid, even if this function fails.
+ *
+ * Architectures that implement this function must ensure that if @hpte does
+ * not change levels, then its PTL must also stay the same.
+ */
+int hugetlb_walk_step(struct mm_struct *mm, struct hugetlb_pte *hpte,
+ unsigned long addr, unsigned long sz)
+{
+ pte_t *ptep;
+ spinlock_t *ptl;
+
+ switch (hpte->level) {
+ case HUGETLB_LEVEL_PUD:
+ ptep = (pte_t *)hugetlb_alloc_pmd(mm, hpte, addr);
+ if (IS_ERR(ptep))
+ return PTR_ERR(ptep);
+ hugetlb_pte_init(mm, hpte, ptep, PMD_SHIFT,
+ HUGETLB_LEVEL_PMD);
+ break;
+ case HUGETLB_LEVEL_PMD:
+ ptep = hugetlb_alloc_pte(mm, hpte, addr);
+ if (IS_ERR(ptep))
+ return PTR_ERR(ptep);
+ ptl = pte_lockptr(mm, (pmd_t *)hpte->ptep);
+ __hugetlb_pte_init(hpte, ptep, PAGE_SHIFT,
+ HUGETLB_LEVEL_PTE, ptl);
+ break;
+ default:
+ WARN_ONCE(1, "%s: got invalid level: %d (shift: %d)\n",
+ __func__, hpte->level, hpte->shift);
+ return -EINVAL;
+ }
+ return 0;
+}
+
/*
* Return a mask that can be used to update an address to the last huge
* page in a page table page mapping size. Used to skip non-present