[v2,09/46] mm: add MADV_SPLIT to enable HugeTLB HGM

Message ID 20230218002819.1486479-10-jthoughton@google.com
State New
Headers
Series hugetlb: introduce HugeTLB high-granularity mapping |

Commit Message

James Houghton Feb. 18, 2023, 12:27 a.m. UTC
  Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
applied to non-HugeTLB memory in the future, if such an application is
to arise.

MADV_SPLIT provides several API changes for some syscalls on HugeTLB
address ranges:
1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
   alignment.
2. read()ing a page fault event from a userfaultfd will yield a
   PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
   address (unless UFFD_FEATURE_EXACT_ADDRESS is used).

There is no way to disable the API changes that come with issuing
MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
table mappings that come from the extended functionality that comes with
using MADV_SPLIT.

For post-copy live migration, the expected use-case is:
1. mmap(MAP_SHARED, some_fd) primary mapping
2. mmap(MAP_SHARED, some_fd) alias mapping
3. MADV_SPLIT the primary mapping
4. UFFDIO_REGISTER/etc. the primary mapping
5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
   corresponding PAGE_SIZE sections in the primary mapping.

More API changes may be added in the future.

Signed-off-by: James Houghton <jthoughton@google.com>
  

Comments

Mina Almasry Feb. 18, 2023, 1:58 a.m. UTC | #1
On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
>
> Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> applied to non-HugeTLB memory in the future, if such an application is
> to arise.
>
> MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> address ranges:
> 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
>    alignment.
> 2. read()ing a page fault event from a userfaultfd will yield a
>    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
>    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
>
> There is no way to disable the API changes that come with issuing
> MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> table mappings that come from the extended functionality that comes with
> using MADV_SPLIT.
>

So is a hugetlb page or VMA that has been MADV_SPLIT + MADV_COLLAPSE
distinct from a hugetlb page or vma that has not been? I thought
COLLAPSE would reverse the effects on SPLIT completely.

> For post-copy live migration, the expected use-case is:
> 1. mmap(MAP_SHARED, some_fd) primary mapping
> 2. mmap(MAP_SHARED, some_fd) alias mapping
> 3. MADV_SPLIT the primary mapping
> 4. UFFDIO_REGISTER/etc. the primary mapping
> 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
>    corresponding PAGE_SIZE sections in the primary mapping.
>

Huh, so MADV_SPLIT doesn't actually split an existing PMD mapping into
high granularity mappings. Instead it says that future mappings may be
high granularity? I assume they may not even be high granularity, like
if the alias mapping faulted in a full hugetlb page (without
UFFDIO_CONTINUE) that page would be regular mapped not high
granularity mapped.

This may be bikeshedding but I do think a clearer name is warranted.
Maybe MADV_MAY_SPLIT or something.

> More API changes may be added in the future.
>
> Signed-off-by: James Houghton <jthoughton@google.com>
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..7a26f3648b90 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -78,6 +78,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..f8a74a3a0928 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -105,6 +105,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..a6dc6a56c941 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -72,6 +72,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     74              /* Enable hugepage high-granularity APIs */
> +
>  #define MADV_HWPOISON     100          /* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101          /* soft offline page for testing */
>
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..f98a77c430a9 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -113,6 +113,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..996e8ded092f 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -79,6 +79,8 @@
>
>  #define MADV_COLLAPSE  25              /* Synchronous hugepage collapse */
>
> +#define MADV_SPLIT     26              /* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE       0
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c2202f51e9dd..8c004c678262 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma,
>         return error;
>  }
>
> +static int madvise_split(struct vm_area_struct *vma,
> +                        unsigned long *new_flags)
> +{
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
> +               return -EINVAL;
> +
> +       /*
> +        * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
> +        * of a VMA, then we will split the VMA. Here, we're unsharing before
> +        * splitting because it's simpler, although we may be unsharing more
> +        * than we need.
> +        */
> +       hugetlb_unshare_all_pmds(vma);
> +
> +       *new_flags |= VM_HUGETLB_HGM;
> +       return 0;
> +#else
> +       return -EINVAL;
> +#endif
> +}
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>                 break;
>         case MADV_COLLAPSE:
>                 return madvise_collapse(vma, prev, start, end);
> +       case MADV_SPLIT:
> +               error = madvise_split(vma, &new_flags);
> +               if (error)
> +                       goto out;
> +               break;
>         }
>
>         anon_name = anon_vma_name(vma);
> @@ -1178,6 +1205,9 @@ madvise_behavior_valid(int behavior)
>         case MADV_HUGEPAGE:
>         case MADV_NOHUGEPAGE:
>         case MADV_COLLAPSE:
> +#endif
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +       case MADV_SPLIT:
>  #endif
>         case MADV_DONTDUMP:
>         case MADV_DODUMP:
> @@ -1368,6 +1398,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
>   *             transparent huge pages so the existing pages will not be
>   *             coalesced into THP and new pages will not be allocated as THP.
>   *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
> + *  MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows
> + *             UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions.
>   *  MADV_DONTDUMP - the application wants to prevent pages in the given range
>   *             from being included in its core dump.
>   *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> --
> 2.39.2.637.g21b0678d19-goog
>
  
James Houghton Feb. 21, 2023, 4:33 p.m. UTC | #2
On Fri, Feb 17, 2023 at 5:58 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Feb 17, 2023 at 4:28 PM James Houghton <jthoughton@google.com> wrote:
> >
> > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> > applied to non-HugeTLB memory in the future, if such an application is
> > to arise.
> >
> > MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> > address ranges:
> > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
> >    alignment.
> > 2. read()ing a page fault event from a userfaultfd will yield a
> >    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
> >    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
> >
> > There is no way to disable the API changes that come with issuing
> > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> > table mappings that come from the extended functionality that comes with
> > using MADV_SPLIT.
> >
>
> So is a hugetlb page or VMA that has been MADV_SPLIT + MADV_COLLAPSE
> distinct from a hugetlb page or vma that has not been? I thought
> COLLAPSE would reverse the effects on SPLIT completely.

Right now, MADV_COLLAPSE does *not* completely undo the effects of an
MADV_SPLIT. The API changes that come from MADV_SPLIT aren't undone
with an MADV_COLLAPSE.

>
> > For post-copy live migration, the expected use-case is:
> > 1. mmap(MAP_SHARED, some_fd) primary mapping
> > 2. mmap(MAP_SHARED, some_fd) alias mapping
> > 3. MADV_SPLIT the primary mapping
> > 4. UFFDIO_REGISTER/etc. the primary mapping
> > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
> >    corresponding PAGE_SIZE sections in the primary mapping.
> >
>
> Huh, so MADV_SPLIT doesn't actually split an existing PMD mapping into
> high granularity mappings. Instead it says that future mappings may be
> high granularity? I assume they may not even be high granularity, like
> if the alias mapping faulted in a full hugetlb page (without
> UFFDIO_CONTINUE) that page would be regular mapped not high
> granularity mapped.

MADV_SPLIT just means "userspace is aware that they are able to start
mapping HugeTLB pages at high-granularity". Right now the only way to
get high-granularity mappings is with UFFDIO_CONTINUE, but there may
be other ways in the future.

As of this series, if you MADV_SPLIT a HugeTLB VMA and you aren't
using userfaultfd minor faults, it's basically a no-op. The mappings
that are created will still be huge. I could change this, but I don't
really see a reason to right now.

>
> This may be bikeshedding but I do think a clearer name is warranted.
> Maybe MADV_MAY_SPLIT or something.

I agree -- MADV_MAY_SPLIT more accurately describes the HugeTLB
functionality. I really don't mind what the MADV is called.

I think enabling the high-granularity userfaultfd bits with a
userfaultfd feature[1] worked reasonably well. There is some API
discussion in that thread[1].

[1]: https://lore.kernel.org/linux-mm/20221021163703.3218176-34-jthoughton@google.com/
  
Mike Kravetz Feb. 24, 2023, 11:25 p.m. UTC | #3
On 02/18/23 00:27, James Houghton wrote:
> Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> applied to non-HugeTLB memory in the future, if such an application is
> to arise.
> 
> MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> address ranges:
> 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
>    alignment.
> 2. read()ing a page fault event from a userfaultfd will yield a
>    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
>    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
> 
> There is no way to disable the API changes that come with issuing
> MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> table mappings that come from the extended functionality that comes with
> using MADV_SPLIT.
> 
> For post-copy live migration, the expected use-case is:
> 1. mmap(MAP_SHARED, some_fd) primary mapping
> 2. mmap(MAP_SHARED, some_fd) alias mapping
> 3. MADV_SPLIT the primary mapping
> 4. UFFDIO_REGISTER/etc. the primary mapping
> 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
>    corresponding PAGE_SIZE sections in the primary mapping.
> 
> More API changes may be added in the future.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..7a26f3648b90 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -78,6 +78,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..f8a74a3a0928 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -105,6 +105,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..a6dc6a56c941 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -72,6 +72,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	74		/* Enable hugepage high-granularity APIs */
> +
>  #define MADV_HWPOISON     100		/* poison a page for testing */
>  #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
>  
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..f98a77c430a9 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -113,6 +113,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..996e8ded092f 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -79,6 +79,8 @@
>  
>  #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
>  
> +#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c2202f51e9dd..8c004c678262 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma,
>  	return error;
>  }
>  
> +static int madvise_split(struct vm_area_struct *vma,
> +			 unsigned long *new_flags)
> +{
> +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> +	if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
> +		return -EINVAL;
> +
> +	/*
> +	 * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
> +	 * of a VMA, then we will split the VMA. Here, we're unsharing before
> +	 * splitting because it's simpler, although we may be unsharing more
> +	 * than we need.
> +	 */
> +	hugetlb_unshare_all_pmds(vma);

I think we should just unshare the (appropriately aligned) range within the
vma that is the target of MADV_SPLIT.  No need to unshare the entire vma.

> +
> +	*new_flags |= VM_HUGETLB_HGM;
> +	return 0;
> +#else
> +	return -EINVAL;
> +#endif
> +}
> +
>  /*
>   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
>   * will handle splitting a vm area into separate areas, each area with its own
> @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
>  		break;
>  	case MADV_COLLAPSE:
>  		return madvise_collapse(vma, prev, start, end);
> +	case MADV_SPLIT:
> +		error = madvise_split(vma, &new_flags);
> +		if (error)
> +			goto out;

Not a huge deal, but if one passes an invalid range (such as not huge page
size aligned) to MADV_SPLIT, then we will not notice the error until
later in madvise_update_vma() when the vma split fails.  By then, we will
have unshared all pmds in the entire vma (or just the range if you agree
with my suggestion above).
  
James Houghton Feb. 27, 2023, 3:14 p.m. UTC | #4
On Fri, Feb 24, 2023 at 3:25 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 02/18/23 00:27, James Houghton wrote:
> > Issuing ioctl(MADV_SPLIT) on a HugeTLB address range will enable
> > HugeTLB HGM. MADV_SPLIT was chosen for the name so that this API can be
> > applied to non-HugeTLB memory in the future, if such an application is
> > to arise.
> >
> > MADV_SPLIT provides several API changes for some syscalls on HugeTLB
> > address ranges:
> > 1. UFFDIO_CONTINUE is allowed for MAP_SHARED VMAs at PAGE_SIZE
> >    alignment.
> > 2. read()ing a page fault event from a userfaultfd will yield a
> >    PAGE_SIZE-rounded address, instead of a huge-page-size-rounded
> >    address (unless UFFD_FEATURE_EXACT_ADDRESS is used).
> >
> > There is no way to disable the API changes that come with issuing
> > MADV_SPLIT. MADV_COLLAPSE can be used to collapse high-granularity page
> > table mappings that come from the extended functionality that comes with
> > using MADV_SPLIT.
> >
> > For post-copy live migration, the expected use-case is:
> > 1. mmap(MAP_SHARED, some_fd) primary mapping
> > 2. mmap(MAP_SHARED, some_fd) alias mapping
> > 3. MADV_SPLIT the primary mapping
> > 4. UFFDIO_REGISTER/etc. the primary mapping
> > 5. Copy memory contents into alias mapping and UFFDIO_CONTINUE the
> >    corresponding PAGE_SIZE sections in the primary mapping.
> >
> > More API changes may be added in the future.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> > index 763929e814e9..7a26f3648b90 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -78,6 +78,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> > index c6e1fc77c996..f8a74a3a0928 100644
> > --- a/arch/mips/include/uapi/asm/mman.h
> > +++ b/arch/mips/include/uapi/asm/mman.h
> > @@ -105,6 +105,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> > index 68c44f99bc93..a6dc6a56c941 100644
> > --- a/arch/parisc/include/uapi/asm/mman.h
> > +++ b/arch/parisc/include/uapi/asm/mman.h
> > @@ -72,6 +72,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   74              /* Enable hugepage high-granularity APIs */
> > +
> >  #define MADV_HWPOISON     100                /* poison a page for testing */
> >  #define MADV_SOFT_OFFLINE 101                /* soft offline page for testing */
> >
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index 1ff0c858544f..f98a77c430a9 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -113,6 +113,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index 6ce1f1ceb432..996e8ded092f 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -79,6 +79,8 @@
> >
> >  #define MADV_COLLAPSE        25              /* Synchronous hugepage collapse */
> >
> > +#define MADV_SPLIT   26              /* Enable hugepage high-granularity APIs */
> > +
> >  /* compatibility flags */
> >  #define MAP_FILE     0
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c2202f51e9dd..8c004c678262 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1006,6 +1006,28 @@ static long madvise_remove(struct vm_area_struct *vma,
> >       return error;
> >  }
> >
> > +static int madvise_split(struct vm_area_struct *vma,
> > +                      unsigned long *new_flags)
> > +{
> > +#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
> > +     if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
> > +             return -EINVAL;
> > +
> > +     /*
> > +      * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
> > +      * of a VMA, then we will split the VMA. Here, we're unsharing before
> > +      * splitting because it's simpler, although we may be unsharing more
> > +      * than we need.
> > +      */
> > +     hugetlb_unshare_all_pmds(vma);
>
> I think we should just unshare the (appropriately aligned) range within the
> vma that is the target of MADV_SPLIT.  No need to unshare the entire vma.

Right I can do that, and I can check for appropriate alignment here
(else fail with -EINVAL).

>
> > +
> > +     *new_flags |= VM_HUGETLB_HGM;
> > +     return 0;
> > +#else
> > +     return -EINVAL;
> > +#endif
> > +}
> > +
> >  /*
> >   * Apply an madvise behavior to a region of a vma.  madvise_update_vma
> >   * will handle splitting a vm area into separate areas, each area with its own
> > @@ -1084,6 +1106,11 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> >               break;
> >       case MADV_COLLAPSE:
> >               return madvise_collapse(vma, prev, start, end);
> > +     case MADV_SPLIT:
> > +             error = madvise_split(vma, &new_flags);
> > +             if (error)
> > +                     goto out;
>
> Not a huge deal, but if one passes an invalid range (such as not huge page
> size aligned) to MADV_SPLIT, then we will not notice the error until
> later in madvise_update_vma() when the vma split fails.  By then, we will
> have unshared all pmds in the entire vma (or just the range if you agree
> with my suggestion above).

Good point. I'll fix this for v3. :) Thanks Mike.
  

Patch

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..7a26f3648b90 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -78,6 +78,8 @@ 
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index c6e1fc77c996..f8a74a3a0928 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -105,6 +105,8 @@ 
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 68c44f99bc93..a6dc6a56c941 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -72,6 +72,8 @@ 
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	74		/* Enable hugepage high-granularity APIs */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1ff0c858544f..f98a77c430a9 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -113,6 +113,8 @@ 
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..996e8ded092f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,8 @@ 
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_SPLIT	26		/* Enable hugepage high-granularity APIs */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index c2202f51e9dd..8c004c678262 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1006,6 +1006,28 @@  static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static int madvise_split(struct vm_area_struct *vma,
+			 unsigned long *new_flags)
+{
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	if (!is_vm_hugetlb_page(vma) || !hugetlb_hgm_eligible(vma))
+		return -EINVAL;
+
+	/*
+	 * PMD sharing doesn't work with HGM. If this MADV_SPLIT is on part
+	 * of a VMA, then we will split the VMA. Here, we're unsharing before
+	 * splitting because it's simpler, although we may be unsharing more
+	 * than we need.
+	 */
+	hugetlb_unshare_all_pmds(vma);
+
+	*new_flags |= VM_HUGETLB_HGM;
+	return 0;
+#else
+	return -EINVAL;
+#endif
+}
+
 /*
  * Apply an madvise behavior to a region of a vma.  madvise_update_vma
  * will handle splitting a vm area into separate areas, each area with its own
@@ -1084,6 +1106,11 @@  static int madvise_vma_behavior(struct vm_area_struct *vma,
 		break;
 	case MADV_COLLAPSE:
 		return madvise_collapse(vma, prev, start, end);
+	case MADV_SPLIT:
+		error = madvise_split(vma, &new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1178,6 +1205,9 @@  madvise_behavior_valid(int behavior)
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
 	case MADV_COLLAPSE:
+#endif
+#ifdef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	case MADV_SPLIT:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1368,6 +1398,8 @@  int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
  *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
+ *  MADV_SPLIT - allow HugeTLB pages to be mapped at PAGE_SIZE. This allows
+ *		UFFDIO_CONTINUE to accept PAGE_SIZE-aligned regions.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.