[RFC,v2,33/47] userfaultfd: add UFFD_FEATURE_MINOR_HUGETLBFS_HGM

Message ID 20221021163703.3218176-34-jthoughton@google.com
State New
Headers
Series hugetlb: introduce HugeTLB high-granularity mapping |

Commit Message

James Houghton Oct. 21, 2022, 4:36 p.m. UTC
  Userspace must provide this new feature when it calls UFFDIO_API to
enable HGM. Userspace can check if the feature exists in
uffdio_api.features, and if it does not exist, the kernel does not
support and therefore did not enable HGM.

Signed-off-by: James Houghton <jthoughton@google.com>
---
 fs/userfaultfd.c                 | 12 +++++++++++-
 include/linux/userfaultfd_k.h    |  7 +++++++
 include/uapi/linux/userfaultfd.h |  2 ++
 3 files changed, 20 insertions(+), 1 deletion(-)
  

Comments

Peter Xu Nov. 16, 2022, 10:28 p.m. UTC | #1
On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> Userspace must provide this new feature when it calls UFFDIO_API to
> enable HGM. Userspace can check if the feature exists in
> uffdio_api.features, and if it does not exist, the kernel does not
> support and therefore did not enable HGM.
> 
> Signed-off-by: James Houghton <jthoughton@google.com>

It's still slightly a pity that this can only be enabled by an uffd context
plus a minor fault, so generic hugetlb users cannot directly leverage this.

The patch itself looks good.
  
James Houghton Nov. 16, 2022, 11:30 p.m. UTC | #2
On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > Userspace must provide this new feature when it calls UFFDIO_API to
> > enable HGM. Userspace can check if the feature exists in
> > uffdio_api.features, and if it does not exist, the kernel does not
> > support and therefore did not enable HGM.
> >
> > Signed-off-by: James Houghton <jthoughton@google.com>
>
> It's still slightly a pity that this can only be enabled by an uffd context
> plus a minor fault, so generic hugetlb users cannot directly leverage this.

The idea here is that, for applications that can conceivably benefit
from HGM, we have a mechanism for enabling it for that application. So
this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
prefer this approach over something more general like MADV_ENABLE_HGM
or something.

For hwpoison, HGM will be automatically enabled, but that isn't
implemented in this series. We could also extend MADV_DONTNEED to do
high-granularity unmapping in some way, but that also isn't attempted
here. I'm sure that if there are other cases where HGM may be useful,
we can add/change some uapi to make it possible to take advantage HGM.

- James

>
> The patch itself looks good.
>
> --
> Peter Xu
>
  
Peter Xu Dec. 21, 2022, 7:23 p.m. UTC | #3
James,

On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > enable HGM. Userspace can check if the feature exists in
> > > uffdio_api.features, and if it does not exist, the kernel does not
> > > support and therefore did not enable HGM.
> > >
> > > Signed-off-by: James Houghton <jthoughton@google.com>
> >
> > It's still slightly a pity that this can only be enabled by an uffd context
> > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> 
> The idea here is that, for applications that can conceivably benefit
> from HGM, we have a mechanism for enabling it for that application. So
> this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> prefer this approach over something more general like MADV_ENABLE_HGM
> or something.

Sorry to get back to this very late - I know this has been discussed since
the very early stage of the feature, but is there any reasoning behind?

When I start to think seriously on applying this to process snapshot with
uffd-wp I found that the minor mode trick won't easily play - normally
that's a case where all the pages were there mapped huge, but when the app
wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
pages, probably some size that the user can specify.  It'll be non-trivial
to enable HGM during that phase using MINOR mode because in that case the
pages are all mapped.

For the long term, I am just still worried the current interface is still
not as flexible.
  
James Houghton Dec. 21, 2022, 8:21 p.m. UTC | #4
On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> James,
>
> On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > enable HGM. Userspace can check if the feature exists in
> > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > support and therefore did not enable HGM.
> > > >
> > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > >
> > > It's still slightly a pity that this can only be enabled by an uffd context
> > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> >
> > The idea here is that, for applications that can conceivably benefit
> > from HGM, we have a mechanism for enabling it for that application. So
> > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > prefer this approach over something more general like MADV_ENABLE_HGM
> > or something.
>
> Sorry to get back to this very late - I know this has been discussed since
> the very early stage of the feature, but is there any reasoning behind?
>
> When I start to think seriously on applying this to process snapshot with
> uffd-wp I found that the minor mode trick won't easily play - normally
> that's a case where all the pages were there mapped huge, but when the app
> wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> pages, probably some size that the user can specify.  It'll be non-trivial
> to enable HGM during that phase using MINOR mode because in that case the
> pages are all mapped.
>
> For the long term, I am just still worried the current interface is still
> not as flexible.

Thanks for bringing this up, Peter. I think the main reason was:
having separate UFFD_FEATUREs clearly indicates to userspace what is
and is not supported.

For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
allowed as of this patch series, but it could be allowed in the
future. To add support in the same way as this series, we would add
another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
having to add another feature isn't great; is this what you're
concerned about?

Considering MADV_ENABLE_HUGETLB...
1. If a user provides this, then the contract becomes: "the kernel may
allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
high-granularities, provided the support exists", but it becomes
unclear to userspace to know what's supported and what isn't.
2. We would then need to keep track if a user explicitly enabled it,
or if it got enabled automatically in response to memory poison, for
example. Not a big problem, just a complication. (Otherwise, if HGM
got enabled for poison, suddenly userspace would be allowed to do
things it wasn't allowed to do before.)
3. This API makes sense for enabling HGM for something outside of
userfaultfd, like MADV_DONTNEED.

Maybe (1) is solvable if we provide a bit field that describes what's
supported, or maybe (1) isn't even a problem.

Another possibility is to have a feature like
UFFD_FEATURE_HUGETLB_HGM, which will enable the possibility of HGM for
all relevant userfaultfd ioctls, but we have the same problem where
it's unclear what's supported and what isn't.

I'm happy to change the API to whatever you think makes the most sense.

Thanks!
- James

>
> --
> Peter Xu
>
  
Mike Kravetz Dec. 21, 2022, 9:39 p.m. UTC | #5
On 12/21/22 15:21, James Houghton wrote:
> On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > James,
> >
> > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > > enable HGM. Userspace can check if the feature exists in
> > > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > > support and therefore did not enable HGM.
> > > > >
> > > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > >
> > > > It's still slightly a pity that this can only be enabled by an uffd context
> > > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> > >
> > > The idea here is that, for applications that can conceivably benefit
> > > from HGM, we have a mechanism for enabling it for that application. So
> > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > > prefer this approach over something more general like MADV_ENABLE_HGM
> > > or something.
> >
> > Sorry to get back to this very late - I know this has been discussed since
> > the very early stage of the feature, but is there any reasoning behind?
> >
> > When I start to think seriously on applying this to process snapshot with
> > uffd-wp I found that the minor mode trick won't easily play - normally
> > that's a case where all the pages were there mapped huge, but when the app
> > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> > pages, probably some size that the user can specify.  It'll be non-trivial
> > to enable HGM during that phase using MINOR mode because in that case the
> > pages are all mapped.
> >
> > For the long term, I am just still worried the current interface is still
> > not as flexible.
> 
> Thanks for bringing this up, Peter. I think the main reason was:
> having separate UFFD_FEATUREs clearly indicates to userspace what is
> and is not supported.

IIRC, I think we wanted to initially limit the usage to the very
specific use case (live migration).  The idea is that we could then
expand usage as more use cases came to light.

Another good thing is that userfaultfd has versioning built into the
API.  Thus a user can determine if HGM is enabled in their running
kernel.

> For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> allowed as of this patch series, but it could be allowed in the
> future. To add support in the same way as this series, we would add
> another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> having to add another feature isn't great; is this what you're
> concerned about?
> 
> Considering MADV_ENABLE_HUGETLB...
> 1. If a user provides this, then the contract becomes: "the kernel may
> allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> high-granularities, provided the support exists", but it becomes
> unclear to userspace to know what's supported and what isn't.
> 2. We would then need to keep track if a user explicitly enabled it,
> or if it got enabled automatically in response to memory poison, for
> example. Not a big problem, just a complication. (Otherwise, if HGM
> got enabled for poison, suddenly userspace would be allowed to do
> things it wasn't allowed to do before.)
> 3. This API makes sense for enabling HGM for something outside of
> userfaultfd, like MADV_DONTNEED.

I think #3 is key here.  Once we start applying HGM to things outside
userfaultfd, then more thought will be required on APIs.  The API is
somewhat limited by design until the basic functionality is in place.
  
Peter Xu Dec. 21, 2022, 10:10 p.m. UTC | #6
On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> On 12/21/22 15:21, James Houghton wrote:
> > On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > James,
> > >
> > > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > > > enable HGM. Userspace can check if the feature exists in
> > > > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > > > support and therefore did not enable HGM.
> > > > > >
> > > > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > >
> > > > > It's still slightly a pity that this can only be enabled by an uffd context
> > > > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> > > >
> > > > The idea here is that, for applications that can conceivably benefit
> > > > from HGM, we have a mechanism for enabling it for that application. So
> > > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > > > prefer this approach over something more general like MADV_ENABLE_HGM
> > > > or something.
> > >
> > > Sorry to get back to this very late - I know this has been discussed since
> > > the very early stage of the feature, but is there any reasoning behind?
> > >
> > > When I start to think seriously on applying this to process snapshot with
> > > uffd-wp I found that the minor mode trick won't easily play - normally
> > > that's a case where all the pages were there mapped huge, but when the app
> > > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> > > pages, probably some size that the user can specify.  It'll be non-trivial
> > > to enable HGM during that phase using MINOR mode because in that case the
> > > pages are all mapped.
> > >
> > > For the long term, I am just still worried the current interface is still
> > > not as flexible.
> > 
> > Thanks for bringing this up, Peter. I think the main reason was:
> > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > and is not supported.
> 
> IIRC, I think we wanted to initially limit the usage to the very
> specific use case (live migration).  The idea is that we could then
> expand usage as more use cases came to light.
> 
> Another good thing is that userfaultfd has versioning built into the
> API.  Thus a user can determine if HGM is enabled in their running
> kernel.

I don't worry much on this one, afaiu if we have any way to enable hgm then
the user can just try enabling it on a test vma, just like when an app
wants to detect whether a new madvise() is present on the current host OS.

Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
would work too.

> 
> > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > allowed as of this patch series, but it could be allowed in the
> > future. To add support in the same way as this series, we would add
> > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > having to add another feature isn't great; is this what you're
> > concerned about?
> > 
> > Considering MADV_ENABLE_HUGETLB...
> > 1. If a user provides this, then the contract becomes: "the kernel may
> > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > high-granularities, provided the support exists", but it becomes
> > unclear to userspace to know what's supported and what isn't.
> > 2. We would then need to keep track if a user explicitly enabled it,
> > or if it got enabled automatically in response to memory poison, for
> > example. Not a big problem, just a complication. (Otherwise, if HGM
> > got enabled for poison, suddenly userspace would be allowed to do
> > things it wasn't allowed to do before.)

We could alternatively have two flags for each vma: (a) hgm_advised and (b)
hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
to set (b) only.  For this patchset, it can be all about (a).

> > 3. This API makes sense for enabling HGM for something outside of
> > userfaultfd, like MADV_DONTNEED.
> 
> I think #3 is key here.  Once we start applying HGM to things outside
> userfaultfd, then more thought will be required on APIs.  The API is
> somewhat limited by design until the basic functionality is in place.

Mike, could you elaborate what's the major concern of having hgm used
outside uffd and live migration use cases?

I feel like I miss something here.  I can understand we want to limit the
usage only when the user specifies using hgm because we want to keep the
old behavior intact.  However if we want another way to enable hgm it'll
still need one knob anyway even outside uffd, and I thought that'll service
the same purpose, or maybe not?

Thanks,
  
Mike Kravetz Dec. 21, 2022, 10:31 p.m. UTC | #7
On 12/21/22 17:10, Peter Xu wrote:
> On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> > On 12/21/22 15:21, James Houghton wrote:
> > > On Wed, Dec 21, 2022 at 2:23 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > James,
> > > >
> > > > On Wed, Nov 16, 2022 at 03:30:00PM -0800, James Houghton wrote:
> > > > > On Wed, Nov 16, 2022 at 2:28 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Fri, Oct 21, 2022 at 04:36:49PM +0000, James Houghton wrote:
> > > > > > > Userspace must provide this new feature when it calls UFFDIO_API to
> > > > > > > enable HGM. Userspace can check if the feature exists in
> > > > > > > uffdio_api.features, and if it does not exist, the kernel does not
> > > > > > > support and therefore did not enable HGM.
> > > > > > >
> > > > > > > Signed-off-by: James Houghton <jthoughton@google.com>
> > > > > >
> > > > > > It's still slightly a pity that this can only be enabled by an uffd context
> > > > > > plus a minor fault, so generic hugetlb users cannot directly leverage this.
> > > > >
> > > > > The idea here is that, for applications that can conceivably benefit
> > > > > from HGM, we have a mechanism for enabling it for that application. So
> > > > > this patch creates that mechanism for userfaultfd/UFFDIO_CONTINUE. I
> > > > > prefer this approach over something more general like MADV_ENABLE_HGM
> > > > > or something.
> > > >
> > > > Sorry to get back to this very late - I know this has been discussed since
> > > > the very early stage of the feature, but is there any reasoning behind?
> > > >
> > > > When I start to think seriously on applying this to process snapshot with
> > > > uffd-wp I found that the minor mode trick won't easily play - normally
> > > > that's a case where all the pages were there mapped huge, but when the app
> > > > wants UFFDIO_WRITEPROTECT it may want to remap the huge pages into smaller
> > > > pages, probably some size that the user can specify.  It'll be non-trivial
> > > > to enable HGM during that phase using MINOR mode because in that case the
> > > > pages are all mapped.
> > > >
> > > > For the long term, I am just still worried the current interface is still
> > > > not as flexible.
> > > 
> > > Thanks for bringing this up, Peter. I think the main reason was:
> > > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > > and is not supported.
> > 
> > IIRC, I think we wanted to initially limit the usage to the very
> > specific use case (live migration).  The idea is that we could then
> > expand usage as more use cases came to light.
> > 
> > Another good thing is that userfaultfd has versioning built into the
> > API.  Thus a user can determine if HGM is enabled in their running
> > kernel.
> 
> I don't worry much on this one, afaiu if we have any way to enable hgm then
> the user can just try enabling it on a test vma, just like when an app
> wants to detect whether a new madvise() is present on the current host OS.
> 
> Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
> would work too.
> 
> > 
> > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > > allowed as of this patch series, but it could be allowed in the
> > > future. To add support in the same way as this series, we would add
> > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > > having to add another feature isn't great; is this what you're
> > > concerned about?
> > > 
> > > Considering MADV_ENABLE_HUGETLB...
> > > 1. If a user provides this, then the contract becomes: "the kernel may
> > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > > high-granularities, provided the support exists", but it becomes
> > > unclear to userspace to know what's supported and what isn't.
> > > 2. We would then need to keep track if a user explicitly enabled it,
> > > or if it got enabled automatically in response to memory poison, for
> > > example. Not a big problem, just a complication. (Otherwise, if HGM
> > > got enabled for poison, suddenly userspace would be allowed to do
> > > things it wasn't allowed to do before.)
> 
> We could alternatively have two flags for each vma: (a) hgm_advised and (b)
> hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
> to set (b) only.  For this patchset, it can be all about (a).
> 
> > > 3. This API makes sense for enabling HGM for something outside of
> > > userfaultfd, like MADV_DONTNEED.
> > 
> > I think #3 is key here.  Once we start applying HGM to things outside
> > userfaultfd, then more thought will be required on APIs.  The API is
> > somewhat limited by design until the basic functionality is in place.
> 
> Mike, could you elaborate what's the major concern of having hgm used
> outside uffd and live migration use cases?
> 
> I feel like I miss something here.  I can understand we want to limit the
> usage only when the user specifies using hgm because we want to keep the
> old behavior intact.  However if we want another way to enable hgm it'll
> still need one knob anyway even outside uffd, and I thought that'll service
> the same purpose, or maybe not?

I am not opposed to using hgm outside the use cases targeted by this series.

It seems that when we were previously discussing the API we spent a bunch of
time going around in circles trying to get the API correct.  That is expected
as it is more difficult to take all users/uses/abuses of the API into account.

Since the initial use case was fairly limited, it seemed like a good idea to
limit the API to userfaultfd.  In this way we could focus on the underlying
code/implementation and then expand as needed.  Of course, with an eye on
anything that may be a limiting factor in the future.

I was not aware of the uffd-wp use case, and am more than happy to discuss
expanding the API.
  
James Houghton Dec. 22, 2022, 12:02 a.m. UTC | #8
On Wed, Dec 21, 2022 at 5:32 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 12/21/22 17:10, Peter Xu wrote:
> > On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> > > On 12/21/22 15:21, James Houghton wrote:
> > > > Thanks for bringing this up, Peter. I think the main reason was:
> > > > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > > > and is not supported.
> > >
> > > IIRC, I think we wanted to initially limit the usage to the very
> > > specific use case (live migration).  The idea is that we could then
> > > expand usage as more use cases came to light.
> > >
> > > Another good thing is that userfaultfd has versioning built into the
> > > API.  Thus a user can determine if HGM is enabled in their running
> > > kernel.
> >
> > I don't worry much on this one, afaiu if we have any way to enable hgm then
> > the user can just try enabling it on a test vma, just like when an app
> > wants to detect whether a new madvise() is present on the current host OS.

That would be enough to test if HGM was merely present, but if
specific features like 4K UFFDIO_CONTINUEs or 4K UFFDIO_WRITEPROTECTs
were available. You could always check these by making a HugeTLB VMA
and setting it up correctly for userfaultfd/etc., but that's a little
messy.

> >
> > Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
> > would work too.

I'm not opposed to this.

> >
> > >
> > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > > > allowed as of this patch series, but it could be allowed in the
> > > > future. To add support in the same way as this series, we would add
> > > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > > > having to add another feature isn't great; is this what you're
> > > > concerned about?
> > > >
> > > > Considering MADV_ENABLE_HUGETLB...
> > > > 1. If a user provides this, then the contract becomes: "the kernel may
> > > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > > > high-granularities, provided the support exists", but it becomes
> > > > unclear to userspace to know what's supported and what isn't.
> > > > 2. We would then need to keep track if a user explicitly enabled it,
> > > > or if it got enabled automatically in response to memory poison, for
> > > > example. Not a big problem, just a complication. (Otherwise, if HGM
> > > > got enabled for poison, suddenly userspace would be allowed to do
> > > > things it wasn't allowed to do before.)
> >
> > We could alternatively have two flags for each vma: (a) hgm_advised and (b)
> > hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
> > to set (b) only.  For this patchset, it can be all about (a).

My thoughts exactly. :)

> >
> > > > 3. This API makes sense for enabling HGM for something outside of
> > > > userfaultfd, like MADV_DONTNEED.
> > >
> > > I think #3 is key here.  Once we start applying HGM to things outside
> > > userfaultfd, then more thought will be required on APIs.  The API is
> > > somewhat limited by design until the basic functionality is in place.
> >
> > Mike, could you elaborate what's the major concern of having hgm used
> > outside uffd and live migration use cases?
> >
> > I feel like I miss something here.  I can understand we want to limit the
> > usage only when the user specifies using hgm because we want to keep the
> > old behavior intact.  However if we want another way to enable hgm it'll
> > still need one knob anyway even outside uffd, and I thought that'll service
> > the same purpose, or maybe not?
>
> I am not opposed to using hgm outside the use cases targeted by this series.
>
> It seems that when we were previously discussing the API we spent a bunch of
> time going around in circles trying to get the API correct.  That is expected
> as it is more difficult to take all users/uses/abuses of the API into account.
>
> Since the initial use case was fairly limited, it seemed like a good idea to
> limit the API to userfaultfd.  In this way we could focus on the underlying
> code/implementation and then expand as needed.  Of course, with an eye on
> anything that may be a limiting factor in the future.
>
> I was not aware of the uffd-wp use case, and am more than happy to discuss
> expanding the API.

So considering two API choices:

1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for
UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for
UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start
allowing high-granularity choices (not sure if this is bad; we started
allowing it for HugeTLB recently with no other API change, AFAIA).

2. MADV_ENABLE_HGM or something similar. The changes to
UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically,
provided they are implemented.

I don't mind one way or the other. Peter, I assume you prefer #2.
Mike, what about you? If we decide on something other than #1, I'll
make the change before sending v1 out.

- James
  
Mike Kravetz Dec. 22, 2022, 12:38 a.m. UTC | #9
On 12/21/22 19:02, James Houghton wrote:
> On Wed, Dec 21, 2022 at 5:32 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 12/21/22 17:10, Peter Xu wrote:
> > > On Wed, Dec 21, 2022 at 01:39:39PM -0800, Mike Kravetz wrote:
> > > > On 12/21/22 15:21, James Houghton wrote:
> > > > > Thanks for bringing this up, Peter. I think the main reason was:
> > > > > having separate UFFD_FEATUREs clearly indicates to userspace what is
> > > > > and is not supported.
> > > >
> > > > IIRC, I think we wanted to initially limit the usage to the very
> > > > specific use case (live migration).  The idea is that we could then
> > > > expand usage as more use cases came to light.
> > > >
> > > > Another good thing is that userfaultfd has versioning built into the
> > > > API.  Thus a user can determine if HGM is enabled in their running
> > > > kernel.
> > >
> > > I don't worry much on this one, afaiu if we have any way to enable hgm then
> > > the user can just try enabling it on a test vma, just like when an app
> > > wants to detect whether a new madvise() is present on the current host OS.
> 
> That would be enough to test if HGM was merely present, but if
> specific features like 4K UFFDIO_CONTINUEs or 4K UFFDIO_WRITEPROTECTs
> were available. You could always check these by making a HugeTLB VMA
> and setting it up correctly for userfaultfd/etc., but that's a little
> messy.
> 
> > >
> > > Besides, I'm wondering whether something like /sys/kernel/vm/hugepages/hgm
> > > would work too.
> 
> I'm not opposed to this.
> 
> > >
> > > >
> > > > > For UFFDIO_WRITEPROTECT, a user could remap huge pages into smaller
> > > > > pages by issuing a high-granularity UFFDIO_WRITEPROTECT. That isn't
> > > > > allowed as of this patch series, but it could be allowed in the
> > > > > future. To add support in the same way as this series, we would add
> > > > > another feature, say UFFD_FEATURE_WP_HUGETLBFS_HGM. I agree that
> > > > > having to add another feature isn't great; is this what you're
> > > > > concerned about?
> > > > >
> > > > > Considering MADV_ENABLE_HUGETLB...
> > > > > 1. If a user provides this, then the contract becomes: "the kernel may
> > > > > allow UFFDIO_CONTINUE and UFFDIO_WRITEPROTECT for HugeTLB at
> > > > > high-granularities, provided the support exists", but it becomes
> > > > > unclear to userspace to know what's supported and what isn't.
> > > > > 2. We would then need to keep track if a user explicitly enabled it,
> > > > > or if it got enabled automatically in response to memory poison, for
> > > > > example. Not a big problem, just a complication. (Otherwise, if HGM
> > > > > got enabled for poison, suddenly userspace would be allowed to do
> > > > > things it wasn't allowed to do before.)
> > >
> > > We could alternatively have two flags for each vma: (a) hgm_advised and (b)
> > > hgm_enabled.  (a) always sets (b) but not vice versa.  We can limit poison
> > > to set (b) only.  For this patchset, it can be all about (a).
> 
> My thoughts exactly. :)
> 
> > >
> > > > > 3. This API makes sense for enabling HGM for something outside of
> > > > > userfaultfd, like MADV_DONTNEED.
> > > >
> > > > I think #3 is key here.  Once we start applying HGM to things outside
> > > > userfaultfd, then more thought will be required on APIs.  The API is
> > > > somewhat limited by design until the basic functionality is in place.
> > >
> > > Mike, could you elaborate what's the major concern of having hgm used
> > > outside uffd and live migration use cases?
> > >
> > > I feel like I miss something here.  I can understand we want to limit the
> > > usage only when the user specifies using hgm because we want to keep the
> > > old behavior intact.  However if we want another way to enable hgm it'll
> > > still need one knob anyway even outside uffd, and I thought that'll service
> > > the same purpose, or maybe not?
> >
> > I am not opposed to using hgm outside the use cases targeted by this series.
> >
> > It seems that when we were previously discussing the API we spent a bunch of
> > time going around in circles trying to get the API correct.  That is expected
> > as it is more difficult to take all users/uses/abuses of the API into account.
> >
> > Since the initial use case was fairly limited, it seemed like a good idea to
> > limit the API to userfaultfd.  In this way we could focus on the underlying
> > code/implementation and then expand as needed.  Of course, with an eye on
> > anything that may be a limiting factor in the future.
> >
> > I was not aware of the uffd-wp use case, and am more than happy to discuss
> > expanding the API.
> 
> So considering two API choices:
> 
> 1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for
> UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for
> UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start
> allowing high-granularity choices (not sure if this is bad; we started
> allowing it for HugeTLB recently with no other API change, AFAIA).

I don't think we can just start allowing HGM for MADV_DONTNEED without
some type of user interaction/request.  Otherwise, a user that passes
in non-hugetlb page size requests may get unexpected results.  And, one
of the threads about MADV_DONTNEED points out a valid use cases where
the caller may not know the mapping is hugetlb or not and is likely to
pass in non-hugetlb page size requests.

> 2. MADV_ENABLE_HGM or something similar. The changes to
> UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically,
> provided they are implemented.
> 
> I don't mind one way or the other. Peter, I assume you prefer #2.
> Mike, what about you? If we decide on something other than #1, I'll
> make the change before sending v1 out.

Since I do not believe 1) is an option, MADV_ENABLE_HGM might be the way
to go.  Any thoughts about MADV_ENABLE_HGM?  I'm thinking:
- Make it have same restrictions as other madvise hugetlb calls,
  . addr must be huge page aligned
  . length is rounded down to a multiple of huge page size
- We split the vma as required
- Flags carrying HGM state reside in the hugetlb_shared_vma_data struct
  
James Houghton Dec. 22, 2022, 1:24 a.m. UTC | #10
> > So considering two API choices:
> >
> > 1. What we have now: UFFD_FEATURE_MINOR_HUGETLBFS_HGM for
> > UFFDIO_CONTINUE, and later UFFD_FEATURE_WP_HUGETLBFS_HGM for
> > UFFDIO_WRITEPROTECT. For MADV_DONTNEED, we could just suddenly start
> > allowing high-granularity choices (not sure if this is bad; we started
> > allowing it for HugeTLB recently with no other API change, AFAIA).
>
> I don't think we can just start allowing HGM for MADV_DONTNEED without
> some type of user interaction/request.  Otherwise, a user that passes
> in non-hugetlb page size requests may get unexpected results.  And, one
> of the threads about MADV_DONTNEED points out a valid use cases where
> the caller may not know the mapping is hugetlb or not and is likely to
> pass in non-hugetlb page size requests.
>
> > 2. MADV_ENABLE_HGM or something similar. The changes to
> > UFFDIO_CONTINUE/UFFDIO_WRITEPROTECT/MADV_DONTNEED come automatically,
> > provided they are implemented.
> >
> > I don't mind one way or the other. Peter, I assume you prefer #2.
> > Mike, what about you? If we decide on something other than #1, I'll
> > make the change before sending v1 out.
>
> Since I do not believe 1) is an option, MADV_ENABLE_HGM might be the way
> to go.  Any thoughts about MADV_ENABLE_HGM?  I'm thinking:
> - Make it have same restrictions as other madvise hugetlb calls,
>   . addr must be huge page aligned
>   . length is rounded down to a multiple of huge page size
> - We split the vma as required
I agree with these.
> - Flags carrying HGM state reside in the hugetlb_shared_vma_data struct
I actually changed this in v1 to storing HGM state as a VMA flag to
avoid problems with splitting VMAs (like, when we split a VMA, it's
possible the VMA data/lock struct doesn't get allocated). It seems
better to me; I can change it back if you disagree.

Not sure what the best name for this flag is either. MADV_ENABLE_HGM
sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
too. No need to figure it out now.

Thanks Mike and Peter :) I'll make this change for v1 and send it out
sometime soon.

- James
  
Peter Xu Dec. 22, 2022, 2:30 p.m. UTC | #11
On Wed, Dec 21, 2022 at 08:24:45PM -0500, James Houghton wrote:
> Not sure what the best name for this flag is either. MADV_ENABLE_HGM
> sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
> too. No need to figure it out now.

One more option to consider is MADV_SPLIT (hopefully to be more generic).

We already decided to reuse thp MADV_COLLAPSE, we can also introduce
MADV_SPLIT and leave thp for later if it can be anything helpful (I
remember we used to discuss this for thp split).

For hugetlb one SPLIT should enable hgm advise bit on the vma forever.

Thanks,
  
James Houghton Dec. 27, 2022, 5:02 p.m. UTC | #12
On Thu, Dec 22, 2022 at 9:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Dec 21, 2022 at 08:24:45PM -0500, James Houghton wrote:
> > Not sure what the best name for this flag is either. MADV_ENABLE_HGM
> > sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
> > too. No need to figure it out now.
>
> One more option to consider is MADV_SPLIT (hopefully to be more generic).
>
> We already decided to reuse thp MADV_COLLAPSE, we can also introduce
> MADV_SPLIT and leave thp for later if it can be anything helpful (I
> remember we used to discuss this for thp split).
>
> For hugetlb one SPLIT should enable hgm advise bit on the vma forever.

MADV_SPLIT sounds okay to me -- we'll see how it turns out when I send
v1. However, there's an interesting API question regarding what
address userfaultfd provides. We previously required
UFFD_FEATURE_EXACT_ADDRESS when you specified
UFFD_FEATURE_MINOR_HUGETLBFS_HGM so that there was no ambiguity. Now,
we can do:

1. When MADV_SPLIT is given, userfaultfd will now round addresses to
PAGE_SIZE instead of huge_page_size(hstate), and
UFFD_FEATURE_EXACT_ADDRESS is not needed.
2. Don't change anything. A user must know to provide
UFFD_FEATURE_EXACT_ADDRESS to get the real address, otherwise they get
an (unusable) hugepage-aligned address.

I think #1 sounds fine; let me know if you disagree.

Thanks!
- James
  
Peter Xu Jan. 3, 2023, 5:06 p.m. UTC | #13
On Tue, Dec 27, 2022 at 12:02:52PM -0500, James Houghton wrote:
> On Thu, Dec 22, 2022 at 9:30 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Wed, Dec 21, 2022 at 08:24:45PM -0500, James Houghton wrote:
> > > Not sure what the best name for this flag is either. MADV_ENABLE_HGM
> > > sounds ok. MADV_HUGETLB_HGM or MADV_HUGETLB_SMALL_PAGES could work
> > > too. No need to figure it out now.
> >
> > One more option to consider is MADV_SPLIT (hopefully to be more generic).
> >
> > We already decided to reuse thp MADV_COLLAPSE, we can also introduce
> > MADV_SPLIT and leave thp for later if it can be anything helpful (I
> > remember we used to discuss this for thp split).
> >
> > For hugetlb one SPLIT should enable hgm advise bit on the vma forever.
> 
> MADV_SPLIT sounds okay to me -- we'll see how it turns out when I send
> v1. However, there's an interesting API question regarding what
> address userfaultfd provides. We previously required
> UFFD_FEATURE_EXACT_ADDRESS when you specified
> UFFD_FEATURE_MINOR_HUGETLBFS_HGM so that there was no ambiguity. Now,
> we can do:
> 
> 1. When MADV_SPLIT is given, userfaultfd will now round addresses to
> PAGE_SIZE instead of huge_page_size(hstate), and
> UFFD_FEATURE_EXACT_ADDRESS is not needed.
> 2. Don't change anything. A user must know to provide
> UFFD_FEATURE_EXACT_ADDRESS to get the real address, otherwise they get
> an (unusable) hugepage-aligned address.
> 
> I think #1 sounds fine; let me know if you disagree.

Sounds good to me, thanks!
  

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..3a3e9ef74dab 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -226,6 +226,11 @@  static inline struct uffd_msg userfault_msg(unsigned long address,
 	return msg;
 }
 
+bool uffd_ctx_has_hgm(struct vm_userfaultfd_ctx *ctx)
+{
+	return ctx->ctx->features & UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 /*
  * Same functionality as userfaultfd_must_wait below with modifications for
@@ -1954,10 +1959,15 @@  static int userfaultfd_api(struct userfaultfd_ctx *ctx,
 		goto err_out;
 	/* report all available features and ioctls to userland */
 	uffdio_api.features = UFFD_API_FEATURES;
+
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 	uffdio_api.features &=
 		~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM);
-#endif
+#ifndef CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING
+	uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS_HGM;
+#endif  /* CONFIG_HUGETLB_HIGH_GRANULARITY_MAPPING */
+#endif  /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+
 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 	uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP;
 #endif
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index f07e6998bb68..d8fa37f308f7 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -162,6 +162,8 @@  static inline bool vma_can_userfault(struct vm_area_struct *vma,
 	    vma_is_shmem(vma);
 }
 
+extern bool uffd_ctx_has_hgm(struct vm_userfaultfd_ctx *);
+
 extern int dup_userfaultfd(struct vm_area_struct *, struct list_head *);
 extern void dup_userfaultfd_complete(struct list_head *);
 
@@ -228,6 +230,11 @@  static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 	return false;
 }
 
+static inline bool uffd_ctx_has_hgm(struct vm_userfaultfd_ctx *ctx)
+{
+	return false;
+}
+
 static inline int dup_userfaultfd(struct vm_area_struct *vma,
 				  struct list_head *l)
 {
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..ae8080003560 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -36,6 +36,7 @@ 
 			   UFFD_FEATURE_SIGBUS |		\
 			   UFFD_FEATURE_THREAD_ID |		\
 			   UFFD_FEATURE_MINOR_HUGETLBFS |	\
+			   UFFD_FEATURE_MINOR_HUGETLBFS_HGM |	\
 			   UFFD_FEATURE_MINOR_SHMEM |		\
 			   UFFD_FEATURE_EXACT_ADDRESS |		\
 			   UFFD_FEATURE_WP_HUGETLBFS_SHMEM)
@@ -217,6 +218,7 @@  struct uffdio_api {
 #define UFFD_FEATURE_MINOR_SHMEM		(1<<10)
 #define UFFD_FEATURE_EXACT_ADDRESS		(1<<11)
 #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM		(1<<12)
+#define UFFD_FEATURE_MINOR_HUGETLBFS_HGM	(1<<13)
 	__u64 features;
 
 	__u64 ioctls;