[RFC,00/19] hugetlb support for KVM guest_mem

Message ID	cover.1686077275.git.ackerleytng@google.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Date: Tue, 6 Jun 2023 19:03:45 +0000 Mime-Version: 1.0 Message-ID: <cover.1686077275.git.ackerleytng@google.com> Subject: [RFC PATCH 00/19] hugetlb support for KVM guest_mem From: Ackerley Tng <ackerleytng@google.com> To: akpm@linux-foundation.org, mike.kravetz@oracle.com, muchun.song@linux.dev, pbonzini@redhat.com, seanjc@google.com, shuah@kernel.org, willy@infradead.org Cc: brauner@kernel.org, chao.p.peng@linux.intel.com, coltonlewis@google.com, david@redhat.com, dhildenb@redhat.com, dmatlack@google.com, erdemaktas@google.com, hughd@google.com, isaku.yamahata@gmail.com, jarkko@kernel.org, jmattson@google.com, joro@8bytes.org, jthoughton@google.com, jun.nakajima@intel.com, kirill.shutemov@linux.intel.com, liam.merwick@oracle.com, mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com, qperret@google.com, rientjes@google.com, rppt@kernel.org, steven.price@arm.com, tabba@google.com, vannapurve@google.com, vbabka@suse.cz, vipinsh@google.com, vkuznets@redhat.com, wei.w.wang@intel.com, yu.c.zhang@linux.intel.com, kvm@vger.kernel.org, linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org, x86@kernel.org, Ackerley Tng <ackerleytng@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	hugetlb support for KVM guest_mem \| [RFC,00/19] hugetlb support for KVM guest_mem [RFC,01/19] mm: hugetlb: Expose get_hstate_idx() [RFC,02/19] mm: hugetlb: Move and expose hugetlbfs_zero_partial_page [RFC,03/19] mm: hugetlb: Expose remove_inode_hugepages [RFC,04/19] mm: hugetlb: Decouple hstate, subpool from inode [RFC,05/19] mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool and hstate [RFC,06/19] mm: hugetlb: Provide hugetlb_filemap_add_folio() [RFC,07/19] mm: hugetlb: Refactor vma_*_reservation functions [RFC,08/19] mm: hugetlb: Refactor restore_reserve_on_error [RFC,09/19] mm: hugetlb: Use restore_reserve_on_error directly in filesystems [RFC,10/19] mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by resv_map [RFC,11/19] mm: hugetlb: Parametrize hugetlb functions by resv_map [RFC,12/19] mm: truncate: Expose preparation steps for truncate_inode_pages_final [RFC,13/19] KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers [RFC,14/19] KVM: guest_mem: Refactor cleanup to separate inode and file cleanup [RFC,15/19] KVM: guest_mem: hugetlb: initialization and cleanup [RFC,16/19] KVM: guest_mem: hugetlb: allocate and truncate from hugetlb [RFC,17/19] KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem [RFC,18/19] KVM: selftests: Support various types of backing sources for private memory [RFC,19/19] KVM: selftests: Update test for various private memory backing source types

Message ID

cover.1686077275.git.ackerleytng@google.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
Date: Tue,  6 Jun 2023 19:03:45 +0000
Mime-Version: 1.0
Message-ID: <cover.1686077275.git.ackerleytng@google.com>
Subject: [RFC PATCH 00/19] hugetlb support for KVM guest_mem
From: Ackerley Tng <ackerleytng@google.com>
To: akpm@linux-foundation.org, mike.kravetz@oracle.com,
        muchun.song@linux.dev, pbonzini@redhat.com, seanjc@google.com,
        shuah@kernel.org, willy@infradead.org
Cc: brauner@kernel.org, chao.p.peng@linux.intel.com,
        coltonlewis@google.com, david@redhat.com, dhildenb@redhat.com,
        dmatlack@google.com, erdemaktas@google.com, hughd@google.com,
        isaku.yamahata@gmail.com, jarkko@kernel.org, jmattson@google.com,
        joro@8bytes.org, jthoughton@google.com, jun.nakajima@intel.com,
        kirill.shutemov@linux.intel.com, liam.merwick@oracle.com,
        mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com,
        qperret@google.com, rientjes@google.com, rppt@kernel.org,
        steven.price@arm.com, tabba@google.com, vannapurve@google.com,
        vbabka@suse.cz, vipinsh@google.com, vkuznets@redhat.com,
        wei.w.wang@intel.com, yu.c.zhang@linux.intel.com,
        kvm@vger.kernel.org, linux-api@vger.kernel.org,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-kselftest@vger.kernel.org, linux-mm@kvack.org,
        qemu-devel@nongnu.org, x86@kernel.org,
        Ackerley Tng <ackerleytng@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Series

hugetlb support for KVM guest_mem |

Message

Ackerley Tng June 6, 2023, 7:03 p.m. UTC

  Hello,

This patchset builds upon a soon-to-be-published WIP patchset that Sean
published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned
at [1].

The tree can be found at:
https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1

In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced,
allowing VM private memory (for confidential computing) to be backed by hugetlb
pages.

guest_mem provides userspace with a handle, with which userspace can allocate
and deallocate memory for confidential VMs without mapping the memory into
userspace.

Why use hugetlb instead of introducing a new allocator, like gmem does for 4K
and transparent hugepages?

+ hugetlb provides the following useful functionality, which would otherwise
  have to be reimplemented:
    + Allocation of hugetlb pages at boot time, including
        + Parsing of kernel boot parameters to configure hugetlb
        + Tracking of usage in hstate
        + gmem will share the same system-wide pool of hugetlb pages, so users
          don't have to have separate pools for hugetlb and gmem
    + Page accounting with subpools
        + hugetlb pages are tracked in subpools, which gmem uses to reserve
          pages from the global hstate
    + Memory charging
        + hugetlb provides code that charges memory to cgroups
    + Reporting: hugetlb usage and availability are available at /proc/meminfo,
      etc

The first 11 patches in this patchset is a series of refactoring to decouple
hugetlb and hugetlbfs.

The central thread binding the refactoring is that some functions (like
inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs
concept, that the resv_map, subpool, hstate, are in a specific field in a
hugetlb inode.

Refactoring to parametrize functions by hstate, subpool, resv_map will allow
hugetlb to be used by gmem and in other places where these data structures
aren't necessarily stored in the same positions in the inode.

The refactoring proposed here is just the minimum required to get a
proof-of-concept working with gmem. I would like to get opinions on this
approach before doing further refactoring. (See TODOs)

TODOs:

+ hugetlb/hugetlbfs refactoring
    + remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs
      specific and used only in inode.c
    + remove_mapping_hugepages(), remove_inode_single_folio(),
      hugetlb_unreserve_pages() shouldn't need to take inode as a parameter
        + Updating inode->i_blocks can be refactored to a separate function and
          called from hugetlbfs and gmem
    + alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by
      vma
    + hugetlb_reserve_pages() should be refactored to be symmetric with
      hugetlb_unreserve_pages()
        + It should be parametrized by resv_map
        + alloc_hugetlb_folio_from_subpool() could perhaps use
          hugetlb_reserve_pages()?
+ gmem
    + Figure out if resv_map should be used by gmem at all
        + Probably needs more refactoring to decouple resv_map from hugetlb
          functions

Questions for the community:

1. In this patchset, every gmem file backed with hugetlb is given a new
   subpool. Is that desirable?
    + In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one
      mount per hugetlb size (2M, 1G, etc)
    + memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it
      (rightfully) uses the hugetlbfs kernel mounts and their subpools
    + I gave each file a subpool mostly to speed up implementation and still be
      able to reserve hugetlb pages from the global hstate based on the gmem
      file size.
    + gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so
        + Should there be multiple mounts, one for each hugetlb size?
        + Will the mounts be initialized on boot or on first gmem file creation?
        + Or is one subpool per gmem file fine?
2. Should resv_map be used for gmem at all, since gmem doesn't allow userspace
   reservations?

[1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@google.com/

---

Ackerley Tng (19):
  mm: hugetlb: Expose get_hstate_idx()
  mm: hugetlb: Move and expose hugetlbfs_zero_partial_page
  mm: hugetlb: Expose remove_inode_hugepages
  mm: hugetlb: Decouple hstate, subpool from inode
  mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool
    and hstate
  mm: hugetlb: Provide hugetlb_filemap_add_folio()
  mm: hugetlb: Refactor vma_*_reservation functions
  mm: hugetlb: Refactor restore_reserve_on_error
  mm: hugetlb: Use restore_reserve_on_error directly in filesystems
  mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by
    resv_map
  mm: hugetlb: Parametrize hugetlb functions by resv_map
  mm: truncate: Expose preparation steps for truncate_inode_pages_final
  KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers
  KVM: guest_mem: Refactor cleanup to separate inode and file cleanup
  KVM: guest_mem: hugetlb: initialization and cleanup
  KVM: guest_mem: hugetlb: allocate and truncate from hugetlb
  KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem
  KVM: selftests: Support various types of backing sources for private
    memory
  KVM: selftests: Update test for various private memory backing source
    types

 fs/hugetlbfs/inode.c                          | 102 ++--
 include/linux/hugetlb.h                       |  86 ++-
 include/linux/mm.h                            |   1 +
 include/uapi/linux/kvm.h                      |  25 +
 mm/hugetlb.c                                  | 324 +++++++-----
 mm/truncate.c                                 |  24 +-
 .../testing/selftests/kvm/guest_memfd_test.c  |  33 +-
 .../testing/selftests/kvm/include/test_util.h |  14 +
 tools/testing/selftests/kvm/lib/test_util.c   |  74 +++
 .../kvm/x86_64/private_mem_conversions_test.c |  38 +-
 virt/kvm/guest_mem.c                          | 488 ++++++++++++++----
 11 files changed, 882 insertions(+), 327 deletions(-)

--
2.41.0.rc0.172.g3f132b7071-goog

Comments

Isaku Yamahata June 8, 2023, 4:38 a.m. UTC | #1

On Tue, Jun 06, 2023 at 07:03:45PM +0000,
Ackerley Tng <ackerleytng@google.com> wrote:

> Hello,
> 
> This patchset builds upon a soon-to-be-published WIP patchset that Sean
> published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned
> at [1].
> 
> The tree can be found at:
> https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
> 
> In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced,
> allowing VM private memory (for confidential computing) to be backed by hugetlb
> pages.
> 
> guest_mem provides userspace with a handle, with which userspace can allocate
> and deallocate memory for confidential VMs without mapping the memory into
> userspace.
> 
> Why use hugetlb instead of introducing a new allocator, like gmem does for 4K
> and transparent hugepages?
> 
> + hugetlb provides the following useful functionality, which would otherwise
>   have to be reimplemented:
>     + Allocation of hugetlb pages at boot time, including
>         + Parsing of kernel boot parameters to configure hugetlb
>         + Tracking of usage in hstate
>         + gmem will share the same system-wide pool of hugetlb pages, so users
>           don't have to have separate pools for hugetlb and gmem
>     + Page accounting with subpools
>         + hugetlb pages are tracked in subpools, which gmem uses to reserve
>           pages from the global hstate
>     + Memory charging
>         + hugetlb provides code that charges memory to cgroups
>     + Reporting: hugetlb usage and availability are available at /proc/meminfo,
>       etc
> 
> The first 11 patches in this patchset is a series of refactoring to decouple
> hugetlb and hugetlbfs.
> 
> The central thread binding the refactoring is that some functions (like
> inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs
> concept, that the resv_map, subpool, hstate, are in a specific field in a
> hugetlb inode.
> 
> Refactoring to parametrize functions by hstate, subpool, resv_map will allow
> hugetlb to be used by gmem and in other places where these data structures
> aren't necessarily stored in the same positions in the inode.
> 
> The refactoring proposed here is just the minimum required to get a
> proof-of-concept working with gmem. I would like to get opinions on this
> approach before doing further refactoring. (See TODOs)
> 
> TODOs:
> 
> + hugetlb/hugetlbfs refactoring
>     + remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs
>       specific and used only in inode.c
>     + remove_mapping_hugepages(), remove_inode_single_folio(),
>       hugetlb_unreserve_pages() shouldn't need to take inode as a parameter
>         + Updating inode->i_blocks can be refactored to a separate function and
>           called from hugetlbfs and gmem
>     + alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by
>       vma
>     + hugetlb_reserve_pages() should be refactored to be symmetric with
>       hugetlb_unreserve_pages()
>         + It should be parametrized by resv_map
>         + alloc_hugetlb_folio_from_subpool() could perhaps use
>           hugetlb_reserve_pages()?
> + gmem
>     + Figure out if resv_map should be used by gmem at all
>         + Probably needs more refactoring to decouple resv_map from hugetlb
>           functions

Hi. If kvm gmem is compiled as kernel module, many symbols are failed to link.
You need to add EXPORT_SYMBOL{,_GPL} for exported symbols.
Or compile it to kernel instead of module?

Thanks,

> Questions for the community:
> 
> 1. In this patchset, every gmem file backed with hugetlb is given a new
>    subpool. Is that desirable?
>     + In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one
>       mount per hugetlb size (2M, 1G, etc)
>     + memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it
>       (rightfully) uses the hugetlbfs kernel mounts and their subpools
>     + I gave each file a subpool mostly to speed up implementation and still be
>       able to reserve hugetlb pages from the global hstate based on the gmem
>       file size.
>     + gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so
>         + Should there be multiple mounts, one for each hugetlb size?
>         + Will the mounts be initialized on boot or on first gmem file creation?
>         + Or is one subpool per gmem file fine?
> 2. Should resv_map be used for gmem at all, since gmem doesn't allow userspace
>    reservations?
> 
> [1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@google.com/
> 
> ---
> 
> Ackerley Tng (19):
>   mm: hugetlb: Expose get_hstate_idx()
>   mm: hugetlb: Move and expose hugetlbfs_zero_partial_page
>   mm: hugetlb: Expose remove_inode_hugepages
>   mm: hugetlb: Decouple hstate, subpool from inode
>   mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool
>     and hstate
>   mm: hugetlb: Provide hugetlb_filemap_add_folio()
>   mm: hugetlb: Refactor vma_*_reservation functions
>   mm: hugetlb: Refactor restore_reserve_on_error
>   mm: hugetlb: Use restore_reserve_on_error directly in filesystems
>   mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by
>     resv_map
>   mm: hugetlb: Parametrize hugetlb functions by resv_map
>   mm: truncate: Expose preparation steps for truncate_inode_pages_final
>   KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers
>   KVM: guest_mem: Refactor cleanup to separate inode and file cleanup
>   KVM: guest_mem: hugetlb: initialization and cleanup
>   KVM: guest_mem: hugetlb: allocate and truncate from hugetlb
>   KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem
>   KVM: selftests: Support various types of backing sources for private
>     memory
>   KVM: selftests: Update test for various private memory backing source
>     types
> 
>  fs/hugetlbfs/inode.c                          | 102 ++--
>  include/linux/hugetlb.h                       |  86 ++-
>  include/linux/mm.h                            |   1 +
>  include/uapi/linux/kvm.h                      |  25 +
>  mm/hugetlb.c                                  | 324 +++++++-----
>  mm/truncate.c                                 |  24 +-
>  .../testing/selftests/kvm/guest_memfd_test.c  |  33 +-
>  .../testing/selftests/kvm/include/test_util.h |  14 +
>  tools/testing/selftests/kvm/lib/test_util.c   |  74 +++
>  .../kvm/x86_64/private_mem_conversions_test.c |  38 +-
>  virt/kvm/guest_mem.c                          | 488 ++++++++++++++----
>  11 files changed, 882 insertions(+), 327 deletions(-)
> 
> --
> 2.41.0.rc0.172.g3f132b7071-goog

Mike Kravetz June 16, 2023, 6:28 p.m. UTC | #2

On 06/06/23 19:03, Ackerley Tng wrote:
> Hello,
> 
> This patchset builds upon a soon-to-be-published WIP patchset that Sean
> published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned
> at [1].
> 
> The tree can be found at:
> https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
> 
> In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced,
> allowing VM private memory (for confidential computing) to be backed by hugetlb
> pages.
> 
> guest_mem provides userspace with a handle, with which userspace can allocate
> and deallocate memory for confidential VMs without mapping the memory into
> userspace.

Hello Ackerley,

I am not sure if you are aware or, have been following the hugetlb HGM
discussion in this thread:
https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/

There we are trying to decide if HGM should be added to hugetlb, or if
perhaps a new filesystem/driver/allocator should be created.  The concern
is added complexity to hugetlb as well as core mm special casing.  Note
that HGM is addressing issues faced by existing hugetlb users.

Your proposal here suggests modifying hugetlb so that it can be used in
a new way (use case) by KVM's guest_mem.  As such it really seems like
something that should be done in a separate filesystem/driver/allocator.
You will likely not get much support for modifying hugetlb.

Vishal Annapurve June 21, 2023, 9:01 a.m. UTC | #3

On Fri, Jun 16, 2023 at 11:28 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 06/06/23 19:03, Ackerley Tng wrote:
> > Hello,
> >
> > This patchset builds upon a soon-to-be-published WIP patchset that Sean
> > published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned
> > at [1].
> >
> > The tree can be found at:
> > https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
> >
> > In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced,
> > allowing VM private memory (for confidential computing) to be backed by hugetlb
> > pages.
> >
> > guest_mem provides userspace with a handle, with which userspace can allocate
> > and deallocate memory for confidential VMs without mapping the memory into
> > userspace.
>
> Hello Ackerley,
>
> I am not sure if you are aware or, have been following the hugetlb HGM
> discussion in this thread:
> https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey/
>
> There we are trying to decide if HGM should be added to hugetlb, or if
> perhaps a new filesystem/driver/allocator should be created.  The concern
> is added complexity to hugetlb as well as core mm special casing.  Note
> that HGM is addressing issues faced by existing hugetlb users.
>
> Your proposal here suggests modifying hugetlb so that it can be used in
> a new way (use case) by KVM's guest_mem.  As such it really seems like
> something that should be done in a separate filesystem/driver/allocator.
> You will likely not get much support for modifying hugetlb.
>
> --
> Mike Kravetz
>

IIUC mm/hugetlb.c implements memory manager for Hugetlb pages and
fd/hugetlbfs/inode.c implements the filesystem logic for hugetlbfs.

This series implements a new filesystem with limited operations
parallel to hugetlbfs filesystem but tries to reuse hugetlb memory
manager. The effort here is to not add any new feature to hugetlb
memory manager but clean it up so that it can be used by a new
filesystem.

guest_mem warrants a new filesystem since it supports limited
operations on the underlying files but there is no additional
restriction on underlying memory management. Though one could argue
that memory management for guest_mem files can be a very simple one
that goes inline with limited operations on the files.

If this series were to go a separate way of implementing a new memory
manager, one immediate requirement that might spring up, would be to
convert memory from hugetlb managed memory to be managed by this newly
introduced memory manager and vice a versa at runtime since there
could be a mix of VMs on the same platform using guest_mem and
hugetlb.
Maybe this can be satisfied by having a separate global pool for
reservation that's consumed by both, which would need more changes in
my understanding.

Using guest_mem for all the VMs by default would be a future work
contingent on all existing usecases/requirements being satisfied.

Regards,
Vishal