[RFC,0/5] Prototype for direct map awareness in page allocator

Message ID 20230308094106.227365-1-rppt@kernel.org
Headers
Series Prototype for direct map awareness in page allocator |

Message

Mike Rapoport March 8, 2023, 9:41 a.m. UTC
  From: "Mike Rapoport (IBM)" <rppt@kernel.org>

Hi,

This is a third attempt to make page allocator aware of the direct map
layout and allow grouping of the pages that must be unmapped from
the direct map.

This a new implementation of __GFP_UNMAPPED, kinda a follow up for this set:

https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org

but instead of using a migrate type to cache the unmapped pages, the
current implementation adds a dedicated cache to serve __GFP_UNMAPPED
allocations.

The last two patches in the series demonstrate how __GFP_UNMAPPED can be
used in two in-tree use cases.

First one is to switch secretmem to use the new mechanism, which is
straight forward optimization.

The second use-case is to enable __GFP_UNMAPPED in x86::module_alloc()
that is essentially used as a method to allocate code pages and thus
requires permission changes for basic pages in the direct map.

This set is x86 specific at the moment because other architectures either
do not support set_memory APIs that split the direct^w linear map (e.g.
PowerPC) or only enable set_memory APIs when the linear map uses basic page
size (like arm64).

The patches are only lightly tested.

== Motivation ==

There are use-cases that need to remove pages from the direct map or at
least map them with 4K granularity. Whenever this is done e.g. with
set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
direct map are split into smaller pages.

To reduce the performance hit caused by the fragmentation of the direct map
it makes sense to group and/or cache the pages removed from the direct
map so that the split large pages won't be all over the place. 

There were RFCs for grouped page allocations for vmalloc permissions [1]
and for using PKS to protect page tables [2] as well as an attempt to use a
pool of large pages in secretmtm [3], but these suggestions address each
use case separately, while having a common mechanism at the core mm level
could be used by all use cases.

== Implementation overview ==

The pages that need to be removed from the direct map are grouped in a
dedicated cache. When there is a page allocation request with
__GFP_UNMAPPED set, it is redirected from __alloc_pages() to that cache
using a new unmapped_alloc() function.

The cache is implemented as a buddy allocator and it can handle high order
requests.

The cache starts empty and whenever it does not have enough pages to
satisfy an allocation request the cache attempts to allocate PMD_SIZE page
to replenish the cache. If PMD_SIZE page cannot be allocated, the cache is
replenished with a page of the highest order available. That page is
removed from the direct map and added to the local buddy allocator.

There is also a shrinker that releases pages from the unmapped cache when
there us a memory pressure in the system. When shrinker releases a page it
is mapped back into the direct map.

[1] https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com
[2] https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com
[3] https://lore.kernel.org/lkml/20210121122723.3446-8-rppt@kernel.org

Mike Rapoport (IBM) (5):
  mm: intorduce __GFP_UNMAPPED and unmapped_alloc()
  mm/unmapped_alloc: add debugfs file similar to /proc/pagetypeinfo
  mm/unmapped_alloc: add shrinker
  EXPERIMENTAL: x86: use __GFP_UNMAPPED for modele_alloc()
  EXPERIMENTAL: mm/secretmem: use __GFP_UNMAPPED

 arch/x86/Kconfig                |   3 +
 arch/x86/kernel/module.c        |   2 +-
 include/linux/gfp_types.h       |  11 +-
 include/linux/page-flags.h      |   6 +
 include/linux/pageblock-flags.h |  28 +++
 include/trace/events/mmflags.h  |  10 +-
 mm/Kconfig                      |   4 +
 mm/Makefile                     |   1 +
 mm/internal.h                   |  24 +++
 mm/page_alloc.c                 |  39 +++-
 mm/secretmem.c                  |  26 +--
 mm/unmapped-alloc.c             | 334 ++++++++++++++++++++++++++++++++
 mm/vmalloc.c                    |   2 +-
 13 files changed, 459 insertions(+), 31 deletions(-)
 create mode 100644 mm/unmapped-alloc.c


base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
  

Comments

Edgecombe, Rick P March 9, 2023, 1:59 a.m. UTC | #1
On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> Hi,
> 
> This is a third attempt to make page allocator aware of the direct
> map
> layout and allow grouping of the pages that must be unmapped from
> the direct map.
> 
> This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> this set:
> 
> https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org
> 
> but instead of using a migrate type to cache the unmapped pages, the
> current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> allocations.

It seems a downside to having a page allocator outside of _the_ page
allocator is you don't get all of the features that are baked in there.
For example does secretmem care about numa? I guess in this
implementation there is just one big cache for all nodes.

Probably most users would want __GFP_ZERO. Would secretmem care about
__GFP_ACCOUNT? I'm sure there is more, but I guess the question is, is
the idea that these features all get built into unmapped-alloc at some
point? The alternate approach is to have little caches for each usage
like the grouped pages, which is probably less efficient when you have
a bunch of them. Or solve it just for modules like the bpf allocator.
Those are the tradeoffs for the approaches that have been explored,
right?
  
Mike Rapoport March 9, 2023, 3:14 p.m. UTC | #2
On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > 
> > Hi,
> > 
> > This is a third attempt to make page allocator aware of the direct
> > map
> > layout and allow grouping of the pages that must be unmapped from
> > the direct map.
> > 
> > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > this set:
> > 
> > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org
> > 
> > but instead of using a migrate type to cache the unmapped pages, the
> > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > allocations.
> 
> It seems a downside to having a page allocator outside of _the_ page
> allocator is you don't get all of the features that are baked in there.
> For example does secretmem care about numa? I guess in this
> implementation there is just one big cache for all nodes.
> 
> Probably most users would want __GFP_ZERO. Would secretmem care about
> __GFP_ACCOUNT?

The intention was that the pages in cache are always zeroed, so __GFP_ZERO
is always implicitly there, at least should have been.
__GFP_ACCOUNT is respected in this implementation. If you look at the
changes to __alloc_pages(), after getting pages from unmapped cache there
is 'goto out' to the point where the accounting is handled.

> I'm sure there is more, but I guess the question is, is
> the idea that these features all get built into unmapped-alloc at some
> point? The alternate approach is to have little caches for each usage
> like the grouped pages, which is probably less efficient when you have
> a bunch of them. Or solve it just for modules like the bpf allocator.
> Those are the tradeoffs for the approaches that have been explored,
> right?

I think that no matter what cache we'll use it won't be able to support all
features _the_ page allocator has. Indeed if we'd have per case cache
implementation we can tune that implementation to support features of
interest for that use case, but then we'll be less efficient in reducing
splits of the large pages. Not to mention increase in complexity as there
will be several caches doing similar but yet different things.

This POC mostly targets secretmem and modules, so this was pretty much
about GFP_KERNEL without considerations for NUMA, but I think extending
this unmapped alloc for NUMA should be simple enough but it will increase
memory overhead even more.
  
Christoph Hellwig March 10, 2023, 7:27 a.m. UTC | #3
On Wed, Mar 08, 2023 at 11:41:01AM +0200, Mike Rapoport wrote:
> The last two patches in the series demonstrate how __GFP_UNMAPPED can be
> used in two in-tree use cases.

dma_alloc_attrs(DMA_ATTR_NO_KERNEL_MAPPING) would be another easy one.
  
Mike Rapoport March 27, 2023, 2:27 p.m. UTC | #4
(adding Mel)

On Wed, Mar 08, 2023 at 11:41:01AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> Hi,
> 
> This is a third attempt to make page allocator aware of the direct map
> layout and allow grouping of the pages that must be unmapped from
> the direct map.
> 
> This a new implementation of __GFP_UNMAPPED, kinda a follow up for this set:
> 
> https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org
> 
> but instead of using a migrate type to cache the unmapped pages, the
> current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> allocations.
> 
> The last two patches in the series demonstrate how __GFP_UNMAPPED can be
> used in two in-tree use cases.
> 
> First one is to switch secretmem to use the new mechanism, which is
> straight forward optimization.
> 
> The second use-case is to enable __GFP_UNMAPPED in x86::module_alloc()
> that is essentially used as a method to allocate code pages and thus
> requires permission changes for basic pages in the direct map.
> 
> This set is x86 specific at the moment because other architectures either
> do not support set_memory APIs that split the direct^w linear map (e.g.
> PowerPC) or only enable set_memory APIs when the linear map uses basic page
> size (like arm64).
> 
> The patches are only lightly tested.
> 
> == Motivation ==
> 
> There are use-cases that need to remove pages from the direct map or at
> least map them with 4K granularity. Whenever this is done e.g. with
> set_memory/set_direct_map APIs, the PUD and PMD sized mappings in the
> direct map are split into smaller pages.
> 
> To reduce the performance hit caused by the fragmentation of the direct map
> it makes sense to group and/or cache the pages removed from the direct
> map so that the split large pages won't be all over the place. 
> 
> There were RFCs for grouped page allocations for vmalloc permissions [1]
> and for using PKS to protect page tables [2] as well as an attempt to use a
> pool of large pages in secretmtm [3], but these suggestions address each
> use case separately, while having a common mechanism at the core mm level
> could be used by all use cases.
> 
> == Implementation overview ==
> 
> The pages that need to be removed from the direct map are grouped in a
> dedicated cache. When there is a page allocation request with
> __GFP_UNMAPPED set, it is redirected from __alloc_pages() to that cache
> using a new unmapped_alloc() function.
> 
> The cache is implemented as a buddy allocator and it can handle high order
> requests.
> 
> The cache starts empty and whenever it does not have enough pages to
> satisfy an allocation request the cache attempts to allocate PMD_SIZE page
> to replenish the cache. If PMD_SIZE page cannot be allocated, the cache is
> replenished with a page of the highest order available. That page is
> removed from the direct map and added to the local buddy allocator.
> 
> There is also a shrinker that releases pages from the unmapped cache when
> there us a memory pressure in the system. When shrinker releases a page it
> is mapped back into the direct map.
> 
> [1] https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com
> [2] https://lore.kernel.org/lkml/20210505003032.489164-1-rick.p.edgecombe@intel.com
> [3] https://lore.kernel.org/lkml/20210121122723.3446-8-rppt@kernel.org
> 
> Mike Rapoport (IBM) (5):
>   mm: intorduce __GFP_UNMAPPED and unmapped_alloc()
>   mm/unmapped_alloc: add debugfs file similar to /proc/pagetypeinfo
>   mm/unmapped_alloc: add shrinker
>   EXPERIMENTAL: x86: use __GFP_UNMAPPED for modele_alloc()
>   EXPERIMENTAL: mm/secretmem: use __GFP_UNMAPPED
> 
>  arch/x86/Kconfig                |   3 +
>  arch/x86/kernel/module.c        |   2 +-
>  include/linux/gfp_types.h       |  11 +-
>  include/linux/page-flags.h      |   6 +
>  include/linux/pageblock-flags.h |  28 +++
>  include/trace/events/mmflags.h  |  10 +-
>  mm/Kconfig                      |   4 +
>  mm/Makefile                     |   1 +
>  mm/internal.h                   |  24 +++
>  mm/page_alloc.c                 |  39 +++-
>  mm/secretmem.c                  |  26 +--
>  mm/unmapped-alloc.c             | 334 ++++++++++++++++++++++++++++++++
>  mm/vmalloc.c                    |   2 +-
>  13 files changed, 459 insertions(+), 31 deletions(-)
>  create mode 100644 mm/unmapped-alloc.c
> 
> 
> base-commit: fe15c26ee26efa11741a7b632e9f23b01aca4cc6
> -- 
> 2.35.1
>
  
Sean Christopherson May 19, 2023, 3:40 p.m. UTC | #5
On Thu, Mar 09, 2023, Mike Rapoport wrote:
> On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > 
> > > Hi,
> > > 
> > > This is a third attempt to make page allocator aware of the direct
> > > map
> > > layout and allow grouping of the pages that must be unmapped from
> > > the direct map.
> > > 
> > > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > > this set:
> > > 
> > > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org
> > > 
> > > but instead of using a migrate type to cache the unmapped pages, the
> > > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > > allocations.
> > 
> > It seems a downside to having a page allocator outside of _the_ page
> > allocator is you don't get all of the features that are baked in there.
> > For example does secretmem care about numa? I guess in this
> > implementation there is just one big cache for all nodes.
> > 
> > Probably most users would want __GFP_ZERO. Would secretmem care about
> > __GFP_ACCOUNT?
> 
> The intention was that the pages in cache are always zeroed, so __GFP_ZERO
> is always implicitly there, at least should have been.

Would it be possible to drop that assumption/requirement, i.e. allow allocation of
__GFP_UNMAPPED without __GFP_ZERO?  At a glance, __GFP_UNMAPPED looks like it would
be a great fit for backing guest memory, in particular for confidential VMs.  And
for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
access to the guest's encryption key.  In other words, zeroing in the kernel would
be unnecessary work.
  
Mike Rapoport May 19, 2023, 4:24 p.m. UTC | #6
On Fri, May 19, 2023 at 08:40:48AM -0700, Sean Christopherson wrote:
> On Thu, Mar 09, 2023, Mike Rapoport wrote:
> > On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > 
> > > > Hi,
> > > > 
> > > > This is a third attempt to make page allocator aware of the direct
> > > > map
> > > > layout and allow grouping of the pages that must be unmapped from
> > > > the direct map.
> > > > 
> > > > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > > > this set:
> > > > 
> > > > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org
> > > > 
> > > > but instead of using a migrate type to cache the unmapped pages, the
> > > > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > > > allocations.
> > > 
> > > It seems a downside to having a page allocator outside of _the_ page
> > > allocator is you don't get all of the features that are baked in there.
> > > For example does secretmem care about numa? I guess in this
> > > implementation there is just one big cache for all nodes.
> > > 
> > > Probably most users would want __GFP_ZERO. Would secretmem care about
> > > __GFP_ACCOUNT?
> > 
> > The intention was that the pages in cache are always zeroed, so __GFP_ZERO
> > is always implicitly there, at least should have been.
> 
> Would it be possible to drop that assumption/requirement, i.e. allow allocation of
> __GFP_UNMAPPED without __GFP_ZERO?  At a glance, __GFP_UNMAPPED looks like it would
> be a great fit for backing guest memory, in particular for confidential VMs.  And
> for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
> zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
> access to the guest's encryption key.  In other words, zeroing in the kernel would
> be unnecessary work.

Making and unmapped allocation without __GFP_ZERO shouldn't be a problem. 

However, using a gfp flag and hooking up into the free path in page
allocator have issues and preferably should be avoided.

Will something like unmapped_alloc() and unmapped_free() work for your
usecase?
  
Sean Christopherson May 19, 2023, 6:25 p.m. UTC | #7
On Fri, May 19, 2023, Mike Rapoport wrote:
> On Fri, May 19, 2023 at 08:40:48AM -0700, Sean Christopherson wrote:
> > On Thu, Mar 09, 2023, Mike Rapoport wrote:
> > > On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > > > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > This is a third attempt to make page allocator aware of the direct
> > > > > map
> > > > > layout and allow grouping of the pages that must be unmapped from
> > > > > the direct map.
> > > > > 
> > > > > This a new implementation of __GFP_UNMAPPED, kinda a follow up for
> > > > > this set:
> > > > > 
> > > > > https://lore.kernel.org/all/20220127085608.306306-1-rppt@kernel.org
> > > > > 
> > > > > but instead of using a migrate type to cache the unmapped pages, the
> > > > > current implementation adds a dedicated cache to serve __GFP_UNMAPPED
> > > > > allocations.
> > > > 
> > > > It seems a downside to having a page allocator outside of _the_ page
> > > > allocator is you don't get all of the features that are baked in there.
> > > > For example does secretmem care about numa? I guess in this
> > > > implementation there is just one big cache for all nodes.
> > > > 
> > > > Probably most users would want __GFP_ZERO. Would secretmem care about
> > > > __GFP_ACCOUNT?
> > > 
> > > The intention was that the pages in cache are always zeroed, so __GFP_ZERO
> > > is always implicitly there, at least should have been.
> > 
> > Would it be possible to drop that assumption/requirement, i.e. allow allocation of
> > __GFP_UNMAPPED without __GFP_ZERO?  At a glance, __GFP_UNMAPPED looks like it would
> > be a great fit for backing guest memory, in particular for confidential VMs.  And
> > for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
> > zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
> > access to the guest's encryption key.  In other words, zeroing in the kernel would
> > be unnecessary work.
> 
> Making and unmapped allocation without __GFP_ZERO shouldn't be a problem. 
> 
> However, using a gfp flag and hooking up into the free path in page
> allocator have issues and preferably should be avoided.
> 
> Will something like unmapped_alloc() and unmapped_free() work for your
> usecase?

Yep, I'm leaning more and more towards having KVM implement its own ioctl() for
managing this type of memory.  Wiring that up to use dedicated APIs should be no
problem.

Thanks!
  
Mike Rapoport May 25, 2023, 8:37 p.m. UTC | #8
On Fri, May 19, 2023 at 11:25:53AM -0700, Sean Christopherson wrote:
> On Fri, May 19, 2023, Mike Rapoport wrote:
> > On Fri, May 19, 2023 at 08:40:48AM -0700, Sean Christopherson wrote:
> > > On Thu, Mar 09, 2023, Mike Rapoport wrote:
> > > > On Thu, Mar 09, 2023 at 01:59:00AM +0000, Edgecombe, Rick P wrote:
> > > 
> > > Would it be possible to drop that assumption/requirement, i.e. allow allocation of
> > > __GFP_UNMAPPED without __GFP_ZERO?  At a glance, __GFP_UNMAPPED looks like it would
> > > be a great fit for backing guest memory, in particular for confidential VMs.  And
> > > for some flavors of CoCo, i.e. TDX, the trusted intermediary is responsible for
> > > zeroing/initializing guest memory as the untrusted host (kernel/KVM) doesn't have
> > > access to the guest's encryption key.  In other words, zeroing in the kernel would
> > > be unnecessary work.
> > 
> > Making and unmapped allocation without __GFP_ZERO shouldn't be a problem. 
> > 
> > However, using a gfp flag and hooking up into the free path in page
> > allocator have issues and preferably should be avoided.
> > 
> > Will something like unmapped_alloc() and unmapped_free() work for your
> > usecase?
> 
> Yep, I'm leaning more and more towards having KVM implement its own ioctl() for
> managing this type of memory.  Wiring that up to use dedicated APIs should be no
> problem.

Ok, I've dropped the GFP flag and made unmapped_pages_{alloc,free} the only
APIs.

Totally untested version of what I've got is here:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=unmapped-alloc/rfc-v2

I have some thoughts about adding support for 1G pages, but this is still
really vague at the moment.
 
> Thanks!