diff mbox series

[RFC,1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc()

Message ID	20230308094106.227365-2-rppt@kernel.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Mike Rapoport <rppt@kernel.org> To: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org>, Dave Hansen <dave.hansen@linux.intel.com>, Mike Rapoport <rppt@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Song Liu <song@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Vlastimil Babka <vbabka@suse.cz>, linux-kernel@vger.kernel.org, x86@kernel.org Subject: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc() Date: Wed, 8 Mar 2023 11:41:02 +0200 Message-Id: <20230308094106.227365-2-rppt@kernel.org> In-Reply-To: <20230308094106.227365-1-rppt@kernel.org> References: <20230308094106.227365-1-rppt@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Prototype for direct map awareness in page allocator \| [RFC,0/5] Prototype for direct map awareness in page allocator [RFC,1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc() [RFC,2/5] mm/unmapped_alloc: add debugfs file similar to /proc/pagetypeinfo [RFC,3/5] mm/unmapped_alloc: add shrinker [RFC,4/5] EXPERIMENTAL: x86: use __GFP_UNMAPPED for modele_alloc() [RFC,5/5] EXPERIMENTAL: mm/secretmem: use __GFP_UNMAPPED

Commit Message

Mike Rapoport March 8, 2023, 9:41 a.m. UTC

  From: "Mike Rapoport (IBM)" <rppt@kernel.org>

When set_memory or set_direct_map APIs used to change attribute or
permissions for chunks of several pages, the large PMD that maps these
pages in the direct map must be split. Fragmenting the direct map in such
manner causes TLB pressure and, eventually, performance degradation.

To avoid excessive direct map fragmentation, add ability to allocate
"unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
allocated pages from the direct map and use a cache of the unmapped pages.

This cache is replenished with higher order pages with preference for
PMD_SIZE pages when possible so that there will be fewer splits of large
pages in the direct map.

The cache is implemented as a buddy allocator, so it can serve high order
allocations of unmapped pages.

Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
---
 arch/x86/Kconfig                |   3 +
 include/linux/gfp_types.h       |  11 +-
 include/linux/page-flags.h      |   6 +
 include/linux/pageblock-flags.h |  28 +++++
 include/trace/events/mmflags.h  |  10 +-
 mm/Kconfig                      |   4 +
 mm/Makefile                     |   1 +
 mm/internal.h                   |  22 ++++
 mm/page_alloc.c                 |  29 ++++-
 mm/unmapped-alloc.c             | 215 ++++++++++++++++++++++++++++++++
 10 files changed, 325 insertions(+), 4 deletions(-)
 create mode 100644 mm/unmapped-alloc.c

Comments

Edgecombe, Rick P March 9, 2023, 1:56 a.m. UTC | #1

On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> +
> +static inline void __free_one_page(struct page *page, unsigned int
> order,
> +                                  bool cache_refill)
> +{
> +       unsigned long pfn = page_to_pfn(page);
> +       unsigned long buddy_pfn;
> +       unsigned long combined_pfn;
> +       struct page *buddy;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&free_area->lock, flags);
> +
> +       if (cache_refill) {
> +               set_pageblock_unmapped(page);
> +               free_area[order].nr_cached++;
> +       }
> +
> +       while (order < MAX_ORDER - 1) {
> +               buddy = find_unmapped_buddy_page_pfn(page, pfn,
> order,
> +                                                    &buddy_pfn);
> +               if (!buddy)
> +                       break;
> +
> +               del_page_from_free_list(buddy, order);
> +               combined_pfn = buddy_pfn & pfn;
> +               page = page + (combined_pfn - pfn);
> +               pfn = combined_pfn;
> +               order++;
> +       }
> +
> +       set_unmapped_order(page, order);
> +       add_to_free_list(page, order);
> +       spin_unlock_irqrestore(&free_area->lock, flags);
> +}
> +

The page has to be zeroed before it goes back on the list, right? I
didn't see it.

Hyeonggon Yoo March 9, 2023, 6:31 a.m. UTC | #2

On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> When set_memory or set_direct_map APIs used to change attribute or
> permissions for chunks of several pages, the large PMD that maps these
> pages in the direct map must be split. Fragmenting the direct map in such
> manner causes TLB pressure and, eventually, performance degradation.
> 
> To avoid excessive direct map fragmentation, add ability to allocate
> "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> allocated pages from the direct map and use a cache of the unmapped pages.
> 
> This cache is replenished with higher order pages with preference for
> PMD_SIZE pages when possible so that there will be fewer splits of large
> pages in the direct map.
> 
> The cache is implemented as a buddy allocator, so it can serve high order
> allocations of unmapped pages.

Hello,

To me it seems unnecessary to increase pageblock bitmap size just to
distinguish if it is allocated with __GFP_UNMAPPED.

I think this can be implemented as an independent cache on top of
buddy allocator, and introducing new API for unmapped page
allocation and freeing?

Thanks,
Hyeonggon

Mike Rapoport March 9, 2023, 2:39 p.m. UTC | #3

On Thu, Mar 09, 2023 at 01:56:37AM +0000, Edgecombe, Rick P wrote:
> On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > +
> > +static inline void __free_one_page(struct page *page, unsigned int
> > order,
> > +                                  bool cache_refill)
> > +{
> > +       unsigned long pfn = page_to_pfn(page);
> > +       unsigned long buddy_pfn;
> > +       unsigned long combined_pfn;
> > +       struct page *buddy;
> > +       unsigned long flags;
> > +
> > +       spin_lock_irqsave(&free_area->lock, flags);
> > +
> > +       if (cache_refill) {
> > +               set_pageblock_unmapped(page);
> > +               free_area[order].nr_cached++;
> > +       }
> > +
> > +       while (order < MAX_ORDER - 1) {
> > +               buddy = find_unmapped_buddy_page_pfn(page, pfn,
> > order,
> > +                                                    &buddy_pfn);
> > +               if (!buddy)
> > +                       break;
> > +
> > +               del_page_from_free_list(buddy, order);
> > +               combined_pfn = buddy_pfn & pfn;
> > +               page = page + (combined_pfn - pfn);
> > +               pfn = combined_pfn;
> > +               order++;
> > +       }
> > +
> > +       set_unmapped_order(page, order);
> > +       add_to_free_list(page, order);
> > +       spin_unlock_irqrestore(&free_area->lock, flags);
> > +}
> > +
> 
> The page has to be zeroed before it goes back on the list, right? I
> didn't see it.

You are right, I missed it.

Mike Rapoport March 9, 2023, 3:27 p.m. UTC | #4

On Thu, Mar 09, 2023 at 06:31:25AM +0000, Hyeonggon Yoo wrote:
> On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > 
> > When set_memory or set_direct_map APIs used to change attribute or
> > permissions for chunks of several pages, the large PMD that maps these
> > pages in the direct map must be split. Fragmenting the direct map in such
> > manner causes TLB pressure and, eventually, performance degradation.
> > 
> > To avoid excessive direct map fragmentation, add ability to allocate
> > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > allocated pages from the direct map and use a cache of the unmapped pages.
> > 
> > This cache is replenished with higher order pages with preference for
> > PMD_SIZE pages when possible so that there will be fewer splits of large
> > pages in the direct map.
> > 
> > The cache is implemented as a buddy allocator, so it can serve high order
> > allocations of unmapped pages.
> 
> Hello,
> 
> To me it seems unnecessary to increase pageblock bitmap size just to
> distinguish if it is allocated with __GFP_UNMAPPED.
> 
> I think this can be implemented as an independent cache on top of
> buddy allocator, and introducing new API for unmapped page
> allocation and freeing?

Yes, it could. But with such API we'd also need to take care of
mapping/unmapping these pages for the modules use case which is not that
difficult, but still more complex than a flag to vmalloc().
As for secretmem, freeing of secretmem pages happens in the page cache, so
changing that to use, say, unmapped_free() would be quite intrusive.

Another reason to mark the entire pageblock as unmapped is possibility to
steal pages from that pageblock into the unmmaped cache when they become
free. When we allocate pages to replenish the unmapped cache, we split the
large mapping in the direct map, so even the pages in the same pageblock
that do not get into the cache are still mapped at PTE level. When they are
freed, they will be put in the cache and so fewer explicit cache refills
that split the large mappings will be required.

> Thanks,
> Hyeonggon

Edgecombe, Rick P March 9, 2023, 3:34 p.m. UTC | #5

On Thu, 2023-03-09 at 16:39 +0200, Mike Rapoport wrote:
> On Thu, Mar 09, 2023 at 01:56:37AM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote:
> > > +
> > > +static inline void __free_one_page(struct page *page, unsigned
> > > int
> > > order,
> > > +                                  bool cache_refill)
> > > +{
> > > +       unsigned long pfn = page_to_pfn(page);
> > > +       unsigned long buddy_pfn;
> > > +       unsigned long combined_pfn;
> > > +       struct page *buddy;
> > > +       unsigned long flags;
> > > +
> > > +       spin_lock_irqsave(&free_area->lock, flags);
> > > +
> > > +       if (cache_refill) {
> > > +               set_pageblock_unmapped(page);
> > > +               free_area[order].nr_cached++;
> > > +       }
> > > +
> > > +       while (order < MAX_ORDER - 1) {
> > > +               buddy = find_unmapped_buddy_page_pfn(page, pfn,
> > > order,
> > > +                                                    &buddy_pfn);
> > > +               if (!buddy)
> > > +                       break;
> > > +
> > > +               del_page_from_free_list(buddy, order);
> > > +               combined_pfn = buddy_pfn & pfn;
> > > +               page = page + (combined_pfn - pfn);
> > > +               pfn = combined_pfn;
> > > +               order++;
> > > +       }
> > > +
> > > +       set_unmapped_order(page, order);
> > > +       add_to_free_list(page, order);
> > > +       spin_unlock_irqrestore(&free_area->lock, flags);
> > > +}
> > > +
> > 
> > The page has to be zeroed before it goes back on the list, right? I
> > didn't see it.
> 
> You are right, I missed it.

The text_poke() family is probably the way on x86, but I don't think
there is a cross-arch way that doesn't involve kernel shootdowns...

Michal Hocko March 24, 2023, 8:37 a.m. UTC | #6

On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> When set_memory or set_direct_map APIs used to change attribute or
> permissions for chunks of several pages, the large PMD that maps these
> pages in the direct map must be split. Fragmenting the direct map in such
> manner causes TLB pressure and, eventually, performance degradation.
> 
> To avoid excessive direct map fragmentation, add ability to allocate
> "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> allocated pages from the direct map and use a cache of the unmapped pages.
> 
> This cache is replenished with higher order pages with preference for
> PMD_SIZE pages when possible so that there will be fewer splits of large
> pages in the direct map.
> 
> The cache is implemented as a buddy allocator, so it can serve high order
> allocations of unmapped pages.

Why do we need a dedicated gfp flag for all this when a dedicated
allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?

Mike Rapoport March 25, 2023, 6:38 a.m. UTC | #7

On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > 
> > When set_memory or set_direct_map APIs used to change attribute or
> > permissions for chunks of several pages, the large PMD that maps these
> > pages in the direct map must be split. Fragmenting the direct map in such
> > manner causes TLB pressure and, eventually, performance degradation.
> > 
> > To avoid excessive direct map fragmentation, add ability to allocate
> > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > allocated pages from the direct map and use a cache of the unmapped pages.
> > 
> > This cache is replenished with higher order pages with preference for
> > PMD_SIZE pages when possible so that there will be fewer splits of large
> > pages in the direct map.
> > 
> > The cache is implemented as a buddy allocator, so it can serve high order
> > allocations of unmapped pages.
> 
> Why do we need a dedicated gfp flag for all this when a dedicated
> allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?

Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
outweighs the cost of a dedicated gfp flag.

For modules we'd have to make x86::module_{alloc,free}() take care of
mapping and unmapping the allocated pages in the modules virtual address
range. This also might become relevant for another architectures in future
and than we'll have several complex module_alloc()s. 

And for secretmem while using unmapped_pages_alloc() is easy, the free path
becomes really complex because actual page freeing for fd-based memory is
deeply buried in the page cache code.

My gut feeling is that for PKS using a gfp flag would save a lot of hassle
as well.

I also think that some of the core buddy allocator code might be reused and
unmapped_pages_{alloc,free} could be statics in mm/page_alloc.c and won't be
exposed at all. For now I've just copied free list helpers to a separate
file out of laziness.

> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko March 27, 2023, 1:43 p.m. UTC | #8

On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > 
> > > When set_memory or set_direct_map APIs used to change attribute or
> > > permissions for chunks of several pages, the large PMD that maps these
> > > pages in the direct map must be split. Fragmenting the direct map in such
> > > manner causes TLB pressure and, eventually, performance degradation.
> > > 
> > > To avoid excessive direct map fragmentation, add ability to allocate
> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > allocated pages from the direct map and use a cache of the unmapped pages.
> > > 
> > > This cache is replenished with higher order pages with preference for
> > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > pages in the direct map.
> > > 
> > > The cache is implemented as a buddy allocator, so it can serve high order
> > > allocations of unmapped pages.
> > 
> > Why do we need a dedicated gfp flag for all this when a dedicated
> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> 
> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> outweighs the cost of a dedicated gfp flag.

Aren't those users rare and very special anyway?

> For modules we'd have to make x86::module_{alloc,free}() take care of
> mapping and unmapping the allocated pages in the modules virtual address
> range. This also might become relevant for another architectures in future
> and than we'll have several complex module_alloc()s. 

The module_alloc use is lacking any justification. More context would be
more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
likely needs more explanation as well.

> And for secretmem while using unmapped_pages_alloc() is easy, the free path
> becomes really complex because actual page freeing for fd-based memory is
> deeply buried in the page cache code.

Why is that a problem? You already hook into the page freeing path and
special case unmapped memory.

> My gut feeling is that for PKS using a gfp flag would save a lot of hassle
> as well.

Well, my take on this is that this is not a generic page allocator
functionality. It is clearly an allocator on top of the page allocator.
In general gfp flags are scarce and convenience argument usually fires
back later on in hard to predict ways. So I've learned to be careful
here. I am not saying this is a no-go but right now I do not see any
acutal advantage. The vmalloc usecase could be interesting in that
regards but it is not really clear to me whether this is a good idea in
the first place.

Vlastimil Babka March 27, 2023, 2:31 p.m. UTC | #9

On 3/27/23 15:43, Michal Hocko wrote:
> On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
>> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
>> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
>> > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
>> > > 
>> > > When set_memory or set_direct_map APIs used to change attribute or
>> > > permissions for chunks of several pages, the large PMD that maps these
>> > > pages in the direct map must be split. Fragmenting the direct map in such
>> > > manner causes TLB pressure and, eventually, performance degradation.
>> > > 
>> > > To avoid excessive direct map fragmentation, add ability to allocate
>> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
>> > > allocated pages from the direct map and use a cache of the unmapped pages.
>> > > 
>> > > This cache is replenished with higher order pages with preference for
>> > > PMD_SIZE pages when possible so that there will be fewer splits of large
>> > > pages in the direct map.
>> > > 
>> > > The cache is implemented as a buddy allocator, so it can serve high order
>> > > allocations of unmapped pages.
>> > 
>> > Why do we need a dedicated gfp flag for all this when a dedicated
>> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
>> 
>> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
>> outweighs the cost of a dedicated gfp flag.
> 
> Aren't those users rare and very special anyway?

I think it's mostly about the freeing that can happen from a generic context
not aware of the special allocation, so it's not about how rare it is, but
how complex would be to determine exhaustively those contexts and do
something in them.

>> For modules we'd have to make x86::module_{alloc,free}() take care of
>> mapping and unmapping the allocated pages in the modules virtual address
>> range. This also might become relevant for another architectures in future
>> and than we'll have several complex module_alloc()s. 
> 
> The module_alloc use is lacking any justification. More context would be
> more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> likely needs more explanation as well.
> 
>> And for secretmem while using unmapped_pages_alloc() is easy, the free path
>> becomes really complex because actual page freeing for fd-based memory is
>> deeply buried in the page cache code.
> 
> Why is that a problem? You already hook into the page freeing path and
> special case unmapped memory.

But the proposal of unmapped_pages_free() would suggest this would no longer
be the case?

But maybe we could, as a compromise, provide unmapped_pages_alloc() to get
rid of the new __GFP flag, provide unmapped_pages_free() to annotate places
that are known to free unmapped memory explicitly, but the generic page
freeing would also keep the hook?

>> My gut feeling is that for PKS using a gfp flag would save a lot of hassle
>> as well.
> 
> Well, my take on this is that this is not a generic page allocator
> functionality. It is clearly an allocator on top of the page allocator.
> In general gfp flags are scarce and convenience argument usually fires
> back later on in hard to predict ways. So I've learned to be careful
> here. I am not saying this is a no-go but right now I do not see any
> acutal advantage. The vmalloc usecase could be interesting in that
> regards but it is not really clear to me whether this is a good idea in
> the first place.
>

Michal Hocko March 27, 2023, 3:10 p.m. UTC | #10

On Mon 27-03-23 16:31:45, Vlastimil Babka wrote:
> On 3/27/23 15:43, Michal Hocko wrote:
> > On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> >> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> >> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> >> > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> >> > > 
> >> > > When set_memory or set_direct_map APIs used to change attribute or
> >> > > permissions for chunks of several pages, the large PMD that maps these
> >> > > pages in the direct map must be split. Fragmenting the direct map in such
> >> > > manner causes TLB pressure and, eventually, performance degradation.
> >> > > 
> >> > > To avoid excessive direct map fragmentation, add ability to allocate
> >> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> >> > > allocated pages from the direct map and use a cache of the unmapped pages.
> >> > > 
> >> > > This cache is replenished with higher order pages with preference for
> >> > > PMD_SIZE pages when possible so that there will be fewer splits of large
> >> > > pages in the direct map.
> >> > > 
> >> > > The cache is implemented as a buddy allocator, so it can serve high order
> >> > > allocations of unmapped pages.
> >> > 
> >> > Why do we need a dedicated gfp flag for all this when a dedicated
> >> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> >> 
> >> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> >> outweighs the cost of a dedicated gfp flag.
> > 
> > Aren't those users rare and very special anyway?
> 
> I think it's mostly about the freeing that can happen from a generic context
> not aware of the special allocation, so it's not about how rare it is, but
> how complex would be to determine exhaustively those contexts and do
> something in them.

Yes, I can see a challenge with put_page users but that is not really
related to the gfp flag as those are only relevant for the allocation
context.
 
> >> For modules we'd have to make x86::module_{alloc,free}() take care of
> >> mapping and unmapping the allocated pages in the modules virtual address
> >> range. This also might become relevant for another architectures in future
> >> and than we'll have several complex module_alloc()s. 
> > 
> > The module_alloc use is lacking any justification. More context would be
> > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> > likely needs more explanation as well.
> > 
> >> And for secretmem while using unmapped_pages_alloc() is easy, the free path
> >> becomes really complex because actual page freeing for fd-based memory is
> >> deeply buried in the page cache code.
> > 
> > Why is that a problem? You already hook into the page freeing path and
> > special case unmapped memory.
> 
> But the proposal of unmapped_pages_free() would suggest this would no longer
> be the case?

I can see a check in the freeing path.

> But maybe we could, as a compromise, provide unmapped_pages_alloc() to get
> rid of the new __GFP flag, provide unmapped_pages_free() to annotate places
> that are known to free unmapped memory explicitly, but the generic page
> freeing would also keep the hook?

Honestly I do not see a different option if those pages are to be
reference counted. Unless they can use a destructor concept like hugetlb
pages. At least secret mem usecase cannot AFAICS.

Mike Rapoport March 28, 2023, 6:25 a.m. UTC | #11

On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote:
> On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > 
> > > > When set_memory or set_direct_map APIs used to change attribute or
> > > > permissions for chunks of several pages, the large PMD that maps these
> > > > pages in the direct map must be split. Fragmenting the direct map in such
> > > > manner causes TLB pressure and, eventually, performance degradation.
> > > > 
> > > > To avoid excessive direct map fragmentation, add ability to allocate
> > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > > allocated pages from the direct map and use a cache of the unmapped pages.
> > > > 
> > > > This cache is replenished with higher order pages with preference for
> > > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > > pages in the direct map.
> > > > 
> > > > The cache is implemented as a buddy allocator, so it can serve high order
> > > > allocations of unmapped pages.
> > > 
> > > Why do we need a dedicated gfp flag for all this when a dedicated
> > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> > 
> > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> > outweighs the cost of a dedicated gfp flag.
> 
> Aren't those users rare and very special anyway?
> 
> > For modules we'd have to make x86::module_{alloc,free}() take care of
> > mapping and unmapping the allocated pages in the modules virtual address
> > range. This also might become relevant for another architectures in future
> > and than we'll have several complex module_alloc()s. 
> 
> The module_alloc use is lacking any justification. More context would be
> more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> likely needs more explanation as well.
 
Right now module_alloc() boils down to vmalloc() with the virtual range
limited to the modules area. The allocated chunk contains both code and
data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated
with module_alloc() remapped with different permissions both in vmalloc
address space and in the direct map. The change of permissions for small
ranges causes splits of large pages in the direct map.

If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
to implement the part of vmalloc() that reserves the virtual addresses and
maps the allocated memory there in module_alloc().

> > And for secretmem while using unmapped_pages_alloc() is easy, the free path
> > becomes really complex because actual page freeing for fd-based memory is
> > deeply buried in the page cache code.
> 
> Why is that a problem? You already hook into the page freeing path and
> special case unmapped memory.

I didn't say there is a problem with unmapped_pages_alloc() in secretmem, I
said there is a problem with unmapped_pages_free() and hence are the
special case for unmapped memory in the freeing path.
 
> > My gut feeling is that for PKS using a gfp flag would save a lot of hassle
> > as well.
> 
> Well, my take on this is that this is not a generic page allocator
> functionality. It is clearly an allocator on top of the page allocator.
> In general gfp flags are scarce and convenience argument usually fires
> back later on in hard to predict ways. So I've learned to be careful
> here. I am not saying this is a no-go but right now I do not see any
> acutal advantage. The vmalloc usecase could be interesting in that
> regards but it is not really clear to me whether this is a good idea in
> the first place.

I don't see the usage of a gfp flag as a convenience argument, but rather
it feels for me that a gfp flag will cause less maintenance burden. Of
course this is subjective.

And although this is an allocator on top of the page allocator, it is still
very tightly coupled with the core page allocator. I'm still think that
using a migrate type for this would have been more elegant, but I realize
that a migrate type would have more impact on the allocation path.
 
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko March 28, 2023, 7:39 a.m. UTC | #12

On Tue 28-03-23 09:25:35, Mike Rapoport wrote:
> On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote:
> > On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> > > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > > 
> > > > > When set_memory or set_direct_map APIs used to change attribute or
> > > > > permissions for chunks of several pages, the large PMD that maps these
> > > > > pages in the direct map must be split. Fragmenting the direct map in such
> > > > > manner causes TLB pressure and, eventually, performance degradation.
> > > > > 
> > > > > To avoid excessive direct map fragmentation, add ability to allocate
> > > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > > > allocated pages from the direct map and use a cache of the unmapped pages.
> > > > > 
> > > > > This cache is replenished with higher order pages with preference for
> > > > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > > > pages in the direct map.
> > > > > 
> > > > > The cache is implemented as a buddy allocator, so it can serve high order
> > > > > allocations of unmapped pages.
> > > > 
> > > > Why do we need a dedicated gfp flag for all this when a dedicated
> > > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> > > 
> > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> > > outweighs the cost of a dedicated gfp flag.
> > 
> > Aren't those users rare and very special anyway?
> > 
> > > For modules we'd have to make x86::module_{alloc,free}() take care of
> > > mapping and unmapping the allocated pages in the modules virtual address
> > > range. This also might become relevant for another architectures in future
> > > and than we'll have several complex module_alloc()s. 
> > 
> > The module_alloc use is lacking any justification. More context would be
> > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> > likely needs more explanation as well.
>  
> Right now module_alloc() boils down to vmalloc() with the virtual range
> limited to the modules area. The allocated chunk contains both code and
> data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated
> with module_alloc() remapped with different permissions both in vmalloc
> address space and in the direct map. The change of permissions for small
> ranges causes splits of large pages in the direct map.

OK, so you want to reduce that direct map fragmentation? Is that a real
problem? My impression is that modules are mostly static thing. BPF
might be a different thing though. I have a recollection that BPF guys
were dealing with direct map fragmention as well.

> If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> to implement the part of vmalloc() that reserves the virtual addresses and
> maps the allocated memory there in module_alloc().

Another option would be to provide an allocator for the backing pages to
vmalloc. But I do agree that a gfp flag is a less laborous way to
achieve the same. So the primary question really is whether we really
need vmalloc support for unmapped memory.

Mike Rapoport March 28, 2023, 3:11 p.m. UTC | #13

On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> On Tue 28-03-23 09:25:35, Mike Rapoport wrote:
> > On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote:
> > > On Sat 25-03-23 09:38:12, Mike Rapoport wrote:
> > > > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote:
> > > > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote:
> > > > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > > > > > 
> > > > > > When set_memory or set_direct_map APIs used to change attribute or
> > > > > > permissions for chunks of several pages, the large PMD that maps these
> > > > > > pages in the direct map must be split. Fragmenting the direct map in such
> > > > > > manner causes TLB pressure and, eventually, performance degradation.
> > > > > > 
> > > > > > To avoid excessive direct map fragmentation, add ability to allocate
> > > > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > > > > allocated pages from the direct map and use a cache of the unmapped pages.
> > > > > > 
> > > > > > This cache is replenished with higher order pages with preference for
> > > > > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > > > > pages in the direct map.
> > > > > > 
> > > > > > The cache is implemented as a buddy allocator, so it can serve high order
> > > > > > allocations of unmapped pages.
> > > > > 
> > > > > Why do we need a dedicated gfp flag for all this when a dedicated
> > > > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
> > > > 
> > > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO
> > > > outweighs the cost of a dedicated gfp flag.
> > > 
> > > Aren't those users rare and very special anyway?
> > > 
> > > > For modules we'd have to make x86::module_{alloc,free}() take care of
> > > > mapping and unmapping the allocated pages in the modules virtual address
> > > > range. This also might become relevant for another architectures in future
> > > > and than we'll have several complex module_alloc()s. 
> > > 
> > > The module_alloc use is lacking any justification. More context would be
> > > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED
> > > likely needs more explanation as well.
> >  
> > Right now module_alloc() boils down to vmalloc() with the virtual range
> > limited to the modules area. The allocated chunk contains both code and
> > data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated
> > with module_alloc() remapped with different permissions both in vmalloc
> > address space and in the direct map. The change of permissions for small
> > ranges causes splits of large pages in the direct map.
> 
> OK, so you want to reduce that direct map fragmentation?

Yes.

> Is that a real problem?

A while ago Intel folks published report [1] that showed better performance
with large pages in the direct map for majority of benchmarks.

> My impression is that modules are mostly static thing. BPF
> might be a different thing though. I have a recollection that BPF guys
> were dealing with direct map fragmention as well.

Modules are indeed static, but module_alloc() used by anything that
allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
mentioned that having code in 2M pages reduces iTLB pressure [2], but
that's not only about avoiding the splits in the direct map but also about
using large mappings in the modules address space.

BPF guys suggested an allocator for executable memory [3] mainly because
they've seen performance improvement of 0.6% - 0.9% in their setups [4].
 
> > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > to implement the part of vmalloc() that reserves the virtual addresses and
> > maps the allocated memory there in module_alloc().
> 
> Another option would be to provide an allocator for the backing pages to
> vmalloc. But I do agree that a gfp flag is a less laborous way to
> achieve the same. So the primary question really is whether we really
> need vmalloc support for unmapped memory.

I'm not sure I follow here. module_alloc() is essentially an alias to
vmalloc(), so to reduce direct map fragmentation caused by code allocations
the most sensible way IMO is to support unmapped memory in vmalloc().

I also think vmalloc with unmmapped pages can provide backing pages for
execmem_alloc() Song proposed.

> -- 
> Michal Hocko
> SUSE Labs

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
[2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/
[3] https://lore.kernel.org/all/20221107223921.3451913-1-song@kernel.org/
[4] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/

Michal Hocko March 28, 2023, 3:24 p.m. UTC | #14

On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
[...]
> > OK, so you want to reduce that direct map fragmentation?
> 
> Yes.
> 
> > Is that a real problem?
> 
> A while ago Intel folks published report [1] that showed better performance
> with large pages in the direct map for majority of benchmarks.
> 
> > My impression is that modules are mostly static thing. BPF
> > might be a different thing though. I have a recollection that BPF guys
> > were dealing with direct map fragmention as well.
> 
> Modules are indeed static, but module_alloc() used by anything that
> allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> mentioned that having code in 2M pages reduces iTLB pressure [2], but
> that's not only about avoiding the splits in the direct map but also about
> using large mappings in the modules address space.
> 
> BPF guys suggested an allocator for executable memory [3] mainly because
> they've seen performance improvement of 0.6% - 0.9% in their setups [4].

These are fair arguments and it would have been better to have them in
the RFC. Also it is not really clear to me what is the actual benefit of
the unmapping for those usecases. I do get they want to benefit from
caching on the same permission setup but do they need unmapping as well?

> > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > > to implement the part of vmalloc() that reserves the virtual addresses and
> > > maps the allocated memory there in module_alloc().
> > 
> > Another option would be to provide an allocator for the backing pages to
> > vmalloc. But I do agree that a gfp flag is a less laborous way to
> > achieve the same. So the primary question really is whether we really
> > need vmalloc support for unmapped memory.
> 
> I'm not sure I follow here. module_alloc() is essentially an alias to
> vmalloc(), so to reduce direct map fragmentation caused by code allocations
> the most sensible way IMO is to support unmapped memory in vmalloc().

What I meant to say is that vmalloc is currently using the page
allocator (resp bulk allocator) for the backing storage. I can imagine
an extension to replace this default allocator by something else (e.g. a
given allocation function). This would be more generic and it would
allow different usecases (e.g. benefit from caching withtout doing the
actual unmapping).
 
> I also think vmalloc with unmmapped pages can provide backing pages for
> execmem_alloc() Song proposed.

Why would you need to have execmem_alloc have its memory virtually
mapped into vmalloc space?

> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
> [2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/
> [3] https://lore.kernel.org/all/20221107223921.3451913-1-song@kernel.org/
> [4] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
> 
> -- 
> Sincerely yours,
> Mike.

Luis Chamberlain March 28, 2023, 5:18 p.m. UTC | #15

Mike, please Cc linux-modules if you want modules folks' input as well ;)

On Tue, Mar 28, 2023 at 8:11 AM Mike Rapoport <rppt@kernel.org> wrote:
> On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > OK, so you want to reduce that direct map fragmentation?
>
> Yes.
>
> > Is that a real problem?
>
> A while ago Intel folks published report [1] that showed better performance
> with large pages in the direct map for majority of benchmarks.
>
> > My impression is that modules are mostly static thing. BPF
> > might be a different thing though. I have a recollection that BPF guys
> > were dealing with direct map fragmention as well.
>
> Modules are indeed static, but module_alloc() used by anything that
> allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> mentioned that having code in 2M pages reduces iTLB pressure [2], but
> that's not only about avoiding the splits in the direct map but also about
> using large mappings in the modules address space.

It is easily overlooked why such things create direct fragmentation --
it's because of the special permission stuff done, module_alloc()
targets memory which can be executed somehow.

> BPF guys suggested an allocator for executable memory [3] mainly because
> they've seen performance improvement of 0.6% - 0.9% in their setups [4].

The performance metrics were completely opaque to some synthetic
environment and our goal is to showcase real value with reproducible
performance benchmarks. Since now Song is convinced that modules need
to be a first class citizen in order to generalize a special allocator
we may sooner rather than later real reproducible performance data to
show the benefit of such a special allocator. One of the big
differences with eBPF programs is that modules *can* be rather large
in size. What is the average size of modules? Well let's take a look:

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko| wc -l
9173
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1}
END {print sum/NR/1024}'
175.1

175 MiB is pretty large.

Ignoring the top 5 piggies:

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r |head -5
58315248 - ./drivers/gpu/drm/amd/amdgpu/amdgpu.ko
29605592 - ./drivers/gpu/drm/i915/i915.ko
18591256 - ./drivers/gpu/drm/nouveau/nouveau.ko
16867048 - ./fs/xfs/xfs.ko
14209440 - ./fs/btrfs/btrfs.ko

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
160.54

Ignoring the top 10 largest modules:

mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-10)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
154.656

Ignoring the top 20 piggies:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-20)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
146.384

Ignoring the top 100 bloated modules:
mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-100)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
117.97

Ignoring the top 1000 bloated modules:
mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-1000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
64.4686

Ignoring the top 2000 bloated modules:
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-2000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
48.7869

Ignoring top 3000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-3000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
39.6037

Ignoring top 4000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-4000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
33.106

Ignoring top 5000
mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
-name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
-$((9173-5000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
28.0925

But at least on the driver front we know we won't always have loaded
*all* GPU drivers, but *one*. So the math needs to consider what
module sizes are for modules actually used.

Let's see what the average module size is on on a big system:

mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename |  xargs stat -c "%s - %n" | awk 'BEGIN
{sum=0} {sum+=$1} END {print sum/NR/1024}'
313.432

On a small desktop:
mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename |  xargs stat -c "%s - %n" | awk 'BEGIN
{sum=0} {sum+=$1} END {print sum/NR/1024}'
292.786

For each finit_module we also do a vmalloc twice actually, one for the
kernel_read_*() which lets us read the file from userspace, and then a
second set of allocations where we copy data over, each step either
using vzalloc() or module_alloc() (vmalloc() with special
permissions), this is more easily reflected and visible now on
linux-next with Song's new module_memory_alloc().

However -- for the context of *this* effort, we're only talking about
the executable sections, the areas we'd use module_alloc() for.

On a big system:

mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs size --radix=10 | awk '{print
$1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
88.7964

On a small desktop:
mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs
sudo modinfo --field filename | xargs size --radix=10 | awk '{print
$1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print
sum/NR/1024}'
92.1473

Regardless, the amount of memory allocated is pretty significant, to
the point a 400 CPU system easily run out of vmap space (yes the
kernel does some stupid stuff, which we're correcting over time):

https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

To help with this we have some recent efforts to help with this
pressure on vmap space:

https://lkml.kernel.org/r/20230311051712.4095040-1-mcgrof@kernel.org

> > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > > to implement the part of vmalloc() that reserves the virtual addresses and
> > > maps the allocated memory there in module_alloc().
> >
> > Another option would be to provide an allocator for the backing pages to
> > vmalloc. But I do agree that a gfp flag is a less laborous way to
> > achieve the same. So the primary question really is whether we really
> > need vmalloc support for unmapped memory.
>
> I'm not sure I follow here. module_alloc() is essentially an alias to
> vmalloc(), so to reduce direct map fragmentation caused by code allocations
> the most sensible way IMO is to support unmapped memory in vmalloc().

The big win also would be to use huge pages from a performance point
of view, specially reducing iTLB pressure, however that is *slightly*
orthogonal to your goal of reducing direct map fragmentation. However,
we should not ignore that strategy as it is a special use case which
*could* be leveraged in other ways too. And so I'd prefer to see that
over just a flag to hide those allocations. It would not only reduce
that fragmentation, but reduce iTLB pressure which I think *is* what
experimentation revealed to show more gains.

> I also think vmalloc with unmmapped pages can provide backing pages for
> execmem_alloc() Song proposed.

Only if you consider huge pages. I don't see that here.

  Luis

Matthew Wilcox March 28, 2023, 5:37 p.m. UTC | #16

On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote:
> differences with eBPF programs is that modules *can* be rather large
> in size. What is the average size of modules? Well let's take a look:
> 
> mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> -name \*.ko| wc -l
> 9173

ummm ... wc -c, surely?

> mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> -name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
> -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
> 160.54

... which invalidates all of these.

Luis Chamberlain March 28, 2023, 5:52 p.m. UTC | #17

On Tue, Mar 28, 2023 at 06:37:13PM +0100, Matthew Wilcox wrote:
> On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote:
> > differences with eBPF programs is that modules *can* be rather large
> > in size. What is the average size of modules? Well let's take a look:
> > 
> > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > -name \*.ko| wc -l
> > 9173
> 
> ummm ... wc -c, surely?

That's the number of allmodconfig modules found.

mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2
./arch/x86/crypto/twofish-x86_64.ko
./arch/x86/crypto/serpent-avx2.ko
mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
wc -l
2
mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
wc -c
70

wc -c would give a lot more. wc -l gives me the module count.

> > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > -name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
> > -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
> > 160.54
> 
> ... which invalidates all of these.

Not sure ? But regardless the *.text* lookup is what we care for though
which was later.

  Luis

Luis Chamberlain March 28, 2023, 5:55 p.m. UTC | #18

On Tue, Mar 28, 2023 at 10:52:08AM -0700, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 06:37:13PM +0100, Matthew Wilcox wrote:
> > On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote:
> > > differences with eBPF programs is that modules *can* be rather large
> > > in size. What is the average size of modules? Well let's take a look:
> > > 
> > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > > -name \*.ko| wc -l
> > > 9173
> > 
> > ummm ... wc -c, surely?
> 
> That's the number of allmodconfig modules found.
> 
> mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2
> ./arch/x86/crypto/twofish-x86_64.ko
> ./arch/x86/crypto/serpent-avx2.ko
> mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
> wc -l
> 2
> mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 |
> wc -c
> 70
> 
> wc -c would give a lot more. wc -l gives me the module count.
> 
> > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./
> > > -name \*.ko|  xargs stat -c "%s - %n" | sort -n -k 1 -r | tail
> > > -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}'
> > > 160.54
> > 
> > ... which invalidates all of these.
> 
> Not sure ? But regardless the *.text* lookup is what we care for though
> which was later.

Which gets me thinking it'd be super nice if kmod tools supported
querying this for us, then no fat finger could mess up the math:

For all modules available:
 * Average module size
 * Average .text module size

For only modules loaded:
 * Average module size
 * Average .text module size

  Luis

Mike Rapoport March 29, 2023, 7:28 a.m. UTC | #19

On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> [...]
> > > OK, so you want to reduce that direct map fragmentation?
> > 
> > Yes.
> > 
> > > Is that a real problem?
> > 
> > A while ago Intel folks published report [1] that showed better performance
> > with large pages in the direct map for majority of benchmarks.
> > 
> > > My impression is that modules are mostly static thing. BPF
> > > might be a different thing though. I have a recollection that BPF guys
> > > were dealing with direct map fragmention as well.
> > 
> > Modules are indeed static, but module_alloc() used by anything that
> > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > that's not only about avoiding the splits in the direct map but also about
> > using large mappings in the modules address space.
> > 
> > BPF guys suggested an allocator for executable memory [3] mainly because
> > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> 
> These are fair arguments and it would have been better to have them in
> the RFC. Also it is not really clear to me what is the actual benefit of
> the unmapping for those usecases. I do get they want to benefit from
> caching on the same permission setup but do they need unmapping as well?

The pages allocated with module_alloc() get different permissions depending
on whether they belong to text, rodata, data etc. The permissions are
updated in both vmalloc address space and in the direct map. The updates to
the direct map cause splits of the large pages. If we cache large pages as
unmapped we take out the entire 2M page from the direct map and then
if/when it becomes free it can be returned to the direct map as a 2M page.

Generally, the unmapped allocations are intended for use-cases that anyway
map the memory elsewhere than direct map and need to modify direct mappings
of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
even some of the encrypted VM memory.

> > > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have
> > > > to implement the part of vmalloc() that reserves the virtual addresses and
> > > > maps the allocated memory there in module_alloc().
> > > 
> > > Another option would be to provide an allocator for the backing pages to
> > > vmalloc. But I do agree that a gfp flag is a less laborous way to
> > > achieve the same. So the primary question really is whether we really
> > > need vmalloc support for unmapped memory.
> > 
> > I'm not sure I follow here. module_alloc() is essentially an alias to
> > vmalloc(), so to reduce direct map fragmentation caused by code allocations
> > the most sensible way IMO is to support unmapped memory in vmalloc().
> 
> What I meant to say is that vmalloc is currently using the page
> allocator (resp bulk allocator) for the backing storage. I can imagine
> an extension to replace this default allocator by something else (e.g. a
> given allocation function). This would be more generic and it would
> allow different usecases (e.g. benefit from caching withtout doing the
> actual unmapping).

The whole point of unmapped cache is to allow non-default permissions in
the direct map without splitting large pages there. So the cache that
unmaps large pages in the direct map and then handles out subpages of that
page seems like the most logical thing to do. With the current use cases
the callers anyway map these pages at different virtual addresses, i.e. 
page cache or vmalloc.

> > I also think vmalloc with unmmapped pages can provide backing pages for
> > execmem_alloc() Song proposed.
> 
> Why would you need to have execmem_alloc have its memory virtually
> mapped into vmalloc space?

Currently all the memory allocated for code is managed in a subset of
vmalloc() space. The intention of execmem_alloc() was to replace
module_alloc() for the code pages, so it's natural that it will use the
same virtual ranges.

But anyway, execmem_alloc() is a long shot as it also requires quite a
refactoring of modules loading.

> > [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
> > [2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/
> > [3] https://lore.kernel.org/all/20221107223921.3451913-1-song@kernel.org/
> > [4] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> 
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko March 29, 2023, 8:13 a.m. UTC | #20

On Wed 29-03-23 10:28:02, Mike Rapoport wrote:
> On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> > On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > [...]
> > > > OK, so you want to reduce that direct map fragmentation?
> > > 
> > > Yes.
> > > 
> > > > Is that a real problem?
> > > 
> > > A while ago Intel folks published report [1] that showed better performance
> > > with large pages in the direct map for majority of benchmarks.
> > > 
> > > > My impression is that modules are mostly static thing. BPF
> > > > might be a different thing though. I have a recollection that BPF guys
> > > > were dealing with direct map fragmention as well.
> > > 
> > > Modules are indeed static, but module_alloc() used by anything that
> > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > > that's not only about avoiding the splits in the direct map but also about
> > > using large mappings in the modules address space.
> > > 
> > > BPF guys suggested an allocator for executable memory [3] mainly because
> > > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> > 
> > These are fair arguments and it would have been better to have them in
> > the RFC. Also it is not really clear to me what is the actual benefit of
> > the unmapping for those usecases. I do get they want to benefit from
> > caching on the same permission setup but do they need unmapping as well?
> 
> The pages allocated with module_alloc() get different permissions depending
> on whether they belong to text, rodata, data etc. The permissions are
> updated in both vmalloc address space and in the direct map. The updates to
> the direct map cause splits of the large pages.

That much is clear (wouldn't hurt to mention that in the changelog
though).

> If we cache large pages as
> unmapped we take out the entire 2M page from the direct map and then
> if/when it becomes free it can be returned to the direct map as a 2M page.
> 
> Generally, the unmapped allocations are intended for use-cases that anyway
> map the memory elsewhere than direct map and need to modify direct mappings
> of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
> even some of the encrypted VM memory.

I believe we are still not on the same page here. I do understand that
you want to re-use the caching capability of the unmapped_pages_alloc
for modules allocations as well. My question is whether module_alloc
really benefits from the fact that the memory is unmapped?

I guess you want to say that it does because it wouldn't have to change
the permission for the direct map but I do not see that anywhere in the
patch... Also consinder the slow path where module_alloc needs to
allocate a fresh (huge)page and unmap it. Does the overhead of the
unmapping suits needs of all module_alloc users? Module loader is likely
not interesting as it tends to be rather static but what about BPF
programs?

[...]
> > > I also think vmalloc with unmmapped pages can provide backing pages for
> > > execmem_alloc() Song proposed.
> > 
> > Why would you need to have execmem_alloc have its memory virtually
> > mapped into vmalloc space?
> 
> Currently all the memory allocated for code is managed in a subset of
> vmalloc() space. The intention of execmem_alloc() was to replace
> module_alloc() for the code pages, so it's natural that it will use the
> same virtual ranges.

Ohh, sorry my bad. I have only now realized I have misread and thought
about secretmem_alloc...

Mike Rapoport March 30, 2023, 5:13 a.m. UTC | #21

On Wed, Mar 29, 2023 at 10:13:23AM +0200, Michal Hocko wrote:
> On Wed 29-03-23 10:28:02, Mike Rapoport wrote:
> > On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> > > On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > OK, so you want to reduce that direct map fragmentation?
> > > > 
> > > > Yes.
> > > > 
> > > > > Is that a real problem?
> > > > 
> > > > A while ago Intel folks published report [1] that showed better performance
> > > > with large pages in the direct map for majority of benchmarks.
> > > > 
> > > > > My impression is that modules are mostly static thing. BPF
> > > > > might be a different thing though. I have a recollection that BPF guys
> > > > > were dealing with direct map fragmention as well.
> > > > 
> > > > Modules are indeed static, but module_alloc() used by anything that
> > > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > > > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > > > that's not only about avoiding the splits in the direct map but also about
> > > > using large mappings in the modules address space.
> > > > 
> > > > BPF guys suggested an allocator for executable memory [3] mainly because
> > > > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> > > 
> > > These are fair arguments and it would have been better to have them in
> > > the RFC. Also it is not really clear to me what is the actual benefit of
> > > the unmapping for those usecases. I do get they want to benefit from
> > > caching on the same permission setup but do they need unmapping as well?
> > 
> > The pages allocated with module_alloc() get different permissions depending
> > on whether they belong to text, rodata, data etc. The permissions are
> > updated in both vmalloc address space and in the direct map. The updates to
> > the direct map cause splits of the large pages.
> 
> That much is clear (wouldn't hurt to mention that in the changelog
> though).
> 
> > If we cache large pages as
> > unmapped we take out the entire 2M page from the direct map and then
> > if/when it becomes free it can be returned to the direct map as a 2M page.
> > 
> > Generally, the unmapped allocations are intended for use-cases that anyway
> > map the memory elsewhere than direct map and need to modify direct mappings
> > of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
> > even some of the encrypted VM memory.
> 
> I believe we are still not on the same page here. I do understand that
> you want to re-use the caching capability of the unmapped_pages_alloc
> for modules allocations as well. My question is whether module_alloc
> really benefits from the fact that the memory is unmapped?
> 
> I guess you want to say that it does because it wouldn't have to change
> the permission for the direct map but I do not see that anywhere in the
> patch...

This happens automagically outside the patch :)

Currently, to change memory permissions modules code calls set_memory APIs
and passes vmalloced address to them. set_memory functions lookup the
direct map alias and update the permissions there as well.
If the memory allocated with module_alloc() is unmapped in the direct map,
there won't be an alias for set_memory APIs to update.

> Also consinder the slow path where module_alloc needs to
> allocate a fresh (huge)page and unmap it. Does the overhead of the
> unmapping suits needs of all module_alloc users? Module loader is likely
> not interesting as it tends to be rather static but what about BPF
> programs?

The overhead of unmapping pages in the direct map on allocation path will
be offset by reduced overhead of updating permissions in the direct map
after the allocation. Both are using the same APIs and if today the
permission update causes a split of a large page, unmapping of a large page
won't.

Of course in a loaded system unmapped_alloc() won't be able to always
allocated large pages to replenish the cache, but still there will be fewer
updates to the direct map.
 
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko March 30, 2023, 8:11 a.m. UTC | #22

On Thu 30-03-23 08:13:48, Mike Rapoport wrote:
> On Wed, Mar 29, 2023 at 10:13:23AM +0200, Michal Hocko wrote:
> > On Wed 29-03-23 10:28:02, Mike Rapoport wrote:
> > > On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote:
> > > > On Tue 28-03-23 18:11:20, Mike Rapoport wrote:
> > > > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > OK, so you want to reduce that direct map fragmentation?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > Is that a real problem?
> > > > > 
> > > > > A while ago Intel folks published report [1] that showed better performance
> > > > > with large pages in the direct map for majority of benchmarks.
> > > > > 
> > > > > > My impression is that modules are mostly static thing. BPF
> > > > > > might be a different thing though. I have a recollection that BPF guys
> > > > > > were dealing with direct map fragmention as well.
> > > > > 
> > > > > Modules are indeed static, but module_alloc() used by anything that
> > > > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas
> > > > > mentioned that having code in 2M pages reduces iTLB pressure [2], but
> > > > > that's not only about avoiding the splits in the direct map but also about
> > > > > using large mappings in the modules address space.
> > > > > 
> > > > > BPF guys suggested an allocator for executable memory [3] mainly because
> > > > > they've seen performance improvement of 0.6% - 0.9% in their setups [4].
> > > > 
> > > > These are fair arguments and it would have been better to have them in
> > > > the RFC. Also it is not really clear to me what is the actual benefit of
> > > > the unmapping for those usecases. I do get they want to benefit from
> > > > caching on the same permission setup but do they need unmapping as well?
> > > 
> > > The pages allocated with module_alloc() get different permissions depending
> > > on whether they belong to text, rodata, data etc. The permissions are
> > > updated in both vmalloc address space and in the direct map. The updates to
> > > the direct map cause splits of the large pages.
> > 
> > That much is clear (wouldn't hurt to mention that in the changelog
> > though).
> > 
> > > If we cache large pages as
> > > unmapped we take out the entire 2M page from the direct map and then
> > > if/when it becomes free it can be returned to the direct map as a 2M page.
> > > 
> > > Generally, the unmapped allocations are intended for use-cases that anyway
> > > map the memory elsewhere than direct map and need to modify direct mappings
> > > of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe
> > > even some of the encrypted VM memory.
> > 
> > I believe we are still not on the same page here. I do understand that
> > you want to re-use the caching capability of the unmapped_pages_alloc
> > for modules allocations as well. My question is whether module_alloc
> > really benefits from the fact that the memory is unmapped?
> > 
> > I guess you want to say that it does because it wouldn't have to change
> > the permission for the direct map but I do not see that anywhere in the
> > patch...
> 
> This happens automagically outside the patch :)
> 
> Currently, to change memory permissions modules code calls set_memory APIs
> and passes vmalloced address to them. set_memory functions lookup the
> direct map alias and update the permissions there as well.
> If the memory allocated with module_alloc() is unmapped in the direct map,
> there won't be an alias for set_memory APIs to update.
> 
> > Also consinder the slow path where module_alloc needs to
> > allocate a fresh (huge)page and unmap it. Does the overhead of the
> > unmapping suits needs of all module_alloc users? Module loader is likely
> > not interesting as it tends to be rather static but what about BPF
> > programs?
> 
> The overhead of unmapping pages in the direct map on allocation path will
> be offset by reduced overhead of updating permissions in the direct map
> after the allocation. Both are using the same APIs and if today the
> permission update causes a split of a large page, unmapping of a large page
> won't.
> 
> Of course in a loaded system unmapped_alloc() won't be able to always
> allocated large pages to replenish the cache, but still there will be fewer
> updates to the direct map.
 
Ok, all of that is a changelog material. I would recommend to go this
way. Start by the simple unmapped_pages_alloc interface and use it for
the secret memory. There shouldn't be anything controversial there. In a
follow up patch add a support for the vmalloc which would add a new gfp
flag with a justification that this is the simplest way to support
module_alloc usecase and do not feel shy of providing as much context as
you can. Ideally with some numbers for the best/worst/avg cases.

Thanks

Kent Overstreet May 18, 2023, 3:35 a.m. UTC | #23

On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> 
> When set_memory or set_direct_map APIs used to change attribute or
> permissions for chunks of several pages, the large PMD that maps these
> pages in the direct map must be split. Fragmenting the direct map in such
> manner causes TLB pressure and, eventually, performance degradation.
> 
> To avoid excessive direct map fragmentation, add ability to allocate
> "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> allocated pages from the direct map and use a cache of the unmapped pages.
> 
> This cache is replenished with higher order pages with preference for
> PMD_SIZE pages when possible so that there will be fewer splits of large
> pages in the direct map.
> 
> The cache is implemented as a buddy allocator, so it can serve high order
> allocations of unmapped pages.

So I'm late to this discussion, I stumbled in because of my own run in
with executable memory allocation.

I understand that post LSF this patchset seems to not be going anywhere,
but OTOH there's also been a desire for better executable memory
allocation; as noted by tglx and elsewhere, there _is_ a definite
performance impact on page size with kernel text - I've seen numbers in
the multiple single digit percentage range in the past.

This patchset does seem to me to be roughly the right approach for that,
and coupled with the slab allocator for sub-page sized allocations it
seems there's the potential for getting a nice interface that spans the
full range of allocation sizes, from small bpf/trampoline allocations up
to modules.

Is this patchset worth reviving/continuing with? Was it really just the
needed module refactoring that was the blocker?

Mike Rapoport May 18, 2023, 3:23 p.m. UTC | #24

On Wed, May 17, 2023 at 11:35:56PM -0400, Kent Overstreet wrote:
> On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > 
> > When set_memory or set_direct_map APIs used to change attribute or
> > permissions for chunks of several pages, the large PMD that maps these
> > pages in the direct map must be split. Fragmenting the direct map in such
> > manner causes TLB pressure and, eventually, performance degradation.
> > 
> > To avoid excessive direct map fragmentation, add ability to allocate
> > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > allocated pages from the direct map and use a cache of the unmapped pages.
> > 
> > This cache is replenished with higher order pages with preference for
> > PMD_SIZE pages when possible so that there will be fewer splits of large
> > pages in the direct map.
> > 
> > The cache is implemented as a buddy allocator, so it can serve high order
> > allocations of unmapped pages.
> 
> So I'm late to this discussion, I stumbled in because of my own run in
> with executable memory allocation.
> 
> I understand that post LSF this patchset seems to not be going anywhere,
> but OTOH there's also been a desire for better executable memory
> allocation; as noted by tglx and elsewhere, there _is_ a definite
> performance impact on page size with kernel text - I've seen numbers in
> the multiple single digit percentage range in the past.
> 
> This patchset does seem to me to be roughly the right approach for that,
> and coupled with the slab allocator for sub-page sized allocations it
> seems there's the potential for getting a nice interface that spans the
> full range of allocation sizes, from small bpf/trampoline allocations up
> to modules.
> 
> Is this patchset worth reviving/continuing with? Was it really just the
> needed module refactoring that was the blocker?

As I see it, this patchset only one building block out of three? four?
If we are to repurpose it for code allocations it should be something like

1) allocate 2M page to fill the cache
2) remove this page from the direct map
3) map the 2M page ROX in module address space (usually some part of
   vmalloc address space)
4) allocate a smaller chunk of that page to the actual caller (bpf,
   modules, whatever)

Right now (3) and (4) won't work for modules because they mix code and data
in a single allocation.

So module refactoring is a blocker and another blocker is to teach vmalloc
to map the areas for the executable memory with 2M pages and probably
something else.

I remember there was an attempt to add VM_ALLOW_HUGE_VMAP to
x86::module_alloc(), but it caused problems and was reverted. Sorry, I
could not find the lore link.

There was a related discussion here:

https://lore.kernel.org/all/14D6DBA0-0572-44FB-A566-464B1FF541E0@fb.com/

Song Liu May 18, 2023, 4:33 p.m. UTC | #25

On Thu, May 18, 2023 at 8:24 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, May 17, 2023 at 11:35:56PM -0400, Kent Overstreet wrote:
> > On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" <rppt@kernel.org>
> > >
> > > When set_memory or set_direct_map APIs used to change attribute or
> > > permissions for chunks of several pages, the large PMD that maps these
> > > pages in the direct map must be split. Fragmenting the direct map in such
> > > manner causes TLB pressure and, eventually, performance degradation.
> > >
> > > To avoid excessive direct map fragmentation, add ability to allocate
> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the
> > > allocated pages from the direct map and use a cache of the unmapped pages.
> > >
> > > This cache is replenished with higher order pages with preference for
> > > PMD_SIZE pages when possible so that there will be fewer splits of large
> > > pages in the direct map.
> > >
> > > The cache is implemented as a buddy allocator, so it can serve high order
> > > allocations of unmapped pages.
> >
> > So I'm late to this discussion, I stumbled in because of my own run in
> > with executable memory allocation.
> >
> > I understand that post LSF this patchset seems to not be going anywhere,
> > but OTOH there's also been a desire for better executable memory
> > allocation; as noted by tglx and elsewhere, there _is_ a definite
> > performance impact on page size with kernel text - I've seen numbers in
> > the multiple single digit percentage range in the past.
> >
> > This patchset does seem to me to be roughly the right approach for that,
> > and coupled with the slab allocator for sub-page sized allocations it
> > seems there's the potential for getting a nice interface that spans the
> > full range of allocation sizes, from small bpf/trampoline allocations up
> > to modules.
> >
> > Is this patchset worth reviving/continuing with? Was it really just the
> > needed module refactoring that was the blocker?
>
> As I see it, this patchset only one building block out of three? four?
> If we are to repurpose it for code allocations it should be something like
>
> 1) allocate 2M page to fill the cache
> 2) remove this page from the direct map
> 3) map the 2M page ROX in module address space (usually some part of
>    vmalloc address space)
> 4) allocate a smaller chunk of that page to the actual caller (bpf,
>    modules, whatever)
>
> Right now (3) and (4) won't work for modules because they mix code and data
> in a single allocation.

I am working on patches based on the discussion in [1]. I am planning to
send v1 for review in a week or so.

Thanks,
Song

[1] https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/

[...]

Kent Overstreet May 18, 2023, 4:48 p.m. UTC | #26

On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> I am working on patches based on the discussion in [1]. I am planning to
> send v1 for review in a week or so.

Hey Song, I was reviewing that thread too,

Are you taking a different approach based on Thomas's feedback? I think
he had some fair points in that thread.

My own feeling is that the buddy allocator is our tool for allocating
larger variable sized physically contiguous allocations, so I'd like to
see something based on that - I think we could do a hybrid buddy/slab
allocator approach, like we have for regular memory allocations.

I started on a slab allocator for executable memory allocations the
other day (very minimal, but tested it for bcachefs and it works).

But I'd love to hear more about your current approach as well.

Cheers,
Kent

Kent Overstreet May 18, 2023, 4:58 p.m. UTC | #27

On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> I am working on patches based on the discussion in [1]. I am planning to
> send v1 for review in a week or so.

For reference, here's my own (early, but functioning :) slab allocator:

Look forward to comparing!
-->--
From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Wed, 17 May 2023 01:22:06 -0400
Subject: [PATCH] mm: jit/text allocator

This provides a new, very simple slab allocator for jit/text, i.e. bpf,
ftrace trampolines, or bcachefs unpack functions.

With this API we can avoid ever mapping pages both writeable and
executable (not implemented in this patch: need to tweak
module_alloc()), and it also supports sub-page sized allocations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
new file mode 100644
index 0000000000..f1549d60e8
--- /dev/null
+++ b/include/linux/jitalloc.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_JITALLOC_H
+#define _LINUX_JITALLOC_H
+
+void jit_update(void *buf, void *new_buf, size_t len);
+void jit_free(void *buf);
+void *jit_alloc(void *buf, size_t len);
+
+#endif /* _LINUX_JITALLOC_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f..ff26a4f0c9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config JITALLOC
+	bool
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index c03e1e5859..25e82db9e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_JITALLOC) += jitalloc.o
diff --git a/mm/jitalloc.c b/mm/jitalloc.c
new file mode 100644
index 0000000000..7c4d621802
--- /dev/null
+++ b/mm/jitalloc.c
@@ -0,0 +1,187 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/gfp.h>
+#include <linux/highmem.h>
+#include <linux/jitalloc.h>
+#include <linux/mm.h>
+#include <linux/moduleloader.h>
+#include <linux/mutex.h>
+#include <linux/set_memory.h>
+#include <linux/vmalloc.h>
+
+#include <asm/text-patching.h>
+
+static DEFINE_MUTEX(jit_alloc_lock);
+
+struct jit_cache {
+	unsigned		obj_size_bits;
+	unsigned		objs_per_slab;
+	struct list_head	partial;
+};
+
+#define JITALLOC_MIN_SIZE	16
+#define NR_JIT_CACHES		ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
+
+static struct jit_cache jit_caches[NR_JIT_CACHES];
+
+struct jit_slab {
+	unsigned long		__page_flags;
+
+	struct jit_cache	*cache;
+	void			*executably_mapped;;
+	unsigned long		*objs_allocated; /* bitmap of free objects */
+	struct list_head	list;
+};
+
+#define folio_jit_slab(folio)		(_Generic((folio),			\
+	const struct folio *:		(const struct jit_slab *)(folio),	\
+	struct folio *:			(struct jit_slab *)(folio)))
+
+#define jit_slab_folio(s)		(_Generic((s),				\
+	const struct jit_slab *:	(const struct folio *)s,		\
+	struct jit_slab *:		(struct folio *)s))
+
+static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
+{
+	void *executably_mapped = module_alloc(PAGE_SIZE);
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *slab;
+	unsigned long *objs_allocated;
+
+	if (!executably_mapped)
+		return NULL;
+
+	objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
+	if (!objs_allocated ) {
+		vfree(executably_mapped);
+		return NULL;
+	}
+
+	set_vm_flush_reset_perms(executably_mapped);
+	set_memory_rox((unsigned long) executably_mapped, 1);
+
+	page = vmalloc_to_page(executably_mapped);
+	folio = page_folio(page);
+
+	__folio_set_slab(folio);
+	slab			= folio_jit_slab(folio);
+	slab->cache		= cache;
+	slab->executably_mapped	= executably_mapped;
+	slab->objs_allocated = objs_allocated;
+	INIT_LIST_HEAD(&slab->list);
+
+	return slab;
+}
+
+static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
+{
+	struct jit_slab *s =
+		list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
+		jit_slab_alloc(cache);
+	unsigned obj_idx, nr_allocated;
+
+	if (!s)
+		return NULL;
+
+	obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
+
+	BUG_ON(obj_idx >= cache->objs_per_slab);
+	__set_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == s->cache->objs_per_slab) {
+		list_del_init(&s->list);
+	} else if (nr_allocated == 1) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	return s->executably_mapped + (obj_idx << cache->obj_size_bits);
+}
+
+void jit_update(void *buf, void *new_buf, size_t len)
+{
+	text_poke_copy(buf, new_buf, len);
+}
+EXPORT_SYMBOL_GPL(jit_update);
+
+void jit_free(void *buf)
+{
+	struct page *page;
+	struct folio *folio;
+	struct jit_slab *s;
+	unsigned obj_idx, nr_allocated;
+	size_t offset;
+
+	if (!buf)
+		return;
+
+	page	= vmalloc_to_page(buf);
+	folio	= page_folio(page);
+	offset	= offset_in_folio(folio, buf);
+
+	if (!folio_test_slab(folio)) {
+		vfree(buf);
+		return;
+	}
+
+	s = folio_jit_slab(folio);
+
+	mutex_lock(&jit_alloc_lock);
+	obj_idx = offset >> s->cache->obj_size_bits;
+
+	__clear_bit(obj_idx, s->objs_allocated);
+
+	nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
+
+	if (nr_allocated == 0) {
+		list_del(&s->list);
+		kfree(s->objs_allocated);
+		folio_put(folio);
+	} else if (nr_allocated + 1 == s->cache->objs_per_slab) {
+		list_del(&s->list);
+		list_add(&s->list, &s->cache->partial);
+	}
+
+	mutex_unlock(&jit_alloc_lock);
+}
+EXPORT_SYMBOL_GPL(jit_free);
+
+void *jit_alloc(void *buf, size_t len)
+{
+	unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
+	void *p;
+
+	if (jit_cache_idx < NR_JIT_CACHES) {
+		mutex_lock(&jit_alloc_lock);
+		p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
+		mutex_unlock(&jit_alloc_lock);
+	} else {
+		p = module_alloc(len);
+		if (p) {
+			set_vm_flush_reset_perms(p);
+			set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
+		}
+	}
+
+	if (p && buf)
+		jit_update(p, buf, len);
+
+	return p;
+}
+EXPORT_SYMBOL_GPL(jit_alloc);
+
+static int __init jit_alloc_init(void)
+{
+	for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
+		jit_caches[i].obj_size_bits	= ilog2(JITALLOC_MIN_SIZE) + i;
+		jit_caches[i].objs_per_slab	= PAGE_SIZE >> jit_caches[i].obj_size_bits;
+
+		INIT_LIST_HEAD(&jit_caches[i].partial);
+	}
+
+	return 0;
+}
+core_initcall(jit_alloc_init);

Song Liu May 18, 2023, 5 p.m. UTC | #28

On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > I am working on patches based on the discussion in [1]. I am planning to
> > send v1 for review in a week or so.
>
> Hey Song, I was reviewing that thread too,
>
> Are you taking a different approach based on Thomas's feedback? I think
> he had some fair points in that thread.

Yes, the API is based on Thomas's suggestion, like 90% from the discussions.

>
> My own feeling is that the buddy allocator is our tool for allocating
> larger variable sized physically contiguous allocations, so I'd like to
> see something based on that - I think we could do a hybrid buddy/slab
> allocator approach, like we have for regular memory allocations.

I am planning to implement the allocator based on this (reuse
vmap_area logic):

https://lore.kernel.org/linux-mm/20221107223921.3451913-2-song@kernel.org/

Thanks,
Song

>
> I started on a slab allocator for executable memory allocations the
> other day (very minimal, but tested it for bcachefs and it works).
>
> But I'd love to hear more about your current approach as well.
>
> Cheers,
> Kent

Song Liu May 18, 2023, 5:15 p.m. UTC | #29

On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > I am working on patches based on the discussion in [1]. I am planning to
> > send v1 for review in a week or so.
>
> For reference, here's my own (early, but functioning :) slab allocator:
>
> Look forward to comparing!

This looks similar to the bpf_prog_pack allocator we use for BPF at the
moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we
can also use bpf_prog_pack for bcachefs for now.

Thanks,
Song

> -->--
> From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <kent.overstreet@linux.dev>
> Date: Wed, 17 May 2023 01:22:06 -0400
> Subject: [PATCH] mm: jit/text allocator
>
> This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> ftrace trampolines, or bcachefs unpack functions.
>
> With this API we can avoid ever mapping pages both writeable and
> executable (not implemented in this patch: need to tweak
> module_alloc()), and it also supports sub-page sized allocations.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
>
> diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h
> new file mode 100644
> index 0000000000..f1549d60e8
> --- /dev/null
> +++ b/include/linux/jitalloc.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_JITALLOC_H
> +#define _LINUX_JITALLOC_H
> +
> +void jit_update(void *buf, void *new_buf, size_t len);
> +void jit_free(void *buf);
> +void *jit_alloc(void *buf, size_t len);
> +
> +#endif /* _LINUX_JITALLOC_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4751031f3f..ff26a4f0c9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS
>           This option has a per-memcg and per-node memory overhead.
>  # }
>
> +config JITALLOC
> +       bool
> +
>  source "mm/damon/Kconfig"
>
>  endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index c03e1e5859..25e82db9e8 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o
>  obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
>  obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
>  obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
> +obj-$(CONFIG_JITALLOC) += jitalloc.o
> diff --git a/mm/jitalloc.c b/mm/jitalloc.c
> new file mode 100644
> index 0000000000..7c4d621802
> --- /dev/null
> +++ b/mm/jitalloc.c
> @@ -0,0 +1,187 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/gfp.h>
> +#include <linux/highmem.h>
> +#include <linux/jitalloc.h>
> +#include <linux/mm.h>
> +#include <linux/moduleloader.h>
> +#include <linux/mutex.h>
> +#include <linux/set_memory.h>
> +#include <linux/vmalloc.h>
> +
> +#include <asm/text-patching.h>
> +
> +static DEFINE_MUTEX(jit_alloc_lock);
> +
> +struct jit_cache {
> +       unsigned                obj_size_bits;
> +       unsigned                objs_per_slab;
> +       struct list_head        partial;
> +};
> +
> +#define JITALLOC_MIN_SIZE      16
> +#define NR_JIT_CACHES          ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE)
> +
> +static struct jit_cache jit_caches[NR_JIT_CACHES];
> +
> +struct jit_slab {
> +       unsigned long           __page_flags;
> +
> +       struct jit_cache        *cache;
> +       void                    *executably_mapped;;
> +       unsigned long           *objs_allocated; /* bitmap of free objects */
> +       struct list_head        list;
> +};
> +
> +#define folio_jit_slab(folio)          (_Generic((folio),                      \
> +       const struct folio *:           (const struct jit_slab *)(folio),       \
> +       struct folio *:                 (struct jit_slab *)(folio)))
> +
> +#define jit_slab_folio(s)              (_Generic((s),                          \
> +       const struct jit_slab *:        (const struct folio *)s,                \
> +       struct jit_slab *:              (struct folio *)s))
> +
> +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache)
> +{
> +       void *executably_mapped = module_alloc(PAGE_SIZE);
> +       struct page *page;
> +       struct folio *folio;
> +       struct jit_slab *slab;
> +       unsigned long *objs_allocated;
> +
> +       if (!executably_mapped)
> +               return NULL;
> +
> +       objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL);
> +       if (!objs_allocated ) {
> +               vfree(executably_mapped);
> +               return NULL;
> +       }
> +
> +       set_vm_flush_reset_perms(executably_mapped);
> +       set_memory_rox((unsigned long) executably_mapped, 1);
> +
> +       page = vmalloc_to_page(executably_mapped);
> +       folio = page_folio(page);
> +
> +       __folio_set_slab(folio);
> +       slab                    = folio_jit_slab(folio);
> +       slab->cache             = cache;
> +       slab->executably_mapped = executably_mapped;
> +       slab->objs_allocated = objs_allocated;
> +       INIT_LIST_HEAD(&slab->list);
> +
> +       return slab;
> +}
> +
> +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> +{
> +       struct jit_slab *s =
> +               list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> +               jit_slab_alloc(cache);
> +       unsigned obj_idx, nr_allocated;
> +
> +       if (!s)
> +               return NULL;
> +
> +       obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> +
> +       BUG_ON(obj_idx >= cache->objs_per_slab);
> +       __set_bit(obj_idx, s->objs_allocated);
> +
> +       nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +       if (nr_allocated == s->cache->objs_per_slab) {
> +               list_del_init(&s->list);
> +       } else if (nr_allocated == 1) {
> +               list_del(&s->list);
> +               list_add(&s->list, &s->cache->partial);
> +       }
> +
> +       return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> +}
> +
> +void jit_update(void *buf, void *new_buf, size_t len)
> +{
> +       text_poke_copy(buf, new_buf, len);
> +}
> +EXPORT_SYMBOL_GPL(jit_update);
> +
> +void jit_free(void *buf)
> +{
> +       struct page *page;
> +       struct folio *folio;
> +       struct jit_slab *s;
> +       unsigned obj_idx, nr_allocated;
> +       size_t offset;
> +
> +       if (!buf)
> +               return;
> +
> +       page    = vmalloc_to_page(buf);
> +       folio   = page_folio(page);
> +       offset  = offset_in_folio(folio, buf);
> +
> +       if (!folio_test_slab(folio)) {
> +               vfree(buf);
> +               return;
> +       }
> +
> +       s = folio_jit_slab(folio);
> +
> +       mutex_lock(&jit_alloc_lock);
> +       obj_idx = offset >> s->cache->obj_size_bits;
> +
> +       __clear_bit(obj_idx, s->objs_allocated);
> +
> +       nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +       if (nr_allocated == 0) {
> +               list_del(&s->list);
> +               kfree(s->objs_allocated);
> +               folio_put(folio);
> +       } else if (nr_allocated + 1 == s->cache->objs_per_slab) {
> +               list_del(&s->list);
> +               list_add(&s->list, &s->cache->partial);
> +       }
> +
> +       mutex_unlock(&jit_alloc_lock);
> +}
> +EXPORT_SYMBOL_GPL(jit_free);
> +
> +void *jit_alloc(void *buf, size_t len)
> +{
> +       unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16);
> +       void *p;
> +
> +       if (jit_cache_idx < NR_JIT_CACHES) {
> +               mutex_lock(&jit_alloc_lock);
> +               p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]);
> +               mutex_unlock(&jit_alloc_lock);
> +       } else {
> +               p = module_alloc(len);
> +               if (p) {
> +                       set_vm_flush_reset_perms(p);
> +                       set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE));
> +               }
> +       }
> +
> +       if (p && buf)
> +               jit_update(p, buf, len);
> +
> +       return p;
> +}
> +EXPORT_SYMBOL_GPL(jit_alloc);
> +
> +static int __init jit_alloc_init(void)
> +{
> +       for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) {
> +               jit_caches[i].obj_size_bits     = ilog2(JITALLOC_MIN_SIZE) + i;
> +               jit_caches[i].objs_per_slab     = PAGE_SIZE >> jit_caches[i].obj_size_bits;
> +
> +               INIT_LIST_HEAD(&jit_caches[i].partial);
> +       }
> +
> +       return 0;
> +}
> +core_initcall(jit_alloc_init);
>

Kent Overstreet May 18, 2023, 5:23 p.m. UTC | #30

On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > I am working on patches based on the discussion in [1]. I am planning to
> > > send v1 for review in a week or so.
> >
> > Hey Song, I was reviewing that thread too,
> >
> > Are you taking a different approach based on Thomas's feedback? I think
> > he had some fair points in that thread.
> 
> Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> 
> >
> > My own feeling is that the buddy allocator is our tool for allocating
> > larger variable sized physically contiguous allocations, so I'd like to
> > see something based on that - I think we could do a hybrid buddy/slab
> > allocator approach, like we have for regular memory allocations.
> 
> I am planning to implement the allocator based on this (reuse
> vmap_area logic):

Ah, you're still doing vmap_area approach.

Mike's approach looks like it'll be _much_ lighter weight and higher
performance, to me. vmalloc is known to be slow compared to the buddy
allocator, and with Mike's approach we're only modifying mappings once
per 2 MB chunk.

I don't see anything in your code for sub-page sized allocations too, so
perhaps I should keep going with my slab allocator.

Could you share your thoughts on your approach vs. Mike's? I'm newer to
this area of the code than you two so maybe there's an angle I've missed
:)

Kent Overstreet May 18, 2023, 5:25 p.m. UTC | #31

On Thu, May 18, 2023 at 10:15:59AM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > I am working on patches based on the discussion in [1]. I am planning to
> > > send v1 for review in a week or so.
> >
> > For reference, here's my own (early, but functioning :) slab allocator:
> >
> > Look forward to comparing!
> 
> This looks similar to the bpf_prog_pack allocator we use for BPF at the
> moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we
> can also use bpf_prog_pack for bcachefs for now.

The code in bpf/core.c does a _lot_, my preference would be to split
that up and refactor the bpf code to use something generic :)

Song Liu May 18, 2023, 6:47 p.m. UTC | #32

On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > send v1 for review in a week or so.
> > >
> > > Hey Song, I was reviewing that thread too,
> > >
> > > Are you taking a different approach based on Thomas's feedback? I think
> > > he had some fair points in that thread.
> >
> > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> >
> > >
> > > My own feeling is that the buddy allocator is our tool for allocating
> > > larger variable sized physically contiguous allocations, so I'd like to
> > > see something based on that - I think we could do a hybrid buddy/slab
> > > allocator approach, like we have for regular memory allocations.
> >
> > I am planning to implement the allocator based on this (reuse
> > vmap_area logic):
>
> Ah, you're still doing vmap_area approach.
>
> Mike's approach looks like it'll be _much_ lighter weight and higher
> performance, to me. vmalloc is known to be slow compared to the buddy
> allocator, and with Mike's approach we're only modifying mappings once
> per 2 MB chunk.
>
> I don't see anything in your code for sub-page sized allocations too, so
> perhaps I should keep going with my slab allocator.

The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
we showed that multiple BPF programs share the same page with some
kernel text (_etext).

> Could you share your thoughts on your approach vs. Mike's? I'm newer to
> this area of the code than you two so maybe there's an angle I've missed
> :)

AFAICT, tree based solution (vmap_area) is more efficient than bitmap
based solution.

First, for 2MiB page with 64B chunk size, we need a bitmap of
     2MiB / 64B = 32k bit = 4k bytes
While the tree based solution can adapt to the number of allocations within
This 2MiB page. Also, searching a free range within 4kB of bitmap may
actually be slower than searching in the tree.

Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
while tree based solution can. For example, if a big driver uses 3MiB, the
tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
smaller allocations.

Thanks,
Song

[2] https://lore.kernel.org/linux-mm/20221107223921.3451913-6-song@kernel.org/

Song Liu May 18, 2023, 6:54 p.m. UTC | #33

On Thu, May 18, 2023 at 10:25 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 10:15:59AM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > send v1 for review in a week or so.
> > >
> > > For reference, here's my own (early, but functioning :) slab allocator:
> > >
> > > Look forward to comparing!
> >
> > This looks similar to the bpf_prog_pack allocator we use for BPF at the
> > moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we
> > can also use bpf_prog_pack for bcachefs for now.
>
> The code in bpf/core.c does a _lot_, my preference would be to split
> that up and refactor the bpf code to use something generic :)

The code is actually splitted into two parts:

   bpf_prog_pack_[alloc|free] vs. bpf_jit_binary_pack_[alloc|free].

The latter has logic just for BPF. But the former can be reused by
other users (maybe with a little refactoring).

Once the vmap_area based solution is committed, we will replace the
former with the new allocator.

Thanks,
Song

Song Liu May 18, 2023, 7:01 p.m. UTC | #34

On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > I am working on patches based on the discussion in [1]. I am planning to
> > send v1 for review in a week or so.
>
> For reference, here's my own (early, but functioning :) slab allocator:
>
> Look forward to comparing!
> -->--
> From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <kent.overstreet@linux.dev>
> Date: Wed, 17 May 2023 01:22:06 -0400
> Subject: [PATCH] mm: jit/text allocator
>
> This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> ftrace trampolines, or bcachefs unpack functions.
>
> With this API we can avoid ever mapping pages both writeable and
> executable (not implemented in this patch: need to tweak
> module_alloc()), and it also supports sub-page sized allocations.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

[...]

> +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> +{
> +       struct jit_slab *s =
> +               list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> +               jit_slab_alloc(cache);
> +       unsigned obj_idx, nr_allocated;
> +
> +       if (!s)
> +               return NULL;
> +
> +       obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> +
> +       BUG_ON(obj_idx >= cache->objs_per_slab);
> +       __set_bit(obj_idx, s->objs_allocated);
> +
> +       nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> +
> +       if (nr_allocated == s->cache->objs_per_slab) {
> +               list_del_init(&s->list);
> +       } else if (nr_allocated == 1) {
> +               list_del(&s->list);
> +               list_add(&s->list, &s->cache->partial);
> +       }
> +
> +       return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> +}

IIUC, "len" is ignored in jit_cache_alloc(), so it can only handle
<=16 byte allocations?

Thanks,
Song

Song Liu May 18, 2023, 7:03 p.m. UTC | #35

On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote:
>
> On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > send v1 for review in a week or so.
> > > >
> > > > Hey Song, I was reviewing that thread too,
> > > >
> > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > he had some fair points in that thread.
> > >
> > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > >
> > > >
> > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > allocator approach, like we have for regular memory allocations.
> > >
> > > I am planning to implement the allocator based on this (reuse
> > > vmap_area logic):
> >
> > Ah, you're still doing vmap_area approach.
> >
> > Mike's approach looks like it'll be _much_ lighter weight and higher
> > performance, to me. vmalloc is known to be slow compared to the buddy
> > allocator, and with Mike's approach we're only modifying mappings once
> > per 2 MB chunk.
> >
> > I don't see anything in your code for sub-page sized allocations too, so
> > perhaps I should keep going with my slab allocator.
>
> The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> we showed that multiple BPF programs share the same page with some
> kernel text (_etext).
>
> > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > this area of the code than you two so maybe there's an angle I've missed
> > :)
>
> AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> based solution.
>
> First, for 2MiB page with 64B chunk size, we need a bitmap of
>      2MiB / 64B = 32k bit = 4k bytes
> While the tree based solution can adapt to the number of allocations within
> This 2MiB page. Also, searching a free range within 4kB of bitmap may
> actually be slower than searching in the tree.
>
> Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> while tree based solution can. For example, if a big driver uses 3MiB, the
> tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> smaller allocations.

 Missed one:

Third, bitmap based solution requires a "size" parameter in free(). It is an
overhead for the user. Tree based solution doesn't have this issue.

Thanks,
Song

>
> [2] https://lore.kernel.org/linux-mm/20221107223921.3451913-6-song@kernel.org/

Kent Overstreet May 18, 2023, 7:10 p.m. UTC | #36

On Thu, May 18, 2023 at 12:01:26PM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 9:58 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > I am working on patches based on the discussion in [1]. I am planning to
> > > send v1 for review in a week or so.
> >
> > For reference, here's my own (early, but functioning :) slab allocator:
> >
> > Look forward to comparing!
> > -->--
> > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001
> > From: Kent Overstreet <kent.overstreet@linux.dev>
> > Date: Wed, 17 May 2023 01:22:06 -0400
> > Subject: [PATCH] mm: jit/text allocator
> >
> > This provides a new, very simple slab allocator for jit/text, i.e. bpf,
> > ftrace trampolines, or bcachefs unpack functions.
> >
> > With this API we can avoid ever mapping pages both writeable and
> > executable (not implemented in this patch: need to tweak
> > module_alloc()), and it also supports sub-page sized allocations.
> >
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> 
> [...]
> 
> > +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache)
> > +{
> > +       struct jit_slab *s =
> > +               list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?:
> > +               jit_slab_alloc(cache);
> > +       unsigned obj_idx, nr_allocated;
> > +
> > +       if (!s)
> > +               return NULL;
> > +
> > +       obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab);
> > +
> > +       BUG_ON(obj_idx >= cache->objs_per_slab);
> > +       __set_bit(obj_idx, s->objs_allocated);
> > +
> > +       nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab);
> > +
> > +       if (nr_allocated == s->cache->objs_per_slab) {
> > +               list_del_init(&s->list);
> > +       } else if (nr_allocated == 1) {
> > +               list_del(&s->list);
> > +               list_add(&s->list, &s->cache->partial);
> > +       }
> > +
> > +       return s->executably_mapped + (obj_idx << cache->obj_size_bits);
> > +}
> 
> IIUC, "len" is ignored in jit_cache_alloc(), so it can only handle
> <=16 byte allocations?

len is a redundant parameter (good catch); at that point we've picked a
cache for the specific allocation size.

Since there's multiple caches for each power of two size, it can handle
allocations up to PAGE_SIZE.

Kent Overstreet May 18, 2023, 7:15 p.m. UTC | #37

On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote:
> >
> > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > > send v1 for review in a week or so.
> > > > >
> > > > > Hey Song, I was reviewing that thread too,
> > > > >
> > > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > > he had some fair points in that thread.
> > > >
> > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > > >
> > > > >
> > > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > > allocator approach, like we have for regular memory allocations.
> > > >
> > > > I am planning to implement the allocator based on this (reuse
> > > > vmap_area logic):
> > >
> > > Ah, you're still doing vmap_area approach.
> > >
> > > Mike's approach looks like it'll be _much_ lighter weight and higher
> > > performance, to me. vmalloc is known to be slow compared to the buddy
> > > allocator, and with Mike's approach we're only modifying mappings once
> > > per 2 MB chunk.
> > >
> > > I don't see anything in your code for sub-page sized allocations too, so
> > > perhaps I should keep going with my slab allocator.
> >
> > The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> > we showed that multiple BPF programs share the same page with some
> > kernel text (_etext).
> >
> > > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > > this area of the code than you two so maybe there's an angle I've missed
> > > :)
> >
> > AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> > based solution.

Tree based requires quite a bit of overhead for the rbtree pointers, and
additional vmap_area structs.

With a buddy allocator based approach, there's no additional state that
needs to be allocated, since it all fits in struct page.

> > First, for 2MiB page with 64B chunk size, we need a bitmap of
> >      2MiB / 64B = 32k bit = 4k bytes
> > While the tree based solution can adapt to the number of allocations within
> > This 2MiB page. Also, searching a free range within 4kB of bitmap may
> > actually be slower than searching in the tree.
> >
> > Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> > while tree based solution can. For example, if a big driver uses 3MiB, the
> > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> > smaller allocations.

We're not talking about a bitmap based solution for >= PAGE_SIZE
allocations, the alternative is a buddy allocator - so no searching,
just per power-of-two freelists.

> 
>  Missed one:
> 
> Third, bitmap based solution requires a "size" parameter in free(). It is an
> overhead for the user. Tree based solution doesn't have this issue.

No, we can recover the size of the allocation via compound_order() -
hasn't historically been done for alloc_pages() allocations to avoid
setting up the state in each page for compound head/tail, but it perhaps
should be (and is with folios, which we've generally been switching to).

Kent Overstreet May 18, 2023, 7:16 p.m. UTC | #38

On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> > Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> > while tree based solution can. For example, if a big driver uses 3MiB, the
> > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> > smaller allocations.

2 MB is just a good default size for the initial unmapped allocations if
the main consideration is avoiding mapping fragmentation. The approach
can work with larger allocations if required.

Song Liu May 18, 2023, 8:03 p.m. UTC | #39

On Thu, May 18, 2023 at 12:15 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote:
> > >
> > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > > > send v1 for review in a week or so.
> > > > > >
> > > > > > Hey Song, I was reviewing that thread too,
> > > > > >
> > > > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > > > he had some fair points in that thread.
> > > > >
> > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > > > >
> > > > > >
> > > > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > > > allocator approach, like we have for regular memory allocations.
> > > > >
> > > > > I am planning to implement the allocator based on this (reuse
> > > > > vmap_area logic):
> > > >
> > > > Ah, you're still doing vmap_area approach.
> > > >
> > > > Mike's approach looks like it'll be _much_ lighter weight and higher
> > > > performance, to me. vmalloc is known to be slow compared to the buddy
> > > > allocator, and with Mike's approach we're only modifying mappings once
> > > > per 2 MB chunk.
> > > >
> > > > I don't see anything in your code for sub-page sized allocations too, so
> > > > perhaps I should keep going with my slab allocator.
> > >
> > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> > > we showed that multiple BPF programs share the same page with some
> > > kernel text (_etext).
> > >
> > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > > > this area of the code than you two so maybe there's an angle I've missed
> > > > :)
> > >
> > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> > > based solution.
>
> Tree based requires quite a bit of overhead for the rbtree pointers, and
> additional vmap_area structs.
>
> With a buddy allocator based approach, there's no additional state that
> needs to be allocated, since it all fits in struct page.
>
> > > First, for 2MiB page with 64B chunk size, we need a bitmap of
> > >      2MiB / 64B = 32k bit = 4k bytes
> > > While the tree based solution can adapt to the number of allocations within
> > > This 2MiB page. Also, searching a free range within 4kB of bitmap may
> > > actually be slower than searching in the tree.
> > >
> > > Second, bitmap based solution cannot handle > 2MiB allocation cleanly,
> > > while tree based solution can. For example, if a big driver uses 3MiB, the
> > > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for
> > > smaller allocations.
>
> We're not talking about a bitmap based solution for >= PAGE_SIZE
> allocations, the alternative is a buddy allocator - so no searching,
> just per power-of-two freelists.
>
> >
> >  Missed one:
> >
> > Third, bitmap based solution requires a "size" parameter in free(). It is an
> > overhead for the user. Tree based solution doesn't have this issue.
>
> No, we can recover the size of the allocation via compound_order() -
> hasn't historically been done for alloc_pages() allocations to avoid
> setting up the state in each page for compound head/tail, but it perhaps
> should be (and is with folios, which we've generally been switching to).

If we use compound_order(), we will round up to power of 2 for all
allocations. Does this mean we will use 4MiB for a 2.1MiB allocation?

Thanks,
Song

Kent Overstreet May 18, 2023, 8:13 p.m. UTC | #40

On Thu, May 18, 2023 at 01:03:28PM -0700, Song Liu wrote:
> If we use compound_order(), we will round up to power of 2 for all
> allocations. Does this mean we will use 4MiB for a 2.1MiB allocation?

Yes. This means we lose on average 33% - I believe, someone with better
statistics might correct me - to internal fragmentation. But the buddy
allocator will be better at avoiding and dealing with external
fragmentation over time.

Song Liu May 18, 2023, 8:51 p.m. UTC | #41

On Thu, May 18, 2023 at 12:15 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote:
> > >
> > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > > > <kent.overstreet@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > > > send v1 for review in a week or so.
> > > > > >
> > > > > > Hey Song, I was reviewing that thread too,
> > > > > >
> > > > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > > > he had some fair points in that thread.
> > > > >
> > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > > > >
> > > > > >
> > > > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > > > allocator approach, like we have for regular memory allocations.
> > > > >
> > > > > I am planning to implement the allocator based on this (reuse
> > > > > vmap_area logic):
> > > >
> > > > Ah, you're still doing vmap_area approach.
> > > >
> > > > Mike's approach looks like it'll be _much_ lighter weight and higher
> > > > performance, to me. vmalloc is known to be slow compared to the buddy
> > > > allocator, and with Mike's approach we're only modifying mappings once
> > > > per 2 MB chunk.
> > > >
> > > > I don't see anything in your code for sub-page sized allocations too, so
> > > > perhaps I should keep going with my slab allocator.
> > >
> > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2],
> > > we showed that multiple BPF programs share the same page with some
> > > kernel text (_etext).
> > >
> > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to
> > > > this area of the code than you two so maybe there's an angle I've missed
> > > > :)
> > >
> > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap
> > > based solution.
>
> Tree based requires quite a bit of overhead for the rbtree pointers, and
> additional vmap_area structs.
>
> With a buddy allocator based approach, there's no additional state that
> needs to be allocated, since it all fits in struct page.

To allocate memory for text, we will allocate 2MiB, make it ROX, and then
use it for many small allocations. IIUC, buddy allocator will use unallocated
parts of this page for metadata. I guess this may be a problem, as the
whole page is ROX now, and we have to use text_poke to write to it.

OTOH, if we allocate extra memory for metadata (tree based solution),
all the metadata operations can be regular read/write.

Thanks,
Song

Kent Overstreet May 19, 2023, 1:24 a.m. UTC | #42

On Thu, May 18, 2023 at 01:51:08PM -0700, Song Liu wrote:

> To allocate memory for text, we will allocate 2MiB, make it ROX, and then
> use it for many small allocations. IIUC, buddy allocator will use unallocated
> parts of this page for metadata. I guess this may be a problem, as the
> whole page is ROX now, and we have to use text_poke to write to it.

The standard kernel buddy allocator does _not_ store anything in the
page itself - because the page might be a highmem page.

That's also why I went with the bitmap for my slab allocator; standard
kernel slab allocator stores a freelist ptr in free objects.

Mike Rapoport May 19, 2023, 8:29 a.m. UTC | #43

Hi Kent,

On Thu, May 18, 2023 at 01:23:56PM -0400, Kent Overstreet wrote:
> On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > send v1 for review in a week or so.
> > >
> > > Hey Song, I was reviewing that thread too,
> > >
> > > Are you taking a different approach based on Thomas's feedback? I think
> > > he had some fair points in that thread.
> > 
> > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > 
> > >
> > > My own feeling is that the buddy allocator is our tool for allocating
> > > larger variable sized physically contiguous allocations, so I'd like to
> > > see something based on that - I think we could do a hybrid buddy/slab
> > > allocator approach, like we have for regular memory allocations.
> > 
> > I am planning to implement the allocator based on this (reuse
> > vmap_area logic):
> 
> Ah, you're still doing vmap_area approach.
> 
> Mike's approach looks like it'll be _much_ lighter weight and higher
> performance, to me. vmalloc is known to be slow compared to the buddy
> allocator, and with Mike's approach we're only modifying mappings once
> per 2 MB chunk.
> 
> I don't see anything in your code for sub-page sized allocations too, so
> perhaps I should keep going with my slab allocator.

Your allocator implicitly relies on vmalloc because of module_alloc ;-)

What I was thinking is that we can replace module_alloc() calls in your
allocator with something based on my unmapped_alloc(). If we make the part
that refills the cache also take care of creating the mapping in the
module address space, that should cover everything.
 
> Could you share your thoughts on your approach vs. Mike's? I'm newer to
> this area of the code than you two so maybe there's an angle I've missed
> :)
>

Song Liu May 19, 2023, 3:08 p.m. UTC | #44

On Thu, May 18, 2023 at 6:24 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, May 18, 2023 at 01:51:08PM -0700, Song Liu wrote:
>
> > To allocate memory for text, we will allocate 2MiB, make it ROX, and then
> > use it for many small allocations. IIUC, buddy allocator will use unallocated
> > parts of this page for metadata. I guess this may be a problem, as the
> > whole page is ROX now, and we have to use text_poke to write to it.
>
> The standard kernel buddy allocator does _not_ store anything in the
> page itself - because the page might be a highmem page.

Thanks for the correction.
Song
>
> That's also why I went with the bitmap for my slab allocator; standard
> kernel slab allocator stores a freelist ptr in free objects.

Song Liu May 19, 2023, 3:42 p.m. UTC | #45

On Fri, May 19, 2023 at 1:30 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi Kent,
>
> On Thu, May 18, 2023 at 01:23:56PM -0400, Kent Overstreet wrote:
> > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote:
> > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote:
> > > > > I am working on patches based on the discussion in [1]. I am planning to
> > > > > send v1 for review in a week or so.
> > > >
> > > > Hey Song, I was reviewing that thread too,
> > > >
> > > > Are you taking a different approach based on Thomas's feedback? I think
> > > > he had some fair points in that thread.
> > >
> > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions.
> > >
> > > >
> > > > My own feeling is that the buddy allocator is our tool for allocating
> > > > larger variable sized physically contiguous allocations, so I'd like to
> > > > see something based on that - I think we could do a hybrid buddy/slab
> > > > allocator approach, like we have for regular memory allocations.
> > >
> > > I am planning to implement the allocator based on this (reuse
> > > vmap_area logic):
> >
> > Ah, you're still doing vmap_area approach.
> >
> > Mike's approach looks like it'll be _much_ lighter weight and higher
> > performance, to me. vmalloc is known to be slow compared to the buddy
> > allocator, and with Mike's approach we're only modifying mappings once
> > per 2 MB chunk.
> >
> > I don't see anything in your code for sub-page sized allocations too, so
> > perhaps I should keep going with my slab allocator.
>
> Your allocator implicitly relies on vmalloc because of module_alloc ;-)
>
> What I was thinking is that we can replace module_alloc() calls in your
> allocator with something based on my unmapped_alloc(). If we make the part
> that refills the cache also take care of creating the mapping in the
> module address space, that should cover everything.

Here are what I found as I work more on the code:

1. It takes quite some work to create a clean interface and make sure
    all the users of module_alloc can use the new allocator on all archs.
    (archs with text poke need to work with ROX memory from the
    allocator; archs without text poke need to have a clean fall back
    mechanism, etc.). Most of this work is independent of the actual
    allocator, so we can do this part and plug in whatever allocator we
    want (buddy+slab or vmap-based or any other solutions).

2. vmap_area based solution will work. And it will be one solution for
    both < PAGE_SIZE and > PAGE_SIZE allocations. Given
    module_alloc is not in any hot path (AFAIK), I don't see any
    practical issues with this solution. It will be a little tricky to place
    and name the code, as it uses vmalloc logic, but it is technically a
    module allocator.

I will prioritize building the interface, and migrating users to it. If we
do this part right, changing the underlying allocator should be
straightforward.

Thanks,
Song

Kent Overstreet May 19, 2023, 3:47 p.m. UTC | #46

On Fri, May 19, 2023 at 11:29:45AM +0300, Mike Rapoport wrote:
> Your allocator implicitly relies on vmalloc because of module_alloc ;-)
> 
> What I was thinking is that we can replace module_alloc() calls in your
> allocator with something based on my unmapped_alloc(). If we make the part
> that refills the cache also take care of creating the mapping in the
> module address space, that should cover everything.

Yeah, that's exactly what I was thinking :)

Liam was also just mentioning on IRC vmalloc lock contention came up
again at LSF, and that's historically always been an isuse - going with
your patchset for the backend nicely avoids that.

If I have time (hah! big if :) I'll see if I can cook up a patchset that
combines our two approaches over the weekend.

Mike Rapoport May 19, 2023, 4:14 p.m. UTC | #47

On Fri, May 19, 2023 at 11:47:42AM -0400, Kent Overstreet wrote:
> On Fri, May 19, 2023 at 11:29:45AM +0300, Mike Rapoport wrote:
> > Your allocator implicitly relies on vmalloc because of module_alloc ;-)
> > 
> > What I was thinking is that we can replace module_alloc() calls in your
> > allocator with something based on my unmapped_alloc(). If we make the part
> > that refills the cache also take care of creating the mapping in the
> > module address space, that should cover everything.
> 
> Yeah, that's exactly what I was thinking :)
> 
> Liam was also just mentioning on IRC vmalloc lock contention came up
> again at LSF, and that's historically always been an isuse - going with
> your patchset for the backend nicely avoids that.

Unfortunately not because we still need to map the pages in the modules
area which is essentially a subset of vmalloc address space.
 
> If I have time (hah! big if :) I'll see if I can cook up a patchset that
> combines our two approaches over the weekend.

Now there is also an interest about unmapped allocations from KVM folks, so
I might continue pursuing unmapped allocator, probably just without a new
GFP flag and hooks into page allocator.

Kent Overstreet May 19, 2023, 4:21 p.m. UTC | #48

On Fri, May 19, 2023 at 07:14:14PM +0300, Mike Rapoport wrote:
> Unfortunately not because we still need to map the pages in the modules
> area which is essentially a subset of vmalloc address space.

Why wouldn't we be doing that once per 2 MB page?

>  
> > If I have time (hah! big if :) I'll see if I can cook up a patchset that
> > combines our two approaches over the weekend.
> 
> Now there is also an interest about unmapped allocations from KVM folks, so
> I might continue pursuing unmapped allocator, probably just without a new
> GFP flag and hooks into page allocator.

Cool - it does seem like a generally valuable capability :)

Thomas Gleixner May 22, 2023, 10:05 p.m. UTC | #49

On Fri, May 19 2023 at 08:42, Song Liu wrote:
> On Fri, May 19, 2023 at 1:30 AM Mike Rapoport <rppt@kernel.org> wrote:
>> What I was thinking is that we can replace module_alloc() calls in your
>> allocator with something based on my unmapped_alloc(). If we make the part
>> that refills the cache also take care of creating the mapping in the
>> module address space, that should cover everything.
>
> Here are what I found as I work more on the code:
>
> 1. It takes quite some work to create a clean interface and make sure
>     all the users of module_alloc can use the new allocator on all archs.
>     (archs with text poke need to work with ROX memory from the
>     allocator; archs without text poke need to have a clean fall back
>     mechanism, etc.). Most of this work is independent of the actual
>     allocator, so we can do this part and plug in whatever allocator we
>     want (buddy+slab or vmap-based or any other solutions).
>
> 2. vmap_area based solution will work. And it will be one solution for
>     both < PAGE_SIZE and > PAGE_SIZE allocations. Given
>     module_alloc is not in any hot path (AFAIK), I don't see any
>     practical issues with this solution. It will be a little tricky to place
>     and name the code, as it uses vmalloc logic, but it is technically a
>     module allocator.
>
> I will prioritize building the interface, and migrating users to it. If we
> do this part right, changing the underlying allocator should be
> straightforward.

Correct, that's the only workable approach.

Once you have solved #1 the mechanism of the underlying allocator (#2)
is pretty much exchangeable.

Any attempt to solve #2 before #1 is doomed by definition.

Thanks,

        tglx

diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a825bf031f49..735f691d449c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,9 @@  config X86_64
 	select ARCH_HAS_ELFCORE_COMPAT
 	select ZONE_DMA32
 
+config ARCH_WANTS_GFP_UNMAPPED
+	def_bool y if X86_64
+
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
 	depends on X86_32
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 5088637fe5c2..234122b434dd 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -60,6 +60,12 @@  typedef unsigned int __bitwise gfp_t;
 #else
 #define ___GFP_NOLOCKDEP	0
 #endif
+#ifdef CONFIG_UNMAPPED_ALLOC
+#define ___GFP_UNMAPPED		0x10000000u
+#else
+#define ___GFP_UNMAPPED		0
+#endif
+
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -101,12 +107,15 @@  typedef unsigned int __bitwise gfp_t;
  * node with no fallbacks or placement policy enforcements.
  *
  * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg.
+ *
+ * %__GFP_UNMAPPED removes the allocated pages from the direct map.
  */
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
 #define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
 #define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 #define __GFP_ACCOUNT	((__force gfp_t)___GFP_ACCOUNT)
+#define __GFP_UNMAPPED	((__force gfp_t)___GFP_UNMAPPED)
 
 /**
  * DOC: Watermark modifiers
@@ -252,7 +261,7 @@  typedef unsigned int __bitwise gfp_t;
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (28 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a7e3a3405520..e66dbfdba8f2 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -922,6 +922,7 @@  static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_unmapped	0x00000800
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -992,6 +993,11 @@  PAGE_TYPE_OPS(Table, table)
  */
 PAGE_TYPE_OPS(Guard, guard)
 
+/*
+ * Marks pages in free lists of unmapped allocator.
+ */
+PAGE_TYPE_OPS(Unmapped, unmapped)
+
 extern bool is_free_buddy_page(struct page *page);
 
 PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 5f1ae07d724b..bb6f5ec83881 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -21,6 +21,9 @@  enum pageblock_bits {
 			/* 3 bits required for migrate types */
 	PB_migrate_skip,/* If set the block is skipped by compaction */
 
+	PB_unmapped = 7,/* if set the block has pages unmapped in the direct
+			   map */
+
 	/*
 	 * Assume the bits will always align on a word. If this assumption
 	 * changes then get/set pageblock needs updating.
@@ -95,4 +98,29 @@  static inline void set_pageblock_skip(struct page *page)
 }
 #endif /* CONFIG_COMPACTION */
 
+#ifdef CONFIG_UNMAPPED_ALLOC
+#define get_pageblock_unmapped(page)                         \
+       get_pfnblock_flags_mask(page, page_to_pfn(page),        \
+                       (1 << (PB_unmapped)))
+#define clear_pageblock_unmapped(page) \
+       set_pfnblock_flags_mask(page, 0, page_to_pfn(page),     \
+                       (1 << PB_unmapped))
+#define set_pageblock_unmapped(page) \
+       set_pfnblock_flags_mask(page, (1 << PB_unmapped),     \
+                       page_to_pfn(page),                      \
+                       (1 << PB_unmapped))
+#else /* CONFIG_UNMAPPED_ALLOC */
+static inline bool get_pageblock_unmapped(struct page *page)
+{
+       return false;
+}
+static inline void clear_pageblock_unmapped(struct page *page)
+{
+}
+static inline void set_pageblock_unmapped(struct page *page)
+{
+}
+#endif /* CONFIG_UNMAPPED_ALLOC */
+
+
 #endif	/* PAGEBLOCK_FLAGS_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 9db52bc4ce19..951d294a3840 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -61,9 +61,17 @@ 
 #define __def_gfpflag_names_kasan
 #endif
 
+#ifdef CONFIG_UNMAPPED_ALLOC
+#define __def_gfpflag_names_unmapped			\
+	, gfpflag_string(__GFP_UNMAPPED)
+#else
+#define __def_gfpflag_names_unmapped
+#endif
+
 #define show_gfp_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
-	__def_gfpflag_names __def_gfpflag_names_kasan			\
+	__def_gfpflag_names __def_gfpflag_names_kasan 			\
+	__def_gfpflag_names_unmapped					\
 	) : "none"
 
 #ifdef CONFIG_MMU
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f05..404be73e00e8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,10 @@  config LRU_GEN_STATS
 	  This option has a per-memcg and per-node memory overhead.
 # }
 
+config UNMAPPED_ALLOC
+	def_bool y
+	depends on ARCH_WANTS_GFP_UNMAPPED
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..9143303295af 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -138,3 +138,4 @@  obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
 obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
+obj-$(CONFIG_UNMAPPED_ALLOC) += unmapped-alloc.o
diff --git a/mm/internal.h b/mm/internal.h
index 7920a8b7982e..8d84cceab467 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1042,4 +1042,26 @@  struct vma_prepare {
 	struct vm_area_struct *remove;
 	struct vm_area_struct *remove2;
 };
+
+/*
+ * mm/unmapped-alloc.c
+ */
+#ifdef CONFIG_UNMAPPED_ALLOC
+int unmapped_alloc_init(void);
+struct page *unmapped_pages_alloc(gfp_t gfp, int order);
+void unmapped_pages_free(struct page *page, int order);
+#else
+static inline int unmapped_alloc_init(void)
+{
+	return 0;
+}
+
+static inline struct page *unmapped_pages_alloc(gfp_t gfp, int order)
+{
+	return NULL;
+}
+
+static inline void unmapped_pages_free(struct page *page, int order) {}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac1fc986af44..01f18e7529b0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -589,7 +589,7 @@  void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
 	unsigned long bitidx, word_bitidx;
 	unsigned long word;
 
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8);
 	BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits));
 
 	bitmap = get_pageblock_bitmap(page, pfn);
@@ -746,6 +746,13 @@  static inline bool pcp_allowed_order(unsigned int order)
 	return false;
 }
 
+static inline bool pcp_allowed(unsigned int order, gfp_t gfp)
+{
+	if (unlikely(gfp & __GFP_UNMAPPED))
+		return false;
+	return pcp_allowed_order(order);
+}
+
 static inline void free_the_page(struct page *page, unsigned int order)
 {
 	if (pcp_allowed_order(order))		/* Via pcp? */
@@ -1460,6 +1467,11 @@  static __always_inline bool free_pages_prepare(struct page *page,
 					   PAGE_SIZE << order);
 	}
 
+	if (get_pageblock_unmapped(page)) {
+		unmapped_pages_free(page, order);
+		return false;
+	}
+
 	kernel_poison_pages(page, 1 << order);
 
 	/*
@@ -3525,6 +3537,13 @@  void free_unref_page_list(struct list_head *list)
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
+
+		if (get_pageblock_unmapped(page)) {
+			list_del(&page->lru);
+			unmapped_pages_free(page, 0);
+			continue;
+		}
+
 		if (!free_unref_page_prepare(page, pfn, 0)) {
 			list_del(&page->lru);
 			continue;
@@ -3856,7 +3875,7 @@  struct page *rmqueue(struct zone *preferred_zone,
 	 */
 	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
 
-	if (likely(pcp_allowed_order(order))) {
+	if (likely(pcp_allowed(order, gfp_flags))) {
 		/*
 		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
 		 * we need to skip it when CMA area isn't allowed.
@@ -5581,6 +5600,11 @@  struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
 			&alloc_gfp, &alloc_flags))
 		return NULL;
 
+	if (alloc_gfp & __GFP_UNMAPPED) {
+		page = unmapped_pages_alloc(gfp, order);
+		goto out;
+	}
+
 	/*
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
@@ -8437,6 +8461,7 @@  void __init free_area_init(unsigned long *max_zone_pfn)
 	}
 
 	memmap_init();
+	unmapped_alloc_init();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core,
diff --git a/mm/unmapped-alloc.c b/mm/unmapped-alloc.c
new file mode 100644
index 000000000000..fb2d54204a3c
--- /dev/null
+++ b/mm/unmapped-alloc.c
@@ -0,0 +1,215 @@ 
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/gfp.h>
+#include <linux/mmzone.h>
+#include <linux/printk.h>
+#include <linux/spinlock.h>
+#include <linux/set_memory.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+struct unmapped_free_area {
+	struct list_head	free_list;
+	spinlock_t		lock;
+	unsigned long		nr_free;
+	unsigned long		nr_cached;
+};
+
+static struct unmapped_free_area free_area[MAX_ORDER];
+
+static inline void add_to_free_list(struct page *page, unsigned int order)
+{
+	struct unmapped_free_area *area = &free_area[order];
+
+	list_add(&page->buddy_list, &area->free_list);
+	area->nr_free++;
+}
+
+static inline void del_page_from_free_list(struct page *page, unsigned int order)
+{
+	list_del(&page->buddy_list);
+	__ClearPageUnmapped(page);
+	set_page_private(page, 0);
+	free_area[order].nr_free--;
+}
+
+static inline void set_unmapped_order(struct page *page, unsigned int order)
+{
+	set_page_private(page, order);
+	__SetPageUnmapped(page);
+}
+
+static inline bool page_is_unmapped_buddy(struct page *page, struct page *buddy,
+					  unsigned int order)
+{
+	if (!PageUnmapped(buddy))
+		return false;
+
+	if (buddy_order(buddy) != order)
+		return false;
+
+	return true;
+}
+
+static struct page *find_unmapped_buddy_page_pfn(struct page *page,
+						 unsigned long pfn,
+						 unsigned int order,
+						 unsigned long *buddy_pfn)
+{
+	unsigned long __buddy_pfn = __find_buddy_pfn(pfn, order);
+	struct page *buddy;
+
+	buddy = page + (__buddy_pfn - pfn);
+	if (buddy_pfn)
+		*buddy_pfn = __buddy_pfn;
+
+	if (page_is_unmapped_buddy(page, buddy, order))
+		return buddy;
+
+	return NULL;
+}
+
+static inline void __free_one_page(struct page *page, unsigned int order,
+				   bool cache_refill)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long buddy_pfn;
+	unsigned long combined_pfn;
+	struct page *buddy;
+	unsigned long flags;
+
+	spin_lock_irqsave(&free_area->lock, flags);
+
+	if (cache_refill) {
+		set_pageblock_unmapped(page);
+		free_area[order].nr_cached++;
+	}
+
+	while (order < MAX_ORDER - 1) {
+		buddy = find_unmapped_buddy_page_pfn(page, pfn, order,
+						     &buddy_pfn);
+		if (!buddy)
+			break;
+
+		del_page_from_free_list(buddy, order);
+		combined_pfn = buddy_pfn & pfn;
+		page = page + (combined_pfn - pfn);
+		pfn = combined_pfn;
+		order++;
+	}
+
+	set_unmapped_order(page, order);
+	add_to_free_list(page, order);
+	spin_unlock_irqrestore(&free_area->lock, flags);
+}
+
+static inline void expand(struct page *page, int low, int high)
+{
+	unsigned long size = 1 << high;
+
+	while (high > low) {
+		high--;
+		size >>= 1;
+
+		add_to_free_list(&page[size], high);
+		set_unmapped_order(&page[size], high);
+	}
+}
+
+static struct page *__rmqueue_smallest(unsigned int order)
+{
+	unsigned int current_order;
+	struct unmapped_free_area *area;
+	struct page *page = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&free_area->lock, flags);
+
+	/* Find a page of the appropriate size in the preferred list */
+	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+		area = &free_area[current_order];
+		page = list_first_entry_or_null(&area->free_list, struct page,
+						lru);
+		if (!page)
+			continue;
+
+		del_page_from_free_list(page, current_order);
+		expand(page, order, current_order);
+
+		break;
+	}
+
+	spin_unlock_irqrestore(&free_area->lock, flags);
+
+	return page;
+}
+
+/* FIXME: have PMD_ORDER at last available in include/linux */
+#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
+
+struct page *unmapped_pages_alloc(gfp_t gfp, int order)
+{
+
+	int cache_order = PMD_ORDER;
+	struct page *page;
+
+	page = __rmqueue_smallest(order);
+	if (page)
+		goto out;
+
+	gfp &= ~__GFP_UNMAPPED;
+	while (cache_order >= order) {
+		page = alloc_pages(gfp | __GFP_ZERO, cache_order);
+		if (page)
+			break;
+		cache_order--;
+	}
+
+	if (page) {
+		unsigned long addr = (unsigned long)page_address(page);
+		unsigned long nr_pages = (1 << order);
+		unsigned long size = PAGE_SIZE * nr_pages;
+
+		/* FIXME: set_direct_map_invalid_noflush may fail */
+		for (int i = 0; i < nr_pages; i++)
+			set_direct_map_invalid_noflush(page + i);
+
+		flush_tlb_kernel_range(addr, addr + size);
+
+		/*
+		 * FIXME: have this under lock so that allocation running
+		 * in parallel won't steal all pages from the newly cached
+		 * ones
+		 */
+		__free_one_page(page, cache_order, true);
+		page = __rmqueue_smallest(order);
+
+	}
+
+out:
+	if (page) {
+		/* FIXME: __prep_new_page() expects page count of 0 */
+		set_page_count(page, 0);
+		post_alloc_hook(page, order, GFP_KERNEL);
+	}
+
+	return page;
+}
+
+void unmapped_pages_free(struct page *page, int order)
+{
+	__free_one_page(page, order, false);
+}
+
+int unmapped_alloc_init(void)
+{
+	for (int order = 0; order < MAX_ORDER; order++) {
+		struct unmapped_free_area *area = &free_area[order];
+		INIT_LIST_HEAD(&area->free_list);
+		spin_lock_init(&area->lock);
+	}
+
+	return 0;
+}