Message ID | 20230308094106.227365-2-rppt@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp237983wrd; Wed, 8 Mar 2023 01:57:30 -0800 (PST) X-Google-Smtp-Source: AK7set9eJkqBnw83ltcFRK5gaVqBczr4oItrcFwMPAdvwohAURuTlFS6An9sOBKmyFH+5l7wZySh X-Received: by 2002:a17:906:9f21:b0:88a:7037:855e with SMTP id fy33-20020a1709069f2100b0088a7037855emr19642544ejc.9.1678269450376; Wed, 08 Mar 2023 01:57:30 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1678269450; cv=none; d=google.com; s=arc-20160816; b=NmuYh+w/k0DD8ggzSTlnQ+b7P8jWHrU/CfoQWhYNz3wTphbxk+RYj98hxKE7H3qXgM j6NRatUtjbIgh/8H5/7IKry9Bb1uy2Pc7gHfp/uGZcPVkpeMRdlrs29wPixZiYhI7d8G c+AEMsHV0Q+VNiAxh8zUM0xS4C+XXA2dEpCSIdICg5SDrfMRCI7cO/+ns1/lHSX2CrNx CBMs28Rf1GGp6z44B/813UMYR7Sq9YNPHUC2xzfcVI09/NaaJfb2jZ+151gGT7M99xeQ bAbmDqD0kYqwmOvlPeVm7rpPxjVAriJs1I/nmmrg/MHgabQhZ32F2fdxybCpfHGY54NM s33Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=7mWerlB7iI5JwLACExHKsBgwf4sqipB5x9+UyIghBDI=; b=RpFo6oHl4+J81PeFlq6W736M7Cipw9RhP9KIHbXuNv0ChLMWXwpims0MjldkEbnkfw RfcHsAabF/F6vc7W831fKT1vHxzpK9DPi9fMWlrPOI6vgYHe0FeeYddD+7+4R94l02JZ NM9tUjwY83GEQuSYkAqRrvKgpX2pDNtQkGpUDYeZwtBUhZl0vJNOJr17lvJ927fsEQ02 jAoRTdP6KOjc/B5ytCIkZcH1UywIJqqJUA4IhMhVDmCNVKK3Yv9VCVevl/t7Mi8/F0Yr OuSKMZbmTLyl8y0MsjmX24KA238xqzQ26KCZAYs1Qq0CTqijSwk5UwahfXGdPJ0LUXI8 WS+g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=g2gtD1xh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qk6-20020a1709077f8600b008e01978a238si6501928ejc.52.2023.03.08.01.57.07; Wed, 08 Mar 2023 01:57:30 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=g2gtD1xh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230122AbjCHJlx (ORCPT <rfc822;toshivichauhan@gmail.com> + 99 others); Wed, 8 Mar 2023 04:41:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229929AbjCHJlh (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 8 Mar 2023 04:41:37 -0500 Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 62B71B1ED1 for <linux-kernel@vger.kernel.org>; Wed, 8 Mar 2023 01:41:28 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sin.source.kernel.org (Postfix) with ESMTPS id 7B9C2CE1F10 for <linux-kernel@vger.kernel.org>; Wed, 8 Mar 2023 09:41:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2E1AAC4339B; Wed, 8 Mar 2023 09:41:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1678268484; bh=yWy9YBr3QRwlwUhGDvVMLg0rTuhOsa1avYZumFur5Fc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=g2gtD1xhW3u8pPRbZe/MH1SYrnQyB3Pg3cb3uhQHhN6UnNrqhop4CTR6FmDAcS2St YCjcuOxw+R2wN9J/DVCCiwFM1JQcWLMMdg1T7gVK2z2YpXDqJruWy/AlgmPjaCy147 JGJvvZPr9xvET4zf6lp6RmFzW7BMjIPm/A2p+scS/WAQF9qM78NbN+qaLFs0wooc6t H+Ixc0EkDbJU229EW9sLUolO7nPNsXja4Aq4l/mDMesoTliZWImHdvrCLOpMy+NyMW VntaGV1oKqnqu8ipRu6ukpj4mvrmJREPGAWAm85CMkMG5ERTOgugoBXvyVkB9UWYpg bkJ8fmxQCrpmg== From: Mike Rapoport <rppt@kernel.org> To: linux-mm@kvack.org Cc: Andrew Morton <akpm@linux-foundation.org>, Dave Hansen <dave.hansen@linux.intel.com>, Mike Rapoport <rppt@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Rick Edgecombe <rick.p.edgecombe@intel.com>, Song Liu <song@kernel.org>, Thomas Gleixner <tglx@linutronix.de>, Vlastimil Babka <vbabka@suse.cz>, linux-kernel@vger.kernel.org, x86@kernel.org Subject: [RFC PATCH 1/5] mm: intorduce __GFP_UNMAPPED and unmapped_alloc() Date: Wed, 8 Mar 2023 11:41:02 +0200 Message-Id: <20230308094106.227365-2-rppt@kernel.org> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20230308094106.227365-1-rppt@kernel.org> References: <20230308094106.227365-1-rppt@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1759793067494279991?= X-GMAIL-MSGID: =?utf-8?q?1759793067494279991?= |
Series |
Prototype for direct map awareness in page allocator
|
|
Commit Message
Mike Rapoport
March 8, 2023, 9:41 a.m. UTC
From: "Mike Rapoport (IBM)" <rppt@kernel.org> When set_memory or set_direct_map APIs used to change attribute or permissions for chunks of several pages, the large PMD that maps these pages in the direct map must be split. Fragmenting the direct map in such manner causes TLB pressure and, eventually, performance degradation. To avoid excessive direct map fragmentation, add ability to allocate "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the allocated pages from the direct map and use a cache of the unmapped pages. This cache is replenished with higher order pages with preference for PMD_SIZE pages when possible so that there will be fewer splits of large pages in the direct map. The cache is implemented as a buddy allocator, so it can serve high order allocations of unmapped pages. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> --- arch/x86/Kconfig | 3 + include/linux/gfp_types.h | 11 +- include/linux/page-flags.h | 6 + include/linux/pageblock-flags.h | 28 +++++ include/trace/events/mmflags.h | 10 +- mm/Kconfig | 4 + mm/Makefile | 1 + mm/internal.h | 22 ++++ mm/page_alloc.c | 29 ++++- mm/unmapped-alloc.c | 215 ++++++++++++++++++++++++++++++++ 10 files changed, 325 insertions(+), 4 deletions(-) create mode 100644 mm/unmapped-alloc.c
Comments
On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote: > + > +static inline void __free_one_page(struct page *page, unsigned int > order, > + bool cache_refill) > +{ > + unsigned long pfn = page_to_pfn(page); > + unsigned long buddy_pfn; > + unsigned long combined_pfn; > + struct page *buddy; > + unsigned long flags; > + > + spin_lock_irqsave(&free_area->lock, flags); > + > + if (cache_refill) { > + set_pageblock_unmapped(page); > + free_area[order].nr_cached++; > + } > + > + while (order < MAX_ORDER - 1) { > + buddy = find_unmapped_buddy_page_pfn(page, pfn, > order, > + &buddy_pfn); > + if (!buddy) > + break; > + > + del_page_from_free_list(buddy, order); > + combined_pfn = buddy_pfn & pfn; > + page = page + (combined_pfn - pfn); > + pfn = combined_pfn; > + order++; > + } > + > + set_unmapped_order(page, order); > + add_to_free_list(page, order); > + spin_unlock_irqrestore(&free_area->lock, flags); > +} > + The page has to be zeroed before it goes back on the list, right? I didn't see it.
On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > When set_memory or set_direct_map APIs used to change attribute or > permissions for chunks of several pages, the large PMD that maps these > pages in the direct map must be split. Fragmenting the direct map in such > manner causes TLB pressure and, eventually, performance degradation. > > To avoid excessive direct map fragmentation, add ability to allocate > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > allocated pages from the direct map and use a cache of the unmapped pages. > > This cache is replenished with higher order pages with preference for > PMD_SIZE pages when possible so that there will be fewer splits of large > pages in the direct map. > > The cache is implemented as a buddy allocator, so it can serve high order > allocations of unmapped pages. Hello, To me it seems unnecessary to increase pageblock bitmap size just to distinguish if it is allocated with __GFP_UNMAPPED. I think this can be implemented as an independent cache on top of buddy allocator, and introducing new API for unmapped page allocation and freeing? Thanks, Hyeonggon
On Thu, Mar 09, 2023 at 01:56:37AM +0000, Edgecombe, Rick P wrote: > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote: > > + > > +static inline void __free_one_page(struct page *page, unsigned int > > order, > > + bool cache_refill) > > +{ > > + unsigned long pfn = page_to_pfn(page); > > + unsigned long buddy_pfn; > > + unsigned long combined_pfn; > > + struct page *buddy; > > + unsigned long flags; > > + > > + spin_lock_irqsave(&free_area->lock, flags); > > + > > + if (cache_refill) { > > + set_pageblock_unmapped(page); > > + free_area[order].nr_cached++; > > + } > > + > > + while (order < MAX_ORDER - 1) { > > + buddy = find_unmapped_buddy_page_pfn(page, pfn, > > order, > > + &buddy_pfn); > > + if (!buddy) > > + break; > > + > > + del_page_from_free_list(buddy, order); > > + combined_pfn = buddy_pfn & pfn; > > + page = page + (combined_pfn - pfn); > > + pfn = combined_pfn; > > + order++; > > + } > > + > > + set_unmapped_order(page, order); > > + add_to_free_list(page, order); > > + spin_unlock_irqrestore(&free_area->lock, flags); > > +} > > + > > The page has to be zeroed before it goes back on the list, right? I > didn't see it. You are right, I missed it.
On Thu, Mar 09, 2023 at 06:31:25AM +0000, Hyeonggon Yoo wrote: > On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > When set_memory or set_direct_map APIs used to change attribute or > > permissions for chunks of several pages, the large PMD that maps these > > pages in the direct map must be split. Fragmenting the direct map in such > > manner causes TLB pressure and, eventually, performance degradation. > > > > To avoid excessive direct map fragmentation, add ability to allocate > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > This cache is replenished with higher order pages with preference for > > PMD_SIZE pages when possible so that there will be fewer splits of large > > pages in the direct map. > > > > The cache is implemented as a buddy allocator, so it can serve high order > > allocations of unmapped pages. > > Hello, > > To me it seems unnecessary to increase pageblock bitmap size just to > distinguish if it is allocated with __GFP_UNMAPPED. > > I think this can be implemented as an independent cache on top of > buddy allocator, and introducing new API for unmapped page > allocation and freeing? Yes, it could. But with such API we'd also need to take care of mapping/unmapping these pages for the modules use case which is not that difficult, but still more complex than a flag to vmalloc(). As for secretmem, freeing of secretmem pages happens in the page cache, so changing that to use, say, unmapped_free() would be quite intrusive. Another reason to mark the entire pageblock as unmapped is possibility to steal pages from that pageblock into the unmmaped cache when they become free. When we allocate pages to replenish the unmapped cache, we split the large mapping in the direct map, so even the pages in the same pageblock that do not get into the cache are still mapped at PTE level. When they are freed, they will be put in the cache and so fewer explicit cache refills that split the large mappings will be required. > Thanks, > Hyeonggon
On Thu, 2023-03-09 at 16:39 +0200, Mike Rapoport wrote: > On Thu, Mar 09, 2023 at 01:56:37AM +0000, Edgecombe, Rick P wrote: > > On Wed, 2023-03-08 at 11:41 +0200, Mike Rapoport wrote: > > > + > > > +static inline void __free_one_page(struct page *page, unsigned > > > int > > > order, > > > + bool cache_refill) > > > +{ > > > + unsigned long pfn = page_to_pfn(page); > > > + unsigned long buddy_pfn; > > > + unsigned long combined_pfn; > > > + struct page *buddy; > > > + unsigned long flags; > > > + > > > + spin_lock_irqsave(&free_area->lock, flags); > > > + > > > + if (cache_refill) { > > > + set_pageblock_unmapped(page); > > > + free_area[order].nr_cached++; > > > + } > > > + > > > + while (order < MAX_ORDER - 1) { > > > + buddy = find_unmapped_buddy_page_pfn(page, pfn, > > > order, > > > + &buddy_pfn); > > > + if (!buddy) > > > + break; > > > + > > > + del_page_from_free_list(buddy, order); > > > + combined_pfn = buddy_pfn & pfn; > > > + page = page + (combined_pfn - pfn); > > > + pfn = combined_pfn; > > > + order++; > > > + } > > > + > > > + set_unmapped_order(page, order); > > > + add_to_free_list(page, order); > > > + spin_unlock_irqrestore(&free_area->lock, flags); > > > +} > > > + > > > > The page has to be zeroed before it goes back on the list, right? I > > didn't see it. > > You are right, I missed it. The text_poke() family is probably the way on x86, but I don't think there is a cross-arch way that doesn't involve kernel shootdowns...
On Wed 08-03-23 11:41:02, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > When set_memory or set_direct_map APIs used to change attribute or > permissions for chunks of several pages, the large PMD that maps these > pages in the direct map must be split. Fragmenting the direct map in such > manner causes TLB pressure and, eventually, performance degradation. > > To avoid excessive direct map fragmentation, add ability to allocate > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > allocated pages from the direct map and use a cache of the unmapped pages. > > This cache is replenished with higher order pages with preference for > PMD_SIZE pages when possible so that there will be fewer splits of large > pages in the direct map. > > The cache is implemented as a buddy allocator, so it can serve high order > allocations of unmapped pages. Why do we need a dedicated gfp flag for all this when a dedicated allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}?
On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote: > On Wed 08-03-23 11:41:02, Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > When set_memory or set_direct_map APIs used to change attribute or > > permissions for chunks of several pages, the large PMD that maps these > > pages in the direct map must be split. Fragmenting the direct map in such > > manner causes TLB pressure and, eventually, performance degradation. > > > > To avoid excessive direct map fragmentation, add ability to allocate > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > This cache is replenished with higher order pages with preference for > > PMD_SIZE pages when possible so that there will be fewer splits of large > > pages in the direct map. > > > > The cache is implemented as a buddy allocator, so it can serve high order > > allocations of unmapped pages. > > Why do we need a dedicated gfp flag for all this when a dedicated > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}? Using unmapped_pages_{alloc,free} adds complexity to the users which IMO outweighs the cost of a dedicated gfp flag. For modules we'd have to make x86::module_{alloc,free}() take care of mapping and unmapping the allocated pages in the modules virtual address range. This also might become relevant for another architectures in future and than we'll have several complex module_alloc()s. And for secretmem while using unmapped_pages_alloc() is easy, the free path becomes really complex because actual page freeing for fd-based memory is deeply buried in the page cache code. My gut feeling is that for PKS using a gfp flag would save a lot of hassle as well. I also think that some of the core buddy allocator code might be reused and unmapped_pages_{alloc,free} could be statics in mm/page_alloc.c and won't be exposed at all. For now I've just copied free list helpers to a separate file out of laziness. > -- > Michal Hocko > SUSE Labs
On Sat 25-03-23 09:38:12, Mike Rapoport wrote: > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote: > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote: > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > > > When set_memory or set_direct_map APIs used to change attribute or > > > permissions for chunks of several pages, the large PMD that maps these > > > pages in the direct map must be split. Fragmenting the direct map in such > > > manner causes TLB pressure and, eventually, performance degradation. > > > > > > To avoid excessive direct map fragmentation, add ability to allocate > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > > > This cache is replenished with higher order pages with preference for > > > PMD_SIZE pages when possible so that there will be fewer splits of large > > > pages in the direct map. > > > > > > The cache is implemented as a buddy allocator, so it can serve high order > > > allocations of unmapped pages. > > > > Why do we need a dedicated gfp flag for all this when a dedicated > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}? > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO > outweighs the cost of a dedicated gfp flag. Aren't those users rare and very special anyway? > For modules we'd have to make x86::module_{alloc,free}() take care of > mapping and unmapping the allocated pages in the modules virtual address > range. This also might become relevant for another architectures in future > and than we'll have several complex module_alloc()s. The module_alloc use is lacking any justification. More context would be more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED likely needs more explanation as well. > And for secretmem while using unmapped_pages_alloc() is easy, the free path > becomes really complex because actual page freeing for fd-based memory is > deeply buried in the page cache code. Why is that a problem? You already hook into the page freeing path and special case unmapped memory. > My gut feeling is that for PKS using a gfp flag would save a lot of hassle > as well. Well, my take on this is that this is not a generic page allocator functionality. It is clearly an allocator on top of the page allocator. In general gfp flags are scarce and convenience argument usually fires back later on in hard to predict ways. So I've learned to be careful here. I am not saying this is a no-go but right now I do not see any acutal advantage. The vmalloc usecase could be interesting in that regards but it is not really clear to me whether this is a good idea in the first place.
On 3/27/23 15:43, Michal Hocko wrote: > On Sat 25-03-23 09:38:12, Mike Rapoport wrote: >> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote: >> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote: >> > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> >> > > >> > > When set_memory or set_direct_map APIs used to change attribute or >> > > permissions for chunks of several pages, the large PMD that maps these >> > > pages in the direct map must be split. Fragmenting the direct map in such >> > > manner causes TLB pressure and, eventually, performance degradation. >> > > >> > > To avoid excessive direct map fragmentation, add ability to allocate >> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the >> > > allocated pages from the direct map and use a cache of the unmapped pages. >> > > >> > > This cache is replenished with higher order pages with preference for >> > > PMD_SIZE pages when possible so that there will be fewer splits of large >> > > pages in the direct map. >> > > >> > > The cache is implemented as a buddy allocator, so it can serve high order >> > > allocations of unmapped pages. >> > >> > Why do we need a dedicated gfp flag for all this when a dedicated >> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}? >> >> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO >> outweighs the cost of a dedicated gfp flag. > > Aren't those users rare and very special anyway? I think it's mostly about the freeing that can happen from a generic context not aware of the special allocation, so it's not about how rare it is, but how complex would be to determine exhaustively those contexts and do something in them. >> For modules we'd have to make x86::module_{alloc,free}() take care of >> mapping and unmapping the allocated pages in the modules virtual address >> range. This also might become relevant for another architectures in future >> and than we'll have several complex module_alloc()s. > > The module_alloc use is lacking any justification. More context would be > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED > likely needs more explanation as well. > >> And for secretmem while using unmapped_pages_alloc() is easy, the free path >> becomes really complex because actual page freeing for fd-based memory is >> deeply buried in the page cache code. > > Why is that a problem? You already hook into the page freeing path and > special case unmapped memory. But the proposal of unmapped_pages_free() would suggest this would no longer be the case? But maybe we could, as a compromise, provide unmapped_pages_alloc() to get rid of the new __GFP flag, provide unmapped_pages_free() to annotate places that are known to free unmapped memory explicitly, but the generic page freeing would also keep the hook? >> My gut feeling is that for PKS using a gfp flag would save a lot of hassle >> as well. > > Well, my take on this is that this is not a generic page allocator > functionality. It is clearly an allocator on top of the page allocator. > In general gfp flags are scarce and convenience argument usually fires > back later on in hard to predict ways. So I've learned to be careful > here. I am not saying this is a no-go but right now I do not see any > acutal advantage. The vmalloc usecase could be interesting in that > regards but it is not really clear to me whether this is a good idea in > the first place. >
On Mon 27-03-23 16:31:45, Vlastimil Babka wrote: > On 3/27/23 15:43, Michal Hocko wrote: > > On Sat 25-03-23 09:38:12, Mike Rapoport wrote: > >> On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote: > >> > On Wed 08-03-23 11:41:02, Mike Rapoport wrote: > >> > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > >> > > > >> > > When set_memory or set_direct_map APIs used to change attribute or > >> > > permissions for chunks of several pages, the large PMD that maps these > >> > > pages in the direct map must be split. Fragmenting the direct map in such > >> > > manner causes TLB pressure and, eventually, performance degradation. > >> > > > >> > > To avoid excessive direct map fragmentation, add ability to allocate > >> > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > >> > > allocated pages from the direct map and use a cache of the unmapped pages. > >> > > > >> > > This cache is replenished with higher order pages with preference for > >> > > PMD_SIZE pages when possible so that there will be fewer splits of large > >> > > pages in the direct map. > >> > > > >> > > The cache is implemented as a buddy allocator, so it can serve high order > >> > > allocations of unmapped pages. > >> > > >> > Why do we need a dedicated gfp flag for all this when a dedicated > >> > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}? > >> > >> Using unmapped_pages_{alloc,free} adds complexity to the users which IMO > >> outweighs the cost of a dedicated gfp flag. > > > > Aren't those users rare and very special anyway? > > I think it's mostly about the freeing that can happen from a generic context > not aware of the special allocation, so it's not about how rare it is, but > how complex would be to determine exhaustively those contexts and do > something in them. Yes, I can see a challenge with put_page users but that is not really related to the gfp flag as those are only relevant for the allocation context. > >> For modules we'd have to make x86::module_{alloc,free}() take care of > >> mapping and unmapping the allocated pages in the modules virtual address > >> range. This also might become relevant for another architectures in future > >> and than we'll have several complex module_alloc()s. > > > > The module_alloc use is lacking any justification. More context would be > > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED > > likely needs more explanation as well. > > > >> And for secretmem while using unmapped_pages_alloc() is easy, the free path > >> becomes really complex because actual page freeing for fd-based memory is > >> deeply buried in the page cache code. > > > > Why is that a problem? You already hook into the page freeing path and > > special case unmapped memory. > > But the proposal of unmapped_pages_free() would suggest this would no longer > be the case? I can see a check in the freeing path. > But maybe we could, as a compromise, provide unmapped_pages_alloc() to get > rid of the new __GFP flag, provide unmapped_pages_free() to annotate places > that are known to free unmapped memory explicitly, but the generic page > freeing would also keep the hook? Honestly I do not see a different option if those pages are to be reference counted. Unless they can use a destructor concept like hugetlb pages. At least secret mem usecase cannot AFAICS.
On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote: > On Sat 25-03-23 09:38:12, Mike Rapoport wrote: > > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote: > > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote: > > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > > > > > When set_memory or set_direct_map APIs used to change attribute or > > > > permissions for chunks of several pages, the large PMD that maps these > > > > pages in the direct map must be split. Fragmenting the direct map in such > > > > manner causes TLB pressure and, eventually, performance degradation. > > > > > > > > To avoid excessive direct map fragmentation, add ability to allocate > > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > > > > > This cache is replenished with higher order pages with preference for > > > > PMD_SIZE pages when possible so that there will be fewer splits of large > > > > pages in the direct map. > > > > > > > > The cache is implemented as a buddy allocator, so it can serve high order > > > > allocations of unmapped pages. > > > > > > Why do we need a dedicated gfp flag for all this when a dedicated > > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}? > > > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO > > outweighs the cost of a dedicated gfp flag. > > Aren't those users rare and very special anyway? > > > For modules we'd have to make x86::module_{alloc,free}() take care of > > mapping and unmapping the allocated pages in the modules virtual address > > range. This also might become relevant for another architectures in future > > and than we'll have several complex module_alloc()s. > > The module_alloc use is lacking any justification. More context would be > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED > likely needs more explanation as well. Right now module_alloc() boils down to vmalloc() with the virtual range limited to the modules area. The allocated chunk contains both code and data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated with module_alloc() remapped with different permissions both in vmalloc address space and in the direct map. The change of permissions for small ranges causes splits of large pages in the direct map. If we were to use unmapped_pages_alloc() in modules_alloc(), we would have to implement the part of vmalloc() that reserves the virtual addresses and maps the allocated memory there in module_alloc(). > > And for secretmem while using unmapped_pages_alloc() is easy, the free path > > becomes really complex because actual page freeing for fd-based memory is > > deeply buried in the page cache code. > > Why is that a problem? You already hook into the page freeing path and > special case unmapped memory. I didn't say there is a problem with unmapped_pages_alloc() in secretmem, I said there is a problem with unmapped_pages_free() and hence are the special case for unmapped memory in the freeing path. > > My gut feeling is that for PKS using a gfp flag would save a lot of hassle > > as well. > > Well, my take on this is that this is not a generic page allocator > functionality. It is clearly an allocator on top of the page allocator. > In general gfp flags are scarce and convenience argument usually fires > back later on in hard to predict ways. So I've learned to be careful > here. I am not saying this is a no-go but right now I do not see any > acutal advantage. The vmalloc usecase could be interesting in that > regards but it is not really clear to me whether this is a good idea in > the first place. I don't see the usage of a gfp flag as a convenience argument, but rather it feels for me that a gfp flag will cause less maintenance burden. Of course this is subjective. And although this is an allocator on top of the page allocator, it is still very tightly coupled with the core page allocator. I'm still think that using a migrate type for this would have been more elegant, but I realize that a migrate type would have more impact on the allocation path. > -- > Michal Hocko > SUSE Labs
On Tue 28-03-23 09:25:35, Mike Rapoport wrote: > On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote: > > On Sat 25-03-23 09:38:12, Mike Rapoport wrote: > > > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote: > > > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote: > > > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > > > > > > > When set_memory or set_direct_map APIs used to change attribute or > > > > > permissions for chunks of several pages, the large PMD that maps these > > > > > pages in the direct map must be split. Fragmenting the direct map in such > > > > > manner causes TLB pressure and, eventually, performance degradation. > > > > > > > > > > To avoid excessive direct map fragmentation, add ability to allocate > > > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > > > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > > > > > > > This cache is replenished with higher order pages with preference for > > > > > PMD_SIZE pages when possible so that there will be fewer splits of large > > > > > pages in the direct map. > > > > > > > > > > The cache is implemented as a buddy allocator, so it can serve high order > > > > > allocations of unmapped pages. > > > > > > > > Why do we need a dedicated gfp flag for all this when a dedicated > > > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}? > > > > > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO > > > outweighs the cost of a dedicated gfp flag. > > > > Aren't those users rare and very special anyway? > > > > > For modules we'd have to make x86::module_{alloc,free}() take care of > > > mapping and unmapping the allocated pages in the modules virtual address > > > range. This also might become relevant for another architectures in future > > > and than we'll have several complex module_alloc()s. > > > > The module_alloc use is lacking any justification. More context would be > > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED > > likely needs more explanation as well. > > Right now module_alloc() boils down to vmalloc() with the virtual range > limited to the modules area. The allocated chunk contains both code and > data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated > with module_alloc() remapped with different permissions both in vmalloc > address space and in the direct map. The change of permissions for small > ranges causes splits of large pages in the direct map. OK, so you want to reduce that direct map fragmentation? Is that a real problem? My impression is that modules are mostly static thing. BPF might be a different thing though. I have a recollection that BPF guys were dealing with direct map fragmention as well. > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have > to implement the part of vmalloc() that reserves the virtual addresses and > maps the allocated memory there in module_alloc(). Another option would be to provide an allocator for the backing pages to vmalloc. But I do agree that a gfp flag is a less laborous way to achieve the same. So the primary question really is whether we really need vmalloc support for unmapped memory.
On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: > On Tue 28-03-23 09:25:35, Mike Rapoport wrote: > > On Mon, Mar 27, 2023 at 03:43:27PM +0200, Michal Hocko wrote: > > > On Sat 25-03-23 09:38:12, Mike Rapoport wrote: > > > > On Fri, Mar 24, 2023 at 09:37:31AM +0100, Michal Hocko wrote: > > > > > On Wed 08-03-23 11:41:02, Mike Rapoport wrote: > > > > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > > > > > > > > > When set_memory or set_direct_map APIs used to change attribute or > > > > > > permissions for chunks of several pages, the large PMD that maps these > > > > > > pages in the direct map must be split. Fragmenting the direct map in such > > > > > > manner causes TLB pressure and, eventually, performance degradation. > > > > > > > > > > > > To avoid excessive direct map fragmentation, add ability to allocate > > > > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > > > > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > > > > > > > > > This cache is replenished with higher order pages with preference for > > > > > > PMD_SIZE pages when possible so that there will be fewer splits of large > > > > > > pages in the direct map. > > > > > > > > > > > > The cache is implemented as a buddy allocator, so it can serve high order > > > > > > allocations of unmapped pages. > > > > > > > > > > Why do we need a dedicated gfp flag for all this when a dedicated > > > > > allocator is used anyway. What prevents users to call unmapped_pages_{alloc,free}? > > > > > > > > Using unmapped_pages_{alloc,free} adds complexity to the users which IMO > > > > outweighs the cost of a dedicated gfp flag. > > > > > > Aren't those users rare and very special anyway? > > > > > > > For modules we'd have to make x86::module_{alloc,free}() take care of > > > > mapping and unmapping the allocated pages in the modules virtual address > > > > range. This also might become relevant for another architectures in future > > > > and than we'll have several complex module_alloc()s. > > > > > > The module_alloc use is lacking any justification. More context would be > > > more than useful. Also vmalloc support for the proposed __GFP_UNMAPPED > > > likely needs more explanation as well. > > > > Right now module_alloc() boils down to vmalloc() with the virtual range > > limited to the modules area. The allocated chunk contains both code and > > data. When CONFIG_STRICT_MODULE_RWX is set, parts of the memory allocated > > with module_alloc() remapped with different permissions both in vmalloc > > address space and in the direct map. The change of permissions for small > > ranges causes splits of large pages in the direct map. > > OK, so you want to reduce that direct map fragmentation? Yes. > Is that a real problem? A while ago Intel folks published report [1] that showed better performance with large pages in the direct map for majority of benchmarks. > My impression is that modules are mostly static thing. BPF > might be a different thing though. I have a recollection that BPF guys > were dealing with direct map fragmention as well. Modules are indeed static, but module_alloc() used by anything that allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas mentioned that having code in 2M pages reduces iTLB pressure [2], but that's not only about avoiding the splits in the direct map but also about using large mappings in the modules address space. BPF guys suggested an allocator for executable memory [3] mainly because they've seen performance improvement of 0.6% - 0.9% in their setups [4]. > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have > > to implement the part of vmalloc() that reserves the virtual addresses and > > maps the allocated memory there in module_alloc(). > > Another option would be to provide an allocator for the backing pages to > vmalloc. But I do agree that a gfp flag is a less laborous way to > achieve the same. So the primary question really is whether we really > need vmalloc support for unmapped memory. I'm not sure I follow here. module_alloc() is essentially an alias to vmalloc(), so to reduce direct map fragmentation caused by code allocations the most sensible way IMO is to support unmapped memory in vmalloc(). I also think vmalloc with unmmapped pages can provide backing pages for execmem_alloc() Song proposed. > -- > Michal Hocko > SUSE Labs [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ [2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/ [3] https://lore.kernel.org/all/20221107223921.3451913-1-song@kernel.org/ [4] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/
On Tue 28-03-23 18:11:20, Mike Rapoport wrote: > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: [...] > > OK, so you want to reduce that direct map fragmentation? > > Yes. > > > Is that a real problem? > > A while ago Intel folks published report [1] that showed better performance > with large pages in the direct map for majority of benchmarks. > > > My impression is that modules are mostly static thing. BPF > > might be a different thing though. I have a recollection that BPF guys > > were dealing with direct map fragmention as well. > > Modules are indeed static, but module_alloc() used by anything that > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas > mentioned that having code in 2M pages reduces iTLB pressure [2], but > that's not only about avoiding the splits in the direct map but also about > using large mappings in the modules address space. > > BPF guys suggested an allocator for executable memory [3] mainly because > they've seen performance improvement of 0.6% - 0.9% in their setups [4]. These are fair arguments and it would have been better to have them in the RFC. Also it is not really clear to me what is the actual benefit of the unmapping for those usecases. I do get they want to benefit from caching on the same permission setup but do they need unmapping as well? > > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have > > > to implement the part of vmalloc() that reserves the virtual addresses and > > > maps the allocated memory there in module_alloc(). > > > > Another option would be to provide an allocator for the backing pages to > > vmalloc. But I do agree that a gfp flag is a less laborous way to > > achieve the same. So the primary question really is whether we really > > need vmalloc support for unmapped memory. > > I'm not sure I follow here. module_alloc() is essentially an alias to > vmalloc(), so to reduce direct map fragmentation caused by code allocations > the most sensible way IMO is to support unmapped memory in vmalloc(). What I meant to say is that vmalloc is currently using the page allocator (resp bulk allocator) for the backing storage. I can imagine an extension to replace this default allocator by something else (e.g. a given allocation function). This would be more generic and it would allow different usecases (e.g. benefit from caching withtout doing the actual unmapping). > I also think vmalloc with unmmapped pages can provide backing pages for > execmem_alloc() Song proposed. Why would you need to have execmem_alloc have its memory virtually mapped into vmalloc space? > [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > [2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/ > [3] https://lore.kernel.org/all/20221107223921.3451913-1-song@kernel.org/ > [4] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/ > > -- > Sincerely yours, > Mike.
Mike, please Cc linux-modules if you want modules folks' input as well ;) On Tue, Mar 28, 2023 at 8:11 AM Mike Rapoport <rppt@kernel.org> wrote: > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: > > OK, so you want to reduce that direct map fragmentation? > > Yes. > > > Is that a real problem? > > A while ago Intel folks published report [1] that showed better performance > with large pages in the direct map for majority of benchmarks. > > > My impression is that modules are mostly static thing. BPF > > might be a different thing though. I have a recollection that BPF guys > > were dealing with direct map fragmention as well. > > Modules are indeed static, but module_alloc() used by anything that > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas > mentioned that having code in 2M pages reduces iTLB pressure [2], but > that's not only about avoiding the splits in the direct map but also about > using large mappings in the modules address space. It is easily overlooked why such things create direct fragmentation -- it's because of the special permission stuff done, module_alloc() targets memory which can be executed somehow. > BPF guys suggested an allocator for executable memory [3] mainly because > they've seen performance improvement of 0.6% - 0.9% in their setups [4]. The performance metrics were completely opaque to some synthetic environment and our goal is to showcase real value with reproducible performance benchmarks. Since now Song is convinced that modules need to be a first class citizen in order to generalize a special allocator we may sooner rather than later real reproducible performance data to show the benefit of such a special allocator. One of the big differences with eBPF programs is that modules *can* be rather large in size. What is the average size of modules? Well let's take a look: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| wc -l 9173 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 175.1 175 MiB is pretty large. Ignoring the top 5 piggies: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r |head -5 58315248 - ./drivers/gpu/drm/amd/amdgpu/amdgpu.ko 29605592 - ./drivers/gpu/drm/i915/i915.ko 18591256 - ./drivers/gpu/drm/nouveau/nouveau.ko 16867048 - ./fs/xfs/xfs.ko 14209440 - ./fs/btrfs/btrfs.ko mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 160.54 Ignoring the top 10 largest modules: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-10)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 154.656 Ignoring the top 20 piggies: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-20)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 146.384 Ignoring the top 100 bloated modules: mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-100)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 117.97 Ignoring the top 1000 bloated modules: mcgrof@bigtwin mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-1000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 64.4686 Ignoring the top 2000 bloated modules: mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-2000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 48.7869 Ignoring top 3000 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-3000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 39.6037 Ignoring top 4000 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-4000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 33.106 Ignoring top 5000 mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail -$((9173-5000)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 28.0925 But at least on the driver front we know we won't always have loaded *all* GPU drivers, but *one*. So the math needs to consider what module sizes are for modules actually used. Let's see what the average module size is on on a big system: mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 313.432 On a small desktop: mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs stat -c "%s - %n" | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 292.786 For each finit_module we also do a vmalloc twice actually, one for the kernel_read_*() which lets us read the file from userspace, and then a second set of allocations where we copy data over, each step either using vzalloc() or module_alloc() (vmalloc() with special permissions), this is more easily reflected and visible now on linux-next with Song's new module_memory_alloc(). However -- for the context of *this* effort, we're only talking about the executable sections, the areas we'd use module_alloc() for. On a big system: mcgrof@bigtwin ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs size --radix=10 | awk '{print $1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 88.7964 On a small desktop: mcgrof@desktop ~ $ lsmod | grep -v Module| awk '{print $1}' | xargs sudo modinfo --field filename | xargs size --radix=10 | awk '{print $1}'| grep -v text| awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' 92.1473 Regardless, the amount of memory allocated is pretty significant, to the point a 400 CPU system easily run out of vmap space (yes the kernel does some stupid stuff, which we're correcting over time): https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com To help with this we have some recent efforts to help with this pressure on vmap space: https://lkml.kernel.org/r/20230311051712.4095040-1-mcgrof@kernel.org > > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have > > > to implement the part of vmalloc() that reserves the virtual addresses and > > > maps the allocated memory there in module_alloc(). > > > > Another option would be to provide an allocator for the backing pages to > > vmalloc. But I do agree that a gfp flag is a less laborous way to > > achieve the same. So the primary question really is whether we really > > need vmalloc support for unmapped memory. > > I'm not sure I follow here. module_alloc() is essentially an alias to > vmalloc(), so to reduce direct map fragmentation caused by code allocations > the most sensible way IMO is to support unmapped memory in vmalloc(). The big win also would be to use huge pages from a performance point of view, specially reducing iTLB pressure, however that is *slightly* orthogonal to your goal of reducing direct map fragmentation. However, we should not ignore that strategy as it is a special use case which *could* be leveraged in other ways too. And so I'd prefer to see that over just a flag to hide those allocations. It would not only reduce that fragmentation, but reduce iTLB pressure which I think *is* what experimentation revealed to show more gains. > I also think vmalloc with unmmapped pages can provide backing pages for > execmem_alloc() Song proposed. Only if you consider huge pages. I don't see that here. Luis
On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote: > differences with eBPF programs is that modules *can* be rather large > in size. What is the average size of modules? Well let's take a look: > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ > -name \*.ko| wc -l > 9173 ummm ... wc -c, surely? > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ > -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail > -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' > 160.54 ... which invalidates all of these.
On Tue, Mar 28, 2023 at 06:37:13PM +0100, Matthew Wilcox wrote: > On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote: > > differences with eBPF programs is that modules *can* be rather large > > in size. What is the average size of modules? Well let's take a look: > > > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ > > -name \*.ko| wc -l > > 9173 > > ummm ... wc -c, surely? That's the number of allmodconfig modules found. mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 ./arch/x86/crypto/twofish-x86_64.ko ./arch/x86/crypto/serpent-avx2.ko mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 | wc -l 2 mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 | wc -c 70 wc -c would give a lot more. wc -l gives me the module count. > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ > > -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail > > -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' > > 160.54 > > ... which invalidates all of these. Not sure ? But regardless the *.text* lookup is what we care for though which was later. Luis
On Tue, Mar 28, 2023 at 10:52:08AM -0700, Luis Chamberlain wrote: > On Tue, Mar 28, 2023 at 06:37:13PM +0100, Matthew Wilcox wrote: > > On Tue, Mar 28, 2023 at 10:18:50AM -0700, Luis Chamberlain wrote: > > > differences with eBPF programs is that modules *can* be rather large > > > in size. What is the average size of modules? Well let's take a look: > > > > > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ > > > -name \*.ko| wc -l > > > 9173 > > > > ummm ... wc -c, surely? > > That's the number of allmodconfig modules found. > > mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 > ./arch/x86/crypto/twofish-x86_64.ko > ./arch/x86/crypto/serpent-avx2.ko > mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 | > wc -l > 2 > mcgrof@fulton ~/linux (git::sysctl-next)$ find ./ -name \*.ko| head -2 | > wc -c > 70 > > wc -c would give a lot more. wc -l gives me the module count. > > > > mcgrof@bigtwin /mirror/code/mcgrof/linux-next (git::master)$ find ./ > > > -name \*.ko| xargs stat -c "%s - %n" | sort -n -k 1 -r | tail > > > -$((9173-5)) | awk 'BEGIN {sum=0} {sum+=$1} END {print sum/NR/1024}' > > > 160.54 > > > > ... which invalidates all of these. > > Not sure ? But regardless the *.text* lookup is what we care for though > which was later. Which gets me thinking it'd be super nice if kmod tools supported querying this for us, then no fat finger could mess up the math: For all modules available: * Average module size * Average .text module size For only modules loaded: * Average module size * Average .text module size Luis
On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote: > On Tue 28-03-23 18:11:20, Mike Rapoport wrote: > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: > [...] > > > OK, so you want to reduce that direct map fragmentation? > > > > Yes. > > > > > Is that a real problem? > > > > A while ago Intel folks published report [1] that showed better performance > > with large pages in the direct map for majority of benchmarks. > > > > > My impression is that modules are mostly static thing. BPF > > > might be a different thing though. I have a recollection that BPF guys > > > were dealing with direct map fragmention as well. > > > > Modules are indeed static, but module_alloc() used by anything that > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas > > mentioned that having code in 2M pages reduces iTLB pressure [2], but > > that's not only about avoiding the splits in the direct map but also about > > using large mappings in the modules address space. > > > > BPF guys suggested an allocator for executable memory [3] mainly because > > they've seen performance improvement of 0.6% - 0.9% in their setups [4]. > > These are fair arguments and it would have been better to have them in > the RFC. Also it is not really clear to me what is the actual benefit of > the unmapping for those usecases. I do get they want to benefit from > caching on the same permission setup but do they need unmapping as well? The pages allocated with module_alloc() get different permissions depending on whether they belong to text, rodata, data etc. The permissions are updated in both vmalloc address space and in the direct map. The updates to the direct map cause splits of the large pages. If we cache large pages as unmapped we take out the entire 2M page from the direct map and then if/when it becomes free it can be returned to the direct map as a 2M page. Generally, the unmapped allocations are intended for use-cases that anyway map the memory elsewhere than direct map and need to modify direct mappings of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe even some of the encrypted VM memory. > > > > If we were to use unmapped_pages_alloc() in modules_alloc(), we would have > > > > to implement the part of vmalloc() that reserves the virtual addresses and > > > > maps the allocated memory there in module_alloc(). > > > > > > Another option would be to provide an allocator for the backing pages to > > > vmalloc. But I do agree that a gfp flag is a less laborous way to > > > achieve the same. So the primary question really is whether we really > > > need vmalloc support for unmapped memory. > > > > I'm not sure I follow here. module_alloc() is essentially an alias to > > vmalloc(), so to reduce direct map fragmentation caused by code allocations > > the most sensible way IMO is to support unmapped memory in vmalloc(). > > What I meant to say is that vmalloc is currently using the page > allocator (resp bulk allocator) for the backing storage. I can imagine > an extension to replace this default allocator by something else (e.g. a > given allocation function). This would be more generic and it would > allow different usecases (e.g. benefit from caching withtout doing the > actual unmapping). The whole point of unmapped cache is to allow non-default permissions in the direct map without splitting large pages there. So the cache that unmaps large pages in the direct map and then handles out subpages of that page seems like the most logical thing to do. With the current use cases the callers anyway map these pages at different virtual addresses, i.e. page cache or vmalloc. > > I also think vmalloc with unmmapped pages can provide backing pages for > > execmem_alloc() Song proposed. > > Why would you need to have execmem_alloc have its memory virtually > mapped into vmalloc space? Currently all the memory allocated for code is managed in a subset of vmalloc() space. The intention of execmem_alloc() was to replace module_alloc() for the code pages, so it's natural that it will use the same virtual ranges. But anyway, execmem_alloc() is a long shot as it also requires quite a refactoring of modules loading. > > [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/ > > [2] https://lore.kernel.org/all/87mt86rbvy.ffs@tglx/ > > [3] https://lore.kernel.org/all/20221107223921.3451913-1-song@kernel.org/ > > [4] https://lore.kernel.org/bpf/20220707223546.4124919-1-song@kernel.org/ > > > > -- > > Sincerely yours, > > Mike. > > -- > Michal Hocko > SUSE Labs
On Wed 29-03-23 10:28:02, Mike Rapoport wrote: > On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote: > > On Tue 28-03-23 18:11:20, Mike Rapoport wrote: > > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: > > [...] > > > > OK, so you want to reduce that direct map fragmentation? > > > > > > Yes. > > > > > > > Is that a real problem? > > > > > > A while ago Intel folks published report [1] that showed better performance > > > with large pages in the direct map for majority of benchmarks. > > > > > > > My impression is that modules are mostly static thing. BPF > > > > might be a different thing though. I have a recollection that BPF guys > > > > were dealing with direct map fragmention as well. > > > > > > Modules are indeed static, but module_alloc() used by anything that > > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas > > > mentioned that having code in 2M pages reduces iTLB pressure [2], but > > > that's not only about avoiding the splits in the direct map but also about > > > using large mappings in the modules address space. > > > > > > BPF guys suggested an allocator for executable memory [3] mainly because > > > they've seen performance improvement of 0.6% - 0.9% in their setups [4]. > > > > These are fair arguments and it would have been better to have them in > > the RFC. Also it is not really clear to me what is the actual benefit of > > the unmapping for those usecases. I do get they want to benefit from > > caching on the same permission setup but do they need unmapping as well? > > The pages allocated with module_alloc() get different permissions depending > on whether they belong to text, rodata, data etc. The permissions are > updated in both vmalloc address space and in the direct map. The updates to > the direct map cause splits of the large pages. That much is clear (wouldn't hurt to mention that in the changelog though). > If we cache large pages as > unmapped we take out the entire 2M page from the direct map and then > if/when it becomes free it can be returned to the direct map as a 2M page. > > Generally, the unmapped allocations are intended for use-cases that anyway > map the memory elsewhere than direct map and need to modify direct mappings > of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe > even some of the encrypted VM memory. I believe we are still not on the same page here. I do understand that you want to re-use the caching capability of the unmapped_pages_alloc for modules allocations as well. My question is whether module_alloc really benefits from the fact that the memory is unmapped? I guess you want to say that it does because it wouldn't have to change the permission for the direct map but I do not see that anywhere in the patch... Also consinder the slow path where module_alloc needs to allocate a fresh (huge)page and unmap it. Does the overhead of the unmapping suits needs of all module_alloc users? Module loader is likely not interesting as it tends to be rather static but what about BPF programs? [...] > > > I also think vmalloc with unmmapped pages can provide backing pages for > > > execmem_alloc() Song proposed. > > > > Why would you need to have execmem_alloc have its memory virtually > > mapped into vmalloc space? > > Currently all the memory allocated for code is managed in a subset of > vmalloc() space. The intention of execmem_alloc() was to replace > module_alloc() for the code pages, so it's natural that it will use the > same virtual ranges. Ohh, sorry my bad. I have only now realized I have misread and thought about secretmem_alloc...
On Wed, Mar 29, 2023 at 10:13:23AM +0200, Michal Hocko wrote: > On Wed 29-03-23 10:28:02, Mike Rapoport wrote: > > On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote: > > > On Tue 28-03-23 18:11:20, Mike Rapoport wrote: > > > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: > > > [...] > > > > > OK, so you want to reduce that direct map fragmentation? > > > > > > > > Yes. > > > > > > > > > Is that a real problem? > > > > > > > > A while ago Intel folks published report [1] that showed better performance > > > > with large pages in the direct map for majority of benchmarks. > > > > > > > > > My impression is that modules are mostly static thing. BPF > > > > > might be a different thing though. I have a recollection that BPF guys > > > > > were dealing with direct map fragmention as well. > > > > > > > > Modules are indeed static, but module_alloc() used by anything that > > > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas > > > > mentioned that having code in 2M pages reduces iTLB pressure [2], but > > > > that's not only about avoiding the splits in the direct map but also about > > > > using large mappings in the modules address space. > > > > > > > > BPF guys suggested an allocator for executable memory [3] mainly because > > > > they've seen performance improvement of 0.6% - 0.9% in their setups [4]. > > > > > > These are fair arguments and it would have been better to have them in > > > the RFC. Also it is not really clear to me what is the actual benefit of > > > the unmapping for those usecases. I do get they want to benefit from > > > caching on the same permission setup but do they need unmapping as well? > > > > The pages allocated with module_alloc() get different permissions depending > > on whether they belong to text, rodata, data etc. The permissions are > > updated in both vmalloc address space and in the direct map. The updates to > > the direct map cause splits of the large pages. > > That much is clear (wouldn't hurt to mention that in the changelog > though). > > > If we cache large pages as > > unmapped we take out the entire 2M page from the direct map and then > > if/when it becomes free it can be returned to the direct map as a 2M page. > > > > Generally, the unmapped allocations are intended for use-cases that anyway > > map the memory elsewhere than direct map and need to modify direct mappings > > of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe > > even some of the encrypted VM memory. > > I believe we are still not on the same page here. I do understand that > you want to re-use the caching capability of the unmapped_pages_alloc > for modules allocations as well. My question is whether module_alloc > really benefits from the fact that the memory is unmapped? > > I guess you want to say that it does because it wouldn't have to change > the permission for the direct map but I do not see that anywhere in the > patch... This happens automagically outside the patch :) Currently, to change memory permissions modules code calls set_memory APIs and passes vmalloced address to them. set_memory functions lookup the direct map alias and update the permissions there as well. If the memory allocated with module_alloc() is unmapped in the direct map, there won't be an alias for set_memory APIs to update. > Also consinder the slow path where module_alloc needs to > allocate a fresh (huge)page and unmap it. Does the overhead of the > unmapping suits needs of all module_alloc users? Module loader is likely > not interesting as it tends to be rather static but what about BPF > programs? The overhead of unmapping pages in the direct map on allocation path will be offset by reduced overhead of updating permissions in the direct map after the allocation. Both are using the same APIs and if today the permission update causes a split of a large page, unmapping of a large page won't. Of course in a loaded system unmapped_alloc() won't be able to always allocated large pages to replenish the cache, but still there will be fewer updates to the direct map. > -- > Michal Hocko > SUSE Labs
On Thu 30-03-23 08:13:48, Mike Rapoport wrote: > On Wed, Mar 29, 2023 at 10:13:23AM +0200, Michal Hocko wrote: > > On Wed 29-03-23 10:28:02, Mike Rapoport wrote: > > > On Tue, Mar 28, 2023 at 05:24:49PM +0200, Michal Hocko wrote: > > > > On Tue 28-03-23 18:11:20, Mike Rapoport wrote: > > > > > On Tue, Mar 28, 2023 at 09:39:37AM +0200, Michal Hocko wrote: > > > > [...] > > > > > > OK, so you want to reduce that direct map fragmentation? > > > > > > > > > > Yes. > > > > > > > > > > > Is that a real problem? > > > > > > > > > > A while ago Intel folks published report [1] that showed better performance > > > > > with large pages in the direct map for majority of benchmarks. > > > > > > > > > > > My impression is that modules are mostly static thing. BPF > > > > > > might be a different thing though. I have a recollection that BPF guys > > > > > > were dealing with direct map fragmention as well. > > > > > > > > > > Modules are indeed static, but module_alloc() used by anything that > > > > > allocates code pages, e.g. kprobes, ftrace and BPF. Besides, Thomas > > > > > mentioned that having code in 2M pages reduces iTLB pressure [2], but > > > > > that's not only about avoiding the splits in the direct map but also about > > > > > using large mappings in the modules address space. > > > > > > > > > > BPF guys suggested an allocator for executable memory [3] mainly because > > > > > they've seen performance improvement of 0.6% - 0.9% in their setups [4]. > > > > > > > > These are fair arguments and it would have been better to have them in > > > > the RFC. Also it is not really clear to me what is the actual benefit of > > > > the unmapping for those usecases. I do get they want to benefit from > > > > caching on the same permission setup but do they need unmapping as well? > > > > > > The pages allocated with module_alloc() get different permissions depending > > > on whether they belong to text, rodata, data etc. The permissions are > > > updated in both vmalloc address space and in the direct map. The updates to > > > the direct map cause splits of the large pages. > > > > That much is clear (wouldn't hurt to mention that in the changelog > > though). > > > > > If we cache large pages as > > > unmapped we take out the entire 2M page from the direct map and then > > > if/when it becomes free it can be returned to the direct map as a 2M page. > > > > > > Generally, the unmapped allocations are intended for use-cases that anyway > > > map the memory elsewhere than direct map and need to modify direct mappings > > > of the memory, be it modules_alloc(), secretmem, PKS page tables or maybe > > > even some of the encrypted VM memory. > > > > I believe we are still not on the same page here. I do understand that > > you want to re-use the caching capability of the unmapped_pages_alloc > > for modules allocations as well. My question is whether module_alloc > > really benefits from the fact that the memory is unmapped? > > > > I guess you want to say that it does because it wouldn't have to change > > the permission for the direct map but I do not see that anywhere in the > > patch... > > This happens automagically outside the patch :) > > Currently, to change memory permissions modules code calls set_memory APIs > and passes vmalloced address to them. set_memory functions lookup the > direct map alias and update the permissions there as well. > If the memory allocated with module_alloc() is unmapped in the direct map, > there won't be an alias for set_memory APIs to update. > > > Also consinder the slow path where module_alloc needs to > > allocate a fresh (huge)page and unmap it. Does the overhead of the > > unmapping suits needs of all module_alloc users? Module loader is likely > > not interesting as it tends to be rather static but what about BPF > > programs? > > The overhead of unmapping pages in the direct map on allocation path will > be offset by reduced overhead of updating permissions in the direct map > after the allocation. Both are using the same APIs and if today the > permission update causes a split of a large page, unmapping of a large page > won't. > > Of course in a loaded system unmapped_alloc() won't be able to always > allocated large pages to replenish the cache, but still there will be fewer > updates to the direct map. Ok, all of that is a changelog material. I would recommend to go this way. Start by the simple unmapped_pages_alloc interface and use it for the secret memory. There shouldn't be anything controversial there. In a follow up patch add a support for the vmalloc which would add a new gfp flag with a justification that this is the simplest way to support module_alloc usecase and do not feel shy of providing as much context as you can. Ideally with some numbers for the best/worst/avg cases. Thanks
On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > When set_memory or set_direct_map APIs used to change attribute or > permissions for chunks of several pages, the large PMD that maps these > pages in the direct map must be split. Fragmenting the direct map in such > manner causes TLB pressure and, eventually, performance degradation. > > To avoid excessive direct map fragmentation, add ability to allocate > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > allocated pages from the direct map and use a cache of the unmapped pages. > > This cache is replenished with higher order pages with preference for > PMD_SIZE pages when possible so that there will be fewer splits of large > pages in the direct map. > > The cache is implemented as a buddy allocator, so it can serve high order > allocations of unmapped pages. So I'm late to this discussion, I stumbled in because of my own run in with executable memory allocation. I understand that post LSF this patchset seems to not be going anywhere, but OTOH there's also been a desire for better executable memory allocation; as noted by tglx and elsewhere, there _is_ a definite performance impact on page size with kernel text - I've seen numbers in the multiple single digit percentage range in the past. This patchset does seem to me to be roughly the right approach for that, and coupled with the slab allocator for sub-page sized allocations it seems there's the potential for getting a nice interface that spans the full range of allocation sizes, from small bpf/trampoline allocations up to modules. Is this patchset worth reviving/continuing with? Was it really just the needed module refactoring that was the blocker?
On Wed, May 17, 2023 at 11:35:56PM -0400, Kent Overstreet wrote: > On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote: > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > When set_memory or set_direct_map APIs used to change attribute or > > permissions for chunks of several pages, the large PMD that maps these > > pages in the direct map must be split. Fragmenting the direct map in such > > manner causes TLB pressure and, eventually, performance degradation. > > > > To avoid excessive direct map fragmentation, add ability to allocate > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > This cache is replenished with higher order pages with preference for > > PMD_SIZE pages when possible so that there will be fewer splits of large > > pages in the direct map. > > > > The cache is implemented as a buddy allocator, so it can serve high order > > allocations of unmapped pages. > > So I'm late to this discussion, I stumbled in because of my own run in > with executable memory allocation. > > I understand that post LSF this patchset seems to not be going anywhere, > but OTOH there's also been a desire for better executable memory > allocation; as noted by tglx and elsewhere, there _is_ a definite > performance impact on page size with kernel text - I've seen numbers in > the multiple single digit percentage range in the past. > > This patchset does seem to me to be roughly the right approach for that, > and coupled with the slab allocator for sub-page sized allocations it > seems there's the potential for getting a nice interface that spans the > full range of allocation sizes, from small bpf/trampoline allocations up > to modules. > > Is this patchset worth reviving/continuing with? Was it really just the > needed module refactoring that was the blocker? As I see it, this patchset only one building block out of three? four? If we are to repurpose it for code allocations it should be something like 1) allocate 2M page to fill the cache 2) remove this page from the direct map 3) map the 2M page ROX in module address space (usually some part of vmalloc address space) 4) allocate a smaller chunk of that page to the actual caller (bpf, modules, whatever) Right now (3) and (4) won't work for modules because they mix code and data in a single allocation. So module refactoring is a blocker and another blocker is to teach vmalloc to map the areas for the executable memory with 2M pages and probably something else. I remember there was an attempt to add VM_ALLOW_HUGE_VMAP to x86::module_alloc(), but it caused problems and was reverted. Sorry, I could not find the lore link. There was a related discussion here: https://lore.kernel.org/all/14D6DBA0-0572-44FB-A566-464B1FF541E0@fb.com/
On Thu, May 18, 2023 at 8:24 AM Mike Rapoport <rppt@kernel.org> wrote: > > On Wed, May 17, 2023 at 11:35:56PM -0400, Kent Overstreet wrote: > > On Wed, Mar 08, 2023 at 11:41:02AM +0200, Mike Rapoport wrote: > > > From: "Mike Rapoport (IBM)" <rppt@kernel.org> > > > > > > When set_memory or set_direct_map APIs used to change attribute or > > > permissions for chunks of several pages, the large PMD that maps these > > > pages in the direct map must be split. Fragmenting the direct map in such > > > manner causes TLB pressure and, eventually, performance degradation. > > > > > > To avoid excessive direct map fragmentation, add ability to allocate > > > "unmapped" pages with __GFP_UNMAPPED flag that will cause removal of the > > > allocated pages from the direct map and use a cache of the unmapped pages. > > > > > > This cache is replenished with higher order pages with preference for > > > PMD_SIZE pages when possible so that there will be fewer splits of large > > > pages in the direct map. > > > > > > The cache is implemented as a buddy allocator, so it can serve high order > > > allocations of unmapped pages. > > > > So I'm late to this discussion, I stumbled in because of my own run in > > with executable memory allocation. > > > > I understand that post LSF this patchset seems to not be going anywhere, > > but OTOH there's also been a desire for better executable memory > > allocation; as noted by tglx and elsewhere, there _is_ a definite > > performance impact on page size with kernel text - I've seen numbers in > > the multiple single digit percentage range in the past. > > > > This patchset does seem to me to be roughly the right approach for that, > > and coupled with the slab allocator for sub-page sized allocations it > > seems there's the potential for getting a nice interface that spans the > > full range of allocation sizes, from small bpf/trampoline allocations up > > to modules. > > > > Is this patchset worth reviving/continuing with? Was it really just the > > needed module refactoring that was the blocker? > > As I see it, this patchset only one building block out of three? four? > If we are to repurpose it for code allocations it should be something like > > 1) allocate 2M page to fill the cache > 2) remove this page from the direct map > 3) map the 2M page ROX in module address space (usually some part of > vmalloc address space) > 4) allocate a smaller chunk of that page to the actual caller (bpf, > modules, whatever) > > Right now (3) and (4) won't work for modules because they mix code and data > in a single allocation. I am working on patches based on the discussion in [1]. I am planning to send v1 for review in a week or so. Thanks, Song [1] https://lore.kernel.org/linux-mm/20221107223921.3451913-1-song@kernel.org/ [...]
On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > I am working on patches based on the discussion in [1]. I am planning to > send v1 for review in a week or so. Hey Song, I was reviewing that thread too, Are you taking a different approach based on Thomas's feedback? I think he had some fair points in that thread. My own feeling is that the buddy allocator is our tool for allocating larger variable sized physically contiguous allocations, so I'd like to see something based on that - I think we could do a hybrid buddy/slab allocator approach, like we have for regular memory allocations. I started on a slab allocator for executable memory allocations the other day (very minimal, but tested it for bcachefs and it works). But I'd love to hear more about your current approach as well. Cheers, Kent
On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > I am working on patches based on the discussion in [1]. I am planning to > send v1 for review in a week or so. For reference, here's my own (early, but functioning :) slab allocator: Look forward to comparing! -->-- From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001 From: Kent Overstreet <kent.overstreet@linux.dev> Date: Wed, 17 May 2023 01:22:06 -0400 Subject: [PATCH] mm: jit/text allocator This provides a new, very simple slab allocator for jit/text, i.e. bpf, ftrace trampolines, or bcachefs unpack functions. With this API we can avoid ever mapping pages both writeable and executable (not implemented in this patch: need to tweak module_alloc()), and it also supports sub-page sized allocations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h new file mode 100644 index 0000000000..f1549d60e8 --- /dev/null +++ b/include/linux/jitalloc.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_JITALLOC_H +#define _LINUX_JITALLOC_H + +void jit_update(void *buf, void *new_buf, size_t len); +void jit_free(void *buf); +void *jit_alloc(void *buf, size_t len); + +#endif /* _LINUX_JITALLOC_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 4751031f3f..ff26a4f0c9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS This option has a per-memcg and per-node memory overhead. # } +config JITALLOC + bool + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index c03e1e5859..25e82db9e8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o +obj-$(CONFIG_JITALLOC) += jitalloc.o diff --git a/mm/jitalloc.c b/mm/jitalloc.c new file mode 100644 index 0000000000..7c4d621802 --- /dev/null +++ b/mm/jitalloc.c @@ -0,0 +1,187 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/gfp.h> +#include <linux/highmem.h> +#include <linux/jitalloc.h> +#include <linux/mm.h> +#include <linux/moduleloader.h> +#include <linux/mutex.h> +#include <linux/set_memory.h> +#include <linux/vmalloc.h> + +#include <asm/text-patching.h> + +static DEFINE_MUTEX(jit_alloc_lock); + +struct jit_cache { + unsigned obj_size_bits; + unsigned objs_per_slab; + struct list_head partial; +}; + +#define JITALLOC_MIN_SIZE 16 +#define NR_JIT_CACHES ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE) + +static struct jit_cache jit_caches[NR_JIT_CACHES]; + +struct jit_slab { + unsigned long __page_flags; + + struct jit_cache *cache; + void *executably_mapped;; + unsigned long *objs_allocated; /* bitmap of free objects */ + struct list_head list; +}; + +#define folio_jit_slab(folio) (_Generic((folio), \ + const struct folio *: (const struct jit_slab *)(folio), \ + struct folio *: (struct jit_slab *)(folio))) + +#define jit_slab_folio(s) (_Generic((s), \ + const struct jit_slab *: (const struct folio *)s, \ + struct jit_slab *: (struct folio *)s)) + +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache) +{ + void *executably_mapped = module_alloc(PAGE_SIZE); + struct page *page; + struct folio *folio; + struct jit_slab *slab; + unsigned long *objs_allocated; + + if (!executably_mapped) + return NULL; + + objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL); + if (!objs_allocated ) { + vfree(executably_mapped); + return NULL; + } + + set_vm_flush_reset_perms(executably_mapped); + set_memory_rox((unsigned long) executably_mapped, 1); + + page = vmalloc_to_page(executably_mapped); + folio = page_folio(page); + + __folio_set_slab(folio); + slab = folio_jit_slab(folio); + slab->cache = cache; + slab->executably_mapped = executably_mapped; + slab->objs_allocated = objs_allocated; + INIT_LIST_HEAD(&slab->list); + + return slab; +} + +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache) +{ + struct jit_slab *s = + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?: + jit_slab_alloc(cache); + unsigned obj_idx, nr_allocated; + + if (!s) + return NULL; + + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab); + + BUG_ON(obj_idx >= cache->objs_per_slab); + __set_bit(obj_idx, s->objs_allocated); + + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); + + if (nr_allocated == s->cache->objs_per_slab) { + list_del_init(&s->list); + } else if (nr_allocated == 1) { + list_del(&s->list); + list_add(&s->list, &s->cache->partial); + } + + return s->executably_mapped + (obj_idx << cache->obj_size_bits); +} + +void jit_update(void *buf, void *new_buf, size_t len) +{ + text_poke_copy(buf, new_buf, len); +} +EXPORT_SYMBOL_GPL(jit_update); + +void jit_free(void *buf) +{ + struct page *page; + struct folio *folio; + struct jit_slab *s; + unsigned obj_idx, nr_allocated; + size_t offset; + + if (!buf) + return; + + page = vmalloc_to_page(buf); + folio = page_folio(page); + offset = offset_in_folio(folio, buf); + + if (!folio_test_slab(folio)) { + vfree(buf); + return; + } + + s = folio_jit_slab(folio); + + mutex_lock(&jit_alloc_lock); + obj_idx = offset >> s->cache->obj_size_bits; + + __clear_bit(obj_idx, s->objs_allocated); + + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); + + if (nr_allocated == 0) { + list_del(&s->list); + kfree(s->objs_allocated); + folio_put(folio); + } else if (nr_allocated + 1 == s->cache->objs_per_slab) { + list_del(&s->list); + list_add(&s->list, &s->cache->partial); + } + + mutex_unlock(&jit_alloc_lock); +} +EXPORT_SYMBOL_GPL(jit_free); + +void *jit_alloc(void *buf, size_t len) +{ + unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16); + void *p; + + if (jit_cache_idx < NR_JIT_CACHES) { + mutex_lock(&jit_alloc_lock); + p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]); + mutex_unlock(&jit_alloc_lock); + } else { + p = module_alloc(len); + if (p) { + set_vm_flush_reset_perms(p); + set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE)); + } + } + + if (p && buf) + jit_update(p, buf, len); + + return p; +} +EXPORT_SYMBOL_GPL(jit_alloc); + +static int __init jit_alloc_init(void) +{ + for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) { + jit_caches[i].obj_size_bits = ilog2(JITALLOC_MIN_SIZE) + i; + jit_caches[i].objs_per_slab = PAGE_SIZE >> jit_caches[i].obj_size_bits; + + INIT_LIST_HEAD(&jit_caches[i].partial); + } + + return 0; +} +core_initcall(jit_alloc_init);
On Thu, May 18, 2023 at 9:48 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > I am working on patches based on the discussion in [1]. I am planning to > > send v1 for review in a week or so. > > Hey Song, I was reviewing that thread too, > > Are you taking a different approach based on Thomas's feedback? I think > he had some fair points in that thread. Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > My own feeling is that the buddy allocator is our tool for allocating > larger variable sized physically contiguous allocations, so I'd like to > see something based on that - I think we could do a hybrid buddy/slab > allocator approach, like we have for regular memory allocations. I am planning to implement the allocator based on this (reuse vmap_area logic): https://lore.kernel.org/linux-mm/20221107223921.3451913-2-song@kernel.org/ Thanks, Song > > I started on a slab allocator for executable memory allocations the > other day (very minimal, but tested it for bcachefs and it works). > > But I'd love to hear more about your current approach as well. > > Cheers, > Kent
On Thu, May 18, 2023 at 9:58 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > I am working on patches based on the discussion in [1]. I am planning to > > send v1 for review in a week or so. > > For reference, here's my own (early, but functioning :) slab allocator: > > Look forward to comparing! This looks similar to the bpf_prog_pack allocator we use for BPF at the moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we can also use bpf_prog_pack for bcachefs for now. Thanks, Song > -->-- > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001 > From: Kent Overstreet <kent.overstreet@linux.dev> > Date: Wed, 17 May 2023 01:22:06 -0400 > Subject: [PATCH] mm: jit/text allocator > > This provides a new, very simple slab allocator for jit/text, i.e. bpf, > ftrace trampolines, or bcachefs unpack functions. > > With this API we can avoid ever mapping pages both writeable and > executable (not implemented in this patch: need to tweak > module_alloc()), and it also supports sub-page sized allocations. > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > diff --git a/include/linux/jitalloc.h b/include/linux/jitalloc.h > new file mode 100644 > index 0000000000..f1549d60e8 > --- /dev/null > +++ b/include/linux/jitalloc.h > @@ -0,0 +1,9 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_JITALLOC_H > +#define _LINUX_JITALLOC_H > + > +void jit_update(void *buf, void *new_buf, size_t len); > +void jit_free(void *buf); > +void *jit_alloc(void *buf, size_t len); > + > +#endif /* _LINUX_JITALLOC_H */ > diff --git a/mm/Kconfig b/mm/Kconfig > index 4751031f3f..ff26a4f0c9 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1202,6 +1202,9 @@ config LRU_GEN_STATS > This option has a per-memcg and per-node memory overhead. > # } > > +config JITALLOC > + bool > + > source "mm/damon/Kconfig" > > endmenu > diff --git a/mm/Makefile b/mm/Makefile > index c03e1e5859..25e82db9e8 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o > obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o > obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o > obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o > +obj-$(CONFIG_JITALLOC) += jitalloc.o > diff --git a/mm/jitalloc.c b/mm/jitalloc.c > new file mode 100644 > index 0000000000..7c4d621802 > --- /dev/null > +++ b/mm/jitalloc.c > @@ -0,0 +1,187 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include <linux/gfp.h> > +#include <linux/highmem.h> > +#include <linux/jitalloc.h> > +#include <linux/mm.h> > +#include <linux/moduleloader.h> > +#include <linux/mutex.h> > +#include <linux/set_memory.h> > +#include <linux/vmalloc.h> > + > +#include <asm/text-patching.h> > + > +static DEFINE_MUTEX(jit_alloc_lock); > + > +struct jit_cache { > + unsigned obj_size_bits; > + unsigned objs_per_slab; > + struct list_head partial; > +}; > + > +#define JITALLOC_MIN_SIZE 16 > +#define NR_JIT_CACHES ilog2(PAGE_SIZE / JITALLOC_MIN_SIZE) > + > +static struct jit_cache jit_caches[NR_JIT_CACHES]; > + > +struct jit_slab { > + unsigned long __page_flags; > + > + struct jit_cache *cache; > + void *executably_mapped;; > + unsigned long *objs_allocated; /* bitmap of free objects */ > + struct list_head list; > +}; > + > +#define folio_jit_slab(folio) (_Generic((folio), \ > + const struct folio *: (const struct jit_slab *)(folio), \ > + struct folio *: (struct jit_slab *)(folio))) > + > +#define jit_slab_folio(s) (_Generic((s), \ > + const struct jit_slab *: (const struct folio *)s, \ > + struct jit_slab *: (struct folio *)s)) > + > +static struct jit_slab *jit_slab_alloc(struct jit_cache *cache) > +{ > + void *executably_mapped = module_alloc(PAGE_SIZE); > + struct page *page; > + struct folio *folio; > + struct jit_slab *slab; > + unsigned long *objs_allocated; > + > + if (!executably_mapped) > + return NULL; > + > + objs_allocated = kcalloc(BITS_TO_LONGS(cache->objs_per_slab), sizeof(unsigned long), GFP_KERNEL); > + if (!objs_allocated ) { > + vfree(executably_mapped); > + return NULL; > + } > + > + set_vm_flush_reset_perms(executably_mapped); > + set_memory_rox((unsigned long) executably_mapped, 1); > + > + page = vmalloc_to_page(executably_mapped); > + folio = page_folio(page); > + > + __folio_set_slab(folio); > + slab = folio_jit_slab(folio); > + slab->cache = cache; > + slab->executably_mapped = executably_mapped; > + slab->objs_allocated = objs_allocated; > + INIT_LIST_HEAD(&slab->list); > + > + return slab; > +} > + > +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache) > +{ > + struct jit_slab *s = > + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?: > + jit_slab_alloc(cache); > + unsigned obj_idx, nr_allocated; > + > + if (!s) > + return NULL; > + > + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab); > + > + BUG_ON(obj_idx >= cache->objs_per_slab); > + __set_bit(obj_idx, s->objs_allocated); > + > + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); > + > + if (nr_allocated == s->cache->objs_per_slab) { > + list_del_init(&s->list); > + } else if (nr_allocated == 1) { > + list_del(&s->list); > + list_add(&s->list, &s->cache->partial); > + } > + > + return s->executably_mapped + (obj_idx << cache->obj_size_bits); > +} > + > +void jit_update(void *buf, void *new_buf, size_t len) > +{ > + text_poke_copy(buf, new_buf, len); > +} > +EXPORT_SYMBOL_GPL(jit_update); > + > +void jit_free(void *buf) > +{ > + struct page *page; > + struct folio *folio; > + struct jit_slab *s; > + unsigned obj_idx, nr_allocated; > + size_t offset; > + > + if (!buf) > + return; > + > + page = vmalloc_to_page(buf); > + folio = page_folio(page); > + offset = offset_in_folio(folio, buf); > + > + if (!folio_test_slab(folio)) { > + vfree(buf); > + return; > + } > + > + s = folio_jit_slab(folio); > + > + mutex_lock(&jit_alloc_lock); > + obj_idx = offset >> s->cache->obj_size_bits; > + > + __clear_bit(obj_idx, s->objs_allocated); > + > + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); > + > + if (nr_allocated == 0) { > + list_del(&s->list); > + kfree(s->objs_allocated); > + folio_put(folio); > + } else if (nr_allocated + 1 == s->cache->objs_per_slab) { > + list_del(&s->list); > + list_add(&s->list, &s->cache->partial); > + } > + > + mutex_unlock(&jit_alloc_lock); > +} > +EXPORT_SYMBOL_GPL(jit_free); > + > +void *jit_alloc(void *buf, size_t len) > +{ > + unsigned jit_cache_idx = ilog2(roundup_pow_of_two(len) / 16); > + void *p; > + > + if (jit_cache_idx < NR_JIT_CACHES) { > + mutex_lock(&jit_alloc_lock); > + p = jit_cache_alloc(buf, len, &jit_caches[jit_cache_idx]); > + mutex_unlock(&jit_alloc_lock); > + } else { > + p = module_alloc(len); > + if (p) { > + set_vm_flush_reset_perms(p); > + set_memory_rox((unsigned long) p, DIV_ROUND_UP(len, PAGE_SIZE)); > + } > + } > + > + if (p && buf) > + jit_update(p, buf, len); > + > + return p; > +} > +EXPORT_SYMBOL_GPL(jit_alloc); > + > +static int __init jit_alloc_init(void) > +{ > + for (unsigned i = 0; i < ARRAY_SIZE(jit_caches); i++) { > + jit_caches[i].obj_size_bits = ilog2(JITALLOC_MIN_SIZE) + i; > + jit_caches[i].objs_per_slab = PAGE_SIZE >> jit_caches[i].obj_size_bits; > + > + INIT_LIST_HEAD(&jit_caches[i].partial); > + } > + > + return 0; > +} > +core_initcall(jit_alloc_init); >
On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > I am working on patches based on the discussion in [1]. I am planning to > > > send v1 for review in a week or so. > > > > Hey Song, I was reviewing that thread too, > > > > Are you taking a different approach based on Thomas's feedback? I think > > he had some fair points in that thread. > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > My own feeling is that the buddy allocator is our tool for allocating > > larger variable sized physically contiguous allocations, so I'd like to > > see something based on that - I think we could do a hybrid buddy/slab > > allocator approach, like we have for regular memory allocations. > > I am planning to implement the allocator based on this (reuse > vmap_area logic): Ah, you're still doing vmap_area approach. Mike's approach looks like it'll be _much_ lighter weight and higher performance, to me. vmalloc is known to be slow compared to the buddy allocator, and with Mike's approach we're only modifying mappings once per 2 MB chunk. I don't see anything in your code for sub-page sized allocations too, so perhaps I should keep going with my slab allocator. Could you share your thoughts on your approach vs. Mike's? I'm newer to this area of the code than you two so maybe there's an angle I've missed :)
On Thu, May 18, 2023 at 10:15:59AM -0700, Song Liu wrote: > On Thu, May 18, 2023 at 9:58 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > I am working on patches based on the discussion in [1]. I am planning to > > > send v1 for review in a week or so. > > > > For reference, here's my own (early, but functioning :) slab allocator: > > > > Look forward to comparing! > > This looks similar to the bpf_prog_pack allocator we use for BPF at the > moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we > can also use bpf_prog_pack for bcachefs for now. The code in bpf/core.c does a _lot_, my preference would be to split that up and refactor the bpf code to use something generic :)
On Thu, May 18, 2023 at 10:24 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > send v1 for review in a week or so. > > > > > > Hey Song, I was reviewing that thread too, > > > > > > Are you taking a different approach based on Thomas's feedback? I think > > > he had some fair points in that thread. > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > > > > My own feeling is that the buddy allocator is our tool for allocating > > > larger variable sized physically contiguous allocations, so I'd like to > > > see something based on that - I think we could do a hybrid buddy/slab > > > allocator approach, like we have for regular memory allocations. > > > > I am planning to implement the allocator based on this (reuse > > vmap_area logic): > > Ah, you're still doing vmap_area approach. > > Mike's approach looks like it'll be _much_ lighter weight and higher > performance, to me. vmalloc is known to be slow compared to the buddy > allocator, and with Mike's approach we're only modifying mappings once > per 2 MB chunk. > > I don't see anything in your code for sub-page sized allocations too, so > perhaps I should keep going with my slab allocator. The vmap_area approach handles sub-page allocations. In 5/5 of set [2], we showed that multiple BPF programs share the same page with some kernel text (_etext). > Could you share your thoughts on your approach vs. Mike's? I'm newer to > this area of the code than you two so maybe there's an angle I've missed > :) AFAICT, tree based solution (vmap_area) is more efficient than bitmap based solution. First, for 2MiB page with 64B chunk size, we need a bitmap of 2MiB / 64B = 32k bit = 4k bytes While the tree based solution can adapt to the number of allocations within This 2MiB page. Also, searching a free range within 4kB of bitmap may actually be slower than searching in the tree. Second, bitmap based solution cannot handle > 2MiB allocation cleanly, while tree based solution can. For example, if a big driver uses 3MiB, the tree based allocator can allocate 4MiB for it, and use the rest 1MiB for smaller allocations. Thanks, Song [2] https://lore.kernel.org/linux-mm/20221107223921.3451913-6-song@kernel.org/
On Thu, May 18, 2023 at 10:25 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 10:15:59AM -0700, Song Liu wrote: > > On Thu, May 18, 2023 at 9:58 AM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > send v1 for review in a week or so. > > > > > > For reference, here's my own (early, but functioning :) slab allocator: > > > > > > Look forward to comparing! > > > > This looks similar to the bpf_prog_pack allocator we use for BPF at the > > moment. (search for bpf_prog_pack in kernel/bpf/core.c). I guess we > > can also use bpf_prog_pack for bcachefs for now. > > The code in bpf/core.c does a _lot_, my preference would be to split > that up and refactor the bpf code to use something generic :) The code is actually splitted into two parts: bpf_prog_pack_[alloc|free] vs. bpf_jit_binary_pack_[alloc|free]. The latter has logic just for BPF. But the former can be reused by other users (maybe with a little refactoring). Once the vmap_area based solution is committed, we will replace the former with the new allocator. Thanks, Song
On Thu, May 18, 2023 at 9:58 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > I am working on patches based on the discussion in [1]. I am planning to > > send v1 for review in a week or so. > > For reference, here's my own (early, but functioning :) slab allocator: > > Look forward to comparing! > -->-- > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001 > From: Kent Overstreet <kent.overstreet@linux.dev> > Date: Wed, 17 May 2023 01:22:06 -0400 > Subject: [PATCH] mm: jit/text allocator > > This provides a new, very simple slab allocator for jit/text, i.e. bpf, > ftrace trampolines, or bcachefs unpack functions. > > With this API we can avoid ever mapping pages both writeable and > executable (not implemented in this patch: need to tweak > module_alloc()), and it also supports sub-page sized allocations. > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> [...] > +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache) > +{ > + struct jit_slab *s = > + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?: > + jit_slab_alloc(cache); > + unsigned obj_idx, nr_allocated; > + > + if (!s) > + return NULL; > + > + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab); > + > + BUG_ON(obj_idx >= cache->objs_per_slab); > + __set_bit(obj_idx, s->objs_allocated); > + > + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); > + > + if (nr_allocated == s->cache->objs_per_slab) { > + list_del_init(&s->list); > + } else if (nr_allocated == 1) { > + list_del(&s->list); > + list_add(&s->list, &s->cache->partial); > + } > + > + return s->executably_mapped + (obj_idx << cache->obj_size_bits); > +} IIUC, "len" is ignored in jit_cache_alloc(), so it can only handle <=16 byte allocations? Thanks, Song
On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote: > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > > send v1 for review in a week or so. > > > > > > > > Hey Song, I was reviewing that thread too, > > > > > > > > Are you taking a different approach based on Thomas's feedback? I think > > > > he had some fair points in that thread. > > > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > > > > > > > My own feeling is that the buddy allocator is our tool for allocating > > > > larger variable sized physically contiguous allocations, so I'd like to > > > > see something based on that - I think we could do a hybrid buddy/slab > > > > allocator approach, like we have for regular memory allocations. > > > > > > I am planning to implement the allocator based on this (reuse > > > vmap_area logic): > > > > Ah, you're still doing vmap_area approach. > > > > Mike's approach looks like it'll be _much_ lighter weight and higher > > performance, to me. vmalloc is known to be slow compared to the buddy > > allocator, and with Mike's approach we're only modifying mappings once > > per 2 MB chunk. > > > > I don't see anything in your code for sub-page sized allocations too, so > > perhaps I should keep going with my slab allocator. > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2], > we showed that multiple BPF programs share the same page with some > kernel text (_etext). > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to > > this area of the code than you two so maybe there's an angle I've missed > > :) > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap > based solution. > > First, for 2MiB page with 64B chunk size, we need a bitmap of > 2MiB / 64B = 32k bit = 4k bytes > While the tree based solution can adapt to the number of allocations within > This 2MiB page. Also, searching a free range within 4kB of bitmap may > actually be slower than searching in the tree. > > Second, bitmap based solution cannot handle > 2MiB allocation cleanly, > while tree based solution can. For example, if a big driver uses 3MiB, the > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for > smaller allocations. Missed one: Third, bitmap based solution requires a "size" parameter in free(). It is an overhead for the user. Tree based solution doesn't have this issue. Thanks, Song > > [2] https://lore.kernel.org/linux-mm/20221107223921.3451913-6-song@kernel.org/
On Thu, May 18, 2023 at 12:01:26PM -0700, Song Liu wrote: > On Thu, May 18, 2023 at 9:58 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > I am working on patches based on the discussion in [1]. I am planning to > > > send v1 for review in a week or so. > > > > For reference, here's my own (early, but functioning :) slab allocator: > > > > Look forward to comparing! > > -->-- > > From 6eeb6b8ef4271ea1a8d9cac7fbaeeb7704951976 Mon Sep 17 00:00:00 2001 > > From: Kent Overstreet <kent.overstreet@linux.dev> > > Date: Wed, 17 May 2023 01:22:06 -0400 > > Subject: [PATCH] mm: jit/text allocator > > > > This provides a new, very simple slab allocator for jit/text, i.e. bpf, > > ftrace trampolines, or bcachefs unpack functions. > > > > With this API we can avoid ever mapping pages both writeable and > > executable (not implemented in this patch: need to tweak > > module_alloc()), and it also supports sub-page sized allocations. > > > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> > > [...] > > > +static void *jit_cache_alloc(void *buf, size_t len, struct jit_cache *cache) > > +{ > > + struct jit_slab *s = > > + list_first_entry_or_null(&cache->partial, struct jit_slab, list) ?: > > + jit_slab_alloc(cache); > > + unsigned obj_idx, nr_allocated; > > + > > + if (!s) > > + return NULL; > > + > > + obj_idx = find_first_zero_bit(s->objs_allocated, cache->objs_per_slab); > > + > > + BUG_ON(obj_idx >= cache->objs_per_slab); > > + __set_bit(obj_idx, s->objs_allocated); > > + > > + nr_allocated = bitmap_weight(s->objs_allocated, s->cache->objs_per_slab); > > + > > + if (nr_allocated == s->cache->objs_per_slab) { > > + list_del_init(&s->list); > > + } else if (nr_allocated == 1) { > > + list_del(&s->list); > > + list_add(&s->list, &s->cache->partial); > > + } > > + > > + return s->executably_mapped + (obj_idx << cache->obj_size_bits); > > +} > > IIUC, "len" is ignored in jit_cache_alloc(), so it can only handle > <=16 byte allocations? len is a redundant parameter (good catch); at that point we've picked a cache for the specific allocation size. Since there's multiple caches for each power of two size, it can handle allocations up to PAGE_SIZE.
On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote: > On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote: > > > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > > > send v1 for review in a week or so. > > > > > > > > > > Hey Song, I was reviewing that thread too, > > > > > > > > > > Are you taking a different approach based on Thomas's feedback? I think > > > > > he had some fair points in that thread. > > > > > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > > > > > > > > > > My own feeling is that the buddy allocator is our tool for allocating > > > > > larger variable sized physically contiguous allocations, so I'd like to > > > > > see something based on that - I think we could do a hybrid buddy/slab > > > > > allocator approach, like we have for regular memory allocations. > > > > > > > > I am planning to implement the allocator based on this (reuse > > > > vmap_area logic): > > > > > > Ah, you're still doing vmap_area approach. > > > > > > Mike's approach looks like it'll be _much_ lighter weight and higher > > > performance, to me. vmalloc is known to be slow compared to the buddy > > > allocator, and with Mike's approach we're only modifying mappings once > > > per 2 MB chunk. > > > > > > I don't see anything in your code for sub-page sized allocations too, so > > > perhaps I should keep going with my slab allocator. > > > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2], > > we showed that multiple BPF programs share the same page with some > > kernel text (_etext). > > > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to > > > this area of the code than you two so maybe there's an angle I've missed > > > :) > > > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap > > based solution. Tree based requires quite a bit of overhead for the rbtree pointers, and additional vmap_area structs. With a buddy allocator based approach, there's no additional state that needs to be allocated, since it all fits in struct page. > > First, for 2MiB page with 64B chunk size, we need a bitmap of > > 2MiB / 64B = 32k bit = 4k bytes > > While the tree based solution can adapt to the number of allocations within > > This 2MiB page. Also, searching a free range within 4kB of bitmap may > > actually be slower than searching in the tree. > > > > Second, bitmap based solution cannot handle > 2MiB allocation cleanly, > > while tree based solution can. For example, if a big driver uses 3MiB, the > > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for > > smaller allocations. We're not talking about a bitmap based solution for >= PAGE_SIZE allocations, the alternative is a buddy allocator - so no searching, just per power-of-two freelists. > > Missed one: > > Third, bitmap based solution requires a "size" parameter in free(). It is an > overhead for the user. Tree based solution doesn't have this issue. No, we can recover the size of the allocation via compound_order() - hasn't historically been done for alloc_pages() allocations to avoid setting up the state in each page for compound head/tail, but it perhaps should be (and is with folios, which we've generally been switching to).
On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote: > > Second, bitmap based solution cannot handle > 2MiB allocation cleanly, > > while tree based solution can. For example, if a big driver uses 3MiB, the > > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for > > smaller allocations. 2 MB is just a good default size for the initial unmapped allocations if the main consideration is avoiding mapping fragmentation. The approach can work with larger allocations if required.
On Thu, May 18, 2023 at 12:15 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote: > > On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote: > > > > > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > > > > send v1 for review in a week or so. > > > > > > > > > > > > Hey Song, I was reviewing that thread too, > > > > > > > > > > > > Are you taking a different approach based on Thomas's feedback? I think > > > > > > he had some fair points in that thread. > > > > > > > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > > > > > > > > > > > > > My own feeling is that the buddy allocator is our tool for allocating > > > > > > larger variable sized physically contiguous allocations, so I'd like to > > > > > > see something based on that - I think we could do a hybrid buddy/slab > > > > > > allocator approach, like we have for regular memory allocations. > > > > > > > > > > I am planning to implement the allocator based on this (reuse > > > > > vmap_area logic): > > > > > > > > Ah, you're still doing vmap_area approach. > > > > > > > > Mike's approach looks like it'll be _much_ lighter weight and higher > > > > performance, to me. vmalloc is known to be slow compared to the buddy > > > > allocator, and with Mike's approach we're only modifying mappings once > > > > per 2 MB chunk. > > > > > > > > I don't see anything in your code for sub-page sized allocations too, so > > > > perhaps I should keep going with my slab allocator. > > > > > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2], > > > we showed that multiple BPF programs share the same page with some > > > kernel text (_etext). > > > > > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to > > > > this area of the code than you two so maybe there's an angle I've missed > > > > :) > > > > > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap > > > based solution. > > Tree based requires quite a bit of overhead for the rbtree pointers, and > additional vmap_area structs. > > With a buddy allocator based approach, there's no additional state that > needs to be allocated, since it all fits in struct page. > > > > First, for 2MiB page with 64B chunk size, we need a bitmap of > > > 2MiB / 64B = 32k bit = 4k bytes > > > While the tree based solution can adapt to the number of allocations within > > > This 2MiB page. Also, searching a free range within 4kB of bitmap may > > > actually be slower than searching in the tree. > > > > > > Second, bitmap based solution cannot handle > 2MiB allocation cleanly, > > > while tree based solution can. For example, if a big driver uses 3MiB, the > > > tree based allocator can allocate 4MiB for it, and use the rest 1MiB for > > > smaller allocations. > > We're not talking about a bitmap based solution for >= PAGE_SIZE > allocations, the alternative is a buddy allocator - so no searching, > just per power-of-two freelists. > > > > > Missed one: > > > > Third, bitmap based solution requires a "size" parameter in free(). It is an > > overhead for the user. Tree based solution doesn't have this issue. > > No, we can recover the size of the allocation via compound_order() - > hasn't historically been done for alloc_pages() allocations to avoid > setting up the state in each page for compound head/tail, but it perhaps > should be (and is with folios, which we've generally been switching to). If we use compound_order(), we will round up to power of 2 for all allocations. Does this mean we will use 4MiB for a 2.1MiB allocation? Thanks, Song
On Thu, May 18, 2023 at 01:03:28PM -0700, Song Liu wrote: > If we use compound_order(), we will round up to power of 2 for all > allocations. Does this mean we will use 4MiB for a 2.1MiB allocation? Yes. This means we lose on average 33% - I believe, someone with better statistics might correct me - to internal fragmentation. But the buddy allocator will be better at avoiding and dealing with external fragmentation over time.
On Thu, May 18, 2023 at 12:15 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 12:03:03PM -0700, Song Liu wrote: > > On Thu, May 18, 2023 at 11:47 AM Song Liu <song@kernel.org> wrote: > > > > > > On Thu, May 18, 2023 at 10:24 AM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > > > > send v1 for review in a week or so. > > > > > > > > > > > > Hey Song, I was reviewing that thread too, > > > > > > > > > > > > Are you taking a different approach based on Thomas's feedback? I think > > > > > > he had some fair points in that thread. > > > > > > > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > > > > > > > > > > > > > My own feeling is that the buddy allocator is our tool for allocating > > > > > > larger variable sized physically contiguous allocations, so I'd like to > > > > > > see something based on that - I think we could do a hybrid buddy/slab > > > > > > allocator approach, like we have for regular memory allocations. > > > > > > > > > > I am planning to implement the allocator based on this (reuse > > > > > vmap_area logic): > > > > > > > > Ah, you're still doing vmap_area approach. > > > > > > > > Mike's approach looks like it'll be _much_ lighter weight and higher > > > > performance, to me. vmalloc is known to be slow compared to the buddy > > > > allocator, and with Mike's approach we're only modifying mappings once > > > > per 2 MB chunk. > > > > > > > > I don't see anything in your code for sub-page sized allocations too, so > > > > perhaps I should keep going with my slab allocator. > > > > > > The vmap_area approach handles sub-page allocations. In 5/5 of set [2], > > > we showed that multiple BPF programs share the same page with some > > > kernel text (_etext). > > > > > > > Could you share your thoughts on your approach vs. Mike's? I'm newer to > > > > this area of the code than you two so maybe there's an angle I've missed > > > > :) > > > > > > AFAICT, tree based solution (vmap_area) is more efficient than bitmap > > > based solution. > > Tree based requires quite a bit of overhead for the rbtree pointers, and > additional vmap_area structs. > > With a buddy allocator based approach, there's no additional state that > needs to be allocated, since it all fits in struct page. To allocate memory for text, we will allocate 2MiB, make it ROX, and then use it for many small allocations. IIUC, buddy allocator will use unallocated parts of this page for metadata. I guess this may be a problem, as the whole page is ROX now, and we have to use text_poke to write to it. OTOH, if we allocate extra memory for metadata (tree based solution), all the metadata operations can be regular read/write. Thanks, Song
On Thu, May 18, 2023 at 01:51:08PM -0700, Song Liu wrote: > To allocate memory for text, we will allocate 2MiB, make it ROX, and then > use it for many small allocations. IIUC, buddy allocator will use unallocated > parts of this page for metadata. I guess this may be a problem, as the > whole page is ROX now, and we have to use text_poke to write to it. The standard kernel buddy allocator does _not_ store anything in the page itself - because the page might be a highmem page. That's also why I went with the bitmap for my slab allocator; standard kernel slab allocator stores a freelist ptr in free objects.
Hi Kent, On Thu, May 18, 2023 at 01:23:56PM -0400, Kent Overstreet wrote: > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > send v1 for review in a week or so. > > > > > > Hey Song, I was reviewing that thread too, > > > > > > Are you taking a different approach based on Thomas's feedback? I think > > > he had some fair points in that thread. > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > > > > My own feeling is that the buddy allocator is our tool for allocating > > > larger variable sized physically contiguous allocations, so I'd like to > > > see something based on that - I think we could do a hybrid buddy/slab > > > allocator approach, like we have for regular memory allocations. > > > > I am planning to implement the allocator based on this (reuse > > vmap_area logic): > > Ah, you're still doing vmap_area approach. > > Mike's approach looks like it'll be _much_ lighter weight and higher > performance, to me. vmalloc is known to be slow compared to the buddy > allocator, and with Mike's approach we're only modifying mappings once > per 2 MB chunk. > > I don't see anything in your code for sub-page sized allocations too, so > perhaps I should keep going with my slab allocator. Your allocator implicitly relies on vmalloc because of module_alloc ;-) What I was thinking is that we can replace module_alloc() calls in your allocator with something based on my unmapped_alloc(). If we make the part that refills the cache also take care of creating the mapping in the module address space, that should cover everything. > Could you share your thoughts on your approach vs. Mike's? I'm newer to > this area of the code than you two so maybe there's an angle I've missed > :) >
On Thu, May 18, 2023 at 6:24 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, May 18, 2023 at 01:51:08PM -0700, Song Liu wrote: > > > To allocate memory for text, we will allocate 2MiB, make it ROX, and then > > use it for many small allocations. IIUC, buddy allocator will use unallocated > > parts of this page for metadata. I guess this may be a problem, as the > > whole page is ROX now, and we have to use text_poke to write to it. > > The standard kernel buddy allocator does _not_ store anything in the > page itself - because the page might be a highmem page. Thanks for the correction. Song > > That's also why I went with the bitmap for my slab allocator; standard > kernel slab allocator stores a freelist ptr in free objects.
On Fri, May 19, 2023 at 1:30 AM Mike Rapoport <rppt@kernel.org> wrote: > > Hi Kent, > > On Thu, May 18, 2023 at 01:23:56PM -0400, Kent Overstreet wrote: > > On Thu, May 18, 2023 at 10:00:39AM -0700, Song Liu wrote: > > > On Thu, May 18, 2023 at 9:48 AM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > On Thu, May 18, 2023 at 09:33:20AM -0700, Song Liu wrote: > > > > > I am working on patches based on the discussion in [1]. I am planning to > > > > > send v1 for review in a week or so. > > > > > > > > Hey Song, I was reviewing that thread too, > > > > > > > > Are you taking a different approach based on Thomas's feedback? I think > > > > he had some fair points in that thread. > > > > > > Yes, the API is based on Thomas's suggestion, like 90% from the discussions. > > > > > > > > > > > My own feeling is that the buddy allocator is our tool for allocating > > > > larger variable sized physically contiguous allocations, so I'd like to > > > > see something based on that - I think we could do a hybrid buddy/slab > > > > allocator approach, like we have for regular memory allocations. > > > > > > I am planning to implement the allocator based on this (reuse > > > vmap_area logic): > > > > Ah, you're still doing vmap_area approach. > > > > Mike's approach looks like it'll be _much_ lighter weight and higher > > performance, to me. vmalloc is known to be slow compared to the buddy > > allocator, and with Mike's approach we're only modifying mappings once > > per 2 MB chunk. > > > > I don't see anything in your code for sub-page sized allocations too, so > > perhaps I should keep going with my slab allocator. > > Your allocator implicitly relies on vmalloc because of module_alloc ;-) > > What I was thinking is that we can replace module_alloc() calls in your > allocator with something based on my unmapped_alloc(). If we make the part > that refills the cache also take care of creating the mapping in the > module address space, that should cover everything. Here are what I found as I work more on the code: 1. It takes quite some work to create a clean interface and make sure all the users of module_alloc can use the new allocator on all archs. (archs with text poke need to work with ROX memory from the allocator; archs without text poke need to have a clean fall back mechanism, etc.). Most of this work is independent of the actual allocator, so we can do this part and plug in whatever allocator we want (buddy+slab or vmap-based or any other solutions). 2. vmap_area based solution will work. And it will be one solution for both < PAGE_SIZE and > PAGE_SIZE allocations. Given module_alloc is not in any hot path (AFAIK), I don't see any practical issues with this solution. It will be a little tricky to place and name the code, as it uses vmalloc logic, but it is technically a module allocator. I will prioritize building the interface, and migrating users to it. If we do this part right, changing the underlying allocator should be straightforward. Thanks, Song
On Fri, May 19, 2023 at 11:29:45AM +0300, Mike Rapoport wrote: > Your allocator implicitly relies on vmalloc because of module_alloc ;-) > > What I was thinking is that we can replace module_alloc() calls in your > allocator with something based on my unmapped_alloc(). If we make the part > that refills the cache also take care of creating the mapping in the > module address space, that should cover everything. Yeah, that's exactly what I was thinking :) Liam was also just mentioning on IRC vmalloc lock contention came up again at LSF, and that's historically always been an isuse - going with your patchset for the backend nicely avoids that. If I have time (hah! big if :) I'll see if I can cook up a patchset that combines our two approaches over the weekend.
On Fri, May 19, 2023 at 11:47:42AM -0400, Kent Overstreet wrote: > On Fri, May 19, 2023 at 11:29:45AM +0300, Mike Rapoport wrote: > > Your allocator implicitly relies on vmalloc because of module_alloc ;-) > > > > What I was thinking is that we can replace module_alloc() calls in your > > allocator with something based on my unmapped_alloc(). If we make the part > > that refills the cache also take care of creating the mapping in the > > module address space, that should cover everything. > > Yeah, that's exactly what I was thinking :) > > Liam was also just mentioning on IRC vmalloc lock contention came up > again at LSF, and that's historically always been an isuse - going with > your patchset for the backend nicely avoids that. Unfortunately not because we still need to map the pages in the modules area which is essentially a subset of vmalloc address space. > If I have time (hah! big if :) I'll see if I can cook up a patchset that > combines our two approaches over the weekend. Now there is also an interest about unmapped allocations from KVM folks, so I might continue pursuing unmapped allocator, probably just without a new GFP flag and hooks into page allocator.
On Fri, May 19, 2023 at 07:14:14PM +0300, Mike Rapoport wrote: > Unfortunately not because we still need to map the pages in the modules > area which is essentially a subset of vmalloc address space. Why wouldn't we be doing that once per 2 MB page? > > > If I have time (hah! big if :) I'll see if I can cook up a patchset that > > combines our two approaches over the weekend. > > Now there is also an interest about unmapped allocations from KVM folks, so > I might continue pursuing unmapped allocator, probably just without a new > GFP flag and hooks into page allocator. Cool - it does seem like a generally valuable capability :)
On Fri, May 19 2023 at 08:42, Song Liu wrote: > On Fri, May 19, 2023 at 1:30 AM Mike Rapoport <rppt@kernel.org> wrote: >> What I was thinking is that we can replace module_alloc() calls in your >> allocator with something based on my unmapped_alloc(). If we make the part >> that refills the cache also take care of creating the mapping in the >> module address space, that should cover everything. > > Here are what I found as I work more on the code: > > 1. It takes quite some work to create a clean interface and make sure > all the users of module_alloc can use the new allocator on all archs. > (archs with text poke need to work with ROX memory from the > allocator; archs without text poke need to have a clean fall back > mechanism, etc.). Most of this work is independent of the actual > allocator, so we can do this part and plug in whatever allocator we > want (buddy+slab or vmap-based or any other solutions). > > 2. vmap_area based solution will work. And it will be one solution for > both < PAGE_SIZE and > PAGE_SIZE allocations. Given > module_alloc is not in any hot path (AFAIK), I don't see any > practical issues with this solution. It will be a little tricky to place > and name the code, as it uses vmalloc logic, but it is technically a > module allocator. > > I will prioritize building the interface, and migrating users to it. If we > do this part right, changing the underlying allocator should be > straightforward. Correct, that's the only workable approach. Once you have solved #1 the mechanism of the underlying allocator (#2) is pretty much exchangeable. Any attempt to solve #2 before #1 is doomed by definition. Thanks, tglx
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a825bf031f49..735f691d449c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -35,6 +35,9 @@ config X86_64 select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 +config ARCH_WANTS_GFP_UNMAPPED + def_bool y if X86_64 + config FORCE_DYNAMIC_FTRACE def_bool y depends on X86_32 diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h index 5088637fe5c2..234122b434dd 100644 --- a/include/linux/gfp_types.h +++ b/include/linux/gfp_types.h @@ -60,6 +60,12 @@ typedef unsigned int __bitwise gfp_t; #else #define ___GFP_NOLOCKDEP 0 #endif +#ifdef CONFIG_UNMAPPED_ALLOC +#define ___GFP_UNMAPPED 0x10000000u +#else +#define ___GFP_UNMAPPED 0 +#endif + /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -101,12 +107,15 @@ typedef unsigned int __bitwise gfp_t; * node with no fallbacks or placement policy enforcements. * * %__GFP_ACCOUNT causes the allocation to be accounted to kmemcg. + * + * %__GFP_UNMAPPED removes the allocated pages from the direct map. */ #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) #define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL) #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE) #define __GFP_ACCOUNT ((__force gfp_t)___GFP_ACCOUNT) +#define __GFP_UNMAPPED ((__force gfp_t)___GFP_UNMAPPED) /** * DOC: Watermark modifiers @@ -252,7 +261,7 @@ typedef unsigned int __bitwise gfp_t; #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) /* Room for N __GFP_FOO bits */ -#define __GFP_BITS_SHIFT (27 + IS_ENABLED(CONFIG_LOCKDEP)) +#define __GFP_BITS_SHIFT (28 + IS_ENABLED(CONFIG_LOCKDEP)) #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a7e3a3405520..e66dbfdba8f2 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -922,6 +922,7 @@ static inline bool is_page_hwpoison(struct page *page) #define PG_offline 0x00000100 #define PG_table 0x00000200 #define PG_guard 0x00000400 +#define PG_unmapped 0x00000800 #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) @@ -992,6 +993,11 @@ PAGE_TYPE_OPS(Table, table) */ PAGE_TYPE_OPS(Guard, guard) +/* + * Marks pages in free lists of unmapped allocator. + */ +PAGE_TYPE_OPS(Unmapped, unmapped) + extern bool is_free_buddy_page(struct page *page); PAGEFLAG(Isolated, isolated, PF_ANY); diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 5f1ae07d724b..bb6f5ec83881 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -21,6 +21,9 @@ enum pageblock_bits { /* 3 bits required for migrate types */ PB_migrate_skip,/* If set the block is skipped by compaction */ + PB_unmapped = 7,/* if set the block has pages unmapped in the direct + map */ + /* * Assume the bits will always align on a word. If this assumption * changes then get/set pageblock needs updating. @@ -95,4 +98,29 @@ static inline void set_pageblock_skip(struct page *page) } #endif /* CONFIG_COMPACTION */ +#ifdef CONFIG_UNMAPPED_ALLOC +#define get_pageblock_unmapped(page) \ + get_pfnblock_flags_mask(page, page_to_pfn(page), \ + (1 << (PB_unmapped))) +#define clear_pageblock_unmapped(page) \ + set_pfnblock_flags_mask(page, 0, page_to_pfn(page), \ + (1 << PB_unmapped)) +#define set_pageblock_unmapped(page) \ + set_pfnblock_flags_mask(page, (1 << PB_unmapped), \ + page_to_pfn(page), \ + (1 << PB_unmapped)) +#else /* CONFIG_UNMAPPED_ALLOC */ +static inline bool get_pageblock_unmapped(struct page *page) +{ + return false; +} +static inline void clear_pageblock_unmapped(struct page *page) +{ +} +static inline void set_pageblock_unmapped(struct page *page) +{ +} +#endif /* CONFIG_UNMAPPED_ALLOC */ + + #endif /* PAGEBLOCK_FLAGS_H */ diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index 9db52bc4ce19..951d294a3840 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -61,9 +61,17 @@ #define __def_gfpflag_names_kasan #endif +#ifdef CONFIG_UNMAPPED_ALLOC +#define __def_gfpflag_names_unmapped \ + , gfpflag_string(__GFP_UNMAPPED) +#else +#define __def_gfpflag_names_unmapped +#endif + #define show_gfp_flags(flags) \ (flags) ? __print_flags(flags, "|", \ - __def_gfpflag_names __def_gfpflag_names_kasan \ + __def_gfpflag_names __def_gfpflag_names_kasan \ + __def_gfpflag_names_unmapped \ ) : "none" #ifdef CONFIG_MMU diff --git a/mm/Kconfig b/mm/Kconfig index 4751031f3f05..404be73e00e8 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1202,6 +1202,10 @@ config LRU_GEN_STATS This option has a per-memcg and per-node memory overhead. # } +config UNMAPPED_ALLOC + def_bool y + depends on ARCH_WANTS_GFP_UNMAPPED + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 8e105e5b3e29..9143303295af 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -138,3 +138,4 @@ obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o +obj-$(CONFIG_UNMAPPED_ALLOC) += unmapped-alloc.o diff --git a/mm/internal.h b/mm/internal.h index 7920a8b7982e..8d84cceab467 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1042,4 +1042,26 @@ struct vma_prepare { struct vm_area_struct *remove; struct vm_area_struct *remove2; }; + +/* + * mm/unmapped-alloc.c + */ +#ifdef CONFIG_UNMAPPED_ALLOC +int unmapped_alloc_init(void); +struct page *unmapped_pages_alloc(gfp_t gfp, int order); +void unmapped_pages_free(struct page *page, int order); +#else +static inline int unmapped_alloc_init(void) +{ + return 0; +} + +static inline struct page *unmapped_pages_alloc(gfp_t gfp, int order) +{ + return NULL; +} + +static inline void unmapped_pages_free(struct page *page, int order) {} +#endif + #endif /* __MM_INTERNAL_H */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ac1fc986af44..01f18e7529b0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -589,7 +589,7 @@ void set_pfnblock_flags_mask(struct page *page, unsigned long flags, unsigned long bitidx, word_bitidx; unsigned long word; - BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4); + BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8); BUILD_BUG_ON(MIGRATE_TYPES > (1 << PB_migratetype_bits)); bitmap = get_pageblock_bitmap(page, pfn); @@ -746,6 +746,13 @@ static inline bool pcp_allowed_order(unsigned int order) return false; } +static inline bool pcp_allowed(unsigned int order, gfp_t gfp) +{ + if (unlikely(gfp & __GFP_UNMAPPED)) + return false; + return pcp_allowed_order(order); +} + static inline void free_the_page(struct page *page, unsigned int order) { if (pcp_allowed_order(order)) /* Via pcp? */ @@ -1460,6 +1467,11 @@ static __always_inline bool free_pages_prepare(struct page *page, PAGE_SIZE << order); } + if (get_pageblock_unmapped(page)) { + unmapped_pages_free(page, order); + return false; + } + kernel_poison_pages(page, 1 << order); /* @@ -3525,6 +3537,13 @@ void free_unref_page_list(struct list_head *list) /* Prepare pages for freeing */ list_for_each_entry_safe(page, next, list, lru) { unsigned long pfn = page_to_pfn(page); + + if (get_pageblock_unmapped(page)) { + list_del(&page->lru); + unmapped_pages_free(page, 0); + continue; + } + if (!free_unref_page_prepare(page, pfn, 0)) { list_del(&page->lru); continue; @@ -3856,7 +3875,7 @@ struct page *rmqueue(struct zone *preferred_zone, */ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); - if (likely(pcp_allowed_order(order))) { + if (likely(pcp_allowed(order, gfp_flags))) { /* * MIGRATE_MOVABLE pcplist could have the pages on CMA area and * we need to skip it when CMA area isn't allowed. @@ -5581,6 +5600,11 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, &alloc_gfp, &alloc_flags)) return NULL; + if (alloc_gfp & __GFP_UNMAPPED) { + page = unmapped_pages_alloc(gfp, order); + goto out; + } + /* * Forbid the first pass from falling back to types that fragment * memory until all local zones are considered. @@ -8437,6 +8461,7 @@ void __init free_area_init(unsigned long *max_zone_pfn) } memmap_init(); + unmapped_alloc_init(); } static int __init cmdline_parse_core(char *p, unsigned long *core, diff --git a/mm/unmapped-alloc.c b/mm/unmapped-alloc.c new file mode 100644 index 000000000000..fb2d54204a3c --- /dev/null +++ b/mm/unmapped-alloc.c @@ -0,0 +1,215 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/gfp.h> +#include <linux/mmzone.h> +#include <linux/printk.h> +#include <linux/spinlock.h> +#include <linux/set_memory.h> + +#include <asm/tlbflush.h> + +#include "internal.h" + +struct unmapped_free_area { + struct list_head free_list; + spinlock_t lock; + unsigned long nr_free; + unsigned long nr_cached; +}; + +static struct unmapped_free_area free_area[MAX_ORDER]; + +static inline void add_to_free_list(struct page *page, unsigned int order) +{ + struct unmapped_free_area *area = &free_area[order]; + + list_add(&page->buddy_list, &area->free_list); + area->nr_free++; +} + +static inline void del_page_from_free_list(struct page *page, unsigned int order) +{ + list_del(&page->buddy_list); + __ClearPageUnmapped(page); + set_page_private(page, 0); + free_area[order].nr_free--; +} + +static inline void set_unmapped_order(struct page *page, unsigned int order) +{ + set_page_private(page, order); + __SetPageUnmapped(page); +} + +static inline bool page_is_unmapped_buddy(struct page *page, struct page *buddy, + unsigned int order) +{ + if (!PageUnmapped(buddy)) + return false; + + if (buddy_order(buddy) != order) + return false; + + return true; +} + +static struct page *find_unmapped_buddy_page_pfn(struct page *page, + unsigned long pfn, + unsigned int order, + unsigned long *buddy_pfn) +{ + unsigned long __buddy_pfn = __find_buddy_pfn(pfn, order); + struct page *buddy; + + buddy = page + (__buddy_pfn - pfn); + if (buddy_pfn) + *buddy_pfn = __buddy_pfn; + + if (page_is_unmapped_buddy(page, buddy, order)) + return buddy; + + return NULL; +} + +static inline void __free_one_page(struct page *page, unsigned int order, + bool cache_refill) +{ + unsigned long pfn = page_to_pfn(page); + unsigned long buddy_pfn; + unsigned long combined_pfn; + struct page *buddy; + unsigned long flags; + + spin_lock_irqsave(&free_area->lock, flags); + + if (cache_refill) { + set_pageblock_unmapped(page); + free_area[order].nr_cached++; + } + + while (order < MAX_ORDER - 1) { + buddy = find_unmapped_buddy_page_pfn(page, pfn, order, + &buddy_pfn); + if (!buddy) + break; + + del_page_from_free_list(buddy, order); + combined_pfn = buddy_pfn & pfn; + page = page + (combined_pfn - pfn); + pfn = combined_pfn; + order++; + } + + set_unmapped_order(page, order); + add_to_free_list(page, order); + spin_unlock_irqrestore(&free_area->lock, flags); +} + +static inline void expand(struct page *page, int low, int high) +{ + unsigned long size = 1 << high; + + while (high > low) { + high--; + size >>= 1; + + add_to_free_list(&page[size], high); + set_unmapped_order(&page[size], high); + } +} + +static struct page *__rmqueue_smallest(unsigned int order) +{ + unsigned int current_order; + struct unmapped_free_area *area; + struct page *page = NULL; + unsigned long flags; + + spin_lock_irqsave(&free_area->lock, flags); + + /* Find a page of the appropriate size in the preferred list */ + for (current_order = order; current_order < MAX_ORDER; ++current_order) { + area = &free_area[current_order]; + page = list_first_entry_or_null(&area->free_list, struct page, + lru); + if (!page) + continue; + + del_page_from_free_list(page, current_order); + expand(page, order, current_order); + + break; + } + + spin_unlock_irqrestore(&free_area->lock, flags); + + return page; +} + +/* FIXME: have PMD_ORDER at last available in include/linux */ +#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT) + +struct page *unmapped_pages_alloc(gfp_t gfp, int order) +{ + + int cache_order = PMD_ORDER; + struct page *page; + + page = __rmqueue_smallest(order); + if (page) + goto out; + + gfp &= ~__GFP_UNMAPPED; + while (cache_order >= order) { + page = alloc_pages(gfp | __GFP_ZERO, cache_order); + if (page) + break; + cache_order--; + } + + if (page) { + unsigned long addr = (unsigned long)page_address(page); + unsigned long nr_pages = (1 << order); + unsigned long size = PAGE_SIZE * nr_pages; + + /* FIXME: set_direct_map_invalid_noflush may fail */ + for (int i = 0; i < nr_pages; i++) + set_direct_map_invalid_noflush(page + i); + + flush_tlb_kernel_range(addr, addr + size); + + /* + * FIXME: have this under lock so that allocation running + * in parallel won't steal all pages from the newly cached + * ones + */ + __free_one_page(page, cache_order, true); + page = __rmqueue_smallest(order); + + } + +out: + if (page) { + /* FIXME: __prep_new_page() expects page count of 0 */ + set_page_count(page, 0); + post_alloc_hook(page, order, GFP_KERNEL); + } + + return page; +} + +void unmapped_pages_free(struct page *page, int order) +{ + __free_one_page(page, order, false); +} + +int unmapped_alloc_init(void) +{ + for (int order = 0; order < MAX_ORDER; order++) { + struct unmapped_free_area *area = &free_area[order]; + INIT_LIST_HEAD(&area->free_list); + spin_lock_init(&area->lock); + } + + return 0; +}