Message ID | 770f9f61472b24b6bc89adbd71a77d9cf62bb54f.1686646361.git.baolin.wang@linux.alibaba.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp412852vqr; Tue, 13 Jun 2023 02:32:26 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7WFfYs8IVv74DfK093AE5rt45mqJVse/5sGc6s4d2jFQd7joE34gvAmpW17O2vRafm3hYi X-Received: by 2002:a05:6a20:3d8a:b0:10c:c407:92e5 with SMTP id s10-20020a056a203d8a00b0010cc40792e5mr16108748pzi.22.1686648745808; Tue, 13 Jun 2023 02:32:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686648745; cv=none; d=google.com; s=arc-20160816; b=SRqMjKRoQVQ6NochJQnqNqG84HpnuhobGLHAjbS1hRdN5iVJL/iIFNXcbAupLSOFA/ Io3KgfEvOmP9a5eVVOzA932vdP5AemFVyBqQd/CxiapWa8fCrd9PrH6UzdPDPwqleh9u mjyJrDoPTRUvpjgrxOxeYukkKgdt7ePH1EWN8wGKW22lyA34zhwEYO/JHBzkS1DB0jNm 4wAoTOrsgDAdCTn5nWqL2+ftHXRGEa7E4NU1txz6YH40/Ufp6FHdBlcYHxnvJktn1yNX aYBQp/dPqy5ittxAGb4PQT28SgkjxWweuxQNnE8KgVIlofi41Z9L2jt49I6OgUDYfKP8 iImw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=wMcBRFqfTE0STr9JCRRLkDjxGqHSrWjNxvDD1SWW8XA=; b=sS30vy0hpUhZfxJUBiUmkDohyiuttR84dwbfdu9OeYxrg832H8EOLenIeKB6gjBZ7V wMAlC0HMN2CSFRu4afzc3L7xf7vIlrc+Tdyd1CJlVMZfdNbGTWRukm7EUJMJydYwYQ0x pLF+n2l1MUku9DOIa40476dSgn9Cye5fDkDCAaLjwQsOk3e3MrOyy6uz9+7VHxR2+XWe 2fe8MkGpdVgLkibHbE2e9hgfVL/4c5tApx9tdyChCtrNdP8cABS+fXKsjKQ847+e5e1g FTJtOAAtXSdf9c9qzWg3BG6Si1GW3drzUlhbY/n78dwmpae+FDUi+gHHNplljZG7xCxe Nr/Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i18-20020a633c52000000b0053f4a727dacsi8572009pgn.251.2023.06.13.02.32.12; Tue, 13 Jun 2023 02:32:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239549AbjFMIzd (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Tue, 13 Jun 2023 04:55:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60196 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241849AbjFMIzS (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 13 Jun 2023 04:55:18 -0400 Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 590F5C0 for <linux-kernel@vger.kernel.org>; Tue, 13 Jun 2023 01:55:17 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R961e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046049;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Vl1OoaE_1686646513; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0Vl1OoaE_1686646513) by smtp.aliyun-inc.com; Tue, 13 Jun 2023 16:55:13 +0800 From: Baolin Wang <baolin.wang@linux.alibaba.com> To: akpm@linux-foundation.org Cc: mgorman@techsingularity.net, vbabka@suse.cz, david@redhat.com, ying.huang@intel.com, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v2] mm: compaction: skip memory hole rapidly when isolating migratable pages Date: Tue, 13 Jun 2023 16:55:04 +0800 Message-Id: <770f9f61472b24b6bc89adbd71a77d9cf62bb54f.1686646361.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768579395073252396?= X-GMAIL-MSGID: =?utf-8?q?1768579395073252396?= |
Series |
[v2] mm: compaction: skip memory hole rapidly when isolating migratable pages
|
|
Commit Message
Baolin Wang
June 13, 2023, 8:55 a.m. UTC
On some machines, the normal zone can have a large memory hole like
below memory layout, and we can see the range from 0x100000000 to
0x1800000000 is a hole. So when isolating some migratable pages, the
scanner can meet the hole and it will take more time to skip the large
hole. From my measurement, I can see the isolation scanner will take
80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000].
So adding a new helper to fast search next online memory section
to skip the large hole can help to find next suitable pageblock
efficiently. With this patch, I can see the large hole scanning only
takes < 1us.
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff]
[ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff]
[ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff]
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
Changes from v1:
- Fix building errors if CONFIG_SPARSEMEM is not selected.
- Use NR_MEM_SECTIONS instead of '-1' per Huang Ying.
---
include/linux/mmzone.h | 10 ++++++++++
mm/compaction.c | 30 +++++++++++++++++++++++++++++-
2 files changed, 39 insertions(+), 1 deletion(-)
Comments
On 13.06.23 10:55, Baolin Wang wrote: > On some machines, the normal zone can have a large memory hole like > below memory layout, and we can see the range from 0x100000000 to > 0x1800000000 is a hole. So when isolating some migratable pages, the > scanner can meet the hole and it will take more time to skip the large > hole. From my measurement, I can see the isolation scanner will take > 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. > > So adding a new helper to fast search next online memory section > to skip the large hole can help to find next suitable pageblock > efficiently. With this patch, I can see the large hole scanning only > takes < 1us. > > [ 0.000000] Zone ranges: > [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] > [ 0.000000] DMA32 empty > [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] > [ 0.000000] Movable zone start for each node > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] > [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] > [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] > [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] > [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] > [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] > [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] > [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] > [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] > [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > --- > Changes from v1: > - Fix building errors if CONFIG_SPARSEMEM is not selected. > - Use NR_MEM_SECTIONS instead of '-1' per Huang Ying. > --- > include/linux/mmzone.h | 10 ++++++++++ > mm/compaction.c | 30 +++++++++++++++++++++++++++++- > 2 files changed, 39 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 5a7ada0413da..5ff1fa2efe28 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -2000,6 +2000,16 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr) > return -1; > } > > +static inline unsigned long next_online_section_nr(unsigned long section_nr) > +{ > + while (++section_nr <= __highest_present_section_nr) { > + if (online_section_nr(section_nr)) > + return section_nr; > + } > + > + return NR_MEM_SECTIONS; > +} > + > /* > * These are _only_ used during initialisation, therefore they > * can use __initdata ... They could have names to indicate > diff --git a/mm/compaction.c b/mm/compaction.c > index 3398ef3a55fe..c31ff6123891 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -229,6 +229,28 @@ static void reset_cached_positions(struct zone *zone) > pageblock_start_pfn(zone_end_pfn(zone) - 1); > } > > +#ifdef CONFIG_SPARSEMEM > +static unsigned long skip_hole_pageblock(unsigned long start_pfn) > +{ > + unsigned long next_online_nr; > + unsigned long start_nr = pfn_to_section_nr(start_pfn); > + > + if (online_section_nr(start_nr)) > + return 0; > + > + next_online_nr = next_online_section_nr(start_nr); > + if (next_online_nr < NR_MEM_SECTIONS) > + return section_nr_to_pfn(next_online_nr); > + I would simply inline next_online_section_nr and simplify (and add a comment): /* * If the PFN falls into an offline section, return the start PFN of the * next online section. If the PFN falls into an online section or if * there is no next online section, return 0. */ static unsigned long skip_hole_pageblock(unsigned long start_pfn) { unsigned long nr = pfn_to_section_nr(start_pfn); if (online_section_nr(nr)) return 0; while (++nr <= __highest_present_section_nr) { if (online_section_nr(nr)) return section_nr_to_pfn(nr); } return 0 } Easier, no? And maybe just call that function "skip_offline_sections()" then? Because we're not operating on pageblocks. > + return 0; > +} > +#else > +static unsigned long skip_hole_pageblock(unsigned long start_pfn) > +{ > + return 0; > +} > +#endif > + > /* > * Compound pages of >= pageblock_order should consistently be skipped until > * released. It is always pointless to compact pages of such order (if they are > @@ -1991,8 +2013,14 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc) > > page = pageblock_pfn_to_page(block_start_pfn, > block_end_pfn, cc->zone); > - if (!page) > + if (!page) { > + unsigned long next_pfn; > + > + next_pfn = skip_hole_pageblock(block_start_pfn); > + if (next_pfn != 0) if (next_pfn) > + block_end_pfn = next_pfn; > continue; > + } > > /* > * If isolation recently failed, do not retry. Only check the
On 6/13/2023 5:56 PM, David Hildenbrand wrote: > On 13.06.23 10:55, Baolin Wang wrote: >> On some machines, the normal zone can have a large memory hole like >> below memory layout, and we can see the range from 0x100000000 to >> 0x1800000000 is a hole. So when isolating some migratable pages, the >> scanner can meet the hole and it will take more time to skip the large >> hole. From my measurement, I can see the isolation scanner will take >> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >> >> So adding a new helper to fast search next online memory section >> to skip the large hole can help to find next suitable pageblock >> efficiently. With this patch, I can see the large hole scanning only >> takes < 1us. >> >> [ 0.000000] Zone ranges: >> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >> [ 0.000000] DMA32 empty >> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >> [ 0.000000] Movable zone start for each node >> [ 0.000000] Early memory node ranges >> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> --- >> Changes from v1: >> - Fix building errors if CONFIG_SPARSEMEM is not selected. >> - Use NR_MEM_SECTIONS instead of '-1' per Huang Ying. >> --- >> include/linux/mmzone.h | 10 ++++++++++ >> mm/compaction.c | 30 +++++++++++++++++++++++++++++- >> 2 files changed, 39 insertions(+), 1 deletion(-) >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 5a7ada0413da..5ff1fa2efe28 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -2000,6 +2000,16 @@ static inline unsigned long >> next_present_section_nr(unsigned long section_nr) >> return -1; >> } >> +static inline unsigned long next_online_section_nr(unsigned long >> section_nr) >> +{ >> + while (++section_nr <= __highest_present_section_nr) { >> + if (online_section_nr(section_nr)) >> + return section_nr; >> + } >> + >> + return NR_MEM_SECTIONS; >> +} >> + >> /* >> * These are _only_ used during initialisation, therefore they >> * can use __initdata ... They could have names to indicate >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 3398ef3a55fe..c31ff6123891 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -229,6 +229,28 @@ static void reset_cached_positions(struct zone >> *zone) >> pageblock_start_pfn(zone_end_pfn(zone) - 1); >> } >> +#ifdef CONFIG_SPARSEMEM >> +static unsigned long skip_hole_pageblock(unsigned long start_pfn) >> +{ >> + unsigned long next_online_nr; >> + unsigned long start_nr = pfn_to_section_nr(start_pfn); >> + >> + if (online_section_nr(start_nr)) >> + return 0; >> + >> + next_online_nr = next_online_section_nr(start_nr); >> + if (next_online_nr < NR_MEM_SECTIONS) >> + return section_nr_to_pfn(next_online_nr); >> + > > I would simply inline next_online_section_nr and simplify (and add a > comment): > > /* > * If the PFN falls into an offline section, return the start PFN of the > * next online section. If the PFN falls into an online section or if > * there is no next online section, return 0. > */ > static unsigned long skip_hole_pageblock(unsigned long start_pfn) > { > unsigned long nr = pfn_to_section_nr(start_pfn); > > if (online_section_nr(nr)) > return 0; > > while (++nr <= __highest_present_section_nr) { > if (online_section_nr(nr)) > return section_nr_to_pfn(nr); > } > return 0 > } > > Easier, no? Originally I want to add a common helper like next_present_section_nr(), which can be used by other users. But yes, your suggestion is easier, and I am fine with that. > And maybe just call that function "skip_offline_sections()" then? > Because we're not operating on pageblocks. OK. Thanks.
On 13.06.23 13:13, Baolin Wang wrote: > > > On 6/13/2023 5:56 PM, David Hildenbrand wrote: >> On 13.06.23 10:55, Baolin Wang wrote: >>> On some machines, the normal zone can have a large memory hole like >>> below memory layout, and we can see the range from 0x100000000 to >>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>> scanner can meet the hole and it will take more time to skip the large >>> hole. From my measurement, I can see the isolation scanner will take >>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>> >>> So adding a new helper to fast search next online memory section >>> to skip the large hole can help to find next suitable pageblock >>> efficiently. With this patch, I can see the large hole scanning only >>> takes < 1us. >>> >>> [ 0.000000] Zone ranges: >>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>> [ 0.000000] DMA32 empty >>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>> [ 0.000000] Movable zone start for each node >>> [ 0.000000] Early memory node ranges >>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>> >>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>> --- >>> Changes from v1: >>> - Fix building errors if CONFIG_SPARSEMEM is not selected. >>> - Use NR_MEM_SECTIONS instead of '-1' per Huang Ying. >>> --- >>> include/linux/mmzone.h | 10 ++++++++++ >>> mm/compaction.c | 30 +++++++++++++++++++++++++++++- >>> 2 files changed, 39 insertions(+), 1 deletion(-) >>> >>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >>> index 5a7ada0413da..5ff1fa2efe28 100644 >>> --- a/include/linux/mmzone.h >>> +++ b/include/linux/mmzone.h >>> @@ -2000,6 +2000,16 @@ static inline unsigned long >>> next_present_section_nr(unsigned long section_nr) >>> return -1; >>> } >>> +static inline unsigned long next_online_section_nr(unsigned long >>> section_nr) >>> +{ >>> + while (++section_nr <= __highest_present_section_nr) { >>> + if (online_section_nr(section_nr)) >>> + return section_nr; >>> + } >>> + >>> + return NR_MEM_SECTIONS; >>> +} >>> + >>> /* >>> * These are _only_ used during initialisation, therefore they >>> * can use __initdata ... They could have names to indicate >>> diff --git a/mm/compaction.c b/mm/compaction.c >>> index 3398ef3a55fe..c31ff6123891 100644 >>> --- a/mm/compaction.c >>> +++ b/mm/compaction.c >>> @@ -229,6 +229,28 @@ static void reset_cached_positions(struct zone >>> *zone) >>> pageblock_start_pfn(zone_end_pfn(zone) - 1); >>> } >>> +#ifdef CONFIG_SPARSEMEM >>> +static unsigned long skip_hole_pageblock(unsigned long start_pfn) >>> +{ >>> + unsigned long next_online_nr; >>> + unsigned long start_nr = pfn_to_section_nr(start_pfn); >>> + >>> + if (online_section_nr(start_nr)) >>> + return 0; >>> + >>> + next_online_nr = next_online_section_nr(start_nr); >>> + if (next_online_nr < NR_MEM_SECTIONS) >>> + return section_nr_to_pfn(next_online_nr); >>> + >> >> I would simply inline next_online_section_nr and simplify (and add a >> comment): >> >> /* >> * If the PFN falls into an offline section, return the start PFN of the >> * next online section. If the PFN falls into an online section or if >> * there is no next online section, return 0. >> */ >> static unsigned long skip_hole_pageblock(unsigned long start_pfn) >> { >> unsigned long nr = pfn_to_section_nr(start_pfn); >> >> if (online_section_nr(nr)) >> return 0; >> >> while (++nr <= __highest_present_section_nr) { >> if (online_section_nr(nr)) >> return section_nr_to_pfn(nr); >> } >> return 0 >> } >> >> Easier, no? > > Originally I want to add a common helper like next_present_section_nr(), > which can be used by other users. But yes, your suggestion is easier, > and I am fine with that. > >> And maybe just call that function "skip_offline_sections()" then? >> Because we're not operating on pageblocks. > > OK. Thanks. > Feel free to add to the simplified version Acked-by: David Hildenbrand <david@redhat.com>
David Hildenbrand <david@redhat.com> writes: > On 13.06.23 13:13, Baolin Wang wrote: >> On 6/13/2023 5:56 PM, David Hildenbrand wrote: >>> On 13.06.23 10:55, Baolin Wang wrote: >>>> On some machines, the normal zone can have a large memory hole like >>>> below memory layout, and we can see the range from 0x100000000 to >>>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>>> scanner can meet the hole and it will take more time to skip the large >>>> hole. From my measurement, I can see the isolation scanner will take >>>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>>> >>>> So adding a new helper to fast search next online memory section >>>> to skip the large hole can help to find next suitable pageblock >>>> efficiently. With this patch, I can see the large hole scanning only >>>> takes < 1us. >>>> >>>> [ 0.000000] Zone ranges: >>>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>>> [ 0.000000] DMA32 empty >>>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>>> [ 0.000000] Movable zone start for each node >>>> [ 0.000000] Early memory node ranges >>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>>> >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> --- >>>> Changes from v1: >>>> - Fix building errors if CONFIG_SPARSEMEM is not selected. >>>> - Use NR_MEM_SECTIONS instead of '-1' per Huang Ying. >>>> --- >>>> include/linux/mmzone.h | 10 ++++++++++ >>>> mm/compaction.c | 30 +++++++++++++++++++++++++++++- >>>> 2 files changed, 39 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >>>> index 5a7ada0413da..5ff1fa2efe28 100644 >>>> --- a/include/linux/mmzone.h >>>> +++ b/include/linux/mmzone.h >>>> @@ -2000,6 +2000,16 @@ static inline unsigned long >>>> next_present_section_nr(unsigned long section_nr) >>>> return -1; >>>> } >>>> +static inline unsigned long next_online_section_nr(unsigned long >>>> section_nr) >>>> +{ >>>> + while (++section_nr <= __highest_present_section_nr) { >>>> + if (online_section_nr(section_nr)) >>>> + return section_nr; >>>> + } >>>> + >>>> + return NR_MEM_SECTIONS; >>>> +} >>>> + >>>> /* >>>> * These are _only_ used during initialisation, therefore they >>>> * can use __initdata ... They could have names to indicate >>>> diff --git a/mm/compaction.c b/mm/compaction.c >>>> index 3398ef3a55fe..c31ff6123891 100644 >>>> --- a/mm/compaction.c >>>> +++ b/mm/compaction.c >>>> @@ -229,6 +229,28 @@ static void reset_cached_positions(struct zone >>>> *zone) >>>> pageblock_start_pfn(zone_end_pfn(zone) - 1); >>>> } >>>> +#ifdef CONFIG_SPARSEMEM >>>> +static unsigned long skip_hole_pageblock(unsigned long start_pfn) >>>> +{ >>>> + unsigned long next_online_nr; >>>> + unsigned long start_nr = pfn_to_section_nr(start_pfn); >>>> + >>>> + if (online_section_nr(start_nr)) >>>> + return 0; >>>> + >>>> + next_online_nr = next_online_section_nr(start_nr); >>>> + if (next_online_nr < NR_MEM_SECTIONS) >>>> + return section_nr_to_pfn(next_online_nr); >>>> + >>> >>> I would simply inline next_online_section_nr and simplify (and add a >>> comment): >>> >>> /* >>> * If the PFN falls into an offline section, return the start PFN of the >>> * next online section. If the PFN falls into an online section or if >>> * there is no next online section, return 0. >>> */ >>> static unsigned long skip_hole_pageblock(unsigned long start_pfn) >>> { >>> unsigned long nr = pfn_to_section_nr(start_pfn); >>> >>> if (online_section_nr(nr)) >>> return 0; >>> >>> while (++nr <= __highest_present_section_nr) { >>> if (online_section_nr(nr)) >>> return section_nr_to_pfn(nr); >>> } >>> return 0 >>> } >>> >>> Easier, no? >> Originally I want to add a common helper like >> next_present_section_nr(), >> which can be used by other users. But yes, your suggestion is easier, >> and I am fine with that. >> >>> And maybe just call that function "skip_offline_sections()" then? >>> Because we're not operating on pageblocks. >> OK. Thanks. >> > > Feel free to add to the simplified version > > Acked-by: David Hildenbrand <david@redhat.com> With David's above comments addressed, feel free to add Acked-by: "Huang, Ying" <ying.huang@intel.com>
On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: > On some machines, the normal zone can have a large memory hole like > below memory layout, and we can see the range from 0x100000000 to > 0x1800000000 is a hole. So when isolating some migratable pages, the > scanner can meet the hole and it will take more time to skip the large > hole. From my measurement, I can see the isolation scanner will take > 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. > > So adding a new helper to fast search next online memory section > to skip the large hole can help to find next suitable pageblock > efficiently. With this patch, I can see the large hole scanning only > takes < 1us. > > [ 0.000000] Zone ranges: > [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] > [ 0.000000] DMA32 empty > [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] > [ 0.000000] Movable zone start for each node > [ 0.000000] Early memory node ranges > [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] > [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] > [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] > [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] > [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] > [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] > [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] > [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] > [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] > [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> This may only be necessary for non-contiguous zones so a check for zone_contiguous could be made but I suspect the saving, if any, would be marginal. However, it's subtle that block_end_pfn can end up in an arbirary location past the end of the zone or past cc->free_pfn. As the "continue" will update cc->migrate_pfn, that might lead to errors in the future. It would be a lot safer to pass in cc->free_pfn and do two things with the value. First, there is no point scanning for a valid online section past cc->free_pfn so terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn does not end up with an arbitrary value which is a more defensive approach to any future programming errors.
On 6/14/2023 5:55 PM, Mel Gorman wrote: > On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >> On some machines, the normal zone can have a large memory hole like >> below memory layout, and we can see the range from 0x100000000 to >> 0x1800000000 is a hole. So when isolating some migratable pages, the >> scanner can meet the hole and it will take more time to skip the large >> hole. From my measurement, I can see the isolation scanner will take >> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >> >> So adding a new helper to fast search next online memory section >> to skip the large hole can help to find next suitable pageblock >> efficiently. With this patch, I can see the large hole scanning only >> takes < 1us. >> >> [ 0.000000] Zone ranges: >> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >> [ 0.000000] DMA32 empty >> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >> [ 0.000000] Movable zone start for each node >> [ 0.000000] Early memory node ranges >> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > This may only be necessary for non-contiguous zones so a check for > zone_contiguous could be made but I suspect the saving, if any, would be > marginal. Right. But the pageblock_pfn_to_page() have considered the contiguous case, and will not return NULL page for a contiguous zone. > However, it's subtle that block_end_pfn can end up in an arbirary location > past the end of the zone or past cc->free_pfn. As the "continue" will update > cc->migrate_pfn, that might lead to errors in the future. It would be a Ah, yes, thanks for pointing this out that I missed before. > lot safer to pass in cc->free_pfn and do two things with the value. First, > there is no point scanning for a valid online section past cc->free_pfn so > terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn The skipping function introduced in this patch will only scan the first online section, so it can not terminate the scanning early by comparing if it is greater than cc->free_pfn. It can only compare the first online section with cc->free_pfn. > does not end up with an arbitrary value which is a more defensive approach > to any future programming errors. Right. So I think I should make sure the cc->migrate_pfn is not larger than cc->free_pfn with below change: @@ -1965,7 +1965,7 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc) next_pfn = skip_offline_sections(block_start_pfn); if (next_pfn) - block_end_pfn = next_pfn; + block_end_pfn = min(next_pfn, cc->free_pfn); continue; }
Hi, Mel, Mel Gorman <mgorman@techsingularity.net> writes: > On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >> On some machines, the normal zone can have a large memory hole like >> below memory layout, and we can see the range from 0x100000000 to >> 0x1800000000 is a hole. So when isolating some migratable pages, the >> scanner can meet the hole and it will take more time to skip the large >> hole. From my measurement, I can see the isolation scanner will take >> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >> >> So adding a new helper to fast search next online memory section >> to skip the large hole can help to find next suitable pageblock >> efficiently. With this patch, I can see the large hole scanning only >> takes < 1us. >> >> [ 0.000000] Zone ranges: >> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >> [ 0.000000] DMA32 empty >> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >> [ 0.000000] Movable zone start for each node >> [ 0.000000] Early memory node ranges >> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >> >> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> > > This may only be necessary for non-contiguous zones so a check for > zone_contiguous could be made but I suspect the saving, if any, would be > marginal. > > However, it's subtle that block_end_pfn can end up in an arbirary location > past the end of the zone or past cc->free_pfn. As the "continue" will update > cc->migrate_pfn, that might lead to errors in the future. It would be a > lot safer to pass in cc->free_pfn and do two things with the value. First, > there is no point scanning for a valid online section past cc->free_pfn so > terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn > does not end up with an arbitrary value which is a more defensive approach > to any future programming errors. I have thought about this before. Originally, I had thought that we were safe because cc->free_pfn should be in a online section and block_end_pfn should reach cc->free_pfn before the end of zone. But after checking more code and thinking about it again, I found that the underlying sections may go offline under us during compaction. So that, cc->free_pfn may be in a offline section or after the end of zone. So, you are right, we need to consider the range of block_end_pfn. But, if we thought in this way (memory online/offline at any time), it appears that we need to check whether the underlying section was offlined. For example, is it safe to use "pfn_to_page()" in "isolate_migratepages_block()"? Is it possible for the underlying section to be offlined under us? Hi, David, can you teach me on this too? Best Regards, Huang, Ying
On 6/15/2023 11:22 AM, Huang, Ying wrote: > Hi, Mel, > > Mel Gorman <mgorman@techsingularity.net> writes: > >> On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >>> On some machines, the normal zone can have a large memory hole like >>> below memory layout, and we can see the range from 0x100000000 to >>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>> scanner can meet the hole and it will take more time to skip the large >>> hole. From my measurement, I can see the isolation scanner will take >>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>> >>> So adding a new helper to fast search next online memory section >>> to skip the large hole can help to find next suitable pageblock >>> efficiently. With this patch, I can see the large hole scanning only >>> takes < 1us. >>> >>> [ 0.000000] Zone ranges: >>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>> [ 0.000000] DMA32 empty >>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>> [ 0.000000] Movable zone start for each node >>> [ 0.000000] Early memory node ranges >>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>> >>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >> >> This may only be necessary for non-contiguous zones so a check for >> zone_contiguous could be made but I suspect the saving, if any, would be >> marginal. >> >> However, it's subtle that block_end_pfn can end up in an arbirary location >> past the end of the zone or past cc->free_pfn. As the "continue" will update >> cc->migrate_pfn, that might lead to errors in the future. It would be a >> lot safer to pass in cc->free_pfn and do two things with the value. First, >> there is no point scanning for a valid online section past cc->free_pfn so >> terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn >> does not end up with an arbitrary value which is a more defensive approach >> to any future programming errors. > > I have thought about this before. Originally, I had thought that we > were safe because cc->free_pfn should be in a online section and > block_end_pfn should reach cc->free_pfn before the end of zone. But > after checking more code and thinking about it again, I found that the > underlying sections may go offline under us during compaction. So that, > cc->free_pfn may be in a offline section or after the end of zone. So, > you are right, we need to consider the range of block_end_pfn. > > But, if we thought in this way (memory online/offline at any time), it > appears that we need to check whether the underlying section was > offlined. For example, is it safe to use "pfn_to_page()" in > "isolate_migratepages_block()"? Is it possible for the underlying > section to be offlined under us? It is possible. There is a previous discussion[1] about the race between pfn_to_online_page() and memory offline. [1] https://lore.kernel.org/lkml/87zgc6buoq.fsf@nvidia.com/T/#m642d91bcc726437e1848b295bc57ce249c7ca399
Baolin Wang <baolin.wang@linux.alibaba.com> writes: > On 6/15/2023 11:22 AM, Huang, Ying wrote: >> Hi, Mel, >> Mel Gorman <mgorman@techsingularity.net> writes: >> >>> On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >>>> On some machines, the normal zone can have a large memory hole like >>>> below memory layout, and we can see the range from 0x100000000 to >>>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>>> scanner can meet the hole and it will take more time to skip the large >>>> hole. From my measurement, I can see the isolation scanner will take >>>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>>> >>>> So adding a new helper to fast search next online memory section >>>> to skip the large hole can help to find next suitable pageblock >>>> efficiently. With this patch, I can see the large hole scanning only >>>> takes < 1us. >>>> >>>> [ 0.000000] Zone ranges: >>>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>>> [ 0.000000] DMA32 empty >>>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>>> [ 0.000000] Movable zone start for each node >>>> [ 0.000000] Early memory node ranges >>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>>> >>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>> >>> This may only be necessary for non-contiguous zones so a check for >>> zone_contiguous could be made but I suspect the saving, if any, would be >>> marginal. >>> >>> However, it's subtle that block_end_pfn can end up in an arbirary location >>> past the end of the zone or past cc->free_pfn. As the "continue" will update >>> cc->migrate_pfn, that might lead to errors in the future. It would be a >>> lot safer to pass in cc->free_pfn and do two things with the value. First, >>> there is no point scanning for a valid online section past cc->free_pfn so >>> terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn >>> does not end up with an arbitrary value which is a more defensive approach >>> to any future programming errors. >> I have thought about this before. Originally, I had thought that we >> were safe because cc->free_pfn should be in a online section and >> block_end_pfn should reach cc->free_pfn before the end of zone. But >> after checking more code and thinking about it again, I found that the >> underlying sections may go offline under us during compaction. So that, >> cc->free_pfn may be in a offline section or after the end of zone. So, >> you are right, we need to consider the range of block_end_pfn. >> But, if we thought in this way (memory online/offline at any time), >> it >> appears that we need to check whether the underlying section was >> offlined. For example, is it safe to use "pfn_to_page()" in >> "isolate_migratepages_block()"? Is it possible for the underlying >> section to be offlined under us? > > It is possible. There is a previous discussion[1] about the race > between pfn_to_online_page() and memory offline. > > [1] > https://lore.kernel.org/lkml/87zgc6buoq.fsf@nvidia.com/T/#m642d91bcc726437e1848b295bc57ce249c7ca399 Thank you very much for sharing! That answers my questions directly! Best Regards, Huang, Ying
On 15.06.23 09:22, Huang, Ying wrote: > Baolin Wang <baolin.wang@linux.alibaba.com> writes: > >> On 6/15/2023 11:22 AM, Huang, Ying wrote: >>> Hi, Mel, >>> Mel Gorman <mgorman@techsingularity.net> writes: >>> >>>> On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >>>>> On some machines, the normal zone can have a large memory hole like >>>>> below memory layout, and we can see the range from 0x100000000 to >>>>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>>>> scanner can meet the hole and it will take more time to skip the large >>>>> hole. From my measurement, I can see the isolation scanner will take >>>>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>>>> >>>>> So adding a new helper to fast search next online memory section >>>>> to skip the large hole can help to find next suitable pageblock >>>>> efficiently. With this patch, I can see the large hole scanning only >>>>> takes < 1us. >>>>> >>>>> [ 0.000000] Zone ranges: >>>>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>>>> [ 0.000000] DMA32 empty >>>>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>>>> [ 0.000000] Movable zone start for each node >>>>> [ 0.000000] Early memory node ranges >>>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>>>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>>>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>>>> >>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>> >>>> This may only be necessary for non-contiguous zones so a check for >>>> zone_contiguous could be made but I suspect the saving, if any, would be >>>> marginal. >>>> >>>> However, it's subtle that block_end_pfn can end up in an arbirary location >>>> past the end of the zone or past cc->free_pfn. As the "continue" will update >>>> cc->migrate_pfn, that might lead to errors in the future. It would be a >>>> lot safer to pass in cc->free_pfn and do two things with the value. First, >>>> there is no point scanning for a valid online section past cc->free_pfn so >>>> terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn >>>> does not end up with an arbitrary value which is a more defensive approach >>>> to any future programming errors. >>> I have thought about this before. Originally, I had thought that we >>> were safe because cc->free_pfn should be in a online section and >>> block_end_pfn should reach cc->free_pfn before the end of zone. But >>> after checking more code and thinking about it again, I found that the >>> underlying sections may go offline under us during compaction. So that, >>> cc->free_pfn may be in a offline section or after the end of zone. So, >>> you are right, we need to consider the range of block_end_pfn. >>> But, if we thought in this way (memory online/offline at any time), >>> it >>> appears that we need to check whether the underlying section was >>> offlined. For example, is it safe to use "pfn_to_page()" in >>> "isolate_migratepages_block()"? Is it possible for the underlying >>> section to be offlined under us? >> >> It is possible. There is a previous discussion[1] about the race >> between pfn_to_online_page() and memory offline. >> >> [1] >> https://lore.kernel.org/lkml/87zgc6buoq.fsf@nvidia.com/T/#m642d91bcc726437e1848b295bc57ce249c7ca399 > > Thank you very much for sharing! That answers my questions directly! I remember another discussion (but can't find it) regarding why memory compaction can get away without pfn_to_online_page() all over the place. The use is limited to __reset_isolation_pfn(). But yes, in theory pfn_to_online_page() can race with memory offlining.
David Hildenbrand <david@redhat.com> writes: > On 15.06.23 09:22, Huang, Ying wrote: >> Baolin Wang <baolin.wang@linux.alibaba.com> writes: >> >>> On 6/15/2023 11:22 AM, Huang, Ying wrote: >>>> Hi, Mel, >>>> Mel Gorman <mgorman@techsingularity.net> writes: >>>> >>>>> On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >>>>>> On some machines, the normal zone can have a large memory hole like >>>>>> below memory layout, and we can see the range from 0x100000000 to >>>>>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>>>>> scanner can meet the hole and it will take more time to skip the large >>>>>> hole. From my measurement, I can see the isolation scanner will take >>>>>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>>>>> >>>>>> So adding a new helper to fast search next online memory section >>>>>> to skip the large hole can help to find next suitable pageblock >>>>>> efficiently. With this patch, I can see the large hole scanning only >>>>>> takes < 1us. >>>>>> >>>>>> [ 0.000000] Zone ranges: >>>>>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>>>>> [ 0.000000] DMA32 empty >>>>>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>>>>> [ 0.000000] Movable zone start for each node >>>>>> [ 0.000000] Early memory node ranges >>>>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>>>>> >>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>> >>>>> This may only be necessary for non-contiguous zones so a check for >>>>> zone_contiguous could be made but I suspect the saving, if any, would be >>>>> marginal. >>>>> >>>>> However, it's subtle that block_end_pfn can end up in an arbirary location >>>>> past the end of the zone or past cc->free_pfn. As the "continue" will update >>>>> cc->migrate_pfn, that might lead to errors in the future. It would be a >>>>> lot safer to pass in cc->free_pfn and do two things with the value. First, >>>>> there is no point scanning for a valid online section past cc->free_pfn so >>>>> terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn >>>>> does not end up with an arbitrary value which is a more defensive approach >>>>> to any future programming errors. >>>> I have thought about this before. Originally, I had thought that we >>>> were safe because cc->free_pfn should be in a online section and >>>> block_end_pfn should reach cc->free_pfn before the end of zone. But >>>> after checking more code and thinking about it again, I found that the >>>> underlying sections may go offline under us during compaction. So that, >>>> cc->free_pfn may be in a offline section or after the end of zone. So, >>>> you are right, we need to consider the range of block_end_pfn. >>>> But, if we thought in this way (memory online/offline at any time), >>>> it >>>> appears that we need to check whether the underlying section was >>>> offlined. For example, is it safe to use "pfn_to_page()" in >>>> "isolate_migratepages_block()"? Is it possible for the underlying >>>> section to be offlined under us? >>> >>> It is possible. There is a previous discussion[1] about the race >>> between pfn_to_online_page() and memory offline. >>> >>> [1] >>> https://lore.kernel.org/lkml/87zgc6buoq.fsf@nvidia.com/T/#m642d91bcc726437e1848b295bc57ce249c7ca399 >> Thank you very much for sharing! That answers my questions >> directly! > > I remember another discussion (but can't find it) regarding why memory > compaction can get away without pfn_to_online_page() all over the > place. The use is limited to __reset_isolation_pfn(). Per my understanding, isolate_migratepages() -> pageblock_pfn_to_page() will check whether the pageblock is online. So if the pageblock isn't offlined afterwards, we can use pfn_to_page(). > But yes, in theory pfn_to_online_page() can race with memory offlining. Thanks for confirmation. Best Regards, Huang, Ying
On 15.06.23 10:38, Huang, Ying wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 15.06.23 09:22, Huang, Ying wrote: >>> Baolin Wang <baolin.wang@linux.alibaba.com> writes: >>> >>>> On 6/15/2023 11:22 AM, Huang, Ying wrote: >>>>> Hi, Mel, >>>>> Mel Gorman <mgorman@techsingularity.net> writes: >>>>> >>>>>> On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >>>>>>> On some machines, the normal zone can have a large memory hole like >>>>>>> below memory layout, and we can see the range from 0x100000000 to >>>>>>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>>>>>> scanner can meet the hole and it will take more time to skip the large >>>>>>> hole. From my measurement, I can see the isolation scanner will take >>>>>>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>>>>>> >>>>>>> So adding a new helper to fast search next online memory section >>>>>>> to skip the large hole can help to find next suitable pageblock >>>>>>> efficiently. With this patch, I can see the large hole scanning only >>>>>>> takes < 1us. >>>>>>> >>>>>>> [ 0.000000] Zone ranges: >>>>>>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>>>>>> [ 0.000000] DMA32 empty >>>>>>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>>>>>> [ 0.000000] Movable zone start for each node >>>>>>> [ 0.000000] Early memory node ranges >>>>>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>>>>>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>>>>>> >>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> >>>>>> >>>>>> This may only be necessary for non-contiguous zones so a check for >>>>>> zone_contiguous could be made but I suspect the saving, if any, would be >>>>>> marginal. >>>>>> >>>>>> However, it's subtle that block_end_pfn can end up in an arbirary location >>>>>> past the end of the zone or past cc->free_pfn. As the "continue" will update >>>>>> cc->migrate_pfn, that might lead to errors in the future. It would be a >>>>>> lot safer to pass in cc->free_pfn and do two things with the value. First, >>>>>> there is no point scanning for a valid online section past cc->free_pfn so >>>>>> terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn >>>>>> does not end up with an arbitrary value which is a more defensive approach >>>>>> to any future programming errors. >>>>> I have thought about this before. Originally, I had thought that we >>>>> were safe because cc->free_pfn should be in a online section and >>>>> block_end_pfn should reach cc->free_pfn before the end of zone. But >>>>> after checking more code and thinking about it again, I found that the >>>>> underlying sections may go offline under us during compaction. So that, >>>>> cc->free_pfn may be in a offline section or after the end of zone. So, >>>>> you are right, we need to consider the range of block_end_pfn. >>>>> But, if we thought in this way (memory online/offline at any time), >>>>> it >>>>> appears that we need to check whether the underlying section was >>>>> offlined. For example, is it safe to use "pfn_to_page()" in >>>>> "isolate_migratepages_block()"? Is it possible for the underlying >>>>> section to be offlined under us? >>>> >>>> It is possible. There is a previous discussion[1] about the race >>>> between pfn_to_online_page() and memory offline. >>>> >>>> [1] >>>> https://lore.kernel.org/lkml/87zgc6buoq.fsf@nvidia.com/T/#m642d91bcc726437e1848b295bc57ce249c7ca399 >>> Thank you very much for sharing! That answers my questions >>> directly! >> >> I remember another discussion (but can't find it) regarding why memory >> compaction can get away without pfn_to_online_page() all over the >> place. The use is limited to __reset_isolation_pfn(). > > Per my understanding, isolate_migratepages() -> pageblock_pfn_to_page() > will check whether the pageblock is online. So if the pageblock isn't > offlined afterwards, we can use pfn_to_page(). Oh, indeed, that was the magic bit, thanks!
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 5a7ada0413da..5ff1fa2efe28 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2000,6 +2000,16 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr) return -1; } +static inline unsigned long next_online_section_nr(unsigned long section_nr) +{ + while (++section_nr <= __highest_present_section_nr) { + if (online_section_nr(section_nr)) + return section_nr; + } + + return NR_MEM_SECTIONS; +} + /* * These are _only_ used during initialisation, therefore they * can use __initdata ... They could have names to indicate diff --git a/mm/compaction.c b/mm/compaction.c index 3398ef3a55fe..c31ff6123891 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -229,6 +229,28 @@ static void reset_cached_positions(struct zone *zone) pageblock_start_pfn(zone_end_pfn(zone) - 1); } +#ifdef CONFIG_SPARSEMEM +static unsigned long skip_hole_pageblock(unsigned long start_pfn) +{ + unsigned long next_online_nr; + unsigned long start_nr = pfn_to_section_nr(start_pfn); + + if (online_section_nr(start_nr)) + return 0; + + next_online_nr = next_online_section_nr(start_nr); + if (next_online_nr < NR_MEM_SECTIONS) + return section_nr_to_pfn(next_online_nr); + + return 0; +} +#else +static unsigned long skip_hole_pageblock(unsigned long start_pfn) +{ + return 0; +} +#endif + /* * Compound pages of >= pageblock_order should consistently be skipped until * released. It is always pointless to compact pages of such order (if they are @@ -1991,8 +2013,14 @@ static isolate_migrate_t isolate_migratepages(struct compact_control *cc) page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn, cc->zone); - if (!page) + if (!page) { + unsigned long next_pfn; + + next_pfn = skip_hole_pageblock(block_start_pfn); + if (next_pfn != 0) + block_end_pfn = next_pfn; continue; + } /* * If isolation recently failed, do not retry. Only check the