Message ID | 20230920061856.257597-2-ying.huang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4173425vqi; Wed, 20 Sep 2023 07:15:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGf0cC98s6bvCBCilnsMWkb1kwpsxZZ+Ix1Q7TvVWUivE45TPMyMUVAcC9M4v1BVMDuJnNQ X-Received: by 2002:a05:6a00:2d19:b0:68e:3eab:9e18 with SMTP id fa25-20020a056a002d1900b0068e3eab9e18mr2614683pfb.12.1695219352790; Wed, 20 Sep 2023 07:15:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695219352; cv=none; d=google.com; s=arc-20160816; b=ub73UFADPsBfiilxc7Sk+HBMfZ7J9MfXRpbS/UWrxfknw0d22Hx6lkYBhVII97JLU8 s+ZTOfuQtCRt4JmRz7ITngp1/rixGJ2xiU1Jy+9QXjjwV98gt7/A2PXr6GjCE6IOyw0p F5/2DQHOSlaI+7UnT6Ik31rGaZWeg4rDeNqz5eTCGglsYKriOsbNguPLFXC1jvueNNYJ y6sWuTNriivazbCOR6HwtzHz34ynv4FY7rYy0/+YBlXR9R5TmHMkSzaqmkiYoJxWToLl Bj6Av6d7nMZ8ypWYqa5Re/vxXpmBZfyfntjfnaxHECKsTvRqdvHBfbXxihgesgVKQftL PykQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=GPl4E/VNt1Dv9MdbE8wBSTxkzdvu+TPDhPll3vaYXgs=; fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=; b=hGBYt5PUSPbee0PHJZiou0sSIa6CkE8QgzmVbvvR/i2CH1ipYxU7uyT1iDGAPxxCgF r9VdKAGxj2emCJgC5MYQdB0WByVphKC+ReuQTooksKVsRn0+Vhp0/rbudFyQpnNIwyYV TOB9Yw6qixgtc3lJ9w4EuoigI/AvemR/Y5tWYWRwDHqam6DzmKzDjPqPpCpawq+1Z1rM t9vhDMyM46i+sSkiCwhbSabJNus4dIo8P1znTaLlta/V8w0Vyssua953/JPdjKvohI3/ 5G3t9eOyaR7bZpF+B7Xexrdbb/cQVBcoAVx3r/eTT7VBQQ3RZ9GRvQNPpNYuLdAORP19 2Dcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=JscwNsJ7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id r20-20020a6560d4000000b00578b6e32b5dsi3126252pgv.405.2023.09.20.07.15.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Sep 2023 07:15:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=JscwNsJ7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 8C9058020B5B; Tue, 19 Sep 2023 23:20:00 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233175AbjITGTr (ORCPT <rfc822;toshivichauhan@gmail.com> + 26 others); Wed, 20 Sep 2023 02:19:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233150AbjITGTn (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 20 Sep 2023 02:19:43 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BCDF69D for <linux-kernel@vger.kernel.org>; Tue, 19 Sep 2023 23:19:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190777; x=1726726777; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=y5KqaVwntn1O5nTc80IjgYebQMrM7BtNCmGfR5qYkwg=; b=JscwNsJ7pqDILWogYR/nvN0sLD+OCVPadr9V0lRIcJP5vJrgX4zEShqS teJ4pVWjU/AmN1Z5uMKIjRtkXitW2QdP0iwXPyO9OdPGvVvfz2/rRz8fz w8j0JJSVZ51bdSTiSkb/LmI2P7/ZOB/7vPz7YNX5BVM3S5SQ7UUWpuuDQ iRVx+Do1eniOy7v/KREL5E+pYa7sRODKw8GkKAGgi7s2A26Tz4Mm1OBaA G5+QV7OqWMSN+512nZdWijau83Okth2Arz5taU88FlomqgpjTVfJBPOmB Yddkrr6j0MUXfUgqglSjWBDhRHkKSbTs9hqhoC3mvTWnefZ+b1XgaWPvM g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187579" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187579" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060503" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060503" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:33 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven <arjan@linux.intel.com>, Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Johannes Weiner <jweiner@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Matthew Wilcox <willy@infradead.org>, Christoph Lameter <cl@linux.com> Subject: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit Date: Wed, 20 Sep 2023 14:18:47 +0800 Message-Id: <20230920061856.257597-2-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Tue, 19 Sep 2023 23:20:00 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777566328066656481 X-GMAIL-MSGID: 1777566328066656481 |
Series |
mm: PCP high auto-tuning
|
|
Commit Message
Huang, Ying
Sept. 20, 2023, 6:18 a.m. UTC
In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocation and freeing CPUs. But, the PCP draining mechanism may be triggered unexpectedly when process exits. With some customized trace point, it was found that PCP draining (free_high == true) was triggered with the order-1 page freeing with the following call stack, => free_unref_page_commit => free_unref_page => __mmdrop => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 Checking the source code, this is the page table PGD freeing (mm_free_pgd()). It's a order-1 page freeing if CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for security. Just before that, page freeing with the following call stack was found, => free_unref_page_commit => free_unref_page_list => release_pages => tlb_batch_pages_flush => tlb_finish_mmu => exit_mmap => __mmput => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 So, when a process exits, - a large number of user pages of the process will be freed without page allocation, it's highly possible that pcp->free_factor becomes > 0. - after freeing all user pages, the PGD will be freed, which is a order-1 page freeing, PCP will be drained. All in all, when a process exits, it's high possible that the PCP will be drained. This is an unexpected behavior. To avoid this, in the patch, the PCP draining will only be triggered for 2 consecutive high-order page freeing. On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. With the patch, the build time decreases 3.4% (from 206s to 199s). The cycles% of the spinlock contention (mostly for zone lock) decreases from 43.6% to 40.3% (with PCP size == 361). The number of PCP draining for high order pages freeing (free_high) decreases 50.8%. This helps network workload too for reduced zone lock contention. On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 17.1%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 50.0% to 45.8%. The number of PCP draining for high order pages freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> --- include/linux/mmzone.h | 5 ++++- mm/page_alloc.c | 11 ++++++++--- 2 files changed, 12 insertions(+), 4 deletions(-)
Comments
On Wed, Sep 20, 2023 at 02:18:47PM +0800, Huang Ying wrote: > In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order > pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be > drained when PCP is mostly used for high-order pages freeing to > improve the cache-hot pages reusing between page allocation and > freeing CPUs. > > But, the PCP draining mechanism may be triggered unexpectedly when > process exits. With some customized trace point, it was found that > PCP draining (free_high == true) was triggered with the order-1 page > freeing with the following call stack, > > => free_unref_page_commit > => free_unref_page > => __mmdrop > => exit_mm > => do_exit > => do_group_exit > => __x64_sys_exit_group > => do_syscall_64 > > Checking the source code, this is the page table PGD > freeing (mm_free_pgd()). It's a order-1 page freeing if > CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for > security. > > Just before that, page freeing with the following call stack was > found, > > => free_unref_page_commit > => free_unref_page_list > => release_pages > => tlb_batch_pages_flush > => tlb_finish_mmu > => exit_mmap > => __mmput > => exit_mm > => do_exit > => do_group_exit > => __x64_sys_exit_group > => do_syscall_64 > > So, when a process exits, > > - a large number of user pages of the process will be freed without > page allocation, it's highly possible that pcp->free_factor becomes > > 0. > > - after freeing all user pages, the PGD will be freed, which is a > order-1 page freeing, PCP will be drained. > > All in all, when a process exits, it's high possible that the PCP will > be drained. This is an unexpected behavior. > > To avoid this, in the patch, the PCP draining will only be triggered > for 2 consecutive high-order page freeing. > > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on > one socket with `make -j 112`. With the patch, the build time > decreases 3.4% (from 206s to 199s). The cycles% of the spinlock > contention (mostly for zone lock) decreases from 43.6% to 40.3% (with > PCP size == 361). The number of PCP draining for high order pages > freeing (free_high) decreases 50.8%. > > This helps network workload too for reduced zone lock contention. On > a 2-socket Intel server with 128 logical CPU, with the patch, the > network bandwidth of the UNIX (AF_UNIX) test case of lmbench test > suite with 16-pair processes increase 17.1%. The cycles% of the > spinlock contention (mostly for zone lock) decreases from 50.0% to > 45.8%. The number of PCP draining for high order pages > freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%. > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> However, I want to note that batching on exit is not necessarily unexpected. For processes that are multi-TB in size, the time to exit can actually be quite large and batching is of benefit but optimising for exit is rarely a winning strategy. The pattern of "all allocs on CPU B and all frees on CPU B" or "short-lived tasks triggering a premature drain" is a bit more compelling but not worth a changelog rewrite. > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 4106fbc5b4b3..64d5ed2bb724 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -676,12 +676,15 @@ enum zone_watermarks { > #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) > #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) > > +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 > + The meaning of the flag and its intent should have been documented.
On Wed, 11 Oct 2023 13:46:10 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -676,12 +676,15 @@ enum zone_watermarks { > > #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) > > #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) > > > > +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 > > + > > The meaning of the flag and its intent should have been documented. I need to rebase mm-stable for other reasons. So let's please decide (soon) whether Mel's review comments can be addressed via add-on patches or whether I should drop this version of this series altogether, during that rebase.
Mel Gorman <mgorman@techsingularity.net> writes: > On Wed, Sep 20, 2023 at 02:18:47PM +0800, Huang Ying wrote: >> In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order >> pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be >> drained when PCP is mostly used for high-order pages freeing to >> improve the cache-hot pages reusing between page allocation and >> freeing CPUs. >> >> But, the PCP draining mechanism may be triggered unexpectedly when >> process exits. With some customized trace point, it was found that >> PCP draining (free_high == true) was triggered with the order-1 page >> freeing with the following call stack, >> >> => free_unref_page_commit >> => free_unref_page >> => __mmdrop >> => exit_mm >> => do_exit >> => do_group_exit >> => __x64_sys_exit_group >> => do_syscall_64 >> >> Checking the source code, this is the page table PGD >> freeing (mm_free_pgd()). It's a order-1 page freeing if >> CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for >> security. >> >> Just before that, page freeing with the following call stack was >> found, >> >> => free_unref_page_commit >> => free_unref_page_list >> => release_pages >> => tlb_batch_pages_flush >> => tlb_finish_mmu >> => exit_mmap >> => __mmput >> => exit_mm >> => do_exit >> => do_group_exit >> => __x64_sys_exit_group >> => do_syscall_64 >> >> So, when a process exits, >> >> - a large number of user pages of the process will be freed without >> page allocation, it's highly possible that pcp->free_factor becomes >> > 0. >> >> - after freeing all user pages, the PGD will be freed, which is a >> order-1 page freeing, PCP will be drained. >> >> All in all, when a process exits, it's high possible that the PCP will >> be drained. This is an unexpected behavior. >> >> To avoid this, in the patch, the PCP draining will only be triggered >> for 2 consecutive high-order page freeing. >> >> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on >> one socket with `make -j 112`. With the patch, the build time >> decreases 3.4% (from 206s to 199s). The cycles% of the spinlock >> contention (mostly for zone lock) decreases from 43.6% to 40.3% (with >> PCP size == 361). The number of PCP draining for high order pages >> freeing (free_high) decreases 50.8%. >> >> This helps network workload too for reduced zone lock contention. On >> a 2-socket Intel server with 128 logical CPU, with the patch, the >> network bandwidth of the UNIX (AF_UNIX) test case of lmbench test >> suite with 16-pair processes increase 17.1%. The cycles% of the >> spinlock contention (mostly for zone lock) decreases from 50.0% to >> 45.8%. The number of PCP draining for high order pages >> freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%. >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > > Acked-by: Mel Gorman <mgorman@techsingularity.net> > > However, I want to note that batching on exit is not necessarily > unexpected. For processes that are multi-TB in size, the time to exit > can actually be quite large and batching is of benefit but optimising > for exit is rarely a winning strategy. The pattern of "all allocs on CPU > B and all frees on CPU B" or "short-lived tasks triggering a premature > drain" is a bit more compelling but not worth a changelog rewrite. >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 4106fbc5b4b3..64d5ed2bb724 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -676,12 +676,15 @@ enum zone_watermarks { >> #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) >> #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) >> >> +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 >> + > > The meaning of the flag and its intent should have been documented. Sure. Will add comments for the flags. -- Best Regards, Huang, Ying
On Wed, Oct 11, 2023 at 10:16:17AM -0700, Andrew Morton wrote: > On Wed, 11 Oct 2023 13:46:10 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: > > > > --- a/include/linux/mmzone.h > > > +++ b/include/linux/mmzone.h > > > @@ -676,12 +676,15 @@ enum zone_watermarks { > > > #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) > > > #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) > > > > > > +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 > > > + > > > > The meaning of the flag and its intent should have been documented. > > I need to rebase mm-stable for other reasons. So let's please > decide (soon) whether Mel's review comments can be addressed > via add-on patches or whether I should drop this version of this > series altogether, during that rebase. The cache slice calculation is the only change I think may deserve a respin as it may have a material impact on the performance figures if the "size_data" value changes by too much. Huang, what do you think and how long do you think it would take to update the performance figures? As it may require multiple tests for each patch in the series, I would also be ok with a follow-on patch like "mm: page_alloc: Simply cache size estimation for PCP tuning" that documents the limitation of summing the unified caches and the impact, if any, on performance. It makes for a messy history *but* it would also record the reasons why summing hierarchies is not necessarily the best approach which also has value. I think patch 9 should be dropped as it has no impact on headline performance while adding a relatively tricky heuristic that updates within a fast path. Again, I'd like to give Huang a chance to respond and to evaluate if it materially impacts patch 10 -- I don't think it does but I didn't think very hard about it. Even if patch 9+10 had to be dropped, it would not take much from the overall value of the series. Comments and documentation alone are not grounds for pulling the series but I hope they do get addressed in follow-on patches. I think requiring them for accepting the series is unfair even if the only reason is I took too long to review.
Mel Gorman <mgorman@techsingularity.net> writes: > On Wed, Oct 11, 2023 at 10:16:17AM -0700, Andrew Morton wrote: >> On Wed, 11 Oct 2023 13:46:10 +0100 Mel Gorman <mgorman@techsingularity.net> wrote: >> >> > > --- a/include/linux/mmzone.h >> > > +++ b/include/linux/mmzone.h >> > > @@ -676,12 +676,15 @@ enum zone_watermarks { >> > > #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) >> > > #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) >> > > >> > > +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 >> > > + >> > >> > The meaning of the flag and its intent should have been documented. >> >> I need to rebase mm-stable for other reasons. So let's please >> decide (soon) whether Mel's review comments can be addressed >> via add-on patches or whether I should drop this version of this >> series altogether, during that rebase. > > The cache slice calculation is the only change I think may deserve a > respin as it may have a material impact on the performance figures if the > "size_data" value changes by too much. Huang, what do you think and how > long do you think it would take to update the performance figures? As it > may require multiple tests for each patch in the series, I would also be ok > with a follow-on patch like "mm: page_alloc: Simply cache size estimation > for PCP tuning" that documents the limitation of summing the unified caches > and the impact, if any, on performance. It makes for a messy history *but* > it would also record the reasons why summing hierarchies is not necessarily > the best approach which also has value. I am OK to respin the series. It will take 3-4 days to update the performance figures. > I think patch 9 should be dropped as it has no impact on headline performance > while adding a relatively tricky heuristic that updates within a fast > path. Again, I'd like to give Huang a chance to respond and to evaluate > if it materially impacts patch 10 -- I don't think it does but I didn't > think very hard about it. Even if patch 9+10 had to be dropped, it would > not take much from the overall value of the series. I am OK to drop patch 9 at least for now. In the future we may revisit it when we found workloads that benefit more from it. It's not too hard to rebase patch 10. > Comments and documentation alone are not grounds for pulling the series but > I hope they do get addressed in follow-on patches. I think requiring them > for accepting the series is unfair even if the only reason is I took too > long to review. Never mind, your review are very value! -- Best Regards, Huang, Ying
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..64d5ed2bb724 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -676,12 +676,15 @@ enum zone_watermarks { #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) +#define PCPF_PREV_FREE_HIGH_ORDER 0x01 + struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ - short free_factor; /* batch scaling factor during free */ + u8 flags; /* protected by pcp->lock */ + u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA short expire; /* When 0, remote pagesets are drained */ #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0c5be12f9336..828dcc24b030 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, { int high; int pindex; - bool free_high; + bool free_high = false; __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); @@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, * freeing without allocation. The remainder after bulk freeing * stops will be drained from vmstat refresh context. */ - free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER); - + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { + free_high = (pcp->free_factor && + (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER)); + pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; + } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { + pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; + } high = nr_pcp_high(pcp, zone, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);