Message ID | 20230920061856.257597-10-ying.huang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4155417vqi; Wed, 20 Sep 2023 06:52:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFRkNWb8DUJrbn1JYA5Em5J1lh0EXY+d14yO5h0OTP0TmxOK+DvGfR6aNedqZB5URdsrH1s X-Received: by 2002:a05:6a20:3d09:b0:14c:c393:692 with SMTP id y9-20020a056a203d0900b0014cc3930692mr3276401pzi.7.1695217953989; Wed, 20 Sep 2023 06:52:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695217953; cv=none; d=google.com; s=arc-20160816; b=n4HOlPW8vJL9eTdh7Jk20R9LwcucgtSMoV7dMz3cfJ7Va0gk+/DWb+uyjJTX/9CdxH 8dLJfGvVeXutQ46TgsyK01F618lNSPBpW9fUIxMBYQo3Dyql7/VCTtg8y5qUg7vE+Lv2 M4g4qoMm2T6e5y73ikMiHwJJeixk0ONHUazqTJhIGKw8RktzHezt9jxADFHmlJEZxZxu 8BhIkGlGNqiuIGrBGdz2rk10aySNrrAKxpgiBQsyE84+sUfYU0DdGZwUezOF5CZJpNWD HV+nZ3vRx2lXDo3QbGVwyqSf7CWcusZpU/M9vjCGviMp5yKgKWxSUWtIURJHjfoZWrG7 nDCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=kQYc5N1rnjtiHtbCX15sF/J52hBb9PsZp7Xb3r8dfqY=; fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=; b=B5j9wDuwf2pr6hBQWJ3TzXIqWD0Cgawg6N16tdQnVHMCx8lWpkHgqllZ7gaKKPB40q MADn19ckOo96ljcQqpHq5AZacEfz4m3oftLLuSvt037SWRIiXr+zK/NAbsjy8bhb9vLC nUVM1JYHCzE38Sp9RZq1aNxJ5/b+omNQJrfnlQDSlrKxVJ7sBEqTBLscZv/LLSgqoOUX 4+mfY0jNuPY+vhYZH+IKObh3tgrigQaSzpzFTT601EVHVpUpFWUg+RSJ43ic2N6tbZKA kx3vtCHjbFffxgy/c5pmayGp3X/Es06VqY4ZWIvHk2McOL5R2Pqmxq9gR1NwAgvtOZDu 1e1Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=e97HTkmW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id a9-20020a17090abe0900b00276a288f4ebsi1593846pjs.91.2023.09.20.06.52.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Sep 2023 06:52:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=e97HTkmW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id DA2A881A6C03; Tue, 19 Sep 2023 23:22:00 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233325AbjITGUq (ORCPT <rfc822;toshivichauhan@gmail.com> + 26 others); Wed, 20 Sep 2023 02:20:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233162AbjITGUZ (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 20 Sep 2023 02:20:25 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F4F6AB for <linux-kernel@vger.kernel.org>; Tue, 19 Sep 2023 23:20:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190809; x=1726726809; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+FoiCfoy4TLCRTojH1ubjvlhPyG87O12FA/RyIO19/k=; b=e97HTkmWjwmBWqYJ2BvfTxhgtOMn93iCwx8Wx/VoQdW619wARiAfUR+h PBFq9LJAByZADaQTe9s5h5LVYgifqsNhBvBPLff0dGi6K+AtbyKJ7dOxx xPIEZFzUvgPijRIgA4VCGalczuY+dSqpQehUZ+SUyTyGz9+2vynsNX7fj yA19bEJukBC7qwrxZRL24HKQUqbz8wsTRJbusqboeIlFVMZOje/w3m4LN wKlrcoBHEoYXfIKNNaqbITjcLjHnbF4luZn5Rq/CXee7hLn4sEZtApt9q RT3a1Xt5xzOB+faZTQN6gcecHJ9ktYRoemR6vADnkC3eP0cKPEFMq5xgn g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187785" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187785" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:08 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060679" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060679" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:20:05 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven <arjan@linux.intel.com>, Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Johannes Weiner <jweiner@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Matthew Wilcox <willy@infradead.org>, Christoph Lameter <cl@linux.com> Subject: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily Date: Wed, 20 Sep 2023 14:18:55 +0800 Message-Id: <20230920061856.257597-10-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com> References: <20230920061856.257597-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Tue, 19 Sep 2023 23:22:01 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777564861384996661 X-GMAIL-MSGID: 1777564861384996661 |
Series |
mm: PCP high auto-tuning
|
|
Commit Message
Huang, Ying
Sept. 20, 2023, 6:18 a.m. UTC
In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in
periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will
decrease PCP high to try to free possible idle PCP pages. One issue
is that even if the page allocating/freeing depth is larger than
maximal PCP high, we may reduce PCP high unnecessarily.
To avoid the above issue, in this patch, we will track the minimal PCP
page count. And, the periodic PCP high decrement will not more than
the recent minimal PCP page count. So, only detected idle pages will
be freed.
On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`. With the patch, The number of pages
allocated from zone (instead of from PCP) decreases 25.8%.
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 15 ++++++++++-----
2 files changed, 11 insertions(+), 5 deletions(-)
Comments
On Wed, Sep 20, 2023 at 02:18:55PM +0800, Huang Ying wrote: > In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in > periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will > decrease PCP high to try to free possible idle PCP pages. One issue > is that even if the page allocating/freeing depth is larger than > maximal PCP high, we may reduce PCP high unnecessarily. > > To avoid the above issue, in this patch, we will track the minimal PCP > page count. And, the periodic PCP high decrement will not more than > the recent minimal PCP page count. So, only detected idle pages will > be freed. > > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on > one socket with `make -j 112`. With the patch, The number of pages > allocated from zone (instead of from PCP) decreases 25.8%. > > Signed-off-by: "Huang, Ying" <ying.huang@intel.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: David Hildenbrand <david@redhat.com> > Cc: Johannes Weiner <jweiner@redhat.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Pavel Tatashin <pasha.tatashin@soleen.com> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Christoph Lameter <cl@linux.com> > --- > include/linux/mmzone.h | 1 + > mm/page_alloc.c | 15 ++++++++++----- > 2 files changed, 11 insertions(+), 5 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 8a19e2af89df..35b78c7522a7 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -682,6 +682,7 @@ enum zone_watermarks { > struct per_cpu_pages { > spinlock_t lock; /* Protects lists field */ > int count; /* number of pages in the list */ > + int count_min; /* minimal number of pages in the list recently */ > int high; /* high watermark, emptying needed */ > int high_min; /* min high watermark */ > int high_max; /* max high watermark */ > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3f8c7dfeed23..77e9b7b51688 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, > */ > int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) > { > - int high_min, to_drain, batch; > + int high_min, decrease, to_drain, batch; > int todo = 0; > > high_min = READ_ONCE(pcp->high_min); > batch = READ_ONCE(pcp->batch); > /* > - * Decrease pcp->high periodically to try to free possible > - * idle PCP pages. And, avoid to free too many pages to > - * control latency. > + * Decrease pcp->high periodically to free idle PCP pages counted > + * via pcp->count_min. And, avoid to free too many pages to > + * control latency. This caps pcp->high decrement too. > */ > if (pcp->high > high_min) { > + decrease = min(pcp->count_min, pcp->high / 5); Not directly related to this patch but why 20%, it seems a bit arbitrary. While this is not an fast path, using a divide rather than a shift seems unnecessarily expensive. > pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX), > - pcp->high * 4 / 5, high_min); > + pcp->high - decrease, high_min); > if (pcp->high > high_min) > todo++; > } > @@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) > todo++; > } > > + pcp->count_min = pcp->count; > + > return todo; > } > > @@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, > page = list_first_entry(list, struct page, pcp_list); > list_del(&page->pcp_list); > pcp->count -= 1 << order; > + if (pcp->count < pcp->count_min) > + pcp->count_min = pcp->count; While the accounting for this is in a relatively fast path. At the moment I don't have a better suggestion but I'm not as keen on this patch. It seems like it would have been more appropriate to decay if there was no recent allocation activity tracked via pcp->flags. The major caveat there is tracking a bit and clearing it may very well be in a fast path unless it was tried to refills but that is subject to timing issues and the allocation request stream :( While you noted the difference in buddy allocations which may tie into lock contention issues, how much difference to it make to the actual performance of the workload?
Mel Gorman <mgorman@techsingularity.net> writes: > On Wed, Sep 20, 2023 at 02:18:55PM +0800, Huang Ying wrote: >> In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in >> periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will >> decrease PCP high to try to free possible idle PCP pages. One issue >> is that even if the page allocating/freeing depth is larger than >> maximal PCP high, we may reduce PCP high unnecessarily. >> >> To avoid the above issue, in this patch, we will track the minimal PCP >> page count. And, the periodic PCP high decrement will not more than >> the recent minimal PCP page count. So, only detected idle pages will >> be freed. >> >> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on >> one socket with `make -j 112`. With the patch, The number of pages >> allocated from zone (instead of from PCP) decreases 25.8%. >> >> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Mel Gorman <mgorman@techsingularity.net> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Johannes Weiner <jweiner@redhat.com> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> >> Cc: Matthew Wilcox <willy@infradead.org> >> Cc: Christoph Lameter <cl@linux.com> >> --- >> include/linux/mmzone.h | 1 + >> mm/page_alloc.c | 15 ++++++++++----- >> 2 files changed, 11 insertions(+), 5 deletions(-) >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 8a19e2af89df..35b78c7522a7 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -682,6 +682,7 @@ enum zone_watermarks { >> struct per_cpu_pages { >> spinlock_t lock; /* Protects lists field */ >> int count; /* number of pages in the list */ >> + int count_min; /* minimal number of pages in the list recently */ >> int high; /* high watermark, emptying needed */ >> int high_min; /* min high watermark */ >> int high_max; /* max high watermark */ >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 3f8c7dfeed23..77e9b7b51688 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, >> */ >> int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) >> { >> - int high_min, to_drain, batch; >> + int high_min, decrease, to_drain, batch; >> int todo = 0; >> >> high_min = READ_ONCE(pcp->high_min); >> batch = READ_ONCE(pcp->batch); >> /* >> - * Decrease pcp->high periodically to try to free possible >> - * idle PCP pages. And, avoid to free too many pages to >> - * control latency. >> + * Decrease pcp->high periodically to free idle PCP pages counted >> + * via pcp->count_min. And, avoid to free too many pages to >> + * control latency. This caps pcp->high decrement too. >> */ >> if (pcp->high > high_min) { >> + decrease = min(pcp->count_min, pcp->high / 5); > > Not directly related to this patch but why 20%, it seems a bit > arbitrary. While this is not an fast path, using a divide rather than a > shift seems unnecessarily expensive. Yes. The number chosen is kind of arbitrary. Will use ">> 3" (/ 8). >> pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX), >> - pcp->high * 4 / 5, high_min); >> + pcp->high - decrease, high_min); >> if (pcp->high > high_min) >> todo++; >> } >> @@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) >> todo++; >> } >> >> + pcp->count_min = pcp->count; >> + >> return todo; >> } >> >> @@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, >> page = list_first_entry(list, struct page, pcp_list); >> list_del(&page->pcp_list); >> pcp->count -= 1 << order; >> + if (pcp->count < pcp->count_min) >> + pcp->count_min = pcp->count; > > While the accounting for this is in a relatively fast path. > > At the moment I don't have a better suggestion but I'm not as keen on > this patch. It seems like it would have been more appropriate to decay if > there was no recent allocation activity tracked via pcp->flags. The major > caveat there is tracking a bit and clearing it may very well be in a fast > path unless it was tried to refills but that is subject to timing issues > and the allocation request stream :( > > While you noted the difference in buddy allocations which may tie into > lock contention issues, how much difference to it make to the actual > performance of the workload? Thanks Andrew for his reminding on test results. I found that I used a uncommon configuration to test kbuild in V1 of the patchset. So, I sent out V2 of the patchset as follows with only test results and document changed. https://lore.kernel.org/linux-mm/20230926060911.266511-1-ying.huang@intel.com/ So, for performance data, please refer to V2 of the patchset. For this patch, the performance data are, " On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, The number of pages allocated from zone (instead of from PCP) decreases 21.4%. " I also showed the performance number for each step of optimization as follows (copied from the above patchset V2 link). " build time lock contend% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 13.5 100.0 100.0 patch1 99.2 10.6 19.2 95.6 patch3 99.2 11.7 7.1 95.6 patch5 98.4 10.0 8.2 97.1 patch7 94.9 0.7 3.0 19.0 patch9 94.9 0.6 2.7 15.0 <-- this patch patch10 94.9 0.9 8.8 18.6 " Although I think the patch is helpful via avoiding the unnecessary pcp->high decaying, thus reducing the zone lock contention. There's no visible benchmark score change for the patch. -- Best Regards, Huang, Ying
On Thu, Oct 12, 2023 at 03:48:04PM +0800, Huang, Ying wrote: > " > On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild > instances in parallel (each with `make -j 28`) in 8 cgroup. This > simulates the kbuild server that is used by 0-Day kbuild service. > With the patch, The number of pages allocated from zone (instead of > from PCP) decreases 21.4%. > " > > I also showed the performance number for each step of optimization as > follows (copied from the above patchset V2 link). > > " > build time lock contend% free_high alloc_zone > ---------- ---------- --------- ---------- > base 100.0 13.5 100.0 100.0 > patch1 99.2 10.6 19.2 95.6 > patch3 99.2 11.7 7.1 95.6 > patch5 98.4 10.0 8.2 97.1 > patch7 94.9 0.7 3.0 19.0 > patch9 94.9 0.6 2.7 15.0 <-- this patch > patch10 94.9 0.9 8.8 18.6 > " > > Although I think the patch is helpful via avoiding the unnecessary > pcp->high decaying, thus reducing the zone lock contention. There's no > visible benchmark score change for the patch. > Thanks! Given that it's another PCP field with an update in a relatively hot path, I would suggest dropping this patch entirely if it does not affect performance. It has the risk of being a magical heuristic that we forget later whether it's even worthwhile.
Mel Gorman <mgorman@techsingularity.net> writes: > On Thu, Oct 12, 2023 at 03:48:04PM +0800, Huang, Ying wrote: >> " >> On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild >> instances in parallel (each with `make -j 28`) in 8 cgroup. This >> simulates the kbuild server that is used by 0-Day kbuild service. >> With the patch, The number of pages allocated from zone (instead of >> from PCP) decreases 21.4%. >> " >> >> I also showed the performance number for each step of optimization as >> follows (copied from the above patchset V2 link). >> >> " >> build time lock contend% free_high alloc_zone >> ---------- ---------- --------- ---------- >> base 100.0 13.5 100.0 100.0 >> patch1 99.2 10.6 19.2 95.6 >> patch3 99.2 11.7 7.1 95.6 >> patch5 98.4 10.0 8.2 97.1 >> patch7 94.9 0.7 3.0 19.0 >> patch9 94.9 0.6 2.7 15.0 <-- this patch >> patch10 94.9 0.9 8.8 18.6 >> " >> >> Although I think the patch is helpful via avoiding the unnecessary >> pcp->high decaying, thus reducing the zone lock contention. There's no >> visible benchmark score change for the patch. >> > > Thanks! > > Given that it's another PCP field with an update in a relatively hot > path, I would suggest dropping this patch entirely if it does not affect > performance. It has the risk of being a magical heuristic that we forget > later whether it's even worthwhile. OK. Hope we can find some workloads that can benefit from the patch in the future. -- Best Regards, Huang, Ying
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8a19e2af89df..35b78c7522a7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -682,6 +682,7 @@ enum zone_watermarks { struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ + int count_min; /* minimal number of pages in the list recently */ int high; /* high watermark, emptying needed */ int high_min; /* min high watermark */ int high_max; /* max high watermark */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3f8c7dfeed23..77e9b7b51688 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, */ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) { - int high_min, to_drain, batch; + int high_min, decrease, to_drain, batch; int todo = 0; high_min = READ_ONCE(pcp->high_min); batch = READ_ONCE(pcp->batch); /* - * Decrease pcp->high periodically to try to free possible - * idle PCP pages. And, avoid to free too many pages to - * control latency. + * Decrease pcp->high periodically to free idle PCP pages counted + * via pcp->count_min. And, avoid to free too many pages to + * control latency. This caps pcp->high decrement too. */ if (pcp->high > high_min) { + decrease = min(pcp->count_min, pcp->high / 5); pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX), - pcp->high * 4 / 5, high_min); + pcp->high - decrease, high_min); if (pcp->high > high_min) todo++; } @@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) todo++; } + pcp->count_min = pcp->count; + return todo; } @@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, page = list_first_entry(list, struct page, pcp_list); list_del(&page->pcp_list); pcp->count -= 1 << order; + if (pcp->count < pcp->count_min) + pcp->count_min = pcp->count; } while (check_new_pages(page, order)); return page;