Message ID | 20230710065325.290366-3-ying.huang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9f45:0:b0:3ea:f831:8777 with SMTP id v5csp4839568vqx; Mon, 10 Jul 2023 00:16:00 -0700 (PDT) X-Google-Smtp-Source: APBJJlFhnshdqXyeR3kA99ZCqoGT53OjXFR6naTua+pCLFKOK8zSt0CgOyOq5GXbMIxPA5as96Du X-Received: by 2002:a17:906:dc:b0:98e:4c96:6e16 with SMTP id 28-20020a17090600dc00b0098e4c966e16mr12277446eji.5.1688973360024; Mon, 10 Jul 2023 00:16:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688973360; cv=none; d=google.com; s=arc-20160816; b=CDLrJGBwHR+3bRDQDy2svJZeJtV35D5Ma+QE7c3ngWBpnvw7T1qaKYHziIJ/vL/RCC VYnhcP9vPyMnpO+EyFliXCErvTDv6LpfhIwitPoQtW4xenOkrKR1Oi8jvXcGTVFPhaVD sOSkcASojcwDhJq6GCxwTuXkfORyJswNTLbRpoWFGc6huuWDKzZOcoqAdGoGam0nmBbm 49G/zqsP+RdWAdswySNGKzlfm4XtRn0I0FiwWCWQZD1Zh56fOFOVmmEnnh7R90nnwQ7E kK46m2tIBef+54BA7TK9GUu1qU/TZ4AkA7Yege0sPvy5ksIW66J1vQX+u5D9gz6IluHr O9HA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=0nouO9OcAdux4rp+R5OLsd8oOd56jQjb+veENuZ8RQY=; fh=b+wlSMyDY7NrKNrb2ta/WH6C0rnVdrgLb8oUyiJoaSo=; b=L8etES8VJys79CBKRxCXmaCQMpMCRUGWI+CTTcr3K1A76QSbpEPSmvIcrS+yoVlcSF NxlFqC4fwr1fTRiu9tiYfY4kLVfQdrwwjI8VA4I/anAKkuxFdFGZTIFrtQ8+dPLFT26R SAKODyg0AbJNv27sjNTEOunW1RUXp74XTHmITmavgDp/6ocmn5AfxpccHV0H2tjhxyeL s1dpThWTlJ2D0p9JEoIEv4C2jy3gMobOSo9fgtt1TIOnDoiINPkmqbQvDK5ftatmxQqz jxBwpwZvRhere3Y2vMlY/jYj3dLQmeUIRUL5Y178JjI4Gdgkg8ElxBmszivHdctgAn5c 8LBw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Wc8rX1WW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id jp23-20020a170906f75700b00993181656b0si5447425ejb.475.2023.07.10.00.15.36; Mon, 10 Jul 2023 00:16:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Wc8rX1WW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231410AbjGJGya (ORCPT <rfc822;ybw1215001957@gmail.com> + 99 others); Mon, 10 Jul 2023 02:54:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53744 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231510AbjGJGyX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 10 Jul 2023 02:54:23 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F7EE129 for <linux-kernel@vger.kernel.org>; Sun, 9 Jul 2023 23:54:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688972061; x=1720508061; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yqtL2s8WqXvRLMZvsfWXtVAKECayu9UX3tE/H+XNNpY=; b=Wc8rX1WWeDVfdlXbCSVhVW43wQ0CcwXd6ICEcRWurIXeyRL7+23k+pwz 24AtFKQAG7NQUXedoD94gMX7EDRpK3NO9xjlyEntCIWalPaahsrmsP3ge LXKnixwxX2wiTsnqKPLbp4SrWbrYUBa3rzpD7e9LjKNIYKCOwoDro3quT azXjtHFcT7tZsW5MxsUtO7Tgzi4OSpO6kWeA2JE8AhXmRcxaYrvMIrP50 87lJgcGA0Z2rltlNnQ6Pc3FoLjJpsW4NOKasBlXtiySv0/D5j+Y7PES3A Sgm5UocHfy9Vd1HgJrsGgXry6gDgjPysv6YjWtj7OjW1N3ylycFg9AoIf A==; X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="427961518" X-IronPort-AV: E=Sophos;i="6.01,194,1684825200"; d="scan'208";a="427961518" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 23:54:21 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="750240562" X-IronPort-AV: E=Sophos;i="6.01,194,1684825200"; d="scan'208";a="750240562" Received: from zhanggua-mobl.ccr.corp.intel.com (HELO yhuang6-mobl2.ccr.corp.intel.com) ([10.255.29.223]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 23:54:18 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven <arjan@linux.intel.com>, Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Johannes Weiner <jweiner@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Matthew Wilcox <willy@infradead.org> Subject: [RFC 2/2] mm: alloc/free depth based PCP high auto-tuning Date: Mon, 10 Jul 2023 14:53:25 +0800 Message-Id: <20230710065325.290366-3-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230710065325.290366-1-ying.huang@intel.com> References: <20230710065325.290366-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771016929942694840 X-GMAIL-MSGID: 1771016929942694840 |
Series |
mm: PCP high auto-tuning
|
|
Commit Message
Huang, Ying
July 10, 2023, 6:53 a.m. UTC
To auto-tune PCP high for each CPU automatically, an
allocation/freeing depth based PCP high auto-tuning algorithm is
implemented in this patch.
The basic idea behind the algorithm is to detect the repetitive
allocation and freeing pattern with short enough period (about 1
second). The period needs to be short to respond to allocation and
freeing pattern changes quickly and control the memory wasted by
unnecessary caching.
To detect the repetitive allocation and freeing pattern, the
alloc/free depth is calculated for each tuning period (1 second) on
each CPU. To calculate the alloc/free depth, we track the alloc
count. Which increases for page allocation from PCP and decreases for
page freeing to PCP. The alloc depth is the maximum alloc count
difference between the later large value and former small value.
While, the free depth is the maximum alloc count difference between
the former large value and the later small value.
Then, the average alloc/free depth in multiple tuning periods is
calculated, with the old alloc/free depth decay in the average
gradually.
Finally, the PCP high is set to be the smaller value of average alloc
depth and average free depth, after clamped between the default and
the max PCP high. In this way, pure allocation or freeing will not
enlarge the PCP high because PCP doesn't help.
We have tested the algorithm with several workloads on Intel's
2-socket server machines.
Will-it-scale/page_fault1
=========================
On one socket of the system with 56 cores, 56 workload processes are
run to stress the kernel memory allocator. Each workload process is
put in a different memcg to eliminate the LRU lock contention.
base optimized
---- ---------
Thoughput (GB/s) 34.3 75.0
native_queued_spin_lock_slowpath% 60.9 0.2
This is a simple workload that each process allocates 128MB pages then
frees them repetitively. So, it's quite easy to detect its allocation
and freeing pattern and adjust PCP high accordingly. The optimized
kernel almost eliminates the lock contention cycles% (from 60.9% to
0.2%). And its benchmark score increases 118.7%.
Kbuild
======
"make -j 224" is used to build the kernel in parallel on the 2-socket
server system with 224 logical CPUs.
base optimized
---- ---------
Build time (s) 162.67 162.67
native_queued_spin_lock_slowpath% 17.00 12.28
rmqueue% 11.53 8.33
Free_unref_page_list% 3.85 0.54
folio_lruvec_lock_irqsave% 1.21 1.96
The optimized kernel reduces cycles of the page allocation/freeing
functions from ~15.38% to ~8.87% via enlarging the PCP high when
necessary. The system overhead lock contention cycles% decreases
too. But the benchmark score hasn't visible change. There should be
other bottlenecks.
We also captured /proc/meminfo during test. After a short
while (about 10s) from the beginning of the test, the
Memused (MemTotal - MemFree) of the optimized kernel is higher than
that of the base kernel for the increased PCP high. But in the second
half of the test, the Memused of the optimized kernel decreases to the
same level of that of the base kernel. That is, PCP high is decreased
effectively for the decreased page allocation requirements.
Netperf/SCTP_STREAM_MANY
On another 2-socket server with 128 logical CPUs. 64 processes of
netperf are run with the netserver runs on the same machine (that is,
loopback network is used).
base optimized
---- ---------
Throughput (MB/s) 7136 8489 +19.0%
vmstat.cpu.id% 73.05 63.73 -9.3
vmstat.procs.r 34.1 45.6 +33.5%
meminfo.Memused 5479861 8492875 +55.0%
perf-stat.ps.cpu-cycles 1.04e+11 1.38e+11 +32.3%
perf-stat.ps.instructions 0.96e+11 1.14e+11 +17.8%
perf-profile.free_unref_page% 2.46 1.65 -0.8
latency.99%.__alloc_pages 4.28 2.21 -48.4%
latency.99%.__free_unref_page 4.11 0.87 -78.8%
From the test results, the throughput of benchmark increases 19.0%.
That comes from the increased CPU cycles and instructions per
second (perf-stat.ps.cpu-cycles and perf-stat.ps.instructions), that
is, reduced CPU idle. And, perf-profile shows that page allocator
cycles% isn't high. So, the reduced CPU idle may comes from the page
allocation/freeing latency reduction, which influence the network
behavior.
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
---
include/linux/mmzone.h | 8 +++++++
mm/page_alloc.c | 50 ++++++++++++++++++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 2 deletions(-)
Comments
On Mon 10-07-23 14:53:25, Huang Ying wrote: > To auto-tune PCP high for each CPU automatically, an > allocation/freeing depth based PCP high auto-tuning algorithm is > implemented in this patch. > > The basic idea behind the algorithm is to detect the repetitive > allocation and freeing pattern with short enough period (about 1 > second). The period needs to be short to respond to allocation and > freeing pattern changes quickly and control the memory wasted by > unnecessary caching. 1s is an ethernity from the allocation POV. Is a time based sampling really a good choice? I would have expected a natural allocation/freeing feedback mechanism. I.e. double the batch size when the batch is consumed and it requires to be refilled and shrink it under memory pressure (GFP_NOWAIT allocation fails) or when the surplus grows too high over batch (e.g. twice as much). Have you considered something as simple as that? Quite honestly I am not sure time based approach is a good choice because memory consumptions tends to be quite bulky (e.g. application starts or workload transitions based on requests). > To detect the repetitive allocation and freeing pattern, the > alloc/free depth is calculated for each tuning period (1 second) on > each CPU. To calculate the alloc/free depth, we track the alloc > count. Which increases for page allocation from PCP and decreases for > page freeing to PCP. The alloc depth is the maximum alloc count > difference between the later large value and former small value. > While, the free depth is the maximum alloc count difference between > the former large value and the later small value. > > Then, the average alloc/free depth in multiple tuning periods is > calculated, with the old alloc/free depth decay in the average > gradually. > > Finally, the PCP high is set to be the smaller value of average alloc > depth and average free depth, after clamped between the default and > the max PCP high. In this way, pure allocation or freeing will not > enlarge the PCP high because PCP doesn't help. > > We have tested the algorithm with several workloads on Intel's > 2-socket server machines. How does this scheme deal with memory pressure?
On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote: > On Mon 10-07-23 14:53:25, Huang Ying wrote: > > To auto-tune PCP high for each CPU automatically, an > > allocation/freeing depth based PCP high auto-tuning algorithm is > > implemented in this patch. > > > > The basic idea behind the algorithm is to detect the repetitive > > allocation and freeing pattern with short enough period (about 1 > > second). The period needs to be short to respond to allocation and > > freeing pattern changes quickly and control the memory wasted by > > unnecessary caching. > > 1s is an ethernity from the allocation POV. Is a time based sampling > really a good choice? I would have expected a natural allocation/freeing > feedback mechanism. I.e. double the batch size when the batch is > consumed and it requires to be refilled and shrink it under memory > pressure (GFP_NOWAIT allocation fails) or when the surplus grows too > high over batch (e.g. twice as much). Have you considered something as > simple as that? > Quite honestly I am not sure time based approach is a good choice > because memory consumptions tends to be quite bulky (e.g. application > starts or workload transitions based on requests). > I tend to agree. Tuning based on the recent allocation pattern without frees would make more sense and also be symmetric with how free_factor works. I suspect that time-based may be heavily orientated around the will-it-scale benchmark. While I only glanced at this, a few things jumped out 1. Time-based heuristics are not ideal. congestion_wait() and friends was an obvious case where time-based heuristics fell apart even before the event it waited on was removed. For congestion, it happened to work for slow storage for a while but that was about it. For allocation stream detection, it has a similar problem. If a process is allocating heavily, then fine, if it's in bursts of less than a second more than one second apart then it will not adapt. While I do not think it is explicitly mentioned anywhere, my understanding was that heuristics like this within mm/ should be driven by explicit events as much as possible and not time. 2. If time was to be used, it would be cheaper to have the simpliest possible state tracking in the fast paths and decay any resizing of the PCP within the vmstat updates (reuse pcp->expire except it applies to local pcps). Even this is less than ideal as the PCP may be too large for short periods of time but it may also act as a backstop for worst-case behaviour 3. free_factor is an existing mechanism for detecting recent patterns and adapting the PCP sizes. The allocation side should be symmetric and the events that should drive it are "refills" on the alloc side and "drains" on the free side. Initially it might be easier to have a single parameter that scales batch and high up to a limit 4. The amount of state tracked seems excessive and increases the size of the per-cpu structure by more than 1 cache line. That in itself may not be a problem but the state is tracked on every page alloc/free that goes through the fast path and it's relatively complex to track. That is a constant penalty in fast paths that may not may not be relevant to the workload and only sustained bursty allocation streams may offset the cost. 5. Memory pressure and reclaim activity does not appear to be accounted for and it's not clear if pcp->high is bounded or if it's possible for a single PCP to hide a large number of pages from other CPUs sharing the same node. The max size of the PCP should probably be explicitly clamped.
Michal Hocko <mhocko@suse.com> writes: > On Mon 10-07-23 14:53:25, Huang Ying wrote: >> To auto-tune PCP high for each CPU automatically, an >> allocation/freeing depth based PCP high auto-tuning algorithm is >> implemented in this patch. >> >> The basic idea behind the algorithm is to detect the repetitive >> allocation and freeing pattern with short enough period (about 1 >> second). The period needs to be short to respond to allocation and >> freeing pattern changes quickly and control the memory wasted by >> unnecessary caching. > > 1s is an ethernity from the allocation POV. Is a time based sampling > really a good choice? I would have expected a natural allocation/freeing > feedback mechanism. I.e. double the batch size when the batch is > consumed and it requires to be refilled and shrink it under memory > pressure (GFP_NOWAIT allocation fails) or when the surplus grows too > high over batch (e.g. twice as much). Have you considered something as > simple as that? > Quite honestly I am not sure time based approach is a good choice > because memory consumptions tends to be quite bulky (e.g. application > starts or workload transitions based on requests). If my understanding were correct, we are targeting different allocation and freeing patterns. Your target pattern (A): - allocate several GB or more (application starts) in short time - allocate and free small number of pages in long time - free several GB or more (application exits) in short time My target pattern (B): - allocate several hundreds MB, free small number, in short time - free several hundreds MB, allocate small number, in short time - repeat the above pattern for long time For pattern A, my patchset will almost not help it. It may be helpful to increase PCP batch (I will do some test to verify that). But, it may be not a good idea to increase PCP high too much? Because we need to put more than several GB in PCP for long time. At the same time, these memory may be needed by other CPUs. It's hard to determine how long should we keep these pages in PCP. My patchset will help pattern B via identify the allocation/freeing depth and increasing PCP high to hold several hundreds MB in PCP. Then most allocation and freeing needs not to call buddy. And when the pattern completes, the PCP high will be restored and pages will be put back in buddy. I can observe the pattern B in kernel building and netperf/SCTP_STREAM_MANY workload. So, the zone lock contention or page allocation/freeing latency reduction can be observed. For example, the alloc_count for kbuild is as follows, +-------------------------------------------------------------------+ 90000 |-+ + + + + + + + + + E| | mm_pcp_alloc_count.0.2.alloc_count ***E****| 85000 |-+ **E * *| | EE E EE E** * E *| 80000 |-+ EE ** * * ** * * *| | E EEEE E * E E * * * E * *| 75000 |-+ E* EE * E * ** E * EE * E* E *| | E * E * * * * * * * E E * * * E* *| 70000 |-+ EE E EE *E E * E *E* E * * * E E| | E E EE *EE E *E E*EEEE E EE *EE | 65000 |EEE**EE EEE EE* EE EE E EEE EEEE EE | | E * E EE EEE E | 60000 |-+ *E* | | EE | 55000 |-+ + + EE + + + + + + | +-------------------------------------------------------------------+ 74 75 76 77 78 79 80 81 82 83 84 The X axis is time in second, the Y axis is alloc count (start from 0, ++ for allocation, -- for freeing). From the figure you can find that from 75s on, the allocation depth and freeing depth is about 30000, while repetitive period is about 1s. IMHO, both pattern A and pattern B exists in reality. >> To detect the repetitive allocation and freeing pattern, the >> alloc/free depth is calculated for each tuning period (1 second) on >> each CPU. To calculate the alloc/free depth, we track the alloc >> count. Which increases for page allocation from PCP and decreases for >> page freeing to PCP. The alloc depth is the maximum alloc count >> difference between the later large value and former small value. >> While, the free depth is the maximum alloc count difference between >> the former large value and the later small value. >> >> Then, the average alloc/free depth in multiple tuning periods is >> calculated, with the old alloc/free depth decay in the average >> gradually. >> >> Finally, the PCP high is set to be the smaller value of average alloc >> depth and average free depth, after clamped between the default and >> the max PCP high. In this way, pure allocation or freeing will not >> enlarge the PCP high because PCP doesn't help. >> >> We have tested the algorithm with several workloads on Intel's >> 2-socket server machines. > > How does this scheme deal with memory pressure? If my understanding were correct, ZONE_RECLAIM_ACTIVE will be set when kswapd is waken up (balance_pgdat()->set_reclaim_active()). Then (pcp->batch * 4) will be used as PCP high (free_unref_page_commit() -> nr_pcp_high()). So, the pages in PCP will be reduced. Best Regards, Huang, Ying
Mel Gorman <mgorman@techsingularity.net> writes: > On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote: >> On Mon 10-07-23 14:53:25, Huang Ying wrote: >> > To auto-tune PCP high for each CPU automatically, an >> > allocation/freeing depth based PCP high auto-tuning algorithm is >> > implemented in this patch. >> > >> > The basic idea behind the algorithm is to detect the repetitive >> > allocation and freeing pattern with short enough period (about 1 >> > second). The period needs to be short to respond to allocation and >> > freeing pattern changes quickly and control the memory wasted by >> > unnecessary caching. >> >> 1s is an ethernity from the allocation POV. Is a time based sampling >> really a good choice? I would have expected a natural allocation/freeing >> feedback mechanism. I.e. double the batch size when the batch is >> consumed and it requires to be refilled and shrink it under memory >> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too >> high over batch (e.g. twice as much). Have you considered something as >> simple as that? >> Quite honestly I am not sure time based approach is a good choice >> because memory consumptions tends to be quite bulky (e.g. application >> starts or workload transitions based on requests). >> > > I tend to agree. Tuning based on the recent allocation pattern without > frees would make more sense and also be symmetric with how free_factor > works. This sounds good to me. I will give it a try to tune PCP batch. Have some questions about how to tune PCP high based on that. Details are in the following. > I suspect that time-based may be heavily orientated around the > will-it-scale benchmark. I totally agree that will-it-scale isn't a real workload. So, I tried to find some more practical ones. I found that a repetitive allocation/freeing several hundreds MB pages pattern exists in kernel building and netperf/SCTP_STREAM_MANY workloads. More details can be found in my reply to Michal as follows, https://lore.kernel.org/linux-mm/877cr4dydo.fsf@yhuang6-desk2.ccr.corp.intel.com/ > While I only glanced at this, a few things jumped out > > 1. Time-based heuristics are not ideal. congestion_wait() and > friends was an obvious case where time-based heuristics fell apart even > before the event it waited on was removed. For congestion, it happened to > work for slow storage for a while but that was about it. For allocation > stream detection, it has a similar problem. If a process is allocating > heavily, then fine, if it's in bursts of less than a second more than one > second apart then it will not adapt. While I do not think it is explicitly > mentioned anywhere, my understanding was that heuristics like this within > mm/ should be driven by explicit events as much as possible and not time. The proposed heuristics can be changed to be not time-based. When it is found that the allocation/freeing depth is larger than previous value, the PCP high can be increased immediately. We use time-based implementation to try to reduce the overhead. And, we mainly targeted long-time allocation pattern before. > 2. If time was to be used, it would be cheaper to have the simpliest possible > state tracking in the fast paths and decay any resizing of the PCP > within the vmstat updates (reuse pcp->expire except it applies to local > pcps). Even this is less than ideal as the PCP may be too large for short > periods of time but it may also act as a backstop for worst-case behaviour This sounds reasonable. Thanks! We will try this if we choose to use time-based implementation. > 3. free_factor is an existing mechanism for detecting recent patterns > and adapting the PCP sizes. The allocation side should be symmetric > and the events that should drive it are "refills" on the alloc side and > "drains" on the free side. Initially it might be easier to have a single > parameter that scales batch and high up to a limit For example, when a workload is started, several GB pages will be allocated. We will observe many "refills", and almost no "drains". So, we will scales batch and high up to a limit. When the workload exits, large number of pages of the workload will be put in PCP because the PCP high is increased. When should we decrease the PCP batch and high? > 4. The amount of state tracked seems excessive and increases the size of > the per-cpu structure by more than 1 cache line. That in itself may not > be a problem but the state is tracked on every page alloc/free that goes > through the fast path and it's relatively complex to track. That is > a constant penalty in fast paths that may not may not be relevant to the > workload and only sustained bursty allocation streams may offset the > cost. Yes. Thanks for pointing this out. We will optimize this if the other aspects of the basic idea could be accepted. > 5. Memory pressure and reclaim activity does not appear to be accounted > for and it's not clear if pcp->high is bounded or if it's possible for > a single PCP to hide a large number of pages from other CPUs sharing the > same node. The max size of the PCP should probably be explicitly clamped. As in my reply to Michal's email, previously I thought ZONE_RECLAIM_ACTIVE will be set for kswap reclaiming, and PCP high will be decreased to (batch * 4) effectively. Or I miss something? There's a upper limit for PCP high, which comes from MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, we calculate the PCP high as if percpu_pagelist_high_fraction is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION (8). Then, we use that value as the upper limit for PCP high. Best Regards, Huang, Ying
On Wed 12-07-23 10:05:26, Mel Gorman wrote: > On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote: > > On Mon 10-07-23 14:53:25, Huang Ying wrote: > > > To auto-tune PCP high for each CPU automatically, an > > > allocation/freeing depth based PCP high auto-tuning algorithm is > > > implemented in this patch. > > > > > > The basic idea behind the algorithm is to detect the repetitive > > > allocation and freeing pattern with short enough period (about 1 > > > second). The period needs to be short to respond to allocation and > > > freeing pattern changes quickly and control the memory wasted by > > > unnecessary caching. > > > > 1s is an ethernity from the allocation POV. Is a time based sampling > > really a good choice? I would have expected a natural allocation/freeing > > feedback mechanism. I.e. double the batch size when the batch is > > consumed and it requires to be refilled and shrink it under memory > > pressure (GFP_NOWAIT allocation fails) or when the surplus grows too > > high over batch (e.g. twice as much). Have you considered something as > > simple as that? > > Quite honestly I am not sure time based approach is a good choice > > because memory consumptions tends to be quite bulky (e.g. application > > starts or workload transitions based on requests). > > > > I tend to agree. Tuning based on the recent allocation pattern without frees > would make more sense and also be symmetric with how free_factor works. I > suspect that time-based may be heavily orientated around the will-it-scale > benchmark. While I only glanced at this, a few things jumped out > > 1. Time-based heuristics are not ideal. congestion_wait() and > friends was an obvious case where time-based heuristics fell apart even > before the event it waited on was removed. For congestion, it happened to > work for slow storage for a while but that was about it. For allocation > stream detection, it has a similar problem. If a process is allocating > heavily, then fine, if it's in bursts of less than a second more than one > second apart then it will not adapt. While I do not think it is explicitly > mentioned anywhere, my understanding was that heuristics like this within > mm/ should be driven by explicit events as much as possible and not time. Agreed. I would also like to point out that it is also important to realize those events that we should care about. Remember the primary motivation of the tuning is to reduce the lock contention. That being said, it is less of a problem to have stream or bursty demand for memory if that doesn't really cause the said contention, right? So any auto-tuning should consider that as well and do not inflate the batch in an absense of the contention. That of course means that a solely deallocation based monitoring.
On Thu, Jul 13, 2023 at 04:56:54PM +0800, Huang, Ying wrote: > Mel Gorman <mgorman@techsingularity.net> writes: > > > On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote: > >> On Mon 10-07-23 14:53:25, Huang Ying wrote: > >> > To auto-tune PCP high for each CPU automatically, an > >> > allocation/freeing depth based PCP high auto-tuning algorithm is > >> > implemented in this patch. > >> > > >> > The basic idea behind the algorithm is to detect the repetitive > >> > allocation and freeing pattern with short enough period (about 1 > >> > second). The period needs to be short to respond to allocation and > >> > freeing pattern changes quickly and control the memory wasted by > >> > unnecessary caching. > >> > >> 1s is an ethernity from the allocation POV. Is a time based sampling > >> really a good choice? I would have expected a natural allocation/freeing > >> feedback mechanism. I.e. double the batch size when the batch is > >> consumed and it requires to be refilled and shrink it under memory > >> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too > >> high over batch (e.g. twice as much). Have you considered something as > >> simple as that? > >> Quite honestly I am not sure time based approach is a good choice > >> because memory consumptions tends to be quite bulky (e.g. application > >> starts or workload transitions based on requests). > >> > > > > I tend to agree. Tuning based on the recent allocation pattern without > > frees would make more sense and also be symmetric with how free_factor > > works. > > This sounds good to me. I will give it a try to tune PCP batch. Have > some questions about how to tune PCP high based on that. Details are in > the following. > > > I suspect that time-based may be heavily orientated around the > > will-it-scale benchmark. > > I totally agree that will-it-scale isn't a real workload. So, I tried > to find some more practical ones. I found that a repetitive > allocation/freeing several hundreds MB pages pattern exists in kernel > building and netperf/SCTP_STREAM_MANY workloads. More details can be > found in my reply to Michal as follows, > > https://lore.kernel.org/linux-mm/877cr4dydo.fsf@yhuang6-desk2.ccr.corp.intel.com/ > > > While I only glanced at this, a few things jumped out > > > > 1. Time-based heuristics are not ideal. congestion_wait() and > > friends was an obvious case where time-based heuristics fell apart even > > before the event it waited on was removed. For congestion, it happened to > > work for slow storage for a while but that was about it. For allocation > > stream detection, it has a similar problem. If a process is allocating > > heavily, then fine, if it's in bursts of less than a second more than one > > second apart then it will not adapt. While I do not think it is explicitly > > mentioned anywhere, my understanding was that heuristics like this within > > mm/ should be driven by explicit events as much as possible and not time. > > The proposed heuristics can be changed to be not time-based. When it is > found that the allocation/freeing depth is larger than previous value, > the PCP high can be increased immediately. We use time-based > implementation to try to reduce the overhead. And, we mainly targeted > long-time allocation pattern before. > Time simply has too many corner cases. When it's reset for example, all state is lost so patterns that are longer than the time window are unpredictable. It tends to work slightly better than time decays state rather than resets but it gets very hand-wavey. > > 2. If time was to be used, it would be cheaper to have the simpliest possible > > state tracking in the fast paths and decay any resizing of the PCP > > within the vmstat updates (reuse pcp->expire except it applies to local > > pcps). Even this is less than ideal as the PCP may be too large for short > > periods of time but it may also act as a backstop for worst-case behaviour > > This sounds reasonable. Thanks! We will try this if we choose to use > time-based implementation. > > > 3. free_factor is an existing mechanism for detecting recent patterns > > and adapting the PCP sizes. The allocation side should be symmetric > > and the events that should drive it are "refills" on the alloc side and > > "drains" on the free side. Initially it might be easier to have a single > > parameter that scales batch and high up to a limit > > For example, when a workload is started, several GB pages will be > allocated. We will observe many "refills", and almost no "drains". So, > we will scales batch and high up to a limit. When the workload exits, > large number of pages of the workload will be put in PCP because the PCP > high is increased. When should we decrease the PCP batch and high? > Honestly, I'm not 100% certain as I haven't spent time with paper to sketch out the different combinations with "all allocs" at one end, "all frees" at the other and "ratio of alloc:free" in the middle. Intuitively I would expect the time to shrink is when there is a mix. All allocs -- maximise batch and high All frees -- maximise batch and high Mix -- adjust high to appoximate the minimum value of high such that a drain/refill does not occur or rarely occurs Batch should have a much lower maximum than high because it's a deferred cost that gets assigned to an arbitrary task. The worst case is where a process that is a light user of the allocator incurs the full cost of a refill/drain. Again, intuitively this may be PID Control problem for the "Mix" case to estimate the size of high required to minimise drains/allocs as each drain/alloc is potentially a lock contention. The catchall for corner cases would be to decay high from vmstat context based on pcp->expires. The decay would prevent the "high" being pinned at an artifically high value without any zone lock contention for prolonged periods of time and also mitigate worst-case due to state being per-cpu. The downside is that "high" would also oscillate for a continuous steady allocation pattern as the PID control might pick an ideal value suitable for a long period of time with the "decay" disrupting that ideal value. > > 4. The amount of state tracked seems excessive and increases the size of > > the per-cpu structure by more than 1 cache line. That in itself may not > > be a problem but the state is tracked on every page alloc/free that goes > > through the fast path and it's relatively complex to track. That is > > a constant penalty in fast paths that may not may not be relevant to the > > workload and only sustained bursty allocation streams may offset the > > cost. > > Yes. Thanks for pointing this out. We will optimize this if the other > aspects of the basic idea could be accepted. > I'm not opposed to having an adaptive pcp->high in concept. I think it would be best to disable adaptive tuning if percpu_pagelist_high_fraction is set though. I expect that users of that tunable are rare and that if it *is* used that there is a very good reason for it. > > 5. Memory pressure and reclaim activity does not appear to be accounted > > for and it's not clear if pcp->high is bounded or if it's possible for > > a single PCP to hide a large number of pages from other CPUs sharing the > > same node. The max size of the PCP should probably be explicitly clamped. > > As in my reply to Michal's email, previously I thought > ZONE_RECLAIM_ACTIVE will be set for kswap reclaiming, and PCP high will > be decreased to (batch * 4) effectively. Or I miss something? > I don't think you did, but it deserves a big comment in the tuning at minimum and potentially even disabling adaptive tuning entirely if reclaim is active.
Mel Gorman <mgorman@techsingularity.net> writes: > On Thu, Jul 13, 2023 at 04:56:54PM +0800, Huang, Ying wrote: >> Mel Gorman <mgorman@techsingularity.net> writes: >> >> > On Tue, Jul 11, 2023 at 01:19:46PM +0200, Michal Hocko wrote: >> >> On Mon 10-07-23 14:53:25, Huang Ying wrote: >> >> > To auto-tune PCP high for each CPU automatically, an >> >> > allocation/freeing depth based PCP high auto-tuning algorithm is >> >> > implemented in this patch. >> >> > >> >> > The basic idea behind the algorithm is to detect the repetitive >> >> > allocation and freeing pattern with short enough period (about 1 >> >> > second). The period needs to be short to respond to allocation and >> >> > freeing pattern changes quickly and control the memory wasted by >> >> > unnecessary caching. >> >> >> >> 1s is an ethernity from the allocation POV. Is a time based sampling >> >> really a good choice? I would have expected a natural allocation/freeing >> >> feedback mechanism. I.e. double the batch size when the batch is >> >> consumed and it requires to be refilled and shrink it under memory >> >> pressure (GFP_NOWAIT allocation fails) or when the surplus grows too >> >> high over batch (e.g. twice as much). Have you considered something as >> >> simple as that? >> >> Quite honestly I am not sure time based approach is a good choice >> >> because memory consumptions tends to be quite bulky (e.g. application >> >> starts or workload transitions based on requests). >> >> >> > >> > I tend to agree. Tuning based on the recent allocation pattern without >> > frees would make more sense and also be symmetric with how free_factor >> > works. >> >> This sounds good to me. I will give it a try to tune PCP batch. Have >> some questions about how to tune PCP high based on that. Details are in >> the following. >> >> > I suspect that time-based may be heavily orientated around the >> > will-it-scale benchmark. >> >> I totally agree that will-it-scale isn't a real workload. So, I tried >> to find some more practical ones. I found that a repetitive >> allocation/freeing several hundreds MB pages pattern exists in kernel >> building and netperf/SCTP_STREAM_MANY workloads. More details can be >> found in my reply to Michal as follows, >> >> https://lore.kernel.org/linux-mm/877cr4dydo.fsf@yhuang6-desk2.ccr.corp.intel.com/ >> >> > While I only glanced at this, a few things jumped out >> > >> > 1. Time-based heuristics are not ideal. congestion_wait() and >> > friends was an obvious case where time-based heuristics fell apart even >> > before the event it waited on was removed. For congestion, it happened to >> > work for slow storage for a while but that was about it. For allocation >> > stream detection, it has a similar problem. If a process is allocating >> > heavily, then fine, if it's in bursts of less than a second more than one >> > second apart then it will not adapt. While I do not think it is explicitly >> > mentioned anywhere, my understanding was that heuristics like this within >> > mm/ should be driven by explicit events as much as possible and not time. >> >> The proposed heuristics can be changed to be not time-based. When it is >> found that the allocation/freeing depth is larger than previous value, >> the PCP high can be increased immediately. We use time-based >> implementation to try to reduce the overhead. And, we mainly targeted >> long-time allocation pattern before. >> > > Time simply has too many corner cases. When it's reset for example, all > state is lost so patterns that are longer than the time window are > unpredictable. It tends to work slightly better than time decays state > rather than resets but it gets very hand-wavey. > >> > 2. If time was to be used, it would be cheaper to have the simpliest possible >> > state tracking in the fast paths and decay any resizing of the PCP >> > within the vmstat updates (reuse pcp->expire except it applies to local >> > pcps). Even this is less than ideal as the PCP may be too large for short >> > periods of time but it may also act as a backstop for worst-case behaviour >> >> This sounds reasonable. Thanks! We will try this if we choose to use >> time-based implementation. >> >> > 3. free_factor is an existing mechanism for detecting recent patterns >> > and adapting the PCP sizes. The allocation side should be symmetric >> > and the events that should drive it are "refills" on the alloc side and >> > "drains" on the free side. Initially it might be easier to have a single >> > parameter that scales batch and high up to a limit >> >> For example, when a workload is started, several GB pages will be >> allocated. We will observe many "refills", and almost no "drains". So, >> we will scales batch and high up to a limit. When the workload exits, >> large number of pages of the workload will be put in PCP because the PCP >> high is increased. When should we decrease the PCP batch and high? >> > > Honestly, I'm not 100% certain as I haven't spent time with paper to > sketch out the different combinations with "all allocs" at one end, "all > frees" at the other and "ratio of alloc:free" in the middle. Intuitively > I would expect the time to shrink is when there is a mix. > > All allocs -- maximise batch and high > All frees -- maximise batch and high > Mix -- adjust high to appoximate the minimum value of high such > that a drain/refill does not occur or rarely occurs > > Batch should have a much lower maximum than high because it's a deferred cost > that gets assigned to an arbitrary task. The worst case is where a process > that is a light user of the allocator incurs the full cost of a refill/drain. > > Again, intuitively this may be PID Control problem for the "Mix" case > to estimate the size of high required to minimise drains/allocs as each > drain/alloc is potentially a lock contention. The catchall for corner > cases would be to decay high from vmstat context based on pcp->expires. The > decay would prevent the "high" being pinned at an artifically high value > without any zone lock contention for prolonged periods of time and also > mitigate worst-case due to state being per-cpu. The downside is that "high" > would also oscillate for a continuous steady allocation pattern as the PID > control might pick an ideal value suitable for a long period of time with > the "decay" disrupting that ideal value. Maybe we can track the minimal value of pcp->count. If it's small enough recently, we can avoid to decay pcp->high. Because the pages in PCP are used for allocations instead of idle. Another question is as follows. For example, on CPU A, a large number of pages are freed, and we maximize batch and high. So, a large number of pages are put in PCP. Then, the possible situations may be, a) a large number of pages are allocated on CPU A after some time b) a large number of pages are allocated on another CPU B For a), we want the pages are kept in PCP of CPU A as long as possible. For b), we want the pages are kept in PCP of CPU A as short as possible. I think that we need to balance between them. What is the reasonable time to keep pages in PCP without many allocations? >> > 4. The amount of state tracked seems excessive and increases the size of >> > the per-cpu structure by more than 1 cache line. That in itself may not >> > be a problem but the state is tracked on every page alloc/free that goes >> > through the fast path and it's relatively complex to track. That is >> > a constant penalty in fast paths that may not may not be relevant to the >> > workload and only sustained bursty allocation streams may offset the >> > cost. >> >> Yes. Thanks for pointing this out. We will optimize this if the other >> aspects of the basic idea could be accepted. >> > > I'm not opposed to having an adaptive pcp->high in concept. I think it would > be best to disable adaptive tuning if percpu_pagelist_high_fraction is set > though. I expect that users of that tunable are rare and that if it *is* > used that there is a very good reason for it. OK. Will do that in the future version. >> > 5. Memory pressure and reclaim activity does not appear to be accounted >> > for and it's not clear if pcp->high is bounded or if it's possible for >> > a single PCP to hide a large number of pages from other CPUs sharing the >> > same node. The max size of the PCP should probably be explicitly clamped. >> >> As in my reply to Michal's email, previously I thought >> ZONE_RECLAIM_ACTIVE will be set for kswap reclaiming, and PCP high will >> be decreased to (batch * 4) effectively. Or I miss something? >> > > I don't think you did, but it deserves a big comment in the tuning at minimum > and potentially even disabling adaptive tuning entirely if reclaim is active. Sure. Will do that. Best Regards, Huang, Ying
On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote: > Mel Gorman <mgorman@techsingularity.net> writes: > > > Batch should have a much lower maximum than high because it's a deferred cost > > that gets assigned to an arbitrary task. The worst case is where a process > > that is a light user of the allocator incurs the full cost of a refill/drain. > > > > Again, intuitively this may be PID Control problem for the "Mix" case > > to estimate the size of high required to minimise drains/allocs as each > > drain/alloc is potentially a lock contention. The catchall for corner > > cases would be to decay high from vmstat context based on pcp->expires. The > > decay would prevent the "high" being pinned at an artifically high value > > without any zone lock contention for prolonged periods of time and also > > mitigate worst-case due to state being per-cpu. The downside is that "high" > > would also oscillate for a continuous steady allocation pattern as the PID > > control might pick an ideal value suitable for a long period of time with > > the "decay" disrupting that ideal value. > > Maybe we can track the minimal value of pcp->count. If it's small > enough recently, we can avoid to decay pcp->high. Because the pages in > PCP are used for allocations instead of idle. Implement as a separate patch. I suspect this type of heuristic will be very benchmark specific and the complexity may not be worth it in the general case. > > Another question is as follows. > > For example, on CPU A, a large number of pages are freed, and we > maximize batch and high. So, a large number of pages are put in PCP. > Then, the possible situations may be, > > a) a large number of pages are allocated on CPU A after some time > b) a large number of pages are allocated on another CPU B > > For a), we want the pages are kept in PCP of CPU A as long as possible. > For b), we want the pages are kept in PCP of CPU A as short as possible. > I think that we need to balance between them. What is the reasonable > time to keep pages in PCP without many allocations? > This would be a case where you're relying on vmstat to drain the PCP after a period of time as it is a corner case. You cannot reasonably detect the pattern on two separate per-cpu lists without either inspecting remote CPU state or maintaining global state. Either would incur cache miss penalties that probably cost more than the heuristic saves.
Mel Gorman <mgorman@techsingularity.net> writes: > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote: >> Mel Gorman <mgorman@techsingularity.net> writes: >> >> > Batch should have a much lower maximum than high because it's a deferred cost >> > that gets assigned to an arbitrary task. The worst case is where a process >> > that is a light user of the allocator incurs the full cost of a refill/drain. >> > >> > Again, intuitively this may be PID Control problem for the "Mix" case >> > to estimate the size of high required to minimise drains/allocs as each >> > drain/alloc is potentially a lock contention. The catchall for corner >> > cases would be to decay high from vmstat context based on pcp->expires. The >> > decay would prevent the "high" being pinned at an artifically high value >> > without any zone lock contention for prolonged periods of time and also >> > mitigate worst-case due to state being per-cpu. The downside is that "high" >> > would also oscillate for a continuous steady allocation pattern as the PID >> > control might pick an ideal value suitable for a long period of time with >> > the "decay" disrupting that ideal value. >> >> Maybe we can track the minimal value of pcp->count. If it's small >> enough recently, we can avoid to decay pcp->high. Because the pages in >> PCP are used for allocations instead of idle. > > Implement as a separate patch. I suspect this type of heuristic will be > very benchmark specific and the complexity may not be worth it in the > general case. OK. >> Another question is as follows. >> >> For example, on CPU A, a large number of pages are freed, and we >> maximize batch and high. So, a large number of pages are put in PCP. >> Then, the possible situations may be, >> >> a) a large number of pages are allocated on CPU A after some time >> b) a large number of pages are allocated on another CPU B >> >> For a), we want the pages are kept in PCP of CPU A as long as possible. >> For b), we want the pages are kept in PCP of CPU A as short as possible. >> I think that we need to balance between them. What is the reasonable >> time to keep pages in PCP without many allocations? >> > > This would be a case where you're relying on vmstat to drain the PCP after > a period of time as it is a corner case. Yes. The remaining question is how long should "a period of time" be? If it's long, the pages in PCP can be used for allocation after some time. If it's short the pages can be put in buddy, so can be used by other workloads if needed. Anyway, I will do some experiment for that. > You cannot reasonably detect the pattern on two separate per-cpu lists > without either inspecting remote CPU state or maintaining global > state. Either would incur cache miss penalties that probably cost more > than the heuristic saves. Yes. Totally agree. Best Regards, Huang, Ying
On Tue, Jul 18, 2023 at 08:55:16AM +0800, Huang, Ying wrote: > Mel Gorman <mgorman@techsingularity.net> writes: > > > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote: > >> Mel Gorman <mgorman@techsingularity.net> writes: > >> > >> > Batch should have a much lower maximum than high because it's a deferred cost > >> > that gets assigned to an arbitrary task. The worst case is where a process > >> > that is a light user of the allocator incurs the full cost of a refill/drain. > >> > > >> > Again, intuitively this may be PID Control problem for the "Mix" case > >> > to estimate the size of high required to minimise drains/allocs as each > >> > drain/alloc is potentially a lock contention. The catchall for corner > >> > cases would be to decay high from vmstat context based on pcp->expires. The > >> > decay would prevent the "high" being pinned at an artifically high value > >> > without any zone lock contention for prolonged periods of time and also > >> > mitigate worst-case due to state being per-cpu. The downside is that "high" > >> > would also oscillate for a continuous steady allocation pattern as the PID > >> > control might pick an ideal value suitable for a long period of time with > >> > the "decay" disrupting that ideal value. > >> > >> Maybe we can track the minimal value of pcp->count. If it's small > >> enough recently, we can avoid to decay pcp->high. Because the pages in > >> PCP are used for allocations instead of idle. > > > > Implement as a separate patch. I suspect this type of heuristic will be > > very benchmark specific and the complexity may not be worth it in the > > general case. > > OK. > > >> Another question is as follows. > >> > >> For example, on CPU A, a large number of pages are freed, and we > >> maximize batch and high. So, a large number of pages are put in PCP. > >> Then, the possible situations may be, > >> > >> a) a large number of pages are allocated on CPU A after some time > >> b) a large number of pages are allocated on another CPU B > >> > >> For a), we want the pages are kept in PCP of CPU A as long as possible. > >> For b), we want the pages are kept in PCP of CPU A as short as possible. > >> I think that we need to balance between them. What is the reasonable > >> time to keep pages in PCP without many allocations? > >> > > > > This would be a case where you're relying on vmstat to drain the PCP after > > a period of time as it is a corner case. > > Yes. The remaining question is how long should "a period of time" be? Match the time used for draining "remote" pages from the PCP lists. The choice is arbitrary and no matter what value is chosen, it'll be possible to build an adverse workload. > If it's long, the pages in PCP can be used for allocation after some > time. If it's short the pages can be put in buddy, so can be used by > other workloads if needed. > Assume that the main reason to expire pages and put them back on the buddy list is to avoid premature allocation failures due to pages pinned on the PCP. Once pages are going back onto the buddy list and the expiry is hit, it might as well be assumed that the pages are cache-cold. Some bad corner cases should be mitigated by disabling the adapative sizing when reclaim is active. The big remaaining corner case to watch out for is where the sum of the boosted pcp->high exceeds the low watermark. If that should ever happen then potentially a premature OOM happens because the watermarks are fine so no reclaim is active but no pages are available. It may even be the case that the sum of pcp->high should not exceed *min* as that corner case means that processes may prematurely enter direct reclaim (not as bad as OOM but still bad). > Anyway, I will do some experiment for that. > > > You cannot reasonably detect the pattern on two separate per-cpu lists > > without either inspecting remote CPU state or maintaining global > > state. Either would incur cache miss penalties that probably cost more > > than the heuristic saves. > > Yes. Totally agree. > > Best Regards, > Huang, Ying
Mel Gorman <mgorman@techsingularity.net> writes: > On Tue, Jul 18, 2023 at 08:55:16AM +0800, Huang, Ying wrote: >> Mel Gorman <mgorman@techsingularity.net> writes: >> >> > On Mon, Jul 17, 2023 at 05:16:11PM +0800, Huang, Ying wrote: >> >> Mel Gorman <mgorman@techsingularity.net> writes: >> >> >> >> > Batch should have a much lower maximum than high because it's a deferred cost >> >> > that gets assigned to an arbitrary task. The worst case is where a process >> >> > that is a light user of the allocator incurs the full cost of a refill/drain. >> >> > >> >> > Again, intuitively this may be PID Control problem for the "Mix" case >> >> > to estimate the size of high required to minimise drains/allocs as each >> >> > drain/alloc is potentially a lock contention. The catchall for corner >> >> > cases would be to decay high from vmstat context based on pcp->expires. The >> >> > decay would prevent the "high" being pinned at an artifically high value >> >> > without any zone lock contention for prolonged periods of time and also >> >> > mitigate worst-case due to state being per-cpu. The downside is that "high" >> >> > would also oscillate for a continuous steady allocation pattern as the PID >> >> > control might pick an ideal value suitable for a long period of time with >> >> > the "decay" disrupting that ideal value. >> >> >> >> Maybe we can track the minimal value of pcp->count. If it's small >> >> enough recently, we can avoid to decay pcp->high. Because the pages in >> >> PCP are used for allocations instead of idle. >> > >> > Implement as a separate patch. I suspect this type of heuristic will be >> > very benchmark specific and the complexity may not be worth it in the >> > general case. >> >> OK. >> >> >> Another question is as follows. >> >> >> >> For example, on CPU A, a large number of pages are freed, and we >> >> maximize batch and high. So, a large number of pages are put in PCP. >> >> Then, the possible situations may be, >> >> >> >> a) a large number of pages are allocated on CPU A after some time >> >> b) a large number of pages are allocated on another CPU B >> >> >> >> For a), we want the pages are kept in PCP of CPU A as long as possible. >> >> For b), we want the pages are kept in PCP of CPU A as short as possible. >> >> I think that we need to balance between them. What is the reasonable >> >> time to keep pages in PCP without many allocations? >> >> >> > >> > This would be a case where you're relying on vmstat to drain the PCP after >> > a period of time as it is a corner case. >> >> Yes. The remaining question is how long should "a period of time" be? > > Match the time used for draining "remote" pages from the PCP lists. The > choice is arbitrary and no matter what value is chosen, it'll be possible > to build an adverse workload. OK. >> If it's long, the pages in PCP can be used for allocation after some >> time. If it's short the pages can be put in buddy, so can be used by >> other workloads if needed. >> > > Assume that the main reason to expire pages and put them back on the buddy > list is to avoid premature allocation failures due to pages pinned on the > PCP. Once pages are going back onto the buddy list and the expiry is hit, > it might as well be assumed that the pages are cache-cold. Some bad corner > cases should be mitigated by disabling the adapative sizing when reclaim is > active. Yes. This can be mitigated, but the page allocation performance may be hurt. > The big remaaining corner case to watch out for is where the sum > of the boosted pcp->high exceeds the low watermark. If that should ever > happen then potentially a premature OOM happens because the watermarks > are fine so no reclaim is active but no pages are available. It may even > be the case that the sum of pcp->high should not exceed *min* as that > corner case means that processes may prematurely enter direct reclaim > (not as bad as OOM but still bad). Sorry, I don't understand this. When pages are moved from buddy to PCP, zone NR_FREE_PAGES will be decreased in rmqueue_bulk(). That is, pages in PCP will be counted as used instead of free. And, in zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is used to check watermark. So, if my understanding were correct, if the number of pages in PCP is larger than low/min watermark, we can still trigger reclaim. Whether is my understanding correct? >> Anyway, I will do some experiment for that. >> >> > You cannot reasonably detect the pattern on two separate per-cpu lists >> > without either inspecting remote CPU state or maintaining global >> > state. Either would incur cache miss penalties that probably cost more >> > than the heuristic saves. >> >> Yes. Totally agree. Best Regards, Huang, Ying
On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote: > > The big remaaining corner case to watch out for is where the sum > > of the boosted pcp->high exceeds the low watermark. If that should ever > > happen then potentially a premature OOM happens because the watermarks > > are fine so no reclaim is active but no pages are available. It may even > > be the case that the sum of pcp->high should not exceed *min* as that > > corner case means that processes may prematurely enter direct reclaim > > (not as bad as OOM but still bad). > > Sorry, I don't understand this. When pages are moved from buddy to PCP, > zone NR_FREE_PAGES will be decreased in rmqueue_bulk(). That is, pages > in PCP will be counted as used instead of free. And, in > zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is > used to check watermark. So, if my understanding were correct, if the > number of pages in PCP is larger than low/min watermark, we can still > trigger reclaim. Whether is my understanding correct? > You're right, I didn't check the timing of the accounting and all that occurred to me was "the timing of when watermarks trigger kswapd or direct reclaim may change as a result of PCP adaptive resizing". Even though I got the timing wrong, the shape of the problem just changes. I suspect that excessively large PCP high relative to the watermarks may mean that reclaim happens prematurely if too many pages are pinned by PCP pages as the zone free pages approaches the watermark. While disabling the adaptive resizing during reclaim will limit the worst of the problem, it may still be the case that kswapd is woken early simply because there are enough CPUs pinning pages in PCP lists. Similarly, depending on the size of pcp->high and the gap between the watermarks, it's possible for direct reclaim to happen prematurely. I could still be wrong because I'm not thinking the problem through fully, examining the code or thinking about the implementation. It's simply worth keeping in mind the impact elevated PCP high values has on the timing of watermarks failing. If it's complex enough, it may be necessary to have a separate patch dealing with the impact of elevated pcp->high on watermarks.
Mel Gorman <mgorman@techsingularity.net> writes: > On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote: >> > The big remaaining corner case to watch out for is where the sum >> > of the boosted pcp->high exceeds the low watermark. If that should ever >> > happen then potentially a premature OOM happens because the watermarks >> > are fine so no reclaim is active but no pages are available. It may even >> > be the case that the sum of pcp->high should not exceed *min* as that >> > corner case means that processes may prematurely enter direct reclaim >> > (not as bad as OOM but still bad). >> >> Sorry, I don't understand this. When pages are moved from buddy to PCP, >> zone NR_FREE_PAGES will be decreased in rmqueue_bulk(). That is, pages >> in PCP will be counted as used instead of free. And, in >> zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is >> used to check watermark. So, if my understanding were correct, if the >> number of pages in PCP is larger than low/min watermark, we can still >> trigger reclaim. Whether is my understanding correct? >> > > You're right, I didn't check the timing of the accounting and all that > occurred to me was "the timing of when watermarks trigger kswapd or > direct reclaim may change as a result of PCP adaptive resizing". Even > though I got the timing wrong, the shape of the problem just changes. > I suspect that excessively large PCP high relative to the watermarks may > mean that reclaim happens prematurely if too many pages are pinned by PCP > pages as the zone free pages approaches the watermark. Yes. I think so too. In addition to reclaim, falling back to remote NUMA node may happen prematurely too. > While disabling the adaptive resizing during reclaim will limit the > worst of the problem, it may still be the case that kswapd is woken > early simply because there are enough CPUs pinning pages in PCP > lists. Similarly, depending on the size of pcp->high and the gap > between the watermarks, it's possible for direct reclaim to happen > prematurely. I could still be wrong because I'm not thinking the > problem through fully, examining the code or thinking about the > implementation. It's simply worth keeping in mind the impact elevated > PCP high values has on the timing of watermarks failing. If it's > complex enough, it may be necessary to have a separate patch dealing > with the impact of elevated pcp->high on watermarks. Sure. I will keep this in mind. We may need to check zone watermark when tuning pcp->high and free some pages from PCP before falling back to other node or reclaiming. -- Best Regards, Huang, Ying
On Fri, Jul 21, 2023 at 03:28:43PM +0800, Huang, Ying wrote: > Mel Gorman <mgorman@techsingularity.net> writes: > > > On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote: > >> > The big remaaining corner case to watch out for is where the sum > >> > of the boosted pcp->high exceeds the low watermark. If that should ever > >> > happen then potentially a premature OOM happens because the watermarks > >> > are fine so no reclaim is active but no pages are available. It may even > >> > be the case that the sum of pcp->high should not exceed *min* as that > >> > corner case means that processes may prematurely enter direct reclaim > >> > (not as bad as OOM but still bad). > >> > >> Sorry, I don't understand this. When pages are moved from buddy to PCP, > >> zone NR_FREE_PAGES will be decreased in rmqueue_bulk(). That is, pages > >> in PCP will be counted as used instead of free. And, in > >> zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is > >> used to check watermark. So, if my understanding were correct, if the > >> number of pages in PCP is larger than low/min watermark, we can still > >> trigger reclaim. Whether is my understanding correct? > >> > > > > You're right, I didn't check the timing of the accounting and all that > > occurred to me was "the timing of when watermarks trigger kswapd or > > direct reclaim may change as a result of PCP adaptive resizing". Even > > though I got the timing wrong, the shape of the problem just changes. > > I suspect that excessively large PCP high relative to the watermarks may > > mean that reclaim happens prematurely if too many pages are pinned by PCP > > pages as the zone free pages approaches the watermark. > > Yes. I think so too. In addition to reclaim, falling back to remote > NUMA node may happen prematurely too. > Yes, with the added bonus that this is relatively easy to detect from the NUMA miss stats. I say "relative" because in a lot of cases, it'll be difficult to distinguish from the noise. Hence, it's better to be explicit in the change log that the potential problem is known and has been considered. That way, if bisect points the finger at adaptive resizing, there will be some notes on how to investigate the bug. > > While disabling the adaptive resizing during reclaim will limit the > > worst of the problem, it may still be the case that kswapd is woken > > early simply because there are enough CPUs pinning pages in PCP > > lists. Similarly, depending on the size of pcp->high and the gap > > between the watermarks, it's possible for direct reclaim to happen > > prematurely. I could still be wrong because I'm not thinking the > > problem through fully, examining the code or thinking about the > > implementation. It's simply worth keeping in mind the impact elevated > > PCP high values has on the timing of watermarks failing. If it's > > complex enough, it may be necessary to have a separate patch dealing > > with the impact of elevated pcp->high on watermarks. > > Sure. I will keep this in mind. We may need to check zone watermark > when tuning pcp->high and free some pages from PCP before falling back > to other node or reclaiming. > That would certainly be one option, a cap on adaptive resizing as memory gets lower. It's not perfect but ideally the worst-case behaviour would be that PCP adaptive sizing returns to existing behaviour when memory usage is persistently high and near watermarks within a zone.
Mel Gorman <mgorman@techsingularity.net> writes: > On Fri, Jul 21, 2023 at 03:28:43PM +0800, Huang, Ying wrote: >> Mel Gorman <mgorman@techsingularity.net> writes: >> >> > On Wed, Jul 19, 2023 at 01:59:00PM +0800, Huang, Ying wrote: >> >> > The big remaaining corner case to watch out for is where the sum >> >> > of the boosted pcp->high exceeds the low watermark. If that should ever >> >> > happen then potentially a premature OOM happens because the watermarks >> >> > are fine so no reclaim is active but no pages are available. It may even >> >> > be the case that the sum of pcp->high should not exceed *min* as that >> >> > corner case means that processes may prematurely enter direct reclaim >> >> > (not as bad as OOM but still bad). >> >> >> >> Sorry, I don't understand this. When pages are moved from buddy to PCP, >> >> zone NR_FREE_PAGES will be decreased in rmqueue_bulk(). That is, pages >> >> in PCP will be counted as used instead of free. And, in >> >> zone_watermark_ok*() and zone_watermark_fast(), zone NR_FREE_PAGES is >> >> used to check watermark. So, if my understanding were correct, if the >> >> number of pages in PCP is larger than low/min watermark, we can still >> >> trigger reclaim. Whether is my understanding correct? >> >> >> > >> > You're right, I didn't check the timing of the accounting and all that >> > occurred to me was "the timing of when watermarks trigger kswapd or >> > direct reclaim may change as a result of PCP adaptive resizing". Even >> > though I got the timing wrong, the shape of the problem just changes. >> > I suspect that excessively large PCP high relative to the watermarks may >> > mean that reclaim happens prematurely if too many pages are pinned by PCP >> > pages as the zone free pages approaches the watermark. >> >> Yes. I think so too. In addition to reclaim, falling back to remote >> NUMA node may happen prematurely too. >> > > Yes, with the added bonus that this is relatively easy to detect from > the NUMA miss stats. I say "relative" because in a lot of cases, it'll be > difficult to distinguish from the noise. Hence, it's better to be explicit in > the change log that the potential problem is known and has been considered. > That way, if bisect points the finger at adaptive resizing, there will be > some notes on how to investigate the bug. Sure. Will do that. >> > While disabling the adaptive resizing during reclaim will limit the >> > worst of the problem, it may still be the case that kswapd is woken >> > early simply because there are enough CPUs pinning pages in PCP >> > lists. Similarly, depending on the size of pcp->high and the gap >> > between the watermarks, it's possible for direct reclaim to happen >> > prematurely. I could still be wrong because I'm not thinking the >> > problem through fully, examining the code or thinking about the >> > implementation. It's simply worth keeping in mind the impact elevated >> > PCP high values has on the timing of watermarks failing. If it's >> > complex enough, it may be necessary to have a separate patch dealing >> > with the impact of elevated pcp->high on watermarks. >> >> Sure. I will keep this in mind. We may need to check zone watermark >> when tuning pcp->high and free some pages from PCP before falling back >> to other node or reclaiming. >> > > That would certainly be one option, a cap on adaptive resizing as memory > gets lower. It's not perfect but ideally the worst-case behaviour would be > that PCP adaptive sizing returns to existing behaviour when memory usage > is persistently high and near watermarks within a zone. -- Best Regards, Huang, Ying
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7e2c1864a9ea..cd9b497cd596 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -670,6 +670,14 @@ struct per_cpu_pages { #ifdef CONFIG_NUMA short expire; /* When 0, remote pagesets are drained */ #endif + int alloc_count; /* alloc/free count from tune period start */ + int alloc_high; /* max alloc count from tune period start */ + int alloc_low; /* min alloc count from tune period start */ + int alloc_depth; /* alloc depth from tune period start */ + int free_depth; /* free depth from tune period start */ + int avg_alloc_depth; /* average alloc depth */ + int avg_free_depth; /* average free depth */ + unsigned long tune_start; /* tune period start timestamp */ /* Lists of pages, one per migrate type stored on the pcp-lists */ struct list_head lists[NR_PCP_LISTS]; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dd83c19f25c6..4d627d96e41a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2616,9 +2616,38 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return min(READ_ONCE(pcp->batch) << 2, high); } +#define PCP_AUTO_TUNE_PERIOD HZ + static void tune_pcp_high(struct per_cpu_pages *pcp, int high_def) { - pcp->high = high_def; + unsigned long now = jiffies; + int high_max, high; + + if (likely(now - pcp->tune_start <= PCP_AUTO_TUNE_PERIOD)) + return; + + /* No alloc/free in last 2 tune period, reset */ + if (now - pcp->tune_start > 2 * PCP_AUTO_TUNE_PERIOD) { + pcp->tune_start = now; + pcp->alloc_high = pcp->alloc_low = pcp->alloc_count = 0; + pcp->alloc_depth = pcp->free_depth = 0; + pcp->avg_alloc_depth = pcp->avg_free_depth = 0; + pcp->high = high_def; + return; + } + + /* End of tune period, try to tune PCP high automatically */ + pcp->tune_start = now; + /* The old alloc/free depth decay with time */ + pcp->avg_alloc_depth = (pcp->avg_alloc_depth + pcp->alloc_depth) / 2; + pcp->avg_free_depth = (pcp->avg_free_depth + pcp->free_depth) / 2; + /* Reset for next tune period */ + pcp->alloc_high = pcp->alloc_low = pcp->alloc_count = 0; + pcp->alloc_depth = pcp->free_depth = 0; + /* Pure alloc/free will not increase PCP high */ + high = min(pcp->avg_alloc_depth, pcp->avg_free_depth); + high_max = READ_ONCE(pcp->high_max); + pcp->high = clamp(high, high_def, high_max); } static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, @@ -2630,7 +2659,19 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, bool free_high; high_def = READ_ONCE(pcp->high_def); - tune_pcp_high(pcp, high_def); + /* PCP is disabled or boot pageset */ + if (unlikely(!high_def)) { + pcp->high = high_def; + pcp->tune_start = 0; + } else { + /* free count as negative allocation */ + pcp->alloc_count -= (1 << order); + pcp->alloc_low = min(pcp->alloc_low, pcp->alloc_count); + /* max free depth from the start of current tune period */ + pcp->free_depth = max(pcp->free_depth, + pcp->alloc_high - pcp->alloc_count); + tune_pcp_high(pcp, high_def); + } __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); @@ -2998,6 +3039,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, return NULL; } + pcp->alloc_count += (1 << order); + pcp->alloc_high = max(pcp->alloc_high, pcp->alloc_count); + /* max alloc depth from the start of current tune period */ + pcp->alloc_depth = max(pcp->alloc_depth, pcp->alloc_count - pcp->alloc_low); + /* * On allocation, reduce the number of pages that are batch freed. * See nr_pcp_free() where free_factor is increased for subsequent