Message ID | 20230920061856.257597-1-ying.huang@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4060958vqi; Wed, 20 Sep 2023 04:28:42 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFupl6LrPTTnyTL/aVXnMk1Gj+VpyIKxCaWhjM2IVdko5tKlWBs13eq8OemyD1ANIWdYe6B X-Received: by 2002:a05:6a20:144d:b0:14e:3daf:fdb9 with SMTP id a13-20020a056a20144d00b0014e3daffdb9mr2634723pzi.22.1695209322092; Wed, 20 Sep 2023 04:28:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695209322; cv=none; d=google.com; s=arc-20160816; b=hPxVKD/y8+x8izg1RiaRZPa7W5CWPWJ0M/wwJYTs5W61QZXgjpKOq56gwRAAjQGtLB hH0QoiqLee7C3wClzRYHZXheIf5DjILP6hJGzQKvT6OSvFDR3bIWxUlC9CXKsb4x5UC0 cHY55S32QQEqdCoP8XT7AidyU7e5inP6N3gigLwHc452TRjVCM/d2JLRglbFh5Qjajrx 4/AW+RINy6d254GWFvDpOI2nmQntoZXVtR8xHMX0m9ELKHzowrNgkFJ+IaHURhDzsoKs I0UTy51+TrjgDKQa5Aa43OzmBNzC/rkwVHtDBYIVS6Cl76tpEn2hIkxUy10c5GAI5IyC jzKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Fy6CeoqQfmgFo+ydOdkhetg1fzEVEv9zMPkU/BuH5Ko=; fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=; b=fB/dfj66JqkeK+DbZeydj/3kLY22AQVlu5u6U2LzrtUq2dve3aZ4++Bd7HlxdZ//5Z 9Dc09ElsV/5/XaR2KT+P3F84bNKV3ZLbVEKzF8yGx7BU9gksxXqCoBPq1d72S0l471Dz nfYuBt81aNmcJZq2SJ20EWtLGLvfoWHaVX5Rg/9JLPMCNX8JI8NVZiB8nhPvPmJT3y6A 7U6ylRumKMcpyIEIIfGREkrFDd5uZ+V1dtXWzOdDWGgCfT8YrPX1ixh2OIbwIoLbTxpT CgAQ5R5cZq0iYqZ27zwYmOjOBLyVVRtXtf9Ql9dzurElWBrYjgKtYu9GtyCmMxIzTXkb zAxg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PGuKhSry; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id n7-20020a056a000d4700b0068e48477befsi408187pfv.211.2023.09.20.04.28.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Sep 2023 04:28:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PGuKhSry; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 6CF35802076D; Tue, 19 Sep 2023 23:19:52 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233163AbjITGTp (ORCPT <rfc822;toshivichauhan@gmail.com> + 26 others); Wed, 20 Sep 2023 02:19:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233126AbjITGTk (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 20 Sep 2023 02:19:40 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5547399 for <linux-kernel@vger.kernel.org>; Tue, 19 Sep 2023 23:19:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1695190774; x=1726726774; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=chPIaEJzVNdbE6odLGvBFsc/Z9s6eqiuK9MRZ+edZSQ=; b=PGuKhSryVdsx01fblxWFAjzzuKlgBKwj5oouIwlxpyoibgc1LR1eJaCh OzpZeaTnvfaD228qrdhuM11PCE/U+ozbzEMzfc3g15Im9dWd2HVdpbyQx obhmBUvwFSXnnjwehFON0R36Jzp1TEGFYXcsydTzjiZjGETULn76d7sjb sZTtzCsayvbQIx3+iRxMdXtk1TyLvPaat6deSp7FVhfE2roxhHDpGv+XD KmM9W+MDG+AyPcVzb6bbl3FB/ASWjvFkDW/EN0DYFSsuULdy2DSQBUMfT EMgdcAxSYFlWEei02gNOTDrC8yLP/+BgoWotNY4vfDsc7JBMEOtDkcb6J g==; X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187561" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="365187561" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060492" X-IronPort-AV: E=Sophos;i="6.02,161,1688454000"; d="scan'208";a="740060492" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Sep 2023 23:19:29 -0700 From: Huang Ying <ying.huang@intel.com> To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Arjan Van De Ven <arjan@linux.intel.com>, Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Mel Gorman <mgorman@techsingularity.net>, Vlastimil Babka <vbabka@suse.cz>, David Hildenbrand <david@redhat.com>, Johannes Weiner <jweiner@redhat.com>, Dave Hansen <dave.hansen@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Pavel Tatashin <pasha.tatashin@soleen.com>, Matthew Wilcox <willy@infradead.org>, Christoph Lameter <cl@linux.com> Subject: [PATCH 00/10] mm: PCP high auto-tuning Date: Wed, 20 Sep 2023 14:18:46 +0800 Message-Id: <20230920061856.257597-1-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Tue, 19 Sep 2023 23:19:52 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1777555810271413739 X-GMAIL-MSGID: 1777555810271413739 |
Series |
mm: PCP high auto-tuning
|
|
Message
Huang, Ying
Sept. 20, 2023, 6:18 a.m. UTC
The page allocation performance requirements of different workloads are often different. So, we need to tune the PCP (Per-CPU Pageset) high on each CPU automatically to optimize the page allocation performance. The list of patches in series is as follows, 1 mm, pcp: avoid to drain PCP when process exit 2 cacheinfo: calculate per-CPU data cache size 3 mm, pcp: reduce lock contention for draining high-order pages 4 mm: restrict the pcp batch scale factor to avoid too long latency 5 mm, page_alloc: scale the number of pages that are batch allocated 6 mm: add framework for PCP high auto-tuning 7 mm: tune PCP high automatically 8 mm, pcp: decrease PCP high if free pages < high watermark 9 mm, pcp: avoid to reduce PCP high unnecessarily 10 mm, pcp: reduce detecting time of consecutive high order page freeing Patch 1/2/3 optimize the PCP draining for consecutive high-order pages freeing. Patch 4/5 optimize batch freeing and allocating. Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. Patch 10 optimize the PCP draining for consecutive high order page freeing based on PCP high auto-tuning. The test results for patches with performance impact are as follows, kbuild ====== On a 2-socket Intel server with 224 logical CPU, we tested kbuild on one socket with `make -j 112`. build time zone lock% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 43.6 100.0 100.0 patch1 96.6 40.3 49.2 95.2 patch3 96.4 40.5 11.3 95.1 patch5 96.1 37.9 13.3 96.8 patch7 86.4 9.8 6.2 22.0 patch9 85.9 9.4 4.8 16.3 patch10 87.7 12.6 29.0 32.3 The PCP draining optimization (patch 1/3) improves performance a little. The PCP batch allocation optimization (patch 5) reduces zone lock contention a little. The PCP high auto-tuning (patch 7/9) improves performance much. Where the tuning target: the number of pages allocated from zone reduces greatly. So, the zone lock contention cycles% reduces greatly. The further PCP draining optimization (patch 10) based on PCP tuning reduce the performance a little. But it will benefit network workloads as below. With PCP tuning patches (patch 7/9/10), the maximum used memory during test increases up to 50.6% because more pages are cached in PCP. But finally, the number of the used memory decreases to the same level as that of the base patch. That is, the pages cached in PCP will be released to zone after not being used actively. netperf SCTP_STREAM_MANY ======================== On a 2-socket Intel server with 128 logical CPU, we tested SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. score zone lock% free_high alloc_zone cache miss rate% ----- ---------- --------- ---------- ---------------- base 100.0 2.0 100.0 100.0 1.3 patch1 99.7 2.0 99.7 99.7 1.3 patch3 105.5 1.2 13.2 105.4 1.2 patch5 106.9 1.2 13.4 106.9 1.3 patch7 103.5 1.8 6.8 90.8 7.6 patch9 103.7 1.8 6.6 89.8 7.7 patch10 106.9 1.2 13.5 106.9 1.2 The PCP draining optimization (patch 1+3) improves performance. The PCP high auto-tuning (patch 7/9) reduces performance a little because PCP draining cannot be triggered in time sometimes. So, the cache miss rate% increases. The further PCP draining optimization (patch 10) based on PCP tuning restore the performance. lmbench3 UNIX (AF_UNIX) ======================= On a 2-socket Intel server with 128 logical CPU, we tested UNIX (AF_UNIX socket) test case of lmbench3 test suite with 16-pair processes. score zone lock% free_high alloc_zone cache miss rate% ----- ---------- --------- ---------- ---------------- base 100.0 50.0 100.0 100.0 0.3 patch1 117.1 45.8 72.6 108.9 0.2 patch3 201.6 21.2 7.4 111.5 0.2 patch5 201.9 20.9 7.5 112.7 0.3 patch7 194.2 19.3 7.3 111.5 2.9 patch9 193.1 19.2 7.2 110.4 2.9 patch10 196.8 21.0 7.4 111.2 2.1 The PCP draining optimization (patch 1/3) improves performance much. The PCP tuning (patch 7/9) reduces performance a little because PCP draining cannot be triggered in time sometimes. The further PCP draining optimization (patch 10) based on PCP tuning restores the performance partly. The patchset adds several fields in struct per_cpu_pages. The struct layout before/after the patchset is as follows, base ==== struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int high; /* 8 4 */ int batch; /* 12 4 */ short int free_factor; /* 16 2 */ short int expire; /* 18 2 */ /* XXX 4 bytes hole, try to pack */ struct list_head lists[13]; /* 24 208 */ /* size: 256, cachelines: 4, members: 7 */ /* sum members: 228, holes: 1, sum holes: 4 */ /* padding: 24 */ } __attribute__((__aligned__(64))); patched ======= struct per_cpu_pages { spinlock_t lock; /* 0 4 */ int count; /* 4 4 */ int count_min; /* 8 4 */ int high; /* 12 4 */ int high_min; /* 16 4 */ int high_max; /* 20 4 */ int batch; /* 24 4 */ u8 flags; /* 28 1 */ u8 alloc_factor; /* 29 1 */ u8 expire; /* 30 1 */ /* XXX 1 byte hole, try to pack */ short int free_count; /* 32 2 */ /* XXX 6 bytes hole, try to pack */ struct list_head lists[13]; /* 40 208 */ /* size: 256, cachelines: 4, members: 12 */ /* sum members: 241, holes: 2, sum holes: 7 */ /* padding: 8 */ } __attribute__((__aligned__(64))); The size of the struct doesn't changed with the patchset. Best Regards, Huang, Ying
Comments
On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <ying.huang@intel.com> wrote: > The page allocation performance requirements of different workloads > are often different. So, we need to tune the PCP (Per-CPU Pageset) > high on each CPU automatically to optimize the page allocation > performance. Some of the performance changes here are downright scary. I've never been very sure that percpu pages was very beneficial (and hey, I invented the thing back in the Mesozoic era). But these numbers make me think it's very important and we should have been paying more attention. > The list of patches in series is as follows, > > 1 mm, pcp: avoid to drain PCP when process exit > 2 cacheinfo: calculate per-CPU data cache size > 3 mm, pcp: reduce lock contention for draining high-order pages > 4 mm: restrict the pcp batch scale factor to avoid too long latency > 5 mm, page_alloc: scale the number of pages that are batch allocated > 6 mm: add framework for PCP high auto-tuning > 7 mm: tune PCP high automatically > 8 mm, pcp: decrease PCP high if free pages < high watermark > 9 mm, pcp: avoid to reduce PCP high unnecessarily > 10 mm, pcp: reduce detecting time of consecutive high order page freeing > > Patch 1/2/3 optimize the PCP draining for consecutive high-order pages > freeing. > > Patch 4/5 optimize batch freeing and allocating. > > Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. > > Patch 10 optimize the PCP draining for consecutive high order page > freeing based on PCP high auto-tuning. > > The test results for patches with performance impact are as follows, > > kbuild > ====== > > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on > one socket with `make -j 112`. > > build time zone lock% free_high alloc_zone > ---------- ---------- --------- ---------- > base 100.0 43.6 100.0 100.0 > patch1 96.6 40.3 49.2 95.2 > patch3 96.4 40.5 11.3 95.1 > patch5 96.1 37.9 13.3 96.8 > patch7 86.4 9.8 6.2 22.0 > patch9 85.9 9.4 4.8 16.3 > patch10 87.7 12.6 29.0 32.3 You're seriously saying that kbuild got 12% faster? I see that [07/10] (autotuning) alone sped up kbuild by 10%? Other thoughts: - What if any facilities are provided to permit users/developers to monitor the operation of the autotuning algorithm? - I'm not seeing any Documentation/ updates. Surely there are things we can tell users? - This: : It's possible that PCP high auto-tuning doesn't work well for some : workloads. So, when PCP high is tuned by hand via the sysctl knob, : the auto-tuning will be disabled. The PCP high set by hand will be : used instead. Is it a bit hacky to disable autotuning when the user alters pcp-high? Would it be cleaner to have a separate on/off knob for autotuning? And how is the user to determine that "PCP high auto-tuning doesn't work well" for their workload?
Hi, Andrew, Andrew Morton <akpm@linux-foundation.org> writes: > On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <ying.huang@intel.com> wrote: > >> The page allocation performance requirements of different workloads >> are often different. So, we need to tune the PCP (Per-CPU Pageset) >> high on each CPU automatically to optimize the page allocation >> performance. > > Some of the performance changes here are downright scary. > > I've never been very sure that percpu pages was very beneficial (and > hey, I invented the thing back in the Mesozoic era). But these numbers > make me think it's very important and we should have been paying more > attention. > >> The list of patches in series is as follows, >> >> 1 mm, pcp: avoid to drain PCP when process exit >> 2 cacheinfo: calculate per-CPU data cache size >> 3 mm, pcp: reduce lock contention for draining high-order pages >> 4 mm: restrict the pcp batch scale factor to avoid too long latency >> 5 mm, page_alloc: scale the number of pages that are batch allocated >> 6 mm: add framework for PCP high auto-tuning >> 7 mm: tune PCP high automatically >> 8 mm, pcp: decrease PCP high if free pages < high watermark >> 9 mm, pcp: avoid to reduce PCP high unnecessarily >> 10 mm, pcp: reduce detecting time of consecutive high order page freeing >> >> Patch 1/2/3 optimize the PCP draining for consecutive high-order pages >> freeing. >> >> Patch 4/5 optimize batch freeing and allocating. >> >> Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. >> >> Patch 10 optimize the PCP draining for consecutive high order page >> freeing based on PCP high auto-tuning. >> >> The test results for patches with performance impact are as follows, >> >> kbuild >> ====== >> >> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on >> one socket with `make -j 112`. >> >> build time zone lock% free_high alloc_zone >> ---------- ---------- --------- ---------- >> base 100.0 43.6 100.0 100.0 >> patch1 96.6 40.3 49.2 95.2 >> patch3 96.4 40.5 11.3 95.1 >> patch5 96.1 37.9 13.3 96.8 >> patch7 86.4 9.8 6.2 22.0 >> patch9 85.9 9.4 4.8 16.3 >> patch10 87.7 12.6 29.0 32.3 > > You're seriously saying that kbuild got 12% faster? > > I see that [07/10] (autotuning) alone sped up kbuild by 10%? Thank you very much for questioning! I double-checked the my test results and configuration and found that I used an uncommon configuration. So the description of the test should have been, On a 2-socket Intel server with 224 logical CPU, we tested kbuild with `numactl -m 1 -- make -j 112`. This will make processes running on socket 0 to use the normal zone of socket 1. The remote accessing to zone->lock cause heavy lock contention. I apologize for any confusing caused by the above test results. If we test kbuild with `make -j 224` on the machine, the test results becomes, build time lock% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 16.8 100.0 100.0 patch5 99.2 13.9 9.5 97.0 patch7 98.5 5.4 4.8 19.2 Although lock contention cycles%, draining PCP for high order freeing, and allocating from zone reduces greatly, the build time almost doesn't change. We also tested kbuild in the following way, created 8 cgroup, and run `make -j 28` in each cgroup. That is, the total parallel is same, but LRU lock contention can be eliminated via cgroup. And, the single-process link stage take less proportion to the parallel compiling stage. This isn't common for personal usage. But it can be used by something like 0Day kbuild service. The test result is as follows, build time lock% free_high alloc_zone ---------- ---------- --------- ---------- base 100.0 14.2 100.0 100.0 patch5 98.5 8.5 8.1 97.1 patch7 95.0 0.7 3.0 19.0 The lock contention cycles% reduces to nearly 0, because LRU lock contention is eliminated too. The build time reduction becomes visible too. We will continue to do a full test with this configuration. > Other thoughts: > > - What if any facilities are provided to permit users/developers to > monitor the operation of the autotuning algorithm? /proc/zoneinfo can be used to observe PCP high and count for each CPU. > - I'm not seeing any Documentation/ updates. Surely there are things > we can tell users? I will think about that. > - This: > > : It's possible that PCP high auto-tuning doesn't work well for some > : workloads. So, when PCP high is tuned by hand via the sysctl knob, > : the auto-tuning will be disabled. The PCP high set by hand will be > : used instead. > > Is it a bit hacky to disable autotuning when the user alters > pcp-high? Would it be cleaner to have a separate on/off knob for > autotuning? This was suggested by Mel Gormon, https://lore.kernel.org/linux-mm/20230714140710.5xbesq6xguhcbyvi@techsingularity.net/ " I'm not opposed to having an adaptive pcp->high in concept. I think it would be best to disable adaptive tuning if percpu_pagelist_high_fraction is set though. I expect that users of that tunable are rare and that if it *is* used that there is a very good reason for it. " Do you think that this is reasonable? > And how is the user to determine that "PCP high auto-tuning doesn't work > well" for their workload? One way is to check the perf profiling results. If there is heavy zone lock contention, the PCP high auto-tuning doesn't work well enough to eliminate the zone lock contention. Users may try to tune PCP high by hand. -- Best Regards, Huang, Ying
On Thu, 21 Sep 2023 21:32:35 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > > : It's possible that PCP high auto-tuning doesn't work well for some > > : workloads. So, when PCP high is tuned by hand via the sysctl knob, > > : the auto-tuning will be disabled. The PCP high set by hand will be > > : used instead. > > > > Is it a bit hacky to disable autotuning when the user alters > > pcp-high? Would it be cleaner to have a separate on/off knob for > > autotuning? > > This was suggested by Mel Gormon, > > https://lore.kernel.org/linux-mm/20230714140710.5xbesq6xguhcbyvi@techsingularity.net/ > > " > I'm not opposed to having an adaptive pcp->high in concept. I think it would > be best to disable adaptive tuning if percpu_pagelist_high_fraction is set > though. I expect that users of that tunable are rare and that if it *is* > used that there is a very good reason for it. > " > > Do you think that this is reasonable? I suppose so, if it's documented! Documentation/admin-guide/sysctl/vm.rst describes percpu_pagelist_high_fraction.
Andrew Morton <akpm@linux-foundation.org> writes: > On Thu, 21 Sep 2023 21:32:35 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > >> > : It's possible that PCP high auto-tuning doesn't work well for some >> > : workloads. So, when PCP high is tuned by hand via the sysctl knob, >> > : the auto-tuning will be disabled. The PCP high set by hand will be >> > : used instead. >> > >> > Is it a bit hacky to disable autotuning when the user alters >> > pcp-high? Would it be cleaner to have a separate on/off knob for >> > autotuning? >> >> This was suggested by Mel Gormon, >> >> https://lore.kernel.org/linux-mm/20230714140710.5xbesq6xguhcbyvi@techsingularity.net/ >> >> " >> I'm not opposed to having an adaptive pcp->high in concept. I think it would >> be best to disable adaptive tuning if percpu_pagelist_high_fraction is set >> though. I expect that users of that tunable are rare and that if it *is* >> used that there is a very good reason for it. >> " >> >> Do you think that this is reasonable? > > I suppose so, if it's documented! > > Documentation/admin-guide/sysctl/vm.rst describes > percpu_pagelist_high_fraction. Sure. Will add document about auto-tuning behavior in the above document. -- Best Regards, Huang, Ying
On Wed, Sep 20, 2023 at 09:41:18AM -0700, Andrew Morton wrote: > On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <ying.huang@intel.com> wrote: > > > The page allocation performance requirements of different workloads > > are often different. So, we need to tune the PCP (Per-CPU Pageset) > > high on each CPU automatically to optimize the page allocation > > performance. > > Some of the performance changes here are downright scary. > > I've never been very sure that percpu pages was very beneficial (and > hey, I invented the thing back in the Mesozoic era). But these numbers > make me think it's very important and we should have been paying more > attention. > FWIW, it is because not only does it avoid lock contention issues, it avoids excessive splitting/merging of buddies as well as the slower paths of the allocator. It is not very satisfactory and frankly, the whole page allocator needs a revisit to account for very large zones but it is far from a trivial project. PCP just masks the worst of the issues and replacing it is far harder than tweaking it. > > The list of patches in series is as follows, > > > > 1 mm, pcp: avoid to drain PCP when process exit > > 2 cacheinfo: calculate per-CPU data cache size > > 3 mm, pcp: reduce lock contention for draining high-order pages > > 4 mm: restrict the pcp batch scale factor to avoid too long latency > > 5 mm, page_alloc: scale the number of pages that are batch allocated > > 6 mm: add framework for PCP high auto-tuning > > 7 mm: tune PCP high automatically > > 8 mm, pcp: decrease PCP high if free pages < high watermark > > 9 mm, pcp: avoid to reduce PCP high unnecessarily > > 10 mm, pcp: reduce detecting time of consecutive high order page freeing > > > > Patch 1/2/3 optimize the PCP draining for consecutive high-order pages > > freeing. > > > > Patch 4/5 optimize batch freeing and allocating. > > > > Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. > > > > Patch 10 optimize the PCP draining for consecutive high order page > > freeing based on PCP high auto-tuning. > > > > The test results for patches with performance impact are as follows, > > > > kbuild > > ====== > > > > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on > > one socket with `make -j 112`. > > > > build time zone lock% free_high alloc_zone > > ---------- ---------- --------- ---------- > > base 100.0 43.6 100.0 100.0 > > patch1 96.6 40.3 49.2 95.2 > > patch3 96.4 40.5 11.3 95.1 > > patch5 96.1 37.9 13.3 96.8 > > patch7 86.4 9.8 6.2 22.0 > > patch9 85.9 9.4 4.8 16.3 > > patch10 87.7 12.6 29.0 32.3 > > You're seriously saying that kbuild got 12% faster? > > I see that [07/10] (autotuning) alone sped up kbuild by 10%? > > Other thoughts: > > - What if any facilities are provided to permit users/developers to > monitor the operation of the autotuning algorithm? > Not that I've seen yet but I'm still in part of the series. It could be monitored with tracepoints but it can also be inferred from lock contention issue. I think it would only be meaningful to developers to monitor this closely, at least that's what I think now. Honestly, I'm more worried about potential changes in behaviour depending on the exact CPU and cache implementation than I am about being able to actively monitor it. > - I'm not seeing any Documentation/ updates. Surely there are things > we can tell users? > > - This: > > : It's possible that PCP high auto-tuning doesn't work well for some > : workloads. So, when PCP high is tuned by hand via the sysctl knob, > : the auto-tuning will be disabled. The PCP high set by hand will be > : used instead. > > Is it a bit hacky to disable autotuning when the user alters > pcp-high? Would it be cleaner to have a separate on/off knob for > autotuning? > It might be but tuning the allocator is very specific and once we introduce that tunable, we're probably stuck with it. I would prefer to see it introduced if and only if we have to. > And how is the user to determine that "PCP high auto-tuning doesn't work > well" for their workload? Not easily. It may manifest as variable lock contention issues when the workload is at a steady state but that would increase the pressure to split the allocator away from being zone-based entirely instead of tweaking PCP further.