From patchwork Wed Sep 20 06:18:47 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142424
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4173425vqi;
        Wed, 20 Sep 2023 07:15:53 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGf0cC98s6bvCBCilnsMWkb1kwpsxZZ+Ix1Q7TvVWUivE45TPMyMUVAcC9M4v1BVMDuJnNQ
X-Received: by 2002:a05:6a00:2d19:b0:68e:3eab:9e18 with SMTP id
 fa25-20020a056a002d1900b0068e3eab9e18mr2614683pfb.12.1695219352790;
        Wed, 20 Sep 2023 07:15:52 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695219352; cv=none;
        d=google.com; s=arc-20160816;
        b=ub73UFADPsBfiilxc7Sk+HBMfZ7J9MfXRpbS/UWrxfknw0d22Hx6lkYBhVII97JLU8
         s+ZTOfuQtCRt4JmRz7ITngp1/rixGJ2xiU1Jy+9QXjjwV98gt7/A2PXr6GjCE6IOyw0p
         F5/2DQHOSlaI+7UnT6Ik31rGaZWeg4rDeNqz5eTCGglsYKriOsbNguPLFXC1jvueNNYJ
         y6sWuTNriivazbCOR6HwtzHz34ynv4FY7rYy0/+YBlXR9R5TmHMkSzaqmkiYoJxWToLl
         Bj6Av6d7nMZ8ypWYqa5Re/vxXpmBZfyfntjfnaxHECKsTvRqdvHBfbXxihgesgVKQftL
         PykQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=GPl4E/VNt1Dv9MdbE8wBSTxkzdvu+TPDhPll3vaYXgs=;
        fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=;
        b=hGBYt5PUSPbee0PHJZiou0sSIa6CkE8QgzmVbvvR/i2CH1ipYxU7uyT1iDGAPxxCgF
         r9VdKAGxj2emCJgC5MYQdB0WByVphKC+ReuQTooksKVsRn0+Vhp0/rbudFyQpnNIwyYV
         TOB9Yw6qixgtc3lJ9w4EuoigI/AvemR/Y5tWYWRwDHqam6DzmKzDjPqPpCpawq+1Z1rM
         t9vhDMyM46i+sSkiCwhbSabJNus4dIo8P1znTaLlta/V8w0Vyssua953/JPdjKvohI3/
         5G3t9eOyaR7bZpF+B7Xexrdbb/cQVBcoAVx3r/eTT7VBQQ3RZ9GRvQNPpNYuLdAORP19
         2Dcw==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=JscwNsJ7;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:1 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1])
        by mx.google.com with ESMTPS id
 r20-20020a6560d4000000b00578b6e32b5dsi3126252pgv.405.2023.09.20.07.15.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 Sep 2023 07:15:52 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:1 as permitted sender)
 client-ip=2620:137:e000::3:1;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=JscwNsJ7;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:1 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by morse.vger.email (Postfix) with ESMTP id 8C9058020B5B;
	Tue, 19 Sep 2023 23:20:00 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233175AbjITGTr (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:19:47 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50840 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233150AbjITGTn (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:19:43 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BCDF69D
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:19:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190777; x=1726726777;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=y5KqaVwntn1O5nTc80IjgYebQMrM7BtNCmGfR5qYkwg=;
  b=JscwNsJ7pqDILWogYR/nvN0sLD+OCVPadr9V0lRIcJP5vJrgX4zEShqS
   teJ4pVWjU/AmN1Z5uMKIjRtkXitW2QdP0iwXPyO9OdPGvVvfz2/rRz8fz
   w8j0JJSVZ51bdSTiSkb/LmI2P7/ZOB/7vPz7YNX5BVM3S5SQ7UUWpuuDQ
   iRVx+Do1eniOy7v/KREL5E+pYa7sRODKw8GkKAGgi7s2A26Tz4Mm1OBaA
   G5+QV7OqWMSN+512nZdWijau83Okth2Arz5taU88FlomqgpjTVfJBPOmB
   Yddkrr6j0MUXfUgqglSjWBDhRHkKSbTs9hqhoC3mvTWnefZ+b1XgaWPvM
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187579"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187579"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:37 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060503"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060503"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:33 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit
Date: Wed, 20 Sep 2023 14:18:47 +0800
Message-Id: <20230920061856.257597-2-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:20:00 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777566328066656481
X-GMAIL-MSGID: 1777566328066656481

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocation and
freeing CPUs.

But, the PCP draining mechanism may be triggered unexpectedly when
process exits.  With some customized trace point, it was found that
PCP draining (free_high == true) was triggered with the order-1 page
freeing with the following call stack,

 => free_unref_page_commit
 => free_unref_page
 => __mmdrop
 => exit_mm
 => do_exit
 => do_group_exit
 => __x64_sys_exit_group
 => do_syscall_64

Checking the source code, this is the page table PGD
freeing (mm_free_pgd()).  It's a order-1 page freeing if
CONFIG_PAGE_TABLE_ISOLATION=y.  Which is a common configuration for
security.

Just before that, page freeing with the following call stack was
found,

 => free_unref_page_commit
 => free_unref_page_list
 => release_pages
 => tlb_batch_pages_flush
 => tlb_finish_mmu
 => exit_mmap
 => __mmput
 => exit_mm
 => do_exit
 => do_group_exit
 => __x64_sys_exit_group
 => do_syscall_64

So, when a process exits,

- a large number of user pages of the process will be freed without
  page allocation, it's highly possible that pcp->free_factor becomes
  > 0.

- after freeing all user pages, the PGD will be freed, which is a
  order-1 page freeing, PCP will be drained.

All in all, when a process exits, it's high possible that the PCP will
be drained.  This is an unexpected behavior.

To avoid this, in the patch, the PCP draining will only be triggered
for 2 consecutive high-order page freeing.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`.  With the patch, the build time
decreases 3.4% (from 206s to 199s).  The cycles% of the spinlock
contention (mostly for zone lock) decreases from 43.6% to 40.3% (with
PCP size == 361).  The number of PCP draining for high order pages
freeing (free_high) decreases 50.8%.

This helps network workload too for reduced zone lock contention.  On
a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 17.1%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 50.0% to
45.8%.  The number of PCP draining for high order pages
freeing (free_high) decreases 27.4%.  The cache miss rate keeps 0.3%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  5 ++++-
 mm/page_alloc.c        | 11 ++++++++---
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..64d5ed2bb724 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -676,12 +676,15 @@ enum zone_watermarks {
 #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
+#define	PCPF_PREV_FREE_HIGH_ORDER	0x01
+
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	short free_factor;	/* batch scaling factor during free */
+	u8 flags;		/* protected by pcp->lock */
+	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
 	short expire;		/* When 0, remote pagesets are drained */
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c5be12f9336..828dcc24b030 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 {
 	int high;
 	int pindex;
-	bool free_high;
+	bool free_high = false;
 
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
@@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * freeing without allocation. The remainder after bulk freeing
 	 * stops will be drained from vmstat refresh context.
 	 */
-	free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER);
-
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		free_high = (pcp->free_factor &&
+			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
+	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
+		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
+	}
 	high = nr_pcp_high(pcp, zone, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);

From patchwork Wed Sep 20 06:18:48 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142242
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp3913169vqi;
        Tue, 19 Sep 2023 23:21:52 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGRMHD4BkpykkbtRrzP2rXug621G7bfF/kg/9VvFb95hapxOFIFWUfFB2SNGZuvfjJX/B0A
X-Received: by 2002:a17:902:e54c:b0:1c3:dad8:bb99 with SMTP id
 n12-20020a170902e54c00b001c3dad8bb99mr1572001plf.64.1695190912511;
        Tue, 19 Sep 2023 23:21:52 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695190912; cv=none;
        d=google.com; s=arc-20160816;
        b=anewGJbIhClePwwOEAdpCHT3/6jsP1zTLScgQVzHi5HZfohDbLcZMgq5WiBsWrx+Go
         mtHNyUI2WVGKv/iLzv4oKE02i3Cx3uQV4WPaNe0pRcmE44HJrbz3OD94WreObL1frkI4
         O1DJpDKCHbgBAx+nI94fVFEPoCgHd5SEO/tzrblyHEPUoOI5uQELz3VymytoUMvQ3IrL
         cRV9MgcTN/sx0vtBjMRV+PyWRrJ0PgWh9QjBz61l84xKnCf68tKx805xnDvMLQP3glvG
         ogOBFOLcA1YDRDkoyTn6Yo/gUSZex4eqnuQ4s/AbqPGb3ozckFCI+6qiRFJ3KLuLNE1+
         tjvQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=36opLWR6CJOCDKHWsDafiwu7OGcLAly4+3YEuY6IXMM=;
        fh=hGQidUwESGodwSar92ST2/gqoeYVcIwox/uRMHGZrSY=;
        b=xefFfYsa7KPlTYu9vzIMhSmQOgpHxs6qoElFsWo88Hpxa79v4JOdFfHPwGXNpRiYna
         2/H7jdZnIoZns+E/1fsv0pSeiT0KlOUONSeRNgHRdA/4mytN63lFZK3Q3wHRyIzs+cqb
         nBmJECv1bP+glgDg6UyJKaPe4P5lDjbdyjgso+t0i5X213U6CF47dihFG3XXo6M8rZvV
         KYx6hqQ8mgqImc1tMVCtFdhjNxs9b3nf5Nj8GtWXlzDZ4XzcG4MJB98DIsIyykdqmb6v
         QGTvju/toiZSem/jQdUBXsfsDoTIbUT8KC5Q6lVZzf0W1pQvvHBJi2kiRteTb1L06+iU
         tycg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b="JjzVS/Gc";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from lipwig.vger.email (lipwig.vger.email. [23.128.96.33])
        by mx.google.com with ESMTPS id
 f9-20020a170902684900b001bbb39c68b2si10966435pln.178.2023.09.19.23.21.52
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 19 Sep 2023 23:21:52 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender) client-ip=23.128.96.33;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b="JjzVS/Gc";
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.33 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by lipwig.vger.email (Postfix) with ESMTP id C2D26828EC1B;
	Tue, 19 Sep 2023 23:20:13 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233213AbjITGT5 (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:19:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50870 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233183AbjITGTr (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:19:47 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 96D1599
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:19:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190781; x=1726726781;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=lVKcfu7RB9vKN2Hm50wym0zo99YnMLypq+qR+yt7FI8=;
  b=JjzVS/GcIFCb+wgM1BJ+H3FLzEFp5+5xFRNe/q+EottYmRsCWs8d1Aty
   mL9rWfNOJjg1yrrjhnxMG9F/x8Cs4zqr9LqUzdLZiUc6keZE2KiPDNHOP
   N7dtAalDPQnLrJpzV1ORi3roa9MxwkSuF6POwF1MzTjDXnwSWwEg8czLN
   ncJrt3MizNw++vSJm/adsezjJctPwBR5wc6WvqIERr6CWS7SmVH8R3IPt
   uYnKk6AB6zlmCkmjwnXYdhPxXzrVj0fVdIdZKiTbsN4n5kjI8O3bbWUTo
   rUCcD+7FDZsEU77OZY5TG+5FmAoulthw4V4Ud0+KW4I+cAH9qNjwrog+w
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187600"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187600"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:41 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060521"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060521"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:37 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Sudeep Holla <sudeep.holla@arm.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size
Date: Wed, 20 Sep 2023 14:18:48 +0800
Message-Id: <20230920061856.257597-3-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:20:13 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777536506236930256
X-GMAIL-MSGID: 1777536506236930256

Per-CPU data cache size is useful information.  For example, it can be
used to determine per-CPU cache size.  So, in this patch, the data
cache size for each CPU is calculated via data_cache_size /
shared_cpu_weight.

A brute-force algorithm to iterate all online CPUs is used to avoid
to allocate an extra cpumask, especially in offline callback.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 drivers/base/cacheinfo.c  | 42 ++++++++++++++++++++++++++++++++++++++-
 include/linux/cacheinfo.h |  1 +
 2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index cbae8be1fe52..3e8951a3fbab 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
 	return rc;
 }
 
+static void update_data_cache_size_cpu(unsigned int cpu)
+{
+	struct cpu_cacheinfo *ci;
+	struct cacheinfo *leaf;
+	unsigned int i, nr_shared;
+	unsigned int size_data = 0;
+
+	if (!per_cpu_cacheinfo(cpu))
+		return;
+
+	ci = ci_cacheinfo(cpu);
+	for (i = 0; i < cache_leaves(cpu); i++) {
+		leaf = per_cpu_cacheinfo_idx(cpu, i);
+		if (leaf->type != CACHE_TYPE_DATA &&
+		    leaf->type != CACHE_TYPE_UNIFIED)
+			continue;
+		nr_shared = cpumask_weight(&leaf->shared_cpu_map);
+		if (!nr_shared)
+			continue;
+		size_data += leaf->size / nr_shared;
+	}
+	ci->size_data = size_data;
+}
+
+static void update_data_cache_size(bool cpu_online, unsigned int cpu)
+{
+	unsigned int icpu;
+
+	for_each_online_cpu(icpu) {
+		if (!cpu_online && icpu == cpu)
+			continue;
+		update_data_cache_size_cpu(icpu);
+	}
+}
+
 static int cacheinfo_cpu_online(unsigned int cpu)
 {
 	int rc = detect_cache_attributes(cpu);
@@ -906,7 +941,11 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 		return rc;
 	rc = cache_add_dev(cpu);
 	if (rc)
-		free_cache_attributes(cpu);
+		goto err;
+	update_data_cache_size(true, cpu);
+	return 0;
+err:
+	free_cache_attributes(cpu);
 	return rc;
 }
 
@@ -916,6 +955,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 		cpu_cache_sysfs_exit(cpu);
 
 	free_cache_attributes(cpu);
+	update_data_cache_size(false, cpu);
 	return 0;
 }
 
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index a5cfd44fab45..4e7ccfa0c36d 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -73,6 +73,7 @@ struct cacheinfo {
 
 struct cpu_cacheinfo {
 	struct cacheinfo *info_list;
+	unsigned int size_data;
 	unsigned int num_levels;
 	unsigned int num_leaves;
 	bool cpu_map_populated;

From patchwork Wed Sep 20 06:18:49 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142240
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp3912633vqi;
        Tue, 19 Sep 2023 23:20:29 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IHSVi38n8O6KBt/15qJLSNnEEa6wk8bpBiUuKWmKoFuZPruixrO7zQe499hfA3LS+gJqdk2
X-Received: by 2002:a05:6830:616:b0:6bc:f999:a544 with SMTP id
 w22-20020a056830061600b006bcf999a544mr1652606oti.15.1695190829418;
        Tue, 19 Sep 2023 23:20:29 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695190829; cv=none;
        d=google.com; s=arc-20160816;
        b=ZEjAKNX3vHdprqEPYiEp1HC2215QGd4r6QVIIdFbASTiETUm3t/m1fQqZjBoJ/X2vr
         18W44dTe3nMlClNKjM+eFrkdF4a4JTgBoRCA2f952ztD19XDOz2CHlQt1Zk2ovQpwRox
         UR1r9Qqil1TCpa6c1TrK0WBo2FFt6b69iLkbDd94euR4JUGVR3aBkpmdPaN8oEnHK9Fv
         Px/WI0fu9cNV1GgOrCgspLWA+g6pi46uiTRBs2yCXakys+XTu9JbMk3QmVcXKSE0sR3Z
         DUCT4G/LiE0ugYx6LZ6+QNuN94w1+SVWzKTOM6BCNmLpUopEknhEuZvRQPEC4lSnuKUg
         CpVA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=l/RqypTBqoKRXw8P4mlR7VW0p5EEkP7aW82SstOT+vs=;
        fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=;
        b=vN2NQE7o7Rag+fc/sWo1xa+u/rs74bRXUBGgpIANybPorjLvohRUctVZ+QndaIYu49
         HrqGh0Xk4Dj07Lyih8B0R3zT2F60zJ0Eu//l21VuXeY/l9Ej//wJ/YXgcxHAzvjJHvii
         zX0bmLTyepC5DWaOfORH2KYwJjoOqrbhDqq6Wvu/j00l7qhHCE05vGTWbOH+/cZWoJQ9
         8twHjm6seqxO5bMBQ6guRJT1gNgkqR2CjrEwsJ3Zdjso6W+Zhj6x2HpkYLDK23J4qH5F
         ovAaFztJUs4twfmtQ+fosm2KCzzHIeq5ySLD+Thi8mOtaJLnJt6jAKs7pejzQhFmbCrU
         EZHQ==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=VxVNqNbv;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.35 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from groat.vger.email (groat.vger.email. [23.128.96.35])
        by mx.google.com with ESMTPS id
 z1-20020a633301000000b00578af1e2f3dsi2814156pgz.527.2023.09.19.23.20.29
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 19 Sep 2023 23:20:29 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=VxVNqNbv;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.35 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by groat.vger.email (Postfix) with ESMTP id 5F2D78303B26;
	Tue, 19 Sep 2023 23:20:14 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233264AbjITGUC (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:02 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45600 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233197AbjITGTx (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:19:53 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 93375B9
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:19:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190785; x=1726726785;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Gl8Dsupvu0i/ycW4/TMF2HWHASVPpn6wN7AOIH8Z+Fc=;
  b=VxVNqNbvSOot9d5k7Kjb5vSHbRAhnoWYCocgTeM4RNrq8pV4KJzumhKo
   2VjoMKHNBfOtfx1u148N6aUsWHnPF3grRKenn540EUGjEYVaDNcVPdU45
   oZNEd0bkTBT2YRc0BUDEKplVXI1v0bpeqF6m9s2TSOBfQ1E9tWp+UEPM0
   89TLnDeZznsSUWVPsmynXJS+TQASga+547zs/eKreF5aNhtg4nXGlXSRT
   ThA1D/xt7ag0Flokfhj3qxHyGyQpfVyOviBjeL7fjBjd0wheJsBZx2YkZ
   O+4dQXgITFqnRd5Bgpj4lp+V4xx+hz+XsDWeNpaLjchIQINDthTnpnwnG
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187621"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187621"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:45 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060540"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060540"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:41 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 03/10] mm,
 pcp: reduce lock contention for draining high-order pages
Date: Wed, 20 Sep 2023 14:18:49 +0800
Message-Id: <20230920061856.257597-4-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:20:14 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777536418889307062
X-GMAIL-MSGID: 1777536418889307062

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocating and
freeing CPUs.

On system with small per-CPU data cache, pages shouldn't be cached
before draining to guarantee cache-hot.  But on a system with large
per-CPU data cache, more pages can be cached before draining to reduce
zone lock contention.

So, in this patch, instead of draining without any caching, "batch"
pages will be cached in PCP before draining if the per-CPU data cache
size is more than "4 * batch".

On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 72.2%.  The cycles% of the
spinlock contention (mostly for zone lock) decreases from 45.8% to
21.2%.  The number of PCP draining for high order pages
freeing (free_high) decreases 89.8%.  The cache miss rate keeps 0.3%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/base/cacheinfo.c |  2 ++
 include/linux/gfp.h      |  1 +
 include/linux/mmzone.h   |  1 +
 mm/page_alloc.c          | 37 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 3e8951a3fbab..a55b2f83958b 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -943,6 +943,7 @@ static int cacheinfo_cpu_online(unsigned int cpu)
 	if (rc)
 		goto err;
 	update_data_cache_size(true, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 err:
 	free_cache_attributes(cpu);
@@ -956,6 +957,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 
 	free_cache_attributes(cpu);
 	update_data_cache_size(false, cpu);
+	setup_pcp_cacheinfo();
 	return 0;
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..665edc11fb9f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -325,6 +325,7 @@ void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+void setup_pcp_cacheinfo(void);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 64d5ed2bb724..4132e7490b49 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -677,6 +677,7 @@ enum zone_watermarks {
 #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
 #define	PCPF_PREV_FREE_HIGH_ORDER	0x01
+#define	PCPF_FREE_HIGH_BATCH		0x02
 
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 828dcc24b030..06aa9c5687e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -52,6 +52,7 @@
 #include <linux/psi.h>
 #include <linux/khugepaged.h>
 #include <linux/delayacct.h>
+#include <linux/cacheinfo.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -2385,7 +2386,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
 		free_high = (pcp->free_factor &&
-			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
+			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
+			      pcp->count >= READ_ONCE(pcp->batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
@@ -5418,6 +5421,38 @@ static void zone_pcp_update(struct zone *zone, int cpu_online)
 	mutex_unlock(&pcp_batch_high_lock);
 }
 
+static void zone_pcp_update_cacheinfo(struct zone *zone)
+{
+	int cpu;
+	struct per_cpu_pages *pcp;
+	struct cpu_cacheinfo *cci;
+
+	for_each_online_cpu(cpu) {
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		cci = get_cpu_cacheinfo(cpu);
+		/*
+		 * If per-CPU data cache is large enough, up to
+		 * "batch" high-order pages can be cached in PCP for
+		 * consecutive freeing.  This can reduce zone lock
+		 * contention without hurting cache-hot pages sharing.
+		 */
+		spin_lock(&pcp->lock);
+		if ((cci->size_data >> PAGE_SHIFT) > 4 * pcp->batch)
+			pcp->flags |= PCPF_FREE_HIGH_BATCH;
+		else
+			pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
+		spin_unlock(&pcp->lock);
+	}
+}
+
+void setup_pcp_cacheinfo(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zone_pcp_update_cacheinfo(zone);
+}
+
 /*
  * Allocate per cpu pagesets and initialize them.
  * Before this call only boot pagesets were available.

From patchwork Wed Sep 20 06:18:50 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142247
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp3923143vqi;
        Tue, 19 Sep 2023 23:47:08 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IEC5Qn9OYOgtqPB/xZNf7b0er9K9hLl0RhUjVzl6FGQ/cXZPQN8m8teb+HzLFtr5v/x2Qhk
X-Received: by 2002:a17:90b:1190:b0:274:7b6a:4358 with SMTP id
 gk16-20020a17090b119000b002747b6a4358mr1743729pjb.6.1695192428235;
        Tue, 19 Sep 2023 23:47:08 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695192428; cv=none;
        d=google.com; s=arc-20160816;
        b=rP2M5TiZZesACDI9MhzCf7u+/X95Em4qSaBbFWIXN4olE+eFAqAvbZNPDRrrndigo7
         vI+FIebWJF3jbyC4rW7+z+2+JbRCrDaE8oZNmuUheAyNzzCWi2t0z3HdyeHR4WvKFbre
         gawbqEHEl/obgHKAAnAVhzQPLclhOs/XwElxGTS2xUJVbwDUuvqOxSDjGmi/LpArT9yx
         bj/wMGMgAX3TVgt0nu50R5a+/yEtQW1eVxQDu8E3iwFPprXpM8a3GfYYSO2I/fsF+GMM
         Ixeijl6Cj8VviOuPRVN1vrXJN5aqR0ReXrygiDr8mCNpn4QXuHrKh7cmDEcGqOhU5JW6
         h97w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=l3q5SV++ZWPRlY3DCX2JHyc1VEp3GoxWhBxm7mEJMtw=;
        fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=;
        b=ew3S/j3+m0GDrJ8NgwKCIvHznaFQe9r6X1OzeXON9z/U73FQ6wXybis7U+ij3VAtjw
         XIKn5kNE5GWKOrEi1Qn+T0H/VNjgKdDYALpd4kK7uocYeTxpYZ/ag/iM+BUXFqS1iIIS
         Au9iih4dtL8t5jS1SHgkcsmhfju2JvtgeAsJ0DxvmOGlOymzni3+FVFmtcw+6YCXCPLS
         7fOwz14WHQxwglbLyddDfX3yOFU05GdKk6SOF3vLboRxWZ3qKpGUI6LETUBKqP6kfH6h
         Jfq0SzLcBMk8y2V5qa1Q3APHVZ1Tkt2emWpyo4egk5Ck2K0g2BULH5SzVF1OiN7ZKdC5
         MewA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=JzvwP1Yg;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6])
        by mx.google.com with ESMTPS id
 r5-20020a17090a690500b00273515e8968si916218pjj.127.2023.09.19.23.47.07
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 19 Sep 2023 23:47:08 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 client-ip=2620:137:e000::3:6;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=JzvwP1Yg;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:6 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by pete.vger.email (Postfix) with ESMTP id B86B082A2DA7;
	Tue, 19 Sep 2023 23:20:38 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233273AbjITGUF (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:05 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45590 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233217AbjITGTz (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:19:55 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 69DF9EA
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:19:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190789; x=1726726789;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=edWaWFJ5ehaM/hA/BBV9GI7aeyW0raEHXFDMtgquXCc=;
  b=JzvwP1YgcPilvouTtGEPj9VCq0hdGPXUqQS2Kdkre/2wb/tQ2TfwUMeu
   MfVW/OV1lelkQ9ISX16zfeVjBiHxAwtH/Scj5Jx7VVcVGfN38J17opI/p
   E7+W4X7/EMwisAZlnus309GXVEutPw4c5HM74+SX7VtbhgeWNoBstTlZc
   WbDoScppN1Vd1XUDLtwvuCEwJV0QsOw+HgBeT1pbtYUcos9zPTw27RNIV
   +iB6ZqMT6x/3+lGGj5zyKaahQmF+zrtLlS59kbbHbvfrfBkj5cVdI67l5
   TcTDfvKnbzmcJhlLmoSRqcg6nDZ5xdV0N/3vhonOkCeK6tcpPnj2d5Xau
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187663"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187663"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:49 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060591"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060591"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:45 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 04/10] mm: restrict the pcp batch scale factor to avoid too
 long latency
Date: Wed, 20 Sep 2023 14:18:50 +0800
Message-Id: <20230920061856.257597-5-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:20:38 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777538095432517380
X-GMAIL-MSGID: 1777538095432517380

In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
batches to increase page allocation throughput, reduce page
allocation/freeing latency per page, and reduce zone lock contention.
But too large batch size will cause too long maximal
allocation/freeing latency, which may punish arbitrary users.  So the
default batch size is chosen carefully (in zone_batchsize(), the value
is 63 for zone > 1GB) to avoid that.

In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that
are batch freed"), the batch size will be scaled for large number of
page freeing to improve page freeing performance and reduce zone lock
contention.  Similar optimization can be used for large number of
pages allocation too.

To find out a suitable max batch scale factor (that is, max effective
batch size), some tests and measurement on some machines were done as
follows.

A set of debug patches are implemented as follows,

- Set PCP high to be 2 * batch to reduce the effect of PCP high

- Disable free batch size scaling to get the raw performance.

- The code with zone lock held is extracted from rmqueue_bulk() and
  free_pcppages_bulk() to 2 separate functions to make it easy to
  measure the function run time with ftrace function_graph tracer.

- The batch size is hard coded to be 63 (default), 127, 255, 511,
  1023, 2047, 4095.

Then will-it-scale/page_fault1 is used to generate the page
allocation/freeing workload.  The page allocation/freeing throughput
(page/s) is measured via will-it-scale.  The page allocation/freeing
average latency (alloc/free latency avg, in us) and allocation/freeing
latency at 99 percentile (alloc/free latency 99%, in us) are measured
with ftrace function_graph tracer.

The test results are as follows,

Sapphire Rapids Server
======================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	513633.4	 2.33		 3.57		 2.67		  6.83
 127	517616.7	 4.35		 6.65		 4.22		 13.03
 255	520822.8	 8.29		13.32		 7.52		 25.24
 511	524122.0	15.79		23.42		14.02		 49.35
1023	525980.5	30.25		44.19		25.36		 94.88
2047	526793.6	59.39		84.50		45.22		140.81

Ice Lake Server
===============
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	620210.3	 2.21		 3.68		 2.02		 4.35
 127	627003.0	 4.09		 6.86		 3.51		 8.28
 255	630777.5	 7.70		13.50		 6.17		15.97
 511	633651.5	14.85		22.62		11.66		31.08
1023	637071.1	28.55		42.02		20.81		54.36
2047	638089.7	56.54		84.06		39.28		91.68

Cascade Lake Server
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	404706.7	 3.29		  5.03		 3.53		  4.75
 127	422475.2	 6.12		  9.09		 6.36		  8.76
 255	411522.2	11.68		 16.97		10.90		 16.39
 511	428124.1	22.54		 31.28		19.86		 32.25
1023	414718.4	43.39		 62.52		40.00		 66.33
2047	429848.7	86.64		120.34		71.14		106.08

Commet Lake Desktop
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------

  63	795183.13	 2.18		 3.55		 2.03		 3.05
 127	803067.85	 3.91		 6.56		 3.85		 5.52
 255	812771.10	 7.35		10.80		 7.14		10.20
 511	817723.48	14.17		27.54		13.43		30.31
1023	818870.19	27.72		40.10		27.89		46.28

Coffee Lake Desktop
===================
Batch	throughput	free latency	free latency	alloc latency	alloc latency
	page/s		avg / us	99% / us	avg / us	99% / us
-----	----------	------------	------------	-------------	-------------
  63	510542.8	 3.13		  4.40		 2.48		 3.43
 127	514288.6	 5.97		  7.89		 4.65		 6.04
 255	516889.7	11.86		 15.58		 8.96		12.55
 511	519802.4	23.10		 28.81		16.95		26.19
1023	520802.7	45.30		 52.51		33.19		45.95
2047	519997.1	90.63		104.00		65.26		81.74

From the above data, to restrict the allocation/freeing latency to be
less than 100 us in most times, the max batch scale factor needs to be
less than or equal to 5.

So, in this patch, the batch scale factor is restricted to be less
than or equal to 5.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 06aa9c5687e0..30554c674349 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -86,6 +86,9 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Maximum PCP batch scale factor to restrict max allocation/freeing latency */
+#define PCP_BATCH_SCALE_MAX	5
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -2340,7 +2343,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 	 * freeing of pages without any allocation.
 	 */
 	batch <<= pcp->free_factor;
-	if (batch < max_nr_free)
+	if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
 		pcp->free_factor++;
 	batch = clamp(batch, min_nr_free, max_nr_free);
 

From patchwork Wed Sep 20 06:18:51 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142492
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4223510vqi;
        Wed, 20 Sep 2023 08:29:27 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IE9ig99awUrokE7xn2A52HBBXDTCaqi1yJVheRUPIFTc0txE1Ixps3H9tiYR2xGsTBt698m
X-Received: by 2002:a05:6a20:1450:b0:14d:382c:f908 with SMTP id
 a16-20020a056a20145000b0014d382cf908mr3001639pzi.32.1695223766790;
        Wed, 20 Sep 2023 08:29:26 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695223766; cv=none;
        d=google.com; s=arc-20160816;
        b=WoSAelzKdlYe0IYxlT62YT7yBAZIRRNg9Sw7Wf1n+BGk99YhdCtoZgpI7iAsSjZZrI
         /YtCITtZmJ43jkdUWUL6RZiJsmHz3msE9MxApu9juz3CknlZuSMdEi9XYimyFRwyEZl6
         xzg5piwneje612oiZ29fgvVTWy41n3yGfra4mfhjzxzCZux4vpacsMGR0ObpVQN16Ppf
         hIiX5EkOs/o70jAGJkxzkYiYUCGm7Q16DFnNW54Pd6NG55KCn+YyQGVtXso7Juydi/RB
         O37UFkcah5IvTJiGeNWyUxei5pal2H000o9URFlMPmSklDB2Y84vZ2KAYzvzN/TZD9Cq
         cSdA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=B04gbBYfG2Z+ZG2AaIcVwZBgZPaKp1IkffSjK1YB3eQ=;
        fh=itp5/nRcK58KRx3/xKC9CX073VYzlzJbf2ljAGnHiM4=;
        b=fwwb2KbJwn6MqlTbthq4x2m6Ai2tHcPeChvH8/exDrqC1RGDpSTHbHj7L2kh/JtgjA
         pvJzsUXeFsgbo0RoaHIc6KwddPb/MdJ/9EljXaCFL3IuY5l76FcUJ28CH0e913KRsccK
         xEeAuVyGScXeJfsvwNGo1XbyKGxTZLkrwO13uz4xfOqCj4XKqi0X5q18GkaF2smswgP6
         Uqm6jpSByvCA7Q+ajv/F7jwQNIZtN3zkkhnwE53Xgo0o1QtoCMXKGO03vWdH6m9uQixX
         kS3P5F01lF7BJC+1GoRSqgL2Y6ihnkHU41plVJ2S2gCfJHlOGyq72zMMmN1patYyn8eI
         Gmgg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=Gp7YBX0n;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:8 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8])
        by mx.google.com with ESMTPS id
 a1-20020a17090a8c0100b00268515ce449si1748272pjo.94.2023.09.20.08.29.26
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 Sep 2023 08:29:26 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:8 as permitted sender)
 client-ip=2620:137:e000::3:8;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=Gp7YBX0n;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:8 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by fry.vger.email (Postfix) with ESMTP id 100C78021900;
	Tue, 19 Sep 2023 23:20:26 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233335AbjITGUR (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:17 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35276 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233281AbjITGUM (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:20:12 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A061DE
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:19:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190793; x=1726726793;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=34wdsgvd2w4CDiR/vU6T5wdv61OCb5iOnSh2X+G58Mc=;
  b=Gp7YBX0nMiS5QX/TBV6p3G9xBLqXObK3Vvn25jn0OFU44htoSWmGhBO8
   xv+t8h02FgEdi0dLRwcRuNl+mmOHsEUyEh19kd0qiPcLb1h9X9r+m6g+P
   46KrzmOYM6RpgA/E5m+Ss6UN+SmSG2y2EZXtAFsP+YNEKXaVLb8pIKvQw
   f2vNEubh7v0tTgZd06YUAYshMqmz01kfTsWsgDY/ZE8yUaBoZ8b39gR26
   yvYxuPJhy/kgH6wXcGB43jfeTboGe301YHenTKyzBvXwNivj3kxFNzKF9
   Ot6gSC5Ku1KrPsELcyzePifubGKcLVYcgHfzN3Q2K/gww5copN/jP0P2G
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187681"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187681"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:52 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060606"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060606"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:49 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Andrew Morton <akpm@linux-foundation.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 05/10] mm,
 page_alloc: scale the number of pages that are batch allocated
Date: Wed, 20 Sep 2023 14:18:51 +0800
Message-Id: <20230920061856.257597-6-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:20:26 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777570956695843435
X-GMAIL-MSGID: 1777570956695843435

When a task is allocating a large number of order-0 pages, it may
acquire the zone->lock multiple times allocating pages in batches.
This may unnecessarily contend on the zone lock when allocating very
large number of pages.  This patch adapts the size of the batch based
on the recent pattern to scale the batch size for subsequent
allocations.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`.  With the patch, the cycles% of the
spinlock contention (mostly for zone lock) decreases from 40.5% to
37.9% (with PCP size == 361).

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |  3 ++-
 mm/page_alloc.c        | 52 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4132e7490b49..4f7420e35fbb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -685,9 +685,10 @@ struct per_cpu_pages {
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
+	u8 alloc_factor;	/* batch scaling factor during allocate */
 	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
-	short expire;		/* When 0, remote pagesets are drained */
+	u8 expire;		/* When 0, remote pagesets are drained */
 #endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 30554c674349..30bb05fa5353 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2376,6 +2376,12 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	int pindex;
 	bool free_high = false;
 
+	/*
+	 * On freeing, reduce the number of pages that are batch allocated.
+	 * See nr_pcp_alloc() where alloc_factor is increased for subsequent
+	 * allocations.
+	 */
+	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
 	pindex = order_to_pindex(migratetype, order);
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
@@ -2682,6 +2688,41 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 	return page;
 }
 
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+{
+	int high, batch, max_nr_alloc;
+
+	high = READ_ONCE(pcp->high);
+	batch = READ_ONCE(pcp->batch);
+
+	/* Check for PCP disabled or boot pageset */
+	if (unlikely(high < batch))
+		return 1;
+
+	/*
+	 * Double the number of pages allocated each time there is subsequent
+	 * refiling of order-0 pages without drain.
+	 */
+	if (!order) {
+		max_nr_alloc = max(high - pcp->count - batch, batch);
+		batch <<= pcp->alloc_factor;
+		if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX)
+			pcp->alloc_factor++;
+		batch = min(batch, max_nr_alloc);
+	}
+
+	/*
+	 * Scale batch relative to order if batch implies free pages
+	 * can be stored on the PCP. Batch can be 1 for small zones or
+	 * for boot pagesets which should never store free pages as
+	 * the pages may belong to arbitrary zones.
+	 */
+	if (batch > 1)
+		batch = max(batch >> order, 2);
+
+	return batch;
+}
+
 /* Remove page from the per-cpu list, caller must protect the list */
 static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -2694,18 +2735,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 	do {
 		if (list_empty(list)) {
-			int batch = READ_ONCE(pcp->batch);
+			int batch = nr_pcp_alloc(pcp, order);
 			int alloced;
 
-			/*
-			 * Scale batch relative to order if batch implies
-			 * free pages can be stored on the PCP. Batch can
-			 * be 1 for small zones or for boot pagesets which
-			 * should never store free pages as the pages may
-			 * belong to arbitrary zones.
-			 */
-			if (batch > 1)
-				batch = max(batch >> order, 2);
 			alloced = rmqueue_bulk(zone, order,
 					batch, list,
 					migratetype, alloc_flags);

From patchwork Wed Sep 20 06:18:52 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142496
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4234689vqi;
        Wed, 20 Sep 2023 08:46:48 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IEbDK6zvUf1MpeC7IC+gkPest+2BPbnIcFFE6rjnmYyizq4jXyH8fI4TnbE/eq0+bWIe/WC
X-Received: by 2002:a17:903:2310:b0:1c3:bfb8:8c1 with SMTP id
 d16-20020a170903231000b001c3bfb808c1mr2691902plh.65.1695224808605;
        Wed, 20 Sep 2023 08:46:48 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695224808; cv=none;
        d=google.com; s=arc-20160816;
        b=Dkduol8NgXsY4sH99qZXNycbbndnf7AnDSIOXDOeV7hL1B5puWhJ2xSoWsB/4e4+6Y
         m2DmdeIonJQKRa3fJFAr1lv1Bc4nO7Fe8euxZbAKkqFb+g/pAD+cus8G2Qbvid+1yEKh
         LicUFBI/X9HymKWGbC+4dWu+g+AczGvrP8FZ51ohtS+pNryW8XiGPakxeLDClBl5QCuW
         FrfC5pO4N023WqyQU4Yg49gQkR82JAKS3V19hpbVePX76/i0PHsCM25Y2NTHicSRw45E
         fGPg7cqg8VBQ9H1iw4dTxOfKQreyA6Yu+8hsDcWWkOBn2kgi7PgFZ9E61q3u6Ya0KlJ2
         604Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=SdcYhsok387cDjvXpV2YEVBEz71gokOEV+r5pxgghIk=;
        fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=;
        b=LwrIU9d2QMtuWypU/PK2KKaiBeMoKF9vWhocJcnW9fHD9MwP0PIBJyhitAeBUUq4fO
         IigcYvqc4ODOuHcNMaXIIISycX9/TBuc6/ZhshirNsZ7F7/C0XA+xLwWGIt4xZeevsU7
         1mlx9se3y+z4bJB0TTsftF5tfyhshugrBatoBadnGhhP8dTdJp8w7Dg939R3aRj3XIkY
         aMgEqZGIfnob42gS7wwmgOZsF8xVOdbKkiabGbiujHlX9x1ZbnjSkBg9s0QQ4Ubo8sVA
         0YyGhKadq9AzOydig8qAQprmtGDcsWMv48o4rmzIMmqBSl3RPBDK3YuY1UmmSN1Idm5q
         sfUA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=AYL8fzfH;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from fry.vger.email (fry.vger.email. [23.128.96.38])
        by mx.google.com with ESMTPS id
 m7-20020a170902db0700b001b845157b69si12678445plx.414.2023.09.20.08.46.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 Sep 2023 08:46:48 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=AYL8fzfH;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by fry.vger.email (Postfix) with ESMTP id 18D92825F150;
	Tue, 19 Sep 2023 23:20:34 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233293AbjITGUW (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:22 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45742 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233308AbjITGUQ (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:20:16 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBE45134
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:19:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190797; x=1726726797;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=mpRZqHQlbPyOoPdSIFxq6ijVhcRIT0jLaqXdR2d5F6E=;
  b=AYL8fzfHjOESLE0NQxAgJ5zTca3qnmYfV3n1q/p24G9FwQkonPew90gz
   QF7IJKLZ7M8I6izXYhRV6qo5tBFvX4TFwBQmU90Jg8q7lEw+hNf00gp0K
   dQfjCtKadZwIfgZhKwkm/Pof6j56kZW5n4/dxrzP/MtCSvW+y52KGpnMO
   3/92lJwVufPag6kMjnqAl/aukSzbMwXojUWUbGyMOBxU5QPxISKrl/hzV
   rOjdhYeHk/q5HO86Ch2MfJVgspcyJOxZ06rcjOwmt+lEbhPeQR04f2WmE
   PTFgkvSfpFj3slaIYBw00AQZnF+3KM6CRkx8h/Jp/ODNGYdITMWV6iwLM
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187706"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187706"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:56 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060623"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060623"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:52 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 06/10] mm: add framework for PCP high auto-tuning
Date: Wed, 20 Sep 2023 14:18:52 +0800
Message-Id: <20230920061856.257597-7-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:20:34 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777572048759479030
X-GMAIL-MSGID: 1777572048759479030

The page allocation performance requirements of different workloads
are usually different.  So, we need to tune PCP (per-CPU pageset) high
to optimize the workload page allocation performance.  Now, we have a
system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP
high by hand.  But, it's hard to find out the best value by hand.  And
one global configuration may not work best for the different workloads
that run on the same system.  One solution to these issues is to tune
PCP high of each CPU automatically.

This patch adds the framework for PCP high auto-tuning.  With it,
pcp->high of each CPU will be changed automatically by tuning
algorithm at runtime.  The minimal high (pcp->high_min) is the
original PCP high value calculated based on the low watermark pages.
While the maximal high (pcp->high_max) is the PCP high value when
percpu_pagelist_high_fraction sysctl knob is set to
MIN_PERCPU_PAGELIST_HIGH_FRACTION.  That is, the maximal pcp->high
that can be set via sysctl knob by hand.

It's possible that PCP high auto-tuning doesn't work well for some
workloads.  So, when PCP high is tuned by hand via the sysctl knob,
the auto-tuning will be disabled.  The PCP high set by hand will be
used instead.

This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always.  We will add actual
auto-tuning algorithm in the following patches in the series.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  5 ++-
 mm/page_alloc.c        | 71 +++++++++++++++++++++++++++---------------
 2 files changed, 50 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4f7420e35fbb..d6cfb5023f3e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -683,6 +683,8 @@ struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
+	int high_min;		/* min high watermark */
+	int high_max;		/* max high watermark */
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
 	u8 alloc_factor;	/* batch scaling factor during allocate */
@@ -842,7 +844,8 @@ struct zone {
 	 * the high and batch values are copied to individual pagesets for
 	 * faster access
 	 */
-	int pageset_high;
+	int pageset_high_min;
+	int pageset_high_max;
 	int pageset_batch;
 
 #ifndef CONFIG_SPARSEMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 30bb05fa5353..38bfab562b44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2353,7 +2353,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		       bool free_high)
 {
-	int high = READ_ONCE(pcp->high);
+	int high = READ_ONCE(pcp->high_min);
 
 	if (unlikely(!high || free_high))
 		return 0;
@@ -2692,7 +2692,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
 {
 	int high, batch, max_nr_alloc;
 
-	high = READ_ONCE(pcp->high);
+	high = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
 
 	/* Check for PCP disabled or boot pageset */
@@ -5298,14 +5298,15 @@ static int zone_batchsize(struct zone *zone)
 }
 
 static int percpu_pagelist_high_fraction;
-static int zone_highsize(struct zone *zone, int batch, int cpu_online)
+static int zone_highsize(struct zone *zone, int batch, int cpu_online,
+			 int high_fraction)
 {
 #ifdef CONFIG_MMU
 	int high;
 	int nr_split_cpus;
 	unsigned long total_pages;
 
-	if (!percpu_pagelist_high_fraction) {
+	if (!high_fraction) {
 		/*
 		 * By default, the high value of the pcp is based on the zone
 		 * low watermark so that if they are full then background
@@ -5318,15 +5319,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 		 * value is based on a fraction of the managed pages in the
 		 * zone.
 		 */
-		total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction;
+		total_pages = zone_managed_pages(zone) / high_fraction;
 	}
 
 	/*
 	 * Split the high value across all online CPUs local to the zone. Note
 	 * that early in boot that CPUs may not be online yet and that during
 	 * CPU hotplug that the cpumask is not yet updated when a CPU is being
-	 * onlined. For memory nodes that have no CPUs, split pcp->high across
-	 * all online CPUs to mitigate the risk that reclaim is triggered
+	 * onlined. For memory nodes that have no CPUs, split the high value
+	 * across all online CPUs to mitigate the risk that reclaim is triggered
 	 * prematurely due to pages stored on pcp lists.
 	 */
 	nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
@@ -5354,19 +5355,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
  * However, guaranteeing these relations at all times would require e.g. write
  * barriers here but also careful usage of read barriers at the read side, and
  * thus be prone to error and bad for performance. Thus the update only prevents
- * store tearing. Any new users of pcp->batch and pcp->high should ensure they
- * can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * store tearing. Any new users of pcp->batch, pcp->high_min and pcp->high_max
+ * should ensure they can cope with those fields changing asynchronously, and
+ * fully trust only the pcp->count field on the local CPU with interrupts
+ * disabled.
  *
  * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
  * outside of boot time (or some other assurance that no concurrent updaters
  * exist).
  */
-static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
-		unsigned long batch)
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_min,
+			   unsigned long high_max, unsigned long batch)
 {
 	WRITE_ONCE(pcp->batch, batch);
-	WRITE_ONCE(pcp->high, high);
+	WRITE_ONCE(pcp->high_min, high_min);
+	WRITE_ONCE(pcp->high_max, high_max);
 }
 
 static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
@@ -5386,20 +5389,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	 * need to be as careful as pageset_update() as nobody can access the
 	 * pageset yet.
 	 */
-	pcp->high = BOOT_PAGESET_HIGH;
+	pcp->high_min = BOOT_PAGESET_HIGH;
+	pcp->high_max = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
 	pcp->free_factor = 0;
 }
 
-static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
-		unsigned long batch)
+static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
+					      unsigned long high_max, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
 	int cpu;
 
 	for_each_possible_cpu(cpu) {
 		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-		pageset_update(pcp, high, batch);
+		pageset_update(pcp, high_min, high_max, batch);
 	}
 }
 
@@ -5409,19 +5413,34 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
  */
 static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
 {
-	int new_high, new_batch;
+	int new_high_min, new_high_max, new_batch;
 
 	new_batch = max(1, zone_batchsize(zone));
-	new_high = zone_highsize(zone, new_batch, cpu_online);
+	if (percpu_pagelist_high_fraction) {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online,
+					     percpu_pagelist_high_fraction);
+		/*
+		 * PCP high is tuned manually, disable auto-tuning via
+		 * setting high_min and high_max to the manual value.
+		 */
+		new_high_max = new_high_min;
+	} else {
+		new_high_min = zone_highsize(zone, new_batch, cpu_online, 0);
+		new_high_max = zone_highsize(zone, new_batch, cpu_online,
+					     MIN_PERCPU_PAGELIST_HIGH_FRACTION);
+	}
 
-	if (zone->pageset_high == new_high &&
+	if (zone->pageset_high_min == new_high_min &&
+	    zone->pageset_high_max == new_high_max &&
 	    zone->pageset_batch == new_batch)
 		return;
 
-	zone->pageset_high = new_high;
+	zone->pageset_high_min = new_high_min;
+	zone->pageset_high_max = new_high_max;
 	zone->pageset_batch = new_batch;
 
-	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
+	__zone_set_pageset_high_and_batch(zone, new_high_min, new_high_max,
+					  new_batch);
 }
 
 void __meminit setup_zone_pageset(struct zone *zone)
@@ -5529,7 +5548,8 @@ __meminit void zone_pcp_init(struct zone *zone)
 	 */
 	zone->per_cpu_pageset = &boot_pageset;
 	zone->per_cpu_zonestats = &boot_zonestats;
-	zone->pageset_high = BOOT_PAGESET_HIGH;
+	zone->pageset_high_min = BOOT_PAGESET_HIGH;
+	zone->pageset_high_max = BOOT_PAGESET_HIGH;
 	zone->pageset_batch = BOOT_PAGESET_BATCH;
 
 	if (populated_zone(zone))
@@ -6431,13 +6451,14 @@ EXPORT_SYMBOL(free_contig_range);
 void zone_pcp_disable(struct zone *zone)
 {
 	mutex_lock(&pcp_batch_high_lock);
-	__zone_set_pageset_high_and_batch(zone, 0, 1);
+	__zone_set_pageset_high_and_batch(zone, 0, 0, 1);
 	__drain_all_pages(zone, true);
 }
 
 void zone_pcp_enable(struct zone *zone)
 {
-	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+	__zone_set_pageset_high_and_batch(zone, zone->pageset_high_min,
+		zone->pageset_high_max, zone->pageset_batch);
 	mutex_unlock(&pcp_batch_high_lock);
 }
 

From patchwork Wed Sep 20 06:18:53 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142554
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4359810vqi;
        Wed, 20 Sep 2023 12:02:47 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IEemM0FoLjeO1WEN9a0F21i3qBYChTi6WrGdSkfFQavIrWAQgoxgpypLugV2k0SYvZRNXW1
X-Received: by 2002:a05:6e02:1d0f:b0:345:d58d:9ae5 with SMTP id
 i15-20020a056e021d0f00b00345d58d9ae5mr4740963ila.7.1695236567675;
        Wed, 20 Sep 2023 12:02:47 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695236567; cv=none;
        d=google.com; s=arc-20160816;
        b=HgNLVQ5mMgzBoMtGwjhSDt4SvcGnt7qEPPzHHSJkLIBqoJH9eaZYNQvZq1M+VScvFd
         zvatn1UKF87HybBTnrc4QlY4Y7Y7yMI+0q1zNIp6ev4UJJZ2Pz0R7VqIjE3Xvwmg/Jdl
         gq1ZlfSTVxnut3xobwYHUQX8pPgWLsBhNgnb7lCQ6Tnbz6lo7TgdmOPaaSo4I/dgFw9c
         Ji6YoKvKoA2lIC0dF986J9LcOOLt5fXEBwLZkMk9Qzp5WtbmGbMQUs0eyKDVG+/5qIi0
         9c3Hxy9YVOVzupJPj8O0hcWmZyKO2Ied5wtP58+xkS8ATHVbWT6QJNMvbC4GANCVR3as
         3txQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=M/fe6gvTJ5IvajuIXHKt80mPtHGXVD+KvoXX8vweENs=;
        fh=OKxYoYdw2w7fkY50MinRAdhvts7yonK0SE0KpytVvTs=;
        b=HQiEMKZnl/BJ+KvdtPBHJfhQ9pqO+pkr/oqSz5C4BNi/zOFTin7qiwp0s+7zWIf6pD
         IaVVKBOBQspxwHacRfTUKQtbFwgr1ZK5WtnJPoeAU6V0XLsZ718TX/0pT/3lPntt/OHt
         XyJ2LuctE4FGxv2aVSN80XebqVLmLQRyQzC5UanFtw15iRjcIqXSU7Ih1fLx0iWuLYgz
         3b2sZDC2l8dpXVTvGXCBdwVIhx2ktwrBqjAeXHnWlOlOMz9Qr4QhtYSg4QULjKARy/iK
         FBYQfgoJ3Eux1cVUdnJHsd6qPhsxFBIuFzxCKGTJ3N43dI4a3MIMkQbqS9Cuw4eGa+ZP
         DVHA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=WfeXs7Pr;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from fry.vger.email (fry.vger.email. [23.128.96.38])
        by mx.google.com with ESMTPS id
 j64-20020a638043000000b00578b5364e80si3728338pgd.557.2023.09.20.12.02.47
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 Sep 2023 12:02:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=WfeXs7Pr;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.38 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by fry.vger.email (Postfix) with ESMTP id 391DB82D87A8;
	Tue, 19 Sep 2023 23:21:20 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233218AbjITGUg (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:36 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53356 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233350AbjITGUR (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:20:17 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC0171A8
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:20:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190801; x=1726726801;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=hq7IYYC8nVzeFGqZQrv5zeua36AUxmv2g1nXbA7LL/w=;
  b=WfeXs7PrdGSr3f+XV/Uc1r9LwyeMX1Cw4QnURWL8lbYlu9+v5goZsLbj
   k2oezLdBJhx/WyXGMsdlSxkrH7aiz8f0KFHMOs8oO6+fGLx6C7xjjZLMM
   lko0OYwm5S3Ykconu0RM2xs4V8BNMUZaAUl6F4Tv2DjGwOG6we/zFTWBh
   aC7UcZqc2D5l45YWEr7V3jBgjb1XtJ1di3Yd5omRFHOnqOAq85MI0eYmn
   SZEZ1DqnQFC/mghhRfFuxwILn+vH41pxAty2RugV6YgnsnXJq2Lohd1Un
   G2It5WZvQY0gD1beKoMDAzJZIS56oGxReI9eBkYKeBxkR3iApefk0ltZ1
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187731"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187731"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:20:01 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060638"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060638"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:19:56 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Michal Hocko <mhocko@suse.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 07/10] mm: tune PCP high automatically
Date: Wed, 20 Sep 2023 14:18:53 +0800
Message-Id: <20230920061856.257597-8-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:21:20 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777584379120769240
X-GMAIL-MSGID: 1777584379120769240

The target to tune PCP high automatically is as follows,

- Minimize allocation/freeing from/to shared zone

- Minimize idle pages in PCP

- Minimize pages in PCP if the system free pages is too few

To reach these target, a tuning algorithm as follows is designed,

- When we refill PCP via allocating from the zone, increase PCP high.
  Because if we had larger PCP, we could avoid to allocate from the
  zone.

- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
  decrease PCP high to try to free possible idle PCP pages.

- When page reclaiming is active for the zone, stop increasing PCP
  high in allocating path, decrease PCP high and free some pages in
  freeing path.

So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.

One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small.  But
this isn't a severe issue, because there are no idle pages in this
case.

One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during
PCP refilling.  This can avoid the issue above.  But if the number of
pages allocated is much less than that of pages freed on a CPU, there
will be many idle pages in PCP and it may be hard to free these idle
pages.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`.  With the patch, the build time
decreases 10.1%.  The cycles% of the spinlock contention (mostly for
zone lock) decreases from 37.9% to 9.8% (with PCP size == 361).  The
number of PCP draining for high order pages freeing (free_high)
decreases 53.4%.  The number of pages allocated from zone (instead of
from PCP) decreases 77.3%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/gfp.h |   1 +
 mm/page_alloc.c     | 118 ++++++++++++++++++++++++++++++++++----------
 mm/vmstat.c         |   8 +--
 3 files changed, 98 insertions(+), 29 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665edc11fb9f..5b917e5b9350 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -320,6 +320,7 @@ extern void page_frag_free(void *addr);
 #define free_page(addr) free_pages((addr), 0)
 
 void page_alloc_init_cpuhp(void);
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
 void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 38bfab562b44..225abe56752c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2160,6 +2160,40 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	return i;
 }
 
+/*
+ * Called from the vmstat counter updater to decay the PCP high.
+ * Return whether there are addition works to do.
+ */
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+{
+	int high_min, to_drain, batch;
+	int todo = 0;
+
+	high_min = READ_ONCE(pcp->high_min);
+	batch = READ_ONCE(pcp->batch);
+	/*
+	 * Decrease pcp->high periodically to try to free possible
+	 * idle PCP pages.  And, avoid to free too many pages to
+	 * control latency.
+	 */
+	if (pcp->high > high_min) {
+		pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
+				 pcp->high * 4 / 5, high_min);
+		if (pcp->high > high_min)
+			todo++;
+	}
+
+	to_drain = pcp->count - pcp->high;
+	if (to_drain > 0) {
+		spin_lock(&pcp->lock);
+		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		spin_unlock(&pcp->lock);
+		todo++;
+	}
+
+	return todo;
+}
+
 #ifdef CONFIG_NUMA
 /*
  * Called from the vmstat counter updater to drain pagesets of this
@@ -2321,14 +2355,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
 	return true;
 }
 
-static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
+static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high)
 {
 	int min_nr_free, max_nr_free;
-	int batch = READ_ONCE(pcp->batch);
 
-	/* Free everything if batch freeing high-order pages. */
+	/* Free as much as possible if batch freeing high-order pages. */
 	if (unlikely(free_high))
-		return pcp->count;
+		return min(pcp->count, batch << PCP_BATCH_SCALE_MAX);
 
 	/* Check for PCP disabled or boot pageset */
 	if (unlikely(high < batch))
@@ -2343,7 +2376,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 	 * freeing of pages without any allocation.
 	 */
 	batch <<= pcp->free_factor;
-	if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
+	if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
 		pcp->free_factor++;
 	batch = clamp(batch, min_nr_free, max_nr_free);
 
@@ -2351,28 +2384,47 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
 }
 
 static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
-		       bool free_high)
+		       int batch, bool free_high)
 {
-	int high = READ_ONCE(pcp->high_min);
+	int high, high_min, high_max;
 
-	if (unlikely(!high || free_high))
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
+
+	if (unlikely(!high))
 		return 0;
 
-	if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
-		return high;
+	if (unlikely(free_high)) {
+		pcp->high = max(high - (batch << PCP_BATCH_SCALE_MAX), high_min);
+		return 0;
+	}
 
 	/*
 	 * If reclaim is active, limit the number of pages that can be
 	 * stored on pcp lists
 	 */
-	return min(READ_ONCE(pcp->batch) << 2, high);
+	if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
+		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		return min(batch << 2, pcp->high);
+	}
+
+	if (pcp->count >= high && high_min != high_max) {
+		int need_high = (batch << pcp->free_factor) + batch;
+
+		/* pcp->high should be large enough to hold batch freed pages */
+		if (pcp->high < need_high)
+			pcp->high = clamp(need_high, high_min, high_max);
+	}
+
+	return high;
 }
 
 static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 				   struct page *page, int migratetype,
 				   unsigned int order)
 {
-	int high;
+	int high, batch;
 	int pindex;
 	bool free_high = false;
 
@@ -2387,6 +2439,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	list_add(&page->pcp_list, &pcp->lists[pindex]);
 	pcp->count += 1 << order;
 
+	batch = READ_ONCE(pcp->batch);
 	/*
 	 * As high-order pages other than THP's stored on PCP can contribute
 	 * to fragmentation, limit the number stored when PCP is heavily
@@ -2397,14 +2450,15 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 		free_high = (pcp->free_factor &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
-			      pcp->count >= READ_ONCE(pcp->batch)));
+			      pcp->count >= READ_ONCE(batch)));
 		pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
-	high = nr_pcp_high(pcp, zone, free_high);
+	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
-		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);
+		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
+				   pcp, pindex);
 	}
 }
 
@@ -2688,24 +2742,38 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 	return page;
 }
 
-static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 {
-	int high, batch, max_nr_alloc;
+	int high, base_batch, batch, max_nr_alloc;
+	int high_max, high_min;
 
-	high = READ_ONCE(pcp->high_min);
-	batch = READ_ONCE(pcp->batch);
+	base_batch = READ_ONCE(pcp->batch);
+	high_min = READ_ONCE(pcp->high_min);
+	high_max = READ_ONCE(pcp->high_max);
+	high = pcp->high = clamp(pcp->high, high_min, high_max);
 
 	/* Check for PCP disabled or boot pageset */
-	if (unlikely(high < batch))
+	if (unlikely(high < base_batch))
 		return 1;
 
+	if (order)
+		batch = base_batch;
+	else
+		batch = (base_batch << pcp->alloc_factor);
+
 	/*
-	 * Double the number of pages allocated each time there is subsequent
-	 * refiling of order-0 pages without drain.
+	 * If we had larger pcp->high, we could avoid to allocate from
+	 * zone.
 	 */
+	if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+		high = pcp->high = min(high + batch, high_max);
+
 	if (!order) {
-		max_nr_alloc = max(high - pcp->count - batch, batch);
-		batch <<= pcp->alloc_factor;
+		max_nr_alloc = max(high - pcp->count - base_batch, base_batch);
+		/*
+		 * Double the number of pages allocated each time there is
+		 * subsequent refiling of order-0 pages without drain.
+		 */
 		if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX)
 			pcp->alloc_factor++;
 		batch = min(batch, max_nr_alloc);
@@ -2735,7 +2803,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 
 	do {
 		if (list_empty(list)) {
-			int batch = nr_pcp_alloc(pcp, order);
+			int batch = nr_pcp_alloc(pcp, zone, order);
 			int alloced;
 
 			alloced = rmqueue_bulk(zone, order,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..2f716ad14168 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,9 +814,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
 		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
@@ -832,10 +830,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 #endif
 			}
 		}
-#ifdef CONFIG_NUMA
 
 		if (do_pagesets) {
 			cond_resched();
+
+			changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+#ifdef CONFIG_NUMA
 			/*
 			 * Deal with draining the remote pageset of this
 			 * processor
@@ -862,8 +862,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
 				changes++;
 			}
-		}
 #endif
+		}
 	}
 
 	for_each_online_pgdat(pgdat) {

From patchwork Wed Sep 20 06:18:54 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142286
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp3972358vqi;
        Wed, 20 Sep 2023 01:34:54 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IGaSxizVLqd/UQY81bPoehQRdkB+pgtTMJiCpf6q7XFXktcmCFqzEZaPnHsgK7m9LCli7WL
X-Received: by 2002:a05:6a00:13a8:b0:68f:e121:b37c with SMTP id
 t40-20020a056a0013a800b0068fe121b37cmr2228984pfg.4.1695198894523;
        Wed, 20 Sep 2023 01:34:54 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695198894; cv=none;
        d=google.com; s=arc-20160816;
        b=TbnVqKPtD9Gwu1E56h9vnhhoVk57BiieqiUA1boZWeZ9jmypRI+FnWEEv9ScHTExxm
         et2k4gW39Rfp+KDy3PjHQY1uwmPDsHUf/IkS/d8SOAlh3DGl9Lz7r1LxbCBIguuYxDEc
         OiNcOkyEIC4R+llLpY3xui5RjpKS9oNhaxVbGq89GwgVvuJiOmtW6Tekc0ui/HbMYd8q
         GfdLVTPqjQ8pMEMwAH1WaUWeHwkk3xe7YVdEhWIa7/7tI2fZTRovKceuR75KGzj3l/ji
         NPGmxDZQPg1PgYTG/18Kv0zIfeP/n9nIFq2juLlUfDmLOJKyT+vJKaT2V1bqkKKJoUru
         DPVQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=UIyhaTtMVzkd1JNxNHUm6Tdf0oFtWg8xOLzRvYoLm7Q=;
        fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=;
        b=HOtpVRkhkfENZxxAgDQrlhJByHW+mmOR1VdmYvFnpO1CQgtTJnWiszPIVbt51Q3/XZ
         1FFLqQUuoIw0W3Mrh0sE4oHCD+UIO9PuIaLL7OqqJja7GU7UMIAb1jFRg5Ycinehinp8
         nY2uSu0bb2yNtP8ricbejlCe8i2uDnPOpAzDMM/snOxHJT73/3cFGxpG8qWZzvcHAgxd
         wQnelZkmE/q4wnFKX9frDpiIRPlcGbh8vBmPz/QCzndyOFFuDdlv5j9Y1/OyGVODYRQO
         yhbuaHHju5n0YBTPJLl4QHFH5LZAAw+ta8Tk0kIIXhFjOhRZe7ZOKTnMjEJwsJskb8D2
         6t1w==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=EJur15Bd;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from morse.vger.email (morse.vger.email. [23.128.96.31])
        by mx.google.com with ESMTPS id
 az1-20020a056a02004100b005783f4fa3a5si7996795pgb.300.2023.09.20.01.34.54
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 Sep 2023 01:34:54 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=EJur15Bd;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.31 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by morse.vger.email (Postfix) with ESMTP id 4DC60801F9AD;
	Tue, 19 Sep 2023 23:21:14 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233284AbjITGUj (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:39 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53250 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233373AbjITGUS (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:20:18 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 60A0CCC9
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:20:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190805; x=1726726805;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=6W6lJyCPE2vK1mxwG+RBHDFX6xtr7yfsoTDfDpxDR10=;
  b=EJur15BdL1AofMJxC1dgVPzOu6jkqI1uiiSu9QDQNm5rjdI5ghObtCnU
   B7sqjN0uTlMAwW35J2+xf9TapE76cyd799uZ/I6PdNc2PEJoj6TAVyfnO
   Ev2Y/Rs7a4OZhQOfH/qL44GVwcHdPY+ZXJ1v3VKSw7kbP++jX+B67BQTR
   3uKUdKzTzerKd44k2sBrekpfIlEqJsKSM7QNcl+lCMkUIBTOuRJj/pk0S
   V/VbAVmsiWm2BvCp4kjK6iNRslW0sdw+bq7jpfOECMJkW8GxBbWi9gunP
   hMhDqMRb9HQ2TDSV2gdwpk2KfweKTC+H+cdR87sSmEtW2VKULP2Y3EEZM
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187763"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187763"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:20:05 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060665"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060665"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:20:01 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 08/10] mm,
 pcp: decrease PCP high if free pages < high watermark
Date: Wed, 20 Sep 2023 14:18:54 +0800
Message-Id: <20230920061856.257597-9-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:21:14 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777544876019014976
X-GMAIL-MSGID: 1777544876019014976

One target of PCP is to minimize pages in PCP if the system free pages
is too few.  To reach that target, when page reclaiming is active for
the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
allocating path, decrease PCP high and free some pages in freeing
path.  But this may be too late because the background page reclaiming
may introduce latency for some workloads.  So, in this patch, during
page allocation we will detect whether the number of free pages of the
zone is below high watermark.  If so, we will stop increasing PCP high
in allocating path, decrease PCP high and free some pages in freeing
path.  With this, we can reduce the possibility of the premature
background page reclaiming caused by too large PCP.

The high watermark checking is done in allocating path to reduce the
overhead in hotter freeing path.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d6cfb5023f3e..8a19e2af89df 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1006,6 +1006,7 @@ enum zone_flags {
 					 * Cleared when kswapd is woken.
 					 */
 	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
+	ZONE_BELOW_HIGH,		/* zone is below high watermark. */
 };
 
 static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 225abe56752c..3f8c7dfeed23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2409,7 +2409,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return min(batch << 2, pcp->high);
 	}
 
-	if (pcp->count >= high && high_min != high_max) {
+	if (high_min == high_max)
+		return high;
+
+	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
+		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		high = max(pcp->count, high_min);
+	} else if (pcp->count >= high) {
 		int need_high = (batch << pcp->free_factor) + batch;
 
 		/* pcp->high should be large enough to hold batch freed pages */
@@ -2459,6 +2465,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
 				   pcp, pindex);
+		if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
+		    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
+				      ZONE_MOVABLE, 0))
+			clear_bit(ZONE_BELOW_HIGH, &zone->flags);
 	}
 }
 
@@ -2765,7 +2775,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 	 * If we had larger pcp->high, we could avoid to allocate from
 	 * zone.
 	 */
-	if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+	if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags))
 		high = pcp->high = min(high + batch, high_max);
 
 	if (!order) {
@@ -3226,6 +3236,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			}
 		}
 
+		mark = high_wmark_pages(zone);
+		if (zone_watermark_fast(zone, order, mark,
+					ac->highest_zoneidx, alloc_flags,
+					gfp_mask))
+			goto try_this_zone;
+		else if (!test_bit(ZONE_BELOW_HIGH, &zone->flags))
+			set_bit(ZONE_BELOW_HIGH, &zone->flags);
+
 		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac->highest_zoneidx, alloc_flags,

From patchwork Wed Sep 20 06:18:55 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142417
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4155417vqi;
        Wed, 20 Sep 2023 06:52:34 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IFRkNWb8DUJrbn1JYA5Em5J1lh0EXY+d14yO5h0OTP0TmxOK+DvGfR6aNedqZB5URdsrH1s
X-Received: by 2002:a05:6a20:3d09:b0:14c:c393:692 with SMTP id
 y9-20020a056a203d0900b0014cc3930692mr3276401pzi.7.1695217953989;
        Wed, 20 Sep 2023 06:52:33 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695217953; cv=none;
        d=google.com; s=arc-20160816;
        b=n4HOlPW8vJL9eTdh7Jk20R9LwcucgtSMoV7dMz3cfJ7Va0gk+/DWb+uyjJTX/9CdxH
         8dLJfGvVeXutQ46TgsyK01F618lNSPBpW9fUIxMBYQo3Dyql7/VCTtg8y5qUg7vE+Lv2
         M4g4qoMm2T6e5y73ikMiHwJJeixk0ONHUazqTJhIGKw8RktzHezt9jxADFHmlJEZxZxu
         8BhIkGlGNqiuIGrBGdz2rk10aySNrrAKxpgiBQsyE84+sUfYU0DdGZwUezOF5CZJpNWD
         HV+nZ3vRx2lXDo3QbGVwyqSf7CWcusZpU/M9vjCGviMp5yKgKWxSUWtIURJHjfoZWrG7
         nDCw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=kQYc5N1rnjtiHtbCX15sF/J52hBb9PsZp7Xb3r8dfqY=;
        fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=;
        b=B5j9wDuwf2pr6hBQWJ3TzXIqWD0Cgawg6N16tdQnVHMCx8lWpkHgqllZ7gaKKPB40q
         MADn19ckOo96ljcQqpHq5AZacEfz4m3oftLLuSvt037SWRIiXr+zK/NAbsjy8bhb9vLC
         nUVM1JYHCzE38Sp9RZq1aNxJ5/b+omNQJrfnlQDSlrKxVJ7sBEqTBLscZv/LLSgqoOUX
         4+mfY0jNuPY+vhYZH+IKObh3tgrigQaSzpzFTT601EVHVpUpFWUg+RSJ43ic2N6tbZKA
         kx3vtCHjbFffxgy/c5pmayGp3X/Es06VqY4ZWIvHk2McOL5R2Pqmxq9gR1NwAgvtOZDu
         1e1Q==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=e97HTkmW;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from pete.vger.email (pete.vger.email. [23.128.96.36])
        by mx.google.com with ESMTPS id
 a9-20020a17090abe0900b00276a288f4ebsi1593846pjs.91.2023.09.20.06.52.33
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 Sep 2023 06:52:33 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=e97HTkmW;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 23.128.96.36 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by pete.vger.email (Postfix) with ESMTP id DA2A881A6C03;
	Tue, 19 Sep 2023 23:22:00 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233325AbjITGUq (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:46 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55050 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233162AbjITGUZ (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:20:25 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F4F6AB
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:20:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190809; x=1726726809;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=+FoiCfoy4TLCRTojH1ubjvlhPyG87O12FA/RyIO19/k=;
  b=e97HTkmWjwmBWqYJ2BvfTxhgtOMn93iCwx8Wx/VoQdW619wARiAfUR+h
   PBFq9LJAByZADaQTe9s5h5LVYgifqsNhBvBPLff0dGi6K+AtbyKJ7dOxx
   xPIEZFzUvgPijRIgA4VCGalczuY+dSqpQehUZ+SUyTyGz9+2vynsNX7fj
   yA19bEJukBC7qwrxZRL24HKQUqbz8wsTRJbusqboeIlFVMZOje/w3m4LN
   wKlrcoBHEoYXfIKNNaqbITjcLjHnbF4luZn5Rq/CXee7hLn4sEZtApt9q
   RT3a1Xt5xzOB+faZTQN6gcecHJ9ktYRoemR6vADnkC3eP0cKPEFMq5xgn
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187785"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187785"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:20:08 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060679"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060679"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:20:05 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily
Date: Wed, 20 Sep 2023 14:18:55 +0800
Message-Id: <20230920061856.257597-10-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:22:01 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777564861384996661
X-GMAIL-MSGID: 1777564861384996661

In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in
periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will
decrease PCP high to try to free possible idle PCP pages.  One issue
is that even if the page allocating/freeing depth is larger than
maximal PCP high, we may reduce PCP high unnecessarily.

To avoid the above issue, in this patch, we will track the minimal PCP
page count.  And, the periodic PCP high decrement will not more than
the recent minimal PCP page count.  So, only detected idle pages will
be freed.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`.  With the patch, The number of pages
allocated from zone (instead of from PCP) decreases 25.8%.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 15 ++++++++++-----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8a19e2af89df..35b78c7522a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -682,6 +682,7 @@ enum zone_watermarks {
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
 	int count;		/* number of pages in the list */
+	int count_min;		/* minimal number of pages in the list recently */
 	int high;		/* high watermark, emptying needed */
 	int high_min;		/* min high watermark */
 	int high_max;		/* max high watermark */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f8c7dfeed23..77e9b7b51688 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
-	int high_min, to_drain, batch;
+	int high_min, decrease, to_drain, batch;
 	int todo = 0;
 
 	high_min = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
 	/*
-	 * Decrease pcp->high periodically to try to free possible
-	 * idle PCP pages.  And, avoid to free too many pages to
-	 * control latency.
+	 * Decrease pcp->high periodically to free idle PCP pages counted
+	 * via pcp->count_min.  And, avoid to free too many pages to
+	 * control latency.  This caps pcp->high decrement too.
 	 */
 	if (pcp->high > high_min) {
+		decrease = min(pcp->count_min, pcp->high / 5);
 		pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
-				 pcp->high * 4 / 5, high_min);
+				 pcp->high - decrease, high_min);
 		if (pcp->high > high_min)
 			todo++;
 	}
@@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 		todo++;
 	}
 
+	pcp->count_min = pcp->count;
+
 	return todo;
 }
 
@@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 		page = list_first_entry(list, struct page, pcp_list);
 		list_del(&page->pcp_list);
 		pcp->count -= 1 << order;
+		if (pcp->count < pcp->count_min)
+			pcp->count_min = pcp->count;
 	} while (check_new_pages(page, order));
 
 	return page;

From patchwork Wed Sep 20 06:18:56 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Huang, Ying" <ying.huang@intel.com>
X-Patchwork-Id: 142480
Return-Path: <linux-kernel-owner@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4210786vqi;
        Wed, 20 Sep 2023 08:10:13 -0700 (PDT)
X-Google-Smtp-Source: 
 AGHT+IHPpxbewlup9dXF66uLiGqyZILlXKmfMv26RtaT/NbTAHVnB11pQrBe2pCIn5rpCeSDUh9L
X-Received: by 2002:a05:6a20:914a:b0:154:a9bc:12ca with SMTP id
 x10-20020a056a20914a00b00154a9bc12camr3064180pzc.26.1695222613630;
        Wed, 20 Sep 2023 08:10:13 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1695222613; cv=none;
        d=google.com; s=arc-20160816;
        b=dzigoErqJ8ZxAf+OaUnjPFhv5pcmB3RYVTOS+jXzxCiYWGGv95nHNmpK7f5ikP4PN4
         9VQ/lf/P8r9c2lCM4V1hG5cRBJdu/3yNKBjvHqJRPfE+xW8ik6WgWzUc4qTd9tKvr5zd
         w4ubEXOFyx8pGGbzdLnPxMnoldXuF6cxJomyW4RBG3GefJPd8fNGlkGBDxt0zRI33BZE
         363a/G3OLQylfVd+XE+XyjtTHca447nJZPbkZ8wrJCAk2+pGCNje8bBHJmN8tj/Hg5mP
         JRJpj1jwVpf6QHoht1itEXV0w7O7aI3F182L8aCWtNEvHgqyHpLTrREUxg6cJougmC0S
         hiCQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-id:precedence:content-transfer-encoding:mime-version
         :references:in-reply-to:message-id:date:subject:cc:to:from
         :dkim-signature;
        bh=UvpAserKTFQazQgvaQV8AWmN5BTD454Je6S/+Aq412A=;
        fh=OlKm7LKbIdgbzv7m6ivtVBS9u5zco/nrHpeuJnEjCeg=;
        b=SNTRD2Wfio0NyUojPmOWjbB1uReCJs87Qdn3dUt4ZAO1vEX4S3UiVqHk/Fh9rYXVmX
         SewvwJsiNcIgN81RdKyTgeiM1asIy5LnE/hxtRG5XPH05xEiZso7P6FdJpprXiiOBMaY
         HVBnClOknBXptsK/2hDGF+TDzNEXV5JoMZ2bWxb7kWZT3pVNHrb0LN2BoOtqyUsfYbMT
         uWicciTt/5hkp+/4CDWEWORHj7qsJiaYzW/og7R+qWTgtS9vxnmkV1LiWKpjt/8+Pvyd
         v4N50SYSmoib8EJXuQw6zUKAPMijmipCsOonLvcE0tQ66lgQsp+8UQoZ8dl+Oa5ldIp0
         VbLA==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=jXr2YEz5;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:1 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1])
        by mx.google.com with ESMTPS id
 ds2-20020a056a004ac200b00690cff4a2b6si2616910pfb.112.2023.09.20.08.10.13
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 20 Sep 2023 08:10:13 -0700 (PDT)
Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:1 as permitted sender)
 client-ip=2620:137:e000::3:1;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=jXr2YEz5;
       spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::3:1 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org;
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0])
	by morse.vger.email (Postfix) with ESMTP id DA84B8020D9D;
	Tue, 19 Sep 2023 23:21:40 -0700 (PDT)
X-Virus-Status: Clean
X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233427AbjITGU5 (ORCPT <rfc822;toshivichauhan@gmail.com>
        + 26 others); Wed, 20 Sep 2023 02:20:57 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35276 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S233412AbjITGUe (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 20 Sep 2023 02:20:34 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.126])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 779BCF2
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Sep 2023 23:20:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1695190813; x=1726726813;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=S/60qEX1jhBiS6ZWzLs5VHrBM2EfBeFasIo8hPl4DFY=;
  b=jXr2YEz5bQqQJL6tBRUhfqJ/2om3UBPX4ttJKoUkB70zGed/KbsMcjHt
   GwnxWs9nTzfNbNZZ8aVqrelL+1PLcNHDXec5ADbkRCIr/H878JtAIyvyQ
   rqqY2P9OtbEC4qBmPsIBXZGFNeWmkMFEIG/872CrNvbyl4qfMi/N744JH
   geBo1qcaNYIHzWcrBdskweFXpGVbidnoZ/5BCMTXkSfQgwdIgQhIQs1q/
   rYoApNGtTJZ3+u+BX6I4QU+Zqsukkhn7pbd9ooxywB0RRuRbdbGr3+V+4
   76tCdHwLfb9vUncuIaFZTyubtr7qFVNmnGDu9WOn+9XYzJzFlKK4rDeeJ
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="365187807"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="365187807"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:20:12 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10838"; a="740060689"
X-IronPort-AV: E=Sophos;i="6.02,161,1688454000";
   d="scan'208";a="740060689"
Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133])
  by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Sep 2023 23:20:08 -0700
From: Huang Ying <ying.huang@intel.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
        Arjan Van De Ven <arjan@linux.intel.com>,
        Huang Ying <ying.huang@intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mgorman@techsingularity.net>,
        Vlastimil Babka <vbabka@suse.cz>,
        David Hildenbrand <david@redhat.com>,
        Johannes Weiner <jweiner@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Pavel Tatashin <pasha.tatashin@soleen.com>,
        Matthew Wilcox <willy@infradead.org>,
        Christoph Lameter <cl@linux.com>
Subject: [PATCH 10/10] mm,
 pcp: reduce detecting time of consecutive high order page freeing
Date: Wed, 20 Sep 2023 14:18:56 +0800
Message-Id: <20230920061856.257597-11-ying.huang@intel.com>
X-Mailer: git-send-email 2.39.2
In-Reply-To: <20230920061856.257597-1-ying.huang@intel.com>
References: <20230920061856.257597-1-ying.huang@intel.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable
	autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
X-Greylist: Sender passed SPF test,
 not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]);
 Tue, 19 Sep 2023 23:21:40 -0700 (PDT)
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1777569747340975079
X-GMAIL-MSGID: 1777569747340975079

In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small, for
example, in the sender of network workloads.  If a CPU was used as
sender originally, then it is used as receiver after context
switching, we need to fill the whole PCP with maximal high before
triggering PCP draining for consecutive high order freeing.  This will
hurt the performance of some network workloads.

To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining.  So, we
can detect consecutive page freeing much earlier.

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes.  With the patch, the network bandwidth improves 3.1%.  This
restores the performance drop caused by PCP auto-tuning.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h |  2 +-
 mm/page_alloc.c        | 23 +++++++++++------------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 35b78c7522a7..44f6dc3cdeeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -689,10 +689,10 @@ struct per_cpu_pages {
 	int batch;		/* chunk size for buddy add/remove */
 	u8 flags;		/* protected by pcp->lock */
 	u8 alloc_factor;	/* batch scaling factor during allocate */
-	u8 free_factor;		/* batch scaling factor during free */
 #ifdef CONFIG_NUMA
 	u8 expire;		/* When 0, remote pagesets are drained */
 #endif
+	short free_count;	/* consecutive free count */
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e9b7b51688..6ae2a5ebf7a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2375,13 +2375,10 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
 	max_nr_free = high - batch;
 
 	/*
-	 * Double the number of pages freed each time there is subsequent
-	 * freeing of pages without any allocation.
+	 * Increase the batch number to the number of the consecutive
+	 * freed pages to reduce zone lock contention.
 	 */
-	batch <<= pcp->free_factor;
-	if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
-		pcp->free_factor++;
-	batch = clamp(batch, min_nr_free, max_nr_free);
+	batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
 
 	return batch;
 }
@@ -2408,7 +2405,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 	 * stored on pcp lists
 	 */
 	if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
-		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		pcp->high = max(high - pcp->free_count, high_min);
 		return min(batch << 2, pcp->high);
 	}
 
@@ -2416,10 +2413,10 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return high;
 
 	if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
-		pcp->high = max(high - (batch << pcp->free_factor), high_min);
+		pcp->high = max(high - pcp->free_count, high_min);
 		high = max(pcp->count, high_min);
 	} else if (pcp->count >= high) {
-		int need_high = (batch << pcp->free_factor) + batch;
+		int need_high = pcp->free_count + batch;
 
 		/* pcp->high should be large enough to hold batch freed pages */
 		if (pcp->high < need_high)
@@ -2456,7 +2453,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	 * stops will be drained from vmstat refresh context.
 	 */
 	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		free_high = (pcp->free_factor &&
+		free_high = (pcp->free_count >= batch &&
 			     (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
 			     (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
 			      pcp->count >= READ_ONCE(batch)));
@@ -2464,6 +2461,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
+	if (pcp->free_count < (batch << PCP_BATCH_SCALE_MAX))
+		pcp->free_count += (1 << order);
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2861,7 +2860,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 * See nr_pcp_free() where free_factor is increased for subsequent
 	 * frees.
 	 */
-	pcp->free_factor >>= 1;
+	pcp->free_count >>= 1;
 	list = &pcp->lists[order_to_pindex(migratetype, order)];
 	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
 	pcp_spin_unlock(pcp);
@@ -5483,7 +5482,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	pcp->high_min = BOOT_PAGESET_HIGH;
 	pcp->high_max = BOOT_PAGESET_HIGH;
 	pcp->batch = BOOT_PAGESET_BATCH;
-	pcp->free_factor = 0;
+	pcp->free_count = 0;
 }
 
 static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,