Message ID | 20240131162442.3487473-1-tjmercier@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-46779-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2087:b0:106:209c:c626 with SMTP id gs7csp2004190dyb; Wed, 31 Jan 2024 08:25:44 -0800 (PST) X-Google-Smtp-Source: AGHT+IG4nbO1LCZlmrDuMHQKXfJXzayOIqavaKsOh8aN4m/5v5vA1L/5AXKKj+K8Ws0Njiz6DTpg X-Received: by 2002:a05:6a20:4f98:b0:19c:5ae6:4425 with SMTP id gh24-20020a056a204f9800b0019c5ae64425mr1790525pzb.59.1706718344016; Wed, 31 Jan 2024 08:25:44 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706718344; cv=pass; d=google.com; s=arc-20160816; b=T6SNczvhnZg6oyuclUrmKeUvbsLZYuXUiIgdRlDBS1pKCSfQKgUzITAshDGmbbSPdM RbE/eAPVzaf50iQ1IPfEeolLAZsXIGqLwd7Bs/mTb2WJckWd/uKBJL8klU4I7629w4MP 6KwgyFzkBceNni3lqrRd+W6sj06HzkLcEHXj7AQA+ev2uiMy2W2loX4KxhfPdUrL46tV RPc9jIFNRs/2Ikq9P7/glQGdDYfQvUGWNn9hX3BBTr77mMPrFwI4s1haUml12onpavRJ jAsI8QVqXOmLlyQuqmqnoNaKfIdyaHjjxjaUFpxHDRo+3Jc/wNLODOnh1kkW+EWcnYCs fvyw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:dkim-signature; bh=ZNYUkxA0xvwNNh2j9fPRfqO+hGg4aS2w3UwOndzd+pg=; fh=S5JY9QAB9SliRyJULDKUMBonnpDxX8eSpV06SiJrwuk=; b=erPejG2Ip2En1cxVTM9x15heh6viaew6VXP8CzJhZ6wNZ7SQy7uW8xOhi90hi9GdlG F28SjNuDbeuzWBEE4/dbI7NHh2+4xv14C/QGecv6ND8yv3LiBqTo1+B1wlU9kG8XGDVh qPEtmKUdoIQCg13ICMmaNUSMxuq7JzARk+h1SA9klB7gfFAr7Zt/3D4yaXL+ADvN4Vvd E0DivfiKqE1I7t3k2gkbsXT22kVs28ulbmIxmLNIXlMAURi0e4Qgfnj382Dvz7uF/C8M mjQRTaqdsrVWtykFhrImfn1xY5w85e57jn7iYG/kGMVmZNaKXtiysyygYLKp+HlTkexL 2DZA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=k5YDKrMB; arc=pass (i=1 spf=pass spfdomain=flex--tjmercier.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-46779-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-46779-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com X-Forwarded-Encrypted: i=1; AJvYcCW55dQ87kzgureCdrTrpQInHTSkfdTqUF1oakMF298kkEOvyTa1q08hLj1/poYalHL16XfZdRUn9W25c/8lDSKV0nWwLA== Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id j2-20020a056a00174200b006dfde776206si1001509pfc.75.2024.01.31.08.25.43 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 31 Jan 2024 08:25:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-46779-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=k5YDKrMB; arc=pass (i=1 spf=pass spfdomain=flex--tjmercier.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-46779-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-46779-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 3BB0729076C for <ouuuleilei@gmail.com>; Wed, 31 Jan 2024 16:25:00 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2A2DC12BF01; Wed, 31 Jan 2024 16:24:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="k5YDKrMB" Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CF28312BE89 for <linux-kernel@vger.kernel.org>; Wed, 31 Jan 2024 16:24:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706718288; cv=none; b=Sz1B0ifj2bLdXXZWPUt3EpapyVyoNBTXbqogLMeOMB7Hu7uoFK24paUxwmZzXao+gMrR8A9FPlBMsvCX5Y78cj7gZktET7Zt3IpTZjjnwXyakUFB2jymanuFGYPj1fhE/EFqyBo9xJjDH/iIWsRTuVTqV5Kzhpy+vIBNl40N/Ks= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706718288; c=relaxed/simple; bh=AzmL5zx4xNDrc3cpdMYDJLBd7o+fVpmomAYHf1eecDg=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=CfzLhV9Sf890mJgJtfEeG40PC29W2YNflZbUH26LGRCRLLeLAQ/2X0PcxDFX/oTO6rUvx+2u0AM2x+pVxdxHfrPmlLVkru9DJhBwoQHZvip+fOOj04RoZ1PdssV9KLjdipF5Y8mv4hrpYtAyn4zjzQMQdS+fiLHaxNYa3KZJsKg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--tjmercier.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=k5YDKrMB; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--tjmercier.bounces.google.com Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-5c17cff57f9so3427759a12.0 for <linux-kernel@vger.kernel.org>; Wed, 31 Jan 2024 08:24:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706718286; x=1707323086; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=ZNYUkxA0xvwNNh2j9fPRfqO+hGg4aS2w3UwOndzd+pg=; b=k5YDKrMBVqn6l3wucgoSKyXQLDtI6WapQOQoqx4sV25vQSuhytqnm4wwEPGRky3P8n AJW5X6WAZTInuker2KN+tTcUWpcaXcNq1Vgr+ETKjTAS47bpmJvEN3p/OO0/t23cZZoj fDqSLAb+0og/w9+4upVwpgk8L776axU0dlIJ5uUoHL8iFGe2WclkZaPfsVj47owVnE6y 2mU533D5lsRYhkgW8Pz5mjACZob7UDLGBrIj7+R57On7eF8+Ml5VuUUTLDLgfRsVPABh DHtk3Wgt+0EffREL0cHkz34LmkrHTWZnI36Y1UGRqIMIbezRPFj4FLQwPBQlRFRgxfM2 J2Qw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706718286; x=1707323086; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ZNYUkxA0xvwNNh2j9fPRfqO+hGg4aS2w3UwOndzd+pg=; b=lwzDHUwHy3kO7avN9qdPjkLJynuSG95nZdXiBNduoMYrJ9uRfzuZC56DMbnTrUkVl1 Qu+askDf7quOxb9mVbQq14K8PvA9kmFrVc+bTtQIsrQT/PJUKTcSdrxAudaqKgxMyPe4 /AKesYGAYy6rPNkPXk4EmbazYNVHKrMgDjwZVrnWw5nrat5Yd67nYVjGl60kOmjvJiJ0 X0z3Fu4EBh6f+4cqj8ZsNA67xevr7MAZJzx90uz6jP3LRjHntCX4d+ddUyUO6vNQVpM9 4yGmWuXcoyk0lg0fT2ikuz4EGXpziN/ZPcQjW5pT7My5XFnJiIEFD4XjDUERZ2KQuQIr CMKA== X-Gm-Message-State: AOJu0YwppoaG0hE3meDiN59+VK4aKZBNwEnv2+W1LnRyflfMn3YSNTTf BGDU1udQvIVCcm54NSnyUoEos6DOdP11roEtaPDMYCasbH7SyXEecudX3diqy+n/LUBCZy86pfG isbiCve6rXFyJnQ== X-Received: from tj-virt.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5683]) (user=tjmercier job=sendgmr) by 2002:a65:614f:0:b0:5ca:459f:d4ba with SMTP id o15-20020a65614f000000b005ca459fd4bamr20923pgv.9.1706718286053; Wed, 31 Jan 2024 08:24:46 -0800 (PST) Date: Wed, 31 Jan 2024 16:24:41 +0000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 X-Mailer: git-send-email 2.43.0.594.gd9cf4e227d-goog Message-ID: <20240131162442.3487473-1-tjmercier@google.com> Subject: [PATCH] mm: memcg: Use larger chunks for proactive reclaim From: "T.J. Mercier" <tjmercier@google.com> To: tjmercier@google.com, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, Efly Young <yangyifei03@kuaishou.com> Cc: android-mm@google.com, yuzhao@google.com, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789623894000158785 X-GMAIL-MSGID: 1789623894000158785 |
Series |
mm: memcg: Use larger chunks for proactive reclaim
|
|
Commit Message
T.J. Mercier
Jan. 31, 2024, 4:24 p.m. UTC
Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
reclaim") we passed the number of pages for the reclaim request directly
to try_to_free_mem_cgroup_pages, which could lead to significant
overreclaim in order to achieve fairness. After 0388536ac291 the number
of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce
the amount of overreclaim. However such a small chunk size caused a
regression in reclaim performance due to many more reclaim start/stop
cycles inside memory_reclaim.
Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant,
adjust the chunk size proportionally with number of pages left to
reclaim. This allows for higher reclaim efficiency with large chunk
sizes during the beginning of memory_reclaim, and reduces the amount of
potential overreclaim by using small chunk sizes as the total reclaim
amount is approached. Using 1/4 of the amount left to reclaim as the
chunk size gives a good compromise between reclaim performance and
overreclaim:
root - full reclaim pages/sec time (sec)
pre-0388536ac291 : 68047 10.46
post-0388536ac291 : 13742 inf
(reclaim-reclaimed)/4 : 67352 10.51
/uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB)
pre-0388536ac291 : 258822 1.12 107.8
post-0388536ac291 : 105174 2.49 3.5
(reclaim-reclaimed)/4 : 233396 1.12 -7.4
/uid_0 - full reclaim pages/sec time (sec)
pre-0388536ac291 : 72334 7.09
post-0388536ac291 : 38105 14.45
(reclaim-reclaimed)/4 : 72914 6.96
Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
mm/memcontrol.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Comments
On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote: > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > reclaim") we passed the number of pages for the reclaim request directly > to try_to_free_mem_cgroup_pages, which could lead to significant > overreclaim in order to achieve fairness. After 0388536ac291 the number > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce > the amount of overreclaim. However such a small chunk size caused a > regression in reclaim performance due to many more reclaim start/stop > cycles inside memory_reclaim. > > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant, > adjust the chunk size proportionally with number of pages left to > reclaim. This allows for higher reclaim efficiency with large chunk > sizes during the beginning of memory_reclaim, and reduces the amount of > potential overreclaim by using small chunk sizes as the total reclaim > amount is approached. Using 1/4 of the amount left to reclaim as the > chunk size gives a good compromise between reclaim performance and > overreclaim: > > root - full reclaim pages/sec time (sec) > pre-0388536ac291 : 68047 10.46 > post-0388536ac291 : 13742 inf > (reclaim-reclaimed)/4 : 67352 10.51 > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > pre-0388536ac291 : 258822 1.12 107.8 > post-0388536ac291 : 105174 2.49 3.5 > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > /uid_0 - full reclaim pages/sec time (sec) > pre-0388536ac291 : 72334 7.09 > post-0388536ac291 : 38105 14.45 > (reclaim-reclaimed)/4 : 72914 6.96 > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > Signed-off-by: T.J. Mercier <tjmercier@google.com> > --- > mm/memcontrol.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 46d8d02114cf..d68fb89eadd2 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > lru_add_drain_all(); > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > + max((nr_to_reclaim - nr_reclaimed) / 4, > + (nr_to_reclaim - nr_reclaimed) % 4), I don't see why the % 4 is needed. It only kicks in when the delta drops below 4, but try_to_free_mem_cgroup_pages() already has .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), so it looks like dead code.
On Wed, Jan 31, 2024 at 9:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote: > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > reclaim") we passed the number of pages for the reclaim request directly > > to try_to_free_mem_cgroup_pages, which could lead to significant > > overreclaim in order to achieve fairness. After 0388536ac291 the number > > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce > > the amount of overreclaim. However such a small chunk size caused a > > regression in reclaim performance due to many more reclaim start/stop > > cycles inside memory_reclaim. > > > > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant, > > adjust the chunk size proportionally with number of pages left to > > reclaim. This allows for higher reclaim efficiency with large chunk > > sizes during the beginning of memory_reclaim, and reduces the amount of > > potential overreclaim by using small chunk sizes as the total reclaim > > amount is approached. Using 1/4 of the amount left to reclaim as the > > chunk size gives a good compromise between reclaim performance and > > overreclaim: > > > > root - full reclaim pages/sec time (sec) > > pre-0388536ac291 : 68047 10.46 > > post-0388536ac291 : 13742 inf > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > pre-0388536ac291 : 258822 1.12 107.8 > > post-0388536ac291 : 105174 2.49 3.5 > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > /uid_0 - full reclaim pages/sec time (sec) > > pre-0388536ac291 : 72334 7.09 > > post-0388536ac291 : 38105 14.45 > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > --- > > mm/memcontrol.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 46d8d02114cf..d68fb89eadd2 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > lru_add_drain_all(); > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > + (nr_to_reclaim - nr_reclaimed) % 4), > > I don't see why the % 4 is needed. It only kicks in when the delta > drops below 4, but try_to_free_mem_cgroup_pages() already has > > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > > so it looks like dead code. That right, it's only there for when the integer division reaches zero. I didn't want to assume anything about the implementation of try_to_free_mem_cgroup_pages, but I can just remove it entirely if you'd like.
On Wed, Jan 31, 2024 at 10:01:27AM -0800, T.J. Mercier wrote: > On Wed, Jan 31, 2024 at 9:51 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, T.J. Mercier wrote: > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > > reclaim") we passed the number of pages for the reclaim request directly > > > to try_to_free_mem_cgroup_pages, which could lead to significant > > > overreclaim in order to achieve fairness. After 0388536ac291 the number > > > of pages was limited to a maxmimum of 32 (SWAP_CLUSTER_MAX) to reduce > > > the amount of overreclaim. However such a small chunk size caused a > > > regression in reclaim performance due to many more reclaim start/stop > > > cycles inside memory_reclaim. > > > > > > Instead of limiting reclaim chunk size to the SWAP_CLUSTER_MAX constant, > > > adjust the chunk size proportionally with number of pages left to > > > reclaim. This allows for higher reclaim efficiency with large chunk > > > sizes during the beginning of memory_reclaim, and reduces the amount of > > > potential overreclaim by using small chunk sizes as the total reclaim > > > amount is approached. Using 1/4 of the amount left to reclaim as the > > > chunk size gives a good compromise between reclaim performance and > > > overreclaim: > > > > > > root - full reclaim pages/sec time (sec) > > > pre-0388536ac291 : 68047 10.46 > > > post-0388536ac291 : 13742 inf > > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > > pre-0388536ac291 : 258822 1.12 107.8 > > > post-0388536ac291 : 105174 2.49 3.5 > > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > > > /uid_0 - full reclaim pages/sec time (sec) > > > pre-0388536ac291 : 72334 7.09 > > > post-0388536ac291 : 38105 14.45 > > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > > --- > > > mm/memcontrol.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 46d8d02114cf..d68fb89eadd2 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > lru_add_drain_all(); > > > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > > + (nr_to_reclaim - nr_reclaimed) % 4), > > > > I don't see why the % 4 is needed. It only kicks in when the delta > > drops below 4, but try_to_free_mem_cgroup_pages() already has > > > > .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), > > > > so it looks like dead code. > > That right, it's only there for when the integer division reaches > zero. I didn't want to assume anything about the implementation of > try_to_free_mem_cgroup_pages, but I can just remove it entirely if > you'd like. What do others think? We rely on the rounding up in a few other places and it's been doing that for a decade. Maybe lampshade it for the benefit of the reader: /* Will converge on zero, but reclaim enforces a minimum */ but otherwise there is IMO no need to have defensive extra code.
Hello. On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > + max((nr_to_reclaim - nr_reclaimed) / 4, > + (nr_to_reclaim - nr_reclaimed) % 4), The 1/4 factor looks like magic. Commit 0388536ac291 says: | In theory, the amount of reclaimed would be in [request, 2 * request). Doesn't this suggest 1/2 as a better option? (I didn't pursue the theory.) Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, the formula gives arbitrary (unrelated to delta's magnitude) values. Regards, Michal
On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote: > Hello. > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > + (nr_to_reclaim - nr_reclaimed) % 4), > > The 1/4 factor looks like magic. It's just cutting the work into quarters to balance throughput with goal accuracy. It's no more or less magic than DEF_PRIORITY being 12, or SWAP_CLUSTER_MAX being 32. > Commit 0388536ac291 says: > | In theory, the amount of reclaimed would be in [request, 2 * request). Looking at the code, I'm not quite sure if this can be read this literally. Efly might be able to elaborate, but we do a full loop of all nodes and cgroups in the tree before checking nr_to_reclaimed, and rely on priority level for granularity. So request size and complexity of the cgroup tree play a role. I don't know where the exact factor two would come from. IMO it's more accurate to phrase it like this: Reclaim tries to balance nr_to_reclaim fidelity with fairness across nodes and cgroups over which the pages are spread. As such, the bigger the request, the bigger the absolute overreclaim error. Historic in-kernel users of reclaim have used fixed, small request batches to approach an appropriate reclaim rate over time. When we reclaim a user request of arbitrary size, use decaying batches to manage error while maintaining reasonable throughput. > Doesn't this suggest 1/2 as a better option? (I didn't pursue the > theory.) That was TJ's first suggestion as well, but as per above I suggested quartering as a safer option. > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, > the formula gives arbitrary (unrelated to delta's magnitude) values. try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the error margin is much higher at the smaller end of requests anyway. But practically speaking, users care much less if you reclaim 32 pages when 16 were requested than if you reclaim 2G when 1G was requested.
On Thu, Feb 1, 2024 at 7:34 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote: > > Hello. > > > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > > + (nr_to_reclaim - nr_reclaimed) % 4), > > > > The 1/4 factor looks like magic. > > It's just cutting the work into quarters to balance throughput with > goal accuracy. It's no more or less magic than DEF_PRIORITY being 12, > or SWAP_CLUSTER_MAX being 32. Using SWAP_CLUSTER_MAX is sort of like having a really large divisor instead of 4 (or 1 like before). I recorded the average number of iterations required to complete the 1G reclaim for the measurements I took and it looks like this: pre-0388536ac291 : 1 post-0388536ac291 : 1814 (reclaim-reclaimed)/4: 17 Given the results with /4, I don't think the perf we get here is particularly sensitive to the number we choose, but it's definitely a tradeoff. <snip> > > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, > > the formula gives arbitrary (unrelated to delta's magnitude) values. > > try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the > error margin is much higher at the smaller end of requests anyway. > But practically speaking, users care much less if you reclaim 32 pages > when 16 were requested than if you reclaim 2G when 1G was requested. I like Johannes's suggestion of just a comment instead of the mod op. It's easier to read, slightly less generated code, and even if we didn't have the .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX) in try_to_free_mem_cgroup_pages, memory_reclaim would still get very close to the target before running out of nr_retries.
> On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote: > > Hello. > > > > On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > + max((nr_to_reclaim - nr_reclaimed) / 4, > > > + (nr_to_reclaim - nr_reclaimed) % 4), > > > > The 1/4 factor looks like magic. > > It's just cutting the work into quarters to balance throughput with > goal accuracy. It's no more or less magic than DEF_PRIORITY being 12, > or SWAP_CLUSTER_MAX being 32. > > > Commit 0388536ac291 says: > > | In theory, the amount of reclaimed would be in [request, 2 * request). > > Looking at the code, I'm not quite sure if this can be read this > literally. Efly might be able to elaborate, but we do a full loop of > all nodes and cgroups in the tree before checking nr_to_reclaimed, and > rely on priority level for granularity. So request size and complexity > of the cgroup tree play a role. I don't know where the exact factor > two would come from. I'm sorry that this conclusion may be arbitrary. It might just only suit for my case. In my case, I traced it loop twice every time before checking nr_reclaimed, and it reclaimed less than my request size(1G) every time. So I think the upper bound is 2 * request. But now it seems that this is related to cgroup tree I constucted and my system status and my request size(a relatively large chunk). So there are many influencing factors, a specific upper bound is not accurate. > IMO it's more accurate to phrase it like this: > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > nodes and cgroups over which the pages are spread. As such, the bigger > the request, the bigger the absolute overreclaim error. Historic > in-kernel users of reclaim have used fixed, small request batches to > approach an appropriate reclaim rate over time. When we reclaim a user > request of arbitrary size, use decaying batches to manage error while > maintaining reasonable throughput. > > > Doesn't this suggest 1/2 as a better option? (I didn't pursue the > > theory.) > > That was TJ's first suggestion as well, but as per above I suggested > quartering as a safer option. > > > Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8, > > the formula gives arbitrary (unrelated to delta's magnitude) values. > > try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the > error margin is much higher at the smaller end of requests anyway. > But practically speaking, users care much less if you reclaim 32 pages > when 16 were requested than if you reclaim 2G when 1G was requested. Yes, I agreed completely that the bigger the request the bigger the absolute overreclaim error. The focus now is the tradeoff between accurate reclaim and efficient reclaim. I think TJ's test is suggestive.
On Fri, Feb 02, 2024 at 01:02:47PM +0800, Efly Young <yangyifei03@kuaishou.com> wrote: > > Looking at the code, I'm not quite sure if this can be read this > > literally. Efly might be able to elaborate, but we do a full loop of > > all nodes and cgroups in the tree before checking nr_to_reclaimed, and > > rely on priority level for granularity. So request size and complexity > > of the cgroup tree play a role. I don't know where the exact factor > > two would come from. > > I'm sorry that this conclusion may be arbitrary. It might just only suit > for my case. In my case, I traced it loop twice every time before checking > nr_reclaimed, and it reclaimed less than my request size(1G) every time. > So I think the upper bound is 2 * request. But now it seems that this is > related to cgroup tree I constucted and my system status and my request > size(a relatively large chunk). So there are many influencing factors, > a specific upper bound is not accurate. Alright, thanks for the background. > > IMO it's more accurate to phrase it like this: > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > nodes and cgroups over which the pages are spread. As such, the bigger > > the request, the bigger the absolute overreclaim error. Historic > > in-kernel users of reclaim have used fixed, small request batches to > > approach an appropriate reclaim rate over time. When we reclaim a user > > request of arbitrary size, use decaying batches to manage error while > > maintaining reasonable throughput. Hm, decay... So shouldn't the formula be nr_pages = delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 where delta = nr_to_reclaim - nr_reclaimed ? (So that convergence for smaller deltas is same like original- and other reclaims while conservative factor is applied for effectivity of higher user requests.) Thanks, Michal
On Fri, Feb 2, 2024 at 2:15 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Fri, Feb 02, 2024 at 01:02:47PM +0800, Efly Young <yangyifei03@kuaishou.com> wrote: > > > Looking at the code, I'm not quite sure if this can be read this > > > literally. Efly might be able to elaborate, but we do a full loop of > > > all nodes and cgroups in the tree before checking nr_to_reclaimed, and > > > rely on priority level for granularity. So request size and complexity > > > of the cgroup tree play a role. I don't know where the exact factor > > > two would come from. > > > > I'm sorry that this conclusion may be arbitrary. It might just only suit > > for my case. In my case, I traced it loop twice every time before checking > > nr_reclaimed, and it reclaimed less than my request size(1G) every time. > > So I think the upper bound is 2 * request. But now it seems that this is > > related to cgroup tree I constucted and my system status and my request > > size(a relatively large chunk). So there are many influencing factors, > > a specific upper bound is not accurate. > > Alright, thanks for the background. > > > > IMO it's more accurate to phrase it like this: > > > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > > nodes and cgroups over which the pages are spread. As such, the bigger > > > the request, the bigger the absolute overreclaim error. Historic > > > in-kernel users of reclaim have used fixed, small request batches to > > > approach an appropriate reclaim rate over time. When we reclaim a user > > > request of arbitrary size, use decaying batches to manage error while > > > maintaining reasonable throughput. > > Hm, decay... > So shouldn't the formula be > nr_pages = delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 > where > delta = nr_to_reclaim - nr_reclaimed > ? > (So that convergence for smaller deltas is same like original- and other > reclaims while conservative factor is applied for effectivity of higher > user requests.) Tapering out at 32 instead of 4 doesn't make much difference in practice because of how far off the actually reclaimed amount can be from the request size. We're talking thousands of pages of error for a request size of a few megs, and hundreds of pages of error for requests less than 100 pages. So all of these should be more or less equivalent: delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4) (nr_to_reclaim - nr_reclaimed) / 4 + 4 (nr_to_reclaim - nr_reclaimed) / 4 I was just trying to avoid putting in a 0 for the request size with the mod.
On Fri, Feb 02, 2024 at 10:22:34AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote: > So all of these should be more or less equivalent: > delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 > max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4) > (nr_to_reclaim - nr_reclaimed) / 4 + 4 > (nr_to_reclaim - nr_reclaimed) / 4 > > I was just trying to avoid putting in a 0 for the request size with the mod. The third variant would be simpler then. Modulo looks weird. Oh, and I just realized that try_to_free_mem_cgroup_pages() does max(nr_pages, SWAP_CLUSTER_MAX). Then I'd vote for the fourth variant + possible comment about harmless 0. (I'm sorry if this was discussed before.) Michal
On Fri, Feb 2, 2024 at 11:46 AM Michal Koutný <mkoutny@suse.com> wrote: > > On Fri, Feb 02, 2024 at 10:22:34AM -0800, "T.J. Mercier" <tjmercier@google.com> wrote: > > So all of these should be more or less equivalent: > > delta <= SWAP_CLUSTER_MAX ? delta : (delta + 3*SWAP_CLUSTER_MAX) / 4 > > max((nr_to_reclaim - nr_reclaimed) / 4, (nr_to_reclaim - nr_reclaimed) % 4) > > (nr_to_reclaim - nr_reclaimed) / 4 + 4 > > (nr_to_reclaim - nr_reclaimed) / 4 > > > > I was just trying to avoid putting in a 0 for the request size with the mod. > > The third variant would be simpler then. Modulo looks weird. > > Oh, and I just realized that try_to_free_mem_cgroup_pages() does > max(nr_pages, SWAP_CLUSTER_MAX). Then I'd vote for the fourth variant + > possible comment about harmless 0. > (I'm sorry if this was discussed before.) > > Michal Ok great, let's do that. Thanks for your input.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 46d8d02114cf..d68fb89eadd2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6977,7 +6977,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, lru_add_drain_all(); reclaimed = try_to_free_mem_cgroup_pages(memcg, - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), + max((nr_to_reclaim - nr_reclaimed) / 4, + (nr_to_reclaim - nr_reclaimed) % 4), GFP_KERNEL, reclaim_options); if (!reclaimed && !nr_retries--)