Message ID | 20240202233855.1236422-1-tjmercier@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-50716-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp758480dyc; Fri, 2 Feb 2024 15:39:36 -0800 (PST) X-Google-Smtp-Source: AGHT+IHZvU2bFUxbyekCqaCa/W7tU1yGfesypKKZ1Tg1lMg7Z9fhoxWWdHq+Uf9GMqEGlCFhnts8 X-Received: by 2002:a05:620a:1910:b0:785:4786:e6db with SMTP id bj16-20020a05620a191000b007854786e6dbmr6633584qkb.2.1706917176644; Fri, 02 Feb 2024 15:39:36 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706917176; cv=pass; d=google.com; s=arc-20160816; b=wqPs3NjzlS4uCq78kMHHvglbRdcFFnVtr3HALSGKtxG13PqtJhLh7bcuSOVqwdE5oV S7wY/rxNKbAs7KAdKiCDzmgIn7KJnXBfPfURvAPE/Dx5E7VcneY/oa7giEAuoiO0KvMJ OKpJrLgnha9g9YQWneSfBo3/v2pqRuQJ57FhbC0zbAESzA/zb8N2YIZU25Ii7BgtHPfc i/qbKEU7fJP7uUMLGYbEMlkDSSVXp+GIQ1HgcAzzwuvtIriFbXJQI3wdIDqzhHgoqyw7 fnEoG7TsiZUu+xBodeJ0ff/4ybego0QP1VHu2Efieoc39SdDJ/AkYR5LOa2NByQqrgwr SlmQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :date:dkim-signature; bh=veXaKfSzJPhWO3q5P8DCJsJwSlDKBS//G1+RUPv9jb0=; fh=e4Usy8cn8p2623BIU4n3d9wsihdUbbDqLOPggzcOam8=; b=z9CLK7hsG8lhe1schnxp7TCDzEL7sy+n95yN2dyyakpu1qDdHJqkQPkXYb88U82VZx lsdxlwR0Wfsgot/6yaMBUOd7Ba8BbAr+auhVLNzgHm5LmtyDiAdVLYEklJV2V5AMoVbF 8yCT00smjVBfQ7eh2Fc27+ThZtHoYYFAz5fYOv8pEovXEDYTSsppKw6ttWZ0seue96NO m75Sh7xZm3x6KQb3vAGfWVMFtAdIHod73glTQFbqjGCOaN8jJQCzWdVE+RXbp/BkREND 8UnWJ9iFta4FeeanszCk3RKu5t4ikMr2B2v65oGl5RMZdwA2X0OX72koNHh/tff5Qswl QGHA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=MA6NxgHm; arc=pass (i=1 spf=pass spfdomain=flex--tjmercier.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-50716-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-50716-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com X-Forwarded-Encrypted: i=1; AJvYcCU8l4/38r/p5QBTggPD72EOSbkhCQvUrhD/FuZqByS9ElUsPrhvQyW5UE/x2NjdPA2tCvUzWoAYk6E0d2q2KubuRa/D8w== Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id c21-20020a05620a201500b007854e32aff6si2898246qka.562.2024.02.02.15.39.36 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 15:39:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-50716-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=MA6NxgHm; arc=pass (i=1 spf=pass spfdomain=flex--tjmercier.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-50716-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-50716-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 6FD691C21425 for <ouuuleilei@gmail.com>; Fri, 2 Feb 2024 23:39:36 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0DF4A12D76D; Fri, 2 Feb 2024 23:39:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="MA6NxgHm" Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 841E35B697 for <linux-kernel@vger.kernel.org>; Fri, 2 Feb 2024 23:39:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706917155; cv=none; b=ViZC9hx/iJV2Eow6L7Nd1+XDSrlKR2NKq8nEyJT/iLsT39OrMDZqVO4TMawbUqb38BUgNG55+rEIJziiICB4A0JQZMXSV+UR1ofdozJuSfC/L0R8iRbfxPMHsjSHtr1V4Uv5CGO9G9gGAIwG3gHAjEEvR6KvRoLeoQPCkorVqMM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706917155; c=relaxed/simple; bh=b/a8qvwV9kIhW3YeFIwfOR6mUCUXJlmI3P0+Ee2g9P0=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=qYPRGow6IrKQZ76eAG174cdJHYdTeICh69RuJ0QF6HfK3GJgxlj/TnmE04i+j96Xri19QZi8bjS6WaS0KM2E0+uRHOII2anbq/vX2UqbVZUzsrO78U1sEaR0JeovmESBQjm+0ctuW+Nf9hESFRvf+O0aXYeA5zf6doAeA5kykIQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--tjmercier.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=MA6NxgHm; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--tjmercier.bounces.google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5f6c12872fbso43573327b3.1 for <linux-kernel@vger.kernel.org>; Fri, 02 Feb 2024 15:39:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706917152; x=1707521952; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=veXaKfSzJPhWO3q5P8DCJsJwSlDKBS//G1+RUPv9jb0=; b=MA6NxgHm95/FnMoo6tnArTU6LMsVmBrFv5m2lfJrsCKoq8pjMwcdwHeaLDNRMGw7fy SrYVMO2v8SCIfvl7OTz2+TUCJu5qUyS5/ZSE5mYALe7mUN0ahLrvVhGcmm7z+r8ENPwQ G5TEtajy8+pTlX0NIvyn3dqfpslI37FRNtRiIaXjG9R7Q1xGepKBKAW3PfHcn5fa1OCm bxVQUPata8/ZxWCRFSQ4h/+xwgoqW3vpIGxSkxAANNvbBYmTrk1Yk5BJUeMkMmdmVuwt VxW99H6ZWl+BWtCePQBZ/6mHoMmsbHHdzv6deO/Tk9K5xm1dsHxm2kkfP0HEKnvryLcy UXBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706917152; x=1707521952; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=veXaKfSzJPhWO3q5P8DCJsJwSlDKBS//G1+RUPv9jb0=; b=JhJ75nVpSnuZCCk/9eSCL6CfOXCcPt6GDhLgLh0m2DA35TfVE/pL9b+fGnQ6ixDxs1 LW+40R9SdF3oGluOJXTKHhWVLfH9f0H2auYpp0HI8w0oJDagaVPgpJ2t1TgwFz9TaNcn 7YADf4zikiaQ4BmI71NhimHK6V9R+VDGrkdaECZgcSrTNajbKL4KIpqvCKweDpt1J53W WDiMOjeZzZL2xM0p4McBETu/dpZFYkvcV06E7CoHfd1sv0PLyR9tFWgoEDM9OFotKUMs 0zYvPxxuyG4Zu3c1obrcpOm0v1I0UKAvpm3zpOyiyBlSkPc2c3svezQcZJhle9KN/d3A J7CQ== X-Gm-Message-State: AOJu0YyBtcidIIoCXpF5Dnot76KUCIjA+lKjJ4UQThT9zvZ1tD021TSx nw0nthH/RdyRRk4XjEle1Cma3JqJYgpKheutNpD8Pnep5vBxT4bfvWD9HpxQDWjo4yYV5jZG1sk jtVOXDuLmO0aTOQ== X-Received: from tj-virt.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5683]) (user=tjmercier job=sendgmr) by 2002:a05:6902:2306:b0:dc2:5130:198f with SMTP id do6-20020a056902230600b00dc25130198fmr197728ybb.5.1706917152524; Fri, 02 Feb 2024 15:39:12 -0800 (PST) Date: Fri, 2 Feb 2024 23:38:54 +0000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 X-Mailer: git-send-email 2.43.0.594.gd9cf4e227d-goog Message-ID: <20240202233855.1236422-1-tjmercier@google.com> Subject: [PATCH v3] mm: memcg: Use larger batches for proactive reclaim From: "T.J. Mercier" <tjmercier@google.com> To: tjmercier@google.com, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, Efly Young <yangyifei03@kuaishou.com> Cc: android-mm@google.com, yuzhao@google.com, mkoutny@suse.com, Yosry Ahmed <yosryahmed@google.com>, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789826814329332771 X-GMAIL-MSGID: 1789832385235575877 |
Series |
[v3] mm: memcg: Use larger batches for proactive reclaim
|
|
Commit Message
T.J. Mercier
Feb. 2, 2024, 11:38 p.m. UTC
Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") we passed the number of pages for the reclaim request directly to try_to_free_mem_cgroup_pages, which could lead to significant overreclaim. After 0388536ac291 the number of pages was limited to a maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. However such a small batch size caused a regression in reclaim performance due to many more reclaim start/stop cycles inside memory_reclaim. Reclaim tries to balance nr_to_reclaim fidelity with fairness across nodes and cgroups over which the pages are spread. As such, the bigger the request, the bigger the absolute overreclaim error. Historic in-kernel users of reclaim have used fixed, small sized requests to approach an appropriate reclaim rate over time. When we reclaim a user request of arbitrary size, use decaying batch sizes to manage error while maintaining reasonable throughput. root - full reclaim pages/sec time (sec) pre-0388536ac291 : 68047 10.46 post-0388536ac291 : 13742 inf (reclaim-reclaimed)/4 : 67352 10.51 /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) pre-0388536ac291 : 258822 1.12 107.8 post-0388536ac291 : 105174 2.49 3.5 (reclaim-reclaimed)/4 : 233396 1.12 -7.4 /uid_0 - full reclaim pages/sec time (sec) pre-0388536ac291 : 72334 7.09 post-0388536ac291 : 38105 14.45 (reclaim-reclaimed)/4 : 72914 6.96 Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") Signed-off-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> --- v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional changes. v2: Simplify the request size calculation per Johannes Weiner and Michal Koutný mm/memcontrol.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
Comments
On Fri, Feb 2, 2024 at 3:39 PM T.J. Mercier <tjmercier@google.com> wrote: > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > reclaim") we passed the number of pages for the reclaim request directly > to try_to_free_mem_cgroup_pages, which could lead to significant > overreclaim. After 0388536ac291 the number of pages was limited to a > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. > However such a small batch size caused a regression in reclaim > performance due to many more reclaim start/stop cycles inside > memory_reclaim. > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > nodes and cgroups over which the pages are spread. As such, the bigger > the request, the bigger the absolute overreclaim error. Historic > in-kernel users of reclaim have used fixed, small sized requests to > approach an appropriate reclaim rate over time. When we reclaim a user > request of arbitrary size, use decaying batch sizes to manage error while > maintaining reasonable throughput. > > root - full reclaim pages/sec time (sec) > pre-0388536ac291 : 68047 10.46 > post-0388536ac291 : 13742 inf > (reclaim-reclaimed)/4 : 67352 10.51 > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > pre-0388536ac291 : 258822 1.12 107.8 > post-0388536ac291 : 105174 2.49 3.5 > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > /uid_0 - full reclaim pages/sec time (sec) > pre-0388536ac291 : 72334 7.09 > post-0388536ac291 : 38105 14.45 > (reclaim-reclaimed)/4 : 72914 6.96 > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > Signed-off-by: T.J. Mercier <tjmercier@google.com> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Acked-by: Shakeel Butt <shakeelb@google.com>
On Fri, Feb 02, 2024 at 11:38:54PM +0000, "T.J. Mercier" <tjmercier@google.com> wrote: > mm/memcontrol.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) Reviewed-by: Michal Koutný <mkoutny@suse.com> Thanks!
On Fri 02-02-24 23:38:54, T.J. Mercier wrote: > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > reclaim") we passed the number of pages for the reclaim request directly > to try_to_free_mem_cgroup_pages, which could lead to significant > overreclaim. After 0388536ac291 the number of pages was limited to a > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. > However such a small batch size caused a regression in reclaim > performance due to many more reclaim start/stop cycles inside > memory_reclaim. You have mentioned that in one of the previous emails but it is good to mention what is the source of that overhead for the future reference. > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > nodes and cgroups over which the pages are spread. As such, the bigger > the request, the bigger the absolute overreclaim error. Historic > in-kernel users of reclaim have used fixed, small sized requests to > approach an appropriate reclaim rate over time. When we reclaim a user > request of arbitrary size, use decaying batch sizes to manage error while > maintaining reasonable throughput. These numbers are with MGLRU or the default reclaim implementation? > root - full reclaim pages/sec time (sec) > pre-0388536ac291 : 68047 10.46 > post-0388536ac291 : 13742 inf > (reclaim-reclaimed)/4 : 67352 10.51 > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > pre-0388536ac291 : 258822 1.12 107.8 > post-0388536ac291 : 105174 2.49 3.5 > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > /uid_0 - full reclaim pages/sec time (sec) > pre-0388536ac291 : 72334 7.09 > post-0388536ac291 : 38105 14.45 > (reclaim-reclaimed)/4 : 72914 6.96 > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > Signed-off-by: T.J. Mercier <tjmercier@google.com> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > --- > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional > changes. > v2: Simplify the request size calculation per Johannes Weiner and Michal Koutný > > mm/memcontrol.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 46d8d02114cf..f6ab61128869 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > if (!nr_retries) > lru_add_drain_all(); > > + /* Will converge on zero, but reclaim enforces a minimum */ > + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; This doesn't fit into the existing coding style. I do not think there is a strong reason to go against it here. > + > reclaimed = try_to_free_mem_cgroup_pages(memcg, > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > - GFP_KERNEL, reclaim_options); > + batch_size, GFP_KERNEL, reclaim_options); Also with the increased reclaim target do we need something like this? diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..94794cf5ee9f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + return sc->nr_to_reclaim; } lru_add_drain(); > > if (!reclaimed && !nr_retries--) > return -EAGAIN; > -- > 2.43.0.594.gd9cf4e227d-goog
On Mon, Feb 5, 2024 at 2:40 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 02-02-24 23:38:54, T.J. Mercier wrote: > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > reclaim") we passed the number of pages for the reclaim request directly > > to try_to_free_mem_cgroup_pages, which could lead to significant > > overreclaim. After 0388536ac291 the number of pages was limited to a > > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. > > However such a small batch size caused a regression in reclaim > > performance due to many more reclaim start/stop cycles inside > > memory_reclaim. > > You have mentioned that in one of the previous emails but it is good to > mention what is the source of that overhead for the future reference. I can add a sentence about the restart cost being amortized over more pages with a large batch size. It covers things like repeatedly flushing stats, walking the tree, evaluating protection limits, etc. > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > nodes and cgroups over which the pages are spread. As such, the bigger > > the request, the bigger the absolute overreclaim error. Historic > > in-kernel users of reclaim have used fixed, small sized requests to > > approach an appropriate reclaim rate over time. When we reclaim a user > > request of arbitrary size, use decaying batch sizes to manage error while > > maintaining reasonable throughput. > > These numbers are with MGLRU or the default reclaim implementation? These numbers are for both. root uses the memcg LRU (MGLRU was enabled), and /uid_0 does not. > > root - full reclaim pages/sec time (sec) > > pre-0388536ac291 : 68047 10.46 > > post-0388536ac291 : 13742 inf > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > pre-0388536ac291 : 258822 1.12 107.8 > > post-0388536ac291 : 105174 2.49 3.5 > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > /uid_0 - full reclaim pages/sec time (sec) > > pre-0388536ac291 : 72334 7.09 > > post-0388536ac291 : 38105 14.45 > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > --- > > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional > > changes. > > v2: Simplify the request size calculation per Johannes Weiner and Michal Koutný > > > > mm/memcontrol.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 46d8d02114cf..f6ab61128869 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > if (!nr_retries) > > lru_add_drain_all(); > > > > + /* Will converge on zero, but reclaim enforces a minimum */ > > + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; > > This doesn't fit into the existing coding style. I do not think there is > a strong reason to go against it here. There's been some back and forth here. You'd prefer to move this to the top of the while loop, under the declaration of reclaimed? It's farther from its use there, but it does match the existing style in the file better. > > + > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > - GFP_KERNEL, reclaim_options); > > + batch_size, GFP_KERNEL, reclaim_options); > > Also with the increased reclaim target do we need something like this? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4f9c854ce6cc..94794cf5ee9f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > - return SWAP_CLUSTER_MAX; > + return sc->nr_to_reclaim; > } > > lru_add_drain(); > > > > if (!reclaimed && !nr_retries--) > > return -EAGAIN; > > -- This is interesting, but I don't think it's closely related to this change. This section looks like it was added to delay OOM kills due to apparent lack of reclaim progress when pages are isolated and the direct reclaimer is scheduled out. A couple things: In the context of proactive reclaim, current is not really undergoing reclaim due to memory pressure. It's initiated from userspace. So whether it has a fatal signal pending or not doesn't seem like it should influence the return value of shrink_inactive_list for some probably unrelated process. It seems more straightforward to me to return 0, and add another fatal signal pending check to the caller (shrink_lruvec) to bail out early (dealing with OOM kill avoidance there if necessary) instead of waiting to accumulate fake SWAP_CLUSTER_MAX values from shrink_inactive_list. As far as changing the value, SWAP_CLUSTER_MAX puts the final value of sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a loop for each evictable lru in shrink_lruvec, we could end up with 4 * sc->nr_to_reclaim in sc->nr_reclaimed if we switched to sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I don't think we'd want to do that.
On Mon 05-02-24 11:29:49, T.J. Mercier wrote: > On Mon, Feb 5, 2024 at 2:40 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 02-02-24 23:38:54, T.J. Mercier wrote: > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > > reclaim") we passed the number of pages for the reclaim request directly > > > to try_to_free_mem_cgroup_pages, which could lead to significant > > > overreclaim. After 0388536ac291 the number of pages was limited to a > > > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. > > > However such a small batch size caused a regression in reclaim > > > performance due to many more reclaim start/stop cycles inside > > > memory_reclaim. > > > > You have mentioned that in one of the previous emails but it is good to > > mention what is the source of that overhead for the future reference. > > I can add a sentence about the restart cost being amortized over more > pages with a large batch size. It covers things like repeatedly > flushing stats, walking the tree, evaluating protection limits, etc. > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > > nodes and cgroups over which the pages are spread. As such, the bigger > > > the request, the bigger the absolute overreclaim error. Historic > > > in-kernel users of reclaim have used fixed, small sized requests to > > > approach an appropriate reclaim rate over time. When we reclaim a user > > > request of arbitrary size, use decaying batch sizes to manage error while > > > maintaining reasonable throughput. > > > > These numbers are with MGLRU or the default reclaim implementation? > > These numbers are for both. root uses the memcg LRU (MGLRU was > enabled), and /uid_0 does not. Thanks it would be nice to outline that in the changelog. > > > root - full reclaim pages/sec time (sec) > > > pre-0388536ac291 : 68047 10.46 > > > post-0388536ac291 : 13742 inf > > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > > pre-0388536ac291 : 258822 1.12 107.8 > > > post-0388536ac291 : 105174 2.49 3.5 > > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > > > /uid_0 - full reclaim pages/sec time (sec) > > > pre-0388536ac291 : 72334 7.09 > > > post-0388536ac291 : 38105 14.45 > > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > --- > > > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional > > > changes. > > > v2: Simplify the request size calculation per Johannes Weiner and Michal Koutný > > > > > > mm/memcontrol.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 46d8d02114cf..f6ab61128869 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > if (!nr_retries) > > > lru_add_drain_all(); > > > > > > + /* Will converge on zero, but reclaim enforces a minimum */ > > > + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; > > > > This doesn't fit into the existing coding style. I do not think there is > > a strong reason to go against it here. > > There's been some back and forth here. You'd prefer to move this to > the top of the while loop, under the declaration of reclaimed? It's > farther from its use there, but it does match the existing style in > the file better. This is not something I deeply care about but generally it is better to not mix styles unless that is a clear win. If you want to save one LOC you can just move it up - just couple of lines up, or you can keep the definition closer and have a separate declaration. > > > + > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > - GFP_KERNEL, reclaim_options); > > > + batch_size, GFP_KERNEL, reclaim_options); > > > > Also with the increased reclaim target do we need something like this? > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 4f9c854ce6cc..94794cf5ee9f 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > > > > /* We are about to die and free our memory. Return now. */ > > if (fatal_signal_pending(current)) > > - return SWAP_CLUSTER_MAX; > > + return sc->nr_to_reclaim; > > } > > > > lru_add_drain(); > > > > > > if (!reclaimed && !nr_retries--) > > > return -EAGAIN; > > > -- > > This is interesting, but I don't think it's closely related to this > change. This section looks like it was added to delay OOM kills due to > apparent lack of reclaim progress when pages are isolated and the > direct reclaimer is scheduled out. A couple things: > > In the context of proactive reclaim, current is not really undergoing > reclaim due to memory pressure. It's initiated from userspace. So > whether it has a fatal signal pending or not doesn't seem like it > should influence the return value of shrink_inactive_list for some > probably unrelated process. It seems more straightforward to me to > return 0, and add another fatal signal pending check to the caller > (shrink_lruvec) to bail out early (dealing with OOM kill avoidance > there if necessary) instead of waiting to accumulate fake > SWAP_CLUSTER_MAX values from shrink_inactive_list. The point of this code is to bail out early if the caller has fatal signals pending. That could be SIGTERM sent to the process performing the reclaim for whatever reason. The bail out is tuned for SWAP_CLUSTER_MAX as you can see and your patch is increasing the reclaim target which means that bailout wouldn't work properly and you wouldn't get any useful work done but not really bail out. > As far as changing the value, SWAP_CLUSTER_MAX puts the final value of > sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a > loop for each evictable lru in shrink_lruvec, we could end up with 4 * > sc->nr_to_reclaim in sc->nr_reclaimed if we switched to > sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I > don't think we'd want to do that. The actual number returned from the reclaim is not really important because memory_reclaim would break out of the loop and userspace would never see the result.
On Mon, Feb 5, 2024 at 11:40 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 05-02-24 11:29:49, T.J. Mercier wrote: > > On Mon, Feb 5, 2024 at 2:40 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Fri 02-02-24 23:38:54, T.J. Mercier wrote: > > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > > > reclaim") we passed the number of pages for the reclaim request directly > > > > to try_to_free_mem_cgroup_pages, which could lead to significant > > > > overreclaim. After 0388536ac291 the number of pages was limited to a > > > > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. > > > > However such a small batch size caused a regression in reclaim > > > > performance due to many more reclaim start/stop cycles inside > > > > memory_reclaim. > > > > > > You have mentioned that in one of the previous emails but it is good to > > > mention what is the source of that overhead for the future reference. > > > > I can add a sentence about the restart cost being amortized over more > > pages with a large batch size. It covers things like repeatedly > > flushing stats, walking the tree, evaluating protection limits, etc. > > > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > > > nodes and cgroups over which the pages are spread. As such, the bigger > > > > the request, the bigger the absolute overreclaim error. Historic > > > > in-kernel users of reclaim have used fixed, small sized requests to > > > > approach an appropriate reclaim rate over time. When we reclaim a user > > > > request of arbitrary size, use decaying batch sizes to manage error while > > > > maintaining reasonable throughput. > > > > > > These numbers are with MGLRU or the default reclaim implementation? > > > > These numbers are for both. root uses the memcg LRU (MGLRU was > > enabled), and /uid_0 does not. > > Thanks it would be nice to outline that in the changelog. Ok, I'll update the table below for each case. > > > > root - full reclaim pages/sec time (sec) > > > > pre-0388536ac291 : 68047 10.46 > > > > post-0388536ac291 : 13742 inf > > > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > > > pre-0388536ac291 : 258822 1.12 107.8 > > > > post-0388536ac291 : 105174 2.49 3.5 > > > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > > > > > /uid_0 - full reclaim pages/sec time (sec) > > > > pre-0388536ac291 : 72334 7.09 > > > > post-0388536ac291 : 38105 14.45 > > > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > --- > > > > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional > > > > changes. > > > > v2: Simplify the request size calculation per Johannes Weiner and Michal Koutný > > > > > > > > mm/memcontrol.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 46d8d02114cf..f6ab61128869 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > > if (!nr_retries) > > > > lru_add_drain_all(); > > > > > > > > + /* Will converge on zero, but reclaim enforces a minimum */ > > > > + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; > > > > > > This doesn't fit into the existing coding style. I do not think there is > > > a strong reason to go against it here. > > > > There's been some back and forth here. You'd prefer to move this to > > the top of the while loop, under the declaration of reclaimed? It's > > farther from its use there, but it does match the existing style in > > the file better. > > This is not something I deeply care about but generally it is better to > not mix styles unless that is a clear win. If you want to save one LOC > you can just move it up - just couple of lines up, or you can keep the > definition closer and have a separate declaration. I find it nicer to have to search as little as possible for both the declaration (type) and definition, but I am not attached to it either and it's not worth annoying anyone over here. Let's move it up like Yosry suggested initially. > > > > + > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > > - GFP_KERNEL, reclaim_options); > > > > + batch_size, GFP_KERNEL, reclaim_options); > > > > > > Also with the increased reclaim target do we need something like this? > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 4f9c854ce6cc..94794cf5ee9f 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > > > > > > /* We are about to die and free our memory. Return now. */ > > > if (fatal_signal_pending(current)) > > > - return SWAP_CLUSTER_MAX; > > > + return sc->nr_to_reclaim; > > > } > > > > > > lru_add_drain(); > > > > > > > > if (!reclaimed && !nr_retries--) > > > > return -EAGAIN; > > > > -- > > > > This is interesting, but I don't think it's closely related to this > > change. This section looks like it was added to delay OOM kills due to > > apparent lack of reclaim progress when pages are isolated and the > > direct reclaimer is scheduled out. A couple things: > > > > In the context of proactive reclaim, current is not really undergoing > > reclaim due to memory pressure. It's initiated from userspace. So > > whether it has a fatal signal pending or not doesn't seem like it > > should influence the return value of shrink_inactive_list for some > > probably unrelated process. It seems more straightforward to me to > > return 0, and add another fatal signal pending check to the caller > > (shrink_lruvec) to bail out early (dealing with OOM kill avoidance > > there if necessary) instead of waiting to accumulate fake > > SWAP_CLUSTER_MAX values from shrink_inactive_list. > > The point of this code is to bail out early if the caller has fatal > signals pending. That could be SIGTERM sent to the process performing > the reclaim for whatever reason. The bail out is tuned for > SWAP_CLUSTER_MAX as you can see and your patch is increasing the reclaim > target which means that bailout wouldn't work properly and you wouldn't > get any useful work done but not really bail out. It's increasing to 1/4 of what it was 6 months ago before 88536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") and this hasn't changed since then, so if anything the bailout should happen quicker than originally tuned for. > > As far as changing the value, SWAP_CLUSTER_MAX puts the final value of > > sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a > > loop for each evictable lru in shrink_lruvec, we could end up with 4 * > > sc->nr_to_reclaim in sc->nr_reclaimed if we switched to > > sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I > > don't think we'd want to do that. > > The actual number returned from the reclaim is not really important > because memory_reclaim would break out of the loop and userspace would > never see the result. This makes sense, but it makes me uneasy. I can't point to anywhere this would cause a problem currently (except maybe super unlikely overflow of nr_reclaimed), but it feels like a setup for future unintended consequences.
On Mon 05-02-24 12:26:10, T.J. Mercier wrote: > On Mon, Feb 5, 2024 at 11:40 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 05-02-24 11:29:49, T.J. Mercier wrote: > > > On Mon, Feb 5, 2024 at 2:40 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Fri 02-02-24 23:38:54, T.J. Mercier wrote: > > > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > > > > reclaim") we passed the number of pages for the reclaim request directly > > > > > to try_to_free_mem_cgroup_pages, which could lead to significant > > > > > overreclaim. After 0388536ac291 the number of pages was limited to a > > > > > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. > > > > > However such a small batch size caused a regression in reclaim > > > > > performance due to many more reclaim start/stop cycles inside > > > > > memory_reclaim. > > > > > > > > You have mentioned that in one of the previous emails but it is good to > > > > mention what is the source of that overhead for the future reference. > > > > > > I can add a sentence about the restart cost being amortized over more > > > pages with a large batch size. It covers things like repeatedly > > > flushing stats, walking the tree, evaluating protection limits, etc. > > > > > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > > > > nodes and cgroups over which the pages are spread. As such, the bigger > > > > > the request, the bigger the absolute overreclaim error. Historic > > > > > in-kernel users of reclaim have used fixed, small sized requests to > > > > > approach an appropriate reclaim rate over time. When we reclaim a user > > > > > request of arbitrary size, use decaying batch sizes to manage error while > > > > > maintaining reasonable throughput. > > > > > > > > These numbers are with MGLRU or the default reclaim implementation? > > > > > > These numbers are for both. root uses the memcg LRU (MGLRU was > > > enabled), and /uid_0 does not. > > > > Thanks it would be nice to outline that in the changelog. > > Ok, I'll update the table below for each case. > > > > > > root - full reclaim pages/sec time (sec) > > > > > pre-0388536ac291 : 68047 10.46 > > > > > post-0388536ac291 : 13742 inf > > > > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > > > > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > > > > pre-0388536ac291 : 258822 1.12 107.8 > > > > > post-0388536ac291 : 105174 2.49 3.5 > > > > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > > > > > > > /uid_0 - full reclaim pages/sec time (sec) > > > > > pre-0388536ac291 : 72334 7.09 > > > > > post-0388536ac291 : 38105 14.45 > > > > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > > > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > > > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > > > --- > > > > > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional > > > > > changes. > > > > > v2: Simplify the request size calculation per Johannes Weiner and Michal Koutný > > > > > > > > > > mm/memcontrol.c | 6 ++++-- > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > index 46d8d02114cf..f6ab61128869 100644 > > > > > --- a/mm/memcontrol.c > > > > > +++ b/mm/memcontrol.c > > > > > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > > > if (!nr_retries) > > > > > lru_add_drain_all(); > > > > > > > > > > + /* Will converge on zero, but reclaim enforces a minimum */ > > > > > + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; > > > > > > > > This doesn't fit into the existing coding style. I do not think there is > > > > a strong reason to go against it here. > > > > > > There's been some back and forth here. You'd prefer to move this to > > > the top of the while loop, under the declaration of reclaimed? It's > > > farther from its use there, but it does match the existing style in > > > the file better. > > > > This is not something I deeply care about but generally it is better to > > not mix styles unless that is a clear win. If you want to save one LOC > > you can just move it up - just couple of lines up, or you can keep the > > definition closer and have a separate declaration. > > I find it nicer to have to search as little as possible for both the > declaration (type) and definition, but I am not attached to it either > and it's not worth annoying anyone over here. Let's move it up like > Yosry suggested initially. > > > > > > + > > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > > > - GFP_KERNEL, reclaim_options); > > > > > + batch_size, GFP_KERNEL, reclaim_options); > > > > > > > > Also with the increased reclaim target do we need something like this? > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 4f9c854ce6cc..94794cf5ee9f 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > > > > > > > > /* We are about to die and free our memory. Return now. */ > > > > if (fatal_signal_pending(current)) > > > > - return SWAP_CLUSTER_MAX; > > > > + return sc->nr_to_reclaim; > > > > } > > > > > > > > lru_add_drain(); > > > > > > > > > > if (!reclaimed && !nr_retries--) > > > > > return -EAGAIN; > > > > > -- > > > > > > This is interesting, but I don't think it's closely related to this > > > change. This section looks like it was added to delay OOM kills due to > > > apparent lack of reclaim progress when pages are isolated and the > > > direct reclaimer is scheduled out. A couple things: > > > > > > In the context of proactive reclaim, current is not really undergoing > > > reclaim due to memory pressure. It's initiated from userspace. So > > > whether it has a fatal signal pending or not doesn't seem like it > > > should influence the return value of shrink_inactive_list for some > > > probably unrelated process. It seems more straightforward to me to > > > return 0, and add another fatal signal pending check to the caller > > > (shrink_lruvec) to bail out early (dealing with OOM kill avoidance > > > there if necessary) instead of waiting to accumulate fake > > > SWAP_CLUSTER_MAX values from shrink_inactive_list. > > > > The point of this code is to bail out early if the caller has fatal > > signals pending. That could be SIGTERM sent to the process performing > > the reclaim for whatever reason. The bail out is tuned for > > SWAP_CLUSTER_MAX as you can see and your patch is increasing the reclaim > > target which means that bailout wouldn't work properly and you wouldn't > > get any useful work done but not really bail out. > > It's increasing to 1/4 of what it was 6 months ago before 88536ac291 > ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") and > this hasn't changed since then, so if anything the bailout should > happen quicker than originally tuned for. Yes, this wasn't handled properly back then either. > > > As far as changing the value, SWAP_CLUSTER_MAX puts the final value of > > > sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a > > > loop for each evictable lru in shrink_lruvec, we could end up with 4 * > > > sc->nr_to_reclaim in sc->nr_reclaimed if we switched to > > > sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I > > > don't think we'd want to do that. > > > > The actual number returned from the reclaim is not really important > > because memory_reclaim would break out of the loop and userspace would > > never see the result. > > This makes sense, but it makes me uneasy. I can't point to anywhere > this would cause a problem currently (except maybe super unlikely > overflow of nr_reclaimed), but it feels like a setup for future > unintended consequences. This of something like timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim where timeout acts as a stop gap if the reclaim cannot finish in TIMEOUT.
On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 05-02-24 12:26:10, T.J. Mercier wrote: > > On Mon, Feb 5, 2024 at 11:40 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 05-02-24 11:29:49, T.J. Mercier wrote: > > > > On Mon, Feb 5, 2024 at 2:40 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Fri 02-02-24 23:38:54, T.J. Mercier wrote: > > > > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive > > > > > > reclaim") we passed the number of pages for the reclaim request directly > > > > > > to try_to_free_mem_cgroup_pages, which could lead to significant > > > > > > overreclaim. After 0388536ac291 the number of pages was limited to a > > > > > > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. > > > > > > However such a small batch size caused a regression in reclaim > > > > > > performance due to many more reclaim start/stop cycles inside > > > > > > memory_reclaim. > > > > > > > > > > You have mentioned that in one of the previous emails but it is good to > > > > > mention what is the source of that overhead for the future reference. > > > > > > > > I can add a sentence about the restart cost being amortized over more > > > > pages with a large batch size. It covers things like repeatedly > > > > flushing stats, walking the tree, evaluating protection limits, etc. > > > > > > > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness across > > > > > > nodes and cgroups over which the pages are spread. As such, the bigger > > > > > > the request, the bigger the absolute overreclaim error. Historic > > > > > > in-kernel users of reclaim have used fixed, small sized requests to > > > > > > approach an appropriate reclaim rate over time. When we reclaim a user > > > > > > request of arbitrary size, use decaying batch sizes to manage error while > > > > > > maintaining reasonable throughput. > > > > > > > > > > These numbers are with MGLRU or the default reclaim implementation? > > > > > > > > These numbers are for both. root uses the memcg LRU (MGLRU was > > > > enabled), and /uid_0 does not. > > > > > > Thanks it would be nice to outline that in the changelog. > > > > Ok, I'll update the table below for each case. > > > > > > > > root - full reclaim pages/sec time (sec) > > > > > > pre-0388536ac291 : 68047 10.46 > > > > > > post-0388536ac291 : 13742 inf > > > > > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > > > > > > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) > > > > > > pre-0388536ac291 : 258822 1.12 107.8 > > > > > > post-0388536ac291 : 105174 2.49 3.5 > > > > > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > > > > > > > > > /uid_0 - full reclaim pages/sec time (sec) > > > > > > pre-0388536ac291 : 72334 7.09 > > > > > > post-0388536ac291 : 38105 14.45 > > > > > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > > > > > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") > > > > > > Signed-off-by: T.J. Mercier <tjmercier@google.com> > > > > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com> > > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > > > > > > > > > > > --- > > > > > > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional > > > > > > changes. > > > > > > v2: Simplify the request size calculation per Johannes Weiner and Michal Koutný > > > > > > > > > > > > mm/memcontrol.c | 6 ++++-- > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > > index 46d8d02114cf..f6ab61128869 100644 > > > > > > --- a/mm/memcontrol.c > > > > > > +++ b/mm/memcontrol.c > > > > > > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > > > > if (!nr_retries) > > > > > > lru_add_drain_all(); > > > > > > > > > > > > + /* Will converge on zero, but reclaim enforces a minimum */ > > > > > > + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; > > > > > > > > > > This doesn't fit into the existing coding style. I do not think there is > > > > > a strong reason to go against it here. > > > > > > > > There's been some back and forth here. You'd prefer to move this to > > > > the top of the while loop, under the declaration of reclaimed? It's > > > > farther from its use there, but it does match the existing style in > > > > the file better. > > > > > > This is not something I deeply care about but generally it is better to > > > not mix styles unless that is a clear win. If you want to save one LOC > > > you can just move it up - just couple of lines up, or you can keep the > > > definition closer and have a separate declaration. > > > > I find it nicer to have to search as little as possible for both the > > declaration (type) and definition, but I am not attached to it either > > and it's not worth annoying anyone over here. Let's move it up like > > Yosry suggested initially. > > > > > > > > + > > > > > > reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > > > > - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), > > > > > > - GFP_KERNEL, reclaim_options); > > > > > > + batch_size, GFP_KERNEL, reclaim_options); > > > > > > > > > > Also with the increased reclaim target do we need something like this? > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index 4f9c854ce6cc..94794cf5ee9f 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > > > > > > > > > > /* We are about to die and free our memory. Return now. */ > > > > > if (fatal_signal_pending(current)) > > > > > - return SWAP_CLUSTER_MAX; > > > > > + return sc->nr_to_reclaim; > > > > > } > > > > > > > > > > lru_add_drain(); > > > > > > > > > > > > if (!reclaimed && !nr_retries--) > > > > > > return -EAGAIN; > > > > > > -- > > > > > > > > This is interesting, but I don't think it's closely related to this > > > > change. This section looks like it was added to delay OOM kills due to > > > > apparent lack of reclaim progress when pages are isolated and the > > > > direct reclaimer is scheduled out. A couple things: > > > > > > > > In the context of proactive reclaim, current is not really undergoing > > > > reclaim due to memory pressure. It's initiated from userspace. So > > > > whether it has a fatal signal pending or not doesn't seem like it > > > > should influence the return value of shrink_inactive_list for some > > > > probably unrelated process. It seems more straightforward to me to > > > > return 0, and add another fatal signal pending check to the caller > > > > (shrink_lruvec) to bail out early (dealing with OOM kill avoidance > > > > there if necessary) instead of waiting to accumulate fake > > > > SWAP_CLUSTER_MAX values from shrink_inactive_list. > > > > > > The point of this code is to bail out early if the caller has fatal > > > signals pending. That could be SIGTERM sent to the process performing > > > the reclaim for whatever reason. The bail out is tuned for > > > SWAP_CLUSTER_MAX as you can see and your patch is increasing the reclaim > > > target which means that bailout wouldn't work properly and you wouldn't > > > get any useful work done but not really bail out. > > > > It's increasing to 1/4 of what it was 6 months ago before 88536ac291 > > ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") and > > this hasn't changed since then, so if anything the bailout should > > happen quicker than originally tuned for. > > Yes, this wasn't handled properly back then either. > > > > > As far as changing the value, SWAP_CLUSTER_MAX puts the final value of > > > > sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a > > > > loop for each evictable lru in shrink_lruvec, we could end up with 4 * > > > > sc->nr_to_reclaim in sc->nr_reclaimed if we switched to > > > > sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I > > > > don't think we'd want to do that. > > > > > > The actual number returned from the reclaim is not really important > > > because memory_reclaim would break out of the loop and userspace would > > > never see the result. > > > > This makes sense, but it makes me uneasy. I can't point to anywhere > > this would cause a problem currently (except maybe super unlikely > > overflow of nr_reclaimed), but it feels like a setup for future > > unintended consequences. > > This of something like > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > where timeout acts as a stop gap if the reclaim cannot finish in > TIMEOUT. Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve it is what's bothering me. It's already wired up that way though, so if you want to make this change now then I can try to test for the difference using really large reclaim targets.
On Mon 05-02-24 12:47:47, T.J. Mercier wrote: > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > This of something like > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > > where timeout acts as a stop gap if the reclaim cannot finish in > > TIMEOUT. > > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve > it is what's bothering me. I am not really happy about this subtlety. If we have a better way then let's do it. Better in its own patch, though. > It's already wired up that way though, so if you want to make this > change now then I can try to test for the difference using really > large reclaim targets. Yes, please. If you want it a separate patch then no objection from me of course. If you do no like the nr_to_reclaim bailout then maybe we can go with a simple break out flag in scan_control. Thanks!
On Mon, Feb 5, 2024 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 05-02-24 12:47:47, T.J. Mercier wrote: > > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > This of something like > > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > > > where timeout acts as a stop gap if the reclaim cannot finish in > > > TIMEOUT. > > > > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve > > it is what's bothering me. > > I am not really happy about this subtlety. If we have a better way then > let's do it. Better in its own patch, though. > > > It's already wired up that way though, so if you want to make this > > change now then I can try to test for the difference using really > > large reclaim targets. > > Yes, please. If you want it a separate patch then no objection from me > of course. If you do no like the nr_to_reclaim bailout then maybe we can > go with a simple break out flag in scan_control. > > Thanks! It's a bit difficult to test under the too_many_isolated check, so I moved the fatal_signal_pending check outside and tried with that. Performing full reclaim on the /uid_0 cgroup with a 250ms delay before SIGKILL, I got an average of 16ms better latency with sc->nr_to_reclaim across 20 runs ignoring one 1s outlier with SWAP_CLUSTER_MAX. The return values from memory_reclaim are different since with sc->nr_to_reclaim we "succeed" and don't reach the signal_pending check to return -EINTR, but I don't think it matters since the return code is 137 (SIGKILL) in both cases. With SWAP_CLUSTER_MAX there was an outlier at nearly 1s, and in general the latency numbers were noiser: 2% RSD vs 13% RSD. I'm guessing that's a function of nr_to_scan being occasionally much less than SWAP_CLUSTER_MAX causing nr[lru] to drain slowly. But it could also have simply been scheduled out more often at the cond_resched in shrink_lruvec, and that would help explain the 1s outlier. I don't have enough debug info on the outlier to say much more. With sc->nr_to_reclaim, the largest sc->nr_reclaimed value I saw was about 2^53 for a sc->nr_to_reclaim of 2^51, but for large memcg hierarchies I think it's possible to get more than that. There were only 15 cgroups under /uid_0. This is the only thing that gives me pause, since we could touch more than 2k cgroups in shrink_node_memcgs, each one adding 4 * 2^51, potentially overflowing sc->nr_to_reclaim. Looks testable but I didn't get to it.
On Mon 05-02-24 20:01:40, T.J. Mercier wrote: > On Mon, Feb 5, 2024 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 05-02-24 12:47:47, T.J. Mercier wrote: > > > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > This of something like > > > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > > > > where timeout acts as a stop gap if the reclaim cannot finish in > > > > TIMEOUT. > > > > > > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve > > > it is what's bothering me. > > > > I am not really happy about this subtlety. If we have a better way then > > let's do it. Better in its own patch, though. > > > > > It's already wired up that way though, so if you want to make this > > > change now then I can try to test for the difference using really > > > large reclaim targets. > > > > Yes, please. If you want it a separate patch then no objection from me > > of course. If you do no like the nr_to_reclaim bailout then maybe we can > > go with a simple break out flag in scan_control. > > > > Thanks! > > It's a bit difficult to test under the too_many_isolated check, so I > moved the fatal_signal_pending check outside and tried with that. > Performing full reclaim on the /uid_0 cgroup with a 250ms delay before > SIGKILL, I got an average of 16ms better latency with > sc->nr_to_reclaim across 20 runs ignoring one 1s outlier with > SWAP_CLUSTER_MAX. This will obviously scale with the number of memcgs in the hierarchy but you are right that too_many_isolated makes the whole fatal_signal_pending check rather inefficient. I haven't missed that. The reclaim path is rather convoluted so this will likely be more complex than I anticipated. I will think about that some more. In order to not delay your patch, please repost with suggested updates to the changelog. This needs addressing IMO but I do not think this is critical at this stage.
On Tue 06-02-24 09:58:41, Michal Hocko wrote: > On Mon 05-02-24 20:01:40, T.J. Mercier wrote: > > On Mon, Feb 5, 2024 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 05-02-24 12:47:47, T.J. Mercier wrote: > > > > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@suse.com> wrote: > > > [...] > > > > > This of something like > > > > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > > > > > where timeout acts as a stop gap if the reclaim cannot finish in > > > > > TIMEOUT. > > > > > > > > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve > > > > it is what's bothering me. > > > > > > I am not really happy about this subtlety. If we have a better way then > > > let's do it. Better in its own patch, though. > > > > > > > It's already wired up that way though, so if you want to make this > > > > change now then I can try to test for the difference using really > > > > large reclaim targets. > > > > > > Yes, please. If you want it a separate patch then no objection from me > > > of course. If you do no like the nr_to_reclaim bailout then maybe we can > > > go with a simple break out flag in scan_control. > > > > > > Thanks! > > > > It's a bit difficult to test under the too_many_isolated check, so I > > moved the fatal_signal_pending check outside and tried with that. > > Performing full reclaim on the /uid_0 cgroup with a 250ms delay before > > SIGKILL, I got an average of 16ms better latency with > > sc->nr_to_reclaim across 20 runs ignoring one 1s outlier with > > SWAP_CLUSTER_MAX. > > This will obviously scale with the number of memcgs in the hierarchy but > you are right that too_many_isolated makes the whole fatal_signal_pending > check rather inefficient. I haven't missed that. The reclaim path is > rather convoluted so this will likely be more complex than I > anticipated. I will think about that some more. > > In order to not delay your patch, please repost with suggested updates > to the changelog. This needs addressing IMO but I do not think this is > critical at this stage. Has there been a new version or a proposal to refine the changelog posted?
On Mon 19-02-24 08:39:19, T.J. Mercier wrote: > On Mon, Feb 19, 2024 at 4:11 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 06-02-24 09:58:41, Michal Hocko wrote: > > > On Mon 05-02-24 20:01:40, T.J. Mercier wrote: > > > > On Mon, Feb 5, 2024 at 1:16 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Mon 05-02-24 12:47:47, T.J. Mercier wrote: > > > > > > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > [...] > > > > > > > This of something like > > > > > > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > > > > > > > where timeout acts as a stop gap if the reclaim cannot finish in > > > > > > > TIMEOUT. > > > > > > > > > > > > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve > > > > > > it is what's bothering me. > > > > > > > > > > I am not really happy about this subtlety. If we have a better way then > > > > > let's do it. Better in its own patch, though. > > > > > > > > > > > It's already wired up that way though, so if you want to make this > > > > > > change now then I can try to test for the difference using really > > > > > > large reclaim targets. > > > > > > > > > > Yes, please. If you want it a separate patch then no objection from me > > > > > of course. If you do no like the nr_to_reclaim bailout then maybe we can > > > > > go with a simple break out flag in scan_control. > > > > > > > > > > Thanks! > > > > > > > > It's a bit difficult to test under the too_many_isolated check, so I > > > > moved the fatal_signal_pending check outside and tried with that. > > > > Performing full reclaim on the /uid_0 cgroup with a 250ms delay before > > > > SIGKILL, I got an average of 16ms better latency with > > > > sc->nr_to_reclaim across 20 runs ignoring one 1s outlier with > > > > SWAP_CLUSTER_MAX. > > > > > > This will obviously scale with the number of memcgs in the hierarchy but > > > you are right that too_many_isolated makes the whole fatal_signal_pending > > > check rather inefficient. I haven't missed that. The reclaim path is > > > rather convoluted so this will likely be more complex than I > > > anticipated. I will think about that some more. > > > > > > In order to not delay your patch, please repost with suggested updates > > > to the changelog. This needs addressing IMO but I do not think this is > > > critical at this stage. > > > > Has there been a new version or a proposal to refine the changelog > > posted? > > Hi Michal, > > I updated the commit message in V4 to include a sentence about restart > cost, and added a line above each reclaim test to note the MGLRU > config and whether the memcg LRU was used or not. > > https://lore.kernel.org/all/20240206175251.3364296-1-tjmercier@google.com/ Hmm, missed that one for some reason.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 46d8d02114cf..f6ab61128869 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, if (!nr_retries) lru_add_drain_all(); + /* Will converge on zero, but reclaim enforces a minimum */ + unsigned long batch_size = (nr_to_reclaim - nr_reclaimed) / 4; + reclaimed = try_to_free_mem_cgroup_pages(memcg, - min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX), - GFP_KERNEL, reclaim_options); + batch_size, GFP_KERNEL, reclaim_options); if (!reclaimed && !nr_retries--) return -EAGAIN;