Message ID | 20231228073055.4046430-1-shakeelb@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-12488-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp1855303dyb; Wed, 27 Dec 2023 23:41:05 -0800 (PST) X-Google-Smtp-Source: AGHT+IE1EKwt7EZ/NJbSRleAer109sU/Vp2KXV8ymbYfuaIUroIF6sp9uAj5ePuDlQ6+z2bMmD/S X-Received: by 2002:a17:907:12c7:b0:a26:975a:fcc8 with SMTP id vp7-20020a17090712c700b00a26975afcc8mr4345064ejb.10.1703749264994; Wed, 27 Dec 2023 23:41:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1703749264; cv=none; d=google.com; s=arc-20160816; b=fCTpm6UfhriOai4zQGLa2d07Epe1dQ6DStCvO8tCrRhavYS34E6WgH6lAIi0ixa7Ap CNQkp4YKFQOyVWe1fT+JTEImTTF61etEt5qUlZDrbWxjU/vEJriekq857JSx1eT4fC42 Z82PH+V1kCgV6xhsqXXIvopJKhQJyhDaLmuDJLg31tO5l+GPNjWwLjUkHOUk0rotpplZ pXU9mp7ZoV0nAtxQMMEzcuKKj15pSvMPuRlmVVELwCUdRR1PMOEEYEVgc0qdMb27Yckl jr6TEkmilKqubKQ34ShnVrTBwW5M8jPOqIZAuNXj3xE87WEHeoZk+idWMwCgNI86PRPf S92w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:dkim-signature; bh=NREzK8/wmujY6hf4hjlWHbSO7quR4xEc0yBGUCASod4=; fh=TacuXG+vvWo5Mtd31bGKLxT+kks2LNIQyCLYogwU55c=; b=CidE2K36Ts2iiJFsSy/mkTUd3vjzsVk3yn9//nqom2s1Qe6TN0Pux6cSKSAWN53JMd tmpZXxZw9Px9qIzCv5Qv1zeJ/gWCqREiZmQXwwTJNDJA9oQ0WTcAQs8gLOjAb6/mFvYv TjcGN8zG0xYUkpRA/LiK4IrEf7IULr2hvHLw7/jS6DCVWIdvo9tNAX0JfkRLK/3AI9X+ ibOMluHdIBQzlRX25dCb8h10Om+hwaHyaSUQid7XenkGp64VOmYNmnZnQHoQXB87wt85 RNIdUy+/kv3sV4MkgRHql2nmi3L8iyp7XPpNEgJ0fwxVCe5nLc/XL4g9ROcNrcNElin7 DPjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=XSszNFTb; spf=pass (google.com: domain of linux-kernel+bounces-12488-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-12488-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id gg22-20020a170906e29600b00a23499a71a4si7034880ejb.525.2023.12.27.23.41.04 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Dec 2023 23:41:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-12488-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=XSszNFTb; spf=pass (google.com: domain of linux-kernel+bounces-12488-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-12488-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id C1DDD1F232DA for <ouuuleilei@gmail.com>; Thu, 28 Dec 2023 07:31:24 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 66A925674; Thu, 28 Dec 2023 07:31:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="XSszNFTb" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E8EB4402 for <linux-kernel@vger.kernel.org>; Thu, 28 Dec 2023 07:31:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--shakeelb.bounces.google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5eb6dba1796so62945187b3.1 for <linux-kernel@vger.kernel.org>; Wed, 27 Dec 2023 23:31:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1703748667; x=1704353467; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=NREzK8/wmujY6hf4hjlWHbSO7quR4xEc0yBGUCASod4=; b=XSszNFTbU2fW1dqk5+s2HCvZHIyJ2VY3uDNzC5CxZq1HAwZUvDQN20q8iVJuSC8L/5 GKQ7IiGNabv014L9XGyS8s7o8uIGaTpvPQBsn9kDRHtbFZlCRdqLDOQGAu8LOj0Av3cK a8nwGol+3cHmhLPqwZQSDnG6WRVp3H9lnIsSWF5aTRKYrifY4E6aCzbmS2tgB+ID2qJd Ceaio+P6ptiON8ZnqxhBt0ye1YueZLsKCSw0IhzoZYA7cWqqKucZC5V4qaidJHr4N3Qw u45CoOyfwdMMcvL8/5Kkyyl3RjrPZylsg4uHvDFxa1hU2avKeu0v8lqT+XqcTuTLF/yL Jubg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703748667; x=1704353467; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=NREzK8/wmujY6hf4hjlWHbSO7quR4xEc0yBGUCASod4=; b=EBMIQaBgq6VRqKUh7JQ4Cvv4ZGfS3reoUQ+8DQ/61onnhMQcDSqoVkdsu2+vJj7gTA EMkYHDwlAiR1nQZ3XZD7MUPZr/BIQK4vCiaOQzzpk36dEh+pz40HsWY1wrd03pcdlPZR rh5kGqDmh0bja1gTwD6x6mciaXaH3WHW4bc8Apx26mz1YhR3xu+FSix8SOOtWF4KN+Z9 JHdbwKA4yyL9giZuaE4LUoK1+vm7kVey6uxJUt4niIxVkwDh4weTIhoc8f12LM0xwUzl n20DmGB6Q/31nxoBQVeyTQIYynKh5cTxgnkZeKKSqpp7qhGkmpoRbLlXYRZdDnsTl32R Q4UA== X-Gm-Message-State: AOJu0Yy6wtV4dqFtqpDdlvIeQaYVaGnlCQTLsSx9EtP8KXr0yAbGZenu gUhAMJlyy2Yqiw67EA3XKMbEczl9RgvKVSdpEUTk X-Received: from shakeelb.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:262e]) (user=shakeelb job=sendgmr) by 2002:a05:690c:b0d:b0:5d3:84d4:eb35 with SMTP id cj13-20020a05690c0b0d00b005d384d4eb35mr5018905ywb.3.1703748667481; Wed, 27 Dec 2023 23:31:07 -0800 (PST) Date: Thu, 28 Dec 2023 07:30:55 +0000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog Message-ID: <20231228073055.4046430-1-shakeelb@google.com> Subject: [PATCH] mm: ratelimit stat flush from workingset shrinker From: Shakeel Butt <shakeelb@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Yosry Ahmed <yosryahmed@google.com>, Johannes Weiner <hannes@cmpxchg.org>, Yu Zhao <yuzhao@google.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1786510589405531290 X-GMAIL-MSGID: 1786510589405531290 |
Series |
mm: ratelimit stat flush from workingset shrinker
|
|
Commit Message
Shakeel Butt
Dec. 28, 2023, 7:30 a.m. UTC
One of our internal workload regressed on newer upstream kernel and on
further investigation, it seems like the cause is the always synchronous
rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b
("mm: memcg: use rstat for non-hierarchical stats"). On further
inspection it seems like we don't really need accurate stats in this
function as it was already approximating the amount of appropriate
shadow entried to keep for maintaining the refault information. Since
there is already 2 sec periodic rstat flush, we don't need exact stats
here. Let's ratelimit the rstat flush in this code path.
Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
mm/workingset.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Thu, Dec 28, 2023 at 12:31 AM Shakeel Butt <shakeelb@google.com> wrote: > > One of our internal workload regressed on newer upstream kernel Not really internal -- it's Postgres 14 + sysbench OLTP. > and on > further investigation, it seems like the cause is the always synchronous > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > ("mm: memcg: use rstat for non-hierarchical stats"). On further > inspection it seems like we don't really need accurate stats in this > function as it was already approximating the amount of appropriate > shadow entried to keep for maintaining the refault information. Since > there is already 2 sec periodic rstat flush, we don't need exact stats > here. Let's ratelimit the rstat flush in this code path. > > Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > mm/workingset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/workingset.c b/mm/workingset.c > index 2a2a34234df9..226012974328 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > struct lruvec *lruvec; > int i; > > - mem_cgroup_flush_stats(sc->memcg); > + mem_cgroup_flush_stats_ratelimited(sc->memcg); > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > pages += lruvec_page_state_local(lruvec, LGTM.
On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@google.com> wrote: > > One of our internal workload regressed on newer upstream kernel and on > further investigation, it seems like the cause is the always synchronous > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > ("mm: memcg: use rstat for non-hierarchical stats"). On further > inspection it seems like we don't really need accurate stats in this > function as it was already approximating the amount of appropriate > shadow entried to keep for maintaining the refault information. Since s/entried/entries > there is already 2 sec periodic rstat flush, we don't need exact stats > here. Let's ratelimit the rstat flush in this code path. Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg: restore subtree stats flushing")? I think the answer is yes based on internal discussions, but this really surprises me. Commit f82e6bf9bb9b removed the percpu loop in lruvec_page_state_local(), and added a flush call. With 7d7ef0a4686a, the flush call is only effective if there are pending updates in the cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW, we should only be doing work when actually needed, whereas before we used to have multiple percpu loops in count_shadow_nodes() regardless of pending updates. It seems like the cgroup subtree is very active that we continuously need to flush in count_shadow_nodes()? If that's the case, do we still think it's okay not to flush when we know there are pending updates? I don't have enough background about the workingset heuristics to judge this. I am not objecting to this change, I am just trying to understand what's happening. Thanks! > > Fixes: f82e6bf9bb9b ("mm: memcg: use rstat for non-hierarchical stats") > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > mm/workingset.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/workingset.c b/mm/workingset.c > index 2a2a34234df9..226012974328 100644 > --- a/mm/workingset.c > +++ b/mm/workingset.c > @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, > struct lruvec *lruvec; > int i; > > - mem_cgroup_flush_stats(sc->memcg); > + mem_cgroup_flush_stats_ratelimited(sc->memcg); > lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); > for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) > pages += lruvec_page_state_local(lruvec, > -- > 2.43.0.472.g3155946c3a-goog >
On Thu, Dec 28, 2023 at 07:13:23AM -0800, Yosry Ahmed wrote: > On Wed, Dec 27, 2023 at 11:31 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > One of our internal workload regressed on newer upstream kernel and on > > further investigation, it seems like the cause is the always synchronous > > rstat flush in the count_shadow_nodes() added by the commit f82e6bf9bb9b > > ("mm: memcg: use rstat for non-hierarchical stats"). On further > > inspection it seems like we don't really need accurate stats in this > > function as it was already approximating the amount of appropriate > > shadow entried to keep for maintaining the refault information. Since > > s/entried/entries > > > there is already 2 sec periodic rstat flush, we don't need exact stats > > here. Let's ratelimit the rstat flush in this code path. > > Is the regression observed even with commit 7d7ef0a4686a ("mm: memcg: > restore subtree stats flushing")? I think the answer is yes based on > internal discussions, but this really surprises me. > Yes, the regression was on latest mm-stable branch of Andrew's mm tree. > Commit f82e6bf9bb9b removed the percpu loop in > lruvec_page_state_local(), and added a flush call. With 7d7ef0a4686a, > the flush call is only effective if there are pending updates in the > cgroup subtree exceeding MEMCG_CHARGE_BATCH * num_online_cpus(). IOW, > we should only be doing work when actually needed, whereas before we > used to have multiple percpu loops in count_shadow_nodes() regardless > of pending updates. > > It seems like the cgroup subtree is very active that we continuously > need to flush in count_shadow_nodes()? If that's the case, do we still > think it's okay not to flush when we know there are pending updates? I > don't have enough background about the workingset heuristics to judge > this. Not all updates might be related to the stats being read here. Also the read value is further divided by 8 and manipulated more in do_shrink_slab(). So, I don't think we need less than 2 seconds accuracy for these stats here.
diff --git a/mm/workingset.c b/mm/workingset.c index 2a2a34234df9..226012974328 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -680,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker, struct lruvec *lruvec; int i; - mem_cgroup_flush_stats(sc->memcg); + mem_cgroup_flush_stats_ratelimited(sc->memcg); lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid)); for (pages = 0, i = 0; i < NR_LRU_LISTS; i++) pages += lruvec_page_state_local(lruvec,