Message ID | 20230418095852.RFC.1.I53bf7f0c7d48fe7af13c5dd3ad581d3bcfd9d1bd@changeid |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp2999319vqo; Tue, 18 Apr 2023 10:02:04 -0700 (PDT) X-Google-Smtp-Source: AKy350a2J+pzgRAmK2UQbr82qNfxy0xVHsqvmXYruUnHN7Egh6L4r48kmwtS/fKV2334utDXb0pX X-Received: by 2002:a05:6a00:1307:b0:63d:311a:a16b with SMTP id j7-20020a056a00130700b0063d311aa16bmr399971pfu.23.1681837324517; Tue, 18 Apr 2023 10:02:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681837324; cv=none; d=google.com; s=arc-20160816; b=gis1HUpGAd7Ro4ABsbHxH9Y1uf4HzEdVfwVtCRf0mjqdYMPUnyBeqMLPJqwQVdtfs3 NGIeQzqFMFKT+O+uC8DbONFDVEJCTJEX/oZxNkYZUj5tNo09AY6G3Cp+kBBTr+bld7ZW 5cPh9aO0kl+44MpXt3tKg1y/NlOtyEue/zx1Fgzy1p+NBr7TM2P/iTFRHsrQn6n7qFCl rbkLoZ9kdDUK4kQ2NkDDTL3eTRq2Ri7mZN+xZ0vQt6MoUMd0zhIducOXgpT8QPRs0PZG UCEgEnna15yTOEcBGoVN7KXqbkmaaf8MIYCkGwqZMVmw0YRwJiO7KwFpoqzZmptkr7Ng xMyg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=6+Cu265ve+v/mYTevQG7B1gRVZEieX1Qo5P1wD9Qx14=; b=OjHyX6u5yiluA85H6D4tyl27SsoTBQFRvc+OzauZIIjJV1DCCC5x+TwpbI3/mKND+m fScNhmy2ebNEWuvhirkSWTY/SOUXVYe32aJ8VwFX0bLwGqpfjE1agPqRXr5Lrk50XmfK ELUuq0k1vI+Cpx1n5KQ2n01XmWLCFbAuGNlG+hL81sZDK4M7e+62Z11TucyN1EfG+tsN rDw8i/5TszjIXRFwhmy22PWdcRyKhsmb7nONwdTWLqTs83pQIKizRy0o0mvrmCnUTYwt tLYWI5v8OqBpzxAA98Ly+hBZ6m8hdLuE/04y6XOgrgO3MLxe9E/AuNJTXWUPS4RsfQyY n+sQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=XKJ57OQq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a9-20020aa795a9000000b0063d392dce55si1736010pfk.164.2023.04.18.10.01.46; Tue, 18 Apr 2023 10:02:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=XKJ57OQq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232062AbjDRQ7x (ORCPT <rfc822;leviz.kernel.dev@gmail.com> + 99 others); Tue, 18 Apr 2023 12:59:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43646 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232558AbjDRQ7s (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 18 Apr 2023 12:59:48 -0400 Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6D53AB455 for <linux-kernel@vger.kernel.org>; Tue, 18 Apr 2023 09:59:20 -0700 (PDT) Received: by mail-pl1-x62b.google.com with SMTP id d9443c01a7336-1a50cb65c92so26452355ad.0 for <linux-kernel@vger.kernel.org>; Tue, 18 Apr 2023 09:59:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1681837160; x=1684429160; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=6+Cu265ve+v/mYTevQG7B1gRVZEieX1Qo5P1wD9Qx14=; b=XKJ57OQqx/2PUHBsVuJiKQWKGxI0zf38e+uy5+3VzxmqWT5D9CoR9d/c3NX3CPvdvm tsgZ6wXl9/jMJbsaEYrw+C5mAR0OeNteAq7KXdnzEKsYopxMIY0VbeAXnjWrgKmmQLvy g2J7T3XZsbWjZKBQL8Hn4B+80abDhZ8JayM+E= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681837160; x=1684429160; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6+Cu265ve+v/mYTevQG7B1gRVZEieX1Qo5P1wD9Qx14=; b=erQASUG9Z4NpVDDWsUIrPKGBhZ4/7PwNoGvOBGrnBacTISWJH6rb8I64iZvEFg3t0y Hj/6z0oDjU/+So9aaZzCE/qbvVI1cQvTGf6Jr3c7rNqPWbP2YOXpho8nX/BNgeyTfPy/ 13p+6LTkWgjkm2QU9fWPflLvW+e89R8d7R4HtSJjepp4U3RP8DmzqC5nMZbW+pl/n2iD x0e4hz7+u6IrT8ln6YNSMoetbDhZ2KPTAdyHlLjkD0qcm6Hqg7l0rcPh4s98XGrxHJ+M ldi3GFst0LCb864zt90CrZuRvPh4Oj21LB6g8UxigW6hSDasTJEjnqhgqEuliQ91vyOr uGsg== X-Gm-Message-State: AAQBX9dneL8St09t4w3/LEti1lbMozWmixPKEnenL6tzp6puwNo1tzTK GO9S8vVkGXfEGthGAfcEeIu3SA== X-Received: by 2002:a17:903:280d:b0:1a5:31f2:9036 with SMTP id kp13-20020a170903280d00b001a531f29036mr2645410plb.11.1681837159865; Tue, 18 Apr 2023 09:59:19 -0700 (PDT) Received: from tictac2.mtv.corp.google.com ([2620:15c:9d:2:15a7:1899:efc0:f521]) by smtp.gmail.com with ESMTPSA id b6-20020a170902bd4600b001a216d44440sm9829786plx.200.2023.04.18.09.59.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Apr 2023 09:59:19 -0700 (PDT) From: Douglas Anderson <dianders@chromium.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@techsingularity.net>, Yu Zhao <yuzhao@google.com>, Ying <ying.huang@intel.com>, Johannes Weiner <hannes@cmpxchg.org>, Douglas Anderson <dianders@chromium.org>, "Peter Zijlstra (Intel)" <peterz@infradead.org>, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH] mm, compaction: kcompactd work shouldn't count towards memory PSI Date: Tue, 18 Apr 2023 09:58:54 -0700 Message-ID: <20230418095852.RFC.1.I53bf7f0c7d48fe7af13c5dd3ad581d3bcfd9d1bd@changeid> X-Mailer: git-send-email 2.40.0.634.g4ca3ef3211-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763534254494893016?= X-GMAIL-MSGID: =?utf-8?q?1763534254494893016?= |
Series |
[RFC] mm, compaction: kcompactd work shouldn't count towards memory PSI
|
|
Commit Message
Doug Anderson
April 18, 2023, 4:58 p.m. UTC
When the main kcompactd thread is doing compaction then it's always
proactive compaction. This is a little confusing because kcompactd has
two phases and one of them is called the "proactive" phase.
Specifically:
* Phase 1 (the "non-proactive" phase): we've been told by someone else
that it would be a good idea to try to compact memory.
* Phase 2 (the "proactive" phase): we analyze memory fragmentation
ourselves and compact if it looks fragmented.
From the context of kcompactd, the above naming makes sense. However,
from the context of the kernel as a whole both phases are "proactive"
because in both cases we're trying compact memory ahead of time and
we're not actually blocking (stalling) any task who is trying to use
memory.
Specifically, if any task is actually blocked needing memory to be
compacted then it will be in direct reclaim. That won't block waiting
on kcompactd task but instead call try_to_compact_pages() directly.
The caller of that direct compaction, __alloc_pages_direct_compact(),
already marks itself as counting towards PSI.
Sanity checking by looking at this from another perspective, we can
look at all the people who explicitly ask kcompactd to do a reclaim by
calling wakeup_kcompactd(). That leads us to 3 places in vmscan.c.
Those are all requests from kswapd, which is also a "proactive"
mechanism in the kernel (tasks aren't blocked waiting for it).
Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
I stumbled upon this while researching for a different patch [1].
Although both the other patch and this one affect kcompactd, they are
otherwise unrelated. It can be noted that ${SUBJECT} patch was created
solely by code inspection. I don't have any specific test cases that
are made better by it, the code just didn't seem quite right to me.
My knowledge of the memory subsystem is shaky at best, so please take
this patch with a grain of salt. If you're a memory system expert and
this patch looks totally misguided to you then it probably is. ;-)
[1] https://lore.kernel.org/r/20230413182313.RFC.1.Ia86ccac02a303154a0b8bc60567e7a95d34c96d3@changeid
mm/compaction.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)
Comments
On Tue, Apr 18, 2023 at 09:58:54AM -0700, Douglas Anderson wrote: > When the main kcompactd thread is doing compaction then it's always > proactive compaction. This is a little confusing because kcompactd has > two phases and one of them is called the "proactive" phase. > Specifically: > * Phase 1 (the "non-proactive" phase): we've been told by someone else > that it would be a good idea to try to compact memory. > * Phase 2 (the "proactive" phase): we analyze memory fragmentation > ourselves and compact if it looks fragmented. > > From the context of kcompactd, the above naming makes sense. However, > from the context of the kernel as a whole both phases are "proactive" > because in both cases we're trying compact memory ahead of time and > we're not actually blocking (stalling) any task who is trying to use > memory. > > Specifically, if any task is actually blocked needing memory to be > compacted then it will be in direct reclaim. That won't block waiting > on kcompactd task but instead call try_to_compact_pages() directly. > The caller of that direct compaction, __alloc_pages_direct_compact(), > already marks itself as counting towards PSI. > > Sanity checking by looking at this from another perspective, we can > look at all the people who explicitly ask kcompactd to do a reclaim by > calling wakeup_kcompactd(). That leads us to 3 places in vmscan.c. > Those are all requests from kswapd, which is also a "proactive" > mechanism in the kernel (tasks aren't blocked waiting for it). There is a reason behind annotating kswapd/kcompactd like this, it's in the longish comment in psi.c: * The time in which a task can execute on a CPU is our baseline for * productivity. Pressure expresses the amount of time in which this * potential cannot be realized due to resource contention. * * This concept of productivity has two components: the workload and * the CPU. To measure the impact of pressure on both, we define two * contention states for a resource: SOME and FULL. * * In the SOME state of a given resource, one or more tasks are * delayed on that resource. This affects the workload's ability to * perform work, but the CPU may still be executing other tasks. * * In the FULL state of a given resource, all non-idle tasks are * delayed on that resource such that nobody is advancing and the CPU * goes idle. This leaves both workload and CPU unproductive. * * SOME = nr_delayed_tasks != 0 * FULL = nr_delayed_tasks != 0 && nr_productive_tasks == 0 * * What it means for a task to be productive is defined differently * for each resource. For IO, productive means a running task. For * memory, productive means a running task that isn't a reclaimer. For * CPU, productive means an oncpu task. So when you have a CPU that's running reclaim/compaction work, that CPU isn't available to execute the workload. Say you only have one CPU shared between an allocating thread and kswapd. Even if the allocating thread never has to do reclaim on its own, if it has to wait for the CPU behind kswapd 50% of the time, that workload is positively under memory pressure. I don't think the distinction between proactive and reactive is all that meaningful. It's generally assumed that all the work done by these background threads is work that later doesn't have to be done by an allocating thread. It might matter from a latency perspective, but otherwise the work is fungible as it relates to memory pressure. HTH
Hi, On Tue, Apr 18, 2023 at 12:53 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Apr 18, 2023 at 09:58:54AM -0700, Douglas Anderson wrote: > > When the main kcompactd thread is doing compaction then it's always > > proactive compaction. This is a little confusing because kcompactd has > > two phases and one of them is called the "proactive" phase. > > Specifically: > > * Phase 1 (the "non-proactive" phase): we've been told by someone else > > that it would be a good idea to try to compact memory. > > * Phase 2 (the "proactive" phase): we analyze memory fragmentation > > ourselves and compact if it looks fragmented. > > > > From the context of kcompactd, the above naming makes sense. However, > > from the context of the kernel as a whole both phases are "proactive" > > because in both cases we're trying compact memory ahead of time and > > we're not actually blocking (stalling) any task who is trying to use > > memory. > > > > Specifically, if any task is actually blocked needing memory to be > > compacted then it will be in direct reclaim. That won't block waiting > > on kcompactd task but instead call try_to_compact_pages() directly. > > The caller of that direct compaction, __alloc_pages_direct_compact(), > > already marks itself as counting towards PSI. > > > > Sanity checking by looking at this from another perspective, we can > > look at all the people who explicitly ask kcompactd to do a reclaim by > > calling wakeup_kcompactd(). That leads us to 3 places in vmscan.c. > > Those are all requests from kswapd, which is also a "proactive" > > mechanism in the kernel (tasks aren't blocked waiting for it). > > There is a reason behind annotating kswapd/kcompactd like this, it's > in the longish comment in psi.c: > > * The time in which a task can execute on a CPU is our baseline for > * productivity. Pressure expresses the amount of time in which this > * potential cannot be realized due to resource contention. > * > * This concept of productivity has two components: the workload and > * the CPU. To measure the impact of pressure on both, we define two > * contention states for a resource: SOME and FULL. > * > * In the SOME state of a given resource, one or more tasks are > * delayed on that resource. This affects the workload's ability to > * perform work, but the CPU may still be executing other tasks. > * > * In the FULL state of a given resource, all non-idle tasks are > * delayed on that resource such that nobody is advancing and the CPU > * goes idle. This leaves both workload and CPU unproductive. > * > * SOME = nr_delayed_tasks != 0 > * FULL = nr_delayed_tasks != 0 && nr_productive_tasks == 0 > * > * What it means for a task to be productive is defined differently > * for each resource. For IO, productive means a running task. For > * memory, productive means a running task that isn't a reclaimer. For > * CPU, productive means an oncpu task. Ah, thanks for the pointer! > So when you have a CPU that's running reclaim/compaction work, that > CPU isn't available to execute the workload. > > Say you only have one CPU shared between an allocating thread and > kswapd. Even if the allocating thread never has to do reclaim on its > own, if it has to wait for the CPU behind kswapd 50% of the time, that > workload is positively under memory pressure. I guess I'm so much in the mindset of having 2-8 CPUs that I didn't think as much about the single CPU case. What you say makes a lot of sense for the single CPU case or for very parallel workloads that take up all available CPUs, but what about when you've got extra CPUs sitting there idling? In that case we're really not taking any CPU cycles away from someone by having one of those CPUs doing compaction in the background. I'm sure this has been discussed before somewhere, but I'd also wonder if there's ever a reason why we should prioritize kswapd/kcompactd over user programs. AKA: why don't kswapd/kcompactd run with a big "sched_nice" value? If kswapd/kcompactd were low enough priority then even in the single CPU case (or the multiple CPU case with a very parallel workload) kswapd/kcompactd would never be taking away significant time from a real program. > I don't think the distinction between proactive and reactive is all > that meaningful. It's generally assumed that all the work done by > these background threads is work that later doesn't have to be done by > an allocating thread. It might matter from a latency perspective, but > otherwise the work is fungible as it relates to memory pressure. If all compaction should count towards PSI then it feels like we have a different bug. The proactive_compact_node() function should also be marked as counting towards PSI, shouldn't it? -Doug
diff --git a/mm/compaction.c b/mm/compaction.c index 5a9501e0ae01..5a8d78b506e4 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -22,7 +22,6 @@ #include <linux/kthread.h> #include <linux/freezer.h> #include <linux/page_owner.h> -#include <linux/psi.h> #include "internal.h" #ifdef CONFIG_COMPACTION @@ -2954,8 +2953,6 @@ static int kcompactd(void *p) pgdat->kcompactd_highest_zoneidx = pgdat->nr_zones - 1; while (!kthread_should_stop()) { - unsigned long pflags; - /* * Avoid the unnecessary wakeup for proactive compaction * when it is disabled. @@ -2967,9 +2964,8 @@ static int kcompactd(void *p) kcompactd_work_requested(pgdat), timeout) && !pgdat->proactive_compact_trigger) { - psi_memstall_enter(&pflags); kcompactd_do_work(pgdat); - psi_memstall_leave(&pflags); + /* * Reset the timeout value. The defer timeout from * proactive compaction is lost here but that is fine