Message ID | 20221025170519.314511-1-hannes@cmpxchg.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp1122543wru; Tue, 25 Oct 2022 10:16:38 -0700 (PDT) X-Google-Smtp-Source: AMsMyM5js7qe8pPmKihqgpuSWua+s0mGOVFl8e4qd2gie3ZY5qUmC4lnST9YaSltSP61493temUq X-Received: by 2002:a05:6402:34cf:b0:461:a72c:7ada with SMTP id w15-20020a05640234cf00b00461a72c7adamr13688049edc.198.1666718198661; Tue, 25 Oct 2022 10:16:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666718198; cv=none; d=google.com; s=arc-20160816; b=jS3iV8sW5m42XI9E21ytb91tkdOhcKbNLbKrZQDQhZ/6MJkpUYo3pZYnycudcCHy+R LJQbfa/qcO7RVnJsTrV6KHYGK/TqTD9lAJYogznu8UJ5pZCI7R9IegV6q1bq4AQvNZky jT5DfdqgqktZ98K9KgPDmoaOebAOe/xWgC/MY4xuDKRIyIU0svlr5zyy58CEMp69uMgp DoRACyfoGACP+yk8oRl0hT0X+AcIrS7+1eHSOQBBHzH9jAh0sRlK4sJeMfof0py+Fxzu 1W3BF8gizLccK3OTZg7TVOwgNDjdoK03Ata/l4uyZa+VjLVxzvFjjEQfwvxEf14jw/hZ oNhw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=9a0U8oNO/IsME6gCf3pwS/ThtVfNk7aRIc+HsU0VHZs=; b=VZ+AV3t3o/L9eZvlkhlaP8poC4pYOI96ML7LeeGx/aCfjQRh3gWW6OEB9HWdzLb8un 7qJh/Xnbo7bLWKZCgmG7WKc7026UjoSdquyHV8mPQA1hXIiwEwGpkYgT/9sl2I8yMB/S uDUK9G+qjjGVAlNvNs4IpNBDXCvfUokM4H0lKYswl6n5/gN7dg1r2Oh/wdwr9jkaRAJH F9Gzk+0v29iFZomPcxU1VzbJCLefTdbUEvIuIURiq7wwW1X/6Iuwod29YXydCyMCKbm3 yuZZb3MFkZER7ZPj/KoQoYOKpOx8rAuBTGKeSjHQkdjWPtct7Dd+QcH7a6vVDVX8EJ67 GKdA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=v1RO4B0V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k21-20020a056402049500b004622e051634si1340607edv.232.2022.10.25.10.16.13; Tue, 25 Oct 2022 10:16:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=v1RO4B0V; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232261AbiJYRF1 (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Tue, 25 Oct 2022 13:05:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43084 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232229AbiJYRFY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 25 Oct 2022 13:05:24 -0400 Received: from mail-qt1-x836.google.com (mail-qt1-x836.google.com [IPv6:2607:f8b0:4864:20::836]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AAD0CA2234 for <linux-kernel@vger.kernel.org>; Tue, 25 Oct 2022 10:05:20 -0700 (PDT) Received: by mail-qt1-x836.google.com with SMTP id w29so870298qtv.9 for <linux-kernel@vger.kernel.org>; Tue, 25 Oct 2022 10:05:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=9a0U8oNO/IsME6gCf3pwS/ThtVfNk7aRIc+HsU0VHZs=; b=v1RO4B0VihiqcEujQzraHVwt8QiLPpO7gbD/2iDMx7OG6hgqazjvubmfqiW/oOIUjr aAcpbnZcl37hoLa+KuTyd1kXtK5MVvhJmxwlqSp5WWWRznRHpXSIUWE+4JKXqtZRHy2F s66YVTYbUMTgZwDVkbt5yk+fEZx3RhO+AKzIb/4Pge8raH4rOImUAyH83otDZYZCwaha XYFj1AAfrbDuwEVBo/Bvh+BSj1q69nOMT7ptVWntVK8UBGpU5yfmd3RMggcYTkeKn5T7 /B8HbJunCZJTysTg7fTNZ9NXJ96gkIY1qTOj02IIRR4ySVkaSV3IJK6c1w/bREpIyGB1 qXDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=9a0U8oNO/IsME6gCf3pwS/ThtVfNk7aRIc+HsU0VHZs=; b=pBRnQjCzdqG4uPfkZo9p0EFhBNF91X40C6S7WLYwWc6CBcf2E40EPMx/HaTHXFw6gd //HbyfooQICWbFtAeOWBlCyJ78UNZ5nRj+1TGv1V405sYCWVkAOCNr+VvSNovX3LehNX ur7XBuEFrwg1b1L6Q45UVVIMIpDl2HnIvwByUW4/cqtd1PGBvNSiPfbredQ0GoideZcz gaEHV1Qhv1sJXCNSSeLQ8TaokdTV1XeQ30/ZRYDA7CHzsdUvniIOsiU8fl8L9qQyvv4i 8IFOUVfKTGs0NoA37XYUDDaFoP9EYyWZ7I5WWJu46NuLzmgue/Vfav1OBjja91jvDk78 UvlA== X-Gm-Message-State: ACrzQf1HCqxSj34T9Tm7SGtSARSQ1Dirvp/m893wsK46bSvRuh/f2/j0 dZOt+CWHxvLBVJPxbdIGnsoovA== X-Received: by 2002:ac8:5891:0:b0:39c:f21a:78bc with SMTP id t17-20020ac85891000000b0039cf21a78bcmr33295639qta.42.1666717519799; Tue, 25 Oct 2022 10:05:19 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::25f1]) by smtp.gmail.com with ESMTPSA id d13-20020a05620a240d00b006bc192d277csm2429869qkn.10.2022.10.25.10.05.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Oct 2022 10:05:19 -0700 (PDT) From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Eric Bergen <ebergen@meta.com> Subject: [PATCH] mm: vmscan: split khugepaged stats from direct reclaim stats Date: Tue, 25 Oct 2022 13:05:19 -0400 Message-Id: <20221025170519.314511-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.38.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747680701520328880?= X-GMAIL-MSGID: =?utf-8?q?1747680701520328880?= |
Series |
mm: vmscan: split khugepaged stats from direct reclaim stats
|
|
Commit Message
Johannes Weiner
Oct. 25, 2022, 5:05 p.m. UTC
Direct reclaim stats are useful for identifying a potential source for
application latency, as well as spotting issues with kswapd. However,
khugepaged currently distorts the picture: as a kernel thread it
doesn't impose allocation latencies on userspace, and it explicitly
opts out of kswapd reclaim. Its activity showing up in the direct
reclaim stats is misleading. Counting it as kswapd reclaim could also
cause confusion when trying to understand actual kswapd behavior.
Break out khugepaged from the direct reclaim counters into new
pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters.
Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS):
pgsteal_kswapd 1342185
pgsteal_direct 0
pgsteal_khugepaged 3623
pgscan_kswapd 1345025
pgscan_direct 0
pgscan_khugepaged 3623
Reported-by: Eric Bergen <ebergen@meta.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
Documentation/admin-guide/cgroup-v2.rst | 6 +++++
include/linux/khugepaged.h | 6 +++++
include/linux/vm_event_item.h | 3 +++
mm/khugepaged.c | 5 +++++
mm/memcontrol.c | 8 +++++--
mm/vmscan.c | 30 ++++++++++++++++++-------
mm/vmstat.c | 3 +++
7 files changed, 51 insertions(+), 10 deletions(-)
Comments
On Tue, Oct 25, 2022 at 01:05:19PM -0400, Johannes Weiner wrote: > +static int reclaimer_offset(void) > +{ > + BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != 1); > + BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != 2); > + BUILD_BUG_ON(PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD != 1); > + BUILD_BUG_ON(PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD != 2); > + BUILD_BUG_ON(PGSCAN_DIRECT - PGSCAN_KSWAPD != 1); > + BUILD_BUG_ON(PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD != 2); > + > + if (current_is_kswapd()) > + return 0; > + if (current_is_khugepaged()) > + return 2; > + return 1; > +} Would this be simpler as ... BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD); BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != PGSCAN_DIRECT - PGSCAN_KSWAPD); BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD); BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != PGSCAN_KHUGEPAGED - PGDEMOTE_KSWAPD); if (current_is_kswapd()) return 0; if (current_is_khugepaged()) return PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD; return PGSTEAL_DIRECT - PGSTEAL_KSWAPD; Not that I think we'd ever want to separate them, but it is perhaps a bit less magic?
On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Direct reclaim stats are useful for identifying a potential source for > application latency, as well as spotting issues with kswapd. However, > khugepaged currently distorts the picture: as a kernel thread it > doesn't impose allocation latencies on userspace, and it explicitly > opts out of kswapd reclaim. Its activity showing up in the direct > reclaim stats is misleading. Counting it as kswapd reclaim could also > cause confusion when trying to understand actual kswapd behavior. > > Break out khugepaged from the direct reclaim counters into new > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > pgsteal_kswapd 1342185 > pgsteal_direct 0 > pgsteal_khugepaged 3623 > pgscan_kswapd 1345025 > pgscan_direct 0 > pgscan_khugepaged 3623 There are other kernel threads or works may allocate memory then trigger memory reclaim, there may be similar problems for them and someone may try to add a new stat. So how's about we make the stats more general, for example, call it "pg{steal|scan}_kthread"? > > Reported-by: Eric Bergen <ebergen@meta.com> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > Documentation/admin-guide/cgroup-v2.rst | 6 +++++ > include/linux/khugepaged.h | 6 +++++ > include/linux/vm_event_item.h | 3 +++ > mm/khugepaged.c | 5 +++++ > mm/memcontrol.c | 8 +++++-- > mm/vmscan.c | 30 ++++++++++++++++++------- > mm/vmstat.c | 3 +++ > 7 files changed, 51 insertions(+), 10 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index dc254a3cb956..74cec76be9f2 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1488,12 +1488,18 @@ PAGE_SIZE multiple when read back. > pgscan_direct (npn) > Amount of scanned pages directly (in an inactive LRU list) > > + pgscan_khugepaged (npn) > + Amount of scanned pages by khugepaged (in an inactive LRU list) > + > pgsteal_kswapd (npn) > Amount of reclaimed pages by kswapd > > pgsteal_direct (npn) > Amount of reclaimed pages directly > > + pgsteal_khugepaged (npn) > + Amount of reclaimed pages by khugepaged > + > pgfault (npn) > Total number of page faults incurred > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > index 70162d707caf..f68865e19b0b 100644 > --- a/include/linux/khugepaged.h > +++ b/include/linux/khugepaged.h > @@ -15,6 +15,7 @@ extern void __khugepaged_exit(struct mm_struct *mm); > extern void khugepaged_enter_vma(struct vm_area_struct *vma, > unsigned long vm_flags); > extern void khugepaged_min_free_kbytes_update(void); > +extern bool current_is_khugepaged(void); > #ifdef CONFIG_SHMEM > extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > bool install_pmd); > @@ -57,6 +58,11 @@ static inline int collapse_pte_mapped_thp(struct mm_struct *mm, > static inline void khugepaged_min_free_kbytes_update(void) > { > } > + > +static inline bool current_is_khugepaged(void) > +{ > + return false; > +} > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > #endif /* _LINUX_KHUGEPAGED_H */ > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 3518dba1e02f..7f5d1caf5890 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -40,10 +40,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > PGREUSE, > PGSTEAL_KSWAPD, > PGSTEAL_DIRECT, > + PGSTEAL_KHUGEPAGED, > PGDEMOTE_KSWAPD, > PGDEMOTE_DIRECT, > + PGDEMOTE_KHUGEPAGED, > PGSCAN_KSWAPD, > PGSCAN_DIRECT, > + PGSCAN_KHUGEPAGED, > PGSCAN_DIRECT_THROTTLE, > PGSCAN_ANON, > PGSCAN_FILE, > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 4734315f7940..36318ebbf50d 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -2528,6 +2528,11 @@ void khugepaged_min_free_kbytes_update(void) > mutex_unlock(&khugepaged_mutex); > } > > +bool current_is_khugepaged(void) > +{ > + return kthread_func(current) == khugepaged; > +} > + > static int madvise_collapse_errno(enum scan_result r) > { > /* > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 2d8549ae1b30..a17a5cfa6a55 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -661,8 +661,10 @@ static const unsigned int memcg_vm_event_stat[] = { > PGPGOUT, > PGSCAN_KSWAPD, > PGSCAN_DIRECT, > + PGSCAN_KHUGEPAGED, > PGSTEAL_KSWAPD, > PGSTEAL_DIRECT, > + PGSTEAL_KHUGEPAGED, > PGFAULT, > PGMAJFAULT, > PGREFILL, > @@ -1574,10 +1576,12 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize) > /* Accumulated memory events */ > seq_buf_printf(&s, "pgscan %lu\n", > memcg_events(memcg, PGSCAN_KSWAPD) + > - memcg_events(memcg, PGSCAN_DIRECT)); > + memcg_events(memcg, PGSCAN_DIRECT) + > + memcg_events(memcg, PGSCAN_KHUGEPAGED)); > seq_buf_printf(&s, "pgsteal %lu\n", > memcg_events(memcg, PGSTEAL_KSWAPD) + > - memcg_events(memcg, PGSTEAL_DIRECT)); > + memcg_events(memcg, PGSTEAL_DIRECT) + > + memcg_events(memcg, PGSTEAL_KHUGEPAGED)); > > for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { > if (memcg_vm_event_stat[i] == PGPGIN || > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 04d8b88e5216..8ceae125bbf7 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -54,6 +54,7 @@ > #include <linux/shmem_fs.h> > #include <linux/ctype.h> > #include <linux/debugfs.h> > +#include <linux/khugepaged.h> > > #include <asm/tlbflush.h> > #include <asm/div64.h> > @@ -1047,6 +1048,22 @@ void drop_slab(void) > drop_slab_node(nid); > } > > +static int reclaimer_offset(void) > +{ > + BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != 1); > + BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != 2); > + BUILD_BUG_ON(PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD != 1); > + BUILD_BUG_ON(PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD != 2); > + BUILD_BUG_ON(PGSCAN_DIRECT - PGSCAN_KSWAPD != 1); > + BUILD_BUG_ON(PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD != 2); > + > + if (current_is_kswapd()) > + return 0; > + if (current_is_khugepaged()) > + return 2; > + return 1; > +} > + > static inline int is_page_cache_freeable(struct folio *folio) > { > /* > @@ -1599,10 +1616,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, > (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, > &nr_succeeded); > > - if (current_is_kswapd()) > - __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); > - else > - __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); > + __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); > > return nr_succeeded; > } > @@ -2475,7 +2489,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > &nr_scanned, sc, lru); > > __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); > - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; > + item = PGSCAN_KSWAPD + reclaimer_offset(); > if (!cgroup_reclaim(sc)) > __count_vm_events(item, nr_scanned); > __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); > @@ -2492,7 +2506,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > move_folios_to_lru(lruvec, &folio_list); > > __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); > - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; > + item = PGSTEAL_KSWAPD + reclaimer_offset(); > if (!cgroup_reclaim(sc)) > __count_vm_events(item, nr_reclaimed); > __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); > @@ -4857,7 +4871,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, > break; > } > > - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; > + item = PGSCAN_KSWAPD + reclaimer_offset(); > if (!cgroup_reclaim(sc)) { > __count_vm_events(item, isolated); > __count_vm_events(PGREFILL, sorted); > @@ -5015,7 +5029,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap > if (walk && walk->batched) > reset_batch_size(lruvec, walk); > > - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; > + item = PGSTEAL_KSWAPD + reclaimer_offset(); > if (!cgroup_reclaim(sc)) > __count_vm_events(item, reclaimed); > __count_memcg_events(memcg, item, reclaimed); > diff --git a/mm/vmstat.c b/mm/vmstat.c > index b2371d745e00..1ea6a5ce1c41 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1271,10 +1271,13 @@ const char * const vmstat_text[] = { > "pgreuse", > "pgsteal_kswapd", > "pgsteal_direct", > + "pgsteal_khugepaged", > "pgdemote_kswapd", > "pgdemote_direct", > + "pgdemote_khugepaged", > "pgscan_kswapd", > "pgscan_direct", > + "pgscan_khugepaged", > "pgscan_direct_throttle", > "pgscan_anon", > "pgscan_file", > -- > 2.38.1 > >
On Tue, Oct 25, 2022 at 06:16:53PM +0100, Matthew Wilcox wrote: > On Tue, Oct 25, 2022 at 01:05:19PM -0400, Johannes Weiner wrote: > > +static int reclaimer_offset(void) > > +{ > > + BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != 1); > > + BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != 2); > > + BUILD_BUG_ON(PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD != 1); > > + BUILD_BUG_ON(PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD != 2); > > + BUILD_BUG_ON(PGSCAN_DIRECT - PGSCAN_KSWAPD != 1); > > + BUILD_BUG_ON(PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD != 2); > > + > > + if (current_is_kswapd()) > > + return 0; > > + if (current_is_khugepaged()) > > + return 2; > > + return 1; > > +} > > Would this be simpler as ... > > BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != > PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD); > BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != > PGSCAN_DIRECT - PGSCAN_KSWAPD); > BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != > PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD); > BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != > PGSCAN_KHUGEPAGED - PGDEMOTE_KSWAPD); > > if (current_is_kswapd()) > return 0; > if (current_is_khugepaged()) > return PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD; > return PGSTEAL_DIRECT - PGSTEAL_KSWAPD; > > Not that I think we'd ever want to separate them, but it is perhaps a > bit less magic? Yeah that looks better. I'll do that in v2, thanks!
On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Direct reclaim stats are useful for identifying a potential source for > > application latency, as well as spotting issues with kswapd. However, > > khugepaged currently distorts the picture: as a kernel thread it > > doesn't impose allocation latencies on userspace, and it explicitly > > opts out of kswapd reclaim. Its activity showing up in the direct > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > cause confusion when trying to understand actual kswapd behavior. > > > > Break out khugepaged from the direct reclaim counters into new > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > pgsteal_kswapd 1342185 > > pgsteal_direct 0 > > pgsteal_khugepaged 3623 > > pgscan_kswapd 1345025 > > pgscan_direct 0 > > pgscan_khugepaged 3623 > > There are other kernel threads or works may allocate memory then > trigger memory reclaim, there may be similar problems for them and > someone may try to add a new stat. So how's about we make the stats > more general, for example, call it "pg{steal|scan}_kthread"? I'm not convinved that's a good idea. Can you generally say that userspace isn't indirectly waiting for one of those allocating threads? With khugepaged, we know. And those other allocations are usually ___GFP_KSWAPD_RECLAIM, so if they do direct reclaim, we'd probably want to know that kswapd is failing to keep up (doubly so if userspace is waiting). In a shared kthread counter, khugepaged would again muddy the waters.
On Tue, Oct 25, 2022 at 1:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > Direct reclaim stats are useful for identifying a potential source for > > > application latency, as well as spotting issues with kswapd. However, > > > khugepaged currently distorts the picture: as a kernel thread it > > > doesn't impose allocation latencies on userspace, and it explicitly > > > opts out of kswapd reclaim. Its activity showing up in the direct > > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > > cause confusion when trying to understand actual kswapd behavior. > > > > > > Break out khugepaged from the direct reclaim counters into new > > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > > > pgsteal_kswapd 1342185 > > > pgsteal_direct 0 > > > pgsteal_khugepaged 3623 > > > pgscan_kswapd 1345025 > > > pgscan_direct 0 > > > pgscan_khugepaged 3623 > > > > There are other kernel threads or works may allocate memory then > > trigger memory reclaim, there may be similar problems for them and > > someone may try to add a new stat. So how's about we make the stats > > more general, for example, call it "pg{steal|scan}_kthread"? > > I'm not convinved that's a good idea. > > Can you generally say that userspace isn't indirectly waiting for one > of those allocating threads? With khugepaged, we know. AFAIK, ksm may do slab allocation with __GFP_DIRECT_RECLAIM. Some device mapper drivers may do heavy lift in the work queue, for example, dm-crypt, particularly for writing. > > And those other allocations are usually ___GFP_KSWAPD_RECLAIM, so if > they do direct reclaim, we'd probably want to know that kswapd is > failing to keep up (doubly so if userspace is waiting). In a shared > kthread counter, khugepaged would again muddy the waters.
On Tue, Oct 25, 2022 at 02:53:01PM -0700, Yang Shi wrote: > On Tue, Oct 25, 2022 at 1:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > > > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > Direct reclaim stats are useful for identifying a potential source for > > > > application latency, as well as spotting issues with kswapd. However, > > > > khugepaged currently distorts the picture: as a kernel thread it > > > > doesn't impose allocation latencies on userspace, and it explicitly > > > > opts out of kswapd reclaim. Its activity showing up in the direct > > > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > > > cause confusion when trying to understand actual kswapd behavior. > > > > > > > > Break out khugepaged from the direct reclaim counters into new > > > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > > > > > pgsteal_kswapd 1342185 > > > > pgsteal_direct 0 > > > > pgsteal_khugepaged 3623 > > > > pgscan_kswapd 1345025 > > > > pgscan_direct 0 > > > > pgscan_khugepaged 3623 > > > > > > There are other kernel threads or works may allocate memory then > > > trigger memory reclaim, there may be similar problems for them and > > > someone may try to add a new stat. So how's about we make the stats > > > more general, for example, call it "pg{steal|scan}_kthread"? > > > > I'm not convinved that's a good idea. > > > > Can you generally say that userspace isn't indirectly waiting for one > > of those allocating threads? With khugepaged, we know. > > AFAIK, ksm may do slab allocation with __GFP_DIRECT_RECLAIM. Right, but ksm also uses __GFP_KSWAPD_RECLAIM. So while userspace isn't directly waiting for ksm, when ksm enters direct reclaim it's because kswapd failed. This is of interest to kernel developers. Userspace will likely see direct reclaim in that scenario as well, so the ksm direct reclaim counts aren't liable to confuse users. Khugepaged on the other hand will *always* reclaim directly, even if there is no memory pressure or kswapd failure. The direct reclaim counts there are misleading to both developers and users. What it really should be is pgscan_nokswapd_nouserprocesswaiting, but that just seems kind of long ;-) I'm also not sure anybody but khugepaged is doing direct reclaim without kswapd reclaim. It seems unlikely we'll get more of those. > Some device mapper drivers may do heavy lift in the work queue, for > example, dm-crypt, particularly for writing. Userspace will wait for those through dirty throttling. We'd want to know about kswapd failures in that case - again, without them being muddied by khugepaged. > > And those other allocations are usually ___GFP_KSWAPD_RECLAIM, so if > > they do direct reclaim, we'd probably want to know that kswapd is > > failing to keep up (doubly so if userspace is waiting). In a shared > > kthread counter, khugepaged would again muddy the waters.
On Wed, Oct 26, 2022 at 10:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Oct 25, 2022 at 02:53:01PM -0700, Yang Shi wrote: > > On Tue, Oct 25, 2022 at 1:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > > > > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > Direct reclaim stats are useful for identifying a potential source for > > > > > application latency, as well as spotting issues with kswapd. However, > > > > > khugepaged currently distorts the picture: as a kernel thread it > > > > > doesn't impose allocation latencies on userspace, and it explicitly > > > > > opts out of kswapd reclaim. Its activity showing up in the direct > > > > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > > > > cause confusion when trying to understand actual kswapd behavior. > > > > > > > > > > Break out khugepaged from the direct reclaim counters into new > > > > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > > > > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > > > > > > > pgsteal_kswapd 1342185 > > > > > pgsteal_direct 0 > > > > > pgsteal_khugepaged 3623 > > > > > pgscan_kswapd 1345025 > > > > > pgscan_direct 0 > > > > > pgscan_khugepaged 3623 > > > > > > > > There are other kernel threads or works may allocate memory then > > > > trigger memory reclaim, there may be similar problems for them and > > > > someone may try to add a new stat. So how's about we make the stats > > > > more general, for example, call it "pg{steal|scan}_kthread"? > > > > > > I'm not convinved that's a good idea. > > > > > > Can you generally say that userspace isn't indirectly waiting for one > > > of those allocating threads? With khugepaged, we know. > > > > AFAIK, ksm may do slab allocation with __GFP_DIRECT_RECLAIM. > > Right, but ksm also uses __GFP_KSWAPD_RECLAIM. So while userspace > isn't directly waiting for ksm, when ksm enters direct reclaim it's > because kswapd failed. This is of interest to kernel developers. > Userspace will likely see direct reclaim in that scenario as well, so > the ksm direct reclaim counts aren't liable to confuse users. > > Khugepaged on the other hand will *always* reclaim directly, even if > there is no memory pressure or kswapd failure. The direct reclaim > counts there are misleading to both developers and users. > > What it really should be is pgscan_nokswapd_nouserprocesswaiting, but > that just seems kind of long ;-) > > I'm also not sure anybody but khugepaged is doing direct reclaim > without kswapd reclaim. It seems unlikely we'll get more of those. IIUC you actually don't care about how many direct reclaim are triggered by khugepaged, but you would like to separate the direct reclaim stats between that are triggered directly by userspace actions, which may stall userspace, and that aren't, which don't stall userspace. If so it doesn't sound that important to distinguish whether the direct reclaim are triggered by khugepaged or other kernel threads even though other kthreads are not liable to confuse users IMHO. > > > Some device mapper drivers may do heavy lift in the work queue, for > > example, dm-crypt, particularly for writing. > > Userspace will wait for those through dirty throttling. We'd want to Not guaranteed. Anyway I just thought pgscan_khugepaged might be more confusing for the users even the developers who are not familiar with THP/khugepaged. > know about kswapd failures in that case - again, without them being > muddied by khugepaged. > > > > And those other allocations are usually ___GFP_KSWAPD_RECLAIM, so if > > > they do direct reclaim, we'd probably want to know that kswapd is > > > failing to keep up (doubly so if userspace is waiting). In a shared > > > kthread counter, khugepaged would again muddy the waters.
On Wed, Oct 26, 2022 at 1:51 PM Yang Shi <shy828301@gmail.com> wrote: > > On Wed, Oct 26, 2022 at 10:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Oct 25, 2022 at 02:53:01PM -0700, Yang Shi wrote: > > > On Tue, Oct 25, 2022 at 1:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > > > > > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > > > Direct reclaim stats are useful for identifying a potential source for > > > > > > application latency, as well as spotting issues with kswapd. However, > > > > > > khugepaged currently distorts the picture: as a kernel thread it > > > > > > doesn't impose allocation latencies on userspace, and it explicitly > > > > > > opts out of kswapd reclaim. Its activity showing up in the direct > > > > > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > > > > > cause confusion when trying to understand actual kswapd behavior. > > > > > > > > > > > > Break out khugepaged from the direct reclaim counters into new > > > > > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > > > > > > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > > > > > > > > > pgsteal_kswapd 1342185 > > > > > > pgsteal_direct 0 > > > > > > pgsteal_khugepaged 3623 > > > > > > pgscan_kswapd 1345025 > > > > > > pgscan_direct 0 > > > > > > pgscan_khugepaged 3623 > > > > > > > > > > There are other kernel threads or works may allocate memory then > > > > > trigger memory reclaim, there may be similar problems for them and > > > > > someone may try to add a new stat. So how's about we make the stats > > > > > more general, for example, call it "pg{steal|scan}_kthread"? > > > > > > > > I'm not convinved that's a good idea. > > > > > > > > Can you generally say that userspace isn't indirectly waiting for one > > > > of those allocating threads? With khugepaged, we know. > > > > > > AFAIK, ksm may do slab allocation with __GFP_DIRECT_RECLAIM. > > > > Right, but ksm also uses __GFP_KSWAPD_RECLAIM. So while userspace > > isn't directly waiting for ksm, when ksm enters direct reclaim it's > > because kswapd failed. This is of interest to kernel developers. > > Userspace will likely see direct reclaim in that scenario as well, so > > the ksm direct reclaim counts aren't liable to confuse users. > > > > Khugepaged on the other hand will *always* reclaim directly, even if > > there is no memory pressure or kswapd failure. The direct reclaim > > counts there are misleading to both developers and users. > > > > What it really should be is pgscan_nokswapd_nouserprocesswaiting, but > > that just seems kind of long ;-) > > > > I'm also not sure anybody but khugepaged is doing direct reclaim > > without kswapd reclaim. It seems unlikely we'll get more of those. > > IIUC you actually don't care about how many direct reclaim are > triggered by khugepaged, but you would like to separate the direct > reclaim stats between that are triggered directly by userspace > actions, which may stall userspace, and that aren't, which don't stall > userspace. If so it doesn't sound that important to distinguish > whether the direct reclaim are triggered by khugepaged or other kernel > threads even though other kthreads are not liable to confuse users > IMHO. My 2c, if we care about direct reclaim as in reclaim that may stall user space application allocations, then there are other reclaim contexts that may pollute the direct reclaim stats. For instance, proactive reclaim, or reclaim done by writing a limit lower than the current usage to memory.max or memory.high, as they are not done in the context of the application allocating memory. At Google, we have some internal direct reclaim memcg statistics, and the way we handle this is by passing a flag from such contexts to try_to_free_mem_cgroup_pages() in the reclaim_options arg. This flag is echod into a scan_struct bit, which we then use to filter out direct reclaim operations that actually cause latencies in user space allocations. Perhaps something similar might be more generic here? I am not sure what context khugepaged reclaims memory from, but I think it's not a memcg context, so maybe we want to generalize the reclaim_options arg to try_to_free_pages() or whatever interface khugepaged uses to free memory. > > > > > > Some device mapper drivers may do heavy lift in the work queue, for > > > example, dm-crypt, particularly for writing. > > > > Userspace will wait for those through dirty throttling. We'd want to > > Not guaranteed. > > Anyway I just thought pgscan_khugepaged might be more confusing for > the users even the developers who are not familiar with > THP/khugepaged. > > > know about kswapd failures in that case - again, without them being > > muddied by khugepaged. > > > > > > And those other allocations are usually ___GFP_KSWAPD_RECLAIM, so if > > > > they do direct reclaim, we'd probably want to know that kswapd is > > > > failing to keep up (doubly so if userspace is waiting). In a shared > > > > kthread counter, khugepaged would again muddy the waters. >
On Wed, Oct 26, 2022 at 07:41:21PM -0700, Yosry Ahmed wrote: > On Wed, Oct 26, 2022 at 1:51 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Wed, Oct 26, 2022 at 10:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Tue, Oct 25, 2022 at 02:53:01PM -0700, Yang Shi wrote: > > > > On Tue, Oct 25, 2022 at 1:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > > > > > > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > > > > > Direct reclaim stats are useful for identifying a potential source for > > > > > > > application latency, as well as spotting issues with kswapd. However, > > > > > > > khugepaged currently distorts the picture: as a kernel thread it > > > > > > > doesn't impose allocation latencies on userspace, and it explicitly > > > > > > > opts out of kswapd reclaim. Its activity showing up in the direct > > > > > > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > > > > > > cause confusion when trying to understand actual kswapd behavior. > > > > > > > > > > > > > > Break out khugepaged from the direct reclaim counters into new > > > > > > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > > > > > > > > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > > > > > > > > > > > pgsteal_kswapd 1342185 > > > > > > > pgsteal_direct 0 > > > > > > > pgsteal_khugepaged 3623 > > > > > > > pgscan_kswapd 1345025 > > > > > > > pgscan_direct 0 > > > > > > > pgscan_khugepaged 3623 > > > > > > > > > > > > There are other kernel threads or works may allocate memory then > > > > > > trigger memory reclaim, there may be similar problems for them and > > > > > > someone may try to add a new stat. So how's about we make the stats > > > > > > more general, for example, call it "pg{steal|scan}_kthread"? > > > > > > > > > > I'm not convinved that's a good idea. > > > > > > > > > > Can you generally say that userspace isn't indirectly waiting for one > > > > > of those allocating threads? With khugepaged, we know. > > > > > > > > AFAIK, ksm may do slab allocation with __GFP_DIRECT_RECLAIM. > > > > > > Right, but ksm also uses __GFP_KSWAPD_RECLAIM. So while userspace > > > isn't directly waiting for ksm, when ksm enters direct reclaim it's > > > because kswapd failed. This is of interest to kernel developers. > > > Userspace will likely see direct reclaim in that scenario as well, so > > > the ksm direct reclaim counts aren't liable to confuse users. > > > > > > Khugepaged on the other hand will *always* reclaim directly, even if > > > there is no memory pressure or kswapd failure. The direct reclaim > > > counts there are misleading to both developers and users. > > > > > > What it really should be is pgscan_nokswapd_nouserprocesswaiting, but > > > that just seems kind of long ;-) > > > > > > I'm also not sure anybody but khugepaged is doing direct reclaim > > > without kswapd reclaim. It seems unlikely we'll get more of those. > > > > IIUC you actually don't care about how many direct reclaim are > > triggered by khugepaged, but you would like to separate the direct > > reclaim stats between that are triggered directly by userspace > > actions, which may stall userspace, and that aren't, which don't stall > > userspace. If so it doesn't sound that important to distinguish > > whether the direct reclaim are triggered by khugepaged or other kernel > > threads even though other kthreads are not liable to confuse users > > IMHO. I feel like I've sufficiently explained my reason for wanting to separate out the __GFP_KSWAPD_RECLAIM special case from other sites. > My 2c, if we care about direct reclaim as in reclaim that may stall > user space application allocations, then there are other reclaim > contexts that may pollute the direct reclaim stats. For instance, > proactive reclaim, or reclaim done by writing a limit lower than the > current usage to memory.max or memory.high, as they are not done in > the context of the application allocating memory. > > At Google, we have some internal direct reclaim memcg statistics, and > the way we handle this is by passing a flag from such contexts to > try_to_free_mem_cgroup_pages() in the reclaim_options arg. This flag > is echod into a scan_struct bit, which we then use to filter out > direct reclaim operations that actually cause latencies in user space > allocations. > > Perhaps something similar might be more generic here? I am not sure > what context khugepaged reclaims memory from, but I think it's not a > memcg context, so maybe we want to generalize the reclaim_options arg > to try_to_free_pages() or whatever interface khugepaged uses to free > memory. So at the /proc/vmstat level, I'm not sure it matters much because it doesn't count any cgroup_reclaim() activity. But at the cgroup level, it sure would be nice to split out proactive reclaim churn. Both in terms of not polluting direct reclaim counts, but also for *knowing* how much proactive reclaim is doing. Do you have separate counters for this?
On Thu, Oct 27, 2022 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Oct 26, 2022 at 07:41:21PM -0700, Yosry Ahmed wrote: > > On Wed, Oct 26, 2022 at 1:51 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Wed, Oct 26, 2022 at 10:32 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > On Tue, Oct 25, 2022 at 02:53:01PM -0700, Yang Shi wrote: > > > > > On Tue, Oct 25, 2022 at 1:54 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > > > On Tue, Oct 25, 2022 at 12:40:15PM -0700, Yang Shi wrote: > > > > > > > On Tue, Oct 25, 2022 at 10:05 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > > > > > > > > > Direct reclaim stats are useful for identifying a potential source for > > > > > > > > application latency, as well as spotting issues with kswapd. However, > > > > > > > > khugepaged currently distorts the picture: as a kernel thread it > > > > > > > > doesn't impose allocation latencies on userspace, and it explicitly > > > > > > > > opts out of kswapd reclaim. Its activity showing up in the direct > > > > > > > > reclaim stats is misleading. Counting it as kswapd reclaim could also > > > > > > > > cause confusion when trying to understand actual kswapd behavior. > > > > > > > > > > > > > > > > Break out khugepaged from the direct reclaim counters into new > > > > > > > > pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters. > > > > > > > > > > > > > > > > Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS): > > > > > > > > > > > > > > > > pgsteal_kswapd 1342185 > > > > > > > > pgsteal_direct 0 > > > > > > > > pgsteal_khugepaged 3623 > > > > > > > > pgscan_kswapd 1345025 > > > > > > > > pgscan_direct 0 > > > > > > > > pgscan_khugepaged 3623 > > > > > > > > > > > > > > There are other kernel threads or works may allocate memory then > > > > > > > trigger memory reclaim, there may be similar problems for them and > > > > > > > someone may try to add a new stat. So how's about we make the stats > > > > > > > more general, for example, call it "pg{steal|scan}_kthread"? > > > > > > > > > > > > I'm not convinved that's a good idea. > > > > > > > > > > > > Can you generally say that userspace isn't indirectly waiting for one > > > > > > of those allocating threads? With khugepaged, we know. > > > > > > > > > > AFAIK, ksm may do slab allocation with __GFP_DIRECT_RECLAIM. > > > > > > > > Right, but ksm also uses __GFP_KSWAPD_RECLAIM. So while userspace > > > > isn't directly waiting for ksm, when ksm enters direct reclaim it's > > > > because kswapd failed. This is of interest to kernel developers. > > > > Userspace will likely see direct reclaim in that scenario as well, so > > > > the ksm direct reclaim counts aren't liable to confuse users. > > > > > > > > Khugepaged on the other hand will *always* reclaim directly, even if > > > > there is no memory pressure or kswapd failure. The direct reclaim > > > > counts there are misleading to both developers and users. > > > > > > > > What it really should be is pgscan_nokswapd_nouserprocesswaiting, but > > > > that just seems kind of long ;-) > > > > > > > > I'm also not sure anybody but khugepaged is doing direct reclaim > > > > without kswapd reclaim. It seems unlikely we'll get more of those. > > > > > > IIUC you actually don't care about how many direct reclaim are > > > triggered by khugepaged, but you would like to separate the direct > > > reclaim stats between that are triggered directly by userspace > > > actions, which may stall userspace, and that aren't, which don't stall > > > userspace. If so it doesn't sound that important to distinguish > > > whether the direct reclaim are triggered by khugepaged or other kernel > > > threads even though other kthreads are not liable to confuse users > > > IMHO. > > I feel like I've sufficiently explained my reason for wanting to > separate out the __GFP_KSWAPD_RECLAIM special case from other sites. > > > My 2c, if we care about direct reclaim as in reclaim that may stall > > user space application allocations, then there are other reclaim > > contexts that may pollute the direct reclaim stats. For instance, > > proactive reclaim, or reclaim done by writing a limit lower than the > > current usage to memory.max or memory.high, as they are not done in > > the context of the application allocating memory. > > > > At Google, we have some internal direct reclaim memcg statistics, and > > the way we handle this is by passing a flag from such contexts to > > try_to_free_mem_cgroup_pages() in the reclaim_options arg. This flag > > is echod into a scan_struct bit, which we then use to filter out > > direct reclaim operations that actually cause latencies in user space > > allocations. > > > > Perhaps something similar might be more generic here? I am not sure > > what context khugepaged reclaims memory from, but I think it's not a > > memcg context, so maybe we want to generalize the reclaim_options arg > > to try_to_free_pages() or whatever interface khugepaged uses to free > > memory. > > So at the /proc/vmstat level, I'm not sure it matters much because it > doesn't count any cgroup_reclaim() activity. > > But at the cgroup level, it sure would be nice to split out proactive > reclaim churn. Both in terms of not polluting direct reclaim counts, > but also for *knowing* how much proactive reclaim is doing. > > Do you have separate counters for this? Not yet. Currently we only have the first part, not polluting direct reclaim counts. We basically exclude reclaim coming from memory.reclaim, setting memory.max/memory.limit_in_bytes, memory.high (on write, not hitting the high limit), and memory.force_empty from direct reclaim stats. As for having a separate counter for proactive reclaim, do you think it should be limited to reclaim coming from memory.reclaim (and potentially memory.force_empty), or should it include reclaim coming from limit-setting as well?
On Thu, Oct 27, 2022 at 01:43:24PM -0700, Yosry Ahmed wrote: > On Thu, Oct 27, 2022 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Oct 26, 2022 at 07:41:21PM -0700, Yosry Ahmed wrote: > > > My 2c, if we care about direct reclaim as in reclaim that may stall > > > user space application allocations, then there are other reclaim > > > contexts that may pollute the direct reclaim stats. For instance, > > > proactive reclaim, or reclaim done by writing a limit lower than the > > > current usage to memory.max or memory.high, as they are not done in > > > the context of the application allocating memory. > > > > > > At Google, we have some internal direct reclaim memcg statistics, and > > > the way we handle this is by passing a flag from such contexts to > > > try_to_free_mem_cgroup_pages() in the reclaim_options arg. This flag > > > is echod into a scan_struct bit, which we then use to filter out > > > direct reclaim operations that actually cause latencies in user space > > > allocations. > > > > > > Perhaps something similar might be more generic here? I am not sure > > > what context khugepaged reclaims memory from, but I think it's not a > > > memcg context, so maybe we want to generalize the reclaim_options arg > > > to try_to_free_pages() or whatever interface khugepaged uses to free > > > memory. > > > > So at the /proc/vmstat level, I'm not sure it matters much because it > > doesn't count any cgroup_reclaim() activity. > > > > But at the cgroup level, it sure would be nice to split out proactive > > reclaim churn. Both in terms of not polluting direct reclaim counts, > > but also for *knowing* how much proactive reclaim is doing. > > > > Do you have separate counters for this? > > Not yet. Currently we only have the first part, not polluting direct > reclaim counts. > > We basically exclude reclaim coming from memory.reclaim, setting > memory.max/memory.limit_in_bytes, memory.high (on write, not hitting > the high limit), and memory.force_empty from direct reclaim stats. > > As for having a separate counter for proactive reclaim, do you think > it should be limited to reclaim coming from memory.reclaim (and > potentially memory.force_empty), or should it include reclaim coming > from limit-setting as well? A combined counter seems reasonable to me. We *have* used the limit knobs to drive proactive reclaim in production in the past, so it's not a stretch. And I can't think of a scenario where you'd like them to be separate. I could think of two ways of describing it: pgscan_user: User-requested reclaim. Could be confusing if we ever have an in-kernel proactive reclaim driver - unless that would then go to another counter (new or kswapd). pgscan_ext: Reclaim activity from extraordinary/external requests. External as in: outside the allocation context.
On Fri, Oct 28, 2022 at 7:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Oct 27, 2022 at 01:43:24PM -0700, Yosry Ahmed wrote: > > On Thu, Oct 27, 2022 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Wed, Oct 26, 2022 at 07:41:21PM -0700, Yosry Ahmed wrote: > > > > My 2c, if we care about direct reclaim as in reclaim that may stall > > > > user space application allocations, then there are other reclaim > > > > contexts that may pollute the direct reclaim stats. For instance, > > > > proactive reclaim, or reclaim done by writing a limit lower than the > > > > current usage to memory.max or memory.high, as they are not done in > > > > the context of the application allocating memory. > > > > > > > > At Google, we have some internal direct reclaim memcg statistics, and > > > > the way we handle this is by passing a flag from such contexts to > > > > try_to_free_mem_cgroup_pages() in the reclaim_options arg. This flag > > > > is echod into a scan_struct bit, which we then use to filter out > > > > direct reclaim operations that actually cause latencies in user space > > > > allocations. > > > > > > > > Perhaps something similar might be more generic here? I am not sure > > > > what context khugepaged reclaims memory from, but I think it's not a > > > > memcg context, so maybe we want to generalize the reclaim_options arg > > > > to try_to_free_pages() or whatever interface khugepaged uses to free > > > > memory. > > > > > > So at the /proc/vmstat level, I'm not sure it matters much because it > > > doesn't count any cgroup_reclaim() activity. > > > > > > But at the cgroup level, it sure would be nice to split out proactive > > > reclaim churn. Both in terms of not polluting direct reclaim counts, > > > but also for *knowing* how much proactive reclaim is doing. > > > > > > Do you have separate counters for this? > > > > Not yet. Currently we only have the first part, not polluting direct > > reclaim counts. > > > > We basically exclude reclaim coming from memory.reclaim, setting > > memory.max/memory.limit_in_bytes, memory.high (on write, not hitting > > the high limit), and memory.force_empty from direct reclaim stats. > > > > As for having a separate counter for proactive reclaim, do you think > > it should be limited to reclaim coming from memory.reclaim (and > > potentially memory.force_empty), or should it include reclaim coming > > from limit-setting as well? > > A combined counter seems reasonable to me. We *have* used the limit > knobs to drive proactive reclaim in production in the past, so it's > not a stretch. And I can't think of a scenario where you'd like them > to be separate. > > I could think of two ways of describing it: > > pgscan_user: User-requested reclaim. Could be confusing if we ever > have an in-kernel proactive reclaim driver - unless that would then go > to another counter (new or kswapd). > > pgscan_ext: Reclaim activity from extraordinary/external > requests. External as in: outside the allocation context. I imagine if the kernel is doing proactive reclaim on its own, we might want a separate counter for that anyway to monitor what the kernel is doing. So maybe pgscan_user sounds nice for now, but I also like that the latter explicitly says "this is external to the allocation context". But we can just go with pgscan_user and document it properly. How would khugepaged fit in this story? Seems like it would be part of pgscan_ext but not pgscan_user. I imagine we also don't want to pollute proactive reclaim counters with khugepaged reclaim (or other non-direct reclaim). Maybe pgscan_user and pgscan_kernel/pgscan_indirect for things like khugepaged? The problem with pgscan_kernel/indirect is that if we add a proactive reclaim kthread in the future it would technically fit there but we would want a separate counter for it. I am honestly not sure where to put khugepaged. The reasons I don't like a dedicated counter for khugepaged are: - What if other kthreads like khugepaged start doing the same, do we add one counter per-thread? - What if we deprecate khugepaged (or such threads)? Seems more likely than deprecating kswapd. Looks like we want a stat that would group all of this reclaim coming from non-direct kthreads, but would not include a future proactive reclaim kthread.
On Fri, Oct 28, 2022 at 10:41:17AM -0700, Yosry Ahmed wrote: > On Fri, Oct 28, 2022 at 7:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > pgscan_user: User-requested reclaim. Could be confusing if we ever > > have an in-kernel proactive reclaim driver - unless that would then go > > to another counter (new or kswapd). > > > > pgscan_ext: Reclaim activity from extraordinary/external > > requests. External as in: outside the allocation context. > > I imagine if the kernel is doing proactive reclaim on its own, we > might want a separate counter for that anyway to monitor what the > kernel is doing. So maybe pgscan_user sounds nice for now, but I also > like that the latter explicitly says "this is external to the > allocation context". But we can just go with pgscan_user and document > it properly. Yes, I think you're right. pgscan_user sounds good to me. > How would khugepaged fit in this story? Seems like it would be part of > pgscan_ext but not pgscan_user. I imagine we also don't want to > pollute proactive reclaim counters with khugepaged reclaim (or other > non-direct reclaim). > > Maybe pgscan_user and pgscan_kernel/pgscan_indirect for things like khugepaged? > The problem with pgscan_kernel/indirect is that if we add a proactive > reclaim kthread in the future it would technically fit there but we > would want a separate counter for it. > > I am honestly not sure where to put khugepaged. The reasons I don't > like a dedicated counter for khugepaged are: > - What if other kthreads like khugepaged start doing the same, do we > add one counter per-thread? It's unlikely there will be more. The reason khugepaged doesn't rely on kswapd is unique to THP allocations: they can require an exorbitant amount of work to assemble, but due to fragmentation those requests may fail permanently. We don't want to burden a shared facility like kswapd with large amounts of speculative work on behalf of what are (still*) cornercase requests. This isn't true for other allocations. We do have __GFP_NORETRY sites here and there that rather fall back early than put in the full amount of work; but overall we expect allocations to succeed - and kswapd to be able to balance for them!!** - because the alternative tends to be OOMs, or drivers and workloads aborting on -ENOMEM. (* As we evolve the allocator and normalize huge page requests (folios), kswapd may also eventually balance for THPs again. IOW, it's more likely for this exception to disappear again than it is that we'll see more of them.) (** This is also why it's no big deal if other kthreads that rely on kswapd contribute to direct reclaim stats. First, it's highly error prone to determine on a case by case basis whether userspace could be waiting behind that direct reclaim - as Yang Shi's writeback example demonstrates. Second, if kswapd is overwhelmed, it's likely to impact userspace *anyway*! The benefit of this classification work is questionable.) > - What if we deprecate khugepaged (or such threads)? Seems more likely > than deprecating kswapd. If that happens, we can remove the counter again. The bar isn't as high for vmstat as it for other ABI, and we've updated it plenty of times to reflect changes in the MM implementation. > Looks like we want a stat that would group all of this reclaim coming > from non-direct kthreads, but would not include a future proactive > reclaim kthread. I think the desire to generalize overcomplicates things here in a way that isn't actually meaningful. Think of direct reclaim stats as a signal that either a) kswapd is broken or b) memory pressure is high enough to cause latencies in the class of requests that are of interest to userspace. This is true for all cases but khugepaged.
On Mon, Oct 31, 2022 at 9:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Oct 28, 2022 at 10:41:17AM -0700, Yosry Ahmed wrote: > > On Fri, Oct 28, 2022 at 7:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > pgscan_user: User-requested reclaim. Could be confusing if we ever > > > have an in-kernel proactive reclaim driver - unless that would then go > > > to another counter (new or kswapd). > > > > > > pgscan_ext: Reclaim activity from extraordinary/external > > > requests. External as in: outside the allocation context. > > > > I imagine if the kernel is doing proactive reclaim on its own, we > > might want a separate counter for that anyway to monitor what the > > kernel is doing. So maybe pgscan_user sounds nice for now, but I also > > like that the latter explicitly says "this is external to the > > allocation context". But we can just go with pgscan_user and document > > it properly. > > Yes, I think you're right. pgscan_user sounds good to me. > > > How would khugepaged fit in this story? Seems like it would be part of > > pgscan_ext but not pgscan_user. I imagine we also don't want to > > pollute proactive reclaim counters with khugepaged reclaim (or other > > non-direct reclaim). > > > > Maybe pgscan_user and pgscan_kernel/pgscan_indirect for things like khugepaged? > > The problem with pgscan_kernel/indirect is that if we add a proactive > > reclaim kthread in the future it would technically fit there but we > > would want a separate counter for it. > > > > I am honestly not sure where to put khugepaged. The reasons I don't > > like a dedicated counter for khugepaged are: > > - What if other kthreads like khugepaged start doing the same, do we > > add one counter per-thread? > > It's unlikely there will be more. > > The reason khugepaged doesn't rely on kswapd is unique to THP > allocations: they can require an exorbitant amount of work to > assemble, but due to fragmentation those requests may fail > permanently. We don't want to burden a shared facility like kswapd > with large amounts of speculative work on behalf of what are (still*) > cornercase requests. > > This isn't true for other allocations. We do have __GFP_NORETRY sites > here and there that rather fall back early than put in the full amount > of work; but overall we expect allocations to succeed - and kswapd to > be able to balance for them!!** - because the alternative tends to be > OOMs, or drivers and workloads aborting on -ENOMEM. > > (* As we evolve the allocator and normalize huge page requests > (folios), kswapd may also eventually balance for THPs again. IOW, > it's more likely for this exception to disappear again than it is > that we'll see more of them.) > > (** This is also why it's no big deal if other kthreads that rely on > kswapd contribute to direct reclaim stats. First, it's highly > error prone to determine on a case by case basis whether userspace > could be waiting behind that direct reclaim - as Yang Shi's > writeback example demonstrates. Second, if kswapd is overwhelmed, > it's likely to impact userspace *anyway*! The benefit of this > classification work is questionable.) Thanks for the explanation :) > > > - What if we deprecate khugepaged (or such threads)? Seems more likely > > than deprecating kswapd. > > If that happens, we can remove the counter again. The bar isn't as > high for vmstat as it for other ABI, and we've updated it plenty of > times to reflect changes in the MM implementation. Good to know! I thought we'd be stuck with it forever. > > > Looks like we want a stat that would group all of this reclaim coming > > from non-direct kthreads, but would not include a future proactive > > reclaim kthread. > > I think the desire to generalize overcomplicates things here in a way > that isn't actually meaningful. > > Think of direct reclaim stats as a signal that either a) kswapd is > broken or b) memory pressure is high enough to cause latencies in the > class of requests that are of interest to userspace. This is true for > all cases but khugepaged. Agreed. I believe moving forward with pgscan_user and pgscan_khugepaged style stats makes sense. Thanks, Johannes!
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index dc254a3cb956..74cec76be9f2 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1488,12 +1488,18 @@ PAGE_SIZE multiple when read back. pgscan_direct (npn) Amount of scanned pages directly (in an inactive LRU list) + pgscan_khugepaged (npn) + Amount of scanned pages by khugepaged (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd pgsteal_direct (npn) Amount of reclaimed pages directly + pgsteal_khugepaged (npn) + Amount of reclaimed pages by khugepaged + pgfault (npn) Total number of page faults incurred diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index 70162d707caf..f68865e19b0b 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -15,6 +15,7 @@ extern void __khugepaged_exit(struct mm_struct *mm); extern void khugepaged_enter_vma(struct vm_area_struct *vma, unsigned long vm_flags); extern void khugepaged_min_free_kbytes_update(void); +extern bool current_is_khugepaged(void); #ifdef CONFIG_SHMEM extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, bool install_pmd); @@ -57,6 +58,11 @@ static inline int collapse_pte_mapped_thp(struct mm_struct *mm, static inline void khugepaged_min_free_kbytes_update(void) { } + +static inline bool current_is_khugepaged(void) +{ + return false; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_KHUGEPAGED_H */ diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 3518dba1e02f..7f5d1caf5890 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -40,10 +40,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGREUSE, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, + PGSTEAL_KHUGEPAGED, PGDEMOTE_KSWAPD, PGDEMOTE_DIRECT, + PGDEMOTE_KHUGEPAGED, PGSCAN_KSWAPD, PGSCAN_DIRECT, + PGSCAN_KHUGEPAGED, PGSCAN_DIRECT_THROTTLE, PGSCAN_ANON, PGSCAN_FILE, diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4734315f7940..36318ebbf50d 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2528,6 +2528,11 @@ void khugepaged_min_free_kbytes_update(void) mutex_unlock(&khugepaged_mutex); } +bool current_is_khugepaged(void) +{ + return kthread_func(current) == khugepaged; +} + static int madvise_collapse_errno(enum scan_result r) { /* diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2d8549ae1b30..a17a5cfa6a55 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -661,8 +661,10 @@ static const unsigned int memcg_vm_event_stat[] = { PGPGOUT, PGSCAN_KSWAPD, PGSCAN_DIRECT, + PGSCAN_KHUGEPAGED, PGSTEAL_KSWAPD, PGSTEAL_DIRECT, + PGSTEAL_KHUGEPAGED, PGFAULT, PGMAJFAULT, PGREFILL, @@ -1574,10 +1576,12 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize) /* Accumulated memory events */ seq_buf_printf(&s, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) + - memcg_events(memcg, PGSCAN_DIRECT)); + memcg_events(memcg, PGSCAN_DIRECT) + + memcg_events(memcg, PGSCAN_KHUGEPAGED)); seq_buf_printf(&s, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) + - memcg_events(memcg, PGSTEAL_DIRECT)); + memcg_events(memcg, PGSTEAL_DIRECT) + + memcg_events(memcg, PGSTEAL_KHUGEPAGED)); for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { if (memcg_vm_event_stat[i] == PGPGIN || diff --git a/mm/vmscan.c b/mm/vmscan.c index 04d8b88e5216..8ceae125bbf7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -54,6 +54,7 @@ #include <linux/shmem_fs.h> #include <linux/ctype.h> #include <linux/debugfs.h> +#include <linux/khugepaged.h> #include <asm/tlbflush.h> #include <asm/div64.h> @@ -1047,6 +1048,22 @@ void drop_slab(void) drop_slab_node(nid); } +static int reclaimer_offset(void) +{ + BUILD_BUG_ON(PGSTEAL_DIRECT - PGSTEAL_KSWAPD != 1); + BUILD_BUG_ON(PGSTEAL_KHUGEPAGED - PGSTEAL_KSWAPD != 2); + BUILD_BUG_ON(PGDEMOTE_DIRECT - PGDEMOTE_KSWAPD != 1); + BUILD_BUG_ON(PGDEMOTE_KHUGEPAGED - PGDEMOTE_KSWAPD != 2); + BUILD_BUG_ON(PGSCAN_DIRECT - PGSCAN_KSWAPD != 1); + BUILD_BUG_ON(PGSCAN_KHUGEPAGED - PGSCAN_KSWAPD != 2); + + if (current_is_kswapd()) + return 0; + if (current_is_khugepaged()) + return 2; + return 1; +} + static inline int is_page_cache_freeable(struct folio *folio) { /* @@ -1599,10 +1616,7 @@ static unsigned int demote_folio_list(struct list_head *demote_folios, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); - if (current_is_kswapd()) - __count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded); - else - __count_vm_events(PGDEMOTE_DIRECT, nr_succeeded); + __count_vm_events(PGDEMOTE_KSWAPD + reclaimer_offset(), nr_succeeded); return nr_succeeded; } @@ -2475,7 +2489,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, &nr_scanned, sc, lru); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; + item = PGSCAN_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_scanned); __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); @@ -2492,7 +2506,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, move_folios_to_lru(lruvec, &folio_list); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; + item = PGSTEAL_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) __count_vm_events(item, nr_reclaimed); __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); @@ -4857,7 +4871,7 @@ static int scan_folios(struct lruvec *lruvec, struct scan_control *sc, break; } - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT; + item = PGSCAN_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) { __count_vm_events(item, isolated); __count_vm_events(PGREFILL, sorted); @@ -5015,7 +5029,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap if (walk && walk->batched) reset_batch_size(lruvec, walk); - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT; + item = PGSTEAL_KSWAPD + reclaimer_offset(); if (!cgroup_reclaim(sc)) __count_vm_events(item, reclaimed); __count_memcg_events(memcg, item, reclaimed); diff --git a/mm/vmstat.c b/mm/vmstat.c index b2371d745e00..1ea6a5ce1c41 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1271,10 +1271,13 @@ const char * const vmstat_text[] = { "pgreuse", "pgsteal_kswapd", "pgsteal_direct", + "pgsteal_khugepaged", "pgdemote_kswapd", "pgdemote_direct", + "pgdemote_khugepaged", "pgscan_kswapd", "pgscan_direct", + "pgscan_khugepaged", "pgscan_direct_throttle", "pgscan_anon", "pgscan_file",