Message ID | 20221024052841.3291983-1-shakeelb@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp266202wru; Sun, 23 Oct 2022 22:34:07 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4PQtXZqNicST9BqAIMRNKDKcFn9P23aWc3YpBKIbZjGoMdqF1lLkdsGqKM/D/68G/FeMLt X-Received: by 2002:a17:902:e9cc:b0:186:8816:88d4 with SMTP id 12-20020a170902e9cc00b00186881688d4mr11955526plk.59.1666589647179; Sun, 23 Oct 2022 22:34:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666589647; cv=none; d=google.com; s=arc-20160816; b=JStfC6Bkyk72RMoDBUwkuq+KcXWnuUlWkAHsPl94tZkPqMlMLw1E1whJjoi8dSOIss JFjK1wCr+4npMEyc5OYpdNIfJVTOGWJlBvRFqfj3ffR46AHQKYxPZJ+askFbQIxfMT8i 1Tp/rEFWOE3L0ulYTI5PQBNY5YcIbiy44wRHE5nhgI5MA8Al5zo02ynmRX/Xa6N2EdbH Swu/YApPeWOoqBpQwZ05Z+sY4NPcY500TvII8SSjje6DSnsio0WIKLVgwktnE3yEwcqt +QHRK53GNpMemiRFTc9UIIGgoQxZK+j8HVde2+mIRMtu27Koaydl2rnCNOORf/hUsBrP 09SA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=sOfhY6ymaaRiz2tIqe/6KSmlUjgNrp1ZhFMBg9VFg/E=; b=oNHq9DQYuQZX7CBvJymuH7ErwtT4lmlVfPgSFp0mQG3MF/KwYn4Zvg+DAgKrxC2Xgy fQ8EPL3Q5uSJDbAcD5Lktrgq+5RjL3z+5+rVTIM3aicFE5b5gmZHoSggplcdE3pBoW8S 16u13FjcRYuhu7WuvGlhk5fYyHkjt7+12CnlyWrN41S79pW7EGco4c5kcyN7O14TkaI7 yr4jZ2wpVK36wIBHTsow4aLqe1sfxgTugDUXfZdc6qN2qAQEzK3AgXw1AeO1jKsylAf0 jS18HYOoCgCGNXd4IT3kY/jrKdVhAkGwJbk1oLdT+NViEcd62JXX0jwyDpBMRRUE+QJO uocw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=q3Cwas5r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u6-20020a170902a60600b0017efeb31274si30675545plq.507.2022.10.23.22.33.54; Sun, 23 Oct 2022 22:34:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=q3Cwas5r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229914AbiJXF3D (ORCPT <rfc822;pwkd43@gmail.com> + 99 others); Mon, 24 Oct 2022 01:29:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbiJXF27 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 24 Oct 2022 01:28:59 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AE68A7B2A3 for <linux-kernel@vger.kernel.org>; Sun, 23 Oct 2022 22:28:57 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id h2-20020a170902f54200b0018553a8b797so5776810plf.9 for <linux-kernel@vger.kernel.org>; Sun, 23 Oct 2022 22:28:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=sOfhY6ymaaRiz2tIqe/6KSmlUjgNrp1ZhFMBg9VFg/E=; b=q3Cwas5rHaUkvREcs3dVNF37YXiqFoOkf6lSGn2lr9OPPCzeW4cctRZi+M6eM+FM6U eIzBHcRpUViYD4XGBbF1eT1hkaajUShpDABtQJKyGBIx7tKNG2N3ml3L004EFQTzwl/c QUFbs1AeQC2tzpGCONsH+X5jJzidbPkaR0A0MqMpI55NNwtkx87Ahw+HPyfQxGuZdbMU 6gEo1g0R5beSn22NcyNp5iUOdloxpF4x0FKEPwOaKCtjAsLuYsm5FJndZ1+upfWVLJuH 9QJKQK7qkLGDiQBI0U/JC8OXqq9VuebBix/DCtoBaM9h5gMRbzNSC+VD3fP9Bnks99+o enpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=sOfhY6ymaaRiz2tIqe/6KSmlUjgNrp1ZhFMBg9VFg/E=; b=QvKwEi1MOoZwX1UWwMA/oz07kV1eSDCdgzYoXCApVjJ1nJgvvATqowua7KAsDQkVCQ 47eVufLsrG8HT8qry1crcV0FhCncwhnixxpjLdK5X3/knPFtBI0/sueWvvMPX4N4c0mz DVrURZnfXsMEiP0ZXkxq7zPmDfdN5XJQj0Lab4XMuG+LroOpDwlbofjOCMW5Tmv8ypj/ it0ESrVRrgOPz+kEe/Xzh7IvM1Cf3moPFuP+v3531uIZYdeKMmWxxIuexZsDn33oxLqB kBMscauz//K7qsD6axjHeLn0DPzaQ1kOADyrsnHCLQ0wBjBDD0nfAf8FRyctwR8qtTsD jVOg== X-Gm-Message-State: ACrzQf39FJ82YEU2dwEc/HVxMlIQRjNHdVTBVrbKeKxN1wJGtSZ810fv 3YhGsKqOB7EYlcjvXTgSMk7TtPsH+XDBLw== X-Received: from shakeelb.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:262e]) (user=shakeelb job=sendgmr) by 2002:a17:90a:d14a:b0:203:7b4b:6010 with SMTP id t10-20020a17090ad14a00b002037b4b6010mr72452801pjw.237.1666589337176; Sun, 23 Oct 2022 22:28:57 -0700 (PDT) Date: Mon, 24 Oct 2022 05:28:41 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.38.0.135.g90850a2211-goog Message-ID: <20221024052841.3291983-1-shakeelb@google.com> Subject: [PATCH] mm: convert mm's rss stats into percpu_counter From: Shakeel Butt <shakeelb@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747545906039642173?= X-GMAIL-MSGID: =?utf-8?q?1747545906039642173?= |
Series |
mm: convert mm's rss stats into percpu_counter
|
|
Commit Message
Shakeel Butt
Oct. 24, 2022, 5:28 a.m. UTC
Currently mm_struct maintains rss_stats which are updated on page fault
and the unmapping codepaths. For page fault codepath the updates are
cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64.
The reason for caching is performance for multithreaded applications
otherwise the rss_stats updates may become hotspot for such
applications.
However this optimization comes with the cost of error margin in the rss
stats. The rss_stats for applications with large number of threads can
be very skewed. At worst the error margin is (nr_threads * 64) and we
have a lot of applications with 100s of threads, so the error margin can
be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
Recently we started seeing the unbounded errors for rss_stats for
specific applications which use TCP rx0cp. It seems like
vm_insert_pages() codepath does not sync rss_stats at all.
This patch converts the rss_stats into percpu_counter to convert the
error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).
However this conversion enable us to get the accurate stats for
situations where accuracy is more important than the cpu cost. Though
this patch does not make such tradeoffs.
Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
include/linux/mm.h | 26 ++++--------
include/linux/mm_types.h | 7 +---
include/linux/mm_types_task.h | 13 ------
include/linux/percpu_counter.h | 1 -
include/linux/sched.h | 3 --
include/trace/events/kmem.h | 8 ++--
kernel/fork.c | 16 +++++++-
mm/memory.c | 73 +++++-----------------------------
8 files changed, 40 insertions(+), 107 deletions(-)
Comments
On Mon, 24 Oct 2022 05:28:41 +0000 Shakeel Butt <shakeelb@google.com> wrote: > Currently mm_struct maintains rss_stats which are updated on page fault > and the unmapping codepaths. For page fault codepath the updates are > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > The reason for caching is performance for multithreaded applications > otherwise the rss_stats updates may become hotspot for such > applications. > > However this optimization comes with the cost of error margin in the rss > stats. The rss_stats for applications with large number of threads can > be very skewed. At worst the error margin is (nr_threads * 64) and we > have a lot of applications with 100s of threads, so the error margin can > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > > Recently we started seeing the unbounded errors for rss_stats for > specific applications which use TCP rx0cp. It seems like > vm_insert_pages() codepath does not sync rss_stats at all. > > This patch converts the rss_stats into percpu_counter to convert the > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). Confused. The max error should be O(nr_cpus)? > However this conversion enable us to get the accurate stats for > situations where accuracy is more important than the cpu cost. Though > this patch does not make such tradeoffs. Curiousity. Can you expand on the final sentence here? > 8 files changed, 40 insertions(+), 107 deletions(-) There's that, too.
On Mon, Oct 24, 2022 at 03:30:22PM -0700, Andrew Morton wrote: > On Mon, 24 Oct 2022 05:28:41 +0000 Shakeel Butt <shakeelb@google.com> wrote: > > > Currently mm_struct maintains rss_stats which are updated on page fault > > and the unmapping codepaths. For page fault codepath the updates are > > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > > The reason for caching is performance for multithreaded applications > > otherwise the rss_stats updates may become hotspot for such > > applications. > > > > However this optimization comes with the cost of error margin in the rss > > stats. The rss_stats for applications with large number of threads can > > be very skewed. At worst the error margin is (nr_threads * 64) and we > > have a lot of applications with 100s of threads, so the error margin can > > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > > > > Recently we started seeing the unbounded errors for rss_stats for > > specific applications which use TCP rx0cp. It seems like > > vm_insert_pages() codepath does not sync rss_stats at all. > > > > This patch converts the rss_stats into percpu_counter to convert the > > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > > Confused. The max error should be O(nr_cpus)? > So, percpu_counter code sets the percpu batch in the following way: static int compute_batch_value(unsigned int cpu) { int nr = num_online_cpus(); percpu_counter_batch = max(32, nr*2); return 0; } This means each cpu can cache (nr_cpus*2) updates. Practically the number of cpus do not change and are usually much less than the number of threads of large applications, so error margin is lower. > > However this conversion enable us to get the accurate stats for > > situations where accuracy is more important than the cpu cost. Though > > this patch does not make such tradeoffs. > > Curiousity. Can you expand on the final sentence here? > Basically we can just use percpu_counter_add_local() for the updates and percpu_counter_sum() (or percpu_counter_sync() + percpu_counter_read) for the readers. At the moment the readers are either procfs interface, oom_killer and memory reclaim which I think are not performance critical and should be ok with slow read. However I think we can make that change in a separate patch. thanks, Shakeel
Hi On 24.10.2022 07:28, Shakeel Butt wrote: > Currently mm_struct maintains rss_stats which are updated on page fault > and the unmapping codepaths. For page fault codepath the updates are > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > The reason for caching is performance for multithreaded applications > otherwise the rss_stats updates may become hotspot for such > applications. > > However this optimization comes with the cost of error margin in the rss > stats. The rss_stats for applications with large number of threads can > be very skewed. At worst the error margin is (nr_threads * 64) and we > have a lot of applications with 100s of threads, so the error margin can > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > > Recently we started seeing the unbounded errors for rss_stats for > specific applications which use TCP rx0cp. It seems like > vm_insert_pages() codepath does not sync rss_stats at all. > > This patch converts the rss_stats into percpu_counter to convert the > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > However this conversion enable us to get the accurate stats for > situations where accuracy is more important than the cpu cost. Though > this patch does not make such tradeoffs. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> This patch landed recently in linux-next as commit d59f19a7a068 ("mm: convert mm's rss stats into percpu_counter"). Unfortunately it causes a regression on my test systems. I've noticed that it triggers a 'BUG: Bad rss-counter state' warning from time to time for random processes. This is somehow related to CPU hot-plug and/or system suspend/resume. The easiest way to reproduce this issue (although not always) on my test systems (ARM or ARM64 based) is to run the following commands: root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >$i/online; BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1 BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2 BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15 BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2 BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15 Let me know if I can help debugging this somehow or testing a fix. > --- > include/linux/mm.h | 26 ++++-------- > include/linux/mm_types.h | 7 +--- > include/linux/mm_types_task.h | 13 ------ > include/linux/percpu_counter.h | 1 - > include/linux/sched.h | 3 -- > include/trace/events/kmem.h | 8 ++-- > kernel/fork.c | 16 +++++++- > mm/memory.c | 73 +++++----------------------------- > 8 files changed, 40 insertions(+), 107 deletions(-) > > ... Best regards
On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote: > Hi > > On 24.10.2022 07:28, Shakeel Butt wrote: > > Currently mm_struct maintains rss_stats which are updated on page fault > > and the unmapping codepaths. For page fault codepath the updates are > > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > > The reason for caching is performance for multithreaded applications > > otherwise the rss_stats updates may become hotspot for such > > applications. > > > > However this optimization comes with the cost of error margin in the rss > > stats. The rss_stats for applications with large number of threads can > > be very skewed. At worst the error margin is (nr_threads * 64) and we > > have a lot of applications with 100s of threads, so the error margin can > > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > > > > Recently we started seeing the unbounded errors for rss_stats for > > specific applications which use TCP rx0cp. It seems like > > vm_insert_pages() codepath does not sync rss_stats at all. > > > > This patch converts the rss_stats into percpu_counter to convert the > > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > > However this conversion enable us to get the accurate stats for > > situations where accuracy is more important than the cpu cost. Though > > this patch does not make such tradeoffs. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > This patch landed recently in linux-next as commit d59f19a7a068 ("mm: > convert mm's rss stats into percpu_counter"). Unfortunately it causes a > regression on my test systems. I've noticed that it triggers a 'BUG: Bad > rss-counter state' warning from time to time for random processes. This > is somehow related to CPU hot-plug and/or system suspend/resume. The > easiest way to reproduce this issue (although not always) on my test > systems (ARM or ARM64 based) is to run the following commands: > > root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 > >$i/online; > BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1 > BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2 > BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15 > BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2 > BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15 > > Let me know if I can help debugging this somehow or testing a fix. > Hi Marek, Thanks for the report. It seems like there is a race between for_each_online_cpu() in __percpu_counter_sum() and percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for percpu_counter users but for check_mm() is not happy with this race. Can you please try the following patch: From: Shakeel Butt <shakeelb@google.com> Date: Thu, 3 Nov 2022 06:05:13 +0000 Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum interface percpu_counter_sum can race with cpu offlining. Add a new interface which does not race with it and use that for check_mm(). --- include/linux/percpu_counter.h | 11 +++++++++++ kernel/fork.c | 2 +- lib/percpu_counter.c | 24 ++++++++++++++++++------ 3 files changed, 30 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h index bde6c4c1f405..3070c1043acf 100644 --- a/include/linux/percpu_counter.h +++ b/include/linux/percpu_counter.h @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount); void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch); s64 __percpu_counter_sum(struct percpu_counter *fbc); +s64 __percpu_counter_sum_all(struct percpu_counter *fbc); int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); void percpu_counter_sync(struct percpu_counter *fbc); @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) return __percpu_counter_sum(fbc); } +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) +{ + return __percpu_counter_sum_all(fbc); +} + static inline s64 percpu_counter_read(struct percpu_counter *fbc) { return fbc->count; @@ -193,6 +199,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) return percpu_counter_read(fbc); } +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) +{ + return percpu_counter_read(fbc); +} + static inline bool percpu_counter_initialized(struct percpu_counter *fbc) { return true; diff --git a/kernel/fork.c b/kernel/fork.c index 9c32f593ef11..7d6f510cf397 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + long x = percpu_counter_sum_all(&mm->rss_stat[i]); if (unlikely(x)) pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index ed610b75dc32..f26a1a5df399 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -117,11 +117,8 @@ void percpu_counter_sync(struct percpu_counter *fbc) } EXPORT_SYMBOL(percpu_counter_sync); -/* - * Add up all the per-cpu counts, return the result. This is a more accurate - * but much slower version of percpu_counter_read_positive() - */ -s64 __percpu_counter_sum(struct percpu_counter *fbc) +static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc, + const struct cpumask *cpu_mask) { s64 ret; int cpu; @@ -129,15 +126,30 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc) raw_spin_lock_irqsave(&fbc->lock, flags); ret = fbc->count; - for_each_online_cpu(cpu) { + for_each_cpu(cpu, cpu_mask) { s32 *pcount = per_cpu_ptr(fbc->counters, cpu); ret += *pcount; } raw_spin_unlock_irqrestore(&fbc->lock, flags); return ret; } + +/* + * Add up all the per-cpu counts, return the result. This is a more accurate + * but much slower version of percpu_counter_read_positive() + */ +s64 __percpu_counter_sum(struct percpu_counter *fbc) +{ + return __percpu_counter_sum_mask(fbc, cpu_online_mask); +} EXPORT_SYMBOL(__percpu_counter_sum); +s64 __percpu_counter_sum_all(struct percpu_counter *fbc) +{ + return __percpu_counter_sum_mask(fbc, cpu_possible_mask); +} +EXPORT_SYMBOL(__percpu_counter_sum_all); + int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp, struct lock_class_key *key) {
Hi, On 03.11.2022 18:14, Shakeel Butt wrote: > On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote: >> On 24.10.2022 07:28, Shakeel Butt wrote: >>> Currently mm_struct maintains rss_stats which are updated on page fault >>> and the unmapping codepaths. For page fault codepath the updates are >>> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. >>> The reason for caching is performance for multithreaded applications >>> otherwise the rss_stats updates may become hotspot for such >>> applications. >>> >>> However this optimization comes with the cost of error margin in the rss >>> stats. The rss_stats for applications with large number of threads can >>> be very skewed. At worst the error margin is (nr_threads * 64) and we >>> have a lot of applications with 100s of threads, so the error margin can >>> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. >>> >>> Recently we started seeing the unbounded errors for rss_stats for >>> specific applications which use TCP rx0cp. It seems like >>> vm_insert_pages() codepath does not sync rss_stats at all. >>> >>> This patch converts the rss_stats into percpu_counter to convert the >>> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). >>> However this conversion enable us to get the accurate stats for >>> situations where accuracy is more important than the cpu cost. Though >>> this patch does not make such tradeoffs. >>> >>> Signed-off-by: Shakeel Butt <shakeelb@google.com> >> This patch landed recently in linux-next as commit d59f19a7a068 ("mm: >> convert mm's rss stats into percpu_counter"). Unfortunately it causes a >> regression on my test systems. I've noticed that it triggers a 'BUG: Bad >> rss-counter state' warning from time to time for random processes. This >> is somehow related to CPU hot-plug and/or system suspend/resume. The >> easiest way to reproduce this issue (although not always) on my test >> systems (ARM or ARM64 based) is to run the following commands: >> >> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 >> >$i/online; >> BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1 >> BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2 >> BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15 >> BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2 >> BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15 >> >> Let me know if I can help debugging this somehow or testing a fix. >> > Hi Marek, > > Thanks for the report. It seems like there is a race between > for_each_online_cpu() in __percpu_counter_sum() and > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for > percpu_counter users but for check_mm() is not happy with this race. Can > you please try the following patch: > > > From: Shakeel Butt <shakeelb@google.com> > Date: Thu, 3 Nov 2022 06:05:13 +0000 > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum > interface > > percpu_counter_sum can race with cpu offlining. Add a new interface > which does not race with it and use that for check_mm(). > --- > include/linux/percpu_counter.h | 11 +++++++++++ > kernel/fork.c | 2 +- > lib/percpu_counter.c | 24 ++++++++++++++++++------ > 3 files changed, 30 insertions(+), 7 deletions(-) Yes, this seems to fix the issue I've reported. Feel free to add: Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> > diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h > index bde6c4c1f405..3070c1043acf 100644 > --- a/include/linux/percpu_counter.h > +++ b/include/linux/percpu_counter.h > @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount); > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, > s32 batch); > s64 __percpu_counter_sum(struct percpu_counter *fbc); > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc); > int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); > void percpu_counter_sync(struct percpu_counter *fbc); > > @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) > return __percpu_counter_sum(fbc); > } > > +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_all(fbc); > +} > + > static inline s64 percpu_counter_read(struct percpu_counter *fbc) > { > return fbc->count; > @@ -193,6 +199,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) > return percpu_counter_read(fbc); > } > > +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return percpu_counter_read(fbc); > +} > + > static inline bool percpu_counter_initialized(struct percpu_counter *fbc) > { > return true; > diff --git a/kernel/fork.c b/kernel/fork.c > index 9c32f593ef11..7d6f510cf397 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm) > "Please make sure 'struct resident_page_types[]' is updated as well"); > > for (i = 0; i < NR_MM_COUNTERS; i++) { > - long x = percpu_counter_sum(&mm->rss_stat[i]); > + long x = percpu_counter_sum_all(&mm->rss_stat[i]); > > if (unlikely(x)) > pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > index ed610b75dc32..f26a1a5df399 100644 > --- a/lib/percpu_counter.c > +++ b/lib/percpu_counter.c > @@ -117,11 +117,8 @@ void percpu_counter_sync(struct percpu_counter *fbc) > } > EXPORT_SYMBOL(percpu_counter_sync); > > -/* > - * Add up all the per-cpu counts, return the result. This is a more accurate > - * but much slower version of percpu_counter_read_positive() > - */ > -s64 __percpu_counter_sum(struct percpu_counter *fbc) > +static s64 __percpu_counter_sum_mask(struct percpu_counter *fbc, > + const struct cpumask *cpu_mask) > { > s64 ret; > int cpu; > @@ -129,15 +126,30 @@ s64 __percpu_counter_sum(struct percpu_counter *fbc) > > raw_spin_lock_irqsave(&fbc->lock, flags); > ret = fbc->count; > - for_each_online_cpu(cpu) { > + for_each_cpu(cpu, cpu_mask) { > s32 *pcount = per_cpu_ptr(fbc->counters, cpu); > ret += *pcount; > } > raw_spin_unlock_irqrestore(&fbc->lock, flags); > return ret; > } > + > +/* > + * Add up all the per-cpu counts, return the result. This is a more accurate > + * but much slower version of percpu_counter_read_positive() > + */ > +s64 __percpu_counter_sum(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_mask(fbc, cpu_online_mask); > +} > EXPORT_SYMBOL(__percpu_counter_sum); > > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_mask(fbc, cpu_possible_mask); > +} > +EXPORT_SYMBOL(__percpu_counter_sum_all); > + > int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp, > struct lock_class_key *key) > { Best regards
On Thu, Nov 3, 2022 at 4:02 PM Marek Szyprowski <m.szyprowski@samsung.com> wrote: > > Hi, > > On 03.11.2022 18:14, Shakeel Butt wrote: > > On Wed, Nov 02, 2022 at 10:09:57PM +0100, Marek Szyprowski wrote: > >> On 24.10.2022 07:28, Shakeel Butt wrote: > >>> Currently mm_struct maintains rss_stats which are updated on page fault > >>> and the unmapping codepaths. For page fault codepath the updates are > >>> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > >>> The reason for caching is performance for multithreaded applications > >>> otherwise the rss_stats updates may become hotspot for such > >>> applications. > >>> > >>> However this optimization comes with the cost of error margin in the rss > >>> stats. The rss_stats for applications with large number of threads can > >>> be very skewed. At worst the error margin is (nr_threads * 64) and we > >>> have a lot of applications with 100s of threads, so the error margin can > >>> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > >>> > >>> Recently we started seeing the unbounded errors for rss_stats for > >>> specific applications which use TCP rx0cp. It seems like > >>> vm_insert_pages() codepath does not sync rss_stats at all. > >>> > >>> This patch converts the rss_stats into percpu_counter to convert the > >>> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > >>> However this conversion enable us to get the accurate stats for > >>> situations where accuracy is more important than the cpu cost. Though > >>> this patch does not make such tradeoffs. > >>> > >>> Signed-off-by: Shakeel Butt <shakeelb@google.com> > >> This patch landed recently in linux-next as commit d59f19a7a068 ("mm: > >> convert mm's rss stats into percpu_counter"). Unfortunately it causes a > >> regression on my test systems. I've noticed that it triggers a 'BUG: Bad > >> rss-counter state' warning from time to time for random processes. This > >> is somehow related to CPU hot-plug and/or system suspend/resume. The > >> easiest way to reproduce this issue (although not always) on my test > >> systems (ARM or ARM64 based) is to run the following commands: > >> > >> root@target:~# for i in /sys/devices/system/cpu/cpu[1-9]; do echo 0 > >> >$i/online; > >> BUG: Bad rss-counter state mm:f04c7160 type:MM_FILEPAGES val:1 > >> BUG: Bad rss-counter state mm:50f1f502 type:MM_FILEPAGES val:2 > >> BUG: Bad rss-counter state mm:50f1f502 type:MM_ANONPAGES val:15 > >> BUG: Bad rss-counter state mm:63660fd0 type:MM_FILEPAGES val:2 > >> BUG: Bad rss-counter state mm:63660fd0 type:MM_ANONPAGES val:15 > >> > >> Let me know if I can help debugging this somehow or testing a fix. > >> > > Hi Marek, > > > > Thanks for the report. It seems like there is a race between > > for_each_online_cpu() in __percpu_counter_sum() and > > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for > > percpu_counter users but for check_mm() is not happy with this race. Can > > you please try the following patch: > > > > > > From: Shakeel Butt <shakeelb@google.com> > > Date: Thu, 3 Nov 2022 06:05:13 +0000 > > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum > > interface > > > > percpu_counter_sum can race with cpu offlining. Add a new interface > > which does not race with it and use that for check_mm(). > > --- > > include/linux/percpu_counter.h | 11 +++++++++++ > > kernel/fork.c | 2 +- > > lib/percpu_counter.c | 24 ++++++++++++++++++------ > > 3 files changed, 30 insertions(+), 7 deletions(-) > > > Yes, this seems to fix the issue I've reported. Feel free to add: > > Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> > > Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> > > Thanks a lot Marek. I will send out a formal patch later with your reported-by and tested-by tags.
On Thu, 3 Nov 2022 17:14:07 +0000 Shakeel Butt <shakeelb@google.com> wrote: > > ... > > Thanks for the report. It seems like there is a race between > for_each_online_cpu() in __percpu_counter_sum() and > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for > percpu_counter users but for check_mm() is not happy with this race. Can > you please try the following patch: percpu-counters supposedly avoid such races via the hotplup notifier. So can you please fully describe the race and let's see if it can be fixed at the percpu_counter level? > > From: Shakeel Butt <shakeelb@google.com> > Date: Thu, 3 Nov 2022 06:05:13 +0000 > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum > interface > > percpu_counter_sum can race with cpu offlining. Add a new interface > which does not race with it and use that for check_mm(). I'll grab this version for now, as others might be seeing this issue. > --- > include/linux/percpu_counter.h | 11 +++++++++++ > kernel/fork.c | 2 +- > lib/percpu_counter.c | 24 ++++++++++++++++++------ > 3 files changed, 30 insertions(+), 7 deletions(-) > > diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h > index bde6c4c1f405..3070c1043acf 100644 > --- a/include/linux/percpu_counter.h > +++ b/include/linux/percpu_counter.h > @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount); > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, > s32 batch); > s64 __percpu_counter_sum(struct percpu_counter *fbc); > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc); > int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); > void percpu_counter_sync(struct percpu_counter *fbc); > > @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) > return __percpu_counter_sum(fbc); > } > > +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_all(fbc); > +} We haven't been good about documenting these interfaces. Can we please start now? ;) > > ... > > + > +/* > + * Add up all the per-cpu counts, return the result. This is a more accurate > + * but much slower version of percpu_counter_read_positive() > + */ > +s64 __percpu_counter_sum(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_mask(fbc, cpu_online_mask); > +} > EXPORT_SYMBOL(__percpu_counter_sum); > > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc) > +{ > + return __percpu_counter_sum_mask(fbc, cpu_possible_mask); > +} > +EXPORT_SYMBOL(__percpu_counter_sum_all); Probably here is a good place to document it. Is there any point in having the percpu_counter_sum_all()->__percpu_counter_sum_all() inlined wrapper? Why not name this percpu_counter_sum_all() directly? > int __percpu_counter_init(struct percpu_counter *fbc, s64 amount, gfp_t gfp, > struct lock_class_key *key) > { > -- > 2.38.1.431.g37b22c650d-goog
On Fri, Nov 4, 2022 at 4:05 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 3 Nov 2022 17:14:07 +0000 Shakeel Butt <shakeelb@google.com> wrote: > > > > > ... > > > > Thanks for the report. It seems like there is a race between > > for_each_online_cpu() in __percpu_counter_sum() and > > percpu_counter_cpu_dead()/cpu-offlining. Normally this race is fine for > > percpu_counter users but for check_mm() is not happy with this race. Can > > you please try the following patch: > > percpu-counters supposedly avoid such races via the hotplup notifier. > So can you please fully describe the race and let's see if it can be > fixed at the percpu_counter level? > Yes, I am writing a more detailed commit message explaining the race and why it is not really an issue for current users. > > > > From: Shakeel Butt <shakeelb@google.com> > > Date: Thu, 3 Nov 2022 06:05:13 +0000 > > Subject: [PATCH] mm: percpu_counter: use race free percpu_counter sum > > interface > > > > percpu_counter_sum can race with cpu offlining. Add a new interface > > which does not race with it and use that for check_mm(). > > I'll grab this version for now, as others might be seeing this issue. > Thanks. > > > --- > > include/linux/percpu_counter.h | 11 +++++++++++ > > kernel/fork.c | 2 +- > > lib/percpu_counter.c | 24 ++++++++++++++++++------ > > 3 files changed, 30 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h > > index bde6c4c1f405..3070c1043acf 100644 > > --- a/include/linux/percpu_counter.h > > +++ b/include/linux/percpu_counter.h > > @@ -45,6 +45,7 @@ void percpu_counter_set(struct percpu_counter *fbc, s64 amount); > > void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, > > s32 batch); > > s64 __percpu_counter_sum(struct percpu_counter *fbc); > > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc); > > int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch); > > void percpu_counter_sync(struct percpu_counter *fbc); > > > > @@ -85,6 +86,11 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) > > return __percpu_counter_sum(fbc); > > } > > > > +static inline s64 percpu_counter_sum_all(struct percpu_counter *fbc) > > +{ > > + return __percpu_counter_sum_all(fbc); > > +} > > We haven't been good about documenting these interfaces. Can we please > start now? ;) > Yup will do. > > > > ... > > > > + > > +/* > > + * Add up all the per-cpu counts, return the result. This is a more accurate > > + * but much slower version of percpu_counter_read_positive() > > + */ > > +s64 __percpu_counter_sum(struct percpu_counter *fbc) > > +{ > > + return __percpu_counter_sum_mask(fbc, cpu_online_mask); > > +} > > EXPORT_SYMBOL(__percpu_counter_sum); > > > > +s64 __percpu_counter_sum_all(struct percpu_counter *fbc) > > +{ > > + return __percpu_counter_sum_mask(fbc, cpu_possible_mask); > > +} > > +EXPORT_SYMBOL(__percpu_counter_sum_all); > > Probably here is a good place to document it. > > Is there any point in having the > percpu_counter_sum_all()->__percpu_counter_sum_all() inlined wrapper? > Why not name this percpu_counter_sum_all() directly? > Ack. thanks, Shakeel
On Mon 24-10-22 05:28:41, Shakeel Butt wrote: > Currently mm_struct maintains rss_stats which are updated on page fault > and the unmapping codepaths. For page fault codepath the updates are > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > The reason for caching is performance for multithreaded applications > otherwise the rss_stats updates may become hotspot for such > applications. > > However this optimization comes with the cost of error margin in the rss > stats. The rss_stats for applications with large number of threads can > be very skewed. At worst the error margin is (nr_threads * 64) and we > have a lot of applications with 100s of threads, so the error margin can > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > > Recently we started seeing the unbounded errors for rss_stats for > specific applications which use TCP rx0cp. It seems like > vm_insert_pages() codepath does not sync rss_stats at all. > > This patch converts the rss_stats into percpu_counter to convert the > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > However this conversion enable us to get the accurate stats for > situations where accuracy is more important than the cpu cost. Though > this patch does not make such tradeoffs. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> Somewhat late to the game but our performance testing grid has noticed this commit causes a performance regression on shell-heavy workloads. For example running 'make test' in git sources on our test machine with 192 CPUs takes about 4% longer, system time is increased by about 9%: before (9cd6ffa6025) after (f1a7941243c1) Amean User 471.12 * 0.30%* 481.77 * -1.96%* Amean System 244.47 * 0.90%* 269.13 * -9.09%* Amean Elapsed 709.22 * 0.45%* 742.27 * -4.19%* Amean CPU 100.00 ( 0.20%) 101.00 * -0.80%* Essentially this workload spawns in sequence a lot of short-lived tasks and the task startup + teardown cost is what this patch increases. To demonstrate this more clearly, I've written trivial (and somewhat stupid) benchmark shell_bench.sh: for (( i = 0; i < 20000; i++ )); do /bin/true done And when run like: numactl -C 1 ./shell_bench.sh (I've forced physical CPU binding to avoid task migrating over the machine and cpu frequency scaling interfering which makes the numbers much more noisy) I get the following elapsed times: 9cd6ffa6025 f1a7941243c1 Avg 6.807429 7.631571 Stddev 0.021797 0.016483 So some 12% regression in elapsed time. Just to be sure I've verified that per-cpu allocator patch [1] does not improve these numbers in any significant way. Where do we go from here? I think in principle the problem could be fixed by being clever and when the task has only a single thread, we don't bother with allocating pcpu counter (and summing it at the end) and just account directly in mm_struct. When the second thread is spawned, we bite the bullet, allocate pcpu counter and start with more scalable accounting. These shortlived tasks in shell workloads or similar don't spawn any threads so this should fix the regression. But this is obviously easier said than done... Honza [1] https://lore.kernel.org/all/20230606125404.95256-1-yu.ma@intel.com/ > --- > include/linux/mm.h | 26 ++++-------- > include/linux/mm_types.h | 7 +--- > include/linux/mm_types_task.h | 13 ------ > include/linux/percpu_counter.h | 1 - > include/linux/sched.h | 3 -- > include/trace/events/kmem.h | 8 ++-- > kernel/fork.c | 16 +++++++- > mm/memory.c | 73 +++++----------------------------- > 8 files changed, 40 insertions(+), 107 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 9dec25c7d631..a8a9c3a20534 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2000,40 +2000,30 @@ static inline bool get_user_page_fast_only(unsigned long addr, > */ > static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) > { > - long val = atomic_long_read(&mm->rss_stat.count[member]); > - > -#ifdef SPLIT_RSS_COUNTING > - /* > - * counter is updated in asynchronous manner and may go to minus. > - * But it's never be expected number for users. > - */ > - if (val < 0) > - val = 0; > -#endif > - return (unsigned long)val; > + return percpu_counter_read_positive(&mm->rss_stat[member]); > } > > -void mm_trace_rss_stat(struct mm_struct *mm, int member, long count); > +void mm_trace_rss_stat(struct mm_struct *mm, int member); > > static inline void add_mm_counter(struct mm_struct *mm, int member, long value) > { > - long count = atomic_long_add_return(value, &mm->rss_stat.count[member]); > + percpu_counter_add(&mm->rss_stat[member], value); > > - mm_trace_rss_stat(mm, member, count); > + mm_trace_rss_stat(mm, member); > } > > static inline void inc_mm_counter(struct mm_struct *mm, int member) > { > - long count = atomic_long_inc_return(&mm->rss_stat.count[member]); > + percpu_counter_inc(&mm->rss_stat[member]); > > - mm_trace_rss_stat(mm, member, count); > + mm_trace_rss_stat(mm, member); > } > > static inline void dec_mm_counter(struct mm_struct *mm, int member) > { > - long count = atomic_long_dec_return(&mm->rss_stat.count[member]); > + percpu_counter_dec(&mm->rss_stat[member]); > > - mm_trace_rss_stat(mm, member, count); > + mm_trace_rss_stat(mm, member); > } > > /* Optimized variant when page is already known not to be PageAnon */ > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index a82f06ab18a1..834022721bc6 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -18,6 +18,7 @@ > #include <linux/page-flags-layout.h> > #include <linux/workqueue.h> > #include <linux/seqlock.h> > +#include <linux/percpu_counter.h> > > #include <asm/mmu.h> > > @@ -626,11 +627,7 @@ struct mm_struct { > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ > > - /* > - * Special counters, in some configurations protected by the > - * page_table_lock, in other configurations by being atomic. > - */ > - struct mm_rss_stat rss_stat; > + struct percpu_counter rss_stat[NR_MM_COUNTERS]; > > struct linux_binfmt *binfmt; > > diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h > index 0bb4b6da9993..5414b5c6a103 100644 > --- a/include/linux/mm_types_task.h > +++ b/include/linux/mm_types_task.h > @@ -36,19 +36,6 @@ enum { > NR_MM_COUNTERS > }; > > -#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU) > -#define SPLIT_RSS_COUNTING > -/* per-thread cached information, */ > -struct task_rss_stat { > - int events; /* for synchronization threshold */ > - int count[NR_MM_COUNTERS]; > -}; > -#endif /* USE_SPLIT_PTE_PTLOCKS */ > - > -struct mm_rss_stat { > - atomic_long_t count[NR_MM_COUNTERS]; > -}; > - > struct page_frag { > struct page *page; > #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536) > diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h > index 8ed5fba6d156..bde6c4c1f405 100644 > --- a/include/linux/percpu_counter.h > +++ b/include/linux/percpu_counter.h > @@ -13,7 +13,6 @@ > #include <linux/threads.h> > #include <linux/percpu.h> > #include <linux/types.h> > -#include <linux/gfp.h> > > /* percpu_counter batch for local add or sub */ > #define PERCPU_COUNTER_LOCAL_BATCH INT_MAX > diff --git a/include/linux/sched.h b/include/linux/sched.h > index ffb6eb55cd13..079d299fa465 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -870,9 +870,6 @@ struct task_struct { > struct mm_struct *mm; > struct mm_struct *active_mm; > > -#ifdef SPLIT_RSS_COUNTING > - struct task_rss_stat rss_stat; > -#endif > int exit_state; > int exit_code; > int exit_signal; > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h > index 243073cfc29d..58688768ef0f 100644 > --- a/include/trace/events/kmem.h > +++ b/include/trace/events/kmem.h > @@ -346,10 +346,9 @@ TRACE_MM_PAGES > TRACE_EVENT(rss_stat, > > TP_PROTO(struct mm_struct *mm, > - int member, > - long count), > + int member), > > - TP_ARGS(mm, member, count), > + TP_ARGS(mm, member), > > TP_STRUCT__entry( > __field(unsigned int, mm_id) > @@ -362,7 +361,8 @@ TRACE_EVENT(rss_stat, > __entry->mm_id = mm_ptr_to_hash(mm); > __entry->curr = !!(current->mm == mm); > __entry->member = member; > - __entry->size = (count << PAGE_SHIFT); > + __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) > + << PAGE_SHIFT); > ), > > TP_printk("mm_id=%u curr=%d type=%s size=%ldB", > diff --git a/kernel/fork.c b/kernel/fork.c > index cfb09ca1b1bc..f56ad06240e1 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm) > "Please make sure 'struct resident_page_types[]' is updated as well"); > > for (i = 0; i < NR_MM_COUNTERS; i++) { > - long x = atomic_long_read(&mm->rss_stat.count[i]); > + long x = percpu_counter_sum(&mm->rss_stat[i]); > > if (unlikely(x)) > pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", > @@ -782,6 +782,8 @@ static void check_mm(struct mm_struct *mm) > */ > void __mmdrop(struct mm_struct *mm) > { > + int i; > + > BUG_ON(mm == &init_mm); > WARN_ON_ONCE(mm == current->mm); > WARN_ON_ONCE(mm == current->active_mm); > @@ -791,6 +793,9 @@ void __mmdrop(struct mm_struct *mm) > check_mm(mm); > put_user_ns(mm->user_ns); > mm_pasid_drop(mm); > + > + for (i = 0; i < NR_MM_COUNTERS; i++) > + percpu_counter_destroy(&mm->rss_stat[i]); > free_mm(mm); > } > EXPORT_SYMBOL_GPL(__mmdrop); > @@ -1110,6 +1115,8 @@ static void mm_init_uprobes_state(struct mm_struct *mm) > static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > struct user_namespace *user_ns) > { > + int i; > + > mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); > mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); > atomic_set(&mm->mm_users, 1); > @@ -1151,10 +1158,17 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > if (init_new_context(p, mm)) > goto fail_nocontext; > > + for (i = 0; i < NR_MM_COUNTERS; i++) > + if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT)) > + goto fail_pcpu; > + > mm->user_ns = get_user_ns(user_ns); > lru_gen_init_mm(mm); > return mm; > > +fail_pcpu: > + while (i > 0) > + percpu_counter_destroy(&mm->rss_stat[--i]); > fail_nocontext: > mm_free_pgd(mm); > fail_nopgd: > diff --git a/mm/memory.c b/mm/memory.c > index 8e72f703ed99..fea8d737e8c0 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -162,58 +162,11 @@ static int __init init_zero_pfn(void) > } > early_initcall(init_zero_pfn); > > -void mm_trace_rss_stat(struct mm_struct *mm, int member, long count) > +void mm_trace_rss_stat(struct mm_struct *mm, int member) > { > - trace_rss_stat(mm, member, count); > + trace_rss_stat(mm, member); > } > > -#if defined(SPLIT_RSS_COUNTING) > - > -void sync_mm_rss(struct mm_struct *mm) > -{ > - int i; > - > - for (i = 0; i < NR_MM_COUNTERS; i++) { > - if (current->rss_stat.count[i]) { > - add_mm_counter(mm, i, current->rss_stat.count[i]); > - current->rss_stat.count[i] = 0; > - } > - } > - current->rss_stat.events = 0; > -} > - > -static void add_mm_counter_fast(struct mm_struct *mm, int member, int val) > -{ > - struct task_struct *task = current; > - > - if (likely(task->mm == mm)) > - task->rss_stat.count[member] += val; > - else > - add_mm_counter(mm, member, val); > -} > -#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1) > -#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1) > - > -/* sync counter once per 64 page faults */ > -#define TASK_RSS_EVENTS_THRESH (64) > -static void check_sync_rss_stat(struct task_struct *task) > -{ > - if (unlikely(task != current)) > - return; > - if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH)) > - sync_mm_rss(task->mm); > -} > -#else /* SPLIT_RSS_COUNTING */ > - > -#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member) > -#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member) > - > -static void check_sync_rss_stat(struct task_struct *task) > -{ > -} > - > -#endif /* SPLIT_RSS_COUNTING */ > - > /* > * Note: this doesn't free the actual pages themselves. That > * has been handled earlier when unmapping all the memory regions. > @@ -1860,7 +1813,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, > return -EBUSY; > /* Ok, finally just insert the thing.. */ > get_page(page); > - inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); > + inc_mm_counter(vma->vm_mm, mm_counter_file(page)); > page_add_file_rmap(page, vma, false); > set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot)); > return 0; > @@ -3156,12 +3109,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { > if (old_page) { > if (!PageAnon(old_page)) { > - dec_mm_counter_fast(mm, > - mm_counter_file(old_page)); > - inc_mm_counter_fast(mm, MM_ANONPAGES); > + dec_mm_counter(mm, mm_counter_file(old_page)); > + inc_mm_counter(mm, MM_ANONPAGES); > } > } else { > - inc_mm_counter_fast(mm, MM_ANONPAGES); > + inc_mm_counter(mm, MM_ANONPAGES); > } > flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); > entry = mk_pte(new_page, vma->vm_page_prot); > @@ -3968,8 +3920,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (should_try_to_free_swap(folio, vma, vmf->flags)) > folio_free_swap(folio); > > - inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); > - dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); > + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); > pte = mk_pte(page, vma->vm_page_prot); > > /* > @@ -4148,7 +4100,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); > + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > page_add_new_anon_rmap(page, vma, vmf->address); > lru_cache_add_inactive_or_unevictable(page, vma); > setpte: > @@ -4338,11 +4290,11 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) > entry = pte_mkuffd_wp(pte_wrprotect(entry)); > /* copy-on-write page */ > if (write && !(vma->vm_flags & VM_SHARED)) { > - inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); > + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > page_add_new_anon_rmap(page, vma, addr); > lru_cache_add_inactive_or_unevictable(page, vma); > } else { > - inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); > + inc_mm_counter(vma->vm_mm, mm_counter_file(page)); > page_add_file_rmap(page, vma, false); > } > set_pte_at(vma->vm_mm, addr, vmf->pte, entry); > @@ -5194,9 +5146,6 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > count_vm_event(PGFAULT); > count_memcg_event_mm(vma->vm_mm, PGFAULT); > > - /* do counter updates before entering really critical section. */ > - check_sync_rss_stat(current); > - > if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, > flags & FAULT_FLAG_INSTRUCTION, > flags & FAULT_FLAG_REMOTE)) > -- > 2.38.0.135.g90850a2211-goog >
On Thu, Jun 8, 2023 at 5:14 AM Jan Kara <jack@suse.cz> wrote: > > On Mon 24-10-22 05:28:41, Shakeel Butt wrote: > > Currently mm_struct maintains rss_stats which are updated on page fault > > and the unmapping codepaths. For page fault codepath the updates are > > cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. > > The reason for caching is performance for multithreaded applications > > otherwise the rss_stats updates may become hotspot for such > > applications. > > > > However this optimization comes with the cost of error margin in the rss > > stats. The rss_stats for applications with large number of threads can > > be very skewed. At worst the error margin is (nr_threads * 64) and we > > have a lot of applications with 100s of threads, so the error margin can > > be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. > > > > Recently we started seeing the unbounded errors for rss_stats for > > specific applications which use TCP rx0cp. It seems like > > vm_insert_pages() codepath does not sync rss_stats at all. > > > > This patch converts the rss_stats into percpu_counter to convert the > > error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). > > However this conversion enable us to get the accurate stats for > > situations where accuracy is more important than the cpu cost. Though > > this patch does not make such tradeoffs. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Somewhat late to the game but our performance testing grid has noticed this > commit causes a performance regression on shell-heavy workloads. For > example running 'make test' in git sources on our test machine with 192 > CPUs takes about 4% longer, system time is increased by about 9%: > > before (9cd6ffa6025) after (f1a7941243c1) > Amean User 471.12 * 0.30%* 481.77 * -1.96%* > Amean System 244.47 * 0.90%* 269.13 * -9.09%* > Amean Elapsed 709.22 * 0.45%* 742.27 * -4.19%* > Amean CPU 100.00 ( 0.20%) 101.00 * -0.80%* > > Essentially this workload spawns in sequence a lot of short-lived tasks and > the task startup + teardown cost is what this patch increases. To > demonstrate this more clearly, I've written trivial (and somewhat stupid) > benchmark shell_bench.sh: > > for (( i = 0; i < 20000; i++ )); do > /bin/true > done > > And when run like: > > numactl -C 1 ./shell_bench.sh > > (I've forced physical CPU binding to avoid task migrating over the machine > and cpu frequency scaling interfering which makes the numbers much more > noisy) I get the following elapsed times: > > 9cd6ffa6025 f1a7941243c1 > Avg 6.807429 7.631571 > Stddev 0.021797 0.016483 > > So some 12% regression in elapsed time. Just to be sure I've verified that > per-cpu allocator patch [1] does not improve these numbers in any > significant way. > > Where do we go from here? I think in principle the problem could be fixed > by being clever and when the task has only a single thread, we don't bother > with allocating pcpu counter (and summing it at the end) and just account > directly in mm_struct. When the second thread is spawned, we bite the > bullet, allocate pcpu counter and start with more scalable accounting. > These shortlived tasks in shell workloads or similar don't spawn any > threads so this should fix the regression. But this is obviously easier > said than done... > > Honza > > [1] https://lore.kernel.org/all/20230606125404.95256-1-yu.ma@intel.com/ Another regression reported earlier: https://lore.kernel.org/linux-mm/202301301057.e55dad5b-oliver.sang@intel.com/
On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote: [...] > > Somewhat late to the game but our performance testing grid has noticed this > commit causes a performance regression on shell-heavy workloads. For > example running 'make test' in git sources on our test machine with 192 > CPUs takes about 4% longer, system time is increased by about 9%: > > before (9cd6ffa6025) after (f1a7941243c1) > Amean User 471.12 * 0.30%* 481.77 * -1.96%* > Amean System 244.47 * 0.90%* 269.13 * -9.09%* > Amean Elapsed 709.22 * 0.45%* 742.27 * -4.19%* > Amean CPU 100.00 ( 0.20%) 101.00 * -0.80%* > > Essentially this workload spawns in sequence a lot of short-lived tasks and > the task startup + teardown cost is what this patch increases. To > demonstrate this more clearly, I've written trivial (and somewhat stupid) > benchmark shell_bench.sh: > > for (( i = 0; i < 20000; i++ )); do > /bin/true > done > > And when run like: > > numactl -C 1 ./shell_bench.sh > > (I've forced physical CPU binding to avoid task migrating over the machine > and cpu frequency scaling interfering which makes the numbers much more > noisy) I get the following elapsed times: > > 9cd6ffa6025 f1a7941243c1 > Avg 6.807429 7.631571 > Stddev 0.021797 0.016483 > > So some 12% regression in elapsed time. Just to be sure I've verified that > per-cpu allocator patch [1] does not improve these numbers in any > significant way. > > Where do we go from here? I think in principle the problem could be fixed > by being clever and when the task has only a single thread, we don't bother > with allocating pcpu counter (and summing it at the end) and just account > directly in mm_struct. When the second thread is spawned, we bite the > bullet, allocate pcpu counter and start with more scalable accounting. > These shortlived tasks in shell workloads or similar don't spawn any > threads so this should fix the regression. But this is obviously easier > said than done... > Thanks Jan for the report. I wanted to improve the percpu allocation to eliminate this regression as it was reported by intel test bot as well. However your suggestion seems seems targetted and reasonable as well. At the moment I am travelling, so not sure when I will get to this. Do you want to take a stab at it or you want me to do it? Also how urgent and sensitive this regression is for you? thanks, Shakeel
On Thu 08-06-23 17:37:00, Shakeel Butt wrote: > On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote: > [...] > > > > Somewhat late to the game but our performance testing grid has noticed this > > commit causes a performance regression on shell-heavy workloads. For > > example running 'make test' in git sources on our test machine with 192 > > CPUs takes about 4% longer, system time is increased by about 9%: > > > > before (9cd6ffa6025) after (f1a7941243c1) > > Amean User 471.12 * 0.30%* 481.77 * -1.96%* > > Amean System 244.47 * 0.90%* 269.13 * -9.09%* > > Amean Elapsed 709.22 * 0.45%* 742.27 * -4.19%* > > Amean CPU 100.00 ( 0.20%) 101.00 * -0.80%* > > > > Essentially this workload spawns in sequence a lot of short-lived tasks and > > the task startup + teardown cost is what this patch increases. To > > demonstrate this more clearly, I've written trivial (and somewhat stupid) > > benchmark shell_bench.sh: > > > > for (( i = 0; i < 20000; i++ )); do > > /bin/true > > done > > > > And when run like: > > > > numactl -C 1 ./shell_bench.sh > > > > (I've forced physical CPU binding to avoid task migrating over the machine > > and cpu frequency scaling interfering which makes the numbers much more > > noisy) I get the following elapsed times: > > > > 9cd6ffa6025 f1a7941243c1 > > Avg 6.807429 7.631571 > > Stddev 0.021797 0.016483 > > > > So some 12% regression in elapsed time. Just to be sure I've verified that > > per-cpu allocator patch [1] does not improve these numbers in any > > significant way. > > > > Where do we go from here? I think in principle the problem could be fixed > > by being clever and when the task has only a single thread, we don't bother > > with allocating pcpu counter (and summing it at the end) and just account > > directly in mm_struct. When the second thread is spawned, we bite the > > bullet, allocate pcpu counter and start with more scalable accounting. > > These shortlived tasks in shell workloads or similar don't spawn any > > threads so this should fix the regression. But this is obviously easier > > said than done... > > > > Thanks Jan for the report. I wanted to improve the percpu allocation to > eliminate this regression as it was reported by intel test bot as well. > However your suggestion seems seems targetted and reasonable as well. At > the moment I am travelling, so not sure when I will get to this. Do you > want to take a stab at it or you want me to do it? Also how urgent and > sensitive this regression is for you? It is not really urgent but eventually we'd like to get this fixed (like within couple of months). I have currently other stuff in progress so if you could get to it, it would be nice, otherwise I should be able to look into this in a week or two. Honza
Hi Shakeel and Jan, On Thu, Jun 08, 2023 at 05:37:00PM +0000, Shakeel Butt wrote: > On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote: > [...] > > > > Somewhat late to the game but our performance testing grid has noticed this > > commit causes a performance regression on shell-heavy workloads. For > > example running 'make test' in git sources on our test machine with 192 > > CPUs takes about 4% longer, system time is increased by about 9%: > > > > before (9cd6ffa6025) after (f1a7941243c1) > > Amean User 471.12 * 0.30%* 481.77 * -1.96%* > > Amean System 244.47 * 0.90%* 269.13 * -9.09%* > > Amean Elapsed 709.22 * 0.45%* 742.27 * -4.19%* > > Amean CPU 100.00 ( 0.20%) 101.00 * -0.80%* > > > > Essentially this workload spawns in sequence a lot of short-lived tasks and > > the task startup + teardown cost is what this patch increases. To > > demonstrate this more clearly, I've written trivial (and somewhat stupid) > > benchmark shell_bench.sh: > > > > for (( i = 0; i < 20000; i++ )); do > > /bin/true > > done > > > > And when run like: > > > > numactl -C 1 ./shell_bench.sh > > > > (I've forced physical CPU binding to avoid task migrating over the machine > > and cpu frequency scaling interfering which makes the numbers much more > > noisy) I get the following elapsed times: > > > > 9cd6ffa6025 f1a7941243c1 > > Avg 6.807429 7.631571 > > Stddev 0.021797 0.016483 > > > > So some 12% regression in elapsed time. Just to be sure I've verified that > > per-cpu allocator patch [1] does not improve these numbers in any > > significant way. > > > > Where do we go from here? I think in principle the problem could be fixed > > by being clever and when the task has only a single thread, we don't bother > > with allocating pcpu counter (and summing it at the end) and just account > > directly in mm_struct. When the second thread is spawned, we bite the > > bullet, allocate pcpu counter and start with more scalable accounting. > > These shortlived tasks in shell workloads or similar don't spawn any > > threads so this should fix the regression. But this is obviously easier > > said than done... > > > > Thanks Jan for the report. I wanted to improve the percpu allocation to > eliminate this regression as it was reported by intel test bot as well. > However your suggestion seems seems targetted and reasonable as well. At > the moment I am travelling, so not sure when I will get to this. Do you > want to take a stab at it or you want me to do it? Also how urgent and > sensitive this regression is for you? > > thanks, > Shakeel > > I _think_ I could probably spin you a percpu_alloc_bulk() series in a couple days for percpu_counters. Let me try and find some time, unless you had something different in mind. Thanks, Dennis
On Fri, Jun 9, 2023 at 12:10 AM Dennis Zhou <dennisszhou@gmail.com> wrote: > > Hi Shakeel and Jan, > > On Thu, Jun 08, 2023 at 05:37:00PM +0000, Shakeel Butt wrote: > > On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote: > > [...] > > > > > > Somewhat late to the game but our performance testing grid has noticed this > > > commit causes a performance regression on shell-heavy workloads. For > > > example running 'make test' in git sources on our test machine with 192 > > > CPUs takes about 4% longer, system time is increased by about 9%: > > > > > > before (9cd6ffa6025) after (f1a7941243c1) > > > Amean User 471.12 * 0.30%* 481.77 * -1.96%* > > > Amean System 244.47 * 0.90%* 269.13 * -9.09%* > > > Amean Elapsed 709.22 * 0.45%* 742.27 * -4.19%* > > > Amean CPU 100.00 ( 0.20%) 101.00 * -0.80%* > > > > > > Essentially this workload spawns in sequence a lot of short-lived tasks and > > > the task startup + teardown cost is what this patch increases. To > > > demonstrate this more clearly, I've written trivial (and somewhat stupid) > > > benchmark shell_bench.sh: > > > > > > for (( i = 0; i < 20000; i++ )); do > > > /bin/true > > > done > > > > > > And when run like: > > > > > > numactl -C 1 ./shell_bench.sh > > > > > > (I've forced physical CPU binding to avoid task migrating over the machine > > > and cpu frequency scaling interfering which makes the numbers much more > > > noisy) I get the following elapsed times: > > > > > > 9cd6ffa6025 f1a7941243c1 > > > Avg 6.807429 7.631571 > > > Stddev 0.021797 0.016483 > > > > > > So some 12% regression in elapsed time. Just to be sure I've verified that > > > per-cpu allocator patch [1] does not improve these numbers in any > > > significant way. > > > > > > Where do we go from here? I think in principle the problem could be fixed > > > by being clever and when the task has only a single thread, we don't bother > > > with allocating pcpu counter (and summing it at the end) and just account > > > directly in mm_struct. When the second thread is spawned, we bite the > > > bullet, allocate pcpu counter and start with more scalable accounting. > > > These shortlived tasks in shell workloads or similar don't spawn any > > > threads so this should fix the regression. But this is obviously easier > > > said than done... > > > > > > > Thanks Jan for the report. I wanted to improve the percpu allocation to > > eliminate this regression as it was reported by intel test bot as well. > > However your suggestion seems seems targetted and reasonable as well. At > > the moment I am travelling, so not sure when I will get to this. Do you > > want to take a stab at it or you want me to do it? Also how urgent and > > sensitive this regression is for you? > > > > thanks, > > Shakeel > > > > > > I _think_ I could probably spin you a percpu_alloc_bulk() series in a > couple days for percpu_counters. Let me try and find some time, unless > you had something different in mind. > That would be awesome and thanks a lot.
Linux regression tracking (Thorsten Leemhuis)
June 14, 2023, 8:37 a.m. UTC |
#15
Addressed
Unaddressed
[CCing the regression list, as it should be in the loop for regressions: https://docs.kernel.org/admin-guide/reporting-regressions.html] [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] On 08.06.23 13:14, Jan Kara wrote: > On Mon 24-10-22 05:28:41, Shakeel Butt wrote: >> Currently mm_struct maintains rss_stats which are updated on page fault >> and the unmapping codepaths. For page fault codepath the updates are >> cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. >> The reason for caching is performance for multithreaded applications >> otherwise the rss_stats updates may become hotspot for such >> applications. >> >> However this optimization comes with the cost of error margin in the rss >> stats. The rss_stats for applications with large number of threads can >> be very skewed. At worst the error margin is (nr_threads * 64) and we >> have a lot of applications with 100s of threads, so the error margin can >> be very high. Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32. >> >> Recently we started seeing the unbounded errors for rss_stats for >> specific applications which use TCP rx0cp. It seems like >> vm_insert_pages() codepath does not sync rss_stats at all. >> >> This patch converts the rss_stats into percpu_counter to convert the >> error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). >> However this conversion enable us to get the accurate stats for >> situations where accuracy is more important than the cpu cost. Though >> this patch does not make such tradeoffs. >> >> Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Somewhat late to the game but our performance testing grid has noticed this > commit causes a performance regression on shell-heavy workloads. For > example running 'make test' in git sources on our test machine with 192 > CPUs takes about 4% longer, system time is increased by about 9%: Thanks for the report. I noticed this is nothing urgent. Nevertheless to be sure the issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression tracking bot: #regzbot ^introduced f1a7941243c #regzbot title mm: performance regression on shell-heavy workloads #regzbot backburner: not urgent according to reporter #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply and tell me -- ideally while also telling regzbot about it, as explained by the page listed in the footer of this mail. Developers: When fixing the issue, remember to add 'Link:' tags pointing to the report (the parent of this mail). See page linked in footer for details. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you.
diff --git a/include/linux/mm.h b/include/linux/mm.h index 9dec25c7d631..a8a9c3a20534 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2000,40 +2000,30 @@ static inline bool get_user_page_fast_only(unsigned long addr, */ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - long val = atomic_long_read(&mm->rss_stat.count[member]); - -#ifdef SPLIT_RSS_COUNTING - /* - * counter is updated in asynchronous manner and may go to minus. - * But it's never be expected number for users. - */ - if (val < 0) - val = 0; -#endif - return (unsigned long)val; + return percpu_counter_read_positive(&mm->rss_stat[member]); } -void mm_trace_rss_stat(struct mm_struct *mm, int member, long count); +void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - long count = atomic_long_add_return(value, &mm->rss_stat.count[member]); + percpu_counter_add(&mm->rss_stat[member], value); - mm_trace_rss_stat(mm, member, count); + mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - long count = atomic_long_inc_return(&mm->rss_stat.count[member]); + percpu_counter_inc(&mm->rss_stat[member]); - mm_trace_rss_stat(mm, member, count); + mm_trace_rss_stat(mm, member); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - long count = atomic_long_dec_return(&mm->rss_stat.count[member]); + percpu_counter_dec(&mm->rss_stat[member]); - mm_trace_rss_stat(mm, member, count); + mm_trace_rss_stat(mm, member); } /* Optimized variant when page is already known not to be PageAnon */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a82f06ab18a1..834022721bc6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,6 +18,7 @@ #include <linux/page-flags-layout.h> #include <linux/workqueue.h> #include <linux/seqlock.h> +#include <linux/percpu_counter.h> #include <asm/mmu.h> @@ -626,11 +627,7 @@ struct mm_struct { unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ - /* - * Special counters, in some configurations protected by the - * page_table_lock, in other configurations by being atomic. - */ - struct mm_rss_stat rss_stat; + struct percpu_counter rss_stat[NR_MM_COUNTERS]; struct linux_binfmt *binfmt; diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h index 0bb4b6da9993..5414b5c6a103 100644 --- a/include/linux/mm_types_task.h +++ b/include/linux/mm_types_task.h @@ -36,19 +36,6 @@ enum { NR_MM_COUNTERS }; -#if USE_SPLIT_PTE_PTLOCKS && defined(CONFIG_MMU) -#define SPLIT_RSS_COUNTING -/* per-thread cached information, */ -struct task_rss_stat { - int events; /* for synchronization threshold */ - int count[NR_MM_COUNTERS]; -}; -#endif /* USE_SPLIT_PTE_PTLOCKS */ - -struct mm_rss_stat { - atomic_long_t count[NR_MM_COUNTERS]; -}; - struct page_frag { struct page *page; #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536) diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h index 8ed5fba6d156..bde6c4c1f405 100644 --- a/include/linux/percpu_counter.h +++ b/include/linux/percpu_counter.h @@ -13,7 +13,6 @@ #include <linux/threads.h> #include <linux/percpu.h> #include <linux/types.h> -#include <linux/gfp.h> /* percpu_counter batch for local add or sub */ #define PERCPU_COUNTER_LOCAL_BATCH INT_MAX diff --git a/include/linux/sched.h b/include/linux/sched.h index ffb6eb55cd13..079d299fa465 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -870,9 +870,6 @@ struct task_struct { struct mm_struct *mm; struct mm_struct *active_mm; -#ifdef SPLIT_RSS_COUNTING - struct task_rss_stat rss_stat; -#endif int exit_state; int exit_code; int exit_signal; diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 243073cfc29d..58688768ef0f 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -346,10 +346,9 @@ TRACE_MM_PAGES TRACE_EVENT(rss_stat, TP_PROTO(struct mm_struct *mm, - int member, - long count), + int member), - TP_ARGS(mm, member, count), + TP_ARGS(mm, member), TP_STRUCT__entry( __field(unsigned int, mm_id) @@ -362,7 +361,8 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (count << PAGE_SHIFT); + __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) + << PAGE_SHIFT); ), TP_printk("mm_id=%u curr=%d type=%s size=%ldB", diff --git a/kernel/fork.c b/kernel/fork.c index cfb09ca1b1bc..f56ad06240e1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -756,7 +756,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = atomic_long_read(&mm->rss_stat.count[i]); + long x = percpu_counter_sum(&mm->rss_stat[i]); if (unlikely(x)) pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", @@ -782,6 +782,8 @@ static void check_mm(struct mm_struct *mm) */ void __mmdrop(struct mm_struct *mm) { + int i; + BUG_ON(mm == &init_mm); WARN_ON_ONCE(mm == current->mm); WARN_ON_ONCE(mm == current->active_mm); @@ -791,6 +793,9 @@ void __mmdrop(struct mm_struct *mm) check_mm(mm); put_user_ns(mm->user_ns); mm_pasid_drop(mm); + + for (i = 0; i < NR_MM_COUNTERS; i++) + percpu_counter_destroy(&mm->rss_stat[i]); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); @@ -1110,6 +1115,8 @@ static void mm_init_uprobes_state(struct mm_struct *mm) static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { + int i; + mt_init_flags(&mm->mm_mt, MM_MT_FLAGS); mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock); atomic_set(&mm->mm_users, 1); @@ -1151,10 +1158,17 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (init_new_context(p, mm)) goto fail_nocontext; + for (i = 0; i < NR_MM_COUNTERS; i++) + if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT)) + goto fail_pcpu; + mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; +fail_pcpu: + while (i > 0) + percpu_counter_destroy(&mm->rss_stat[--i]); fail_nocontext: mm_free_pgd(mm); fail_nopgd: diff --git a/mm/memory.c b/mm/memory.c index 8e72f703ed99..fea8d737e8c0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -162,58 +162,11 @@ static int __init init_zero_pfn(void) } early_initcall(init_zero_pfn); -void mm_trace_rss_stat(struct mm_struct *mm, int member, long count) +void mm_trace_rss_stat(struct mm_struct *mm, int member) { - trace_rss_stat(mm, member, count); + trace_rss_stat(mm, member); } -#if defined(SPLIT_RSS_COUNTING) - -void sync_mm_rss(struct mm_struct *mm) -{ - int i; - - for (i = 0; i < NR_MM_COUNTERS; i++) { - if (current->rss_stat.count[i]) { - add_mm_counter(mm, i, current->rss_stat.count[i]); - current->rss_stat.count[i] = 0; - } - } - current->rss_stat.events = 0; -} - -static void add_mm_counter_fast(struct mm_struct *mm, int member, int val) -{ - struct task_struct *task = current; - - if (likely(task->mm == mm)) - task->rss_stat.count[member] += val; - else - add_mm_counter(mm, member, val); -} -#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1) -#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1) - -/* sync counter once per 64 page faults */ -#define TASK_RSS_EVENTS_THRESH (64) -static void check_sync_rss_stat(struct task_struct *task) -{ - if (unlikely(task != current)) - return; - if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH)) - sync_mm_rss(task->mm); -} -#else /* SPLIT_RSS_COUNTING */ - -#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member) -#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member) - -static void check_sync_rss_stat(struct task_struct *task) -{ -} - -#endif /* SPLIT_RSS_COUNTING */ - /* * Note: this doesn't free the actual pages themselves. That * has been handled earlier when unmapping all the memory regions. @@ -1860,7 +1813,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, return -EBUSY; /* Ok, finally just insert the thing.. */ get_page(page); - inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); + inc_mm_counter(vma->vm_mm, mm_counter_file(page)); page_add_file_rmap(page, vma, false); set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot)); return 0; @@ -3156,12 +3109,11 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { if (old_page) { if (!PageAnon(old_page)) { - dec_mm_counter_fast(mm, - mm_counter_file(old_page)); - inc_mm_counter_fast(mm, MM_ANONPAGES); + dec_mm_counter(mm, mm_counter_file(old_page)); + inc_mm_counter(mm, MM_ANONPAGES); } } else { - inc_mm_counter_fast(mm, MM_ANONPAGES); + inc_mm_counter(mm, MM_ANONPAGES); } flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = mk_pte(new_page, vma->vm_page_prot); @@ -3968,8 +3920,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (should_try_to_free_swap(folio, vma, vmf->flags)) folio_free_swap(folio); - inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); - dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); pte = mk_pte(page, vma->vm_page_prot); /* @@ -4148,7 +4100,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address); lru_cache_add_inactive_or_unevictable(page, vma); setpte: @@ -4338,11 +4290,11 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr) entry = pte_mkuffd_wp(pte_wrprotect(entry)); /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { - inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); + inc_mm_counter(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, addr); lru_cache_add_inactive_or_unevictable(page, vma); } else { - inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); + inc_mm_counter(vma->vm_mm, mm_counter_file(page)); page_add_file_rmap(page, vma, false); } set_pte_at(vma->vm_mm, addr, vmf->pte, entry); @@ -5194,9 +5146,6 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, count_vm_event(PGFAULT); count_memcg_event_mm(vma->vm_mm, PGFAULT); - /* do counter updates before entering really critical section. */ - check_sync_rss_stat(current); - if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE))