Message ID | 20230616180718.17725-1-mkoutny@suse.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1533213vqr; Fri, 16 Jun 2023 11:20:07 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5n4RUkN6ddPxzKd+0eCC41HnZlc9JEtDLL2IwrSdj8HXIps/nqzm9kh1BMaXXTBliw8Ux8 X-Received: by 2002:a17:90b:3503:b0:25c:2c56:a8ae with SMTP id ls3-20020a17090b350300b0025c2c56a8aemr2236194pjb.19.1686939606988; Fri, 16 Jun 2023 11:20:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686939606; cv=none; d=google.com; s=arc-20160816; b=z95SSqoUqnuVnPBpt3kjO/dINiN2JPQ7gIRrv5TlJWk7oXxg6GoaFQ3lb5svex6web s5czVXiQhN/m7U+27SW6+ZAWuFju8hjtJX3VqIzffZLbYM+H8XGpU+tEpXYAG036CG+L 68iQgLqpgHnN0qcVu/JZ0RqVUeFmfp4Quc4kfH8xMFdKyihB+heiBFbWsStR7Kl+jQDt KYCIop26CUUn+0a2XhSmcNYdctBwwjogDccc+bQSodIZHsX8bqcwWMGnEZlMCzLnac5S S+A08jNaQsNTfmcHVTTkJJD4p+vMwAffJoamfx4ITMo7tJFxjLyigUwynOWzVEWXSLOm a2Rw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=Bm9Vw5H6yGrwn4tTzKc5q8OJHix81Wz2+IxzfBcuFEw=; b=Z7jxQZ/D4yBBy3Sg62k914XoJYa5RWH2prrviPioq2Un8VDgrSGaAJT7Q9eABA5IAf FafHFt86R3vifEjsQ9hLIU7JPlY+jrGhF4Qw3opCsv5sbVTEd8vO7MBTcXCIYYzffLiY CnigiX+CyswCYubDCFzb3PnlzZC/37QdAJy3r4wdTCkC6AdxilwILAS7dR4BBPGyBjky Se3GdObSandXFPhAy5aEspsSlJNQ+RLDIBaAu2Vt6sZ9aHC7PQOwS5ggPl2zq2Y3DUI6 HZBtMMHTxsLeQzM2JjpgZciVnUrZ2GN5tU9/aXUYSx/gJAaq04Kq41b5tm8FxURHx55M hy0g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=QvJrVkKi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id li12-20020a17090b48cc00b0025eae2cef4bsi2669116pjb.1.2023.06.16.11.19.54; Fri, 16 Jun 2023 11:20:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=QvJrVkKi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230432AbjFPSHa (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Fri, 16 Jun 2023 14:07:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229538AbjFPSH2 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 16 Jun 2023 14:07:28 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9284430FE for <linux-kernel@vger.kernel.org>; Fri, 16 Jun 2023 11:07:23 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 40FE321C3A; Fri, 16 Jun 2023 18:07:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1686938842; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=Bm9Vw5H6yGrwn4tTzKc5q8OJHix81Wz2+IxzfBcuFEw=; b=QvJrVkKiH+XeJGuNdMS5Su60VmPHay4I0XeSzYlltz06ro5xEuwX9jSff4IG+wmDRNut+D OivGqOuBREkZic7P0vGnuwg5cTFXMI3L5WNfJPwPmJG17y4vT90o2NHyiw865vSOinYWt1 w4OyzF3V7B1D4MiwQ+P/KfBIbxvm8xc= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id EA487138E8; Fri, 16 Jun 2023 18:07:21 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id 1gwSONmkjGSTPwAAMHmgww (envelope-from <mkoutny@suse.com>); Fri, 16 Jun 2023 18:07:21 +0000 From: =?utf-8?q?Michal_Koutn=C3=BD?= <mkoutny@suse.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Christian Brauner <brauner@kernel.org>, Suren Baghdasaryan <surenb@google.com>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Mike Christie <michael.christie@oracle.com>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Peter Zijlstra <peterz@infradead.org>, Nicholas Piggin <npiggin@gmail.com>, Andrei Vagin <avagin@gmail.com>, Shakeel Butt <shakeelb@google.com>, Adam Majer <amajer@suse.com>, Jan Kara <jack@suse.cz>, Michal Hocko <mhocko@kernel.org> Subject: [PATCH] mm: Sync percpu mm RSS counters before querying Date: Fri, 16 Jun 2023 20:07:18 +0200 Message-ID: <20230616180718.17725-1-mkoutny@suse.com> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768884385179625375?= X-GMAIL-MSGID: =?utf-8?q?1768884385179625375?= |
Series |
mm: Sync percpu mm RSS counters before querying
|
|
Commit Message
Michal Koutný
June 16, 2023, 6:07 p.m. UTC
An issue was observed with stats collected in struct rusage on ppc64le
with 64kB pages. The percpu counters use batching with
percpu_counter_batch = max(32, nr*2) # in PAGE_SIZE
i.e. with larger pages but similar RSS consumption (bytes), there'll be
less flushes and error more noticeable.
In this given case (getting consumption of exited child), we can request
percpu counter's flush without worrying about contention with updaters.
Fortunately, the commit f1a7941243c1 ("mm: convert mm's rss stats into
percpu_counter") didn't eradicate all traces of SPLIT_RSS_COUNTING and
this mechanism already provided some synchronization points before
reading stats.
Therefore, use sync_mm_rss as carrier for percpu counters refreshes and
forget SPLIT_RSS_COUNTING macro for good.
Impact of summing on a 8 CPU machine:
Benchmark 1: taskset -c 1 ./shell-bench.sh
Before
Time (mean ± σ): 9.950 s ± 0.052 s [User: 7.773 s, System: 2.023 s]
After
Time (mean ± σ): 9.990 s ± 0.070 s [User: 7.825 s, System: 2.011 s]
cat >shell-bench.sh <<EOD
for (( i = 0; i < 20000; i++ )); do
/bin/true
done
EOD
The script is meant to stress fork-exit path (exit is where sync_mm_rss
is most called, add_mm_rss_vec should be covered in fork).
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Reported-by: Adam Majer <amajer@suse.com>
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
Notes:
- dummy approach to fix correctness, performance should be treated separately
- add percpu_counter_set to really update the value
- RFC https://lore.kernel.org/r/20230608171256.17827-1-mkoutny@suse.com/
include/linux/mm.h | 7 +++----
kernel/fork.c | 4 ----
2 files changed, 3 insertions(+), 8 deletions(-)
Comments
On Fri, 16 Jun 2023 20:07:18 +0200 Michal Koutný <mkoutny@suse.com> wrote: > An issue was observed with stats collected in struct rusage on ppc64le > with 64kB pages. The percpu counters use batching with > percpu_counter_batch = max(32, nr*2) # in PAGE_SIZE > i.e. with larger pages but similar RSS consumption (bytes), there'll be > less flushes and error more noticeable. A fully detailed description of the issue would be helpful. Obviously "inaccuracy", but how bad? > In this given case (getting consumption of exited child), we can request > percpu counter's flush without worrying about contention with updaters. > > Fortunately, the commit f1a7941243c1 ("mm: convert mm's rss stats into > percpu_counter") didn't eradicate all traces of SPLIT_RSS_COUNTING and > this mechanism already provided some synchronization points before > reading stats. > Therefore, use sync_mm_rss as carrier for percpu counters refreshes and > forget SPLIT_RSS_COUNTING macro for good. > > Impact of summing on a 8 CPU machine: > Benchmark 1: taskset -c 1 ./shell-bench.sh > > Before > Time (mean ± σ): 9.950 s ± 0.052 s [User: 7.773 s, System: 2.023 s] > > After > Time (mean ± σ): 9.990 s ± 0.070 s [User: 7.825 s, System: 2.011 s] > > cat >shell-bench.sh <<EOD > for (( i = 0; i < 20000; i++ )); do > /bin/true > done > EOD > > The script is meant to stress fork-exit path (exit is where sync_mm_rss > is most called, add_mm_rss_vec should be covered in fork). > > ... > > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2547,13 +2547,12 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss, > *maxrss = hiwater_rss; > } > > -#if defined(SPLIT_RSS_COUNTING) > -void sync_mm_rss(struct mm_struct *mm); > -#else > static inline void sync_mm_rss(struct mm_struct *mm) > { > + for (int i = 0; i < NR_MM_COUNTERS; ++i) > + percpu_counter_set(&mm->rss_stat[i], > + percpu_counter_sum(&mm->rss_stat[i])); > } > -#endif Far too large to be inlined! For six callsites it adds 1kb of text. Why even modify the counter? Can't <whatever this issue is> be addressed by using percpu_counter_sum() in an appropriate place? For unknown reasons percpu_counter_set() uses for_each_possible_cpu(). Probably just a mistake - percpu_counters are hotplug-aware and for_each_online_cpu should suffice. I'm really not liking percpu_counter_set(). It's only safe in situations where the caller knows that no other CPU can be modifying the counter. I wonder if all the callers know that. This situation isn't aided by the lack of any documentation.
On Fri, Jun 16, 2023 at 12:31:44PM -0700, Andrew Morton <akpm@linux-foundation.org> wrote: > On Fri, 16 Jun 2023 20:07:18 +0200 Michal Koutný <mkoutny@suse.com> wrote: > > > An issue was observed with stats collected in struct rusage on ppc64le > > with 64kB pages. The percpu counters use batching with > > percpu_counter_batch = max(32, nr*2) # in PAGE_SIZE > > i.e. with larger pages but similar RSS consumption (bytes), there'll be > > less flushes and error more noticeable. > > A fully detailed description of the issue would be helpful. Obviously > "inaccuracy", but how bad? Any reader of get_mm_counter() could see the inaccuracy given by the formula. In this particular case it is detected by a testsuite of time(1) utility that feeds from rusage: > FAIL: tests/time-max-rss > ======================== > > time(1) failed to detect 5MB allcoation. > mem-baseline(kb): 0 > mem-5MB(kb): 4096 > delta(kb): 4096 > FAIL tests/time-max-rss.sh (exit status: 1) (i.e. 1MB missing) > Far too large to be inlined! For six callsites it adds 1kb of text. Ah, thanks, I can change that. > Why even modify the counter? Can't <whatever this issue is> be > addressed by using percpu_counter_sum() in an appropriate place? I considered modifying get_mm_counter(), however, I decided not to put the per-cpu summing there as it'd incur the impact to many more places than sync_mm_rss(). > For unknown reasons percpu_counter_set() uses for_each_possible_cpu(). > Probably just a mistake - percpu_counters are hotplug-aware and > for_each_online_cpu should suffice. Yeah, that could be cleaned up in another patch (cf mask in __percpu_counter_sum). > I'm really not liking percpu_counter_set(). It's only safe in > situations where the caller knows that no other CPU can be modifying > the counter. I wonder if all the callers know that. I admit I only considered the do_exit() path (and even that isn't granted in a multithreaded process) -- so I don't like percpu_counter_set() in this current form neither. I will need to review effects of parallel updates more. Thanks, Michal
On Fri 16-06-23 20:07:18, Michal Koutny wrote:
[...]
> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Please note there is a different regression from f1a7941243c1 reported
by Jan http://lkml.kernel.org/r/20230608111408.s2minsenlcjow7q3@quack3
diff --git a/include/linux/mm.h b/include/linux/mm.h index 27ce77080c79..c7ac1cbc6fb3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2547,13 +2547,12 @@ static inline void setmax_mm_hiwater_rss(unsigned long *maxrss, *maxrss = hiwater_rss; } -#if defined(SPLIT_RSS_COUNTING) -void sync_mm_rss(struct mm_struct *mm); -#else static inline void sync_mm_rss(struct mm_struct *mm) { + for (int i = 0; i < NR_MM_COUNTERS; ++i) + percpu_counter_set(&mm->rss_stat[i], + percpu_counter_sum(&mm->rss_stat[i])); } -#endif #ifndef CONFIG_ARCH_HAS_PTE_SPECIAL static inline int pte_special(pte_t pte) diff --git a/kernel/fork.c b/kernel/fork.c index 81cba91f30bb..e030eb902e4b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2412,10 +2412,6 @@ __latent_entropy struct task_struct *copy_process( p->io_uring = NULL; #endif -#if defined(SPLIT_RSS_COUNTING) - memset(&p->rss_stat, 0, sizeof(p->rss_stat)); -#endif - p->default_timer_slack_ns = current->timer_slack_ns; #ifdef CONFIG_PSI