Message ID | 20230330191801.1967435-5-yosryahmed@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp81050vqo; Thu, 30 Mar 2023 12:20:15 -0700 (PDT) X-Google-Smtp-Source: AKy350bN5n4S5XJFjYe2XqmdYmM/0mvgy16Tqt4KSaSEuqS8KcS0N/4Xp/M9EVaeVfaa33KOxy1+ X-Received: by 2002:a17:90b:33c7:b0:23f:35c8:895 with SMTP id lk7-20020a17090b33c700b0023f35c80895mr25969780pjb.32.1680204015422; Thu, 30 Mar 2023 12:20:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680204015; cv=none; d=google.com; s=arc-20160816; b=uhnqwG/j/04eb6skUkrE7EeAxB2mVQzm/mK8XfzXcQiIKdYxo+I9Xbbzt3E52wRB7A +qofyexLDkDzMuvl1USMNT50vrqSi4zmqZTjMxf0eL5g41K56krTpmQoRNrb3vN1DL5Q 7fXyoTwpVD6FKVmCgm+7dpeJKeHR45QGl/gTFSy5Q6gKhn+JkrVPRY++PXaCfrDlJo+7 E7pSYjwbI+KGK0Q15+ntLigeYVK8QO74nXys0x9UaJa06EODiFoioqP7GEOpttiMwsy4 hp/bH4Cf1cn8Spca9gJ1er806dJt5N4CHFOYv/hi4Im60zEtJtjgA8deAzcmKm8nS1xo k4Rg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=4nnUf1Bi8zb8HMhH9DtFsVEWdtAjMz82Sf34ZimV0fg=; b=0zyKGGAiRPQ/iPaMv0QNw6LWJcr1J1OdAelGX1DC9zZLx7sbySMvDyk/JH0FXhZj1a NR2AnLF6WMDhstNOKJxfKrg70SiuwTRR6zU889fbpwt9PCaWhKtS1G/NKkLLpCiPQ40z M2FsoY5ggWY6T1rYog599PK5XTB7gfnVA19w9Es451cWADlwjQPsleDzNfL3Mk4ZfwXT gP7Uw/5zjiL3oeHMS3FeSb9LDA0x0MkumqOY/LkKwReKT8yzgltkbzR8w+DrI7ODJQrB HQ2UuwUUfWN+cmJGOTqYs3eDZJ3ZprUf6wdel7PO5nmcThGsXWaXa5+jgO0Xgd6Hkebh +0ww== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=X2RFp1pZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gl3-20020a17090b120300b002307345bf7bsi2612622pjb.23.2023.03.30.12.20.02; Thu, 30 Mar 2023 12:20:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=X2RFp1pZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230253AbjC3TS1 (ORCPT <rfc822;rua109.linux@gmail.com> + 99 others); Thu, 30 Mar 2023 15:18:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48526 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231845AbjC3TST (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 30 Mar 2023 15:18:19 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6FD7EFF35 for <linux-kernel@vger.kernel.org>; Thu, 30 Mar 2023 12:18:11 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id q9-20020a170902dac900b001a18ceff5ebso11612229plx.4 for <linux-kernel@vger.kernel.org>; Thu, 30 Mar 2023 12:18:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680203891; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=4nnUf1Bi8zb8HMhH9DtFsVEWdtAjMz82Sf34ZimV0fg=; b=X2RFp1pZWOYnRsIh7IWjsVsM+KEFc9WJSHxF/6WWA0h+C4aKaj5fmmxIF9JIbSgACd wSFJlPVZHSShfit9sIv2gEMkHiLDYI1aj0ZjNxN+KDFEUuJ5hrS0dAiMoZtgG8tE01Qw Z6IuI1YwDe0K3jJKl09PQ+88WTXCXmLj3mXjZyc3NpqNvz62t2kYidmfKl22OQzA3sQQ EAjXhE1Jh+vHizPfZUZ3LQ+IzSkqjVL6en02vxZp22lS6vBDlsf8gGPBHsHSCXP5Ac15 JluiS/D70cGu14FmTesFiFf/4cN9gJ7flhGTivB535PauxJnWfGkMyHN0wJ36oB1Mtk7 YlVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680203891; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4nnUf1Bi8zb8HMhH9DtFsVEWdtAjMz82Sf34ZimV0fg=; b=Zm+lShhCPbLalzq59B8ZMMAt7Sk1N2DiS8uEa9o2MEcLdyl7aos1Ou4UviPfuaw3GB zMYozFQw0mMDnjRlOVKbn2OPfhaE/hDmHotuEKQfN4Uunh3WeyINHDanuGhT13VFLHzF /BH0vlZxZrBgOXWB6YlAIkAtujfhoTw9jPscO3c0stN7ZkhCW/6WDJpO9wGieeFGeNDr pMEUD5R5lYLcsABQ5F5LtnlIWpmvPhG5mtP5sAIeJ/EBOqBVhchVfXFySSV1P9umfGZB vWHZpF5o0scs9Z4j0UT5okgZtpAV8SVuplPXVNnpKRKsKFWk4B2xvYMD1mhlJPfjQxyX VKaA== X-Gm-Message-State: AAQBX9dATVZiU+26Z1/mTKaPo1mCCM3UwDn+Kqq3bsYQjLQKVJoJPxva Ud78J5C51DfohbZBum2mVx/RY9nC6m6sxxBY X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a05:6a00:2286:b0:627:e6d5:ba2d with SMTP id f6-20020a056a00228600b00627e6d5ba2dmr13290706pfe.6.1680203890966; Thu, 30 Mar 2023 12:18:10 -0700 (PDT) Date: Thu, 30 Mar 2023 19:17:57 +0000 In-Reply-To: <20230330191801.1967435-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230330191801.1967435-1-yosryahmed@google.com> X-Mailer: git-send-email 2.40.0.348.gf938b09366-goog Message-ID: <20230330191801.1967435-5-yosryahmed@google.com> Subject: [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic From: Yosry Ahmed <yosryahmed@google.com> To: Tejun Heo <tj@kernel.org>, Josef Bacik <josef@toxicpanda.com>, Jens Axboe <axboe@kernel.dk>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, " =?utf-8?q?Michal_Koutn=C3=BD?= " <mkoutny@suse.com> Cc: Vasily Averin <vasily.averin@linux.dev>, cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com>, Michal Hocko <mhocko@suse.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.7 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761821605480233686?= X-GMAIL-MSGID: =?utf-8?q?1761821605480233686?= |
Series |
memcg: avoid flushing stats atomically where possible
|
|
Commit Message
Yosry Ahmed
March 30, 2023, 7:17 p.m. UTC
As Johannes notes in [1], stats_flush_lock is currently used to: (a) Protect updated to stats_flush_threshold. (b) Protect updates to flush_next_time. (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits. However: 1. stats_flush_threshold is already an atomic 2. flush_next_time is not atomic. The writer is locked, but the reader is lockless. If the reader races with a flush, you could see this: if (time_after(jiffies, flush_next_time)) spin_trylock() flush_next_time = now + delay flush() spin_unlock() spin_trylock() flush_next_time = now + delay flush() spin_unlock() which means we already can get flushes at a higher frequency than FLUSH_TIME during races. But it isn't really a problem. The reader could also see garbled partial updates if the compiler decides to split the write, so it needs at least READ_ONCE and WRITE_ONCE protection. 3. Serializing cgroup_rstat_flush() calls against the ratelimit factors is currently broken because of the race in 2. But the race is actually harmless, all we might get is the occasional earlier flush. If there is no delta, the flush won't do much. And if there is, the flush is justified. So the lock can be removed all together. However, the lock also served the purpose of preventing a thundering herd problem for concurrent flushers, see [2]. Use an atomic instead to serve the purpose of unifying concurrent flushers. [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/ [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/ Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> --- mm/memcontrol.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-)
Comments
Hello. On Thu, Mar 30, 2023 at 07:17:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > static void __mem_cgroup_flush_stats(void) > { > - unsigned long flag; > - > - if (!spin_trylock_irqsave(&stats_flush_lock, flag)) > + /* > + * We always flush the entire tree, so concurrent flushers can just > + * skip. This avoids a thundering herd problem on the rstat global lock > + * from memcg flushers (e.g. reclaim, refault, etc). > + */ > + if (atomic_read(&stats_flush_ongoing) || > + atomic_xchg(&stats_flush_ongoing, 1)) > return; I'm curious about why this instead of if (atomic_xchg(&stats_flush_ongoing, 1)) return; Is that some microarchitectural cleverness? Thanks, Michal
On Tue, Apr 4, 2023 at 9:53 AM Michal Koutný <mkoutny@suse.com> wrote: > > Hello. > > On Thu, Mar 30, 2023 at 07:17:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > > static void __mem_cgroup_flush_stats(void) > > { > > - unsigned long flag; > > - > > - if (!spin_trylock_irqsave(&stats_flush_lock, flag)) > > + /* > > + * We always flush the entire tree, so concurrent flushers can just > > + * skip. This avoids a thundering herd problem on the rstat global lock > > + * from memcg flushers (e.g. reclaim, refault, etc). > > + */ > > + if (atomic_read(&stats_flush_ongoing) || > > + atomic_xchg(&stats_flush_ongoing, 1)) > > return; > > I'm curious about why this instead of > > if (atomic_xchg(&stats_flush_ongoing, 1)) > return; > > Is that some microarchitectural cleverness? > Yes indeed it is. Basically we want to avoid unconditional cache dirtying. This pattern is also used at other places in the kernel like qspinlock.
On Tue, Apr 4, 2023 at 10:13 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, Apr 4, 2023 at 9:53 AM Michal Koutný <mkoutny@suse.com> wrote: > > > > Hello. > > > > On Thu, Mar 30, 2023 at 07:17:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote: > > > static void __mem_cgroup_flush_stats(void) > > > { > > > - unsigned long flag; > > > - > > > - if (!spin_trylock_irqsave(&stats_flush_lock, flag)) > > > + /* > > > + * We always flush the entire tree, so concurrent flushers can just > > > + * skip. This avoids a thundering herd problem on the rstat global lock > > > + * from memcg flushers (e.g. reclaim, refault, etc). > > > + */ > > > + if (atomic_read(&stats_flush_ongoing) || > > > + atomic_xchg(&stats_flush_ongoing, 1)) > > > return; > > > > I'm curious about why this instead of > > > > if (atomic_xchg(&stats_flush_ongoing, 1)) > > return; > > > > Is that some microarchitectural cleverness? > > > > Yes indeed it is. Basically we want to avoid unconditional cache > dirtying. This pattern is also used at other places in the kernel like > qspinlock. Oh also take a look at https://lore.kernel.org/all/20230404052228.15788-1-feng.tang@intel.com/
On Tue, Apr 04, 2023 at 10:21:33AM -0700, Shakeel Butt <shakeelb@google.com> wrote: > > Yes indeed it is. Basically we want to avoid unconditional cache > > dirtying. This pattern is also used at other places in the kernel like > > qspinlock. Thanks for confirmation. (I remembered the commit 873f64b791a2 ("mm/memcontrol.c: remove the redundant updating of stats_flush_threshold"). But was slightly confused why would it be open-coded every time.) > Oh also take a look at > https://lore.kernel.org/all/20230404052228.15788-1-feng.tang@intel.com/ Thanks for the link. Michal
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ff39f78f962e..65750f8b8259 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) */ static void flush_memcg_stats_dwork(struct work_struct *w); static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static DEFINE_SPINLOCK(stats_flush_lock); static DEFINE_PER_CPU(unsigned int, stats_updates); +static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); static atomic_t stats_flush_threshold = ATOMIC_INIT(0); static u64 flush_next_time; @@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) static void __mem_cgroup_flush_stats(void) { - unsigned long flag; - - if (!spin_trylock_irqsave(&stats_flush_lock, flag)) + /* + * We always flush the entire tree, so concurrent flushers can just + * skip. This avoids a thundering herd problem on the rstat global lock + * from memcg flushers (e.g. reclaim, refault, etc). + */ + if (atomic_read(&stats_flush_ongoing) || + atomic_xchg(&stats_flush_ongoing, 1)) return; - flush_next_time = jiffies_64 + 2*FLUSH_TIME; + WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup); atomic_set(&stats_flush_threshold, 0); - spin_unlock_irqrestore(&stats_flush_lock, flag); + atomic_set(&stats_flush_ongoing, 0); } void mem_cgroup_flush_stats(void) @@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void) void mem_cgroup_flush_stats_ratelimited(void) { - if (time_after64(jiffies_64, flush_next_time)) + if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) mem_cgroup_flush_stats(); }