Message ID | 20230322194456.2331527-2-frederic@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:604a:0:0:0:0:0 with SMTP id j10csp2534592wrt; Wed, 22 Mar 2023 12:46:14 -0700 (PDT) X-Google-Smtp-Source: AK7set89mpsYcg7tZ2mLQe4NLB+zQ5QInI1oZufkG79kZ9Aiux4CTsAW1t++sPBizLYidb/bDePv X-Received: by 2002:a05:6a20:1221:b0:d9:b59c:d09e with SMTP id v33-20020a056a20122100b000d9b59cd09emr655303pzf.11.1679514374162; Wed, 22 Mar 2023 12:46:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679514374; cv=none; d=google.com; s=arc-20160816; b=ZAxdAm+tsuf4UJsU+z1vMKQ3HAfUgLwtobzmoWZgH4JlIHPudAvvNXzC8bWuIh0ny+ xG3jji1uVdX7ljYY3ZCmMkEhhQtnM1V980QlaTwkCraMB4xmi6gwi5CriQARbhvV+mNp rOUgZ27AabsBWSzUejHhCtPEa7FtU2AU0Mh5Pm8/7ogM1BR0dUp8hoOAii1ROcAr1krV TURO+RSvhMVoZgC2jgPyGJ/iy26pZZvLbyqS9KoWsTqWruIEl8yfJrMEbTNWvWY4dU0J SVh2Z7jy7ll0SuKIIL5VJ7FBBS+UBtB9U3FtvLmGkUqQtvrUAkcomw+Qs/5xJ3SaDLlG suSw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=mmglLE5oH+V8Tbw5lSX8yjxfe/MV+7OxcRsttZXtJoU=; b=Bw0WsaC/bynEKeCo0kMuZOZ+ajHCIfHJ+/ZqRjd6vOt4eaPSZbyUX0GOQ4XNu6wLEF 6LvkkpULN5YKY9C/KyK8WcXG8rw64ZT4UZMT58XLJgZ2cBGwyJWtMj5TVkO48SGwB2rk qGa/aivX1eURTKsvwgIF3uO+ItH6YmofMvO1jzIVBDMSEDlBHnbaLUOStatXfvm9saeC Fbvo14XF9z3epum5PtcM9VIdecMuNkprsCr1Nf+X4nvD/7fAjUCClRJo8KLsgK6uu0Uw lWHDFyq43Tl38SowB99Z+Jmn9DLPvL3HO+94doQDspL4MHkfjREI2FkBM+BGjOK+YbuH 05+w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=KsIagmLy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 18-20020a630412000000b004fc29da6e55si16269867pge.674.2023.03.22.12.46.01; Wed, 22 Mar 2023 12:46:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=KsIagmLy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230010AbjCVTpW (ORCPT <rfc822;ezelljr.billy@gmail.com> + 99 others); Wed, 22 Mar 2023 15:45:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52616 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229838AbjCVTpS (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 22 Mar 2023 15:45:18 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6135E3E0BD; Wed, 22 Mar 2023 12:45:07 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id EF741622B2; Wed, 22 Mar 2023 19:45:06 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8E6C7C4339E; Wed, 22 Mar 2023 19:45:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1679514306; bh=a59gUDfTCHexacSrsbU3Hhq+A3tfJIUWFd5dyE5PoH0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=KsIagmLyRTqKsfR7uIUg9MjUB6redqBnrKvV76N6o1Vx+jY7uIq9Y/alf5qVC/2ey ohoXznoXXIOPSCCOrKmSUnBTPG7XgSSOYnAU0cwsvTb1jWWtbM7mUf8bn8zfxUqq7d sUKfStasnjoGheU7ibY1W95SuRdThbFK9Ec/iAOHru2Z9iREVbwiQjm2+mgwBrHgpT Rv8lUWHtGFZv1KFuOaULMtS5KzIMTqZzlDzOsR6T4Fl82KuJ/xakJuFUd6056Ubqmc H7gSghF0E0NFS/fX2z1J5oKfP42YjQUcRkcAmriowRcVz/Mqs+RrElKEEgy3Ag1wLo 1ff65pFu9K4+A== From: Frederic Weisbecker <frederic@kernel.org> To: "Paul E . McKenney" <paulmck@kernel.org> Cc: LKML <linux-kernel@vger.kernel.org>, Frederic Weisbecker <frederic@kernel.org>, rcu <rcu@vger.kernel.org>, Uladzislau Rezki <urezki@gmail.com>, Neeraj Upadhyay <quic_neeraju@quicinc.com>, Boqun Feng <boqun.feng@gmail.com>, Joel Fernandes <joel@joelfernandes.org> Subject: [PATCH 1/4] rcu/nocb: Protect lazy shrinker against concurrent (de-)offloading Date: Wed, 22 Mar 2023 20:44:53 +0100 Message-Id: <20230322194456.2331527-2-frederic@kernel.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230322194456.2331527-1-frederic@kernel.org> References: <20230322194456.2331527-1-frederic@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761098464543116465?= X-GMAIL-MSGID: =?utf-8?q?1761098464543116465?= |
Series |
rcu/nocb: Shrinker related boring fixes
|
|
Commit Message
Frederic Weisbecker
March 22, 2023, 7:44 p.m. UTC
The shrinker may run concurrently with callbacks (de-)offloading. As
such, calling rcu_nocb_lock() is very dangerous because it does a
conditional locking. The worst outcome is that rcu_nocb_lock() doesn't
lock but rcu_nocb_unlock() eventually unlocks, or the reverse, creating
an imbalance.
Fix this with protecting against (de-)offloading using the barrier mutex.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
kernel/rcu/tree_nocb.h | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
Comments
On Wed, Mar 22, 2023 at 08:44:53PM +0100, Frederic Weisbecker wrote: > The shrinker may run concurrently with callbacks (de-)offloading. As > such, calling rcu_nocb_lock() is very dangerous because it does a > conditional locking. The worst outcome is that rcu_nocb_lock() doesn't > lock but rcu_nocb_unlock() eventually unlocks, or the reverse, creating > an imbalance. > > Fix this with protecting against (de-)offloading using the barrier mutex. > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Good catch!!! A few questions, comments, and speculations below. Thanx, Paul > --- > kernel/rcu/tree_nocb.h | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h > index f2280616f9d5..dd9b655ae533 100644 > --- a/kernel/rcu/tree_nocb.h > +++ b/kernel/rcu/tree_nocb.h > @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > unsigned long flags; > unsigned long count = 0; > > + /* > + * Protect against concurrent (de-)offloading. Otherwise nocb locking > + * may be ignored or imbalanced. > + */ > + mutex_lock(&rcu_state.barrier_mutex); I was worried about this possibly leading to out-of-memory deadlock, but if I recall correctly, the (de-)offloading process never allocates memory, so this should be OK? The other concern was that the (de-)offloading operation might take a long time, but the usual cause for that is huge numbers of callbacks, in which case letting them free their memory is not necessarily a bad strategy. > + > /* Snapshot count of all CPUs */ > for_each_possible_cpu(cpu) { > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > - int _count = READ_ONCE(rdp->lazy_len); > + int _count; > + > + if (!rcu_rdp_is_offloaded(rdp)) > + continue; If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? Or can it contain garbage after a de-offloading operation? > + _count = READ_ONCE(rdp->lazy_len); > > if (_count == 0) > continue; > + > rcu_nocb_lock_irqsave(rdp, flags); > WRITE_ONCE(rdp->lazy_len, 0); > rcu_nocb_unlock_irqrestore(rdp, flags); > @@ -1352,6 +1364,9 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > if (sc->nr_to_scan <= 0) > break; > } > + > + mutex_unlock(&rcu_state.barrier_mutex); > + > return count ? count : SHRINK_STOP; > } > > -- > 2.34.1 >
On Wed, Mar 22, 2023 at 04:18:24PM -0700, Paul E. McKenney wrote: > On Wed, Mar 22, 2023 at 08:44:53PM +0100, Frederic Weisbecker wrote: > > The shrinker may run concurrently with callbacks (de-)offloading. As > > such, calling rcu_nocb_lock() is very dangerous because it does a > > conditional locking. The worst outcome is that rcu_nocb_lock() doesn't > > lock but rcu_nocb_unlock() eventually unlocks, or the reverse, creating > > an imbalance. > > > > Fix this with protecting against (de-)offloading using the barrier mutex. > > > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > > Good catch!!! A few questions, comments, and speculations below. Added a few more. ;) > > --- > > kernel/rcu/tree_nocb.h | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h > > index f2280616f9d5..dd9b655ae533 100644 > > --- a/kernel/rcu/tree_nocb.h > > +++ b/kernel/rcu/tree_nocb.h > > @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > unsigned long flags; > > unsigned long count = 0; > > > > + /* > > + * Protect against concurrent (de-)offloading. Otherwise nocb locking > > + * may be ignored or imbalanced. > > + */ > > + mutex_lock(&rcu_state.barrier_mutex); > > I was worried about this possibly leading to out-of-memory deadlock, > but if I recall correctly, the (de-)offloading process never allocates > memory, so this should be OK? Maybe trylock is better then? If we can't make progress, may be better to let kswapd free memory by other means than blocking on the mutex. ISTR, from my Android days that there are weird lockdep issues that happen when locking in a shrinker (due to the 'fake lock' dependency added during reclaim). > The other concern was that the (de-)offloading operation might take a > long time, but the usual cause for that is huge numbers of callbacks, > in which case letting them free their memory is not necessarily a bad > strategy. > > > + > > /* Snapshot count of all CPUs */ > > for_each_possible_cpu(cpu) { > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > - int _count = READ_ONCE(rdp->lazy_len); > > + int _count; > > + > > + if (!rcu_rdp_is_offloaded(rdp)) > > + continue; > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? Did you mean de-offloaded? If it is offloaded, that means nocb is active so there could be lazy CBs queued. Or did I miss something? thanks, - Joel > Or can it contain garbage after a de-offloading operation? > > > + _count = READ_ONCE(rdp->lazy_len); > > > > if (_count == 0) > > continue; > > + > > rcu_nocb_lock_irqsave(rdp, flags); > > WRITE_ONCE(rdp->lazy_len, 0); > > rcu_nocb_unlock_irqrestore(rdp, flags); > > @@ -1352,6 +1364,9 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > if (sc->nr_to_scan <= 0) > > break; > > } > > + > > + mutex_unlock(&rcu_state.barrier_mutex); > > + > > return count ? count : SHRINK_STOP; > > } > > > > -- > > 2.34.1 > >
On Fri, Mar 24, 2023 at 12:55:23AM +0000, Joel Fernandes wrote: > On Wed, Mar 22, 2023 at 04:18:24PM -0700, Paul E. McKenney wrote: > > On Wed, Mar 22, 2023 at 08:44:53PM +0100, Frederic Weisbecker wrote: > > > The shrinker may run concurrently with callbacks (de-)offloading. As > > > such, calling rcu_nocb_lock() is very dangerous because it does a > > > conditional locking. The worst outcome is that rcu_nocb_lock() doesn't > > > lock but rcu_nocb_unlock() eventually unlocks, or the reverse, creating > > > an imbalance. > > > > > > Fix this with protecting against (de-)offloading using the barrier mutex. > > > > > > Signed-off-by: Frederic Weisbecker <frederic@kernel.org> > > > > Good catch!!! A few questions, comments, and speculations below. > > Added a few more. ;) > > > > --- > > > kernel/rcu/tree_nocb.h | 17 ++++++++++++++++- > > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h > > > index f2280616f9d5..dd9b655ae533 100644 > > > --- a/kernel/rcu/tree_nocb.h > > > +++ b/kernel/rcu/tree_nocb.h > > > @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > unsigned long flags; > > > unsigned long count = 0; > > > > > > + /* > > > + * Protect against concurrent (de-)offloading. Otherwise nocb locking > > > + * may be ignored or imbalanced. > > > + */ > > > + mutex_lock(&rcu_state.barrier_mutex); > > > > I was worried about this possibly leading to out-of-memory deadlock, > > but if I recall correctly, the (de-)offloading process never allocates > > memory, so this should be OK? > > Maybe trylock is better then? If we can't make progress, may be better to let > kswapd free memory by other means than blocking on the mutex. > > ISTR, from my Android days that there are weird lockdep issues that happen > when locking in a shrinker (due to the 'fake lock' dependency added during > reclaim). This stuff gets tricky quickly. ;-) > > The other concern was that the (de-)offloading operation might take a > > long time, but the usual cause for that is huge numbers of callbacks, > > in which case letting them free their memory is not necessarily a bad > > strategy. > > > > > + > > > /* Snapshot count of all CPUs */ > > > for_each_possible_cpu(cpu) { > > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > > - int _count = READ_ONCE(rdp->lazy_len); > > > + int _count; > > > + > > > + if (!rcu_rdp_is_offloaded(rdp)) > > > + continue; > > > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? > > Did you mean de-offloaded? If it is offloaded, that means nocb is active so > there could be lazy CBs queued. Or did I miss something? You are quite right, offloaded for ->lazy_len to be zero. Thanx, Paul. > thanks, > > - Joel > > > > Or can it contain garbage after a de-offloading operation? > > > > > + _count = READ_ONCE(rdp->lazy_len); > > > > > > if (_count == 0) > > > continue; > > > + > > > rcu_nocb_lock_irqsave(rdp, flags); > > > WRITE_ONCE(rdp->lazy_len, 0); > > > rcu_nocb_unlock_irqrestore(rdp, flags); > > > @@ -1352,6 +1364,9 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > if (sc->nr_to_scan <= 0) > > > break; > > > } > > > + > > > + mutex_unlock(&rcu_state.barrier_mutex); > > > + > > > return count ? count : SHRINK_STOP; > > > } > > > > > > -- > > > 2.34.1 > > >
Le Wed, Mar 22, 2023 at 04:18:24PM -0700, Paul E. McKenney a écrit : > > @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > unsigned long flags; > > unsigned long count = 0; > > > > + /* > > + * Protect against concurrent (de-)offloading. Otherwise nocb locking > > + * may be ignored or imbalanced. > > + */ > > + mutex_lock(&rcu_state.barrier_mutex); > > I was worried about this possibly leading to out-of-memory deadlock, > but if I recall correctly, the (de-)offloading process never allocates > memory, so this should be OK? Good point. It _should_ be fine but like you, Joel and Hillf pointed out it's asking for trouble. We could try Joel's idea to use mutex_trylock() as a best effort, which should be fine as it's mostly uncontended. The alternative is to force nocb locking and check the offloading state right after. So instead of: rcu_nocb_lock_irqsave(rdp, flags); //flush stuff rcu_nocb_unlock_irqrestore(rdp, flags); Have: raw_spin_lock_irqsave(rdp->nocb_lock, flags); if (!rcu_rdp_is_offloaded(rdp)) raw_spin_unlock_irqrestore(rdp->nocb_lock, flags); continue; } //flush stuff rcu_nocb_unlock_irqrestore(rdp, flags); But it's not pretty and also disqualifies the last two patches as rcu_nocb_mask can't be iterated safely anymore. What do you think? > > /* Snapshot count of all CPUs */ > > for_each_possible_cpu(cpu) { > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > - int _count = READ_ONCE(rdp->lazy_len); > > + int _count; > > + > > + if (!rcu_rdp_is_offloaded(rdp)) > > + continue; > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? > > Or can it contain garbage after a de-offloading operation? If it's deoffloaded, ->lazy_len is indeed (supposed to be) guaranteed to be zero. Bypass is flushed and disabled atomically early on de-offloading and the flush resets ->lazy_len. Thanks.
On Fri, Mar 24, 2023 at 11:09:08PM +0100, Frederic Weisbecker wrote: > Le Wed, Mar 22, 2023 at 04:18:24PM -0700, Paul E. McKenney a écrit : > > > @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > unsigned long flags; > > > unsigned long count = 0; > > > > > > + /* > > > + * Protect against concurrent (de-)offloading. Otherwise nocb locking > > > + * may be ignored or imbalanced. > > > + */ > > > + mutex_lock(&rcu_state.barrier_mutex); > > > > I was worried about this possibly leading to out-of-memory deadlock, > > but if I recall correctly, the (de-)offloading process never allocates > > memory, so this should be OK? > > Good point. It _should_ be fine but like you, Joel and Hillf pointed out > it's asking for trouble. > > We could try Joel's idea to use mutex_trylock() as a best effort, which > should be fine as it's mostly uncontended. > > The alternative is to force nocb locking and check the offloading state > right after. So instead of: > > rcu_nocb_lock_irqsave(rdp, flags); > //flush stuff > rcu_nocb_unlock_irqrestore(rdp, flags); > > Have: > > raw_spin_lock_irqsave(rdp->nocb_lock, flags); > if (!rcu_rdp_is_offloaded(rdp)) > raw_spin_unlock_irqrestore(rdp->nocb_lock, flags); > continue; > } > //flush stuff > rcu_nocb_unlock_irqrestore(rdp, flags); > > But it's not pretty and also disqualifies the last two patches as > rcu_nocb_mask can't be iterated safely anymore. > > What do you think? The mutex_trylock() approach does have the advantage of simplicity, and as you say should do well given low contention. Which reminds me, what sort of test strategy did you have in mind? Memory exhaustion can have surprising effects. > > > /* Snapshot count of all CPUs */ > > > for_each_possible_cpu(cpu) { > > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > > - int _count = READ_ONCE(rdp->lazy_len); > > > + int _count; > > > + > > > + if (!rcu_rdp_is_offloaded(rdp)) > > > + continue; > > > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? > > > > Or can it contain garbage after a de-offloading operation? > > If it's deoffloaded, ->lazy_len is indeed (supposed to be) guaranteed to be zero. > Bypass is flushed and disabled atomically early on de-offloading and the > flush resets ->lazy_len. Whew! At the moment, I don't feel strongly about whether or not the following code should (1) read the value, (2) warn on non-zero, (3) assume zero without reading, or (4) some other option that is not occurring to me. Your choice! Thanx, Paul
Le Fri, Mar 24, 2023 at 03:51:54PM -0700, Paul E. McKenney a écrit : > On Fri, Mar 24, 2023 at 11:09:08PM +0100, Frederic Weisbecker wrote: > > Le Wed, Mar 22, 2023 at 04:18:24PM -0700, Paul E. McKenney a écrit : > > > > @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > > unsigned long flags; > > > > unsigned long count = 0; > > > > > > > > + /* > > > > + * Protect against concurrent (de-)offloading. Otherwise nocb locking > > > > + * may be ignored or imbalanced. > > > > + */ > > > > + mutex_lock(&rcu_state.barrier_mutex); > > > > > > I was worried about this possibly leading to out-of-memory deadlock, > > > but if I recall correctly, the (de-)offloading process never allocates > > > memory, so this should be OK? > > > > Good point. It _should_ be fine but like you, Joel and Hillf pointed out > > it's asking for trouble. > > > > We could try Joel's idea to use mutex_trylock() as a best effort, which > > should be fine as it's mostly uncontended. > > > > The alternative is to force nocb locking and check the offloading state > > right after. So instead of: > > > > rcu_nocb_lock_irqsave(rdp, flags); > > //flush stuff > > rcu_nocb_unlock_irqrestore(rdp, flags); > > > > Have: > > > > raw_spin_lock_irqsave(rdp->nocb_lock, flags); > > if (!rcu_rdp_is_offloaded(rdp)) > > raw_spin_unlock_irqrestore(rdp->nocb_lock, flags); > > continue; > > } > > //flush stuff > > rcu_nocb_unlock_irqrestore(rdp, flags); > > > > But it's not pretty and also disqualifies the last two patches as > > rcu_nocb_mask can't be iterated safely anymore. > > > > What do you think? > > The mutex_trylock() approach does have the advantage of simplicity, > and as you say should do well given low contention. > > Which reminds me, what sort of test strategy did you have in mind? > Memory exhaustion can have surprising effects. The best I can do is to trigger the count and scan callbacks through the shrinker debugfs and see if it crashes or not :-) > > > > > /* Snapshot count of all CPUs */ > > > > for_each_possible_cpu(cpu) { > > > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > > > - int _count = READ_ONCE(rdp->lazy_len); > > > > + int _count; > > > > + > > > > + if (!rcu_rdp_is_offloaded(rdp)) > > > > + continue; > > > > > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? > > > > > > Or can it contain garbage after a de-offloading operation? > > > > If it's deoffloaded, ->lazy_len is indeed (supposed to be) guaranteed to be zero. > > Bypass is flushed and disabled atomically early on de-offloading and the > > flush resets ->lazy_len. > > Whew! At the moment, I don't feel strongly about whether or not > the following code should (1) read the value, (2) warn on non-zero, > (3) assume zero without reading, or (4) some other option that is not > occurring to me. Your choice! (2) looks like a good idea! Thanks.
On Sun, Mar 26, 2023 at 10:01:34PM +0200, Frederic Weisbecker wrote: > Le Fri, Mar 24, 2023 at 03:51:54PM -0700, Paul E. McKenney a écrit : > > On Fri, Mar 24, 2023 at 11:09:08PM +0100, Frederic Weisbecker wrote: > > > Le Wed, Mar 22, 2023 at 04:18:24PM -0700, Paul E. McKenney a écrit : > > > > > @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > > > > > unsigned long flags; > > > > > unsigned long count = 0; > > > > > > > > > > + /* > > > > > + * Protect against concurrent (de-)offloading. Otherwise nocb locking > > > > > + * may be ignored or imbalanced. > > > > > + */ > > > > > + mutex_lock(&rcu_state.barrier_mutex); > > > > > > > > I was worried about this possibly leading to out-of-memory deadlock, > > > > but if I recall correctly, the (de-)offloading process never allocates > > > > memory, so this should be OK? > > > > > > Good point. It _should_ be fine but like you, Joel and Hillf pointed out > > > it's asking for trouble. > > > > > > We could try Joel's idea to use mutex_trylock() as a best effort, which > > > should be fine as it's mostly uncontended. > > > > > > The alternative is to force nocb locking and check the offloading state > > > right after. So instead of: > > > > > > rcu_nocb_lock_irqsave(rdp, flags); > > > //flush stuff > > > rcu_nocb_unlock_irqrestore(rdp, flags); > > > > > > Have: > > > > > > raw_spin_lock_irqsave(rdp->nocb_lock, flags); > > > if (!rcu_rdp_is_offloaded(rdp)) > > > raw_spin_unlock_irqrestore(rdp->nocb_lock, flags); > > > continue; > > > } > > > //flush stuff > > > rcu_nocb_unlock_irqrestore(rdp, flags); > > > > > > But it's not pretty and also disqualifies the last two patches as > > > rcu_nocb_mask can't be iterated safely anymore. > > > > > > What do you think? > > > > The mutex_trylock() approach does have the advantage of simplicity, > > and as you say should do well given low contention. > > > > Which reminds me, what sort of test strategy did you have in mind? > > Memory exhaustion can have surprising effects. > > The best I can do is to trigger the count and scan callbacks through > the shrinker debugfs and see if it crashes or not :-) Sounds like a good start. Maybe also a good finish? ;-) > > > > > /* Snapshot count of all CPUs */ > > > > > for_each_possible_cpu(cpu) { > > > > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > > > > - int _count = READ_ONCE(rdp->lazy_len); > > > > > + int _count; > > > > > + > > > > > + if (!rcu_rdp_is_offloaded(rdp)) > > > > > + continue; > > > > > > > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? > > > > > > > > Or can it contain garbage after a de-offloading operation? > > > > > > If it's deoffloaded, ->lazy_len is indeed (supposed to be) guaranteed to be zero. > > > Bypass is flushed and disabled atomically early on de-offloading and the > > > flush resets ->lazy_len. > > > > Whew! At the moment, I don't feel strongly about whether or not > > the following code should (1) read the value, (2) warn on non-zero, > > (3) assume zero without reading, or (4) some other option that is not > > occurring to me. Your choice! > > (2) looks like a good idea! Sounds good to me! Thanx, Paul
On Sun, Mar 26, 2023 at 02:45:18PM -0700, Paul E. McKenney wrote: > On Sun, Mar 26, 2023 at 10:01:34PM +0200, Frederic Weisbecker wrote: > > > > > > /* Snapshot count of all CPUs */ > > > > > > for_each_possible_cpu(cpu) { > > > > > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > > > > > - int _count = READ_ONCE(rdp->lazy_len); > > > > > > + int _count; > > > > > > + > > > > > > + if (!rcu_rdp_is_offloaded(rdp)) > > > > > > + continue; > > > > > > > > > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? > > > > > > > > > > Or can it contain garbage after a de-offloading operation? > > > > > > > > If it's deoffloaded, ->lazy_len is indeed (supposed to be) guaranteed to be zero. > > > > Bypass is flushed and disabled atomically early on de-offloading and the > > > > flush resets ->lazy_len. > > > > > > Whew! At the moment, I don't feel strongly about whether or not > > > the following code should (1) read the value, (2) warn on non-zero, > > > (3) assume zero without reading, or (4) some other option that is not > > > occurring to me. Your choice! > > > > (2) looks like a good idea! > > Sounds good to me! So since we now iterate rcu_nocb_mask after the patchset, there is no more deoffloaded rdp to check. Meanwhile I put a WARN in the new series making sure that an rdp in rcu_nocb_mask is also offloaded (heh!)
On Wed, Mar 29, 2023 at 06:07:58PM +0200, Frederic Weisbecker wrote: > On Sun, Mar 26, 2023 at 02:45:18PM -0700, Paul E. McKenney wrote: > > On Sun, Mar 26, 2023 at 10:01:34PM +0200, Frederic Weisbecker wrote: > > > > > > > /* Snapshot count of all CPUs */ > > > > > > > for_each_possible_cpu(cpu) { > > > > > > > struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); > > > > > > > - int _count = READ_ONCE(rdp->lazy_len); > > > > > > > + int _count; > > > > > > > + > > > > > > > + if (!rcu_rdp_is_offloaded(rdp)) > > > > > > > + continue; > > > > > > > > > > > > If the CPU is offloaded, isn't ->lazy_len guaranteed to be zero? > > > > > > > > > > > > Or can it contain garbage after a de-offloading operation? > > > > > > > > > > If it's deoffloaded, ->lazy_len is indeed (supposed to be) guaranteed to be zero. > > > > > Bypass is flushed and disabled atomically early on de-offloading and the > > > > > flush resets ->lazy_len. > > > > > > > > Whew! At the moment, I don't feel strongly about whether or not > > > > the following code should (1) read the value, (2) warn on non-zero, > > > > (3) assume zero without reading, or (4) some other option that is not > > > > occurring to me. Your choice! > > > > > > (2) looks like a good idea! > > > > Sounds good to me! > > So since we now iterate rcu_nocb_mask after the patchset, there is no more > deoffloaded rdp to check. Meanwhile I put a WARN in the new series making > sure that an rdp in rcu_nocb_mask is also offloaded (heh!) Sounds good, thank you! Thanx, Paul
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index f2280616f9d5..dd9b655ae533 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -1336,13 +1336,25 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) unsigned long flags; unsigned long count = 0; + /* + * Protect against concurrent (de-)offloading. Otherwise nocb locking + * may be ignored or imbalanced. + */ + mutex_lock(&rcu_state.barrier_mutex); + /* Snapshot count of all CPUs */ for_each_possible_cpu(cpu) { struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); - int _count = READ_ONCE(rdp->lazy_len); + int _count; + + if (!rcu_rdp_is_offloaded(rdp)) + continue; + + _count = READ_ONCE(rdp->lazy_len); if (_count == 0) continue; + rcu_nocb_lock_irqsave(rdp, flags); WRITE_ONCE(rdp->lazy_len, 0); rcu_nocb_unlock_irqrestore(rdp, flags); @@ -1352,6 +1364,9 @@ lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) if (sc->nr_to_scan <= 0) break; } + + mutex_unlock(&rcu_state.barrier_mutex); + return count ? count : SHRINK_STOP; }