Message ID | 20230406015629.1804722-3-yebin@huaweicloud.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp716892vqo; Wed, 5 Apr 2023 19:10:09 -0700 (PDT) X-Google-Smtp-Source: AKy350YNXNiZ5LTx1+8i19cFE2PGSHh79KZY4TWekiuPXkdjT58U9g6CUAGCjQ3CVFfcwJJ9oKsQ X-Received: by 2002:a17:90b:b12:b0:23f:a4da:1208 with SMTP id bf18-20020a17090b0b1200b0023fa4da1208mr9038827pjb.39.1680747008822; Wed, 05 Apr 2023 19:10:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680747008; cv=none; d=google.com; s=arc-20160816; b=xHfe0V2ZZIfO1FTriE4gieUHP61M8jgwvBuJ2mhr9MsE4shTtrrdHNEMU7h9R+onAd McAVr+iuEFEjpfIZtndm280jOLWIZyFN9N9S32xNucEzHXXjqmjrhobdd8f7CpdpayVO mBaqCERQfDCVfu3ULPo+jTqYINuHwIOZ39kIbUyKvMhMHCVuXPseLUOL6E4eeGb7ZYuA MRfpC80lHgOuTCXkxq1Q27rYjg25zGFNLOrh2BiDOFEo1NYOihhVGbByCq6xnHCy5VlD 0ONdgMPPrGd9JDg4hqV5yYTpweNyLjSKF2L3lcBaP1P7GdD8uNWkg6KtLkehkE+uPMQW Io7g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=7os4YebrsB+qVnoAs8abz+sYpG6SziiRn/vgzbQG0e8=; b=nOGGyT1JE2biANeWVreOZtktFxhN5dRERhQxbNLKpz7uvjBrA1qqiLUpEZvSMl+smz kdEVqBhyaFJ+8jPSRX493iTBxhyNTm520+RvDgith4c2f4+oyfkQLmMUNctECqYgvj6p 3S1vs+ilT+2zMv81yb3YSmjh4iNgDJypsosd8jLMvzg9xVfG2sVpwwWGeMJTzeif5CCo oPwlhex4m7E/m4E2mwQG4Vr24KnV+pmwJtJvfJM6Kg9r/tX60Ck1sm3ZF5ysiSmIl5Fh 4U1sgrOuJ89km62T/0/ePeCDyNQIMjzVeDgJEOqNZohMllv7KcRVN2HE+2gK745RDeU9 g0hg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id om8-20020a17090b3a8800b002448c09e0f0si86046pjb.168.2023.04.05.19.09.55; Wed, 05 Apr 2023 19:10:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234651AbjDFB5M (ORCPT <rfc822;a1648639935@gmail.com> + 99 others); Wed, 5 Apr 2023 21:57:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58144 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234539AbjDFB5F (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 5 Apr 2023 21:57:05 -0400 Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A46283CE for <linux-kernel@vger.kernel.org>; Wed, 5 Apr 2023 18:57:02 -0700 (PDT) Received: from mail02.huawei.com (unknown [172.30.67.169]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4PsPkf381dz4f3v5b for <linux-kernel@vger.kernel.org>; Thu, 6 Apr 2023 09:56:58 +0800 (CST) Received: from huaweicloud.com (unknown [10.175.127.227]) by APP3 (Coremail) with SMTP id _Ch0CgCnUyHoJi5kX936GA--.31319S6; Thu, 06 Apr 2023 09:56:59 +0800 (CST) From: Ye Bin <yebin@huaweicloud.com> To: dennis@kernel.org, tj@kernel.org, cl@linux.com, linux-mm@kvack.org, yury.norov@gmail.com, andriy.shevchenko@linux.intel.com, linux@rasmusvillemoes.dk Cc: linux-kernel@vger.kernel.org, dchinner@redhat.com, yebin10@huawei.com, yebin@huaweicloud.com Subject: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Date: Thu, 6 Apr 2023 09:56:29 +0800 Message-Id: <20230406015629.1804722-3-yebin@huaweicloud.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20230406015629.1804722-1-yebin@huaweicloud.com> References: <20230406015629.1804722-1-yebin@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CM-TRANSID: _Ch0CgCnUyHoJi5kX936GA--.31319S6 X-Coremail-Antispam: 1UD129KBjvJXoW7ZFW7ArWDXw1ktFy7KFyfZwb_yoW8WFy5pr 4UKry5Jr18AF92k343Kw1vqF9I9r1kAF4rKwnrGF1fAFnxZa45urW0yrs8JF109rn7Wrya qryjgF4xCa4Yv3JanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUBE14x267AKxVWrJVCq3wAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2048vs2IY020E87I2jVAFwI0_Jryl82xGYIkIc2 x26xkF7I0E14v26ryj6s0DM28lY4IEw2IIxxk0rwA2F7IY1VAKz4vEj48ve4kI8wA2z4x0 Y4vE2Ix0cI8IcVAFwI0_tr0E3s1l84ACjcxK6xIIjxv20xvEc7CjxVAFwI0_Gr1j6F4UJw A2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x0267AKxVW0oVCq3wAS 0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG6I80ewAv7VC0I7IYx2 IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFVCjc4AY6r1j6r4UM4x0 Y48IcxkI7VAKI48JM4x0x7Aq67IIx4CEVc8vx2IErcIFxwACI402YVCY1x02628vn2kIc2 xKxwCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14v2 6r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jw0_GFylIxkGc2 Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI0_ Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r1xMIIF0xvEx4A2jsIE14v26r1j6r4UMI IF0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf9x0JUHbyAUUUUU = X-CM-SenderInfo: p1hex046kxt4xhlfz01xgou0bp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=0.0 required=5.0 tests=SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1762390975563691343?= X-GMAIL-MSGID: =?utf-8?q?1762390975563691343?= |
Series |
fix dying cpu compare race
|
|
Commit Message
Ye Bin
April 6, 2023, 1:56 a.m. UTC
From: Ye Bin <yebin10@huawei.com> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race condition between a cpu dying and percpu_counter_sum() iterating online CPUs was identified. Acctually, there's the same race condition between a cpu dying and __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment. But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()', then maybe return incorrect result. To solve above issue, also need to add dying CPUs count when do quick judgment in __percpu_counter_compare(). Signed-off-by: Ye Bin <yebin10@huawei.com> --- lib/percpu_counter.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
Comments
Hello, On Thu, Apr 06, 2023 at 09:56:29AM +0800, Ye Bin wrote: > From: Ye Bin <yebin10@huawei.com> > > In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race > condition between a cpu dying and percpu_counter_sum() iterating online CPUs > was identified. > Acctually, there's the same race condition between a cpu dying and > __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment. > But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()', > then maybe return incorrect result. > To solve above issue, also need to add dying CPUs count when do quick judgment > in __percpu_counter_compare(). > I've thought a lot of about this since you've sent v1. For the general problem, you haven't addressed Dave's concerns from [1]. I agree you've found a valid race condition, but as Dave mentioned, there's no synchronization in __percpu_counter_compare() and consequently no guarantees about the accuracy of the value. However, I might be missing something, but I do think the use case in 5825bea05265 ("xfs: __percpu_counter_compare() inode count debug too expensive") is potentially valid. If the rhs is an expected lower bound or upper bound (depending on if you're counting up or down, but not both) and the count you're accounting has the same expectations as percpu_refcount (you can only subtract what you've already added), then should the percpu_counter_sum() ever be on the wrong side of rhs, that should be an error and visible via percpu_counter_compare(). I need to think a little longer, but my initial thought is while you close a race condition, the function itself is inherently vulnerable. [1] https://lore.kernel.org/lkml/ZCu9LtdA+NMrfG9x@rh/ Thanks, Dennis > Signed-off-by: Ye Bin <yebin10@huawei.com> > --- > lib/percpu_counter.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > index 5004463c4f9f..399840cb0012 100644 > --- a/lib/percpu_counter.c > +++ b/lib/percpu_counter.c > @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu) > return 0; > } > > +static __always_inline unsigned int num_count_cpus(void) > +{ > +#ifdef CONFIG_HOTPLUG_CPU > + return (num_online_cpus() + num_dying_cpus()); > +#else > + return num_online_cpus(); > +#endif > +} > + > /* > * Compare counter against given value. > * Return 1 if greater, 0 if equal and -1 if less > @@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) > > count = percpu_counter_read(fbc); > /* Check to see if rough count will be sufficient for comparison */ > - if (abs(count - rhs) > (batch * num_online_cpus())) { > + if (abs(count - rhs) > (batch * num_count_cpus())) { > if (count > rhs) > return 1; > else > -- > 2.31.1 > >
On Thu, Apr 06 2023 at 09:56, Ye Bin wrote: > From: Ye Bin <yebin10@huawei.com> > > In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race > condition between a cpu dying and percpu_counter_sum() iterating online CPUs > was identified. > Acctually, there's the same race condition between a cpu dying and > __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment. > But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()', > then maybe return incorrect result. > To solve above issue, also need to add dying CPUs count when do quick judgment > in __percpu_counter_compare(). This is all bogus including the above commit. All CPU masks including cpu_online_mask and cpu_dying_mask are only valid when the context is: - A CPU hotplug callback - Any other code which holds cpu_hotplug_lock read or write locked. cpu_online_mask is special in that regard. It is also protected against modification in any preemption disabled region, but that's a pure implementation detail. cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being held. And even with that lock held any cpumask operation on it is silly. The mask is a core detail: commit e40f74c535b8 "cpumask: Introduce DYING mask" Introduce a cpumask that indicates (for each CPU) what direction the CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when an up fails and we do a roll-back down, it will accurately reflect the direction. It does not tell anything to a user which is not aware of the actual hotplug state machine state. The real reason for this percpu counter issue is how percpu counter hotplug callbacks are implemented: They are asymmetric and at the completely wrong place. The above 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") was done via XFS and without the courtesy of CC'ing the people who care about the CPU hotplug core. The lenghty analysis of this commit is all shiny, but fundamentally wrong. See above. I'm way too tired to come up with a proper fix for this mess, but I'm going to stare at it tomorrow morning with brain awake. Thanks, tglx
On Mon, Apr 10 2023 at 22:53, Thomas Gleixner wrote: > On Thu, Apr 06 2023 at 09:56, Ye Bin wrote: > cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being > held. And even with that lock held any cpumask operation on it is silly. > The mask is a core detail: > > commit e40f74c535b8 "cpumask: Introduce DYING mask" > > Introduce a cpumask that indicates (for each CPU) what direction the > CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when > an up fails and we do a roll-back down, it will accurately reflect the > direction. > > It does not tell anything to a user which is not aware of the actual > hotplug state machine state. Even if the mask is most of the time stable, it's a total disaster performance wise. The bits in cpu_dying_mask are sticky until the next online operation. So for a system which has SMT enabled in BIOS, but SMT is disabled on the kernel command line or later via the sysfs knob, this means that the loop in __percpu_counter_sum() will iterate over all shutdown SMT siblings forever. IOW, it will touch nr_of_shutdown_cpus() cachelines for absolutely zero reason. The same applies for the proposed counter. Thanks, tglx
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index 5004463c4f9f..399840cb0012 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu) return 0; } +static __always_inline unsigned int num_count_cpus(void) +{ +#ifdef CONFIG_HOTPLUG_CPU + return (num_online_cpus() + num_dying_cpus()); +#else + return num_online_cpus(); +#endif +} + /* * Compare counter against given value. * Return 1 if greater, 0 if equal and -1 if less @@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) count = percpu_counter_read(fbc); /* Check to see if rough count will be sufficient for comparison */ - if (abs(count - rhs) > (batch * num_online_cpus())) { + if (abs(count - rhs) > (batch * num_count_cpus())) { if (count > rhs) return 1; else