Message ID | 20230124173126.3492345-1-joel@joelfernandes.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2282929wrn; Tue, 24 Jan 2023 09:38:44 -0800 (PST) X-Google-Smtp-Source: AMrXdXuosyHfuDOBSr5lCc/yQsZkz1vdY27bVL9rHZwZt3JQpAZ57raLWdHVLJGD5zukNFAKYxCE X-Received: by 2002:aa7:dc03:0:b0:49e:1d4f:a9f0 with SMTP id b3-20020aa7dc03000000b0049e1d4fa9f0mr29783341edu.42.1674581924631; Tue, 24 Jan 2023 09:38:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674581924; cv=none; d=google.com; s=arc-20160816; b=ns6/a1ubsvtK4BffS39Qnr/NJPfjP48asN/auBxx1vdMl3nTp7tWmGTttd83jKK+sy SLHJnybphbzFh/FF9DhCJc1M/G6oF4rA20lac+wGQ03XMak8H7ypkXkD/Wxry9XrYny0 MjhD5ZBUsw1RmUh7HUhqzyW+hu+1UlEBdvJhFN5v6BkCY5vwKqMyAqbYF82LHEdSSdLA jvcvn6dN6+4BHuZ2TZA4Gg+iKBXXu+03EH8oHsXF6dm1YksiKvqkYOmFbdM7fbthNs46 AoZbv8GI8csgDHgDWXDROek5PYeleihfBgyYnLE2CHy+ISzoo5a7ibOzXFFUt2Zt3wzk NJ4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=QxC8g2x7UGo1iIB34KSAj3wYyngwomKHsPwBL9NxZa8=; b=RzORP9eD95kocBrFkPniAUnVQwnkb0vS3DoddWsvfmfvPiti03XMYAjIcErmgosNbd y6W9L3E/XpSTS00PUpG9Ix7Y/jk331Y4RH84WSqJVhZgFqPLeBMHuuvabQWqPScGlvx6 iVxT7zXee4am8mlf3B9g8grNBwUpIPjbQJl+sH/6IA5TJxLWRI8Lppdh8zGXC3mBwXQa RovOzEYR40ERPXe+m7f84cOpzxCiNZ+fwGgz3sooLsYwT6y7HSyahZ39jQHhCj68afWW 1gltndyuEgBGWI49F6G2AaQFFLdQtvxN5vkuOUqfS6xFFnbpAeeHil2VMHPvB2chM0f0 F4rg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=KfrA5Bb0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l10-20020aa7c30a000000b004a08bb3338asi2512925edq.605.2023.01.24.09.38.13; Tue, 24 Jan 2023 09:38:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=KfrA5Bb0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234589AbjAXRbv (ORCPT <rfc822;rust.linux@gmail.com> + 99 others); Tue, 24 Jan 2023 12:31:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34224 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234544AbjAXRbo (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 24 Jan 2023 12:31:44 -0500 Received: from mail-vk1-xa29.google.com (mail-vk1-xa29.google.com [IPv6:2607:f8b0:4864:20::a29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 68DD94E522 for <linux-kernel@vger.kernel.org>; Tue, 24 Jan 2023 09:31:36 -0800 (PST) Received: by mail-vk1-xa29.google.com with SMTP id t2so7961451vkk.9 for <linux-kernel@vger.kernel.org>; Tue, 24 Jan 2023 09:31:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=QxC8g2x7UGo1iIB34KSAj3wYyngwomKHsPwBL9NxZa8=; b=KfrA5Bb0OO3a9PdjN+skVpLoXDxc0nIfU6VmYVEyRX2iUjwYvf6ObVP96UCWp57YgY hYimpfA/fLUMfX1ab8HNx3qqtuXZAJp1mnSCfbvrt63wwLugnSO2Vrc8HuHQqBvZoJPt 2OPlwOX0QaDtXErT9GVLWQJXIqnTVA4I/ba6o= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=QxC8g2x7UGo1iIB34KSAj3wYyngwomKHsPwBL9NxZa8=; b=EIWz+CO8Rl+8fd+qRmC2bzgTcXuz7jsNOx0YEYXMz0hxT8cuS2sn+oarx1m9Sbi/CI l0Xmh2an88HuRSY1fVs7IWKnfjFbSPV9ggODWobWEyyhnrafdTNX8dIOw2rMAmr3Q8F7 au/nJE2bGY99oXtoLWh6eoFjrMiTl4l3xMyVyjt8BfHFBKhCcFVQa8/SX3iGUrl/CqI4 7DH49XTOXZpex4qKhaxzWnl4qcb5l3vGSetIH0asG6NTYocI0EcHVlZAs5tbo3KESt1F nXs6hvzdRn806SJC73/EPTKQP0rqPiT7ep0uhIJzjhE5ox3sBrEq3TUBJbelVU+jtzV/ /Tcg== X-Gm-Message-State: AFqh2kr4Cr8SsKG0W6I41zpQKZhxslKBUjnb/lauWzDvBwJ3nnwgbHRk v5F29DTGpcBi7Lnk6z3HeW1O8z8CdXL17TEc X-Received: by 2002:ac5:cb7a:0:b0:3e1:69c4:65e7 with SMTP id l26-20020ac5cb7a000000b003e169c465e7mr17790577vkn.14.1674581494792; Tue, 24 Jan 2023 09:31:34 -0800 (PST) Received: from joelboxx.c.googlers.com.com (129.239.188.35.bc.googleusercontent.com. [35.188.239.129]) by smtp.gmail.com with ESMTPSA id x11-20020a05620a098b00b007055dce4cecsm1741073qkx.97.2023.01.24.09.31.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Jan 2023 09:31:34 -0800 (PST) From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Frederic Weisbecker <frederic@kernel.org>, "Paul E. McKenney" <paulmck@kernel.org>, Zhouyi Zhou <zhouzhouyi@gmail.com>, Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>, rcu <rcu@vger.kernel.org>, stable@vger.kernel.org, Frederic Weisbecker <fweisbec@gmail.com>, Ingo Molnar <mingo@kernel.org>, "Rafael J. Wysocki" <rafael@kernel.org>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH v2 RESEND] tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem Date: Tue, 24 Jan 2023 17:31:26 +0000 Message-Id: <20230124173126.3492345-1-joel@joelfernandes.org> X-Mailer: git-send-email 2.39.1.405.gd4c25cc71f-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLACK autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755926255730836202?= X-GMAIL-MSGID: =?utf-8?q?1755926416074269821?= |
Series |
[v2,RESEND] tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
|
|
Commit Message
Joel Fernandes
Jan. 24, 2023, 5:31 p.m. UTC
For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. However, cpu_is_hotpluggable() still returns true for those CPUs. This causes torture tests that do offlining to end up trying to offline this CPU causing test failures. Such failure happens on all architectures. Fix it by asking the opinion of the nohz subsystem on whether the CPU can be hotplugged. [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] For drivers/base/ portion: Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: rcu <rcu@vger.kernel.org> Cc: stable@vger.kernel.org Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> --- Sorry, resending with CC to stable. drivers/base/cpu.c | 3 ++- include/linux/tick.h | 2 ++ kernel/time/tick-sched.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 4 deletions(-)
Comments
On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > torture tests that do offlining to end up trying to offline this CPU causing > test failures. Such failure happens on all architectures. > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > be hotplugged. > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > For drivers/base/ portion: > Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> > Cc: Will Deacon <will@kernel.org> > Cc: Marc Zyngier <maz@kernel.org> > Cc: rcu <rcu@vger.kernel.org> > Cc: stable@vger.kernel.org > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Queued for further review and testing, thank you both! It might be a few hours until it becomes publicly visible, but it will get there. Thanx, Paul > --- > Sorry, resending with CC to stable. > > drivers/base/cpu.c | 3 ++- > include/linux/tick.h | 2 ++ > kernel/time/tick-sched.c | 11 ++++++++--- > 3 files changed, 12 insertions(+), 4 deletions(-) > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c > index 55405ebf23ab..450dca235a2f 100644 > --- a/drivers/base/cpu.c > +++ b/drivers/base/cpu.c > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { > bool cpu_is_hotpluggable(unsigned int cpu) > { > struct device *dev = get_cpu_device(cpu); > - return dev && container_of(dev, struct cpu, dev)->hotpluggable; > + return dev && container_of(dev, struct cpu, dev)->hotpluggable > + && tick_nohz_cpu_hotpluggable(cpu); > } > EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); > > diff --git a/include/linux/tick.h b/include/linux/tick.h > index bfd571f18cfd..9459fef5b857 100644 > --- a/include/linux/tick.h > +++ b/include/linux/tick.h > @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, > enum tick_dep_bits bit); > extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, > enum tick_dep_bits bit); > +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu); > > /* > * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases > @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } > > static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } > static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } > +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; } > > static inline void tick_dep_set(enum tick_dep_bits bit) { } > static inline void tick_dep_clear(enum tick_dep_bits bit) { } > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 9c6f661fb436..63e3e8ebcd64 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) > tick_nohz_full_running = true; > } > > -static int tick_nohz_cpu_down(unsigned int cpu) > +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) > { > /* > * The tick_do_timer_cpu CPU handles housekeeping duty (unbound > @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) > * CPUs. It must remain online when nohz full is enabled. > */ > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > - return -EBUSY; > - return 0; > + return false; > + return true; > +} > + > +static int tick_nohz_cpu_down(unsigned int cpu) > +{ > + return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY; > } > > void __init tick_nohz_init(void) > -- > 2.39.1.405.gd4c25cc71f-goog >
On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > > torture tests that do offlining to end up trying to offline this CPU causing > > test failures. Such failure happens on all architectures. > > > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > > be hotplugged. > > > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > > > For drivers/base/ portion: > > Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> > > Cc: Will Deacon <will@kernel.org> > > Cc: Marc Zyngier <maz@kernel.org> > > Cc: rcu <rcu@vger.kernel.org> > > Cc: stable@vger.kernel.org > > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > Queued for further review and testing, thank you both! > > It might be a few hours until it becomes publicly visible, but it will > get there. A new round of rcutorture test on fixed linux-5.15.y[3] has been performed in the PPC VM of Open Source Lab of Oregon State University [1], which will last about 29 hours. The test result on original linux-5.15.y is at [2]. [1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/ [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.18-13.22.39-torture/ [3] http://140.211.169.189/linux-stable-rc/ Happy to benefit the community and this is a fruitful learning journey ;-) Cheers Zhouyi > > Thanx, Paul > > > --- > > Sorry, resending with CC to stable. > > > > drivers/base/cpu.c | 3 ++- > > include/linux/tick.h | 2 ++ > > kernel/time/tick-sched.c | 11 ++++++++--- > > 3 files changed, 12 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c > > index 55405ebf23ab..450dca235a2f 100644 > > --- a/drivers/base/cpu.c > > +++ b/drivers/base/cpu.c > > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { > > bool cpu_is_hotpluggable(unsigned int cpu) > > { > > struct device *dev = get_cpu_device(cpu); > > - return dev && container_of(dev, struct cpu, dev)->hotpluggable; > > + return dev && container_of(dev, struct cpu, dev)->hotpluggable > > + && tick_nohz_cpu_hotpluggable(cpu); > > } > > EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); > > > > diff --git a/include/linux/tick.h b/include/linux/tick.h > > index bfd571f18cfd..9459fef5b857 100644 > > --- a/include/linux/tick.h > > +++ b/include/linux/tick.h > > @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, > > enum tick_dep_bits bit); > > extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, > > enum tick_dep_bits bit); > > +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu); > > > > /* > > * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases > > @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } > > > > static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } > > static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } > > +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; } > > > > static inline void tick_dep_set(enum tick_dep_bits bit) { } > > static inline void tick_dep_clear(enum tick_dep_bits bit) { } > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index 9c6f661fb436..63e3e8ebcd64 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) > > tick_nohz_full_running = true; > > } > > > > -static int tick_nohz_cpu_down(unsigned int cpu) > > +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) > > { > > /* > > * The tick_do_timer_cpu CPU handles housekeeping duty (unbound > > @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) > > * CPUs. It must remain online when nohz full is enabled. > > */ > > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > > - return -EBUSY; > > - return 0; > > + return false; > > + return true; > > +} > > + > > +static int tick_nohz_cpu_down(unsigned int cpu) > > +{ > > + return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY; > > } > > > > void __init tick_nohz_init(void) > > -- > > 2.39.1.405.gd4c25cc71f-goog > >
> On Jan 24, 2023, at 7:13 PM, Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney <paulmck@kernel.org> wrote: >> >>> On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: >>> For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. >>> However, cpu_is_hotpluggable() still returns true for those CPUs. This causes >>> torture tests that do offlining to end up trying to offline this CPU causing >>> test failures. Such failure happens on all architectures. >>> >>> Fix it by asking the opinion of the nohz subsystem on whether the CPU can >>> be hotplugged. >>> >>> [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] >>> >>> For drivers/base/ portion: >>> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> >>> >>> Cc: Frederic Weisbecker <frederic@kernel.org> >>> Cc: "Paul E. McKenney" <paulmck@kernel.org> >>> Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> >>> Cc: Will Deacon <will@kernel.org> >>> Cc: Marc Zyngier <maz@kernel.org> >>> Cc: rcu <rcu@vger.kernel.org> >>> Cc: stable@vger.kernel.org >>> Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") >>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> >> >> Queued for further review and testing, thank you both! >> >> It might be a few hours until it becomes publicly visible, but it will >> get there. > A new round of rcutorture test on fixed linux-5.15.y[3] has been > performed in the PPC VM of Open Source Lab of Oregon State University > [1], which will last about 29 hours. The test result on original > linux-5.15.y is at [2]. > > [1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/ > [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.18-13.22.39-torture/ > [3] http://140.211.169.189/linux-stable-rc/ > > Happy to benefit the community and this is a fruitful learning journey ;-) Thanks a lot Zhouyi! After testing something your reply can also optionally include: Tested-by: Zhouyi Zhou <zhouzhoui@gmail.com> Thanks. Have fun at the university. I miss my university days. - Joel > > Cheers > Zhouyi >> >> Thanx, Paul >> >>> --- >>> Sorry, resending with CC to stable. >>> >>> drivers/base/cpu.c | 3 ++- >>> include/linux/tick.h | 2 ++ >>> kernel/time/tick-sched.c | 11 ++++++++--- >>> 3 files changed, 12 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c >>> index 55405ebf23ab..450dca235a2f 100644 >>> --- a/drivers/base/cpu.c >>> +++ b/drivers/base/cpu.c >>> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { >>> bool cpu_is_hotpluggable(unsigned int cpu) >>> { >>> struct device *dev = get_cpu_device(cpu); >>> - return dev && container_of(dev, struct cpu, dev)->hotpluggable; >>> + return dev && container_of(dev, struct cpu, dev)->hotpluggable >>> + && tick_nohz_cpu_hotpluggable(cpu); >>> } >>> EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); >>> >>> diff --git a/include/linux/tick.h b/include/linux/tick.h >>> index bfd571f18cfd..9459fef5b857 100644 >>> --- a/include/linux/tick.h >>> +++ b/include/linux/tick.h >>> @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, >>> enum tick_dep_bits bit); >>> extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, >>> enum tick_dep_bits bit); >>> +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu); >>> >>> /* >>> * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases >>> @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } >>> >>> static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } >>> static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } >>> +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; } >>> >>> static inline void tick_dep_set(enum tick_dep_bits bit) { } >>> static inline void tick_dep_clear(enum tick_dep_bits bit) { } >>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c >>> index 9c6f661fb436..63e3e8ebcd64 100644 >>> --- a/kernel/time/tick-sched.c >>> +++ b/kernel/time/tick-sched.c >>> @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) >>> tick_nohz_full_running = true; >>> } >>> >>> -static int tick_nohz_cpu_down(unsigned int cpu) >>> +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) >>> { >>> /* >>> * The tick_do_timer_cpu CPU handles housekeeping duty (unbound >>> @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) >>> * CPUs. It must remain online when nohz full is enabled. >>> */ >>> if (tick_nohz_full_running && tick_do_timer_cpu == cpu) >>> - return -EBUSY; >>> - return 0; >>> + return false; >>> + return true; >>> +} >>> + >>> +static int tick_nohz_cpu_down(unsigned int cpu) >>> +{ >>> + return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY; >>> } >>> >>> void __init tick_nohz_init(void) >>> -- >>> 2.39.1.405.gd4c25cc71f-goog >>>
On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > > > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > > > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > > > torture tests that do offlining to end up trying to offline this CPU causing > > > test failures. Such failure happens on all architectures. > > > > > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > > > be hotplugged. > > > > > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > > > > > For drivers/base/ portion: > > > Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > > Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > Cc: Will Deacon <will@kernel.org> > > > Cc: Marc Zyngier <maz@kernel.org> > > > Cc: rcu <rcu@vger.kernel.org> > > > Cc: stable@vger.kernel.org > > > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > Queued for further review and testing, thank you both! > > > > It might be a few hours until it becomes publicly visible, but it will > > get there. > A new round of rcutorture test on fixed linux-5.15.y[3] has been > performed in the PPC VM of Open Source Lab of Oregon State University > [1], which will last about 29 hours. The test result on original > linux-5.15.y is at [2]. From the result of [1], the HOTPLUG failure reports have been eliminated, but a new kernel null point bug related to scsi module has been reported [4] ;-( [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at 0x00000008 ... [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] scsi_end_request+0xd8/0x1f0 (unreliable)^M [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] scsi_io_completion+0x88/0x700^M [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] scsi_finish_command+0xe8/0x150^M [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] scsi_complete+0x90/0x140^M [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] blk_complete_reqs+0x80/0xa0^M [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] __do_softirq+0x1e0/0x4e0^M [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] do_softirq_own_stack+0x48/0x60^M [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] 0xc00000000a5e7c80^M [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] do_softirq+0xb0/0xc0^M [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] __local_bh_enable_ip+0xdc/0x110^M [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] irq_forced_thread_fn+0xc8/0xf0^M [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] irq_thread+0x1b4/0x2a0^M [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] kthread+0x1a8/0x1d0^M [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64^M But when I invoked [5] by hand, the bug did not show again [6]. [4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.log [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/qemu-cmd [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.zzy.log I think the bug is not caused by Joel's patch, there must be some other reason. I am starting the 29 hours' rcutorture test again. And I can give ssh access to you if you are interested in it. Sorry for the inconvenience that I bring Thanks Zhouyi > > [1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/ > [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.18-13.22.39-torture/ > [3] http://140.211.169.189/linux-stable-rc/ > > Happy to benefit the community and this is a fruitful learning journey ;-) > > Cheers > Zhouyi > > > > Thanx, Paul > > > > > --- > > > Sorry, resending with CC to stable. > > > > > > drivers/base/cpu.c | 3 ++- > > > include/linux/tick.h | 2 ++ > > > kernel/time/tick-sched.c | 11 ++++++++--- > > > 3 files changed, 12 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c > > > index 55405ebf23ab..450dca235a2f 100644 > > > --- a/drivers/base/cpu.c > > > +++ b/drivers/base/cpu.c > > > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { > > > bool cpu_is_hotpluggable(unsigned int cpu) > > > { > > > struct device *dev = get_cpu_device(cpu); > > > - return dev && container_of(dev, struct cpu, dev)->hotpluggable; > > > + return dev && container_of(dev, struct cpu, dev)->hotpluggable > > > + && tick_nohz_cpu_hotpluggable(cpu); > > > } > > > EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); > > > > > > diff --git a/include/linux/tick.h b/include/linux/tick.h > > > index bfd571f18cfd..9459fef5b857 100644 > > > --- a/include/linux/tick.h > > > +++ b/include/linux/tick.h > > > @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, > > > enum tick_dep_bits bit); > > > extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, > > > enum tick_dep_bits bit); > > > +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu); > > > > > > /* > > > * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases > > > @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } > > > > > > static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } > > > static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } > > > +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; } > > > > > > static inline void tick_dep_set(enum tick_dep_bits bit) { } > > > static inline void tick_dep_clear(enum tick_dep_bits bit) { } > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > > index 9c6f661fb436..63e3e8ebcd64 100644 > > > --- a/kernel/time/tick-sched.c > > > +++ b/kernel/time/tick-sched.c > > > @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) > > > tick_nohz_full_running = true; > > > } > > > > > > -static int tick_nohz_cpu_down(unsigned int cpu) > > > +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) > > > { > > > /* > > > * The tick_do_timer_cpu CPU handles housekeeping duty (unbound > > > @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) > > > * CPUs. It must remain online when nohz full is enabled. > > > */ > > > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > > > - return -EBUSY; > > > - return 0; > > > + return false; > > > + return true; > > > +} > > > + > > > +static int tick_nohz_cpu_down(unsigned int cpu) > > > +{ > > > + return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY; > > > } > > > > > > void __init tick_nohz_init(void) > > > -- > > > 2.39.1.405.gd4c25cc71f-goog > > >
On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > > > > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > > > > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > > > > torture tests that do offlining to end up trying to offline this CPU causing > > > > test failures. Such failure happens on all architectures. > > > > > > > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > > > > be hotplugged. > > > > > > > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > > > > > > > For drivers/base/ portion: > > > > Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > > > > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > > > Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > Cc: Will Deacon <will@kernel.org> > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > Cc: rcu <rcu@vger.kernel.org> > > > > Cc: stable@vger.kernel.org > > > > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > Queued for further review and testing, thank you both! > > > > > > It might be a few hours until it becomes publicly visible, but it will > > > get there. > > A new round of rcutorture test on fixed linux-5.15.y[3] has been > > performed in the PPC VM of Open Source Lab of Oregon State University > > [1], which will last about 29 hours. The test result on original > > linux-5.15.y is at [2]. > From the result of [1], the HOTPLUG failure reports have been > eliminated, but a new kernel null point bug related to scsi module has > been reported [4] ;-( > [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at > 0x00000008 > ... > [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] > scsi_end_request+0xd8/0x1f0 (unreliable)^M > [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] > scsi_io_completion+0x88/0x700^M > [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] > scsi_finish_command+0xe8/0x150^M > [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] > scsi_complete+0x90/0x140^M > [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] > blk_complete_reqs+0x80/0xa0^M > [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] > __do_softirq+0x1e0/0x4e0^M > [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] > do_softirq_own_stack+0x48/0x60^M > [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] > 0xc00000000a5e7c80^M > [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] > do_softirq+0xb0/0xc0^M > [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] > __local_bh_enable_ip+0xdc/0x110^M > [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] > irq_forced_thread_fn+0xc8/0xf0^M > [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] > irq_thread+0x1b4/0x2a0^M > [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] > kthread+0x1a8/0x1d0^M > [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] > ret_from_kernel_thread+0x5c/0x64^M > > But when I invoked [5] by hand, the bug did not show again [6]. > > [4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.log > [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/qemu-cmd > [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.zzy.log > > I think the bug is not caused by Joel's patch, there must be some > other reason. I am starting the 29 hours' rcutorture test again. And I > can give ssh access to you if you are interested in it. > > Sorry for the inconvenience that I bring > > Thanks > Zhouyi Hi the above kernel null pointer dereference bug has nothing to do with Joel's fix because I can reproduce it on original 5.15.y [7] using as while loop [8] (after 36 iterations, the bug fires). So, Joel's patch is tested good! Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> In following days, I am like to dig out the root of the null pointer dereference bug. [7] http://140.211.169.189/stable-for-joel/linux-stable-rc/ [8] http://140.211.169.189/stable-for-joel/linux-stable-rc/qemu-cmd.sh Thanks Zhouyi > > > > [1] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/ > > [2] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.18-13.22.39-torture/ > > [3] http://140.211.169.189/linux-stable-rc/ > > > > Happy to benefit the community and this is a fruitful learning journey ;-) > > > > Cheers > > Zhouyi > > > > > > Thanx, Paul > > > > > > > --- > > > > Sorry, resending with CC to stable. > > > > > > > > drivers/base/cpu.c | 3 ++- > > > > include/linux/tick.h | 2 ++ > > > > kernel/time/tick-sched.c | 11 ++++++++--- > > > > 3 files changed, 12 insertions(+), 4 deletions(-) > > > > > > > > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c > > > > index 55405ebf23ab..450dca235a2f 100644 > > > > --- a/drivers/base/cpu.c > > > > +++ b/drivers/base/cpu.c > > > > @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { > > > > bool cpu_is_hotpluggable(unsigned int cpu) > > > > { > > > > struct device *dev = get_cpu_device(cpu); > > > > - return dev && container_of(dev, struct cpu, dev)->hotpluggable; > > > > + return dev && container_of(dev, struct cpu, dev)->hotpluggable > > > > + && tick_nohz_cpu_hotpluggable(cpu); > > > > } > > > > EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); > > > > > > > > diff --git a/include/linux/tick.h b/include/linux/tick.h > > > > index bfd571f18cfd..9459fef5b857 100644 > > > > --- a/include/linux/tick.h > > > > +++ b/include/linux/tick.h > > > > @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, > > > > enum tick_dep_bits bit); > > > > extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, > > > > enum tick_dep_bits bit); > > > > +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu); > > > > > > > > /* > > > > * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases > > > > @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } > > > > > > > > static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } > > > > static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } > > > > +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; } > > > > > > > > static inline void tick_dep_set(enum tick_dep_bits bit) { } > > > > static inline void tick_dep_clear(enum tick_dep_bits bit) { } > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > > > index 9c6f661fb436..63e3e8ebcd64 100644 > > > > --- a/kernel/time/tick-sched.c > > > > +++ b/kernel/time/tick-sched.c > > > > @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) > > > > tick_nohz_full_running = true; > > > > } > > > > > > > > -static int tick_nohz_cpu_down(unsigned int cpu) > > > > +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) > > > > { > > > > /* > > > > * The tick_do_timer_cpu CPU handles housekeeping duty (unbound > > > > @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) > > > > * CPUs. It must remain online when nohz full is enabled. > > > > */ > > > > if (tick_nohz_full_running && tick_do_timer_cpu == cpu) > > > > - return -EBUSY; > > > > - return 0; > > > > + return false; > > > > + return true; > > > > +} > > > > + > > > > +static int tick_nohz_cpu_down(unsigned int cpu) > > > > +{ > > > > + return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY; > > > > } > > > > > > > > void __init tick_nohz_init(void) > > > > -- > > > > 2.39.1.405.gd4c25cc71f-goog > > > >
On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > > > On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > > > > > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > > > > > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > > > > > torture tests that do offlining to end up trying to offline this CPU causing > > > > > test failures. Such failure happens on all architectures. > > > > > > > > > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > > > > > be hotplugged. > > > > > > > > > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > > > > > > > > > For drivers/base/ portion: > > > > > Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > > > > > > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > > > > Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > Cc: Will Deacon <will@kernel.org> > > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > > Cc: rcu <rcu@vger.kernel.org> > > > > > Cc: stable@vger.kernel.org > > > > > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > Queued for further review and testing, thank you both! > > > > > > > > It might be a few hours until it becomes publicly visible, but it will > > > > get there. > > > A new round of rcutorture test on fixed linux-5.15.y[3] has been > > > performed in the PPC VM of Open Source Lab of Oregon State University > > > [1], which will last about 29 hours. The test result on original > > > linux-5.15.y is at [2]. > > From the result of [1], the HOTPLUG failure reports have been > > eliminated, but a new kernel null point bug related to scsi module has > > been reported [4] ;-( > > [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at > > 0x00000008 > > ... > > [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] > > scsi_end_request+0xd8/0x1f0 (unreliable)^M > > [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] > > scsi_io_completion+0x88/0x700^M > > [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] > > scsi_finish_command+0xe8/0x150^M > > [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] > > scsi_complete+0x90/0x140^M > > [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] > > blk_complete_reqs+0x80/0xa0^M > > [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] > > __do_softirq+0x1e0/0x4e0^M > > [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] > > do_softirq_own_stack+0x48/0x60^M > > [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] > > 0xc00000000a5e7c80^M > > [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] > > do_softirq+0xb0/0xc0^M > > [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] > > __local_bh_enable_ip+0xdc/0x110^M > > [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] > > irq_forced_thread_fn+0xc8/0xf0^M > > [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] > > irq_thread+0x1b4/0x2a0^M > > [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] > > kthread+0x1a8/0x1d0^M > > [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] > > ret_from_kernel_thread+0x5c/0x64^M > > > > But when I invoked [5] by hand, the bug did not show again [6]. > > > > [4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.log > > [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/qemu-cmd > > [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.zzy.log > > > > I think the bug is not caused by Joel's patch, there must be some > > other reason. I am starting the 29 hours' rcutorture test again. And I > > can give ssh access to you if you are interested in it. > > > > Sorry for the inconvenience that I bring > > > > Thanks > > Zhouyi > Hi the above kernel null pointer dereference bug has nothing to do > with Joel's fix because I can reproduce it on original 5.15.y [7] > using as while loop [8] (after 36 iterations, the bug fires). > So, Joel's patch is tested good! > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a lot and I don't see such an issue. However, your logs showed the crash is SCSI related. These were the recent SCSI commits in 5.15.y but I am not sure if it causes the issue: 13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile 513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler 7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern Perhaps report it to the scsi and/or stable lists, or do some web searches for if someone else sees it. - Joel
On Thu, Jan 26, 2023 at 11:14 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > > > On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > > > > > On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > > > > > > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > > > > > > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > > > > > > torture tests that do offlining to end up trying to offline this CPU causing > > > > > > test failures. Such failure happens on all architectures. > > > > > > > > > > > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > > > > > > be hotplugged. > > > > > > > > > > > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > > > > > > > > > > > For drivers/base/ portion: > > > > > > Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > > > > > > > > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > > > > > Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > Cc: Will Deacon <will@kernel.org> > > > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > > > Cc: rcu <rcu@vger.kernel.org> > > > > > > Cc: stable@vger.kernel.org > > > > > > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > > > Queued for further review and testing, thank you both! > > > > > > > > > > It might be a few hours until it becomes publicly visible, but it will > > > > > get there. > > > > A new round of rcutorture test on fixed linux-5.15.y[3] has been > > > > performed in the PPC VM of Open Source Lab of Oregon State University > > > > [1], which will last about 29 hours. The test result on original > > > > linux-5.15.y is at [2]. > > > From the result of [1], the HOTPLUG failure reports have been > > > eliminated, but a new kernel null point bug related to scsi module has > > > been reported [4] ;-( > > > [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at > > > 0x00000008 > > > ... > > > [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] > > > scsi_end_request+0xd8/0x1f0 (unreliable)^M > > > [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] > > > scsi_io_completion+0x88/0x700^M > > > [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] > > > scsi_finish_command+0xe8/0x150^M > > > [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] > > > scsi_complete+0x90/0x140^M > > > [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] > > > blk_complete_reqs+0x80/0xa0^M > > > [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] > > > __do_softirq+0x1e0/0x4e0^M > > > [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] > > > do_softirq_own_stack+0x48/0x60^M > > > [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] > > > 0xc00000000a5e7c80^M > > > [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] > > > do_softirq+0xb0/0xc0^M > > > [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] > > > __local_bh_enable_ip+0xdc/0x110^M > > > [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] > > > irq_forced_thread_fn+0xc8/0xf0^M > > > [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] > > > irq_thread+0x1b4/0x2a0^M > > > [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] > > > kthread+0x1a8/0x1d0^M > > > [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] > > > ret_from_kernel_thread+0x5c/0x64^M > > > > > > But when I invoked [5] by hand, the bug did not show again [6]. > > > > > > [4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.log > > > [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/qemu-cmd > > > [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.zzy.log > > > > > > I think the bug is not caused by Joel's patch, there must be some > > > other reason. I am starting the 29 hours' rcutorture test again. And I > > > can give ssh access to you if you are interested in it. > > > > > > Sorry for the inconvenience that I bring > > > > > > Thanks > > > Zhouyi > > Hi the above kernel null pointer dereference bug has nothing to do > > with Joel's fix because I can reproduce it on original 5.15.y [7] > > using as while loop [8] (after 36 iterations, the bug fires). > > So, Joel's patch is tested good! > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a > lot and I don't see such an issue. > > However, your logs showed the crash is SCSI related. These were the > recent SCSI commits in 5.15.y but I am not sure if it causes the > issue: > > 13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile > 513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler > 7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern > > Perhaps report it to the scsi and/or stable lists, or do some web > searches for if someone else sees it. Thank Joel for your guidance! I will continue the research following your instructions! I will report to you and the community once I have any process. Thank you all! Best Regards Zhouyi > > - Joel
On Thu, Jan 26, 2023 at 11:41 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > On Thu, Jan 26, 2023 at 11:14 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Thu, Jan 26, 2023 at 10:01 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > > > On Thu, Jan 26, 2023 at 12:16 PM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > > > > > On Wed, Jan 25, 2023 at 8:13 AM Zhouyi Zhou <zhouzhouyi@gmail.com> wrote: > > > > > > > > > > On Wed, Jan 25, 2023 at 6:56 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > > > On Tue, Jan 24, 2023 at 05:31:26PM +0000, Joel Fernandes (Google) wrote: > > > > > > > For CONFIG_NO_HZ_FULL systems, the tick_do_timer_cpu cannot be offlined. > > > > > > > However, cpu_is_hotpluggable() still returns true for those CPUs. This causes > > > > > > > torture tests that do offlining to end up trying to offline this CPU causing > > > > > > > test failures. Such failure happens on all architectures. > > > > > > > > > > > > > > Fix it by asking the opinion of the nohz subsystem on whether the CPU can > > > > > > > be hotplugged. > > > > > > > > > > > > > > [ Apply Frederic Weisbecker feedback on refactoring tick_nohz_cpu_down(). ] > > > > > > > > > > > > > > For drivers/base/ portion: > > > > > > > Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > > > > > > > > > > > > Cc: Frederic Weisbecker <frederic@kernel.org> > > > > > > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > > > > > > Cc: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > > > > Cc: Will Deacon <will@kernel.org> > > > > > > > Cc: Marc Zyngier <maz@kernel.org> > > > > > > > Cc: rcu <rcu@vger.kernel.org> > > > > > > > Cc: stable@vger.kernel.org > > > > > > > Fixes: 2987557f52b9 ("driver-core/cpu: Expose hotpluggability to the rest of the kernel") > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> > > > > > > > > > > > > Queued for further review and testing, thank you both! > > > > > > > > > > > > It might be a few hours until it becomes publicly visible, but it will > > > > > > get there. > > > > > A new round of rcutorture test on fixed linux-5.15.y[3] has been > > > > > performed in the PPC VM of Open Source Lab of Oregon State University > > > > > [1], which will last about 29 hours. The test result on original > > > > > linux-5.15.y is at [2]. > > > > From the result of [1], the HOTPLUG failure reports have been > > > > eliminated, but a new kernel null point bug related to scsi module has > > > > been reported [4] ;-( > > > > [ 5.178733][ C1] BUG: Kernel NULL pointer dereference on read at > > > > 0x00000008 > > > > ... > > > > [ 5.231013][ C1] [c00000001ff9fca0] [c0000000009ffbc8] > > > > scsi_end_request+0xd8/0x1f0 (unreliable)^M > > > > [ 5.234961][ C1] [c00000001ff9fcf0] [c000000000a00e68] > > > > scsi_io_completion+0x88/0x700^M > > > > [ 5.237863][ C1] [c00000001ff9fda0] [c0000000009f5028] > > > > scsi_finish_command+0xe8/0x150^M > > > > [ 5.240089][ C1] [c00000001ff9fdf0] [c000000000a00c70] > > > > scsi_complete+0x90/0x140^M > > > > [ 5.242481][ C1] [c00000001ff9fe20] [c0000000007e5170] > > > > blk_complete_reqs+0x80/0xa0^M > > > > [ 5.245187][ C1] [c00000001ff9fe50] [c000000000f0b5d0] > > > > __do_softirq+0x1e0/0x4e0^M > > > > [ 5.248479][ C1] [c00000001ff9ff90] [c0000000000170e8] > > > > do_softirq_own_stack+0x48/0x60^M > > > > [ 5.250919][ C1] [c00000000a5e7c40] [c00000000a5e7c80] > > > > 0xc00000000a5e7c80^M > > > > [ 5.253792][ C1] [c00000000a5e7c70] [c0000000001534c0] > > > > do_softirq+0xb0/0xc0^M > > > > [ 5.256824][ C1] [c00000000a5e7ca0] [c0000000001535ac] > > > > __local_bh_enable_ip+0xdc/0x110^M > > > > [ 5.259414][ C1] [c00000000a5e7cc0] [c0000000001d75e8] > > > > irq_forced_thread_fn+0xc8/0xf0^M > > > > [ 5.261921][ C1] [c00000000a5e7d00] [c0000000001d7ae4] > > > > irq_thread+0x1b4/0x2a0^M > > > > [ 5.265298][ C1] [c00000000a5e7da0] [c00000000017d8c8] > > > > kthread+0x1a8/0x1d0^M > > > > [ 5.269184][ C1] [c00000000a5e7e10] [c00000000000cee4] > > > > ret_from_kernel_thread+0x5c/0x64^M > > > > > > > > But when I invoked [5] by hand, the bug did not show again [6]. > > > > > > > > [4] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.log > > > > [5] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/qemu-cmd > > > > [6] http://140.211.169.189/linux-stable-rc/tools/testing/selftests/rcutorture/res/2023.01.25-00.04.30-torture/results-clocksourcewd-2/TREE03/console.zzy.log > > > > > > > > I think the bug is not caused by Joel's patch, there must be some > > > > other reason. I am starting the 29 hours' rcutorture test again. And I > > > > can give ssh access to you if you are interested in it. > > > > > > > > Sorry for the inconvenience that I bring > > > > > > > > Thanks > > > > Zhouyi > > > Hi the above kernel null pointer dereference bug has nothing to do > > > with Joel's fix because I can reproduce it on original 5.15.y [7] > > > using as while loop [8] (after 36 iterations, the bug fires). > > > So, Joel's patch is tested good! > > > Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> > > > > Interesting. I have been running rcutorture's TREE03 on 5.15.y quite a > > lot and I don't see such an issue. > > > > However, your logs showed the crash is SCSI related. These were the > > recent SCSI commits in 5.15.y but I am not sure if it causes the > > issue: > > > > 13259b6 scsi: mpi3mr: Refer CONFIG_SCSI_MPI3MR in Makefile > > 513fdf0 scsi: ufs: Stop using the clock scaling lock in the error handler > > 7c26d21 scsi: ufs: core: WLUN suspend SSU/enter hibern > > > > Perhaps report it to the scsi and/or stable lists, or do some web > > searches for if someone else sees it. > Thank Joel for your guidance! > I will continue the research following your instructions! > I will report to you and the community once I have any process. Hi I think I have chased down the reason behind the bug reported by SCSI subsystem following your guidance, and have reported to SCSI mainteners [1] This is a fruitful journey, thank you all ;-) So once again, Joel's patch is tested good ;-) Tested-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Thanks for your attention Thanks Zhouyi [1] https://lore.kernel.org/lkml/CAABZP2ywCbu4Po63BDBE7U1WEqx4DF7F2CZjTqFp0dSDw-uziQ@mail.gmail.com/T/#u > > Thank you all! > > Best Regards > Zhouyi > > > > - Joel
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 55405ebf23ab..450dca235a2f 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = { bool cpu_is_hotpluggable(unsigned int cpu) { struct device *dev = get_cpu_device(cpu); - return dev && container_of(dev, struct cpu, dev)->hotpluggable; + return dev && container_of(dev, struct cpu, dev)->hotpluggable + && tick_nohz_cpu_hotpluggable(cpu); } EXPORT_SYMBOL_GPL(cpu_is_hotpluggable); diff --git a/include/linux/tick.h b/include/linux/tick.h index bfd571f18cfd..9459fef5b857 100644 --- a/include/linux/tick.h +++ b/include/linux/tick.h @@ -216,6 +216,7 @@ extern void tick_nohz_dep_set_signal(struct task_struct *tsk, enum tick_dep_bits bit); extern void tick_nohz_dep_clear_signal(struct signal_struct *signal, enum tick_dep_bits bit); +extern bool tick_nohz_cpu_hotpluggable(unsigned int cpu); /* * The below are tick_nohz_[set,clear]_dep() wrappers that optimize off-cases @@ -280,6 +281,7 @@ static inline void tick_nohz_full_add_cpus_to(struct cpumask *mask) { } static inline void tick_nohz_dep_set_cpu(int cpu, enum tick_dep_bits bit) { } static inline void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit) { } +static inline bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { return true; } static inline void tick_dep_set(enum tick_dep_bits bit) { } static inline void tick_dep_clear(enum tick_dep_bits bit) { } diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 9c6f661fb436..63e3e8ebcd64 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -510,7 +510,7 @@ void __init tick_nohz_full_setup(cpumask_var_t cpumask) tick_nohz_full_running = true; } -static int tick_nohz_cpu_down(unsigned int cpu) +bool tick_nohz_cpu_hotpluggable(unsigned int cpu) { /* * The tick_do_timer_cpu CPU handles housekeeping duty (unbound @@ -518,8 +518,13 @@ static int tick_nohz_cpu_down(unsigned int cpu) * CPUs. It must remain online when nohz full is enabled. */ if (tick_nohz_full_running && tick_do_timer_cpu == cpu) - return -EBUSY; - return 0; + return false; + return true; +} + +static int tick_nohz_cpu_down(unsigned int cpu) +{ + return tick_nohz_cpu_hotpluggable(cpu) ? 0 : -EBUSY; } void __init tick_nohz_init(void)