Message ID | 20221125155427.1381933-1-qiang1.zhang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp4115986wrr; Fri, 25 Nov 2022 07:50:32 -0800 (PST) X-Google-Smtp-Source: AA0mqf5859k9NcIjXaAoMZtWk92pgEljj9VwI00sZB+TMTgsYYmt0X/a9lwQmGxhYtoY0fYsKOqq X-Received: by 2002:a17:906:6d88:b0:7ad:b86b:3ff with SMTP id h8-20020a1709066d8800b007adb86b03ffmr33986818ejt.448.1669391431860; Fri, 25 Nov 2022 07:50:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669391431; cv=none; d=google.com; s=arc-20160816; b=B0WtY5p+CuQJn3V9LdDbxWJ/tuSGmJk7EVJyJ2pJL3gj69nBgoJJdrIwuRZAGus0y/ sU7MsVnboARIxVRSez1/QtBKyqZaW45xID33dr4ONN/W2vvWizEz31DWq0FcSIO9kU82 DnKcCPpoh2T/hzIyswuJe2TtPbQaqf6uKQZgKUH8H9tNKalUDw14gn3GpPgbZyKRNWxL hOntdk9GzKh4NHCD1bQTpxG8ch5JEqmT0fcQL2VoFX5BDoA73nn/SlMts7+1YoDbMWDk RzPD3RR4GU5zCaBbgx72yzn2MjhJGSnSKz87g6VKTB855SSE+oysF8Rj/jsselSnI2a3 JzjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=jqV47By+PjXr+jV8w8NDCF7neHepnR/vgwuBQauaaBQ=; b=t9ww93/ejNGbrGiOjpYehsPWyJC1D8QPXA5rS5PXzMhduD4YGXGjZFiCERivO1FBEg Qsv1h3+JQ2quREOZXDyJYioSdCilrmffc+423lDLXBvgePOhzN6ues7lEYK0yoMoFDoq Xm8gcI47fBWA4+ueWZEpLJbUT6L2geYbFEUYbqUqjUarAU0qS3R6LfX9bBkvkX05U/Nl OIOfJdULqYIBCxKp5zXryixOrGyuUDmAY0rzRS+RCzdzGogx2pto5r4TUzZZsR+QMiBQ 176IvLiZgkMjsNC6Vi6TleKAbSGOBPaHtCBz/fvq6tPwKsuqgZxKrwzAu2Z6FGSH4u/I GvTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PqEkLrmb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id dm21-20020a170907949500b007add150c78asi4005364ejc.503.2022.11.25.07.50.05; Fri, 25 Nov 2022 07:50:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PqEkLrmb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229586AbiKYPsq (ORCPT <rfc822;zxc52fgh@gmail.com> + 99 others); Fri, 25 Nov 2022 10:48:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229529AbiKYPso (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 25 Nov 2022 10:48:44 -0500 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B9F322516; Fri, 25 Nov 2022 07:48:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1669391318; x=1700927318; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=YL+ZrS8YaYyzd70/eAm0Uh+wmNDQlivpxdI7bvBiGwo=; b=PqEkLrmb7QFmeDZ52Z5Z+fuNbv8l95sNZoXcfw1NJlUk6dL5NeFAgWIC bEZ9XNC17q7IAImYGlHUzlAFwiBydGo4O+WXEf8pEE8SYSKs5Xjd7EGQm 3tRtQbu+Sc3rIBDn4VrzyTuCJxIfNFWbk7td7VGCVVEsDEeIDdeigVjyp cHuYYhGgi4WAflY9i86eK18C7AP39OKSj58aTPdifmzo8wh8RCiW1JreI t6paq+bSMb1+hnYo/Ns02ACS15nfVihVbHl2sIhFX1PDye6uAjkRhPLzL pQvzAh4It7jkbK2l4ybfZWOgcG09Ny/04Z0BzjacLITU8PcvNk3dPPpRW A==; X-IronPort-AV: E=McAfee;i="6500,9779,10542"; a="341407124" X-IronPort-AV: E=Sophos;i="5.96,193,1665471600"; d="scan'208";a="341407124" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Nov 2022 07:48:38 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10542"; a="644834099" X-IronPort-AV: E=Sophos;i="5.96,193,1665471600"; d="scan'208";a="644834099" Received: from zq-optiplex-7090.bj.intel.com ([10.238.156.129]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Nov 2022 07:48:36 -0800 From: Zqiang <qiang1.zhang@intel.com> To: paulmck@kernel.org, frederic@kernel.org, joel@joelfernandes.org Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug Date: Fri, 25 Nov 2022 23:54:27 +0800 Message-Id: <20221125155427.1381933-1-qiang1.zhang@intel.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750483790151452762?= X-GMAIL-MSGID: =?utf-8?q?1750483790151452762?= |
Series |
rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug
|
|
Commit Message
Zqiang
Nov. 25, 2022, 3:54 p.m. UTC
Currently, for the case of num_online_cpus() <= 1, return directly,
indicates the end of current grace period and then release old data.
it's not accurate, for SMP system, when num_online_cpus() is equal
one, maybe another cpu that in offline process(after invoke
__cpu_disable()) is still in the rude RCU-Tasks critical section
holding the old data, this lead to memory corruption.
Therefore, this commit add cpus_read_lock/unlock() before executing
num_online_cpus().
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
---
kernel/rcu/tasks.h | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
Comments
On Fri, Nov 25, 2022 at 11:54:27PM +0800, Zqiang wrote: > Currently, for the case of num_online_cpus() <= 1, return directly, > indicates the end of current grace period and then release old data. > it's not accurate, for SMP system, when num_online_cpus() is equal > one, maybe another cpu that in offline process(after invoke > __cpu_disable()) is still in the rude RCU-Tasks critical section > holding the old data, this lead to memory corruption. > > Therefore, this commit add cpus_read_lock/unlock() before executing > num_online_cpus(). I am not sure if this is needed. The only way what you suggest can happen is if the tasks-RCU protected data is accessed after the num_online_cpus() value is decremented on the CPU going offline. However, the number of online CPUs value is changed on a CPU other than the CPU going offline. So there's no way the CPU going offline can run any code (it is already dead courtesy of CPUHP_AP_IDLE_DEAD). So a corruption is impossible. Or, did I miss something? thanks, - Joel > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > --- > kernel/rcu/tasks.h | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > index 4a991311be9b..08e72c6462d8 100644 > --- a/kernel/rcu/tasks.h > +++ b/kernel/rcu/tasks.h > @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) > { > } > > +static DEFINE_PER_CPU(struct work_struct, rude_work); > + > // Wait for one rude RCU-tasks grace period. > static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) > { > + int cpu; > + struct work_struct *work; > + > + cpus_read_lock(); > if (num_online_cpus() <= 1) > - return; // Fastpath for only one CPU. > + goto end;// Fastpath for only one CPU. > > rtp->n_ipis += cpumask_weight(cpu_online_mask); > - schedule_on_each_cpu(rcu_tasks_be_rude); > + for_each_online_cpu(cpu) { > + work = per_cpu_ptr(&rude_work, cpu); > + INIT_WORK(work, rcu_tasks_be_rude); > + schedule_work_on(cpu, work); > + } > + > + for_each_online_cpu(cpu) > + flush_work(per_cpu_ptr(&rude_work, cpu)); > + > +end: > + cpus_read_unlock(); > } > > void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func); > -- > 2.25.1 >
On Fri, Nov 25, 2022 at 11:54:27PM +0800, Zqiang wrote: > Currently, for the case of num_online_cpus() <= 1, return directly, > indicates the end of current grace period and then release old data. > it's not accurate, for SMP system, when num_online_cpus() is equal > one, maybe another cpu that in offline process(after invoke > __cpu_disable()) is still in the rude RCU-Tasks critical section > holding the old data, this lead to memory corruption. > > Therefore, this commit add cpus_read_lock/unlock() before executing > num_online_cpus(). >I am not sure if this is needed. The only way what you suggest can happen is >if the tasks-RCU protected data is accessed after the num_online_cpus() value is >decremented on the CPU going offline. > >However, the number of online CPUs value is changed on a CPU other than the >CPU going offline. > >So there's no way the CPU going offline can run any code (it is already >dead courtesy of CPUHP_AP_IDLE_DEAD). So a corruption is impossible. > >Or, did I miss something? Hi joel Suppose the system has two cpus CPU0 CPU1 cpu_stopper_thread take_cpu_down __cpu_disable dec __num_online_cpus rcu_tasks_rude_wait_gp cpuhp_invoke_callback num_online_cpus() == 1 return; when __num_online_cpus == 1, the CPU1 not completely offline. Thanks Zqiang > >thanks, > > - Joel > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > --- > kernel/rcu/tasks.h | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > index 4a991311be9b..08e72c6462d8 100644 > --- a/kernel/rcu/tasks.h > +++ b/kernel/rcu/tasks.h > @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) > { > } > > +static DEFINE_PER_CPU(struct work_struct, rude_work); > + > // Wait for one rude RCU-tasks grace period. > static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) > { > + int cpu; > + struct work_struct *work; > + > + cpus_read_lock(); > if (num_online_cpus() <= 1) > - return; // Fastpath for only one CPU. > + goto end;// Fastpath for only one CPU. > > rtp->n_ipis += cpumask_weight(cpu_online_mask); > - schedule_on_each_cpu(rcu_tasks_be_rude); > + for_each_online_cpu(cpu) { > + work = per_cpu_ptr(&rude_work, cpu); > + INIT_WORK(work, rcu_tasks_be_rude); > + schedule_work_on(cpu, work); > + } > + > + for_each_online_cpu(cpu) > + flush_work(per_cpu_ptr(&rude_work, cpu)); > + > +end: > + cpus_read_unlock(); > } > > void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func); > -- > 2.25.1 >
On Sat, Nov 26, 2022 at 02:43:59AM +0000, Zhang, Qiang1 wrote: > On Fri, Nov 25, 2022 at 11:54:27PM +0800, Zqiang wrote: > > Currently, for the case of num_online_cpus() <= 1, return directly, > > indicates the end of current grace period and then release old data. > > it's not accurate, for SMP system, when num_online_cpus() is equal > > one, maybe another cpu that in offline process(after invoke > > __cpu_disable()) is still in the rude RCU-Tasks critical section > > holding the old data, this lead to memory corruption. > > > > Therefore, this commit add cpus_read_lock/unlock() before executing > > num_online_cpus(). > > > >I am not sure if this is needed. The only way what you suggest can happen is > >if the tasks-RCU protected data is accessed after the num_online_cpus() value is > >decremented on the CPU going offline. > > > >However, the number of online CPUs value is changed on a CPU other than the > >CPU going offline. > > > >So there's no way the CPU going offline can run any code (it is already > >dead courtesy of CPUHP_AP_IDLE_DEAD). So a corruption is impossible. > > > >Or, did I miss something? > > Hi joel > > Suppose the system has two cpus > > CPU0 CPU1 > cpu_stopper_thread > take_cpu_down > __cpu_disable > dec __num_online_cpus > rcu_tasks_rude_wait_gp cpuhp_invoke_callback Thanks for clarifying! You are right, this can be a problem for anything in the stop machine on the CPU going offline from CPUHP_AP_ONLINE to CPUHP_AP_IDLE_DEAD, during which the code execute on that CPU is not accounted for in num_online_cpus(). Actually Neeraj found a similar issue 2 years ago and instead of hotplug lock, he added a new attribute to rcu_state to track number of CPUs. See: https://lore.kernel.org/r/20200923210313.GS29330@paulmck-ThinkPad-P72 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2317853.html Could we do something similar? Off note is the comment in that thread: Actually blocking CPU hotplug would not only result in excessive overhead, but would also unnecessarily impede CPU-hotplug operations. Neeraj is also on the thread and could chime in. Thanks, - Joel > num_online_cpus() == 1 > return; > > when __num_online_cpus == 1, the CPU1 not completely offline. > > Thanks > Zqiang > > > > >thanks, > > > > - Joel > > > > > > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > > --- > > kernel/rcu/tasks.h | 20 ++++++++++++++++++-- > > 1 file changed, 18 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > > index 4a991311be9b..08e72c6462d8 100644 > > --- a/kernel/rcu/tasks.h > > +++ b/kernel/rcu/tasks.h > > @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) > > { > > } > > > > +static DEFINE_PER_CPU(struct work_struct, rude_work); > > + > > // Wait for one rude RCU-tasks grace period. > > static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) > > { > > + int cpu; > > + struct work_struct *work; > > + > > + cpus_read_lock(); > > if (num_online_cpus() <= 1) > > - return; // Fastpath for only one CPU. > > + goto end;// Fastpath for only one CPU. > > > > rtp->n_ipis += cpumask_weight(cpu_online_mask); > > - schedule_on_each_cpu(rcu_tasks_be_rude); > > + for_each_online_cpu(cpu) { > > + work = per_cpu_ptr(&rude_work, cpu); > > + INIT_WORK(work, rcu_tasks_be_rude); > > + schedule_work_on(cpu, work); > > + } > > + > > + for_each_online_cpu(cpu) > > + flush_work(per_cpu_ptr(&rude_work, cpu)); > > + > > +end: > > + cpus_read_unlock(); > > } > > > > void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func); > > -- > > 2.25.1 > >
Hi Zqiang, On 11/25/2022 9:24 PM, Zqiang wrote: > Currently, for the case of num_online_cpus() <= 1, return directly, > indicates the end of current grace period and then release old data. > it's not accurate, for SMP system, when num_online_cpus() is equal > one, maybe another cpu that in offline process(after invoke > __cpu_disable()) is still in the rude RCU-Tasks critical section > holding the old data, this lead to memory corruption. > Was this race seen in your testing? For the outgoing CPU, once that CPU marks itself offline (and decrements __num_online_cpus), do we have tracing active on that CPU, and synchronize_rcu_tasks_rude() not waiting for it could potentially lead to memory corruption? As per my understanding, given that outgoing/incoming CPU decrements/increments the __num_online_cpus value, and num_online_cpus() is a plain read, problem could happen when the incoming CPU updates the __num_online_cpus value, however, rcu_tasks_rude_wait_gp()'s num_online_cpus() call didn't observe the increment. So, cpus_read_lock/unlock() seems to be required to handle this case. Thanks Neeraj > Therefore, this commit add cpus_read_lock/unlock() before executing > num_online_cpus(). > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > --- > kernel/rcu/tasks.h | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > index 4a991311be9b..08e72c6462d8 100644 > --- a/kernel/rcu/tasks.h > +++ b/kernel/rcu/tasks.h > @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) > { > } > > +static DEFINE_PER_CPU(struct work_struct, rude_work); > + > // Wait for one rude RCU-tasks grace period. > static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) > { > + int cpu; > + struct work_struct *work; > + > + cpus_read_lock(); > if (num_online_cpus() <= 1) > - return; // Fastpath for only one CPU. > + goto end;// Fastpath for only one CPU. > > rtp->n_ipis += cpumask_weight(cpu_online_mask) > - schedule_on_each_cpu(rcu_tasks_be_rude); > + for_each_online_cpu(cpu) { > + work = per_cpu_ptr(&rude_work, cpu); > + INIT_WORK(work, rcu_tasks_be_rude); > + schedule_work_on(cpu, work); > + } > + > + for_each_online_cpu(cpu) > + flush_work(per_cpu_ptr(&rude_work, cpu)); > + > +end: > + cpus_read_unlock(); > } > > void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
Hi, On 11/26/2022 10:04 AM, Joel Fernandes wrote: > On Sat, Nov 26, 2022 at 02:43:59AM +0000, Zhang, Qiang1 wrote: >> On Fri, Nov 25, 2022 at 11:54:27PM +0800, Zqiang wrote: >>> Currently, for the case of num_online_cpus() <= 1, return directly, >>> indicates the end of current grace period and then release old data. >>> it's not accurate, for SMP system, when num_online_cpus() is equal >>> one, maybe another cpu that in offline process(after invoke >>> __cpu_disable()) is still in the rude RCU-Tasks critical section >>> holding the old data, this lead to memory corruption. >>> >>> Therefore, this commit add cpus_read_lock/unlock() before executing >>> num_online_cpus(). >> >> >>> I am not sure if this is needed. The only way what you suggest can happen is >>> if the tasks-RCU protected data is accessed after the num_online_cpus() value is >>> decremented on the CPU going offline. >>> >>> However, the number of online CPUs value is changed on a CPU other than the >>> CPU going offline. >>> >>> So there's no way the CPU going offline can run any code (it is already >>> dead courtesy of CPUHP_AP_IDLE_DEAD). So a corruption is impossible. >>> >>> Or, did I miss something? >> >> Hi joel >> >> Suppose the system has two cpus >> >> CPU0 CPU1 >> cpu_stopper_thread >> take_cpu_down >> __cpu_disable >> dec __num_online_cpus >> rcu_tasks_rude_wait_gp cpuhp_invoke_callback > > Thanks for clarifying! > > You are right, this can be a problem for anything in the stop machine on the > CPU going offline from CPUHP_AP_ONLINE to CPUHP_AP_IDLE_DEAD, during which > the code execute on that CPU is not accounted for in num_online_cpus(). > > Actually Neeraj found a similar issue 2 years ago and instead of hotplug > lock, he added a new attribute to rcu_state to track number of CPUs. > > See: > https://lore.kernel.org/r/20200923210313.GS29330@paulmck-ThinkPad-P72 > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2317853.html > > Could we do something similar? > > Off note is the comment in that thread: > Actually blocking CPU hotplug would not only result in excessive overhead, > but would also unnecessarily impede CPU-hotplug operations. > > Neeraj is also on the thread and could chime in. > I agree that using a counter, which is updated on the control CPU - after the CPU is dead ( for offline case) and before the CPU starts executing in kernel (for online case) optimizes the fast path. However, given that, in the common case (num_online_cpus() > 1), we also need to acquire cpus_read_lock(), I am not sure of how much actual impact that optimization will have. Thanks Neeraj > Thanks, > > - Joel > > >> num_online_cpus() == 1 >> return; >> >> when __num_online_cpus == 1, the CPU1 not completely offline. >> >> Thanks >> Zqiang >> >>> >>> thanks, >>> >>> - Joel >> >> >> >>> >>> Signed-off-by: Zqiang <qiang1.zhang@intel.com> >>> --- >>> kernel/rcu/tasks.h | 20 ++++++++++++++++++-- >>> 1 file changed, 18 insertions(+), 2 deletions(-) >>> >>> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h >>> index 4a991311be9b..08e72c6462d8 100644 >>> --- a/kernel/rcu/tasks.h >>> +++ b/kernel/rcu/tasks.h >>> @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) >>> { >>> } >>> >>> +static DEFINE_PER_CPU(struct work_struct, rude_work); >>> + >>> // Wait for one rude RCU-tasks grace period. >>> static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) >>> { >>> + int cpu; >>> + struct work_struct *work; >>> + >>> + cpus_read_lock(); >>> if (num_online_cpus() <= 1) >>> - return; // Fastpath for only one CPU. >>> + goto end;// Fastpath for only one CPU. >>> >>> rtp->n_ipis += cpumask_weight(cpu_online_mask); >>> - schedule_on_each_cpu(rcu_tasks_be_rude); >>> + for_each_online_cpu(cpu) { >>> + work = per_cpu_ptr(&rude_work, cpu); >>> + INIT_WORK(work, rcu_tasks_be_rude); >>> + schedule_work_on(cpu, work); >>> + } >>> + >>> + for_each_online_cpu(cpu) >>> + flush_work(per_cpu_ptr(&rude_work, cpu)); >>> + >>> +end: >>> + cpus_read_unlock(); >>> } >>> >>> void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func); >>> -- >>> 2.25.1 >>>
>Hi Zqiang, > >On 11/25/2022 9:24 PM, Zqiang wrote: > Currently, for the case of num_online_cpus() <= 1, return directly, > indicates the end of current grace period and then release old data. > it's not accurate, for SMP system, when num_online_cpus() is equal > one, maybe another cpu that in offline process(after invoke > __cpu_disable()) is still in the rude RCU-Tasks critical section > holding the old data, this lead to memory corruption. > > > >Was this race seen in your testing? For the outgoing CPU, once that >CPU marks itself offline (and decrements __num_online_cpus), do we >have tracing active on that CPU, and synchronize_rcu_tasks_rude() >not waiting for it could potentially lead to memory corruption? Hi Neeraj Indeed, I didn't see race in the actual production environment, Maybe my commit information description is not accurate enough, like the scene I described with joel. If in cpuhp_invoke_callback, some callback is in rude rcu-tasks read ctrical section, and still holding old data, but in this time, synchronize_rcu_tasks_rude() not waiting, and release old data. Suppose the system has two cpus CPU0 CPU1 cpu_stopper_thread take_cpu_down __cpu_disable dec __num_online_cpus rcu_tasks_rude_wait_gp cpuhp_invoke_callback num_online_cpus() == 1 return; when __num_online_cpus == 1, the CPU1 not completely offline. > >As per my understanding, given that outgoing/incoming CPU >decrements/increments the __num_online_cpus value, and num_online_cpus() >is a plain read, problem could happen when the incoming CPU updates the >__num_online_cpus value, however, rcu_tasks_rude_wait_gp()'s >num_online_cpus() call didn't observe the increment. So, >cpus_read_lock/unlock() seems to be required to handle this case. Yes, the same problem will be encountered when going online, due to access __num_online_cpus that is not protected by cpus_read_lock/unlock() in rcu_tasks_rude_wait_gp(). Do I need to change the commit information to send v2? Thanks Zqiang > > >Thanks >Neeraj > > Therefore, this commit add cpus_read_lock/unlock() before executing > num_online_cpus(). > > Signed-off-by: Zqiang <qiang1.zhang@intel.com> > --- > kernel/rcu/tasks.h | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h > index 4a991311be9b..08e72c6462d8 100644 > --- a/kernel/rcu/tasks.h > +++ b/kernel/rcu/tasks.h > @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) > { > } > > +static DEFINE_PER_CPU(struct work_struct, rude_work); > + > // Wait for one rude RCU-tasks grace period. > static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) > { > + int cpu; > + struct work_struct *work; > + > + cpus_read_lock(); > if (num_online_cpus() <= 1) > - return; // Fastpath for only one CPU. > + goto end;// Fastpath for only one CPU. > > rtp->n_ipis += cpumask_weight(cpu_online_mask) > - schedule_on_each_cpu(rcu_tasks_be_rude); > + for_each_online_cpu(cpu) { > + work = per_cpu_ptr(&rude_work, cpu); > + INIT_WORK(work, rcu_tasks_be_rude); > + schedule_work_on(cpu, work); > + } > + > + for_each_online_cpu(cpu) > + flush_work(per_cpu_ptr(&rude_work, cpu)); > + > +end: > + cpus_read_unlock(); > } > > void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
> On Nov 26, 2022, at 12:52 AM, Zhang, Qiang1 <qiang1.zhang@intel.com> wrote: > > >> >> Hi Zqiang, >> >> On 11/25/2022 9:24 PM, Zqiang wrote: >> Currently, for the case of num_online_cpus() <= 1, return directly, >> indicates the end of current grace period and then release old data. >> it's not accurate, for SMP system, when num_online_cpus() is equal >> one, maybe another cpu that in offline process(after invoke >> __cpu_disable()) is still in the rude RCU-Tasks critical section >> holding the old data, this lead to memory corruption. >> >> >> >> Was this race seen in your testing? For the outgoing CPU, once that >> CPU marks itself offline (and decrements __num_online_cpus), do we >> have tracing active on that CPU, and synchronize_rcu_tasks_rude() >> not waiting for it could potentially lead to memory corruption? > > Hi Neeraj > > Indeed, I didn't see race in the actual production environment, > Maybe my commit information description is not accurate enough, > like the scene I described with joel. > > If in cpuhp_invoke_callback, some callback is in rude rcu-tasks read ctrical section, > and still holding old data, but in this time, synchronize_rcu_tasks_rude() not waiting, > and release old data. > > Suppose the system has two cpus > > CPU0 CPU1 > cpu_stopper_thread > take_cpu_down > __cpu_disable > dec __num_online_cpus > rcu_tasks_rude_wait_gp cpuhp_invoke_callback > num_online_cpus() == 1 > return; > > when __num_online_cpus == 1, the CPU1 not completely offline. Agreed with yours and Neeraj assessment. >> >> As per my understanding, given that outgoing/incoming CPU >> decrements/increments the __num_online_cpus value, and num_online_cpus() >> is a plain read, problem could happen when the incoming CPU updates the >> __num_online_cpus value, however, rcu_tasks_rude_wait_gp()'s >> num_online_cpus() call didn't observe the increment. So, >> cpus_read_lock/unlock() seems to be required to handle this case. > > Yes, the same problem will be encountered when going online, due to > access __num_online_cpus that is not protected by cpus_read_lock/unlock() > in rcu_tasks_rude_wait_gp(). > > Do I need to change the commit information to send v2? I think so. If you could add the CPU sequence diagram you mentioned, that would be great. Also I suggest add more details of which specific parts of the hotplug process (the ones in stop machine only) are susceptible to the issue. That is, only those hotplug callbacks that are in stop machine which may have trampolines prematurely freed from another cpu, right? Thanks! - Joel > > Thanks > Zqiang > >> >> >> Thanks >> Neeraj >> >> Therefore, this commit add cpus_read_lock/unlock() before executing >> num_online_cpus(). >> >> Signed-off-by: Zqiang <qiang1.zhang@intel.com> >> --- >> kernel/rcu/tasks.h | 20 ++++++++++++++++++-- >> 1 file changed, 18 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h >> index 4a991311be9b..08e72c6462d8 100644 >> --- a/kernel/rcu/tasks.h >> +++ b/kernel/rcu/tasks.h >> @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) >> { >> } >> >> +static DEFINE_PER_CPU(struct work_struct, rude_work); >> + >> // Wait for one rude RCU-tasks grace period. >> static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) >> { >> + int cpu; >> + struct work_struct *work; >> + >> + cpus_read_lock(); >> if (num_online_cpus() <= 1) >> - return; // Fastpath for only one CPU. >> + goto end;// Fastpath for only one CPU. >> >> rtp->n_ipis += cpumask_weight(cpu_online_mask) > - schedule_on_each_cpu(rcu_tasks_be_rude); >> + for_each_online_cpu(cpu) { >> + work = per_cpu_ptr(&rude_work, cpu); >> + INIT_WORK(work, rcu_tasks_be_rude); >> + schedule_work_on(cpu, work); >> + } >> + >> + for_each_online_cpu(cpu) >> + flush_work(per_cpu_ptr(&rude_work, cpu)); >> + >> +end: >> + cpus_read_unlock(); >> } >> >> void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
> On Nov 26, 2022, at 12:52 AM, Zhang, Qiang1 <qiang1.zhang@intel.com> wrote: > > >> >> Hi Zqiang, >> >> On 11/25/2022 9:24 PM, Zqiang wrote: >> Currently, for the case of num_online_cpus() <= 1, return directly, >> indicates the end of current grace period and then release old data. >> it's not accurate, for SMP system, when num_online_cpus() is equal >> one, maybe another cpu that in offline process(after invoke >> __cpu_disable()) is still in the rude RCU-Tasks critical section >> holding the old data, this lead to memory corruption. >> >> >> >> Was this race seen in your testing? For the outgoing CPU, once that >> CPU marks itself offline (and decrements __num_online_cpus), do we >> have tracing active on that CPU, and synchronize_rcu_tasks_rude() not >> waiting for it could potentially lead to memory corruption? > > Hi Neeraj > > Indeed, I didn't see race in the actual production environment, Maybe > my commit information description is not accurate enough, like the > scene I described with joel. > > If in cpuhp_invoke_callback, some callback is in rude rcu-tasks read > ctrical section, and still holding old data, but in this time, > synchronize_rcu_tasks_rude() not waiting, and release old data. > > Suppose the system has two cpus > > CPU0 CPU1 > cpu_stopper_thread > take_cpu_down > __cpu_disable > dec __num_online_cpus > rcu_tasks_rude_wait_gp cpuhp_invoke_callback > num_online_cpus() == 1 > return; > > when __num_online_cpus == 1, the CPU1 not completely offline. > >Agreed with yours and Neeraj assessment. > >> >> As per my understanding, given that outgoing/incoming CPU >> decrements/increments the __num_online_cpus value, and >> num_online_cpus() is a plain read, problem could happen when the >> incoming CPU updates the __num_online_cpus value, however, >> rcu_tasks_rude_wait_gp()'s >> num_online_cpus() call didn't observe the increment. So, >> cpus_read_lock/unlock() seems to be required to handle this case. > > Yes, the same problem will be encountered when going online, due to > access __num_online_cpus that is not protected by > cpus_read_lock/unlock() in rcu_tasks_rude_wait_gp(). > > Do I need to change the commit information to send v2? > >I think so. If you could add the CPU sequence diagram you mentioned, that would be great. > >Also I suggest add more details of which specific parts of the hotplug process (the ones in stop machine only) are susceptible to the issue. That is, only those hotplug callbacks that are in stop machine which may have trampolines prematurely freed from another cpu, right? Yes, your describe is correct, I will resend, I will resend. Thanks Zqiang > >Thanks! > > - Joel > > > > > Thanks > Zqiang > >> >> >> Thanks >> Neeraj >> >> Therefore, this commit add cpus_read_lock/unlock() before executing >> num_online_cpus(). >> >> Signed-off-by: Zqiang <qiang1.zhang@intel.com> >> --- >> kernel/rcu/tasks.h | 20 ++++++++++++++++++-- >> 1 file changed, 18 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index >> 4a991311be9b..08e72c6462d8 100644 >> --- a/kernel/rcu/tasks.h >> +++ b/kernel/rcu/tasks.h >> @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct >> work_struct *work) { } >> >> +static DEFINE_PER_CPU(struct work_struct, rude_work); >> + >> // Wait for one rude RCU-tasks grace period. >> static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) { >> + int cpu; >> + struct work_struct *work; >> + >> + cpus_read_lock(); >> if (num_online_cpus() <= 1) >> - return; // Fastpath for only one CPU. >> + goto end;// Fastpath for only one CPU. >> >> rtp->n_ipis += cpumask_weight(cpu_online_mask) > - schedule_on_each_cpu(rcu_tasks_be_rude); >> + for_each_online_cpu(cpu) { >> + work = per_cpu_ptr(&rude_work, cpu); >> + INIT_WORK(work, rcu_tasks_be_rude); >> + schedule_work_on(cpu, work); >> + } >> + >> + for_each_online_cpu(cpu) >> + flush_work(per_cpu_ptr(&rude_work, cpu)); >> + >> +end: >> + cpus_read_unlock(); >> } >> >> void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 4a991311be9b..08e72c6462d8 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1033,14 +1033,30 @@ static void rcu_tasks_be_rude(struct work_struct *work) { } +static DEFINE_PER_CPU(struct work_struct, rude_work); + // Wait for one rude RCU-tasks grace period. static void rcu_tasks_rude_wait_gp(struct rcu_tasks *rtp) { + int cpu; + struct work_struct *work; + + cpus_read_lock(); if (num_online_cpus() <= 1) - return; // Fastpath for only one CPU. + goto end;// Fastpath for only one CPU. rtp->n_ipis += cpumask_weight(cpu_online_mask); - schedule_on_each_cpu(rcu_tasks_be_rude); + for_each_online_cpu(cpu) { + work = per_cpu_ptr(&rude_work, cpu); + INIT_WORK(work, rcu_tasks_be_rude); + schedule_work_on(cpu, work); + } + + for_each_online_cpu(cpu) + flush_work(per_cpu_ptr(&rude_work, cpu)); + +end: + cpus_read_unlock(); } void call_rcu_tasks_rude(struct rcu_head *rhp, rcu_callback_t func);