Message ID | 20231003232903.7109-1-frederic@kernel.org |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2a8e:b0:403:3b70:6f57 with SMTP id in14csp2413208vqb; Tue, 3 Oct 2023 16:29:16 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGlOFImOicbbnttTInxSBMXSdioFuA59EsNK+M2t/qml9hago6iWgg/ARdSO3+Y6fks2iRi X-Received: by 2002:a17:90a:9d82:b0:269:621e:a673 with SMTP id k2-20020a17090a9d8200b00269621ea673mr771200pjp.1.1696375755676; Tue, 03 Oct 2023 16:29:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696375755; cv=none; d=google.com; s=arc-20160816; b=y38OFWfHi8dRg/YVJVK/bCRTJ8uWEBNSjZnKSynp/uyPKbKGtbB7Wh2SMQyASDPMBZ XnBa4yQHKie0tsVDaoEnWpeArWGmE9sSlZIKkVdG2f5a1XCklq16VUXHRNz335hu6s8h NY+oXY6AM/s9/3mOiLNX/t3WyqEpFjPYKF/7iI9ToU0WX8SqhnChjSj6LcVbxcOclj72 wGHqIPZK23+UjO/JUl7bpr7k2iH/LZNok5b6qlKpT/79aGTKEl1xrBvBMsKAptHL3nAD JEnvNyEjEQO7FsHIRIPmT9BQHGVKo4NUkNjDpzzCBAOJ8AdMJkmD9jGx+NoqcA9PHs6f rVkA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=OwHEFWXr+intMX8TP/80NTSMEhhUw4EDwxd6lRDQnWA=; fh=F5rAnPhwm0E3NuqHck/WDxlMw3n2ra+y1fhewNj930Q=; b=znAOFs7NCvFgb32iBS00PA3ZDqkzggkRsfK1VPBOVlq3sRkyBvwjqKLDEmTVKKUDX+ wHsxcQBXh4yAp+IwTIWw67Lan9Rnq/FeG9+5OJqMh0f89LzWAOyU8ZmaCwARssiEduBm aAPQx4w9IDDLqjQQFDObd5Jtx1KhKFqKscdjqM7xq12jXNVUnlOTRXdR0XC7Uav3vt2q e/ofgQ7pt4FTRQW6mXIM0e1XZNQxAkh7HsnfrZjQ06zXaYQiZLfLB0Ipm32br80fLzXt 2eNJIIF5PSUSu3eRF2sGofTmRkEPSFCm7c5Aa/H5x4U68ngkwSM8sdUiel23TQw/o7si M5Ng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Na8OIKWc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id u128-20020a637986000000b005859e224619si2474781pgc.786.2023.10.03.16.29.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 16:29:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Na8OIKWc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id E6C0A8310D1B; Tue, 3 Oct 2023 16:29:14 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236029AbjJCX3O (ORCPT <rfc822;chrisfriedt@gmail.com> + 17 others); Tue, 3 Oct 2023 19:29:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235978AbjJCX3N (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 3 Oct 2023 19:29:13 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 225B5BB; Tue, 3 Oct 2023 16:29:11 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 80845C433C8; Tue, 3 Oct 2023 23:29:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1696375750; bh=7/LYX88Wk78uddD8OaNEI8pUWtPtDqIWAY8ypJFmHC0=; h=From:To:Cc:Subject:Date:From; b=Na8OIKWcA3s1tnsiGzBkH6XQzt1OvSR34rjmF5LpWfY1CykzCctd0mILCqPJ21rvb bbsn0izR+SCRC6HG5y+94AfYFokm5USYZsyH31nUOr8yHuj2dWbNHL7OdMO/WL9Ck1 xZe2QMyJlSGxKscidjiBJxcfVM/eu+n4m1w1gn9zKEu0EcKv0o+03ecojiok1VUbmo EueifNbmz77inUBt/CPmRDXw0hHMLRkDYreXljo7p2pLrJwT0kkrwxv0N2fVjVvETX kHKqOxuMJoA3baqtY+NCaTz60K4Yz4+a9ceUH6Ovg6OKPTeFykQjbr5dL7Ed2j6rUY QhXf8N8qhxY+A== From: Frederic Weisbecker <frederic@kernel.org> To: "Paul E . McKenney" <paulmck@kernel.org> Cc: LKML <linux-kernel@vger.kernel.org>, Frederic Weisbecker <frederic@kernel.org>, Yong He <zhuangel570@gmail.com>, Neeraj upadhyay <neeraj.iitr10@gmail.com>, Joel Fernandes <joel@joelfernandes.org>, Boqun Feng <boqun.feng@gmail.com>, Uladzislau Rezki <urezki@gmail.com>, RCU <rcu@vger.kernel.org> Subject: [PATCH 0/5] srcu fixes Date: Wed, 4 Oct 2023 01:28:58 +0200 Message-ID: <20231003232903.7109-1-frederic@kernel.org> X-Mailer: git-send-email 2.41.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 03 Oct 2023 16:29:14 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778778903973316686 X-GMAIL-MSGID: 1778778903973316686 |
Series |
srcu fixes
|
|
Message
Frederic Weisbecker
Oct. 3, 2023, 11:28 p.m. UTC
Hi, This contains a fix for "SRCU: kworker hung in synchronize_srcu": http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com And a few cleanups. Passed 50 hours of SRCU-P and SRCU-N. git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git srcu/fixes HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 Thanks, Frederic --- Frederic Weisbecker (5): srcu: Fix callbacks acceleration mishandling srcu: Only accelerate on enqueue time srcu: Remove superfluous callbacks advancing from srcu_start_gp() srcu: No need to advance/accelerate if no callback enqueued srcu: Explain why callbacks invocations can't run concurrently kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 39 insertions(+), 16 deletions(-)
Comments
On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > Hi, > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > And a few cleanups. > > Passed 50 hours of SRCU-P and SRCU-N. > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > srcu/fixes > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > Thanks, > Frederic Very good, and a big "Thank You!!!" to all of you! I queued this series for testing purposes, and have started a bunch of SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and SRCU-N on another system, but with both scenarios resized to 40 CPU each. While that is in flight, a few questions: o Please check the Co-developed-by rules. Last I knew, it was necessary to have a Signed-off-by after each Co-developed-by. o Is it possible to get a Tested-by from the original reporter? Or is this not reproducible? o Is it possible to convince rcutorture to find this sort of bug? Seems like it should be, but easy to say... o Frederic, would you like to include this in your upcoming pull request? Or does it need more time? Thanx, Paul > --- > > Frederic Weisbecker (5): > srcu: Fix callbacks acceleration mishandling > srcu: Only accelerate on enqueue time > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > srcu: No need to advance/accelerate if no callback enqueued > srcu: Explain why callbacks invocations can't run concurrently > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > 1 file changed, 39 insertions(+), 16 deletions(-)
On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > Hi, > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > And a few cleanups. > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > srcu/fixes > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > Thanks, > > Frederic > > Very good, and a big "Thank You!!!" to all of you! > > I queued this series for testing purposes, and have started a bunch of > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > SRCU-N on another system, but with both scenarios resized to 40 CPU each. > > While that is in flight, a few questions: > > o Please check the Co-developed-by rules. Last I knew, it was > necessary to have a Signed-off-by after each Co-developed-by. > > o Is it possible to get a Tested-by from the original reporter? > Or is this not reproducible? > > o Is it possible to convince rcutorture to find this sort of > bug? Seems like it should be, but easy to say... And one other thing... o What other bugs like this one are hiding elsewhere in RCU? > o Frederic, would you like to include this in your upcoming > pull request? Or does it need more time? Thanx, Paul > > --- > > > > Frederic Weisbecker (5): > > srcu: Fix callbacks acceleration mishandling > > srcu: Only accelerate on enqueue time > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > srcu: No need to advance/accelerate if no callback enqueued > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > 1 file changed, 39 insertions(+), 16 deletions(-)
On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote: > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > > Hi, > > > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > > > And a few cleanups. > > > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > > srcu/fixes > > > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > > > Thanks, > > > Frederic > > > > Very good, and a big "Thank You!!!" to all of you! > > > > I queued this series for testing purposes, and have started a bunch of > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > > SRCU-N on another system, but with both scenarios resized to 40 CPU each. The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual tick-stop errors. (Is there a patch for that one?) The 40-CPU SRCU-N run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting a maximum of 16 CPUs. So I started a 10-hour 40-CPU SRCU-P and a pair of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and 100*10h of SRCU-P. I will let you know how it goes. Thanx, Paul > > While that is in flight, a few questions: > > > > o Please check the Co-developed-by rules. Last I knew, it was > > necessary to have a Signed-off-by after each Co-developed-by. > > > > o Is it possible to get a Tested-by from the original reporter? > > Or is this not reproducible? > > > > o Is it possible to convince rcutorture to find this sort of > > bug? Seems like it should be, but easy to say... > > And one other thing... > > o What other bugs like this one are hiding elsewhere > in RCU? > > > o Frederic, would you like to include this in your upcoming > > pull request? Or does it need more time? > > Thanx, Paul > > > > --- > > > > > > Frederic Weisbecker (5): > > > srcu: Fix callbacks acceleration mishandling > > > srcu: Only accelerate on enqueue time > > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > > srcu: No need to advance/accelerate if no callback enqueued > > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > > 1 file changed, 39 insertions(+), 16 deletions(-)
On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > Hi, > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > And a few cleanups. > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > srcu/fixes > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > Thanks, > > Frederic > > Very good, and a big "Thank You!!!" to all of you! > > I queued this series for testing purposes, and have started a bunch of > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > SRCU-N on another system, but with both scenarios resized to 40 CPU each. > > While that is in flight, a few questions: > > o Please check the Co-developed-by rules. Last I knew, it was > necessary to have a Signed-off-by after each Co-developed-by. Indeed! I'll try to collect the three of them within a few days. If some are missing, I'll put a Reported-by instead. > > o Is it possible to get a Tested-by from the original reporter? > Or is this not reproducible? It seems that the issue would trigger rarely. But I hope we can get one. > > o Is it possible to convince rcutorture to find this sort of > bug? Seems like it should be, but easy to say... So at least the part where advance/accelerate fail is observed from time to time. But then we must meet two more rare events: 1) The CPU failing to ACC/ADV must also fail to start the grace period because another CPU was faster. 2) The callbacks invocation must not run until that grace period has ended (even though we had a previous one completed with callbacks ready). Or it can run after all but at least the acceleration part of it has to happen after the end of the new grace period. Perhaps all these conditions can me met more often if we overcommit the number of vCPU. For example run 10 SRCU-P instances within 3 real CPUs. This could introduce random breaks within the torture writers... Just an idea... > > o Frederic, would you like to include this in your upcoming > pull request? Or does it need more time? At least the first patch yes. It should be easily backported and it should be enough to solve the race. I'll just wait a bit to collect more tags. Thanks! > > Thanx, Paul > > > --- > > > > Frederic Weisbecker (5): > > srcu: Fix callbacks acceleration mishandling > > srcu: Only accelerate on enqueue time > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > srcu: No need to advance/accelerate if no callback enqueued > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > 1 file changed, 39 insertions(+), 16 deletions(-)
On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote: > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > > Hi, > > > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > > > And a few cleanups. > > > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > > srcu/fixes > > > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > > > Thanks, > > > Frederic > > > > Very good, and a big "Thank You!!!" to all of you! > > > > I queued this series for testing purposes, and have started a bunch of > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > > SRCU-N on another system, but with both scenarios resized to 40 CPU each. > > > > While that is in flight, a few questions: > > > > o Please check the Co-developed-by rules. Last I knew, it was > > necessary to have a Signed-off-by after each Co-developed-by. > > > > o Is it possible to get a Tested-by from the original reporter? > > Or is this not reproducible? > > > > o Is it possible to convince rcutorture to find this sort of > > bug? Seems like it should be, but easy to say... > > And one other thing... > > o What other bugs like this one are hiding elsewhere > in RCU? Hmm, yesterday I thought RCU would be fine because it has a tick polling on callbacks anyway. But I'm not so sure, I'll check for real... Thanks. > > > o Frederic, would you like to include this in your upcoming > > pull request? Or does it need more time? > > Thanx, Paul > > > > --- > > > > > > Frederic Weisbecker (5): > > > srcu: Fix callbacks acceleration mishandling > > > srcu: Only accelerate on enqueue time > > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > > srcu: No need to advance/accelerate if no callback enqueued > > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > > 1 file changed, 39 insertions(+), 16 deletions(-)
On Tue, Oct 03, 2023 at 08:30:45PM -0700, Paul E. McKenney wrote: > On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote: > > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > > > Hi, > > > > > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > > > > > And a few cleanups. > > > > > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > > > srcu/fixes > > > > > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > > > > > Thanks, > > > > Frederic > > > > > > Very good, and a big "Thank You!!!" to all of you! > > > > > > I queued this series for testing purposes, and have started a bunch of > > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > > > SRCU-N on another system, but with both scenarios resized to 40 CPU each. > > The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual > tick-stop errors. (Is there a patch for that one?) The 40-CPU SRCU-N > run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting > a maximum of 16 CPUs. So I started a 10-hour 40-CPU SRCU-P and a pair > of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and > 100*10h of SRCU-P. > > I will let you know how it goes. Very nice! It might be worth testing the first patch alone as well if we backport only this one. Thanks! > Thanx, Paul > > > > While that is in flight, a few questions: > > > > > > o Please check the Co-developed-by rules. Last I knew, it was > > > necessary to have a Signed-off-by after each Co-developed-by. > > > > > > o Is it possible to get a Tested-by from the original reporter? > > > Or is this not reproducible? > > > > > > o Is it possible to convince rcutorture to find this sort of > > > bug? Seems like it should be, but easy to say... > > > > And one other thing... > > > > o What other bugs like this one are hiding elsewhere > > in RCU? > > > > > o Frederic, would you like to include this in your upcoming > > > pull request? Or does it need more time? > > > > Thanx, Paul > > > > > > --- > > > > > > > > Frederic Weisbecker (5): > > > > srcu: Fix callbacks acceleration mishandling > > > > srcu: Only accelerate on enqueue time > > > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > > > srcu: No need to advance/accelerate if no callback enqueued > > > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > > > 1 file changed, 39 insertions(+), 16 deletions(-)
On Wed, Oct 04, 2023 at 11:36:49AM +0200, Frederic Weisbecker wrote: > On Tue, Oct 03, 2023 at 08:30:45PM -0700, Paul E. McKenney wrote: > > On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote: > > > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > > > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > > > > Hi, > > > > > > > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > > > > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > > > > > > > And a few cleanups. > > > > > > > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > > > > srcu/fixes > > > > > > > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > > > > > > > Thanks, > > > > > Frederic > > > > > > > > Very good, and a big "Thank You!!!" to all of you! > > > > > > > > I queued this series for testing purposes, and have started a bunch of > > > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > > > > SRCU-N on another system, but with both scenarios resized to 40 CPU each. > > > > The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual > > tick-stop errors. (Is there a patch for that one?) The 40-CPU SRCU-N > > run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting > > a maximum of 16 CPUs. So I started a 10-hour 40-CPU SRCU-P and a pair > > of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and > > 100*10h of SRCU-P. > > > > I will let you know how it goes. > > Very nice! It might be worth testing the first patch alone as > well if we backport only this one. The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs completed without failure. The others had some failures, but I need to look and see if any were unexpected. In the meantime, I started a two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on just that first commit. Also servicing SIGSHOWER and SIGFOOD. ;-) Thanx, Paul > Thanks! > > > > Thanx, Paul > > > > > > While that is in flight, a few questions: > > > > > > > > o Please check the Co-developed-by rules. Last I knew, it was > > > > necessary to have a Signed-off-by after each Co-developed-by. > > > > > > > > o Is it possible to get a Tested-by from the original reporter? > > > > Or is this not reproducible? > > > > > > > > o Is it possible to convince rcutorture to find this sort of > > > > bug? Seems like it should be, but easy to say... > > > > > > And one other thing... > > > > > > o What other bugs like this one are hiding elsewhere > > > in RCU? > > > > > > > o Frederic, would you like to include this in your upcoming > > > > pull request? Or does it need more time? > > > > > > Thanx, Paul > > > > > > > > --- > > > > > > > > > > Frederic Weisbecker (5): > > > > > srcu: Fix callbacks acceleration mishandling > > > > > srcu: Only accelerate on enqueue time > > > > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > > > > srcu: No need to advance/accelerate if no callback enqueued > > > > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > > > > > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > > > > 1 file changed, 39 insertions(+), 16 deletions(-)
On Wed, Oct 04, 2023 at 07:06:58AM -0700, Paul E. McKenney wrote: > On Wed, Oct 04, 2023 at 11:36:49AM +0200, Frederic Weisbecker wrote: > > On Tue, Oct 03, 2023 at 08:30:45PM -0700, Paul E. McKenney wrote: > > > On Tue, Oct 03, 2023 at 08:21:42PM -0700, Paul E. McKenney wrote: > > > > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > > > > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > > > > > Hi, > > > > > > > > > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > > > > > > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > > > > > > > > > And a few cleanups. > > > > > > > > > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > > > > > srcu/fixes > > > > > > > > > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > > > > > > > > > Thanks, > > > > > > Frederic > > > > > > > > > > Very good, and a big "Thank You!!!" to all of you! > > > > > > > > > > I queued this series for testing purposes, and have started a bunch of > > > > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > > > > > SRCU-N on another system, but with both scenarios resized to 40 CPU each. > > > > > > The 200*1h of SRCU-N and the 100*1h of SRCU-p passed other than the usual > > > tick-stop errors. (Is there a patch for that one?) The 40-CPU SRCU-N > > > run was fine, but the 40-CPU SRCU-P run failed due to the fanouts setting > > > a maximum of 16 CPUs. So I started a 10-hour 40-CPU SRCU-P and a pair > > > of 10-hour 16-CPU SRCU-N runs on one system, and 200*10h of SRCU-N and > > > 100*10h of SRCU-P. > > > > > > I will let you know how it goes. > > > > Very nice! It might be worth testing the first patch alone as > > well if we backport only this one. > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs > completed without failure. The others had some failures, but I need > to look and see if any were unexpected. In the meantime, I started a > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on > just that first commit. Also servicing SIGSHOWER and SIGFOOD. ;-) And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N runs (on only the first commit) completed without incident. The other set of overnight full-stack runs had only tick-stop errors, so I started a two-hour set on the first commit. So far so good! Thanx, Paul > > Thanks! > > > > > > > Thanx, Paul > > > > > > > > While that is in flight, a few questions: > > > > > > > > > > o Please check the Co-developed-by rules. Last I knew, it was > > > > > necessary to have a Signed-off-by after each Co-developed-by. > > > > > > > > > > o Is it possible to get a Tested-by from the original reporter? > > > > > Or is this not reproducible? > > > > > > > > > > o Is it possible to convince rcutorture to find this sort of > > > > > bug? Seems like it should be, but easy to say... > > > > > > > > And one other thing... > > > > > > > > o What other bugs like this one are hiding elsewhere > > > > in RCU? > > > > > > > > > o Frederic, would you like to include this in your upcoming > > > > > pull request? Or does it need more time? > > > > > > > > Thanx, Paul > > > > > > > > > > --- > > > > > > > > > > > > Frederic Weisbecker (5): > > > > > > srcu: Fix callbacks acceleration mishandling > > > > > > srcu: Only accelerate on enqueue time > > > > > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > > > > > srcu: No need to advance/accelerate if no callback enqueued > > > > > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > > > > > > > > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > > > > > 1 file changed, 39 insertions(+), 16 deletions(-)
Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit : > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs > > completed without failure. The others had some failures, but I need > > to look and see if any were unexpected. In the meantime, I started a > > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on > > just that first commit. Also servicing SIGSHOWER and SIGFOOD. ;-) > > And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N > runs (on only the first commit) completed without incident. > > The other set of overnight full-stack runs had only tick-stop errors, > so I started a two-hour set on the first commit. > > So far so good! Very nice! As for the tick-stop error, see the upstream fix: 1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe") Thanks!
On Wed, Oct 04, 2023 at 11:27:29PM +0200, Frederic Weisbecker wrote: > Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit : > > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs > > > completed without failure. The others had some failures, but I need > > > to look and see if any were unexpected. In the meantime, I started a > > > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on > > > just that first commit. Also servicing SIGSHOWER and SIGFOOD. ;-) > > > > And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N > > runs (on only the first commit) completed without incident. > > > > The other set of overnight full-stack runs had only tick-stop errors, > > so I started a two-hour set on the first commit. > > > > So far so good! > > Very nice! > > As for the tick-stop error, see the upstream fix: > > 1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe") Got it, thank you! And the two-hour set of 200*SRCU-N and 100*SRCU-P had only tick-stop errors. I am refreshing the test grid, and will run overnight. Here is hoping! Thanx, Paul
On Wed, Oct 04, 2023 at 02:54:57PM -0700, Paul E. McKenney wrote: > On Wed, Oct 04, 2023 at 11:27:29PM +0200, Frederic Weisbecker wrote: > > Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit : > > > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs > > > > completed without failure. The others had some failures, but I need > > > > to look and see if any were unexpected. In the meantime, I started a > > > > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on > > > > just that first commit. Also servicing SIGSHOWER and SIGFOOD. ;-) > > > > > > And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N > > > runs (on only the first commit) completed without incident. > > > > > > The other set of overnight full-stack runs had only tick-stop errors, > > > so I started a two-hour set on the first commit. > > > > > > So far so good! > > > > Very nice! > > > > As for the tick-stop error, see the upstream fix: > > > > 1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe") > > Got it, thank you! > > And the two-hour set of 200*SRCU-N and 100*SRCU-P had only tick-stop > errors. I am refreshing the test grid, and will run overnight. And the ten-hour test passed with only the tick-stop errors, representing 2000 hours of SRCU-N and 1000 hours of SRCU-P. (I ran the exact same stack, without the rebased fix you call out above.) Looking good! Thanx, Paul
On Wed, Oct 4, 2023 at 5:25 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote: > > On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote: > > > Hi, > > > > > > This contains a fix for "SRCU: kworker hung in synchronize_srcu": > > > > > > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@mail.gmail.com > > > > > > And a few cleanups. > > > > > > Passed 50 hours of SRCU-P and SRCU-N. > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > > srcu/fixes > > > > > > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0 > > > > > > Thanks, > > > Frederic > > > > Very good, and a big "Thank You!!!" to all of you! > > > > I queued this series for testing purposes, and have started a bunch of > > SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and > > SRCU-N on another system, but with both scenarios resized to 40 CPU each. > > > > While that is in flight, a few questions: > > > > o Please check the Co-developed-by rules. Last I knew, it was > > necessary to have a Signed-off-by after each Co-developed-by. > > Indeed! I'll try to collect the three of them within a few days. If some > are missing, I'll put a Reported-by instead. > > > > > o Is it possible to get a Tested-by from the original reporter? > > Or is this not reproducible? > > It seems that the issue would trigger rarely. But I hope we can get one. There is currently no way to reproduce this problem in our environment. The problem has appeared on 2 machines, and each time it occurred, the test had been running for more than a month. BTW, I will run tests with these patches in our environment. > > > > > o Is it possible to convince rcutorture to find this sort of > > bug? Seems like it should be, but easy to say... > > So at least the part where advance/accelerate fail is observed from time > to time. But then we must meet two more rare events: > > 1) The CPU failing to ACC/ADV must also fail to start the grace period because > another CPU was faster. > > 2) The callbacks invocation must not run until that grace period has ended (even > though we had a previous one completed with callbacks ready). > > Or it can run after all but at least the acceleration part of it has to > happen after the end of the new grace period. > > Perhaps all these conditions can me met more often if we overcommit the number > of vCPU. For example run 10 SRCU-P instances within 3 real CPUs. This could > introduce random breaks within the torture writers... > > Just an idea... > > > > > o Frederic, would you like to include this in your upcoming > > pull request? Or does it need more time? > > At least the first patch yes. It should be easily backported and > it should be enough to solve the race. I'll just wait a bit to collect > more tags. > > Thanks! > > > > > Thanx, Paul > > > > > --- > > > > > > Frederic Weisbecker (5): > > > srcu: Fix callbacks acceleration mishandling > > > srcu: Only accelerate on enqueue time > > > srcu: Remove superfluous callbacks advancing from srcu_start_gp() > > > srcu: No need to advance/accelerate if no callback enqueued > > > srcu: Explain why callbacks invocations can't run concurrently > > > > > > > > > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++--------------- > > > 1 file changed, 39 insertions(+), 16 deletions(-)
On Thu, Oct 05, 2023 at 09:54:12AM -0700, Paul E. McKenney wrote: > On Wed, Oct 04, 2023 at 02:54:57PM -0700, Paul E. McKenney wrote: > > On Wed, Oct 04, 2023 at 11:27:29PM +0200, Frederic Weisbecker wrote: > > > Le Wed, Oct 04, 2023 at 09:47:04AM -0700, Paul E. McKenney a écrit : > > > > > The 10-hour 40-CPU SRCU-P run and pair of 10-hour 16-CPU SRCU-N runs > > > > > completed without failure. The others had some failures, but I need > > > > > to look and see if any were unexpected. In the meantime, I started a > > > > > two-hour 40-CPU SRCU-P run and a pair of one-hour 16-CPU SRCU-N runs on > > > > > just that first commit. Also servicing SIGSHOWER and SIGFOOD. ;-) > > > > > > > > And the two-hour 40-CPU SRCU-P run and a pair of two-hour 16-CPU SRCU-N > > > > runs (on only the first commit) completed without incident. > > > > > > > > The other set of overnight full-stack runs had only tick-stop errors, > > > > so I started a two-hour set on the first commit. > > > > > > > > So far so good! > > > > > > Very nice! > > > > > > As for the tick-stop error, see the upstream fix: > > > > > > 1a6a46477494 ("timers: Tag (hr)timer softirq as hotplug safe") > > > > Got it, thank you! > > > > And the two-hour set of 200*SRCU-N and 100*SRCU-P had only tick-stop > > errors. I am refreshing the test grid, and will run overnight. > > And the ten-hour test passed with only the tick-stop errors, representing > 2000 hours of SRCU-N and 1000 hours of SRCU-P. (I ran the exact same > stack, without the rebased fix you call out above.) Thanks a lot!
On Sat, Oct 07, 2023 at 06:24:53PM +0800, zhuangel570 wrote: > On Wed, Oct 4, 2023 at 5:25 PM Frederic Weisbecker <frederic@kernel.org> wrote: > There is currently no way to reproduce this problem in our environment. > The problem has appeared on 2 machines, and each time it occurred, the > test had been running for more than a month. > > BTW, I will run tests with these patches in our environment. Ok, let us know if it ever triggers after this series. Since I added you in the Co-developed-by: tags, I will need to also add your Signed-off-by: tag. Is that ok for you? Thanks.
On Tue, Oct 10, 2023 at 7:27 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Sat, Oct 07, 2023 at 06:24:53PM +0800, zhuangel570 wrote: > > On Wed, Oct 4, 2023 at 5:25 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > There is currently no way to reproduce this problem in our environment. > > The problem has appeared on 2 machines, and each time it occurred, the > > test had been running for more than a month. > > > > BTW, I will run tests with these patches in our environment. > > Ok, let us know if it ever triggers after this series. Sure, I have ported the patch set to our test environment, now 2 machines already run test for 3 days, everything looks fine. The patch set is very elegant and completely eliminates the possibility of unexpected accelerations in our analysis. I am very confident in fixing our problem. > > Since I added you in the Co-developed-by: tags, I will need to > also add your Signed-off-by: tag. > > Is that ok for you? Sure. Big thanks! If possible, would you please change my "Signed-off-by" into: Signed-off-by: Yong He <alexyonghe@tencent.com> > > Thanks.