Message ID | bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2016:b0:403:3b70:6f57 with SMTP id fe22csp459416vqb; Thu, 5 Oct 2023 10:37:12 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFCfjbrqlNFg+8OsgJMtQLjsCdJlCZmUgDX1sI6e+tPUQV0CccQ45ZQUHKgc5x4fbNN99i2 X-Received: by 2002:a05:6a00:398c:b0:68a:6cbe:35a7 with SMTP id fi12-20020a056a00398c00b0068a6cbe35a7mr6331318pfb.2.1696527432630; Thu, 05 Oct 2023 10:37:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696527432; cv=none; d=google.com; s=arc-20160816; b=jJDMjrQK9OazwOInncXTSn5iXJK58SMPZN0TyUEvIyjr/z9wuJv+ohr20EYd8sHgTh Vl2X0bewThedQ38S6nziEDnF/Cy1sLjPk5PFveY/L4qQlVeClS48CICDS+tDM6v6Phuu bD0syzcomoe9IXBDmfS3+v0MtMcIKgDJHFQe2PzvtJn3ujY5O8XyM8zBvzeRtFZsbMUD mgbmE5kwkSOZ/49RgKklhIxsWw5b+EZ30Fey7uwUK1lrMQkacS5Rr+rQ7g4yo4yVHkzs NKXFxbDE7majtrWCYAO4RTe7gMbFaz/wKD81mzZ6rEHnUF9QVICDlhVJhEsIgntYTUbN yf/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:reply-to :message-id:subject:cc:to:from:date:dkim-signature; bh=3+LPeXXsUuZqTYKewepgZbs5KFtYihniVTkK07HvXvI=; fh=Ly0RBVX4Dm/io6+f3Ow6WlDzGFcLZXQPwAYGpNWor0s=; b=exni+dvQ1qpOf1EEb10uA4GJ2EIh846xDCGhPZUFiziAFW9J2F/P/K4ezXYunbZia7 Mti+zoLoi3YX12kOvyD8hDlCfGfVXWV/pqTScB55maATE5mxUCK2JetLYeR/izTb8U8c r7Y016xzRqEos302U2YMkgRI3FmrzOYefdrfX9PPHJSU3TZRitoosf9NkbD30fcWHKUd wuAU/gFKkIEFf2vLMUlJHreBSzENLy0kYVzwThK3pwEYr/KuvIxHNeJIjVCxDsbmGf8R OJdTx632IBudL07wzHSH3z6pZcQU00IFaCiH0LQnuwlgr0FE0/FyblINxSEot65iEsrz AaPQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="RsEsZ3/R"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id u17-20020a056a00125100b0068a38a9ab84si1864434pfi.176.2023.10.05.10.37.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Oct 2023 10:37:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="RsEsZ3/R"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 96CA08095DCC; Thu, 5 Oct 2023 10:23:55 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230452AbjJERXe (ORCPT <rfc822;ezelljr.billy@gmail.com> + 18 others); Thu, 5 Oct 2023 13:23:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33396 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230405AbjJERXF (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 5 Oct 2023 13:23:05 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5E1F82D4F for <linux-kernel@vger.kernel.org>; Thu, 5 Oct 2023 09:48:37 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id ED01DC433C7; Thu, 5 Oct 2023 16:48:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1696524517; bh=CQ2O//GYwC5mvy1qRCT49ssKIiGsS8rLXvdx8sZl8yY=; h=Date:From:To:Cc:Subject:Reply-To:From; b=RsEsZ3/RFK5xfWdrP/s2tB3BdQqazLRDwZz9DQU0D7h0ATTaWh7Xtaw2fxgJZeC03 slXoaHSTZYH7zcUYvsuhLx5aHP7utICxbDwqSTRU57p4rw+0g3tZJkKnLvuDngtuve QzRg1FWb262AAve3NySR0Ss5PGfZp5SaX+fP7OjH5eQ+mpBt2sRqaiq0Gpzuq2PYRN epbfg7Ueo7uTx1s39FRzBeCap70+pT+2xkYhgOfTgW+8wa+LaO99R+WDQwq+/KWqHg rX+nMRzBtotCJCwUoVhmK3CLc7+RtAxuDPnmo3+B6Gvfy4gdTKbdD4AceziWu0AAYA dBdbfhjFrj6MA== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 8173ECE0869; Thu, 5 Oct 2023 09:48:36 -0700 (PDT) Date: Thu, 5 Oct 2023 09:48:36 -0700 From: "Paul E. McKenney" <paulmck@kernel.org> To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org>, Valentin Schneider <vschneid@redhat.com>, Juergen Gross <jgross@suse.com>, Leonardo Bras <leobras@redhat.com>, Imran Khan <imran.f.khan@oracle.com> Subject: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long Message-ID: <bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop> Reply-To: paulmck@kernel.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 05 Oct 2023 10:23:55 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778937949210050581 X-GMAIL-MSGID: 1778937949210050581 |
Series |
[smp,csd] Throw an error if a CSD lock is stuck for too long
|
|
Commit Message
Paul E. McKenney
Oct. 5, 2023, 4:48 p.m. UTC
The CSD lock seems to get stuck in 2 "modes". When it gets stuck temporarily, it usually gets released in a few seconds, and sometimes up to one or two minutes. If the CSD lock stays stuck for more than several minutes, it never seems to get unstuck, and gradually more and more things in the system end up also getting stuck. In the latter case, we should just give up, so the system can dump out a little more information about what went wrong, and, with panic_on_oops and a kdump kernel loaded, dump a whole bunch more information about what might have gone wrong. Question: should this have its own panic_on_ipistall switch in /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different way than via BUG_ON? Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Comments
Hello Paul, On 6/10/2023 3:48 am, Paul E. McKenney wrote: > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > temporarily, it usually gets released in a few seconds, and sometimes > up to one or two minutes. > > If the CSD lock stays stuck for more than several minutes, it never > seems to get unstuck, and gradually more and more things in the system > end up also getting stuck. > > In the latter case, we should just give up, so the system can dump out > a little more information about what went wrong, and, with panic_on_oops > and a kdump kernel loaded, dump a whole bunch more information about > what might have gone wrong. > > Question: should this have its own panic_on_ipistall switch in > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > way than via BUG_ON? > panic_on_ipistall (set to 1 by default) looks better option to me. For systems where such delay is acceptable and system can eventually get back to sane state, this option (set to 0 after boot) would prevent crashing the system for apparently benign CSD hangs of long duration. > Signed-off-by: Rik van Riel <riel@surriel.com> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > diff --git a/kernel/smp.c b/kernel/smp.c > index 8455a53465af..059f1f53fc6b 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -230,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > } > > ts2 = sched_clock(); > + /* How long since we last checked for a stuck CSD lock.*/ > ts_delta = ts2 - *ts1; > if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) > return false; > @@ -243,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > else > cpux = cpu; > cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ > + /* How long since this CSD lock was stuck. */ > + ts_delta = ts2 - ts0; > pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", > - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, > + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, > cpu, csd->func, csd->info); > + /* > + * If the CSD lock is still stuck after 5 minutes, it is unlikely > + * to become unstuck. Use a signed comparison to avoid triggering > + * on underflows when the TSC is out of sync between sockets. > + */ > + BUG_ON((s64)ts_delta > 300000000000LL); Can we make this a module_param (default value 5 mins), so that if needed it can be tweaked to a bigger/smaller value? > if (cpu_cur_csd && csd != cpu_cur_csd) { > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), Thanks, Imran
Is this related to the qspinlock issue you described earlier? jonas Am 10/5/2023 um 6:48 PM schrieb Paul E. McKenney: > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > temporarily, it usually gets released in a few seconds, and sometimes > up to one or two minutes. > > If the CSD lock stays stuck for more than several minutes, it never > seems to get unstuck, and gradually more and more things in the system > end up also getting stuck. > > In the latter case, we should just give up, so the system can dump out > a little more information about what went wrong, and, with panic_on_oops > and a kdump kernel loaded, dump a whole bunch more information about > what might have gone wrong. > > Question: should this have its own panic_on_ipistall switch in > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > way than via BUG_ON? > > Signed-off-by: Rik van Riel <riel@surriel.com> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > diff --git a/kernel/smp.c b/kernel/smp.c > index 8455a53465af..059f1f53fc6b 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -230,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > } > > ts2 = sched_clock(); > + /* How long since we last checked for a stuck CSD lock.*/ > ts_delta = ts2 - *ts1; > if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) > return false; > @@ -243,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > else > cpux = cpu; > cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ > + /* How long since this CSD lock was stuck. */ > + ts_delta = ts2 - ts0; > pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", > - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, > + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, > cpu, csd->func, csd->info); > + /* > + * If the CSD lock is still stuck after 5 minutes, it is unlikely > + * to become unstuck. Use a signed comparison to avoid triggering > + * on underflows when the TSC is out of sync between sockets. > + */ > + BUG_ON((s64)ts_delta > 300000000000LL); > if (cpu_cur_csd && csd != cpu_cur_csd) { > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
On Fri, Oct 06, 2023 at 08:48:23PM +0200, Jonas Oberhauser wrote: > Is this related to the qspinlock issue you described earlier? Kind of in that sometimes qspinlock issues trigger CSD-lock warnings, but not really directly related. Thanx, Paul > jonas > > > Am 10/5/2023 um 6:48 PM schrieb Paul E. McKenney: > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > temporarily, it usually gets released in a few seconds, and sometimes > > up to one or two minutes. > > > > If the CSD lock stays stuck for more than several minutes, it never > > seems to get unstuck, and gradually more and more things in the system > > end up also getting stuck. > > > > In the latter case, we should just give up, so the system can dump out > > a little more information about what went wrong, and, with panic_on_oops > > and a kdump kernel loaded, dump a whole bunch more information about > > what might have gone wrong. > > > > Question: should this have its own panic_on_ipistall switch in > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > way than via BUG_ON? > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > index 8455a53465af..059f1f53fc6b 100644 > > --- a/kernel/smp.c > > +++ b/kernel/smp.c > > @@ -230,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > } > > ts2 = sched_clock(); > > + /* How long since we last checked for a stuck CSD lock.*/ > > ts_delta = ts2 - *ts1; > > if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) > > return false; > > @@ -243,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > else > > cpux = cpu; > > cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ > > + /* How long since this CSD lock was stuck. */ > > + ts_delta = ts2 - ts0; > > pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", > > - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, > > + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, > > cpu, csd->func, csd->info); > > + /* > > + * If the CSD lock is still stuck after 5 minutes, it is unlikely > > + * to become unstuck. Use a signed comparison to avoid triggering > > + * on underflows when the TSC is out of sync between sockets. > > + */ > > + BUG_ON((s64)ts_delta > 300000000000LL); > > if (cpu_cur_csd && csd != cpu_cur_csd) { > > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), >
On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > Hello Paul, > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > temporarily, it usually gets released in a few seconds, and sometimes > > up to one or two minutes. > > > > If the CSD lock stays stuck for more than several minutes, it never > > seems to get unstuck, and gradually more and more things in the system > > end up also getting stuck. > > > > In the latter case, we should just give up, so the system can dump out > > a little more information about what went wrong, and, with panic_on_oops > > and a kdump kernel loaded, dump a whole bunch more information about > > what might have gone wrong. > > > > Question: should this have its own panic_on_ipistall switch in > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > way than via BUG_ON? > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > where such delay is acceptable and system can eventually get back to sane state, > this option (set to 0 after boot) would prevent crashing the system for > apparently benign CSD hangs of long duration. Good point! How about like the following? Thanx, Paul ------------------------------------------------------------------------ commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 Author: Rik van Riel <riel@surriel.com> Date: Mon Aug 21 16:04:09 2023 -0400 smp,csd: Throw an error if a CSD lock is stuck for too long The CSD lock seems to get stuck in 2 "modes". When it gets stuck temporarily, it usually gets released in a few seconds, and sometimes up to one or two minutes. If the CSD lock stays stuck for more than several minutes, it never seems to get unstuck, and gradually more and more things in the system end up also getting stuck. In the latter case, we should just give up, so the system can dump out a little more information about what went wrong, and, with panic_on_oops and a kdump kernel loaded, dump a whole bunch more information about what might have gone wrong. In addition, there is an smp.panic_on_ipistall kernel boot parameter that by default retains the old behavior, but when set enables the panic after the CSD lock has been stuck for more than five minutes. [ paulmck: Apply Imran Khan feedback. ] Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0a1731a0f0ef..592935267ce2 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5858,6 +5858,11 @@ This feature may be more efficiently disabled using the csdlock_debug- kernel parameter. + smp.panic_on_ipistall= [KNL] + If a csd_lock_timeout extends for more than + five minutes, panic the system. By default, let + CSD-lock acquisition take as long as they take. + smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port smsc-ircc2.ircc_sir= [HW] SIR base I/O port diff --git a/kernel/smp.c b/kernel/smp.c index 8455a53465af..b6a0773a7015 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ module_param(csd_lock_timeout, ulong, 0444); +static bool panic_on_ipistall; +module_param(panic_on_ipistall, bool, 0444); static atomic_t csd_bug_count = ATOMIC_INIT(0); @@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * } ts2 = sched_clock(); + /* How long since we last checked for a stuck CSD lock.*/ ts_delta = ts2 - *ts1; if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) return false; @@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * else cpux = cpu; cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ + /* How long since this CSD lock was stuck. */ + ts_delta = ts2 - ts0; pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, cpu, csd->func, csd->info); + /* + * If the CSD lock is still stuck after 5 minutes, it is unlikely + * to become unstuck. Use a signed comparison to avoid triggering + * on underflows when the TSC is out of sync between sockets. + */ + BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
Hello Paul, On 10/10/2023 3:39 am, Paul E. McKenney wrote: > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: >> Hello Paul, >> >> On 6/10/2023 3:48 am, Paul E. McKenney wrote: >>> The CSD lock seems to get stuck in 2 "modes". When it gets stuck >>> temporarily, it usually gets released in a few seconds, and sometimes >>> up to one or two minutes. >>> >>> If the CSD lock stays stuck for more than several minutes, it never >>> seems to get unstuck, and gradually more and more things in the system >>> end up also getting stuck. >>> >>> In the latter case, we should just give up, so the system can dump out >>> a little more information about what went wrong, and, with panic_on_oops >>> and a kdump kernel loaded, dump a whole bunch more information about >>> what might have gone wrong. >>> >>> Question: should this have its own panic_on_ipistall switch in >>> /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different >>> way than via BUG_ON? >>> >> panic_on_ipistall (set to 1 by default) looks better option to me. For systems >> where such delay is acceptable and system can eventually get back to sane state, >> this option (set to 0 after boot) would prevent crashing the system for >> apparently benign CSD hangs of long duration. > > Good point! How about like the following? > Yes, this looks good. Just realized that keeping panic_on_ipistall set by default(as suggested earlier by me) would not follow convention of other similar switches like hard/softlockup_panic etc. which are 0 by deafault. So default value of 0 looks better choice for panic_on_ipistall as well. > Thanx, Paul > > ------------------------------------------------------------------------ > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > Author: Rik van Riel <riel@surriel.com> > Date: Mon Aug 21 16:04:09 2023 -0400 > > smp,csd: Throw an error if a CSD lock is stuck for too long > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > temporarily, it usually gets released in a few seconds, and sometimes > up to one or two minutes. > > If the CSD lock stays stuck for more than several minutes, it never > seems to get unstuck, and gradually more and more things in the system > end up also getting stuck. > > In the latter case, we should just give up, so the system can dump out > a little more information about what went wrong, and, with panic_on_oops > and a kdump kernel loaded, dump a whole bunch more information about what > might have gone wrong. In addition, there is an smp.panic_on_ipistall > kernel boot parameter that by default retains the old behavior, but when > set enables the panic after the CSD lock has been stuck for more than > five minutes. > > [ paulmck: Apply Imran Khan feedback. ] > > Link: https://urldefense.com/v3/__https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/__;!!ACWV5N9M2RV99hQ!PDFpjgGTCPjxqCyusua5IZWkvdWEMf51igFDc-yb9cVK9PYr7FpEE1oGpWp09YK4lc15C2taMdcuEOqyH8k$ > Signed-off-by: Rik van Riel <riel@surriel.com> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Juergen Gross <jgross@suse.com> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Randy Dunlap <rdunlap@infradead.org> > Reviewed-by: Imran Khan <imran.f.khan@oracle.com> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 0a1731a0f0ef..592935267ce2 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5858,6 +5858,11 @@ > This feature may be more efficiently disabled > using the csdlock_debug- kernel parameter. > > + smp.panic_on_ipistall= [KNL] > + If a csd_lock_timeout extends for more than > + five minutes, panic the system. By default, let > + CSD-lock acquisition take as long as they take. > + > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > smsc-ircc2.ircc_sir= [HW] SIR base I/O port > diff --git a/kernel/smp.c b/kernel/smp.c > index 8455a53465af..b6a0773a7015 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > module_param(csd_lock_timeout, ulong, 0444); > +static bool panic_on_ipistall; > +module_param(panic_on_ipistall, bool, 0444); > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > @@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > } > > ts2 = sched_clock(); > + /* How long since we last checked for a stuck CSD lock.*/ > ts_delta = ts2 - *ts1; > if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) > return false; > @@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > else > cpux = cpu; > cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ > + /* How long since this CSD lock was stuck. */ > + ts_delta = ts2 - ts0; > pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", > - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, > + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, > cpu, csd->func, csd->info); > + /* > + * If the CSD lock is still stuck after 5 minutes, it is unlikely > + * to become unstuck. Use a signed comparison to avoid triggering > + * on underflows when the TSC is out of sync between sockets. > + */ > + BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > if (cpu_cur_csd && csd != cpu_cur_csd) { > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
On Tue, Oct 10, 2023 at 03:58:43PM +1100, Imran Khan wrote: > Hello Paul, > > On 10/10/2023 3:39 am, Paul E. McKenney wrote: > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > >> Hello Paul, > >> > >> On 6/10/2023 3:48 am, Paul E. McKenney wrote: > >>> The CSD lock seems to get stuck in 2 "modes". When it gets stuck > >>> temporarily, it usually gets released in a few seconds, and sometimes > >>> up to one or two minutes. > >>> > >>> If the CSD lock stays stuck for more than several minutes, it never > >>> seems to get unstuck, and gradually more and more things in the system > >>> end up also getting stuck. > >>> > >>> In the latter case, we should just give up, so the system can dump out > >>> a little more information about what went wrong, and, with panic_on_oops > >>> and a kdump kernel loaded, dump a whole bunch more information about > >>> what might have gone wrong. > >>> > >>> Question: should this have its own panic_on_ipistall switch in > >>> /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > >>> way than via BUG_ON? > >>> > >> panic_on_ipistall (set to 1 by default) looks better option to me. For systems > >> where such delay is acceptable and system can eventually get back to sane state, > >> this option (set to 0 after boot) would prevent crashing the system for > >> apparently benign CSD hangs of long duration. > > > > Good point! How about like the following? > > > > Yes, this looks good. > Just realized that keeping panic_on_ipistall set by default(as suggested earlier > by me) would not follow convention of other similar switches like > hard/softlockup_panic etc. which are 0 by deafault. > So default value of 0 looks better choice for panic_on_ipistall as well. Plus if a new option is set by default and causes problems, people get (understandably) annoyed. ;-) > > ------------------------------------------------------------------------ > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > Author: Rik van Riel <riel@surriel.com> > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > temporarily, it usually gets released in a few seconds, and sometimes > > up to one or two minutes. > > > > If the CSD lock stays stuck for more than several minutes, it never > > seems to get unstuck, and gradually more and more things in the system > > end up also getting stuck. > > > > In the latter case, we should just give up, so the system can dump out > > a little more information about what went wrong, and, with panic_on_oops > > and a kdump kernel loaded, dump a whole bunch more information about what > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > kernel boot parameter that by default retains the old behavior, but when > > set enables the panic after the CSD lock has been stuck for more than > > five minutes. > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > Link: https://urldefense.com/v3/__https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/__;!!ACWV5N9M2RV99hQ!PDFpjgGTCPjxqCyusua5IZWkvdWEMf51igFDc-yb9cVK9PYr7FpEE1oGpWp09YK4lc15C2taMdcuEOqyH8k$ > > Signed-off-by: Rik van Riel <riel@surriel.com> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Valentin Schneider <vschneid@redhat.com> > > Cc: Juergen Gross <jgross@suse.com> > > Cc: Jonathan Corbet <corbet@lwn.net> > > Cc: Randy Dunlap <rdunlap@infradead.org> > > > Reviewed-by: Imran Khan <imran.f.khan@oracle.com> Thank you, and I will apply this on my next rebase. Thanx, Paul > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 0a1731a0f0ef..592935267ce2 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5858,6 +5858,11 @@ > > This feature may be more efficiently disabled > > using the csdlock_debug- kernel parameter. > > > > + smp.panic_on_ipistall= [KNL] > > + If a csd_lock_timeout extends for more than > > + five minutes, panic the system. By default, let > > + CSD-lock acquisition take as long as they take. > > + > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > > smsc-ircc2.ircc_sir= [HW] SIR base I/O port > > diff --git a/kernel/smp.c b/kernel/smp.c > > index 8455a53465af..b6a0773a7015 100644 > > --- a/kernel/smp.c > > +++ b/kernel/smp.c > > @@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > > module_param(csd_lock_timeout, ulong, 0444); > > +static bool panic_on_ipistall; > > +module_param(panic_on_ipistall, bool, 0444); > > > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > > > @@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > } > > > > ts2 = sched_clock(); > > + /* How long since we last checked for a stuck CSD lock.*/ > > ts_delta = ts2 - *ts1; > > if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) > > return false; > > @@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > else > > cpux = cpu; > > cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ > > + /* How long since this CSD lock was stuck. */ > > + ts_delta = ts2 - ts0; > > pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", > > - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, > > + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, > > cpu, csd->func, csd->info); > > + /* > > + * If the CSD lock is still stuck after 5 minutes, it is unlikely > > + * to become unstuck. Use a signed comparison to avoid triggering > > + * on underflows when the TSC is out of sync between sockets. > > + */ > > + BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > if (cpu_cur_csd && csd != cpu_cur_csd) { > > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote: > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > > Hello Paul, > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > temporarily, it usually gets released in a few seconds, and sometimes > > > up to one or two minutes. > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > seems to get unstuck, and gradually more and more things in the system > > > end up also getting stuck. > > > > > > In the latter case, we should just give up, so the system can dump out > > > a little more information about what went wrong, and, with panic_on_oops > > > and a kdump kernel loaded, dump a whole bunch more information about > > > what might have gone wrong. > > > > > > Question: should this have its own panic_on_ipistall switch in > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > > way than via BUG_ON? > > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > > where such delay is acceptable and system can eventually get back to sane state, > > this option (set to 0 after boot) would prevent crashing the system for > > apparently benign CSD hangs of long duration. > > Good point! How about like the following? > > Thanx, Paul > > ------------------------------------------------------------------------ > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > Author: Rik van Riel <riel@surriel.com> > Date: Mon Aug 21 16:04:09 2023 -0400 > > smp,csd: Throw an error if a CSD lock is stuck for too long > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > temporarily, it usually gets released in a few seconds, and sometimes > up to one or two minutes. > > If the CSD lock stays stuck for more than several minutes, it never > seems to get unstuck, and gradually more and more things in the system > end up also getting stuck. > > In the latter case, we should just give up, so the system can dump out > a little more information about what went wrong, and, with panic_on_oops > and a kdump kernel loaded, dump a whole bunch more information about what > might have gone wrong. In addition, there is an smp.panic_on_ipistall > kernel boot parameter that by default retains the old behavior, but when > set enables the panic after the CSD lock has been stuck for more than > five minutes. > > [ paulmck: Apply Imran Khan feedback. ] > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ > Signed-off-by: Rik van Riel <riel@surriel.com> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Juergen Gross <jgross@suse.com> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Randy Dunlap <rdunlap@infradead.org> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 0a1731a0f0ef..592935267ce2 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5858,6 +5858,11 @@ > This feature may be more efficiently disabled > using the csdlock_debug- kernel parameter. > > + smp.panic_on_ipistall= [KNL] > + If a csd_lock_timeout extends for more than > + five minutes, panic the system. By default, let > + CSD-lock acquisition take as long as they take. > + It could be interesting to have it as an s64 parameter (in {mili,}seconds) instead of bool, this way the user could pick the time to wait before the panic happens. 0 or -1 could mean disabled. What do you think? Other than that, Reviewed-by: Leonardo Bras <leobras@redhat.com> > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > smsc-ircc2.ircc_sir= [HW] SIR base I/O port > diff --git a/kernel/smp.c b/kernel/smp.c > index 8455a53465af..b6a0773a7015 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -170,6 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > module_param(csd_lock_timeout, ulong, 0444); > +static bool panic_on_ipistall; > +module_param(panic_on_ipistall, bool, 0444); > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > @@ -230,6 +232,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > } > > ts2 = sched_clock(); > + /* How long since we last checked for a stuck CSD lock.*/ > ts_delta = ts2 - *ts1; > if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) > return false; > @@ -243,9 +246,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > else > cpux = cpu; > cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ > + /* How long since this CSD lock was stuck. */ > + ts_delta = ts2 - ts0; > pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", > - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, > + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, > cpu, csd->func, csd->info); > + /* > + * If the CSD lock is still stuck after 5 minutes, it is unlikely > + * to become unstuck. Use a signed comparison to avoid triggering > + * on underflows when the TSC is out of sync between sockets. > + */ > + BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > if (cpu_cur_csd && csd != cpu_cur_csd) { > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), >
On Fri, Oct 13, 2023 at 12:26:22PM -0300, Leonardo Bras wrote: > On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote: > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > > > Hello Paul, > > > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > up to one or two minutes. > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > seems to get unstuck, and gradually more and more things in the system > > > > end up also getting stuck. > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > a little more information about what went wrong, and, with panic_on_oops > > > > and a kdump kernel loaded, dump a whole bunch more information about > > > > what might have gone wrong. > > > > > > > > Question: should this have its own panic_on_ipistall switch in > > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > > > way than via BUG_ON? > > > > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > > > where such delay is acceptable and system can eventually get back to sane state, > > > this option (set to 0 after boot) would prevent crashing the system for > > > apparently benign CSD hangs of long duration. > > > > Good point! How about like the following? > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > Author: Rik van Riel <riel@surriel.com> > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > temporarily, it usually gets released in a few seconds, and sometimes > > up to one or two minutes. > > > > If the CSD lock stays stuck for more than several minutes, it never > > seems to get unstuck, and gradually more and more things in the system > > end up also getting stuck. > > > > In the latter case, we should just give up, so the system can dump out > > a little more information about what went wrong, and, with panic_on_oops > > and a kdump kernel loaded, dump a whole bunch more information about what > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > kernel boot parameter that by default retains the old behavior, but when > > set enables the panic after the CSD lock has been stuck for more than > > five minutes. > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ > > Signed-off-by: Rik van Riel <riel@surriel.com> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Valentin Schneider <vschneid@redhat.com> > > Cc: Juergen Gross <jgross@suse.com> > > Cc: Jonathan Corbet <corbet@lwn.net> > > Cc: Randy Dunlap <rdunlap@infradead.org> > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index 0a1731a0f0ef..592935267ce2 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5858,6 +5858,11 @@ > > This feature may be more efficiently disabled > > using the csdlock_debug- kernel parameter. > > > > + smp.panic_on_ipistall= [KNL] > > + If a csd_lock_timeout extends for more than > > + five minutes, panic the system. By default, let > > + CSD-lock acquisition take as long as they take. > > + > > It could be interesting to have it as an s64 parameter (in {mili,}seconds) > instead of bool, this way the user could pick the time to wait before the > panic happens. 0 or -1 could mean disabled. > > What do you think? > > Other than that, > Reviewed-by: Leonardo Bras <leobras@redhat.com> Thank you for looking this over! How about with the diff shown below, to be folded into the original? I went with int instead of s64 because I am having some difficulty imagining anyone specifying more than a 24-day timeout. ;-) Thanx, Paul ------------------------------------------------------------------------ diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ccb7621eff79..ea5ae9deb753 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5931,8 +5931,10 @@ smp.panic_on_ipistall= [KNL] If a csd_lock_timeout extends for more than - five minutes, panic the system. By default, let - CSD-lock acquisition take as long as they take. + the specified number of milliseconds, panic the + system. By default, let CSD-lock acquisition + take as long as they take. Specifying 300,000 + for this value provides a 10-minute timeout. smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port diff --git a/kernel/smp.c b/kernel/smp.c index b6a0773a7015..d3ca47f32f38 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ module_param(csd_lock_timeout, ulong, 0444); -static bool panic_on_ipistall; -module_param(panic_on_ipistall, bool, 0444); +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for ten minutes. */ +module_param(panic_on_ipistall, int, 0444); static atomic_t csd_bug_count = ATOMIC_INIT(0); @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * * to become unstuck. Use a signed comparison to avoid triggering * on underflows when the TSC is out of sync between sockets. */ - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
On Mon, Oct 16, 2023 at 11:27:51AM -0700, Paul E. McKenney wrote: > On Fri, Oct 13, 2023 at 12:26:22PM -0300, Leonardo Bras wrote: > > On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote: > > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > > > > Hello Paul, > > > > > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > > up to one or two minutes. > > > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > > seems to get unstuck, and gradually more and more things in the system > > > > > end up also getting stuck. > > > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > > a little more information about what went wrong, and, with panic_on_oops > > > > > and a kdump kernel loaded, dump a whole bunch more information about > > > > > what might have gone wrong. > > > > > > > > > > Question: should this have its own panic_on_ipistall switch in > > > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > > > > way than via BUG_ON? > > > > > > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > > > > where such delay is acceptable and system can eventually get back to sane state, > > > > this option (set to 0 after boot) would prevent crashing the system for > > > > apparently benign CSD hangs of long duration. > > > > > > Good point! How about like the following? > > > > > > Thanx, Paul > > > > > > ------------------------------------------------------------------------ > > > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > > Author: Rik van Riel <riel@surriel.com> > > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > temporarily, it usually gets released in a few seconds, and sometimes > > > up to one or two minutes. > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > seems to get unstuck, and gradually more and more things in the system > > > end up also getting stuck. > > > > > > In the latter case, we should just give up, so the system can dump out > > > a little more information about what went wrong, and, with panic_on_oops > > > and a kdump kernel loaded, dump a whole bunch more information about what > > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > > kernel boot parameter that by default retains the old behavior, but when > > > set enables the panic after the CSD lock has been stuck for more than > > > five minutes. > > > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > Cc: Valentin Schneider <vschneid@redhat.com> > > > Cc: Juergen Gross <jgross@suse.com> > > > Cc: Jonathan Corbet <corbet@lwn.net> > > > Cc: Randy Dunlap <rdunlap@infradead.org> > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > index 0a1731a0f0ef..592935267ce2 100644 > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > @@ -5858,6 +5858,11 @@ > > > This feature may be more efficiently disabled > > > using the csdlock_debug- kernel parameter. > > > > > > + smp.panic_on_ipistall= [KNL] > > > + If a csd_lock_timeout extends for more than > > > + five minutes, panic the system. By default, let > > > + CSD-lock acquisition take as long as they take. > > > + > > > > It could be interesting to have it as an s64 parameter (in {mili,}seconds) > > instead of bool, this way the user could pick the time to wait before the > > panic happens. 0 or -1 could mean disabled. > > > > What do you think? > > > > Other than that, > > Reviewed-by: Leonardo Bras <leobras@redhat.com> > > Thank you for looking this over! > > How about with the diff shown below, to be folded into the original? > I went with int instead of s64 because I am having some difficulty > imagining anyone specifying more than a 24-day timeout. ;-) I suggested s64 just because it was the type being used in BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); But anyway, int should be fine. > > Thanx, Paul > > ------------------------------------------------------------------------ > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index ccb7621eff79..ea5ae9deb753 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5931,8 +5931,10 @@ > > smp.panic_on_ipistall= [KNL] > If a csd_lock_timeout extends for more than > - five minutes, panic the system. By default, let > - CSD-lock acquisition take as long as they take. > + the specified number of milliseconds, panic the > + system. By default, let CSD-lock acquisition > + take as long as they take. Specifying 300,000 > + for this value provides a 10-minute timeout. 300,000 ms is 300s, which is 5 minutes, right? > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > diff --git a/kernel/smp.c b/kernel/smp.c > index b6a0773a7015..d3ca47f32f38 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > module_param(csd_lock_timeout, ulong, 0444); > -static bool panic_on_ipistall; > -module_param(panic_on_ipistall, bool, 0444); > +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for ten minutes. */ s/ten/five > +module_param(panic_on_ipistall, int, 0444); > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > * to become unstuck. Use a signed comparison to avoid triggering > * on underflows when the TSC is out of sync between sockets. > */ > - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); s64 here would avoid casting (s64)panic_on_ipistall, but I think it does not really impact readability. IIUC ts_delta is an u64 being casted as s64, which could be an issue but no computer system will actually take over 2^31 ns (292 years) to run 1 iteration, so it's safe. I think it's a nice feaure :) Thanks! Leo > if (cpu_cur_csd && csd != cpu_cur_csd) { > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), >
On Mon, Oct 16, 2023 at 05:52:57PM -0300, Leonardo Bras wrote: > On Mon, Oct 16, 2023 at 11:27:51AM -0700, Paul E. McKenney wrote: > > On Fri, Oct 13, 2023 at 12:26:22PM -0300, Leonardo Bras wrote: > > > On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote: > > > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > > > > > Hello Paul, > > > > > > > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > > > up to one or two minutes. > > > > > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > > > seems to get unstuck, and gradually more and more things in the system > > > > > > end up also getting stuck. > > > > > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > > > a little more information about what went wrong, and, with panic_on_oops > > > > > > and a kdump kernel loaded, dump a whole bunch more information about > > > > > > what might have gone wrong. > > > > > > > > > > > > Question: should this have its own panic_on_ipistall switch in > > > > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > > > > > way than via BUG_ON? > > > > > > > > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > > > > > where such delay is acceptable and system can eventually get back to sane state, > > > > > this option (set to 0 after boot) would prevent crashing the system for > > > > > apparently benign CSD hangs of long duration. > > > > > > > > Good point! How about like the following? > > > > > > > > Thanx, Paul > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > > > Author: Rik van Riel <riel@surriel.com> > > > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > up to one or two minutes. > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > seems to get unstuck, and gradually more and more things in the system > > > > end up also getting stuck. > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > a little more information about what went wrong, and, with panic_on_oops > > > > and a kdump kernel loaded, dump a whole bunch more information about what > > > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > > > kernel boot parameter that by default retains the old behavior, but when > > > > set enables the panic after the CSD lock has been stuck for more than > > > > five minutes. > > > > > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > > > > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > > Cc: Valentin Schneider <vschneid@redhat.com> > > > > Cc: Juergen Gross <jgross@suse.com> > > > > Cc: Jonathan Corbet <corbet@lwn.net> > > > > Cc: Randy Dunlap <rdunlap@infradead.org> > > > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > > index 0a1731a0f0ef..592935267ce2 100644 > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > @@ -5858,6 +5858,11 @@ > > > > This feature may be more efficiently disabled > > > > using the csdlock_debug- kernel parameter. > > > > > > > > + smp.panic_on_ipistall= [KNL] > > > > + If a csd_lock_timeout extends for more than > > > > + five minutes, panic the system. By default, let > > > > + CSD-lock acquisition take as long as they take. > > > > + > > > > > > It could be interesting to have it as an s64 parameter (in {mili,}seconds) > > > instead of bool, this way the user could pick the time to wait before the > > > panic happens. 0 or -1 could mean disabled. > > > > > > What do you think? > > > > > > Other than that, > > > Reviewed-by: Leonardo Bras <leobras@redhat.com> > > > > Thank you for looking this over! > > > > How about with the diff shown below, to be folded into the original? > > I went with int instead of s64 because I am having some difficulty > > imagining anyone specifying more than a 24-day timeout. ;-) > > I suggested s64 just because it was the type being used in > BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > But anyway, int should be fine. > > > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > index ccb7621eff79..ea5ae9deb753 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -5931,8 +5931,10 @@ > > > > smp.panic_on_ipistall= [KNL] > > If a csd_lock_timeout extends for more than > > - five minutes, panic the system. By default, let > > - CSD-lock acquisition take as long as they take. > > + the specified number of milliseconds, panic the > > + system. By default, let CSD-lock acquisition > > + take as long as they take. Specifying 300,000 > > + for this value provides a 10-minute timeout. > > 300,000 ms is 300s, which is 5 minutes, right? > > > > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > > diff --git a/kernel/smp.c b/kernel/smp.c > > index b6a0773a7015..d3ca47f32f38 100644 > > --- a/kernel/smp.c > > +++ b/kernel/smp.c > > @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > > module_param(csd_lock_timeout, ulong, 0444); > > -static bool panic_on_ipistall; > > -module_param(panic_on_ipistall, bool, 0444); > > +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for ten minutes. */ > > s/ten/five > > > +module_param(panic_on_ipistall, int, 0444); > > > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > > > @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > * to become unstuck. Use a signed comparison to avoid triggering > > * on underflows when the TSC is out of sync between sockets. > > */ > > - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); > > s64 here would avoid casting (s64)panic_on_ipistall, but I think it does > not really impact readability. > > IIUC ts_delta is an u64 being casted as s64, which could be an issue but no > computer system will actually take over 2^31 ns (292 years) to run 1 > iteration, so it's safe. > > I think it's a nice feaure :) s/feaure/feature Also, FWIW: Reviewed-by: Leonardo Bras <leobras@redhat.com> > > Thanks! > Leo > > > > if (cpu_cur_csd && csd != cpu_cur_csd) { > > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), > >
On Mon, Oct 16, 2023 at 05:58:23PM -0300, Leonardo Bras wrote: > On Mon, Oct 16, 2023 at 05:52:57PM -0300, Leonardo Bras wrote: > > On Mon, Oct 16, 2023 at 11:27:51AM -0700, Paul E. McKenney wrote: > > > On Fri, Oct 13, 2023 at 12:26:22PM -0300, Leonardo Bras wrote: > > > > On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote: > > > > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > > > > > > Hello Paul, > > > > > > > > > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > > > > up to one or two minutes. > > > > > > > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > > > > seems to get unstuck, and gradually more and more things in the system > > > > > > > end up also getting stuck. > > > > > > > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > > > > a little more information about what went wrong, and, with panic_on_oops > > > > > > > and a kdump kernel loaded, dump a whole bunch more information about > > > > > > > what might have gone wrong. > > > > > > > > > > > > > > Question: should this have its own panic_on_ipistall switch in > > > > > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > > > > > > way than via BUG_ON? > > > > > > > > > > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > > > > > > where such delay is acceptable and system can eventually get back to sane state, > > > > > > this option (set to 0 after boot) would prevent crashing the system for > > > > > > apparently benign CSD hangs of long duration. > > > > > > > > > > Good point! How about like the following? > > > > > > > > > > Thanx, Paul > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > > > > Author: Rik van Riel <riel@surriel.com> > > > > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > > > > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > > > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > > up to one or two minutes. > > > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > > seems to get unstuck, and gradually more and more things in the system > > > > > end up also getting stuck. > > > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > > a little more information about what went wrong, and, with panic_on_oops > > > > > and a kdump kernel loaded, dump a whole bunch more information about what > > > > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > > > > kernel boot parameter that by default retains the old behavior, but when > > > > > set enables the panic after the CSD lock has been stuck for more than > > > > > five minutes. > > > > > > > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > > > > > > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ > > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > > > Cc: Valentin Schneider <vschneid@redhat.com> > > > > > Cc: Juergen Gross <jgross@suse.com> > > > > > Cc: Jonathan Corbet <corbet@lwn.net> > > > > > Cc: Randy Dunlap <rdunlap@infradead.org> > > > > > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > > > index 0a1731a0f0ef..592935267ce2 100644 > > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > > @@ -5858,6 +5858,11 @@ > > > > > This feature may be more efficiently disabled > > > > > using the csdlock_debug- kernel parameter. > > > > > > > > > > + smp.panic_on_ipistall= [KNL] > > > > > + If a csd_lock_timeout extends for more than > > > > > + five minutes, panic the system. By default, let > > > > > + CSD-lock acquisition take as long as they take. > > > > > + > > > > > > > > It could be interesting to have it as an s64 parameter (in {mili,}seconds) > > > > instead of bool, this way the user could pick the time to wait before the > > > > panic happens. 0 or -1 could mean disabled. > > > > > > > > What do you think? > > > > > > > > Other than that, > > > > Reviewed-by: Leonardo Bras <leobras@redhat.com> > > > > > > Thank you for looking this over! > > > > > > How about with the diff shown below, to be folded into the original? > > > I went with int instead of s64 because I am having some difficulty > > > imagining anyone specifying more than a 24-day timeout. ;-) > > > > I suggested s64 just because it was the type being used in > > BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > > > But anyway, int should be fine. > > > > > > > > > > Thanx, Paul > > > > > > ------------------------------------------------------------------------ > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > index ccb7621eff79..ea5ae9deb753 100644 > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > @@ -5931,8 +5931,10 @@ > > > > > > smp.panic_on_ipistall= [KNL] > > > If a csd_lock_timeout extends for more than > > > - five minutes, panic the system. By default, let > > > - CSD-lock acquisition take as long as they take. > > > + the specified number of milliseconds, panic the > > > + system. By default, let CSD-lock acquisition > > > + take as long as they take. Specifying 300,000 > > > + for this value provides a 10-minute timeout. > > > > 300,000 ms is 300s, which is 5 minutes, right? Right you are! Fixed, please see replacement fixup patch below. > > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > > > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > index b6a0773a7015..d3ca47f32f38 100644 > > > --- a/kernel/smp.c > > > +++ b/kernel/smp.c > > > @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > > > > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > > > module_param(csd_lock_timeout, ulong, 0444); > > > -static bool panic_on_ipistall; > > > -module_param(panic_on_ipistall, bool, 0444); > > > +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for ten minutes. */ > > > > s/ten/five And right you are again! > > > +module_param(panic_on_ipistall, int, 0444); > > > > > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > > > > > @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > > * to become unstuck. Use a signed comparison to avoid triggering > > > * on underflows when the TSC is out of sync between sockets. > > > */ > > > - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > > + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); > > > > s64 here would avoid casting (s64)panic_on_ipistall, but I think it does > > not really impact readability. > > > > IIUC ts_delta is an u64 being casted as s64, which could be an issue but no > > computer system will actually take over 2^31 ns (292 years) to run 1 > > iteration, so it's safe. Back at you with s/2^31/2^63/. ;-) > > I think it's a nice feaure :) > > s/feaure/feature > > Also, FWIW: > Reviewed-by: Leonardo Bras <leobras@redhat.com> I have that down as well, and thank you again! Thanx, Paul ------------------------------------------------------------------------ diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ccb7621eff79..e1fe26dae586 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5931,8 +5931,10 @@ smp.panic_on_ipistall= [KNL] If a csd_lock_timeout extends for more than - five minutes, panic the system. By default, let - CSD-lock acquisition take as long as they take. + the specified number of milliseconds, panic the + system. By default, let CSD-lock acquisition + take as long as they take. Specifying 300,000 + for this value provides a 5-minute timeout. smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port diff --git a/kernel/smp.c b/kernel/smp.c index b6a0773a7015..695eb13a276d 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ module_param(csd_lock_timeout, ulong, 0444); -static bool panic_on_ipistall; -module_param(panic_on_ipistall, bool, 0444); +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for five minutes. */ +module_param(panic_on_ipistall, int, 0444); static atomic_t csd_bug_count = ATOMIC_INIT(0); @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * * to become unstuck. Use a signed comparison to avoid triggering * on underflows when the TSC is out of sync between sockets. */ - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
On Mon, Oct 16, 2023 at 02:37:10PM -0700, Paul E. McKenney wrote: > On Mon, Oct 16, 2023 at 05:58:23PM -0300, Leonardo Bras wrote: > > On Mon, Oct 16, 2023 at 05:52:57PM -0300, Leonardo Bras wrote: > > > On Mon, Oct 16, 2023 at 11:27:51AM -0700, Paul E. McKenney wrote: > > > > On Fri, Oct 13, 2023 at 12:26:22PM -0300, Leonardo Bras wrote: > > > > > On Mon, Oct 09, 2023 at 09:39:38AM -0700, Paul E. McKenney wrote: > > > > > > On Fri, Oct 06, 2023 at 10:32:07AM +1100, Imran Khan wrote: > > > > > > > Hello Paul, > > > > > > > > > > > > > > On 6/10/2023 3:48 am, Paul E. McKenney wrote: > > > > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > > > > > up to one or two minutes. > > > > > > > > > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > > > > > seems to get unstuck, and gradually more and more things in the system > > > > > > > > end up also getting stuck. > > > > > > > > > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > > > > > a little more information about what went wrong, and, with panic_on_oops > > > > > > > > and a kdump kernel loaded, dump a whole bunch more information about > > > > > > > > what might have gone wrong. > > > > > > > > > > > > > > > > Question: should this have its own panic_on_ipistall switch in > > > > > > > > /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different > > > > > > > > way than via BUG_ON? > > > > > > > > > > > > > > > panic_on_ipistall (set to 1 by default) looks better option to me. For systems > > > > > > > where such delay is acceptable and system can eventually get back to sane state, > > > > > > > this option (set to 0 after boot) would prevent crashing the system for > > > > > > > apparently benign CSD hangs of long duration. > > > > > > > > > > > > Good point! How about like the following? > > > > > > > > > > > > Thanx, Paul > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > > > commit 6bcf3786291b86f13b3e13d51e998737a8009ec3 > > > > > > Author: Rik van Riel <riel@surriel.com> > > > > > > Date: Mon Aug 21 16:04:09 2023 -0400 > > > > > > > > > > > > smp,csd: Throw an error if a CSD lock is stuck for too long > > > > > > > > > > > > The CSD lock seems to get stuck in 2 "modes". When it gets stuck > > > > > > temporarily, it usually gets released in a few seconds, and sometimes > > > > > > up to one or two minutes. > > > > > > > > > > > > If the CSD lock stays stuck for more than several minutes, it never > > > > > > seems to get unstuck, and gradually more and more things in the system > > > > > > end up also getting stuck. > > > > > > > > > > > > In the latter case, we should just give up, so the system can dump out > > > > > > a little more information about what went wrong, and, with panic_on_oops > > > > > > and a kdump kernel loaded, dump a whole bunch more information about what > > > > > > might have gone wrong. In addition, there is an smp.panic_on_ipistall > > > > > > kernel boot parameter that by default retains the old behavior, but when > > > > > > set enables the panic after the CSD lock has been stuck for more than > > > > > > five minutes. > > > > > > > > > > > > [ paulmck: Apply Imran Khan feedback. ] > > > > > > > > > > > > Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ > > > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > > > Cc: Peter Zijlstra <peterz@infradead.org> > > > > > > Cc: Valentin Schneider <vschneid@redhat.com> > > > > > > Cc: Juergen Gross <jgross@suse.com> > > > > > > Cc: Jonathan Corbet <corbet@lwn.net> > > > > > > Cc: Randy Dunlap <rdunlap@infradead.org> > > > > > > > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > > > > index 0a1731a0f0ef..592935267ce2 100644 > > > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > > > @@ -5858,6 +5858,11 @@ > > > > > > This feature may be more efficiently disabled > > > > > > using the csdlock_debug- kernel parameter. > > > > > > > > > > > > + smp.panic_on_ipistall= [KNL] > > > > > > + If a csd_lock_timeout extends for more than > > > > > > + five minutes, panic the system. By default, let > > > > > > + CSD-lock acquisition take as long as they take. > > > > > > + > > > > > > > > > > It could be interesting to have it as an s64 parameter (in {mili,}seconds) > > > > > instead of bool, this way the user could pick the time to wait before the > > > > > panic happens. 0 or -1 could mean disabled. > > > > > > > > > > What do you think? > > > > > > > > > > Other than that, > > > > > Reviewed-by: Leonardo Bras <leobras@redhat.com> > > > > > > > > Thank you for looking this over! > > > > > > > > How about with the diff shown below, to be folded into the original? > > > > I went with int instead of s64 because I am having some difficulty > > > > imagining anyone specifying more than a 24-day timeout. ;-) > > > > > > I suggested s64 just because it was the type being used in > > > BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > > > > > But anyway, int should be fine. > > > > > > > > > > > > > > Thanx, Paul > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > > index ccb7621eff79..ea5ae9deb753 100644 > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > @@ -5931,8 +5931,10 @@ > > > > > > > > smp.panic_on_ipistall= [KNL] > > > > If a csd_lock_timeout extends for more than > > > > - five minutes, panic the system. By default, let > > > > - CSD-lock acquisition take as long as they take. > > > > + the specified number of milliseconds, panic the > > > > + system. By default, let CSD-lock acquisition > > > > + take as long as they take. Specifying 300,000 > > > > + for this value provides a 10-minute timeout. > > > > > > 300,000 ms is 300s, which is 5 minutes, right? > > Right you are! Fixed, please see replacement fixup patch below. > > > > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > > > > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > > > index b6a0773a7015..d3ca47f32f38 100644 > > > > --- a/kernel/smp.c > > > > +++ b/kernel/smp.c > > > > @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > > > > > > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > > > > module_param(csd_lock_timeout, ulong, 0444); > > > > -static bool panic_on_ipistall; > > > > -module_param(panic_on_ipistall, bool, 0444); > > > > +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for ten minutes. */ > > > > > > s/ten/five > > And right you are again! > > > > > +module_param(panic_on_ipistall, int, 0444); > > > > > > > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > > > > > > > @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > > > > * to become unstuck. Use a signed comparison to avoid triggering > > > > * on underflows when the TSC is out of sync between sockets. > > > > */ > > > > - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > > > > + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); > > > > > > s64 here would avoid casting (s64)panic_on_ipistall, but I think it does > > > not really impact readability. > > > > > > IIUC ts_delta is an u64 being casted as s64, which could be an issue but no > > > computer system will actually take over 2^31 ns (292 years) to run 1 > > > iteration, so it's safe. > > Back at you with s/2^31/2^63/. ;-) Right you are! :) > > > > I think it's a nice feaure :) > > > > s/feaure/feature > > > > Also, FWIW: > > Reviewed-by: Leonardo Bras <leobras@redhat.com> > > I have that down as well, and thank you again! > > Thanx, Paul > > ------------------------------------------------------------------------ > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index ccb7621eff79..e1fe26dae586 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5931,8 +5931,10 @@ > > smp.panic_on_ipistall= [KNL] > If a csd_lock_timeout extends for more than > - five minutes, panic the system. By default, let > - CSD-lock acquisition take as long as they take. > + the specified number of milliseconds, panic the > + system. By default, let CSD-lock acquisition > + take as long as they take. Specifying 300,000 > + for this value provides a 5-minute timeout. > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port > diff --git a/kernel/smp.c b/kernel/smp.c > index b6a0773a7015..695eb13a276d 100644 > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -170,8 +170,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info); > > static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ > module_param(csd_lock_timeout, ulong, 0444); > -static bool panic_on_ipistall; > -module_param(panic_on_ipistall, bool, 0444); > +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for five minutes. */ > +module_param(panic_on_ipistall, int, 0444); > > static atomic_t csd_bug_count = ATOMIC_INIT(0); > > @@ -256,7 +256,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * > * to become unstuck. Use a signed comparison to avoid triggering > * on underflows when the TSC is out of sync between sockets. > */ > - BUG_ON(panic_on_ipistall && (s64)ts_delta > 300000000000LL); > + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); > if (cpu_cur_csd && csd != cpu_cur_csd) { > pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", > *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), > It's great! Thanks!
diff --git a/kernel/smp.c b/kernel/smp.c index 8455a53465af..059f1f53fc6b 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -230,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * } ts2 = sched_clock(); + /* How long since we last checked for a stuck CSD lock.*/ ts_delta = ts2 - *ts1; if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) return false; @@ -243,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * else cpux = cpu; cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ + /* How long since this CSD lock was stuck. */ + ts_delta = ts2 - ts0; pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, cpu, csd->func, csd->info); + /* + * If the CSD lock is still stuck after 5 minutes, it is unlikely + * to become unstuck. Use a signed comparison to avoid triggering + * on underflows when the TSC is out of sync between sockets. + */ + BUG_ON((s64)ts_delta > 300000000000LL); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),