From patchwork Thu Oct 5 16:48:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Paul E. McKenney" X-Patchwork-Id: 148962 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2016:b0:403:3b70:6f57 with SMTP id fe22csp459416vqb; Thu, 5 Oct 2023 10:37:12 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFCfjbrqlNFg+8OsgJMtQLjsCdJlCZmUgDX1sI6e+tPUQV0CccQ45ZQUHKgc5x4fbNN99i2 X-Received: by 2002:a05:6a00:398c:b0:68a:6cbe:35a7 with SMTP id fi12-20020a056a00398c00b0068a6cbe35a7mr6331318pfb.2.1696527432630; Thu, 05 Oct 2023 10:37:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696527432; cv=none; d=google.com; s=arc-20160816; b=jJDMjrQK9OazwOInncXTSn5iXJK58SMPZN0TyUEvIyjr/z9wuJv+ohr20EYd8sHgTh Vl2X0bewThedQ38S6nziEDnF/Cy1sLjPk5PFveY/L4qQlVeClS48CICDS+tDM6v6Phuu bD0syzcomoe9IXBDmfS3+v0MtMcIKgDJHFQe2PzvtJn3ujY5O8XyM8zBvzeRtFZsbMUD mgbmE5kwkSOZ/49RgKklhIxsWw5b+EZ30Fey7uwUK1lrMQkacS5Rr+rQ7g4yo4yVHkzs NKXFxbDE7majtrWCYAO4RTe7gMbFaz/wKD81mzZ6rEHnUF9QVICDlhVJhEsIgntYTUbN yf/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:reply-to :message-id:subject:cc:to:from:date:dkim-signature; bh=3+LPeXXsUuZqTYKewepgZbs5KFtYihniVTkK07HvXvI=; fh=Ly0RBVX4Dm/io6+f3Ow6WlDzGFcLZXQPwAYGpNWor0s=; b=exni+dvQ1qpOf1EEb10uA4GJ2EIh846xDCGhPZUFiziAFW9J2F/P/K4ezXYunbZia7 Mti+zoLoi3YX12kOvyD8hDlCfGfVXWV/pqTScB55maATE5mxUCK2JetLYeR/izTb8U8c r7Y016xzRqEos302U2YMkgRI3FmrzOYefdrfX9PPHJSU3TZRitoosf9NkbD30fcWHKUd wuAU/gFKkIEFf2vLMUlJHreBSzENLy0kYVzwThK3pwEYr/KuvIxHNeJIjVCxDsbmGf8R OJdTx632IBudL07wzHSH3z6pZcQU00IFaCiH0LQnuwlgr0FE0/FyblINxSEot65iEsrz AaPQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="RsEsZ3/R"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id u17-20020a056a00125100b0068a38a9ab84si1864434pfi.176.2023.10.05.10.37.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Oct 2023 10:37:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="RsEsZ3/R"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 96CA08095DCC; Thu, 5 Oct 2023 10:23:55 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230452AbjJERXe (ORCPT + 18 others); Thu, 5 Oct 2023 13:23:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33396 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230405AbjJERXF (ORCPT ); Thu, 5 Oct 2023 13:23:05 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5E1F82D4F for ; Thu, 5 Oct 2023 09:48:37 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id ED01DC433C7; Thu, 5 Oct 2023 16:48:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1696524517; bh=CQ2O//GYwC5mvy1qRCT49ssKIiGsS8rLXvdx8sZl8yY=; h=Date:From:To:Cc:Subject:Reply-To:From; b=RsEsZ3/RFK5xfWdrP/s2tB3BdQqazLRDwZz9DQU0D7h0ATTaWh7Xtaw2fxgJZeC03 slXoaHSTZYH7zcUYvsuhLx5aHP7utICxbDwqSTRU57p4rw+0g3tZJkKnLvuDngtuve QzRg1FWb262AAve3NySR0Ss5PGfZp5SaX+fP7OjH5eQ+mpBt2sRqaiq0Gpzuq2PYRN epbfg7Ueo7uTx1s39FRzBeCap70+pT+2xkYhgOfTgW+8wa+LaO99R+WDQwq+/KWqHg rX+nMRzBtotCJCwUoVhmK3CLc7+RtAxuDPnmo3+B6Gvfy4gdTKbdD4AceziWu0AAYA dBdbfhjFrj6MA== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 8173ECE0869; Thu, 5 Oct 2023 09:48:36 -0700 (PDT) Date: Thu, 5 Oct 2023 09:48:36 -0700 From: "Paul E. McKenney" To: linux-kernel@vger.kernel.org Cc: Peter Zijlstra , Valentin Schneider , Juergen Gross , Leonardo Bras , Imran Khan Subject: [PATCH smp,csd] Throw an error if a CSD lock is stuck for too long Message-ID: Reply-To: paulmck@kernel.org MIME-Version: 1.0 Content-Disposition: inline X-Spam-Status: No, score=-1.2 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 05 Oct 2023 10:23:55 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778937949210050581 X-GMAIL-MSGID: 1778937949210050581 The CSD lock seems to get stuck in 2 "modes". When it gets stuck temporarily, it usually gets released in a few seconds, and sometimes up to one or two minutes. If the CSD lock stays stuck for more than several minutes, it never seems to get unstuck, and gradually more and more things in the system end up also getting stuck. In the latter case, we should just give up, so the system can dump out a little more information about what went wrong, and, with panic_on_oops and a kdump kernel loaded, dump a whole bunch more information about what might have gone wrong. Question: should this have its own panic_on_ipistall switch in /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different way than via BUG_ON? Signed-off-by: Rik van Riel Signed-off-by: Paul E. McKenney Reviewed-by: Imran Khan Reviewed-by: Leonardo Bras Reviewed-by: Leonardo Bras diff --git a/kernel/smp.c b/kernel/smp.c index 8455a53465af..059f1f53fc6b 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -230,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * } ts2 = sched_clock(); + /* How long since we last checked for a stuck CSD lock.*/ ts_delta = ts2 - *ts1; if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) return false; @@ -243,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * else cpux = cpu; cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ + /* How long since this CSD lock was stuck. */ + ts_delta = ts2 - ts0; pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, cpu, csd->func, csd->info); + /* + * If the CSD lock is still stuck after 5 minutes, it is unlikely + * to become unstuck. Use a signed comparison to avoid triggering + * on underflows when the TSC is out of sync between sockets. + */ + BUG_ON((s64)ts_delta > 300000000000LL); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),