From patchwork Mon Aug 21 20:04:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rik van Riel X-Patchwork-Id: 136424 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b82d:0:b0:3f2:4152:657d with SMTP id z13csp3234053vqi; Mon, 21 Aug 2023 13:16:17 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEjvN/DD3mZHPmzah76KBUzVlhb43mUXbytlS1Fy/YM9UjTlA00EVoU3aZUkgKqlW151naH X-Received: by 2002:a05:6a20:3ca1:b0:13d:5b70:17da with SMTP id b33-20020a056a203ca100b0013d5b7017damr6788332pzj.26.1692648977535; Mon, 21 Aug 2023 13:16:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1692648977; cv=none; d=google.com; s=arc-20160816; b=yAR+5+b86y4N1J3M5hH6XBV0X7u5IsQT//rx/+uM8wnUvOrW+weTrlOXUzk3HnWYKh GQdB9QTGZY/RABKNTAWeTvLQCp6z25Z+SlaqeQ7skKChXrK+H3Lhpwr+ggpCo+d1bbY/ bvjL7ztCsTH6CQd39jweQz8GQwb8fejl+xjhRCXUYZpelO8fNXgFRV8uiq7AVJ2Jk25C 73UKeLrCLtXOrJXcY+2N16sp+Av/DKaWEggmY3xZtJoIF4c1H/39MxKCguM3K6lnqxN+ XiFCwyI9/iO8a1sEtTfh4CL81TaN6iKBU7obpY1cwnD0IO4B5E6S53Z4E+MwiZZoowRe 5ESg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:subject:cc:to:from:date; bh=Pvgkt1UxMn5ZQqUdCeCWIQE1QtAoX1en5Es6rc/sF4w=; fh=boN+c/w6Q69kGm+ZEb+ItfotTU4i+QZARjAYOfYUf1k=; b=EfbaHXtALLGZF53ZDOi+wEd2fb0VWePUegNA1rvvluQxUx0WnTHTeOd4yxQGn8n050 jcYXTKJGHhWFZ9FCZEV74o1VmSs6C7Tx3xaPswBpr9BIQg3aISsTLrHmf5+pX0zWPFx5 FaSHfIAUL0VNnaRfytFNNh0relzRkj1cY4XGkAYwy0utmoP2hrtBtOyGA0WjnUEEseWG TE0PgMIssBlz71y6iB5mnWuB1hqiKvVl0O0AJR2JU7PO1BCSfNdAVOtK0/PYkEwkz1Ca lbnrrg/tYk591mGsnL9vGErhKqLlyyYUzV2jGK22oOmeICO9fvsAmzzT7hbiYnEGG+9v 1fdA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id cq26-20020a056a00331a00b0067fea30cd05si7589863pfb.79.2023.08.21.13.16.04; Mon, 21 Aug 2023 13:16:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230230AbjHUUFI (ORCPT + 99 others); Mon, 21 Aug 2023 16:05:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48432 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230229AbjHUUFH (ORCPT ); Mon, 21 Aug 2023 16:05:07 -0400 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 72A00129 for ; Mon, 21 Aug 2023 13:05:04 -0700 (PDT) Received: from [2601:18c:8180:ac39:6e0b:84ff:fee2:98bb] (helo=imladris.surriel.com) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1qYB8H-0005n1-1a; Mon, 21 Aug 2023 16:04:13 -0400 Date: Mon, 21 Aug 2023 16:04:09 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, Peter Zijlstra , "Paul E. McKenney" , Valentin Schneider , Juergen Gross Subject: [PATCH,RFC] smp,csd: throw an error if a CSD lock is stuck for too long Message-ID: <20230821160409.663b8ba9@imladris.surriel.com> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Sender: riel@surriel.com X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1774871094449204015 X-GMAIL-MSGID: 1774871094449204015 The CSD lock seems to get stuck in 2 "modes". When it gets stuck temporarily, it usually gets released in a few seconds, and sometimes up to one or two minutes. If the CSD lock stays stuck for more than several minutes, it never seems to get unstuck, and gradually more and more things in the system end up also getting stuck. In the latter case, we should just give up, so the system can dump out a little more information about what went wrong, and, with panic_on_oops and a kdump kernel loaded, dump a whole bunch more information about what might have gone wrong. Question: should this have its own panic_on_ipistall switch in /proc/sys/kernel, or maybe piggyback on panic_on_oops in a different way than via BUG_ON? Signed-off-by: Rik van Riel Reviewed-by: Paul E. McKenney --- kernel/smp.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/smp.c b/kernel/smp.c index 385179dae360..8b808bff15e6 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -228,6 +228,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * } ts2 = sched_clock(); + /* How long since we last checked for a stuck CSD lock.*/ ts_delta = ts2 - *ts1; if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) return false; @@ -241,9 +242,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * else cpux = cpu; cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ + /* How long since this CSD lock was stuck. */ + ts_delta = ts2 - ts0; pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, cpu, csd->func, csd->info); + /* + * If the CSD lock is still stuck after 5 minutes, it is unlikely + * to become unstuck. Use a signed comparison to avoid triggering + * on underflows when the TSC is out of sync between sockets. + */ + BUG_ON((s64)ts_delta > 300000000000LL); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),