Message ID | 20221027151908.v1.1.I295e65357f06f162b46ea6fc6c03be37e3978bdc@changeid |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp297716wru; Thu, 27 Oct 2022 08:31:48 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4jUtZZaruffX/oIJ3g94Wblt9M4kW/EwbSbkHfdWm86wEU4xIznxbXOQuMvEmHwq8sU/L3 X-Received: by 2002:a17:90b:2504:b0:212:def1:623b with SMTP id ns4-20020a17090b250400b00212def1623bmr10964967pjb.47.1666884708425; Thu, 27 Oct 2022 08:31:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666884708; cv=none; d=google.com; s=arc-20160816; b=X4EOrEFDm2CUCOKYIIZp7PJcja04tEgr5FTdVnx/VdXaCnejglNjZu2vSa9lnQrHaE luINskfhsKm+OjO7DAerGSleuDS4zNNk4WsSFnIAD2GfHxp0lmBdS5HfYLrSKKZp9L9b /VmjT9fKlomEvXcRfSUvdcBD1rjJsOg+jE7HK3YL2qZehmsrewW8sFcRYb0h6dWRQiGm 59YRs3m0Euu124B8xSvTuHrq742L9Z15JB6dTIo4ZLv4sVtqq3j1LR4x6QNn5iOWEAKS eQcs74UBTOK6dYch2/grYd5g/Lo/4zsxkYRuOXwBSY1me6GosbNuT33uUon7jKMb2ebW SNkQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=ArN/ceylnEiTQ/ffcvkaRkBlDi2r/w3YlTdaaGRYyy0=; b=RbR/ZbZGDFFQATuW5GrWnGw4m3pcqvPUi7ryrj1C9fI+jn6rwYzgX7r+EDLlNTmXf6 LLtoOXYfNviVIMG2wXiN1Jq9JklJkfv9H60ZoTa8gAAY3RSb9ldqyh+WHUphlAEKsPp0 Suuobqp2XdUWTjk71oJEu74kZVVQR1YXA6LwgdiJzIZ3mOEv9H0riNGYvWa6rVqxpe0C dSXHMtgTZbiuZXJuuV6y7ceMZFHCn4Vy8x1asM+WcgjoUyOgecdfM4UFMJQSZnbsWB9X qoZxuo875q5H6rRX8ql//yK4RaDLSVyvxaQG7slAqYxTQcBrXvhjuSoBU8Jjpw77Oaz1 nLaA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=TBZFyhZ8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n2-20020a170903404200b0015f4370afbdsi1592993pla.516.2022.10.27.08.31.33; Thu, 27 Oct 2022 08:31:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=TBZFyhZ8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236511AbiJ0PVM (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Thu, 27 Oct 2022 11:21:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53428 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236438AbiJ0PU3 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 27 Oct 2022 11:20:29 -0400 Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 094AA1AF24 for <linux-kernel@vger.kernel.org>; Thu, 27 Oct 2022 08:19:22 -0700 (PDT) Received: by mail-qk1-x72e.google.com with SMTP id i10so1000552qkl.12 for <linux-kernel@vger.kernel.org>; Thu, 27 Oct 2022 08:19:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ArN/ceylnEiTQ/ffcvkaRkBlDi2r/w3YlTdaaGRYyy0=; b=TBZFyhZ8VFkxJraM6oMydd20rFcNXgaDzV3Eahs4a5rNLpMc88DjEXedfr1aMtOEd1 y3XKPeBHjK3bb8MVVwe6773bWMogMwYqzUTR0TCvNY9AXaq4QrxueuF/zsSoWFesHt87 BZe2FQDx+9TlQd4jKXvs16YCM2v8qFWIYGTAY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=ArN/ceylnEiTQ/ffcvkaRkBlDi2r/w3YlTdaaGRYyy0=; b=OuHzpPW+mXZu1nqdkRK98PE+REJdo1O8hWkuWbMb2cOCVcYuPTk0tTgrkdlOPpk6Re gUeHPnAoBd3a5xtEr2CwT5Jel5P/GV2cGb7QIgJTFUPC/Cunxk59wZVnllItj1kylvQG DXiIuFJ1KwwTJOk4svSXxDGLzFeuU21ZZp+/vLejDqZwZk83Lq23PU2D/2KQfbMAo/Tt 1+cXN0RcGbpC0IgwfgpMQc2zGTb8SCJxXgKKpDenNMUaelPhyaWEONse3pHvuIjCG3Ou N9VZ/H3Gfoa4yWykBHN3wvLYM90ccde72xWvSwfObYZqe8bIATbRpmhf4WocPAb/0f+Z ERrQ== X-Gm-Message-State: ACrzQf1EF+UfiJOAYFzzKOlIBrzqXA+FdRHzntRDSXlGY1m2mfWonFgf MQrnIfOBTyj5KyMmbOi+vqPWprZXSYyQ6A== X-Received: by 2002:a05:620a:51ca:b0:6f5:71e1:5ccf with SMTP id cx10-20020a05620a51ca00b006f571e15ccfmr14767083qkb.217.1666883961039; Thu, 27 Oct 2022 08:19:21 -0700 (PDT) Received: from trappist.c.googlers.com.com (128.174.85.34.bc.googleusercontent.com. [34.85.174.128]) by smtp.gmail.com with ESMTPSA id w7-20020ac84d07000000b0039d02911555sm991304qtv.78.2022.10.27.08.19.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Oct 2022 08:19:20 -0700 (PDT) From: Sven van Ashbrook <svenva@chromium.org> To: LKML <linux-kernel@vger.kernel.org> Cc: platform-driver-x86@vger.kernel.org, Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>, Rafael J Wysocki <rjw@rjwysocki.net>, Rajat Jain <rajatja@google.com>, Sven van Ashbrook <svenva@chromium.org>, David E Box <david.e.box@intel.com>, Hans de Goede <hdegoede@redhat.com>, Mark Gross <markgross@kernel.org>, Rajneesh Bhardwaj <irenic.rajneesh@gmail.com> Subject: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN() Date: Thu, 27 Oct 2022 15:19:14 +0000 Message-Id: <20221027151908.v1.1.I295e65357f06f162b46ea6fc6c03be37e3978bdc@changeid> X-Mailer: git-send-email 2.38.0.135.g90850a2211-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1747855300469717103?= X-GMAIL-MSGID: =?utf-8?q?1747855300469717103?= |
Series |
[v1] platform/x86: intel_pmc_core: promote S0ix failure warn() to WARN()
|
|
Commit Message
Sven van Ashbrook
Oct. 27, 2022, 3:19 p.m. UTC
The "failure to enter S0ix" warning is critically important for monitoring
and debugging power regressions, both in the field and in the test lab.
Promote from lower-case warn() to upper-case WARN() so that it becomes
more prominent, and gets picked up as part of existing monitoring
infrastructure, which typically focuses on WARN() and ignores warn()
type log messages.
Signed-off-by: Sven van Ashbrook <svenva@chromium.org>
---
Against v6.1-rc2
drivers/platform/x86/intel/pmc/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
Hi, On 10/27/22 17:19, Sven van Ashbrook wrote: > The "failure to enter S0ix" warning is critically important for monitoring > and debugging power regressions, both in the field and in the test lab. > > Promote from lower-case warn() to upper-case WARN() so that it becomes > more prominent, and gets picked up as part of existing monitoring > infrastructure, which typically focuses on WARN() and ignores warn() > type log messages. > > Signed-off-by: Sven van Ashbrook <svenva@chromium.org> WARN() is really only intended for internal kernel bugs and not for hw misbehaving, so I'm not a fan of the change you are suggesting here. Intel folks, do you have an opinion on this ? Regards, Hans > --- > Against v6.1-rc2 > > drivers/platform/x86/intel/pmc/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/platform/x86/intel/pmc/core.c b/drivers/platform/x86/intel/pmc/core.c > index a1fe1e0dcf4a5..834f0352c0edf 100644 > --- a/drivers/platform/x86/intel/pmc/core.c > +++ b/drivers/platform/x86/intel/pmc/core.c > @@ -2125,7 +2125,7 @@ static __maybe_unused int pmc_core_resume(struct device *dev) > } > > /* The real interesting case - S0ix failed - lets ask PMC why. */ > - dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", > + dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", > pmcdev->s0ix_counter); > if (pmcdev->map->slps0_dbg_maps) > pmc_core_slps0_display(pmcdev, dev, NULL);
[Public] +Shyam & Raul > -----Original Message----- > From: Hans de Goede <hdegoede@redhat.com> > Sent: Thursday, October 27, 2022 10:41 > To: Sven van Ashbrook <svenva@chromium.org>; LKML <linux- > kernel@vger.kernel.org> > Cc: platform-driver-x86@vger.kernel.org; Rajneesh Bhardwaj > <rajneesh.bhardwaj@intel.com>; Rafael J Wysocki <rjw@rjwysocki.net>; > Rajat Jain <rajatja@google.com>; David E Box <david.e.box@intel.com>; > Mark Gross <markgross@kernel.org>; Rajneesh Bhardwaj > <irenic.rajneesh@gmail.com> > Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure > warn() to WARN() > > Hi, > > On 10/27/22 17:19, Sven van Ashbrook wrote: > > The "failure to enter S0ix" warning is critically important for monitoring > > and debugging power regressions, both in the field and in the test lab. > > > > Promote from lower-case warn() to upper-case WARN() so that it becomes > > more prominent, and gets picked up as part of existing monitoring > > infrastructure, which typically focuses on WARN() and ignores warn() > > type log messages. > > > > Signed-off-by: Sven van Ashbrook <svenva@chromium.org> > > WARN() is really only intended for internal kernel bugs and not for > hw misbehaving, so I'm not a fan of the change you are suggesting here. > > Intel folks, do you have an opinion on this ? For a reference point; on AMD's implementation of a similar driver (platform/x86/amd/pmc.c) this "type" of message is also "dev_warn": https://github.com/torvalds/linux/blob/master/drivers/platform/x86/amd/pmc.c#L365 If we do make changes to this message level so that other infrastructure picks up I suggest we do it for both drivers. Are we maybe at the point now it should be dev_err instead? > > Regards, > > Hans > > > > --- > > Against v6.1-rc2 > > > > drivers/platform/x86/intel/pmc/core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/platform/x86/intel/pmc/core.c > b/drivers/platform/x86/intel/pmc/core.c > > index a1fe1e0dcf4a5..834f0352c0edf 100644 > > --- a/drivers/platform/x86/intel/pmc/core.c > > +++ b/drivers/platform/x86/intel/pmc/core.c > > @@ -2125,7 +2125,7 @@ static __maybe_unused int > pmc_core_resume(struct device *dev) > > } > > > > /* The real interesting case - S0ix failed - lets ask PMC why. */ > > - dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", > > + dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", > > pmcdev->s0ix_counter); > > if (pmcdev->map->slps0_dbg_maps) > > pmc_core_slps0_display(pmcdev, dev, NULL);
On Thu, Oct 27, 2022 at 11:47 AM Limonciello, Mario <Mario.Limonciello@amd.com> wrote: > > If we do make changes to this message level so that other infrastructure picks up I suggest > we do it for both drivers. Sounds good, patch feedback willing. > WARN() is really only intended for internal kernel bugs and not for > hw misbehaving, so I'm not a fan of the change you are suggesting here. Thanks, I was not aware of this distinction. But it does look like upper-case WARN is already used in many places to indicate hw misbehaving? Here for example: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/platform/chrome/cros_ec.c?h=v6.1-rc2#n142
On Thu, 2022-10-27 at 17:40 +0200, Hans de Goede wrote: > Hi, > > On 10/27/22 17:19, Sven van Ashbrook wrote: > > The "failure to enter S0ix" warning is critically important for monitoring > > and debugging power regressions, both in the field and in the test lab. > > > > Promote from lower-case warn() to upper-case WARN() so that it becomes > > more prominent, and gets picked up as part of existing monitoring > > infrastructure, which typically focuses on WARN() and ignores warn() > > type log messages. > > > > Signed-off-by: Sven van Ashbrook <svenva@chromium.org> > > WARN() is really only intended for internal kernel bugs and not for > hw misbehaving, so I'm not a fan of the change you are suggesting here. > > Intel folks, do you have an opinion on this ? I agree that not entering s0ix is a critical failure, but this is a hardware suspend failure. How we treat this should be akin to how we treat failure to enter S3 or deeper. S0ix support is indicated by the S0 Low Power Idle bit in the ACPI FADT table. It's better IMO to create some framework in the suspend or ACPI core that allows platforms to report whether they have achieved the hardware state indicated by having this bit set. Rafael, your thoughts? David > > Regards, > > Hans > > > > --- > > Against v6.1-rc2 > > > > drivers/platform/x86/intel/pmc/core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/platform/x86/intel/pmc/core.c > > b/drivers/platform/x86/intel/pmc/core.c > > index a1fe1e0dcf4a5..834f0352c0edf 100644 > > --- a/drivers/platform/x86/intel/pmc/core.c > > +++ b/drivers/platform/x86/intel/pmc/core.c > > @@ -2125,7 +2125,7 @@ static __maybe_unused int pmc_core_resume(struct > > device *dev) > > } > > > > /* The real interesting case - S0ix failed - lets ask PMC why. */ > > - dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", > > + dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", > > pmcdev->s0ix_counter); > > if (pmcdev->map->slps0_dbg_maps) > > pmc_core_slps0_display(pmcdev, dev, NULL); >
On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj <irenic.rajneesh@gmail.com> wrote: > I'd advise against this promotion based on my experience with S0ix entry failures. On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com> wrote: > I'm not a fan of the change you are suggesting here. Thanks everyone for the feedback. Looks like there is consensus that it's not advisable to promote the warning. We will move forward with changes to our monitoring infrastructure instead.
[Public] > -----Original Message----- > From: Sven van Ashbrook <svenva@chromium.org> > Sent: Friday, October 28, 2022 23:12 > To: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com>; Hans de Goede > <hdegoede@redhat.com> > Cc: Limonciello, Mario <Mario.Limonciello@amd.com>; LKML <linux- > kernel@vger.kernel.org>; S-k, Shyam-sundar <Shyam-sundar.S- > k@amd.com>; rrangel@chromium.org; platform-driver- > x86@vger.kernel.org; Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>; > Rafael J Wysocki <rjw@rjwysocki.net>; Rajat Jain <rajatja@google.com>; > David E Box <david.e.box@intel.com>; Mark Gross <markgross@kernel.org> > Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure > warn() to WARN() > > On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj > <irenic.rajneesh@gmail.com> wrote: > > I'd advise against this promotion based on my experience with S0ix entry > failures. > > On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com> > wrote: > > I'm not a fan of the change you are suggesting here. > > Thanks everyone for the feedback. Looks like there is consensus that it's > not advisable to promote the warning. We will move forward with changes to > our monitoring infrastructure instead. Did you see the idea proposed by David Box to introduce some infrastructure in the kernel for this? Just thinking about it a little bit more, it could be a lot nicer to have something like: /sys/power/suspend_stats/last_hw_deepest_state During the resume process drivers such as amd_pmc and intel_pmc_core could read the appropriate values for the hardware and call a function that would populate it with either a "0" or "1" or maybe even the amount of time spent in that state. We could then retire the debugging messages from both drivers and instead of your infrastructure relying upon scanning logs, userspace could read that sysfs file after every suspend and raise the alarms when it's 0 (or if it's populated with time smaller than X% of the total suspend time).
Hi, On 10/31/22 20:39, Limonciello, Mario wrote: > [Public] > >> -----Original Message----- >> From: Sven van Ashbrook <svenva@chromium.org> >> Sent: Friday, October 28, 2022 23:12 >> To: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com>; Hans de Goede >> <hdegoede@redhat.com> >> Cc: Limonciello, Mario <Mario.Limonciello@amd.com>; LKML <linux- >> kernel@vger.kernel.org>; S-k, Shyam-sundar <Shyam-sundar.S- >> k@amd.com>; rrangel@chromium.org; platform-driver- >> x86@vger.kernel.org; Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>; >> Rafael J Wysocki <rjw@rjwysocki.net>; Rajat Jain <rajatja@google.com>; >> David E Box <david.e.box@intel.com>; Mark Gross <markgross@kernel.org> >> Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix failure >> warn() to WARN() >> >> On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj >> <irenic.rajneesh@gmail.com> wrote: >>> I'd advise against this promotion based on my experience with S0ix entry >> failures. >> >> On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com> >> wrote: >>> I'm not a fan of the change you are suggesting here. >> >> Thanks everyone for the feedback. Looks like there is consensus that it's >> not advisable to promote the warning. We will move forward with changes to >> our monitoring infrastructure instead. > > Did you see the idea proposed by David Box to introduce some infrastructure in > the kernel for this? > > Just thinking about it a little bit more, it could be a lot nicer to have something like: > > /sys/power/suspend_stats/last_hw_deepest_state > > During the resume process drivers such as amd_pmc and intel_pmc_core could > read the appropriate values for the hardware and call a function that would > populate it with either a "0" or "1" or maybe even the amount of time spent in > that state. > > We could then retire the debugging messages from both drivers and instead of > your infrastructure relying upon scanning logs, userspace could read that sysfs > file after every suspend and raise the alarms when it's 0 (or if it's populated with > time smaller than X% of the total suspend time). Something like this does indeed sound like a nice solution for this. Regards, Hans
On Mon, Oct 31, 2022 at 4:15 PM Hans de Goede <hdegoede@redhat.com> wrote: > > Something like this does indeed sound like a nice solution for this. Agreed, much nicer than letting the monitoring infrastructure hunt for certain dmesg strings, for sure. Thank you David for the suggestion, and Mario for the opening ideas.
On 10/31/2022 15:47, Rajat Jain wrote: > Hello, > > On Mon, Oct 31, 2022 at 12:39 PM Limonciello, Mario > <Mario.Limonciello@amd.com <mailto:Mario.Limonciello@amd.com>> wrote: > > > > [Public] > > > > > -----Original Message----- > > > From: Sven van Ashbrook <svenva@chromium.org > <mailto:svenva@chromium.org>> > > > Sent: Friday, October 28, 2022 23:12 > > > To: Rajneesh Bhardwaj <irenic.rajneesh@gmail.com > <mailto:irenic.rajneesh@gmail.com>>; Hans de Goede > > > <hdegoede@redhat.com <mailto:hdegoede@redhat.com>> > > > Cc: Limonciello, Mario <Mario.Limonciello@amd.com > <mailto:Mario.Limonciello@amd.com>>; LKML <linux- > > > kernel@vger.kernel.org <mailto:kernel@vger.kernel.org>>; S-k, > Shyam-sundar <Shyam-sundar.S- > > > k@amd.com <mailto:k@amd.com>>; rrangel@chromium.org > <mailto:rrangel@chromium.org>; platform-driver- > > > x86@vger.kernel.org <mailto:x86@vger.kernel.org>; Rajneesh Bhardwaj > <rajneesh.bhardwaj@intel.com <mailto:rajneesh.bhardwaj@intel.com>>; > > > Rafael J Wysocki <rjw@rjwysocki.net <mailto:rjw@rjwysocki.net>>; > Rajat Jain <rajatja@google.com <mailto:rajatja@google.com>>; > > > David E Box <david.e.box@intel.com <mailto:david.e.box@intel.com>>; > Mark Gross <markgross@kernel.org <mailto:markgross@kernel.org>> > > > Subject: Re: [PATCH v1] platform/x86: intel_pmc_core: promote S0ix > failure > > > warn() to WARN() > > > > > > On Thu, Oct 27, 2022 at 12:02 PM Rajneesh Bhardwaj > > > <irenic.rajneesh@gmail.com <mailto:irenic.rajneesh@gmail.com>> wrote: > > > > I'd advise against this promotion based on my experience with > S0ix entry > > > failures. > > > > > > On Thu, Oct 27, 2022 at 11:40 AM Hans de Goede <hdegoede@redhat.com > <mailto:hdegoede@redhat.com>> > > > wrote: > > > > I'm not a fan of the change you are suggesting here. > > > > > > Thanks everyone for the feedback. Looks like there is consensus > that it's > > > not advisable to promote the warning. We will move forward with > changes to > > > our monitoring infrastructure instead. > > > > Did you see the idea proposed by David Box to introduce some > infrastructure in > > the kernel for this? > > > > Just thinking about it a little bit more, it could be a lot nicer to > have something like: > > > > /sys/power/suspend_stats/last_hw_deepest_state > > > > During the resume process drivers such as amd_pmc and intel_pmc_core > could > > read the appropriate values for the hardware and call a function that > would > > populate it with either a "0" or "1" or maybe even the amount of time > spent in > > that state. > > > > We could then retire the debugging messages from both drivers > > I do not think we should retire the debug messages. The sysfs > attribute could help us *trigger* a failure detection, but we would > still need these debug logs to actually determine why exactly we did > not go into the S0ix / deepest power state (And the debug messages > print out the debug register bit fields that let us know that). > > Thanks, > I just spun together an RFC series for this idea and while doing it I had the same realization. So I left the warning messages in place for both drivers. You can take a look at the series here: https://lore.kernel.org/platform-driver-x86/20221031204320.22464-1-mario.limonciello@amd.com/T/#m6c7db55c98b8a3ce8c48d451fc01c1d9b0df37fb Thanks,
On Mon, Oct 31, 2022 at 3:39 PM Limonciello, Mario <Mario.Limonciello@amd.com> wrote: > > Just thinking about it a little bit more, it could be a lot nicer to have something like: > > /sys/power/suspend_stats/last_hw_deepest_state While I agree that reporting through a framework is generally better than getting infrastructure to grep for specific strings, I believe that a simple sysfs file is probably too simplistic. 1. We need more sophisticated reporting than just last_hw_deepest_state: - sometimes the system enters the deep state we want, yet after a while moves back up and gets "stuck" in an intermediate state (below S0). Or, the system enters the deep state we want, but moves back to S0 after a time without apparent reason. These platform-dependent failures are not so easily describable in a generic framework. - ChromeOS in particular has multiple independent S0ix / S3 / s2idle failure report sources. We have the kernel warning above; also our Embedded Controller monitors suspend failure cases which the simple kernel warning cannot catch, reported through a separate WARN_ONCE(). 2. A simple sysfs file will need to be polled by the infrastructure after every suspend; it would be preferable to have some signal or callback which the infrastructure could register itself with. The generic infrastructure to support this sounds like quite a bit of work, and for what gain? Compared to simply matching a log string and sending the whole dmesg if there's a match. Is the light worth the candle? Sven
On 10/31/22 20:38, Sven van Ashbrook wrote: > On Mon, Oct 31, 2022 at 3:39 PM Limonciello, Mario > <Mario.Limonciello@amd.com> wrote: >> >> Just thinking about it a little bit more, it could be a lot nicer to have something like: >> >> /sys/power/suspend_stats/last_hw_deepest_state > > While I agree that reporting through a framework is generally better > than getting infrastructure to grep for specific strings, I believe > that a simple sysfs file is probably too simplistic. > > 1. We need more sophisticated reporting than just last_hw_deepest_state: > > - sometimes the system enters the deep state we want, yet after a > while moves back up and gets "stuck" in an intermediate state (below > S0). Or, the system enters the deep state we want, but moves back to > S0 after a time without apparent reason. These platform-dependent > failures are not so easily describable in a generic framework. I actually thought that by putting the duration of time put in last_hw_deepest_state you'll be able to catch this by comparing the duration of the suspend to the duration of last_hw_deepest_state. If you're below some threshold of percent for suspends that are at least some other threshold long you can trigger the failure. This then lets you tune your framework to find the right place for those thresholds too without needing to change the kernel. > > - ChromeOS in particular has multiple independent S0ix / S3 / s2idle > failure report sources. We have the kernel warning above; also our > Embedded Controller monitors suspend failure cases which the simple > kernel warning cannot catch, reported through a separate WARN_ONCE(). > > 2. A simple sysfs file will need to be polled by the infrastructure > after every suspend; it would be preferable to have some signal or > callback which the infrastructure could register itself with. The interface to trigger a suspend is writing a value into /sys/power/state. You'll get a return code from this, but this return code does not represent whether you got to the deepest state, just whether the suspend succeeded or not. So what would an ideal interface that sends a signal that the last "successful" suspend didn't get to the deepest state look like to you? > > The generic infrastructure to support this sounds like quite a bit of > work, and for what gain? Compared to simply matching a log string and > sending the whole dmesg if there's a match. I would like to think it's cheaper to read the sysfs file, do a local comparison on HW deepest time to the suspend time and then only send the the dmesg up for further analysis. > > Is the light worth the candle? I wrote an RFC that I sent out for it with my ideas at least.
On Mon, Oct 31, 2022 at 9:58 PM Mario Limonciello <mario.limonciello@amd.com> wrote: > > I actually thought that by putting the duration of time put in > last_hw_deepest_state you'll be able to catch this by comparing the > duration of the suspend to the duration of last_hw_deepest_state. I'm not sure if this method would catch all or even most suspend failures. That's why the EC monitoring of S0ix was devised. I will circulate this internally, see what comes back. > > > > Is the light worth the candle? > > I wrote an RFC that I sent out for it with my ideas at least. > That is much appreciated ! Yet even for good ideas, it's often necessary to weigh the benefits and downsides of the intervention. Perhaps we can get some pros/cons feedback from other stakeholders ? Sven
On Mon, Oct 31, 2022 at 4:55 PM Limonciello, Mario <mario.limonciello@amd.com> wrote: > > I just spun together an RFC series for this idea and while doing it I > had the same realization. So I left the warning messages in place for > both drivers. > > You can take a look at the series here: > > https://lore.kernel.org/platform-driver-x86/20221031204320.22464-1-mario.limonciello@amd.com/T/#m6c7db55c98b8a3ce8c48d451fc01c1d9b0df37fb > We've had some internal discussions within ChromeOS intel big core, and we believe this is a worthwhile effort, and we are supportive, as long as our current S0ix fail detection will not break for the foreseeable future, i.e. as long as the warning message and register dump stays in place. Which is the case for your RFC. +swboyd@chromium.org who expressed interest in doing something similar for ARM.
On Tue, 2022-11-01 at 13:24 -0400, Sven van Ashbrook wrote: > On Mon, Oct 31, 2022 at 4:55 PM Limonciello, Mario > <mario.limonciello@amd.com> wrote: > > I just spun together an RFC series for this idea and while doing it I > > had the same realization. So I left the warning messages in place for > > both drivers. > > > > You can take a look at the series here: > > > > https://lore.kernel.org/platform-driver-x86/20221031204320.22464-1-mario.limonciello@amd.com/T/#m6c7db55c98b8a3ce8c48d451fc01c1d9b0df37fb > > > > We've had some internal discussions within ChromeOS intel big core, > and we believe this is a worthwhile effort, and we are supportive, as > long as our current S0ix fail detection will not break for the > foreseeable future, i.e. as long as the warning message and register > dump stays in place. Which is the case for your RFC. Yeah, I did not see this as a replacement for anything in the pmc drivers. Given the prevalence of S0ix, and that hardware based low power idle support is indicated in the FADT (so part of the standard) it makes sense to have it tracked by the suspend core, particularly when it's being used as a replacement for S3. We don't need to collect any implementation or debug details there. Only detect when it's available, being used for suspend, and being achieved. Maybe residency information as well if available but that's it. Other information is separate and should be contained to the individual drivers which have the detailed platform knowledge. David > > +swboyd@chromium.org who expressed interest in doing something similar for > ARM.
diff --git a/drivers/platform/x86/intel/pmc/core.c b/drivers/platform/x86/intel/pmc/core.c index a1fe1e0dcf4a5..834f0352c0edf 100644 --- a/drivers/platform/x86/intel/pmc/core.c +++ b/drivers/platform/x86/intel/pmc/core.c @@ -2125,7 +2125,7 @@ static __maybe_unused int pmc_core_resume(struct device *dev) } /* The real interesting case - S0ix failed - lets ask PMC why. */ - dev_warn(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", + dev_WARN(dev, "CPU did not enter SLP_S0!!! (S0ix cnt=%llu)\n", pmcdev->s0ix_counter); if (pmcdev->map->slps0_dbg_maps) pmc_core_slps0_display(pmcdev, dev, NULL);