Message ID | 20230803171233.3810944-3-alex.williamson@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9f41:0:b0:3e4:2afc:c1 with SMTP id v1csp1341354vqx; Thu, 3 Aug 2023 11:50:32 -0700 (PDT) X-Google-Smtp-Source: APBJJlE7CGvxhIs4RZAnadyMW9IBAOYUuX9DWmBRqBDbZuowDV4nu7QK3I6fhJJnQHa7XwzTFKRF X-Received: by 2002:a05:6358:2619:b0:139:bbae:1f3 with SMTP id l25-20020a056358261900b00139bbae01f3mr10528707rwc.3.1691088632431; Thu, 03 Aug 2023 11:50:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691088632; cv=none; d=google.com; s=arc-20160816; b=vx6zteQUgAly0HGRj2veouB3mpP8yAlT2jr9+eoPPx4VsumiXJ4nCSBaTSD0dFtIuH AC0HQxKJLVSZJT9LiRZ6IIkIN5AB8sjq7UV8SgMW21fDwX7DY20MqgmMFmztALYa+5Wf Re59PJXJYzV9VWL7NhA4kKLSNi1aWKbgeexsWfAiuqrEb1pCIVUDeQkvUIitB28BFDh0 DaJk5BYwvauWZOPspU8JEpfel1sHY0d5lCf9IsFPWhYQoY6LXu69MAh/AkZtqCxCbiIW Ll+KaW9NhblZbEbQj5RcnPd1Qvkv5H434uySGAgOa36F6B2aNycgA4+SahSG+2Vu8PLi o0/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=i4rNHvY4SBHV3GZTxHfmC+xvKSqh+4xt/RUop3WDQyg=; fh=EkXd9mulaEwhN/gYQbLN171OqS+TkA74SfbkC4a6Z8Y=; b=Z5Cx77qh91tbmhzYPtA4Y/I6mm956bfh34FYR1VgeZBdP1tWUpFka08TkaXfvw5Gso 6+r1bK65q+9xPWpRKlFm/J/O6PlOtoCtxOlpg2GHB9PQBq/YX8b3krqCCju35MQWTYRE Rh64foEqXTvlBLOHv645gx7FjqW7c5bcrWelyb+HQljYh3qL7ydcDET5wYIt1QD69sFv RXB9tK+fo5l00a4RoB/oTG+yujGV11/78aymW5qJnTXsOshoxLf37F71km0g2237SeKt CxxHA7YEkjH5K1kSjFuNQgmeZvGimxr9ceTWUJ4WJf+2fy6NEyJJy4ZDNJ+Mi+ulwaOl fgVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZfQFNF+e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a73-20020a63904c000000b00563fbc00003si418538pge.264.2023.08.03.11.50.18; Thu, 03 Aug 2023 11:50:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZfQFNF+e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234389AbjHCRNt (ORCPT <rfc822;guoshuai5156@gmail.com> + 99 others); Thu, 3 Aug 2023 13:13:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234587AbjHCRNk (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 3 Aug 2023 13:13:40 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F7A62D43 for <linux-kernel@vger.kernel.org>; Thu, 3 Aug 2023 10:12:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1691082767; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=i4rNHvY4SBHV3GZTxHfmC+xvKSqh+4xt/RUop3WDQyg=; b=ZfQFNF+eZLx1ZVVljYzBUsUDoCpEiRAXNzADmTbTWBX5G7LSqW3YIOZmCDzxFilBACBT5E Ce9PWqdE8eWcrtVlAjoXqI0YdFIVhWfEABol3DY7XZfo6ztWqyfoDPJzZ+MrLXyOGcioVc KRNPyxPSgyIIbIGGLsB4sm77orUedEk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-241-it8Yw8nBOjSFtKIOfzR0Dg-1; Thu, 03 Aug 2023 13:12:45 -0400 X-MC-Unique: it8Yw8nBOjSFtKIOfzR0Dg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 02099104458B; Thu, 3 Aug 2023 17:12:44 +0000 (UTC) Received: from omen.home.shazbot.org (unknown [10.22.10.229]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8C1004021CE; Thu, 3 Aug 2023 17:12:43 +0000 (UTC) From: Alex Williamson <alex.williamson@redhat.com> To: bhelgaas@google.com Cc: Alex Williamson <alex.williamson@redhat.com>, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, eric.auger@redhat.com Subject: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling Date: Thu, 3 Aug 2023 11:12:33 -0600 Message-Id: <20230803171233.3810944-3-alex.williamson@redhat.com> In-Reply-To: <20230803171233.3810944-1-alex.williamson@redhat.com> References: <20230803171233.3810944-1-alex.williamson@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773234953974029639 X-GMAIL-MSGID: 1773234953974029639 |
Series |
PCI: Protect VPD and PME accesses from power management
|
|
Commit Message
Alex Williamson
Aug. 3, 2023, 5:12 p.m. UTC
Testing that a device is not currently in a low power state provides no
guarantees that the device is not immenently transitioning to such a state.
We need to increment the PM usage counter before accessing the device.
Since we don't wish to wake the device for PME polling, do so only if the
device is already active by using pm_runtime_get_if_active().
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
drivers/pci/pci.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)
Comments
On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > Testing that a device is not currently in a low power state provides no > guarantees that the device is not immenently transitioning to such a state. > We need to increment the PM usage counter before accessing the device. > Since we don't wish to wake the device for PME polling, do so only if the > device is already active by using pm_runtime_get_if_active(). > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > drivers/pci/pci.c | 23 ++++++++++++++++------- > 1 file changed, 16 insertions(+), 7 deletions(-) Hey folks, Resurrecting this patch (currently commit d3fcd7360338) for discussion as it's been identified as the source of a regression in: https://bugzilla.kernel.org/show_bug.cgi?id=218360 Copying Mika, Lukas, and Rafael as it's related to: 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") where we skip devices in D3cold when processing the PME list. I think the issue in the above bz is that the downstream TB3/USB4 port is in D3 (presumably D3hot) and I therefore infer the device is in state RPM_SUSPENDED. This commit is attempting to make sure the device power state is stable across the call such that it does not transition into D3cold while we're accessing it. To do that I used pm_runtime_get_if_active(), but in retrospect this requires the device to be in RPM_ACTIVE so we end up skipping anything suspended or transitioning. As reported in the above bz, I tried replacing this with: pm_runtime_get_noresume(dev); pm_runtime_barrier(dev); The theory here being that the barrier would wait for any transitioning states such that as far as runtime power management is concerned, the device power state is stable. This causes live locks where the barrier never returns. Instead I'm considering that since we're polling the PME list, maybe we could just defer devices in transition states, for instance something that looks like pm_runtime_get_if_active(), but would return zero if the device was in RPM_SUSPENDING or RPM_RESUMING rather than requiring RPM_ACTIVE. I'm not an expert in PME or runtime power management though, so I'm looking for advice. Thanks, Alex > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 60230da957e0..bc266f290b2c 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2415,10 +2415,13 @@ static void pci_pme_list_scan(struct work_struct *work) > > mutex_lock(&pci_pme_list_mutex); > list_for_each_entry_safe(pme_dev, n, &pci_pme_list, list) { > - if (pme_dev->dev->pme_poll) { > - struct pci_dev *bridge; > + struct pci_dev *pdev = pme_dev->dev; > + > + if (pdev->pme_poll) { > + struct pci_dev *bridge = pdev->bus->self; > + struct device *dev = &pdev->dev; > + int pm_status; > > - bridge = pme_dev->dev->bus->self; > /* > * If bridge is in low power state, the > * configuration space of subordinate devices > @@ -2426,14 +2429,20 @@ static void pci_pme_list_scan(struct work_struct *work) > */ > if (bridge && bridge->current_state != PCI_D0) > continue; > + > /* > - * If the device is in D3cold it should not be > - * polled either. > + * If the device is in a low power state it > + * should not be polled either. > */ > - if (pme_dev->dev->current_state == PCI_D3cold) > + pm_status = pm_runtime_get_if_active(dev, true); > + if (!pm_status) > continue; > > - pci_pme_wakeup(pme_dev->dev, NULL); > + if (pdev->current_state != PCI_D3cold) > + pci_pme_wakeup(pdev, NULL); > + > + if (pm_status > 0) > + pm_runtime_put(dev); > } else { > list_del(&pme_dev->list); > kfree(pme_dev);
On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote: > > > Testing that a device is not currently in a low power state provides no > > > guarantees that the device is not immenently transitioning to such a state. > > > We need to increment the PM usage counter before accessing the device. > > > Since we don't wish to wake the device for PME polling, do so only if the > > > device is already active by using pm_runtime_get_if_active(). > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > > --- > > > drivers/pci/pci.c | 23 ++++++++++++++++------- > > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > Resurrecting this patch (currently commit d3fcd7360338) for discussion > > as it's been identified as the source of a regression in: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218360 > > > > Copying Mika, Lukas, and Rafael as it's related to: > > > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") > > > > where we skip devices in D3cold when processing the PME list. > > > > I think the issue in the above bz is that the downstream TB3/USB4 port > > is in D3 (presumably D3hot) and I therefore infer the device is in state > > RPM_SUSPENDED. This commit is attempting to make sure the device power > > state is stable across the call such that it does not transition into > > D3cold while we're accessing it. > > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > suspended or transitioning. > > How about dropping the calls to pm_runtime_get_if_active() and > pm_runtime_put() and instead simply do: > > if (pm_runtime_suspended(&pdev->dev) && > pdev->current_state != PCI_D3cold) > pci_pme_wakeup(pdev, NULL); Hi Lukas, Do we require that the polled device is in the RPM_SUSPENDED state? Also pm_runtime_suspended() can also only be trusted while holding the device power.lock, we need a usage count reference to maintain that state. I'm also seeing cases where the bridge is power state D0, but PM state RPM_SUSPENDING, so config space of the polled device becomes inaccessible even while we're holding a reference once we allow polling in RPM_SUSPENDED. I'm currently working with the below patch, which I believe addresses all these issues, but I'd welcome review and testing. Thanks, Alex commit 0a063b8e91d0bc807db712c81c8b270864f99ecb Author: Alex Williamson <alex.williamson@redhat.com> Date: Tue Jan 16 13:28:33 2024 -0700 PCI: Fix active state requirement in PME polling The commit noted in fixes added a bogus requirement that runtime PM managed devices need to be in the RPM_ACTIVE state for PME polling. In fact, there is no requirement of a specific runtime PM state, it is only required that the state is stable such that testing config space availability, ie. !D3cold, remains valid across the PME wakeup. To that effect, defer polling of runtime PM managed devices that are not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in transitional states remain on the pci_pme_list and will be re-queued. However in allowing polling of devices in the RPM_SUSPENDED state, the bridge state requires further refinement as it's possible to poll while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING. An asynchronous completion of the bridge transition to a low power state can make config space of the subordinate device become unavailable. A runtime PM reference to the bridge is therefore added with a supplementary requirement that the bridge is in the RPM_ACTIVE state. Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling") Reported-by: Sanath S <sanath.s@amd.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360 Signed-off-by: Alex Williamson <alex.williamson@redhat.com> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index bdbf8a94b4d0..31dbf1834b07 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work) if (pdev->pme_poll) { struct pci_dev *bridge = pdev->bus->self; struct device *dev = &pdev->dev; - int pm_status; + struct device *bdev = bridge ? &bridge->dev : NULL; /* - * If bridge is in low power state, the - * configuration space of subordinate devices - * may be not accessible + * If we have a bridge, it should be in an active/D0 + * state or the configuration space of subordinate + * devices may not be accessible. */ - if (bridge && bridge->current_state != PCI_D0) - continue; + if (bdev) { + spin_lock_irq(&bdev->power.lock); + if (!pm_runtime_active(bdev) || + bridge->current_state != PCI_D0) { + spin_unlock_irq(&bdev->power.lock); + continue; + } + pm_runtime_get_noresume(bdev); + spin_unlock_irq(&bdev->power.lock); + } /* - * If the device is in a low power state it - * should not be polled either. + * The device itself may be either in active or + * suspended state, but must not be in D3cold so + * that configuration space is accessible. The + * transitional resuming and suspending states are + * skipped to avoid D3cold races. */ - pm_status = pm_runtime_get_if_active(dev, true); - if (!pm_status) - continue; - - if (pdev->current_state != PCI_D3cold) + spin_lock_irq(&dev->power.lock); + if ((pm_runtime_active(dev) || + pm_runtime_suspended(dev)) && + pdev->current_state != PCI_D3cold) { + pm_runtime_get_noresume(dev); + spin_unlock_irq(&dev->power.lock); pci_pme_wakeup(pdev, NULL); - - if (pm_status > 0) pm_runtime_put(dev); + } else { + spin_unlock_irq(&dev->power.lock); + } + + if (bdev) + pm_runtime_put(bdev); } else { list_del(&pme_dev->list); kfree(pme_dev);
On Mon, 22 Jan 2024 15:50:03 -0700 Alex Williamson <alex.williamson@redhat.com> wrote: > On Mon, 22 Jan 2024 23:17:30 +0100 > Lukas Wunner <lukas@wunner.de> wrote: > > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > > On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote: > > > > Testing that a device is not currently in a low power state provides no > > > > guarantees that the device is not immenently transitioning to such a state. > > > > We need to increment the PM usage counter before accessing the device. > > > > Since we don't wish to wake the device for PME polling, do so only if the > > > > device is already active by using pm_runtime_get_if_active(). > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > > > --- > > > > drivers/pci/pci.c | 23 ++++++++++++++++------- > > > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > > > Resurrecting this patch (currently commit d3fcd7360338) for discussion > > > as it's been identified as the source of a regression in: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218360 > > > > > > Copying Mika, Lukas, and Rafael as it's related to: > > > > > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") > > > > > > where we skip devices in D3cold when processing the PME list. > > > > > > I think the issue in the above bz is that the downstream TB3/USB4 port > > > is in D3 (presumably D3hot) and I therefore infer the device is in state > > > RPM_SUSPENDED. This commit is attempting to make sure the device power > > > state is stable across the call such that it does not transition into > > > D3cold while we're accessing it. > > > > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > > suspended or transitioning. > > > > How about dropping the calls to pm_runtime_get_if_active() and > > pm_runtime_put() and instead simply do: > > > > if (pm_runtime_suspended(&pdev->dev) && > > pdev->current_state != PCI_D3cold) > > pci_pme_wakeup(pdev, NULL); > > Hi Lukas, > > Do we require that the polled device is in the RPM_SUSPENDED state? > Also pm_runtime_suspended() can also only be trusted while holding the > device power.lock, we need a usage count reference to maintain that > state. > > I'm also seeing cases where the bridge is power state D0, but PM state > RPM_SUSPENDING, so config space of the polled device becomes > inaccessible even while we're holding a reference once we allow polling > in RPM_SUSPENDED. > > I'm currently working with the below patch, which I believe addresses > all these issues, but I'd welcome review and testing. Thanks, > > Alex > > commit 0a063b8e91d0bc807db712c81c8b270864f99ecb > Author: Alex Williamson <alex.williamson@redhat.com> > Date: Tue Jan 16 13:28:33 2024 -0700 > > PCI: Fix active state requirement in PME polling > > The commit noted in fixes added a bogus requirement that runtime PM > managed devices need to be in the RPM_ACTIVE state for PME polling. > In fact, there is no requirement of a specific runtime PM state, it > is only required that the state is stable such that testing config > space availability, ie. !D3cold, remains valid across the PME wakeup. > > To that effect, defer polling of runtime PM managed devices that are > not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in > transitional states remain on the pci_pme_list and will be re-queued. > > However in allowing polling of devices in the RPM_SUSPENDED state, > the bridge state requires further refinement as it's possible to poll > while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING. > An asynchronous completion of the bridge transition to a low power > state can make config space of the subordinate device become > unavailable. A runtime PM reference to the bridge is therefore added > with a supplementary requirement that the bridge is in the RPM_ACTIVE > state. > > Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling") > Reported-by: Sanath S <sanath.s@amd.com> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360 > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index bdbf8a94b4d0..31dbf1834b07 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work) > if (pdev->pme_poll) { > struct pci_dev *bridge = pdev->bus->self; > struct device *dev = &pdev->dev; > - int pm_status; > + struct device *bdev = bridge ? &bridge->dev : NULL; > > /* > - * If bridge is in low power state, the > - * configuration space of subordinate devices > - * may be not accessible > + * If we have a bridge, it should be in an active/D0 > + * state or the configuration space of subordinate > + * devices may not be accessible. > */ > - if (bridge && bridge->current_state != PCI_D0) > - continue; > + if (bdev) { > + spin_lock_irq(&bdev->power.lock); With the code as shown here I have one system that seems to be getting contention when reading the vpd sysfs attribute when the endpoints (QL41000) are bound to vfio-pci and unused, resulting in the root port and endpoints being suspended. A vpd read can take over a minute. Seems to be resolved changing the above spin_lock to a spin_trylock: if (!spin_trylock_irq(&bdev->power.lock)) continue; The pm_runtime_barrier() as used in the vpd path can be prone to such issues, I saw similar in the fix I previously proposed in the bz. I'll continue to do more testing with this change and hopefully Sanath can verify this resolves the bug report. Thanks, Alex > + if (!pm_runtime_active(bdev) || > + bridge->current_state != PCI_D0) { > + spin_unlock_irq(&bdev->power.lock); > + continue; > + } > + pm_runtime_get_noresume(bdev); > + spin_unlock_irq(&bdev->power.lock); > + } > > /* > - * If the device is in a low power state it > - * should not be polled either. > + * The device itself may be either in active or > + * suspended state, but must not be in D3cold so > + * that configuration space is accessible. The > + * transitional resuming and suspending states are > + * skipped to avoid D3cold races. > */ > - pm_status = pm_runtime_get_if_active(dev, true); > - if (!pm_status) > - continue; > - > - if (pdev->current_state != PCI_D3cold) > + spin_lock_irq(&dev->power.lock); > + if ((pm_runtime_active(dev) || > + pm_runtime_suspended(dev)) && > + pdev->current_state != PCI_D3cold) { > + pm_runtime_get_noresume(dev); > + spin_unlock_irq(&dev->power.lock); > pci_pme_wakeup(pdev, NULL); > - > - if (pm_status > 0) > pm_runtime_put(dev); > + } else { > + spin_unlock_irq(&dev->power.lock); > + } > + > + if (bdev) > + pm_runtime_put(bdev); > } else { > list_del(&pme_dev->list); > kfree(pme_dev);
On 1/23/2024 10:16 AM, Alex Williamson wrote: > On Mon, 22 Jan 2024 15:50:03 -0700 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> On Mon, 22 Jan 2024 23:17:30 +0100 >> Lukas Wunner <lukas@wunner.de> wrote: >> >>> On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: >>>> On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote: >>>>> Testing that a device is not currently in a low power state provides no >>>>> guarantees that the device is not immenently transitioning to such a state. >>>>> We need to increment the PM usage counter before accessing the device. >>>>> Since we don't wish to wake the device for PME polling, do so only if the >>>>> device is already active by using pm_runtime_get_if_active(). >>>>> >>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>>>> --- >>>>> drivers/pci/pci.c | 23 ++++++++++++++++------- >>>>> 1 file changed, 16 insertions(+), 7 deletions(-) >>>> Resurrecting this patch (currently commit d3fcd7360338) for discussion >>>> as it's been identified as the source of a regression in: >>>> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=218360 >>>> >>>> Copying Mika, Lukas, and Rafael as it's related to: >>>> >>>> 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") >>>> >>>> where we skip devices in D3cold when processing the PME list. >>>> >>>> I think the issue in the above bz is that the downstream TB3/USB4 port >>>> is in D3 (presumably D3hot) and I therefore infer the device is in state >>>> RPM_SUSPENDED. This commit is attempting to make sure the device power >>>> state is stable across the call such that it does not transition into >>>> D3cold while we're accessing it. >>>> >>>> To do that I used pm_runtime_get_if_active(), but in retrospect this >>>> requires the device to be in RPM_ACTIVE so we end up skipping anything >>>> suspended or transitioning. >>> How about dropping the calls to pm_runtime_get_if_active() and >>> pm_runtime_put() and instead simply do: >>> >>> if (pm_runtime_suspended(&pdev->dev) && >>> pdev->current_state != PCI_D3cold) >>> pci_pme_wakeup(pdev, NULL); >> Hi Lukas, >> >> Do we require that the polled device is in the RPM_SUSPENDED state? >> Also pm_runtime_suspended() can also only be trusted while holding the >> device power.lock, we need a usage count reference to maintain that >> state. >> >> I'm also seeing cases where the bridge is power state D0, but PM state >> RPM_SUSPENDING, so config space of the polled device becomes >> inaccessible even while we're holding a reference once we allow polling >> in RPM_SUSPENDED. >> >> I'm currently working with the below patch, which I believe addresses >> all these issues, but I'd welcome review and testing. Thanks, >> >> Alex >> >> commit 0a063b8e91d0bc807db712c81c8b270864f99ecb >> Author: Alex Williamson <alex.williamson@redhat.com> >> Date: Tue Jan 16 13:28:33 2024 -0700 >> >> PCI: Fix active state requirement in PME polling >> >> The commit noted in fixes added a bogus requirement that runtime PM >> managed devices need to be in the RPM_ACTIVE state for PME polling. >> In fact, there is no requirement of a specific runtime PM state, it >> is only required that the state is stable such that testing config >> space availability, ie. !D3cold, remains valid across the PME wakeup. >> >> To that effect, defer polling of runtime PM managed devices that are >> not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in >> transitional states remain on the pci_pme_list and will be re-queued. >> >> However in allowing polling of devices in the RPM_SUSPENDED state, >> the bridge state requires further refinement as it's possible to poll >> while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING. >> An asynchronous completion of the bridge transition to a low power >> state can make config space of the subordinate device become >> unavailable. A runtime PM reference to the bridge is therefore added >> with a supplementary requirement that the bridge is in the RPM_ACTIVE >> state. >> >> Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling") >> Reported-by: Sanath S <sanath.s@amd.com> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360 >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index bdbf8a94b4d0..31dbf1834b07 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work) >> if (pdev->pme_poll) { >> struct pci_dev *bridge = pdev->bus->self; >> struct device *dev = &pdev->dev; >> - int pm_status; >> + struct device *bdev = bridge ? &bridge->dev : NULL; >> >> /* >> - * If bridge is in low power state, the >> - * configuration space of subordinate devices >> - * may be not accessible >> + * If we have a bridge, it should be in an active/D0 >> + * state or the configuration space of subordinate >> + * devices may not be accessible. >> */ >> - if (bridge && bridge->current_state != PCI_D0) >> - continue; >> + if (bdev) { >> + spin_lock_irq(&bdev->power.lock); > With the code as shown here I have one system that seems to be getting > contention when reading the vpd sysfs attribute when the endpoints > (QL41000) are bound to vfio-pci and unused, resulting in the root port > and endpoints being suspended. A vpd read can take over a minute. > Seems to be resolved changing the above spin_lock to a spin_trylock: > > if (!spin_trylock_irq(&bdev->power.lock)) > continue; > > The pm_runtime_barrier() as used in the vpd path can be prone to such > issues, I saw similar in the fix I previously proposed in the bz. > > I'll continue to do more testing with this change and hopefully Sanath > can verify this resolves the bug report. Thanks, > > Alex I'll verify it today and let you know the observations. >> + if (!pm_runtime_active(bdev) || >> + bridge->current_state != PCI_D0) { >> + spin_unlock_irq(&bdev->power.lock); >> + continue; >> + } >> + pm_runtime_get_noresume(bdev); >> + spin_unlock_irq(&bdev->power.lock); >> + } >> >> /* >> - * If the device is in a low power state it >> - * should not be polled either. >> + * The device itself may be either in active or >> + * suspended state, but must not be in D3cold so >> + * that configuration space is accessible. The >> + * transitional resuming and suspending states are >> + * skipped to avoid D3cold races. >> */ >> - pm_status = pm_runtime_get_if_active(dev, true); >> - if (!pm_status) >> - continue; >> - >> - if (pdev->current_state != PCI_D3cold) >> + spin_lock_irq(&dev->power.lock); >> + if ((pm_runtime_active(dev) || >> + pm_runtime_suspended(dev)) && >> + pdev->current_state != PCI_D3cold) { >> + pm_runtime_get_noresume(dev); >> + spin_unlock_irq(&dev->power.lock); >> pci_pme_wakeup(pdev, NULL); >> - >> - if (pm_status > 0) >> pm_runtime_put(dev); >> + } else { >> + spin_unlock_irq(&dev->power.lock); >> + } >> + >> + if (bdev) >> + pm_runtime_put(bdev); >> } else { >> list_del(&pme_dev->list); >> kfree(pme_dev);
On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Mon, Jan 22, 2024 at 03:50:03PM -0700, Alex Williamson wrote: > > On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > > > suspended or transitioning. > > > > > > How about dropping the calls to pm_runtime_get_if_active() and > > > pm_runtime_put() and instead simply do: > > > > > > if (pm_runtime_suspended(&pdev->dev) && > > > pdev->current_state != PCI_D3cold) > > > pci_pme_wakeup(pdev, NULL); > > > > Do we require that the polled device is in the RPM_SUSPENDED state? > > If the device is RPM_SUSPENDING, why immediately resume it for polling? > It's sufficient to poll it the next time around, i.e. 1 second later. > > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need > to poll PME. I'm clearly not an expert on PME, but this is not obvious to me and before the commit that went in through this thread, PME wakeup was triggered regardless of the PM state. I was trying to restore the behavior of not requiring a specific PM state other than deferring polling across transition states. > This leaves RPM_SUSPENDED as the only state in which it makes sense to > poll. > > > Also pm_runtime_suspended() can also only be trusted while holding the > > device power.lock, we need a usage count reference to maintain that > > state. > > Why? Let's say there's a race and the device resumes immediately after > we call pm_runtime_suspended() here. So we might call pci_pme_wakeup() > gratuitouly. So what? No biggie. The issue I'm trying to address is that config space of the device can become inaccessible while calling pci_pme_wakeup() on it, causing a system fault on some hardware. So a gratuitous pci_pme_wakeup() can be detrimental. We require the device config space to remain accessible, therefore the instantaneous test against D3cold and that the parent bridge is in D0 is not sufficient. I see traces where the parent bridge is in D0, but the PM state is RPM_SUSPENDING and the endpoint device transitions to D3cold while we're executing pci_pme_wakeup(). Therefore at a minimum, I think we need to enforce that the bridge is in RPM_ACTIVE and remains in that state across pci_pme_wakeup(), which means we need to hold a usage count reference, and that usage count reference must be acquired under power.lock in RPM_ACTIVE state to be effective. > > + if (bdev) { > > + spin_lock_irq(&bdev->power.lock); > > Hm, I'd expect that lock to be internal to the PM core, > although there *are* a few stray users outside of it. Right, there are. It's possible that if we only need to hold a reference on the bridge we can abstract this through pm_runtime_get_if_active(), the semantics worked better to essentially open code it in this iteration though. Thanks, Alex
On Tue, 23 Jan 2024 17:12:39 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Tue, Jan 23, 2024 at 08:55:21AM -0700, Alex Williamson wrote: > > On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > > If the device is RPM_SUSPENDING, why immediately resume it for polling? > > > It's sufficient to poll it the next time around, i.e. 1 second later. > > > > > > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need > > > to poll PME. > > > > I'm clearly not an expert on PME, but this is not obvious to me and > > before the commit that went in through this thread, PME wakeup was > > triggered regardless of the PM state. I was trying to restore the > > behavior of not requiring a specific PM state other than deferring > > polling across transition states. > > There are broken devices which are incapable of signaling PME. > As a workaround, the kernel polls these devices once per second. > The first time the device signals PME, the kernel stops polling > that particular device because PME is clearly working. > > So this is just a best-effort way to support PME for broken devices. > If it takes a little longer to detect that PME was signaled, it's not > a big deal. > > > The issue I'm trying to address is that config space of the device can > > become inaccessible while calling pci_pme_wakeup() on it, causing a > > system fault on some hardware. So a gratuitous pci_pme_wakeup() can be > > detrimental. > > > > We require the device config space to remain accessible, therefore the > > instantaneous test against D3cold and that the parent bridge is in D0 > > is not sufficient. I see traces where the parent bridge is in D0, but > > the PM state is RPM_SUSPENDING and the endpoint device transitions to > > D3cold while we're executing pci_pme_wakeup(). > > We have pci_config_pm_runtime_{get,put}() helpers to ensure the parent > of a device is in D0 so that the device's config space is accessible. > So you may need to use that in pci_pme_wakeup(). pci_config_pm_runtime_get() doesn't seem to align with our current philosophy to defer polling devices that aren't in the correct power state. We require the bridge to be in D0, but we defer polling rather than resume it otherwise. We also defer device polling if the device is in D3cold, whereas the above function would resume a device in that state. I think our bridge D0 test could be reliable if it were done holding a reference acquired via pm_runtime_get_if_active(). Thanks, Alex
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 60230da957e0..bc266f290b2c 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2415,10 +2415,13 @@ static void pci_pme_list_scan(struct work_struct *work) mutex_lock(&pci_pme_list_mutex); list_for_each_entry_safe(pme_dev, n, &pci_pme_list, list) { - if (pme_dev->dev->pme_poll) { - struct pci_dev *bridge; + struct pci_dev *pdev = pme_dev->dev; + + if (pdev->pme_poll) { + struct pci_dev *bridge = pdev->bus->self; + struct device *dev = &pdev->dev; + int pm_status; - bridge = pme_dev->dev->bus->self; /* * If bridge is in low power state, the * configuration space of subordinate devices @@ -2426,14 +2429,20 @@ static void pci_pme_list_scan(struct work_struct *work) */ if (bridge && bridge->current_state != PCI_D0) continue; + /* - * If the device is in D3cold it should not be - * polled either. + * If the device is in a low power state it + * should not be polled either. */ - if (pme_dev->dev->current_state == PCI_D3cold) + pm_status = pm_runtime_get_if_active(dev, true); + if (!pm_status) continue; - pci_pme_wakeup(pme_dev->dev, NULL); + if (pdev->current_state != PCI_D3cold) + pci_pme_wakeup(pdev, NULL); + + if (pm_status > 0) + pm_runtime_put(dev); } else { list_del(&pme_dev->list); kfree(pme_dev);