[1/2] ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE

Message ID 20240130095933.14158-1-jhp@endlessos.org
State New
Headers
Series [1/2] ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE |

Commit Message

Jian-Hong Pan Jan. 30, 2024, 9:59 a.m. UTC
  Some systems, like ASUS B1400CEAE equipped with the SATA controller
[8086:a0d3] can use LPM policy to save power, especially for s2idle.

However, the same controller may be failed on other platforms. So,
commit (ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI
controller") drops LPM policy for [8086:a0d3]. But, this blocks going
to deeper CPU Package C-state when s2idle with enabled Intel VMD.

This patch adds ahci_force_lpm_policy DMI quirk for ASUS B1400CEAE to
fix s2idle's power consumption issue when Intel VMD feature is enabled.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
Signed-off-by: Jian-Hong Pan <jhp@endlessos.org>
---
 drivers/ata/ahci.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)
  

Comments

Mika Westerberg Jan. 30, 2024, 10:13 a.m. UTC | #1
Hi,

On Tue, Jan 30, 2024 at 05:59:33PM +0800, Jian-Hong Pan wrote:
> Some systems, like ASUS B1400CEAE equipped with the SATA controller
> [8086:a0d3] can use LPM policy to save power, especially for s2idle.
> 
> However, the same controller may be failed on other platforms. So,
> commit (ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI
> controller") drops LPM policy for [8086:a0d3]. But, this blocks going
> to deeper CPU Package C-state when s2idle with enabled Intel VMD.

Tiger Lake really should support this with no issues (as are the
generations after it). I suggest trying to figure out what was the root
cause of the original problem that triggered the revert, if possible at
all, perhaps it is is something not related to LPM and that would allow
us to enable this unconditionally on all Tiger Lake.

I'm pretty sure the platform where this was reported suffers the same
s2idle issue you are seeing without this patch.
  
Jian-Hong Pan Jan. 31, 2024, 8:56 a.m. UTC | #2
Mika Westerberg <mika.westerberg@linux.intel.com> 於 2024年1月30日 週二 下午6:13寫道:
>
> Hi,
>
> On Tue, Jan 30, 2024 at 05:59:33PM +0800, Jian-Hong Pan wrote:
> > Some systems, like ASUS B1400CEAE equipped with the SATA controller
> > [8086:a0d3] can use LPM policy to save power, especially for s2idle.
> >
> > However, the same controller may be failed on other platforms. So,
> > commit (ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI
> > controller") drops LPM policy for [8086:a0d3]. But, this blocks going
> > to deeper CPU Package C-state when s2idle with enabled Intel VMD.
>
> Tiger Lake really should support this with no issues (as are the
> generations after it). I suggest trying to figure out what was the root
> cause of the original problem that triggered the revert, if possible at
> all, perhaps it is is something not related to LPM and that would allow
> us to enable this unconditionally on all Tiger Lake.
>
> I'm pretty sure the platform where this was reported suffers the same
> s2idle issue you are seeing without this patch.

Simply applying LPM policy to [8086:a0d3] sounds like a good idea!

I installed an SATA storage into ASUS B1400CEAE and tested with
enabled & disabled VMD again. Both the NVMe and SATA storage work with
applying LPM policy to [8086:a0d3], which means that it does not hit
the issue mentioned in the commit (ata: ahci: Revert "ata: ahci: Add
Tiger Lake UP{3,4} AHCI controller").

Jian-Hong Pan
  
Niklas Cassel Jan. 31, 2024, 10:57 a.m. UTC | #3
On Wed, Jan 31, 2024 at 04:56:27PM +0800, Jian-Hong Pan wrote:
> Mika Westerberg <mika.westerberg@linux.intel.com> 於 2024年1月30日 週二 下午6:13寫道:
> >
> > Hi,
> >
> > On Tue, Jan 30, 2024 at 05:59:33PM +0800, Jian-Hong Pan wrote:
> > > Some systems, like ASUS B1400CEAE equipped with the SATA controller
> > > [8086:a0d3] can use LPM policy to save power, especially for s2idle.
> > >
> > > However, the same controller may be failed on other platforms. So,
> > > commit (ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI
> > > controller") drops LPM policy for [8086:a0d3]. But, this blocks going
> > > to deeper CPU Package C-state when s2idle with enabled Intel VMD.
> >
> > Tiger Lake really should support this with no issues (as are the
> > generations after it). I suggest trying to figure out what was the root
> > cause of the original problem that triggered the revert, if possible at
> > all, perhaps it is is something not related to LPM and that would allow
> > us to enable this unconditionally on all Tiger Lake.
> >
> > I'm pretty sure the platform where this was reported suffers the same
> > s2idle issue you are seeing without this patch.
> 
> Simply applying LPM policy to [8086:a0d3] sounds like a good idea!
> 
> I installed an SATA storage into ASUS B1400CEAE and tested with
> enabled & disabled VMD again. Both the NVMe and SATA storage work with
> applying LPM policy to [8086:a0d3], which means that it does not hit
> the issue mentioned in the commit (ata: ahci: Revert "ata: ahci: Add
> Tiger Lake UP{3,4} AHCI controller").

We would like to enable LPM for Tiger Lake again, but last time we tried
to do so, we got a bunch of reports:
https://bugzilla.kernel.org/show_bug.cgi?id=217114
https://bbs.archlinux.org/viewtopic.php?id=283906

Where people complained that their SATA drive dissappeared as a result.


The bug reports were reported on the following hardware:
ASUS Vivobook 15 X513EAN
Lenovo L3-15ITL6 IdeaPad
Acer A515-56-32DK
Acer A514
HP Pavilion 14-dv0589na

Usually, when a single vendor manages to mess something up in their
platform firmware, and we deal with that by adding a quirk for that
specific vendor.

But here it is four different vendors... This kind of suggests that
there could be a some more fundamental LPM related issue at play here.


Yes, we are fully aware that on some other platforms, not having
LPM enabled stops those platforms from entering the lowest CPU power
states. So right now we are doomed if we do enable LPM, doomed if we
don't.

However, having some systems drawing more power than needed is better
than some other systems not detecting their SATA drives at all.

Unfortunately, we don't have any of these laptops that has a Tiger Lake
AHCI controller (with a disappearing drive), so until someone who does
debugs this, I think we are stuck at status quo.


Kind regards,
Niklas
  
Daniel Drake Jan. 31, 2024, 11:08 a.m. UTC | #4
On Wed, Jan 31, 2024 at 6:57 AM Niklas Cassel <cassel@kernel.org> wrote:
> Unfortunately, we don't have any of these laptops that has a Tiger Lake
> AHCI controller (with a disappearing drive), so until someone who does
> debugs this, I think we are stuck at status quo.

I've asked for volunteers to help test things on those original bug
reports (and may have one already) and would appreciate any suggested
debugging approaches from those more experienced with SATA/AHCI. What
would be your first few suggested debugging steps?

Non-LPM boot:
ata1: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202100 irq 145
ata2: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202180 irq 145
ata2: SATA link down (SStatus 0 SControl 300)
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

LPM failed boot:
ata1: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202100 irq 145
ata2: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202180 irq 145
ata1: SATA link down (SStatus 4 SControl 300)
ata2: SATA link down (SStatus 4 SControl 300)

note SStatus=4 on both ports  (means "PHY in offline mode"?)

Daniel
  
Niklas Cassel Jan. 31, 2024, 10:43 p.m. UTC | #5
On Wed, Jan 31, 2024 at 07:08:12AM -0400, Daniel Drake wrote:
> On Wed, Jan 31, 2024 at 6:57 AM Niklas Cassel <cassel@kernel.org> wrote:
> > Unfortunately, we don't have any of these laptops that has a Tiger Lake
> > AHCI controller (with a disappearing drive), so until someone who does
> > debugs this, I think we are stuck at status quo.
> 
> I've asked for volunteers to help test things on those original bug
> reports (and may have one already) and would appreciate any suggested
> debugging approaches from those more experienced with SATA/AHCI. What
> would be your first few suggested debugging steps?
> 
> Non-LPM boot:
> ata1: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202100 irq 145
> ata2: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202180 irq 145
> ata2: SATA link down (SStatus 0 SControl 300)
> ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> 
> LPM failed boot:
> ata1: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202100 irq 145
> ata2: SATA max UDMA/133 abar m2048@0x50202000 port 0x50202180 irq 145
> ata1: SATA link down (SStatus 4 SControl 300)
> ata2: SATA link down (SStatus 4 SControl 300)
> 
> note SStatus=4 on both ports  (means "PHY in offline mode"?)

Hello Daniel, Vitalii,

The attachments that you uploaded to bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=217114
namely dmesg_VMD_off and dmesg_VMD_on:

dmesg_VMD_off:[    1.020080] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
dmesg_VMD_off:[    1.020095] ahci 0000:00:17.0: flags: 64bit ncq sntf pm clo only pio slum part deso sadm sds 
dmesg_VMD_off:[    1.020645] ata1: SATA max UDMA/133 abar m2048@0x80902000 port 0x80902100 irq 123 lpm-pol 0
dmesg_VMD_off:[    1.330090] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)


dmesg_VMD_on:[    0.973901] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
dmesg_VMD_on:[    0.973904] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only pio slum part deso sadm sds 
dmesg_VMD_on:[    0.974094] ata1: SATA max UDMA/133 abar m2048@0x82102000 port 0x82102100 irq 142 lpm-pol 0
dmesg_VMD_on:[    1.287221] ata1: SATA link down (SStatus 4 SControl 300)


I assume that both of these logs are from the same kernel binary.
Does this kernel binary have the Tiger Lake LPM enablement patch included?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=104ff59af73aba524e57ae0fef70121643ff270e

If so, since with Intel VMD turned off we can detect the SATA drive,
but with Intel VMD turned on we do not, which strongly suggests that
the problem is related to Intel VMD.



In libata we perform a reset of the port at boot, see:
libata-sata.c:sata_link_hardreset()
after writing to SControl, we call
libata-core.c:ata_wait_ready() that will poll for the port being ready
by calling the check_ready callback.
For AHCI, this callback funcion is set to:
libahci.c:ahci_check_ready().

A reset should take the device out of deep power state and should be
sufficient to establish a connection (and that also seems to be the
case when not using Intel VMD).

However, if you want to debug, I would start by adding prints to
libata-sata.c:sata_link_hardreset()
libata-core.c:ata_wait_ready()
libahci.c:ahci_check_ready().



Vitalii,

I noticed that the prints says "lpm-pol 0"
Do you have:
CONFIG_SATA_MOBILE_LPM_POLICY set to 0?

CONFIG_SATA_MOBILE_LPM_POLICY=0
means do not touch any settings set by firmware.

This means that this code:
https://github.com/torvalds/linux/blob/v6.8-rc2/drivers/ata/libata-eh.c#L3572-L3579
https://github.com/torvalds/linux/blob/v6.8-rc2/drivers/ata/libata-eh.c#L3067-L3072

will not call ata_eh_set_lpm(), which means that ahci_set_lpm() will not
be called, which means that sata_link_scr_lpm() will not be called.

While this shouldn't make a difference, and I didn't check the code
to see where "SATA link up" is printed, but possibly, when using VMD,
perhaps it is so quick to put the device to a deeper power state after
a reset, and with policy == 0, we will never send a ATA_LPM_WAKE_ONLY
that sets PxCMD.ICC to the active state (brings the device out of sleep).

Could you please try to set:
CONFIG_SATA_MOBILE_LPM_POLICY=3
and enable VMD again, and see if that makes you able to detect the SATA
drive even with VMD enabled.


Kind regards,
Niklas
  
Niklas Cassel Feb. 1, 2024, 10:35 a.m. UTC | #6
On Thu, Feb 01, 2024 at 06:50:53AM +0300, Виталий Соломонов wrote:
> Hello Daniel, Niklas
> 
> I assume that both of these logs are from the same kernel binary.
> > Does this kernel binary have the Tiger Lake LPM enablement patch included?
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/patch/?id=104ff59af73aba524e57ae0fef70121643ff270e
> 
> Yes, it's the same kernel binary with applied patch.
> 
> 
> 
> > Could you please try to set:
> > CONFIG_SATA_MOBILE_LPM_POLICY=3
> > and enable VMD again, and see if that makes you able to detect the SATA
> > drive even with VMD enabled.
> >
>  With this policy settings my laptop was able to mount lvm volumes (/ and
> /boot) on SSD (/dev/nvme), and stuck on mounting HDD volumes (/home)
> dmesg: https://bugzilla.kernel.org/attachment.cgi?id=305804
> journalctl: https://bugzilla.kernel.org/attachment.cgi?id=305805

Just to clarify, in this latest log VMD_on_LPM_3_dmesg:
[    0.957793] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6 Gbps 0x1 impl SATA mode
[    0.957796] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only pio slum part deso sadm sds 
[    0.957982] ata1: SATA max UDMA/133 abar m2048@0x82102000 port 0x82102100 irq 142 lpm-pol 4
[    1.271254] ata1: SATA link down (SStatus 4 SControl 300)

We still don't get any link up, so LPM policy == 3 did no difference.

I guess someone at Intel needs to debug why we don't get a link up with
VMD enabled, since AFAICT, having VMD enabled is like adding an extra HW layer:
libata -> Intel VMD -> Intel AHCI controller
instead of the normal
libata -> Intel AHCI controller

Since it works when talking to the AHCI controller directly, the question
is if it is VMD not following the spec, or if it libata that is not following
the spec (and we are just lucky that it works when talking to the Intel AHCI
controller directly).


Kind regards,
Niklas
  
Niklas Cassel Feb. 1, 2024, 3:01 p.m. UTC | #7
On Wed, Jan 31, 2024 at 11:43:59PM +0100, Niklas Cassel wrote:
> On Wed, Jan 31, 2024 at 07:08:12AM -0400, Daniel Drake wrote:

(snip)

> In libata we perform a reset of the port at boot, see:
> libata-sata.c:sata_link_hardreset()
> after writing to SControl, we call
> libata-core.c:ata_wait_ready() that will poll for the port being ready
> by calling the check_ready callback.
> For AHCI, this callback funcion is set to:
> libahci.c:ahci_check_ready().
> 
> A reset should take the device out of deep power state and should be
> sufficient to establish a connection (and that also seems to be the
> case when not using Intel VMD).
> 
> However, if you want to debug, I would start by adding prints to
> libata-sata.c:sata_link_hardreset()
> libata-core.c:ata_wait_ready()
> libahci.c:ahci_check_ready().

FWIW, this will dump SStatus.DET every time the check_ready callback function
(ahci_check_ready()) is called:


diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index 1a63200ea437..0467e150601e 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -1533,6 +1533,12 @@ int ahci_check_ready(struct ata_link *link)
 {
        void __iomem *port_mmio = ahci_port_base(link->ap);
        u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF;
+       u32 cur = 0;
+
+       sata_scr_read(link, SCR_STATUS, &cur);
+
+       ata_link_info(link, "BUSY ? %d (status: %#x) SStatus.DET: %#x\n",
+                     status & ATA_BUSY, status, cur & 0xf);
 
        return ata_check_ready(status);
 }
  
Jian-Hong Pan Feb. 2, 2024, 8:49 a.m. UTC | #8
Niklas Cassel <cassel@kernel.org> 於 2024年2月1日 週四 下午11:01寫道:
>
> On Wed, Jan 31, 2024 at 11:43:59PM +0100, Niklas Cassel wrote:
> > On Wed, Jan 31, 2024 at 07:08:12AM -0400, Daniel Drake wrote:
>
> (snip)
>
> > In libata we perform a reset of the port at boot, see:
> > libata-sata.c:sata_link_hardreset()
> > after writing to SControl, we call
> > libata-core.c:ata_wait_ready() that will poll for the port being ready
> > by calling the check_ready callback.
> > For AHCI, this callback funcion is set to:
> > libahci.c:ahci_check_ready().
> >
> > A reset should take the device out of deep power state and should be
> > sufficient to establish a connection (and that also seems to be the
> > case when not using Intel VMD).
> >
> > However, if you want to debug, I would start by adding prints to
> > libata-sata.c:sata_link_hardreset()
> > libata-core.c:ata_wait_ready()
> > libahci.c:ahci_check_ready().
>
> FWIW, this will dump SStatus.DET every time the check_ready callback function
> (ahci_check_ready()) is called:
>
>
> diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
> index 1a63200ea437..0467e150601e 100644
> --- a/drivers/ata/libahci.c
> +++ b/drivers/ata/libahci.c
> @@ -1533,6 +1533,12 @@ int ahci_check_ready(struct ata_link *link)
>  {
>         void __iomem *port_mmio = ahci_port_base(link->ap);
>         u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF;
> +       u32 cur = 0;
> +
> +       sata_scr_read(link, SCR_STATUS, &cur);
> +
> +       ata_link_info(link, "BUSY ? %d (status: %#x) SStatus.DET: %#x\n",
> +                     status & ATA_BUSY, status, cur & 0xf);
>
>         return ata_check_ready(status);
>  }

I think I can join the test based on kernel v6.8-rc2, too.

The original ASUS B1400CEAE has only one NVMe SSD.  I prepare the
patch ("ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE") to
fix the power consumption issue for s2idle with enabled VMD.

The patch is a quirk limiting ASUS B1400CEAE only, not generic for the
SATA controller [8086:a0d3].  Then, I install another SATA HDD into
ASUS B1400CEAE for test.  The SATA HDD shows up and works.

$ dmesg | grep SATA
[    0.785120] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
Gbps 0x1 impl SATA mode
[    0.785269] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
0x76102100 irq 144 lpm-pol 3
[    1.096684] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

However, if I simply revert the commit 6210038aeaf4 ("ata: ahci:
Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI controller"") (fix the
conflict, of course), then the SATA HDD disappears!!?  Both
CONFIG_SATA_MOBILE_LPM_POLICY=3 and 0 can reproduce the issue.

$ dmesg | grep SATA
[    0.783211] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
Gbps 0x1 impl SATA mode
[    0.783399] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
0x76102100 irq 144 lpm-pol 3
[    1.096685] ata1: SATA link down (SStatus 4 SControl 300)

Here is the dmesg of reverting ("ata: ahci: Revert "ata: ahci: Add
Tiger Lake UP{3,4} AHCI controller"")
https://bugzilla.kernel.org/show_bug.cgi?id=217114#c27
The code already includes the debug message in ahci_check_ready() from
Niklas.  However, the dmesg does not show the "BUSY ? ..." from
ahci_check_ready().

From these scenarios mentioned above, they all apply LPM policy to the
SATA controller [8086:a0d3].  But, they apply LPM policy at different
time:
* The patch ("ata: ahci: Add force LPM policy quirk for ASUS
B1400CEAE") applies LPM policy in early ahci_init_one(), which is the
probe callback.
* Reverting 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger
Lake UP{3,4} AHCI controller"") applies LPM policy via "ahci_pci_tbl"
table.

Jian-Hong Pan
  
Niklas Cassel Feb. 5, 2024, 11:33 a.m. UTC | #9
On Fri, Feb 02, 2024 at 04:49:00PM +0800, Jian-Hong Pan wrote:
> Niklas Cassel <cassel@kernel.org> 於 2024年2月1日 週四 下午11:01寫道:
> >
> > On Wed, Jan 31, 2024 at 11:43:59PM +0100, Niklas Cassel wrote:
> > > On Wed, Jan 31, 2024 at 07:08:12AM -0400, Daniel Drake wrote:
> >
> > (snip)
> >
> > > In libata we perform a reset of the port at boot, see:
> > > libata-sata.c:sata_link_hardreset()
> > > after writing to SControl, we call
> > > libata-core.c:ata_wait_ready() that will poll for the port being ready
> > > by calling the check_ready callback.
> > > For AHCI, this callback funcion is set to:
> > > libahci.c:ahci_check_ready().
> > >
> > > A reset should take the device out of deep power state and should be
> > > sufficient to establish a connection (and that also seems to be the
> > > case when not using Intel VMD).
> > >
> > > However, if you want to debug, I would start by adding prints to
> > > libata-sata.c:sata_link_hardreset()
> > > libata-core.c:ata_wait_ready()
> > > libahci.c:ahci_check_ready().
> >
> > FWIW, this will dump SStatus.DET every time the check_ready callback function
> > (ahci_check_ready()) is called:
> >
> >
> > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
> > index 1a63200ea437..0467e150601e 100644
> > --- a/drivers/ata/libahci.c
> > +++ b/drivers/ata/libahci.c
> > @@ -1533,6 +1533,12 @@ int ahci_check_ready(struct ata_link *link)
> >  {
> >         void __iomem *port_mmio = ahci_port_base(link->ap);
> >         u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF;
> > +       u32 cur = 0;
> > +
> > +       sata_scr_read(link, SCR_STATUS, &cur);
> > +
> > +       ata_link_info(link, "BUSY ? %d (status: %#x) SStatus.DET: %#x\n",
> > +                     status & ATA_BUSY, status, cur & 0xf);
> >
> >         return ata_check_ready(status);
> >  }
> 
> I think I can join the test based on kernel v6.8-rc2, too.
> 
> The original ASUS B1400CEAE has only one NVMe SSD.  I prepare the
> patch ("ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE") to
> fix the power consumption issue for s2idle with enabled VMD.
> 
> The patch is a quirk limiting ASUS B1400CEAE only, not generic for the
> SATA controller [8086:a0d3].  Then, I install another SATA HDD into
> ASUS B1400CEAE for test.  The SATA HDD shows up and works.
> 
> $ dmesg | grep SATA
> [    0.785120] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> Gbps 0x1 impl SATA mode
> [    0.785269] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> 0x76102100 irq 144 lpm-pol 3
> [    1.096684] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> 
> However, if I simply revert the commit 6210038aeaf4 ("ata: ahci:
> Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI controller"") (fix the
> conflict, of course), then the SATA HDD disappears!!?  Both
> CONFIG_SATA_MOBILE_LPM_POLICY=3 and 0 can reproduce the issue.
> 
> $ dmesg | grep SATA
> [    0.783211] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> Gbps 0x1 impl SATA mode
> [    0.783399] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> 0x76102100 irq 144 lpm-pol 3
> [    1.096685] ata1: SATA link down (SStatus 4 SControl 300)
> 
> Here is the dmesg of reverting ("ata: ahci: Revert "ata: ahci: Add
> Tiger Lake UP{3,4} AHCI controller"")
> https://bugzilla.kernel.org/show_bug.cgi?id=217114#c27
> The code already includes the debug message in ahci_check_ready() from
> Niklas.  However, the dmesg does not show the "BUSY ? ..." from
> ahci_check_ready().
> 
> From these scenarios mentioned above, they all apply LPM policy to the
> SATA controller [8086:a0d3].  But, they apply LPM policy at different
> time:
> * The patch ("ata: ahci: Add force LPM policy quirk for ASUS
> B1400CEAE") applies LPM policy in early ahci_init_one(), which is the
> probe callback.
> * Reverting 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger
> Lake UP{3,4} AHCI controller"") applies LPM policy via "ahci_pci_tbl"
> table.

I don't see why it should matter if we set the AHCI_HFLAG_USE_LPM_POLICY
flag using ahci_pci_tbl, or by your suggested quirk in ahci_init_one(),
as in both cases the flag will be set before ahci_init_one() calls
ahci_update_initial_lpm_policy().


Could it perhaps be that in order for libata to be able to detect your
drive, when VMD is enabled, we also need your patch
"PCI: vmd: enable PCI PM's L1 substates of remapped PCIe port and NVMe" ?


If that is not the case, and there actually is a difference between using
ahci_pci_tbl and your suggested quirk, then my next suggestion would be to
add prints to libata-sata.c:sata_link_scr_lpm(). That way you can dump the
exact SCR writes that are being done for the working case vs. the
non-working case. (Since I assume that there must be some difference.)


Kind regards,
Niklas
  
Jian-Hong Pan Feb. 6, 2024, 8:39 a.m. UTC | #10
Niklas Cassel <cassel@kernel.org> 於 2024年2月5日 週一 下午7:33寫道:
>
> On Fri, Feb 02, 2024 at 04:49:00PM +0800, Jian-Hong Pan wrote:
> > Niklas Cassel <cassel@kernel.org> 於 2024年2月1日 週四 下午11:01寫道:
> > >
> > > On Wed, Jan 31, 2024 at 11:43:59PM +0100, Niklas Cassel wrote:
> > > > On Wed, Jan 31, 2024 at 07:08:12AM -0400, Daniel Drake wrote:
> > >
> > > (snip)
> > >
> > > > In libata we perform a reset of the port at boot, see:
> > > > libata-sata.c:sata_link_hardreset()
> > > > after writing to SControl, we call
> > > > libata-core.c:ata_wait_ready() that will poll for the port being ready
> > > > by calling the check_ready callback.
> > > > For AHCI, this callback funcion is set to:
> > > > libahci.c:ahci_check_ready().
> > > >
> > > > A reset should take the device out of deep power state and should be
> > > > sufficient to establish a connection (and that also seems to be the
> > > > case when not using Intel VMD).
> > > >
> > > > However, if you want to debug, I would start by adding prints to
> > > > libata-sata.c:sata_link_hardreset()
> > > > libata-core.c:ata_wait_ready()
> > > > libahci.c:ahci_check_ready().
> > >
> > > FWIW, this will dump SStatus.DET every time the check_ready callback function
> > > (ahci_check_ready()) is called:
> > >
> > >
> > > diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
> > > index 1a63200ea437..0467e150601e 100644
> > > --- a/drivers/ata/libahci.c
> > > +++ b/drivers/ata/libahci.c
> > > @@ -1533,6 +1533,12 @@ int ahci_check_ready(struct ata_link *link)
> > >  {
> > >         void __iomem *port_mmio = ahci_port_base(link->ap);
> > >         u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF;
> > > +       u32 cur = 0;
> > > +
> > > +       sata_scr_read(link, SCR_STATUS, &cur);
> > > +
> > > +       ata_link_info(link, "BUSY ? %d (status: %#x) SStatus.DET: %#x\n",
> > > +                     status & ATA_BUSY, status, cur & 0xf);
> > >
> > >         return ata_check_ready(status);
> > >  }
> >
> > I think I can join the test based on kernel v6.8-rc2, too.
> >
> > The original ASUS B1400CEAE has only one NVMe SSD.  I prepare the
> > patch ("ata: ahci: Add force LPM policy quirk for ASUS B1400CEAE") to
> > fix the power consumption issue for s2idle with enabled VMD.
> >
> > The patch is a quirk limiting ASUS B1400CEAE only, not generic for the
> > SATA controller [8086:a0d3].  Then, I install another SATA HDD into
> > ASUS B1400CEAE for test.  The SATA HDD shows up and works.
> >
> > $ dmesg | grep SATA
> > [    0.785120] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> > Gbps 0x1 impl SATA mode
> > [    0.785269] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> > 0x76102100 irq 144 lpm-pol 3
> > [    1.096684] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> >
> > However, if I simply revert the commit 6210038aeaf4 ("ata: ahci:
> > Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI controller"") (fix the
> > conflict, of course), then the SATA HDD disappears!!?  Both
> > CONFIG_SATA_MOBILE_LPM_POLICY=3 and 0 can reproduce the issue.
> >
> > $ dmesg | grep SATA
> > [    0.783211] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> > Gbps 0x1 impl SATA mode
> > [    0.783399] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> > 0x76102100 irq 144 lpm-pol 3
> > [    1.096685] ata1: SATA link down (SStatus 4 SControl 300)
> >
> > Here is the dmesg of reverting ("ata: ahci: Revert "ata: ahci: Add
> > Tiger Lake UP{3,4} AHCI controller"")
> > https://bugzilla.kernel.org/show_bug.cgi?id=217114#c27
> > The code already includes the debug message in ahci_check_ready() from
> > Niklas.  However, the dmesg does not show the "BUSY ? ..." from
> > ahci_check_ready().
> >
> > From these scenarios mentioned above, they all apply LPM policy to the
> > SATA controller [8086:a0d3].  But, they apply LPM policy at different
> > time:
> > * The patch ("ata: ahci: Add force LPM policy quirk for ASUS
> > B1400CEAE") applies LPM policy in early ahci_init_one(), which is the
> > probe callback.
> > * Reverting 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger
> > Lake UP{3,4} AHCI controller"") applies LPM policy via "ahci_pci_tbl"
> > table.
>
> I don't see why it should matter if we set the AHCI_HFLAG_USE_LPM_POLICY
> flag using ahci_pci_tbl, or by your suggested quirk in ahci_init_one(),
> as in both cases the flag will be set before ahci_init_one() calls
> ahci_update_initial_lpm_policy().
>
>
> Could it perhaps be that in order for libata to be able to detect your
> drive, when VMD is enabled, we also need your patch
> "PCI: vmd: enable PCI PM's L1 substates of remapped PCIe port and NVMe" ?

I only apply the patch ("ata: ahci: Add force LPM policy quirk for
ASUS B1400CEAE") for this test.  No "PCI: vmd: enable PCI PM's L1
substates of remapped PCIe port and NVMe".  :)

> If that is not the case, and there actually is a difference between using
> ahci_pci_tbl and your suggested quirk, then my next suggestion would be to
> add prints to libata-sata.c:sata_link_scr_lpm(). That way you can dump the
> exact SCR writes that are being done for the working case vs. the
> non-working case. (Since I assume that there must be some difference.)

I prepared debug messages as:

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 7ecd56c8262a..b910c7856d08 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1677,8 +1676,10 @@ static void
ahci_update_initial_lpm_policy(struct ata_port *ap,
/* Ignore processing for chipsets that don't use policy */
- if (!(hpriv->flags & AHCI_HFLAG_USE_LPM_POLICY))
+ if (!(hpriv->flags & AHCI_HFLAG_USE_LPM_POLICY)) {
+ dev_info(ap->dev, "%s: do not use LPM policy\n", __func__);
return;
+ }
/* user modified policy via module param */
if (mobile_lpm_policy != -1) {
@@ -1696,6 +1697,7 @@ static void
ahci_update_initial_lpm_policy(struct ata_port *ap,
update_policy:
if (policy >= ATA_LPM_UNKNOWN && policy <= ATA_LPM_MIN_POWER)
ap->target_lpm_policy = policy;
+ dev_info(ap->dev, "%s: policy %d\n", __func__, policy);
}
static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct
ahci_host_priv *hpriv)
@@ -1706,12 +1708,16 @@ static void ahci_intel_pcs_quirk(struct
pci_dev *pdev, struct ahci_host_priv *hp
/*
* Only apply the 6-port PCS quirk for known legacy platforms.
*/
- if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
+ if (!id || id->vendor != PCI_VENDOR_ID_INTEL) {
+ dev_info(&pdev->dev, "%s: not Intel, the vendor is 0x%08x\n",
__func__, id->vendor);
return;
+ }
/* Skip applying the quirk on Denverton and beyond */
- if (((enum board_ids) id->driver_data) >= board_ahci_pcs7)
+ if (((enum board_ids) id->driver_data) >= board_ahci_pcs7) {
+ dev_info(&pdev->dev, "%s: skip\n", __func__);
return;
+ }
/*
* port_map is determined from PORTS_IMPL PCI register which is
@@ -1722,8 +1728,10 @@ static void ahci_intel_pcs_quirk(struct pci_dev
*pdev, struct ahci_host_priv *hp
* before the OS boots.
*/
pci_read_config_word(pdev, PCS_6, &tmp16);
+ dev_info(&pdev->dev, "%s: PCS_6 is 0x%04x", __func__, tmp16);
if ((tmp16 & hpriv->port_map) != hpriv->port_map) {
tmp16 |= hpriv->port_map;
+ dev_info(&pdev->dev, "%s: write PCS_6 with 0x%04x", __func__, tmp16);
pci_write_config_word(pdev, PCS_6, tmp16);
}
}
@@ -1998,6 +2006,7 @@ static int ahci_init_one(struct pci_dev *pdev,
const struct pci_device_id *ent)
if (rc)
return rc;
+ dev_info(&pdev->dev, "%s: probed\n", __func__);
pm_runtime_put_noidle(&pdev->dev);
return 0;
}
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index 1a63200ea437..7e4f349554eb 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -812,6 +812,7 @@ static int ahci_set_lpm(struct ata_link *link,
enum ata_lpm_policy policy,
struct ahci_port_priv *pp = ap->private_data;
void __iomem *port_mmio = ahci_port_base(ap);
+ ata_link_info(link, "%s: policy=%d\n", __func__, policy);
if (policy != ATA_LPM_MAX_POWER) {
/* wakeup flag only applies to the max power policy */
hints &= ~ATA_LPM_WAKE_ONLY;
@@ -1533,6 +1534,12 @@ int ahci_check_ready(struct ata_link *link)
{
void __iomem *port_mmio = ahci_port_base(link->ap);
u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF;
+ u32 cur = 0;
+
+ sata_scr_read(link, SCR_STATUS, &cur);
+
+ ata_link_info(link, "BUSY ? %d (status: %#x) SStatus.DET: %#x\n",
+ status & ATA_BUSY, status, cur & 0xf);
return ata_check_ready(status);
}
diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
index 0fb1934875f2..4bcedd46bcfa 100644
--- a/drivers/ata/libata-sata.c
+++ b/drivers/ata/libata-sata.c
@@ -344,6 +344,7 @@ int sata_link_resume(struct ata_link *link, const
unsigned int *params,
if (!(rc = sata_scr_read(link, SCR_ERROR, &serror)))
rc = sata_scr_write(link, SCR_ERROR, serror);
+ ata_link_info(link, "%s: rc=%d", __func__, rc);
return rc != -EINVAL ? rc : 0;
}
EXPORT_SYMBOL_GPL(sata_link_resume);
@@ -378,6 +379,7 @@ int sata_link_scr_lpm(struct ata_link *link, enum
ata_lpm_policy policy,
if (rc)
return rc;
+ ata_link_info(link, "%s: policy is %d and original scontrol
0x%08x\n", __func__, policy, scontrol);
switch (policy) {
case ATA_LPM_MAX_POWER:
/* disable all LPM transitions */
@@ -422,6 +424,7 @@ int sata_link_scr_lpm(struct ata_link *link, enum
ata_lpm_policy policy,
WARN_ON(1);
}
+ ata_link_info(link, "%s: write scontrol 0x%08x\n", __func__, scontrol);
rc = sata_scr_write(link, SCR_CONTROL, scontrol);
if (rc)
return rc;
@@ -586,9 +589,12 @@ int sata_link_hardreset(struct ata_link *link,
const unsigned int *timing,
rc = sata_link_resume(link, timing, deadline);
if (rc)
goto out;
+
/* if link is offline nothing more to do */
- if (ata_phys_link_offline(link))
+ if (ata_phys_link_offline(link)) {
+ ata_link_info(link, "%s: ata_phys_link_offline is True\n", __func__);
goto out;
+ }
/* Link is online. From this point, -ENODEV too is an error. */
if (online)
@@ -616,12 +622,15 @@ int sata_link_hardreset(struct ata_link *link,
const unsigned int *timing,
rc = 0;
if (check_ready)
rc = ata_wait_ready(link, deadline, check_ready);
+
+ ata_link_info(link, "%s: is %d\n", __func__, rc);
out:
if (rc && rc != -EAGAIN) {
/* online is set iff link is online && reset succeeded */
if (online)
*online = false;
}
+ ata_link_info(link, "%s: is %s line, returns %d\n", __func__,
*online? "on":"off", rc);
return rc;
}
EXPORT_SYMBOL_GPL(sata_link_hardreset);

Have the comparison:

* Bind LPM policy with the patch "ata: ahci: Add force LPM policy
quirk for ASUS B1400CEAE" based on kernel v6.8-rc2:

$ dmesg | grep -E "(SATA|ata1|ahci)"
[    0.791497] ahci 10000:e0:17.0: version 3.0
[    0.791499] ahci 10000:e0:17.0: force controller follow LPM policy
[    0.791517] ahci 10000:e0:17.0: can't derive routing for PCI INT A
[    0.791518] ahci 10000:e0:17.0: PCI INT A: no GSI
[    0.791637] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
[    0.791652] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: not Intel,
the vendor is 0xffffffff
[    0.791662] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
Gbps 0x1 impl SATA mode
[    0.791663] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
pio slum part deso sadm sds
[    0.791771] scsi host0: ahci
[    0.791806] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
0x76102100 irq 145 lpm-pol 3
[    0.791808] ahci 10000:e0:17.0: ahci_init_one: probed
[    1.109393] ata1: sata_link_resume: rc=0
[    1.109415] ata1: BUSY ? 0 (status: 0x50) SStatus.DET: 0x3
[    1.109418] ata1: sata_link_hardreset: is 0
[    1.109420] ata1: sata_link_hardreset: is on line, returns 0
[    1.109444] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.110161] ata1.00: ATA-10: WDC WD10SPZX-80Z10T2, 04.01A04, max UDMA/133
[    1.112047] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    1.112054] ata1.00: Features: NCQ-prio
[    1.114814] ata1.00: configured for UDMA/133
[    1.114821] ata1: ahci_set_lpm: policy=3
[    1.114837] ata1: sata_link_scr_lpm: policy is 3 and original
scontrol 0x00000300
[    1.114840] ata1: sata_link_scr_lpm: write scontrol 0x00000000

The SATA link is up and SATA storage shows up.
Full dmesg as the attachment of
https://bugzilla.kernel.org/show_bug.cgi?id=217114#c28

* Bind LPM policy with PCI IDs like commit 104ff59af73a ("ata: ahci:
Add Tiger Lake UP{3,4} AHCI controller"):

$ dmesg | grep -E "(SATA|ata1|ahci)"
[    0.783125] ahci 10000:e0:17.0: version 3.0
[    0.783143] ahci 10000:e0:17.0: can't derive routing for PCI INT A
[    0.783145] ahci 10000:e0:17.0: PCI INT A: no GSI
[    0.783257] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
[    0.783280] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: PCS_6 is 0x0000
[    0.783281] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: write PCS_6 with 0x0001
[    0.783296] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
Gbps 0x1 impl SATA mode
[    0.783298] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
pio slum part deso sadm sds
[    0.783402] scsi host0: ahci
[    0.783440] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
0x76102100 irq 144 lpm-pol 3
[    0.783442] ahci 10000:e0:17.0: ahci_init_one: probed
[    1.096930] ata1: sata_link_resume: rc=0
[    1.096960] ata1: sata_link_hardreset: ata_phys_link_offline is True
[    1.096962] ata1: sata_link_hardreset: is off line, returns 0
[    1.097000] ata1: SATA link down (SStatus 4 SControl 300)
[    1.097025] ata1: ahci_set_lpm: policy=3
[    1.097051] ata1: sata_link_scr_lpm: policy is 3 and original
scontrol 0x00000300
[    1.097054] ata1: sata_link_scr_lpm: write scontrol 0x00000304

The SATA link is down and SATA storage disappears.
Full dmesg as the attachment of
https://bugzilla.kernel.org/show_bug.cgi?id=217114#c29

The SCR writes different values with these two conditions.

However, I notice more interesting thing:
"drivers/ata/ahci.c:ahci_intel_pcs_quirk()"!
If bind LPM policy with PCI IDs matching, then it does the PCS quirk.
But, binding with the patch "ata: ahci: Add force LPM policy quirk for
ASUS B1400CEAE" does not, because the vendor is ANY vendor, not Intel.

So, I did following test:

If I modify the PCI vendor check condition with the pdev, not the PCI
ID's vendor:

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 7ecd56c8262a..ece709ac20d6 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1706,12 +1709,16 @@ static void ahci_intel_pcs_quirk(struct
pci_dev *pdev, struct ahci_host_priv *hp
        /*
         * Only apply the 6-port PCS quirk for known legacy platforms.
         */
-       if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
+       if (!id || pdev->vendor != PCI_VENDOR_ID_INTEL) {
+               dev_info(&pdev->dev, "%s: not Intel, the vendor is
0x%08x\n", __func__, id->vendor);
                return;
+       }

Then, the SATA HDD always disappears like binding the LPM policy with
PCI IDs matching, even with the patch "ata: ahci: Add force LPM policy
quirk for ASUS B1400CEAE".
So, I think ahci_intel_pcs_quirk() is the key point.

BR,
Jian-Hong Pan
  
Niklas Cassel Feb. 6, 2024, 1:12 p.m. UTC | #11
On Tue, Feb 06, 2024 at 04:39:02PM +0800, Jian-Hong Pan wrote:
> Niklas Cassel <cassel@kernel.org> 於 2024年2月5日 週一 下午7:33寫道:
> 
> Have the comparison:
> 
> * Bind LPM policy with the patch "ata: ahci: Add force LPM policy
> quirk for ASUS B1400CEAE" based on kernel v6.8-rc2:
> 
> $ dmesg | grep -E "(SATA|ata1|ahci)"
> [    0.791497] ahci 10000:e0:17.0: version 3.0
> [    0.791499] ahci 10000:e0:17.0: force controller follow LPM policy
> [    0.791517] ahci 10000:e0:17.0: can't derive routing for PCI INT A
> [    0.791518] ahci 10000:e0:17.0: PCI INT A: no GSI
> [    0.791637] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
> [    0.791652] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: not Intel,
> the vendor is 0xffffffff
> [    0.791662] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> Gbps 0x1 impl SATA mode
> [    0.791663] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
> pio slum part deso sadm sds
> [    0.791771] scsi host0: ahci
> [    0.791806] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> 0x76102100 irq 145 lpm-pol 3
> [    0.791808] ahci 10000:e0:17.0: ahci_init_one: probed
> [    1.109393] ata1: sata_link_resume: rc=0
> [    1.109415] ata1: BUSY ? 0 (status: 0x50) SStatus.DET: 0x3
> [    1.109418] ata1: sata_link_hardreset: is 0
> [    1.109420] ata1: sata_link_hardreset: is on line, returns 0
> [    1.109444] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [    1.110161] ata1.00: ATA-10: WDC WD10SPZX-80Z10T2, 04.01A04, max UDMA/133
> [    1.112047] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
> [    1.112054] ata1.00: Features: NCQ-prio
> [    1.114814] ata1.00: configured for UDMA/133
> [    1.114821] ata1: ahci_set_lpm: policy=3
> [    1.114837] ata1: sata_link_scr_lpm: policy is 3 and original
> scontrol 0x00000300
> [    1.114840] ata1: sata_link_scr_lpm: write scontrol 0x00000000
> 
> The SATA link is up and SATA storage shows up.
> Full dmesg as the attachment of
> https://bugzilla.kernel.org/show_bug.cgi?id=217114#c28
> 
> * Bind LPM policy with PCI IDs like commit 104ff59af73a ("ata: ahci:
> Add Tiger Lake UP{3,4} AHCI controller"):
> 
> $ dmesg | grep -E "(SATA|ata1|ahci)"
> [    0.783125] ahci 10000:e0:17.0: version 3.0
> [    0.783143] ahci 10000:e0:17.0: can't derive routing for PCI INT A
> [    0.783145] ahci 10000:e0:17.0: PCI INT A: no GSI
> [    0.783257] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
> [    0.783280] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: PCS_6 is 0x0000
> [    0.783281] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: write PCS_6 with 0x0001
> [    0.783296] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> Gbps 0x1 impl SATA mode
> [    0.783298] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
> pio slum part deso sadm sds
> [    0.783402] scsi host0: ahci
> [    0.783440] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> 0x76102100 irq 144 lpm-pol 3
> [    0.783442] ahci 10000:e0:17.0: ahci_init_one: probed
> [    1.096930] ata1: sata_link_resume: rc=0
> [    1.096960] ata1: sata_link_hardreset: ata_phys_link_offline is True
> [    1.096962] ata1: sata_link_hardreset: is off line, returns 0
> [    1.097000] ata1: SATA link down (SStatus 4 SControl 300)
> [    1.097025] ata1: ahci_set_lpm: policy=3
> [    1.097051] ata1: sata_link_scr_lpm: policy is 3 and original
> scontrol 0x00000300
> [    1.097054] ata1: sata_link_scr_lpm: write scontrol 0x00000304
> 
> The SATA link is down and SATA storage disappears.
> Full dmesg as the attachment of
> https://bugzilla.kernel.org/show_bug.cgi?id=217114#c29
> 

So in summary:
When Intel VMD is on, and the ahci_intel_pcs_quirk is applied => NOT OK
When Intel VMD is on, and the ahci_intel_pcs_quirk is not applied => OK

When Intel VMD is off, and the ahci_intel_pcs_quirk is applied => OK
When Intel VMD is off, and the ahci_intel_pcs_quirk is not applied => ?

Excellent find!



In the bad case:

sata_link_hardreset() sets SControl.DET to 1, to establish the interface
communication. Then sleeps for 1 ms.

Then it calls sata_link_resume(), which clears SControl.DET to 0.
(This matches the AHCI spec which says that SControl.DET should be set
to 1 for at least 1 ms.)

sata_link_hardreset() then calls ata_phys_link_offline(),
which is essentially defined as: return !(SStatus.DET == 0x3)
ata_phys_link_offline() returns true, since SStatus.DET == 0x4.

SStatus.DET == 0x4 means: Phy in offline mode as a result of the
interface being disabled or running in a BIST loopback mode.

If the physical link is not established, there is no point to call
ata_wait_ready() (which waits for the device to become ready on the
protocol level), as the physical link could not even be established.

After that, we write SControl.DET to set bit 4 to disable the port,
in order to save power. This is only done because sata_link_hardreset()
failed to establish a link after toggling SControl.DET == 1.

So the problem is that SStatus.DET never changed to 0x3 after toggling
SControl.DET == 1.


> 
> However, I notice more interesting thing:
> "drivers/ata/ahci.c:ahci_intel_pcs_quirk()"!
> If bind LPM policy with PCI IDs matching, then it does the PCS quirk.
> But, binding with the patch "ata: ahci: Add force LPM policy quirk for
> ASUS B1400CEAE" does not, because the vendor is ANY vendor, not Intel.
> 
> So, I did following test:
> 
> If I modify the PCI vendor check condition with the pdev, not the PCI
> ID's vendor:
> 
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index 7ecd56c8262a..ece709ac20d6 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1706,12 +1709,16 @@ static void ahci_intel_pcs_quirk(struct
> pci_dev *pdev, struct ahci_host_priv *hp
>         /*
>          * Only apply the 6-port PCS quirk for known legacy platforms.
>          */
> -       if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
> +       if (!id || pdev->vendor != PCI_VENDOR_ID_INTEL) {
> +               dev_info(&pdev->dev, "%s: not Intel, the vendor is
> 0x%08x\n", __func__, id->vendor);
>                 return;
> +       }

The reason why you are seeing this is because Tiger Lake does not have
an entry in the ahci_pci_tbl in mainline, so it uses the generic entry
which matches on the AHCI class code:
https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L636

If you revert 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger Lake
UP{3,4} AHCI controller""), you will get an explicit entry in the
ahci_pci_tbl.

But to clarify, I think that it would make sense to add:

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index d2460fa985b7..e462509a45e8 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1672,12 +1672,18 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
 
 static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
 {
-       const struct pci_device_id *id = pci_match_id(ahci_pci_tbl, pdev);
+       const struct pci_device_id *id;
        u16 tmp16;
 
+       /* If the detected PCI device is not an Intel device, skip. */
+       if (pdev->vendor != PCI_VENDOR_ID_INTEL)
+               return;
+
        /*
-        * Only apply the 6-port PCS quirk for known legacy platforms.
+        * See if there is an explicit entry for this PCI device in
+        * ahci_pci_tbl, if there is not, do not apply the quirk.
         */
+       id = pci_match_id(ahci_pci_tbl, pdev);
        if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
                return;
 


> 
> Then, the SATA HDD always disappears like binding the LPM policy with
> PCI IDs matching, even with the patch "ata: ahci: Add force LPM policy
> quirk for ASUS B1400CEAE".
> So, I think ahci_intel_pcs_quirk() is the key point.

I agree.


Can you verify that things work as expected when doing a:
$ git revert 6210038aeaf49c395c2da57572246d93ec67f6d4
to re-add the explicit entry, if you also do a:

--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1672,6 +1672,7 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
 
 static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
 {
+#if 0
        const struct pci_device_id *id = pci_match_id(ahci_pci_tbl, pdev);
        u16 tmp16;
 
@@ -1698,6 +1699,7 @@ static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hp
                tmp16 |= hpriv->port_map;
                pci_write_config_word(pdev, PCS_6, tmp16);
        }
+#endif
 }

To make the quirk a no-op?



To be honest, this quirk looks horrible.

Looking at the original commit:
c312ef176399 ("libata/ahci: Drop PCS quirk for Denverton and beyond")

It claims that:

Rather than try to fix the PCS quirk to consider the DNV register layout
instead require explicit opt-in. The assumption is that the OS driver
need not touch this register, and platforms can be added with a new
boad_ahci_pcs7 board-id when / if problematic platforms are found in the
future.

However, it does NOT require an explicit opt-in!

If we were to add an entry with board type "board_ahci" or
"board_ahci_low_power" for Tiger Lake, the quirk gets applied...

See also:
09d6ac8dc51a ("libata/ahci: Fix PCS quirk application")

So basically, what ahci_intel_pcs_quirk() does is that it checks
if there is an explicit entry in ahci_pci_tbl.
If there is not, the quirk is not applied.

If there is an entry, and the enum for that board has a value that
is less than board_ahci_pcs7, the quirk is applied...

But that will be *ALL* other board types since board_ahci_pcs7 is
defined last in the enum:
https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L75

Not only that but the comment for that enum is wrong:
https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L71-L74

	/*
	 * board IDs for Intel chipsets that support more than 6 ports
	 * *and* end up needing the PCS quirk.
	 */

Is is the opposite... board IDs that do NOT need the PCS quirk...

But this is not the way we add quirks.
We add a flag and a new board_id and mark the PCI device and vendor ids
that are affected to use that board, see e.g.
20730e9b2778 ("ahci: add 43-bit DMA address quirk for ASMedia ASM1061 controllers")

We don't add a quirk and apply it for everything (board_ahci,
board_ahci_low_power) except for a specific entry (board_ahci_pcs7).

It seems that at least Intel AHCI controllers that also have Intel VMD
enabled break when this quirk is applied.

I guess one way would be to do a:
git show c312ef176399:drivers/ata/ahci.c | grep "PCI_VDEVICE(INTEL"
and replace everything that is not: board_ahci_pcs7
with a board_ahci_pcs_quirk, board_ahci_low_power_pcs_quirk, and
board_ahci_avn_pcs_quirk, and after that change all board_ahci_pcs7
entries to board_ahci, and assume that entries added since c312ef176399
do not need the quirk.

But it would be nice if someone from Intel could clean this up.


Kind regards,
Niklas
  
Niklas Cassel Feb. 6, 2024, 10:07 p.m. UTC | #12
On Tue, Feb 06, 2024 at 02:12:17PM +0100, Niklas Cassel wrote:
> On Tue, Feb 06, 2024 at 04:39:02PM +0800, Jian-Hong Pan wrote:
> > Niklas Cassel <cassel@kernel.org> 於 2024年2月5日 週一 下午7:33寫道:

(snip)

> 
> It seems that at least Intel AHCI controllers that also have Intel VMD
> enabled break when this quirk is applied.
> 
> I guess one way would be to do a:
> git show c312ef176399:drivers/ata/ahci.c | grep "PCI_VDEVICE(INTEL"
> and replace everything that is not: board_ahci_pcs7
> with a board_ahci_pcs_quirk, board_ahci_low_power_pcs_quirk, and
> board_ahci_avn_pcs_quirk, and after that change all board_ahci_pcs7
> entries to board_ahci, and assume that entries added since c312ef176399
> do not need the quirk.
> 
> But it would be nice if someone from Intel could clean this up.

While it would still be nice if someone cleaned up the intel_pcs_quirk.


Jian-Hong, perhaps you can try my series:
https://lore.kernel.org/linux-ide/20240206211352.1664816-1-cassel@kernel.org/

Also available here:
https://github.com/floatious/linux/tree/external-port-v2

With that series, you should not need an explicit "board_ahci_low_power"
entry for Tiger Lake to get working LPM.

If fact, you want to ensure that you do not have any Tiger Lake entry in
ahci_pci_tbl, as that will apply intel_pcs_quirk and break your platform.

Testing is appreciated :)

Let's hope that LPM is enabled and that you can also enter low-power
C-states.


Kind regards,
Niklas
  
Jian-Hong Pan Feb. 7, 2024, 6:31 a.m. UTC | #13
Niklas Cassel <cassel@kernel.org> 於 2024年2月6日 週二 下午9:12寫道:
>
> On Tue, Feb 06, 2024 at 04:39:02PM +0800, Jian-Hong Pan wrote:
> > Niklas Cassel <cassel@kernel.org> 於 2024年2月5日 週一 下午7:33寫道:
> >
> > Have the comparison:
> >
> > * Bind LPM policy with the patch "ata: ahci: Add force LPM policy
> > quirk for ASUS B1400CEAE" based on kernel v6.8-rc2:
> >
> > $ dmesg | grep -E "(SATA|ata1|ahci)"
> > [    0.791497] ahci 10000:e0:17.0: version 3.0
> > [    0.791499] ahci 10000:e0:17.0: force controller follow LPM policy
> > [    0.791517] ahci 10000:e0:17.0: can't derive routing for PCI INT A
> > [    0.791518] ahci 10000:e0:17.0: PCI INT A: no GSI
> > [    0.791637] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
> > [    0.791652] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: not Intel,
> > the vendor is 0xffffffff
> > [    0.791662] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> > Gbps 0x1 impl SATA mode
> > [    0.791663] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
> > pio slum part deso sadm sds
> > [    0.791771] scsi host0: ahci
> > [    0.791806] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> > 0x76102100 irq 145 lpm-pol 3
> > [    0.791808] ahci 10000:e0:17.0: ahci_init_one: probed
> > [    1.109393] ata1: sata_link_resume: rc=0
> > [    1.109415] ata1: BUSY ? 0 (status: 0x50) SStatus.DET: 0x3
> > [    1.109418] ata1: sata_link_hardreset: is 0
> > [    1.109420] ata1: sata_link_hardreset: is on line, returns 0
> > [    1.109444] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> > [    1.110161] ata1.00: ATA-10: WDC WD10SPZX-80Z10T2, 04.01A04, max UDMA/133
> > [    1.112047] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
> > [    1.112054] ata1.00: Features: NCQ-prio
> > [    1.114814] ata1.00: configured for UDMA/133
> > [    1.114821] ata1: ahci_set_lpm: policy=3
> > [    1.114837] ata1: sata_link_scr_lpm: policy is 3 and original
> > scontrol 0x00000300
> > [    1.114840] ata1: sata_link_scr_lpm: write scontrol 0x00000000
> >
> > The SATA link is up and SATA storage shows up.
> > Full dmesg as the attachment of
> > https://bugzilla.kernel.org/show_bug.cgi?id=217114#c28
> >
> > * Bind LPM policy with PCI IDs like commit 104ff59af73a ("ata: ahci:
> > Add Tiger Lake UP{3,4} AHCI controller"):
> >
> > $ dmesg | grep -E "(SATA|ata1|ahci)"
> > [    0.783125] ahci 10000:e0:17.0: version 3.0
> > [    0.783143] ahci 10000:e0:17.0: can't derive routing for PCI INT A
> > [    0.783145] ahci 10000:e0:17.0: PCI INT A: no GSI
> > [    0.783257] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
> > [    0.783280] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: PCS_6 is 0x0000
> > [    0.783281] ahci 10000:e0:17.0: ahci_intel_pcs_quirk: write PCS_6 with 0x0001
> > [    0.783296] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
> > Gbps 0x1 impl SATA mode
> > [    0.783298] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
> > pio slum part deso sadm sds
> > [    0.783402] scsi host0: ahci
> > [    0.783440] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
> > 0x76102100 irq 144 lpm-pol 3
> > [    0.783442] ahci 10000:e0:17.0: ahci_init_one: probed
> > [    1.096930] ata1: sata_link_resume: rc=0
> > [    1.096960] ata1: sata_link_hardreset: ata_phys_link_offline is True
> > [    1.096962] ata1: sata_link_hardreset: is off line, returns 0
> > [    1.097000] ata1: SATA link down (SStatus 4 SControl 300)
> > [    1.097025] ata1: ahci_set_lpm: policy=3
> > [    1.097051] ata1: sata_link_scr_lpm: policy is 3 and original
> > scontrol 0x00000300
> > [    1.097054] ata1: sata_link_scr_lpm: write scontrol 0x00000304
> >
> > The SATA link is down and SATA storage disappears.
> > Full dmesg as the attachment of
> > https://bugzilla.kernel.org/show_bug.cgi?id=217114#c29
> >
>
> So in summary:
> When Intel VMD is on, and the ahci_intel_pcs_quirk is applied => NOT OK
> When Intel VMD is on, and the ahci_intel_pcs_quirk is not applied => OK
>
> When Intel VMD is off, and the ahci_intel_pcs_quirk is applied => OK
> When Intel VMD is off, and the ahci_intel_pcs_quirk is not applied => ?

When Intel VMD is off, and the ahci_intel_pcs_quirk is not applied => OK

> Excellent find!
>
>
>
> In the bad case:
>
> sata_link_hardreset() sets SControl.DET to 1, to establish the interface
> communication. Then sleeps for 1 ms.
>
> Then it calls sata_link_resume(), which clears SControl.DET to 0.
> (This matches the AHCI spec which says that SControl.DET should be set
> to 1 for at least 1 ms.)
>
> sata_link_hardreset() then calls ata_phys_link_offline(),
> which is essentially defined as: return !(SStatus.DET == 0x3)
> ata_phys_link_offline() returns true, since SStatus.DET == 0x4.
>
> SStatus.DET == 0x4 means: Phy in offline mode as a result of the
> interface being disabled or running in a BIST loopback mode.
>
> If the physical link is not established, there is no point to call
> ata_wait_ready() (which waits for the device to become ready on the
> protocol level), as the physical link could not even be established.
>
> After that, we write SControl.DET to set bit 4 to disable the port,
> in order to save power. This is only done because sata_link_hardreset()
> failed to establish a link after toggling SControl.DET == 1.
>
> So the problem is that SStatus.DET never changed to 0x3 after toggling
> SControl.DET == 1.
>
>
> >
> > However, I notice more interesting thing:
> > "drivers/ata/ahci.c:ahci_intel_pcs_quirk()"!
> > If bind LPM policy with PCI IDs matching, then it does the PCS quirk.
> > But, binding with the patch "ata: ahci: Add force LPM policy quirk for
> > ASUS B1400CEAE" does not, because the vendor is ANY vendor, not Intel.
> >
> > So, I did following test:
> >
> > If I modify the PCI vendor check condition with the pdev, not the PCI
> > ID's vendor:
> >
> > diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> > index 7ecd56c8262a..ece709ac20d6 100644
> > --- a/drivers/ata/ahci.c
> > +++ b/drivers/ata/ahci.c
> > @@ -1706,12 +1709,16 @@ static void ahci_intel_pcs_quirk(struct
> > pci_dev *pdev, struct ahci_host_priv *hp
> >         /*
> >          * Only apply the 6-port PCS quirk for known legacy platforms.
> >          */
> > -       if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
> > +       if (!id || pdev->vendor != PCI_VENDOR_ID_INTEL) {
> > +               dev_info(&pdev->dev, "%s: not Intel, the vendor is
> > 0x%08x\n", __func__, id->vendor);
> >                 return;
> > +       }
>
> The reason why you are seeing this is because Tiger Lake does not have
> an entry in the ahci_pci_tbl in mainline, so it uses the generic entry
> which matches on the AHCI class code:
> https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L636
>
> If you revert 6210038aeaf4 ("ata: ahci: Revert "ata: ahci: Add Tiger Lake
> UP{3,4} AHCI controller""), you will get an explicit entry in the
> ahci_pci_tbl.
>
> But to clarify, I think that it would make sense to add:
>
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index d2460fa985b7..e462509a45e8 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1672,12 +1672,18 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
>
>  static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
>  {
> -       const struct pci_device_id *id = pci_match_id(ahci_pci_tbl, pdev);
> +       const struct pci_device_id *id;
>         u16 tmp16;
>
> +       /* If the detected PCI device is not an Intel device, skip. */
> +       if (pdev->vendor != PCI_VENDOR_ID_INTEL)
> +               return;
> +
>         /*
> -        * Only apply the 6-port PCS quirk for known legacy platforms.
> +        * See if there is an explicit entry for this PCI device in
> +        * ahci_pci_tbl, if there is not, do not apply the quirk.
>          */
> +       id = pci_match_id(ahci_pci_tbl, pdev);
>         if (!id || id->vendor != PCI_VENDOR_ID_INTEL)
>                 return;
>
>
>
> >
> > Then, the SATA HDD always disappears like binding the LPM policy with
> > PCI IDs matching, even with the patch "ata: ahci: Add force LPM policy
> > quirk for ASUS B1400CEAE".
> > So, I think ahci_intel_pcs_quirk() is the key point.
>
> I agree.
>
>
> Can you verify that things work as expected when doing a:
> $ git revert 6210038aeaf49c395c2da57572246d93ec67f6d4
> to re-add the explicit entry, if you also do a:
>
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1672,6 +1672,7 @@ static void ahci_update_initial_lpm_policy(struct ata_port *ap,
>
>  static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
>  {
> +#if 0
>         const struct pci_device_id *id = pci_match_id(ahci_pci_tbl, pdev);
>         u16 tmp16;
>
> @@ -1698,6 +1699,7 @@ static void ahci_intel_pcs_quirk(struct pci_dev *pdev, struct ahci_host_priv *hp
>                 tmp16 |= hpriv->port_map;
>                 pci_write_config_word(pdev, PCS_6, tmp16);
>         }
> +#endif
>  }
>
> To make the quirk a no-op?

Here is the test result:

Both enabled & disabled VMD with no-op ahci_intel_pcs_quirk() shows
the SATA storage on ASUS B1400CEAE. => OK

$ cat /tmp/dmesg.log
[    0.799439] ahci 10000:e0:17.0: version 3.0
[    0.799459] ahci 10000:e0:17.0: can't derive routing for PCI INT A
[    0.799460] ahci 10000:e0:17.0: PCI INT A: no GSI
[    0.799582] ahci 10000:e0:17.0: ahci_update_initial_lpm_policy: policy 3
[    0.799615] ahci 10000:e0:17.0: AHCI 0001.0301 32 slots 1 ports 6
Gbps 0x1 impl SATA mode
[    0.799617] ahci 10000:e0:17.0: flags: 64bit ncq sntf pm clo only
pio slum part deso sadm sds
[    0.799722] scsi host0: ahci
[    0.799760] ata1: SATA max UDMA/133 abar m2048@0x76102000 port
0x76102100 irq 144 lpm-pol 3
[    0.799761] ahci 10000:e0:17.0: ahci_init_one: probed
[    1.112519] ata1: sata_link_resume: rc=0
[    1.112541] ata1: BUSY ? 0 (status: 0x50) SStatus.DET: 0x3
[    1.112545] ata1: sata_link_hardreset: is 0
[    1.112547] ata1: sata_link_hardreset: is on line, returns 0
[    1.112571] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.113280] ata1.00: ATA-10: WDC WD10SPZX-80Z10T2, 04.01A04, max UDMA/133
[    1.114880] ata1.00: 1953525168 sectors, multi 16: LBA48 NCQ (depth 32), AA
[    1.114887] ata1.00: Features: NCQ-prio
[    1.117148] ata1.00: configured for UDMA/133
[    1.117154] ata1: ahci_set_lpm: policy=3
[    1.117169] ata1: sata_link_scr_lpm: policy is 3 and original
scontrol 0x00000300
[    1.117172] ata1: sata_link_scr_lpm: write scontrol 0x00000000

Jian-Hong Pan

> To be honest, this quirk looks horrible.
>
> Looking at the original commit:
> c312ef176399 ("libata/ahci: Drop PCS quirk for Denverton and beyond")
>
> It claims that:
>
> Rather than try to fix the PCS quirk to consider the DNV register layout
> instead require explicit opt-in. The assumption is that the OS driver
> need not touch this register, and platforms can be added with a new
> boad_ahci_pcs7 board-id when / if problematic platforms are found in the
> future.
>
> However, it does NOT require an explicit opt-in!
>
> If we were to add an entry with board type "board_ahci" or
> "board_ahci_low_power" for Tiger Lake, the quirk gets applied...
>
> See also:
> 09d6ac8dc51a ("libata/ahci: Fix PCS quirk application")
>
> So basically, what ahci_intel_pcs_quirk() does is that it checks
> if there is an explicit entry in ahci_pci_tbl.
> If there is not, the quirk is not applied.
>
> If there is an entry, and the enum for that board has a value that
> is less than board_ahci_pcs7, the quirk is applied...
>
> But that will be *ALL* other board types since board_ahci_pcs7 is
> defined last in the enum:
> https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L75
>
> Not only that but the comment for that enum is wrong:
> https://github.com/torvalds/linux/blob/v6.8-rc3/drivers/ata/ahci.c#L71-L74
>
>         /*
>          * board IDs for Intel chipsets that support more than 6 ports
>          * *and* end up needing the PCS quirk.
>          */
>
> Is is the opposite... board IDs that do NOT need the PCS quirk...
>
> But this is not the way we add quirks.
> We add a flag and a new board_id and mark the PCI device and vendor ids
> that are affected to use that board, see e.g.
> 20730e9b2778 ("ahci: add 43-bit DMA address quirk for ASMedia ASM1061 controllers")
>
> We don't add a quirk and apply it for everything (board_ahci,
> board_ahci_low_power) except for a specific entry (board_ahci_pcs7).
>
> It seems that at least Intel AHCI controllers that also have Intel VMD
> enabled break when this quirk is applied.
>
> I guess one way would be to do a:
> git show c312ef176399:drivers/ata/ahci.c | grep "PCI_VDEVICE(INTEL"
> and replace everything that is not: board_ahci_pcs7
> with a board_ahci_pcs_quirk, board_ahci_low_power_pcs_quirk, and
> board_ahci_avn_pcs_quirk, and after that change all board_ahci_pcs7
> entries to board_ahci, and assume that entries added since c312ef176399
> do not need the quirk.
>
> But it would be nice if someone from Intel could clean this up.
>
>
> Kind regards,
> Niklas
  
Jian-Hong Pan Feb. 7, 2024, 6:35 a.m. UTC | #14
Niklas Cassel <cassel@kernel.org> 於 2024年2月7日 週三 上午6:07寫道:
>
> On Tue, Feb 06, 2024 at 02:12:17PM +0100, Niklas Cassel wrote:
> > On Tue, Feb 06, 2024 at 04:39:02PM +0800, Jian-Hong Pan wrote:
> > > Niklas Cassel <cassel@kernel.org> 於 2024年2月5日 週一 下午7:33寫道:
>
> (snip)
>
> >
> > It seems that at least Intel AHCI controllers that also have Intel VMD
> > enabled break when this quirk is applied.
> >
> > I guess one way would be to do a:
> > git show c312ef176399:drivers/ata/ahci.c | grep "PCI_VDEVICE(INTEL"
> > and replace everything that is not: board_ahci_pcs7
> > with a board_ahci_pcs_quirk, board_ahci_low_power_pcs_quirk, and
> > board_ahci_avn_pcs_quirk, and after that change all board_ahci_pcs7
> > entries to board_ahci, and assume that entries added since c312ef176399
> > do not need the quirk.
> >
> > But it would be nice if someone from Intel could clean this up.
>
> While it would still be nice if someone cleaned up the intel_pcs_quirk.
>
>
> Jian-Hong, perhaps you can try my series:
> https://lore.kernel.org/linux-ide/20240206211352.1664816-1-cassel@kernel.org/
>
> Also available here:
> https://github.com/floatious/linux/tree/external-port-v2
>
> With that series, you should not need an explicit "board_ahci_low_power"
> entry for Tiger Lake to get working LPM.
>
> If fact, you want to ensure that you do not have any Tiger Lake entry in
> ahci_pci_tbl, as that will apply intel_pcs_quirk and break your platform.
>
> Testing is appreciated :)
>
> Let's hope that LPM is enabled and that you can also enter low-power
> C-states.

Yes!  That patch series binds LPM policy by default and avoids
disappeared SATA storage issue.

Thanks!
Jian-Hong Pan
  

Patch

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index d2460fa985b7..6bc5298a4adf 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1434,6 +1434,33 @@  static bool ahci_broken_devslp(struct pci_dev *pdev)
 	return pci_match_id(ids, pdev);
 }
 
+static bool ahci_force_lpm_policy(void)
+{
+	/*
+	 * Some systems, like ASUS B1400CEAE equipped with the SATA controller
+	 * [8086:a0d3] can use LPM policy to save power, especially for s2idle.
+	 * However, the same controller may be failed on other platforms. So,
+	 * commit (ata: ahci: Revert "ata: ahci: Add Tiger Lake UP{3,4} AHCI
+	 * controller") drops LPM policy for [8086:a0d3].
+	 *
+	 * https://bugzilla.kernel.org/show_bug.cgi?id=218394
+	 */
+	static const struct dmi_system_id sysids[] = {
+		{
+			.ident = "ASUS B1400CEAE",
+			.matches = {
+				DMI_MATCH(DMI_SYS_VENDOR,
+					  "ASUSTeK COMPUTER INC."),
+				DMI_MATCH(DMI_PRODUCT_NAME,
+					  "ASUS EXPERTBOOK B1400CEAE"),
+			},
+		},
+		{ }	/* terminate list */
+	};
+
+	return dmi_first_match(sysids);
+}
+
 #ifdef CONFIG_ATA_ACPI
 static void ahci_gtf_filter_workaround(struct ata_host *host)
 {
@@ -1760,6 +1787,11 @@  static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 			ahci_pci_bar = AHCI_PCI_BAR_LOONGSON;
 	}
 
+	if (ahci_force_lpm_policy()) {
+		pi = ahci_port_info[board_ahci_low_power];
+		dev_info(&pdev->dev, "force controller follow LPM policy\n");
+	}
+
 	/* acquire resources */
 	rc = pcim_enable_device(pdev);
 	if (rc)