[v5,13/16] x86: decouple PAT and MTRR handling

Message ID 20221102074713.21493-14-jgross@suse.com
State New
Headers
Series x86: make PAT and MTRR independent from each other |

Commit Message

Juergen Gross Nov. 2, 2022, 7:47 a.m. UTC
  Today PAT is usable only with MTRR being active, with some nasty tweaks
to make PAT usable when running as Xen PV guest, which doesn't support
MTRR.

The reason for this coupling is, that both, PAT MSR changes and MTRR
changes, require a similar sequence and so full PAT support was added
using the already available MTRR handling.

Xen PV PAT handling can work without MTRR, as it just needs to consume
the PAT MSR setting done by the hypervisor without the ability and need
to change it. This in turn has resulted in a convoluted initialization
sequence and wrong decisions regarding cache mode availability due to
misguiding PAT availability flags.

Fix all of that by allowing to use PAT without MTRR and by reworking
the current PAT initialization sequence to match better with the newly
introduced generic cache initialization.

This removes the need of the recently added pat_force_disabled flag, so
remove the remnants of the patch adding it.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- former patch 3 completely reworked
V5:
- rework PAT() macro
- drop local pat variable (Borislav Petkov)
- use cpu_feature_enabled() (Borislav Petkov)
- some more minor adjustments (Borislav Petkov)
---
 arch/x86/include/asm/memtype.h  |   5 +-
 arch/x86/kernel/cpu/cacheinfo.c |   3 +-
 arch/x86/kernel/cpu/mtrr/mtrr.c |  12 +--
 arch/x86/kernel/setup.c         |  13 +--
 arch/x86/mm/pat/memtype.c       | 152 +++++++++++---------------------
 5 files changed, 57 insertions(+), 128 deletions(-)
  

Comments

Kirill A. Shutemov Dec. 1, 2022, 4:26 p.m. UTC | #1
On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
> Today PAT is usable only with MTRR being active, with some nasty tweaks
> to make PAT usable when running as Xen PV guest, which doesn't support
> MTRR.
> 
> The reason for this coupling is, that both, PAT MSR changes and MTRR
> changes, require a similar sequence and so full PAT support was added
> using the already available MTRR handling.
> 
> Xen PV PAT handling can work without MTRR, as it just needs to consume
> the PAT MSR setting done by the hypervisor without the ability and need
> to change it. This in turn has resulted in a convoluted initialization
> sequence and wrong decisions regarding cache mode availability due to
> misguiding PAT availability flags.
> 
> Fix all of that by allowing to use PAT without MTRR and by reworking
> the current PAT initialization sequence to match better with the newly
> introduced generic cache initialization.
> 
> This removes the need of the recently added pat_force_disabled flag, so
> remove the remnants of the patch adding it.
> 
> Signed-off-by: Juergen Gross <jgross@suse.com>

This patch breaks boot for TDX guest.

Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
causes #VE:

	tdx: Unexpected #VE: 28
	VE fault: 0000 [#1] PREEMPT SMP NOPTI
	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427) 
	 Call Trace:
	  <TASK>
	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085) 
	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3)) 
	 ? setup_arch (arch/x86/kernel/setup.c:1079) 
	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3)) 
	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155) 
	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358) 
	  </TASK>

Any suggestion how to fix it?

[1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
  
Juergen Gross Dec. 1, 2022, 4:33 p.m. UTC | #2
On 01.12.22 17:26, Kirill A. Shutemov wrote:
> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
>> Today PAT is usable only with MTRR being active, with some nasty tweaks
>> to make PAT usable when running as Xen PV guest, which doesn't support
>> MTRR.
>>
>> The reason for this coupling is, that both, PAT MSR changes and MTRR
>> changes, require a similar sequence and so full PAT support was added
>> using the already available MTRR handling.
>>
>> Xen PV PAT handling can work without MTRR, as it just needs to consume
>> the PAT MSR setting done by the hypervisor without the ability and need
>> to change it. This in turn has resulted in a convoluted initialization
>> sequence and wrong decisions regarding cache mode availability due to
>> misguiding PAT availability flags.
>>
>> Fix all of that by allowing to use PAT without MTRR and by reworking
>> the current PAT initialization sequence to match better with the newly
>> introduced generic cache initialization.
>>
>> This removes the need of the recently added pat_force_disabled flag, so
>> remove the remnants of the patch adding it.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
> 
> This patch breaks boot for TDX guest.
> 
> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
> causes #VE:
> 
> 	tdx: Unexpected #VE: 28
> 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
> 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
> 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
> 	 Call Trace:
> 	  <TASK>
> 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
> 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
> 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
> 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
> 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
> 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
> 	  </TASK>
> 
> Any suggestion how to fix it?
> 
> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568

What was the solution before?

I guess MTRR was disabled, so there was no PAT, too?

If this is the case, you can go the same route as Xen PV guests do.


Juergen
  
Kirill A. Shutemov Dec. 1, 2022, 11:57 p.m. UTC | #3
On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
> On 01.12.22 17:26, Kirill A. Shutemov wrote:
> > On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
> > > Today PAT is usable only with MTRR being active, with some nasty tweaks
> > > to make PAT usable when running as Xen PV guest, which doesn't support
> > > MTRR.
> > > 
> > > The reason for this coupling is, that both, PAT MSR changes and MTRR
> > > changes, require a similar sequence and so full PAT support was added
> > > using the already available MTRR handling.
> > > 
> > > Xen PV PAT handling can work without MTRR, as it just needs to consume
> > > the PAT MSR setting done by the hypervisor without the ability and need
> > > to change it. This in turn has resulted in a convoluted initialization
> > > sequence and wrong decisions regarding cache mode availability due to
> > > misguiding PAT availability flags.
> > > 
> > > Fix all of that by allowing to use PAT without MTRR and by reworking
> > > the current PAT initialization sequence to match better with the newly
> > > introduced generic cache initialization.
> > > 
> > > This removes the need of the recently added pat_force_disabled flag, so
> > > remove the remnants of the patch adding it.
> > > 
> > > Signed-off-by: Juergen Gross <jgross@suse.com>
> > 
> > This patch breaks boot for TDX guest.
> > 
> > Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
> > causes #VE:
> > 
> > 	tdx: Unexpected #VE: 28
> > 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
> > 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
> > 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
> > 	 Call Trace:
> > 	  <TASK>
> > 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
> > 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
> > 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
> > 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
> > 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
> > 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
> > 	  </TASK>
> > 
> > Any suggestion how to fix it?
> > 
> > [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
> 
> What was the solution before?
> 
> I guess MTRR was disabled, so there was no PAT, too?

Right:

Linus' tree:

[    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
[    0.003976] Disabled
[    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
[    0.005856] CPU MTRRs all blank - virtualized system.
[    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC

tip/master:

[    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
[    0.005220] Disabled
[    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
[    0.007752] tdx: Unexpected #VE: 28

The dangling "Disabled" comes mtrr_bp_init().


> If this is the case, you can go the same route as Xen PV guests do.

Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
X86_FEATURE_XENPV there?

Do we have any virtualized platform that supports it?
  
Juergen Gross Dec. 2, 2022, 5:56 a.m. UTC | #4
On 02.12.22 00:57, Kirill A. Shutemov wrote:
> On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
>> On 01.12.22 17:26, Kirill A. Shutemov wrote:
>>> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
>>>> Today PAT is usable only with MTRR being active, with some nasty tweaks
>>>> to make PAT usable when running as Xen PV guest, which doesn't support
>>>> MTRR.
>>>>
>>>> The reason for this coupling is, that both, PAT MSR changes and MTRR
>>>> changes, require a similar sequence and so full PAT support was added
>>>> using the already available MTRR handling.
>>>>
>>>> Xen PV PAT handling can work without MTRR, as it just needs to consume
>>>> the PAT MSR setting done by the hypervisor without the ability and need
>>>> to change it. This in turn has resulted in a convoluted initialization
>>>> sequence and wrong decisions regarding cache mode availability due to
>>>> misguiding PAT availability flags.
>>>>
>>>> Fix all of that by allowing to use PAT without MTRR and by reworking
>>>> the current PAT initialization sequence to match better with the newly
>>>> introduced generic cache initialization.
>>>>
>>>> This removes the need of the recently added pat_force_disabled flag, so
>>>> remove the remnants of the patch adding it.
>>>>
>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>
>>> This patch breaks boot for TDX guest.
>>>
>>> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
>>> causes #VE:
>>>
>>> 	tdx: Unexpected #VE: 28
>>> 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
>>> 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
>>> 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>> 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
>>> 	 Call Trace:
>>> 	  <TASK>
>>> 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
>>> 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
>>> 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
>>> 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
>>> 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
>>> 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
>>> 	  </TASK>
>>>
>>> Any suggestion how to fix it?
>>>
>>> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
>>
>> What was the solution before?
>>
>> I guess MTRR was disabled, so there was no PAT, too?
> 
> Right:
> 
> Linus' tree:
> 
> [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
> [    0.003976] Disabled
> [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
> [    0.005856] CPU MTRRs all blank - virtualized system.
> [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> 
> tip/master:
> 
> [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
> [    0.005220] Disabled
> [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> [    0.007752] tdx: Unexpected #VE: 28
> 
> The dangling "Disabled" comes mtrr_bp_init().
> 
> 
>> If this is the case, you can go the same route as Xen PV guests do.
> 
> Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
> X86_FEATURE_XENPV there?
> 
> Do we have any virtualized platform that supports it?

Yes, of course. Any hardware virtualized guest should be able to use it,
obviously TDX guests are the first ones not being able to do so.

And above dmesg snipplets are showing rather nicely that not disabling
PAT completely should be a benefit for TDX guests, as all caching modes
would be usable (the PAT MSR seems to be initialized quite fine).

Instead of X86_FEATURE_XENPV we could introduce something like
X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
TDX guests.


Juergen
  
Kirill A. Shutemov Dec. 2, 2022, 1:27 p.m. UTC | #5
On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
> On 02.12.22 00:57, Kirill A. Shutemov wrote:
> > On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
> > > On 01.12.22 17:26, Kirill A. Shutemov wrote:
> > > > On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
> > > > > Today PAT is usable only with MTRR being active, with some nasty tweaks
> > > > > to make PAT usable when running as Xen PV guest, which doesn't support
> > > > > MTRR.
> > > > > 
> > > > > The reason for this coupling is, that both, PAT MSR changes and MTRR
> > > > > changes, require a similar sequence and so full PAT support was added
> > > > > using the already available MTRR handling.
> > > > > 
> > > > > Xen PV PAT handling can work without MTRR, as it just needs to consume
> > > > > the PAT MSR setting done by the hypervisor without the ability and need
> > > > > to change it. This in turn has resulted in a convoluted initialization
> > > > > sequence and wrong decisions regarding cache mode availability due to
> > > > > misguiding PAT availability flags.
> > > > > 
> > > > > Fix all of that by allowing to use PAT without MTRR and by reworking
> > > > > the current PAT initialization sequence to match better with the newly
> > > > > introduced generic cache initialization.
> > > > > 
> > > > > This removes the need of the recently added pat_force_disabled flag, so
> > > > > remove the remnants of the patch adding it.
> > > > > 
> > > > > Signed-off-by: Juergen Gross <jgross@suse.com>
> > > > 
> > > > This patch breaks boot for TDX guest.
> > > > 
> > > > Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
> > > > causes #VE:
> > > > 
> > > > 	tdx: Unexpected #VE: 28
> > > > 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
> > > > 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
> > > > 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > > 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
> > > > 	 Call Trace:
> > > > 	  <TASK>
> > > > 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
> > > > 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
> > > > 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
> > > > 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
> > > > 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
> > > > 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
> > > > 	  </TASK>
> > > > 
> > > > Any suggestion how to fix it?
> > > > 
> > > > [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
> > > 
> > > What was the solution before?
> > > 
> > > I guess MTRR was disabled, so there was no PAT, too?
> > 
> > Right:
> > 
> > Linus' tree:
> > 
> > [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
> > [    0.003976] Disabled
> > [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
> > [    0.005856] CPU MTRRs all blank - virtualized system.
> > [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> > 
> > tip/master:
> > 
> > [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
> > [    0.005220] Disabled
> > [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > [    0.007752] tdx: Unexpected #VE: 28
> > 
> > The dangling "Disabled" comes mtrr_bp_init().
> > 
> > 
> > > If this is the case, you can go the same route as Xen PV guests do.
> > 
> > Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
> > X86_FEATURE_XENPV there?
> > 
> > Do we have any virtualized platform that supports it?
> 
> Yes, of course. Any hardware virtualized guest should be able to use it,
> obviously TDX guests are the first ones not being able to do so.
> 
> And above dmesg snipplets are showing rather nicely that not disabling
> PAT completely should be a benefit for TDX guests, as all caching modes
> would be usable (the PAT MSR seems to be initialized quite fine).
> 
> Instead of X86_FEATURE_XENPV we could introduce something like
> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
> TDX guests.

Technically, the MSR is writable on TDX. But it seems there's no way to
properly change it, following the protocol of changing on MP systems.

Although, I don't quite follow what role cache disabling playing on system
with self-snoop support. Hm?
  
Juergen Gross Dec. 2, 2022, 1:39 p.m. UTC | #6
On 02.12.22 14:27, Kirill A. Shutemov wrote:
> On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
>> On 02.12.22 00:57, Kirill A. Shutemov wrote:
>>> On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
>>>> On 01.12.22 17:26, Kirill A. Shutemov wrote:
>>>>> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
>>>>>> Today PAT is usable only with MTRR being active, with some nasty tweaks
>>>>>> to make PAT usable when running as Xen PV guest, which doesn't support
>>>>>> MTRR.
>>>>>>
>>>>>> The reason for this coupling is, that both, PAT MSR changes and MTRR
>>>>>> changes, require a similar sequence and so full PAT support was added
>>>>>> using the already available MTRR handling.
>>>>>>
>>>>>> Xen PV PAT handling can work without MTRR, as it just needs to consume
>>>>>> the PAT MSR setting done by the hypervisor without the ability and need
>>>>>> to change it. This in turn has resulted in a convoluted initialization
>>>>>> sequence and wrong decisions regarding cache mode availability due to
>>>>>> misguiding PAT availability flags.
>>>>>>
>>>>>> Fix all of that by allowing to use PAT without MTRR and by reworking
>>>>>> the current PAT initialization sequence to match better with the newly
>>>>>> introduced generic cache initialization.
>>>>>>
>>>>>> This removes the need of the recently added pat_force_disabled flag, so
>>>>>> remove the remnants of the patch adding it.
>>>>>>
>>>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>>>
>>>>> This patch breaks boot for TDX guest.
>>>>>
>>>>> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
>>>>> causes #VE:
>>>>>
>>>>> 	tdx: Unexpected #VE: 28
>>>>> 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
>>>>> 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
>>>>> 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>>> 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
>>>>> 	 Call Trace:
>>>>> 	  <TASK>
>>>>> 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
>>>>> 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
>>>>> 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
>>>>> 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
>>>>> 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
>>>>> 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
>>>>> 	  </TASK>
>>>>>
>>>>> Any suggestion how to fix it?
>>>>>
>>>>> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
>>>>
>>>> What was the solution before?
>>>>
>>>> I guess MTRR was disabled, so there was no PAT, too?
>>>
>>> Right:
>>>
>>> Linus' tree:
>>>
>>> [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
>>> [    0.003976] Disabled
>>> [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
>>> [    0.005856] CPU MTRRs all blank - virtualized system.
>>> [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
>>>
>>> tip/master:
>>>
>>> [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
>>> [    0.005220] Disabled
>>> [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
>>> [    0.007752] tdx: Unexpected #VE: 28
>>>
>>> The dangling "Disabled" comes mtrr_bp_init().
>>>
>>>
>>>> If this is the case, you can go the same route as Xen PV guests do.
>>>
>>> Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
>>> X86_FEATURE_XENPV there?
>>>
>>> Do we have any virtualized platform that supports it?
>>
>> Yes, of course. Any hardware virtualized guest should be able to use it,
>> obviously TDX guests are the first ones not being able to do so.
>>
>> And above dmesg snipplets are showing rather nicely that not disabling
>> PAT completely should be a benefit for TDX guests, as all caching modes
>> would be usable (the PAT MSR seems to be initialized quite fine).
>>
>> Instead of X86_FEATURE_XENPV we could introduce something like
>> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
>> TDX guests.
> 
> Technically, the MSR is writable on TDX. But it seems there's no way to
> properly change it, following the protocol of changing on MP systems.

Why not? I don't see why it is possible in a non-TDX system, but not within
a TDX guest.

> Although, I don't quite follow what role cache disabling playing on system
> with self-snoop support. Hm?

It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache
Control"):

The operating system is responsible for insuring that changes to a PAT entry
occur in a manner that maintains the consistency of the processor caches and
translation lookaside buffers (TLB). This is accomplished by following the
procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,
”for changing the value of an MTRR in a multiple processor system. It requires
a specific sequence of operations that includes flushing the processors caches
and TLBs.

And the sequence for the MTRRs is:

1. Broadcast to all processors to execute the following code sequence.
2. Disable interrupts.
3. Wait for all processors to reach this point.
4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1
    and the NW flag to 0.)
5. Flush all caches using the WBINVD instructions. Note on a processor that
    supports self-snooping, CPUID feature flag bit 27, this step is unnecessary.
6. If the PGE flag is set in control register CR4, flush all TLBs by clearing
    that flag.
7. If the PGE flag is clear in control register CR4, flush all TLBs by executing
    a MOV from control register CR3 to another register and then a MOV from that
    register back to CR3.
8. Disable all range registers (by clearing the E flag in register MTRRdefType).
    If only variable ranges are being modified, software may clear the valid bits
    for the affected register pairs instead.
9. Update the MTRRs.
10. Enable all range registers (by setting the E flag in register MTRRdefType).
     If only variable-range registers were modified and their individual valid
     bits were cleared, then set the valid bits for the affected ranges instead.
11. Flush all caches and all TLBs a second time. (The TLB flush is required for
     Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD
     instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
     processors, but it may be needed in future systems.)
12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags
     in control register CR0 to 0.)
13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
14. Wait for all processors to reach this point.
15. Enable interrupts.

So cache disabling is recommended.


Juergen
  
Borislav Petkov Dec. 2, 2022, 1:55 p.m. UTC | #7
On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
> Instead of X86_FEATURE_XENPV we could introduce something like
> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
> TDX guests.

Until a third type comes which wants its pony to be pink and to dance.
:-\

Nah, we already have X86_FEATURE_TDX_GUEST - let's use that in
pat_bp_init() and exit there early and be done with it. This way it is
also self-documenting who can and cannot deal with that code.

Thx.
  
Kirill A. Shutemov Dec. 2, 2022, 2:33 p.m. UTC | #8
On Fri, Dec 02, 2022 at 02:39:58PM +0100, Juergen Gross wrote:
> On 02.12.22 14:27, Kirill A. Shutemov wrote:
> > On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
> > > On 02.12.22 00:57, Kirill A. Shutemov wrote:
> > > > On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
> > > > > On 01.12.22 17:26, Kirill A. Shutemov wrote:
> > > > > > On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
> > > > > > > Today PAT is usable only with MTRR being active, with some nasty tweaks
> > > > > > > to make PAT usable when running as Xen PV guest, which doesn't support
> > > > > > > MTRR.
> > > > > > > 
> > > > > > > The reason for this coupling is, that both, PAT MSR changes and MTRR
> > > > > > > changes, require a similar sequence and so full PAT support was added
> > > > > > > using the already available MTRR handling.
> > > > > > > 
> > > > > > > Xen PV PAT handling can work without MTRR, as it just needs to consume
> > > > > > > the PAT MSR setting done by the hypervisor without the ability and need
> > > > > > > to change it. This in turn has resulted in a convoluted initialization
> > > > > > > sequence and wrong decisions regarding cache mode availability due to
> > > > > > > misguiding PAT availability flags.
> > > > > > > 
> > > > > > > Fix all of that by allowing to use PAT without MTRR and by reworking
> > > > > > > the current PAT initialization sequence to match better with the newly
> > > > > > > introduced generic cache initialization.
> > > > > > > 
> > > > > > > This removes the need of the recently added pat_force_disabled flag, so
> > > > > > > remove the remnants of the patch adding it.
> > > > > > > 
> > > > > > > Signed-off-by: Juergen Gross <jgross@suse.com>
> > > > > > 
> > > > > > This patch breaks boot for TDX guest.
> > > > > > 
> > > > > > Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
> > > > > > causes #VE:
> > > > > > 
> > > > > > 	tdx: Unexpected #VE: 28
> > > > > > 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
> > > > > > 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > > > > 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
> > > > > > 	 Call Trace:
> > > > > > 	  <TASK>
> > > > > > 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
> > > > > > 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
> > > > > > 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
> > > > > > 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
> > > > > > 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
> > > > > > 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
> > > > > > 	  </TASK>
> > > > > > 
> > > > > > Any suggestion how to fix it?
> > > > > > 
> > > > > > [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
> > > > > 
> > > > > What was the solution before?
> > > > > 
> > > > > I guess MTRR was disabled, so there was no PAT, too?
> > > > 
> > > > Right:
> > > > 
> > > > Linus' tree:
> > > > 
> > > > [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
> > > > [    0.003976] Disabled
> > > > [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
> > > > [    0.005856] CPU MTRRs all blank - virtualized system.
> > > > [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
> > > > 
> > > > tip/master:
> > > > 
> > > > [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
> > > > [    0.005220] Disabled
> > > > [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > [    0.007752] tdx: Unexpected #VE: 28
> > > > 
> > > > The dangling "Disabled" comes mtrr_bp_init().
> > > > 
> > > > 
> > > > > If this is the case, you can go the same route as Xen PV guests do.
> > > > 
> > > > Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
> > > > X86_FEATURE_XENPV there?
> > > > 
> > > > Do we have any virtualized platform that supports it?
> > > 
> > > Yes, of course. Any hardware virtualized guest should be able to use it,
> > > obviously TDX guests are the first ones not being able to do so.
> > > 
> > > And above dmesg snipplets are showing rather nicely that not disabling
> > > PAT completely should be a benefit for TDX guests, as all caching modes
> > > would be usable (the PAT MSR seems to be initialized quite fine).
> > > 
> > > Instead of X86_FEATURE_XENPV we could introduce something like
> > > X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
> > > TDX guests.
> > 
> > Technically, the MSR is writable on TDX. But it seems there's no way to
> > properly change it, following the protocol of changing on MP systems.
> 
> Why not? I don't see why it is possible in a non-TDX system, but not within
> a TDX guest.

Because the protocol you described below requires setting CR0.CD which is
not allowed in TDX guest and causes #VE.

> > Although, I don't quite follow what role cache disabling playing on system
> > with self-snoop support. Hm?
> 
> It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache
> Control"):
> 
> The operating system is responsible for insuring that changes to a PAT entry
> occur in a manner that maintains the consistency of the processor caches and
> translation lookaside buffers (TLB). This is accomplished by following the
> procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,
> ”for changing the value of an MTRR in a multiple processor system. It requires
> a specific sequence of operations that includes flushing the processors caches
> and TLBs.
> 
> And the sequence for the MTRRs is:
> 
> 1. Broadcast to all processors to execute the following code sequence.
> 2. Disable interrupts.
> 3. Wait for all processors to reach this point.
> 4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1
>    and the NW flag to 0.)
> 5. Flush all caches using the WBINVD instructions. Note on a processor that
>    supports self-snooping, CPUID feature flag bit 27, this step is unnecessary.
> 6. If the PGE flag is set in control register CR4, flush all TLBs by clearing
>    that flag.
> 7. If the PGE flag is clear in control register CR4, flush all TLBs by executing
>    a MOV from control register CR3 to another register and then a MOV from that
>    register back to CR3.
> 8. Disable all range registers (by clearing the E flag in register MTRRdefType).
>    If only variable ranges are being modified, software may clear the valid bits
>    for the affected register pairs instead.
> 9. Update the MTRRs.
> 10. Enable all range registers (by setting the E flag in register MTRRdefType).
>     If only variable-range registers were modified and their individual valid
>     bits were cleared, then set the valid bits for the affected ranges instead.
> 11. Flush all caches and all TLBs a second time. (The TLB flush is required for
>     Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD
>     instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
>     processors, but it may be needed in future systems.)
> 12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags
>     in control register CR0 to 0.)
> 13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
> 14. Wait for all processors to reach this point.
> 15. Enable interrupts.
> 
> So cache disabling is recommended.

Yeah, I read that.

But the question is what kind of scenario cache disabling is actually
prevents if self-snoop is supported? In this case cache stays intact (no
WBINVD). The next time a cache line gets accessed with different caching
mode the old line gets snooped, right?

Would it be valid to avoid touching CR0.CD if X86_FEATURE_SELFSNOOP?
  
Juergen Gross Dec. 2, 2022, 2:56 p.m. UTC | #9
On 02.12.22 15:33, Kirill A. Shutemov wrote:
> On Fri, Dec 02, 2022 at 02:39:58PM +0100, Juergen Gross wrote:
>> On 02.12.22 14:27, Kirill A. Shutemov wrote:
>>> On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
>>>> On 02.12.22 00:57, Kirill A. Shutemov wrote:
>>>>> On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
>>>>>> On 01.12.22 17:26, Kirill A. Shutemov wrote:
>>>>>>> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
>>>>>>>> Today PAT is usable only with MTRR being active, with some nasty tweaks
>>>>>>>> to make PAT usable when running as Xen PV guest, which doesn't support
>>>>>>>> MTRR.
>>>>>>>>
>>>>>>>> The reason for this coupling is, that both, PAT MSR changes and MTRR
>>>>>>>> changes, require a similar sequence and so full PAT support was added
>>>>>>>> using the already available MTRR handling.
>>>>>>>>
>>>>>>>> Xen PV PAT handling can work without MTRR, as it just needs to consume
>>>>>>>> the PAT MSR setting done by the hypervisor without the ability and need
>>>>>>>> to change it. This in turn has resulted in a convoluted initialization
>>>>>>>> sequence and wrong decisions regarding cache mode availability due to
>>>>>>>> misguiding PAT availability flags.
>>>>>>>>
>>>>>>>> Fix all of that by allowing to use PAT without MTRR and by reworking
>>>>>>>> the current PAT initialization sequence to match better with the newly
>>>>>>>> introduced generic cache initialization.
>>>>>>>>
>>>>>>>> This removes the need of the recently added pat_force_disabled flag, so
>>>>>>>> remove the remnants of the patch adding it.
>>>>>>>>
>>>>>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>>>>>
>>>>>>> This patch breaks boot for TDX guest.
>>>>>>>
>>>>>>> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
>>>>>>> causes #VE:
>>>>>>>
>>>>>>> 	tdx: Unexpected #VE: 28
>>>>>>> 	VE fault: 0000 [#1] PREEMPT SMP NOPTI
>>>>>>> 	CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc1-00015-gadfe7512e1d0 #2646
>>>>>>> 	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>>>>> 	RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
>>>>>>> 	 Call Trace:
>>>>>>> 	  <TASK>
>>>>>>> 	 ? cache_disable (arch/x86/include/asm/cpufeature.h:173 arch/x86/kernel/cpu/cacheinfo.c:1085)
>>>>>>> 	 ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 (discriminator 3))
>>>>>>> 	 ? setup_arch (arch/x86/kernel/setup.c:1079)
>>>>>>> 	 ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 (discriminator 3) init/main.c:960 (discriminator 3))
>>>>>>> 	 ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
>>>>>>> 	 ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
>>>>>>> 	  </TASK>
>>>>>>>
>>>>>>> Any suggestion how to fix it?
>>>>>>>
>>>>>>> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
>>>>>>
>>>>>> What was the solution before?
>>>>>>
>>>>>> I guess MTRR was disabled, so there was no PAT, too?
>>>>>
>>>>> Right:
>>>>>
>>>>> Linus' tree:
>>>>>
>>>>> [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
>>>>> [    0.003976] Disabled
>>>>> [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
>>>>> [    0.005856] CPU MTRRs all blank - virtualized system.
>>>>> [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
>>>>>
>>>>> tip/master:
>>>>>
>>>>> [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
>>>>> [    0.005220] Disabled
>>>>> [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
>>>>> [    0.007752] tdx: Unexpected #VE: 28
>>>>>
>>>>> The dangling "Disabled" comes mtrr_bp_init().
>>>>>
>>>>>
>>>>>> If this is the case, you can go the same route as Xen PV guests do.
>>>>>
>>>>> Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
>>>>> X86_FEATURE_XENPV there?
>>>>>
>>>>> Do we have any virtualized platform that supports it?
>>>>
>>>> Yes, of course. Any hardware virtualized guest should be able to use it,
>>>> obviously TDX guests are the first ones not being able to do so.
>>>>
>>>> And above dmesg snipplets are showing rather nicely that not disabling
>>>> PAT completely should be a benefit for TDX guests, as all caching modes
>>>> would be usable (the PAT MSR seems to be initialized quite fine).
>>>>
>>>> Instead of X86_FEATURE_XENPV we could introduce something like
>>>> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
>>>> TDX guests.
>>>
>>> Technically, the MSR is writable on TDX. But it seems there's no way to
>>> properly change it, following the protocol of changing on MP systems.
>>
>> Why not? I don't see why it is possible in a non-TDX system, but not within
>> a TDX guest.
> 
> Because the protocol you described below requires setting CR0.CD which is
> not allowed in TDX guest and causes #VE.

Hmm, yes, seems to be a valid reason. :-)

> 
>>> Although, I don't quite follow what role cache disabling playing on system
>>> with self-snoop support. Hm?
>>
>> It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache
>> Control"):
>>
>> The operating system is responsible for insuring that changes to a PAT entry
>> occur in a manner that maintains the consistency of the processor caches and
>> translation lookaside buffers (TLB). This is accomplished by following the
>> procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,
>> ”for changing the value of an MTRR in a multiple processor system. It requires
>> a specific sequence of operations that includes flushing the processors caches
>> and TLBs.
>>
>> And the sequence for the MTRRs is:
>>
>> 1. Broadcast to all processors to execute the following code sequence.
>> 2. Disable interrupts.
>> 3. Wait for all processors to reach this point.
>> 4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1
>>     and the NW flag to 0.)
>> 5. Flush all caches using the WBINVD instructions. Note on a processor that
>>     supports self-snooping, CPUID feature flag bit 27, this step is unnecessary.
>> 6. If the PGE flag is set in control register CR4, flush all TLBs by clearing
>>     that flag.
>> 7. If the PGE flag is clear in control register CR4, flush all TLBs by executing
>>     a MOV from control register CR3 to another register and then a MOV from that
>>     register back to CR3.
>> 8. Disable all range registers (by clearing the E flag in register MTRRdefType).
>>     If only variable ranges are being modified, software may clear the valid bits
>>     for the affected register pairs instead.
>> 9. Update the MTRRs.
>> 10. Enable all range registers (by setting the E flag in register MTRRdefType).
>>      If only variable-range registers were modified and their individual valid
>>      bits were cleared, then set the valid bits for the affected ranges instead.
>> 11. Flush all caches and all TLBs a second time. (The TLB flush is required for
>>      Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD
>>      instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
>>      processors, but it may be needed in future systems.)
>> 12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags
>>      in control register CR0 to 0.)
>> 13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
>> 14. Wait for all processors to reach this point.
>> 15. Enable interrupts.
>>
>> So cache disabling is recommended.
> 
> Yeah, I read that.
> 
> But the question is what kind of scenario cache disabling is actually
> prevents if self-snoop is supported? In this case cache stays intact (no
> WBINVD). The next time a cache line gets accessed with different caching
> mode the old line gets snooped, right?
> 
> Would it be valid to avoid touching CR0.CD if X86_FEATURE_SELFSNOOP?
> 

That's a question for the Intel architects, I guess.

I'd just ask them how to setup PAT in TDX guests. Either they need to
change the recommended setup sequence, or the PAT support bit needs to
be cleared IMO.


Juergen
  
Juergen Gross Dec. 5, 2022, 7:40 a.m. UTC | #10
On 02.12.22 15:56, Juergen Gross wrote:
> On 02.12.22 15:33, Kirill A. Shutemov wrote:
>> On Fri, Dec 02, 2022 at 02:39:58PM +0100, Juergen Gross wrote:
>>> On 02.12.22 14:27, Kirill A. Shutemov wrote:
>>>> On Fri, Dec 02, 2022 at 06:56:47AM +0100, Juergen Gross wrote:
>>>>> On 02.12.22 00:57, Kirill A. Shutemov wrote:
>>>>>> On Thu, Dec 01, 2022 at 05:33:28PM +0100, Juergen Gross wrote:
>>>>>>> On 01.12.22 17:26, Kirill A. Shutemov wrote:
>>>>>>>> On Wed, Nov 02, 2022 at 08:47:10AM +0100, Juergen Gross wrote:
>>>>>>>>> Today PAT is usable only with MTRR being active, with some nasty tweaks
>>>>>>>>> to make PAT usable when running as Xen PV guest, which doesn't support
>>>>>>>>> MTRR.
>>>>>>>>>
>>>>>>>>> The reason for this coupling is, that both, PAT MSR changes and MTRR
>>>>>>>>> changes, require a similar sequence and so full PAT support was added
>>>>>>>>> using the already available MTRR handling.
>>>>>>>>>
>>>>>>>>> Xen PV PAT handling can work without MTRR, as it just needs to consume
>>>>>>>>> the PAT MSR setting done by the hypervisor without the ability and need
>>>>>>>>> to change it. This in turn has resulted in a convoluted initialization
>>>>>>>>> sequence and wrong decisions regarding cache mode availability due to
>>>>>>>>> misguiding PAT availability flags.
>>>>>>>>>
>>>>>>>>> Fix all of that by allowing to use PAT without MTRR and by reworking
>>>>>>>>> the current PAT initialization sequence to match better with the newly
>>>>>>>>> introduced generic cache initialization.
>>>>>>>>>
>>>>>>>>> This removes the need of the recently added pat_force_disabled flag, so
>>>>>>>>> remove the remnants of the patch adding it.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>>>>>>>
>>>>>>>> This patch breaks boot for TDX guest.
>>>>>>>>
>>>>>>>> Kernel now tries to set CR0.CD which is forbidden in TDX guest[1] and
>>>>>>>> causes #VE:
>>>>>>>>
>>>>>>>>     tdx: Unexpected #VE: 28
>>>>>>>>     VE fault: 0000 [#1] PREEMPT SMP NOPTI
>>>>>>>>     CPU: 0 PID: 0 Comm: swapper Not tainted 
>>>>>>>> 6.1.0-rc1-00015-gadfe7512e1d0 #2646
>>>>>>>>     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
>>>>>>>> 02/06/2015
>>>>>>>>     RIP: 0010:native_write_cr0 (arch/x86/kernel/cpu/common.c:427)
>>>>>>>>      Call Trace:
>>>>>>>>       <TASK>
>>>>>>>>      ? cache_disable (arch/x86/include/asm/cpufeature.h:173 
>>>>>>>> arch/x86/kernel/cpu/cacheinfo.c:1085)
>>>>>>>>      ? cache_cpu_init (arch/x86/kernel/cpu/cacheinfo.c:1132 
>>>>>>>> (discriminator 3))
>>>>>>>>      ? setup_arch (arch/x86/kernel/setup.c:1079)
>>>>>>>>      ? start_kernel (init/main.c:279 (discriminator 3) init/main.c:477 
>>>>>>>> (discriminator 3) init/main.c:960 (discriminator 3))
>>>>>>>>      ? load_ucode_bsp (arch/x86/kernel/cpu/microcode/core.c:155)
>>>>>>>>      ? secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:358)
>>>>>>>>       </TASK>
>>>>>>>>
>>>>>>>> Any suggestion how to fix it?
>>>>>>>>
>>>>>>>> [1] Section 10.6.1. "CR0", https://cdrdv2.intel.com/v1/dl/getContent/733568
>>>>>>>
>>>>>>> What was the solution before?
>>>>>>>
>>>>>>> I guess MTRR was disabled, so there was no PAT, too?
>>>>>>
>>>>>> Right:
>>>>>>
>>>>>> Linus' tree:
>>>>>>
>>>>>> [    0.002589] last_pfn = 0x480000 max_arch_pfn = 0x10000000000
>>>>>> [    0.003976] Disabled
>>>>>> [    0.004452] x86/PAT: MTRRs disabled, skipping PAT initialization too.
>>>>>> [    0.005856] CPU MTRRs all blank - virtualized system.
>>>>>> [    0.006915] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC
>>>>>>
>>>>>> tip/master:
>>>>>>
>>>>>> [    0.003443] last_pfn = 0x20b8e max_arch_pfn = 0x10000000000
>>>>>> [    0.005220] Disabled
>>>>>> [    0.005818] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
>>>>>> [    0.007752] tdx: Unexpected #VE: 28
>>>>>>
>>>>>> The dangling "Disabled" comes mtrr_bp_init().
>>>>>>
>>>>>>
>>>>>>> If this is the case, you can go the same route as Xen PV guests do.
>>>>>>
>>>>>> Any reason X86_FEATURE_HYPERVISOR cannot be used instead of
>>>>>> X86_FEATURE_XENPV there?
>>>>>>
>>>>>> Do we have any virtualized platform that supports it?
>>>>>
>>>>> Yes, of course. Any hardware virtualized guest should be able to use it,
>>>>> obviously TDX guests are the first ones not being able to do so.
>>>>>
>>>>> And above dmesg snipplets are showing rather nicely that not disabling
>>>>> PAT completely should be a benefit for TDX guests, as all caching modes
>>>>> would be usable (the PAT MSR seems to be initialized quite fine).
>>>>>
>>>>> Instead of X86_FEATURE_XENPV we could introduce something like
>>>>> X86_FEATURE_PAT_READONLY, which could be set for Xen PV guests and for
>>>>> TDX guests.
>>>>
>>>> Technically, the MSR is writable on TDX. But it seems there's no way to
>>>> properly change it, following the protocol of changing on MP systems.
>>>
>>> Why not? I don't see why it is possible in a non-TDX system, but not within
>>> a TDX guest.
>>
>> Because the protocol you described below requires setting CR0.CD which is
>> not allowed in TDX guest and causes #VE.
> 
> Hmm, yes, seems to be a valid reason. :-)
> 
>>
>>>> Although, I don't quite follow what role cache disabling playing on system
>>>> with self-snoop support. Hm?
>>>
>>> It is the recommended way to do it. See SDM Vol. 3 Chapter 11 ("Memory Cache
>>> Control"):
>>>
>>> The operating system is responsible for insuring that changes to a PAT entry
>>> occur in a manner that maintains the consistency of the processor caches and
>>> translation lookaside buffers (TLB). This is accomplished by following the
>>> procedure as specified in Section 11.11.8, “MTRR Considerations in MP Systems,
>>> ”for changing the value of an MTRR in a multiple processor system. It requires
>>> a specific sequence of operations that includes flushing the processors caches
>>> and TLBs.
>>>
>>> And the sequence for the MTRRs is:
>>>
>>> 1. Broadcast to all processors to execute the following code sequence.
>>> 2. Disable interrupts.
>>> 3. Wait for all processors to reach this point.
>>> 4. Enter the no-fill cache mode. (Set the CD flag in control register CR0 to 1
>>>     and the NW flag to 0.)
>>> 5. Flush all caches using the WBINVD instructions. Note on a processor that
>>>     supports self-snooping, CPUID feature flag bit 27, this step is unnecessary.
>>> 6. If the PGE flag is set in control register CR4, flush all TLBs by clearing
>>>     that flag.
>>> 7. If the PGE flag is clear in control register CR4, flush all TLBs by executing
>>>     a MOV from control register CR3 to another register and then a MOV from that
>>>     register back to CR3.
>>> 8. Disable all range registers (by clearing the E flag in register MTRRdefType).
>>>     If only variable ranges are being modified, software may clear the valid 
>>> bits
>>>     for the affected register pairs instead.
>>> 9. Update the MTRRs.
>>> 10. Enable all range registers (by setting the E flag in register MTRRdefType).
>>>      If only variable-range registers were modified and their individual valid
>>>      bits were cleared, then set the valid bits for the affected ranges instead.
>>> 11. Flush all caches and all TLBs a second time. (The TLB flush is required for
>>>      Pentium 4, Intel Xeon, and P6 family processors. Executing the WBINVD
>>>      instruction is not needed when using Pentium 4, Intel Xeon, and P6 family
>>>      processors, but it may be needed in future systems.)
>>> 12. Enter the normal cache mode to re-enable caching. (Set the CD and NW flags
>>>      in control register CR0 to 0.)
>>> 13. Set PGE flag in control register CR4, if cleared in Step 6 (above).
>>> 14. Wait for all processors to reach this point.
>>> 15. Enable interrupts.
>>>
>>> So cache disabling is recommended.
>>
>> Yeah, I read that.
>>
>> But the question is what kind of scenario cache disabling is actually
>> prevents if self-snoop is supported? In this case cache stays intact (no
>> WBINVD). The next time a cache line gets accessed with different caching
>> mode the old line gets snooped, right?
>>
>> Would it be valid to avoid touching CR0.CD if X86_FEATURE_SELFSNOOP?
>>
> 
> That's a question for the Intel architects, I guess.
> 
> I'd just ask them how to setup PAT in TDX guests. Either they need to
> change the recommended setup sequence, or the PAT support bit needs to
> be cleared IMO.

I've forwarded the question to Intel, BTW.

Another question to you: where does the initial PAT MSR value come from?
I guess from UEFI?


Juergen
  
Kirill A. Shutemov Dec. 5, 2022, 12:21 p.m. UTC | #11
On Mon, Dec 05, 2022 at 08:40:06AM +0100, Juergen Gross wrote:
> > That's a question for the Intel architects, I guess.
> > 
> > I'd just ask them how to setup PAT in TDX guests. Either they need to
> > change the recommended setup sequence, or the PAT support bit needs to
> > be cleared IMO.
> 
> I've forwarded the question to Intel, BTW.

I've initiated the talk internally too.

> Another question to you: where does the initial PAT MSR value come from?
> I guess from UEFI?

It is set by TDX module on initialization. See section 21.2.4.1.2. "TD
VMCS Guest MSRs" of TDX module spec[1]

[1] https://cdrdv2.intel.com/v1/dl/getContent/733568
  

Patch

diff --git a/arch/x86/include/asm/memtype.h b/arch/x86/include/asm/memtype.h
index 9ca760e430b9..113b2fa51849 100644
--- a/arch/x86/include/asm/memtype.h
+++ b/arch/x86/include/asm/memtype.h
@@ -6,9 +6,8 @@ 
 #include <asm/pgtable_types.h>
 
 extern bool pat_enabled(void);
-extern void pat_disable(const char *reason);
-extern void pat_init(void);
-extern void init_cache_modes(void);
+extern void pat_bp_init(void);
+extern void pat_cpu_init(void);
 
 extern int memtype_reserve(u64 start, u64 end,
 		enum page_cache_mode req_pcm, enum page_cache_mode *ret_pcm);
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index a92099569617..1aaf830254df 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -1133,7 +1133,7 @@  static void cache_cpu_init(void)
 		mtrr_generic_set_state();
 
 	if (memory_caching_control & CACHE_PAT)
-		pat_init();
+		pat_cpu_init();
 
 	cache_enable();
 	local_irq_restore(flags);
@@ -1162,6 +1162,7 @@  static int cache_rendezvous_handler(void *unused)
 void __init cache_bp_init(void)
 {
 	mtrr_bp_init();
+	pat_bp_init();
 
 	if (memory_caching_control)
 		cache_cpu_init();
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 99b6973a69b4..8403daf34158 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -725,7 +725,7 @@  void __init mtrr_bp_init(void)
 		if (mtrr_if == &generic_mtrr_ops) {
 			/* BIOS may override */
 			if (get_mtrr_state()) {
-				memory_caching_control |= CACHE_MTRR | CACHE_PAT;
+				memory_caching_control |= CACHE_MTRR;
 				changed_by_mtrr_cleanup = mtrr_cleanup(phys_addr);
 			} else {
 				mtrr_if = NULL;
@@ -733,16 +733,8 @@  void __init mtrr_bp_init(void)
 		}
 	}
 
-	if (!mtrr_enabled()) {
+	if (!mtrr_enabled())
 		pr_info("Disabled\n");
-
-		/*
-		 * PAT initialization relies on MTRR's rendezvous handler.
-		 * Skip PAT init until the handler can initialize both
-		 * features independently.
-		 */
-		pat_disable("MTRRs disabled, skipping PAT initialization too.");
-	}
 }
 
 /**
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e0e185ee0229..aacaa96f0195 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1075,23 +1075,12 @@  void __init setup_arch(char **cmdline_p)
 	max_pfn = e820__end_of_ram_pfn();
 
 	/* update e820 for memory not covered by WB MTRRs */
-	if (IS_ENABLED(CONFIG_MTRR))
-		cache_bp_init();
-	else
-		pat_disable("PAT support disabled because CONFIG_MTRR is disabled in the kernel.");
-
+	cache_bp_init();
 	if (mtrr_trim_uncached_memory(max_pfn))
 		max_pfn = e820__end_of_ram_pfn();
 
 	max_possible_pfn = max_pfn;
 
-	/*
-	 * This call is required when the CPU does not support PAT. If
-	 * mtrr_bp_init() invoked it already via pat_init() the call has no
-	 * effect.
-	 */
-	init_cache_modes();
-
 	/*
 	 * Define random base addresses for memory sections after max_pfn is
 	 * defined and before each memory section base is used.
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 66a209f7eb86..9aab17d660cd 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -43,6 +43,7 @@ 
 #include <linux/rbtree.h>
 
 #include <asm/cacheflush.h>
+#include <asm/cacheinfo.h>
 #include <asm/processor.h>
 #include <asm/tlbflush.h>
 #include <asm/x86_init.h>
@@ -60,41 +61,34 @@ 
 #undef pr_fmt
 #define pr_fmt(fmt) "" fmt
 
-static bool __read_mostly pat_bp_initialized;
 static bool __read_mostly pat_disabled = !IS_ENABLED(CONFIG_X86_PAT);
-static bool __initdata pat_force_disabled = !IS_ENABLED(CONFIG_X86_PAT);
-static bool __read_mostly pat_bp_enabled;
-static bool __read_mostly pat_cm_initialized;
+static u64 __ro_after_init pat_msr_val;
 
 /*
  * PAT support is enabled by default, but can be disabled for
  * various user-requested or hardware-forced reasons:
  */
-void pat_disable(const char *msg_reason)
+static void __init pat_disable(const char *msg_reason)
 {
 	if (pat_disabled)
 		return;
 
-	if (pat_bp_initialized) {
-		WARN_ONCE(1, "x86/PAT: PAT cannot be disabled after initialization\n");
-		return;
-	}
-
 	pat_disabled = true;
 	pr_info("x86/PAT: %s\n", msg_reason);
+
+	memory_caching_control &= ~CACHE_PAT;
 }
 
 static int __init nopat(char *str)
 {
 	pat_disable("PAT support disabled via boot option.");
-	pat_force_disabled = true;
 	return 0;
 }
 early_param("nopat", nopat);
 
 bool pat_enabled(void)
 {
-	return pat_bp_enabled;
+	return !pat_disabled;
 }
 EXPORT_SYMBOL_GPL(pat_enabled);
 
@@ -192,7 +186,8 @@  enum {
 
 #define CM(c) (_PAGE_CACHE_MODE_ ## c)
 
-static enum page_cache_mode pat_get_cache_mode(unsigned pat_val, char *msg)
+static enum page_cache_mode __init pat_get_cache_mode(unsigned int pat_val,
+						      char *msg)
 {
 	enum page_cache_mode cache;
 	char *cache_mode;
@@ -219,14 +214,12 @@  static enum page_cache_mode pat_get_cache_mode(unsigned pat_val, char *msg)
  * configuration.
  * Using lower indices is preferred, so we start with highest index.
  */
-static void __init_cache_modes(u64 pat)
+static void __init init_cache_modes(u64 pat)
 {
 	enum page_cache_mode cache;
 	char pat_msg[33];
 	int i;
 
-	WARN_ON_ONCE(pat_cm_initialized);
-
 	pat_msg[32] = 0;
 	for (i = 7; i >= 0; i--) {
 		cache = pat_get_cache_mode((pat >> (i * 8)) & 7,
@@ -234,34 +227,9 @@  static void __init_cache_modes(u64 pat)
 		update_cache_mode_entry(i, cache);
 	}
 	pr_info("x86/PAT: Configuration [0-7]: %s\n", pat_msg);
-
-	pat_cm_initialized = true;
 }
 
-#define PAT(x, y)	((u64)PAT_ ## y << ((x)*8))
-
-static void pat_bp_init(u64 pat)
-{
-	u64 tmp_pat;
-
-	if (!boot_cpu_has(X86_FEATURE_PAT)) {
-		pat_disable("PAT not supported by the CPU.");
-		return;
-	}
-
-	rdmsrl(MSR_IA32_CR_PAT, tmp_pat);
-	if (!tmp_pat) {
-		pat_disable("PAT support disabled by the firmware.");
-		return;
-	}
-
-	wrmsrl(MSR_IA32_CR_PAT, pat);
-	pat_bp_enabled = true;
-
-	__init_cache_modes(pat);
-}
-
-static void pat_ap_init(u64 pat)
+void pat_cpu_init(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_PAT)) {
 		/*
@@ -271,30 +239,39 @@  static void pat_ap_init(u64 pat)
 		panic("x86/PAT: PAT enabled, but not supported by secondary CPU\n");
 	}
 
-	wrmsrl(MSR_IA32_CR_PAT, pat);
+	wrmsrl(MSR_IA32_CR_PAT, pat_msr_val);
 }
 
-void __init init_cache_modes(void)
+/**
+ * pat_bp_init - Initialize the PAT MSR value and PAT table
+ *
+ * This function initializes PAT MSR value and PAT table with an OS-defined
+ * value to enable additional cache attributes, WC, WT and WP.
+ *
+ * This function prepares the calls of pat_cpu_init() via cache_cpu_init()
+ * on all CPUs.
+ */
+void __init pat_bp_init(void)
 {
-	u64 pat = 0;
+	struct cpuinfo_x86 *c = &boot_cpu_data;
+#define PAT(p0, p1, p2, p3, p4, p5, p6, p7)			\
+	(((u64)PAT_ ## p0) | ((u64)PAT_ ## p1 << 8) |		\
+	((u64)PAT_ ## p2 << 16) | ((u64)PAT_ ## p3 << 24) |	\
+	((u64)PAT_ ## p4 << 32) | ((u64)PAT_ ## p5 << 40) |	\
+	((u64)PAT_ ## p6 << 48) | ((u64)PAT_ ## p7 << 56))
 
-	if (pat_cm_initialized)
-		return;
 
-	if (boot_cpu_has(X86_FEATURE_PAT)) {
-		/*
-		 * CPU supports PAT. Set PAT table to be consistent with
-		 * PAT MSR. This case supports "nopat" boot option, and
-		 * virtual machine environments which support PAT without
-		 * MTRRs. In specific, Xen has unique setup to PAT MSR.
-		 *
-		 * If PAT MSR returns 0, it is considered invalid and emulates
-		 * as No PAT.
-		 */
-		rdmsrl(MSR_IA32_CR_PAT, pat);
-	}
+	if (!IS_ENABLED(CONFIG_X86_PAT))
+		pr_info_once("x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.\n");
+
+	if (!cpu_feature_enabled(X86_FEATURE_PAT))
+		pat_disable("PAT not supported by the CPU.");
+	else
+		rdmsrl(MSR_IA32_CR_PAT, pat_msr_val);
+
+	if (!pat_msr_val) {
+		pat_disable("PAT support disabled by the firmware.");
 
-	if (!pat) {
 		/*
 		 * No PAT. Emulate the PAT table that corresponds to the two
 		 * cache bits, PWT (Write Through) and PCD (Cache Disable).
@@ -313,40 +290,17 @@  void __init init_cache_modes(void)
 		 * NOTE: When WC or WP is used, it is redirected to UC- per
 		 * the default setup in __cachemode2pte_tbl[].
 		 */
-		pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
-		      PAT(4, WB) | PAT(5, WT) | PAT(6, UC_MINUS) | PAT(7, UC);
-	} else if (!pat_force_disabled && cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) {
-		/*
-		 * Clearly PAT is enabled underneath. Allow pat_enabled() to
-		 * reflect this.
-		 */
-		pat_bp_enabled = true;
+		pat_msr_val = PAT(WB, WT, UC_MINUS, UC, WB, WT, UC_MINUS, UC);
 	}
 
-	__init_cache_modes(pat);
-}
-
-/**
- * pat_init - Initialize the PAT MSR and PAT table on the current CPU
- *
- * This function initializes PAT MSR and PAT table with an OS-defined value
- * to enable additional cache attributes, WC, WT and WP.
- *
- * This function must be called on all CPUs using the specific sequence of
- * operations defined in Intel SDM. mtrr_rendezvous_handler() provides this
- * procedure for PAT.
- */
-void pat_init(void)
-{
-	u64 pat;
-	struct cpuinfo_x86 *c = &boot_cpu_data;
-
-#ifndef CONFIG_X86_PAT
-	pr_info_once("x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.\n");
-#endif
-
-	if (pat_disabled)
+	/*
+	 * Xen PV doesn't allow to set PAT MSR, but all cache modes are
+	 * supported.
+	 */
+	if (pat_disabled || cpu_feature_enabled(X86_FEATURE_XENPV)) {
+		init_cache_modes(pat_msr_val);
 		return;
+	}
 
 	if ((c->x86_vendor == X86_VENDOR_INTEL) &&
 	    (((c->x86 == 0x6) && (c->x86_model <= 0xd)) ||
@@ -371,8 +325,7 @@  void pat_init(void)
 		 * NOTE: When WT or WP is used, it is redirected to UC- per
 		 * the default setup in __cachemode2pte_tbl[].
 		 */
-		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
-		      PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, UC);
+		pat_msr_val = PAT(WB, WC, UC_MINUS, UC, WB, WC, UC_MINUS, UC);
 	} else {
 		/*
 		 * Full PAT support.  We put WT in slot 7 to improve
@@ -400,19 +353,14 @@  void pat_init(void)
 		 * The reserved slots are unused, but mapped to their
 		 * corresponding types in the presence of PAT errata.
 		 */
-		pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
-		      PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
+		pat_msr_val = PAT(WB, WC, UC_MINUS, UC, WB, WP, UC_MINUS, WT);
 	}
 
-	if (!pat_bp_initialized) {
-		pat_bp_init(pat);
-		pat_bp_initialized = true;
-	} else {
-		pat_ap_init(pat);
-	}
-}
+	memory_caching_control |= CACHE_PAT;
 
+	init_cache_modes(pat_msr_val);
 #undef PAT
+}
 
 static DEFINE_SPINLOCK(memtype_lock);	/* protects memtype accesses */