[v2,0/8] x86/mtrr: fix handling with PAT but without MTRR

Message ID	20230209072220.6836-1-jgross@suse.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Juergen Gross <jgross@suse.com> To: linux-kernel@vger.kernel.org, x86@kernel.org, linux-hyperv@vger.kernel.org Cc: lists@nerdbynature.de, mikelley@microsoft.com, torvalds@linux-foundation.org, Juergen Gross <jgross@suse.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com>, "K. Y. Srinivasan" <kys@microsoft.com>, Haiyang Zhang <haiyangz@microsoft.com>, Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, xen-devel@lists.xenproject.org, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH v2 0/8] x86/mtrr: fix handling with PAT but without MTRR Date: Thu, 9 Feb 2023 08:22:12 +0100 Message-Id: <20230209072220.6836-1-jgross@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86/mtrr: fix handling with PAT but without MTRR \| [v2,0/8] x86/mtrr: fix handling with PAT but without MTRR [v2,1/8] x86/mtrr: split off physical address size calculation [v2,2/8] x86/mtrr: support setting MTRR state for software defined MTRRs [v2,3/8] x86/hyperv: set MTRR state when running as SEV-SNP Hyper-V guest [v2,4/8] x86/xen: set MTRR state when running as Xen PV initial domain [v2,5/8] x86/mtrr: revert commit 90b926e68f50 [v2,6/8] x86/mtrr: don't let mtrr_type_lookup() return MTRR_TYPE_INVALID [v2,7/8] x86/mm: only check uniform after calling mtrr_type_lookup() [v2,8/8] x86/mtrr: drop sanity check in mtrr_type_lookup_fixed()

Message ID

20230209072220.6836-1-jgross@suse.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Juergen Gross <jgross@suse.com>
To: linux-kernel@vger.kernel.org, x86@kernel.org,
        linux-hyperv@vger.kernel.org
Cc: lists@nerdbynature.de, mikelley@microsoft.com,
        torvalds@linux-foundation.org, Juergen Gross <jgross@suse.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        "H. Peter Anvin" <hpa@zytor.com>,
        "K. Y. Srinivasan" <kys@microsoft.com>,
        Haiyang Zhang <haiyangz@microsoft.com>,
        Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
        Boris Ostrovsky <boris.ostrovsky@oracle.com>,
        xen-devel@lists.xenproject.org, Andy Lutomirski <luto@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>
Subject: [PATCH v2 0/8] x86/mtrr: fix handling with PAT but without MTRR
Date: Thu,  9 Feb 2023 08:22:12 +0100
Message-Id: <20230209072220.6836-1-jgross@suse.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

x86/mtrr: fix handling with PAT but without MTRR |

Message

Juergen Gross Feb. 9, 2023, 7:22 a.m. UTC

  This series tries to fix the rather special case of PAT being available
without having MTRRs (either due to CONFIG_MTRR being not set, or
because the feature has been disabled e.g. by a hypervisor).

The main use cases are Xen PV guests and SEV-SNP guests running under
Hyper-V.

Instead of trying to work around all the issues by adding if statements
here and there, just try to use the complete available infrastructure
by setting up a read-only MTRR state when needed.

In the Xen PV case the current MTRR MSR values can be read from the
hypervisor, while for the SEV-SNP case all needed is to set the
default caching mode to "WB".

I have added more cleanup which has been discussed when looking into
the most recent failures.

Note that I couldn't test the Hyper-V related change (patch 3).

Running on bare metal and with Xen didn't show any problems with the
series applied.

Changes in V2:
- replaced former patches 1+2 with new patches 1-4, avoiding especially
  the rather hacky approach of V1, while making all the MTRR type
  conflict tests available for the Xen PV case
- updated patch 6 (was patch 4 in V1)

Juergen Gross (8):
  x86/mtrr: split off physical address size calculation
  x86/mtrr: support setting MTRR state for software defined MTRRs
  x86/hyperv: set MTRR state when running as SEV-SNP Hyper-V guest
  x86/xen: set MTRR state when running as Xen PV initial domain
  x86/mtrr: revert commit 90b926e68f50
  x86/mtrr: don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
  x86/mm: only check uniform after calling mtrr_type_lookup()
  x86/mtrr: drop sanity check in mtrr_type_lookup_fixed()

 arch/x86/include/asm/mtrr.h        |  9 +++-
 arch/x86/include/uapi/asm/mtrr.h   |  6 +--
 arch/x86/kernel/cpu/mshyperv.c     |  8 +++
 arch/x86/kernel/cpu/mtrr/generic.c | 56 +++++++++++++++++----
 arch/x86/kernel/cpu/mtrr/mtrr.c    | 79 ++++++++++++++++--------------
 arch/x86/mm/pat/memtype.c          |  3 +-
 arch/x86/mm/pgtable.c              |  6 +--
 arch/x86/xen/enlighten_pv.c        | 49 ++++++++++++++++++
 8 files changed, 159 insertions(+), 57 deletions(-)

Comments

Edgecombe, Rick P Feb. 11, 2023, 12:06 a.m. UTC | #1

On Thu, 2023-02-09 at 08:22 +0100, Juergen Gross wrote:
> This series tries to fix the rather special case of PAT being
> available
> without having MTRRs (either due to CONFIG_MTRR being not set, or
> because the feature has been disabled e.g. by a hypervisor).

debug_vm_pgtable fails in a KVM guest with CONFIG_MTRR=y. CONFIG_MTRR=n
succeeds.

[    0.830280] debug_vm_pgtable: [debug_vm_pgtable         ]:
Validating architecture page table helpers
[    0.831906] ------------[ cut here ]------------
[    0.832711] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:461
debug_vm_pgtable+0xb9a/0xe16
[    0.833998] Modules linked in:
[    0.834450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc7+
#2366
[    0.835462] RIP: 0010:debug_vm_pgtable+0xb9a/0xe16
[    0.836217] Code: e2 3a 73 4a 48 c7 00 00 00 00 00 48 8b b4 24 a0 00
00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c4
[    0.839068] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
[    0.839735] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
bffffffffffffff0
[    0.840646] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
0000000000000000
[    0.841661] RBP: ffff88810004d140 R08: 0000000000000000 R09:
ffff888100280880
[    0.842625] R10: 0000000000000001 R11: 0000000000000001 R12:
ffff888103810298
[    0.843574] R13: ffff888100048780 R14: ffffffff8282e099 R15:
0000000000000000
[    0.844524] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000)
knlGS:0000000000000000
[    0.845706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.846499] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
0000000000370ef0
[    0.847464] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    0.848432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[    0.849371] Call Trace:
[    0.849699]  <TASK>
[    0.849997]  ? destroy_args+0x131/0x131
[    0.850487]  do_one_initcall+0x61/0x250
[    0.850983]  ? rdinit_setup+0x2c/0x2c
[    0.851451]  kernel_init_freeable+0x18e/0x1d8
[    0.852033]  ? rest_init+0x130/0x130
[    0.852533]  kernel_init+0x16/0x120
[    0.853035]  ret_from_fork+0x1f/0x30
[    0.853507]  </TASK>
[    0.853803] ---[ end trace 0000000000000000 ]---
[    0.854421] ------------[ cut here ]------------
[    0.855027] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:462
debug_vm_pgtable+0xbaa/0xe16
[    0.856115] Modules linked in:
[    0.856517] CPU: 0 PID: 1 Comm: swapper/0 Tainted:
G        W          6.2.0-rc7+ #2366
[    0.857562] RIP: 0010:debug_vm_pgtable+0xbaa/0xe16
[    0.858186] Code: 00 00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c1 e6 0c
e8 79 18 7f fe 85 c0 75 02 0f 0b 48 8b 7b
[    0.860778] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
[    0.861519] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
bffffffffffffff0
[    0.862530] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
ffff88810380e7f8
[    0.863522] RBP: ffff88810004d140 R08: 0000000000000000 R09:
ffff888100280880
[    0.864449] R10: 0000000000000001 R11: 0000000000000001 R12:
ffff888103810298
[    0.865454] R13: ffff888100048780 R14: ffffffff8282e099 R15:
0000000000000000
[    0.866401] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000)
knlGS:0000000000000000
[    0.867438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.868181] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
0000000000370ef0
[    0.869097] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    0.870026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[    0.870943] Call Trace:
[    0.871259]  <TASK>
[    0.871537]  ? destroy_args+0x131/0x131
[    0.872030]  do_one_initcall+0x61/0x250
[    0.872521]  ? rdinit_setup+0x2c/0x2c
[    0.873005]  kernel_init_freeable+0x18e/0x1d8
[    0.873607]  ? rest_init+0x130/0x130
[    0.874116]  kernel_init+0x16/0x120
[    0.874618]  ret_from_fork+0x1f/0x30
[    0.875123]  </TASK>
[    0.875411] ---[ end trace 0000000000000000 ]---

Juergen Gross Feb. 13, 2023, 6:12 a.m. UTC | #2

On 11.02.23 01:06, Edgecombe, Rick P wrote:
> On Thu, 2023-02-09 at 08:22 +0100, Juergen Gross wrote:
>> This series tries to fix the rather special case of PAT being
>> available
>> without having MTRRs (either due to CONFIG_MTRR being not set, or
>> because the feature has been disabled e.g. by a hypervisor).
> 
> debug_vm_pgtable fails in a KVM guest with CONFIG_MTRR=y. CONFIG_MTRR=n
> succeeds.
> 
> [    0.830280] debug_vm_pgtable: [debug_vm_pgtable         ]:
> Validating architecture page table helpers
> [    0.831906] ------------[ cut here ]------------
> [    0.832711] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:461
> debug_vm_pgtable+0xb9a/0xe16
> [    0.833998] Modules linked in:
> [    0.834450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc7+
> #2366
> [    0.835462] RIP: 0010:debug_vm_pgtable+0xb9a/0xe16
> [    0.836217] Code: e2 3a 73 4a 48 c7 00 00 00 00 00 48 8b b4 24 a0 00
> 00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c4
> [    0.839068] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
> [    0.839735] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
> bffffffffffffff0
> [    0.840646] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
> 0000000000000000
> [    0.841661] RBP: ffff88810004d140 R08: 0000000000000000 R09:
> ffff888100280880
> [    0.842625] R10: 0000000000000001 R11: 0000000000000001 R12:
> ffff888103810298
> [    0.843574] R13: ffff888100048780 R14: ffffffff8282e099 R15:
> 0000000000000000
> [    0.844524] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000)
> knlGS:0000000000000000
> [    0.845706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.846499] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
> 0000000000370ef0
> [    0.847464] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [    0.848432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [    0.849371] Call Trace:
> [    0.849699]  <TASK>
> [    0.849997]  ? destroy_args+0x131/0x131
> [    0.850487]  do_one_initcall+0x61/0x250
> [    0.850983]  ? rdinit_setup+0x2c/0x2c
> [    0.851451]  kernel_init_freeable+0x18e/0x1d8
> [    0.852033]  ? rest_init+0x130/0x130
> [    0.852533]  kernel_init+0x16/0x120
> [    0.853035]  ret_from_fork+0x1f/0x30
> [    0.853507]  </TASK>
> [    0.853803] ---[ end trace 0000000000000000 ]---
> [    0.854421] ------------[ cut here ]------------
> [    0.855027] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:462
> debug_vm_pgtable+0xbaa/0xe16
> [    0.856115] Modules linked in:
> [    0.856517] CPU: 0 PID: 1 Comm: swapper/0 Tainted:
> G        W          6.2.0-rc7+ #2366
> [    0.857562] RIP: 0010:debug_vm_pgtable+0xbaa/0xe16
> [    0.858186] Code: 00 00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c1 e6 0c
> e8 79 18 7f fe 85 c0 75 02 0f 0b 48 8b 7b
> [    0.860778] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
> [    0.861519] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
> bffffffffffffff0
> [    0.862530] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
> ffff88810380e7f8
> [    0.863522] RBP: ffff88810004d140 R08: 0000000000000000 R09:
> ffff888100280880
> [    0.864449] R10: 0000000000000001 R11: 0000000000000001 R12:
> ffff888103810298
> [    0.865454] R13: ffff888100048780 R14: ffffffff8282e099 R15:
> 0000000000000000
> [    0.866401] FS:  0000000000000000(0000) GS:ffff88813bc00000(0000)
> knlGS:0000000000000000
> [    0.867438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.868181] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
> 0000000000370ef0
> [    0.869097] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [    0.870026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [    0.870943] Call Trace:
> [    0.871259]  <TASK>
> [    0.871537]  ? destroy_args+0x131/0x131
> [    0.872030]  do_one_initcall+0x61/0x250
> [    0.872521]  ? rdinit_setup+0x2c/0x2c
> [    0.873005]  kernel_init_freeable+0x18e/0x1d8
> [    0.873607]  ? rest_init+0x130/0x130
> [    0.874116]  kernel_init+0x16/0x120
> [    0.874618]  ret_from_fork+0x1f/0x30
> [    0.875123]  </TASK>
> [    0.875411] ---[ end trace 0000000000000000 ]---

Thanks for the report.

I'll have a look. Probably I'll need to re-add the check for WB in patch 7.


Juergen

Edgecombe, Rick P Feb. 13, 2023, 6:21 p.m. UTC | #3

On Mon, 2023-02-13 at 07:12 +0100, Juergen Gross wrote:
> 
> Thanks for the report.
> 
> I'll have a look. Probably I'll need to re-add the check for WB in
> patch 7.

Sure, let me know if you need any more details about by setup.

Juergen Gross Feb. 15, 2023, 8:25 a.m. UTC | #4

On 13.02.23 19:21, Edgecombe, Rick P wrote:
> On Mon, 2023-02-13 at 07:12 +0100, Juergen Gross wrote:
>>
>> Thanks for the report.
>>
>> I'll have a look. Probably I'll need to re-add the check for WB in
>> patch 7.
> 
> Sure, let me know if you need any more details about by setup.

I have reproduced the issue.

Adding back the test for WB will fix it, but I'm not sure this is really
what I should do.

The problem arises in case a large mapping is spanning multiple MTRRs,
even if they define the same caching type (uniform is set to 0 in this
case).

So the basic question for me is: shouldn't the semantics of uniform be
adpated? Today it means "the range is covered by only one MTRR or by
none". Looking at the use cases I'm wondering whether it shouldn't be
"the whole range has the same caching type".

Thoughts?

Juergen

Linus Torvalds Feb. 15, 2023, 11:22 p.m. UTC | #5

On Wed, Feb 15, 2023 at 12:25 AM Juergen Gross <jgross@suse.com> wrote:
>
> The problem arises in case a large mapping is spanning multiple MTRRs,
> even if they define the same caching type (uniform is set to 0 in this
> case).

Oh, I think then you should fix uniform to be 1.

IOW, we should not think "multiple MTRRs" means "non-uniform". Only
"different actual memory types" should mean non-uniformity.

If I remember correctly, there were good reasons to have overlapping
MTRR's. In fact, you can generate a single MTRR that described a
memory ttype that wasn't even contiguous if you had odd memory setups.

Intel definitely defines how overlapping MTRR's work, and "same types
overlaps" is documented as a real thing.

            Linus

Juergen Gross Feb. 16, 2023, 5:35 a.m. UTC | #6

On 16.02.23 00:22, Linus Torvalds wrote:
> On Wed, Feb 15, 2023 at 12:25 AM Juergen Gross <jgross@suse.com> wrote:
>>
>> The problem arises in case a large mapping is spanning multiple MTRRs,
>> even if they define the same caching type (uniform is set to 0 in this
>> case).
> 
> Oh, I think then you should fix uniform to be 1.
> 
> IOW, we should not think "multiple MTRRs" means "non-uniform". Only
> "different actual memory types" should mean non-uniformity.

Thanks for confirmation. I completely agree.

> If I remember correctly, there were good reasons to have overlapping
> MTRR's. In fact, you can generate a single MTRR that described a
> memory ttype that wasn't even contiguous if you had odd memory setups.
> 
> Intel definitely defines how overlapping MTRR's work, and "same types
> overlaps" is documented as a real thing.

Yes. And it is handled wrong in current code.

Handling it correctly will require quite some reworking of the code,
which I've already started to work on. I will defer the pud_set_huge()/
pmd_set_huge() modifying patch to after this rework.


Juergen