[v5,11/15] x86/mtrr: construct a memory map with cache modes

Message ID 20230401063652.23522-12-jgross@suse.com
State New
Headers
Series x86/mtrr: fix handling with PAT but without MTRR |

Commit Message

Juergen Gross April 1, 2023, 6:36 a.m. UTC
  After MTRR initialization construct a memory map with cache modes from
MTRR values. This will speed up lookups via mtrr_lookup_type()
especially in case of overlapping MTRRs.

This will be needed when switching the semantics of the "uniform"
parameter of mtrr_lookup_type() from "only covered by one MTRR" to
"memory range has a uniform cache mode", which is the data the callers
really want to know. Today this information is not easily available,
in case MTRRs are not well sorted regarding base address.

The map will be built in __initdata. When memory management is up, the
map will be moved to dynamically allocated memory, in order to avoid
the need of an overly large array. The size of this array is calculated
using the number of variable MTRR registers and the needed size for
fixed entries.

Only add the map creation and expansion for now. The lookup will be
added later.

When writing new MTRR entries in the running system rebuild the map
inside the call from mtrr_rendezvous_handler() in order to avoid nasty
race conditions with concurrent lookups.

Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
---
V3:
- new patch
V5:
- fix setting of mtrr_tom2
- change cache_map .type and .fixed to bitfields (Boris Petkov)
- use memmove() (Boris Petkov)
- a lot of comments (Boris Petkov)
- rewrite setting of merge bools (Boris Petkov)
- mark mtrr_build_map() as __init
- add pr_info() (Boris Petkov)
---
 arch/x86/kernel/cpu/mtrr/generic.c | 288 +++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mtrr/mtrr.c    |  10 +-
 arch/x86/kernel/cpu/mtrr/mtrr.h    |   3 +
 3 files changed, 300 insertions(+), 1 deletion(-)
  

Comments

Borislav Petkov April 20, 2023, 12:15 p.m. UTC | #1
On Sat, Apr 01, 2023 at 08:36:48AM +0200, Juergen Gross wrote:
> +static void rm_map_entry_at(int idx)
> +{
> +	cache_map_n--;

Let's not call memmove() when cache_map_n == idx.

Below too.

Especially since cache_map + idx + 1 is not valid and I wouldn't want it
getting prefetched from %rsi in the hw when there's no reason for it and
also the RET even from a function which doesn't do anything, costs.

> +	memmove(cache_map + idx, cache_map + idx + 1,
> +		sizeof(*cache_map) * (cache_map_n - idx));
> +}

Ok, first weird issue I encountered while playing with my carved out
program to exercise this cache_map handling thing. I can share it if you
want it - it is ugly but it works.

So while rebuilding the map, I have these current regions in the map, at
one point in time:

Current map:
0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
3: start: 0x000003bf0000c000, end: 0x00000d4b0000c000, type: 0x1
4: start: 0x00000d4b0000c000, end: 0x00019fc000001000, type: 0x0
5: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2

note entry #3.

Now the next one it inserts is:

add_map_entry_at: start: 0x3bf0000c000, end: 0xd4b0000c000, type: 0x0, idx: 3
 merge_prev 0: prev->fixed: 0, prev->end: 0x3bf0000c000, prev->type: 0x2
 merge_next: 1, next->fixed: 0, next->start: 0xd4b0000c000, next->type: 0x0
add_map_entry_at: ret: 1

Note how it is the same as entry number #3 - just a different type.

What it ends up doing is, it simply overwrites the previous one and
merges it with the next:

Current map:
0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2

Now I think right about now we should've screamed loudly.

I know I know, this should never happen, right? And firmware programming
those MTRRs doesn't make mistakes...

However, we should be loud here and scream when a MTRR range disappears
like that.

Right?
  
Juergen Gross April 20, 2023, 12:30 p.m. UTC | #2
On 20.04.23 14:15, Borislav Petkov wrote:
> On Sat, Apr 01, 2023 at 08:36:48AM +0200, Juergen Gross wrote:
>> +static void rm_map_entry_at(int idx)
>> +{
>> +	cache_map_n--;
> 
> Let's not call memmove() when cache_map_n == idx.
> 
> Below too.
> 
> Especially since cache_map + idx + 1 is not valid and I wouldn't want it
> getting prefetched from %rsi in the hw when there's no reason for it and
> also the RET even from a function which doesn't do anything, costs.

OTOH the additional compare to 0 has costs, too, and this cost is spent for
ALL calls, while the zero size call is a rather rare case.

Regarding "cache_map + idx + 1 is not valid": the standard clearly points
out that a call with size 0 is valid and won't copy anything [1].

> 
>> +	memmove(cache_map + idx, cache_map + idx + 1,
>> +		sizeof(*cache_map) * (cache_map_n - idx));
>> +}
> 
> Ok, first weird issue I encountered while playing with my carved out
> program to exercise this cache_map handling thing. I can share it if you
> want it - it is ugly but it works.
> 
> So while rebuilding the map, I have these current regions in the map, at
> one point in time:
> 
> Current map:
> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
> 3: start: 0x000003bf0000c000, end: 0x00000d4b0000c000, type: 0x1
> 4: start: 0x00000d4b0000c000, end: 0x00019fc000001000, type: 0x0
> 5: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
> 
> note entry #3.
> 
> Now the next one it inserts is:
> 
> add_map_entry_at: start: 0x3bf0000c000, end: 0xd4b0000c000, type: 0x0, idx: 3
>   merge_prev 0: prev->fixed: 0, prev->end: 0x3bf0000c000, prev->type: 0x2
>   merge_next: 1, next->fixed: 0, next->start: 0xd4b0000c000, next->type: 0x0
> add_map_entry_at: ret: 1
> 
> Note how it is the same as entry number #3 - just a different type.
> 
> What it ends up doing is, it simply overwrites the previous one and
> merges it with the next:
> 
> Current map:
> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
> 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
> 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2

The map would reflect hardware behavior. Type 0 wins in case of overlapping
MTRRs.

> Now I think right about now we should've screamed loudly.

Now this is another requirement, right? Today's MTRR code wouldn't scream
either.

> I know I know, this should never happen, right? And firmware programming
> those MTRRs doesn't make mistakes...

At least we don't correct such mistakes today. Do you think we should change
that?

> However, we should be loud here and scream when a MTRR range disappears
> like that.
> 
> Right?

TBH, I don't know.


Juergen
  
Juergen Gross April 20, 2023, 12:30 p.m. UTC | #3
On 20.04.23 14:30, Juergen Gross wrote:
> On 20.04.23 14:15, Borislav Petkov wrote:
>> On Sat, Apr 01, 2023 at 08:36:48AM +0200, Juergen Gross wrote:
>>> +static void rm_map_entry_at(int idx)
>>> +{
>>> +    cache_map_n--;
>>
>> Let's not call memmove() when cache_map_n == idx.
>>
>> Below too.
>>
>> Especially since cache_map + idx + 1 is not valid and I wouldn't want it
>> getting prefetched from %rsi in the hw when there's no reason for it and
>> also the RET even from a function which doesn't do anything, costs.
> 
> OTOH the additional compare to 0 has costs, too, and this cost is spent for
> ALL calls, while the zero size call is a rather rare case.
> 
> Regarding "cache_map + idx + 1 is not valid": the standard clearly points
> out that a call with size 0 is valid and won't copy anything [1].
> 
>>
>>> +    memmove(cache_map + idx, cache_map + idx + 1,
>>> +        sizeof(*cache_map) * (cache_map_n - idx));
>>> +}
>>
>> Ok, first weird issue I encountered while playing with my carved out
>> program to exercise this cache_map handling thing. I can share it if you
>> want it - it is ugly but it works.
>>
>> So while rebuilding the map, I have these current regions in the map, at
>> one point in time:
>>
>> Current map:
>> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
>> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
>> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
>> 3: start: 0x000003bf0000c000, end: 0x00000d4b0000c000, type: 0x1
>> 4: start: 0x00000d4b0000c000, end: 0x00019fc000001000, type: 0x0
>> 5: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
>>
>> note entry #3.
>>
>> Now the next one it inserts is:
>>
>> add_map_entry_at: start: 0x3bf0000c000, end: 0xd4b0000c000, type: 0x0, idx: 3
>>   merge_prev 0: prev->fixed: 0, prev->end: 0x3bf0000c000, prev->type: 0x2
>>   merge_next: 1, next->fixed: 0, next->start: 0xd4b0000c000, next->type: 0x0
>> add_map_entry_at: ret: 1
>>
>> Note how it is the same as entry number #3 - just a different type.
>>
>> What it ends up doing is, it simply overwrites the previous one and
>> merges it with the next:
>>
>> Current map:
>> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
>> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
>> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
>> 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
>> 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
> 
> The map would reflect hardware behavior. Type 0 wins in case of overlapping
> MTRRs.
> 
>> Now I think right about now we should've screamed loudly.
> 
> Now this is another requirement, right? Today's MTRR code wouldn't scream
> either.
> 
>> I know I know, this should never happen, right? And firmware programming
>> those MTRRs doesn't make mistakes...
> 
> At least we don't correct such mistakes today. Do you think we should change
> that?
> 
>> However, we should be loud here and scream when a MTRR range disappears
>> like that.
>>
>> Right?
> 
> TBH, I don't know.

Bah, forgot the link:

[1]: https://lists.llvm.org/pipermail/llvm-dev/2007-October/011070.html


Juergen
  
Borislav Petkov April 20, 2023, 1:01 p.m. UTC | #4
On Thu, Apr 20, 2023 at 02:30:09PM +0200, Juergen Gross wrote:
> OTOH the additional compare to 0 has costs, too, and this cost is spent for
> ALL calls

I'll take the cost of a single

	cmpl    %edi, %edx

for a handful of entries any day of the week.

> while the zero size call is a rather rare case.

$ grep "memmove size" /tmp/mtrr.txt
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0

for - I admit a largely contrived map - but 5 entries only:

Current map:
0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
2: start: 0x000050d70000c000, end: 0x000062a60000c000, type: 0x4
3: start: 0x000062a60000c000, end: 0x0001d6d200001000, type: 0x0
4: start: 0x0001d6d200001000, end: 0x0001dd8100001000, type: 0x6

> Regarding "cache_map + idx + 1 is not valid": the standard clearly points
> out that a call with size 0 is valid and won't copy anything

That's not what I meant. Please read again what I said:

"I wouldn't want it getting prefetched from %rsi in the hw when there's
no reason for it".

IOW, I don't want to put invalid values in hw registers if I can. Think
hw mitigations and *very* hungry prefetchers.

> > Current map:
> > 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> > 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> > 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
> > 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
> > 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
> 
> The map would reflect hardware behavior. Type 0 wins in case of overlapping
> MTRRs.

Type 0 is MTRR_TYPE_UNCACHABLE, 1 is MTRR_TYPE_WRCOMB.

"Uncacheable (UC)—Reads from, and writes to, UC memory are not cacheable. Reads from UC
memory cannot be speculative. Write-combining to UC memory is not
allowed."

That last sentence.

So if you have conflicting regions and one is WC but then something is
expecting it to be UC and that something doesn't want for it to
*especially* to be WC because it should not combine writes to it, then
that clearly is a misconfiguration, I'd say.

> Now this is another requirement, right? Today's MTRR code wouldn't scream
> either.

So are we fixing this right or only a little?

> At least we don't correct such mistakes today. Do you think we should change
> that?

I'm thinking considering how often we've seen all those error messages
from mtrr_state_warn(), we should at least warn when we encounter
inconsistencies.

This won't help with released BIOSes but it will help when they do new
BIOS verification and see those messages. People do use Linux for that
a lot and then they'll definitely look and address them.

And this is a pretty good goal to have, IMO.

Thx.
  
Juergen Gross April 20, 2023, 1:57 p.m. UTC | #5
On 20.04.23 15:01, Borislav Petkov wrote:
> On Thu, Apr 20, 2023 at 02:30:09PM +0200, Juergen Gross wrote:
>> OTOH the additional compare to 0 has costs, too, and this cost is spent for
>> ALL calls
> 
> I'll take the cost of a single
> 
> 	cmpl    %edi, %edx
> 
> for a handful of entries any day of the week.
> 
>> while the zero size call is a rather rare case.
> 
> $ grep "memmove size" /tmp/mtrr.txt
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> 
> for - I admit a largely contrived map - but 5 entries only:
> 
> Current map:
> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> 2: start: 0x000050d70000c000, end: 0x000062a60000c000, type: 0x4
> 3: start: 0x000062a60000c000, end: 0x0001d6d200001000, type: 0x0
> 4: start: 0x0001d6d200001000, end: 0x0001dd8100001000, type: 0x6
> 
>> Regarding "cache_map + idx + 1 is not valid": the standard clearly points
>> out that a call with size 0 is valid and won't copy anything
> 
> That's not what I meant. Please read again what I said:
> 
> "I wouldn't want it getting prefetched from %rsi in the hw when there's
> no reason for it".

So you are suggesting that prefetching can happen across one wrong speculated
branch, but not across two of them? And you are not worrying about prefetches
past the end of a copy with size > 0?

> IOW, I don't want to put invalid values in hw registers if I can. Think
> hw mitigations and *very* hungry prefetchers.
> 
>>> Current map:
>>> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
>>> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
>>> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
>>> 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
>>> 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
>>
>> The map would reflect hardware behavior. Type 0 wins in case of overlapping
>> MTRRs.
> 
> Type 0 is MTRR_TYPE_UNCACHABLE, 1 is MTRR_TYPE_WRCOMB.
> 
> "Uncacheable (UC)—Reads from, and writes to, UC memory are not cacheable. Reads from UC
> memory cannot be speculative. Write-combining to UC memory is not
> allowed."
> 
> That last sentence.

"If two or more variable memory ranges match and one of the memory types is UC,
the UC memory type used."

So technically no problem, apart from lower performance.

> So if you have conflicting regions and one is WC but then something is
> expecting it to be UC and that something doesn't want for it to
> *especially* to be WC because it should not combine writes to it, then
> that clearly is a misconfiguration, I'd say.

Would you be fine with adding that as an additional patch?

I believe if we really want that, then we should be able to disable such a
cleanup. So it should be an add-on anyway.

>> Now this is another requirement, right? Today's MTRR code wouldn't scream
>> either.
> 
> So are we fixing this right or only a little?

I'm not against adding such additional checks. I wouldn't like to force them
into this series right now, because we need this series for Hyper-V isolated
guest support.

>> At least we don't correct such mistakes today. Do you think we should change
>> that?
> 
> I'm thinking considering how often we've seen all those error messages
> from mtrr_state_warn(), we should at least warn when we encounter
> inconsistencies.

Just to say it explicitly: you are concerned for the case that a complete
MTRR is hidden beneath another one (e.g. a large UC MTRR hiding a smaller
MTRR with another type, or a variable MTRR being hidden by fixed MTRRs)?

> This won't help with released BIOSes but it will help when they do new
> BIOS verification and see those messages. People do use Linux for that
> a lot and then they'll definitely look and address them.

Okay.

> And this is a pretty good goal to have, IMO.

Probably, yes.


Juergen
  
Borislav Petkov April 20, 2023, 2:54 p.m. UTC | #6
On Thu, Apr 20, 2023 at 03:57:43PM +0200, Juergen Gross wrote:
> So you are suggesting that prefetching can happen across one wrong speculated
> branch, but not across two of them? And you are not worrying about prefetches
> past the end of a copy with size > 0?

Maybe it will, maybe it won't.

I am worried about calling a function unnecessarily. I'm worried about
calling the asm version __memmove() with zero length unnecessarily. I'm
worried about executing the return thunk unnecessarily:

ffffffff8104c749:       48 83 c4 28             add    $0x28,%rsp
ffffffff8104c74d:       e9 72 2a 9b 00          jmp    ffffffff819ff1c4 <__x86_return_thunk>
ffffffff8104c752:       31 c0                   xor    %eax,%eax
ffffffff8104c754:       e9 6b 2a 9b 00          jmp    ffffffff819ff1c4 <__x86_return_thunk>
ffffffff8104c759:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)

Just say that you don't want to do this simple check and I will do it
myself because I'm tired of debating.

> "If two or more variable memory ranges match and one of the memory types is UC,
> the UC memory type used."
> 
> So technically no problem, apart from lower performance.

How do you come from "Write-combining to UC memory is not allowed" to
"lower performance"?

Not allowed is not allowed. Geez.

> Would you be fine with adding that as an additional patch?
>
> I believe if we really want that, then we should be able to disable such a
> cleanup. So it should be an add-on anyway.

Sure, whatever.

> I'm not against adding such additional checks. I wouldn't like to force them
> into this series right now, because we need this series for Hyper-V isolated
> guest support.

We will add this series when they're ready. If Hyper-V needs them
immediately they can take whatever they want and do whatever they want.

Or you can do a simpler fix for Hyper-V that goes before this, if you
want to address Hyper-V.

> Just to say it explicitly: you are concerned for the case that a complete
> MTRR is hidden beneath another one (e.g. a large UC MTRR hiding a smaller
> MTRR with another type, or a variable MTRR being hidden by fixed MTRRs)?

I am concerned about catching any and all inconsistencies with the MTRRs
and catching them right. If we're going to spend all this time on this,
then let's do it right, once and for all and do it in a manner that can
be improved in the future.

Thanks!
  
Juergen Gross April 20, 2023, 3:10 p.m. UTC | #7
On 20.04.23 16:54, Borislav Petkov wrote:
> On Thu, Apr 20, 2023 at 03:57:43PM +0200, Juergen Gross wrote:
>> So you are suggesting that prefetching can happen across one wrong speculated
>> branch, but not across two of them? And you are not worrying about prefetches
>> past the end of a copy with size > 0?
> 
> Maybe it will, maybe it won't.
> 
> I am worried about calling a function unnecessarily. I'm worried about
> calling the asm version __memmove() with zero length unnecessarily. I'm
> worried about executing the return thunk unnecessarily:
> 
> ffffffff8104c749:       48 83 c4 28             add    $0x28,%rsp
> ffffffff8104c74d:       e9 72 2a 9b 00          jmp    ffffffff819ff1c4 <__x86_return_thunk>
> ffffffff8104c752:       31 c0                   xor    %eax,%eax
> ffffffff8104c754:       e9 6b 2a 9b 00          jmp    ffffffff819ff1c4 <__x86_return_thunk>
> ffffffff8104c759:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
> 
> Just say that you don't want to do this simple check and I will do it
> myself because I'm tired of debating.

I just want to make sure to understand your concerns and that the reasoning
is sane.

You seem to feel rather strong here, so I'll add the test.

> 
>> "If two or more variable memory ranges match and one of the memory types is UC,
>> the UC memory type used."
>>
>> So technically no problem, apart from lower performance.
> 
> How do you come from "Write-combining to UC memory is not allowed" to
> "lower performance"?
> 
> Not allowed is not allowed. Geez.

Yes. And using UC instead of WC usually means lower performance of writes.

>> Would you be fine with adding that as an additional patch?
>>
>> I believe if we really want that, then we should be able to disable such a
>> cleanup. So it should be an add-on anyway.
> 
> Sure, whatever.

Okay, thanks.

I think this will need another final loop over the MTRRs to check against the
constructed map if a MTRR is completely useless.

Another question: in case we detect such a hidden MTRR, should it be disabled
in order to have more MTRRs available for run-time adding?

> 
>> I'm not against adding such additional checks. I wouldn't like to force them
>> into this series right now, because we need this series for Hyper-V isolated
>> guest support.
> 
> We will add this series when they're ready. If Hyper-V needs them
> immediately they can take whatever they want and do whatever they want.
> 
> Or you can do a simpler fix for Hyper-V that goes before this, if you
> want to address Hyper-V.
> 
>> Just to say it explicitly: you are concerned for the case that a complete
>> MTRR is hidden beneath another one (e.g. a large UC MTRR hiding a smaller
>> MTRR with another type, or a variable MTRR being hidden by fixed MTRRs)?
> 
> I am concerned about catching any and all inconsistencies with the MTRRs
> and catching them right. If we're going to spend all this time on this,
> then let's do it right, once and for all and do it in a manner that can
> be improved in the future.

Okay.


Juergen
  
Borislav Petkov April 21, 2023, 11:23 a.m. UTC | #8
On Thu, Apr 20, 2023 at 05:10:03PM +0200, Juergen Gross wrote:
> I think this will need another final loop over the MTRRs to check against the
> constructed map if a MTRR is completely useless.

Yeah, I slept on it: so I think there should be a patch ontop which does
add debug output - disabled by default and controllable by adding
"mtrr=debug" on the cmdline or so - which dumps the cache map operations
(add/remove) and the final table.

The reason being, when this cache_map thing hits upstream, we would need
a way to debug any potential issues which people might report so asking
them to do a "mtrr=debug" boot would be a good help.

And pls make the prints pr_info() and not pr_debug() because people
should not have to recompile in order to enable that.

> Another question: in case we detect such a hidden MTRR, should it be disabled
> in order to have more MTRRs available for run-time adding?

Let's not do anything for now and address this if really needed.

Thx.
  
Juergen Gross April 21, 2023, 2:35 p.m. UTC | #9
On 21.04.23 13:23, Borislav Petkov wrote:
> On Thu, Apr 20, 2023 at 05:10:03PM +0200, Juergen Gross wrote:
>> I think this will need another final loop over the MTRRs to check against the
>> constructed map if a MTRR is completely useless.
> 
> Yeah, I slept on it: so I think there should be a patch ontop which does
> add debug output - disabled by default and controllable by adding
> "mtrr=debug" on the cmdline or so - which dumps the cache map operations
> (add/remove) and the final table.

Okay.

> The reason being, when this cache_map thing hits upstream, we would need
> a way to debug any potential issues which people might report so asking
> them to do a "mtrr=debug" boot would be a good help.
> 
> And pls make the prints pr_info() and not pr_debug() because people
> should not have to recompile in order to enable that.

Agreed.

>> Another question: in case we detect such a hidden MTRR, should it be disabled
>> in order to have more MTRRs available for run-time adding?
> 
> Let's not do anything for now and address this if really needed.

Okay.


Juergen
  

Patch

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 005f07ebb3a3..fe8238832095 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -33,6 +33,38 @@  static struct fixed_range_block fixed_range_blocks[] = {
 	{}
 };
 
+struct cache_map {
+	u64 start;
+	u64 end;
+	u64 flags;
+	u64 type:8;
+	u64 fixed:1;
+};
+
+/*
+ * CACHE_MAP_MAX is the maximum number of memory ranges in cache_map, where
+ * no 2 adjacent ranges have the same cache mode (those would be merged).
+ * The number is based on the worst case:
+ * - no two adjacent fixed MTRRs share the same cache mode
+ * - one variable MTRR is spanning a huge area with mode WB
+ * - 255 variable MTRRs with mode UC all overlap with the WB MTRR, creating 2
+ *   additional ranges each (result like "ababababa...aba" with a = WB, b = UC),
+ *   accounting for MTRR_MAX_VAR_RANGES * 2 - 1 range entries
+ * - a TOM2 area (even with overlapping an UC MTRR can't add 2 range entries
+ *   to the possible maximum, as it always starts at 4GB, thus it can't be in
+ *   the middle of that MTRR, unless that MTRR starts at 0, which would remove
+ *   the initial "a" from the "abababa" pattern above)
+ * The map won't contain ranges with no matching MTRR (those fall back to the
+ * default cache mode).
+ */
+#define CACHE_MAP_MAX	(MTRR_NUM_FIXED_RANGES + MTRR_MAX_VAR_RANGES * 2)
+
+static struct cache_map init_cache_map[CACHE_MAP_MAX] __initdata;
+static struct cache_map *cache_map __refdata = init_cache_map;
+static unsigned int cache_map_size = CACHE_MAP_MAX;
+static unsigned int cache_map_n;
+static unsigned int cache_map_fixed;
+
 static unsigned long smp_changes_mask;
 static int mtrr_state_set;
 u64 mtrr_tom2;
@@ -78,6 +110,20 @@  static u64 get_mtrr_size(u64 mask)
 	return size;
 }
 
+static u8 get_var_mtrr_state(unsigned int reg, u64 *start, u64 *size)
+{
+	struct mtrr_var_range *mtrr = mtrr_state.var_ranges + reg;
+
+	if (!(mtrr->mask_lo & MTRR_MASK_VALID))
+		return MTRR_TYPE_INVALID;
+
+	*start = (((u64)mtrr->base_hi) << 32) + (mtrr->base_lo & PAGE_MASK);
+	*size = get_mtrr_size((((u64)mtrr->mask_hi) << 32) +
+			      (mtrr->mask_lo & PAGE_MASK));
+
+	return mtrr->base_lo & MTRR_BASE_TYPE_MASK;
+}
+
 static u8 get_effective_type(u8 type1, u8 type2)
 {
 	if (type1 == MTRR_TYPE_UNCACHABLE || type2 == MTRR_TYPE_UNCACHABLE)
@@ -242,6 +288,244 @@  static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
 	return mtrr_state.def_type;
 }
 
+static void rm_map_entry_at(int idx)
+{
+	cache_map_n--;
+	memmove(cache_map + idx, cache_map + idx + 1,
+		sizeof(*cache_map) * (cache_map_n - idx));
+}
+
+/*
+ * Add an entry into cache_map at a specific index.  Merges adjacent entries if
+ * appropriate.  Return the number of merges for correcting the scan index
+ * (this is needed as merging will reduce the number of entries, which will
+ * result in skipping entries in future iterations if the scan index isn't
+ * corrected).
+ * Note that the corrected index can never go below -1 (resulting in being 0 in
+ * the next scan iteration), as "2" is returned only if the current index is
+ * larger than zero.
+ */
+static int add_map_entry_at(u64 start, u64 end, u8 type, int idx)
+{
+	bool merge_prev = false, merge_next = false;
+
+	if (start >= end)
+		return 0;
+
+	if (idx > 0) {
+		struct cache_map *prev = cache_map + idx - 1;
+
+		if (!prev->fixed && start == prev->end && type == prev->type)
+			merge_prev = true;
+	}
+
+	if (idx < cache_map_n) {
+		struct cache_map *next = cache_map + idx;
+
+		if (!next->fixed && end == next->start && type == next->type)
+			merge_next = true;
+	}
+
+	if (merge_prev && merge_next) {
+		cache_map[idx - 1].end = cache_map[idx].end;
+		rm_map_entry_at(idx);
+		return 2;
+	}
+	if (merge_prev) {
+		cache_map[idx - 1].end = end;
+		return 1;
+	}
+	if (merge_next) {
+		cache_map[idx].start = start;
+		return 1;
+	}
+
+	/* Sanity check: the array should NEVER be too small! */
+	if (cache_map_n == cache_map_size) {
+		WARN(1, "MTRR cache mode memory map exhausted!\n");
+		cache_map_n = cache_map_fixed;
+		return 0;
+	}
+
+	memmove(cache_map + idx + 1, cache_map + idx,
+		sizeof(*cache_map) * (cache_map_n - idx));
+
+	cache_map[idx].start = start;
+	cache_map[idx].end = end;
+	cache_map[idx].type = type;
+	cache_map[idx].fixed = 0;
+	cache_map_n++;
+
+	return 0;
+}
+
+/* Clear a part of an entry. Return 1 if start of entry is still valid. */
+static int clr_map_range_at(u64 start, u64 end, int idx)
+{
+	int ret = start != cache_map[idx].start;
+	u64 tmp;
+
+	if (start == cache_map[idx].start && end == cache_map[idx].end) {
+		rm_map_entry_at(idx);
+	} else if (start == cache_map[idx].start) {
+		cache_map[idx].start = end;
+	} else if (end == cache_map[idx].end) {
+		cache_map[idx].end = start;
+	} else {
+		tmp = cache_map[idx].end;
+		cache_map[idx].end = start;
+		add_map_entry_at(end, tmp, cache_map[idx].type, idx + 1);
+	}
+
+	return ret;
+}
+
+/*
+ * Add MTRR to the map.  The current map is scanned and each part of the MTRR
+ * either overlapping with an existing entry or with a hole in the map is
+ * handled separately.
+ */
+static void add_map_entry(u64 start, u64 end, u8 type)
+{
+	u8 new_type, old_type;
+	u64 tmp;
+	int i;
+
+	for (i = 0; i < cache_map_n && start < end; i++) {
+		if (start >= cache_map[i].end)
+			continue;
+
+		if (start < cache_map[i].start) {
+			/* Region start has no overlap. */
+			tmp = min(end, cache_map[i].start);
+			i -= add_map_entry_at(start, tmp,  type, i);
+			start = tmp;
+			continue;
+		}
+
+		new_type = get_effective_type(type, cache_map[i].type);
+		old_type = cache_map[i].type;
+
+		if (cache_map[i].fixed || new_type == old_type) {
+			/* Cut off start of new entry. */
+			start = cache_map[i].end;
+			continue;
+		}
+
+		/* Handle only overlapping part of region. */
+		tmp = min(end, cache_map[i].end);
+		i += clr_map_range_at(start, tmp, i);
+		i -= add_map_entry_at(start, tmp, new_type, i);
+		start = tmp;
+	}
+
+	/* Add rest of region after last map entry (rest might be empty). */
+	add_map_entry_at(start, end, type, i);
+}
+
+/* Add variable MTRRs to cache map. */
+static void map_add_var(void)
+{
+	u64 start, size;
+	unsigned int i;
+	u8 type;
+
+	/*
+	 * Add AMD TOM2 MTRR.  Can't be added in mtrr_build_map(), as it needs
+	 * to be added again when rebuilding the map due to potentially having
+	 * moved as a result of variable MTRRs for memory below 4GB.
+	 */
+	if (mtrr_tom2) {
+		add_map_entry(BIT_ULL(32), mtrr_tom2, MTRR_TYPE_WRBACK);
+		cache_map[cache_map_n - 1].fixed = 1;
+	}
+
+	for (i = 0; i < num_var_ranges; i++) {
+		type = get_var_mtrr_state(i, &start, &size);
+		if (type != MTRR_TYPE_INVALID)
+			add_map_entry(start, start + size, type);
+	}
+}
+
+/* Rebuild map by replacing variable entries. */
+static void rebuild_map(void)
+{
+	cache_map_n = cache_map_fixed;
+
+	map_add_var();
+}
+
+static unsigned int __init get_cache_map_size(void)
+{
+	return cache_map_fixed + 2 * num_var_ranges + (mtrr_tom2 != 0);
+}
+
+/* Build the cache_map containing the cache modes per memory range. */
+void __init mtrr_build_map(void)
+{
+	u64 start, end, size;
+	unsigned int i;
+	u8 type;
+
+	if (!mtrr_state.enabled)
+		return;
+
+	/* Add fixed MTRRs, optimize for adjacent entries with same type. */
+	if (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED) {
+		/*
+		 * Start with 64k size fixed entries, preset 1st one (hence the
+		 * loop below is starting with index 1).
+		 */
+		start = 0;
+		end = size = 0x10000;
+		type = mtrr_state.fixed_ranges[0];
+
+		for (i = 1; i < MTRR_NUM_FIXED_RANGES; i++) {
+			/* 8 64k entries, then 16 16k ones, rest 4k. */
+			if (i == 8 || i == 24)
+				size >>= 2;
+
+			if (mtrr_state.fixed_ranges[i] != type) {
+				add_map_entry(start, end, type);
+				start = end;
+				type = mtrr_state.fixed_ranges[i];
+			}
+			end += size;
+		}
+		add_map_entry(start, end, type);
+	}
+
+	/* Mark fixed, they take precedence. */
+	for (i = 0; i < cache_map_n; i++)
+		cache_map[i].fixed = 1;
+	cache_map_fixed = cache_map_n;
+
+	map_add_var();
+
+	pr_info("MTRR map: %u entries (%u fixed + %u variable; max %u), built from %u variable MTRRs\n",
+		cache_map_n, cache_map_fixed, cache_map_n - cache_map_fixed,
+		get_cache_map_size(), num_var_ranges + (mtrr_tom2 != 0));
+}
+
+/* Copy the cache_map from __initdata memory to dynamically allocated one. */
+void __init mtrr_copy_map(void)
+{
+	unsigned int new_size = get_cache_map_size();
+
+	if (!mtrr_state.enabled || !new_size) {
+		cache_map = NULL;
+		return;
+	}
+
+	mutex_lock(&mtrr_mutex);
+
+	cache_map = kcalloc(new_size, sizeof(*cache_map), GFP_KERNEL);
+	memmove(cache_map, init_cache_map, cache_map_n * sizeof(*cache_map));
+	cache_map_size = new_size;
+
+	mutex_unlock(&mtrr_mutex);
+}
+
 /**
  * mtrr_overwrite_state - set static MTRR state
  *
@@ -834,6 +1118,10 @@  static void generic_set_mtrr(unsigned int reg, unsigned long base,
 
 	cache_enable();
 	local_irq_restore(flags);
+
+	/* On the first CPU rebuild the cache mode memory map. */
+	if (smp_processor_id() == cpumask_first(cpu_online_mask))
+		rebuild_map();
 }
 
 int generic_validate_add_page(unsigned long base, unsigned long size,
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 76f5b5e1128b..d44e4c2670cc 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -65,7 +65,7 @@  static bool mtrr_enabled(void)
 }
 
 unsigned int mtrr_usage_table[MTRR_MAX_VAR_RANGES];
-static DEFINE_MUTEX(mtrr_mutex);
+DEFINE_MUTEX(mtrr_mutex);
 
 u64 size_or_mask, size_and_mask;
 
@@ -660,6 +660,7 @@  void __init mtrr_bp_init(void)
 		/* Software overwrite of MTRR state, only for generic case. */
 		mtrr_calc_physbits(true);
 		init_table();
+		mtrr_build_map();
 		pr_info("MTRRs set to read-only\n");
 
 		return;
@@ -697,6 +698,7 @@  void __init mtrr_bp_init(void)
 			if (get_mtrr_state()) {
 				memory_caching_control |= CACHE_MTRR;
 				changed_by_mtrr_cleanup = mtrr_cleanup(phys_addr);
+				mtrr_build_map();
 			} else {
 				mtrr_if = NULL;
 				why = "by BIOS";
@@ -725,6 +727,12 @@  void mtrr_save_state(void)
 
 static int __init mtrr_init_finialize(void)
 {
+	/*
+	 * Map might exist if mtrr_overwrite_state() has been called or if
+	 * mtrr_enabled() returns true.
+	 */
+	mtrr_copy_map();
+
 	if (!mtrr_enabled())
 		return 0;
 
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
index a3c362d3d5bf..6246a1d8650b 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -53,6 +53,7 @@  bool get_mtrr_state(void);
 
 extern u64 size_or_mask, size_and_mask;
 extern const struct mtrr_ops *mtrr_if;
+extern struct mutex mtrr_mutex;
 
 extern unsigned int num_var_ranges;
 extern u64 mtrr_tom2;
@@ -61,6 +62,8 @@  extern struct mtrr_state_type mtrr_state;
 void mtrr_state_warn(void);
 const char *mtrr_attrib_to_str(int x);
 void mtrr_wrmsr(unsigned, unsigned, unsigned);
+void mtrr_build_map(void);
+void mtrr_copy_map(void);
 
 /* CPU specific mtrr_ops vectors. */
 extern const struct mtrr_ops amd_mtrr_ops;