[v1,11/26] x86/sev: Invalidate pages from the direct map when adding them to the RMP table

Message ID 20231230161954.569267-12-michael.roth@amd.com
State New
Headers
Series Add AMD Secure Nested Paging (SEV-SNP) Initialization Support |

Commit Message

Michael Roth Dec. 30, 2023, 4:19 p.m. UTC
  From: Brijesh Singh <brijesh.singh@amd.com>

The integrity guarantee of SEV-SNP is enforced through the RMP table.
The RMP is used with standard x86 and IOMMU page tables to enforce
memory restrictions and page access rights. The RMP check is enforced as
soon as SEV-SNP is enabled globally in the system. When hardware
encounters an RMP-check failure, it raises a page-fault exception.

If the kernel uses a 2MB directmap mapping to write to an address, and
that 2MB range happens to contain a 4KB page that set to private in the
RMP table, that will also lead to a page-fault exception.

Prevent this by removing pages from the directmap prior to setting them
as private in the RMP table, and then re-adding them to the directmap
when they set back to shared.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Co-developed-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/virt/svm/sev.c | 58 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 56 insertions(+), 2 deletions(-)
  

Comments

Borislav Petkov Jan. 12, 2024, 7:48 p.m. UTC | #1
On Sat, Dec 30, 2023 at 10:19:39AM -0600, Michael Roth wrote:
>  static int rmpupdate(u64 pfn, struct rmp_state *state)
>  {
>  	unsigned long paddr = pfn << PAGE_SHIFT;
> -	int ret;
> +	int ret, level, npages;
>  
>  	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>  		return -ENODEV;
>  
> +	level = RMP_TO_PG_LEVEL(state->pagesize);
> +	npages = page_level_size(level) / PAGE_SIZE;
> +
> +	/*
> +	 * If the kernel uses a 2MB directmap mapping to write to an address,
> +	 * and that 2MB range happens to contain a 4KB page that set to private
> +	 * in the RMP table, an RMP #PF will trigger and cause a host crash.
> +	 *
> +	 * Prevent this by removing pages from the directmap prior to setting
> +	 * them as private in the RMP table.
> +	 */
> +	if (state->assigned && invalidate_direct_map(pfn, npages))
> +		return -EFAULT;

Well, the comment says one thing but the code does not do quite what the
text says:

* where is the check that pfn is part of the kernel direct map?

* where is the check that this address is part of a 2M PMD in the direct
map so that the situation is even given that you can have a 4K private
PTE there?

What this does is simply invalidate @npages unconditionally.

Then, __set_pages_np() already takes a number of pages and a start
address. Why is this thing iterating instead of sending *all* npages in
one go?

Yes, you'd need two new helpers, something like:

set_pages_present(struct page *page, int numpages)
set_pages_non_present(struct page *page, int numpages)

which call the lowlevel counterparts.

For some reason I remember asking this a while ago...
  
Dave Hansen Jan. 12, 2024, 8 p.m. UTC | #2
On 12/30/23 08:19, Michael Roth wrote:
> If the kernel uses a 2MB directmap mapping to write to an address, and
> that 2MB range happens to contain a 4KB page that set to private in the
> RMP table, that will also lead to a page-fault exception.

I thought we agreed long ago to just demote the whole direct map to 4k
on kernels that might need to act as SEV-SNP hosts.  That should be step
one and this can be discussed as an optimization later.
  
Borislav Petkov Jan. 12, 2024, 8:07 p.m. UTC | #3
On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote:
> On 12/30/23 08:19, Michael Roth wrote:
> > If the kernel uses a 2MB directmap mapping to write to an address, and
> > that 2MB range happens to contain a 4KB page that set to private in the
> > RMP table, that will also lead to a page-fault exception.
> 
> I thought we agreed long ago to just demote the whole direct map to 4k
> on kernels that might need to act as SEV-SNP hosts.  That should be step
> one and this can be discussed as an optimization later.

What would be the disadvantage here? Higher TLB pressure when running
kernel code I guess...
  
Vlastimil Babka Jan. 12, 2024, 8:27 p.m. UTC | #4
On 1/12/24 21:07, Borislav Petkov wrote:
> On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote:
>> On 12/30/23 08:19, Michael Roth wrote:
>> > If the kernel uses a 2MB directmap mapping to write to an address, and
>> > that 2MB range happens to contain a 4KB page that set to private in the
>> > RMP table, that will also lead to a page-fault exception.
>> 
>> I thought we agreed long ago to just demote the whole direct map to 4k
>> on kernels that might need to act as SEV-SNP hosts.  That should be step
>> one and this can be discussed as an optimization later.
> 
> What would be the disadvantage here? Higher TLB pressure when running
> kernel code I guess...

Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we
previously thought https://lwn.net/Articles/931406/
  
Tom Lendacky Jan. 12, 2024, 8:28 p.m. UTC | #5
On 1/12/24 14:07, Borislav Petkov wrote:
> On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote:
>> On 12/30/23 08:19, Michael Roth wrote:
>>> If the kernel uses a 2MB directmap mapping to write to an address, and
>>> that 2MB range happens to contain a 4KB page that set to private in the
>>> RMP table, that will also lead to a page-fault exception.

I thought there was also a desire to remove the direct map for any pages 
assigned to a guest as private, not just the case that the comment says. 
So updating the comment would probably the best action.

>>
>> I thought we agreed long ago to just demote the whole direct map to 4k
>> on kernels that might need to act as SEV-SNP hosts.  That should be step
>> one and this can be discussed as an optimization later.

Won't this accomplish that without actually demoting a lot of long-live 
kernel related mappings that would never be demoted? I don't think we need 
to demote the whole mapping to 4K.

Thanks,
Tom

> 
> What would be the disadvantage here? Higher TLB pressure when running
> kernel code I guess...
>
  
Dave Hansen Jan. 12, 2024, 8:37 p.m. UTC | #6
On 1/12/24 12:28, Tom Lendacky wrote:
> I thought there was also a desire to remove the direct map for any pages
> assigned to a guest as private, not just the case that the comment says.
> So updating the comment would probably the best action.

I'm not sure who desires that.

It's sloooooooow to remove things from the direct map.  There's almost
certainly a frequency cutoff where running the whole direct mapping as
4k is better than the cost of mapping/unmapping.

Actually, where _is_ the TLB flushing here?
  
Borislav Petkov Jan. 15, 2024, 9:01 a.m. UTC | #7
On Sat, Dec 30, 2023 at 10:19:39AM -0600, Michael Roth wrote:
> +	/*
> +	 * If the kernel uses a 2MB directmap mapping to write to an address,
> +	 * and that 2MB range happens to contain a 4KB page that set to private
> +	 * in the RMP table, an RMP #PF will trigger and cause a host crash.

Also:

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 7d294d1a620b..2ad83e7fb2da 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -415,8 +415,9 @@ static int rmpupdate(u64 pfn, struct rmp_state *state)
 
 	/*
 	 * If the kernel uses a 2MB directmap mapping to write to an address,
-	 * and that 2MB range happens to contain a 4KB page that set to private
-	 * in the RMP table, an RMP #PF will trigger and cause a host crash.
+	 * and that 2MB range happens to contain a 4KB page that has been set
+	 * to private in the RMP table, an RMP #PF will trigger and cause a
+	 * host crash.
 	 *
 	 * Prevent this by removing pages from the directmap prior to setting
 	 * them as private in the RMP table.
  
Borislav Petkov Jan. 15, 2024, 9:06 a.m. UTC | #8
On Fri, Jan 12, 2024 at 09:27:45PM +0100, Vlastimil Babka wrote:
> Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we
> previously thought https://lwn.net/Articles/931406/

How nice, thanks for that!

Do you have some refs to Mike's tests so that we could run them here too
with SNP guests to see how big - if any - the fragmentation has.

Thx.
  
Borislav Petkov Jan. 15, 2024, 9:09 a.m. UTC | #9
On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote:
> I thought we agreed long ago to just demote the whole direct map to 4k
> on kernels that might need to act as SEV-SNP hosts.  That should be step
> one and this can be discussed as an optimization later.

Do we have a link to that agreement somewhere?

I'd like to read why we agreed. And looking at Mike's talk:
https://lwn.net/Articles/931406/ - what do we wanna do in general with
the direct map granularity?

Thx.
  
Vlastimil Babka Jan. 15, 2024, 9:14 a.m. UTC | #10
On 1/15/24 10:06, Borislav Petkov wrote:
> On Fri, Jan 12, 2024 at 09:27:45PM +0100, Vlastimil Babka wrote:
>> Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we
>> previously thought https://lwn.net/Articles/931406/
> 
> How nice, thanks for that!
> 
> Do you have some refs to Mike's tests so that we could run them here too
> with SNP guests to see how big - if any - the fragmentation has.

Let me Cc him...

> Thx.
>
  
Mike Rapoport Jan. 15, 2024, 9:16 a.m. UTC | #11
On Mon, Jan 15, 2024 at 10:06:39AM +0100, Borislav Petkov wrote:
> On Fri, Jan 12, 2024 at 09:27:45PM +0100, Vlastimil Babka wrote:
> > Yeah and last LSF/MM we concluded that it's not as a big disadvantage as we
> > previously thought https://lwn.net/Articles/931406/
> 
> How nice, thanks for that!
> 
> Do you have some refs to Mike's tests so that we could run them here too
> with SNP guests to see how big - if any - the fragmentation has.

I ran several tests from mmtests suite.

The tests results are here:
https://rppt.io/gfp-unmapped-bench/
 
> Thx.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
>
  
Borislav Petkov Jan. 15, 2024, 9:20 a.m. UTC | #12
On Mon, Jan 15, 2024 at 11:16:19AM +0200, Mike Rapoport wrote:
> I ran several tests from mmtests suite.
> 
> The tests results are here:
> https://rppt.io/gfp-unmapped-bench/

Cool, thanks Mike!

:-)
  
Vlastimil Babka Jan. 15, 2024, 9:23 a.m. UTC | #13
On 1/12/24 21:37, Dave Hansen wrote:
> On 1/12/24 12:28, Tom Lendacky wrote:
>> I thought there was also a desire to remove the direct map for any pages
>> assigned to a guest as private, not just the case that the comment says.
>> So updating the comment would probably the best action.
> 
> I'm not sure who desires that.
> 
> It's sloooooooow to remove things from the direct map.  There's almost
> certainly a frequency cutoff where running the whole direct mapping as
> 4k is better than the cost of mapping/unmapping.
> 
> Actually, where _is_ the TLB flushing here?

Hm yeah it seems to be using the _noflush version? Maybe the RMP issues this
avoids are only triggered with actual page tables and a stray outdated TLB
hit doesn't trigger it? Needs documenting though if that's the case.
  
Michael Roth Jan. 16, 2024, 4:19 p.m. UTC | #14
On Fri, Jan 12, 2024 at 12:37:31PM -0800, Dave Hansen wrote:
> On 1/12/24 12:28, Tom Lendacky wrote:
> > I thought there was also a desire to remove the direct map for any pages
> > assigned to a guest as private, not just the case that the comment says.
> > So updating the comment would probably the best action.
> 
> I'm not sure who desires that.
> 
> It's sloooooooow to remove things from the direct map.  There's almost
> certainly a frequency cutoff where running the whole direct mapping as
> 4k is better than the cost of mapping/unmapping.

One area we'd been looking to optimize[1] is lots of vCPUs/guests
faulting in private memory on-demand via kernels with lazy acceptance
support. The lazy acceptance / #NPF path for each 4K/2M guest page
involves updating the corresponding RMP table entry to set it to
Guest-owned state, and as part of that we remove the PFNs from the
directmap. There is indeed potential for scalability issues due to how
directmap updates are currently handled, since they involve holding a
global cpa_lock for every update.

At the time, there were investigations on how to remove the cpa_lock,
and there's an RFC patch[2] that implements this change, but I don't
know where this stands today.

There was also some work[3] on being able to restore 2MB entries in the
directmap (currently entries are only re-added as 4K) the Mike Rapoport
pointed out to me a while back. With that, since the bulk of private
guest pages 2MB, we'd be able to avoid splitting the directmap to 4K over
time.

There was also some indication[4] that UPM/guest_memfd would eventually
manage directmap invalidations internally, so it made sense to have
SNP continue to handle this up until the point that mangement was
moved to gmem.

Those 3 things, paired with a platform-independent way of catching
unexpected kernel accesses to private guest memory, are what I think
nudged us all toward the current implementation.

But AFAIK none of these 3 things are being actively upstreamed today,
so it makes sense to re-consider how we handle the directmap in the
interim.

I did some performance tests which do seem to indicate that
pre-splitting the directmap to 4K can be substantially improve certain
SNP guest workloads. This test involves running a single 1TB SNP guest
with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to
rapidly fault in all of its memory via lazy acceptance, and then
measuring the rate that gmem pages are being allocated on the host by
monitoring "FileHugePages" from /proc/meminfo to get some rough gauge
of how quickly a guest can fault in it's initial working set prior to
reaching steady state. The data is a bit noisy but seems to indicate
significant improvement by taking the directmap updates out of the
lazy acceptance path, and I would only expect that to become more
significant as you scale up the number of guests / vCPUs.

  # Average fault-in rate across 3 runs, measured in GB/s
                    unpinned | pinned to NUMA node 0
  DirectMap4K           12.9 | 12.1
             stddev      2.2 |  1.3
  DirectMap2M+split      8.0 |  8.9
             stddev      1.3 |  0.8

The downside of course is potential impact for non-SNP workloads
resulting from splitting the directmap. Mike Rapoport's numbers make
me feel a little better about it, but I don't think they apply directly
to the notion of splitting the entire directmap. It's Even he LWN article
summarizes:

  "The conclusion from all of this, Rapoport continued, was that
  direct-map fragmentation just does not matter — for data access, at
  least. Using huge-page mappings does still appear to make a difference
  for memory containing the kernel code, so allocator changes should
  focus on code allocations — improving the layout of allocations for
  loadable modules, for example, or allowing vmalloc() to allocate huge
  pages for code. But, for kernel-data allocations, direct-map
  fragmentation simply appears to not be worth worrying about."

So at the very least, if we went down this path, we would be worth
investigating the following areas in addition to general perf testing:

  1) Only splitting directmap regions corresponding to kernel-allocatable
     *data* (hopefully that's even feasible...)
  2) Potentially deferring the split until an SNP guest is actually
     run, so there isn't any impact just from having SNP enabled (though
     you still take a hit from RMP checks in that case so maybe it's not
     worthwhile, but that itself has been noted as a concern for users
     so it would be nice to not make things even worse).

[1] https://lore.kernel.org/linux-mm/20231103000105.m3z4eijcxlxciyzd@amd.com/
[2] https://lore.kernel.org/lkml/Y7f9ZuPcIMk37KnN@gmail.com/T/#m15b74841f5319c0d1177f118470e9714d4ea96c8
[3] https://lore.kernel.org/linux-kernel/20200416213229.19174-1-kirill.shutemov@linux.intel.com/
[4] https://lore.kernel.org/all/YyGLXXkFCmxBfu5U@google.com/

> 
> Actually, where _is_ the TLB flushing here?

Boris pointed that out in v6, and we implemented it in v7, but it
completely cratered performance:

   https://lore.kernel.org/linux-mm/20221219150026.bltiyk72pmdc2ic3@amd.com/

After further discussion I think we'd concluded it wasn't necessary. Maybe
that's worth revisiting though. If it is necessary, then that would be
another reason to just pre-split the directmap because the above-mentioned
lazy acceptance workload/bottleneck would likely get substantially worse.

-Mike
  
Dave Hansen Jan. 16, 2024, 4:21 p.m. UTC | #15
On 1/15/24 01:09, Borislav Petkov wrote:
> On Fri, Jan 12, 2024 at 12:00:01PM -0800, Dave Hansen wrote:
>> I thought we agreed long ago to just demote the whole direct map to 4k
>> on kernels that might need to act as SEV-SNP hosts.  That should be step
>> one and this can be discussed as an optimization later.
> 
> Do we have a link to that agreement somewhere?

I seem to be remembering it wrong.  I have some memory of disabling 2M
mappings in the same spots that debug_pagealloc_enabled() does, but I
must have hallucinated it.

> https://lore.kernel.org/lkml/535400b4-0593-a7ca-1548-532ee1fefbd7@intel.com/

The original approach tried to fix up the page faults by doing the
demotion there and didn't seem to work with hugetlbfs at all.

> I'd like to read why we agreed. And looking at Mike's talk:
> https://lwn.net/Articles/931406/ - what do we wanna do in general with
> the direct map granularity?

I think fracturing it is much less of a problem than we once thought it
was.  There are _definitely_ people that will pay the cost for added
security.

The problem will be the first time someone sees a regression when their
direct-map-fracturing kernel feature (secret pages, SNP host, etc...)
collides with some workload that's sensitive to kernel TLB pressure.
  
Michael Roth Jan. 16, 2024, 4:50 p.m. UTC | #16
On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote:
> I did some performance tests which do seem to indicate that
> pre-splitting the directmap to 4K can be substantially improve certain
> SNP guest workloads. This test involves running a single 1TB SNP guest
> with 128 vCPUs running "stress --vm 128 --vm-bytes 5G --vm-keep" to
> rapidly fault in all of its memory via lazy acceptance, and then
> measuring the rate that gmem pages are being allocated on the host by
> monitoring "FileHugePages" from /proc/meminfo to get some rough gauge
> of how quickly a guest can fault in it's initial working set prior to
> reaching steady state. The data is a bit noisy but seems to indicate
> significant improvement by taking the directmap updates out of the
> lazy acceptance path, and I would only expect that to become more
> significant as you scale up the number of guests / vCPUs.
> 
>   # Average fault-in rate across 3 runs, measured in GB/s
>                     unpinned | pinned to NUMA node 0
>   DirectMap4K           12.9 | 12.1
>              stddev      2.2 |  1.3
>   DirectMap2M+split      8.0 |  8.9
>              stddev      1.3 |  0.8
> 
> The downside of course is potential impact for non-SNP workloads
> resulting from splitting the directmap. Mike Rapoport's numbers make
> me feel a little better about it, but I don't think they apply directly
> to the notion of splitting the entire directmap. It's Even he LWN article
> summarizes:
> 
>   "The conclusion from all of this, Rapoport continued, was that
>   direct-map fragmentation just does not matter — for data access, at
>   least. Using huge-page mappings does still appear to make a difference
>   for memory containing the kernel code, so allocator changes should
>   focus on code allocations — improving the layout of allocations for
>   loadable modules, for example, or allowing vmalloc() to allocate huge
>   pages for code. But, for kernel-data allocations, direct-map
>   fragmentation simply appears to not be worth worrying about."
> 
> So at the very least, if we went down this path, we would be worth
> investigating the following areas in addition to general perf testing:
> 
>   1) Only splitting directmap regions corresponding to kernel-allocatable
>      *data* (hopefully that's even feasible...)
>   2) Potentially deferring the split until an SNP guest is actually
>      run, so there isn't any impact just from having SNP enabled (though
>      you still take a hit from RMP checks in that case so maybe it's not
>      worthwhile, but that itself has been noted as a concern for users
>      so it would be nice to not make things even worse).

There's another potential area of investigation I forgot to mention that
doesn't involve pre-splitting the directmap. It makes use of the fact
that the kernel should never be accessing a 2MB mapping that overlaps with
private guest memory if the backing PFN for the guest memory is a 2MB page.
Since there's no chance for overlap (well, maybe via a 1GB directmap entry,
but not as dramatic a change to force those to 2M), there's no need to
actually split the directmap entry in these cases since they won't
result in unexpected RMP faults.

So if pre-splitting the directmap ends up having too many downsides, then
there may still some potential for optimizing the current approach to a
fair degree.

-Mike
  
Borislav Petkov Jan. 16, 2024, 6:22 p.m. UTC | #17
On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote:
> So at the very least, if we went down this path, we would be worth
> investigating the following areas in addition to general perf testing:
> 
>   1) Only splitting directmap regions corresponding to kernel-allocatable
>      *data* (hopefully that's even feasible...)
>   2) Potentially deferring the split until an SNP guest is actually
>      run, so there isn't any impact just from having SNP enabled (though
>      you still take a hit from RMP checks in that case so maybe it's not
>      worthwhile, but that itself has been noted as a concern for users
>      so it would be nice to not make things even worse).

So the gist of this whole explanation why we end up doing what we end up
doing eventually should be in the commit message so that it is clear
*why* we did it. 

> After further discussion I think we'd concluded it wasn't necessary. Maybe
> that's worth revisiting though. If it is necessary, then that would be
> another reason to just pre-split the directmap because the above-mentioned
> lazy acceptance workload/bottleneck would likely get substantially worse.

The reason for that should also be in the commit message.

And to answer:

https://lore.kernel.org/linux-mm/20221219150026.bltiyk72pmdc2ic3@amd.com/

yes, you should add a @npages variant.

See if you could use/extend this, for example:

https://lore.kernel.org/r/20240116022008.1023398-3-mhklinux@outlook.com

Thx.
  
Mike Rapoport Jan. 16, 2024, 8:12 p.m. UTC | #18
On Tue, Jan 16, 2024 at 10:50:25AM -0600, Michael Roth wrote:
> On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote:
> > 
> > The downside of course is potential impact for non-SNP workloads
> > resulting from splitting the directmap. Mike Rapoport's numbers make
> > me feel a little better about it, but I don't think they apply directly
> > to the notion of splitting the entire directmap. It's Even he LWN article
> > summarizes:

When I ran the tests, I had the entire direct map forced to 4k or 2M pages.
There is indeed some impact and some tests suffer more than others but
there were also runs that benefit from smaller page size in the direct map,
at least if I remember correctly the results Intel folks posted a while
ago.
 
> >   "The conclusion from all of this, Rapoport continued, was that
> >   direct-map fragmentation just does not matter — for data access, at
> >   least. Using huge-page mappings does still appear to make a difference
> >   for memory containing the kernel code, so allocator changes should
> >   focus on code allocations — improving the layout of allocations for
> >   loadable modules, for example, or allowing vmalloc() to allocate huge
> >   pages for code. But, for kernel-data allocations, direct-map
> >   fragmentation simply appears to not be worth worrying about."
> > 
> > So at the very least, if we went down this path, we would be worth
> > investigating the following areas in addition to general perf testing:
> > 
> >   1) Only splitting directmap regions corresponding to kernel-allocatable
> >      *data* (hopefully that's even feasible...)

Forcing the direct map to 4k pages does not affect the kernel text
mappings, they are created separately and they are not the part of the
kernel mapping of the physical memory.
Well, except the modules, but they are mapped with 4k pages anyway.

> -Mike
  
Dave Hansen Jan. 16, 2024, 8:22 p.m. UTC | #19
On 1/16/24 08:19, Michael Roth wrote:
> 
> So at the very least, if we went down this path, we would be worth
> investigating the following areas in addition to general perf testing:
> 
>   1) Only splitting directmap regions corresponding to kernel-allocatable
>      *data* (hopefully that's even feasible...)

Take a look at the 64-bit memory map in here:

	https://www.kernel.org/doc/Documentation/x86/x86_64/mm.rst

We already have separate mappings for kernel data and (normal) kernel text.

>   2) Potentially deferring the split until an SNP guest is actually
>      run, so there isn't any impact just from having SNP enabled (though
>      you still take a hit from RMP checks in that case so maybe it's not
>      worthwhile, but that itself has been noted as a concern for users
>      so it would be nice to not make things even worse).

Yes, this would be nice too.

>> Actually, where _is_ the TLB flushing here?
> Boris pointed that out in v6, and we implemented it in v7, but it
> completely cratered performance:

That *desperately* needs to be documented.

How can it be safe to skip the TLB flush?  It this akin to a page
permission promotion where you go from RO->RW but can skip the TLB
flush?  In that case, the CPU will see the RO TLB entry violation, drop
it, and re-walk the page tables, discovering the RW entry.

Does something similar happen here where the CPU sees the 2M/4k conflict
in the TLB, drops the 2M entry, does a re-walk then picks up the
newly-split 2M->4k entries?

I can see how something like that would work, but it's _awfully_ subtle
to go unmentioned.
  
Borislav Petkov Jan. 17, 2024, 9:34 a.m. UTC | #20
On Tue, Jan 16, 2024 at 08:21:09AM -0800, Dave Hansen wrote:
> The problem will be the first time someone sees a regression when their
> direct-map-fracturing kernel feature (secret pages, SNP host, etc...)
> collides with some workload that's sensitive to kernel TLB pressure.

Yeah, and that "workload" will be some microbenchmark crap which people
would pay too much attention to, without it having any relevance to
real-world scenarios. Completely hypothetically speaking, ofc.

:-)
  
Michael Roth Jan. 26, 2024, 1:35 a.m. UTC | #21
On Tue, Jan 16, 2024 at 12:22:39PM -0800, Dave Hansen wrote:
> On 1/16/24 08:19, Michael Roth wrote:
> > 
> > So at the very least, if we went down this path, we would be worth
> > investigating the following areas in addition to general perf testing:
> > 
> >   1) Only splitting directmap regions corresponding to kernel-allocatable
> >      *data* (hopefully that's even feasible...)
> 
> Take a look at the 64-bit memory map in here:
> 
> 	https://www.kernel.org/doc/Documentation/x86/x86_64/mm.rst
> 
> We already have separate mappings for kernel data and (normal) kernel text.
> 
> >   2) Potentially deferring the split until an SNP guest is actually
> >      run, so there isn't any impact just from having SNP enabled (though
> >      you still take a hit from RMP checks in that case so maybe it's not
> >      worthwhile, but that itself has been noted as a concern for users
> >      so it would be nice to not make things even worse).
> 
> Yes, this would be nice too.
> 
> >> Actually, where _is_ the TLB flushing here?
> > Boris pointed that out in v6, and we implemented it in v7, but it
> > completely cratered performance:
> 
> That *desperately* needs to be documented.
> 
> How can it be safe to skip the TLB flush?  It this akin to a page
> permission promotion where you go from RO->RW but can skip the TLB
> flush?  In that case, the CPU will see the RO TLB entry violation, drop
> it, and re-walk the page tables, discovering the RW entry.
> 
> Does something similar happen here where the CPU sees the 2M/4k conflict
> in the TLB, drops the 2M entry, does a re-walk then picks up the
> newly-split 2M->4k entries?
> 
> I can see how something like that would work, but it's _awfully_ subtle
> to go unmentioned.

I think the current logic relies on the fact that we don't actually have
to remove entries from the RMP table to avoid unexpected RMP #PFs,
because the owners of private PFNs are aware of what access restrictions
would be active, and non-owners of private PFNs shouldn't be trying to
write to them. 

It's only cases where a non-owner does a write via a 2MB+ mapping to that
happens to overlap a private PFN that needs to be guarded against, and
when a private PFN is removed from the directmap it will end up
splitting the 2MB mapping, and in that cases there *is* a TLB flush that
would wipe out the 2MB TLB entry.

After that point, there may still be stale 4KB TLB entries that accrue,
but by the time there is any sensible reason to use them to perform a
write, the PFN would have been transitioned back to shared state and
re-added to the directmap.

But yes, it's ugly, and we've ended up re-working this to simply split
the directmap via set_memory_4k(), which also does the expected TLB
flushing of 2MB+ mappings, and we only call it when absolutely
necessary, benefitting from the following optimizations:

  - if lookup_address() tells us the mapping is already split, no work
    splitting is needed
  - when the rmpupdate() is for a 2MB private page/RMP entry, there's no
    need to split the directmap, because there's no risk of a non-owner's
    writes overlapping with the private PFNs (RMP checks are only done on
    4KB/2MB granularity, so even a write via a 1GB directmap mapping
    would not trigger an RMP fault since the non-owner's writes would be
    to a different 2MB PFN range.

Since KVM/gmem generally allocate 2MB backing pages for private guest
memory, the above optimizations result in the directmap only needing to
be split, and only needing to acquire the cpa_lock, a handful of times
for each guest.

This also sort of gives us what we were looking for earlier: a way to
defer the directmap split until it's actually needed. But rather than in
bulk, it's amortized over time, and the more it's done, the more likely it
is that the host is being used primarily to run SNP guests, and so the less
likely it is that splitting the directmap is going to cause some sort of
noticeable regression for non-SNP/non-guest workloads.

I did some basic testing to compare this against pre-splitting the
directmap, and I think it's pretty close performance-wise, so it seems like
a good, safe default. It's also fairly trivial to add a kernel cmdline
params or something later if someone wants to optimize specifically for
SNP guests.

  System: EPYC, 512 cores, 1.5TB memory
  
  == Page-in/acceptance rate for 1 guests, each with 128 vCPUs/512M ==
  
     directmap pre-split to 4K:
     13.77GB/s
     11.88GB/s
     15.37GB/s

     directmap split on-demand:
     14.77 GB/s
     16.57 GB/s
     15.53 GB/s
  
  == Page-in/acceptance rate for 4 guests, each with 128 vCPUs/256GB ==
  
     directmap pre-split to 4K:
     37.35 GB/s
     33.37 GB/s
     29.55 GB/s
    
     directmap split on-demand:
     28.88 GB/s
     25.13 GB/s
     29.28 GB/s
  
  == Page-in/acceptance rate for 8 guests, each with 128 vCPUs/160GB ==
  
     directmap pre-split to 4K:
     24.39 GB/s
     27.16 GB/s
     24.70 GB/s
     
     direct split on-demand:
     26.12 GB/s
     24.44 GB/s
     22.66 GB/s

So for v2 we're using this on-demand approach, and I've tried to capture
some of this rationale and in the commit log / comments for the relevant
patch.

Thanks,

Mike
  
Michael Roth Jan. 26, 2024, 1:49 a.m. UTC | #22
On Tue, Jan 16, 2024 at 10:12:26PM +0200, Mike Rapoport wrote:
> On Tue, Jan 16, 2024 at 10:50:25AM -0600, Michael Roth wrote:
> > On Tue, Jan 16, 2024 at 10:19:09AM -0600, Michael Roth wrote:
> > > 
> > > The downside of course is potential impact for non-SNP workloads
> > > resulting from splitting the directmap. Mike Rapoport's numbers make
> > > me feel a little better about it, but I don't think they apply directly
> > > to the notion of splitting the entire directmap. It's Even he LWN article
> > > summarizes:
> 
> When I ran the tests, I had the entire direct map forced to 4k or 2M pages.
> There is indeed some impact and some tests suffer more than others but
> there were also runs that benefit from smaller page size in the direct map,
> at least if I remember correctly the results Intel folks posted a while
> ago.

I see, thanks for clarifying. Certainly helps to have this data if someone
ends up wanting to investigate pre-splitting directmap to optimize SNP
use-cases in the future. Still feel more comfortable introducing this as an
optional tuneable though; can't help but worry about that *1* workload that
somehow suffers perf regression and has us frantically re-working SNP
implementation if we were to start off with making 4k directmap a requirement.

>  
> > >   "The conclusion from all of this, Rapoport continued, was that
> > >   direct-map fragmentation just does not matter — for data access, at
> > >   least. Using huge-page mappings does still appear to make a difference
> > >   for memory containing the kernel code, so allocator changes should
> > >   focus on code allocations — improving the layout of allocations for
> > >   loadable modules, for example, or allowing vmalloc() to allocate huge
> > >   pages for code. But, for kernel-data allocations, direct-map
> > >   fragmentation simply appears to not be worth worrying about."
> > > 
> > > So at the very least, if we went down this path, we would be worth
> > > investigating the following areas in addition to general perf testing:
> > > 
> > >   1) Only splitting directmap regions corresponding to kernel-allocatable
> > >      *data* (hopefully that's even feasible...)
> 
> Forcing the direct map to 4k pages does not affect the kernel text
> mappings, they are created separately and they are not the part of the
> kernel mapping of the physical memory.
> Well, except the modules, but they are mapped with 4k pages anyway.

Thanks!

-Mike

> 
> > -Mike
> 
> -- 
> Sincerely yours,
> Mike.
  

Patch

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index ff9fa0a85a7f..ee182351d93a 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -369,14 +369,64 @@  int psmash(u64 pfn)
 }
 EXPORT_SYMBOL_GPL(psmash);
 
+static int restore_direct_map(u64 pfn, int npages)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < npages; i++) {
+		ret = set_direct_map_default_noflush(pfn_to_page(pfn + i));
+		if (ret)
+			break;
+	}
+
+	if (ret)
+		pr_warn("Failed to restore direct map for pfn 0x%llx, ret: %d\n",
+			pfn + i, ret);
+
+	return ret;
+}
+
+static int invalidate_direct_map(u64 pfn, int npages)
+{
+	int i, ret = 0;
+
+	for (i = 0; i < npages; i++) {
+		ret = set_direct_map_invalid_noflush(pfn_to_page(pfn + i));
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		pr_warn("Failed to invalidate direct map for pfn 0x%llx, ret: %d\n",
+			pfn + i, ret);
+		restore_direct_map(pfn, i);
+	}
+
+	return ret;
+}
+
 static int rmpupdate(u64 pfn, struct rmp_state *state)
 {
 	unsigned long paddr = pfn << PAGE_SHIFT;
-	int ret;
+	int ret, level, npages;
 
 	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
 		return -ENODEV;
 
+	level = RMP_TO_PG_LEVEL(state->pagesize);
+	npages = page_level_size(level) / PAGE_SIZE;
+
+	/*
+	 * If the kernel uses a 2MB directmap mapping to write to an address,
+	 * and that 2MB range happens to contain a 4KB page that set to private
+	 * in the RMP table, an RMP #PF will trigger and cause a host crash.
+	 *
+	 * Prevent this by removing pages from the directmap prior to setting
+	 * them as private in the RMP table.
+	 */
+	if (state->assigned && invalidate_direct_map(pfn, npages))
+		return -EFAULT;
+
 	do {
 		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
 		asm volatile(".byte 0xF2, 0x0F, 0x01, 0xFE"
@@ -386,12 +436,16 @@  static int rmpupdate(u64 pfn, struct rmp_state *state)
 	} while (ret == RMPUPDATE_FAIL_OVERLAP);
 
 	if (ret) {
-		pr_err("RMPUPDATE failed for PFN %llx, ret: %d\n", pfn, ret);
+		pr_err("RMPUPDATE failed for PFN %llx, pg_level: %d, ret: %d\n",
+		       pfn, level, ret);
 		dump_rmpentry(pfn);
 		dump_stack();
 		return -EFAULT;
 	}
 
+	if (!state->assigned && restore_direct_map(pfn, npages))
+		return -EFAULT;
+
 	return 0;
 }