[v6,2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

Message ID 20230504225351.10765-3-decui@microsoft.com
State New
Headers
Series Support TDX guests on Hyper-V |

Commit Message

Dexuan Cui May 4, 2023, 10:53 p.m. UTC
  When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
allocates buffers using vzalloc(), and needs to share the buffers with the
host OS by calling set_memory_decrypted(), which is not working for
vmalloc() yet. Add the support by handling the pages one by one.

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

Changes in v2:
  Changed tdx_enc_status_changed() in place.

Hi, Dave, I checked the huge vmalloc mapping code, but still don't know
how to get the underlying huge page info (if huge page is in use) and
try to use PG_LEVEL_2M/1G in try_accept_page() for vmalloc: I checked
is_vm_area_hugepages() and  __vfree() -> __vunmap(), and I think the
underlying page allocation info is internal to the mm code, and there
is no mm API to for me get the info in tdx_enc_status_changed().

Changes in v3:
  No change since v2.

Changes in v4:
  Added Kirill's Co-developed-by since Kirill helped to improve the
    code by adding tdx_enc_status_changed_phys().

  Thanks Kirill for the clarification on load_unaligned_zeropad()!

  The vzalloc() usage in drivers/net/hyperv/netvsc.c: netvsc_init_buf()
    remains the same. It may not worth it to "allocate a vmalloc region,
    allocate pages manually", because we have to consider the worst case
    where the system is sufferiing from severe memory fragmentation and
    we can only allocate multiple single pages. We may not want to
    complicate the code in netvsc_init_buf(). We'll support NIC SR-IOV
    for TDX VMs on Hyper-V, so the netvsc send/recv buffers won't be
    used when the VF NIC is up.

Changes in v5:
  Added Kirill's Signed-off-by.
  Added Michael's Reviewed-by.

Changes in v6: None.

 arch/x86/coco/tdx/tdx.c | 76 ++++++++++++++++++++++++++++-------------
 1 file changed, 52 insertions(+), 24 deletions(-)
  

Comments

Dave Hansen May 23, 2023, 8:39 p.m. UTC | #1
On 5/4/23 15:53, Dexuan Cui wrote:
> When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> allocates buffers using vzalloc(), and needs to share the buffers with the
> host OS by calling set_memory_decrypted(), which is not working for
> vmalloc() yet. Add the support by handling the pages one by one.

I think this sets a bad precedent.

There are consequences for converting pages between shared and private.
Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
EPT/SEPT mappings.

How does this work with load_unaligned_zeropad()?  Couldn't it be
running around poking at one of these vmalloc()'d pages via the direct
map during a shared->private conversion before the page has been accepted?
  
Sean Christopherson May 23, 2023, 9:25 p.m. UTC | #2
On Tue, May 23, 2023, Dave Hansen wrote:
> On 5/4/23 15:53, Dexuan Cui wrote:
> > When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> > allocates buffers using vzalloc(), and needs to share the buffers with the
> > host OS by calling set_memory_decrypted(), which is not working for
> > vmalloc() yet. Add the support by handling the pages one by one.
> 
> I think this sets a bad precedent.

+1 

> There are consequences for converting pages between shared and private.
> Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
> EPT/SEPT mappings.
> 
> How does this work with load_unaligned_zeropad()?  Couldn't it be
> running around poking at one of these vmalloc()'d pages via the direct
> map during a shared->private conversion before the page has been accepted?

Would it be feasible and sensible to add a GFP_SHARED or whatever, to communicate
to the core allocators that the page is destined to be converted to a shared page?
I assume that would provide a common place (or two) for initiating conversions,
and would hopefully allow for future optimizations, e.g. to keep shared allocation
in the same pool or whatever.  Sharing memory without any intelligence as to what
memory is converted is going to make both the guest and host sad.
  
Dave Hansen May 23, 2023, 9:33 p.m. UTC | #3
On 5/23/23 14:25, Sean Christopherson wrote:
>> There are consequences for converting pages between shared and private.
>> Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
>> EPT/SEPT mappings.
>>
>> How does this work with load_unaligned_zeropad()?  Couldn't it be
>> running around poking at one of these vmalloc()'d pages via the direct
>> map during a shared->private conversion before the page has been accepted?
> Would it be feasible and sensible to add a GFP_SHARED or whatever, to communicate
> to the core allocators that the page is destined to be converted to a shared page?
> I assume that would provide a common place (or two) for initiating conversions,
> and would hopefully allow for future optimizations, e.g. to keep shared allocation
> in the same pool or whatever.  Sharing memory without any intelligence as to what
> memory is converted is going to make both the guest and host sad.

I don't think we want a GFP flag.  This is still way too specialized to
warrant one of those.

It sounds like a similar problem to what folks want for modules or BPF.
There are a bunch of allocations that are related and can have some of
their setup/teardown costs amortized if they can be clumped together.

For BPF, the costs are from doing RW=>RO in the kernel direct map, and
fracturing it in the process.

Here, the costs are from the private->shared conversions and fracturing
both the direct map and the EPT/SEPT.

I just don't know if there's anything that we can reuse from the BPF effort.
  
Kirill A. Shutemov May 23, 2023, 10:37 p.m. UTC | #4
On Tue, May 23, 2023 at 01:39:11PM -0700, Dave Hansen wrote:
> On 5/4/23 15:53, Dexuan Cui wrote:
> > When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> > allocates buffers using vzalloc(), and needs to share the buffers with the
> > host OS by calling set_memory_decrypted(), which is not working for
> > vmalloc() yet. Add the support by handling the pages one by one.
> 
> I think this sets a bad precedent.
> 
> There are consequences for converting pages between shared and private.
> Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
> EPT/SEPT mappings.
> 
> How does this work with load_unaligned_zeropad()?  Couldn't it be
> running around poking at one of these vmalloc()'d pages via the direct
> map during a shared->private conversion before the page has been accepted?

Alias processing in __change_page_attr_set_clr() will change direct
mapping if you call it on vmalloc()ed memory. I think we are safe wrt
load_unaligned_zeropad() here.
  
Dave Hansen May 23, 2023, 10:43 p.m. UTC | #5
On 5/23/23 15:37, kirill.shutemov@linux.intel.com wrote:
>> How does this work with load_unaligned_zeropad()?  Couldn't it be
>> running around poking at one of these vmalloc()'d pages via the direct
>> map during a shared->private conversion before the page has been accepted?
> Alias processing in __change_page_attr_set_clr() will change direct
> mapping if you call it on vmalloc()ed memory. I think we are safe wrt
> load_unaligned_zeropad() here.

We're *eventually* OK:

>         /* Notify hypervisor that we are about to set/clr encryption attribute. */
>         x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
> 
>         ret = __change_page_attr_set_clr(&cpa, 1);

But what about in the middle between enc_status_change_prepare() and
__change_page_attr_set_clr()?  Don't the direct map and the
shared/private status of the page diverge in there?
  
Edgecombe, Rick P May 23, 2023, 11:02 p.m. UTC | #6
On Tue, 2023-05-23 at 14:33 -0700, Dave Hansen wrote:
> On 5/23/23 14:25, Sean Christopherson wrote:
> > > There are consequences for converting pages between shared and
> > > private.
> > > Doing it on a vmalloc() mapping is guaranteed to fracture the
> > > underlying
> > > EPT/SEPT mappings.
> > > 
> > > How does this work with load_unaligned_zeropad()?  Couldn't it be
> > > running around poking at one of these vmalloc()'d pages via the
> > > direct
> > > map during a shared->private conversion before the page has been
> > > accepted?
> > Would it be feasible and sensible to add a GFP_SHARED or whatever,
> > to communicate
> > to the core allocators that the page is destined to be converted to
> > a shared page?
> > I assume that would provide a common place (or two) for initiating
> > conversions,
> > and would hopefully allow for future optimizations, e.g. to keep
> > shared allocation
> > in the same pool or whatever.  Sharing memory without any
> > intelligence as to what
> > memory is converted is going to make both the guest and host sad.
> 
> I don't think we want a GFP flag.  This is still way too specialized
> to
> warrant one of those.
> 
> It sounds like a similar problem to what folks want for modules or
> BPF.
> There are a bunch of allocations that are related and can have some
> of
> their setup/teardown costs amortized if they can be clumped together.
> 
> For BPF, the costs are from doing RW=>RO in the kernel direct map,
> and
> fracturing it in the process.
> 
> Here, the costs are from the private->shared conversions and
> fracturing
> both the direct map and the EPT/SEPT.
> 
> I just don't know if there's anything that we can reuse from the BPF
> effort.

Last I saw the BPF allocator was focused on module space memory and
packing multiple allocations into pages. So that would probably have to
be generalized to support this.

But also, a lot of the efforts around improving direct map modification
efficiency have sort of assumed all of the types of special permissions
could be grouped together on the direct map and just mapped elsewhere
in whatever permission they needed. This does seem like a special case
where it really needs to be grouped together only with similar
permissions.

If a GFP flag is too precious, what about something like PKS tables[0]
tried. It kind of had a similar problem with respect to preferring
physically contiguous special permissioned memory on the direct map. It
did this with a cache of converted pages outside the page allocator and
a separate alloc() function to get at it. This one could be
vmalloc_shared() or something. Don't know, just tossing it out there.

[0]
https://lore.kernel.org/lkml/20210830235927.6443-7-rick.p.edgecombe@intel.com/
  
Kirill A. Shutemov May 23, 2023, 11:28 p.m. UTC | #7
On Tue, May 23, 2023 at 03:43:15PM -0700, Dave Hansen wrote:
> On 5/23/23 15:37, kirill.shutemov@linux.intel.com wrote:
> >> How does this work with load_unaligned_zeropad()?  Couldn't it be
> >> running around poking at one of these vmalloc()'d pages via the direct
> >> map during a shared->private conversion before the page has been accepted?
> > Alias processing in __change_page_attr_set_clr() will change direct
> > mapping if you call it on vmalloc()ed memory. I think we are safe wrt
> > load_unaligned_zeropad() here.
> 
> We're *eventually* OK:
> 
> >         /* Notify hypervisor that we are about to set/clr encryption attribute. */
> >         x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
> > 
> >         ret = __change_page_attr_set_clr(&cpa, 1);
> 
> But what about in the middle between enc_status_change_prepare() and
> __change_page_attr_set_clr()?  Don't the direct map and the
> shared/private status of the page diverge in there?

Hmm. Maybe we would need to go through making the range in direct mapping
non-present before notifying VMM about the change.

I need to look at this again in the morning.
  
Kirill A. Shutemov May 25, 2023, 7:08 p.m. UTC | #8
On Wed, May 24, 2023 at 02:28:51AM +0300, kirill.shutemov@linux.intel.com wrote:
> On Tue, May 23, 2023 at 03:43:15PM -0700, Dave Hansen wrote:
> > On 5/23/23 15:37, kirill.shutemov@linux.intel.com wrote:
> > >> How does this work with load_unaligned_zeropad()?  Couldn't it be
> > >> running around poking at one of these vmalloc()'d pages via the direct
> > >> map during a shared->private conversion before the page has been accepted?
> > > Alias processing in __change_page_attr_set_clr() will change direct
> > > mapping if you call it on vmalloc()ed memory. I think we are safe wrt
> > > load_unaligned_zeropad() here.
> > 
> > We're *eventually* OK:
> > 
> > >         /* Notify hypervisor that we are about to set/clr encryption attribute. */
> > >         x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
> > > 
> > >         ret = __change_page_attr_set_clr(&cpa, 1);
> > 
> > But what about in the middle between enc_status_change_prepare() and
> > __change_page_attr_set_clr()?  Don't the direct map and the
> > shared/private status of the page diverge in there?
> 
> Hmm. Maybe we would need to go through making the range in direct mapping
> non-present before notifying VMM about the change.
> 
> I need to look at this again in the morning.

Okay, I've got around to it finally.

Private->Shared conversion is safe: we first set shared bit in the direct
mapping and all aliases and then call MapGPA enc_status_change_finish().
So we don't have privately mapped pages that we converted to shared with
MapGPA.

Shared->Private is not safe. As with Private->Shared, we adjust direct
mapping before notifying VMM and accepting the memory, so there's short
window when privately mapped memory that is neither mapped into SEPT nor
accepted. It is a problem as it can race with load_unaligned_zeropad().

Shared->Private conversion is rare. I only see one call total during the
boot in my setup. Worth fixing anyway.


The patch below fixes the issue by hooking up enc_status_change_prepare()
and doing conversion from Shared to Private there.

enc_status_change_finish() only covers Private to Shared conversion.

The patch is on top of unaccepted memory patchset. It is more convenient
base. I will rebase to Linus' tree if the approach looks sane to you.

Any comments?

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 32501277ef84..b73ec2449c64 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -713,16 +713,32 @@ static bool tdx_cache_flush_required(void)
 	return true;
 }
 
+static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
+					  bool enc)
+{
+	phys_addr_t start = __pa(vaddr);
+	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+	if (!enc)
+		return true;
+
+	return tdx_enc_status_changed_phys(start, end, enc);
+}
+
 /*
  * Inform the VMM of the guest's intent for this physical page: shared with
  * the VMM or private to the guest.  The VMM is expected to change its mapping
  * of the page in response.
  */
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
+					 bool enc)
 {
 	phys_addr_t start = __pa(vaddr);
 	phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
 
+	if (enc)
+		return true;
+
 	return tdx_enc_status_changed_phys(start, end, enc);
 }
 
@@ -753,9 +769,10 @@ void __init tdx_early_init(void)
 	 */
 	physical_mask &= cc_mask - 1;
 
-	x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
-	x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
-	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
+	x86_platform.guest.enc_cache_flush_required  = tdx_cache_flush_required;
+	x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
+	x86_platform.guest.enc_status_change_prepare = tdx_enc_status_change_prepare;
+	x86_platform.guest.enc_status_change_finish  = tdx_enc_status_change_finish;
 
 	pr_info("Guest detected\n");
 }
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 88085f369ff6..1ca9701917c5 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -150,7 +150,7 @@ struct x86_init_acpi {
  * @enc_cache_flush_required	Returns true if a cache flush is needed before changing page encryption status
  */
 struct x86_guest {
-	void (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
+	bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
 	bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
 	bool (*enc_tlb_flush_required)(bool enc);
 	bool (*enc_cache_flush_required)(void);
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index d82f4fa2f1bf..64664311ac2b 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -130,8 +130,8 @@ struct x86_cpuinit_ops x86_cpuinit = {
 
 static void default_nmi_init(void) { };
 
-static void enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { }
-static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return false; }
+static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
+static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
 static bool enc_tlb_flush_required_noop(bool enc) { return false; }
 static bool enc_cache_flush_required_noop(void) { return false; }
 static bool is_private_mmio_noop(u64 addr) {return false; }
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index e0b51c09109f..4f95c449a406 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -319,7 +319,7 @@ static void enc_dec_hypercall(unsigned long vaddr, int npages, bool enc)
 #endif
 }
 
-static void amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
+static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
 {
 	/*
 	 * To maintain the security guarantees of SEV-SNP guests, make sure
@@ -327,6 +327,8 @@ static void amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool
 	 */
 	if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
 		snp_set_memory_shared(vaddr, npages);
+
+	return true;
 }
 
 /* Return true unconditionally: return value doesn't matter for the SEV side */
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 7159cf787613..b8f48ebe753c 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2151,7 +2151,8 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 		cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
 
 	/* Notify hypervisor that we are about to set/clr encryption attribute. */
-	x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
+	if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
+		return -EIO;
 
 	ret = __change_page_attr_set_clr(&cpa, 1);
  
Dave Hansen May 25, 2023, 7:18 p.m. UTC | #9
On 5/25/23 12:08, Kirill A. Shutemov wrote:
> Shared->Private conversion is rare. I only see one call total during the
> boot in my setup. Worth fixing anyway.
...
> Any comments?

So the rules are:

 * Shared mapping of a private page: BAD
 * Private mapping of a shared page: OK

?

The patch seems OK, other than having zero comments in it.
  

Patch

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 5574c91541a2..731be50b3d09 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@ 
 #include <linux/cpufeature.h>
 #include <linux/export.h>
 #include <linux/io.h>
+#include <linux/mm.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
@@ -789,6 +790,34 @@  static bool try_accept_one(phys_addr_t *start, unsigned long len,
 	return true;
 }
 
+static bool try_accept_page(phys_addr_t start, phys_addr_t end)
+{
+	/*
+	 * For shared->private conversion, accept the page using
+	 * TDX_ACCEPT_PAGE TDX module call.
+	 */
+	while (start < end) {
+		unsigned long len = end - start;
+
+		/*
+		 * Try larger accepts first. It gives chance to VMM to keep
+		 * 1G/2M SEPT entries where possible and speeds up process by
+		 * cutting number of hypercalls (if successful).
+		 */
+
+		if (try_accept_one(&start, len, PG_LEVEL_1G))
+			continue;
+
+		if (try_accept_one(&start, len, PG_LEVEL_2M))
+			continue;
+
+		if (!try_accept_one(&start, len, PG_LEVEL_4K))
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * Notify the VMM about page mapping conversion. More info about ABI
  * can be found in TDX Guest-Host-Communication Interface (GHCI),
@@ -838,6 +867,19 @@  static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
 	return !ret;
 }
 
+static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
+					bool enc)
+{
+	if (!tdx_map_gpa(start, end, enc))
+		return false;
+
+	/* private->shared conversion requires only MapGPA call */
+	if (!enc)
+		return true;
+
+	return try_accept_page(start, end);
+}
+
 /*
  * Inform the VMM of the guest's intent for this physical page: shared with
  * the VMM or private to the guest. The VMM is expected to change its mapping
@@ -845,37 +887,23 @@  static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
  */
 static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 {
-	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+	unsigned long start = vaddr;
+	unsigned long end = start + numpages * PAGE_SIZE;
 
-	if (!tdx_map_gpa(start, end, enc))
+	if (offset_in_page(start) != 0)
 		return false;
 
-	/* private->shared conversion  requires only MapGPA call */
-	if (!enc)
-		return true;
+	if (!is_vmalloc_addr((void *)start))
+		return tdx_enc_status_changed_phys(__pa(start), __pa(end), enc);
 
-	/*
-	 * For shared->private conversion, accept the page using
-	 * TDX_ACCEPT_PAGE TDX module call.
-	 */
 	while (start < end) {
-		unsigned long len = end - start;
+		phys_addr_t start_pa = slow_virt_to_phys((void *)start);
+		phys_addr_t end_pa = start_pa + PAGE_SIZE;
 
-		/*
-		 * Try larger accepts first. It gives chance to VMM to keep
-		 * 1G/2M SEPT entries where possible and speeds up process by
-		 * cutting number of hypercalls (if successful).
-		 */
-
-		if (try_accept_one(&start, len, PG_LEVEL_1G))
-			continue;
-
-		if (try_accept_one(&start, len, PG_LEVEL_2M))
-			continue;
-
-		if (!try_accept_one(&start, len, PG_LEVEL_4K))
+		if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
 			return false;
+
+		start += PAGE_SIZE;
 	}
 
 	return true;