[3/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

Message ID 20221121195151.21812-4-decui@microsoft.com
State New
Headers
Series Support TDX guests on Hyper-V |

Commit Message

Dexuan Cui Nov. 21, 2022, 7:51 p.m. UTC
  When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
allocates buffers using vzalloc(), and needs to share the buffers with the
host OS by calling set_memory_decrypted(), which is not working for
vmalloc() yet. Add the support by handling the pages one by one.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 arch/x86/coco/tdx/tdx.c | 45 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)
  

Comments

Dave Hansen Nov. 21, 2022, 9 p.m. UTC | #1
On 11/21/22 11:51, Dexuan Cui wrote:
> -static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> +static bool tdx_enc_status_changed_for_contiguous_pages(unsigned long vaddr,
> +							int numpages, bool enc)

That naming is unfortunate.

First, it's getting way too long.

Second, you don't need two of these functions because it's contiguous or
not.  It's because tdx_enc_status_changed() only works on the direct map.

>  {
>  	phys_addr_t start = __pa(vaddr);
>  	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> @@ -798,6 +800,47 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>  	return true;
>  }
>  
> +static bool tdx_enc_status_changed_for_vmalloc(unsigned long vaddr,
> +					       int numpages, bool enc)
> +{
> +	void *start_va = (void *)vaddr;
> +	void *end_va = start_va + numpages * PAGE_SIZE;
> +	phys_addr_t pa;
> +
> +	if (offset_in_page(vaddr) != 0)
> +		return false;
> +
> +	while (start_va < end_va) {
> +		pa = slow_virt_to_phys(start_va);
> +		if (!enc)
> +			pa |= cc_mkdec(0);
> +
> +		if (!tdx_map_gpa(pa, pa + PAGE_SIZE, enc))
> +			return false;
> +
> +		/*
> +		 * private->shared conversion requires only MapGPA call.
> +		 *
> +		 * For shared->private conversion, accept the page using
> +		 * TDX_ACCEPT_PAGE TDX module call.
> +		 */
> +		if (enc && !try_accept_one(&pa, PAGE_SIZE, PG_LEVEL_4K))
> +			return false;

Don't we support large vmalloc() mappings these days?

> +		start_va += PAGE_SIZE;
> +	}
> +
> +	return true;
> +}

I really don't like the copy-and-paste fork here.

I'd almost just rather have this *one* "vmalloc" copy that does
slow_virt_to_phys() on direct map addresses than have two copies.

Can you please look into making *one* function that works on either kind
of mapping?

> +static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> +{
> +	if (is_vmalloc_addr((void *)vaddr))
> +		return tdx_enc_status_changed_for_vmalloc(vaddr, numpages, enc);
> +
> +	return tdx_enc_status_changed_for_contiguous_pages(vaddr, numpages, enc);
> +}
> +
>  void __init tdx_early_init(void)
>  {
>  	u64 cc_mask;
  
Kirill A. Shutemov Nov. 22, 2022, 12:24 a.m. UTC | #2
On Mon, Nov 21, 2022 at 11:51:48AM -0800, Dexuan Cui wrote:
> When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> allocates buffers using vzalloc(), and needs to share the buffers with the
> host OS by calling set_memory_decrypted(), which is not working for
> vmalloc() yet. Add the support by handling the pages one by one.

Why do you use vmalloc here in the first place?

Will you also adjust direct mapping to have shared bit set?

If not, we will have problems with load_unaligned_zeropad() when it will
access shared pages via non-shared mapping.

If direct mapping is adjusted, we will get direct mapping fragmentation.

Maybe tap into swiotlb buffer using DMA API?
  
Dexuan Cui Nov. 23, 2022, 4:01 a.m. UTC | #3
> From: Dave Hansen <dave.hansen@intel.com>
> Sent: Monday, November 21, 2022 1:01 PM
> [...]
> On 11/21/22 11:51, Dexuan Cui wrote:
> > -static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
> bool enc)
> > +static bool tdx_enc_status_changed_for_contiguous_pages(unsigned long
> vaddr,
> > +							int numpages, bool enc)
> 
> That naming is unfortunate.
> 
> First, it's getting way too long.
> 
> Second, you don't need two of these functions because it's contiguous or
> not.  It's because tdx_enc_status_changed() only works on the direct map.

Will try to make one function with better naming.

> > +static bool tdx_enc_status_changed_for_vmalloc(unsigned long vaddr,
> > +					       int numpages, bool enc)
> > +{
> > +	void *start_va = (void *)vaddr;
> > +	void *end_va = start_va + numpages * PAGE_SIZE;
> > +	phys_addr_t pa;
> > +
> > +	if (offset_in_page(vaddr) != 0)
> > +		return false;
> > +
> > +	while (start_va < end_va) {
> > +		pa = slow_virt_to_phys(start_va);
> > +		if (!enc)
> > +			pa |= cc_mkdec(0);
> > +
> > +		if (!tdx_map_gpa(pa, pa + PAGE_SIZE, enc))
> > +			return false;
> > +
> > +		/*
> > +		 * private->shared conversion requires only MapGPA call.
> > +		 *
> > +		 * For shared->private conversion, accept the page using
> > +		 * TDX_ACCEPT_PAGE TDX module call.
> > +		 */
> > +		if (enc && !try_accept_one(&pa, PAGE_SIZE, PG_LEVEL_4K))
> > +			return false;
> 
> Don't we support large vmalloc() mappings these days?

I just noticed Nicholas Piggin's huge vmalloc mapping patches that were
merged in April 2021. I'll take a look and see what I can use here.

> > +		start_va += PAGE_SIZE;
> > +	}
> > +
> > +	return true;
> > +}
> 
> I really don't like the copy-and-paste fork here.
> 
> I'd almost just rather have this *one* "vmalloc" copy that does
> slow_virt_to_phys() on direct map addresses than have two copies.

It looks like typically set_memory_{de/en}crypted() are not invoked
frequently when Linux is running, e.g. the swiotlb bounce buffers are
only initialzed once with set_memory_decrypt(). A driver runs in a guest
on a hypervisor typically also only initializes the buffers (which need to
be shared with the hypervisor) with set_memory_decrypt() once when
the driver initializes the device. So, it looks like slow_virt_to_phys() may be
acceptable for configuous memory pages as well.

> Can you please look into making *one* function that works on either kind
> of mapping?

Ok. Looking into this using slow_virt_to_phys() for direct map
addresses as well.
  
Dexuan Cui Nov. 23, 2022, 11:51 p.m. UTC | #4
> From: Kirill A. Shutemov <kirill@shutemov.name>
> Sent: Monday, November 21, 2022 4:24 PM
> 
> On Mon, Nov 21, 2022 at 11:51:48AM -0800, Dexuan Cui wrote:
> > When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> > allocates buffers using vzalloc(), and needs to share the buffers with the
> > host OS by calling set_memory_decrypted(), which is not working for
> > vmalloc() yet. Add the support by handling the pages one by one.
> 
> Why do you use vmalloc here in the first place?

We changed to vmalloc() long ago, mainly for 2 reasons:

1) __alloc_pages() only allows us to allocate up to 4MB of contiguous pages, but
we need a 16MB buffer in the Hyper-V vNIC driver for better performance.

2) Due to memory fragmentation, we have seen that the page allocator can fail
to allocate 2 contigous pages, though the system has a lot of free memory. We
need to support Hyper-V vNIC hot addition, so we changed to vmalloc. See

b679ef73edc2 ("hyperv: Add support for physically discontinuous receive buffer")
06b47aac4924 ("Drivers: net-next: hyperv: Increase the size of the sendbuf region")

> Will you also adjust direct mapping to have shared bit set?
> 
> If not, we will have problems with load_unaligned_zeropad() when it will
> access shared pages via non-shared mapping.
> 
> If direct mapping is adjusted, we will get direct mapping fragmentation.

load_unaligned_zeropad() was added 10 years ago by Linus in
e419b4cc5856 ("vfs: make word-at-a-time accesses handle a non-existing page") 
so this seemingly-strange usage is legitimate.

Sorry I don't know how to adjust direct mapping. Do you mean I should do
something like the below in tdx_enc_status_changed_for_vmalloc() for
every 'start_va':
  pa = slow_virt_to_phys(start_va);
  set_memory_decrypted(phys_to_virt(pa), 1);
?

But IIRC this issue is not specific to vmalloc()? e.g. we get 1 page by
__get_free_pages(GFP_KERNEL, 0) or kmalloc(PAGE_SIZE, GFP_KERNEL)
and we call set_memory_decrypted() for the page. How can we make
sure the callers of load_unaligned_zeropad() can't access the page
via non-shared mapping?

It looks like you have a patchset to address the issue (it looks like it
hasn't been merged into the mainline?) ?
https://lwn.net/ml/linux-kernel/20220614120231.48165-11-kirill.shutemov@linux.intel.com/

BTW, I'll drop tdx_enc_status_changed_for_vmalloc() and try to enhance the
existing tdx_enc_status() to support both direct mapping and vmalloc().

> Maybe tap into swiotlb buffer using DMA API?

I doubt the Hyper-V vNIC driver here can call dma_alloc_coherent() to
get a 16MB buffer from swiotlb buffers. I'm looking at dma_alloc_coherent() ->
dma_alloc_attrs() -> dma_direct_alloc(), which typically calls 
__dma_direct_alloc_pages() to allocate congituous memory pages (which
can't exceed the 4MB limit. Note there is no virtual IOMMU in a guest on Hyper-V).

It looks like we can't force dma_direct_alloc() to call dma_direct_alloc_no_mapping(),
because even if we set the DMA_ATTR_NO_KERNEL_MAPPING flag,
force_dma_unencrypted() is still always true for a TDX guest.
  
Kirill A. Shutemov Nov. 24, 2022, 7:51 a.m. UTC | #5
On Wed, Nov 23, 2022 at 11:51:11PM +0000, Dexuan Cui wrote:
> > From: Kirill A. Shutemov <kirill@shutemov.name>
> > Sent: Monday, November 21, 2022 4:24 PM
> > 
> > On Mon, Nov 21, 2022 at 11:51:48AM -0800, Dexuan Cui wrote:
> > > When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> > > allocates buffers using vzalloc(), and needs to share the buffers with the
> > > host OS by calling set_memory_decrypted(), which is not working for
> > > vmalloc() yet. Add the support by handling the pages one by one.
> > 
> > Why do you use vmalloc here in the first place?
> 
> We changed to vmalloc() long ago, mainly for 2 reasons:
> 
> 1) __alloc_pages() only allows us to allocate up to 4MB of contiguous pages, but
> we need a 16MB buffer in the Hyper-V vNIC driver for better performance.
> 
> 2) Due to memory fragmentation, we have seen that the page allocator can fail
> to allocate 2 contigous pages, though the system has a lot of free memory. We
> need to support Hyper-V vNIC hot addition, so we changed to vmalloc. See
> 
> b679ef73edc2 ("hyperv: Add support for physically discontinuous receive buffer")
> 06b47aac4924 ("Drivers: net-next: hyperv: Increase the size of the sendbuf region")
> 
> > Will you also adjust direct mapping to have shared bit set?
> > 
> > If not, we will have problems with load_unaligned_zeropad() when it will
> > access shared pages via non-shared mapping.
> > 
> > If direct mapping is adjusted, we will get direct mapping fragmentation.
> 
> load_unaligned_zeropad() was added 10 years ago by Linus in
> e419b4cc5856 ("vfs: make word-at-a-time accesses handle a non-existing page") 
> so this seemingly-strange usage is legitimate.
> 
> Sorry I don't know how to adjust direct mapping. Do you mean I should do
> something like the below in tdx_enc_status_changed_for_vmalloc() for
> every 'start_va':
>   pa = slow_virt_to_phys(start_va);
>   set_memory_decrypted(phys_to_virt(pa), 1);
> ?
> 
> But IIRC this issue is not specific to vmalloc()? e.g. we get 1 page by
> __get_free_pages(GFP_KERNEL, 0) or kmalloc(PAGE_SIZE, GFP_KERNEL)
> and we call set_memory_decrypted() for the page. How can we make
> sure the callers of load_unaligned_zeropad() can't access the page
> via non-shared mapping?

__get_free_pages() and kmalloc() returns pointer to the page in the direct
mapping. set_memory_decrypted() adjust direct mapping to have the shared
bit set. Everything is fine.

> It looks like you have a patchset to address the issue (it looks like it
> hasn't been merged into the mainline?) ?
> https://lwn.net/ml/linux-kernel/20220614120231.48165-11-kirill.shutemov@linux.intel.com/

It addresses similar, but different issue. It is only relevant for
unaccepted memory support.

> BTW, I'll drop tdx_enc_status_changed_for_vmalloc() and try to enhance the
> existing tdx_enc_status() to support both direct mapping and vmalloc().
> 
> > Maybe tap into swiotlb buffer using DMA API?
> 
> I doubt the Hyper-V vNIC driver here can call dma_alloc_coherent() to
> get a 16MB buffer from swiotlb buffers. I'm looking at dma_alloc_coherent() ->
> dma_alloc_attrs() -> dma_direct_alloc(), which typically calls 
> __dma_direct_alloc_pages() to allocate congituous memory pages (which
> can't exceed the 4MB limit. Note there is no virtual IOMMU in a guest on Hyper-V).
> 
> It looks like we can't force dma_direct_alloc() to call dma_direct_alloc_no_mapping(),
> because even if we set the DMA_ATTR_NO_KERNEL_MAPPING flag,
> force_dma_unencrypted() is still always true for a TDX guest.

The point is not in reaching dma_direct_alloc_no_mapping(). The idea is
allocate from existing swiotlb that already has shared bit set in direct
mapping.

vmap area that maps pages allocated from swiotlb also should work fine.

To be honest, I don't understand DMA API well enough. I need to experiment
with it to see what works for the case.
  
Dexuan Cui Nov. 27, 2022, 8:27 p.m. UTC | #6
> From: Kirill A. Shutemov <kirill@shutemov.name>
> Sent: Wednesday, November 23, 2022 11:51 PM
> > [...]
> > > Will you also adjust direct mapping to have shared bit set?
> > >
> > > If not, we will have problems with load_unaligned_zeropad() when it will
> > > access shared pages via non-shared mapping.

It looks like this is also an issue to AMD SNP?

> > > If direct mapping is adjusted, we will get direct mapping fragmentation.
> > [...]
> 
> __get_free_pages() and kmalloc() returns pointer to the page in the direct
> mapping. set_memory_decrypted() adjust direct mapping to have the shared
> bit set. Everything is fine.

You're correct. Now I understand the issue.
 
> > BTW, I'll drop tdx_enc_status_changed_for_vmalloc() and try to enhance the
> > existing tdx_enc_status() to support both direct mapping and vmalloc().

Looks like I should not drop tdx_enc_status_changed_for_vmalloc() and have to
detect if the addr is a vmalloc address, and if yes we'll have to adjust the direct
mapping?

> > > Maybe tap into swiotlb buffer using DMA API?
> >
> > I doubt the Hyper-V vNIC driver here can call dma_alloc_coherent() to
> > get a 16MB buffer from swiotlb buffers. I'm looking at dma_alloc_coherent()
> ->
> > dma_alloc_attrs() -> dma_direct_alloc(), which typically calls
> > __dma_direct_alloc_pages() to allocate congituous memory pages (which
> > can't exceed the 4MB limit. Note there is no virtual IOMMU in a guest on
> Hyper-V).
> >
> > It looks like we can't force dma_direct_alloc() to call
> dma_direct_alloc_no_mapping(),
> > because even if we set the DMA_ATTR_NO_KERNEL_MAPPING flag,
> > force_dma_unencrypted() is still always true for a TDX guest.
> 
> The point is not in reaching dma_direct_alloc_no_mapping(). The idea is
> allocate from existing swiotlb that already has shared bit set in direct
> mapping.
> 
> vmap area that maps pages allocated from swiotlb also should work fine.

My understanding is that swiotlb is mainly for buffer bouncing, and the
only non-static function in kernel/dma/swiotlb.c for allocating memory 
is swiotlb_alloc(), which is defined only if CONFIG_DMA_RESTRICTED_POOL=y,
which is =n on x86-64 due to CONFIG_OF=n.

If we don't adjust the direct mapping, IMO we'll have to do:
1) force the kernel to not use load_unaligned_zeropad() for a coco VM?
Or
2) make swiotlb_alloc()/free() available to x86-64 and export them for drivers?
Or
3) implement and use a custom memory pool that's pre-allocated using
__get_free_pages() and set_memory_decrypted(), and use the pool in drivers???

BTW, ideas 1) and 3) are from Michael Kelley who discussed the issue with me.
Michael can share more details and thoughts.

I'm inclined to detect a vmalloc address and adjust the direct mapping:

1) Typically IMO drivers don't use a lot of shared buffers from vmalloc(), so
direct mapping fragmentation is not a severe issue.

2) When a driver does use shared buffers from vmalloc(), typically it only
allocates the buffers once, when the device is being initialized. For a Linux
coco VM on Hyper-V, only the hv_netvsc driver uses shared buffers from
vmalloc(). While we do need to support vNIC hot add/remove, in practice
AFAIK vNIC hot add/remove happens very infrequently.

Thoughts?
  

Patch

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 46971cc7d006..8bccae962b6d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -5,6 +5,7 @@ 
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <linux/mm.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
@@ -754,7 +755,8 @@  static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
  * the VMM or private to the guest.  The VMM is expected to change its mapping
  * of the page in response.
  */
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_enc_status_changed_for_contiguous_pages(unsigned long vaddr,
+							int numpages, bool enc)
 {
 	phys_addr_t start = __pa(vaddr);
 	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
@@ -798,6 +800,47 @@  static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 	return true;
 }
 
+static bool tdx_enc_status_changed_for_vmalloc(unsigned long vaddr,
+					       int numpages, bool enc)
+{
+	void *start_va = (void *)vaddr;
+	void *end_va = start_va + numpages * PAGE_SIZE;
+	phys_addr_t pa;
+
+	if (offset_in_page(vaddr) != 0)
+		return false;
+
+	while (start_va < end_va) {
+		pa = slow_virt_to_phys(start_va);
+		if (!enc)
+			pa |= cc_mkdec(0);
+
+		if (!tdx_map_gpa(pa, pa + PAGE_SIZE, enc))
+			return false;
+
+		/*
+		 * private->shared conversion requires only MapGPA call.
+		 *
+		 * For shared->private conversion, accept the page using
+		 * TDX_ACCEPT_PAGE TDX module call.
+		 */
+		if (enc && !try_accept_one(&pa, PAGE_SIZE, PG_LEVEL_4K))
+			return false;
+
+		start_va += PAGE_SIZE;
+	}
+
+	return true;
+}
+
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+	if (is_vmalloc_addr((void *)vaddr))
+		return tdx_enc_status_changed_for_vmalloc(vaddr, numpages, enc);
+
+	return tdx_enc_status_changed_for_contiguous_pages(vaddr, numpages, enc);
+}
+
 void __init tdx_early_init(void)
 {
 	u64 cc_mask;