[RFC,V1,0/5] x86: CVMs: Align memory conversions to 2M granularity

Message ID	20240112055251.36101-1-vannapurve@google.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-24325-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Date: Fri, 12 Jan 2024 05:52:46 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240112055251.36101-1-vannapurve@google.com> Subject: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity From: Vishal Annapurve <vannapurve@google.com> To: x86@kernel.org, linux-kernel@vger.kernel.org Cc: pbonzini@redhat.com, rientjes@google.com, bgardon@google.com, seanjc@google.com, erdemaktas@google.com, ackerleytng@google.com, jxgao@google.com, sagis@google.com, oupton@google.com, peterx@redhat.com, vkuznets@redhat.com, dmatlack@google.com, pgonda@google.com, michael.roth@amd.com, kirill@shutemov.name, thomas.lendacky@amd.com, dave.hansen@linux.intel.com, linux-coco@lists.linux.dev, chao.p.peng@linux.intel.com, isaku.yamahata@gmail.com, andrew.jones@linux.dev, corbet@lwn.net, hch@lst.de, m.szyprowski@samsung.com, bp@suse.de, rostedt@goodmis.org, iommu@lists.linux.dev, Vishal Annapurve <vannapurve@google.com> Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX
Series	x86: CVMs: Align memory conversions to 2M granularity \| [RFC,V1,0/5] x86: CVMs: Align memory conversions to 2M granularity [RFC,V1,1/5] swiotlb: Support allocating DMA memory from SWIOTLB [RFC,V1,2/5] swiotlb: Allow setting up default alignment of SWIOTLB region [RFC,V1,3/5] x86: CVMs: Enable dynamic swiotlb by default for CVMs [RFC,V1,4/5] x86: CVMs: Allow allocating all DMA memory from SWIOTLB [RFC,V1,5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment

Message ID

20240112055251.36101-1-vannapurve@google.com

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-24325-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1;
Date: Fri, 12 Jan 2024 05:52:46 +0000
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20240112055251.36101-1-vannapurve@google.com>
Subject: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity
From: Vishal Annapurve <vannapurve@google.com>
To: x86@kernel.org, linux-kernel@vger.kernel.org
Cc: pbonzini@redhat.com, rientjes@google.com, bgardon@google.com,
	seanjc@google.com, erdemaktas@google.com, ackerleytng@google.com,
	jxgao@google.com, sagis@google.com, oupton@google.com, peterx@redhat.com,
	vkuznets@redhat.com, dmatlack@google.com, pgonda@google.com,
	michael.roth@amd.com, kirill@shutemov.name, thomas.lendacky@amd.com,
	dave.hansen@linux.intel.com, linux-coco@lists.linux.dev,
	chao.p.peng@linux.intel.com, isaku.yamahata@gmail.com,
 andrew.jones@linux.dev,
	corbet@lwn.net, hch@lst.de, m.szyprowski@samsung.com, bp@suse.de,
	rostedt@goodmis.org, iommu@lists.linux.dev,
	Vishal Annapurve <vannapurve@google.com>
Content-Type: text/plain; charset="UTF-8"

Series

x86: CVMs: Align memory conversions to 2M granularity |

Message

Vishal Annapurve Jan. 12, 2024, 5:52 a.m. UTC

  Goal of this series is aligning memory conversion requests from CVMs to
huge page sizes to allow better host side management of guest memory and
optimized page table walks.

This patch series is partially tested and needs more work, I am seeking
feedback from wider community before making further progress.

Background
=====================
Confidential VMs(CVMs) support two types of guest memory ranges:
1) Private Memory: Intended to be consumed/modified only by the CVM.
2) Shared Memory: visible to both guest/host components, used for
non-trusted IO.

Guest memfd [1] support is set to be merged upstream to handle guest private
memory isolation from host usersapace. Guest memfd approach allows following
setup:
* private memory backed using the guest memfd file which is not accessible
  from host userspace.
* Shared memory backed by tmpfs/hugetlbfs files that are accessible from
  host userspace.

Userspace VMM needs to register two backing stores for all of the guest
memory ranges:
* HVA for shared memory
* Guest memfd ranges for private memory

KVM keeps track of shared/private guest memory ranges that can be updated at
runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
(shared) or guest memfd file offsets (private) based on the attributes of the
guest memory ranges.

In this setup, there is possibility of "double allocation" i.e. scenarios where
both shared and private memory backing stores mapped to the same guest memory
ranges have memory allocated.

Guest issues an hypercall to convert the memory types which is forwarded by KVM
to the host userspace.
Userspace VMM is supposed to handle conversion as follows:
1) Private to shared conversion:
  * Update guest memory attributes for the range to be shared using KVM
    supported IOCTLs.
    - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
      to the guest memory being converted.
  * Unback the guest memfd range.
2) Shared to private conversion:
  * Update guest memory attributes for the range to be private using KVM
    supported IOCTLs.
    - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
      to the guest memory being converted.
  * Unback the shared memory file.

Note that unbacking needs to be done for both kinds of conversions in order to
avoid double allocation.

Problem
=====================
CVMs can convert memory between these two types at 4K granularity. Conversion
done at 4K granularity causes issues when using guest memfd support
with hugetlb/Hugepage backed guest private memory:
1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
causing all the private to shared memory conversions to result in double
allocation.
2) Even if a new fs is implemented for guest memfd that allows splitting
hugepages, punching holes at 4K will cause:
   - loss of vmemmmap optimization [2]
   - more memory for EPT/NPT entries and extra pagetable walks for guest
     side accesses.
   - Shared memory mappings to consume more host pagetable entries and
     extra pagetalble walks for host side access.
   - Higher number of conversions with additional overhead of VM exits
     serviced by host userspace.

Memory conversion scenarios in the guest that are of major concern:
- SWIOTLB area conversion early during boot.
   * dma_map_* API invocations for CVMs result in using bounce buffers
     from SWIOTLB region which is already marked as shared.
- Device drivers allocating memory using dma_alloc_* APIs at runtime
  that bypass SWIOTLB.

Proposal
=====================
To counter above issues, this series proposes following:
1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
using dma_alloc_* APIs.
2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
scaled up as needed.
4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
2M granularity once during boot.
5) Add a check to ensure all conversions happen at 2M granularity.

** This series leaves out some of the conversion sites which might not
be 2M aligned but should be easy to fix once the approach is finalized. **

1G alignment for conversion:
* Using 1G alignment may cause over-allocated SWIOTLB buffers but might
  be acceptable for CVMs depending on more considerations.
* It might be challenging to use 1G aligned conversion in OVMF. 2M
  alignment should be achievable with OVMF changes [3].

Alternatives could be:
1) Separate hugepage aligned DMA pools setup by individual device drivers in
case of CVMs.

[1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@redhat.com/
[2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
[3] https://github.com/tianocore/edk2/pull/3784
[4] https://lore.kernel.org/lkml/20230908080031.GA7848@lst.de/T/

Vishal Annapurve (5):
  swiotlb: Support allocating DMA memory from SWIOTLB
  swiotlb: Allow setting up default alignment of SWIOTLB region
  x86: CVMs: Enable dynamic swiotlb by default for CVMs
  x86: CVMs: Allow allocating all DMA memory from SWIOTLB
  x86: CVMs: Ensure that memory conversions happen at 2M alignment

 arch/x86/Kconfig             |  2 ++
 arch/x86/kernel/pci-dma.c    |  2 +-
 arch/x86/mm/mem_encrypt.c    |  8 ++++++--
 arch/x86/mm/pat/set_memory.c |  6 ++++--
 include/linux/swiotlb.h      | 22 ++++++----------------
 kernel/dma/direct.c          |  4 ++--
 kernel/dma/swiotlb.c         | 17 ++++++++++++-----
 7 files changed, 33 insertions(+), 28 deletions(-)

Comments

Vishal Annapurve Jan. 30, 2024, 4:42 p.m. UTC | #1

On Fri, Jan 12, 2024 at 11:22 AM Vishal Annapurve <vannapurve@google.com> wrote:
>
> Goal of this series is aligning memory conversion requests from CVMs to
> huge page sizes to allow better host side management of guest memory and
> optimized page table walks.
>
> This patch series is partially tested and needs more work, I am seeking
> feedback from wider community before making further progress.
>
> Background
> =====================
> Confidential VMs(CVMs) support two types of guest memory ranges:
> 1) Private Memory: Intended to be consumed/modified only by the CVM.
> 2) Shared Memory: visible to both guest/host components, used for
> non-trusted IO.
>
> Guest memfd [1] support is set to be merged upstream to handle guest private
> memory isolation from host usersapace. Guest memfd approach allows following
> setup:
> * private memory backed using the guest memfd file which is not accessible
>   from host userspace.
> * Shared memory backed by tmpfs/hugetlbfs files that are accessible from
>   host userspace.
>
> Userspace VMM needs to register two backing stores for all of the guest
> memory ranges:
> * HVA for shared memory
> * Guest memfd ranges for private memory
>
> KVM keeps track of shared/private guest memory ranges that can be updated at
> runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
> (shared) or guest memfd file offsets (private) based on the attributes of the
> guest memory ranges.
>
> In this setup, there is possibility of "double allocation" i.e. scenarios where
> both shared and private memory backing stores mapped to the same guest memory
> ranges have memory allocated.
>
> Guest issues an hypercall to convert the memory types which is forwarded by KVM
> to the host userspace.
> Userspace VMM is supposed to handle conversion as follows:
> 1) Private to shared conversion:
>   * Update guest memory attributes for the range to be shared using KVM
>     supported IOCTLs.
>     - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
>       to the guest memory being converted.
>   * Unback the guest memfd range.
> 2) Shared to private conversion:
>   * Update guest memory attributes for the range to be private using KVM
>     supported IOCTLs.
>     - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
>       to the guest memory being converted.
>   * Unback the shared memory file.
>
> Note that unbacking needs to be done for both kinds of conversions in order to
> avoid double allocation.
>
> Problem
> =====================
> CVMs can convert memory between these two types at 4K granularity. Conversion
> done at 4K granularity causes issues when using guest memfd support
> with hugetlb/Hugepage backed guest private memory:
> 1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
> causing all the private to shared memory conversions to result in double
> allocation.
> 2) Even if a new fs is implemented for guest memfd that allows splitting
> hugepages, punching holes at 4K will cause:
>    - loss of vmemmmap optimization [2]
>    - more memory for EPT/NPT entries and extra pagetable walks for guest
>      side accesses.
>    - Shared memory mappings to consume more host pagetable entries and
>      extra pagetalble walks for host side access.
>    - Higher number of conversions with additional overhead of VM exits
>      serviced by host userspace.
>
> Memory conversion scenarios in the guest that are of major concern:
> - SWIOTLB area conversion early during boot.
>    * dma_map_* API invocations for CVMs result in using bounce buffers
>      from SWIOTLB region which is already marked as shared.
> - Device drivers allocating memory using dma_alloc_* APIs at runtime
>   that bypass SWIOTLB.
>
> Proposal
> =====================
> To counter above issues, this series proposes following:
> 1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
> using dma_alloc_* APIs.
> 2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
> 3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
> scaled up as needed.
> 4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
> 2M granularity once during boot.
> 5) Add a check to ensure all conversions happen at 2M granularity.
>
> ** This series leaves out some of the conversion sites which might not
> be 2M aligned but should be easy to fix once the approach is finalized. **
>
> 1G alignment for conversion:
> * Using 1G alignment may cause over-allocated SWIOTLB buffers but might
>   be acceptable for CVMs depending on more considerations.
> * It might be challenging to use 1G aligned conversion in OVMF. 2M
>   alignment should be achievable with OVMF changes [3].
>
> Alternatives could be:
> 1) Separate hugepage aligned DMA pools setup by individual device drivers in
> case of CVMs.
>
> [1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@redhat.com/
> [2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
> [3] https://github.com/tianocore/edk2/pull/3784
> [4] https://lore.kernel.org/lkml/20230908080031.GA7848@lst.de/T/
>
> Vishal Annapurve (5):
>   swiotlb: Support allocating DMA memory from SWIOTLB
>   swiotlb: Allow setting up default alignment of SWIOTLB region
>   x86: CVMs: Enable dynamic swiotlb by default for CVMs
>   x86: CVMs: Allow allocating all DMA memory from SWIOTLB
>   x86: CVMs: Ensure that memory conversions happen at 2M alignment
>
>  arch/x86/Kconfig             |  2 ++
>  arch/x86/kernel/pci-dma.c    |  2 +-
>  arch/x86/mm/mem_encrypt.c    |  8 ++++++--
>  arch/x86/mm/pat/set_memory.c |  6 ++++--
>  include/linux/swiotlb.h      | 22 ++++++----------------
>  kernel/dma/direct.c          |  4 ++--
>  kernel/dma/swiotlb.c         | 17 ++++++++++++-----
>  7 files changed, 33 insertions(+), 28 deletions(-)
>
> --
> 2.43.0.275.g3460e3d667-goog
>

Ping for review of this series.

Thanks,
Vishal

Dave Hansen Jan. 31, 2024, 4:52 p.m. UTC | #2

There's a bunch of code in the kernel for TDX and SEV guests.  How much
of it uses the "CVM" nomenclature?

What do you do when you need to dynamically scale up the SWIOTLB size
and can't allocate a 2M page?  I guess you're saying here that you'd
rather run with a too-small 2M pool than a large-enough mixed 4k/2M pool.

I also had a really hard time parsing through the problem statement and
solution here.  I'd really suggest cleaning up the problem statement and
more clearly differentiating the host and guest sides in the description.

Vishal Annapurve Feb. 1, 2024, 5:44 a.m. UTC | #3

On Wed, Jan 31, 2024 at 10:23 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> There's a bunch of code in the kernel for TDX and SEV guests.  How much
> of it uses the "CVM" nomenclature?

Right, I see that "CoCo VMs" is a more accepted term in the kernel
codebase so far, will update the references in the next version.

>
> What do you do when you need to dynamically scale up the SWIOTLB size
> and can't allocate a 2M page?  I guess you're saying here that you'd
> rather run with a too-small 2M pool than a large-enough mixed 4k/2M pool.

I am not yet certain how to ensure 2M page is always available/made
available at runtime for CoCo VMs. Few options that I can think of:
1) Reserve additional memory for CMA allocations to satisfy runtime
requests of 2M allocations.
2) Pre-reserve SWIOTLB to a safe value just like it's done today and
not rely on dynamic scaling.

Any suggestions are welcome.

>
> I also had a really hard time parsing through the problem statement and
> solution here.  I'd really suggest cleaning up the problem statement and
> more clearly differentiating the host and guest sides in the description.

Thanks for taking a look at this series. I will reword the description
in the next version. The goal basically is to ensure private and
shared memory regions are always huge page aligned.