[0/4] KVM: Honor guest memory types for virtio GPU devices

Message ID	20240105091237.24577-1-yan.y.zhao@intel.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: pbonzini@redhat.com, seanjc@google.com, olvaffe@gmail.com, kevin.tian@intel.com, zhiyuan.lv@intel.com, zhenyu.z.wang@intel.com, yongwei.ma@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com, joro@8bytes.org, gurchetansingh@chromium.org, kraxel@redhat.com, zzyiwei@google.com, ankita@nvidia.com, jgg@nvidia.com, alex.williamson@redhat.com, maz@kernel.org, oliver.upton@linux.dev, james.morse@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH 0/4] KVM: Honor guest memory types for virtio GPU devices Date: Fri, 5 Jan 2024 17:12:37 +0800 Message-Id: <20240105091237.24577-1-yan.y.zhao@intel.com> Precedence: bulk
Series	KVM: Honor guest memory types for virtio GPU devices \| [0/4] KVM: Honor guest memory types for virtio GPU devices [1/4] KVM: Introduce a new memslot flag KVM_MEM_NON_COHERENT_DMA [2/4] KVM: x86: Add a new param "slot" to op get_mt_mask in kvm_x86_ops [3/4] KVM: VMX: Honor guest PATs for memslots of flag KVM_MEM_NON_COHERENT_DMA [4/4] KVM: selftests: Set KVM_MEM_NON_COHERENT_DMA as a supported memslot flag

Message ID

20240105091237.24577-1-yan.y.zhao@intel.com

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1;
From: Yan Zhao <yan.y.zhao@intel.com>
To: kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	dri-devel@lists.freedesktop.org
Cc: pbonzini@redhat.com,
	seanjc@google.com,
	olvaffe@gmail.com,
	kevin.tian@intel.com,
	zhiyuan.lv@intel.com,
	zhenyu.z.wang@intel.com,
	yongwei.ma@intel.com,
	vkuznets@redhat.com,
	wanpengli@tencent.com,
	jmattson@google.com,
	joro@8bytes.org,
	gurchetansingh@chromium.org,
	kraxel@redhat.com,
	zzyiwei@google.com,
	ankita@nvidia.com,
	jgg@nvidia.com,
	alex.williamson@redhat.com,
	maz@kernel.org,
	oliver.upton@linux.dev,
	james.morse@arm.com,
	suzuki.poulose@arm.com,
	yuzenghui@huawei.com,
	Yan Zhao <yan.y.zhao@intel.com>
Subject: [PATCH 0/4] KVM: Honor guest memory types for virtio GPU devices
Date: Fri,  5 Jan 2024 17:12:37 +0800
Message-Id: <20240105091237.24577-1-yan.y.zhao@intel.com>
Precedence: bulk

Series

KVM: Honor guest memory types for virtio GPU devices |

Message

Yan Zhao Jan. 5, 2024, 9:12 a.m. UTC

This series allow user space to notify KVM of noncoherent DMA status so as
to let KVM honor guest memory types in specified memory slot ranges.

Motivation
===
A virtio GPU device may want to configure GPU hardware to work in
noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
This is generally for performance consideration.
In certain platform, GFX performance can improve 20+% with DMAs going to
noncoherent path.

This noncoherent DMA mode works in below sequence:
1. Host backend driver programs hardware not to snoop memory of target
DMA buffer.
2. Host backend driver indicates guest frontend driver to program guest PAT
to WC for target DMA buffer.
3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
4. Hardware does noncoherent DMA to the target buffer.

In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
as not cached. So, if KVM forces the effective memory type of this DMA
buffer to be WB, hardware DMA may read incorrect data and cause misc
failures.

Therefore we introduced a new memslot flag KVM_MEM_NON_COHERENT_DMA to
allow user space convey noncoherent DMA status in memslot granularity.
Platforms that do not always honor guest memory type can choose to honor
it in ranges of memslots with KVM_MEM_NON_COHERENT_DMA set.

Security
===
The biggest concern for KVM to honor guest's memory type is page aliasing
issue.
In Intel's platform,
- For host MMIO, KVM VMX always programs EPT memory type to UC (which will
overwrite all guest PAT types except WC), which is of no change after
this series.

- For host non-MMIO pages,
* virtio guest frontend and host backend driver should be synced to use
the same memory type to map a buffer. Otherwise, there will be
potential problem for incorrect memory data. But this will only impact
the buggy guest alone.
* for live migration, user space can skip reading/writing memory
corresponding to the memslot with flag KVM_MEM_NON_COHERENT_DMA or
do some special handling during memory read/write.

Implementation
===
Unlike previous RFC series [1] that uses a new KVM VIRTIO device to convey
noncoherent DMA status, this version chooses to introduce a new memslot
flag, similar to what's done in series from google at [2].
The difference is that [2] increases noncoherent DMA count to ask KVM VMX
to honor guest memory type for all guest memory as a whole, while this
series will only ask KVM to honor guest memory type in the specified
memslots.

The reason of not introducing a KVM cap or a memslot flag to allow users to
toggle noncoherent DMA state as a whole is mainly for the page aliasing
issue as mentioned above.
If guest memory type is only honored in limited memslots, user space can
do special handling before/after accessing to guest memory belonging to the
limited memslots.

For virtio GPUs, it usually will create memslots that are mapped into guest
device BARs.
- guest device driver will sync with host side to use the same memory type
to access that memslots.
- no other guest components will have access to the memory in the memslots
since it's mapped as device BARs.
So, by adding flag KVM_MEM_NON_COHERENT_DMA to memslots specific to virtio
GPUs and asking KVM to only honor guest memory in those memslots, page
aliasing issue can be avoided easily.

This series doesn't limit which memslots are legible to set flag
KVM_MEM_NON_COHERENT_DMA, so if the user sets this flag to memslots for
guest system RAM, page aliasing issue may be met during live migration
or other use cases when host wants to access guest memory with different
memory types due to lacking of coordination between non-enlightened guest
components and host. Just as when noncoherent DMA devices are assigned
through VFIO.
But as it will not impact other VMs, we choose to trust the user and let
the user to do mitigations when it has to set this flag to memslots for
guest system RAM.

Note:
We also noticed that there's a series [3] trying to fix a similar problem
in ARM for VFIO device passthrough.
The difference is that [3] is trying to fix the problem that guest memory
types for pass-through device MMIOs are not honored in ARM (which is not a
problem for x86 VMX), while this series is for the problem that guest
memory types are not honored in non-host-MMIO ranges for virtio GPUs in x86
VMX.

Changelog:
RFC --> v1:
- Switch to use memslot flag way to convey non-coherent DMA info
(Sean, Kevin)
- Do not honor guest MTRRs in memslot of flag KVM_MEM_NON_COHERENT_DMA
(Sean)

[1]: https://lore.kernel.org/all/20231214103520.7198-1-yan.y.zhao@intel.com/
[2]: https://patchwork.kernel.org/project/dri-devel/cover/20200213213036.207625-1-olvaffe@gmail.com/
[3]: https://lore.kernel.org/all/20231221154002.32622-1-ankita@nvidia.com/

Yan Zhao (4):
KVM: Introduce a new memslot flag KVM_MEM_NON_COHERENT_DMA
KVM: x86: Add a new param "slot" to op get_mt_mask in kvm_x86_ops
KVM: VMX: Honor guest PATs for memslots of flag
KVM_MEM_NON_COHERENT_DMA
KVM: selftests: Set KVM_MEM_NON_COHERENT_DMA as a supported memslot
flag

base-commit: 8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1

Comments

Jason Gunthorpe Jan. 5, 2024, 7:55 p.m. UTC | #1

On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> This series allow user space to notify KVM of noncoherent DMA status so as
> to let KVM honor guest memory types in specified memory slot ranges.
> 
> Motivation
> ===
> A virtio GPU device may want to configure GPU hardware to work in
> noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.

Does this mean some DMA reads do not snoop the caches or does it
include DMA writes not synchronizing the caches too?

> This is generally for performance consideration.
> In certain platform, GFX performance can improve 20+% with DMAs going to
> noncoherent path.
> 
> This noncoherent DMA mode works in below sequence:
> 1. Host backend driver programs hardware not to snoop memory of target
>    DMA buffer.
> 2. Host backend driver indicates guest frontend driver to program guest PAT
>    to WC for target DMA buffer.
> 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> 4. Hardware does noncoherent DMA to the target buffer.
> 
> In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> as not cached. So, if KVM forces the effective memory type of this DMA
> buffer to be WB, hardware DMA may read incorrect data and cause misc
> failures.

I don't know all the details, but a big concern would be that the
caches remain fully coherent with the underlying memory at any point
where kvm decides to revoke the page from the VM.

If you allow an incoherence of cache != physical then it opens a
security attack where the observed content of memory can change when
it should not.

ARM64 has issues like this and due to that ARM has to have explict,
expensive, cache flushing at certain points.

Jason

Yan Zhao Jan. 8, 2024, 6:02 a.m. UTC | #2

On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > This series allow user space to notify KVM of noncoherent DMA status so as
> > to let KVM honor guest memory types in specified memory slot ranges.
> > 
> > Motivation
> > ===
> > A virtio GPU device may want to configure GPU hardware to work in
> > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> 
> Does this mean some DMA reads do not snoop the caches or does it
> include DMA writes not synchronizing the caches too?
Both DMA reads and writes are not snooped.
The virtio host side will mmap the buffer to WC (pgprot_writecombine)
for CPU access and program the device to access the buffer in uncached
way.
Meanwhile, virtio host side will construct a memslot in KVM with the PTR
returned from the mmap, and notify virtio guest side to mmap the same buffer in
guest page table with PAT=WC, too.

> 
> > This is generally for performance consideration.
> > In certain platform, GFX performance can improve 20+% with DMAs going to
> > noncoherent path.
> > 
> > This noncoherent DMA mode works in below sequence:
> > 1. Host backend driver programs hardware not to snoop memory of target
> >    DMA buffer.
> > 2. Host backend driver indicates guest frontend driver to program guest PAT
> >    to WC for target DMA buffer.
> > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > 4. Hardware does noncoherent DMA to the target buffer.
> > 
> > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > as not cached. So, if KVM forces the effective memory type of this DMA
> > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > failures.
> 
> I don't know all the details, but a big concern would be that the
> caches remain fully coherent with the underlying memory at any point
> where kvm decides to revoke the page from the VM.
Ah, you mean, for page migration, the content of the page may not be copied
correctly, right?

Currently in x86, we have 2 ways to let KVM honor guest memory types:
1. through KVM memslot flag introduced in this series, for virtio GPUs, in
   memslot granularity.
2. through increasing noncoherent dma count, as what's done in VFIO, for
   Intel GPU passthrough, for all guest memory.

This page migration issue should not be the case for virtio GPU, as both host
and guest are synced to use the same memory type and actually the pages
are not anonymous pages.
For GPU pass-through, though host mmaps with WB, it's still fine for guest to
use WC because page migration on pages of VMs with pass-through device is not
allowed.

But I agree, this should be a case if user space sets the memslot flag to honor
guest memory type to memslots for guest system RAM where non-enlightened guest
components may cause guest and host to access with different memory types.
Or simply when the guest is a malicious one.

> If you allow an incoherence of cache != physical then it opens a
> security attack where the observed content of memory can change when
> it should not.

In this case, will this security attack impact other guests?

> 
> ARM64 has issues like this and due to that ARM has to have explict,
> expensive, cache flushing at certain points.
>

Jason Gunthorpe Jan. 8, 2024, 2:02 p.m. UTC | #3

On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > to let KVM honor guest memory types in specified memory slot ranges.
> > > 
> > > Motivation
> > > ===
> > > A virtio GPU device may want to configure GPU hardware to work in
> > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > 
> > Does this mean some DMA reads do not snoop the caches or does it
> > include DMA writes not synchronizing the caches too?
> Both DMA reads and writes are not snooped.

Oh that sounds really dangerous.

> > > This is generally for performance consideration.
> > > In certain platform, GFX performance can improve 20+% with DMAs going to
> > > noncoherent path.
> > > 
> > > This noncoherent DMA mode works in below sequence:
> > > 1. Host backend driver programs hardware not to snoop memory of target
> > >    DMA buffer.
> > > 2. Host backend driver indicates guest frontend driver to program guest PAT
> > >    to WC for target DMA buffer.
> > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > > 4. Hardware does noncoherent DMA to the target buffer.
> > > 
> > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > > as not cached. So, if KVM forces the effective memory type of this DMA
> > > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > > failures.
> > 
> > I don't know all the details, but a big concern would be that the
> > caches remain fully coherent with the underlying memory at any point
> > where kvm decides to revoke the page from the VM.
> Ah, you mean, for page migration, the content of the page may not be copied
> correctly, right?

Not just migration. Any point where KVM revokes the page from the
VM. Ie just tearing down the VM still has to make the cache coherent
with physical or there may be problems.
 
> Currently in x86, we have 2 ways to let KVM honor guest memory types:
> 1. through KVM memslot flag introduced in this series, for virtio GPUs, in
>    memslot granularity.
> 2. through increasing noncoherent dma count, as what's done in VFIO, for
>    Intel GPU passthrough, for all guest memory.

And where does all this fixup the coherency problem?

> This page migration issue should not be the case for virtio GPU, as both host
> and guest are synced to use the same memory type and actually the pages
> are not anonymous pages.

The guest isn't required to do this so it can force the cache to
become incoherent.

> > If you allow an incoherence of cache != physical then it opens a
> > security attack where the observed content of memory can change when
> > it should not.
> 
> In this case, will this security attack impact other guests?

It impacts the hypervisor potentially. It depends..

Jason

Daniel Vetter Jan. 8, 2024, 3:25 p.m. UTC | #4

On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > 
> > > > Motivation
> > > > ===
> > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > 
> > > Does this mean some DMA reads do not snoop the caches or does it
> > > include DMA writes not synchronizing the caches too?
> > Both DMA reads and writes are not snooped.
> 
> Oh that sounds really dangerous.

So if this is an issue then we might already have a problem, because with
many devices it's entirely up to the device programming whether the i/o is
snooping or not. So the moment you pass such a device to a guest, whether
there's explicit support for non-coherent or not, you have a problem.

_If_ there is a fundamental problem. I'm not sure of that, because my
assumption was that at most the guest shoots itself and the data
corruption doesn't go any further the moment the hypervisor does the
dma/iommu unmapping.

Also, there's a pile of x86 devices where this very much applies, x86
being dma-coherent is not really the true ground story.

Cheers, Sima

> > > > This is generally for performance consideration.
> > > > In certain platform, GFX performance can improve 20+% with DMAs going to
> > > > noncoherent path.
> > > > 
> > > > This noncoherent DMA mode works in below sequence:
> > > > 1. Host backend driver programs hardware not to snoop memory of target
> > > >    DMA buffer.
> > > > 2. Host backend driver indicates guest frontend driver to program guest PAT
> > > >    to WC for target DMA buffer.
> > > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > > > 4. Hardware does noncoherent DMA to the target buffer.
> > > > 
> > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > > > as not cached. So, if KVM forces the effective memory type of this DMA
> > > > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > > > failures.
> > > 
> > > I don't know all the details, but a big concern would be that the
> > > caches remain fully coherent with the underlying memory at any point
> > > where kvm decides to revoke the page from the VM.
> > Ah, you mean, for page migration, the content of the page may not be copied
> > correctly, right?
> 
> Not just migration. Any point where KVM revokes the page from the
> VM. Ie just tearing down the VM still has to make the cache coherent
> with physical or there may be problems.
>  
> > Currently in x86, we have 2 ways to let KVM honor guest memory types:
> > 1. through KVM memslot flag introduced in this series, for virtio GPUs, in
> >    memslot granularity.
> > 2. through increasing noncoherent dma count, as what's done in VFIO, for
> >    Intel GPU passthrough, for all guest memory.
> 
> And where does all this fixup the coherency problem?
> 
> > This page migration issue should not be the case for virtio GPU, as both host
> > and guest are synced to use the same memory type and actually the pages
> > are not anonymous pages.
> 
> The guest isn't required to do this so it can force the cache to
> become incoherent.
> 
> > > If you allow an incoherence of cache != physical then it opens a
> > > security attack where the observed content of memory can change when
> > > it should not.
> > 
> > In this case, will this security attack impact other guests?
> 
> It impacts the hypervisor potentially. It depends..
> 
> Jason

Jason Gunthorpe Jan. 8, 2024, 3:38 p.m. UTC | #5

On Mon, Jan 08, 2024 at 04:25:02PM +0100, Daniel Vetter wrote:
> On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > > 
> > > > > Motivation
> > > > > ===
> > > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > > 
> > > > Does this mean some DMA reads do not snoop the caches or does it
> > > > include DMA writes not synchronizing the caches too?
> > > Both DMA reads and writes are not snooped.
> > 
> > Oh that sounds really dangerous.
> 
> So if this is an issue then we might already have a problem, because with
> many devices it's entirely up to the device programming whether the i/o is
> snooping or not. So the moment you pass such a device to a guest, whether
> there's explicit support for non-coherent or not, you have a
> problem.

No, the iommus (except Intel and only for Intel integrated GPU, IIRC)
prohibit the use of non-coherent DMA entirely from a VM.

Eg AMD systems 100% block non-coherent DMA in VMs at the iommu level.

> _If_ there is a fundamental problem. I'm not sure of that, because my
> assumption was that at most the guest shoots itself and the data
> corruption doesn't go any further the moment the hypervisor does the
> dma/iommu unmapping.

Who fixes the cache on the unmapping? I didn't see anything..

Jason

Yan Zhao Jan. 8, 2024, 11:36 p.m. UTC | #6

On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > 
> > > > Motivation
> > > > ===
> > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > 
> > > Does this mean some DMA reads do not snoop the caches or does it
> > > include DMA writes not synchronizing the caches too?
> > Both DMA reads and writes are not snooped.
> 
> Oh that sounds really dangerous.
>
But the IOMMU for Intel GPU does not do force-snoop, no matter KVM
honors guest memory type or not.

> > > > This is generally for performance consideration.
> > > > In certain platform, GFX performance can improve 20+% with DMAs going to
> > > > noncoherent path.
> > > > 
> > > > This noncoherent DMA mode works in below sequence:
> > > > 1. Host backend driver programs hardware not to snoop memory of target
> > > >    DMA buffer.
> > > > 2. Host backend driver indicates guest frontend driver to program guest PAT
> > > >    to WC for target DMA buffer.
> > > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
> > > > 4. Hardware does noncoherent DMA to the target buffer.
> > > > 
> > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
> > > > as not cached. So, if KVM forces the effective memory type of this DMA
> > > > buffer to be WB, hardware DMA may read incorrect data and cause misc
> > > > failures.
> > > 
> > > I don't know all the details, but a big concern would be that the
> > > caches remain fully coherent with the underlying memory at any point
> > > where kvm decides to revoke the page from the VM.
> > Ah, you mean, for page migration, the content of the page may not be copied
> > correctly, right?
> 
> Not just migration. Any point where KVM revokes the page from the
> VM. Ie just tearing down the VM still has to make the cache coherent
> with physical or there may be problems.
Not sure what's the mentioned problem during KVM revoking.
In host,
- If the memory type is WB, as the case in intel GPU passthrough,
  the mismatch can only happen when guest memory type is UC/WC/WT/WP, all
  stronger than WB.
  So, even after KVM revoking the page, the host will not get delayed
  data from cache.
- If the memory type is WC, as the case in virtio GPU, after KVM revoking
  the page, the page is still hold in the virtio host side.
  Even though a incooperative guest can cause wrong data in the page,
  the guest can achieve the purpose in a more straight-forward way, i.e.
  writing a wrong data directly to the page.
  So, I don't see the problem in this case too.

>  
> > Currently in x86, we have 2 ways to let KVM honor guest memory types:
> > 1. through KVM memslot flag introduced in this series, for virtio GPUs, in
> >    memslot granularity.
> > 2. through increasing noncoherent dma count, as what's done in VFIO, for
> >    Intel GPU passthrough, for all guest memory.
> 
> And where does all this fixup the coherency problem?
> 
> > This page migration issue should not be the case for virtio GPU, as both host
> > and guest are synced to use the same memory type and actually the pages
> > are not anonymous pages.
> 
> The guest isn't required to do this so it can force the cache to
> become incoherent.
> 
> > > If you allow an incoherence of cache != physical then it opens a
> > > security attack where the observed content of memory can change when
> > > it should not.
> > 
> > In this case, will this security attack impact other guests?
> 
> It impacts the hypervisor potentially. It depends..
Could you elaborate more on how it will impact hypervisor?
We can try to fix it if it's really a case.

Jason Gunthorpe Jan. 9, 2024, 12:22 a.m. UTC | #7

On Tue, Jan 09, 2024 at 07:36:22AM +0800, Yan Zhao wrote:
> On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > > 
> > > > > Motivation
> > > > > ===
> > > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > > 
> > > > Does this mean some DMA reads do not snoop the caches or does it
> > > > include DMA writes not synchronizing the caches too?
> > > Both DMA reads and writes are not snooped.
> > 
> > Oh that sounds really dangerous.
> >
> But the IOMMU for Intel GPU does not do force-snoop, no matter KVM
> honors guest memory type or not.

Yes, I know. Sounds dangerous!

> > Not just migration. Any point where KVM revokes the page from the
> > VM. Ie just tearing down the VM still has to make the cache coherent
> > with physical or there may be problems.
> Not sure what's the mentioned problem during KVM revoking.
> In host,
> - If the memory type is WB, as the case in intel GPU passthrough,
>   the mismatch can only happen when guest memory type is UC/WC/WT/WP, all
>   stronger than WB.
>   So, even after KVM revoking the page, the host will not get delayed
>   data from cache.
> - If the memory type is WC, as the case in virtio GPU, after KVM revoking
>   the page, the page is still hold in the virtio host side.
>   Even though a incooperative guest can cause wrong data in the page,
>   the guest can achieve the purpose in a more straight-forward way, i.e.
>   writing a wrong data directly to the page.
>   So, I don't see the problem in this case too.

You can't let cache incoherent memory leak back into the hypervisor
for other uses or who knows what can happen. In many cases something
will zero the page and you can probably reliably argue that will make
the cache coherent, but there are still all sorts of cases where pages
are write protected and then used in the hypervisor context. Eg page
out or something where the incoherence is a big problem.

eg RAID parity and mirror calculations become at-rist of
malfunction. Storage CRCs stop working reliably, etc, etc.

It is certainly a big enough problem that a generic KVM switch to
allow incoherence should be trated with alot of skepticism. You can't
argue that the only use of the generic switch will be with GPUs that
exclude all the troublesome cases!

> > > In this case, will this security attack impact other guests?
> > 
> > It impacts the hypervisor potentially. It depends..
> Could you elaborate more on how it will impact hypervisor?
> We can try to fix it if it's really a case.

Well, for instance, when you install pages into the KVM the hypervisor
will have taken kernel memory, then zero'd it with cachable writes,
however the VM can read it incoherently with DMA and access the
pre-zero'd data since the zero'd writes potentially hasn't left the
cache. That is an information leakage exploit.

Who knows what else you can get up to if you are creative. The whole
security model assumes there is only one view of memory, not two.

Jason

Yan Zhao Jan. 9, 2024, 2:11 a.m. UTC | #8

On Mon, Jan 08, 2024 at 08:22:20PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 09, 2024 at 07:36:22AM +0800, Yan Zhao wrote:
> > On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote:
> > > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote:
> > > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote:
> > > > > > This series allow user space to notify KVM of noncoherent DMA status so as
> > > > > > to let KVM honor guest memory types in specified memory slot ranges.
> > > > > > 
> > > > > > Motivation
> > > > > > ===
> > > > > > A virtio GPU device may want to configure GPU hardware to work in
> > > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
> > > > > 
> > > > > Does this mean some DMA reads do not snoop the caches or does it
> > > > > include DMA writes not synchronizing the caches too?
> > > > Both DMA reads and writes are not snooped.
> > > 
> > > Oh that sounds really dangerous.
> > >
> > But the IOMMU for Intel GPU does not do force-snoop, no matter KVM
> > honors guest memory type or not.
> 
> Yes, I know. Sounds dangerous!
> 
> > > Not just migration. Any point where KVM revokes the page from the
> > > VM. Ie just tearing down the VM still has to make the cache coherent
> > > with physical or there may be problems.
> > Not sure what's the mentioned problem during KVM revoking.
> > In host,
> > - If the memory type is WB, as the case in intel GPU passthrough,
> >   the mismatch can only happen when guest memory type is UC/WC/WT/WP, all
> >   stronger than WB.
> >   So, even after KVM revoking the page, the host will not get delayed
> >   data from cache.
> > - If the memory type is WC, as the case in virtio GPU, after KVM revoking
> >   the page, the page is still hold in the virtio host side.
> >   Even though a incooperative guest can cause wrong data in the page,
> >   the guest can achieve the purpose in a more straight-forward way, i.e.
> >   writing a wrong data directly to the page.
> >   So, I don't see the problem in this case too.
> 
> You can't let cache incoherent memory leak back into the hypervisor
> for other uses or who knows what can happen. In many cases something
> will zero the page and you can probably reliably argue that will make
> the cache coherent, but there are still all sorts of cases where pages
> are write protected and then used in the hypervisor context. Eg page
> out or something where the incoherence is a big problem.
> 
> eg RAID parity and mirror calculations become at-rist of
> malfunction. Storage CRCs stop working reliably, etc, etc.
> 
> It is certainly a big enough problem that a generic KVM switch to
> allow incoherence should be trated with alot of skepticism. You can't
> argue that the only use of the generic switch will be with GPUs that
> exclude all the troublesome cases!
>
You are right. It's more safe with only one view of memory.
But even something will zero the page, if it happens before returning
the page to host, looks the impact is constrained in VM scope? e.g.
for the write protected page, hypervisor cannot rely on the page content
is correct or expected.

For virtio GPU's use case, do you think a better way for KVM is to pull
the memory type from host page table in the specified memslot?

But for noncoherent DMA device passthrough, we can't pull host memory
type, because we rely on guest device driver to do cache flush
properly, and if the guest device driver thinks a memory is uncached
while it's effectively cached, the device cannot work properly.

> > > > In this case, will this security attack impact other guests?
> > > 
> > > It impacts the hypervisor potentially. It depends..
> > Could you elaborate more on how it will impact hypervisor?
> > We can try to fix it if it's really a case.
> 
> Well, for instance, when you install pages into the KVM the hypervisor
> will have taken kernel memory, then zero'd it with cachable writes,
> however the VM can read it incoherently with DMA and access the
> pre-zero'd data since the zero'd writes potentially hasn't left the
> cache. That is an information leakage exploit.
This makes sense.
How about KVM doing cache flush before installing/revoking the
page if guest memory type is honored?

> Who knows what else you can get up to if you are creative. The whole
> security model assumes there is only one view of memory, not two.
>

Jason Gunthorpe Jan. 15, 2024, 4:30 p.m. UTC | #9

On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:

> > Well, for instance, when you install pages into the KVM the hypervisor
> > will have taken kernel memory, then zero'd it with cachable writes,
> > however the VM can read it incoherently with DMA and access the
> > pre-zero'd data since the zero'd writes potentially hasn't left the
> > cache. That is an information leakage exploit.
>
> This makes sense.
> How about KVM doing cache flush before installing/revoking the
> page if guest memory type is honored?

I think if you are going to allow the guest to bypass the cache in any
way then KVM should fully flush the cache before allowing the guest to
access memory and it should fully flush the cache after removing
memory from the guest.

Noting that fully removing the memory now includes VFIO too, which is
going to be very hard to co-ordinate between KVM and VFIO.

ARM has the hooks for most of this in the common code already, so it
should not be outrageous to do, but slow I suspect.

Jason

Tian, Kevin Jan. 16, 2024, 12:45 a.m. UTC | #10

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, January 16, 2024 12:31 AM
> 
> On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:
> 
> > > Well, for instance, when you install pages into the KVM the hypervisor
> > > will have taken kernel memory, then zero'd it with cachable writes,
> > > however the VM can read it incoherently with DMA and access the
> > > pre-zero'd data since the zero'd writes potentially hasn't left the
> > > cache. That is an information leakage exploit.
> >
> > This makes sense.
> > How about KVM doing cache flush before installing/revoking the
> > page if guest memory type is honored?
> 
> I think if you are going to allow the guest to bypass the cache in any
> way then KVM should fully flush the cache before allowing the guest to
> access memory and it should fully flush the cache after removing
> memory from the guest.

For GPU passthrough can we rely on the fact that the entire guest memory
is pinned so the only occurrence of removing memory is when killing the
guest then the pages will be zero-ed by mm before next use? then we
just need to flush the cache before the 1st guest run to avoid information
leak.

yes it's a more complex issue if allowing guest to bypass cache in a
configuration mixing host mm activities on guest pages at run-time.

> 
> Noting that fully removing the memory now includes VFIO too, which is
> going to be very hard to co-ordinate between KVM and VFIO.

if only talking about GPU passthrough do we still need such coordination?

> 
> ARM has the hooks for most of this in the common code already, so it
> should not be outrageous to do, but slow I suspect.
> 
> Jason

Tian, Kevin Jan. 16, 2024, 4:05 a.m. UTC | #11

> From: Tian, Kevin
> Sent: Tuesday, January 16, 2024 8:46 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, January 16, 2024 12:31 AM
> >
> > On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:
> >
> > > > Well, for instance, when you install pages into the KVM the hypervisor
> > > > will have taken kernel memory, then zero'd it with cachable writes,
> > > > however the VM can read it incoherently with DMA and access the
> > > > pre-zero'd data since the zero'd writes potentially hasn't left the
> > > > cache. That is an information leakage exploit.
> > >
> > > This makes sense.
> > > How about KVM doing cache flush before installing/revoking the
> > > page if guest memory type is honored?
> >
> > I think if you are going to allow the guest to bypass the cache in any
> > way then KVM should fully flush the cache before allowing the guest to
> > access memory and it should fully flush the cache after removing
> > memory from the guest.
> 
> For GPU passthrough can we rely on the fact that the entire guest memory
> is pinned so the only occurrence of removing memory is when killing the
> guest then the pages will be zero-ed by mm before next use? then we
> just need to flush the cache before the 1st guest run to avoid information
> leak.

Just checked your past comments. If there is no guarantee that the removed
pages will be zero-ed before next use then yes cache has to be flushed
after the page is removed from the guest. :/

> 
> yes it's a more complex issue if allowing guest to bypass cache in a
> configuration mixing host mm activities on guest pages at run-time.
> 
> >
> > Noting that fully removing the memory now includes VFIO too, which is
> > going to be very hard to co-ordinate between KVM and VFIO.
> 

Probably we could just handle cache flush in IOMMUFD or VFIO type1
map/unmap which is the gate of allowing/denying non-coherent DMAs
to specific pages.

Jason Gunthorpe Jan. 16, 2024, 12:54 p.m. UTC | #12

On Tue, Jan 16, 2024 at 04:05:08AM +0000, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Tuesday, January 16, 2024 8:46 AM
> > 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, January 16, 2024 12:31 AM
> > >
> > > On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote:
> > >
> > > > > Well, for instance, when you install pages into the KVM the hypervisor
> > > > > will have taken kernel memory, then zero'd it with cachable writes,
> > > > > however the VM can read it incoherently with DMA and access the
> > > > > pre-zero'd data since the zero'd writes potentially hasn't left the
> > > > > cache. That is an information leakage exploit.
> > > >
> > > > This makes sense.
> > > > How about KVM doing cache flush before installing/revoking the
> > > > page if guest memory type is honored?
> > >
> > > I think if you are going to allow the guest to bypass the cache in any
> > > way then KVM should fully flush the cache before allowing the guest to
> > > access memory and it should fully flush the cache after removing
> > > memory from the guest.
> > 
> > For GPU passthrough can we rely on the fact that the entire guest memory
> > is pinned so the only occurrence of removing memory is when killing the
> > guest then the pages will be zero-ed by mm before next use? then we
> > just need to flush the cache before the 1st guest run to avoid information
> > leak.
> 
> Just checked your past comments. If there is no guarantee that the removed
> pages will be zero-ed before next use then yes cache has to be flushed
> after the page is removed from the guest. :/

Next use may include things like swap to disk or live migrate the VM.

So it isn't quite so simple in the general case.

> > > Noting that fully removing the memory now includes VFIO too, which is
> > > going to be very hard to co-ordinate between KVM and VFIO.
> 
> Probably we could just handle cache flush in IOMMUFD or VFIO type1
> map/unmap which is the gate of allowing/denying non-coherent DMAs
> to specific pages.

Maybe, and on live migrate dma stop..

Jason