[RFC] KVM: Introduce KVM VIRTIO device

Message ID 20231214103520.7198-1-yan.y.zhao@intel.com
State New
Headers
Series [RFC] KVM: Introduce KVM VIRTIO device |

Commit Message

Yan Zhao Dec. 14, 2023, 10:35 a.m. UTC
  Introduce a new KVM device that represents a virtio device in order to
allow user to indicate device hardware's noncoherent DMA status.

Motivation
===
A virtio GPU device may want to configure GPU hardware to work in
noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
This is generally for performance consideration.
In certain platform, GFX performance can improve 20+% with DMAs going to
noncoherent path.

This noncoherent DMA mode works in below sequence:
1. Host backend driver programs hardware not to snoop memory of target
   DMA buffer.
2. Host backend driver indicates guest frontend driver to program guest PAT
   to WC for target DMA buffer.
3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
4. Hardware does noncoherent DMA to the target buffer.

In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
as not cached. So, if KVM forces the effective memory type of this DMA
buffer to be WB, hardware DMA may read incorrect data and cause misc
failures.

Therefore, we introduce a new KVM virtio device to let KVM be aware that
the virtio device hardware is working in noncoherent mode.
Then, KVM will honor guest PAT type in Intel's platform.

For a virtio device model (e.g. QEMU), if it knows device hardware is to be
configured to work in

a. noncoherent mode,
   - on device realization,
     it can create a KVM virtio device and set device attr to increase KVM
     noncoherent DMA count;
   - on device unrealization,
     destroy the KVM virtio device to decrease KVM noncoherent DMA count.

b. coherent mode,
   just do nothing.

Security
===
The biggest concern for KVM to honor guest's memory type in Intel platform
is page aliasing issue.
- For host MMIO, it's not a concern as KVM VMX programs EPT memory type to
  UC (which will overwrite all guest PAT type except WC) no matter guest
  memory type is honored or not.

- For host non-MMIO pages,
  * virtio guest frontend and host backend driver should be synced to use
    the same memory type to map a buffer. Otherwise, there will be
    potential problem for incorrect memory data. But this will only impact
    the buggy guest alone.
  * for live migration,
    as QEMU will read all guest memory during live migration, page aliasing
    could happen.
    Current thinking is to disable live migration if a virtio device has
    indicated its noncoherent state.
    As a follow-up, we can discuss other solutions. e.g.
    (a) switching back to coherent path before starting live migration.
    (b) read/write of guest memory with clflush during live migration.

Implementation Consideration
===
There is a previous series [1] from google to serve the same purpose to
let KVM be aware of virtio GPU's noncoherent DMA status. That series
requires a new memslot flag, and special memslots in user space.

We don't choose to use memslot flag to request honoring guest memory type.
Instead we hope to make the honoring request to be explicit (not tied to a
memslot flag). This is because once guest memory type is honored, not only
memory used by guest virtio device, but all guest memory is facing page
aliasing issue potentially. KVM needs a generic solution to take care of
page aliasing issue rather than counting on memory type of a special
memslot being aligned in host and guest.
(we can discuss what a generic solution to handle page aliasing issue will
look like in later follow-up series).

On the other hand, we choose to introduce a KVM virtio device rather than
just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma()
directly, which is based on considerations that
1. Explicitly limit the "register noncoherent DMA" ability to virtio
   devices.
2. Provide a better encapsulation.
   Repeated setting noncoherent attribute to a KVM virtio device will only
   increase KVM noncoherent DMA count for once.
   KVM noncohrent DMA count will be decreased automatically when KVM virtio
   device is closed.
3. The KVM virtio device can be extended to let KVM know more info about
   the device to introduce non-coherent DMA.

Example QEMU usage
===
- on virtio device realize:
   struct kvm_create_device cd = {
       .type = KVM_DEV_TYPE_VIRTIO,
   };
   kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd);

   struct kvm_device_attr attr = {
       .group = KVM_DEV_VIRTIO_NONCOHERENT,
       .attr = KVM_DEV_VIRTIO_NONCOHERENT_SET,
   };

   ioctl(cd.fd, KVM_SET_DEVICE_ATTR, &attr);

   g->kvm_device_fd = cd.fd;

- on virtio device unrealize
  close(g->kvm_device_fd);

Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Link: https://patchwork.kernel.org/project/dri-devel/cover/20200213213036.207625-1-olvaffe@gmail.com/ [1]
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/Kconfig     |   1 +
 include/uapi/linux/kvm.h |   5 ++
 virt/kvm/Kconfig         |   3 +
 virt/kvm/Makefile.kvm    |   1 +
 virt/kvm/kvm_main.c      |   8 +++
 virt/kvm/virtio.c        | 121 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/virtio.h        |  18 ++++++
 7 files changed, 157 insertions(+)
 create mode 100644 virt/kvm/virtio.c
 create mode 100644 virt/kvm/virtio.h


base-commit: 8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1
  

Comments

Tian, Kevin Dec. 15, 2023, 6:23 a.m. UTC | #1
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Thursday, December 14, 2023 6:35 PM
> 
> - For host non-MMIO pages,
>   * virtio guest frontend and host backend driver should be synced to use
>     the same memory type to map a buffer. Otherwise, there will be
>     potential problem for incorrect memory data. But this will only impact
>     the buggy guest alone.
>   * for live migration,
>     as QEMU will read all guest memory during live migration, page aliasing
>     could happen.
>     Current thinking is to disable live migration if a virtio device has
>     indicated its noncoherent state.
>     As a follow-up, we can discuss other solutions. e.g.
>     (a) switching back to coherent path before starting live migration.

both guest/host switching to coherent or host-only?

host-only certainly is problematic if guest is still using non-coherent.

on the other hand I'm not sure whether the host/guest gfx stack is
capable of switching between coherent and non-coherent path in-fly
when the buffer is right being rendered.

>     (b) read/write of guest memory with clflush during live migration.

write is irrelevant as it's only done in the resume path where the
guest is not running.

> 
> Implementation Consideration
> ===
> There is a previous series [1] from google to serve the same purpose to
> let KVM be aware of virtio GPU's noncoherent DMA status. That series
> requires a new memslot flag, and special memslots in user space.
> 
> We don't choose to use memslot flag to request honoring guest memory
> type.

memslot flag has the potential to restrict the impact e.g. when using
clflush-before-read in migration? Of course the implication is to
honor guest type only for the selected slot in KVM instead of applying
to the entire guest memory as in previous series (which selects this
way because vmx_get_mt_mask() is in perf-critical path hence not
good to check memslot flag?)

> Instead we hope to make the honoring request to be explicit (not tied to a
> memslot flag). This is because once guest memory type is honored, not only
> memory used by guest virtio device, but all guest memory is facing page
> aliasing issue potentially. KVM needs a generic solution to take care of
> page aliasing issue rather than counting on memory type of a special
> memslot being aligned in host and guest.
> (we can discuss what a generic solution to handle page aliasing issue will
> look like in later follow-up series).
> 
> On the other hand, we choose to introduce a KVM virtio device rather than
> just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma()
> directly, which is based on considerations that

I wonder it's over-engineered for the purpose.

why not just introducing a KVM_CAP and allowing the VMM to enable?
KVM doesn't need to know the exact source of requiring it...
  
Yan Zhao Dec. 18, 2023, 2:58 a.m. UTC | #2
On Fri, Dec 15, 2023 at 02:23:48PM +0800, Tian, Kevin wrote:
> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Thursday, December 14, 2023 6:35 PM
> > 
> > - For host non-MMIO pages,
> >   * virtio guest frontend and host backend driver should be synced to use
> >     the same memory type to map a buffer. Otherwise, there will be
> >     potential problem for incorrect memory data. But this will only impact
> >     the buggy guest alone.
> >   * for live migration,
> >     as QEMU will read all guest memory during live migration, page aliasing
> >     could happen.
> >     Current thinking is to disable live migration if a virtio device has
> >     indicated its noncoherent state.
> >     As a follow-up, we can discuss other solutions. e.g.
> >     (a) switching back to coherent path before starting live migration.
> 
> both guest/host switching to coherent or host-only?
> 
> host-only certainly is problematic if guest is still using non-coherent.
Both.

> on the other hand I'm not sure whether the host/guest gfx stack is
> capable of switching between coherent and non-coherent path in-fly
> when the buffer is right being rendered.
> 
Yes. I'm also not sure about it. But it's an option though.

> >     (b) read/write of guest memory with clflush during live migration.
> 
> write is irrelevant as it's only done in the resume path where the
> guest is not running.
Given host write is with PAT WB and hardware is in no-snoop mode, is it
better to perform cache flush after host write?
(can do more investigation to check if it's necessary).

BTW, there's also post-copy live migration, in which case the guest is
running :)

> 
> > 
> > Implementation Consideration
> > ===
> > There is a previous series [1] from google to serve the same purpose to
> > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > requires a new memslot flag, and special memslots in user space.
> > 
> > We don't choose to use memslot flag to request honoring guest memory
> > type.
> 
> memslot flag has the potential to restrict the impact e.g. when using
> clflush-before-read in migration? Of course the implication is to
> honor guest type only for the selected slot in KVM instead of applying
> to the entire guest memory as in previous series (which selects this
> way because vmx_get_mt_mask() is in perf-critical path hence not
> good to check memslot flag?)
>
I think checking memslot flag in itself is all right.
But memslot flag does not contain the memory type that host is using for
the memslot.
On the other hand, virtio GPU is not the only source of non-coherent DMAs.
Memslot flag way is not applicable to pass-through GPUs, due to lacking of
coordination between guest and host.


> > Instead we hope to make the honoring request to be explicit (not tied to a
> > memslot flag). This is because once guest memory type is honored, not only
> > memory used by guest virtio device, but all guest memory is facing page
> > aliasing issue potentially. KVM needs a generic solution to take care of
> > page aliasing issue rather than counting on memory type of a special
> > memslot being aligned in host and guest.
> > (we can discuss what a generic solution to handle page aliasing issue will
> > look like in later follow-up series).
> > 
> > On the other hand, we choose to introduce a KVM virtio device rather than
> > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma()
> > directly, which is based on considerations that
> 
> I wonder it's over-engineered for the purpose.
> 
> why not just introducing a KVM_CAP and allowing the VMM to enable?
As we hope to increase non-coherent DMA count on hot-plug of a non-coherent
device and decrease non-coherent DMA count on hot-unplug of the non-coherent
device, a KVM_CAP looks requiring user to maintain a ref count before turning
on/off, which is less desired. Agree?

> KVM doesn't need to know the exact source of requiring it...
Maybe we can use the source info in a way like this:
1. indicate the source is not a passthrough device
2. record relationship between GPA and memory type.

Then, if KVM knows non-coherent DMAs do not contain any passthrough 
devices, it can force a GPA's memory type (by ignoring guest PAT) to the
one specified by host (in 2), so as to avoid cache flush operations before
live migration.

If there are passthrough devices involved later, we can zap the EPT and
rebuild memory type to honor guest PAT, resorting to cache flush before
live migration to maintain coherency.
  
Sean Christopherson Dec. 18, 2023, 3:08 p.m. UTC | #3
+Yiwei

On Fri, Dec 15, 2023, Kevin Tian wrote:
> > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > Sent: Thursday, December 14, 2023 6:35 PM
> > 
> > - For host non-MMIO pages,
> >   * virtio guest frontend and host backend driver should be synced to use
> >     the same memory type to map a buffer. Otherwise, there will be
> >     potential problem for incorrect memory data. But this will only impact
> >     the buggy guest alone.
> >   * for live migration,
> >     as QEMU will read all guest memory during live migration, page aliasing
> >     could happen.
> >     Current thinking is to disable live migration if a virtio device has
> >     indicated its noncoherent state.
> >     As a follow-up, we can discuss other solutions. e.g.
> >     (a) switching back to coherent path before starting live migration.
> 
> both guest/host switching to coherent or host-only?
> 
> host-only certainly is problematic if guest is still using non-coherent.
> 
> on the other hand I'm not sure whether the host/guest gfx stack is
> capable of switching between coherent and non-coherent path in-fly
> when the buffer is right being rendered.
> 
> >     (b) read/write of guest memory with clflush during live migration.
> 
> write is irrelevant as it's only done in the resume path where the
> guest is not running.
> 
> > 
> > Implementation Consideration
> > ===
> > There is a previous series [1] from google to serve the same purpose to
> > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > requires a new memslot flag, and special memslots in user space.
> > 
> > We don't choose to use memslot flag to request honoring guest memory
> > type.
> 
> memslot flag has the potential to restrict the impact e.g. when using
> clflush-before-read in migration?

Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
the host kernel, then the memslot flag will allow for a much more targeted
operation.

> Of course the implication is to honor guest type only for the selected slot
> in KVM instead of applying to the entire guest memory as in previous series
> (which selects this way because vmx_get_mt_mask() is in perf-critical path
> hence not good to check memslot flag?)

Checking a memslot flag won't impact performance.  KVM already has the memslot
when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
access to the memslot.

That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
to retrieve the associated PFN, update write-tracking for shadow pages, etc.

I added Yiwei, who I think is planning on posting another RFC for the memslot
idea (I actually completely forgot that the memslot idea had been thought of and
posted a few years back).

> > Instead we hope to make the honoring request to be explicit (not tied to a
> > memslot flag). This is because once guest memory type is honored, not only
> > memory used by guest virtio device, but all guest memory is facing page
> > aliasing issue potentially. KVM needs a generic solution to take care of
> > page aliasing issue rather than counting on memory type of a special
> > memslot being aligned in host and guest.
> > (we can discuss what a generic solution to handle page aliasing issue will
> > look like in later follow-up series).
> > 
> > On the other hand, we choose to introduce a KVM virtio device rather than
> > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma()
> > directly, which is based on considerations that
> 
> I wonder it's over-engineered for the purpose.
> 
> why not just introducing a KVM_CAP and allowing the VMM to enable?
> KVM doesn't need to know the exact source of requiring it...

Agreed.  If we end up needing to grant the whole VM access for some reason, just
give userspace a direct toggle.
  
Yiwei Zhang Dec. 18, 2023, 6:14 p.m. UTC | #4
> +Yiwei
>
> On Fri, Dec 15, 2023, Kevin Tian wrote:
> > > From: Zhao, Yan Y <yan.y.zhao@intel.com>
> > > Sent: Thursday, December 14, 2023 6:35 PM
> > >
> > > - For host non-MMIO pages,
> > >   * virtio guest frontend and host backend driver should be synced to use
> > >     the same memory type to map a buffer. Otherwise, there will be
> > >     potential problem for incorrect memory data. But this will only impact
> > >     the buggy guest alone.
> > >   * for live migration,
> > >     as QEMU will read all guest memory during live migration, page aliasing
> > >     could happen.
> > >     Current thinking is to disable live migration if a virtio device has
> > >     indicated its noncoherent state.
> > >     As a follow-up, we can discuss other solutions. e.g.
> > >     (a) switching back to coherent path before starting live migration.
> >
> > both guest/host switching to coherent or host-only?
> >
> > host-only certainly is problematic if guest is still using non-coherent.
> >
> > on the other hand I'm not sure whether the host/guest gfx stack is
> > capable of switching between coherent and non-coherent path in-fly
> > when the buffer is right being rendered.
> >
> > >     (b) read/write of guest memory with clflush during live migration.
> >
> > write is irrelevant as it's only done in the resume path where the
> > guest is not running.
> >
> > >
> > > Implementation Consideration
> > > ===
> > > There is a previous series [1] from google to serve the same purpose to
> > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > requires a new memslot flag, and special memslots in user space.
> > >
> > > We don't choose to use memslot flag to request honoring guest memory
> > > type.
> >
> > memslot flag has the potential to restrict the impact e.g. when using
> > clflush-before-read in migration?
>
> Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> the host kernel, then the memslot flag will allow for a much more targeted
> operation.
>
> > Of course the implication is to honor guest type only for the selected slot
> > in KVM instead of applying to the entire guest memory as in previous series
> > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > hence not good to check memslot flag?)
>
> Checking a memslot flag won't impact performance.  KVM already has the memslot
> when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> access to the memslot.
>
> That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> to retrieve the associated PFN, update write-tracking for shadow pages, etc.
>
> I added Yiwei, who I think is planning on posting another RFC for the memslot
> idea (I actually completely forgot that the memslot idea had been thought of and
> posted a few years back).

We've deferred to Yan (Intel side) to drive the userspace opt-in. So
it's up to Yan to
revise the series to be memslot flag based. I'm okay with what
upstream folks think
to be safer for the opt-in. Thanks!

> > > Instead we hope to make the honoring request to be explicit (not tied to a
> > > memslot flag). This is because once guest memory type is honored, not only
> > > memory used by guest virtio device, but all guest memory is facing page
> > > aliasing issue potentially. KVM needs a generic solution to take care of
> > > page aliasing issue rather than counting on memory type of a special
> > > memslot being aligned in host and guest.
> > > (we can discuss what a generic solution to handle page aliasing issue will
> > > look like in later follow-up series).
> > >
> > > On the other hand, we choose to introduce a KVM virtio device rather than
> > > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma()
> > > directly, which is based on considerations that
> >
> > I wonder it's over-engineered for the purpose.
> >
> > why not just introducing a KVM_CAP and allowing the VMM to enable?
> > KVM doesn't need to know the exact source of requiring it...
>
> Agreed.  If we end up needing to grant the whole VM access for some reason, just
> give userspace a direct toggle.
  
Yan Zhao Dec. 19, 2023, 4:26 a.m. UTC | #5
On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote:
> > > Implementation Consideration
> > > ===
> > > There is a previous series [1] from google to serve the same purpose to
> > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > requires a new memslot flag, and special memslots in user space.
> > > 
> > > We don't choose to use memslot flag to request honoring guest memory
> > > type.
> > 
> > memslot flag has the potential to restrict the impact e.g. when using
> > clflush-before-read in migration?
> 
> Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> the host kernel, then the memslot flag will allow for a much more targeted
> operation.
> 
> > Of course the implication is to honor guest type only for the selected slot
> > in KVM instead of applying to the entire guest memory as in previous series
> > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > hence not good to check memslot flag?)
> 
> Checking a memslot flag won't impact performance.  KVM already has the memslot
> when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> access to the memslot.
> 
> That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> to retrieve the associated PFN, update write-tracking for shadow pages, etc.
> 
Hi Sean,
Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC?
For KVM_MEM_DMA, KVM needs to
(a) search VMA for vma->vm_page_prot and convert it to page cache mode (with
    pgprot2cachemode()? ), or
(b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in
    pat_pfn_immune_to_uc_mtrr().

But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now.

For KVM_MEM_WC, it requires user to ensure the memory is actually mapped
to WC, right?

Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type
for the special memslot only, as below.
Is this understanding correct?

static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
{
        if (is_mmio)                                                                           
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;                          
                                                                                               
        if (gfn_in_dma_slot(vcpu->kvm, gfn)) {                                                 
                u8 type = MTRR_TYPE_WRCOMB;                                      
                //u8 type = pat_pfn_memtype(pfn);                                
                return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;       
        }                                                                                      
                                                                                               
        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))                            
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;         
                                                                                               
        if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {                                             
                if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))               
                        return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;                      
                else                                                                           
                        return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | 
                                VMX_EPT_IPAT_BIT;                                
        }                                                                        
                                                                                 
        return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
}

BTW, since the special memslot must be exposed to guest as virtio GPU BAR in
order to prevent other guest drivers from access, I wonder if it's better to
include some keyword like VIRTIO_GPU_BAR in memslot flag name.
  
Yan Zhao Dec. 20, 2023, 2:15 a.m. UTC | #6
On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote:
> On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote:
> > > > Implementation Consideration
> > > > ===
> > > > There is a previous series [1] from google to serve the same purpose to
> > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > > requires a new memslot flag, and special memslots in user space.
> > > > 
> > > > We don't choose to use memslot flag to request honoring guest memory
> > > > type.
> > > 
> > > memslot flag has the potential to restrict the impact e.g. when using
> > > clflush-before-read in migration?
> > 
> > Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> > the host kernel, then the memslot flag will allow for a much more targeted
> > operation.
> > 
> > > Of course the implication is to honor guest type only for the selected slot
> > > in KVM instead of applying to the entire guest memory as in previous series
> > > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > > hence not good to check memslot flag?)
> > 
> > Checking a memslot flag won't impact performance.  KVM already has the memslot
> > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> > access to the memslot.
> > 
> > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> > to retrieve the associated PFN, update write-tracking for shadow pages, etc.
> > 
> Hi Sean,
> Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC?
> For KVM_MEM_DMA, KVM needs to
> (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with
>     pgprot2cachemode()? ), or
> (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in
>     pat_pfn_immune_to_uc_mtrr().
> 
> But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now.
> 
> For KVM_MEM_WC, it requires user to ensure the memory is actually mapped
> to WC, right?
> 
> Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type
> for the special memslot only, as below.
> Is this understanding correct?
> 
> static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
>         if (is_mmio)                                                                           
>                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;                          
>                                                                                                
>         if (gfn_in_dma_slot(vcpu->kvm, gfn)) {                                                 
>                 u8 type = MTRR_TYPE_WRCOMB;                                      
>                 //u8 type = pat_pfn_memtype(pfn);                                
>                 return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;       
>         }                                                                                      
>                                                                                                
>         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))                            
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;         
>                                                                                                
>         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {                                             
>                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))               
>                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;                      
>                 else                                                                           
>                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | 
>                                 VMX_EPT_IPAT_BIT;                                
>         }                                                                        
>                                                                                  
>         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> }
> 
> BTW, since the special memslot must be exposed to guest as virtio GPU BAR in
> order to prevent other guest drivers from access, I wonder if it's better to
> include some keyword like VIRTIO_GPU_BAR in memslot flag name.
Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user
(e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping
reading/writing to them in general paths).

@@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
        if (is_mmio)
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

+       if (in_slot_honor_guest_pat(vcpu->kvm, gfn))
+               return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
+
        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

        if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
                if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
                        return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
                else
                        return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
                                VMX_EPT_IPAT_BIT;
        }

        return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
 }
  
Sean Christopherson Dec. 21, 2023, 2:12 a.m. UTC | #7
On Wed, Dec 20, 2023, Yan Zhao wrote:
> On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote:
> > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote:
> > > > > Implementation Consideration
> > > > > ===
> > > > > There is a previous series [1] from google to serve the same purpose to
> > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > > > requires a new memslot flag, and special memslots in user space.
> > > > > 
> > > > > We don't choose to use memslot flag to request honoring guest memory
> > > > > type.
> > > > 
> > > > memslot flag has the potential to restrict the impact e.g. when using
> > > > clflush-before-read in migration?
> > > 
> > > Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> > > the host kernel, then the memslot flag will allow for a much more targeted
> > > operation.
> > > 
> > > > Of course the implication is to honor guest type only for the selected slot
> > > > in KVM instead of applying to the entire guest memory as in previous series
> > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > > > hence not good to check memslot flag?)
> > > 
> > > Checking a memslot flag won't impact performance.  KVM already has the memslot
> > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> > > access to the memslot.
> > > 
> > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> > > to retrieve the associated PFN, update write-tracking for shadow pages, etc.
> > > 
> > Hi Sean,
> > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC?
> > For KVM_MEM_DMA, KVM needs to
> > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with
> >     pgprot2cachemode()? ), or
> > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in
> >     pat_pfn_immune_to_uc_mtrr().
> > 
> > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now.
> > 
> > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped
> > to WC, right?
> > 
> > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type
> > for the special memslot only, as below.
> > Is this understanding correct?
> > 
> > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > {
> >         if (is_mmio)                                                                           
> >                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;                          
> >                                                                                                
> >         if (gfn_in_dma_slot(vcpu->kvm, gfn)) {                                                 
> >                 u8 type = MTRR_TYPE_WRCOMB;                                      
> >                 //u8 type = pat_pfn_memtype(pfn);                                
> >                 return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;       
> >         }                                                                                      
> >                                                                                                
> >         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))                            
> >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;         
> >                                                                                                
> >         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {                                             
> >                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))               
> >                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;                      
> >                 else                                                                           
> >                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | 
> >                                 VMX_EPT_IPAT_BIT;                                
> >         }                                                                        
> >                                                                                  
> >         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> > }
> > 
> > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in
> > order to prevent other guest drivers from access, I wonder if it's better to
> > include some keyword like VIRTIO_GPU_BAR in memslot flag name.
> Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user
> (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping
> reading/writing to them in general paths).
> 
> @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>         if (is_mmio)
>                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> 
> +       if (in_slot_honor_guest_pat(vcpu->kvm, gfn))
> +               return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;

This is more along the lines of what I was thinking, though the name should be
something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory
for AMD (which already honors guest PAT).

I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping
those out.  This is a new feature, so we have the luxury of defining KVM's ABI
for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs.

Like so?

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d21f55f323ea..ed527acb2bd3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm)
        return 0;
 }
 
-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
+                         struct kvm_memory_slot *slot)
 {
        /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
         * memory aliases with conflicting memory types and sometimes MCEs.
@@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
        if (is_mmio)
                return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
 
+       if (kvm_memslot_has_non_coherent_dma(slot))
+               return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+
        if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

I like the idea of pulling the memtype from the host, but if we can make that
work then I don't see the need for a special memslot flag, i.e. just do it for
*all* SPTEs on VMX.  I don't think we need a VMA for that, e.g. we should be able
to get the memtype from the host PTEs, just like we do the page size.

KVM_MEM_WC is a hard "no" for me.  It's far too x86 centric, and as you alluded
to, it requires coordination from the guest, i.e. is effectively limited to
paravirt scenarios.

> +
>         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
>         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
>                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
>                 else
>                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
>                                 VMX_EPT_IPAT_BIT;
>         }
> 
>         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
>  }
  
Yan Zhao Dec. 21, 2023, 2:44 a.m. UTC | #8
On Wed, Dec 20, 2023 at 06:12:55PM -0800, Sean Christopherson wrote:
> On Wed, Dec 20, 2023, Yan Zhao wrote:
> > On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote:
> > > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote:
> > > > > > Implementation Consideration
> > > > > > ===
> > > > > > There is a previous series [1] from google to serve the same purpose to
> > > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series
> > > > > > requires a new memslot flag, and special memslots in user space.
> > > > > > 
> > > > > > We don't choose to use memslot flag to request honoring guest memory
> > > > > > type.
> > > > > 
> > > > > memslot flag has the potential to restrict the impact e.g. when using
> > > > > clflush-before-read in migration?
> > > > 
> > > > Yep, exactly.  E.g. if KVM needs to ensure coherency when freeing memory back to
> > > > the host kernel, then the memslot flag will allow for a much more targeted
> > > > operation.
> > > > 
> > > > > Of course the implication is to honor guest type only for the selected slot
> > > > > in KVM instead of applying to the entire guest memory as in previous series
> > > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path
> > > > > hence not good to check memslot flag?)
> > > > 
> > > > Checking a memslot flag won't impact performance.  KVM already has the memslot
> > > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has
> > > > access to the memslot.
> > > > 
> > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g.
> > > > to retrieve the associated PFN, update write-tracking for shadow pages, etc.
> > > > 
> > > Hi Sean,
> > > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC?
> > > For KVM_MEM_DMA, KVM needs to
> > > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with
> > >     pgprot2cachemode()? ), or
> > > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in
> > >     pat_pfn_immune_to_uc_mtrr().
> > > 
> > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now.
> > > 
> > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped
> > > to WC, right?
> > > 
> > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type
> > > for the special memslot only, as below.
> > > Is this understanding correct?
> > > 
> > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > > {
> > >         if (is_mmio)                                                                           
> > >                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;                          
> > >                                                                                                
> > >         if (gfn_in_dma_slot(vcpu->kvm, gfn)) {                                                 
> > >                 u8 type = MTRR_TYPE_WRCOMB;                                      
> > >                 //u8 type = pat_pfn_memtype(pfn);                                
> > >                 return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;       
> > >         }                                                                                      
> > >                                                                                                
> > >         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))                            
> > >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;         
> > >                                                                                                
> > >         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {                                             
> > >                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))               
> > >                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;                      
> > >                 else                                                                           
> > >                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | 
> > >                                 VMX_EPT_IPAT_BIT;                                
> > >         }                                                                        
> > >                                                                                  
> > >         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> > > }
> > > 
> > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in
> > > order to prevent other guest drivers from access, I wonder if it's better to
> > > include some keyword like VIRTIO_GPU_BAR in memslot flag name.
> > Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user
> > (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping
> > reading/writing to them in general paths).
> > 
> > @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >         if (is_mmio)
> >                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> > 
> > +       if (in_slot_honor_guest_pat(vcpu->kvm, gfn))
> > +               return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> 
> This is more along the lines of what I was thinking, though the name should be
> something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory
> for AMD (which already honors guest PAT).
> 
> I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping
> those out.  This is a new feature, so we have the luxury of defining KVM's ABI
> for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs.
> 
> Like so?
Yes, this looks good to me. Will refine the patch in the recommended way.
Only one remaining question from me is if we could allow user to set the flag
to every slot.
If user is allowed to set the flag to system ram slot, it may cause problem,
since guest drivers (other than the paravirt one) may have chance to get
allocated memory in this ram slot and make its PAT setting take effect, which
is not desired.

Do we need to find a way to limit the eligible slots?
e.g. check VMAs of the slot to ensure its vm_flags include VM_IO and VM_PFNMAP.

Or, just trust user?


> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d21f55f323ea..ed527acb2bd3 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm)
>         return 0;
>  }
>  
> -static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio,
> +                         struct kvm_memory_slot *slot)
>  {
>         /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
>          * memory aliases with conflicting memory types and sometimes MCEs.
> @@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>         if (is_mmio)
>                 return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
>  
> +       if (kvm_memslot_has_non_coherent_dma(slot))
> +               return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> +
>         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> I like the idea of pulling the memtype from the host, but if we can make that
> work then I don't see the need for a special memslot flag, i.e. just do it for
> *all* SPTEs on VMX.  I don't think we need a VMA for that, e.g. we should be able
> to get the memtype from the host PTEs, just like we do the page size.
Right.
Besides, after reading host PTEs, we still need to decode the bits to memtype in
platform specific way and convert the type to EPT type.

> 
> KVM_MEM_WC is a hard "no" for me.  It's far too x86 centric, and as you alluded
> to, it requires coordination from the guest, i.e. is effectively limited to
> paravirt scenarios.
Got it!
> 
> > +
> >         if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > 
> >         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> >                 if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >                         return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> >                 else
> >                         return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
> >                                 VMX_EPT_IPAT_BIT;
> >         }
> > 
> >         return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> >  }
  

Patch

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index b07247b0b958..9f7223d298a2 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -44,6 +44,7 @@  config KVM
 	select KVM_XFER_TO_GUEST_WORK
 	select KVM_GENERIC_DIRTYLOG_READ_PROTECT
 	select KVM_VFIO
+	select KVM_VIRTIO
 	select INTERVAL_TREE
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b1f92a0edc35..7f57fa902a0a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1394,6 +1394,9 @@  struct kvm_device_attr {
 #define   KVM_DEV_VFIO_GROUP_DEL	KVM_DEV_VFIO_FILE_DEL
 #define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
+#define KVM_DEV_VIRTIO_NONCOHERENT	1
+#define KVM_DEV_VIRTIO_NONCOHERENT_SET	1
+
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
 #define KVM_DEV_TYPE_FSL_MPIC_20	KVM_DEV_TYPE_FSL_MPIC_20
@@ -1417,6 +1420,8 @@  enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_PV_TIME	KVM_DEV_TYPE_ARM_PV_TIME
 	KVM_DEV_TYPE_RISCV_AIA,
 #define KVM_DEV_TYPE_RISCV_AIA		KVM_DEV_TYPE_RISCV_AIA
+	KVM_DEV_TYPE_VIRTIO,
+#define KVM_DEV_TYPE_VIRTIO		KVM_DEV_TYPE_VIRTIO
 	KVM_DEV_TYPE_MAX,
 };
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 6793211a0b64..5dcba034593c 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -56,6 +56,9 @@  config HAVE_KVM_CPU_RELAX_INTERCEPT
 config KVM_VFIO
        bool
 
+config KVM_VIRTIO
+       bool
+
 config HAVE_KVM_INVALID_WAKEUPS
        bool
 
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 724c89af78af..614c4c4fe50e 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -7,6 +7,7 @@  KVM ?= ../../../virt/kvm
 
 kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o
 kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o
+kvm-$(CONFIG_KVM_VIRTIO) += $(KVM)/virtio.o
 kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o
 kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
 kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index acd67fb40183..dca2040368da 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -61,6 +61,7 @@ 
 #include "async_pf.h"
 #include "kvm_mm.h"
 #include "vfio.h"
+#include "virtio.h"
 
 #include <trace/events/ipi.h>
 
@@ -6453,6 +6454,10 @@  int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	if (WARN_ON_ONCE(r))
 		goto err_vfio;
 
+	r = kvm_virtio_ops_init();
+	if (WARN_ON_ONCE(r))
+		goto err_virtio;
+
 	kvm_gmem_init(module);
 
 	/*
@@ -6468,6 +6473,8 @@  int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	return 0;
 
 err_register:
+	kvm_virtio_ops_exit();
+err_virtio:
 	kvm_vfio_ops_exit();
 err_vfio:
 	kvm_async_pf_deinit();
@@ -6503,6 +6510,7 @@  void kvm_exit(void)
 		free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
 	kmem_cache_destroy(kvm_vcpu_cache);
 	kvm_vfio_ops_exit();
+	kvm_virtio_ops_exit();
 	kvm_async_pf_deinit();
 #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING
 	unregister_syscore_ops(&kvm_syscore_ops);
diff --git a/virt/kvm/virtio.c b/virt/kvm/virtio.c
new file mode 100644
index 000000000000..dbb25d9784c5
--- /dev/null
+++ b/virt/kvm/virtio.c
@@ -0,0 +1,121 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * KVM-VIRTIO device
+ *
+ */
+#include <linux/kvm_host.h>
+#include <linux/errno.h>
+#include <linux/mutex.h>
+#include <linux/slab.h>
+#include "virtio.h"
+
+struct kvm_virtio {
+	struct mutex lock;
+	bool noncoherent;
+};
+
+static int kvm_virtio_set_noncoherent(struct kvm_device *dev, long attr,
+				      void __user *arg)
+{
+	struct kvm_virtio *kv = dev->private;
+
+	/*
+	 * Currently, only set to noncoherent is allowed, and therefore a virtio
+	 * device is not allowed to switch back to coherent once it's set to
+	 * noncoherent.
+	 * User arg is also not checked as the attr name has indicated that the
+	 * purpose is to set to noncoherent.
+	 */
+	if (attr != KVM_DEV_VIRTIO_NONCOHERENT_SET)
+		return -ENXIO;
+
+	mutex_lock(&kv->lock);
+	if (kv->noncoherent)
+		goto out;
+
+	kv->noncoherent = true;
+	kvm_arch_register_noncoherent_dma(dev->kvm);
+out:
+	mutex_unlock(&kv->lock);
+	return 0;
+}
+
+static int kvm_virtio_set_attr(struct kvm_device *dev,
+			       struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_VIRTIO_NONCOHERENT:
+		return kvm_virtio_set_noncoherent(dev, attr->attr,
+						  u64_to_user_ptr(attr->addr));
+	}
+
+	return -ENXIO;
+}
+
+static int kvm_virtio_has_attr(struct kvm_device *dev,
+			     struct kvm_device_attr *attr)
+{
+	switch (attr->group) {
+	case KVM_DEV_VIRTIO_NONCOHERENT:
+		switch (attr->attr) {
+		case KVM_DEV_VIRTIO_NONCOHERENT_SET:
+			return 0;
+		}
+
+		break;
+	}
+
+	return -ENXIO;
+}
+
+static void kvm_virtio_release(struct kvm_device *dev)
+{
+	struct kvm_virtio *kv = dev->private;
+
+	if (kv->noncoherent)
+		kvm_arch_unregister_noncoherent_dma(dev->kvm);
+	kfree(kv);
+	kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */
+}
+
+static int kvm_virtio_create(struct kvm_device *dev, u32 type);
+
+static struct kvm_device_ops kvm_virtio_ops = {
+	.name = "kvm-virtio",
+	.create = kvm_virtio_create,
+	.release = kvm_virtio_release,
+	.set_attr = kvm_virtio_set_attr,
+	.has_attr = kvm_virtio_has_attr,
+};
+
+static int kvm_virtio_create(struct kvm_device *dev, u32 type)
+{
+	struct kvm_virtio *kv;
+
+	if (type != KVM_DEV_TYPE_VIRTIO)
+		return -ENODEV;
+
+	/*
+	 * This kvm_virtio device is created per virtio device.
+	 * Its default noncoherent state is false.
+	 */
+	kv = kzalloc(sizeof(*kv), GFP_KERNEL_ACCOUNT);
+	if (!kv)
+		return -ENOMEM;
+
+	mutex_init(&kv->lock);
+
+	dev->private = kv;
+
+	return 0;
+}
+
+int kvm_virtio_ops_init(void)
+{
+	return kvm_register_device_ops(&kvm_virtio_ops, KVM_DEV_TYPE_VIRTIO);
+}
+
+void kvm_virtio_ops_exit(void)
+{
+	kvm_unregister_device_ops(KVM_DEV_TYPE_VIRTIO);
+}
diff --git a/virt/kvm/virtio.h b/virt/kvm/virtio.h
new file mode 100644
index 000000000000..0353398ee1f2
--- /dev/null
+++ b/virt/kvm/virtio.h
@@ -0,0 +1,18 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_VIRTIO_H
+#define __KVM_VIRTIO_H
+
+#ifdef CONFIG_KVM_VIRTIO
+int kvm_virtio_ops_init(void);
+void kvm_virtio_ops_exit(void);
+#else
+static inline int kvm_virtio_ops_init(void)
+{
+	return 0;
+}
+static inline void kvm_virtio_ops_exit(void)
+{
+}
+#endif
+
+#endif