Message ID | 20231214103520.7198-1-yan.y.zhao@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:3b04:b0:fb:cd0c:d3e with SMTP id c4csp8460120dys; Thu, 14 Dec 2023 03:05:07 -0800 (PST) X-Google-Smtp-Source: AGHT+IHtbTpKgfh/tLMF7vz+Bd1siYZf1DtqciO81a3mP585gi3rlmLxho8ltveevVFrj5FEzvGY X-Received: by 2002:a05:6a00:39a1:b0:6d0:99da:eadf with SMTP id fi33-20020a056a0039a100b006d099daeadfmr3184528pfb.44.1702551906932; Thu, 14 Dec 2023 03:05:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702551906; cv=none; d=google.com; s=arc-20160816; b=jjZY2XDRIQQ5pqmi4cuSqON1ZG8ZFUJc84cMSJwbWQqYObK+RgYvMnsZ6RgJxa1m0l 6R3W/DGYKLPuFuEPUHRu3ML4jE26VGfWA707kTuR/OtTxUNb7rSSGSlEIdhTGr/5HEne OFEsmq1zhAxUnn2LgXmyk+hEAF1KMEgXIZlnYJHV2R+5OlGsQo/aUQ6UD70JfD19CvN4 2X9iTsY6Go7GXF41NDX6uKoUmsqxU3Kt/fI7tFsOJgRh9EAl2TyGc0hbUwq3M14LOGtX ilcJ7hazQC+U5d/fT80CKpp8Y1o+DEcjRa5Mh0mc6wyoRlTdFE/Wo/bMLBSTASYKPuaq 6Bzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=yWP+2D1eULPusEz0RCI9Wdbqiu1t2gWF7XI5hhSVxBk=; fh=flttZmas51TMfnCnDh2IPQNTIdR4x+7WxTSos7vPkVw=; b=jcKv6kP0zfLcIr/WxBPTizvXX0kat55EzbXJQpKuq4dc2JNEM5WYg5gPRYme9PHkng hW7B+7m9LnLqM9bgEDhD4YANyWHUSjkBv6gYB1f1CsQrv45VFNW/Qh2idxovvjZnSZNK 6PCFGRHHRRo+lETmV26eBdR1ISpW8hB+s+LznM3Edfx53WajZf7O6AnXu/v3SWOg8sgm kJVk3ZHln9HhbCMxql5AKTyZdv8hKC2+ItSMUE9fV0mCgXXNL2MDFEXBgypECki3coTK oh/L8v50TXzqHDiu/qXENps1+uFSzHB5Jw/f4qEQ/9tRhOX7UE2nfbmiDyF53O2SfYkd rSRA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=P7DrjrDl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id x13-20020a056a000bcd00b006ce702c5c48si10623241pfu.359.2023.12.14.03.05.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Dec 2023 03:05:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=P7DrjrDl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id A93098145940; Thu, 14 Dec 2023 03:04:57 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1443895AbjLNLEt (ORCPT <rfc822;dexuan.linux@gmail.com> + 99 others); Thu, 14 Dec 2023 06:04:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37656 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1443893AbjLNLEm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 14 Dec 2023 06:04:42 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 75B2C11A; Thu, 14 Dec 2023 03:04:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1702551883; x=1734087883; h=from:to:cc:subject:date:message-id; bh=PPJ0qo+v4gU51NZ6QR1Rk2xhM3fO+C3owLqS4ypDdH8=; b=P7DrjrDlccdgzyGxKAYA6wtrd7xo1xXYGpsL2ZgeWrA23jwC5KFa2Zi4 +AXaZSXzbtC4QQW7h9ZSP0DkjmweU9vhDacREMtB54Re9Kgv0uFXYCOHS S9Rez4M5HH8fYiNYjsjjF7qgRk2yfbrqy3FbPe+2C7HsKPb1jaJAnlWvl yuZFMki2/s3wxxIQlIs3RnHp022jec6b8OTO2tfB2R9SwEq+nnI4T8v1Z tOPCCcVOq5AirnrK9a5ajO0LOSO2qlWu2Byew1weV0HI88pto3FA9v6Ip TxGZqoKHvX2QiA0SIL3jIDSIooTTHJi76wrODzggkVrHcMs4Vh8cCZRgz Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10923"; a="8500543" X-IronPort-AV: E=Sophos;i="6.04,275,1695711600"; d="scan'208";a="8500543" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Dec 2023 03:04:42 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,275,1695711600"; d="scan'208";a="22358938" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Dec 2023 03:04:37 -0800 From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: pbonzini@redhat.com, seanjc@google.com, olvaffe@gmail.com, kevin.tian@intel.com, zhiyuan.lv@intel.com, zhenyu.z.wang@intel.com, yongwei.ma@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com, joro@8bytes.org, gurchetansingh@chromium.org, kraxel@redhat.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [RFC PATCH] KVM: Introduce KVM VIRTIO device Date: Thu, 14 Dec 2023 18:35:20 +0800 Message-Id: <20231214103520.7198-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Thu, 14 Dec 2023 03:04:57 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1785255068288836517 X-GMAIL-MSGID: 1785255068288836517 |
Series |
[RFC] KVM: Introduce KVM VIRTIO device
|
|
Commit Message
Yan Zhao
Dec. 14, 2023, 10:35 a.m. UTC
Introduce a new KVM device that represents a virtio device in order to allow user to indicate device hardware's noncoherent DMA status. Motivation === A virtio GPU device may want to configure GPU hardware to work in noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. This is generally for performance consideration. In certain platform, GFX performance can improve 20+% with DMAs going to noncoherent path. This noncoherent DMA mode works in below sequence: 1. Host backend driver programs hardware not to snoop memory of target DMA buffer. 2. Host backend driver indicates guest frontend driver to program guest PAT to WC for target DMA buffer. 3. Guest frontend driver writes to the DMA buffer without clflush stuffs. 4. Hardware does noncoherent DMA to the target buffer. In this noncoherent DMA mode, both guest and hardware regard a DMA buffer as not cached. So, if KVM forces the effective memory type of this DMA buffer to be WB, hardware DMA may read incorrect data and cause misc failures. Therefore, we introduce a new KVM virtio device to let KVM be aware that the virtio device hardware is working in noncoherent mode. Then, KVM will honor guest PAT type in Intel's platform. For a virtio device model (e.g. QEMU), if it knows device hardware is to be configured to work in a. noncoherent mode, - on device realization, it can create a KVM virtio device and set device attr to increase KVM noncoherent DMA count; - on device unrealization, destroy the KVM virtio device to decrease KVM noncoherent DMA count. b. coherent mode, just do nothing. Security === The biggest concern for KVM to honor guest's memory type in Intel platform is page aliasing issue. - For host MMIO, it's not a concern as KVM VMX programs EPT memory type to UC (which will overwrite all guest PAT type except WC) no matter guest memory type is honored or not. - For host non-MMIO pages, * virtio guest frontend and host backend driver should be synced to use the same memory type to map a buffer. Otherwise, there will be potential problem for incorrect memory data. But this will only impact the buggy guest alone. * for live migration, as QEMU will read all guest memory during live migration, page aliasing could happen. Current thinking is to disable live migration if a virtio device has indicated its noncoherent state. As a follow-up, we can discuss other solutions. e.g. (a) switching back to coherent path before starting live migration. (b) read/write of guest memory with clflush during live migration. Implementation Consideration === There is a previous series [1] from google to serve the same purpose to let KVM be aware of virtio GPU's noncoherent DMA status. That series requires a new memslot flag, and special memslots in user space. We don't choose to use memslot flag to request honoring guest memory type. Instead we hope to make the honoring request to be explicit (not tied to a memslot flag). This is because once guest memory type is honored, not only memory used by guest virtio device, but all guest memory is facing page aliasing issue potentially. KVM needs a generic solution to take care of page aliasing issue rather than counting on memory type of a special memslot being aligned in host and guest. (we can discuss what a generic solution to handle page aliasing issue will look like in later follow-up series). On the other hand, we choose to introduce a KVM virtio device rather than just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() directly, which is based on considerations that 1. Explicitly limit the "register noncoherent DMA" ability to virtio devices. 2. Provide a better encapsulation. Repeated setting noncoherent attribute to a KVM virtio device will only increase KVM noncoherent DMA count for once. KVM noncohrent DMA count will be decreased automatically when KVM virtio device is closed. 3. The KVM virtio device can be extended to let KVM know more info about the device to introduce non-coherent DMA. Example QEMU usage === - on virtio device realize: struct kvm_create_device cd = { .type = KVM_DEV_TYPE_VIRTIO, }; kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd); struct kvm_device_attr attr = { .group = KVM_DEV_VIRTIO_NONCOHERENT, .attr = KVM_DEV_VIRTIO_NONCOHERENT_SET, }; ioctl(cd.fd, KVM_SET_DEVICE_ATTR, &attr); g->kvm_device_fd = cd.fd; - on virtio device unrealize close(g->kvm_device_fd); Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com> Link: https://patchwork.kernel.org/project/dri-devel/cover/20200213213036.207625-1-olvaffe@gmail.com/ [1] Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> --- arch/x86/kvm/Kconfig | 1 + include/uapi/linux/kvm.h | 5 ++ virt/kvm/Kconfig | 3 + virt/kvm/Makefile.kvm | 1 + virt/kvm/kvm_main.c | 8 +++ virt/kvm/virtio.c | 121 +++++++++++++++++++++++++++++++++++++++ virt/kvm/virtio.h | 18 ++++++ 7 files changed, 157 insertions(+) create mode 100644 virt/kvm/virtio.c create mode 100644 virt/kvm/virtio.h base-commit: 8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1
Comments
> From: Zhao, Yan Y <yan.y.zhao@intel.com> > Sent: Thursday, December 14, 2023 6:35 PM > > - For host non-MMIO pages, > * virtio guest frontend and host backend driver should be synced to use > the same memory type to map a buffer. Otherwise, there will be > potential problem for incorrect memory data. But this will only impact > the buggy guest alone. > * for live migration, > as QEMU will read all guest memory during live migration, page aliasing > could happen. > Current thinking is to disable live migration if a virtio device has > indicated its noncoherent state. > As a follow-up, we can discuss other solutions. e.g. > (a) switching back to coherent path before starting live migration. both guest/host switching to coherent or host-only? host-only certainly is problematic if guest is still using non-coherent. on the other hand I'm not sure whether the host/guest gfx stack is capable of switching between coherent and non-coherent path in-fly when the buffer is right being rendered. > (b) read/write of guest memory with clflush during live migration. write is irrelevant as it's only done in the resume path where the guest is not running. > > Implementation Consideration > === > There is a previous series [1] from google to serve the same purpose to > let KVM be aware of virtio GPU's noncoherent DMA status. That series > requires a new memslot flag, and special memslots in user space. > > We don't choose to use memslot flag to request honoring guest memory > type. memslot flag has the potential to restrict the impact e.g. when using clflush-before-read in migration? Of course the implication is to honor guest type only for the selected slot in KVM instead of applying to the entire guest memory as in previous series (which selects this way because vmx_get_mt_mask() is in perf-critical path hence not good to check memslot flag?) > Instead we hope to make the honoring request to be explicit (not tied to a > memslot flag). This is because once guest memory type is honored, not only > memory used by guest virtio device, but all guest memory is facing page > aliasing issue potentially. KVM needs a generic solution to take care of > page aliasing issue rather than counting on memory type of a special > memslot being aligned in host and guest. > (we can discuss what a generic solution to handle page aliasing issue will > look like in later follow-up series). > > On the other hand, we choose to introduce a KVM virtio device rather than > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > directly, which is based on considerations that I wonder it's over-engineered for the purpose. why not just introducing a KVM_CAP and allowing the VMM to enable? KVM doesn't need to know the exact source of requiring it...
On Fri, Dec 15, 2023 at 02:23:48PM +0800, Tian, Kevin wrote: > > From: Zhao, Yan Y <yan.y.zhao@intel.com> > > Sent: Thursday, December 14, 2023 6:35 PM > > > > - For host non-MMIO pages, > > * virtio guest frontend and host backend driver should be synced to use > > the same memory type to map a buffer. Otherwise, there will be > > potential problem for incorrect memory data. But this will only impact > > the buggy guest alone. > > * for live migration, > > as QEMU will read all guest memory during live migration, page aliasing > > could happen. > > Current thinking is to disable live migration if a virtio device has > > indicated its noncoherent state. > > As a follow-up, we can discuss other solutions. e.g. > > (a) switching back to coherent path before starting live migration. > > both guest/host switching to coherent or host-only? > > host-only certainly is problematic if guest is still using non-coherent. Both. > on the other hand I'm not sure whether the host/guest gfx stack is > capable of switching between coherent and non-coherent path in-fly > when the buffer is right being rendered. > Yes. I'm also not sure about it. But it's an option though. > > (b) read/write of guest memory with clflush during live migration. > > write is irrelevant as it's only done in the resume path where the > guest is not running. Given host write is with PAT WB and hardware is in no-snoop mode, is it better to perform cache flush after host write? (can do more investigation to check if it's necessary). BTW, there's also post-copy live migration, in which case the guest is running :) > > > > > Implementation Consideration > > === > > There is a previous series [1] from google to serve the same purpose to > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > requires a new memslot flag, and special memslots in user space. > > > > We don't choose to use memslot flag to request honoring guest memory > > type. > > memslot flag has the potential to restrict the impact e.g. when using > clflush-before-read in migration? Of course the implication is to > honor guest type only for the selected slot in KVM instead of applying > to the entire guest memory as in previous series (which selects this > way because vmx_get_mt_mask() is in perf-critical path hence not > good to check memslot flag?) > I think checking memslot flag in itself is all right. But memslot flag does not contain the memory type that host is using for the memslot. On the other hand, virtio GPU is not the only source of non-coherent DMAs. Memslot flag way is not applicable to pass-through GPUs, due to lacking of coordination between guest and host. > > Instead we hope to make the honoring request to be explicit (not tied to a > > memslot flag). This is because once guest memory type is honored, not only > > memory used by guest virtio device, but all guest memory is facing page > > aliasing issue potentially. KVM needs a generic solution to take care of > > page aliasing issue rather than counting on memory type of a special > > memslot being aligned in host and guest. > > (we can discuss what a generic solution to handle page aliasing issue will > > look like in later follow-up series). > > > > On the other hand, we choose to introduce a KVM virtio device rather than > > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > > directly, which is based on considerations that > > I wonder it's over-engineered for the purpose. > > why not just introducing a KVM_CAP and allowing the VMM to enable? As we hope to increase non-coherent DMA count on hot-plug of a non-coherent device and decrease non-coherent DMA count on hot-unplug of the non-coherent device, a KVM_CAP looks requiring user to maintain a ref count before turning on/off, which is less desired. Agree? > KVM doesn't need to know the exact source of requiring it... Maybe we can use the source info in a way like this: 1. indicate the source is not a passthrough device 2. record relationship between GPA and memory type. Then, if KVM knows non-coherent DMAs do not contain any passthrough devices, it can force a GPA's memory type (by ignoring guest PAT) to the one specified by host (in 2), so as to avoid cache flush operations before live migration. If there are passthrough devices involved later, we can zap the EPT and rebuild memory type to honor guest PAT, resorting to cache flush before live migration to maintain coherency.
+Yiwei On Fri, Dec 15, 2023, Kevin Tian wrote: > > From: Zhao, Yan Y <yan.y.zhao@intel.com> > > Sent: Thursday, December 14, 2023 6:35 PM > > > > - For host non-MMIO pages, > > * virtio guest frontend and host backend driver should be synced to use > > the same memory type to map a buffer. Otherwise, there will be > > potential problem for incorrect memory data. But this will only impact > > the buggy guest alone. > > * for live migration, > > as QEMU will read all guest memory during live migration, page aliasing > > could happen. > > Current thinking is to disable live migration if a virtio device has > > indicated its noncoherent state. > > As a follow-up, we can discuss other solutions. e.g. > > (a) switching back to coherent path before starting live migration. > > both guest/host switching to coherent or host-only? > > host-only certainly is problematic if guest is still using non-coherent. > > on the other hand I'm not sure whether the host/guest gfx stack is > capable of switching between coherent and non-coherent path in-fly > when the buffer is right being rendered. > > > (b) read/write of guest memory with clflush during live migration. > > write is irrelevant as it's only done in the resume path where the > guest is not running. > > > > > Implementation Consideration > > === > > There is a previous series [1] from google to serve the same purpose to > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > requires a new memslot flag, and special memslots in user space. > > > > We don't choose to use memslot flag to request honoring guest memory > > type. > > memslot flag has the potential to restrict the impact e.g. when using > clflush-before-read in migration? Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to the host kernel, then the memslot flag will allow for a much more targeted operation. > Of course the implication is to honor guest type only for the selected slot > in KVM instead of applying to the entire guest memory as in previous series > (which selects this way because vmx_get_mt_mask() is in perf-critical path > hence not good to check memslot flag?) Checking a memslot flag won't impact performance. KVM already has the memslot when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has access to the memslot. That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. to retrieve the associated PFN, update write-tracking for shadow pages, etc. I added Yiwei, who I think is planning on posting another RFC for the memslot idea (I actually completely forgot that the memslot idea had been thought of and posted a few years back). > > Instead we hope to make the honoring request to be explicit (not tied to a > > memslot flag). This is because once guest memory type is honored, not only > > memory used by guest virtio device, but all guest memory is facing page > > aliasing issue potentially. KVM needs a generic solution to take care of > > page aliasing issue rather than counting on memory type of a special > > memslot being aligned in host and guest. > > (we can discuss what a generic solution to handle page aliasing issue will > > look like in later follow-up series). > > > > On the other hand, we choose to introduce a KVM virtio device rather than > > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > > directly, which is based on considerations that > > I wonder it's over-engineered for the purpose. > > why not just introducing a KVM_CAP and allowing the VMM to enable? > KVM doesn't need to know the exact source of requiring it... Agreed. If we end up needing to grant the whole VM access for some reason, just give userspace a direct toggle.
> +Yiwei > > On Fri, Dec 15, 2023, Kevin Tian wrote: > > > From: Zhao, Yan Y <yan.y.zhao@intel.com> > > > Sent: Thursday, December 14, 2023 6:35 PM > > > > > > - For host non-MMIO pages, > > > * virtio guest frontend and host backend driver should be synced to use > > > the same memory type to map a buffer. Otherwise, there will be > > > potential problem for incorrect memory data. But this will only impact > > > the buggy guest alone. > > > * for live migration, > > > as QEMU will read all guest memory during live migration, page aliasing > > > could happen. > > > Current thinking is to disable live migration if a virtio device has > > > indicated its noncoherent state. > > > As a follow-up, we can discuss other solutions. e.g. > > > (a) switching back to coherent path before starting live migration. > > > > both guest/host switching to coherent or host-only? > > > > host-only certainly is problematic if guest is still using non-coherent. > > > > on the other hand I'm not sure whether the host/guest gfx stack is > > capable of switching between coherent and non-coherent path in-fly > > when the buffer is right being rendered. > > > > > (b) read/write of guest memory with clflush during live migration. > > > > write is irrelevant as it's only done in the resume path where the > > guest is not running. > > > > > > > > Implementation Consideration > > > === > > > There is a previous series [1] from google to serve the same purpose to > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > requires a new memslot flag, and special memslots in user space. > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > type. > > > > memslot flag has the potential to restrict the impact e.g. when using > > clflush-before-read in migration? > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > the host kernel, then the memslot flag will allow for a much more targeted > operation. > > > Of course the implication is to honor guest type only for the selected slot > > in KVM instead of applying to the entire guest memory as in previous series > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > hence not good to check memslot flag?) > > Checking a memslot flag won't impact performance. KVM already has the memslot > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > access to the memslot. > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > I added Yiwei, who I think is planning on posting another RFC for the memslot > idea (I actually completely forgot that the memslot idea had been thought of and > posted a few years back). We've deferred to Yan (Intel side) to drive the userspace opt-in. So it's up to Yan to revise the series to be memslot flag based. I'm okay with what upstream folks think to be safer for the opt-in. Thanks! > > > Instead we hope to make the honoring request to be explicit (not tied to a > > > memslot flag). This is because once guest memory type is honored, not only > > > memory used by guest virtio device, but all guest memory is facing page > > > aliasing issue potentially. KVM needs a generic solution to take care of > > > page aliasing issue rather than counting on memory type of a special > > > memslot being aligned in host and guest. > > > (we can discuss what a generic solution to handle page aliasing issue will > > > look like in later follow-up series). > > > > > > On the other hand, we choose to introduce a KVM virtio device rather than > > > just provide an ioctl to wrap kvm_arch_[un]register_noncoherent_dma() > > > directly, which is based on considerations that > > > > I wonder it's over-engineered for the purpose. > > > > why not just introducing a KVM_CAP and allowing the VMM to enable? > > KVM doesn't need to know the exact source of requiring it... > > Agreed. If we end up needing to grant the whole VM access for some reason, just > give userspace a direct toggle.
On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > Implementation Consideration > > > === > > > There is a previous series [1] from google to serve the same purpose to > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > requires a new memslot flag, and special memslots in user space. > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > type. > > > > memslot flag has the potential to restrict the impact e.g. when using > > clflush-before-read in migration? > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > the host kernel, then the memslot flag will allow for a much more targeted > operation. > > > Of course the implication is to honor guest type only for the selected slot > > in KVM instead of applying to the entire guest memory as in previous series > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > hence not good to check memslot flag?) > > Checking a memslot flag won't impact performance. KVM already has the memslot > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > access to the memslot. > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > Hi Sean, Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? For KVM_MEM_DMA, KVM needs to (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with pgprot2cachemode()? ), or (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in pat_pfn_immune_to_uc_mtrr(). But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. For KVM_MEM_WC, it requires user to ensure the memory is actually mapped to WC, right? Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type for the special memslot only, as below. Is this understanding correct? static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { if (is_mmio) return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; if (gfn_in_dma_slot(vcpu->kvm, gfn)) { u8 type = MTRR_TYPE_WRCOMB; //u8 type = pat_pfn_memtype(pfn); return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; } if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; else return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; } return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; } BTW, since the special memslot must be exposed to guest as virtio GPU BAR in order to prevent other guest drivers from access, I wonder if it's better to include some keyword like VIRTIO_GPU_BAR in memslot flag name.
On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote: > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > > Implementation Consideration > > > > === > > > > There is a previous series [1] from google to serve the same purpose to > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > > requires a new memslot flag, and special memslots in user space. > > > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > > type. > > > > > > memslot flag has the potential to restrict the impact e.g. when using > > > clflush-before-read in migration? > > > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > > the host kernel, then the memslot flag will allow for a much more targeted > > operation. > > > > > Of course the implication is to honor guest type only for the selected slot > > > in KVM instead of applying to the entire guest memory as in previous series > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > > hence not good to check memslot flag?) > > > > Checking a memslot flag won't impact performance. KVM already has the memslot > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > > access to the memslot. > > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > > Hi Sean, > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? > For KVM_MEM_DMA, KVM needs to > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with > pgprot2cachemode()? ), or > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in > pat_pfn_immune_to_uc_mtrr(). > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped > to WC, right? > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type > for the special memslot only, as below. > Is this understanding correct? > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > { > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > if (gfn_in_dma_slot(vcpu->kvm, gfn)) { > u8 type = MTRR_TYPE_WRCOMB; > //u8 type = pat_pfn_memtype(pfn); > return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > } > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > else > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > VMX_EPT_IPAT_BIT; > } > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > } > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in > order to prevent other guest drivers from access, I wonder if it's better to > include some keyword like VIRTIO_GPU_BAR in memslot flag name. Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping reading/writing to them in general paths). @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) if (is_mmio) return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; + if (in_slot_honor_guest_pat(vcpu->kvm, gfn)) + return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; else return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; } return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; }
On Wed, Dec 20, 2023, Yan Zhao wrote: > On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote: > > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > > > Implementation Consideration > > > > > === > > > > > There is a previous series [1] from google to serve the same purpose to > > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > > > requires a new memslot flag, and special memslots in user space. > > > > > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > > > type. > > > > > > > > memslot flag has the potential to restrict the impact e.g. when using > > > > clflush-before-read in migration? > > > > > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > > > the host kernel, then the memslot flag will allow for a much more targeted > > > operation. > > > > > > > Of course the implication is to honor guest type only for the selected slot > > > > in KVM instead of applying to the entire guest memory as in previous series > > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > > > hence not good to check memslot flag?) > > > > > > Checking a memslot flag won't impact performance. KVM already has the memslot > > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > > > access to the memslot. > > > > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > > > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > > > > Hi Sean, > > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? > > For KVM_MEM_DMA, KVM needs to > > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with > > pgprot2cachemode()? ), or > > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in > > pat_pfn_immune_to_uc_mtrr(). > > > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. > > > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped > > to WC, right? > > > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type > > for the special memslot only, as below. > > Is this understanding correct? > > > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > { > > if (is_mmio) > > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > > > if (gfn_in_dma_slot(vcpu->kvm, gfn)) { > > u8 type = MTRR_TYPE_WRCOMB; > > //u8 type = pat_pfn_memtype(pfn); > > return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > } > > > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > else > > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > > VMX_EPT_IPAT_BIT; > > } > > > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > } > > > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in > > order to prevent other guest drivers from access, I wonder if it's better to > > include some keyword like VIRTIO_GPU_BAR in memslot flag name. > Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user > (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping > reading/writing to them in general paths). > > @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > + if (in_slot_honor_guest_pat(vcpu->kvm, gfn)) > + return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; This is more along the lines of what I was thinking, though the name should be something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory for AMD (which already honors guest PAT). I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping those out. This is a new feature, so we have the luxury of defining KVM's ABI for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs. Like so? diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index d21f55f323ea..ed527acb2bd3 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm) return 0; } -static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) +static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio, + struct kvm_memory_slot *slot) { /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in * memory aliases with conflicting memory types and sometimes MCEs. @@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) if (is_mmio) return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; + if (kvm_memslot_has_non_coherent_dma(slot)) + return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; I like the idea of pulling the memtype from the host, but if we can make that work then I don't see the need for a special memslot flag, i.e. just do it for *all* SPTEs on VMX. I don't think we need a VMA for that, e.g. we should be able to get the memtype from the host PTEs, just like we do the page size. KVM_MEM_WC is a hard "no" for me. It's far too x86 centric, and as you alluded to, it requires coordination from the guest, i.e. is effectively limited to paravirt scenarios. > + > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > else > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > VMX_EPT_IPAT_BIT; > } > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > }
On Wed, Dec 20, 2023 at 06:12:55PM -0800, Sean Christopherson wrote: > On Wed, Dec 20, 2023, Yan Zhao wrote: > > On Tue, Dec 19, 2023 at 12:26:45PM +0800, Yan Zhao wrote: > > > On Mon, Dec 18, 2023 at 07:08:51AM -0800, Sean Christopherson wrote: > > > > > > Implementation Consideration > > > > > > === > > > > > > There is a previous series [1] from google to serve the same purpose to > > > > > > let KVM be aware of virtio GPU's noncoherent DMA status. That series > > > > > > requires a new memslot flag, and special memslots in user space. > > > > > > > > > > > > We don't choose to use memslot flag to request honoring guest memory > > > > > > type. > > > > > > > > > > memslot flag has the potential to restrict the impact e.g. when using > > > > > clflush-before-read in migration? > > > > > > > > Yep, exactly. E.g. if KVM needs to ensure coherency when freeing memory back to > > > > the host kernel, then the memslot flag will allow for a much more targeted > > > > operation. > > > > > > > > > Of course the implication is to honor guest type only for the selected slot > > > > > in KVM instead of applying to the entire guest memory as in previous series > > > > > (which selects this way because vmx_get_mt_mask() is in perf-critical path > > > > > hence not good to check memslot flag?) > > > > > > > > Checking a memslot flag won't impact performance. KVM already has the memslot > > > > when creating SPTEs, e.g. the sole caller of vmx_get_mt_mask(), make_spte(), has > > > > access to the memslot. > > > > > > > > That isn't coincidental, KVM _must_ have the memslot to construct the SPTE, e.g. > > > > to retrieve the associated PFN, update write-tracking for shadow pages, etc. > > > > > > > Hi Sean, > > > Do you prefer to introduce a memslot flag KVM_MEM_DMA or KVM_MEM_WC? > > > For KVM_MEM_DMA, KVM needs to > > > (a) search VMA for vma->vm_page_prot and convert it to page cache mode (with > > > pgprot2cachemode()? ), or > > > (b) look up memtype of the PFN, by calling lookup_memtype(), similar to that in > > > pat_pfn_immune_to_uc_mtrr(). > > > > > > But pgprot2cachemode() and lookup_memtype() are not exported by x86 code now. > > > > > > For KVM_MEM_WC, it requires user to ensure the memory is actually mapped > > > to WC, right? > > > > > > Then, vmx_get_mt_mask() just ignores guest PAT and programs host PAT as EPT type > > > for the special memslot only, as below. > > > Is this understanding correct? > > > > > > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > > { > > > if (is_mmio) > > > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > > > > > if (gfn_in_dma_slot(vcpu->kvm, gfn)) { > > > u8 type = MTRR_TYPE_WRCOMB; > > > //u8 type = pat_pfn_memtype(pfn); > > > return (type << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > } > > > > > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > > else > > > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > > > VMX_EPT_IPAT_BIT; > > > } > > > > > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > > } > > > > > > BTW, since the special memslot must be exposed to guest as virtio GPU BAR in > > > order to prevent other guest drivers from access, I wonder if it's better to > > > include some keyword like VIRTIO_GPU_BAR in memslot flag name. > > Another choice is to add a memslot flag KVM_MEM_HONOR_GUEST_PAT, then user > > (e.g. QEMU) does special treatment to this kind of memslots (e.g. skipping > > reading/writing to them in general paths). > > > > @@ -7589,26 +7589,29 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > if (is_mmio) > > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > > > + if (in_slot_honor_guest_pat(vcpu->kvm, gfn)) > > + return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > This is more along the lines of what I was thinking, though the name should be > something like KVM_MEM_NON_COHERENT_DMA, i.e. not x86 specific and not contradictory > for AMD (which already honors guest PAT). > > I also vote to deliberately ignore MTRRs, i.e. start us on the path of ripping > those out. This is a new feature, so we have the luxury of defining KVM's ABI > for that feature, i.e. can state that on x86 it honors guest PAT, but not MTRRs. > > Like so? Yes, this looks good to me. Will refine the patch in the recommended way. Only one remaining question from me is if we could allow user to set the flag to every slot. If user is allowed to set the flag to system ram slot, it may cause problem, since guest drivers (other than the paravirt one) may have chance to get allocated memory in this ram slot and make its PAT setting take effect, which is not desired. Do we need to find a way to limit the eligible slots? e.g. check VMAs of the slot to ensure its vm_flags include VM_IO and VM_PFNMAP. Or, just trust user? > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index d21f55f323ea..ed527acb2bd3 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7575,7 +7575,8 @@ static int vmx_vm_init(struct kvm *kvm) > return 0; > } > > -static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > +static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio, > + struct kvm_memory_slot *slot) > { > /* We wanted to honor guest CD/MTRR/PAT, but doing so could result in > * memory aliases with conflicting memory types and sometimes MCEs. > @@ -7598,6 +7599,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > if (is_mmio) > return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT; > > + if (kvm_memslot_has_non_coherent_dma(slot)) > + return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > + > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > I like the idea of pulling the memtype from the host, but if we can make that > work then I don't see the need for a special memslot flag, i.e. just do it for > *all* SPTEs on VMX. I don't think we need a VMA for that, e.g. we should be able > to get the memtype from the host PTEs, just like we do the page size. Right. Besides, after reading host PTEs, we still need to decode the bits to memtype in platform specific way and convert the type to EPT type. > > KVM_MEM_WC is a hard "no" for me. It's far too x86 centric, and as you alluded > to, it requires coordination from the guest, i.e. is effectively limited to > paravirt scenarios. Got it! > > > + > > if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) { > > if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > else > > return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | > > VMX_EPT_IPAT_BIT; > > } > > > > return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT; > > }
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b07247b0b958..9f7223d298a2 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -44,6 +44,7 @@ config KVM select KVM_XFER_TO_GUEST_WORK select KVM_GENERIC_DIRTYLOG_READ_PROTECT select KVM_VFIO + select KVM_VIRTIO select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM select KVM_GENERIC_HARDWARE_ENABLING diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index b1f92a0edc35..7f57fa902a0a 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1394,6 +1394,9 @@ struct kvm_device_attr { #define KVM_DEV_VFIO_GROUP_DEL KVM_DEV_VFIO_FILE_DEL #define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE 3 +#define KVM_DEV_VIRTIO_NONCOHERENT 1 +#define KVM_DEV_VIRTIO_NONCOHERENT_SET 1 + enum kvm_device_type { KVM_DEV_TYPE_FSL_MPIC_20 = 1, #define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20 @@ -1417,6 +1420,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_RISCV_AIA, #define KVM_DEV_TYPE_RISCV_AIA KVM_DEV_TYPE_RISCV_AIA + KVM_DEV_TYPE_VIRTIO, +#define KVM_DEV_TYPE_VIRTIO KVM_DEV_TYPE_VIRTIO KVM_DEV_TYPE_MAX, }; diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 6793211a0b64..5dcba034593c 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -56,6 +56,9 @@ config HAVE_KVM_CPU_RELAX_INTERCEPT config KVM_VFIO bool +config KVM_VIRTIO + bool + config HAVE_KVM_INVALID_WAKEUPS bool diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm index 724c89af78af..614c4c4fe50e 100644 --- a/virt/kvm/Makefile.kvm +++ b/virt/kvm/Makefile.kvm @@ -7,6 +7,7 @@ KVM ?= ../../../virt/kvm kvm-y := $(KVM)/kvm_main.o $(KVM)/eventfd.o $(KVM)/binary_stats.o kvm-$(CONFIG_KVM_VFIO) += $(KVM)/vfio.o +kvm-$(CONFIG_KVM_VIRTIO) += $(KVM)/virtio.o kvm-$(CONFIG_KVM_MMIO) += $(KVM)/coalesced_mmio.o kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index acd67fb40183..dca2040368da 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -61,6 +61,7 @@ #include "async_pf.h" #include "kvm_mm.h" #include "vfio.h" +#include "virtio.h" #include <trace/events/ipi.h> @@ -6453,6 +6454,10 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) if (WARN_ON_ONCE(r)) goto err_vfio; + r = kvm_virtio_ops_init(); + if (WARN_ON_ONCE(r)) + goto err_virtio; + kvm_gmem_init(module); /* @@ -6468,6 +6473,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) return 0; err_register: + kvm_virtio_ops_exit(); +err_virtio: kvm_vfio_ops_exit(); err_vfio: kvm_async_pf_deinit(); @@ -6503,6 +6510,7 @@ void kvm_exit(void) free_cpumask_var(per_cpu(cpu_kick_mask, cpu)); kmem_cache_destroy(kvm_vcpu_cache); kvm_vfio_ops_exit(); + kvm_virtio_ops_exit(); kvm_async_pf_deinit(); #ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING unregister_syscore_ops(&kvm_syscore_ops); diff --git a/virt/kvm/virtio.c b/virt/kvm/virtio.c new file mode 100644 index 000000000000..dbb25d9784c5 --- /dev/null +++ b/virt/kvm/virtio.c @@ -0,0 +1,121 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * KVM-VIRTIO device + * + */ +#include <linux/kvm_host.h> +#include <linux/errno.h> +#include <linux/mutex.h> +#include <linux/slab.h> +#include "virtio.h" + +struct kvm_virtio { + struct mutex lock; + bool noncoherent; +}; + +static int kvm_virtio_set_noncoherent(struct kvm_device *dev, long attr, + void __user *arg) +{ + struct kvm_virtio *kv = dev->private; + + /* + * Currently, only set to noncoherent is allowed, and therefore a virtio + * device is not allowed to switch back to coherent once it's set to + * noncoherent. + * User arg is also not checked as the attr name has indicated that the + * purpose is to set to noncoherent. + */ + if (attr != KVM_DEV_VIRTIO_NONCOHERENT_SET) + return -ENXIO; + + mutex_lock(&kv->lock); + if (kv->noncoherent) + goto out; + + kv->noncoherent = true; + kvm_arch_register_noncoherent_dma(dev->kvm); +out: + mutex_unlock(&kv->lock); + return 0; +} + +static int kvm_virtio_set_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_VIRTIO_NONCOHERENT: + return kvm_virtio_set_noncoherent(dev, attr->attr, + u64_to_user_ptr(attr->addr)); + } + + return -ENXIO; +} + +static int kvm_virtio_has_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + switch (attr->group) { + case KVM_DEV_VIRTIO_NONCOHERENT: + switch (attr->attr) { + case KVM_DEV_VIRTIO_NONCOHERENT_SET: + return 0; + } + + break; + } + + return -ENXIO; +} + +static void kvm_virtio_release(struct kvm_device *dev) +{ + struct kvm_virtio *kv = dev->private; + + if (kv->noncoherent) + kvm_arch_unregister_noncoherent_dma(dev->kvm); + kfree(kv); + kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */ +} + +static int kvm_virtio_create(struct kvm_device *dev, u32 type); + +static struct kvm_device_ops kvm_virtio_ops = { + .name = "kvm-virtio", + .create = kvm_virtio_create, + .release = kvm_virtio_release, + .set_attr = kvm_virtio_set_attr, + .has_attr = kvm_virtio_has_attr, +}; + +static int kvm_virtio_create(struct kvm_device *dev, u32 type) +{ + struct kvm_virtio *kv; + + if (type != KVM_DEV_TYPE_VIRTIO) + return -ENODEV; + + /* + * This kvm_virtio device is created per virtio device. + * Its default noncoherent state is false. + */ + kv = kzalloc(sizeof(*kv), GFP_KERNEL_ACCOUNT); + if (!kv) + return -ENOMEM; + + mutex_init(&kv->lock); + + dev->private = kv; + + return 0; +} + +int kvm_virtio_ops_init(void) +{ + return kvm_register_device_ops(&kvm_virtio_ops, KVM_DEV_TYPE_VIRTIO); +} + +void kvm_virtio_ops_exit(void) +{ + kvm_unregister_device_ops(KVM_DEV_TYPE_VIRTIO); +} diff --git a/virt/kvm/virtio.h b/virt/kvm/virtio.h new file mode 100644 index 000000000000..0353398ee1f2 --- /dev/null +++ b/virt/kvm/virtio.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __KVM_VIRTIO_H +#define __KVM_VIRTIO_H + +#ifdef CONFIG_KVM_VIRTIO +int kvm_virtio_ops_init(void); +void kvm_virtio_ops_exit(void); +#else +static inline int kvm_virtio_ops_init(void) +{ + return 0; +} +static inline void kvm_virtio_ops_exit(void) +{ +} +#endif + +#endif