Message ID | 20240105091237.24577-1-yan.y.zhao@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp6117713dyb; Fri, 5 Jan 2024 01:42:27 -0800 (PST) X-Google-Smtp-Source: AGHT+IGD4hrBO+I8pEaq7FZfV7JW2PAhka0CSYgy8cAqguVyy375+BEINVOXIKNxJ2HGiOz8Trw1 X-Received: by 2002:a05:6e02:221d:b0:360:fed:85eb with SMTP id j29-20020a056e02221d00b003600fed85ebmr2157075ilf.56.1704447747652; Fri, 05 Jan 2024 01:42:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704447747; cv=none; d=google.com; s=arc-20160816; b=vHTqd4Wl6A3M66e1DsoVwsonc7Xun0D1tyKTPdrPtlPNoPKq7pGZJkal6qDkY4Qpv7 Hna6D2u2HraCP9++j9teYta6R8aiKyCW5XFpXOCDBmymBE3WHyfLNoaiSK1uDHHkzQqd 508uVKEiftUSU8kpjwX7FcHmdzKDmucA5jBAvPbFKuovUsK3MTVaOuoN2dmcuBqoL/xz vcUgje8ovC9KfyCw/Ie8MGnNU9IZekWDRkrKGddlI47vw7Jm17tQJ9MvImmVGQ4g2klR 25SlyMBZ+wUXTtdbSYHaaNXG7riD9X0eZbrivlXlrakOJ/rulyREV7xgjinZCnAnQafE M+pA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-unsubscribe:list-subscribe:list-id:precedence:message-id:date :subject:cc:to:from:dkim-signature; bh=SXsXueLShP38tkmPKnlulTCn+X1Mq2eh8Z1qsATDAUs=; fh=7+Mp40O15i8480Y3Lx1xUfrNoWlx4dq3oRO8LaTgnuQ=; b=VRG+1kZCMMH88hb3q0XBF+zjpE0sSVcunozi2rS9L8UeMWp23PY/NWDLIrFpmb4hKr y13mCsQlIoyLiuBMTaYlUYWTpB213R7QQ9n/nU5rVaDM6yM1gaLsK0MwGkMuUyz9979r RsIrbCCIyzhfBDcE0FVV7o/SKna2fS90VVnIlo5oWFnls2mdm+mY+k2s6poi0yzdereH HAUBz269HQ9/l4QaVaNhRFgDJE4JzVJMqTUr7ex+f87EQz16ZEAX6NAjUKPjP/u1ldZe uDRM/LaMsJGuTyv4zcLjPxdKywamSSADkwWzAF+ycskJVI5+GstoTuZM+c5EyiCU+pRE xAZg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=DGwU+xAn; spf=pass (google.com: domain of linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id h20-20020a635314000000b005ce2b993254si984182pgb.204.2024.01.05.01.42.27 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 05 Jan 2024 01:42:27 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=DGwU+xAn; spf=pass (google.com: domain of linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 79178B20F40 for <ouuuleilei@gmail.com>; Fri, 5 Jan 2024 09:42:26 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7A853250F0; Fri, 5 Jan 2024 09:42:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="DGwU+xAn" X-Original-To: linux-kernel@vger.kernel.org Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 94EAB24B29; Fri, 5 Jan 2024 09:42:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1704447723; x=1735983723; h=from:to:cc:subject:date:message-id; bh=hfs+RsszOiy+Zn2bJH81rSeaYMQb4GgLRyTWh5I+W1Y=; b=DGwU+xAnZkgP99B8okCEJwsK8al0YZDGNoeWRfHD9FO3kj7QBltLt24T EhhxCfrb7m5dzkAjYU4JpEkzDi0mfwZFNCmdJMgaRp7FbR6pNYQ9cLxfI mgzK3AH/vJULiRuXzaEBRUI7bbzE4+5Rn6vWSlSsTyo+wMfiMGQxMLV4Q +RSHLOWuW770ImxlKMMRJqrlqIdtPkYlU63zwLw5cT9FLcfE/FDQsXmvf MG0RYuC1RW6JaoNsfdarIiSGyjpB2toBk85CmK4V81Hz+890uvt/553bd R36HxN5ofLs3U1qBzjQb+mfaytszwyVkieQw7q7usQtG78bNPePgvlAuJ Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10943"; a="4285874" X-IronPort-AV: E=Sophos;i="6.04,333,1695711600"; d="scan'208";a="4285874" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jan 2024 01:42:02 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,333,1695711600"; d="scan'208";a="15196531" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jan 2024 01:41:57 -0800 From: Yan Zhao <yan.y.zhao@intel.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: pbonzini@redhat.com, seanjc@google.com, olvaffe@gmail.com, kevin.tian@intel.com, zhiyuan.lv@intel.com, zhenyu.z.wang@intel.com, yongwei.ma@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com, joro@8bytes.org, gurchetansingh@chromium.org, kraxel@redhat.com, zzyiwei@google.com, ankita@nvidia.com, jgg@nvidia.com, alex.williamson@redhat.com, maz@kernel.org, oliver.upton@linux.dev, james.morse@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [PATCH 0/4] KVM: Honor guest memory types for virtio GPU devices Date: Fri, 5 Jan 2024 17:12:37 +0800 Message-Id: <20240105091237.24577-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1787243001382580077 X-GMAIL-MSGID: 1787243001382580077 |
Series |
KVM: Honor guest memory types for virtio GPU devices
|
|
Message
Yan Zhao
Jan. 5, 2024, 9:12 a.m. UTC
This series allow user space to notify KVM of noncoherent DMA status so as to let KVM honor guest memory types in specified memory slot ranges. Motivation === A virtio GPU device may want to configure GPU hardware to work in noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. This is generally for performance consideration. In certain platform, GFX performance can improve 20+% with DMAs going to noncoherent path. This noncoherent DMA mode works in below sequence: 1. Host backend driver programs hardware not to snoop memory of target DMA buffer. 2. Host backend driver indicates guest frontend driver to program guest PAT to WC for target DMA buffer. 3. Guest frontend driver writes to the DMA buffer without clflush stuffs. 4. Hardware does noncoherent DMA to the target buffer. In this noncoherent DMA mode, both guest and hardware regard a DMA buffer as not cached. So, if KVM forces the effective memory type of this DMA buffer to be WB, hardware DMA may read incorrect data and cause misc failures. Therefore we introduced a new memslot flag KVM_MEM_NON_COHERENT_DMA to allow user space convey noncoherent DMA status in memslot granularity. Platforms that do not always honor guest memory type can choose to honor it in ranges of memslots with KVM_MEM_NON_COHERENT_DMA set. Security === The biggest concern for KVM to honor guest's memory type is page aliasing issue. In Intel's platform, - For host MMIO, KVM VMX always programs EPT memory type to UC (which will overwrite all guest PAT types except WC), which is of no change after this series. - For host non-MMIO pages, * virtio guest frontend and host backend driver should be synced to use the same memory type to map a buffer. Otherwise, there will be potential problem for incorrect memory data. But this will only impact the buggy guest alone. * for live migration, user space can skip reading/writing memory corresponding to the memslot with flag KVM_MEM_NON_COHERENT_DMA or do some special handling during memory read/write. Implementation === Unlike previous RFC series [1] that uses a new KVM VIRTIO device to convey noncoherent DMA status, this version chooses to introduce a new memslot flag, similar to what's done in series from google at [2]. The difference is that [2] increases noncoherent DMA count to ask KVM VMX to honor guest memory type for all guest memory as a whole, while this series will only ask KVM to honor guest memory type in the specified memslots. The reason of not introducing a KVM cap or a memslot flag to allow users to toggle noncoherent DMA state as a whole is mainly for the page aliasing issue as mentioned above. If guest memory type is only honored in limited memslots, user space can do special handling before/after accessing to guest memory belonging to the limited memslots. For virtio GPUs, it usually will create memslots that are mapped into guest device BARs. - guest device driver will sync with host side to use the same memory type to access that memslots. - no other guest components will have access to the memory in the memslots since it's mapped as device BARs. So, by adding flag KVM_MEM_NON_COHERENT_DMA to memslots specific to virtio GPUs and asking KVM to only honor guest memory in those memslots, page aliasing issue can be avoided easily. This series doesn't limit which memslots are legible to set flag KVM_MEM_NON_COHERENT_DMA, so if the user sets this flag to memslots for guest system RAM, page aliasing issue may be met during live migration or other use cases when host wants to access guest memory with different memory types due to lacking of coordination between non-enlightened guest components and host. Just as when noncoherent DMA devices are assigned through VFIO. But as it will not impact other VMs, we choose to trust the user and let the user to do mitigations when it has to set this flag to memslots for guest system RAM. Note: We also noticed that there's a series [3] trying to fix a similar problem in ARM for VFIO device passthrough. The difference is that [3] is trying to fix the problem that guest memory types for pass-through device MMIOs are not honored in ARM (which is not a problem for x86 VMX), while this series is for the problem that guest memory types are not honored in non-host-MMIO ranges for virtio GPUs in x86 VMX. Changelog: RFC --> v1: - Switch to use memslot flag way to convey non-coherent DMA info (Sean, Kevin) - Do not honor guest MTRRs in memslot of flag KVM_MEM_NON_COHERENT_DMA (Sean) [1]: https://lore.kernel.org/all/20231214103520.7198-1-yan.y.zhao@intel.com/ [2]: https://patchwork.kernel.org/project/dri-devel/cover/20200213213036.207625-1-olvaffe@gmail.com/ [3]: https://lore.kernel.org/all/20231221154002.32622-1-ankita@nvidia.com/ Yan Zhao (4): KVM: Introduce a new memslot flag KVM_MEM_NON_COHERENT_DMA KVM: x86: Add a new param "slot" to op get_mt_mask in kvm_x86_ops KVM: VMX: Honor guest PATs for memslots of flag KVM_MEM_NON_COHERENT_DMA KVM: selftests: Set KVM_MEM_NON_COHERENT_DMA as a supported memslot flag arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/kvm/mmu/spte.c | 3 ++- arch/x86/kvm/vmx/vmx.c | 6 +++++- include/uapi/linux/kvm.h | 2 ++ tools/testing/selftests/kvm/set_memory_region_test.c | 3 +++ virt/kvm/kvm_main.c | 8 ++++++-- 6 files changed, 20 insertions(+), 5 deletions(-) base-commit: 8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1
Comments
On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > This series allow user space to notify KVM of noncoherent DMA status so as > to let KVM honor guest memory types in specified memory slot ranges. > > Motivation > === > A virtio GPU device may want to configure GPU hardware to work in > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. Does this mean some DMA reads do not snoop the caches or does it include DMA writes not synchronizing the caches too? > This is generally for performance consideration. > In certain platform, GFX performance can improve 20+% with DMAs going to > noncoherent path. > > This noncoherent DMA mode works in below sequence: > 1. Host backend driver programs hardware not to snoop memory of target > DMA buffer. > 2. Host backend driver indicates guest frontend driver to program guest PAT > to WC for target DMA buffer. > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs. > 4. Hardware does noncoherent DMA to the target buffer. > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer > as not cached. So, if KVM forces the effective memory type of this DMA > buffer to be WB, hardware DMA may read incorrect data and cause misc > failures. I don't know all the details, but a big concern would be that the caches remain fully coherent with the underlying memory at any point where kvm decides to revoke the page from the VM. If you allow an incoherence of cache != physical then it opens a security attack where the observed content of memory can change when it should not. ARM64 has issues like this and due to that ARM has to have explict, expensive, cache flushing at certain points. Jason
On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > > This series allow user space to notify KVM of noncoherent DMA status so as > > to let KVM honor guest memory types in specified memory slot ranges. > > > > Motivation > > === > > A virtio GPU device may want to configure GPU hardware to work in > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. > > Does this mean some DMA reads do not snoop the caches or does it > include DMA writes not synchronizing the caches too? Both DMA reads and writes are not snooped. The virtio host side will mmap the buffer to WC (pgprot_writecombine) for CPU access and program the device to access the buffer in uncached way. Meanwhile, virtio host side will construct a memslot in KVM with the PTR returned from the mmap, and notify virtio guest side to mmap the same buffer in guest page table with PAT=WC, too. > > > This is generally for performance consideration. > > In certain platform, GFX performance can improve 20+% with DMAs going to > > noncoherent path. > > > > This noncoherent DMA mode works in below sequence: > > 1. Host backend driver programs hardware not to snoop memory of target > > DMA buffer. > > 2. Host backend driver indicates guest frontend driver to program guest PAT > > to WC for target DMA buffer. > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs. > > 4. Hardware does noncoherent DMA to the target buffer. > > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer > > as not cached. So, if KVM forces the effective memory type of this DMA > > buffer to be WB, hardware DMA may read incorrect data and cause misc > > failures. > > I don't know all the details, but a big concern would be that the > caches remain fully coherent with the underlying memory at any point > where kvm decides to revoke the page from the VM. Ah, you mean, for page migration, the content of the page may not be copied correctly, right? Currently in x86, we have 2 ways to let KVM honor guest memory types: 1. through KVM memslot flag introduced in this series, for virtio GPUs, in memslot granularity. 2. through increasing noncoherent dma count, as what's done in VFIO, for Intel GPU passthrough, for all guest memory. This page migration issue should not be the case for virtio GPU, as both host and guest are synced to use the same memory type and actually the pages are not anonymous pages. For GPU pass-through, though host mmaps with WB, it's still fine for guest to use WC because page migration on pages of VMs with pass-through device is not allowed. But I agree, this should be a case if user space sets the memslot flag to honor guest memory type to memslots for guest system RAM where non-enlightened guest components may cause guest and host to access with different memory types. Or simply when the guest is a malicious one. > If you allow an incoherence of cache != physical then it opens a > security attack where the observed content of memory can change when > it should not. In this case, will this security attack impact other guests? > > ARM64 has issues like this and due to that ARM has to have explict, > expensive, cache flushing at certain points. >
On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote: > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote: > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > > > This series allow user space to notify KVM of noncoherent DMA status so as > > > to let KVM honor guest memory types in specified memory slot ranges. > > > > > > Motivation > > > === > > > A virtio GPU device may want to configure GPU hardware to work in > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. > > > > Does this mean some DMA reads do not snoop the caches or does it > > include DMA writes not synchronizing the caches too? > Both DMA reads and writes are not snooped. Oh that sounds really dangerous. > > > This is generally for performance consideration. > > > In certain platform, GFX performance can improve 20+% with DMAs going to > > > noncoherent path. > > > > > > This noncoherent DMA mode works in below sequence: > > > 1. Host backend driver programs hardware not to snoop memory of target > > > DMA buffer. > > > 2. Host backend driver indicates guest frontend driver to program guest PAT > > > to WC for target DMA buffer. > > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs. > > > 4. Hardware does noncoherent DMA to the target buffer. > > > > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer > > > as not cached. So, if KVM forces the effective memory type of this DMA > > > buffer to be WB, hardware DMA may read incorrect data and cause misc > > > failures. > > > > I don't know all the details, but a big concern would be that the > > caches remain fully coherent with the underlying memory at any point > > where kvm decides to revoke the page from the VM. > Ah, you mean, for page migration, the content of the page may not be copied > correctly, right? Not just migration. Any point where KVM revokes the page from the VM. Ie just tearing down the VM still has to make the cache coherent with physical or there may be problems. > Currently in x86, we have 2 ways to let KVM honor guest memory types: > 1. through KVM memslot flag introduced in this series, for virtio GPUs, in > memslot granularity. > 2. through increasing noncoherent dma count, as what's done in VFIO, for > Intel GPU passthrough, for all guest memory. And where does all this fixup the coherency problem? > This page migration issue should not be the case for virtio GPU, as both host > and guest are synced to use the same memory type and actually the pages > are not anonymous pages. The guest isn't required to do this so it can force the cache to become incoherent. > > If you allow an incoherence of cache != physical then it opens a > > security attack where the observed content of memory can change when > > it should not. > > In this case, will this security attack impact other guests? It impacts the hypervisor potentially. It depends.. Jason
On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote: > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote: > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote: > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > > > > This series allow user space to notify KVM of noncoherent DMA status so as > > > > to let KVM honor guest memory types in specified memory slot ranges. > > > > > > > > Motivation > > > > === > > > > A virtio GPU device may want to configure GPU hardware to work in > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. > > > > > > Does this mean some DMA reads do not snoop the caches or does it > > > include DMA writes not synchronizing the caches too? > > Both DMA reads and writes are not snooped. > > Oh that sounds really dangerous. So if this is an issue then we might already have a problem, because with many devices it's entirely up to the device programming whether the i/o is snooping or not. So the moment you pass such a device to a guest, whether there's explicit support for non-coherent or not, you have a problem. _If_ there is a fundamental problem. I'm not sure of that, because my assumption was that at most the guest shoots itself and the data corruption doesn't go any further the moment the hypervisor does the dma/iommu unmapping. Also, there's a pile of x86 devices where this very much applies, x86 being dma-coherent is not really the true ground story. Cheers, Sima > > > > This is generally for performance consideration. > > > > In certain platform, GFX performance can improve 20+% with DMAs going to > > > > noncoherent path. > > > > > > > > This noncoherent DMA mode works in below sequence: > > > > 1. Host backend driver programs hardware not to snoop memory of target > > > > DMA buffer. > > > > 2. Host backend driver indicates guest frontend driver to program guest PAT > > > > to WC for target DMA buffer. > > > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs. > > > > 4. Hardware does noncoherent DMA to the target buffer. > > > > > > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer > > > > as not cached. So, if KVM forces the effective memory type of this DMA > > > > buffer to be WB, hardware DMA may read incorrect data and cause misc > > > > failures. > > > > > > I don't know all the details, but a big concern would be that the > > > caches remain fully coherent with the underlying memory at any point > > > where kvm decides to revoke the page from the VM. > > Ah, you mean, for page migration, the content of the page may not be copied > > correctly, right? > > Not just migration. Any point where KVM revokes the page from the > VM. Ie just tearing down the VM still has to make the cache coherent > with physical or there may be problems. > > > Currently in x86, we have 2 ways to let KVM honor guest memory types: > > 1. through KVM memslot flag introduced in this series, for virtio GPUs, in > > memslot granularity. > > 2. through increasing noncoherent dma count, as what's done in VFIO, for > > Intel GPU passthrough, for all guest memory. > > And where does all this fixup the coherency problem? > > > This page migration issue should not be the case for virtio GPU, as both host > > and guest are synced to use the same memory type and actually the pages > > are not anonymous pages. > > The guest isn't required to do this so it can force the cache to > become incoherent. > > > > If you allow an incoherence of cache != physical then it opens a > > > security attack where the observed content of memory can change when > > > it should not. > > > > In this case, will this security attack impact other guests? > > It impacts the hypervisor potentially. It depends.. > > Jason
On Mon, Jan 08, 2024 at 04:25:02PM +0100, Daniel Vetter wrote: > On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote: > > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote: > > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > > > > > This series allow user space to notify KVM of noncoherent DMA status so as > > > > > to let KVM honor guest memory types in specified memory slot ranges. > > > > > > > > > > Motivation > > > > > === > > > > > A virtio GPU device may want to configure GPU hardware to work in > > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. > > > > > > > > Does this mean some DMA reads do not snoop the caches or does it > > > > include DMA writes not synchronizing the caches too? > > > Both DMA reads and writes are not snooped. > > > > Oh that sounds really dangerous. > > So if this is an issue then we might already have a problem, because with > many devices it's entirely up to the device programming whether the i/o is > snooping or not. So the moment you pass such a device to a guest, whether > there's explicit support for non-coherent or not, you have a > problem. No, the iommus (except Intel and only for Intel integrated GPU, IIRC) prohibit the use of non-coherent DMA entirely from a VM. Eg AMD systems 100% block non-coherent DMA in VMs at the iommu level. > _If_ there is a fundamental problem. I'm not sure of that, because my > assumption was that at most the guest shoots itself and the data > corruption doesn't go any further the moment the hypervisor does the > dma/iommu unmapping. Who fixes the cache on the unmapping? I didn't see anything.. Jason
On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote: > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote: > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote: > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > > > > This series allow user space to notify KVM of noncoherent DMA status so as > > > > to let KVM honor guest memory types in specified memory slot ranges. > > > > > > > > Motivation > > > > === > > > > A virtio GPU device may want to configure GPU hardware to work in > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. > > > > > > Does this mean some DMA reads do not snoop the caches or does it > > > include DMA writes not synchronizing the caches too? > > Both DMA reads and writes are not snooped. > > Oh that sounds really dangerous. > But the IOMMU for Intel GPU does not do force-snoop, no matter KVM honors guest memory type or not. > > > > This is generally for performance consideration. > > > > In certain platform, GFX performance can improve 20+% with DMAs going to > > > > noncoherent path. > > > > > > > > This noncoherent DMA mode works in below sequence: > > > > 1. Host backend driver programs hardware not to snoop memory of target > > > > DMA buffer. > > > > 2. Host backend driver indicates guest frontend driver to program guest PAT > > > > to WC for target DMA buffer. > > > > 3. Guest frontend driver writes to the DMA buffer without clflush stuffs. > > > > 4. Hardware does noncoherent DMA to the target buffer. > > > > > > > > In this noncoherent DMA mode, both guest and hardware regard a DMA buffer > > > > as not cached. So, if KVM forces the effective memory type of this DMA > > > > buffer to be WB, hardware DMA may read incorrect data and cause misc > > > > failures. > > > > > > I don't know all the details, but a big concern would be that the > > > caches remain fully coherent with the underlying memory at any point > > > where kvm decides to revoke the page from the VM. > > Ah, you mean, for page migration, the content of the page may not be copied > > correctly, right? > > Not just migration. Any point where KVM revokes the page from the > VM. Ie just tearing down the VM still has to make the cache coherent > with physical or there may be problems. Not sure what's the mentioned problem during KVM revoking. In host, - If the memory type is WB, as the case in intel GPU passthrough, the mismatch can only happen when guest memory type is UC/WC/WT/WP, all stronger than WB. So, even after KVM revoking the page, the host will not get delayed data from cache. - If the memory type is WC, as the case in virtio GPU, after KVM revoking the page, the page is still hold in the virtio host side. Even though a incooperative guest can cause wrong data in the page, the guest can achieve the purpose in a more straight-forward way, i.e. writing a wrong data directly to the page. So, I don't see the problem in this case too. > > > Currently in x86, we have 2 ways to let KVM honor guest memory types: > > 1. through KVM memslot flag introduced in this series, for virtio GPUs, in > > memslot granularity. > > 2. through increasing noncoherent dma count, as what's done in VFIO, for > > Intel GPU passthrough, for all guest memory. > > And where does all this fixup the coherency problem? > > > This page migration issue should not be the case for virtio GPU, as both host > > and guest are synced to use the same memory type and actually the pages > > are not anonymous pages. > > The guest isn't required to do this so it can force the cache to > become incoherent. > > > > If you allow an incoherence of cache != physical then it opens a > > > security attack where the observed content of memory can change when > > > it should not. > > > > In this case, will this security attack impact other guests? > > It impacts the hypervisor potentially. It depends.. Could you elaborate more on how it will impact hypervisor? We can try to fix it if it's really a case.
On Tue, Jan 09, 2024 at 07:36:22AM +0800, Yan Zhao wrote: > On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote: > > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote: > > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote: > > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > > > > > This series allow user space to notify KVM of noncoherent DMA status so as > > > > > to let KVM honor guest memory types in specified memory slot ranges. > > > > > > > > > > Motivation > > > > > === > > > > > A virtio GPU device may want to configure GPU hardware to work in > > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. > > > > > > > > Does this mean some DMA reads do not snoop the caches or does it > > > > include DMA writes not synchronizing the caches too? > > > Both DMA reads and writes are not snooped. > > > > Oh that sounds really dangerous. > > > But the IOMMU for Intel GPU does not do force-snoop, no matter KVM > honors guest memory type or not. Yes, I know. Sounds dangerous! > > Not just migration. Any point where KVM revokes the page from the > > VM. Ie just tearing down the VM still has to make the cache coherent > > with physical or there may be problems. > Not sure what's the mentioned problem during KVM revoking. > In host, > - If the memory type is WB, as the case in intel GPU passthrough, > the mismatch can only happen when guest memory type is UC/WC/WT/WP, all > stronger than WB. > So, even after KVM revoking the page, the host will not get delayed > data from cache. > - If the memory type is WC, as the case in virtio GPU, after KVM revoking > the page, the page is still hold in the virtio host side. > Even though a incooperative guest can cause wrong data in the page, > the guest can achieve the purpose in a more straight-forward way, i.e. > writing a wrong data directly to the page. > So, I don't see the problem in this case too. You can't let cache incoherent memory leak back into the hypervisor for other uses or who knows what can happen. In many cases something will zero the page and you can probably reliably argue that will make the cache coherent, but there are still all sorts of cases where pages are write protected and then used in the hypervisor context. Eg page out or something where the incoherence is a big problem. eg RAID parity and mirror calculations become at-rist of malfunction. Storage CRCs stop working reliably, etc, etc. It is certainly a big enough problem that a generic KVM switch to allow incoherence should be trated with alot of skepticism. You can't argue that the only use of the generic switch will be with GPUs that exclude all the troublesome cases! > > > In this case, will this security attack impact other guests? > > > > It impacts the hypervisor potentially. It depends.. > Could you elaborate more on how it will impact hypervisor? > We can try to fix it if it's really a case. Well, for instance, when you install pages into the KVM the hypervisor will have taken kernel memory, then zero'd it with cachable writes, however the VM can read it incoherently with DMA and access the pre-zero'd data since the zero'd writes potentially hasn't left the cache. That is an information leakage exploit. Who knows what else you can get up to if you are creative. The whole security model assumes there is only one view of memory, not two. Jason
On Mon, Jan 08, 2024 at 08:22:20PM -0400, Jason Gunthorpe wrote: > On Tue, Jan 09, 2024 at 07:36:22AM +0800, Yan Zhao wrote: > > On Mon, Jan 08, 2024 at 10:02:50AM -0400, Jason Gunthorpe wrote: > > > On Mon, Jan 08, 2024 at 02:02:57PM +0800, Yan Zhao wrote: > > > > On Fri, Jan 05, 2024 at 03:55:51PM -0400, Jason Gunthorpe wrote: > > > > > On Fri, Jan 05, 2024 at 05:12:37PM +0800, Yan Zhao wrote: > > > > > > This series allow user space to notify KVM of noncoherent DMA status so as > > > > > > to let KVM honor guest memory types in specified memory slot ranges. > > > > > > > > > > > > Motivation > > > > > > === > > > > > > A virtio GPU device may want to configure GPU hardware to work in > > > > > > noncoherent mode, i.e. some of its DMAs do not snoop CPU caches. > > > > > > > > > > Does this mean some DMA reads do not snoop the caches or does it > > > > > include DMA writes not synchronizing the caches too? > > > > Both DMA reads and writes are not snooped. > > > > > > Oh that sounds really dangerous. > > > > > But the IOMMU for Intel GPU does not do force-snoop, no matter KVM > > honors guest memory type or not. > > Yes, I know. Sounds dangerous! > > > > Not just migration. Any point where KVM revokes the page from the > > > VM. Ie just tearing down the VM still has to make the cache coherent > > > with physical or there may be problems. > > Not sure what's the mentioned problem during KVM revoking. > > In host, > > - If the memory type is WB, as the case in intel GPU passthrough, > > the mismatch can only happen when guest memory type is UC/WC/WT/WP, all > > stronger than WB. > > So, even after KVM revoking the page, the host will not get delayed > > data from cache. > > - If the memory type is WC, as the case in virtio GPU, after KVM revoking > > the page, the page is still hold in the virtio host side. > > Even though a incooperative guest can cause wrong data in the page, > > the guest can achieve the purpose in a more straight-forward way, i.e. > > writing a wrong data directly to the page. > > So, I don't see the problem in this case too. > > You can't let cache incoherent memory leak back into the hypervisor > for other uses or who knows what can happen. In many cases something > will zero the page and you can probably reliably argue that will make > the cache coherent, but there are still all sorts of cases where pages > are write protected and then used in the hypervisor context. Eg page > out or something where the incoherence is a big problem. > > eg RAID parity and mirror calculations become at-rist of > malfunction. Storage CRCs stop working reliably, etc, etc. > > It is certainly a big enough problem that a generic KVM switch to > allow incoherence should be trated with alot of skepticism. You can't > argue that the only use of the generic switch will be with GPUs that > exclude all the troublesome cases! > You are right. It's more safe with only one view of memory. But even something will zero the page, if it happens before returning the page to host, looks the impact is constrained in VM scope? e.g. for the write protected page, hypervisor cannot rely on the page content is correct or expected. For virtio GPU's use case, do you think a better way for KVM is to pull the memory type from host page table in the specified memslot? But for noncoherent DMA device passthrough, we can't pull host memory type, because we rely on guest device driver to do cache flush properly, and if the guest device driver thinks a memory is uncached while it's effectively cached, the device cannot work properly. > > > > In this case, will this security attack impact other guests? > > > > > > It impacts the hypervisor potentially. It depends.. > > Could you elaborate more on how it will impact hypervisor? > > We can try to fix it if it's really a case. > > Well, for instance, when you install pages into the KVM the hypervisor > will have taken kernel memory, then zero'd it with cachable writes, > however the VM can read it incoherently with DMA and access the > pre-zero'd data since the zero'd writes potentially hasn't left the > cache. That is an information leakage exploit. This makes sense. How about KVM doing cache flush before installing/revoking the page if guest memory type is honored? > Who knows what else you can get up to if you are creative. The whole > security model assumes there is only one view of memory, not two. >
On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote: > > Well, for instance, when you install pages into the KVM the hypervisor > > will have taken kernel memory, then zero'd it with cachable writes, > > however the VM can read it incoherently with DMA and access the > > pre-zero'd data since the zero'd writes potentially hasn't left the > > cache. That is an information leakage exploit. > > This makes sense. > How about KVM doing cache flush before installing/revoking the > page if guest memory type is honored? I think if you are going to allow the guest to bypass the cache in any way then KVM should fully flush the cache before allowing the guest to access memory and it should fully flush the cache after removing memory from the guest. Noting that fully removing the memory now includes VFIO too, which is going to be very hard to co-ordinate between KVM and VFIO. ARM has the hooks for most of this in the common code already, so it should not be outrageous to do, but slow I suspect. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, January 16, 2024 12:31 AM > > On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote: > > > > Well, for instance, when you install pages into the KVM the hypervisor > > > will have taken kernel memory, then zero'd it with cachable writes, > > > however the VM can read it incoherently with DMA and access the > > > pre-zero'd data since the zero'd writes potentially hasn't left the > > > cache. That is an information leakage exploit. > > > > This makes sense. > > How about KVM doing cache flush before installing/revoking the > > page if guest memory type is honored? > > I think if you are going to allow the guest to bypass the cache in any > way then KVM should fully flush the cache before allowing the guest to > access memory and it should fully flush the cache after removing > memory from the guest. For GPU passthrough can we rely on the fact that the entire guest memory is pinned so the only occurrence of removing memory is when killing the guest then the pages will be zero-ed by mm before next use? then we just need to flush the cache before the 1st guest run to avoid information leak. yes it's a more complex issue if allowing guest to bypass cache in a configuration mixing host mm activities on guest pages at run-time. > > Noting that fully removing the memory now includes VFIO too, which is > going to be very hard to co-ordinate between KVM and VFIO. if only talking about GPU passthrough do we still need such coordination? > > ARM has the hooks for most of this in the common code already, so it > should not be outrageous to do, but slow I suspect. > > Jason
> From: Tian, Kevin > Sent: Tuesday, January 16, 2024 8:46 AM > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, January 16, 2024 12:31 AM > > > > On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote: > > > > > > Well, for instance, when you install pages into the KVM the hypervisor > > > > will have taken kernel memory, then zero'd it with cachable writes, > > > > however the VM can read it incoherently with DMA and access the > > > > pre-zero'd data since the zero'd writes potentially hasn't left the > > > > cache. That is an information leakage exploit. > > > > > > This makes sense. > > > How about KVM doing cache flush before installing/revoking the > > > page if guest memory type is honored? > > > > I think if you are going to allow the guest to bypass the cache in any > > way then KVM should fully flush the cache before allowing the guest to > > access memory and it should fully flush the cache after removing > > memory from the guest. > > For GPU passthrough can we rely on the fact that the entire guest memory > is pinned so the only occurrence of removing memory is when killing the > guest then the pages will be zero-ed by mm before next use? then we > just need to flush the cache before the 1st guest run to avoid information > leak. Just checked your past comments. If there is no guarantee that the removed pages will be zero-ed before next use then yes cache has to be flushed after the page is removed from the guest. :/ > > yes it's a more complex issue if allowing guest to bypass cache in a > configuration mixing host mm activities on guest pages at run-time. > > > > > Noting that fully removing the memory now includes VFIO too, which is > > going to be very hard to co-ordinate between KVM and VFIO. > Probably we could just handle cache flush in IOMMUFD or VFIO type1 map/unmap which is the gate of allowing/denying non-coherent DMAs to specific pages.
On Tue, Jan 16, 2024 at 04:05:08AM +0000, Tian, Kevin wrote: > > From: Tian, Kevin > > Sent: Tuesday, January 16, 2024 8:46 AM > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, January 16, 2024 12:31 AM > > > > > > On Tue, Jan 09, 2024 at 10:11:23AM +0800, Yan Zhao wrote: > > > > > > > > Well, for instance, when you install pages into the KVM the hypervisor > > > > > will have taken kernel memory, then zero'd it with cachable writes, > > > > > however the VM can read it incoherently with DMA and access the > > > > > pre-zero'd data since the zero'd writes potentially hasn't left the > > > > > cache. That is an information leakage exploit. > > > > > > > > This makes sense. > > > > How about KVM doing cache flush before installing/revoking the > > > > page if guest memory type is honored? > > > > > > I think if you are going to allow the guest to bypass the cache in any > > > way then KVM should fully flush the cache before allowing the guest to > > > access memory and it should fully flush the cache after removing > > > memory from the guest. > > > > For GPU passthrough can we rely on the fact that the entire guest memory > > is pinned so the only occurrence of removing memory is when killing the > > guest then the pages will be zero-ed by mm before next use? then we > > just need to flush the cache before the 1st guest run to avoid information > > leak. > > Just checked your past comments. If there is no guarantee that the removed > pages will be zero-ed before next use then yes cache has to be flushed > after the page is removed from the guest. :/ Next use may include things like swap to disk or live migrate the VM. So it isn't quite so simple in the general case. > > > Noting that fully removing the memory now includes VFIO too, which is > > > going to be very hard to co-ordinate between KVM and VFIO. > > Probably we could just handle cache flush in IOMMUFD or VFIO type1 > map/unmap which is the gate of allowing/denying non-coherent DMAs > to specific pages. Maybe, and on live migrate dma stop.. Jason