From patchwork Fri Jan  5 09:12:37 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Yan Zhao <yan.y.zhao@intel.com>
X-Patchwork-Id: 18754
Return-Path: <linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id
 tb2csp6117713dyb;
        Fri, 5 Jan 2024 01:42:27 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IGD4hrBO+I8pEaq7FZfV7JW2PAhka0CSYgy8cAqguVyy375+BEINVOXIKNxJ2HGiOz8Trw1
X-Received: by 2002:a05:6e02:221d:b0:360:fed:85eb with SMTP id
 j29-20020a056e02221d00b003600fed85ebmr2157075ilf.56.1704447747652;
        Fri, 05 Jan 2024 01:42:27 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1704447747; cv=none;
        d=google.com; s=arc-20160816;
        b=vHTqd4Wl6A3M66e1DsoVwsonc7Xun0D1tyKTPdrPtlPNoPKq7pGZJkal6qDkY4Qpv7
         Hna6D2u2HraCP9++j9teYta6R8aiKyCW5XFpXOCDBmymBE3WHyfLNoaiSK1uDHHkzQqd
         508uVKEiftUSU8kpjwX7FcHmdzKDmucA5jBAvPbFKuovUsK3MTVaOuoN2dmcuBqoL/xz
         vcUgje8ovC9KfyCw/Ie8MGnNU9IZekWDRkrKGddlI47vw7Jm17tQJ9MvImmVGQ4g2klR
         25SlyMBZ+wUXTtdbSYHaaNXG7riD9X0eZbrivlXlrakOJ/rulyREV7xgjinZCnAnQafE
         M+pA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=list-unsubscribe:list-subscribe:list-id:precedence:message-id:date
         :subject:cc:to:from:dkim-signature;
        bh=SXsXueLShP38tkmPKnlulTCn+X1Mq2eh8Z1qsATDAUs=;
        fh=7+Mp40O15i8480Y3Lx1xUfrNoWlx4dq3oRO8LaTgnuQ=;
        b=VRG+1kZCMMH88hb3q0XBF+zjpE0sSVcunozi2rS9L8UeMWp23PY/NWDLIrFpmb4hKr
         y13mCsQlIoyLiuBMTaYlUYWTpB213R7QQ9n/nU5rVaDM6yM1gaLsK0MwGkMuUyz9979r
         RsIrbCCIyzhfBDcE0FVV7o/SKna2fS90VVnIlo5oWFnls2mdm+mY+k2s6poi0yzdereH
         HAUBz269HQ9/l4QaVaNhRFgDJE4JzVJMqTUr7ex+f87EQz16ZEAX6NAjUKPjP/u1ldZe
         uDRM/LaMsJGuTyv4zcLjPxdKywamSSADkwWzAF+ycskJVI5+GstoTuZM+c5EyiCU+pRE
         xAZg==
ARC-Authentication-Results: i=1; mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=DGwU+xAn;
       spf=pass (google.com: domain of
 linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:40f1:3f00::1 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org.
 [2604:1380:40f1:3f00::1])
        by mx.google.com with ESMTPS id
 h20-20020a635314000000b005ce2b993254si984182pgb.204.2024.01.05.01.42.27
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 05 Jan 2024 01:42:27 -0800 (PST)
Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@intel.com header.s=Intel header.b=DGwU+xAn;
       spf=pass (google.com: domain of
 linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:40f1:3f00::1 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-17671-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org
 [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by sy.mirrors.kernel.org (Postfix) with ESMTPS id 79178B20F40
	for <ouuuleilei@gmail.com>; Fri,  5 Jan 2024 09:42:26 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id 7A853250F0;
	Fri,  5 Jan 2024 09:42:08 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com
 header.b="DGwU+xAn"
X-Original-To: linux-kernel@vger.kernel.org
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 94EAB24B29;
	Fri,  5 Jan 2024 09:42:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1704447723; x=1735983723;
  h=from:to:cc:subject:date:message-id;
  bh=hfs+RsszOiy+Zn2bJH81rSeaYMQb4GgLRyTWh5I+W1Y=;
  b=DGwU+xAnZkgP99B8okCEJwsK8al0YZDGNoeWRfHD9FO3kj7QBltLt24T
   EhhxCfrb7m5dzkAjYU4JpEkzDi0mfwZFNCmdJMgaRp7FbR6pNYQ9cLxfI
   mgzK3AH/vJULiRuXzaEBRUI7bbzE4+5Rn6vWSlSsTyo+wMfiMGQxMLV4Q
   +RSHLOWuW770ImxlKMMRJqrlqIdtPkYlU63zwLw5cT9FLcfE/FDQsXmvf
   MG0RYuC1RW6JaoNsfdarIiSGyjpB2toBk85CmK4V81Hz+890uvt/553bd
   R36HxN5ofLs3U1qBzjQb+mfaytszwyVkieQw7q7usQtG78bNPePgvlAuJ
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10943"; a="4285874"
X-IronPort-AV: E=Sophos;i="6.04,333,1695711600";
   d="scan'208";a="4285874"
Received: from fmviesa002.fm.intel.com ([10.60.135.142])
  by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Jan 2024 01:42:02 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.04,333,1695711600";
   d="scan'208";a="15196531"
Received: from yzhao56-desk.sh.intel.com ([10.239.159.62])
  by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 05 Jan 2024 01:41:57 -0800
From: Yan Zhao <yan.y.zhao@intel.com>
To: kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	dri-devel@lists.freedesktop.org
Cc: pbonzini@redhat.com,
	seanjc@google.com,
	olvaffe@gmail.com,
	kevin.tian@intel.com,
	zhiyuan.lv@intel.com,
	zhenyu.z.wang@intel.com,
	yongwei.ma@intel.com,
	vkuznets@redhat.com,
	wanpengli@tencent.com,
	jmattson@google.com,
	joro@8bytes.org,
	gurchetansingh@chromium.org,
	kraxel@redhat.com,
	zzyiwei@google.com,
	ankita@nvidia.com,
	jgg@nvidia.com,
	alex.williamson@redhat.com,
	maz@kernel.org,
	oliver.upton@linux.dev,
	james.morse@arm.com,
	suzuki.poulose@arm.com,
	yuzenghui@huawei.com,
	Yan Zhao <yan.y.zhao@intel.com>
Subject: [PATCH 0/4] KVM: Honor guest memory types for virtio GPU devices
Date: Fri,  5 Jan 2024 17:12:37 +0800
Message-Id: <20240105091237.24577-1-yan.y.zhao@intel.com>
X-Mailer: git-send-email 2.17.1
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1787243001382580077
X-GMAIL-MSGID: 1787243001382580077

This series allow user space to notify KVM of noncoherent DMA status so as
to let KVM honor guest memory types in specified memory slot ranges.

Motivation
===
A virtio GPU device may want to configure GPU hardware to work in
noncoherent mode, i.e. some of its DMAs do not snoop CPU caches.
This is generally for performance consideration.
In certain platform, GFX performance can improve 20+% with DMAs going to
noncoherent path.

This noncoherent DMA mode works in below sequence:
1. Host backend driver programs hardware not to snoop memory of target
   DMA buffer.
2. Host backend driver indicates guest frontend driver to program guest PAT
   to WC for target DMA buffer.
3. Guest frontend driver writes to the DMA buffer without clflush stuffs.
4. Hardware does noncoherent DMA to the target buffer.

In this noncoherent DMA mode, both guest and hardware regard a DMA buffer
as not cached. So, if KVM forces the effective memory type of this DMA
buffer to be WB, hardware DMA may read incorrect data and cause misc
failures.

Therefore we introduced a new memslot flag KVM_MEM_NON_COHERENT_DMA to
allow user space convey noncoherent DMA status in memslot granularity.
Platforms that do not always honor guest memory type can choose to honor
it in ranges of memslots with KVM_MEM_NON_COHERENT_DMA set.

Security
===
The biggest concern for KVM to honor guest's memory type is page aliasing
issue.
In Intel's platform,
- For host MMIO, KVM VMX always programs EPT memory type to UC (which will
  overwrite all guest PAT types except WC), which is of no change after
  this series.

- For host non-MMIO pages,
  * virtio guest frontend and host backend driver should be synced to use
    the same memory type to map a buffer. Otherwise, there will be
    potential problem for incorrect memory data. But this will only impact
    the buggy guest alone.
  * for live migration, user space can skip reading/writing memory
    corresponding to the memslot with flag KVM_MEM_NON_COHERENT_DMA or
    do some special handling during memory read/write.

Implementation
===
Unlike previous RFC series [1] that uses a new KVM VIRTIO device to convey
noncoherent DMA status, this version chooses to introduce a new memslot
flag, similar to what's done in series from google at [2].
The difference is that [2] increases noncoherent DMA count to ask KVM VMX
to honor guest memory type for all guest memory as a whole, while this
series will only ask KVM to honor guest memory type in the specified
memslots.

The reason of not introducing a KVM cap or a memslot flag to allow users to
toggle noncoherent DMA state as a whole is mainly for the page aliasing
issue as mentioned above.
If guest memory type is only honored in limited memslots, user space can
do special handling before/after accessing to guest memory belonging to the
limited memslots.

For virtio GPUs, it usually will create memslots that are mapped into guest
device BARs.
- guest device driver will sync with host side to use the same memory type
  to access that memslots.
- no other guest components will have access to the memory in the memslots
  since it's mapped as device BARs.
So, by adding flag KVM_MEM_NON_COHERENT_DMA to memslots specific to virtio
GPUs and asking KVM to only honor guest memory in those memslots, page
aliasing issue can be avoided easily.

This series doesn't limit which memslots are legible to set flag
KVM_MEM_NON_COHERENT_DMA, so if the user sets this flag to memslots for
guest system RAM, page aliasing issue may be met during live migration
or other use cases when host wants to access guest memory with different
memory types due to lacking of coordination between non-enlightened guest
components and host. Just as when noncoherent DMA devices are assigned
through VFIO.
But as it will not impact other VMs, we choose to trust the user and let
the user to do mitigations when it has to set this flag to memslots for
guest system RAM.

Note:
We also noticed that there's a series [3] trying to fix a similar problem
in ARM for VFIO device passthrough.
The difference is that [3] is trying to fix the problem that guest memory
types for pass-through device MMIOs are not honored in ARM (which is not a
problem for x86 VMX), while this series is for the problem that guest
memory types are not honored in non-host-MMIO ranges for virtio GPUs in x86
VMX.

Changelog:
RFC --> v1:
- Switch to use memslot flag way to convey non-coherent DMA info
  (Sean, Kevin)
- Do not honor guest MTRRs in memslot of flag KVM_MEM_NON_COHERENT_DMA
  (Sean)

[1]: https://lore.kernel.org/all/20231214103520.7198-1-yan.y.zhao@intel.com/
[2]: https://patchwork.kernel.org/project/dri-devel/cover/20200213213036.207625-1-olvaffe@gmail.com/
[3]: https://lore.kernel.org/all/20231221154002.32622-1-ankita@nvidia.com/


Yan Zhao (4):
  KVM: Introduce a new memslot flag KVM_MEM_NON_COHERENT_DMA
  KVM: x86: Add a new param "slot" to op get_mt_mask in kvm_x86_ops
  KVM: VMX: Honor guest PATs for memslots of flag
    KVM_MEM_NON_COHERENT_DMA
  KVM: selftests: Set KVM_MEM_NON_COHERENT_DMA as a supported memslot
    flag

 arch/x86/include/asm/kvm_host.h                      | 3 ++-
 arch/x86/kvm/mmu/spte.c                              | 3 ++-
 arch/x86/kvm/vmx/vmx.c                               | 6 +++++-
 include/uapi/linux/kvm.h                             | 2 ++
 tools/testing/selftests/kvm/set_memory_region_test.c | 3 +++
 virt/kvm/kvm_main.c                                  | 8 ++++++--
 6 files changed, 20 insertions(+), 5 deletions(-)


base-commit: 8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1