[v4,3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn()
Message ID | 20240209222858.396696-4-seanjc@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-60107-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:50ea:b0:106:860b:bbdd with SMTP id r10csp1177453dyd; Fri, 9 Feb 2024 14:58:01 -0800 (PST) X-Google-Smtp-Source: AGHT+IGCjX5HuwcqK860mKghsyNoz6OsOROmUCgRqWPxkskSQva7Zsposcb1a/S5fiUVZQYhJB7C X-Received: by 2002:a62:e218:0:b0:6e0:465e:cdd5 with SMTP id a24-20020a62e218000000b006e0465ecdd5mr682693pfi.23.1707519481679; Fri, 09 Feb 2024 14:58:01 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707519481; cv=pass; d=google.com; s=arc-20160816; b=kThTtSHtAH/BBnJi1jz33t5Ows1m7B4N+fJ2w7hedGAJQl3lf1mn5JY1LbWgAnYow3 /QlYbwxAs39TG+MSFe87kGf9wLn/gD9I5zyNo06kGUKeWXrSrSU8Yw9vpoD7sPXpw12H AHWlKoqfhc7h9DeANF9Y56MQl1TBr85+TbSUuss4VMVsN6a4OdTiaR41MOMVRKDn4kTJ ukITtIJReAyYB6/cCLOQl7/HLeKgVcDbIJ1f0+ZKs5Z2tzkZUV6imcHsFAjVzrUFaS9L bNj6viKmSY3ZkiP6l7Zvudw71qEggK3hD4xq1RrvH86E054bRZ/byOSpNHQWZnLfUSdU 0q/A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :reply-to:dkim-signature; bh=a9w16DkYUKNtIq1BB7P1j7ef4bijSTE+8L2OPdx5r88=; fh=cAEHPHOW4vVhXpFPaJg0iWN6sE7S/CV0Zj0t2tGwbFk=; b=gKoXVQbFijitrneyjkIW5dsNhz7RVgmH+k0pE1Ue7OSz+r9kW4EEPr9LzzEhwGAP+k Wu/6320NvyVhp0+Jt3+8jU1a0PA4V/IbcQEpT1ffuEyUKeKxwG+Y8KcDLOGwxV3z1+bV AILjfGYIGDQaiDJm7HKDeKDgsQX/RqUrpeAx5xkgLR9Fy/tHrDzj2tTe2rTWt5RIsR38 Yn0Wy+lizt6NJ/qUnhq4IYtOaFxUjzhzH9HYI86TUbx/ejomWG7W9xmQbDoQvK7sxv9s WtBJAoU3yvIZ3XZA2fYYZ0ZdOsI3iS0pxYTQcguW7rcCfxoyLjGrSLgEepyoRdzP+n+O G0Ig==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=4rSD36P0; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-60107-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-60107-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com X-Forwarded-Encrypted: i=2; AJvYcCV75o6j21GtpdpoV8IiZzUrQHn89j4v5HDwByG0eyeiRo6mSCHBVFAwlOfxMrhhJmG5uL0YYkKhbyYjFoY5pOH4lternQ== Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id ck4-20020a056a02090400b005d8e352afefsi2635571pgb.695.2024.02.09.14.58.01 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 09 Feb 2024 14:58:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-60107-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=4rSD36P0; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-60107-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-60107-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 52883B21BD4 for <ouuuleilei@gmail.com>; Fri, 9 Feb 2024 22:31:04 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2579839FD6; Fri, 9 Feb 2024 22:29:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="4rSD36P0" Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 724F538DFB for <linux-kernel@vger.kernel.org>; Fri, 9 Feb 2024 22:29:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707517749; cv=none; b=Rz+R5B4+UIIHd1+Ezs7A+T0Un0g/Q1d/9CKzoTMIrG/vMawWcrI2jCjQGZZ6osUMeUDDUN8JQCyf6gU0gklTV+ILVM8BGkLAFnkJVRcCuHcT2u7BCQV+jrvLKqeAjSRTfAjNR2hvQf/Vo9wzhwayInuQh0gXDSzN3Rl/WJ7bsRQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707517749; c=relaxed/simple; bh=MzdZYOJ8bcmLxbBzxayTEgk/rh3FciY0A5qDjUGjyCw=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ntacnGQiG1ekRnE3tf3O3EuNeIe2Xth6isdswpLtDLHWcAxWvc+sbv0PQlDvC/6Ok0ODH5faNl1GfUVptgiB5/IpjwhRW5hUOXYTJUketMjSgZHSPWV4HAzhC/p4KmsZXqYFgoUby2Xz57srYGvkWujcDTqYZD/oIo1kjfS0++Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=4rSD36P0; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-604acd1d164so28997387b3.3 for <linux-kernel@vger.kernel.org>; Fri, 09 Feb 2024 14:29:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707517746; x=1708122546; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=a9w16DkYUKNtIq1BB7P1j7ef4bijSTE+8L2OPdx5r88=; b=4rSD36P0XAQFuAK/Nms2+vP/iKlh4jKw3co2K416l19tvRWVQfYeGHrwdHSTAhdK+V vGF9bF9x4tJqT2Fs/aqmNcVD0wXXTcDxm7laxOjZP7vNf3K/+EY/1JkAS/hEG8k2RgLa JHCZ9zLtIv7REH4uXprpSVNI0BoLlo6WxA/tQ7XEqSbQBBC7cyPybdbbWUkck9TiZv44 h8Fz5VxXWXB4n9XVi+0mnMNelaXL/z2HNk60TcvyrGScblw4s2uLylza8BBd93AKCTQE 4W/livcJRfn51On7TIWf2oyfMuCVPC6MIL7oUlONZAuCYCp+UsGJ3+qwPSTJ7NV+LLVy abhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707517746; x=1708122546; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=a9w16DkYUKNtIq1BB7P1j7ef4bijSTE+8L2OPdx5r88=; b=R0oI4DM42f+Tud2U6PN9HDLDDxtDJEdHHaM4jQiARG36DyNesWRS2jBKWYGWMsSaNn tHaSb0L18aO2UZcnMCvM/nMFYmKWrVPW0OQs+Wqw9sRbY1LRNAwjJgQRFvOKOx5YNGNF sZ/V37EtwhUqI+3XuvjunL3uQ7hT8J5N2BbpXnGo0mCrUpCnZXEkGA1XW+mWqtH/ToEX g4m8VDWLoGD84Y6FimR/AL5Hl/ZLM4+2CK9j1iOChYJbhlXF7Trw1vn4nG9TTeR+NzWf KRQlqpRzBCrLCANsHyZP1O0O6elsaimnVj9U01ylZ8Axd2BtUJUixUdvuiiVN6395ZWP +NoQ== X-Gm-Message-State: AOJu0YwOq+Hs+XripXVp7mlsG3YO7O+lskk98T+OaekYcH5LoxDca9X0 MrBvuK0KwztRotj2XdyhDQD2AYdyz2la6nmIZNycfE5/YWJoP8t2eqJLYdjl1WBKYI/E/TCBiio bzw== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:690c:ec9:b0:604:648:6dc0 with SMTP id cs9-20020a05690c0ec900b0060406486dc0mr161009ywb.10.1707517746491; Fri, 09 Feb 2024 14:29:06 -0800 (PST) Reply-To: Sean Christopherson <seanjc@google.com> Date: Fri, 9 Feb 2024 14:28:57 -0800 In-Reply-To: <20240209222858.396696-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 References: <20240209222858.396696-1-seanjc@google.com> X-Mailer: git-send-email 2.43.0.687.g38aa6559b0-goog Message-ID: <20240209222858.396696-4-seanjc@google.com> Subject: [PATCH v4 3/4] KVM: x86/mmu: Move slot checks from __kvm_faultin_pfn() to kvm_faultin_pfn() From: Sean Christopherson <seanjc@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Yan Zhao <yan.y.zhao@intel.com>, Friedrich Weber <f.weber@proxmox.com>, Kai Huang <kai.huang@intel.com>, Yuan Yao <yuan.yao@linux.intel.com>, Xu Yilun <yilun.xu@linux.intel.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, Fuad Tabba <tabba@google.com>, Michael Roth <michael.roth@amd.com>, Isaku Yamahata <isaku.yamahata@intel.com>, David Matlack <dmatlack@google.com> Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790463948162937373 X-GMAIL-MSGID: 1790463948162937373 |
Series |
KVM: x86/mmu: Pre-check for mmu_notifier retry
|
|
Commit Message
Sean Christopherson
Feb. 9, 2024, 10:28 p.m. UTC
Move the checks related to the validity of an access to a memslot from the
inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This
allows emulating accesses to the APIC access page, which don't need to
resolve a pfn, even if there is a relevant in-progress mmu_notifier
invalidation. Ditto for accesses to KVM internal memslots from L2, which
KVM also treats as emulated MMIO.
More importantly, this will allow for future cleanup by having the
"no memslot" case bail from kvm_faultin_pfn() very early on.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/mmu/mmu.c | 62 ++++++++++++++++++++++--------------------
1 file changed, 33 insertions(+), 29 deletions(-)
Comments
On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote: > Move the checks related to the validity of an access to a memslot from the > inner __kvm_faultin_pfn() to its sole caller, kvm_faultin_pfn(). This > allows emulating accesses to the APIC access page, which don't need to > resolve a pfn, even if there is a relevant in-progress mmu_notifier > invalidation. Ditto for accesses to KVM internal memslots from L2, which > KVM also treats as emulated MMIO. > > More importantly, this will allow for future cleanup by having the > "no memslot" case bail from kvm_faultin_pfn() very early on. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 62 ++++++++++++++++++++++-------------------- > 1 file changed, 33 insertions(+), 29 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 50bfaa53f3f2..505fc7eef533 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4333,33 +4333,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault > struct kvm_memory_slot *slot = fault->slot; > bool async; > > - /* > - * Retry the page fault if the gfn hit a memslot that is being deleted > - * or moved. This ensures any existing SPTEs for the old memslot will > - * be zapped before KVM inserts a new MMIO SPTE for the gfn. > - */ > - if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) > - return RET_PF_RETRY; > - > - if (!kvm_is_visible_memslot(slot)) { > - /* Don't expose private memslots to L2. */ > - if (is_guest_mode(vcpu)) { > - fault->slot = NULL; > - fault->pfn = KVM_PFN_NOSLOT; > - fault->map_writable = false; > - return RET_PF_CONTINUE; > - } > - /* > - * If the APIC access page exists but is disabled, go directly > - * to emulation without caching the MMIO access or creating a > - * MMIO SPTE. That way the cache doesn't need to be purged > - * when the AVIC is re-enabled. > - */ > - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && > - !kvm_apicv_activated(vcpu->kvm)) > - return RET_PF_EMULATE; > - } > - > if (fault->is_private) > return kvm_faultin_pfn_private(vcpu, fault); > > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > smp_rmb(); > > + if (!slot) > + goto faultin_pfn; > + > + /* > + * Retry the page fault if the gfn hit a memslot that is being deleted > + * or moved. This ensures any existing SPTEs for the old memslot will > + * be zapped before KVM inserts a new MMIO SPTE for the gfn. > + */ > + if (slot->flags & KVM_MEMSLOT_INVALID) > + return RET_PF_RETRY; > + > + if (!kvm_is_visible_memslot(slot)) { > + /* Don't expose KVM's internal memslots to L2. */ > + if (is_guest_mode(vcpu)) { > + fault->slot = NULL; > + fault->pfn = KVM_PFN_NOSLOT; > + fault->map_writable = false; > + return RET_PF_CONTINUE; Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE? > + } > + > + /* > + * If the APIC access page exists but is disabled, go directly > + * to emulation without caching the MMIO access or creating a > + * MMIO SPTE. That way the cache doesn't need to be purged > + * when the AVIC is re-enabled. > + */ > + if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && > + !kvm_apicv_activated(vcpu->kvm)) > + return RET_PF_EMULATE; > + } > + > /* > * Check for a relevant mmu_notifier invalidation event before getting > * the pfn from the primary MMU, and before acquiring mmu_lock. > @@ -4427,10 +4431,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held > * to detect retry guarantees the worst case latency for the vCPU. > */ > - if (!slot && > - mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) > + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) > return RET_PF_RETRY; > > +faultin_pfn: > ret = __kvm_faultin_pfn(vcpu, fault); > if (ret != RET_PF_CONTINUE) > return ret; > -- > 2.43.0.687.g38aa6559b0-goog >
+Jim On Mon, Feb 19, 2024, Yan Zhao wrote: > On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote: > > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > > smp_rmb(); > > > > + if (!slot) > > + goto faultin_pfn; > > + > > + /* > > + * Retry the page fault if the gfn hit a memslot that is being deleted > > + * or moved. This ensures any existing SPTEs for the old memslot will > > + * be zapped before KVM inserts a new MMIO SPTE for the gfn. > > + */ > > + if (slot->flags & KVM_MEMSLOT_INVALID) > > + return RET_PF_RETRY; > > + > > + if (!kvm_is_visible_memslot(slot)) { > > + /* Don't expose KVM's internal memslots to L2. */ > > + if (is_guest_mode(vcpu)) { > > + fault->slot = NULL; > > + fault->pfn = KVM_PFN_NOSLOT; > > + fault->map_writable = false; > > + return RET_PF_CONTINUE; > Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE? Oof. Yes. But there is a pre-existing bug here too, though it's very theoretical and unlikely to ever cause problems. If KVM is using TDP, but L1 is using shadow paging for L2, then routing through kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode ensure KVM uses different roots for L1 vs. L2. But mmio_gfn will remain valid, and could (quite theoretically) cause KVM to incorrectly treat an L1 access to the private TSS or identity mapped page tables as MMIO. Furthermore, this check doesn't actually prevent exposing KVM's internal memslots to L2, it simply forces KVM to emulate the access. In most cases, that will trigger MMIO, amusingly due to filling arch.mmio_gfn. And vcpu_is_mmio_gpa() always treats APIC accesses as MMIO, so those are fine. But the private TSS and identity mapped page tables could go either way (MMIO or access the private memslot's backing memory). We could "fix" the issue by forcing MMIO emulation for L2 access to all internal memslots, not just to the APIC. But I think that's actually less correct than letting L2 access the private TSS and indentity mapped page tables (not to mention that I can't imagine anyone cares what KVM does in this case). From L1's perspective, there is R/W memory at those memslot, the memory just happens to be initialized with non-zero data, and I don't see a good argument for hiding that memory from L2. Making the memory disappear is far more magical than the memory existing in the first place. The APIC access page is special because KVM _must_ emulate the access to do the right thing. And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private memslots to L2") said, it's not just when L1 is accelerating L2's virtual APIC, it's just as important (likely *more* imporant* for correctness when L1 is passing through its own APIC to L2. Unless I'm missing someting, I think it makes sense to throw in the below before moving code around. Ouch, and looking at this further, patch 1 introduced a bug (technically) by caching fault->slot; in this case KVM will unnecessarily check mmu_notifiers. That's obviously a very benign bug, as a false positive just means an unnecessary retry, but yikes. -- Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to non-APIC internal slots Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 488f522f09c6..4ce824cec5b9 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) return RET_PF_RETRY; - if (!kvm_is_visible_memslot(slot)) { - /* Don't expose private memslots to L2. */ + if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { + /* + * Don't map L1's APIC access page into L2, KVM doesn't support + * using APICv/AVIC to accelerate L2 accesses to L1's APIC, + * i.e. the access needs to be emulated. Emulating access to + * L1's APIC is also correct if L1 is accelerating L2's own + * virtual APIC, but for some reason L1 also maps _L1's_ APIC + * into L2. Note, vcpu_is_mmio_gpa() always treats access to + * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM + * uses different roots for L1 vs. L2, i.e. there is no danger + * of breaking APICv/AVIC for L1. + */ if (is_guest_mode(vcpu)) { fault->slot = NULL; fault->pfn = KVM_PFN_NOSLOT; @@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault * MMIO SPTE. That way the cache doesn't need to be purged * when the AVIC is re-enabled. */ - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && - !kvm_apicv_activated(vcpu->kvm)) + if (!kvm_apicv_activated(vcpu->kvm)) return RET_PF_EMULATE; } base-commit: ec98c2c1a07fb341ba2230eab9a31065d12d9de6 --
On Mon, Feb 19, 2024 at 11:44:54AM -0800, Sean Christopherson wrote: > +Jim > > On Mon, Feb 19, 2024, Yan Zhao wrote: > > On Fri, Feb 09, 2024 at 02:28:57PM -0800, Sean Christopherson wrote: > > > @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, > > > fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; > > > smp_rmb(); > > > > > > + if (!slot) > > > + goto faultin_pfn; > > > + > > > + /* > > > + * Retry the page fault if the gfn hit a memslot that is being deleted > > > + * or moved. This ensures any existing SPTEs for the old memslot will > > > + * be zapped before KVM inserts a new MMIO SPTE for the gfn. > > > + */ > > > + if (slot->flags & KVM_MEMSLOT_INVALID) > > > + return RET_PF_RETRY; > > > + > > > + if (!kvm_is_visible_memslot(slot)) { > > > + /* Don't expose KVM's internal memslots to L2. */ > > > + if (is_guest_mode(vcpu)) { > > > + fault->slot = NULL; > > > + fault->pfn = KVM_PFN_NOSLOT; > > > + fault->map_writable = false; > > > + return RET_PF_CONTINUE; > > Call kvm_handle_noslot_fault() to replace returning RET_PF_CONTINUE? > > Oof. Yes. But there is a pre-existing bug here too, though it's very theoretical > and unlikely to ever cause problems. > > If KVM is using TDP, but L1 is using shadow paging for L2, then routing through > kvm_handle_noslot_fault() will incorrectly cache the gfn as MMIO, and create an > MMIO SPTE. Creating an MMIO SPTE is ok, but only because kvm_mmu_page_role.guest_mode > ensure KVM uses different roots for L1 vs. L2. But mmio_gfn will remain valid, > and could (quite theoretically) cause KVM to incorrectly treat an L1 access to > the private TSS or identity mapped page tables as MMIO. Why would KVM treat L1 access to the private TSS and identity mapped page tables as MMIO even though mmio_gfn is valid? It looks that (for Intel platform) EPT for L1 will only install normal SPTEs (non-MMIO SPTEs) for the two private slots, so there would not have EPT misconfiguration and would not go to emulation path incorrectly. Am I missing something? > Furthermore, this check doesn't actually prevent exposing KVM's internal memslots > to L2, it simply forces KVM to emulate the access. In most cases, that will trigger > MMIO, amusingly due to filling arch.mmio_gfn. And vcpu_is_mmio_gpa() always > treats APIC accesses as MMIO, so those are fine. But the private TSS and identity > mapped page tables could go either way (MMIO or access the private memslot's backing > memory). Yes, this is also my question when reading that part of code. mmio_gen mismatching may lead to accessing the backing memory directly. > We could "fix" the issue by forcing MMIO emulation for L2 access to all internal > memslots, not just to the APIC. But I think that's actually less correct than > letting L2 access the private TSS and indentity mapped page tables (not to mention > that I can't imagine anyone cares what KVM does in this case). From L1's perspective, > there is R/W memory at those memslot, the memory just happens to be initialized > with non-zero data, and I don't see a good argument for hiding that memory from L2. > Making the memory disappear is far more magical than the memory existing in the > first place. > > The APIC access page is special because KVM _must_ emulate the access to do the > right thing. And despite what commit 3a2936dedd20 ("kvm: mmu: Don't expose private > memslots to L2") said, it's not just when L1 is accelerating L2's virtual APIC, > it's just as important (likely *more* imporant* for correctness when L1 is passing > through its own APIC to L2. > > Unless I'm missing someting, I think it makes sense to throw in the below before > moving code around. > > Ouch, and looking at this further, patch 1 introduced a bug (technically) by caching > fault->slot; in this case KVM will unnecessarily check mmu_notifiers. That's > obviously a very benign bug, as a false positive just means an unnecessary retry, > but yikes. > Patch 3 & 4 removed the bug immediately :) > -- > Subject: [PATCH] KVM: x86/mmu: Don't force emulation of L2 accesses to > non-APIC internal slots > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 17 +++++++++++++---- > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 488f522f09c6..4ce824cec5b9 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4341,8 +4341,18 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault > if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) > return RET_PF_RETRY; > > - if (!kvm_is_visible_memslot(slot)) { > - /* Don't expose private memslots to L2. */ > + if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { > + /* > + * Don't map L1's APIC access page into L2, KVM doesn't support > + * using APICv/AVIC to accelerate L2 accesses to L1's APIC, > + * i.e. the access needs to be emulated. Emulating access to > + * L1's APIC is also correct if L1 is accelerating L2's own > + * virtual APIC, but for some reason L1 also maps _L1's_ APIC > + * into L2. Note, vcpu_is_mmio_gpa() always treats access to > + * the APIC as MMIO. Allow an MMIO SPTE to be created, as KVM > + * uses different roots for L1 vs. L2, i.e. there is no danger > + * of breaking APICv/AVIC for L1. > + */ > if (is_guest_mode(vcpu)) { > fault->slot = NULL; > fault->pfn = KVM_PFN_NOSLOT; Checking fault->is_private before calling kvm_handle_noslot_fault()? And do we need a centralized check of fault->is_private in kvm_mmu_do_page_fault() before returning RET_PF_EMULATE? > @@ -4355,8 +4365,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault > * MMIO SPTE. That way the cache doesn't need to be purged > * when the AVIC is re-enabled. > */ > - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && > - !kvm_apicv_activated(vcpu->kvm)) > + if (!kvm_apicv_activated(vcpu->kvm)) > return RET_PF_EMULATE; Otherwise, here also needs a checking of fault->is_private? Maybe also for where RET_PF_EMULATE is returned when page_fault_handle_page_track() is true (though I know it's always false for TDX). > } > > > base-commit: ec98c2c1a07fb341ba2230eab9a31065d12d9de6 > --
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 50bfaa53f3f2..505fc7eef533 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4333,33 +4333,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault struct kvm_memory_slot *slot = fault->slot; bool async; - /* - * Retry the page fault if the gfn hit a memslot that is being deleted - * or moved. This ensures any existing SPTEs for the old memslot will - * be zapped before KVM inserts a new MMIO SPTE for the gfn. - */ - if (slot && (slot->flags & KVM_MEMSLOT_INVALID)) - return RET_PF_RETRY; - - if (!kvm_is_visible_memslot(slot)) { - /* Don't expose private memslots to L2. */ - if (is_guest_mode(vcpu)) { - fault->slot = NULL; - fault->pfn = KVM_PFN_NOSLOT; - fault->map_writable = false; - return RET_PF_CONTINUE; - } - /* - * If the APIC access page exists but is disabled, go directly - * to emulation without caching the MMIO access or creating a - * MMIO SPTE. That way the cache doesn't need to be purged - * when the AVIC is re-enabled. - */ - if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && - !kvm_apicv_activated(vcpu->kvm)) - return RET_PF_EMULATE; - } - if (fault->is_private) return kvm_faultin_pfn_private(vcpu, fault); @@ -4406,6 +4379,37 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq; smp_rmb(); + if (!slot) + goto faultin_pfn; + + /* + * Retry the page fault if the gfn hit a memslot that is being deleted + * or moved. This ensures any existing SPTEs for the old memslot will + * be zapped before KVM inserts a new MMIO SPTE for the gfn. + */ + if (slot->flags & KVM_MEMSLOT_INVALID) + return RET_PF_RETRY; + + if (!kvm_is_visible_memslot(slot)) { + /* Don't expose KVM's internal memslots to L2. */ + if (is_guest_mode(vcpu)) { + fault->slot = NULL; + fault->pfn = KVM_PFN_NOSLOT; + fault->map_writable = false; + return RET_PF_CONTINUE; + } + + /* + * If the APIC access page exists but is disabled, go directly + * to emulation without caching the MMIO access or creating a + * MMIO SPTE. That way the cache doesn't need to be purged + * when the AVIC is re-enabled. + */ + if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT && + !kvm_apicv_activated(vcpu->kvm)) + return RET_PF_EMULATE; + } + /* * Check for a relevant mmu_notifier invalidation event before getting * the pfn from the primary MMU, and before acquiring mmu_lock. @@ -4427,10 +4431,10 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, * *guaranteed* to need to retry, i.e. waiting until mmu_lock is held * to detect retry guarantees the worst case latency for the vCPU. */ - if (!slot && - mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) + if (mmu_invalidate_retry_gfn_unsafe(vcpu->kvm, fault->mmu_seq, fault->gfn)) return RET_PF_RETRY; +faultin_pfn: ret = __kvm_faultin_pfn(vcpu, fault); if (ret != RET_PF_CONTINUE) return ret;