Message ID | 20240203002343.383056-5-seanjc@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-50773-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:9bc1:b0:106:209c:c626 with SMTP id op1csp776328dyc; Fri, 2 Feb 2024 16:30:55 -0800 (PST) X-Google-Smtp-Source: AGHT+IEJeugaIK/yUt7oXJ+rYxuZHeF0mkoFOZDIJmVpqP05LVAGna5ZW7vTRPg1pF2OVpdXxYmX X-Received: by 2002:a17:903:2a8d:b0:1d9:6c08:39bd with SMTP id lv13-20020a1709032a8d00b001d96c0839bdmr4280178plb.28.1706920255449; Fri, 02 Feb 2024 16:30:55 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706920255; cv=pass; d=google.com; s=arc-20160816; b=DNH1v9v6ZGpebHBaPR4gBQbdsV690nIbE2B3n48nAxR40bRnw0ge7D8+igIpAIkMab efxqPF9tj7o2uw5+GsszHUHcsUaK27lGwY6KOc1yKKgoSxJBYN2nsWihEBsUua1/Rh6M 5CBFJNU2ExUFugwtI2YFjrVCeA9TsbeTTrWRx0Wt8siN5/QP6zc4FFOJ6EqejQsyr1jb CgcawKWrrwbeH65HlseiNgSKmi5us768BKFdxdNqdKe8AcJl0R5eu0Y5u8cQgpZvv/wZ BSmY7M29eyhOeEtEVS5uQGHPFFr1l0jhDqs+0MSBGa9Y7Fqwwq4fUuUjp8aYlZuqQWCC 4GJg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :reply-to:dkim-signature; bh=YMTbLD7fakVliu9PWzNdDCpL894AL7erRJ6ZL070HUM=; fh=37254yoVa9bcfyo70olVAI/NDDdRLQQakLutnMOjTUY=; b=BqpqqGpLaCo85laUvD1PkrW0fXGjlCm4RDh2wxE9+oeY2Hac0L8pyk41ZnjIWU0hc5 j7P3T/hUkLRmWZnRe1uC+Ic2seMakgdOEu9iYcydBgO7gJk+cYYi0OKsOJQSXBtYnu08 QsAA9qJhngl+7jBW7m324cNHqosu31hbQvQjbGO8s5sxj7i9ZKzFpcFUVljM/CCr8Fw3 Kh1nrG0Pm186SaAXcXlOj/dSqxS2WyFsSE8my7gXuT4kSKO/R+VhVTp/Zs2AwErvUTtJ 25q114WXLmF3VJobCTY/cTDYVdITiUPbRMO60aktEeROkj49MX3HscAhgUcwMF4CFTqf xO2g==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="Q/OYFDCz"; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-50773-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-50773-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com X-Forwarded-Encrypted: i=1; AJvYcCX4xbunX1Bz+coSEdNAk+/InS1V5+xwGRvvT2CWIWnhWW0zzx9lFzaGfUh1jnxWu/sXPAcE9iyzhp8xJmZoBPHi9C5PXQ== Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id i2-20020a170902eb4200b001d7852b689csi2360080pli.461.2024.02.02.16.30.55 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 16:30:55 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-50773-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="Q/OYFDCz"; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-50773-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-50773-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 25DE3B27407 for <ouuuleilei@gmail.com>; Sat, 3 Feb 2024 00:25:26 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8E00E10A09; Sat, 3 Feb 2024 00:23:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Q/OYFDCz" Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1BBF879D0 for <linux-kernel@vger.kernel.org>; Sat, 3 Feb 2024 00:23:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706919835; cv=none; b=DMsfm/LdhTVZdwZbDfuPDPceAChBlP6yMbE7N7L9j5pcZs4LbiHoMNwiND9Zja6EcKHsTLUCuPCvE1G/UYlQn+ZsFQGBc097JaZzqEwysf9nFwTWjWAoS7LqItHGlmDg9AJ1KuVs3OoEtokerTUD4DfqL8ggVuwg1XTbyex32uk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706919835; c=relaxed/simple; bh=bj+xt0tQx/aHe27FVFwz6GpRQEJPlX/TxMKuiSnFpKQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=YjqRen465QtgeAXBZisesg7xn5u3DT72+0UgYMWRcS8t11OYs7NcCHwn/lmZ4gb3LRfpH7fiFtyUdR2h0nORltwYFbDP770ovOwrb71EzUaRtU8x+akvLuZ728Qc1k4SviPDdgzOSRZML6nfWt1Nk3MSBnVfSrQrTejOQvyqBDs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Q/OYFDCz; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-6de3141f061so2342497b3a.1 for <linux-kernel@vger.kernel.org>; Fri, 02 Feb 2024 16:23:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706919833; x=1707524633; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=YMTbLD7fakVliu9PWzNdDCpL894AL7erRJ6ZL070HUM=; b=Q/OYFDCzEaEi2/ExUkal7KFup9u8uyc315UNMDCVabluK+X1gZhwFt8Onb9m8UR2Vv 0yKEA45jTgqAZP2Xtl3ZrJfrtxiouT4c1zQ3ByDHU4uJ4/apoeauRdt6eci/i5S77eoE xCRMnuspDTra4KVyMUqucpKgita6vxg/seqSHhQMNFwqcdbCaZqpKRqFnngNY1MSQQtU c+eO3+v3Y/K83to0GVC7d1OjxLJ8HT0LVRNbxEhMlLCu7jgwE1Axu/FhLQ7gv4iaTFcg KfrTQskfdMkZSCdvTD/Bkw9Bp4HwUnhobQgqa6pezTCx249YsoElHKAscAoDjFuLp2LH CVmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706919833; x=1707524633; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=YMTbLD7fakVliu9PWzNdDCpL894AL7erRJ6ZL070HUM=; b=OB8lKDncenlr5TTCh7MVIVns2xh4Z0bGZfqNKCrM/z6mIUUiIp3XqXeEABkeB1oCZg 1A93ojnGszcemXeinZXCYxhcCm1ESuwSds/xK2d/4R0vIiNKcREKTTj6660XZjTuqHGU s6tvJ8gx36IT32q55BS4KOlT+ZcdE01adtOgYRXGYby7Wk6IcpS6fNFfDD796idsu/ex jEW8zIvm/V4T0Sp0Ff63cAsxhKQzpssJIH8pqzxlsNwYC2WaigrbcomSxIRHhFqWiOzE 2BKjYqoT7Toxv/7cUbGThQ2u3UpHJ6lEOtc17696sypavY8u/DvPvRg1yiOr8t9gaHy9 HIrA== X-Gm-Message-State: AOJu0Yz34YHy/BcqmaNfLhxJHQOBRDeSNSbm/t/Wls5m1dVVj/7ObVql Ysqa0nSpBDkiFaCC8wurGBWpZ/NKPgFz8kwzg+AwH5NnWZwJoph8Wd0ypUSK2dMmhAR4oIYM46X nBA== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6a00:1797:b0:6df:eae5:79bd with SMTP id s23-20020a056a00179700b006dfeae579bdmr185067pfg.0.1706919833504; Fri, 02 Feb 2024 16:23:53 -0800 (PST) Reply-To: Sean Christopherson <seanjc@google.com> Date: Fri, 2 Feb 2024 16:23:43 -0800 In-Reply-To: <20240203002343.383056-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 References: <20240203002343.383056-1-seanjc@google.com> X-Mailer: git-send-email 2.43.0.594.gd9cf4e227d-goog Message-ID: <20240203002343.383056-5-seanjc@google.com> Subject: [PATCH v2 4/4] KVM: x86/mmu: Fix a *very* theoretical race in kvm_mmu_track_write() From: Sean Christopherson <seanjc@google.com> To: Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Jim Mattson <jmattson@google.com>, Mingwei Zhang <mizhang@google.com> Content-Type: text/plain; charset="UTF-8" X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789835613660169455 X-GMAIL-MSGID: 1789835613660169455 |
Series |
KVM: x86/mmu: Clean up indirect_shadow_pages usage
|
|
Commit Message
Sean Christopherson
Feb. 3, 2024, 12:23 a.m. UTC
Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
to plug a (very, very theoretical) race where kvm_mmu_track_write() could
miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
*stale* SPTEs.
Without the barriers, because modern x86 CPUs allow (per the SDM):
Reads may be reordered with older writes to different locations but not
with older writes to the same location.
it's (again, super theoretically) possible that the following could happen
(terms of values being visible/resolved):
CPU0 CPU1
read memory[gfn] (=Y)
memory[gfn] Y=>X
read indirect_shadow_pages (=0)
indirect_shadow_pages 0=>1
or conversely:
CPU0 CPU1
indirect_shadow_pages 0=>1
read indirect_shadow_pages (=0)
read memory[gfn] (=Y)
memory[gfn] Y=>X
In practice, this bug is likely benign as both the 0=>1 transition and
reordering of this scope are extremely rare occurrences.
Note, if the cost of the barrier (which is simply a locked ADD, see commit
450cbdd0125c ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")),
is problematic, KVM could avoid the barrier by bailing earlier if checking
kvm_memslots_have_rmaps() is false. But the odds of the barrier being
problematic is extremely low, *and* the odds of the extra checks being
meaningfully faster overall is also low.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
Comments
On 2/3/24 01:23, Sean Christopherson wrote: > Add full memory barriers in kvm_mmu_track_write() and account_shadowed() > to plug a (very, very theoretical) race where kvm_mmu_track_write() could > miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant, > *stale* SPTEs. Ok, so we have emulator_write_phys overwrite PTE kvm_page_track_write kvm_mmu_track_write // memory barrier missing here if (indirect_shadow_pages) zap(); and on the other side FNAME(page_fault) FNAME(fetch) kvm_mmu_get_child_sp kvm_mmu_get_shadow_page __kvm_mmu_get_shadow_page kvm_mmu_alloc_shadow_page account_shadowed indirect shadow pages++ // memory barrier missing here if (FNAME(gpte_changed)) // reads PTE goto out If you can weave something like this in the commit message the sequence would be a bit clearer. > In practice, this bug is likely benign as both the 0=>1 transition and > reordering of this scope are extremely rare occurrences. I wouldn't call it benign, it's more that it's unobservable in practice but the race is real. However... > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 3c193b096b45..86b85060534d 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) > struct kvm_memory_slot *slot; > gfn_t gfn; > > + /* > + * Ensure indirect_shadow_pages is elevated prior to re-reading guest > + * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight > + * emulated writes are visible before re-reading guest PTEs, or that > + * an emulated write will see the elevated count and acquire mmu_lock > + * to update SPTEs. Pairs with the smp_mb() in kvm_mmu_track_write(). > + */ > + smp_mb(); .. this memory barrier needs to be after the increment (the desired ordering is store-before-read). Paolo > kvm->arch.indirect_shadow_pages++; > gfn = sp->gfn; > slots = kvm_memslots_for_spte_role(kvm, sp->role); > @@ -5747,10 +5755,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, > bool flush = false; > > /* > - * If we don't have indirect shadow pages, it means no page is > - * write-protected, so we can exit simply. > + * When emulating guest writes, ensure the written value is visible to > + * any task that is handling page faults before checking whether or not > + * KVM is shadowing a guest PTE. This ensures either KVM will create > + * the correct SPTE in the page fault handler, or this task will see > + * a non-zero indirect_shadow_pages. Pairs with the smp_mb() in > + * account_shadowed(). > */ > - if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) > + smp_mb(); > + if (!vcpu->kvm->arch.indirect_shadow_pages) > return; > > write_lock(&vcpu->kvm->mmu_lock);
On Fri, Feb 23, 2024, Paolo Bonzini wrote: > On 2/3/24 01:23, Sean Christopherson wrote: > > Add full memory barriers in kvm_mmu_track_write() and account_shadowed() > > to plug a (very, very theoretical) race where kvm_mmu_track_write() could > > miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant, > > *stale* SPTEs. > > Ok, so we have > > emulator_write_phys > overwrite PTE > kvm_page_track_write > kvm_mmu_track_write > // memory barrier missing here > if (indirect_shadow_pages) > zap(); > > and on the other side > > FNAME(page_fault) > FNAME(fetch) > kvm_mmu_get_child_sp > kvm_mmu_get_shadow_page > __kvm_mmu_get_shadow_page > kvm_mmu_alloc_shadow_page > account_shadowed > indirect shadow pages++ > // memory barrier missing here > if (FNAME(gpte_changed)) // reads PTE > goto out > > If you can weave something like this in the commit message the sequence > would be a bit clearer. Roger that. > > In practice, this bug is likely benign as both the 0=>1 transition and > > reordering of this scope are extremely rare occurrences. > > I wouldn't call it benign, it's more that it's unobservable in practice but > the race is real. However... > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 3c193b096b45..86b85060534d 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) > > struct kvm_memory_slot *slot; > > gfn_t gfn; > > + /* > > + * Ensure indirect_shadow_pages is elevated prior to re-reading guest > > + * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight > > + * emulated writes are visible before re-reading guest PTEs, or that > > + * an emulated write will see the elevated count and acquire mmu_lock > > + * to update SPTEs. Pairs with the smp_mb() in kvm_mmu_track_write(). > > + */ > > + smp_mb(); > > ... this memory barrier needs to be after the increment (the desired > ordering is store-before-read). Doh. I'll post a fixed version as a one-off v3. Thanks!
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3c193b096b45..86b85060534d 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -830,6 +830,14 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp) struct kvm_memory_slot *slot; gfn_t gfn; + /* + * Ensure indirect_shadow_pages is elevated prior to re-reading guest + * child PTEs in FNAME(gpte_changed), i.e. guarantee either in-flight + * emulated writes are visible before re-reading guest PTEs, or that + * an emulated write will see the elevated count and acquire mmu_lock + * to update SPTEs. Pairs with the smp_mb() in kvm_mmu_track_write(). + */ + smp_mb(); kvm->arch.indirect_shadow_pages++; gfn = sp->gfn; slots = kvm_memslots_for_spte_role(kvm, sp->role); @@ -5747,10 +5755,15 @@ void kvm_mmu_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, bool flush = false; /* - * If we don't have indirect shadow pages, it means no page is - * write-protected, so we can exit simply. + * When emulating guest writes, ensure the written value is visible to + * any task that is handling page faults before checking whether or not + * KVM is shadowing a guest PTE. This ensures either KVM will create + * the correct SPTE in the page fault handler, or this task will see + * a non-zero indirect_shadow_pages. Pairs with the smp_mb() in + * account_shadowed(). */ - if (!READ_ONCE(vcpu->kvm->arch.indirect_shadow_pages)) + smp_mb(); + if (!vcpu->kvm->arch.indirect_shadow_pages) return; write_lock(&vcpu->kvm->mmu_lock);