Message ID | 20221117161449.114086-1-pbonzini@redhat.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp490554wrr; Thu, 17 Nov 2022 08:19:57 -0800 (PST) X-Google-Smtp-Source: AA0mqf6/fiE4nzFqUmvLLZshNg1WiguFcuYvTTg1qdfyVvGEDn0OTxDuo/uzyTMlnRqeXc86xzy2 X-Received: by 2002:a17:906:d049:b0:7a6:ea24:7b01 with SMTP id bo9-20020a170906d04900b007a6ea247b01mr2612272ejb.410.1668701997705; Thu, 17 Nov 2022 08:19:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668701997; cv=none; d=google.com; s=arc-20160816; b=Te7hWFh9i3v1LkNE30xLgK8nDbLfN1mPuTFFYQlHvIl4wZOj7xeghRs0oF2fCFiwRN njYKN+udb6Ulowhg3EFNs3OpQl+Zyg94x1S6l55jOg32b7hLm0nNlZbm9wv5+w1QqmKb pVK7N2+QRZhIiBP0vTqFJcbJWmqDEki6WHr9FdKJHSebv+9W9Lkj9Nk3R6ULKbjdHAVO faWQBsZCHGr8Ve6/ZDV/Kc35UfdUPjknWKPuHo0T0jjb9KnX2GwVi48+a1BLHbvh6h3M Vu9iFMzfgoYC6f1780yy3+DLxmOm3Qd5yu/xnRvE8rsYnVLZ8bDbRB1muuwx7M+7WtbH lHRA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=jgeHGWCI+pvVYt1W/flqhT0b6uT1opHUz8BI1fRXQBk=; b=EbFwVYh5wSytDAspS+Ee0ZCbw7pTBdKmCQOMW4DIPwiOBvS7Cjh28yJqAtoKEKezCD cCQjg15CNCLMTXTm/A7ybh8mwiFfnA4JYKnTBHQMY9JYzuoYOgZWf1BHDU4tChpJBbPS qnexmgsRWJyGNsEqAifeh/g7w22LY/j3TkqqAG/6TOgBHdCYCVhG46MCjW6ZZhVNMfzK 9q7grRNHKOrsqhu3b2esh3u5OSsyGkcer9JVivBcXqOhZ+yGMiT/EE1IZEYB0pbqOtjh mbGunXLJy/U9hUVk2iYAMYu8j85kZhlqISu4IPiEnOjN5U+etfdqSrvr37p4256PECiM 0nGQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="ETMr1/Pj"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id di15-20020a170906730f00b007ae76a4e35fsi984286ejc.5.2022.11.17.08.19.33; Thu, 17 Nov 2022 08:19:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="ETMr1/Pj"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240332AbiKQQQQ (ORCPT <rfc822;a1648639935@gmail.com> + 99 others); Thu, 17 Nov 2022 11:16:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50468 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239291AbiKQQPw (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 17 Nov 2022 11:15:52 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DDC174ABF for <linux-kernel@vger.kernel.org>; Thu, 17 Nov 2022 08:14:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668701694; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=jgeHGWCI+pvVYt1W/flqhT0b6uT1opHUz8BI1fRXQBk=; b=ETMr1/PjiC9cyoeV4fR6xeBRpxS2EL9hEDUqCS1lIO6MWXWtuJJTqkmXBgMr/DVdndEbDn 2se/vaaPcJoSupYXqGqs+MiiNnxnHBWrqJBL+YLDN9hP2cDFF8QzSzMs46r02DfDqLbjar 67r/F3RQKNHzPzVX27ad5Mu7k0ujryE= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-561-_kF_-fSCMuuZZh-HTI1sZQ-1; Thu, 17 Nov 2022 11:14:51 -0500 X-MC-Unique: _kF_-fSCMuuZZh-HTI1sZQ-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 8991A38123A8; Thu, 17 Nov 2022 16:14:50 +0000 (UTC) Received: from virtlab511.virt.lab.eng.bos.redhat.com (virtlab511.virt.lab.eng.bos.redhat.com [10.19.152.198]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5EF592024CC8; Thu, 17 Nov 2022 16:14:50 +0000 (UTC) From: Paolo Bonzini <pbonzini@redhat.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: dmatlack@google.com, seanjc@google.com, Robert Hoo <robert.hu@linux.intel.com> Subject: [PATCH] KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry Date: Thu, 17 Nov 2022 11:14:49 -0500 Message-Id: <20221117161449.114086-1-pbonzini@redhat.com> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749760865817970312?= X-GMAIL-MSGID: =?utf-8?q?1749760865817970312?= |
Series |
KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry
|
|
Commit Message
Paolo Bonzini
Nov. 17, 2022, 4:14 p.m. UTC
A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map
only fails in the exact same conditions that the earlier loop
tested in order to issue a "break". So, instead of checking twice the
condition (upper level SPTEs could not be created or was frozen), just
exit the loop with a goto---the usual poor-man C replacement for RAII
early returns.
While at it, do not use the "ret" variable for return values of
functions that do not return a RET_PF_* enum. This is clearer
and also makes it possible to initialize ret to RET_PF_RETRY.
Suggested-by: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
arch/x86/kvm/mmu/tdp_mmu.c | 40 ++++++++++++++++++--------------------
1 file changed, 19 insertions(+), 21 deletions(-)
Comments
On Thu, Nov 17, 2022 at 8:14 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map > only fails in the exact same conditions that the earlier loop > tested in order to issue a "break". So, instead of checking twice the > condition (upper level SPTEs could not be created or was frozen), just > exit the loop with a goto---the usual poor-man C replacement for RAII > early returns. > > While at it, do not use the "ret" variable for return values of > functions that do not return a RET_PF_* enum. This is clearer > and also makes it possible to initialize ret to RET_PF_RETRY. > > Suggested-by: Robert Hoo <robert.hu@linux.intel.com> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/x86/kvm/mmu/tdp_mmu.c | 40 ++++++++++++++++++-------------------- > 1 file changed, 19 insertions(+), 21 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index e08596775427..771210ce5181 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1159,7 +1159,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > struct kvm *kvm = vcpu->kvm; > struct tdp_iter iter; > struct kvm_mmu_page *sp; > - int ret; > + int ret = RET_PF_RETRY; > > kvm_mmu_hugepage_adjust(vcpu, fault); > > @@ -1168,23 +1168,25 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > rcu_read_lock(); > > tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { > + int r; > + > if (fault->nx_huge_page_workaround_enabled) > disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); > > if (iter.level == fault->goal_level) > break; > > - /* Step down into the lower level page table if it exists. */ > - if (is_shadow_present_pte(iter.old_spte) && > - !is_large_pte(iter.old_spte)) > - continue; > - > /* > * If SPTE has been frozen by another thread, just give up and > * retry, avoiding unnecessary page table allocation and free. > */ > if (is_removed_spte(iter.old_spte)) > - break; > + goto retry; > + > + /* Step down into the lower level page table if it exists. */ > + if (is_shadow_present_pte(iter.old_spte) && > + !is_large_pte(iter.old_spte)) > + continue; > > /* > * The SPTE is either non-present or points to a huge page that > @@ -1196,13 +1198,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > sp->nx_huge_page_disallowed = fault->huge_page_disallowed; > > if (is_shadow_present_pte(iter.old_spte)) > - ret = tdp_mmu_split_huge_page(kvm, &iter, sp, true); > + r = tdp_mmu_split_huge_page(kvm, &iter, sp, true); > else > - ret = tdp_mmu_link_sp(kvm, &iter, sp, true); > + r = tdp_mmu_link_sp(kvm, &iter, sp, true); Can this fix be squashed into [1]? It seems like a serious enough bug. If 2 threads race to update the same PTE, KVM will return -EBUSY out to userspace from KVM_RUN, I think. I'm not sure about QEMU, but that would be fatal for the VM in Vanadium. [1] https://lore.kernel.org/kvm/20221109185905.486172-3-dmatlack@google.com/ > > - if (ret) { > + /* > + * Also force the guest to retry the access if the upper level SPTEs > + * aren't in place. > + */ > + if (r) { > tdp_mmu_free_sp(sp); > - break; > + goto retry; > } > > if (fault->huge_page_disallowed && > @@ -1213,18 +1219,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > } > } > > - /* > - * Force the guest to retry the access if the upper level SPTEs aren't > - * in place, or if the target leaf SPTE is frozen by another CPU. > - */ > - if (iter.level != fault->goal_level || is_removed_spte(iter.old_spte)) { > - rcu_read_unlock(); > - return RET_PF_RETRY; > - } > - > ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter); > - rcu_read_unlock(); > > +retry: > + rcu_read_unlock(); > return ret; > } > > -- > 2.31.1 >
On Thu, 2022-11-17 at 11:14 -0500, Paolo Bonzini wrote: > A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map > only fails in the exact same conditions that the earlier loop > tested in order to issue a "break". So, instead of checking twice > the > condition (upper level SPTEs could not be created or was frozen), > just > exit the loop with a goto---the usual poor-man C replacement for RAII > early returns. > > While at it, do not use the "ret" variable for return values of > functions that do not return a RET_PF_* enum. This is clearer > and also makes it possible to initialize ret to RET_PF_RETRY. > > Suggested-by: Robert Hoo <robert.hu@linux.intel.com> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > --- > arch/x86/kvm/mmu/tdp_mmu.c | 40 ++++++++++++++++++---------------- > ---- > 1 file changed, 19 insertions(+), 21 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index e08596775427..771210ce5181 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1159,7 +1159,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault) > struct kvm *kvm = vcpu->kvm; > struct tdp_iter iter; > struct kvm_mmu_page *sp; > - int ret; > + int ret = RET_PF_RETRY; > > kvm_mmu_hugepage_adjust(vcpu, fault); > > @@ -1168,23 +1168,25 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault) > rcu_read_lock(); > > tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { > + int r; > + > if (fault->nx_huge_page_workaround_enabled) > disallowed_hugepage_adjust(fault, > iter.old_spte, iter.level); > And here can also be improved, I think. tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { - if (fault->nx_huge_page_workaround_enabled) + if (fault->huge_page_disallowed) in the case of !fault->exec && fault->nx_huge_page_workaround_enabled, huge page should be still allowed, shouldn't it? If you agree, I can send out a patch for this. I've roughly tested this, with an ordinary guest boot, works normally. > if (iter.level == fault->goal_level) > break; > > - /* Step down into the lower level page table if it > exists. */ > - if (is_shadow_present_pte(iter.old_spte) && > - !is_large_pte(iter.old_spte)) > - continue; > - > /* > * If SPTE has been frozen by another thread, just give > up and > * retry, avoiding unnecessary page table allocation > and free. > */ > if (is_removed_spte(iter.old_spte)) > - break; > + goto retry; > + > + /* Step down into the lower level page table if it > exists. */ > + if (is_shadow_present_pte(iter.old_spte) && > + !is_large_pte(iter.old_spte)) > + continue; > > /* > * The SPTE is either non-present or points to a huge > page that > @@ -1196,13 +1198,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault) > sp->nx_huge_page_disallowed = fault- > >huge_page_disallowed; > > if (is_shadow_present_pte(iter.old_spte)) > - ret = tdp_mmu_split_huge_page(kvm, &iter, sp, > true); > + r = tdp_mmu_split_huge_page(kvm, &iter, sp, > true); > else > - ret = tdp_mmu_link_sp(kvm, &iter, sp, true); > + r = tdp_mmu_link_sp(kvm, &iter, sp, true); > > - if (ret) { > + /* > + * Also force the guest to retry the access if the > upper level SPTEs > + * aren't in place. > + */ > + if (r) { > tdp_mmu_free_sp(sp); > - break; > + goto retry; > } > > if (fault->huge_page_disallowed && > @@ -1213,18 +1219,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > struct kvm_page_fault *fault) > } > } > > - /* > - * Force the guest to retry the access if the upper level SPTEs > aren't > - * in place, or if the target leaf SPTE is frozen by another > CPU. > - */ > - if (iter.level != fault->goal_level || > is_removed_spte(iter.old_spte)) { > - rcu_read_unlock(); > - return RET_PF_RETRY; > - } > - > ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter); > - rcu_read_unlock(); > > +retry: > + rcu_read_unlock(); > return ret; > } >
On Thu, 2022-11-17 at 10:43 -0800, David Matlack wrote: > On Thu, Nov 17, 2022 at 8:14 AM Paolo Bonzini <pbonzini@redhat.com> > wrote: > > > > A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map > > only fails in the exact same conditions that the earlier loop > > tested in order to issue a "break". So, instead of checking twice > > the > > condition (upper level SPTEs could not be created or was frozen), > > just > > exit the loop with a goto---the usual poor-man C replacement for > > RAII > > early returns. > > > > While at it, do not use the "ret" variable for return values of > > functions that do not return a RET_PF_* enum. This is clearer > > and also makes it possible to initialize ret to RET_PF_RETRY. > > > > Suggested-by: Robert Hoo <robert.hu@linux.intel.com> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > > --- > > arch/x86/kvm/mmu/tdp_mmu.c | 40 ++++++++++++++++++-------------- > > ------ > > 1 file changed, 19 insertions(+), 21 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c > > b/arch/x86/kvm/mmu/tdp_mmu.c > > index e08596775427..771210ce5181 100644 > > --- a/arch/x86/kvm/mmu/tdp_mmu.c > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > > @@ -1159,7 +1159,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > > struct kvm_page_fault *fault) > > struct kvm *kvm = vcpu->kvm; > > struct tdp_iter iter; > > struct kvm_mmu_page *sp; > > - int ret; > > + int ret = RET_PF_RETRY; > > > > kvm_mmu_hugepage_adjust(vcpu, fault); > > > > @@ -1168,23 +1168,25 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > > struct kvm_page_fault *fault) > > rcu_read_lock(); > > > > tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) > > { > > + int r; > > + > > if (fault->nx_huge_page_workaround_enabled) > > disallowed_hugepage_adjust(fault, > > iter.old_spte, iter.level); > > > > if (iter.level == fault->goal_level) > > break; > > > > - /* Step down into the lower level page table if it > > exists. */ > > - if (is_shadow_present_pte(iter.old_spte) && > > - !is_large_pte(iter.old_spte)) > > - continue; > > - > > /* > > * If SPTE has been frozen by another thread, just > > give up and > > * retry, avoiding unnecessary page table > > allocation and free. > > */ > > if (is_removed_spte(iter.old_spte)) > > - break; > > + goto retry; > > + > > + /* Step down into the lower level page table if it > > exists. */ > > + if (is_shadow_present_pte(iter.old_spte) && > > + !is_large_pte(iter.old_spte)) > > + continue; > > > > /* > > * The SPTE is either non-present or points to a > > huge page that > > @@ -1196,13 +1198,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > > struct kvm_page_fault *fault) > > sp->nx_huge_page_disallowed = fault- > > >huge_page_disallowed; > > > > if (is_shadow_present_pte(iter.old_spte)) > > - ret = tdp_mmu_split_huge_page(kvm, &iter, > > sp, true); > > + r = tdp_mmu_split_huge_page(kvm, &iter, sp, > > true); > > else > > - ret = tdp_mmu_link_sp(kvm, &iter, sp, > > true); > > + r = tdp_mmu_link_sp(kvm, &iter, sp, true); > > Can this fix be squashed into [1]? It seems like a serious enough > bug. > If 2 threads race to update the same PTE, KVM will return -EBUSY out > to userspace from KVM_RUN, I think. I'm not sure about QEMU, but that > would be fatal for the VM in Vanadium. > > [1] > https://lore.kernel.org/kvm/20221109185905.486172-3-dmatlack@google.com/ > I think in you patch it's all right, since then before kvm_tdp_mmu_map() returns, it must go through tdp_mmu_map_handle_target_level(), it returns RET_PF_* enum. > > > > - if (ret) { > > + /* > > + * Also force the guest to retry the access if the > > upper level SPTEs > > + * aren't in place. > > + */ > > + if (r) { > > tdp_mmu_free_sp(sp); > > - break; > > + goto retry; > > } > > > > if (fault->huge_page_disallowed && > > @@ -1213,18 +1219,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, > > struct kvm_page_fault *fault) > > } > > } > > > > - /* > > - * Force the guest to retry the access if the upper level > > SPTEs aren't > > - * in place, or if the target leaf SPTE is frozen by > > another CPU. > > - */ > > - if (iter.level != fault->goal_level || > > is_removed_spte(iter.old_spte)) { > > - rcu_read_unlock(); > > - return RET_PF_RETRY; > > - } > > - > > ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter); > > - rcu_read_unlock(); > > > > +retry: > > + rcu_read_unlock(); > > return ret; > > } > > > > -- > > 2.31.1 > >
On Thu, Nov 17, 2022 at 6:01 PM Robert Hoo <robert.hu@linux.intel.com> wrote: > > On Thu, 2022-11-17 at 10:43 -0800, David Matlack wrote: > > On Thu, Nov 17, 2022 at 8:14 AM Paolo Bonzini <pbonzini@redhat.com> > > wrote: > > > if (is_shadow_present_pte(iter.old_spte)) > > > - ret = tdp_mmu_split_huge_page(kvm, &iter, > > > sp, true); > > > + r = tdp_mmu_split_huge_page(kvm, &iter, sp, > > > true); > > > else > > > - ret = tdp_mmu_link_sp(kvm, &iter, sp, > > > true); > > > + r = tdp_mmu_link_sp(kvm, &iter, sp, true); > > > > Can this fix be squashed into [1]? It seems like a serious enough > > bug. > > If 2 threads race to update the same PTE, KVM will return -EBUSY out > > to userspace from KVM_RUN, I think. I'm not sure about QEMU, but that > > would be fatal for the VM in Vanadium. > > > > [1] > > https://lore.kernel.org/kvm/20221109185905.486172-3-dmatlack@google.com/ > > > I think in you patch it's all right, since then before > kvm_tdp_mmu_map() returns, it must go through > tdp_mmu_map_handle_target_level(), it returns RET_PF_* enum. Ah that's right. kvm_tdp_mmu_map() won't actually return 0/-EBUSY, because it either returns RET_PF_RETRY or goes through tdp_mmu_map_handle_target_level().
On Thu, Nov 17, 2022 at 5:35 PM Robert Hoo <robert.hu@linux.intel.com> wrote: > > On Thu, 2022-11-17 at 11:14 -0500, Paolo Bonzini wrote: > > + > > if (fault->nx_huge_page_workaround_enabled) > > disallowed_hugepage_adjust(fault, > > iter.old_spte, iter.level); > > > And here can also be improved, I think. > > tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { > - if (fault->nx_huge_page_workaround_enabled) > + if (fault->huge_page_disallowed) > > in the case of !fault->exec && fault->nx_huge_page_workaround_enabled, > huge page should be still allowed, shouldn't it? > > If you agree, I can send out a patch for this. I've roughly tested > this, with an ordinary guest boot, works normally. This check handles the case where a read or write fault occurs within a region that has already been split due to an NX huge page. If we recovered the NX Huge Page on such faults, the guest could end up continuously faulting on the same huge page (e.g. if writing to one page and executing from another within a GPA region backed by a huge page). So instead, NX Huge Page recovery is done periodically by a background thread. That being said, I'm not surprised you didn't encounter any issues when testing. Now that the TDP MMU fully splits NX Huge Pages on fault, such faults should be rare at best. Perhaps even impossible? Hm, can we can drop the call to disallowed_hugepage_adjust() entirely?
On Thu, 2022-11-17 at 20:00 -0800, David Matlack wrote: > On Thu, Nov 17, 2022 at 5:35 PM Robert Hoo <robert.hu@linux.intel.com > > wrote: > > > > On Thu, 2022-11-17 at 11:14 -0500, Paolo Bonzini wrote: > > > + > > > if (fault->nx_huge_page_workaround_enabled) > > > disallowed_hugepage_adjust(fault, > > > iter.old_spte, iter.level); > > > > > > > And here can also be improved, I think. > > > > tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) > > { > > - if (fault->nx_huge_page_workaround_enabled) > > + if (fault->huge_page_disallowed) > > > > in the case of !fault->exec && fault- > > >nx_huge_page_workaround_enabled, > > huge page should be still allowed, shouldn't it? > > > > If you agree, I can send out a patch for this. I've roughly tested > > this, with an ordinary guest boot, works normally. > > This check handles the case where a read or write fault occurs within > a region that has already been split due to an NX huge page. By NX huge page split, the sub-sptes are installed, if my understanding is right. So no fault should happen when next r/w access. > If we > recovered the NX Huge Page on such faults, the guest could end up > continuously faulting on the same huge page (e.g. if writing to one > page and executing from another within a GPA region backed by a huge > page). So instead, NX Huge Page recovery is done periodically by a > background thread. Do you mean the kvm_nx_huge_page_recovery_worker() kthread? My understanding is that it recycles SPs that was created by NX huge page split. This would cause above fault happened, I guess, i.e. the previously installed spte is zapped by the child SP recycled. OK, understand you point now, if let r/w access fault of your mentioned type skip disallowed_hugepage_adjust(), then it will break out and huge page will be installed. Then next exec access will cause the huge page split; then next r/w access fault will install a huge page again ... > > That being said, I'm not surprised you didn't encounter any issues > when testing. Now that the TDP MMU fully splits NX Huge Pages on > fault, such faults should be rare at best. Perhaps even impossible? Possible, and not rare, I added debug info in disallowed_hugepage_adjust() and showed hits. > Hm, can we can drop the call to disallowed_hugepage_adjust() > entirely? I guess not, keep it as is. Though rare, even impossible, what if is_nx_huge_page_enabled() changed during the run time? e.g. NX huge page enabled --> disabled, give it a chance to restore huge page mapping?
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index e08596775427..771210ce5181 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1159,7 +1159,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) struct kvm *kvm = vcpu->kvm; struct tdp_iter iter; struct kvm_mmu_page *sp; - int ret; + int ret = RET_PF_RETRY; kvm_mmu_hugepage_adjust(vcpu, fault); @@ -1168,23 +1168,25 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) rcu_read_lock(); tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) { + int r; + if (fault->nx_huge_page_workaround_enabled) disallowed_hugepage_adjust(fault, iter.old_spte, iter.level); if (iter.level == fault->goal_level) break; - /* Step down into the lower level page table if it exists. */ - if (is_shadow_present_pte(iter.old_spte) && - !is_large_pte(iter.old_spte)) - continue; - /* * If SPTE has been frozen by another thread, just give up and * retry, avoiding unnecessary page table allocation and free. */ if (is_removed_spte(iter.old_spte)) - break; + goto retry; + + /* Step down into the lower level page table if it exists. */ + if (is_shadow_present_pte(iter.old_spte) && + !is_large_pte(iter.old_spte)) + continue; /* * The SPTE is either non-present or points to a huge page that @@ -1196,13 +1198,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) sp->nx_huge_page_disallowed = fault->huge_page_disallowed; if (is_shadow_present_pte(iter.old_spte)) - ret = tdp_mmu_split_huge_page(kvm, &iter, sp, true); + r = tdp_mmu_split_huge_page(kvm, &iter, sp, true); else - ret = tdp_mmu_link_sp(kvm, &iter, sp, true); + r = tdp_mmu_link_sp(kvm, &iter, sp, true); - if (ret) { + /* + * Also force the guest to retry the access if the upper level SPTEs + * aren't in place. + */ + if (r) { tdp_mmu_free_sp(sp); - break; + goto retry; } if (fault->huge_page_disallowed && @@ -1213,18 +1219,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) } } - /* - * Force the guest to retry the access if the upper level SPTEs aren't - * in place, or if the target leaf SPTE is frozen by another CPU. - */ - if (iter.level != fault->goal_level || is_removed_spte(iter.old_spte)) { - rcu_read_unlock(); - return RET_PF_RETRY; - } - ret = tdp_mmu_map_handle_target_level(vcpu, fault, &iter); - rcu_read_unlock(); +retry: + rcu_read_unlock(); return ret; }