Message ID | 20221212153205.3360-2-jiangshanlai@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2310159wrr; Mon, 12 Dec 2022 07:36:23 -0800 (PST) X-Google-Smtp-Source: AA0mqf66yBnMOHZQd+t/vXM9E+Qv3w522dQAB+ylLlX2mcsmzVZ6v6+HFtkg9UJjNYwkmoAGYthV X-Received: by 2002:aa7:d54a:0:b0:461:b93c:47d2 with SMTP id u10-20020aa7d54a000000b00461b93c47d2mr19907154edr.27.1670859383219; Mon, 12 Dec 2022 07:36:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670859383; cv=none; d=google.com; s=arc-20160816; b=WWqdLYKcoXKxJTAu6fH5Acbp/58Mp2iifpsoP5xXvQdKaoXtv2MkZ9JdS9PCw1Oy4d CBMzup/c00Pth96VKl7OYwpBDTk4yWBKDETa/U/dVa8BNgbzuYp3KS7plttbz0/aFpHy cDldM3nY47apWXQ8Z6ZzFGDIC9HnkGOiaHz6nXtVObSXcY8JRLn0mnr0NgIN8QcHNjME JBkymPJSvyt7c3FgJLZf9bLzpNXN1QJzgMiP1vkBhOWhQ5z6JtSoZJq4N+MEzxDRvwcj 16MEiy4JToC+qqjdW+U5vAiZLwE2hBz9+A34qR1p5Yt8UkZy8tPPprGhs+7NUqdK5lSY zjuw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=WmyZB7P+T7cldWqjuRT6Y6vUTkEvSfVPUCpeMf28+1M=; b=SoN80hx4FQ7iZjVMCWrYWVVNwOBDwHQmOT9LvHlcX/y1i7XHi+CVwKksMgdFA3zyBZ UB5MN3ApGG4vHBTvnU5j0O+aS7QMV3B9im4FwPkugeubNpktwZPRJOle+CQe7jvLYfF6 keOUpCx6zNi1r8U6AQKckGbwjPYisyvgwQqYRhm8gqntBQmr29LgVAqBzP9aozriQPf2 +XcgzQ/BS77FQ3j8Ut9cBe5k4b83jo3rwJ7fjpc2o78iQeznL8gd/rkPmRM+r0SNUGY0 eqFUSxZ5T5OEtS2J494mnFm7vgwVagPO0ERCR+OuPoIln4KMgFv393z2X+TueTx82yMO 7+ww== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=AJ45vMDi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g10-20020a056402090a00b0045c97cb9027si8561845edz.421.2022.12.12.07.35.59; Mon, 12 Dec 2022 07:36:23 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=AJ45vMDi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232452AbiLLPbK (ORCPT <rfc822;jeantsuru.cumc.mandola@gmail.com> + 99 others); Mon, 12 Dec 2022 10:31:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231715AbiLLPbG (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 12 Dec 2022 10:31:06 -0500 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BE19CFCFF; Mon, 12 Dec 2022 07:31:04 -0800 (PST) Received: by mail-pj1-x1030.google.com with SMTP id 3-20020a17090a098300b00219041dcbe9so148885pjo.3; Mon, 12 Dec 2022 07:31:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=WmyZB7P+T7cldWqjuRT6Y6vUTkEvSfVPUCpeMf28+1M=; b=AJ45vMDiLsDYC0RGs8ey/ELAXs1CCXKzolW+nCXoW6DuKV5tPfmB2l3CyrBXA3/93U 3d6O8be+Sz1gg0Lzo4hGPAWY/HGiXJ0OX6QHboSQFyZmVr/Iim+X/1JTMDoJE1J9lQjr g8hIUb+vXQ4CMD6fImvPp3fMGzSIi963oKLr+IprrFytd56NhpoJQ28Sk0J0YHGim1/B hwc1FN+QV0Ka3J73gqAL5bybDJ/ZwrkauKS/Kf2SnnzMD+8uP/2Znek+kQgfyKjbfbDM l1mavjbYTuhYWg07jQ8+D0mC2yJ1QBXEmCGr2/+mFWQWc1u5KHoFr4yTR8jLl7lr+Nyq Uz7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WmyZB7P+T7cldWqjuRT6Y6vUTkEvSfVPUCpeMf28+1M=; b=Y8xEOb/BSqcQTdOdLSRiETYuQkYxCl8qc/mPkAmfL6HRf6UmN1jiCpZXAACowseNgK vH18pSepnS2om2vFOEBsLjCXz8qP3scXRT9m2sjUvo6hPWWm4ve2heq0JQRSNx/BzdV6 IJ879NZi4dDnTiyJoGbgIWAkl19h0oa2Rdic1PPELwKw3VCwiMHvA/DVP+xgNyghNGK+ dxvFzdOuKpxik2EC2VZnZJYejMGwEebYKYrYqdqNWhYVo6gSM5Bzm8kdIauvQf8SF+01 GU3g9+eeBlvGgBiiDNeV9CZJ61tzqlhcT0zhUbq041QrGSy1HlTj68uAGgGdAnA1hdlp v67w== X-Gm-Message-State: ANoB5pmwnnOCrfGBv/TrLIkmO6DqKzBao5gCX84fNwuBUEm/Piw/HUHA MqvaRI3PDpuSLyYy9/N4VlrYYNXUjug= X-Received: by 2002:a17:903:2112:b0:189:e711:170 with SMTP id o18-20020a170903211200b00189e7110170mr15330874ple.64.1670859063909; Mon, 12 Dec 2022 07:31:03 -0800 (PST) Received: from localhost ([47.254.32.37]) by smtp.gmail.com with ESMTPSA id b4-20020a170902d50400b0017f73caf588sm6538974plg.218.2022.12.12.07.31.03 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Dec 2022 07:31:03 -0800 (PST) From: Lai Jiangshan <jiangshanlai@gmail.com> To: linux-kernel@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <seanjc@google.com>, Lai Jiangshan <jiangshan.ljs@antgroup.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, kvm@vger.kernel.org Subject: [PATCH 1/2] kvm: x86/mmu: Reduce the update to the spte in FNAME(sync_page) Date: Mon, 12 Dec 2022 23:32:04 +0800 Message-Id: <20221212153205.3360-2-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20221212153205.3360-1-jiangshanlai@gmail.com> References: <20221212153205.3360-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1752023048276171446?= X-GMAIL-MSGID: =?utf-8?q?1752023048276171446?= |
Series |
kvm: x86/mmu: Skip adding write-access for spte in FNAME(sync_page) and remove shadow_host_writable_mask
|
|
Commit Message
Lai Jiangshan
Dec. 12, 2022, 3:32 p.m. UTC
From: Lai Jiangshan <jiangshan.ljs@antgroup.com> Sometimes when the guest updates its pagetable, it adds only new gptes to it without changing any existed one, so there is no point to update the sptes for these existed gptes. Also when the sptes for these unchanged gptes are updated, the AD bits are also removed since make_spte() is called with prefetch=true which might result unneeded TLB flushing. Do nothing if the permissions are unchanged or only write-access is being added. Only update the spte when write-access is being removed. Drop the SPTE otherwise. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> --- arch/x86/kvm/mmu/paging_tmpl.h | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)
Comments
On Mon, Dec 12, 2022, Lai Jiangshan wrote: > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > Sometimes when the guest updates its pagetable, it adds only new gptes > to it without changing any existed one, so there is no point to update > the sptes for these existed gptes. > > Also when the sptes for these unchanged gptes are updated, the AD > bits are also removed since make_spte() is called with prefetch=true > which might result unneeded TLB flushing. If either of the proposed changes is kept, please move this to a separate patch. Skipping updates for PTEs with the same protections is separate logical change from skipping updates when making the SPTE writable. Actually, can't we just pass @prefetch=false to make_spte()? FNAME(prefetch_invalid_gpte) has already verified the Accessed bit is set in the GPTE, so at least for guest correctness there's no need to access-track the SPTE. Host page aging is already fuzzy so I don't think there are problems there. > Do nothing if the permissions are unchanged or only write-access is > being added. I'm pretty sure skipping the "make writable" case is architecturally wrong. On a #PF, any TLB entries for the faulting virtual address are required to be removed. That means KVM _must_ refresh the SPTE if a vCPU takes a !WRITABLE fault on an unsync page. E.g. see kvm_inject_emulated_page_fault(). > Only update the spte when write-access is being removed. Drop the SPTE > otherwise. Correctness aside, there needs to be far more analysis and justification for a change like this, e.g. performance numbers for various workloads. > --- > arch/x86/kvm/mmu/paging_tmpl.h | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h > index e5662dbd519c..613f043a3e9e 100644 > --- a/arch/x86/kvm/mmu/paging_tmpl.h > +++ b/arch/x86/kvm/mmu/paging_tmpl.h > @@ -1023,7 +1023,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > for (i = 0; i < SPTE_ENT_PER_PAGE; i++) { > u64 *sptep, spte; > struct kvm_memory_slot *slot; > - unsigned pte_access; > + unsigned old_pte_access, pte_access; > pt_element_t gpte; > gpa_t pte_gpa; > gfn_t gfn; > @@ -1064,6 +1064,23 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > continue; > } > > + /* > + * Drop the SPTE if the new protections would result in access > + * permissions other than write-access is changing. Do nothing > + * if the permissions are unchanged or only write-access is > + * being added. Only update the spte when write-access is being > + * removed. > + */ > + old_pte_access = kvm_mmu_page_get_access(sp, i); > + if (old_pte_access == pte_access || > + (old_pte_access | ACC_WRITE_MASK) == pte_access) > + continue; > + if (old_pte_access != (pte_access | ACC_WRITE_MASK)) { > + drop_spte(vcpu->kvm, &sp->spt[i]); > + flush = true; > + continue; > + } > + > /* Update the shadowed access bits in case they changed. */ > kvm_mmu_page_set_access(sp, i, pte_access); > > -- > 2.19.1.6.gb485710b >
Hello Sean, On Wed, Dec 14, 2022 at 2:12 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Dec 12, 2022, Lai Jiangshan wrote: > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > > > Sometimes when the guest updates its pagetable, it adds only new gptes > > to it without changing any existed one, so there is no point to update > > the sptes for these existed gptes. > > > > Also when the sptes for these unchanged gptes are updated, the AD > > bits are also removed since make_spte() is called with prefetch=true > > which might result unneeded TLB flushing. > > If either of the proposed changes is kept, please move this to a separate patch. > Skipping updates for PTEs with the same protections is separate logical change > from skipping updates when making the SPTE writable. > > Actually, can't we just pass @prefetch=false to make_spte()? FNAME(prefetch_invalid_gpte) > has already verified the Accessed bit is set in the GPTE, so at least for guest > correctness there's no need to access-track the SPTE. Host page aging is already > fuzzy so I don't think there are problems there. FNAME(prefetch_invalid_gpte) has already verified the Accessed bit is set in the GPTE and FNAME(protect_clean_gpte) has already verified the Dirty bit is set in the GPTE. These are only for guest AD bits. And I don't think it is a good idea to pass @prefetch=false to make_spte(), since the host might have cleared AD bit in the spte for aging or dirty-log, The AD bits in the spte are better to be kept as before. Though passing @prefetch=false would not cause any correctness problem in the view of maintaining guest AD bits. > > > Do nothing if the permissions are unchanged or only write-access is > > being added. > > I'm pretty sure skipping the "make writable" case is architecturally wrong. On a > #PF, any TLB entries for the faulting virtual address are required to be removed. > That means KVM _must_ refresh the SPTE if a vCPU takes a !WRITABLE fault on an > unsync page. E.g. see kvm_inject_emulated_page_fault(). I might misunderstand what you meant or I failed to connect it with the SDM properly. I think there is no #PF here. And even if the guest is requesting writable, the hypervisor is allowed to set it non-writable and prepared to handle it in the ensuing write-fault. Skipping to make it writable is a kind of lazy operation and considered to be "the hypervisor doesn't grant the writable permission for a period before next write-fault". Thanks Lai
On Wed, Dec 14, 2022, Lai Jiangshan wrote: > Hello Sean, > > On Wed, Dec 14, 2022 at 2:12 AM Sean Christopherson <seanjc@google.com> wrote: > > > > On Mon, Dec 12, 2022, Lai Jiangshan wrote: > > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > > > > > Sometimes when the guest updates its pagetable, it adds only new gptes > > > to it without changing any existed one, so there is no point to update > > > the sptes for these existed gptes. > > > > > > Also when the sptes for these unchanged gptes are updated, the AD > > > bits are also removed since make_spte() is called with prefetch=true > > > which might result unneeded TLB flushing. > > > > If either of the proposed changes is kept, please move this to a separate patch. > > Skipping updates for PTEs with the same protections is separate logical change > > from skipping updates when making the SPTE writable. > > > > Actually, can't we just pass @prefetch=false to make_spte()? FNAME(prefetch_invalid_gpte) > > has already verified the Accessed bit is set in the GPTE, so at least for guest > > correctness there's no need to access-track the SPTE. Host page aging is already > > fuzzy so I don't think there are problems there. > > FNAME(prefetch_invalid_gpte) has already verified the Accessed bit is set > in the GPTE and FNAME(protect_clean_gpte) has already verified the Dirty > bit is set in the GPTE. These are only for guest AD bits. > > And I don't think it is a good idea to pass @prefetch=false to make_spte(), > since the host might have cleared AD bit in the spte for aging or dirty-log, > The AD bits in the spte are better to be kept as before. Drat, I was thinking KVM never flushes when aging SPTEs, but forgot about clear_flush_young(). Rather than skipping if the Accessed bit is the only thing that's changing, what about simply preserving the Accessed bit? And s/prefetch/accessed in make_spte() so that future changes to make_spte() don't make incorrect assumptions about the meaning of "prefetch". Another alternative would be to conditionally preserve the Accessed bit, i.e. clear it if a flush is needed anyways, but that seems unnecessarily complex. > Though passing @prefetch=false would not cause any correctness problem > in the view of maintaining guest AD bits. > > > > > > Do nothing if the permissions are unchanged or only write-access is > > > being added. > > > > I'm pretty sure skipping the "make writable" case is architecturally wrong. On a > > #PF, any TLB entries for the faulting virtual address are required to be removed. > > That means KVM _must_ refresh the SPTE if a vCPU takes a !WRITABLE fault on an > > unsync page. E.g. see kvm_inject_emulated_page_fault(). > > I might misunderstand what you meant or I failed to connect it with > the SDM properly. > > I think there is no #PF here. > > And even if the guest is requesting writable, the hypervisor is allowed to > set it non-writable and prepared to handle it in the ensuing write-fault. Yeah, you're right. The host will see the "spurious" page fault but it will never get injected into the guest. > Skipping to make it writable is a kind of lazy operation and considered > to be "the hypervisor doesn't grant the writable permission for a period > before next write-fault". But that raises the question of why? No TLB flush is needed precisely because any !WRITABLE fault will be treated as a spurious fault. The cost of writing the SPTE is minimal. So why skip? Skipping just to reclaim a low SPTE bit doesn't seem like a good tradeoff, especially without a concrete use case for the SPTE bit. E.g. on pre-Nehalem Intel CPUs, i.e. CPUs that don't support EPT and thus have to use shadow paging, the CPU automatically retries accesses after the TLB flush on permission faults. The lazy approach might introduce a noticeable performance regression on such CPUs due to causing more #PF VM-Exits than the current approach.
On Wed, Dec 14, 2022 at 2:12 AM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Dec 12, 2022, Lai Jiangshan wrote: > > From: Lai Jiangshan <jiangshan.ljs@antgroup.com> > > > > Sometimes when the guest updates its pagetable, it adds only new gptes > > to it without changing any existed one, so there is no point to update > > the sptes for these existed gptes. > > > > Also when the sptes for these unchanged gptes are updated, the AD > > bits are also removed since make_spte() is called with prefetch=true > > which might result unneeded TLB flushing. > > If either of the proposed changes is kept, please move this to a separate patch. > Skipping updates for PTEs with the same protections is separate logical change > from skipping updates when making the SPTE writable. > Did as you suggested: https://lore.kernel.org/lkml/20230105095848.6061-5-jiangshanlai@gmail.com/
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index e5662dbd519c..613f043a3e9e 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -1023,7 +1023,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) for (i = 0; i < SPTE_ENT_PER_PAGE; i++) { u64 *sptep, spte; struct kvm_memory_slot *slot; - unsigned pte_access; + unsigned old_pte_access, pte_access; pt_element_t gpte; gpa_t pte_gpa; gfn_t gfn; @@ -1064,6 +1064,23 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) continue; } + /* + * Drop the SPTE if the new protections would result in access + * permissions other than write-access is changing. Do nothing + * if the permissions are unchanged or only write-access is + * being added. Only update the spte when write-access is being + * removed. + */ + old_pte_access = kvm_mmu_page_get_access(sp, i); + if (old_pte_access == pte_access || + (old_pte_access | ACC_WRITE_MASK) == pte_access) + continue; + if (old_pte_access != (pte_access | ACC_WRITE_MASK)) { + drop_spte(vcpu->kvm, &sp->spt[i]); + flush = true; + continue; + } + /* Update the shadowed access bits in case they changed. */ kvm_mmu_page_set_access(sp, i, pte_access);