Message ID | 20230203192822.106773-3-vipinsh@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp1020578wrn; Fri, 3 Feb 2023 11:29:52 -0800 (PST) X-Google-Smtp-Source: AK7set9ZBqcHXj3Fa0LnzB16pBAyQ3ZuQwGLQG9+vJzk7LrrZzb5HJc1Y4W9HswP7LEfjfp0/n81 X-Received: by 2002:a05:6402:4511:b0:49e:16fc:b525 with SMTP id ez17-20020a056402451100b0049e16fcb525mr11457738edb.41.1675452592516; Fri, 03 Feb 2023 11:29:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1675452592; cv=none; d=google.com; s=arc-20160816; b=K3fNU2Zh91Pn7bUu9L6edejW0+pEfkCeqCgAzOL37iP0OVymY3aXRGg0scHPRP29Au lV9hBXARfwRj/KgR8nS7fasnrwk0qHECD4A1Yl4JpXAyXwOnIXgGQC0vAn3DL77eNaGb R/wEsx4QFUSz7qPIl0PRVKUpiUZ09dfaDdmy4Zkuw196TX6TJ75coXTN5Q+l8h4h7DPG ViIs6WJRTJJMm5HonyHOHhmrPc6Q3ISCEin2hMoHO4hpGQd/Q2UN6xqvT4kOFx+HVQDt 21jCVKmVpOBolnvaMUZiXu0GRN0A5BJ3RgmxFAsyReIdesMneUMjzcFCJQ8pTsAdDcZe 20qA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=ak5zcwBMhePTPGhWw+9Y0sgu3ygPb+KvfrYr09hJ3Yk=; b=F4lLKpkzQfPa2pjAF1vsNHqEp8xgsW1drOG8eo1dRDMt5rYjou8Jog7cosmE4iTggL JxoCReU1aYiy4CI0HOIX0X6mQofDNlwLFWZpf1+Xe/v3h/5e17E1tuZBTwDlzPnuq2W8 NZByk5tUXoFOTNBoyIsuTn672MCuGEikyfrKUg10qPiKQyBJmSBP/fVg8sWKRoqvfAaE CzAwee1Lq33TbpdQ2H5UEkjpIOZT/mp8KbpQfOn9GEXrFnIxn8EHggsaveT1Tr72VNTE E7lv6gKWgOYbl7blJZYaaJkpr57lpBoVhR4vCT5ldNRcdCdBrf/3oTQAiE+JgTdJtc/L Mxng== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qZfkHXGG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e7-20020aa7d7c7000000b004a165d44f62si3215454eds.580.2023.02.03.11.29.28; Fri, 03 Feb 2023 11:29:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=qZfkHXGG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233557AbjBCT2i (ORCPT <rfc822;il.mystafa@gmail.com> + 99 others); Fri, 3 Feb 2023 14:28:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33112 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233119AbjBCT2b (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 3 Feb 2023 14:28:31 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B18DDA8A08 for <linux-kernel@vger.kernel.org>; Fri, 3 Feb 2023 11:28:29 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id c8-20020a17090a674800b0022cb9c81fb0so5050906pjm.2 for <linux-kernel@vger.kernel.org>; Fri, 03 Feb 2023 11:28:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ak5zcwBMhePTPGhWw+9Y0sgu3ygPb+KvfrYr09hJ3Yk=; b=qZfkHXGGr1qHpXZJvKaEUSahiehmsi/2hWFtvV03HTTdOPrfyNEML/o9tHSOHWGqJj HTDm7EQ+xPnvX4pp24MFsSYtGLuEAcJxCrY7l0CY11mqiV/AfHXTU2MHaKfoNy3eu+9f TcEGcpXdv9TUsfcLjALKCicgVFuTyqyWg2ZHcSLaAGXOFBugjpmPo5o5LISp/NZW5/dW /chBexL1SIqKTJWFn80Nx3IcK6ziBiW1bt4oz/1FO3Z/X6PJpA57To9fYyinnQiBChXG 5EzN/m151xiuEbOLOjHA5W5a8CD8OXzA4WtwkemGddRtywIQSDPADkBldJuwJHGHwm4r uZ2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ak5zcwBMhePTPGhWw+9Y0sgu3ygPb+KvfrYr09hJ3Yk=; b=EmWicoe0atE8vHVnD/Sf2cp9+U8sLOLcZKbTNMqN/v6TIWYank1p3weCwuNfFzkbZa ddyqM58gBPleUw+sqHVWa4IRltM9EUKjhYWBAri7lkxrTg4nGP2ps/HwUOSj1AbIczsw GKvEEtQDm3KAHVd+iXv22Mg7CHPo6OYHKgxReH74GQ38uDKUxaKKkfpvSB/oRuOEqaSA X7Dtyduv6NtRvnUnELFY/jFQnNY8Z7W3iZJKV9qPxbRdwKH50OowEll+FQuKS9vbj3uO 0xvLW4rBP6CGMPfvODfqW+TE1KV4LNUKkT+OvkoebwwPrI7MRaHFxL1z1JJ8FVdxccBH IL6Q== X-Gm-Message-State: AO0yUKUBjqgczIBoXoDM3rCdUvH3YDlWh3POaeV5xwpURzIVzH7HiShx zJBA8dFAhZ4VlsXY3CiFT6T+IfcBA21t X-Received: from vipin.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:479f]) (user=vipinsh job=sendgmr) by 2002:a17:90a:eb12:b0:225:eaa2:3f5d with SMTP id j18-20020a17090aeb1200b00225eaa23f5dmr3601pjz.2.1675452508924; Fri, 03 Feb 2023 11:28:28 -0800 (PST) Date: Fri, 3 Feb 2023 11:28:19 -0800 In-Reply-To: <20230203192822.106773-1-vipinsh@google.com> Mime-Version: 1.0 References: <20230203192822.106773-1-vipinsh@google.com> X-Mailer: git-send-email 2.39.1.519.gcb327c4b5f-goog Message-ID: <20230203192822.106773-3-vipinsh@google.com> Subject: [Patch v2 2/5] KVM: x86/mmu: Optimize SPTE change flow for clear-dirty-log From: Vipin Sharma <vipinsh@google.com> To: seanjc@google.com, pbonzini@redhat.com, bgardon@google.com, dmatlack@google.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Vipin Sharma <vipinsh@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1756839377770521210?= X-GMAIL-MSGID: =?utf-8?q?1756839377770521210?= |
Series |
Optimize clear dirty log
|
|
Commit Message
Vipin Sharma
Feb. 3, 2023, 7:28 p.m. UTC
No need to check all of the conditions in __handle_changed_spte() as
clearing dirty log only involves resetting dirty or writable bit.
Make atomic change to dirty or writable bit and mark pfn dirty.
Tested on 160 VCPU-160 GB VM and found that performance of clear dirty
log stage improved by ~38% in dirty_log_perf_test
Before optimization:
--------------------
Test iterations: 3
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory: [0x3fd7c0000000, 0x3fffc0000000)
Populate memory time: 6.298459671s
Setting dirty log mode took : 0.000000052s
Enabling dirty logging time: 0.003815691s
Iteration 1 dirty memory time: 0.185538848s
Iteration 1 get dirty log time: 0.002562641s
Iteration 1 clear dirty log time: 3.638543593s
Iteration 2 dirty memory time: 0.192226071s
Iteration 2 get dirty log time: 0.001558446s
Iteration 2 clear dirty log time: 3.145032742s
Iteration 3 dirty memory time: 0.193606295s
Iteration 3 get dirty log time: 0.001559425s
Iteration 3 clear dirty log time: 3.142340358s
Disabling dirty logging time: 3.002873664s
Get dirty log over 3 iterations took 0.005680512s.
(Avg 0.001893504s/iteration)
Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration)
After optimization:
-------------------
Test iterations: 3
Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages
guest physical test memory: [0x3fd7c0000000, 0x3fffc0000000)
Populate memory time: 6.581448437s
Setting dirty log mode took : 0.000000058s
Enabling dirty logging time: 0.003981283s
Iteration 1 dirty memory time: 0.285693420s
Iteration 1 get dirty log time: 0.002743004s
Iteration 1 clear dirty log time: 2.384343157s
Iteration 2 dirty memory time: 0.290414476s
Iteration 2 get dirty log time: 0.001720445s
Iteration 2 clear dirty log time: 1.882770288s
Iteration 3 dirty memory time: 0.289965965s
Iteration 3 get dirty log time: 0.001728232s
Iteration 3 clear dirty log time: 1.881043086s
Disabling dirty logging time: 2.930387523s
Get dirty log over 3 iterations took 0.006191681s.
(Avg 0.002063893s/iteration)
Clear dirty log over 3 iterations took 6.148156531s. (Avg 2.049385510s/iteration)
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
arch/x86/kvm/mmu/tdp_iter.h | 13 ++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++----------------------
2 files changed, 34 insertions(+), 30 deletions(-)
Comments
On Fri, Feb 3, 2023 at 11:28 AM Vipin Sharma <vipinsh@google.com> wrote: > > No need to check all of the conditions in __handle_changed_spte() as > clearing dirty log only involves resetting dirty or writable bit. > > Make atomic change to dirty or writable bit and mark pfn dirty. > > Tested on 160 VCPU-160 GB VM and found that performance of clear dirty > log stage improved by ~38% in dirty_log_perf_test Dang! That's a big improvement. ... > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h > index 30a52e5e68de..21046b34f94e 100644 > --- a/arch/x86/kvm/mmu/tdp_iter.h > +++ b/arch/x86/kvm/mmu/tdp_iter.h > @@ -121,4 +121,17 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root, > void tdp_iter_next(struct tdp_iter *iter); > void tdp_iter_restart(struct tdp_iter *iter); > > +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) > +{ > + atomic64_t *sptep; > + > + if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level)) { > + sptep = (atomic64_t *)rcu_dereference(iter->sptep); > + return (u64)atomic64_fetch_and(~mask, sptep); > + } > + > + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask); > + return iter->old_spte; > +} > + If you do end up sending another version of the series and feel like breaking up this patch, you could probably split this part out since the change to how we set the SPTE and how we handle that change are somewhat independent. I like the switch to atomic64_fetch_and though. No idea if it's faster, but I would believe it could be. Totally optional, but there's currently no validation on the mask. Maybe we're only calling this in one place, but it might be worth clarifying the limits (if any) on what bits can be set in the mask. I don't think there necessarily need to be limits as of this commit, but the handling around this function where it's called here would obviously not be sufficient if the mask were -1UL or something. > #endif /* __KVM_X86_MMU_TDP_ITER_H */ > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index bba33aea0fb0..83f15052aa6c 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -710,18 +710,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm, > * notifier for access tracking. Leaving record_acc_track > * unset in that case prevents page accesses from being > * double counted. > - * @record_dirty_log: Record the page as dirty in the dirty bitmap if > - * appropriate for the change being made. Should be set > - * unless performing certain dirty logging operations. > - * Leaving record_dirty_log unset in that case prevents page > - * writes from being double counted. I was kind of hesitant about getting rid of this but now that I see it going, I love it. ... > @@ -1694,18 +1681,22 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, > mask &= ~(1UL << (iter.gfn - gfn)); > > if (wrprot || spte_ad_need_write_protect(iter.old_spte)) { > - if (is_writable_pte(iter.old_spte)) > - new_spte = iter.old_spte & ~PT_WRITABLE_MASK; > - else > + if (!is_writable_pte(iter.old_spte)) > continue; > + > + clear_bits = PT_WRITABLE_MASK; > } else { > - if (iter.old_spte & shadow_dirty_mask) > - new_spte = iter.old_spte & ~shadow_dirty_mask; > - else > + if (!(iter.old_spte & shadow_dirty_mask)) > continue; > + > + clear_bits = shadow_dirty_mask; > } > > - tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte); > + iter.old_spte = kvm_tdp_mmu_clear_spte_bit(&iter, clear_bits); > + trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, > + iter.old_spte, > + iter.old_spte & ~clear_bits); > + kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); > } Nice!
On Mon, Feb 6, 2023 at 2:06 PM Ben Gardon <bgardon@google.com> wrote: > > On Fri, Feb 3, 2023 at 11:28 AM Vipin Sharma <vipinsh@google.com> wrote: > > > > +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) > > +{ > > + atomic64_t *sptep; > > + > > + if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level)) { > > + sptep = (atomic64_t *)rcu_dereference(iter->sptep); > > + return (u64)atomic64_fetch_and(~mask, sptep); > > + } > > + > > + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask); > > + return iter->old_spte; > > +} > > + > > If you do end up sending another version of the series and feel like > breaking up this patch, you could probably split this part out since > the change to how we set the SPTE and how we handle that change are > somewhat independent. I like the switch to atomic64_fetch_and though. > No idea if it's faster, but I would believe it could be. The changes are actually dependent. Using the atomic-AND makes it impossible for KVM to clear a volatile bit that it wasn't intending to clear, and that is what enables simplifying the SPTE change propagation. Instead I would split out the opportunistic cleanup of @record_dirty_log to a separate patch, since that's dead-code cleanup and refactoring.
On Fri, Feb 03, 2023 at 11:28:19AM -0800, Vipin Sharma wrote: > No need to check all of the conditions in __handle_changed_spte() as > clearing dirty log only involves resetting dirty or writable bit. Channelling Sean: State what the patch does first. > > Make atomic change to dirty or writable bit and mark pfn dirty. This is way too vague. And the old code already did both of these things. What's changed is that the bits are being cleared with an atomic AND and taking advantage of the fact that that avoids needing to deal with changes to volatile bits. Please also explain what effect this has on @record_dirty_log and why it can be opportunistically cleaned up in this commit. > > Tested on 160 VCPU-160 GB VM and found that performance of clear dirty > log stage improved by ~38% in dirty_log_perf_test > > Before optimization: > -------------------- > > Test iterations: 3 > Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > guest physical test memory: [0x3fd7c0000000, 0x3fffc0000000) > Populate memory time: 6.298459671s > Setting dirty log mode took : 0.000000052s > > Enabling dirty logging time: 0.003815691s > > Iteration 1 dirty memory time: 0.185538848s > Iteration 1 get dirty log time: 0.002562641s > Iteration 1 clear dirty log time: 3.638543593s > Iteration 2 dirty memory time: 0.192226071s > Iteration 2 get dirty log time: 0.001558446s > Iteration 2 clear dirty log time: 3.145032742s > Iteration 3 dirty memory time: 0.193606295s > Iteration 3 get dirty log time: 0.001559425s > Iteration 3 clear dirty log time: 3.142340358s > Disabling dirty logging time: 3.002873664s > Get dirty log over 3 iterations took 0.005680512s. > (Avg 0.001893504s/iteration) > Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration) > > After optimization: > ------------------- > > Test iterations: 3 > Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > guest physical test memory: [0x3fd7c0000000, 0x3fffc0000000) > Populate memory time: 6.581448437s > Setting dirty log mode took : 0.000000058s > > Enabling dirty logging time: 0.003981283s > > Iteration 1 dirty memory time: 0.285693420s > Iteration 1 get dirty log time: 0.002743004s > Iteration 1 clear dirty log time: 2.384343157s > Iteration 2 dirty memory time: 0.290414476s > Iteration 2 get dirty log time: 0.001720445s > Iteration 2 clear dirty log time: 1.882770288s > Iteration 3 dirty memory time: 0.289965965s > Iteration 3 get dirty log time: 0.001728232s > Iteration 3 clear dirty log time: 1.881043086s > Disabling dirty logging time: 2.930387523s > Get dirty log over 3 iterations took 0.006191681s. > (Avg 0.002063893s/iteration) > Clear dirty log over 3 iterations took 6.148156531s. (Avg 2.049385510s/iteration) Can you trim these results to just show the clear times? (assuming none of the rest are relevant) > > Signed-off-by: Vipin Sharma <vipinsh@google.com> > --- > arch/x86/kvm/mmu/tdp_iter.h | 13 ++++++++++ > arch/x86/kvm/mmu/tdp_mmu.c | 51 +++++++++++++++---------------------- > 2 files changed, 34 insertions(+), 30 deletions(-) > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h > index 30a52e5e68de..21046b34f94e 100644 > --- a/arch/x86/kvm/mmu/tdp_iter.h > +++ b/arch/x86/kvm/mmu/tdp_iter.h > @@ -121,4 +121,17 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root, > void tdp_iter_next(struct tdp_iter *iter); > void tdp_iter_restart(struct tdp_iter *iter); > > +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) > +{ Make "bit" plural as long as the parameter is a raw mask. Also drop "kvm_" since this is not intended to be called outside the TDP MMU. (It'd be nice to make the same cleanup to the read/write functions if you feel like it.) > + atomic64_t *sptep; > + > + if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level)) { > + sptep = (atomic64_t *)rcu_dereference(iter->sptep); > + return (u64)atomic64_fetch_and(~mask, sptep); > + } > + > + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask); > + return iter->old_spte; > +} > + > #endif /* __KVM_X86_MMU_TDP_ITER_H */ > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index bba33aea0fb0..83f15052aa6c 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -710,18 +710,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm, > * notifier for access tracking. Leaving record_acc_track > * unset in that case prevents page accesses from being > * double counted. > - * @record_dirty_log: Record the page as dirty in the dirty bitmap if > - * appropriate for the change being made. Should be set > - * unless performing certain dirty logging operations. > - * Leaving record_dirty_log unset in that case prevents page > - * writes from being double counted. > * > * Returns the old SPTE value, which _may_ be different than @old_spte if the > * SPTE had voldatile bits. > */ > static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, > u64 old_spte, u64 new_spte, gfn_t gfn, int level, > - bool record_acc_track, bool record_dirty_log) > + bool record_acc_track) > { > lockdep_assert_held_write(&kvm->mmu_lock); > > @@ -740,42 +735,34 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, > > if (record_acc_track) > handle_changed_spte_acc_track(old_spte, new_spte, level); > - if (record_dirty_log) > - handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, > - new_spte, level); > + > + handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, new_spte, > + level); > return old_spte; > } > > static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, > - u64 new_spte, bool record_acc_track, > - bool record_dirty_log) > + u64 new_spte, bool record_acc_track) > { > WARN_ON_ONCE(iter->yielded); > > iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, > iter->old_spte, new_spte, > iter->gfn, iter->level, > - record_acc_track, record_dirty_log); > + record_acc_track); > } > > static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, > u64 new_spte) > { > - _tdp_mmu_set_spte(kvm, iter, new_spte, true, true); > + _tdp_mmu_set_spte(kvm, iter, new_spte, true); > } > > static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm, > struct tdp_iter *iter, > u64 new_spte) > { > - _tdp_mmu_set_spte(kvm, iter, new_spte, false, true); > -} > - > -static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm, > - struct tdp_iter *iter, > - u64 new_spte) > -{ > - _tdp_mmu_set_spte(kvm, iter, new_spte, true, false); > + _tdp_mmu_set_spte(kvm, iter, new_spte, false); > } > > #define tdp_root_for_each_pte(_iter, _root, _start, _end) \ > @@ -925,7 +912,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > return false; > > __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0, > - sp->gfn, sp->role.level + 1, true, true); > + sp->gfn, sp->role.level + 1, true); > > return true; > } > @@ -1678,7 +1665,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, > gfn_t gfn, unsigned long mask, bool wrprot) > { > struct tdp_iter iter; > - u64 new_spte; > + u64 clear_bits; nit: clear_bit since it's always a single bit? > > rcu_read_lock(); > > @@ -1694,18 +1681,22 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, > mask &= ~(1UL << (iter.gfn - gfn)); > > if (wrprot || spte_ad_need_write_protect(iter.old_spte)) { > - if (is_writable_pte(iter.old_spte)) > - new_spte = iter.old_spte & ~PT_WRITABLE_MASK; > - else > + if (!is_writable_pte(iter.old_spte)) > continue; > + > + clear_bits = PT_WRITABLE_MASK; > } else { > - if (iter.old_spte & shadow_dirty_mask) > - new_spte = iter.old_spte & ~shadow_dirty_mask; > - else > + if (!(iter.old_spte & shadow_dirty_mask)) > continue; You can factor out the continue check now that you have clear_bits. e.g. if (wrprot || spte_ad_need_write_protect(iter.old_spte)) clear_bits = PT_WRITABLE_MASK; else clear_bits = shadow_dirty_mask; if (!(iter->old_spte & clear_bits)) continue; iter.old_spte = kvm_tdp_mmu_clear_spte_bit(&iter, clear_bits); > + > + clear_bits = shadow_dirty_mask; > } > > - tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte); > + iter.old_spte = kvm_tdp_mmu_clear_spte_bit(&iter, clear_bits); > + trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, > + iter.old_spte, > + iter.old_spte & ~clear_bits); > + kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); > } > > rcu_read_unlock(); > -- > 2.39.1.519.gcb327c4b5f-goog >
On Fri, Feb 3, 2023 at 11:28 AM Vipin Sharma <vipinsh@google.com> wrote: > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h > index 30a52e5e68de..21046b34f94e 100644 > --- a/arch/x86/kvm/mmu/tdp_iter.h > +++ b/arch/x86/kvm/mmu/tdp_iter.h > @@ -121,4 +121,17 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root, > void tdp_iter_next(struct tdp_iter *iter); > void tdp_iter_restart(struct tdp_iter *iter); > > +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) > +{ > + atomic64_t *sptep; > + > + if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level)) { > + sptep = (atomic64_t *)rcu_dereference(iter->sptep); > + return (u64)atomic64_fetch_and(~mask, sptep); I think you can just set iter->old_spte here and drop the return value? > + } > + > + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask); > + return iter->old_spte; > +}
On Mon, Feb 6, 2023 at 2:06 PM Ben Gardon <bgardon@google.com> wrote: > > On Fri, Feb 3, 2023 at 11:28 AM Vipin Sharma <vipinsh@google.com> wrote: > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h > > index 30a52e5e68de..21046b34f94e 100644 > > --- a/arch/x86/kvm/mmu/tdp_iter.h > > +++ b/arch/x86/kvm/mmu/tdp_iter.h > > @@ -121,4 +121,17 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root, > > void tdp_iter_next(struct tdp_iter *iter); > > void tdp_iter_restart(struct tdp_iter *iter); > > > > +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) > > +{ > > + atomic64_t *sptep; > > + > > + if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level)) { > > + sptep = (atomic64_t *)rcu_dereference(iter->sptep); > > + return (u64)atomic64_fetch_and(~mask, sptep); > > + } > > + > > + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask); > > + return iter->old_spte; > > +} > > + > > If you do end up sending another version of the series and feel like > breaking up this patch, you could probably split this part out since > the change to how we set the SPTE and how we handle that change are > somewhat independent. I like the switch to atomic64_fetch_and though. > No idea if it's faster, but I would believe it could be. David explained in his email why it is not independent. > > Totally optional, but there's currently no validation on the mask. > Maybe we're only calling this in one place, but it might be worth > clarifying the limits (if any) on what bits can be set in the mask. I > don't think there necessarily need to be limits as of this commit, but > the handling around this function where it's called here would > obviously not be sufficient if the mask were -1UL or something. > I cannot think of any specific mask to be useful here. Let us keep it as it is, we can revisit this API if there is a need to add a mask in future. If someone sends -1UL then it will be on them on how they are using the API.
On Mon, Feb 6, 2023 at 3:41 PM David Matlack <dmatlack@google.com> wrote: > > On Fri, Feb 03, 2023 at 11:28:19AM -0800, Vipin Sharma wrote: > > No need to check all of the conditions in __handle_changed_spte() as > > clearing dirty log only involves resetting dirty or writable bit. > > Channelling Sean: State what the patch does first. > > > > > Make atomic change to dirty or writable bit and mark pfn dirty. > > This is way too vague. And the old code already did both of these > things. What's changed is that the bits are being cleared with an atomic > AND and taking advantage of the fact that that avoids needing to deal > with changes to volatile bits. > > Please also explain what effect this has on @record_dirty_log and why it > can be opportunistically cleaned up in this commit. > Okay, I will try to be better in the next one. > > Iteration 3 clear dirty log time: 1.881043086s > > Disabling dirty logging time: 2.930387523s > > Get dirty log over 3 iterations took 0.006191681s. > > (Avg 0.002063893s/iteration) > > Clear dirty log over 3 iterations took 6.148156531s. (Avg 2.049385510s/iteration) > > Can you trim these results to just show the clear times? (assuming none > of the rest are relevant) I was not sure if just showing clear dirty times will be acceptable or not. I will update the message to only show clear dirty log time and average. > > > > > +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) > > +{ > > Make "bit" plural as long as the parameter is a raw mask. > > Also drop "kvm_" since this is not intended to be called outside the TDP > MMU. (It'd be nice to make the same cleanup to the read/write > functions if you feel like it.) > Sounds good. > > @@ -1678,7 +1665,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, > > gfn_t gfn, unsigned long mask, bool wrprot) > > { > > struct tdp_iter iter; > > - u64 new_spte; > > + u64 clear_bits; > > nit: clear_bit since it's always a single bit? Yes. > > > > > rcu_read_lock(); > > > > @@ -1694,18 +1681,22 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, > > mask &= ~(1UL << (iter.gfn - gfn)); > > > > if (wrprot || spte_ad_need_write_protect(iter.old_spte)) { > > - if (is_writable_pte(iter.old_spte)) > > - new_spte = iter.old_spte & ~PT_WRITABLE_MASK; > > - else > > + if (!is_writable_pte(iter.old_spte)) > > continue; > > + > > + clear_bits = PT_WRITABLE_MASK; > > } else { > > - if (iter.old_spte & shadow_dirty_mask) > > - new_spte = iter.old_spte & ~shadow_dirty_mask; > > - else > > + if (!(iter.old_spte & shadow_dirty_mask)) > > continue; > > You can factor out the continue check now that you have clear_bits. e.g. > > if (wrprot || spte_ad_need_write_protect(iter.old_spte)) > clear_bits = PT_WRITABLE_MASK; > else > clear_bits = shadow_dirty_mask; > > if (!(iter->old_spte & clear_bits)) > continue; > > iter.old_spte = kvm_tdp_mmu_clear_spte_bit(&iter, clear_bits); > Yeah, this is better. Even better if I just initialize like: u64 clear_bits = shadow_dirty_mask; This will also get rid of the else part.
On Mon, Feb 6, 2023 at 3:53 PM David Matlack <dmatlack@google.com> wrote: > > On Fri, Feb 3, 2023 at 11:28 AM Vipin Sharma <vipinsh@google.com> wrote: > > > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h > > index 30a52e5e68de..21046b34f94e 100644 > > --- a/arch/x86/kvm/mmu/tdp_iter.h > > +++ b/arch/x86/kvm/mmu/tdp_iter.h > > @@ -121,4 +121,17 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root, > > void tdp_iter_next(struct tdp_iter *iter); > > void tdp_iter_restart(struct tdp_iter *iter); > > > > +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) > > +{ > > + atomic64_t *sptep; > > + > > + if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level)) { > > + sptep = (atomic64_t *)rcu_dereference(iter->sptep); > > + return (u64)atomic64_fetch_and(~mask, sptep); > > I think you can just set iter->old_spte here and drop the return value? > I can do this. However, my intention was to keep the return contract similar to kvm_tdp_mmu_write_spte(). On second thought, should I make this function signature similar to kvm_tdp_mmu_write_spte() just to be consistent? > > + } > > + > > + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask); > > + return iter->old_spte; > > +}
On Tue, Feb 7, 2023 at 9:37 AM Vipin Sharma <vipinsh@google.com> wrote: > > On Mon, Feb 6, 2023 at 3:41 PM David Matlack <dmatlack@google.com> wrote: > > > > On Fri, Feb 03, 2023 at 11:28:19AM -0800, Vipin Sharma wrote: > > > > if (wrprot || spte_ad_need_write_protect(iter.old_spte)) > > clear_bits = PT_WRITABLE_MASK; > > else > > clear_bits = shadow_dirty_mask; > > > > if (!(iter->old_spte & clear_bits)) > > continue; > > > > iter.old_spte = kvm_tdp_mmu_clear_spte_bit(&iter, clear_bits); > > > > Yeah, this is better. Even better if I just initialize like: > > u64 clear_bits = shadow_dirty_mask; > > This will also get rid of the else part. On that topic... Do we need to recalculate clear_bits for every spte? wrprot is passed as a parameter so that will not change. And spte_ad_need_write_protect() should either return true or false for all SPTEs in the TDP MMU. Specifically, make_spte() has this code: if (sp->role.ad_disabled) spte |= SPTE_TDP_AD_DISABLED_MASK; else if (kvm_mmu_page_ad_need_write_protect(sp)) spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK; sp->role.ad_disabled is never modified in TDP MMU pages. So it should be constant for all pages. And kvm_mmu_page_ad_need_write_protect() will always return false for TDP MMU pages since sp->role.guest_mode is always false. So this can just be: u64 clear_bit = (wrprot || !kvm_ad_enabled()) ? PT_WRITABLE_MASK : shadow_dirty_mask;
On Tue, Feb 7, 2023 at 9:47 AM David Matlack <dmatlack@google.com> wrote: > > On Tue, Feb 7, 2023 at 9:37 AM Vipin Sharma <vipinsh@google.com> wrote: > > > > On Mon, Feb 6, 2023 at 3:41 PM David Matlack <dmatlack@google.com> wrote: > > > > > > On Fri, Feb 03, 2023 at 11:28:19AM -0800, Vipin Sharma wrote: > > > > > > if (wrprot || spte_ad_need_write_protect(iter.old_spte)) > > > clear_bits = PT_WRITABLE_MASK; > > > else > > > clear_bits = shadow_dirty_mask; > > > > > > if (!(iter->old_spte & clear_bits)) > > > continue; > > > > > > iter.old_spte = kvm_tdp_mmu_clear_spte_bit(&iter, clear_bits); > > > > > > > Yeah, this is better. Even better if I just initialize like: > > > > u64 clear_bits = shadow_dirty_mask; > > > > This will also get rid of the else part. > > On that topic... Do we need to recalculate clear_bits for every spte? > wrprot is passed as a parameter so that will not change. And > spte_ad_need_write_protect() should either return true or false for > all SPTEs in the TDP MMU. Specifically, make_spte() has this code: > > if (sp->role.ad_disabled) > spte |= SPTE_TDP_AD_DISABLED_MASK; > else if (kvm_mmu_page_ad_need_write_protect(sp)) > spte |= SPTE_TDP_AD_WRPROT_ONLY_MASK; > > sp->role.ad_disabled is never modified in TDP MMU pages. So it should > be constant for all pages. And kvm_mmu_page_ad_need_write_protect() > will always return false for TDP MMU pages since sp->role.guest_mode > is always false. > > So this can just be: > > u64 clear_bit = (wrprot || !kvm_ad_enabled()) ? PT_WRITABLE_MASK : > shadow_dirty_mask; I discussed it offline with David to understand more about it. It makes sense as TDP MMU pages will not have nested SPTEs (they are in rmaps). So, this looks good, I will add it in the next version. Thanks
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 30a52e5e68de..21046b34f94e 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -121,4 +121,17 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root, void tdp_iter_next(struct tdp_iter *iter); void tdp_iter_restart(struct tdp_iter *iter); +static inline u64 kvm_tdp_mmu_clear_spte_bit(struct tdp_iter *iter, u64 mask) +{ + atomic64_t *sptep; + + if (kvm_tdp_mmu_spte_has_volatile_bits(iter->old_spte, iter->level)) { + sptep = (atomic64_t *)rcu_dereference(iter->sptep); + return (u64)atomic64_fetch_and(~mask, sptep); + } + + __kvm_tdp_mmu_write_spte(iter->sptep, iter->old_spte & ~mask); + return iter->old_spte; +} + #endif /* __KVM_X86_MMU_TDP_ITER_H */ diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index bba33aea0fb0..83f15052aa6c 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -710,18 +710,13 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm, * notifier for access tracking. Leaving record_acc_track * unset in that case prevents page accesses from being * double counted. - * @record_dirty_log: Record the page as dirty in the dirty bitmap if - * appropriate for the change being made. Should be set - * unless performing certain dirty logging operations. - * Leaving record_dirty_log unset in that case prevents page - * writes from being double counted. * * Returns the old SPTE value, which _may_ be different than @old_spte if the * SPTE had voldatile bits. */ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, u64 old_spte, u64 new_spte, gfn_t gfn, int level, - bool record_acc_track, bool record_dirty_log) + bool record_acc_track) { lockdep_assert_held_write(&kvm->mmu_lock); @@ -740,42 +735,34 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep, if (record_acc_track) handle_changed_spte_acc_track(old_spte, new_spte, level); - if (record_dirty_log) - handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, - new_spte, level); + + handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte, new_spte, + level); return old_spte; } static inline void _tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, - u64 new_spte, bool record_acc_track, - bool record_dirty_log) + u64 new_spte, bool record_acc_track) { WARN_ON_ONCE(iter->yielded); iter->old_spte = __tdp_mmu_set_spte(kvm, iter->as_id, iter->sptep, iter->old_spte, new_spte, iter->gfn, iter->level, - record_acc_track, record_dirty_log); + record_acc_track); } static inline void tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte) { - _tdp_mmu_set_spte(kvm, iter, new_spte, true, true); + _tdp_mmu_set_spte(kvm, iter, new_spte, true); } static inline void tdp_mmu_set_spte_no_acc_track(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte) { - _tdp_mmu_set_spte(kvm, iter, new_spte, false, true); -} - -static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm, - struct tdp_iter *iter, - u64 new_spte) -{ - _tdp_mmu_set_spte(kvm, iter, new_spte, true, false); + _tdp_mmu_set_spte(kvm, iter, new_spte, false); } #define tdp_root_for_each_pte(_iter, _root, _start, _end) \ @@ -925,7 +912,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) return false; __tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0, - sp->gfn, sp->role.level + 1, true, true); + sp->gfn, sp->role.level + 1, true); return true; } @@ -1678,7 +1665,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, gfn_t gfn, unsigned long mask, bool wrprot) { struct tdp_iter iter; - u64 new_spte; + u64 clear_bits; rcu_read_lock(); @@ -1694,18 +1681,22 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root, mask &= ~(1UL << (iter.gfn - gfn)); if (wrprot || spte_ad_need_write_protect(iter.old_spte)) { - if (is_writable_pte(iter.old_spte)) - new_spte = iter.old_spte & ~PT_WRITABLE_MASK; - else + if (!is_writable_pte(iter.old_spte)) continue; + + clear_bits = PT_WRITABLE_MASK; } else { - if (iter.old_spte & shadow_dirty_mask) - new_spte = iter.old_spte & ~shadow_dirty_mask; - else + if (!(iter.old_spte & shadow_dirty_mask)) continue; + + clear_bits = shadow_dirty_mask; } - tdp_mmu_set_spte_no_dirty_log(kvm, &iter, new_spte); + iter.old_spte = kvm_tdp_mmu_clear_spte_bit(&iter, clear_bits); + trace_kvm_tdp_mmu_spte_changed(iter.as_id, iter.gfn, iter.level, + iter.old_spte, + iter.old_spte & ~clear_bits); + kvm_set_pfn_dirty(spte_to_pfn(iter.old_spte)); } rcu_read_unlock();