Message ID | 20230526234435.662652-5-yuzhao@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp26947vqr; Fri, 26 May 2023 16:46:14 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7RgmRbYcvg6z9tPSMD8mbnFRXHXj/aiaiankU7g5ar2Gdq5Q7UVIqjdCjY988XIeqA67JD X-Received: by 2002:a05:6a20:9c97:b0:10b:d3e0:78a7 with SMTP id mj23-20020a056a209c9700b0010bd3e078a7mr987216pzb.61.1685144773972; Fri, 26 May 2023 16:46:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685144773; cv=none; d=google.com; s=arc-20160816; b=mMHOqp+vmNsD3drIf+czrYfMAvWEO5OhqT3OHnHmbBMYlAqv08/1CMLYT/sW3zGpR/ cs3dTOuhffqMqvBWcTW8KGrbyjwIvoT6P/dEndrvT3/dseMOGO6/IPEvuO1z6nP/zxJ1 3HtmofzTP2k/udt0E2l/4gbOwkIMmoI3lrJwRJLGoo9/4sRhF19GYcpU1yQuYuZ1O/t0 D9L7Mdgf8O/WfVn3T/TnIgr/BisM5gc7QwbvF4lJ8RTeudHx2wmSg/+4gyzp3eFzzdAy Ji3q9fD7yce36eYG8aJHrThUN0bQfFsK8PQ/hYOnWQ7BZlnA653pS4fQUYpvO+rZwFYY 0G4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:references:mime-version :message-id:in-reply-to:date:dkim-signature; bh=9t4P2GvyPz5aecPxpWswvlll2je3r362Snycxj82bkU=; b=0d4Nwcj/Mqa27igZoVGUdyURoXNMR8x04ug67tkXlIdy1qNZFIRxERugPSHsBl2TzH wp9MXC3g0EAwH53Dqwfxbc5HXbW+s0oufVG/lIzAv32Q/I/N/spRcgmspQKKQCTiPrup GncgX9nxwmx6IgDJ0R59HaikIjq7F0B1eqBybhMRW7xeiy7TFkAaoMnStV768EL+AhYr PZ1N5YVoZACPhZ/W1dNG9RicDod1k+h7clWQ9bYV+QbnNBUJ2ExVrLd418khj3lNqj1p DbIY6AJLjWjLTHcPVrMdvukopCVQ8dAt3UojfJt6vS5A+8B3Ej0zhvTH5B4auwPv2nwZ KBMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=cWUN3P8r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bv4-20020a632e04000000b00536b4c4b695si4790989pgb.136.2023.05.26.16.45.59; Fri, 26 May 2023 16:46:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=cWUN3P8r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237735AbjEZXpF (ORCPT <rfc822;zhanglyra.2023@gmail.com> + 99 others); Fri, 26 May 2023 19:45:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58838 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237722AbjEZXou (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 26 May 2023 19:44:50 -0400 Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F0EA51B1 for <linux-kernel@vger.kernel.org>; Fri, 26 May 2023 16:44:46 -0700 (PDT) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-565c380565dso13130087b3.1 for <linux-kernel@vger.kernel.org>; Fri, 26 May 2023 16:44:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685144686; x=1687736686; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=9t4P2GvyPz5aecPxpWswvlll2je3r362Snycxj82bkU=; b=cWUN3P8rM/d3APcGkGsw1zxYbzC0SLP23YWYAHvNkl8en24zIhenJIzsgauwSrM86L GtpWgjW+9PaIY51P7YD5ERtfy398MWWnA2qbOrXT+m6cJHuvmS8It0VeeYLeOMql5pG+ oF4R8PaA47tg1/M/GaUTjyQPHdY2Yt35bMOgs0yeO0A95+MuHExqrRIRhATT/aSAcAvS DuZIw+jQWwcC1PbwFYK5fVCc5aYJ2bcpXcCLA842C0fEFJDU8WE3MLQI6mTQVW9b35Gn iKpM9XslaKQGwLt7gqMS0LD/fBOZjhpB3/6lsoxW54UOFTglxZ15DmEhLn/Wm52Z8EKo 74zA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685144686; x=1687736686; h=cc:to:from:subject:references:mime-version:message-id:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9t4P2GvyPz5aecPxpWswvlll2je3r362Snycxj82bkU=; b=EItzyLIKZLFukFZGPsAhE7DC4qH4gDmT7ymSVkCQaxFZBX8cSchjBKQ1VAXkvihJnk tpfrz8K62TEWIRQOetQ8LD/WbLkATpBo7MswmOYD+fxE38HOlMG0mZwIFhWKWKl9texk utcAxYqhK3R/a7QP87VFA9/jwpOIYKX7nbQZZJsAeUlf4hF9s6f7mev0QO9+eH7lIqKo 4aQvJr6EAZmZhdtUC9wCX8VI1nOY1EujLDPuTqChueNZN26A7ribtwLxfE0DYZ7IwVcC 3ESfW0wkVk5yCsCp/d9Rv6SmEVrB9GkPtBFfPLeQE8Evsk6z1fpaWvXzfGD+fjFo/PFj RPpA== X-Gm-Message-State: AC+VfDyIgD9Gq2Wk/KMGj24S2SIN7agucTDkWuGbH/fbuydDIyxESxQi WdDnkAAvxAc+oeKIybKTmZiUoOWgNmQ= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:910f:8a15:592b:2087]) (user=yuzhao job=sendgmr) by 2002:a81:b627:0:b0:559:d859:d749 with SMTP id u39-20020a81b627000000b00559d859d749mr437593ywh.5.1685144686225; Fri, 26 May 2023 16:44:46 -0700 (PDT) Date: Fri, 26 May 2023 17:44:29 -0600 In-Reply-To: <20230526234435.662652-1-yuzhao@google.com> Message-Id: <20230526234435.662652-5-yuzhao@google.com> Mime-Version: 1.0 References: <20230526234435.662652-1-yuzhao@google.com> X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog Subject: [PATCH mm-unstable v2 04/10] kvm/arm64: make stage2 page tables RCU safe From: Yu Zhao <yuzhao@google.com> To: Andrew Morton <akpm@linux-foundation.org>, Paolo Bonzini <pbonzini@redhat.com> Cc: Alistair Popple <apopple@nvidia.com>, Anup Patel <anup@brainfault.org>, Ben Gardon <bgardon@google.com>, Borislav Petkov <bp@alien8.de>, Catalin Marinas <catalin.marinas@arm.com>, Chao Peng <chao.p.peng@linux.intel.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, Dave Hansen <dave.hansen@linux.intel.com>, Fabiano Rosas <farosas@linux.ibm.com>, Gaosheng Cui <cuigaosheng1@huawei.com>, Gavin Shan <gshan@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>, James Morse <james.morse@arm.com>, "Jason A. Donenfeld" <Jason@zx2c4.com>, Jason Gunthorpe <jgg@ziepe.ca>, Jonathan Corbet <corbet@lwn.net>, Marc Zyngier <maz@kernel.org>, Masami Hiramatsu <mhiramat@kernel.org>, Michael Ellerman <mpe@ellerman.id.au>, Michael Larabel <michael@michaellarabel.com>, Mike Rapoport <rppt@kernel.org>, Nicholas Piggin <npiggin@gmail.com>, Oliver Upton <oliver.upton@linux.dev>, Paul Mackerras <paulus@ozlabs.org>, Peter Xu <peterx@redhat.com>, Sean Christopherson <seanjc@google.com>, Steven Rostedt <rostedt@goodmis.org>, Suzuki K Poulose <suzuki.poulose@arm.com>, Thomas Gleixner <tglx@linutronix.de>, Thomas Huth <thuth@redhat.com>, Will Deacon <will@kernel.org>, Zenghui Yu <yuzenghui@huawei.com>, kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linuxppc-dev@lists.ozlabs.org, linux-trace-kernel@vger.kernel.org, x86@kernel.org, linux-mm@google.com, Yu Zhao <yuzhao@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767002366385337344?= X-GMAIL-MSGID: =?utf-8?q?1767002366385337344?= |
Series |
mm/kvm: locklessly clear the accessed bit
|
|
Commit Message
Yu Zhao
May 26, 2023, 11:44 p.m. UTC
Stage2 page tables are currently not RCU safe against unmapping or VM
destruction. The previous mmu_notifier_ops members rely on
kvm->mmu_lock to synchronize with those operations.
However, the new mmu_notifier_ops member test_clear_young() provides
a fast path that does not take kvm->mmu_lock. To implement
kvm_arch_test_clear_young() for that path, unmapped page tables need
to be freed by RCU and kvm_free_stage2_pgd() needs to be after
mmu_notifier_unregister().
Remapping, specifically stage2_free_removed_table(), is already RCU
safe.
Signed-off-by: Yu Zhao <yuzhao@google.com>
---
arch/arm64/include/asm/kvm_pgtable.h | 2 ++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/hyp/pgtable.c | 8 ++++++--
arch/arm64/kvm/mmu.c | 17 ++++++++++++++++-
4 files changed, 25 insertions(+), 3 deletions(-)
Comments
Yu, On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote: > Stage2 page tables are currently not RCU safe against unmapping or VM > destruction. The previous mmu_notifier_ops members rely on > kvm->mmu_lock to synchronize with those operations. > > However, the new mmu_notifier_ops member test_clear_young() provides > a fast path that does not take kvm->mmu_lock. To implement > kvm_arch_test_clear_young() for that path, unmapped page tables need > to be freed by RCU and kvm_free_stage2_pgd() needs to be after > mmu_notifier_unregister(). > > Remapping, specifically stage2_free_removed_table(), is already RCU > safe. > > Signed-off-by: Yu Zhao <yuzhao@google.com> > --- > arch/arm64/include/asm/kvm_pgtable.h | 2 ++ > arch/arm64/kvm/arm.c | 1 + > arch/arm64/kvm/hyp/pgtable.c | 8 ++++++-- > arch/arm64/kvm/mmu.c | 17 ++++++++++++++++- > 4 files changed, 25 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h > index ff520598b62c..5cab52e3a35f 100644 > --- a/arch/arm64/include/asm/kvm_pgtable.h > +++ b/arch/arm64/include/asm/kvm_pgtable.h > @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) > * @put_page: Decrement the refcount on a page. When the > * refcount reaches 0 the page is automatically > * freed. > + * @put_page_rcu: RCU variant of the above. You don't need to add yet another hook to implement this. I was working on lock-free walks in a separate context and arrived at the following: commit f82d264a37745e07ee28e116c336f139f681fd7f Author: Oliver Upton <oliver.upton@linux.dev> Date: Mon May 1 08:53:37 2023 +0000 KVM: arm64: Consistently use free_removed_table() for stage-2 free_removed_table() is essential to the RCU-protected parallel walking scheme, as behind the scenes the cleanup is deferred until an RCU grace period. Nonetheless, the stage-2 unmap path calls put_page() directly, which leads to table memory being freed inline with the table walk. This is safe for the time being, as the stage-2 unmap walker is called while holding the write lock. A future change to KVM will further relax the locking mechanics around the stage-2 page tables to allow lock-free walkers protected only by RCU. As such, switch to the RCU-safe mechanism for freeing table memory. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 3d61bd3e591d..bfbebdcb4ef0 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, kvm_granule_size(ctx->level)); if (childp) - mm_ops->put_page(childp); + mm_ops->free_removed_table(childp, ctx->level); return 0; }
On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Yu, > > On Fri, May 26, 2023 at 05:44:29PM -0600, Yu Zhao wrote: > > Stage2 page tables are currently not RCU safe against unmapping or VM > > destruction. The previous mmu_notifier_ops members rely on > > kvm->mmu_lock to synchronize with those operations. > > > > However, the new mmu_notifier_ops member test_clear_young() provides > > a fast path that does not take kvm->mmu_lock. To implement > > kvm_arch_test_clear_young() for that path, unmapped page tables need > > to be freed by RCU and kvm_free_stage2_pgd() needs to be after > > mmu_notifier_unregister(). > > > > Remapping, specifically stage2_free_removed_table(), is already RCU > > safe. > > > > Signed-off-by: Yu Zhao <yuzhao@google.com> > > --- > > arch/arm64/include/asm/kvm_pgtable.h | 2 ++ > > arch/arm64/kvm/arm.c | 1 + > > arch/arm64/kvm/hyp/pgtable.c | 8 ++++++-- > > arch/arm64/kvm/mmu.c | 17 ++++++++++++++++- > > 4 files changed, 25 insertions(+), 3 deletions(-) > > > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h > > index ff520598b62c..5cab52e3a35f 100644 > > --- a/arch/arm64/include/asm/kvm_pgtable.h > > +++ b/arch/arm64/include/asm/kvm_pgtable.h > > @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) > > * @put_page: Decrement the refcount on a page. When the > > * refcount reaches 0 the page is automatically > > * freed. > > + * @put_page_rcu: RCU variant of the above. > > You don't need to add yet another hook to implement this. I was working > on lock-free walks in a separate context and arrived at the following: > > commit f82d264a37745e07ee28e116c336f139f681fd7f > Author: Oliver Upton <oliver.upton@linux.dev> > Date: Mon May 1 08:53:37 2023 +0000 > > KVM: arm64: Consistently use free_removed_table() for stage-2 > > free_removed_table() is essential to the RCU-protected parallel walking > scheme, as behind the scenes the cleanup is deferred until an RCU grace > period. Nonetheless, the stage-2 unmap path calls put_page() directly, > which leads to table memory being freed inline with the table walk. > > This is safe for the time being, as the stage-2 unmap walker is called > while holding the write lock. A future change to KVM will further relax > the locking mechanics around the stage-2 page tables to allow lock-free > walkers protected only by RCU. As such, switch to the RCU-safe mechanism > for freeing table memory. > > Signed-off-by: Oliver Upton <oliver.upton@linux.dev> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index 3d61bd3e591d..bfbebdcb4ef0 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > kvm_granule_size(ctx->level)); > > if (childp) > - mm_ops->put_page(childp); > + mm_ops->free_removed_table(childp, ctx->level); Thanks, Oliver. A couple of things I haven't had the chance to verify -- I'm hoping you could help clarify: 1. For unmapping, with free_removed_table(), wouldn't we have to look into the table we know it's empty unnecessarily? 2. For remapping and unmapping, how does free_removed_table() put the final refcnt on the table passed in? (Previously we had put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd have to do something equivalent with free_removed_table().)
Hi Yu, On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote: > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > index 3d61bd3e591d..bfbebdcb4ef0 100644 > > --- a/arch/arm64/kvm/hyp/pgtable.c > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > kvm_granule_size(ctx->level)); > > > > if (childp) > > - mm_ops->put_page(childp); > > + mm_ops->free_removed_table(childp, ctx->level); > > Thanks, Oliver. > > A couple of things I haven't had the chance to verify -- I'm hoping > you could help clarify: > 1. For unmapping, with free_removed_table(), wouldn't we have to look > into the table we know it's empty unnecessarily? As it is currently implemented, yes. But, there's potential to fast-path the implementation by checking page_count() before starting the walk. > 2. For remapping and unmapping, how does free_removed_table() put the > final refcnt on the table passed in? (Previously we had > put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd > have to do something equivalent with free_removed_table().) Heh, that's a bug, and an embarrassing one at that! Sent out a fix for that, since it would appear we leak memory on table->block transitions. PTAL if you have a chance. https://lore.kernel.org/all/20230530193213.1663411-1-oliver.upton@linux.dev/
On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > Hi Yu, > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote: > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > > index 3d61bd3e591d..bfbebdcb4ef0 100644 > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > > kvm_granule_size(ctx->level)); > > > > > > if (childp) > > > - mm_ops->put_page(childp); > > > + mm_ops->free_removed_table(childp, ctx->level); > > > > Thanks, Oliver. > > > > A couple of things I haven't had the chance to verify -- I'm hoping > > you could help clarify: > > 1. For unmapping, with free_removed_table(), wouldn't we have to look > > into the table we know it's empty unnecessarily? > > As it is currently implemented, yes. But, there's potential to fast-path > the implementation by checking page_count() before starting the walk. Do you mind posting another patch? I'd be happy to ack it, as well as the one you suggested above. > > 2. For remapping and unmapping, how does free_removed_table() put the > > final refcnt on the table passed in? (Previously we had > > put_page(childp) in stage2_map_walk_table_post(). So I'm assuming we'd > > have to do something equivalent with free_removed_table().) > > Heh, that's a bug, and an embarrassing one at that! > > Sent out a fix for that, since it would appear we leak memory on > table->block transitions. PTAL if you have a chance. > > https://lore.kernel.org/all/20230530193213.1663411-1-oliver.upton@linux.dev/ Awesome.
On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote: > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > Hi Yu, > > > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote: > > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > > > index 3d61bd3e591d..bfbebdcb4ef0 100644 > > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > > > kvm_granule_size(ctx->level)); > > > > > > > > if (childp) > > > > - mm_ops->put_page(childp); > > > > + mm_ops->free_removed_table(childp, ctx->level); > > > > > > Thanks, Oliver. > > > > > > A couple of things I haven't had the chance to verify -- I'm hoping > > > you could help clarify: > > > 1. For unmapping, with free_removed_table(), wouldn't we have to look > > > into the table we know it's empty unnecessarily? > > > > As it is currently implemented, yes. But, there's potential to fast-path > > the implementation by checking page_count() before starting the walk. > > Do you mind posting another patch? I'd be happy to ack it, as well as > the one you suggested above. I'd rather not take such a patch independent of the test_clear_young series if you're OK with that. Do you mind implementing something similar to the above patch w/ the proposed optimization if you need it?
On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote: > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > > Hi Yu, > > > > > > On Sat, May 27, 2023 at 02:13:07PM -0600, Yu Zhao wrote: > > > > On Sat, May 27, 2023 at 12:08 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > > > > > index 3d61bd3e591d..bfbebdcb4ef0 100644 > > > > > --- a/arch/arm64/kvm/hyp/pgtable.c > > > > > +++ b/arch/arm64/kvm/hyp/pgtable.c > > > > > @@ -1019,7 +1019,7 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, > > > > > kvm_granule_size(ctx->level)); > > > > > > > > > > if (childp) > > > > > - mm_ops->put_page(childp); > > > > > + mm_ops->free_removed_table(childp, ctx->level); > > > > > > > > Thanks, Oliver. > > > > > > > > A couple of things I haven't had the chance to verify -- I'm hoping > > > > you could help clarify: > > > > 1. For unmapping, with free_removed_table(), wouldn't we have to look > > > > into the table we know it's empty unnecessarily? > > > > > > As it is currently implemented, yes. But, there's potential to fast-path > > > the implementation by checking page_count() before starting the walk. > > > > Do you mind posting another patch? I'd be happy to ack it, as well as > > the one you suggested above. > > I'd rather not take such a patch independent of the test_clear_young > series if you're OK with that. Do you mind implementing something > similar to the above patch w/ the proposed optimization if you need it? No worries. I can take the above together with the following, which would form a new series with its own merits, since apparently you think the !AF case is important. diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 26a8d955b49c..6ce73ce9f146 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1453,10 +1453,10 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa) trace_kvm_access_fault(fault_ipa); - read_lock(&vcpu->kvm->mmu_lock); + rcu_read_lock(); mmu = vcpu->arch.hw_mmu; pte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa); - read_unlock(&vcpu->kvm->mmu_lock); + rcu_read_unlock(); if (kvm_pte_valid(pte)) kvm_set_pfn_accessed(kvm_pte_to_pfn(pte));
On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote: > On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote: > > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > As it is currently implemented, yes. But, there's potential to fast-path > > > > the implementation by checking page_count() before starting the walk. > > > > > > Do you mind posting another patch? I'd be happy to ack it, as well as > > > the one you suggested above. > > > > I'd rather not take such a patch independent of the test_clear_young > > series if you're OK with that. Do you mind implementing something > > similar to the above patch w/ the proposed optimization if you need it? > > No worries. I can take the above together with the following, which > would form a new series with its own merits, since apparently you > think the !AF case is important. Sorry if my suggestion was unclear. I thought we were talking about ->free_removed_table() being called from the stage-2 unmap path, in which case we wind up unnecessarily visiting PTEs on a table known to be empty. You could fast-path that by only initiating a walk if page_count() > 1: diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 95dae02ccc2e..766563dc465c 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg .end = kvm_granule_size(level), }; - WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); + if (mm_ops->page_count(pgtable) > 1) + WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); WARN_ON(mm_ops->page_count(pgtable) != 1); mm_ops->put_page(pgtable); A lock-free access fault walker is interesting, but in my testing it hasn't led to any significant improvements over acquiring the MMU lock for read. Because of that I hadn't bothered with posting the series upstream.
On Wed, May 31, 2023 at 5:23 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > On Wed, May 31, 2023 at 05:10:52PM -0600, Yu Zhao wrote: > > On Wed, May 31, 2023 at 1:28 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > On Tue, May 30, 2023 at 02:06:55PM -0600, Yu Zhao wrote: > > > > On Tue, May 30, 2023 at 1:37 PM Oliver Upton <oliver.upton@linux.dev> wrote: > > > > > As it is currently implemented, yes. But, there's potential to fast-path > > > > > the implementation by checking page_count() before starting the walk. > > > > > > > > Do you mind posting another patch? I'd be happy to ack it, as well as > > > > the one you suggested above. > > > > > > I'd rather not take such a patch independent of the test_clear_young > > > series if you're OK with that. Do you mind implementing something > > > similar to the above patch w/ the proposed optimization if you need it? > > > > No worries. I can take the above together with the following, which > > would form a new series with its own merits, since apparently you > > think the !AF case is important. > > Sorry if my suggestion was unclear. > > I thought we were talking about ->free_removed_table() being called from > the stage-2 unmap path Yes, we were, or in general, about how to make KVM PTs RCU safe for ARM. So I'm thinking about taking 1) your patch above, 2) what I just suggested and 3) what you suggested below to form a mini series, which could land indepently and would make my job here easier. > in which case we wind up unnecessarily visiting > PTEs on a table known to be empty. You could fast-path that by only > initiating a walk if page_count() > 1: Yes, this is what I meant. > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index 95dae02ccc2e..766563dc465c 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -1331,7 +1331,8 @@ void kvm_pgtable_stage2_free_removed(struct kvm_pgtable_mm_ops *mm_ops, void *pg > .end = kvm_granule_size(level), > }; > > - WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); > + if (mm_ops->page_count(pgtable) > 1) > + WARN_ON(__kvm_pgtable_walk(&data, mm_ops, ptep, level + 1)); > > WARN_ON(mm_ops->page_count(pgtable) != 1); > mm_ops->put_page(pgtable); > > > A lock-free access fault walker is interesting, but in my testing it hasn't > led to any significant improvements over acquiring the MMU lock for > read. Because of that I hadn't bothered with posting the series upstream. It's hard to measure but we have perf benchmarks on ChromeOS which should help.
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index ff520598b62c..5cab52e3a35f 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -153,6 +153,7 @@ static inline bool kvm_level_supports_block_mapping(u32 level) * @put_page: Decrement the refcount on a page. When the * refcount reaches 0 the page is automatically * freed. + * @put_page_rcu: RCU variant of the above. * @page_count: Return the refcount of a page. * @phys_to_virt: Convert a physical address into a virtual * address mapped in the current context. @@ -170,6 +171,7 @@ struct kvm_pgtable_mm_ops { void (*free_removed_table)(void *addr, u32 level); void (*get_page)(void *addr); void (*put_page)(void *addr); + void (*put_page_rcu)(void *addr); int (*page_count)(void *addr); void* (*phys_to_virt)(phys_addr_t phys); phys_addr_t (*virt_to_phys)(void *addr); diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c index 14391826241c..ee93271035d9 100644 --- a/arch/arm64/kvm/arm.c +++ b/arch/arm64/kvm/arm.c @@ -191,6 +191,7 @@ vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf) */ void kvm_arch_destroy_vm(struct kvm *kvm) { + kvm_free_stage2_pgd(&kvm->arch.mmu); bitmap_free(kvm->arch.pmu_filter); free_cpumask_var(kvm->arch.supported_cpus); diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index 24678ccba76a..dbace4c6a841 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -988,8 +988,12 @@ static int stage2_unmap_walker(const struct kvm_pgtable_visit_ctx *ctx, mm_ops->dcache_clean_inval_poc(kvm_pte_follow(ctx->old, mm_ops), kvm_granule_size(ctx->level)); - if (childp) - mm_ops->put_page(childp); + if (childp) { + if (mm_ops->put_page_rcu) + mm_ops->put_page_rcu(childp); + else + mm_ops->put_page(childp); + } return 0; } diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 3b9d4d24c361..c3b3e2afe26f 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -172,6 +172,21 @@ static int kvm_host_page_count(void *addr) return page_count(virt_to_page(addr)); } +static void kvm_s2_rcu_put_page(struct rcu_head *head) +{ + put_page(container_of(head, struct page, rcu_head)); +} + +static void kvm_s2_put_page_rcu(void *addr) +{ + struct page *page = virt_to_page(addr); + + if (kvm_host_page_count(addr) == 1) + kvm_account_pgtable_pages(addr, -1); + + call_rcu(&page->rcu_head, kvm_s2_rcu_put_page); +} + static phys_addr_t kvm_host_pa(void *addr) { return __pa(addr); @@ -704,6 +719,7 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = { .free_removed_table = stage2_free_removed_table, .get_page = kvm_host_get_page, .put_page = kvm_s2_put_page, + .put_page_rcu = kvm_s2_put_page_rcu, .page_count = kvm_host_page_count, .phys_to_virt = kvm_host_va, .virt_to_phys = kvm_host_pa, @@ -1877,7 +1893,6 @@ void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) void kvm_arch_flush_shadow_all(struct kvm *kvm) { - kvm_free_stage2_pgd(&kvm->arch.mmu); } void kvm_arch_flush_shadow_memslot(struct kvm *kvm,