[v2,1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node
Message ID | 20221201195718.1409782-2-vipinsh@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp459397wrr; Thu, 1 Dec 2022 12:03:19 -0800 (PST) X-Google-Smtp-Source: AA0mqf5y2l3Ai8VZP4wQNt/ywZgkEndQYUBi8LBqkpK7R5N0Z9G2SDo5ga4jrjm0CH5Qrs71vJz2 X-Received: by 2002:a17:906:4551:b0:7c0:b7da:a7fa with SMTP id s17-20020a170906455100b007c0b7daa7famr2354612ejq.445.1669924999203; Thu, 01 Dec 2022 12:03:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669924999; cv=none; d=google.com; s=arc-20160816; b=IT9bcAWFIcwMKZPOT468pySRPwfIH2vO44J7XRD9Eb7wuqG5muuIpbfaF9lJ7p/Ez4 FQgdevvSjANBnd0Do6SOfy0AfwqBCRMA4qhE/dIJKexL3oZxuPYH72C8N2XgCEGUwKaE p8jxTSA8hZf2EW3FzJP3dBgpomVCAjtt7pcFFZDZ/zhWGx6N69ZHDIURwqLfa1zFQa9w cSAOEf1Ow1kBonhjdwjevE9K6dVE2L4cWXmdqw+pDNelYFWqN0V5YD6WZ92kPKWOnGzu 4G/HhXAaYyE2oYv9BldIF1sKp1zlpGzmbvijZ9FK6r5bdorFLk9RUQFJnLmet+3JYa2o CUzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=stKM1jMVwmhwoFpYP247nGWPPAXZf6A1zRPmUrHRQZA=; b=Bi0lToFqH/lhNFvipzL3f5JaJaq8s++HQ9huMAbxLlqLrR/uDH1Ml478IU88Nhkatf zgHXpBfCUwkH9Vm9c6PfBD1s+o8r4HIZEtR2LzD0UmigTQ5cIwS3K8VlD0z6X0of2t3N wVcq/cRkTGvVoZpqY/wg2bSRVcEbkwVN7hCV3TzdG9FtH0mEjy4543oj1NT8VutNDUJ4 PE1ZjSWG1FX9vCWSeZVS1Xv5tDYgQBDXa8g4gwT5TAoPvKhnhbTdwn0Pswrx+uI6Iay1 l89w8eAucwnOEP2kNtYw6x9fl3KNSQam5K0wUKd8SK6uAgcUunm1u6xXmaJ3yelX9WST 3riQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Bu8NSzFG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l12-20020a50d6cc000000b0044f2fb68fe6si4646947edj.495.2022.12.01.12.02.54; Thu, 01 Dec 2022 12:03:19 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Bu8NSzFG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231416AbiLAT6Q (ORCPT <rfc822;heyuhang3455@gmail.com> + 99 others); Thu, 1 Dec 2022 14:58:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54820 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231165AbiLAT6B (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 1 Dec 2022 14:58:01 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E16E1766A for <linux-kernel@vger.kernel.org>; Thu, 1 Dec 2022 11:57:36 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id n4-20020a17090a2fc400b002132adb9485so3011798pjm.0 for <linux-kernel@vger.kernel.org>; Thu, 01 Dec 2022 11:57:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=stKM1jMVwmhwoFpYP247nGWPPAXZf6A1zRPmUrHRQZA=; b=Bu8NSzFGappkbv1L7qwoURd3JHk2KgbJHy9Sdk05Usu/He/TEU7SnfHqTnnrUsn71m 90z4y8EI02w7hoQIjL3LAKsdjTZNluksE5Z5q4dHqwNgYkGJcy2mGBsPaprsTVtVW4W0 PeGWHCIX7JJ9LLDtOyJ6AqcfrAZtPWWDNRc7lSB0I00Skc/AgibNqGy5PgHIItnuDa0t XO03NNFTSJtQAYWfn/thSffywzTQYLvUL0MjvYCyXaCWxcajrZ5HIra1saG+xSfeEN12 4CnwWMYMqAmiHaRTcTH7/Ki5BVRt2On9Rd5G2u0W8gtEbZn6p+eXX3gwW9HmO1J9NbBJ T2pQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=stKM1jMVwmhwoFpYP247nGWPPAXZf6A1zRPmUrHRQZA=; b=emhDwy/jy3Rx74V7qcNodlZJeusS/1DV1nyOH8sYWfJr4qaQLo3JUj54qMS5B7K7s7 dG5NTfQ5/mjbLmBNIFKCHpUnfIu4HMggBTuRWHDZtSvZjwVVN9RaVrW2Nst7nW/KCSTO TObhCbz3NFkJZAYK4GnswiXJieYqiS8SAf/Cls9JaGowDboGvekdgY6VUYSbbh3APKKg AiWaqD5IQBV+9K/F2U7ZmeZw8PV307/VqmN1Np+1QIEU4/OV2kDDcqb2WB3ivcY2Qqkg +p9srJ6H5Ehbzb1Oh5zSQ5AhEAkQ/IP0TaoLDgdxj24kEX3XDOYodEHDuO3+tYdHFsnV VmZQ== X-Gm-Message-State: ANoB5pnR3UcEC7J8V+hJr65psRk+RGy9J/mT3q0JXthNbPRblZ3OgJYf BOwMJA+lO87/cHnyhJqxCmX9NE/r8pUn X-Received: from vipin.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:479f]) (user=vipinsh job=sendgmr) by 2002:a17:902:bb10:b0:189:6292:827e with SMTP id im16-20020a170902bb1000b001896292827emr36484510plb.97.1669924655797; Thu, 01 Dec 2022 11:57:35 -0800 (PST) Date: Thu, 1 Dec 2022 11:57:17 -0800 In-Reply-To: <20221201195718.1409782-1-vipinsh@google.com> Mime-Version: 1.0 References: <20221201195718.1409782-1-vipinsh@google.com> X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221201195718.1409782-2-vipinsh@google.com> Subject: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node From: Vipin Sharma <vipinsh@google.com> To: dmatlack@google.com, bgardon@google.com, seanjc@google.com, pbonzini@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Vipin Sharma <vipinsh@google.com> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751043276441481453?= X-GMAIL-MSGID: =?utf-8?q?1751043276441481453?= |
Series |
NUMA aware page table allocation
|
|
Commit Message
Vipin Sharma
Dec. 1, 2022, 7:57 p.m. UTC
Huge pages are split when dirty log is enabled. New page table pages are
allocated based on the current thread NUMA node or mempolicy. This
causes inefficient page table accesses if underlying page is on a
different NUMA node
Allocate page table pages on the same NUMA node as the underlying huge
page when dirty log is enabled and huge pages are split.
The performance gain during the pre-copy phase of live migrations of a
416 vCPUs and 11 TiB memory VM on a 8 node host was seen in the range
of 130% to 150%.
Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Vipin Sharma <vipinsh@google.com>
---
arch/x86/kvm/mmu.h | 1 +
arch/x86/kvm/mmu/mmu.c | 19 +++++++++++++++++++
arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
include/linux/kvm_host.h | 15 +++++++++++++++
4 files changed, 43 insertions(+), 4 deletions(-)
Comments
On Thu, Dec 1, 2022 at 11:57 AM Vipin Sharma <vipinsh@google.com> wrote: > > Huge pages are split when dirty log is enabled. New page table pages are > allocated based on the current thread NUMA node or mempolicy. This > causes inefficient page table accesses if underlying page is on a > different NUMA node > > Allocate page table pages on the same NUMA node as the underlying huge > page when dirty log is enabled and huge pages are split. > > The performance gain during the pre-copy phase of live migrations of a > 416 vCPUs and 11 TiB memory VM on a 8 node host was seen in the range > of 130% to 150%. > > Suggested-by: David Matlack <dmatlack@google.com> > Signed-off-by: Vipin Sharma <vipinsh@google.com> > --- > arch/x86/kvm/mmu.h | 1 + > arch/x86/kvm/mmu/mmu.c | 19 +++++++++++++++++++ > arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++---- > include/linux/kvm_host.h | 15 +++++++++++++++ > 4 files changed, 43 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h > index 6bdaacb6faa0..c960fb096e5c 100644 > --- a/arch/x86/kvm/mmu.h > +++ b/arch/x86/kvm/mmu.h > @@ -119,6 +119,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu); > void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu); > void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu); > void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu); > +void *kvm_mmu_get_free_page(int nid, gfp_t gfp); > > static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu) > { > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 4736d7849c60..0554dfc55553 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint"); > static bool __read_mostly force_flush_and_sync_on_reuse; > module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); > > +static bool __read_mostly numa_aware_pagetable = true; > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644); > + I'm usually all for having module params to control things, but in this case I don't think it provides much value because whether this NUMA optimization is useful or not is going to depend more on VM size and workload than anything else. If we wanted to make this configurable, a VM capability would probably be a better mechanism so that userspace could leave it off when running small, non-performance-sensitive VMs and turn it on when running large, multi-node VMs. A whole-host module parameter seems overly restrictive. > /* > * When setting this variable to true it enables Two-Dimensional-Paging > * where the hardware walks 2 page tables: > @@ -6984,3 +6987,19 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm) > if (kvm->arch.nx_huge_page_recovery_thread) > kthread_stop(kvm->arch.nx_huge_page_recovery_thread); > } > + > +void *kvm_mmu_get_free_page(int nid, gfp_t gfp) > +{ > + struct page *spt_page; > + void *address = NULL; > + > + if (numa_aware_pagetable) { > + spt_page = alloc_pages_node(nid, gfp, 0); > + if (spt_page) > + address = page_address(spt_page); > + } else { > + address = (void *)__get_free_page(gfp); > + } > + > + return address; > +} > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 771210ce5181..1607afbfcc0b 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -1413,7 +1413,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, > return spte_set; > } > > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) > { > struct kvm_mmu_page *sp; > > @@ -1423,7 +1423,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) > if (!sp) > return NULL; > > - sp->spt = (void *)__get_free_page(gfp); > + sp->spt = kvm_mmu_get_free_page(nid, gfp); > + > if (!sp->spt) { > kmem_cache_free(mmu_page_header_cache, sp); > return NULL; > @@ -1437,6 +1438,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > bool shared) > { > struct kvm_mmu_page *sp; > + int nid; > + > + nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(iter->old_spte)); > > /* > * Since we are allocating while under the MMU lock we have to be > @@ -1447,7 +1451,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > * If this allocation fails we drop the lock and retry with reclaim > * allowed. > */ > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT); > if (sp) > return sp; > > @@ -1459,7 +1463,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, > write_unlock(&kvm->mmu_lock); > > iter->yielded = true; > - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT); > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT); > > if (shared) > read_lock(&kvm->mmu_lock); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 8f874a964313..558ded73f660 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1596,6 +1596,21 @@ void kvm_arch_sync_events(struct kvm *kvm); > int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu); > > struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn); > + > +/* > + * Returns the nid of a 'struct page' if pfn is valid and backed by a refcounted > + * page, NUMA_NO_NODE otherwise. > + */ > +static inline int kvm_pfn_to_refcounted_page_nid(kvm_pfn_t pfn) > +{ > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > + > + if (page) > + return page_to_nid(page); > + else > + return NUMA_NO_NODE; > +} > + > bool kvm_is_zone_device_page(struct page *page); > > struct kvm_irq_ack_notifier { > -- > 2.39.0.rc0.267.gcb52ba06e7-goog >
Side topic, the shortlog is way, way too long. The purpose of the shortlog is to provide a synopsis of the change, not to describe the change in detail. I also think this patch should be 2/2, with the more generic support added along with the module param (or capability) in 1/2. E.g. to yield something like KVM: x86/mmu: Add a module param to make per-vCPU SPTs NUMA aware KVM: x86/mmu: Honor NUMA awareness for per-VM page table allocations On Mon, Dec 05, 2022, Ben Gardon wrote: > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 4736d7849c60..0554dfc55553 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint"); > > static bool __read_mostly force_flush_and_sync_on_reuse; > > module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); > > > > +static bool __read_mostly numa_aware_pagetable = true; > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644); > > + > > I'm usually all for having module params to control things, but in > this case I don't think it provides much value because whether this > NUMA optimization is useful or not is going to depend more on VM size > and workload than anything else. If we wanted to make this > configurable, a VM capability would probably be a better mechanism so > that userspace could leave it off when running small, > non-performance-sensitive VMs Would we actually want to turn it off in this case? IIUC, @nid is just the preferred node, i.e. failure to allocate for the preferred @nid will result in falling back to other nodes, not outright failure. So the pathological worst case scenario would be that for a system with VMs that don't care about performance, all of a nodes memory is allocated due to all VMs starting on that node. On the flip side, if a system had a mix of VM shapes, I think we'd want even the performance insensitive VMs to be NUMA aware so that they can be sequestered on their own node(s), i.e. don't "steal" memory from the VMs that are performance sensitive and have been affined to a single node. > and turn it on when running large, multi-node VMs. A whole-host module > parameter seems overly restrictive.
On Mon, Dec 5, 2022 at 10:47 AM Sean Christopherson <seanjc@google.com> wrote: > > Side topic, the shortlog is way, way too long. The purpose of the shortlog is > to provide a synopsis of the change, not to describe the change in detail. > > I also think this patch should be 2/2, with the more generic support added along > with the module param (or capability) in 1/2. E.g. to yield something like > > KVM: x86/mmu: Add a module param to make per-vCPU SPTs NUMA aware > KVM: x86/mmu: Honor NUMA awareness for per-VM page table allocations > > On Mon, Dec 05, 2022, Ben Gardon wrote: > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > > index 4736d7849c60..0554dfc55553 100644 > > > --- a/arch/x86/kvm/mmu/mmu.c > > > +++ b/arch/x86/kvm/mmu/mmu.c > > > @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint"); > > > static bool __read_mostly force_flush_and_sync_on_reuse; > > > module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); > > > > > > +static bool __read_mostly numa_aware_pagetable = true; > > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644); > > > + > > > > I'm usually all for having module params to control things, but in > > this case I don't think it provides much value because whether this > > NUMA optimization is useful or not is going to depend more on VM size > > and workload than anything else. If we wanted to make this > > configurable, a VM capability would probably be a better mechanism so > > that userspace could leave it off when running small, > > non-performance-sensitive VMs > > Would we actually want to turn it off in this case? IIUC, @nid is just the > preferred node, i.e. failure to allocate for the preferred @nid will result in > falling back to other nodes, not outright failure. So the pathological worst > case scenario would be that for a system with VMs that don't care about performance, > all of a nodes memory is allocated due to all VMs starting on that node. > > On the flip side, if a system had a mix of VM shapes, I think we'd want even the > performance insensitive VMs to be NUMA aware so that they can be sequestered on > their own node(s), i.e. don't "steal" memory from the VMs that are performance > sensitive and have been affined to a single node. Yeah, the only reason to turn it off would be to save memory. As a strawman, if you had 100 1-vCPU VMs on a 2 node system, you'd have 4000 pages allocated in caches, doing nothing. > > > and turn it on when running large, multi-node VMs. A whole-host module > > parameter seems overly restrictive.
On Mon, Dec 05, 2022, Ben Gardon wrote: > On Mon, Dec 5, 2022 at 10:47 AM Sean Christopherson <seanjc@google.com> wrote: > > > > +static bool __read_mostly numa_aware_pagetable = true; > > > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644); > > > > + > > > > > > I'm usually all for having module params to control things, but in > > > this case I don't think it provides much value because whether this > > > NUMA optimization is useful or not is going to depend more on VM size > > > and workload than anything else. If we wanted to make this > > > configurable, a VM capability would probably be a better mechanism so > > > that userspace could leave it off when running small, > > > non-performance-sensitive VMs > > > > Would we actually want to turn it off in this case? IIUC, @nid is just the > > preferred node, i.e. failure to allocate for the preferred @nid will result in > > falling back to other nodes, not outright failure. So the pathological worst > > case scenario would be that for a system with VMs that don't care about performance, > > all of a nodes memory is allocated due to all VMs starting on that node. > > > > On the flip side, if a system had a mix of VM shapes, I think we'd want even the > > performance insensitive VMs to be NUMA aware so that they can be sequestered on > > their own node(s), i.e. don't "steal" memory from the VMs that are performance > > sensitive and have been affined to a single node. > > Yeah, the only reason to turn it off would be to save memory. As a > strawman, if you had 100 1-vCPU VMs on a 2 node system, you'd have > 4000 pages allocated in caches, doing nothing. The management of the caches really should be addressed separately, and this is the perfect excuse to finally get some traction on mmu_shrink_scan(). From a while back[1]: : I know we proposed dropping mmu_shrink_scan(), but the more I think about that, : the more I think that an outright drop is wrong. The real issue is that KVM as : quite literally the dumbest possible "algorithm" for zapping possibly-in-use : shadow pages, and doesn't target the zapping to fit the cgroup that's under : pressure. : : So for this, IMO rather than assume that freeing the caches when any memslot : disables dirty logging, I think it makes sense to initially keep the caches and : only free them at VM destruction. Then in follow-up patches, if we want, free : the caches in the mmu_shrink_scan(), and/or add a function hook for toggling : eager_page_split to topup/free the caches accordingly. That gives userspace : explicit control over when the caches are purged, and does the logical thing of : freeing the caches when eager_page_split is disabled. The current mmu_shrink_scan() implementation is basically useless. It's so naive that relying on it to react to memory pressure all but guarantees that guest performance will tank if the shrinker kicks in. E.g. KVM zaps the oldest pages, which are likely upper level SPTEs and thus most likely to be reused. And prior to the TDP MMU, a guest running nested VMs would more than likely zap L1's SPTEs even if the majority of shadow pages were allocated for L2. And as pointed out in the RFC to drop shrinker support entirely[1], for well over a decade KVM zapped a single page in each call from the shrinker, i.e. it was _really_ useless. And although in [1] I said adding memcg would be easy, doing so in a performant way would be quite difficult as the per-cpu counter would need to be made memcg aware (didn't think about that before). Lastly, given that I completely broke the shrink logic for ~6 months and someone only noticed when n_max_mmu_pages kicked in (commit 8fc517267fb2 "KVM: x86: Zap the oldest MMU pages, not the newest"), I highly doubt anyone is relying on the current shrinker logic, i.e. KVM_SET_NR_MMU_PAGES is used for setups that actually need to limit the number of MMU pages beyond KVM's default. My vote is to do something like the below: repurpose KVM's shrinker integration to only purge the shadow page caches. That can be done with almost no impact on guest performance, e.g. with a dedicated spinlock, reclaiming from the caches wouldn't even need to kick the vCPU. Supporting reclaim of the caches would allow us to change the cache capacity and/or behavior without having to worry too much about exploding memory, and would make KVM's shrinker support actually usable. [1] https://lore.kernel.org/all/YqzRGj6ryBZfGBSl@google.com [2] https://lore.kernel.org/all/20220503221357.943536-1-vipinsh@google.com --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/mmu/mmu.c | 131 ++++++++++++++++---------------- 2 files changed, 69 insertions(+), 64 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 283cbb83d6ae..f123bd985e1a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -786,6 +786,8 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_shadowed_info_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + spinlock_t mmu_shadow_page_cache_lock; + /* * QEMU userspace and the guest each have their own FPU state. * In vcpu_run, we switch between the user and guest FPU contexts. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4736d7849c60..fca74cb1f2ce 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -157,7 +157,12 @@ struct kvm_shadow_walk_iterator { static struct kmem_cache *pte_list_desc_cache; struct kmem_cache *mmu_page_header_cache; -static struct percpu_counter kvm_total_used_mmu_pages; + +/* + * The total number of allocated, unused MMU pages, i.e. the total number of + * unused pages sitting in kvm_mmu_memory_cache structures. + */ +static struct percpu_counter kvm_total_unused_mmu_pages; static void mmu_spte_set(u64 *sptep, u64 spte); @@ -643,6 +648,23 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) } } +static int kvm_mmu_topup_sp_cache(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache; + int orig_nobjs = cache->nobjs; + int r; + + spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock); + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL); + + if (cache->nobjs != orig_nobjs) + percpu_counter_add(&kvm_total_unused_mmu_pages, + cache->nobjs - orig_nobjs); + spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock); + + return r; +} + static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) { int r; @@ -652,10 +674,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM); if (r) return r; - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache, - PT64_ROOT_MAX_LEVEL); + + r = kvm_mmu_topup_sp_cache(vcpu); if (r) return r; + if (maybe_indirect) { r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache, PT64_ROOT_MAX_LEVEL); @@ -1681,28 +1704,23 @@ static int is_empty_shadow_page(u64 *spt) } #endif -/* - * This value is the sum of all of the kvm instances's - * kvm->arch.n_used_mmu_pages values. We need a global, - * aggregate version in order to make the slab shrinker - * faster - */ -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr) -{ - kvm->arch.n_used_mmu_pages += nr; - percpu_counter_add(&kvm_total_used_mmu_pages, nr); -} - static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) { - kvm_mod_used_mmu_pages(kvm, +1); + kvm->arch.n_used_mmu_pages++; kvm_account_pgtable_pages((void *)sp->spt, +1); + + percpu_counter_dec(&kvm_total_unused_mmu_pages); } static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) { - kvm_mod_used_mmu_pages(kvm, -1); + kvm->arch.n_used_mmu_pages--; kvm_account_pgtable_pages((void *)sp->spt, -1); + + /* + * KVM doesn't put freed pages back into the cache, thus freeing a page + * doesn't affect the number of unused MMU pages. + */ } static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp) @@ -5859,6 +5877,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO; vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO; + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock); vcpu->arch.mmu = &vcpu->arch.root_mmu; vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; @@ -5994,11 +6013,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) kvm_tdp_mmu_zap_invalidated_roots(kvm); } -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) -{ - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); -} - static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, struct kvm_page_track_notifier_node *node) @@ -6583,69 +6597,58 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) } } -static unsigned long -mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) +static unsigned long mmu_shrink_scan(struct shrinker *shrink, + struct shrink_control *sc) { + struct kvm_mmu_memory_cache *cache; + struct kvm_vcpu *vcpu; struct kvm *kvm; - int nr_to_scan = sc->nr_to_scan; unsigned long freed = 0; + unsigned long i; mutex_lock(&kvm_lock); list_for_each_entry(kvm, &vm_list, vm_list) { - int idx; - LIST_HEAD(invalid_list); + kvm_for_each_vcpu(i, vcpu, kvm) { + cache = &vcpu->arch.mmu_shadow_page_cache; - /* - * Never scan more than sc->nr_to_scan VM instances. - * Will not hit this condition practically since we do not try - * to shrink more than one VM and it is very unlikely to see - * !n_used_mmu_pages so many times. - */ - if (!nr_to_scan--) - break; - /* - * n_used_mmu_pages is accessed without holding kvm->mmu_lock - * here. We may skip a VM instance errorneosly, but we do not - * want to shrink a VM that only started to populate its MMU - * anyway. - */ - if (!kvm->arch.n_used_mmu_pages && - !kvm_has_zapped_obsolete_pages(kvm)) - continue; + if (!READ_ONCE(cache->nobjs)) + continue; - idx = srcu_read_lock(&kvm->srcu); - write_lock(&kvm->mmu_lock); - - if (kvm_has_zapped_obsolete_pages(kvm)) { - kvm_mmu_commit_zap_page(kvm, - &kvm->arch.zapped_obsolete_pages); - goto unlock; + spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock); + while (cache->nobjs) { + free_page((unsigned long)cache->objects[--cache->nobjs]); + ++freed; + } + spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock); } - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan); - -unlock: - write_unlock(&kvm->mmu_lock); - srcu_read_unlock(&kvm->srcu, idx); - /* * unfair on small ones * per-vm shrinkers cry out * sadness comes quickly */ list_move_tail(&kvm->vm_list, &vm_list); - break; + + /* + * Process all vCPUs before checking if enough pages have been + * freed as a very naive way of providing fairness, e.g. to + * avoid starving a single vCPU. + */ + if (freed >= sc->nr_to_scan) + break; } mutex_unlock(&kvm_lock); + + sc->nr_scanned = freed; return freed; } -static unsigned long -mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) +static unsigned long mmu_shrink_count(struct shrinker *shrink, + struct shrink_control *sc) { - return percpu_counter_read_positive(&kvm_total_used_mmu_pages); + return percpu_counter_read_positive(&kvm_total_unused_mmu_pages); } static struct shrinker mmu_shrinker = { @@ -6753,7 +6756,7 @@ int kvm_mmu_vendor_module_init(void) if (!mmu_page_header_cache) goto out; - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL)) + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL)) goto out; ret = register_shrinker(&mmu_shrinker, "x86-mmu"); @@ -6763,7 +6766,7 @@ int kvm_mmu_vendor_module_init(void) return 0; out_shrinker: - percpu_counter_destroy(&kvm_total_used_mmu_pages); + percpu_counter_destroy(&kvm_total_unused_mmu_pages); out: mmu_destroy_caches(); return ret; @@ -6780,7 +6783,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu) void kvm_mmu_vendor_module_exit(void) { mmu_destroy_caches(); - percpu_counter_destroy(&kvm_total_used_mmu_pages); + percpu_counter_destroy(&kvm_total_unused_mmu_pages); unregister_shrinker(&mmu_shrinker); } base-commit: d709db0a0c05a6a2973655ed88690d4d43128d60 --
On Mon, Dec 5, 2022 at 1:07 PM Sean Christopherson <seanjc@google.com> wrote: > > On Mon, Dec 05, 2022, Ben Gardon wrote: > > On Mon, Dec 5, 2022 at 10:47 AM Sean Christopherson <seanjc@google.com> wrote: > > > > > +static bool __read_mostly numa_aware_pagetable = true; > > > > > +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644); > > > > > + > > > > > > > > I'm usually all for having module params to control things, but in > > > > this case I don't think it provides much value because whether this > > > > NUMA optimization is useful or not is going to depend more on VM size > > > > and workload than anything else. If we wanted to make this > > > > configurable, a VM capability would probably be a better mechanism so > > > > that userspace could leave it off when running small, > > > > non-performance-sensitive VMs > > > > > > Would we actually want to turn it off in this case? IIUC, @nid is just the > > > preferred node, i.e. failure to allocate for the preferred @nid will result in > > > falling back to other nodes, not outright failure. So the pathological worst > > > case scenario would be that for a system with VMs that don't care about performance, > > > all of a nodes memory is allocated due to all VMs starting on that node. > > > > > > On the flip side, if a system had a mix of VM shapes, I think we'd want even the > > > performance insensitive VMs to be NUMA aware so that they can be sequestered on > > > their own node(s), i.e. don't "steal" memory from the VMs that are performance > > > sensitive and have been affined to a single node. > > > > Yeah, the only reason to turn it off would be to save memory. As a > > strawman, if you had 100 1-vCPU VMs on a 2 node system, you'd have > > 4000 pages allocated in caches, doing nothing. > I will keep it as a module parameter for now as Sean noted we don't want insensitive VMs to impact the performance sensitive VMs. Also, as noted below, changing shrinker behaviour to free cache will allow to reclaim memory of unused node. If there is a better reason for using capability let me know we can discuss more about it. > The management of the caches really should be addressed separately, and this is > the perfect excuse to finally get some traction on mmu_shrink_scan(). > > From a while back[1]: > > : I know we proposed dropping mmu_shrink_scan(), but the more I think about that, > : the more I think that an outright drop is wrong. The real issue is that KVM as > : quite literally the dumbest possible "algorithm" for zapping possibly-in-use > : shadow pages, and doesn't target the zapping to fit the cgroup that's under > : pressure. > : > : So for this, IMO rather than assume that freeing the caches when any memslot > : disables dirty logging, I think it makes sense to initially keep the caches and > : only free them at VM destruction. Then in follow-up patches, if we want, free > : the caches in the mmu_shrink_scan(), and/or add a function hook for toggling > : eager_page_split to topup/free the caches accordingly. That gives userspace > : explicit control over when the caches are purged, and does the logical thing of > : freeing the caches when eager_page_split is disabled. > > The current mmu_shrink_scan() implementation is basically useless. It's so naive > that relying on it to react to memory pressure all but guarantees that guest > performance will tank if the shrinker kicks in. E.g. KVM zaps the oldest pages, > which are likely upper level SPTEs and thus most likely to be reused. And prior > to the TDP MMU, a guest running nested VMs would more than likely zap L1's SPTEs > even if the majority of shadow pages were allocated for L2. > > And as pointed out in the RFC to drop shrinker support entirely[1], for well over > a decade KVM zapped a single page in each call from the shrinker, i.e. it was > _really_ useless. And although in [1] I said adding memcg would be easy, doing > so in a performant way would be quite difficult as the per-cpu counter would need > to be made memcg aware (didn't think about that before). > > Lastly, given that I completely broke the shrink logic for ~6 months and someone > only noticed when n_max_mmu_pages kicked in (commit 8fc517267fb2 "KVM: x86: Zap > the oldest MMU pages, not the newest"), I highly doubt anyone is relying on the > current shrinker logic, i.e. KVM_SET_NR_MMU_PAGES is used for setups that actually > need to limit the number of MMU pages beyond KVM's default. > > My vote is to do something like the below: repurpose KVM's shrinker integration > to only purge the shadow page caches. That can be done with almost no impact on > guest performance, e.g. with a dedicated spinlock, reclaiming from the caches > wouldn't even need to kick the vCPU. Supporting reclaim of the caches would allow > us to change the cache capacity and/or behavior without having to worry too much > about exploding memory, and would make KVM's shrinker support actually usable. > > [1] https://lore.kernel.org/all/YqzRGj6ryBZfGBSl@google.com > [2] https://lore.kernel.org/all/20220503221357.943536-1-vipinsh@google.com > As discussed offline. 1. I will create a patch in this series which changes the behaviour of shrinker by freeing the cache. 2. One suggestion came up was to fill the cache of the correct node only in the topup stage by using the PFN after kvm_faulin_pfn(). We decided it is not good because mmu_topup_memory_cache() cannot be called after kvm_faultin_pf() as topup will increase the time between reading the mmu_invalidate_seq and checking it in is_page_fault_stale() which will increase chances of retrying page fault and reducing the guest performance. 3. I will reduce the cache size and try to see if there is any impact observed. > --- > arch/x86/include/asm/kvm_host.h | 2 + > arch/x86/kvm/mmu/mmu.c | 131 ++++++++++++++++---------------- > 2 files changed, 69 insertions(+), 64 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 283cbb83d6ae..f123bd985e1a 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -786,6 +786,8 @@ struct kvm_vcpu_arch { > struct kvm_mmu_memory_cache mmu_shadowed_info_cache; > struct kvm_mmu_memory_cache mmu_page_header_cache; > > + spinlock_t mmu_shadow_page_cache_lock; > + > /* > * QEMU userspace and the guest each have their own FPU state. > * In vcpu_run, we switch between the user and guest FPU contexts. > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 4736d7849c60..fca74cb1f2ce 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -157,7 +157,12 @@ struct kvm_shadow_walk_iterator { > > static struct kmem_cache *pte_list_desc_cache; > struct kmem_cache *mmu_page_header_cache; > -static struct percpu_counter kvm_total_used_mmu_pages; > + > +/* > + * The total number of allocated, unused MMU pages, i.e. the total number of > + * unused pages sitting in kvm_mmu_memory_cache structures. > + */ > +static struct percpu_counter kvm_total_unused_mmu_pages; > > static void mmu_spte_set(u64 *sptep, u64 spte); > > @@ -643,6 +648,23 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) > } > } > > +static int kvm_mmu_topup_sp_cache(struct kvm_vcpu *vcpu) > +{ > + struct kvm_mmu_memory_cache *cache = &vcpu->arch.mmu_shadow_page_cache; > + int orig_nobjs = cache->nobjs; > + int r; > + > + spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock); > + r = kvm_mmu_topup_memory_cache(cache, PT64_ROOT_MAX_LEVEL); > + > + if (cache->nobjs != orig_nobjs) > + percpu_counter_add(&kvm_total_unused_mmu_pages, > + cache->nobjs - orig_nobjs); > + spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock); > + > + return r; > +} > + > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) > { > int r; > @@ -652,10 +674,11 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) > 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM); > if (r) > return r; > - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache, > - PT64_ROOT_MAX_LEVEL); > + > + r = kvm_mmu_topup_sp_cache(vcpu); > if (r) > return r; > + > if (maybe_indirect) { > r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache, > PT64_ROOT_MAX_LEVEL); > @@ -1681,28 +1704,23 @@ static int is_empty_shadow_page(u64 *spt) > } > #endif > > -/* > - * This value is the sum of all of the kvm instances's > - * kvm->arch.n_used_mmu_pages values. We need a global, > - * aggregate version in order to make the slab shrinker > - * faster > - */ > -static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr) > -{ > - kvm->arch.n_used_mmu_pages += nr; > - percpu_counter_add(&kvm_total_used_mmu_pages, nr); > -} > - > static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) > { > - kvm_mod_used_mmu_pages(kvm, +1); > + kvm->arch.n_used_mmu_pages++; > kvm_account_pgtable_pages((void *)sp->spt, +1); > + > + percpu_counter_dec(&kvm_total_unused_mmu_pages); > } > > static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp) > { > - kvm_mod_used_mmu_pages(kvm, -1); > + kvm->arch.n_used_mmu_pages--; > kvm_account_pgtable_pages((void *)sp->spt, -1); > + > + /* > + * KVM doesn't put freed pages back into the cache, thus freeing a page > + * doesn't affect the number of unused MMU pages. > + */ > } > > static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp) > @@ -5859,6 +5877,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu) > vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO; > > vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO; > + spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock); > > vcpu->arch.mmu = &vcpu->arch.root_mmu; > vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; > @@ -5994,11 +6013,6 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm) > kvm_tdp_mmu_zap_invalidated_roots(kvm); > } > > -static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) > -{ > - return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages)); > -} > - > static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, > struct kvm_memory_slot *slot, > struct kvm_page_track_notifier_node *node) > @@ -6583,69 +6597,58 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) > } > } > > -static unsigned long > -mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) > +static unsigned long mmu_shrink_scan(struct shrinker *shrink, > + struct shrink_control *sc) > { > + struct kvm_mmu_memory_cache *cache; > + struct kvm_vcpu *vcpu; > struct kvm *kvm; > - int nr_to_scan = sc->nr_to_scan; > unsigned long freed = 0; > + unsigned long i; > > mutex_lock(&kvm_lock); > > list_for_each_entry(kvm, &vm_list, vm_list) { > - int idx; > - LIST_HEAD(invalid_list); > + kvm_for_each_vcpu(i, vcpu, kvm) { > + cache = &vcpu->arch.mmu_shadow_page_cache; > > - /* > - * Never scan more than sc->nr_to_scan VM instances. > - * Will not hit this condition practically since we do not try > - * to shrink more than one VM and it is very unlikely to see > - * !n_used_mmu_pages so many times. > - */ > - if (!nr_to_scan--) > - break; > - /* > - * n_used_mmu_pages is accessed without holding kvm->mmu_lock > - * here. We may skip a VM instance errorneosly, but we do not > - * want to shrink a VM that only started to populate its MMU > - * anyway. > - */ > - if (!kvm->arch.n_used_mmu_pages && > - !kvm_has_zapped_obsolete_pages(kvm)) > - continue; > + if (!READ_ONCE(cache->nobjs)) > + continue; > > - idx = srcu_read_lock(&kvm->srcu); > - write_lock(&kvm->mmu_lock); > - > - if (kvm_has_zapped_obsolete_pages(kvm)) { > - kvm_mmu_commit_zap_page(kvm, > - &kvm->arch.zapped_obsolete_pages); > - goto unlock; > + spin_lock(&vcpu->arch.mmu_shadow_page_cache_lock); > + while (cache->nobjs) { > + free_page((unsigned long)cache->objects[--cache->nobjs]); > + ++freed; > + } > + spin_unlock(&vcpu->arch.mmu_shadow_page_cache_lock); > } > > - freed = kvm_mmu_zap_oldest_mmu_pages(kvm, sc->nr_to_scan); > - > -unlock: > - write_unlock(&kvm->mmu_lock); > - srcu_read_unlock(&kvm->srcu, idx); > - > /* > * unfair on small ones > * per-vm shrinkers cry out > * sadness comes quickly > */ > list_move_tail(&kvm->vm_list, &vm_list); > - break; > + > + /* > + * Process all vCPUs before checking if enough pages have been > + * freed as a very naive way of providing fairness, e.g. to > + * avoid starving a single vCPU. > + */ > + if (freed >= sc->nr_to_scan) > + break; > } > > mutex_unlock(&kvm_lock); > + > + sc->nr_scanned = freed; > return freed; > } > > -static unsigned long > -mmu_shrink_count(struct shrinker *shrink, struct shrink_control *sc) > +static unsigned long mmu_shrink_count(struct shrinker *shrink, > + struct shrink_control *sc) > { > - return percpu_counter_read_positive(&kvm_total_used_mmu_pages); > + return percpu_counter_read_positive(&kvm_total_unused_mmu_pages); > } > > static struct shrinker mmu_shrinker = { > @@ -6753,7 +6756,7 @@ int kvm_mmu_vendor_module_init(void) > if (!mmu_page_header_cache) > goto out; > > - if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL)) > + if (percpu_counter_init(&kvm_total_unused_mmu_pages, 0, GFP_KERNEL)) > goto out; > > ret = register_shrinker(&mmu_shrinker, "x86-mmu"); > @@ -6763,7 +6766,7 @@ int kvm_mmu_vendor_module_init(void) > return 0; > > out_shrinker: > - percpu_counter_destroy(&kvm_total_used_mmu_pages); > + percpu_counter_destroy(&kvm_total_unused_mmu_pages); > out: > mmu_destroy_caches(); > return ret; > @@ -6780,7 +6783,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu) > void kvm_mmu_vendor_module_exit(void) > { > mmu_destroy_caches(); > - percpu_counter_destroy(&kvm_total_used_mmu_pages); > + percpu_counter_destroy(&kvm_total_unused_mmu_pages); > unregister_shrinker(&mmu_shrinker); > } > > > base-commit: d709db0a0c05a6a2973655ed88690d4d43128d60 > --
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 6bdaacb6faa0..c960fb096e5c 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -119,6 +119,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu); void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu); void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu); void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu); +void *kvm_mmu_get_free_page(int nid, gfp_t gfp); static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4736d7849c60..0554dfc55553 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint"); static bool __read_mostly force_flush_and_sync_on_reuse; module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); +static bool __read_mostly numa_aware_pagetable = true; +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644); + /* * When setting this variable to true it enables Two-Dimensional-Paging * where the hardware walks 2 page tables: @@ -6984,3 +6987,19 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm) if (kvm->arch.nx_huge_page_recovery_thread) kthread_stop(kvm->arch.nx_huge_page_recovery_thread); } + +void *kvm_mmu_get_free_page(int nid, gfp_t gfp) +{ + struct page *spt_page; + void *address = NULL; + + if (numa_aware_pagetable) { + spt_page = alloc_pages_node(nid, gfp, 0); + if (spt_page) + address = page_address(spt_page); + } else { + address = (void *)__get_free_page(gfp); + } + + return address; +} diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 771210ce5181..1607afbfcc0b 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1413,7 +1413,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, return spte_set; } -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) { struct kvm_mmu_page *sp; @@ -1423,7 +1423,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) if (!sp) return NULL; - sp->spt = (void *)__get_free_page(gfp); + sp->spt = kvm_mmu_get_free_page(nid, gfp); + if (!sp->spt) { kmem_cache_free(mmu_page_header_cache, sp); return NULL; @@ -1437,6 +1438,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, bool shared) { struct kvm_mmu_page *sp; + int nid; + + nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(iter->old_spte)); /* * Since we are allocating while under the MMU lock we have to be @@ -1447,7 +1451,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, * If this allocation fails we drop the lock and retry with reclaim * allowed. */ - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT); if (sp) return sp; @@ -1459,7 +1463,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, write_unlock(&kvm->mmu_lock); iter->yielded = true; - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT); + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT); if (shared) read_lock(&kvm->mmu_lock); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8f874a964313..558ded73f660 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1596,6 +1596,21 @@ void kvm_arch_sync_events(struct kvm *kvm); int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu); struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn); + +/* + * Returns the nid of a 'struct page' if pfn is valid and backed by a refcounted + * page, NUMA_NO_NODE otherwise. + */ +static inline int kvm_pfn_to_refcounted_page_nid(kvm_pfn_t pfn) +{ + struct page *page = kvm_pfn_to_refcounted_page(pfn); + + if (page) + return page_to_nid(page); + else + return NUMA_NO_NODE; +} + bool kvm_is_zone_device_page(struct page *page); struct kvm_irq_ack_notifier {