From patchwork Thu Dec 1 19:57:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vipin Sharma X-Patchwork-Id: 28540 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp459397wrr; Thu, 1 Dec 2022 12:03:19 -0800 (PST) X-Google-Smtp-Source: AA0mqf5y2l3Ai8VZP4wQNt/ywZgkEndQYUBi8LBqkpK7R5N0Z9G2SDo5ga4jrjm0CH5Qrs71vJz2 X-Received: by 2002:a17:906:4551:b0:7c0:b7da:a7fa with SMTP id s17-20020a170906455100b007c0b7daa7famr2354612ejq.445.1669924999203; Thu, 01 Dec 2022 12:03:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669924999; cv=none; d=google.com; s=arc-20160816; b=IT9bcAWFIcwMKZPOT468pySRPwfIH2vO44J7XRD9Eb7wuqG5muuIpbfaF9lJ7p/Ez4 FQgdevvSjANBnd0Do6SOfy0AfwqBCRMA4qhE/dIJKexL3oZxuPYH72C8N2XgCEGUwKaE p8jxTSA8hZf2EW3FzJP3dBgpomVCAjtt7pcFFZDZ/zhWGx6N69ZHDIURwqLfa1zFQa9w cSAOEf1Ow1kBonhjdwjevE9K6dVE2L4cWXmdqw+pDNelYFWqN0V5YD6WZ92kPKWOnGzu 4G/HhXAaYyE2oYv9BldIF1sKp1zlpGzmbvijZ9FK6r5bdorFLk9RUQFJnLmet+3JYa2o CUzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=stKM1jMVwmhwoFpYP247nGWPPAXZf6A1zRPmUrHRQZA=; b=Bi0lToFqH/lhNFvipzL3f5JaJaq8s++HQ9huMAbxLlqLrR/uDH1Ml478IU88Nhkatf zgHXpBfCUwkH9Vm9c6PfBD1s+o8r4HIZEtR2LzD0UmigTQ5cIwS3K8VlD0z6X0of2t3N wVcq/cRkTGvVoZpqY/wg2bSRVcEbkwVN7hCV3TzdG9FtH0mEjy4543oj1NT8VutNDUJ4 PE1ZjSWG1FX9vCWSeZVS1Xv5tDYgQBDXa8g4gwT5TAoPvKhnhbTdwn0Pswrx+uI6Iay1 l89w8eAucwnOEP2kNtYw6x9fl3KNSQam5K0wUKd8SK6uAgcUunm1u6xXmaJ3yelX9WST 3riQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Bu8NSzFG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l12-20020a50d6cc000000b0044f2fb68fe6si4646947edj.495.2022.12.01.12.02.54; Thu, 01 Dec 2022 12:03:19 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Bu8NSzFG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231416AbiLAT6Q (ORCPT + 99 others); Thu, 1 Dec 2022 14:58:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54820 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231165AbiLAT6B (ORCPT ); Thu, 1 Dec 2022 14:58:01 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E16E1766A for ; Thu, 1 Dec 2022 11:57:36 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id n4-20020a17090a2fc400b002132adb9485so3011798pjm.0 for ; Thu, 01 Dec 2022 11:57:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=stKM1jMVwmhwoFpYP247nGWPPAXZf6A1zRPmUrHRQZA=; b=Bu8NSzFGappkbv1L7qwoURd3JHk2KgbJHy9Sdk05Usu/He/TEU7SnfHqTnnrUsn71m 90z4y8EI02w7hoQIjL3LAKsdjTZNluksE5Z5q4dHqwNgYkGJcy2mGBsPaprsTVtVW4W0 PeGWHCIX7JJ9LLDtOyJ6AqcfrAZtPWWDNRc7lSB0I00Skc/AgibNqGy5PgHIItnuDa0t XO03NNFTSJtQAYWfn/thSffywzTQYLvUL0MjvYCyXaCWxcajrZ5HIra1saG+xSfeEN12 4CnwWMYMqAmiHaRTcTH7/Ki5BVRt2On9Rd5G2u0W8gtEbZn6p+eXX3gwW9HmO1J9NbBJ T2pQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=stKM1jMVwmhwoFpYP247nGWPPAXZf6A1zRPmUrHRQZA=; b=emhDwy/jy3Rx74V7qcNodlZJeusS/1DV1nyOH8sYWfJr4qaQLo3JUj54qMS5B7K7s7 dG5NTfQ5/mjbLmBNIFKCHpUnfIu4HMggBTuRWHDZtSvZjwVVN9RaVrW2Nst7nW/KCSTO TObhCbz3NFkJZAYK4GnswiXJieYqiS8SAf/Cls9JaGowDboGvekdgY6VUYSbbh3APKKg AiWaqD5IQBV+9K/F2U7ZmeZw8PV307/VqmN1Np+1QIEU4/OV2kDDcqb2WB3ivcY2Qqkg +p9srJ6H5Ehbzb1Oh5zSQ5AhEAkQ/IP0TaoLDgdxj24kEX3XDOYodEHDuO3+tYdHFsnV VmZQ== X-Gm-Message-State: ANoB5pnR3UcEC7J8V+hJr65psRk+RGy9J/mT3q0JXthNbPRblZ3OgJYf BOwMJA+lO87/cHnyhJqxCmX9NE/r8pUn X-Received: from vipin.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:479f]) (user=vipinsh job=sendgmr) by 2002:a17:902:bb10:b0:189:6292:827e with SMTP id im16-20020a170902bb1000b001896292827emr36484510plb.97.1669924655797; Thu, 01 Dec 2022 11:57:35 -0800 (PST) Date: Thu, 1 Dec 2022 11:57:17 -0800 In-Reply-To: <20221201195718.1409782-1-vipinsh@google.com> Mime-Version: 1.0 References: <20221201195718.1409782-1-vipinsh@google.com> X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221201195718.1409782-2-vipinsh@google.com> Subject: [Patch v2 1/2] KVM: x86/mmu: Allocate page table pages on TDP splits during dirty log enable on the underlying page's numa node From: Vipin Sharma To: dmatlack@google.com, bgardon@google.com, seanjc@google.com, pbonzini@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Vipin Sharma X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751043276441481453?= X-GMAIL-MSGID: =?utf-8?q?1751043276441481453?= Huge pages are split when dirty log is enabled. New page table pages are allocated based on the current thread NUMA node or mempolicy. This causes inefficient page table accesses if underlying page is on a different NUMA node Allocate page table pages on the same NUMA node as the underlying huge page when dirty log is enabled and huge pages are split. The performance gain during the pre-copy phase of live migrations of a 416 vCPUs and 11 TiB memory VM on a 8 node host was seen in the range of 130% to 150%. Suggested-by: David Matlack Signed-off-by: Vipin Sharma --- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/mmu/mmu.c | 19 +++++++++++++++++++ arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++---- include/linux/kvm_host.h | 15 +++++++++++++++ 4 files changed, 43 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 6bdaacb6faa0..c960fb096e5c 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -119,6 +119,7 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu); void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu); void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu); void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu); +void *kvm_mmu_get_free_page(int nid, gfp_t gfp); static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4736d7849c60..0554dfc55553 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -90,6 +90,9 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint"); static bool __read_mostly force_flush_and_sync_on_reuse; module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); +static bool __read_mostly numa_aware_pagetable = true; +module_param_named(numa_aware_pagetable, numa_aware_pagetable, bool, 0644); + /* * When setting this variable to true it enables Two-Dimensional-Paging * where the hardware walks 2 page tables: @@ -6984,3 +6987,19 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm) if (kvm->arch.nx_huge_page_recovery_thread) kthread_stop(kvm->arch.nx_huge_page_recovery_thread); } + +void *kvm_mmu_get_free_page(int nid, gfp_t gfp) +{ + struct page *spt_page; + void *address = NULL; + + if (numa_aware_pagetable) { + spt_page = alloc_pages_node(nid, gfp, 0); + if (spt_page) + address = page_address(spt_page); + } else { + address = (void *)__get_free_page(gfp); + } + + return address; +} diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 771210ce5181..1607afbfcc0b 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -1413,7 +1413,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm, return spte_set; } -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) { struct kvm_mmu_page *sp; @@ -1423,7 +1423,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp) if (!sp) return NULL; - sp->spt = (void *)__get_free_page(gfp); + sp->spt = kvm_mmu_get_free_page(nid, gfp); + if (!sp->spt) { kmem_cache_free(mmu_page_header_cache, sp); return NULL; @@ -1437,6 +1438,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, bool shared) { struct kvm_mmu_page *sp; + int nid; + + nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(iter->old_spte)); /* * Since we are allocating while under the MMU lock we have to be @@ -1447,7 +1451,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, * If this allocation fails we drop the lock and retry with reclaim * allowed. */ - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT); + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT); if (sp) return sp; @@ -1459,7 +1463,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm, write_unlock(&kvm->mmu_lock); iter->yielded = true; - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT); + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT); if (shared) read_lock(&kvm->mmu_lock); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8f874a964313..558ded73f660 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1596,6 +1596,21 @@ void kvm_arch_sync_events(struct kvm *kvm); int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu); struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn); + +/* + * Returns the nid of a 'struct page' if pfn is valid and backed by a refcounted + * page, NUMA_NO_NODE otherwise. + */ +static inline int kvm_pfn_to_refcounted_page_nid(kvm_pfn_t pfn) +{ + struct page *page = kvm_pfn_to_refcounted_page(pfn); + + if (page) + return page_to_nid(page); + else + return NUMA_NO_NODE; +} + bool kvm_is_zone_device_page(struct page *page); struct kvm_irq_ack_notifier { From patchwork Thu Dec 1 19:57:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vipin Sharma X-Patchwork-Id: 28541 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp459539wrr; Thu, 1 Dec 2022 12:03:32 -0800 (PST) X-Google-Smtp-Source: AA0mqf4R7/GCXZ1bO+4WSJxAJvpEW1iYWLLnUiovBX8p3qck/88hAwstTlSe3TtcQOQMW0BcPPDJ X-Received: by 2002:a17:907:1387:b0:7bd:b901:4b86 with SMTP id vs7-20020a170907138700b007bdb9014b86mr23534418ejb.712.1669925012544; Thu, 01 Dec 2022 12:03:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669925012; cv=none; d=google.com; s=arc-20160816; b=YNz0LgRXK966Yldj2xD91MGor0H1CDnlEJ1V7tBR62ihICVNi48pxnr92dCw2+Fn0W 2bmHzb2plEK+RMp7QOy5ymlwYQdco70/wWBwgPNE7jtPDYoAIsOCz1as6J2OJmYm4C4N Dkf4/vwSPJGjLW3FRg46KHQcO2QORMGDZ0BUNqMVEmDHad+33PC2k97DDrfbTZgW7mSP zdKp9RdRAMTwObgOXe48QBl/aQiAseqSN6Zpcd6OVPlcg8lbDbxj8AiVNO56wWF2pKWv FlRcXrbVmHyYMoqRFGkTpuHnnX0ySZNheEhNqIT/0hpiqoFouad6F6JQMyg6CTVF93c7 Uskg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=TskAP0HE/NI+JHHGu+M3UubYGVyLMRWKrmPE1DQVw00=; b=JvddeCg4QVPvGhkfS5Mn6JZo463JWBFKSrozDfo73vJQEwg+rQVaAcbVbGNnpOJEP+ mmfQxHwmGiPB1S/OcznsP3fMd0jlPGVuYQZbav4FrYa9gKVV+NTJsPZ24jLVXwjdcwM2 8ngoRQm03bY0PTZkwIg6I6J+liUeJEeTRARAmF6oyg+P701Tb0M6gf4snObuPK3zdIXU ba2apH4Eg7u5304AFztk379eixznvKkPwXXfoqf4zHpZdvT4eXdG7rtM8djGvF6TMglF DtqEDq6ldZfmK1+c4zsb9bfraKLlDdwrvJ9WmCX5R0ZjAoMWvwS5EYtg5SAY0EkN9+lN 2PEw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=od0FFD1N; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o22-20020a170906975600b007ae9abf1990si4656472ejy.352.2022.12.01.12.03.08; Thu, 01 Dec 2022 12:03:32 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=od0FFD1N; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230292AbiLAT63 (ORCPT + 99 others); Thu, 1 Dec 2022 14:58:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55122 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231372AbiLAT6G (ORCPT ); Thu, 1 Dec 2022 14:58:06 -0500 Received: from mail-pj1-x104a.google.com (mail-pj1-x104a.google.com [IPv6:2607:f8b0:4864:20::104a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C34D13D16 for ; Thu, 1 Dec 2022 11:57:39 -0800 (PST) Received: by mail-pj1-x104a.google.com with SMTP id k7-20020a17090a39c700b002192c16f19aso2988448pjf.1 for ; Thu, 01 Dec 2022 11:57:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=TskAP0HE/NI+JHHGu+M3UubYGVyLMRWKrmPE1DQVw00=; b=od0FFD1NG2ZSdpFKWkwxMf8VBl3Z6erPm+QWevNDmiNBeDtnKmk76JPUQA1PNNLrjI 9Gzy+OVUAMB+JRlpCxf/LtkAUF1AxMvZ3t1j/uxYvKWmjaN0/TO375X0eg7++q+QQCot LASy+ltfFryABWWvVxOYaaS+wfrTVLPtvPCvLPTOQt3kIj1PG8Cu0jh3QLVe+ZjAUNwb h07BqN+/IDyOvxDvlpVl2JWKsEnwbmzxNHuOwWzqJeVJtcpPtv37P77GygTYf7d2ojzS 2Hjyynr/Aqwsiu5jVSy91bzGME+WBK6lfw9dhL92XoASXsOLGqa0NCq4uvUPTErz8oAU hiiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=TskAP0HE/NI+JHHGu+M3UubYGVyLMRWKrmPE1DQVw00=; b=cKXXtzQxpmCeerd6W1Wl0QCv+o2nxtqHIE9JefUGn22TVuO5C2MR4WmjmaI8ADk/ok OKPZSRli5pQmjdwDn7M+ncZjyqMJKnC3Kee4s7tI/rnjIUxyyt3L39LdAyepavUPnnau EVCmN5pL6nttrfDEAiZoe3r9CGDCTUiaRLa+eYMZ/Ca39uAZZnMEYCwAjJWxO35F7JFP XzyblrKsI7oBJdh3Bw05tUHhW1Ot9fm3+Sci71CDFDNFZJ3+7H+q/kXP7ol/kqhJwQxm QP4pHsa6137orTIiRsjKUNeFRc+BJKcCIwDm2jqhYSkq4SXObtsLGG0UkX9E/8rr9Y29 kk+g== X-Gm-Message-State: ANoB5pkZ0giQnPf9li4sHl27bRPul1YwD1Fyw89hnt5RnLXJBkh2R3cJ ia4yXA0yEy/vz4tkEINmSRkgo7qLmrcY X-Received: from vipin.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:479f]) (user=vipinsh job=sendgmr) by 2002:a17:90a:fc84:b0:217:ff37:2fe9 with SMTP id ci4-20020a17090afc8400b00217ff372fe9mr78654300pjb.242.1669924659059; Thu, 01 Dec 2022 11:57:39 -0800 (PST) Date: Thu, 1 Dec 2022 11:57:18 -0800 In-Reply-To: <20221201195718.1409782-1-vipinsh@google.com> Mime-Version: 1.0 References: <20221201195718.1409782-1-vipinsh@google.com> X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221201195718.1409782-3-vipinsh@google.com> Subject: [Patch v2 2/2] KVM: x86/mmu: Allocate page table pages on NUMA node of underlying pages From: Vipin Sharma To: dmatlack@google.com, bgardon@google.com, seanjc@google.com, pbonzini@redhat.com Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Vipin Sharma X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1751043290280740991?= X-GMAIL-MSGID: =?utf-8?q?1751043290280740991?= Page table pages of a VM are currently allocated based on the current task's NUMA node or its mempolicy. This can cause suboptimal remote accesses by the vCPU if it is accessing physical pages local to its NUMA node but the page table pages mapping those physcal pages were created by some other vCPU which was on different NUMA node or had different policy. Allocate page table pages on the same NUMA node where underlying physical page exists. Page table at level 5, 4, and 3 might not end up on the same NUMA node as they can span multiple NUMA nodes. Signed-off-by: Vipin Sharma --- arch/x86/include/asm/kvm_host.h | 4 +- arch/x86/kvm/mmu.h | 1 - arch/x86/kvm/mmu/mmu.c | 109 ++++++++++++++++++++++---------- arch/x86/kvm/mmu/paging_tmpl.h | 4 +- arch/x86/kvm/mmu/tdp_mmu.c | 16 +++-- include/linux/kvm_host.h | 2 + include/linux/kvm_types.h | 2 + virt/kvm/kvm_main.c | 7 +- 8 files changed, 101 insertions(+), 44 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 283cbb83d6ae..8a0293326abc 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -782,7 +782,7 @@ struct kvm_vcpu_arch { struct kvm_mmu *walk_mmu; struct kvm_mmu_memory_cache mmu_pte_list_desc_cache; - struct kvm_mmu_memory_cache mmu_shadow_page_cache; + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES]; struct kvm_mmu_memory_cache mmu_shadowed_info_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; @@ -1415,7 +1415,7 @@ struct kvm_arch { * * Protected by kvm->slots_lock. */ - struct kvm_mmu_memory_cache split_shadow_page_cache; + struct kvm_mmu_memory_cache split_shadow_page_cache[MAX_NUMNODES]; struct kvm_mmu_memory_cache split_page_header_cache; /* diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index c960fb096e5c..6bdaacb6faa0 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -119,7 +119,6 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu); void kvm_mmu_free_obsolete_roots(struct kvm_vcpu *vcpu); void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu); void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu); -void *kvm_mmu_get_free_page(int nid, gfp_t gfp); static inline int kvm_mmu_reload(struct kvm_vcpu *vcpu) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0554dfc55553..ff7b17af8ab8 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -648,31 +648,43 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu) static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) { - int r; + int r, i; /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache, 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM); if (r) return r; - r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache, - PT64_ROOT_MAX_LEVEL); - if (r) - return r; + + for (i = 0; i < MAX_NUMNODES; i++) { + if (node_online(i)) { + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache[i], + PT64_ROOT_MAX_LEVEL); + if (r) + return r; + } + } + if (maybe_indirect) { r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache, PT64_ROOT_MAX_LEVEL); if (r) return r; } + return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache, PT64_ROOT_MAX_LEVEL); } static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) { + int i; + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache); - kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache); + + for (i = 0; i < MAX_NUMNODES; i++) + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache[i]); + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); } @@ -2203,13 +2215,17 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm, static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu, gfn_t gfn, - union kvm_mmu_page_role role) + union kvm_mmu_page_role role, + int nid) { - struct shadow_page_caches caches = { - .page_header_cache = &vcpu->arch.mmu_page_header_cache, - .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache, - .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache, - }; + struct shadow_page_caches caches; + + if (nid == NUMA_NO_NODE) + nid = numa_mem_id(); + + caches.page_header_cache = &vcpu->arch.mmu_page_header_cache; + caches.shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid]; + caches.shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache; return __kvm_mmu_get_shadow_page(vcpu->kvm, vcpu, &caches, gfn, role); } @@ -2262,15 +2278,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct, static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, - bool direct, unsigned int access) + bool direct, unsigned int access, + kvm_pfn_t pfn) { union kvm_mmu_page_role role; + int nid; if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) return ERR_PTR(-EEXIST); role = kvm_mmu_child_role(sptep, direct, access); - return kvm_mmu_get_shadow_page(vcpu, gfn, role); + nid = kvm_pfn_to_refcounted_page_nid(pfn); + + return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid); } static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator, @@ -3153,7 +3173,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) if (it.level == fault->goal_level) break; - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL); + sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, + ACC_ALL, fault->pfn); if (sp == ERR_PTR(-EEXIST)) continue; @@ -3579,7 +3600,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant, WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte); WARN_ON_ONCE(role.direct && role.has_4_byte_gpte); - sp = kvm_mmu_get_shadow_page(vcpu, gfn, role); + sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, NUMA_NO_NODE); ++sp->root_count; return __pa(sp->spt); @@ -5853,15 +5874,20 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) int kvm_mmu_create(struct kvm_vcpu *vcpu) { - int ret; + int ret, i; vcpu->arch.mmu_pte_list_desc_cache.kmem_cache = pte_list_desc_cache; vcpu->arch.mmu_pte_list_desc_cache.gfp_zero = __GFP_ZERO; + vcpu->arch.mmu_pte_list_desc_cache.node = NUMA_NO_NODE; vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache; vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO; + vcpu->arch.mmu_page_header_cache.node = NUMA_NO_NODE; - vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO; + for (i = 0; i < MAX_NUMNODES; i++) { + vcpu->arch.mmu_shadow_page_cache[i].gfp_zero = __GFP_ZERO; + vcpu->arch.mmu_shadow_page_cache[i].node = i; + } vcpu->arch.mmu = &vcpu->arch.root_mmu; vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; @@ -6012,7 +6038,7 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm, int kvm_mmu_init_vm(struct kvm *kvm) { struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker; - int r; + int r, i; INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); @@ -6029,20 +6055,29 @@ int kvm_mmu_init_vm(struct kvm *kvm) kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache; kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO; + kvm->arch.split_page_header_cache.node = NUMA_NO_NODE; + + for (i = 0; i < MAX_NUMNODES; i++) { + kvm->arch.split_shadow_page_cache[i].gfp_zero = __GFP_ZERO; + kvm->arch.split_shadow_page_cache[i].node = i; + } - kvm->arch.split_shadow_page_cache.gfp_zero = __GFP_ZERO; kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache; kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO; + kvm->arch.split_desc_cache.node = NUMA_NO_NODE; return 0; } static void mmu_free_vm_memory_caches(struct kvm *kvm) { + int i; + kvm_mmu_free_memory_cache(&kvm->arch.split_desc_cache); kvm_mmu_free_memory_cache(&kvm->arch.split_page_header_cache); - kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache); + for (i = 0; i < MAX_NUMNODES; i++) + kvm_mmu_free_memory_cache(&kvm->arch.split_shadow_page_cache[i]); } void kvm_mmu_uninit_vm(struct kvm *kvm) @@ -6150,7 +6185,7 @@ static inline bool need_topup(struct kvm_mmu_memory_cache *cache, int min) return kvm_mmu_memory_cache_nr_free_objects(cache) < min; } -static bool need_topup_split_caches_or_resched(struct kvm *kvm) +static bool need_topup_split_caches_or_resched(struct kvm *kvm, int nid) { if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) return true; @@ -6162,10 +6197,10 @@ static bool need_topup_split_caches_or_resched(struct kvm *kvm) */ return need_topup(&kvm->arch.split_desc_cache, SPLIT_DESC_CACHE_MIN_NR_OBJECTS) || need_topup(&kvm->arch.split_page_header_cache, 1) || - need_topup(&kvm->arch.split_shadow_page_cache, 1); + need_topup(&kvm->arch.split_shadow_page_cache[nid], 1); } -static int topup_split_caches(struct kvm *kvm) +static int topup_split_caches(struct kvm *kvm, int nid) { /* * Allocating rmap list entries when splitting huge pages for nested @@ -6195,16 +6230,19 @@ static int topup_split_caches(struct kvm *kvm) if (r) return r; - return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache, 1); + return kvm_mmu_topup_memory_cache(&kvm->arch.split_shadow_page_cache[nid], 1); } -static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *huge_sptep) +static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, + u64 *huge_sptep, + u64 huge_spte) { struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep); struct shadow_page_caches caches = {}; union kvm_mmu_page_role role; unsigned int access; gfn_t gfn; + int nid; gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep)); access = kvm_mmu_page_get_access(huge_sp, spte_index(huge_sptep)); @@ -6217,9 +6255,13 @@ static struct kvm_mmu_page *shadow_mmu_get_sp_for_split(struct kvm *kvm, u64 *hu */ role = kvm_mmu_child_role(huge_sptep, /*direct=*/true, access); + nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(huge_spte)); + if (nid == NUMA_NO_NODE) + nid = numa_mem_id(); + /* Direct SPs do not require a shadowed_info_cache. */ caches.page_header_cache = &kvm->arch.split_page_header_cache; - caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache; + caches.shadow_page_cache = &kvm->arch.split_shadow_page_cache[nid]; /* Safe to pass NULL for vCPU since requesting a direct SP. */ return __kvm_mmu_get_shadow_page(kvm, NULL, &caches, gfn, role); @@ -6238,7 +6280,7 @@ static void shadow_mmu_split_huge_page(struct kvm *kvm, gfn_t gfn; int index; - sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep); + sp = shadow_mmu_get_sp_for_split(kvm, huge_sptep, huge_spte); for (index = 0; index < SPTE_ENT_PER_PAGE; index++) { sptep = &sp->spt[index]; @@ -6276,7 +6318,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm, u64 *huge_sptep) { struct kvm_mmu_page *huge_sp = sptep_to_sp(huge_sptep); - int level, r = 0; + int level, r = 0, nid; gfn_t gfn; u64 spte; @@ -6284,13 +6326,16 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm, gfn = kvm_mmu_page_get_gfn(huge_sp, spte_index(huge_sptep)); level = huge_sp->role.level; spte = *huge_sptep; + nid = kvm_pfn_to_refcounted_page_nid(spte_to_pfn(spte)); + if (nid == NUMA_NO_NODE) + nid = numa_mem_id(); if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES) { r = -ENOSPC; goto out; } - if (need_topup_split_caches_or_resched(kvm)) { + if (need_topup_split_caches_or_resched(kvm, nid)) { write_unlock(&kvm->mmu_lock); cond_resched(); /* @@ -6298,7 +6343,7 @@ static int shadow_mmu_try_split_huge_page(struct kvm *kvm, * rmap iterator should be restarted because the MMU lock was * dropped. */ - r = topup_split_caches(kvm) ?: -EAGAIN; + r = topup_split_caches(kvm, nid) ?: -EAGAIN; write_lock(&kvm->mmu_lock); goto out; } @@ -6988,7 +7033,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm) kthread_stop(kvm->arch.nx_huge_page_recovery_thread); } -void *kvm_mmu_get_free_page(int nid, gfp_t gfp) +void *kvm_arch_mmu_get_free_page(int nid, gfp_t gfp) { struct page *spt_page; void *address = NULL; diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 0f6455072055..1b1039a1b178 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, table_gfn = gw->table_gfn[it.level - 2]; access = gw->pt_access[it.level - 2]; sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn, - false, access); + false, access, fault->pfn); if (sp != ERR_PTR(-EEXIST)) { /* @@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, validate_direct_spte(vcpu, it.sptep, direct_access); sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, - true, direct_access); + true, direct_access, fault->pfn); if (sp == ERR_PTR(-EEXIST)) continue; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 1607afbfcc0b..be0763e6b058 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -270,12 +270,15 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm, kvm_mmu_page_as_id(_root) != _as_id) { \ } else -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu) +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid) { struct kvm_mmu_page *sp; + if (nid == NUMA_NO_NODE) + nid = numa_mem_id(); + sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache); - sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache); + sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid]); return sp; } @@ -327,7 +330,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu) goto out; } - root = tdp_mmu_alloc_sp(vcpu); + root = tdp_mmu_alloc_sp(vcpu, NUMA_NO_NODE); tdp_mmu_init_sp(root, NULL, 0, role); refcount_set(&root->tdp_mmu_root_count, 1); @@ -1159,7 +1162,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) struct kvm *kvm = vcpu->kvm; struct tdp_iter iter; struct kvm_mmu_page *sp; - int ret = RET_PF_RETRY; + int ret = RET_PF_RETRY, nid; kvm_mmu_hugepage_adjust(vcpu, fault); @@ -1188,11 +1191,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) !is_large_pte(iter.old_spte)) continue; + nid = kvm_pfn_to_refcounted_page_nid(fault->pfn); /* * The SPTE is either non-present or points to a huge page that * needs to be split. */ - sp = tdp_mmu_alloc_sp(vcpu); + sp = tdp_mmu_alloc_sp(vcpu, nid); tdp_mmu_init_child_sp(sp, &iter); sp->nx_huge_page_disallowed = fault->huge_page_disallowed; @@ -1423,7 +1427,7 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp) if (!sp) return NULL; - sp->spt = kvm_mmu_get_free_page(nid, gfp); + sp->spt = kvm_arch_mmu_get_free_page(nid, gfp); if (!sp->spt) { kmem_cache_free(mmu_page_header_cache, sp); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 558ded73f660..07674955460b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1374,6 +1374,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible); void kvm_flush_remote_tlbs(struct kvm *kvm); +void *kvm_arch_mmu_get_free_page(int nid, gfp_t gfp); + #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE int kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int min); int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min); diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 3ca3db020e0e..cb627cf1b4e1 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -96,6 +96,8 @@ struct kvm_mmu_memory_cache { struct kmem_cache *kmem_cache; int capacity; void **objects; + /* Node on which memory should be allocated by default */ + int node; }; #endif diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1782c4555d94..4d59c9d48277 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -384,6 +384,11 @@ static void kvm_flush_shadow_all(struct kvm *kvm) kvm_arch_guest_memory_reclaimed(kvm); } +void * __weak kvm_arch_mmu_get_free_page(int nid, gfp_t gfp_flags) +{ + return (void *)__get_free_page(gfp_flags); +} + #ifdef KVM_ARCH_NR_OBJS_PER_MEMORY_CACHE static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc, gfp_t gfp_flags) @@ -393,7 +398,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc, if (mc->kmem_cache) return kmem_cache_alloc(mc->kmem_cache, gfp_flags); else - return (void *)__get_free_page(gfp_flags); + return kvm_arch_mmu_get_free_page(mc->node, gfp_flags); } int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)