Message ID | 20230808071329.19995-1-yan.y.zhao@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c44e:0:b0:3f2:4152:657d with SMTP id w14csp2299915vqr; Tue, 8 Aug 2023 11:03:57 -0700 (PDT) X-Google-Smtp-Source: AGHT+IElv5rRpR7iEH1+eheYVGaiLHJ/Do19pgBpLkiTPkE9QEoA7zifIvE1TRNmRw+7kxoyDdaD X-Received: by 2002:a05:6a20:8f09:b0:132:cd2d:16fd with SMTP id b9-20020a056a208f0900b00132cd2d16fdmr289824pzk.38.1691517837333; Tue, 08 Aug 2023 11:03:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691517837; cv=none; d=google.com; s=arc-20160816; b=si2ZEJImPwIxr2Tq/8AoGAU2FkSatrCTEVVfnUMh9GnxUWy3J0G1gXR55n+ZXLnhEA LiNAxQNKFcUXtewNKssZv39bouCMW5lpb0i6S1yPy9/90BzUYIacQcQHUi/MRMKqtCSx a0fVIP9Aplo1t3+xsl0OGCWh6ThpC7cESu56vc1LglCr1hHV0f+Ta/V9ccJ1b8n9igX+ HJdbifK7HFsVSOeDQz3jK+VPloQrSoneyQ6L+jRsy1J0lYtpTf/F1r5u1V6hMU3BJ4yL OH2T7i4Tfbo1La/r9Cwx8WY8c/n/Qtdrg1Xi74A+tVmQR5tn61zeTAho866vcuBVBwtm v8lg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=yb3PScxP/yXgdF6uacGEI8JsPThoaWGXGlLFr70kzOQ=; fh=66ZP7DriuAjqdI9o/zCDiI1Ib6I2mhfDuCT5DLSNGrg=; b=uzCQm2s7R6AgyobXo8k3huQKrInJ/QbYdF0Eag3gDJtdi+YbT1I6Ofvgsht+W7Er7G 5Lv4q7V/0ngh5HMm31LnSPFmcD8sXkDIiyZyXmWWdzfFutguNcRupqQuaRoQVe9+AoVy EeDyugztHMvOqOaTp48al1FAZ5l81ykOSp0Cra5KaR64BNLdbkgNKLN7EJVE7zs4yX5s 3S0cImorPG9WUohdr7ihvMjGMJKLSbq7hJ2OZPXsa74tG0NUC9wV/JbigXIcLUnTQ7bL PNqEIXaNW5MAxLXLU4P26bslZCJcZxnF63J6gjAOv0u14G8hsSd5b5eZv2YV4Doy+1Km BtQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=WMRta5pX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q17-20020a632a11000000b0055c8d58cee9si7014960pgq.714.2023.08.08.11.03.41; Tue, 08 Aug 2023 11:03:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=WMRta5pX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233803AbjHHQ6X (ORCPT <rfc822;aaronkmseo@gmail.com> + 99 others); Tue, 8 Aug 2023 12:58:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43726 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233487AbjHHQ53 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 8 Aug 2023 12:57:29 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 491361FFB; Tue, 8 Aug 2023 08:42:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1691509361; x=1723045361; h=from:to:cc:subject:date:message-id; bh=Vk5oRN79w/cLbLNx3NSjT6+SGZdwpnI5SrG+5/rFROM=; b=WMRta5pXM2WzqBbEDFnIePz90d6/yUTY0Nc79rR+LbpjwpBfAoUJLU// rCXoJ1YhtJ6uvXBzjmvOWexf1Hgk/yAThrosO9TJfGnSVKXpfF5latMz5 gSy+9vDOf9NChui6e1rDYju4ITdV+/tifIWNgKUly7V2YnUam8i+blXmJ BSnQj95H9F0mQh3uGKj+paJunZLmKmYipGjucW6H185zlN4von9OdAecw Xo9dvO1aWcRiAvUOdxvleSO7OgFpa5RVAH8EDa0pJg3injbaC6qBjPIDK 2DOY9eMR6xiEnfvnmJhyb3Sk20GBZErfG/2ULee3QYl2Sn4oQbDd/GXeC Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10795"; a="457130721" X-IronPort-AV: E=Sophos;i="6.01,263,1684825200"; d="scan'208";a="457130721" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2023 00:40:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10795"; a="731281811" X-IronPort-AV: E=Sophos;i="6.01,263,1684825200"; d="scan'208";a="731281811" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Aug 2023 00:40:22 -0700 From: Yan Zhao <yan.y.zhao@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, mike.kravetz@oracle.com, apopple@nvidia.com, jgg@nvidia.com, rppt@kernel.org, akpm@linux-foundation.org, kevin.tian@intel.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [RFC PATCH 0/3] Reduce NUMA balance caused TLB-shootdowns in a VM Date: Tue, 8 Aug 2023 15:13:29 +0800 Message-Id: <20230808071329.19995-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773685008285703124 X-GMAIL-MSGID: 1773685008285703124 |
Series |
Reduce NUMA balance caused TLB-shootdowns in a VM
|
|
Message
Yan Zhao
Aug. 8, 2023, 7:13 a.m. UTC
This is an RFC series trying to fix the issue of unnecessary NUMA protection and TLB-shootdowns found in VMs with assigned devices or VFIO mediated devices during NUMA balance. For VMs with assigned devices or VFIO mediated devices, all or part of guest memory are pinned for long-term. Auto NUMA balancing will periodically selects VMAs of a process and change protections to PROT_NONE even though some or all pages in the selected ranges are long-term pinned for DMAs, which is true for VMs with assigned devices or VFIO mediated devices. Though this will not cause real problem because NUMA migration will ultimately reject migration of those kind of pages and restore those PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically with equal SPTEs finally faulted back, wasting CPU cycles and generating unnecessary TLB-shootdowns. This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation event is sent for NUMA migration purpose in specific. Then, with patch 3, during zapping KVM's secondary MMU, KVM can check and keep accessing the long-term pinned pages even though it's PROT_NONE-mapped in the primary MMU. Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary MMU to avoid NUMA protection introduced page faults and restoration of old huge PMDs/PTEs in primary MMU. As change_pmd_range() will first send .invalidate_range_start() before going down and checking the pages to skip, patch 1 and 3 are still required for KVM. In my test environment, with this series, during boot-up with a VM with assigned devices: TLB shootdown count in KVM caused by .invalidate_range_start() sent for NUMA balancing in change_pmd_range() is reduced from 9000+ on average to 0. Yan Zhao (3): mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose KVM: x86/mmu: skip zap maybe-dma-pinned pages for NUMA migration arch/x86/kvm/mmu/mmu.c | 4 ++-- arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++---- arch/x86/kvm/mmu/tdp_mmu.h | 4 ++-- include/linux/kvm_host.h | 1 + include/linux/mmu_notifier.h | 1 + mm/huge_memory.c | 5 +++++ mm/mprotect.c | 9 ++++++++- virt/kvm/kvm_main.c | 5 +++++ 8 files changed, 46 insertions(+), 9 deletions(-) base-commit: fdf0eaf11452d72945af31804e2a1048ee1b574c
Comments
On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > Skip zapping pages that're exclusive anonymas and maybe-dma-pinned in TDP > MMU if it's for NUMA migration purpose to save unnecessary zaps and TLB > shootdowns. > > For NUMA balancing, change_pmd_range() will send .invalidate_range_start() > and .invalidate_range_end() pair unconditionally before setting a huge PMD > or PTE to be PROT_NONE. > > No matter whether PROT_NONE is set under change_pmd_range(), NUMA migration > will eventually reject migrating of exclusive anonymas and maybe_dma_pinned > pages in later try_to_migrate_one() phase and restoring the affected huge > PMD or PTE. > > Therefore, if KVM can detect those kind of pages in the zap phase, zap and > TLB shootdowns caused by this kind of protection can be avoided. > > Corner cases like below are still fine. > 1. Auto NUMA balancing selects a PMD range to set PROT_NONE in > change_pmd_range(). > 2. A page is maybe-dma-pinned at the time of sending > .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA. > ==> so it's not zapped in KVM's secondary MMU. > 3. The page is unpinned after sending .invalidate_range_start(), therefore > is not maybe-dma-pinned and set to PROT_NONE in primary MMU. > 4. For some reason, page fault is triggered in primary MMU and the page > will be found to be suitable for NUMA migration. > 5. try_to_migrate_one() will send .invalidate_range_start() notification > with event type MMU_NOTIFY_CLEAR to KVM, and ===> > KVM will zap the pages in secondary MMU. > 6. The old page will be replaced by a new page in primary MMU. > > If step 4 does not happen, though KVM will keep accessing a page that > might not be on the best NUMA node, it can be fixed by a next round of > step 1 in Auto NUMA balancing as change_pmd_range() will send mmu > notification without checking PROT_NONE is set or not. > > Currently in this patch, for NUMA migration protection purpose, only > exclusive anonymous maybe-dma-pinned pages are skipped. > Can later include other type of pages, e.g., is_zone_device_page() or > PageKsm() if necessary. > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> > --- > arch/x86/kvm/mmu/mmu.c | 4 ++-- > arch/x86/kvm/mmu/tdp_mmu.c | 26 ++++++++++++++++++++++---- > arch/x86/kvm/mmu/tdp_mmu.h | 4 ++-- > include/linux/kvm_host.h | 1 + > virt/kvm/kvm_main.c | 5 +++++ > 5 files changed, 32 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index d72f2b20f430..9dccc25b1389 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -6307,8 +6307,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) > > if (tdp_mmu_enabled) { > for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) > - flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, > - gfn_end, true, flush); > + flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start, gfn_end, > + true, flush, false); > } > > if (flush) > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 6250bd3d20c1..17762b5a2b98 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -838,7 +838,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp) > * operation can cause a soft lockup. > */ > static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > - gfn_t start, gfn_t end, bool can_yield, bool flush) > + gfn_t start, gfn_t end, bool can_yield, bool flush, > + bool skip_pinned) > { > struct tdp_iter iter; > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > !is_last_spte(iter.old_spte, iter.level)) > continue; > > + if (skip_pinned) { > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > + struct folio *folio; > + > + if (!page) > + continue; > + > + folio = page_folio(page); > + > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > + folio_maybe_dma_pinned(folio)) > + continue; > + } > + I don't get it.. The last patch made it so that the NUMA balancing code doesn't change page_maybe_dma_pinned() pages to PROT_NONE So why doesn't KVM just check if the current and new SPTE are the same and refrain from invalidating if nothing changed? Duplicating the checks here seems very frail to me. If you did that then you probably don't need to change the notifiers. Jason
On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > !is_last_spte(iter.old_spte, iter.level)) > > continue; > > > > + if (skip_pinned) { > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > + struct folio *folio; > > + > > + if (!page) > > + continue; > > + > > + folio = page_folio(page); > > + > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > + folio_maybe_dma_pinned(folio)) > > + continue; > > + } > > + > > I don't get it.. > > The last patch made it so that the NUMA balancing code doesn't change > page_maybe_dma_pinned() pages to PROT_NONE > > So why doesn't KVM just check if the current and new SPTE are the same > and refrain from invalidating if nothing changed? Because KVM doesn't have visibility into the current and new PTEs when the zapping occurs. The contract for invalidate_range_start() requires that KVM drop all references before returning, and so the zapping occurs before change_pte_range() or change_huge_pmd() have done antyhing. > Duplicating the checks here seems very frail to me. Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() can yield different results purely based on refcounts, i.e. KVM could skip pages that the primary MMU does not, and thus violate the mmu_notifier contract. And in general, I am steadfastedly against adding any kind of heuristic to KVM's zapping logic. This really needs to be fixed in the primary MMU and not require any direct involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs to be skipped.
On Tue, Aug 08, 2023 at 07:26:07AM -0700, Sean Christopherson wrote: > On Tue, Aug 08, 2023, Jason Gunthorpe wrote: > > On Tue, Aug 08, 2023 at 03:17:02PM +0800, Yan Zhao wrote: > > > @@ -859,6 +860,21 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > > > !is_last_spte(iter.old_spte, iter.level)) > > > continue; > > > > > > + if (skip_pinned) { > > > + kvm_pfn_t pfn = spte_to_pfn(iter.old_spte); > > > + struct page *page = kvm_pfn_to_refcounted_page(pfn); > > > + struct folio *folio; > > > + > > > + if (!page) > > > + continue; > > > + > > > + folio = page_folio(page); > > > + > > > + if (folio_test_anon(folio) && PageAnonExclusive(&folio->page) && > > > + folio_maybe_dma_pinned(folio)) > > > + continue; > > > + } > > > + > > > > I don't get it.. > > > > The last patch made it so that the NUMA balancing code doesn't change > > page_maybe_dma_pinned() pages to PROT_NONE > > > > So why doesn't KVM just check if the current and new SPTE are the same > > and refrain from invalidating if nothing changed? > > Because KVM doesn't have visibility into the current and new PTEs when the zapping > occurs. The contract for invalidate_range_start() requires that KVM drop all > references before returning, and so the zapping occurs before change_pte_range() > or change_huge_pmd() have done antyhing. > > > Duplicating the checks here seems very frail to me. > > Yes, this is approach gets a hard NAK from me. IIUC, folio_maybe_dma_pinned() > can yield different results purely based on refcounts, i.e. KVM could skip pages > that the primary MMU does not, and thus violate the mmu_notifier contract. And > in general, I am steadfastedly against adding any kind of heuristic to KVM's > zapping logic. > > This really needs to be fixed in the primary MMU and not require any direct > involvement from secondary MMUs, e.g. the mmu_notifier invalidation itself needs > to be skipped. This likely has the same issue you just described, we don't know if it can be skipped until we iterate over the PTEs and by then it is too late to invoke the notifier. Maybe some kind of abort and restart scheme could work? Jason