Message ID | 20230810085636.25914-1-yan.y.zhao@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b824:0:b0:3f2:4152:657d with SMTP id z4csp300522vqi; Thu, 10 Aug 2023 02:46:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEl01Liw1ujM+O0W6WhxECTGjmUkMPSRZtTvY71Bpl+Vgx/F8CGGHpzQ1RVQxqa2VD/rurL X-Received: by 2002:a05:6a00:3919:b0:681:6169:e403 with SMTP id fh25-20020a056a00391900b006816169e403mr2036330pfb.8.1691660774635; Thu, 10 Aug 2023 02:46:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691660774; cv=none; d=google.com; s=arc-20160816; b=hGBCsXS91A4Ootac6S2jsuIkqsU/Ylu+P6jxzGpBrh+s7bTH1/eEhflkFl/5nbRQHU GMp1OVSaJxh+E1n5sYgXAoFrMvB+wm43GPmtgjYbCnrPaRFXug8qQ7xgy9gwh2BiFlVn 46xIHXK0fiTqIWZJHfMjm8VXI2KnvLuOWl4U+DaxvJEuLiObqJx0TnY2/Y2KZCYCm9oC pBSp5urIb0u1sT75ZNkYVd2xmJPZXf4A7a20UcY4Ske2NX4nBW+JXgapDKLQYDFPhSNp cZcduJpVE2w3+LrFa4LULiCKjKtLW3ClknMjakDpS79ll0II37DSwU227casn6PHS7lY 7zig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from :dkim-signature; bh=NqE6mTEwI453v/gEtcv4ta9MU3s1kNot4sv7mgfbvSk=; fh=o8CQ29xf3wsMGvtht3Cn61yxgqhXwJk0ZYd3HveN8Mc=; b=TicPs9qMGjiHt3K8LKOHwHGCS6IW8dTO9OqnydsPfe/nj58i0/uXxzz6IWIGuIDU5G U1ftuMYfPr4YkrCdTLchiDk5G3ZGaN1mMqmY/ODFYPxiuVWh1MnPPmrGj3RdmgdIjgrM xziBLUm9LC6LjKTJuMBmLIO/H2wxmZiEuQ5S0wwup/GmxM576ZX3lgU8nB2nx0EmYYej 8FuvzIGl0vKulH/TOVE3P/JXKMtRuCt7OepOGtlqA63gWWaonccG1kOX/bpbNqMFDJzp tZLUmXDRfJjS9zA+wi2oX7u47dMIu4pGCCyagjbUPVkGnl9tfucCcdOPDNMvjzZe1Era AAsA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=AEiVB1Cc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j184-20020a6380c1000000b0056406432f71si1240161pgd.825.2023.08.10.02.45.56; Thu, 10 Aug 2023 02:46:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=AEiVB1Cc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229801AbjHJJXg (ORCPT <rfc822;m15293943392@gmail.com> + 99 others); Thu, 10 Aug 2023 05:23:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229709AbjHJJXf (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 10 Aug 2023 05:23:35 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9BBC31704; Thu, 10 Aug 2023 02:23:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1691659414; x=1723195414; h=from:to:cc:subject:date:message-id; bh=gfW6MuZeHTt533K3UHwW/pOHp5AwnMiWdNZRUvGivNY=; b=AEiVB1CcAEAQbPqTR9gK9lAKt1PADBhjGFeSJBxEIYcoHPB/gmRSVViZ wDJvQULKrYL/tDOqoKX+bsBg1zJhNSotWpJuHzTVg4aq0KBpRiMFdgpa0 O5SaKd+4O16Yom+8cGHdBzEA4lDVIFXXcbzZfNF9KxEfkbJ3GcGDO0k3x MiEzvKkl2qYuZZpjcYQV/blCrkRX3lUvZjjn0bISa8V9gt6Ev3wFo4ZeX CFQbHORkNxNzuHe2kA2UqKosKYVJ2+6iPuvjJ1mkNbd47nzLgE+xztBZE sVv1YufoFPciAWr+4jsjqOxfl6n5EUttJ0jr6Jg//xBleFnvcfZ+DQJzI A==; X-IronPort-AV: E=McAfee;i="6600,9927,10797"; a="435245958" X-IronPort-AV: E=Sophos;i="6.01,161,1684825200"; d="scan'208";a="435245958" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Aug 2023 02:23:34 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10797"; a="802128926" X-IronPort-AV: E=Sophos;i="6.01,161,1684825200"; d="scan'208";a="802128926" Received: from yzhao56-desk.sh.intel.com ([10.239.159.62]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Aug 2023 02:23:30 -0700 From: Yan Zhao <yan.y.zhao@intel.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: pbonzini@redhat.com, seanjc@google.com, mike.kravetz@oracle.com, apopple@nvidia.com, jgg@nvidia.com, rppt@kernel.org, akpm@linux-foundation.org, kevin.tian@intel.com, david@redhat.com, Yan Zhao <yan.y.zhao@intel.com> Subject: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM Date: Thu, 10 Aug 2023 16:56:36 +0800 Message-Id: <20230810085636.25914-1-yan.y.zhao@intel.com> X-Mailer: git-send-email 2.17.1 X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1773834888445260253 X-GMAIL-MSGID: 1773834888445260253 |
Series |
Reduce NUMA balance caused TLB-shootdowns in a VM
|
|
Message
Yan Zhao
Aug. 10, 2023, 8:56 a.m. UTC
This is an RFC series trying to fix the issue of unnecessary NUMA protection and TLB-shootdowns found in VMs with assigned devices or VFIO mediated devices during NUMA balance. For VMs with assigned devices or VFIO mediated devices, all or part of guest memory are pinned for long-term. Auto NUMA balancing will periodically selects VMAs of a process and change protections to PROT_NONE even though some or all pages in the selected ranges are long-term pinned for DMAs, which is true for VMs with assigned devices or VFIO mediated devices. Though this will not cause real problem because NUMA migration will ultimately reject migration of those kind of pages and restore those PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically with equal SPTEs finally faulted back, wasting CPU cycles and generating unnecessary TLB-shootdowns. This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation event is sent for NUMA migration purpose in specific. Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary MMU to avoid NUMA protection introduced page faults and restoration of old huge PMDs/PTEs in primary MMU. Patch 3 introduces a new mmu notifier callback .numa_protect(), which will be called in patch 4 when a page is ensured to be PROT_NONE protected. Then in patch 5, KVM can recognize a .invalidate_range_start() notification is for NUMA balancing specific and do not do the page unmap in secondary MMU until .numa_protect() comes. Changelog: RFC v1 --> v2: 1. added patch 3-4 to introduce a new callback .numa_protect() 2. Rather than have KVM duplicate logic to check if a page is pinned for long-term, let KVM depend on the new callback .numa_protect() to do the page unmap in secondary MMU for NUMA migration purpose. RFC v1: https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@intel.com/ Yan Zhao (5): mm/mmu_notifier: introduce a new mmu notifier flag MMU_NOTIFIER_RANGE_NUMA mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate purpose mm/mmu_notifier: introduce a new callback .numa_protect mm/autonuma: call .numa_protect() when page is protected for NUMA migrate KVM: Unmap pages only when it's indeed protected for NUMA migration include/linux/mmu_notifier.h | 16 ++++++++++++++++ mm/huge_memory.c | 6 ++++++ mm/mmu_notifier.c | 18 ++++++++++++++++++ mm/mprotect.c | 10 +++++++++- virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++--- 5 files changed, 71 insertions(+), 4 deletions(-)
Comments
On 10.08.23 10:56, Yan Zhao wrote: > This is an RFC series trying to fix the issue of unnecessary NUMA > protection and TLB-shootdowns found in VMs with assigned devices or VFIO > mediated devices during NUMA balance. > > For VMs with assigned devices or VFIO mediated devices, all or part of > guest memory are pinned for long-term. > > Auto NUMA balancing will periodically selects VMAs of a process and change > protections to PROT_NONE even though some or all pages in the selected > ranges are long-term pinned for DMAs, which is true for VMs with assigned > devices or VFIO mediated devices. > > Though this will not cause real problem because NUMA migration will > ultimately reject migration of those kind of pages and restore those > PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically > with equal SPTEs finally faulted back, wasting CPU cycles and generating > unnecessary TLB-shootdowns. > > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation > event is sent for NUMA migration purpose in specific. > > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary > MMU to avoid NUMA protection introduced page faults and restoration of old > huge PMDs/PTEs in primary MMU. > > Patch 3 introduces a new mmu notifier callback .numa_protect(), which > will be called in patch 4 when a page is ensured to be PROT_NONE protected. > > Then in patch 5, KVM can recognize a .invalidate_range_start() notification > is for NUMA balancing specific and do not do the page unmap in secondary > MMU until .numa_protect() comes. > Why do we need all that, when we should simply not be applying PROT_NONE to pinned pages? In change_pte_range() we already have: if (is_cow_mapping(vma->vm_flags) && page_count(page) != 1) Which includes both, shared and pinned pages. Staring at page #2, are we still missing something similar for THPs? Why is that MMU notifier thingy and touching KVM code required?
On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote: > > This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 > > to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that > > the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation > > event is sent for NUMA migration purpose in specific. > > > > Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary > > MMU to avoid NUMA protection introduced page faults and restoration of old > > huge PMDs/PTEs in primary MMU. > > > > Patch 3 introduces a new mmu notifier callback .numa_protect(), which > > will be called in patch 4 when a page is ensured to be PROT_NONE protected. > > > > Then in patch 5, KVM can recognize a .invalidate_range_start() notification > > is for NUMA balancing specific and do not do the page unmap in secondary > > MMU until .numa_protect() comes. > > > > Why do we need all that, when we should simply not be applying PROT_NONE to > pinned pages? > > In change_pte_range() we already have: > > if (is_cow_mapping(vma->vm_flags) && > page_count(page) != 1) > > Which includes both, shared and pinned pages. Ah, right, currently in my side, I don't see any pinned pages are outside of this condition. But I have a question regarding to is_cow_mapping(vma->vm_flags), do we need to allow pinned pages in !is_cow_mapping(vma->vm_flags)? > Staring at page #2, are we still missing something similar for THPs? Yes. > Why is that MMU notifier thingy and touching KVM code required? Because NUMA balancing code will firstly send .invalidate_range_start() with event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range() unconditionally, before it goes down into change_pte_range() and change_huge_pmd() to check each page count and apply PROT_NONE. Then current KVM will unmap all notified pages from secondary MMU in .invalidate_range_start(), which could include pages that finally not set to PROT_NONE in primary MMU. For VMs with pass-through devices, though all guest pages are pinned, KVM still periodically unmap pages in response to the .invalidate_range_start() notification from auto NUMA balancing, which is a waste. So, if there's a new callback sent when pages is set to PROT_NONE for NUMA migrate only, KVM can unmap only those pages. As KVM still needs to unmap pages for other type of event in its handler of .invalidate_range_start() (.i.e. kvm_mmu_notifier_invalidate_range_start()), and MMU_NOTIFY_PROTECTION_VMA also include other reasons, so patch 1 added a range flag to help KVM not to do a blind unmap in .invalidate_range_start(), but do it in the new .numa_protect() handler. > > -- > Cheers, > > David / dhildenb > >
On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote: >This is an RFC series trying to fix the issue of unnecessary NUMA >protection and TLB-shootdowns found in VMs with assigned devices or VFIO >mediated devices during NUMA balance. > >For VMs with assigned devices or VFIO mediated devices, all or part of >guest memory are pinned for long-term. > >Auto NUMA balancing will periodically selects VMAs of a process and change >protections to PROT_NONE even though some or all pages in the selected >ranges are long-term pinned for DMAs, which is true for VMs with assigned >devices or VFIO mediated devices. > >Though this will not cause real problem because NUMA migration will >ultimately reject migration of those kind of pages and restore those >PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically >with equal SPTEs finally faulted back, wasting CPU cycles and generating >unnecessary TLB-shootdowns. In my understanding, NUMA balancing also moves tasks closer to the memory they are accessing. Can this still work with this series applied? > >This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 >to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that >the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation >event is sent for NUMA migration purpose in specific. > >Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary >MMU to avoid NUMA protection introduced page faults and restoration of old >huge PMDs/PTEs in primary MMU. > >Patch 3 introduces a new mmu notifier callback .numa_protect(), which >will be called in patch 4 when a page is ensured to be PROT_NONE protected. > >Then in patch 5, KVM can recognize a .invalidate_range_start() notification >is for NUMA balancing specific and do not do the page unmap in secondary >MMU until .numa_protect() comes. > > >Changelog: >RFC v1 --> v2: >1. added patch 3-4 to introduce a new callback .numa_protect() >2. Rather than have KVM duplicate logic to check if a page is pinned for >long-term, let KVM depend on the new callback .numa_protect() to do the >page unmap in secondary MMU for NUMA migration purpose. > >RFC v1: >https://lore.kernel.org/all/20230808071329.19995-1-yan.y.zhao@intel.com/ > >Yan Zhao (5): > mm/mmu_notifier: introduce a new mmu notifier flag > MMU_NOTIFIER_RANGE_NUMA > mm: don't set PROT_NONE to maybe-dma-pinned pages for NUMA-migrate > purpose > mm/mmu_notifier: introduce a new callback .numa_protect > mm/autonuma: call .numa_protect() when page is protected for NUMA > migrate > KVM: Unmap pages only when it's indeed protected for NUMA migration > > include/linux/mmu_notifier.h | 16 ++++++++++++++++ > mm/huge_memory.c | 6 ++++++ > mm/mmu_notifier.c | 18 ++++++++++++++++++ > mm/mprotect.c | 10 +++++++++- > virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++--- > 5 files changed, 71 insertions(+), 4 deletions(-) > >-- >2.17.1 >
On Thu, Aug 10, 2023 at 09:58:43PM +0800, Chao Gao wrote: > On Thu, Aug 10, 2023 at 04:56:36PM +0800, Yan Zhao wrote: > >This is an RFC series trying to fix the issue of unnecessary NUMA > >protection and TLB-shootdowns found in VMs with assigned devices or VFIO > >mediated devices during NUMA balance. > > > >For VMs with assigned devices or VFIO mediated devices, all or part of > >guest memory are pinned for long-term. > > > >Auto NUMA balancing will periodically selects VMAs of a process and change > >protections to PROT_NONE even though some or all pages in the selected > >ranges are long-term pinned for DMAs, which is true for VMs with assigned > >devices or VFIO mediated devices. > > > >Though this will not cause real problem because NUMA migration will > >ultimately reject migration of those kind of pages and restore those > >PROT_NONE PTEs, it causes KVM's secondary MMU to be zapped periodically > >with equal SPTEs finally faulted back, wasting CPU cycles and generating > >unnecessary TLB-shootdowns. > > In my understanding, NUMA balancing also moves tasks closer to the memory > they are accessing. Can this still work with this series applied? > For pages protected with PROT_NONE in primary MMU in scanning phase, yes; For pages not set to PROT_NONE, no. Because looks this task_numa_migrate() is only triggered in next page fault when PROT_NONE and accessible VMA is found.
On 10.08.23 11:50, Yan Zhao wrote: > On Thu, Aug 10, 2023 at 11:34:07AM +0200, David Hildenbrand wrote: >>> This series first introduces a new flag MMU_NOTIFIER_RANGE_NUMA in patch 1 >>> to work with mmu notifier event type MMU_NOTIFY_PROTECTION_VMA, so that >>> the subscriber (e.g.KVM) of the mmu notifier can know that an invalidation >>> event is sent for NUMA migration purpose in specific. >>> >>> Patch 2 skips setting PROT_NONE to long-term pinned pages in the primary >>> MMU to avoid NUMA protection introduced page faults and restoration of old >>> huge PMDs/PTEs in primary MMU. >>> >>> Patch 3 introduces a new mmu notifier callback .numa_protect(), which >>> will be called in patch 4 when a page is ensured to be PROT_NONE protected. >>> >>> Then in patch 5, KVM can recognize a .invalidate_range_start() notification >>> is for NUMA balancing specific and do not do the page unmap in secondary >>> MMU until .numa_protect() comes. >>> >> >> Why do we need all that, when we should simply not be applying PROT_NONE to >> pinned pages? >> >> In change_pte_range() we already have: >> >> if (is_cow_mapping(vma->vm_flags) && >> page_count(page) != 1) >> >> Which includes both, shared and pinned pages. > Ah, right, currently in my side, I don't see any pinned pages are > outside of this condition. > But I have a question regarding to is_cow_mapping(vma->vm_flags), do we > need to allow pinned pages in !is_cow_mapping(vma->vm_flags)? One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times. One might argue that we also want to exclude pages that are mapped that often. That might possibly work. > >> Staring at page #2, are we still missing something similar for THPs? > Yes. > >> Why is that MMU notifier thingy and touching KVM code required? > Because NUMA balancing code will firstly send .invalidate_range_start() with > event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range() > unconditionally, before it goes down into change_pte_range() and > change_huge_pmd() to check each page count and apply PROT_NONE. Ah, okay I see, thanks. That's indeed unfortunate. > > Then current KVM will unmap all notified pages from secondary MMU > in .invalidate_range_start(), which could include pages that finally not > set to PROT_NONE in primary MMU. > > For VMs with pass-through devices, though all guest pages are pinned, > KVM still periodically unmap pages in response to the > .invalidate_range_start() notification from auto NUMA balancing, which > is a waste. Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it.
On 8/11/23 10:25, David Hildenbrand wrote: ... > One issue is that folio_maybe_pinned...() ... is unreliable as soon as your page is mapped more than 1024 times. > > One might argue that we also want to exclude pages that are mapped that often. That might possibly work. Yes. >> >>> Staring at page #2, are we still missing something similar for THPs? >> Yes. >> >>> Why is that MMU notifier thingy and touching KVM code required? >> Because NUMA balancing code will firstly send .invalidate_range_start() with >> event type MMU_NOTIFY_PROTECTION_VMA to KVM in change_pmd_range() >> unconditionally, before it goes down into change_pte_range() and >> change_huge_pmd() to check each page count and apply PROT_NONE. > > Ah, okay I see, thanks. That's indeed unfortunate. Sigh. All this difficulty reminds me that this mechanism was created in the early days of NUMA. I wonder sometimes lately whether the cost, in complexity and CPU time, is still worth it on today's hardware. But of course I am deeply biased, so don't take that too seriously. See below. :) > >> >> Then current KVM will unmap all notified pages from secondary MMU >> in .invalidate_range_start(), which could include pages that finally not >> set to PROT_NONE in primary MMU. >> >> For VMs with pass-through devices, though all guest pages are pinned, >> KVM still periodically unmap pages in response to the >> .invalidate_range_start() notification from auto NUMA balancing, which >> is a waste. > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. > Yes, NUMA balancing is incredibly harmful to performance, for GPU and accelerators that map memory...and VMs as well, it seems. Basically, anything that has its own processors and page tables needs to be left strictly alone by NUMA balancing. Because the kernel is (still, even today) unaware of what those processors are doing, and so it has no way to do productive NUMA balancing. thanks,
>> Ah, okay I see, thanks. That's indeed unfortunate. > > Sigh. All this difficulty reminds me that this mechanism was created in > the early days of NUMA. I wonder sometimes lately whether the cost, in > complexity and CPU time, is still worth it on today's hardware. > > But of course I am deeply biased, so don't take that too seriously. > See below. :) :) >> >>> >>> Then current KVM will unmap all notified pages from secondary MMU >>> in .invalidate_range_start(), which could include pages that finally not >>> set to PROT_NONE in primary MMU. >>> >>> For VMs with pass-through devices, though all guest pages are pinned, >>> KVM still periodically unmap pages in response to the >>> .invalidate_range_start() notification from auto NUMA balancing, which >>> is a waste. >> >> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are > similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. >> > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and > accelerators that map memory...and VMs as well, it seems. Basically, > anything that has its own processors and page tables needs to be left > strictly alone by NUMA balancing. Because the kernel is (still, even > today) unaware of what those processors are doing, and so it has no way > to do productive NUMA balancing. Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. CCing Mel.
On 8/11/23 11:39, David Hildenbrand wrote: ... >>> Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are >> similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. >>> >> >> Yes, NUMA balancing is incredibly harmful to performance, for GPU and >> accelerators that map memory...and VMs as well, it seems. Basically, >> anything that has its own processors and page tables needs to be left >> strictly alone by NUMA balancing. Because the kernel is (still, even >> today) unaware of what those processors are doing, and so it has no way >> to do productive NUMA balancing. > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. > > CCing Mel. > Let's discern between page pinning situations, and HMM-style situations. Page pinning of CPU memory is unnecessary when setting up for using that memory by modern GPUs or accelerators, because the latter can handle replayable page faults. So for such cases, the pages are in use by a GPU or accelerator, but unpinned. The performance problem occurs because for those pages, the NUMA balancing causes unmapping, which generates callbacks to the device driver, which dutifully unmaps the pages from the GPU or accelerator, even if the GPU might be busy using those pages. The device promptly causes a device page fault, and the driver then re-establishes the device page table mapping, which is good until the next round of unmapping from the NUMA balancer. hmm_range_fault()-based memory management in particular might benefit from having NUMA balancing disabled entirely for the memremap_pages() region, come to think of it. That seems relatively easy and clean at first glance anyway. For other regions (allocated by the device driver), a per-VMA flag seems about right: VM_NO_NUMA_BALANCING ? thanks,
On Fri, Aug 11, 2023 at 12:35:27PM -0700, John Hubbard wrote: > On 8/11/23 11:39, David Hildenbrand wrote: > ... > > > > Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are > > > similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it. > > > > > > > > > > Yes, NUMA balancing is incredibly harmful to performance, for GPU and > > > accelerators that map memory...and VMs as well, it seems. Basically, > > > anything that has its own processors and page tables needs to be left > > > strictly alone by NUMA balancing. Because the kernel is (still, even > > > today) unaware of what those processors are doing, and so it has no way > > > to do productive NUMA balancing. > > > > Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles? > > > > MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually. > > > > I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired. > > > > CCing Mel. > > > > Let's discern between page pinning situations, and HMM-style situations. > Page pinning of CPU memory is unnecessary when setting up for using that > memory by modern GPUs or accelerators, because the latter can handle > replayable page faults. So for such cases, the pages are in use by a GPU > or accelerator, but unpinned. > > The performance problem occurs because for those pages, the NUMA > balancing causes unmapping, which generates callbacks to the device > driver, which dutifully unmaps the pages from the GPU or accelerator, > even if the GPU might be busy using those pages. The device promptly > causes a device page fault, and the driver then re-establishes the > device page table mapping, which is good until the next round of > unmapping from the NUMA balancer. > > hmm_range_fault()-based memory management in particular might benefit > from having NUMA balancing disabled entirely for the memremap_pages() > region, come to think of it. That seems relatively easy and clean at > first glance anyway. > > For other regions (allocated by the device driver), a per-VMA flag > seems about right: VM_NO_NUMA_BALANCING ? > Thanks a lot for those good suggestions! For VMs, when could a per-VMA flag be set? Might be hard in mmap() in QEMU because a VMA may not be used for DMA until after it's mapped into VFIO. Then, should VFIO set this flag on after it maps a range? Could this flag be unset after device hot-unplug?
On 8/14/23 02:09, Yan Zhao wrote: ... >> hmm_range_fault()-based memory management in particular might benefit >> from having NUMA balancing disabled entirely for the memremap_pages() >> region, come to think of it. That seems relatively easy and clean at >> first glance anyway. >> >> For other regions (allocated by the device driver), a per-VMA flag >> seems about right: VM_NO_NUMA_BALANCING ? >> > Thanks a lot for those good suggestions! > For VMs, when could a per-VMA flag be set? > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > after it's mapped into VFIO. > Then, should VFIO set this flag on after it maps a range? > Could this flag be unset after device hot-unplug? > I'm hoping someone who thinks about VMs and VFIO often can chime in. thanks,
On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote: > On 15.08.23 04:34, John Hubbard wrote: > > On 8/14/23 02:09, Yan Zhao wrote: > > ... > > > > hmm_range_fault()-based memory management in particular might benefit > > > > from having NUMA balancing disabled entirely for the memremap_pages() > > > > region, come to think of it. That seems relatively easy and clean at > > > > first glance anyway. > > > > > > > > For other regions (allocated by the device driver), a per-VMA flag > > > > seems about right: VM_NO_NUMA_BALANCING ? > > > > > > > Thanks a lot for those good suggestions! > > > For VMs, when could a per-VMA flag be set? > > > Might be hard in mmap() in QEMU because a VMA may not be used for DMA until > > > after it's mapped into VFIO. > > > Then, should VFIO set this flag on after it maps a range? > > > Could this flag be unset after device hot-unplug? > > > > > > > I'm hoping someone who thinks about VMs and VFIO often can chime in. > > At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao, > using madvise). > > BUT, I do wonder what value there would be for autonuma to still be active Currently MADV_* is up to 25 #define MADV_COLLAPSE 25, while madvise behavior is of type "int". So it's ok. But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform. > for the remainder of the hypervisor. If there is none, a prctl() would be > better. Add a new field in "struct vma_numab_state" in vma, and use prctl() to update this field? e.g. struct vma_numab_state { unsigned long next_scan; unsigned long next_pid_reset; unsigned long access_pids[2]; bool no_scan; }; > > We already do have a mechanism in QEMU to get notified when longterm-pinning > in the kernel might happen (and, therefore, MADV_DONTNEED must not be used): > * ram_block_discard_disable() > * ram_block_uncoordinated_discard_disable() Looks this ram_block_discard allow/disallow state is global rather than per-VMA in QEMU. So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and it's up to the user space to choose between per-VMA and complex way or global and simpler way?
On 16.08.23 11:06, Yan Zhao wrote: > On Wed, Aug 16, 2023 at 09:43:40AM +0200, David Hildenbrand wrote: >> On 15.08.23 04:34, John Hubbard wrote: >>> On 8/14/23 02:09, Yan Zhao wrote: >>> ... >>>>> hmm_range_fault()-based memory management in particular might benefit >>>>> from having NUMA balancing disabled entirely for the memremap_pages() >>>>> region, come to think of it. That seems relatively easy and clean at >>>>> first glance anyway. >>>>> >>>>> For other regions (allocated by the device driver), a per-VMA flag >>>>> seems about right: VM_NO_NUMA_BALANCING ? >>>>> >>>> Thanks a lot for those good suggestions! >>>> For VMs, when could a per-VMA flag be set? >>>> Might be hard in mmap() in QEMU because a VMA may not be used for DMA until >>>> after it's mapped into VFIO. >>>> Then, should VFIO set this flag on after it maps a range? >>>> Could this flag be unset after device hot-unplug? >>>> >>> >>> I'm hoping someone who thinks about VMs and VFIO often can chime in. >> >> At least QEMU could just set it on the applicable VMAs (as said by Yuan Yao, >> using madvise). >> >> BUT, I do wonder what value there would be for autonuma to still be active > Currently MADV_* is up to 25 > #define MADV_COLLAPSE 25, > while madvise behavior is of type "int". So it's ok. > > But vma->vm_flags is of "unsigned long", so it's full at least on 32bit platform. I remember there were discussions to increase it also for 32bit. If that's required, we might want to go down that path. But do 32bit architectures even care about NUMA hinting? If not, just ignore them ... > >> for the remainder of the hypervisor. If there is none, a prctl() would be >> better. > Add a new field in "struct vma_numab_state" in vma, and use prctl() to > update this field? Rather a global toggle per MM, no need to update individual VMAs -- if we go down that prctl() path. No need to consume more memory for VMAs. [...] >> We already do have a mechanism in QEMU to get notified when longterm-pinning >> in the kernel might happen (and, therefore, MADV_DONTNEED must not be used): >> * ram_block_discard_disable() >> * ram_block_uncoordinated_discard_disable() > Looks this ram_block_discard allow/disallow state is global rather than per-VMA > in QEMU. Yes. Once you transition into "discard of any kind disabled", you can go over all guest memory VMAs (RAMBlock) and issue an madvise() for them. (or alternatively, do the prctl() once ) We'll also have to handle new guest memory being created afterwards, but that is easy. Once we transition to "no discarding disabled", you can go over all guest memory VMAs (RAMBlock) and issue an madvise() for them again (or alternatively, do the prctl() once). > So, do you mean that let kernel provide a per-VMA allow/disallow mechanism, and > it's up to the user space to choose between per-VMA and complex way or > global and simpler way? QEMU could do either way. The question would be if a per-vma settings makes sense for NUMA hinting.
On 8/16/23 02:49, David Hildenbrand wrote: > But do 32bit architectures even care about NUMA hinting? If not, just > ignore them ... Probably not! ... >> So, do you mean that let kernel provide a per-VMA allow/disallow >> mechanism, and >> it's up to the user space to choose between per-VMA and complex way or >> global and simpler way? > > QEMU could do either way. The question would be if a per-vma settings > makes sense for NUMA hinting. From our experience with compute on GPUs, a per-mm setting would suffice. No need to go all the way to VMA granularity. thanks,