Message ID | 20230613-vv-kmem_memmap-v1-0-f6de9c6af2c6@intel.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp942719vqr; Thu, 15 Jun 2023 15:09:51 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4gU7MG1yXWBatrJp3YHW+wtGY1F2R0wj30vvVDYW8hEveQls15LoOXRfUXumr5Y8ErNCwt X-Received: by 2002:a05:6a00:1913:b0:63d:3981:313d with SMTP id y19-20020a056a00191300b0063d3981313dmr419005pfi.10.1686866990803; Thu, 15 Jun 2023 15:09:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686866990; cv=none; d=google.com; s=arc-20160816; b=PxKQGgVmSB9kdSCzrAiBNU3xBi/MOSCXtus7qRrZm5YcKGR97iXVCap7OaWpX3+ew9 64NZe56F1UcHY3kVEf6Vh+k0ZgfFhJsyHBtY17BgrOdMPrqrmDQ3Tgxz9O6Ft7MziA1p I7oi+sRd4QXdGPBrO1VO8uG2Z+djLkiESjwc8Ah49cxPekK13Q1a1N4nFHSp8zqBkp1I BAHp2R+Etorobctc5bkbAQ5ddGOjjfSaHh8UwJTwGuEGhJZ7sgziBFvIqBTNoG41PWX9 gj79Y0pk1nYX12mdxr1k4rSMaAq+xOia91LRbWmYGkTSzpg/X6feH4i8hvYTL0wWMMtP 8UDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:content-transfer-encoding:mime-version :message-id:date:subject:from:dkim-signature; bh=fvMLBjjnJfSDUagMZ1N7VhzCM8vCzL/1uB8EI2MIL8s=; b=XVqH4UJgf2Rq556IH9/+JAFf4jssGuOtgc6sAMPzs02mOs04Jgv1WAkPMK1PvfmJdm aow57rxLJWeaWo439KfWBDOExirzJsSXCvpARVEAxiVOx7pNDKjSvvnDVWhXUzQGfKBF L8oFqbM8RK3C0LWYergM0hra7kEr05b6NbIARy4dNtcU5Fnu003xdKIVmxlPHUFF9Lql 5bfrXLjmfi3b/w/x/A/acSaqA9Fu9OwmRnxuZ0uzwldkUC90pf2govnO5C5NlWsCJcty cIg27vyW5ENhWlDpNwyXiOFkO6xD23LPc0VdiqestuOrgSMUThG+ynLq2e7Pmvmb6rnX e5Lw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=N9VtrrL0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i10-20020a636d0a000000b0053ef08b29bfsi13454675pgc.26.2023.06.15.15.09.37; Thu, 15 Jun 2023 15:09:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=N9VtrrL0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237002AbjFOWBL (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Thu, 15 Jun 2023 18:01:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231479AbjFOWBG (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 15 Jun 2023 18:01:06 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3BE362967; Thu, 15 Jun 2023 15:01:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686866464; x=1718402464; h=from:subject:date:message-id:mime-version: content-transfer-encoding:to:cc; bh=lZ52Uyzr79J6u7Q25jSEHwxvL9rBOICrMhEJ+YQMfOE=; b=N9VtrrL0zRoOUeDRhRE1UREOet2gs/QbM8zuszkIMQGsTx8idUjkQBJO 2TSxTauzVBbtgRXIoWdQTKmG5zyjcNRqxDWnd0WAlXnizp7Y+y5RlH2ph kO7IjJMnhgEmbYkm17TC85OINmvX0ApH4fVadBqW5V5fhllPDRkNNWMqH +bXquAzSilYGDVp34P7eI67d37EYLl0jx0x5r05tfNmOdaZeLUZsTMBEN EJOrVcm6KJ6BJaMWDU5kQ8ZiHNqedLpm80xss5CDGrDsBczE1VGFMjfS2 goiJeRAuwJqvPspNUPaJudRTYTen0UpV7bPawUkCY8kiNaw600FX+ksfL g==; X-IronPort-AV: E=McAfee;i="6600,9927,10742"; a="343791121" X-IronPort-AV: E=Sophos;i="6.00,245,1681196400"; d="scan'208";a="343791121" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jun 2023 15:01:02 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10742"; a="715770085" X-IronPort-AV: E=Sophos;i="6.00,245,1681196400"; d="scan'208";a="715770085" Received: from smaurice-mobl.amr.corp.intel.com (HELO [192.168.1.200]) ([10.212.120.175]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jun 2023 15:01:00 -0700 From: Vishal Verma <vishal.l.verma@intel.com> Subject: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem Date: Thu, 15 Jun 2023 16:00:22 -0600 Message-Id: <20230613-vv-kmem_memmap-v1-0-f6de9c6af2c6@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-B4-Tracking: v=1; b=H4sIAPaJi2QC/x2NQQqDMBBFryKzdiAmVq1XKaXEOOpQEyWhQRDv7 tDFX7wPj3dCosiUoC9OiJQ58RYEqrIAt9gwE/IoDFppo5rKYM749eQ/Mm93fNSdcd2oat0+QaT BJsIh2uAW0cJvXeXcI018/Cuv93XdGOiP7nUAAAA= To: "Rafael J. Wysocki" <rafael@kernel.org>, Len Brown <lenb@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Oscar Salvador <osalvador@suse.de>, Dan Williams <dan.j.williams@intel.com>, Dave Jiang <dave.jiang@intel.com> Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, Vishal Verma <vishal.l.verma@intel.com> X-Mailer: b4 0.13-dev-02a79 X-Developer-Signature: v=1; a=openpgp-sha256; l=1518; i=vishal.l.verma@intel.com; h=from:subject:message-id; bh=lZ52Uyzr79J6u7Q25jSEHwxvL9rBOICrMhEJ+YQMfOE=; b=owGbwMvMwCXGf25diOft7jLG02pJDCndXXwTsly/77OW2O/flsjQd6vDyiTYSTTdw7JrgQJnk u7tqAkdpSwMYlwMsmKKLH/3fGQ8Jrc9nycwwRFmDisTyBAGLk4BmMiirYwM94uWCrD/T7tUVyQu /Z/rbMmOF5+f8nw51Lr5VmvUz2c2lQz/DI1u2T1KWqE1rY7z7stD/zPc3oVZTq58l/v2SgTHrpN JfAA= X-Developer-Key: i=vishal.l.verma@intel.com; a=openpgp; fpr=F8682BE134C67A12332A2ED07AFA61BEA3B84DFF X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768808241906059324?= X-GMAIL-MSGID: =?utf-8?q?1768808241906059324?= |
Series |
mm: use memmap_on_memory semantics for dax/kmem
|
|
Message
Verma, Vishal L
June 15, 2023, 10 p.m. UTC
The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit ythe memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.
Arrange for this by first allowing for a module parameter override for
the mhp_supports_memmap_on_memory() test using a flag, adjusting the
only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
exporting the symbol so it can be called by kmem.c, and finally changing
the kmem driver to add_memory() in chunks of memory_block_size_bytes().
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
Vishal Verma (3):
mm/memory_hotplug: Allow an override for the memmap_on_memory param
mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
dax/kmem: Always enroll hotplugged memory for memmap_on_memory
include/linux/memory_hotplug.h | 2 +-
drivers/acpi/acpi_memhotplug.c | 2 +-
drivers/dax/kmem.c | 49 +++++++++++++++++++++++++++++++-----------
mm/memory_hotplug.c | 25 ++++++++++++++-------
4 files changed, 55 insertions(+), 23 deletions(-)
---
base-commit: f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6
change-id: 20230613-vv-kmem_memmap-5483c8d04279
Best regards,
Comments
On 16.06.23 00:00, Vishal Verma wrote: > The dax/kmem driver can potentially hot-add large amounts of memory > originating from CXL memory expanders, or NVDIMMs, or other 'device > memories'. There is a chance there isn't enough regular system memory > available to fit ythe memmap for this new memory. It's therefore > desirable, if all other conditions are met, for the kmem managed memory > to place its memmap on the newly added memory itself. > > Arrange for this by first allowing for a module parameter override for > the mhp_supports_memmap_on_memory() test using a flag, adjusting the > only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, > exporting the symbol so it can be called by kmem.c, and finally changing > the kmem driver to add_memory() in chunks of memory_block_size_bytes(). 1) Why is the override a requirement here? Just let the admin configure it then then add conditional support for kmem. 2) I recall that there are cases where we don't want the memmap to land on slow memory (which online_movable would achieve). Just imagine the slow PMEM case. So this might need another configuration knob on the kmem side. I recall some discussions on doing that chunk handling internally (so kmem can just perform one add_memory() and we'd split that up internally).
On Fri, 2023-06-16 at 09:44 +0200, David Hildenbrand wrote: > On 16.06.23 00:00, Vishal Verma wrote: > > The dax/kmem driver can potentially hot-add large amounts of memory > > originating from CXL memory expanders, or NVDIMMs, or other 'device > > memories'. There is a chance there isn't enough regular system memory > > available to fit ythe memmap for this new memory. It's therefore > > desirable, if all other conditions are met, for the kmem managed memory > > to place its memmap on the newly added memory itself. > > > > Arrange for this by first allowing for a module parameter override for > > the mhp_supports_memmap_on_memory() test using a flag, adjusting the > > only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, > > exporting the symbol so it can be called by kmem.c, and finally changing > > the kmem driver to add_memory() in chunks of memory_block_size_bytes(). > > 1) Why is the override a requirement here? Just let the admin configure > it then then add conditional support for kmem. Configure it in the current way using the module parameter to memory_hotplug? The whole module param check feels a bit awkward, especially since memory_hotplug is builtin, the only way to supply the param is on the kernel command line as opposed to a modprobe config. The goal with extending mhp_supports_memmap_on_memory() to check for support with or without consideration for the module param is that it makes it a bit more flexible for callers like kmem. > 2) I recall that there are cases where we don't want the memmap to land > on slow memory (which online_movable would achieve). Just imagine the > slow PMEM case. So this might need another configuration knob on the > kmem side. > > I recall some discussions on doing that chunk handling internally (so > kmem can just perform one add_memory() and we'd split that up internally). > Another config knob isn't unreasonable - though the thinking in making this behavior the new default policy was that with CXL based memory expanders, the performance delta from main memory wouldn't be as big as the pmem - main memory delta. With pmem devices being phased out, it's not clear we'd need a knob, and it can always be added if it ends up becoming necessary. The other comments about doing the per-memblock loop internally, and fixing up the removal paths all sound good, and I've started reworking those - thanks for taking a look!
On 21.06.23 21:32, Verma, Vishal L wrote: > On Fri, 2023-06-16 at 09:44 +0200, David Hildenbrand wrote: >> On 16.06.23 00:00, Vishal Verma wrote: >>> The dax/kmem driver can potentially hot-add large amounts of memory >>> originating from CXL memory expanders, or NVDIMMs, or other 'device >>> memories'. There is a chance there isn't enough regular system memory >>> available to fit ythe memmap for this new memory. It's therefore >>> desirable, if all other conditions are met, for the kmem managed memory >>> to place its memmap on the newly added memory itself. >>> >>> Arrange for this by first allowing for a module parameter override for >>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the >>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, >>> exporting the symbol so it can be called by kmem.c, and finally changing >>> the kmem driver to add_memory() in chunks of memory_block_size_bytes(). >> >> 1) Why is the override a requirement here? Just let the admin configure >> it then then add conditional support for kmem. > > Configure it in the current way using the module parameter to > memory_hotplug? The whole module param check feels a bit awkward, > especially since memory_hotplug is builtin, the only way to supply the > param is on the kernel command line as opposed to a modprobe config. Yes, and that's nothing special. Runtime toggling is not implemented. > > The goal with extending mhp_supports_memmap_on_memory() to check for > support with or without consideration for the module param is that it > makes it a bit more flexible for callers like kmem. Not convinced yet that the global parameter should be bypassed TBH. And if so, this should be a separate patch on top that is completely optional for the remainder of the series. In any case, there has to be some admin control over that, because 1) You usually don't want vmemmap on potentially slow memory 2) Using memmap-on-memory prohibits gigantic pages from forming on that memory (when runtime-allocating them). So "just doing that" without any config knob is problematic.
David Hildenbrand <david@redhat.com> writes: > On 16.06.23 00:00, Vishal Verma wrote: >> The dax/kmem driver can potentially hot-add large amounts of memory >> originating from CXL memory expanders, or NVDIMMs, or other 'device >> memories'. There is a chance there isn't enough regular system memory >> available to fit ythe memmap for this new memory. It's therefore >> desirable, if all other conditions are met, for the kmem managed memory >> to place its memmap on the newly added memory itself. >> >> Arrange for this by first allowing for a module parameter override for >> the mhp_supports_memmap_on_memory() test using a flag, adjusting the >> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, >> exporting the symbol so it can be called by kmem.c, and finally changing >> the kmem driver to add_memory() in chunks of memory_block_size_bytes(). > > 1) Why is the override a requirement here? Just let the admin > configure it then then add conditional support for kmem. > > 2) I recall that there are cases where we don't want the memmap to > land on slow memory (which online_movable would achieve). Just imagine > the slow PMEM case. So this might need another configuration knob on > the kmem side. From my memory, the case where you don't want the memmap to land on *persistent memory* is when the device is small (such as NVDIMM-N), and you want to reserve as much space as possible for the application data. This has nothing to do with the speed of access. -Jeff
On 13.07.23 21:12, Jeff Moyer wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 16.06.23 00:00, Vishal Verma wrote: >>> The dax/kmem driver can potentially hot-add large amounts of memory >>> originating from CXL memory expanders, or NVDIMMs, or other 'device >>> memories'. There is a chance there isn't enough regular system memory >>> available to fit ythe memmap for this new memory. It's therefore >>> desirable, if all other conditions are met, for the kmem managed memory >>> to place its memmap on the newly added memory itself. >>> >>> Arrange for this by first allowing for a module parameter override for >>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the >>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, >>> exporting the symbol so it can be called by kmem.c, and finally changing >>> the kmem driver to add_memory() in chunks of memory_block_size_bytes(). >> >> 1) Why is the override a requirement here? Just let the admin >> configure it then then add conditional support for kmem. >> >> 2) I recall that there are cases where we don't want the memmap to >> land on slow memory (which online_movable would achieve). Just imagine >> the slow PMEM case. So this might need another configuration knob on >> the kmem side. > > From my memory, the case where you don't want the memmap to land on > *persistent memory* is when the device is small (such as NVDIMM-N), and > you want to reserve as much space as possible for the application data. > This has nothing to do with the speed of access. Now that you mention it, I also do remember the origin of the altmap -- to achieve exactly that: place the memmap on the device. commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3 Author: Dan Williams <dan.j.williams@intel.com> Date: Fri Jan 15 16:56:22 2016 -0800 x86, mm: introduce vmem_altmap to augment vmemmap_populate() In support of providing struct page for large persistent memory capacities, use struct vmem_altmap to change the default policy for allocating memory for the memmap array. The default vmemmap_populate() allocates page table storage area from the page allocator. Given persistent memory capacities relative to DRAM it may not be feasible to store the memmap in 'System Memory'. Instead vmem_altmap represents pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() requests. In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to configure it). BUT that case is completely different from the "System RAM" mode. The memmap of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy). In comparison, if the buddy and everybody else works on the memmap in "System RAM", it's much more significant if that resides on slow memory. Looking at commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375 Author: Michal Hocko <mhocko@suse.com> Date: Tue Oct 3 16:16:19 2017 -0700 mm, page_alloc: add scheduling point to memmap_init_zone memmap_init_zone gets a pfn range to initialize and it can be really large resulting in a soft lockup on non-preemptible kernels NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [kworker/u642:5:1720] [...] task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000 RIP: move_pfn_range_to_zone+0x185/0x1d0 [...] Call Trace: devm_memremap_pages+0x2c7/0x430 pmem_attach_disk+0x2fd/0x3f0 [nd_pmem] nvdimm_bus_probe+0x64/0x110 [libnvdimm] It's hard to tell if that was only required due to the memmap for these devices being that large, or also partially because the access to the memmap is slower that it makes a real difference. I recall that we're also often using ZONE_MOVABLE on such slow memory to not end up placing other kernel data structures on there: especially, user space page tables as I've been told. @Dan, any insight on the performance aspects when placing the memmap on (slow) memory and having that memory be consumed by the buddy where we frequently operate on the memmap?
David Hildenbrand <david@redhat.com> writes: > On 13.07.23 21:12, Jeff Moyer wrote: >> David Hildenbrand <david@redhat.com> writes: >> >>> On 16.06.23 00:00, Vishal Verma wrote: >>>> The dax/kmem driver can potentially hot-add large amounts of memory >>>> originating from CXL memory expanders, or NVDIMMs, or other 'device >>>> memories'. There is a chance there isn't enough regular system memory >>>> available to fit ythe memmap for this new memory. It's therefore >>>> desirable, if all other conditions are met, for the kmem managed memory >>>> to place its memmap on the newly added memory itself. >>>> >>>> Arrange for this by first allowing for a module parameter override for >>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the >>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, >>>> exporting the symbol so it can be called by kmem.c, and finally changing >>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes(). >>> >>> 1) Why is the override a requirement here? Just let the admin >>> configure it then then add conditional support for kmem. >>> >>> 2) I recall that there are cases where we don't want the memmap to >>> land on slow memory (which online_movable would achieve). Just imagine >>> the slow PMEM case. So this might need another configuration knob on >>> the kmem side. >> >> From my memory, the case where you don't want the memmap to land on >> *persistent memory* is when the device is small (such as NVDIMM-N), and >> you want to reserve as much space as possible for the application data. >> This has nothing to do with the speed of access. > > Now that you mention it, I also do remember the origin of the altmap -- > to achieve exactly that: place the memmap on the device. > > commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3 > Author: Dan Williams <dan.j.williams@intel.com> > Date: Fri Jan 15 16:56:22 2016 -0800 > > x86, mm: introduce vmem_altmap to augment vmemmap_populate() > In support of providing struct page for large persistent memory > capacities, use struct vmem_altmap to change the default policy for > allocating memory for the memmap array. The default vmemmap_populate() > allocates page table storage area from the page allocator. Given > persistent memory capacities relative to DRAM it may not be feasible to > store the memmap in 'System Memory'. Instead vmem_altmap represents > pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() > requests. > > In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to > configure it). Configuration is done at pmem namespace creation time. The metadata for the namespace indicates where the memmap resides. See the ndctl-create-namespace man page: -M, --map= A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata. The allocation can be drawn from either: · "mem": typical system memory · "dev": persistent memory reserved from the namespace Given relative capacities of "Persistent Memory" to "System RAM" the allocation defaults to reserving space out of the namespace directly ("--map=dev"). The overhead is 64-bytes per 4K (16GB per 1TB) on x86. > BUT that case is completely different from the "System RAM" mode. The memmap > of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy). Right. (btw, I don't think system ram mode existed back then.) > In comparison, if the buddy and everybody else works on the memmap in > "System RAM", it's much more significant if that resides on slow memory. Agreed. > Looking at > > commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375 > Author: Michal Hocko <mhocko@suse.com> > Date: Tue Oct 3 16:16:19 2017 -0700 > > mm, page_alloc: add scheduling point to memmap_init_zone > memmap_init_zone gets a pfn range to initialize and it can be > really > large resulting in a soft lockup on non-preemptible kernels > NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! > [kworker/u642:5:1720] > [...] > task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000 > RIP: move_pfn_range_to_zone+0x185/0x1d0 > [...] > Call Trace: > devm_memremap_pages+0x2c7/0x430 > pmem_attach_disk+0x2fd/0x3f0 [nd_pmem] > nvdimm_bus_probe+0x64/0x110 [libnvdimm] > > > It's hard to tell if that was only required due to the memmap for these devices > being that large, or also partially because the access to the memmap is slower > that it makes a real difference. I believe the main driver was the size. At the time, Intel was advertising 3TiB/socket for pmem. I can't remember the exact DRAM configuration sizes from the time. > I recall that we're also often using ZONE_MOVABLE on such slow memory > to not end up placing other kernel data structures on there: especially, > user space page tables as I've been told. Part of the issue was preserving the media. The page structure gets lots of updates, and that could cause premature wear. > @Dan, any insight on the performance aspects when placing the memmap on > (slow) memory and having that memory be consumed by the buddy where we frequently > operate on the memmap? I'm glad you're asking these questions. We definitely want to make sure we don't conflate requirements based on some particular technology/implementation. Also, I wouldn't make any assumptions about the performance of CXL devices. As I understand it, there could be a broad spectrum of performance profiles. And now Dan can correct anything I got wrong. ;-) Cheers, Jeff