Message ID | 20230613-vv-kmem_memmap-v1-3-f6de9c6af2c6@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp943195vqr; Thu, 15 Jun 2023 15:10:46 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ76dWmw2lfzGF/MYVUE6cjaEWGstFqN/yu4MvqbJUw4itDIsaJ5NA8EY5ul5AnqK+pjsRpC X-Received: by 2002:a17:90b:98:b0:25b:cc96:a91a with SMTP id bb24-20020a17090b009800b0025bcc96a91amr135804pjb.39.1686867046268; Thu, 15 Jun 2023 15:10:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686867046; cv=none; d=google.com; s=arc-20160816; b=vei3O778FCjTcAZQvXcQK2T2DKmT0Rjy/PUhczWsS3tO3PA8yse+3y7kunaI9PTddn ptDzfskUuXL6cbI0lc/ioahFw5BYpCmGGLmbUZbAOplmvT29loXo5vSWfbF3zWZq0RWM b/P2ZTDJpqadN3E/uK9i7EJwOYwZPofNJnHk/YAv4Tz89qVYetESuP/nDPhVHvtXmHWR dg+cnJsdRf7dZ47mJfuXrWUNtR/hKJm65TJBTp4nWqh9rrq9ogdP5WC9bFqTr+YvONQL BgOmmtp6H58cxntUuv+SFnEw9GDG6vl1weNhWCmDNdagP2frs5pqB2Lk16WqrZjgTBrO SUFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:in-reply-to:references:message-id :content-transfer-encoding:mime-version:subject:date:from :dkim-signature; bh=2f7MyWiXtVfzfQ7VI1xeIzkghmA9toSBSNHXzVy05kk=; b=pFNuBCTu1iUbZWvoQYkPajgOmIllerVTjEIQ/utGWpPTX7i8rj2jF0oZ1sU6afd4jD jaYhOtETYvQd63bBrmWYoZHD1eNDDhMRe32IiIIlOviazisngKuwT0DVgkU8XrSt5Mop 2TsJkC7iVz7bUZmosG24f+gt2Utu3S+iQ3kbqTeONXoxRSOonoEap2LUz4ThGyvZpYfH GdFxsH8sdGp8Cc/0ri6m/ES0uoCCR8DkWgyaBas5mMuH5AY6cuE6e6+C3IPGtaI8n68s euROaNzNZJAfh3OqhsMwpF0OAIWMvlIpo+VBYTJ7bczXOWY/J318l5GmX8PPEY/wPbb/ m76w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=GTxm7YEr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y11-20020a17090aca8b00b00256d7cc5b67si217645pjt.133.2023.06.15.15.10.33; Thu, 15 Jun 2023 15:10:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=GTxm7YEr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237928AbjFOWBW (ORCPT <rfc822;maxin.john@gmail.com> + 99 others); Thu, 15 Jun 2023 18:01:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45436 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234203AbjFOWBI (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 15 Jun 2023 18:01:08 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C3EF2971; Thu, 15 Jun 2023 15:01:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686866466; x=1718402466; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=LdChvoFmNnlAlHfrYI7/8PGxZ7XvW/yi15/tTemhtfg=; b=GTxm7YErEpKjoyHoB3oScxINCTusKrDxfcI93Cax0r0Vmi7TQPIJ1oNH iTQsKC/O1TwPrlMTZrARA3rvOE7Lk8nReUzjpB6OV+K6xCl/G/ae2h3jX DFS24IE92Xpr0uh5263kZOM/yx+ybqM5VDLan01z6PuuuN+9WqXlCXCjJ mWA71fWDjlp8OGnRoyD/Y75qSf/r0R0hnAh7262Y6V2Q1/ql/lo7A1JVR aHB9r/Rd5WGZyyfz0ihr+DC6Y9SdsA96Dy57V9pq9aAFYIc7VVsXUIF1W dUYAgtCQsdtzZtE9p3eNA3qgPcNsI3oPzUp06RbyckEHRLD0igO7NFVhW g==; X-IronPort-AV: E=McAfee;i="6600,9927,10742"; a="343791154" X-IronPort-AV: E=Sophos;i="6.00,245,1681196400"; d="scan'208";a="343791154" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jun 2023 15:01:04 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10742"; a="715770103" X-IronPort-AV: E=Sophos;i="6.00,245,1681196400"; d="scan'208";a="715770103" Received: from smaurice-mobl.amr.corp.intel.com (HELO [192.168.1.200]) ([10.212.120.175]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jun 2023 15:01:02 -0700 From: Vishal Verma <vishal.l.verma@intel.com> Date: Thu, 15 Jun 2023 16:00:25 -0600 Subject: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20230613-vv-kmem_memmap-v1-3-f6de9c6af2c6@intel.com> References: <20230613-vv-kmem_memmap-v1-0-f6de9c6af2c6@intel.com> In-Reply-To: <20230613-vv-kmem_memmap-v1-0-f6de9c6af2c6@intel.com> To: "Rafael J. Wysocki" <rafael@kernel.org>, Len Brown <lenb@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Oscar Salvador <osalvador@suse.de>, Dan Williams <dan.j.williams@intel.com>, Dave Jiang <dave.jiang@intel.com> Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, Vishal Verma <vishal.l.verma@intel.com> X-Mailer: b4 0.13-dev-02a79 X-Developer-Signature: v=1; a=openpgp-sha256; l=3515; i=vishal.l.verma@intel.com; h=from:subject:message-id; bh=LdChvoFmNnlAlHfrYI7/8PGxZ7XvW/yi15/tTemhtfg=; b=owGbwMvMwCXGf25diOft7jLG02pJDCndXTIrHz/Xtq+wu3jBMy+xfr7Z2f+L9vlN3p4Y4pMm5 Fuz/uL9jlIWBjEuBlkxRZa/ez4yHpPbns8TmOAIM4eVCWQIAxenAEzEoZ2R4Z39gVmZulYpD7hq 8jebSH29eWRVeLbZhgq57OPHdj2TtWL4p9A1+afr1TyLVbl3Vb3rfussnLe0fDf3Zh3pBRbef1J 3cAIA X-Developer-Key: i=vishal.l.verma@intel.com; a=openpgp; fpr=F8682BE134C67A12332A2ED07AFA61BEA3B84DFF X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_MSPIKE_H2, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1768808300011282014?= X-GMAIL-MSGID: =?utf-8?q?1768808300011282014?= |
Series |
mm: use memmap_on_memory semantics for dax/kmem
|
|
Commit Message
Verma, Vishal L
June 15, 2023, 10 p.m. UTC
With DAX memory regions originating from CXL memory expanders or
NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
on a system without enough 'regular' main memory to support the memmap
for it. To avoid this, ensure that all kmem managed hotplugged memory is
added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
new memory region being hot added.
To do this, call add_memory() in chunks of memory_block_size_bytes() as
that is a requirement for memmap_on_memory. Additionally, Use the
mhp_flag to force the memmap_on_memory checks regardless of the
respective module parameter setting.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 36 insertions(+), 13 deletions(-)
Comments
Vishal Verma <vishal.l.verma@intel.com> writes: > With DAX memory regions originating from CXL memory expanders or > NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory > on a system without enough 'regular' main memory to support the memmap > for it. To avoid this, ensure that all kmem managed hotplugged memory is > added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the > new memory region being hot added. > > To do this, call add_memory() in chunks of memory_block_size_bytes() as > that is a requirement for memmap_on_memory. Additionally, Use the > mhp_flag to force the memmap_on_memory checks regardless of the > respective module parameter setting. > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Len Brown <lenb@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 36 insertions(+), 13 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 7b36db6f1cbd..0751346193ef 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -12,6 +12,7 @@ > #include <linux/mm.h> > #include <linux/mman.h> > #include <linux/memory-tiers.h> > +#include <linux/memory_hotplug.h> > #include "dax-private.h" > #include "bus.h" > > @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + u64 cur_start, cur_len, remaining; > struct resource *res; > struct range range; > > @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > res->flags = IORESOURCE_SYSTEM_RAM; > > /* > - * Ensure that future kexec'd kernels will not treat > - * this as RAM automatically. > + * Add memory in chunks of memory_block_size_bytes() so that > + * it is considered for MHP_MEMMAP_ON_MEMORY > + * @range has already been aligned to memory_block_size_bytes(), > + * so the following loop will always break it down cleanly. > */ > - rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + cur_start = range.start; > + cur_len = memory_block_size_bytes(); > + remaining = range_len(&range); > + while (remaining) { > + mhp_t mhp_flags = MHP_NID_IS_MGID; > > - if (rc) { > - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > - i, range.start, range.end); > - remove_resource(res); > - kfree(res); > - data->res[i] = NULL; > - if (mapped) > - continue; > - goto err_request_mem; > + if (mhp_supports_memmap_on_memory(cur_len, > + MHP_MEMMAP_ON_MEMORY)) > + mhp_flags |= MHP_MEMMAP_ON_MEMORY; > + /* > + * Ensure that future kexec'd kernels will not treat > + * this as RAM automatically. > + */ > + rc = add_memory_driver_managed(data->mgid, cur_start, > + cur_len, kmem_name, > + mhp_flags); > + > + if (rc) { > + dev_warn(dev, > + "mapping%d: %#llx-%#llx memory add failed\n", > + i, cur_start, cur_start + cur_len - 1); > + remove_resource(res); > + kfree(res); > + data->res[i] = NULL; > + if (mapped) > + continue; > + goto err_request_mem; > + } > + > + cur_start += cur_len; > + remaining -= cur_len; > } > mapped++; > } It appears that we need to hot-remove memory in the granularity of memory_block_size_bytes() too, according to try_remove_memory(). If so, it seems better to allocate one dax_kmem_data.res[] element for each memory block instead of dax region? Best Regards, Huang, Ying
On 16.06.23 00:00, Vishal Verma wrote: > With DAX memory regions originating from CXL memory expanders or > NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory > on a system without enough 'regular' main memory to support the memmap > for it. To avoid this, ensure that all kmem managed hotplugged memory is > added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the > new memory region being hot added. > > To do this, call add_memory() in chunks of memory_block_size_bytes() as > that is a requirement for memmap_on_memory. Additionally, Use the > mhp_flag to force the memmap_on_memory checks regardless of the > respective module parameter setting. > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Len Brown <lenb@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 36 insertions(+), 13 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 7b36db6f1cbd..0751346193ef 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -12,6 +12,7 @@ > #include <linux/mm.h> > #include <linux/mman.h> > #include <linux/memory-tiers.h> > +#include <linux/memory_hotplug.h> > #include "dax-private.h" > #include "bus.h" > > @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + u64 cur_start, cur_len, remaining; > struct resource *res; > struct range range; > > @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > res->flags = IORESOURCE_SYSTEM_RAM; > > /* > - * Ensure that future kexec'd kernels will not treat > - * this as RAM automatically. > + * Add memory in chunks of memory_block_size_bytes() so that > + * it is considered for MHP_MEMMAP_ON_MEMORY > + * @range has already been aligned to memory_block_size_bytes(), > + * so the following loop will always break it down cleanly. > */ > - rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + cur_start = range.start; > + cur_len = memory_block_size_bytes(); > + remaining = range_len(&range); > + while (remaining) { > + mhp_t mhp_flags = MHP_NID_IS_MGID; > > - if (rc) { > - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > - i, range.start, range.end); > - remove_resource(res); > - kfree(res); > - data->res[i] = NULL; > - if (mapped) > - continue; > - goto err_request_mem; > + if (mhp_supports_memmap_on_memory(cur_len, > + MHP_MEMMAP_ON_MEMORY)) > + mhp_flags |= MHP_MEMMAP_ON_MEMORY; > + /* > + * Ensure that future kexec'd kernels will not treat > + * this as RAM automatically. > + */ > + rc = add_memory_driver_managed(data->mgid, cur_start, > + cur_len, kmem_name, > + mhp_flags); > + > + if (rc) { > + dev_warn(dev, > + "mapping%d: %#llx-%#llx memory add failed\n", > + i, cur_start, cur_start + cur_len - 1); > + remove_resource(res); > + kfree(res); > + data->res[i] = NULL; > + if (mapped) > + continue; > + goto err_request_mem; > + } > + > + cur_start += cur_len; > + remaining -= cur_len; > } > mapped++; > } > Maybe the better alternative is teach add_memory_resource()/try_remove_memory() to do that internally. In the add_memory_resource() case, it might be a loop around that memmap_on_memory + arch_add_memory code path (well, and the error path also needs adjustment): /* * Self hosted memmap array */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (!mhp_supports_memmap_on_memory(size)) { ret = -EINVAL; goto error; } mhp_altmap.free = PHYS_PFN(size); mhp_altmap.base_pfn = PHYS_PFN(start); params.altmap = &mhp_altmap; } /* call arch's memory hotadd */ ret = arch_add_memory(nid, start, size, ¶ms); if (ret < 0) goto error; Note that we want to handle that on a per memory-block basis, because we don't want the vmemmap of memory block #2 to end up on memory block #1. It all gets messy with memory onlining/offlining etc otherwise ...
Hi Vishal, Vishal Verma <vishal.l.verma@intel.com> writes: > With DAX memory regions originating from CXL memory expanders or > NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory > on a system without enough 'regular' main memory to support the memmap > for it. To avoid this, ensure that all kmem managed hotplugged memory is > added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the > new memory region being hot added. > Some architectures doesn't have support for MEMMAP_ON_MEMORY, bypassing the check mhp_memmap_on_memory() might cause problems on such architectures (for e.g PPC64). > To do this, call add_memory() in chunks of memory_block_size_bytes() as > that is a requirement for memmap_on_memory. Additionally, Use the > mhp_flag to force the memmap_on_memory checks regardless of the > respective module parameter setting. > > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Cc: Len Brown <lenb@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 36 insertions(+), 13 deletions(-) > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 7b36db6f1cbd..0751346193ef 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -12,6 +12,7 @@ > #include <linux/mm.h> > #include <linux/mman.h> > #include <linux/memory-tiers.h> > +#include <linux/memory_hotplug.h> > #include "dax-private.h" > #include "bus.h" > > @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + u64 cur_start, cur_len, remaining; > struct resource *res; > struct range range; > > @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > res->flags = IORESOURCE_SYSTEM_RAM; > > /* > - * Ensure that future kexec'd kernels will not treat > - * this as RAM automatically. > + * Add memory in chunks of memory_block_size_bytes() so that > + * it is considered for MHP_MEMMAP_ON_MEMORY > + * @range has already been aligned to memory_block_size_bytes(), > + * so the following loop will always break it down cleanly. > */ > - rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + cur_start = range.start; > + cur_len = memory_block_size_bytes(); > + remaining = range_len(&range); > + while (remaining) { > + mhp_t mhp_flags = MHP_NID_IS_MGID; > > - if (rc) { > - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > - i, range.start, range.end); > - remove_resource(res); > - kfree(res); > - data->res[i] = NULL; > - if (mapped) > - continue; > - goto err_request_mem; > + if (mhp_supports_memmap_on_memory(cur_len, > + MHP_MEMMAP_ON_MEMORY)) > + mhp_flags |= MHP_MEMMAP_ON_MEMORY; > + /* > + * Ensure that future kexec'd kernels will not treat > + * this as RAM automatically. > + */ > + rc = add_memory_driver_managed(data->mgid, cur_start, > + cur_len, kmem_name, > + mhp_flags); > + > + if (rc) { > + dev_warn(dev, > + "mapping%d: %#llx-%#llx memory add failed\n", > + i, cur_start, cur_start + cur_len - 1); > + remove_resource(res); > + kfree(res); > + data->res[i] = NULL; > + if (mapped) > + continue; > + goto err_request_mem; > + } > + > + cur_start += cur_len; > + remaining -= cur_len; > } > mapped++; > } > > -- > 2.40.1
David Hildenbrand <david@redhat.com> writes: > On 16.06.23 00:00, Vishal Verma wrote: >> With DAX memory regions originating from CXL memory expanders or >> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory >> on a system without enough 'regular' main memory to support the memmap >> for it. To avoid this, ensure that all kmem managed hotplugged memory is >> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the >> new memory region being hot added. >> >> To do this, call add_memory() in chunks of memory_block_size_bytes() as >> that is a requirement for memmap_on_memory. Additionally, Use the >> mhp_flag to force the memmap_on_memory checks regardless of the >> respective module parameter setting. >> >> Cc: "Rafael J. Wysocki" <rafael@kernel.org> >> Cc: Len Brown <lenb@kernel.org> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Oscar Salvador <osalvador@suse.de> >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: Dave Jiang <dave.jiang@intel.com> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Huang Ying <ying.huang@intel.com> >> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> >> --- >> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- >> 1 file changed, 36 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >> index 7b36db6f1cbd..0751346193ef 100644 >> --- a/drivers/dax/kmem.c >> +++ b/drivers/dax/kmem.c >> @@ -12,6 +12,7 @@ >> #include <linux/mm.h> >> #include <linux/mman.h> >> #include <linux/memory-tiers.h> >> +#include <linux/memory_hotplug.h> >> #include "dax-private.h" >> #include "bus.h" >> >> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> data->mgid = rc; >> >> for (i = 0; i < dev_dax->nr_range; i++) { >> + u64 cur_start, cur_len, remaining; >> struct resource *res; >> struct range range; >> >> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >> res->flags = IORESOURCE_SYSTEM_RAM; >> >> /* >> - * Ensure that future kexec'd kernels will not treat >> - * this as RAM automatically. >> + * Add memory in chunks of memory_block_size_bytes() so that >> + * it is considered for MHP_MEMMAP_ON_MEMORY >> + * @range has already been aligned to memory_block_size_bytes(), >> + * so the following loop will always break it down cleanly. >> */ >> - rc = add_memory_driver_managed(data->mgid, range.start, >> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >> + cur_start = range.start; >> + cur_len = memory_block_size_bytes(); >> + remaining = range_len(&range); >> + while (remaining) { >> + mhp_t mhp_flags = MHP_NID_IS_MGID; >> >> - if (rc) { >> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >> - i, range.start, range.end); >> - remove_resource(res); >> - kfree(res); >> - data->res[i] = NULL; >> - if (mapped) >> - continue; >> - goto err_request_mem; >> + if (mhp_supports_memmap_on_memory(cur_len, >> + MHP_MEMMAP_ON_MEMORY)) >> + mhp_flags |= MHP_MEMMAP_ON_MEMORY; >> + /* >> + * Ensure that future kexec'd kernels will not treat >> + * this as RAM automatically. >> + */ >> + rc = add_memory_driver_managed(data->mgid, cur_start, >> + cur_len, kmem_name, >> + mhp_flags); >> + >> + if (rc) { >> + dev_warn(dev, >> + "mapping%d: %#llx-%#llx memory add failed\n", >> + i, cur_start, cur_start + cur_len - 1); >> + remove_resource(res); >> + kfree(res); >> + data->res[i] = NULL; >> + if (mapped) >> + continue; >> + goto err_request_mem; >> + } >> + >> + cur_start += cur_len; >> + remaining -= cur_len; >> } >> mapped++; >> } >> > > Maybe the better alternative is teach > add_memory_resource()/try_remove_memory() to do that internally. > > In the add_memory_resource() case, it might be a loop around that > memmap_on_memory + arch_add_memory code path (well, and the error path > also needs adjustment): > > /* > * Self hosted memmap array > */ > if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { > if (!mhp_supports_memmap_on_memory(size)) { > ret = -EINVAL; > goto error; > } > mhp_altmap.free = PHYS_PFN(size); > mhp_altmap.base_pfn = PHYS_PFN(start); > params.altmap = &mhp_altmap; > } > > /* call arch's memory hotadd */ > ret = arch_add_memory(nid, start, size, ¶ms); > if (ret < 0) > goto error; > > > Note that we want to handle that on a per memory-block basis, because we > don't want the vmemmap of memory block #2 to end up on memory block #1. > It all gets messy with memory onlining/offlining etc otherwise ... > I tried to implement this inside add_memory_driver_managed() and also within dax/kmem. IMHO doing the error handling inside dax/kmem is better. Here is how it looks: 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions 2. For each succesful request_mem_regions if any blocks got added, we keep the resource. If none got added, we will kfree the resource for (i = 0; i < dev_dax->nr_range; i++) { u64 cur_start, cur_len, remaining; struct resource *res; struct range range; bool block_added; rc = dax_kmem_range(dev_dax, i, &range); if (rc) continue; /* Region is permanently reserved if hotremove fails. */ res = request_mem_region(range.start, range_len(&range), data->res_name); if (!res) { dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n", i, range.start, range.end); /* * Once some memory has been onlined we can't * assume that it can be un-onlined safely. */ if (mapped) continue; rc = -EBUSY; goto err_request_mem; } data->res[i] = res; /* * Set flags appropriate for System RAM. Leave ..._BUSY clear * so that add_memory() can add a child resource. Do not * inherit flags from the parent since it may set new flags * unknown to us that will break add_memory() below. */ res->flags = IORESOURCE_SYSTEM_RAM; /* * Add memory in chunks of memory_block_size_bytes() so that * it is considered for MHP_MEMMAP_ON_MEMORY * @range has already been aligned to memory_block_size_bytes(), * so the following loop will always break it down cleanly. */ cur_start = range.start; cur_len = memory_block_size_bytes(); remaining = range_len(&range); block_added = false; while (remaining) { /* * If alignment rules are not satisified we will * fallback normal memmap allocation. */ mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY; /* * Ensure that future kexec'd kernels will not treat * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, cur_start, cur_len, kmem_name, mhp_flags); if (rc) dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", i, cur_start, cur_start + cur_len - 1); else block_added = true; cur_start += cur_len; remaining -= cur_len; } if (!block_added) { /* * None of the blocks got added, remove the resource. */ remove_resource(res); kfree(res); data->res[i] = NULL; } else mapped++; } if (mapped) { dev_set_drvdata(dev, data); return 0; } err_request_mem: /* * If none of the resources got mapped. * unregister the group. */ memory_group_unregister(data->mgid); err_reg_mgid: kfree(data->res_name);
On 11.07.23 16:30, Aneesh Kumar K.V wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 16.06.23 00:00, Vishal Verma wrote: >>> With DAX memory regions originating from CXL memory expanders or >>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory >>> on a system without enough 'regular' main memory to support the memmap >>> for it. To avoid this, ensure that all kmem managed hotplugged memory is >>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the >>> new memory region being hot added. >>> >>> To do this, call add_memory() in chunks of memory_block_size_bytes() as >>> that is a requirement for memmap_on_memory. Additionally, Use the >>> mhp_flag to force the memmap_on_memory checks regardless of the >>> respective module parameter setting. >>> >>> Cc: "Rafael J. Wysocki" <rafael@kernel.org> >>> Cc: Len Brown <lenb@kernel.org> >>> Cc: Andrew Morton <akpm@linux-foundation.org> >>> Cc: David Hildenbrand <david@redhat.com> >>> Cc: Oscar Salvador <osalvador@suse.de> >>> Cc: Dan Williams <dan.j.williams@intel.com> >>> Cc: Dave Jiang <dave.jiang@intel.com> >>> Cc: Dave Hansen <dave.hansen@linux.intel.com> >>> Cc: Huang Ying <ying.huang@intel.com> >>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> >>> --- >>> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++------------- >>> 1 file changed, 36 insertions(+), 13 deletions(-) >>> >>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>> index 7b36db6f1cbd..0751346193ef 100644 >>> --- a/drivers/dax/kmem.c >>> +++ b/drivers/dax/kmem.c >>> @@ -12,6 +12,7 @@ >>> #include <linux/mm.h> >>> #include <linux/mman.h> >>> #include <linux/memory-tiers.h> >>> +#include <linux/memory_hotplug.h> >>> #include "dax-private.h" >>> #include "bus.h" >>> >>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> data->mgid = rc; >>> >>> for (i = 0; i < dev_dax->nr_range; i++) { >>> + u64 cur_start, cur_len, remaining; >>> struct resource *res; >>> struct range range; >>> >>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> res->flags = IORESOURCE_SYSTEM_RAM; >>> >>> /* >>> - * Ensure that future kexec'd kernels will not treat >>> - * this as RAM automatically. >>> + * Add memory in chunks of memory_block_size_bytes() so that >>> + * it is considered for MHP_MEMMAP_ON_MEMORY >>> + * @range has already been aligned to memory_block_size_bytes(), >>> + * so the following loop will always break it down cleanly. >>> */ >>> - rc = add_memory_driver_managed(data->mgid, range.start, >>> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >>> + cur_start = range.start; >>> + cur_len = memory_block_size_bytes(); >>> + remaining = range_len(&range); >>> + while (remaining) { >>> + mhp_t mhp_flags = MHP_NID_IS_MGID; >>> >>> - if (rc) { >>> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >>> - i, range.start, range.end); >>> - remove_resource(res); >>> - kfree(res); >>> - data->res[i] = NULL; >>> - if (mapped) >>> - continue; >>> - goto err_request_mem; >>> + if (mhp_supports_memmap_on_memory(cur_len, >>> + MHP_MEMMAP_ON_MEMORY)) >>> + mhp_flags |= MHP_MEMMAP_ON_MEMORY; >>> + /* >>> + * Ensure that future kexec'd kernels will not treat >>> + * this as RAM automatically. >>> + */ >>> + rc = add_memory_driver_managed(data->mgid, cur_start, >>> + cur_len, kmem_name, >>> + mhp_flags); >>> + >>> + if (rc) { >>> + dev_warn(dev, >>> + "mapping%d: %#llx-%#llx memory add failed\n", >>> + i, cur_start, cur_start + cur_len - 1); >>> + remove_resource(res); >>> + kfree(res); >>> + data->res[i] = NULL; >>> + if (mapped) >>> + continue; >>> + goto err_request_mem; >>> + } >>> + >>> + cur_start += cur_len; >>> + remaining -= cur_len; >>> } >>> mapped++; >>> } >>> >> >> Maybe the better alternative is teach >> add_memory_resource()/try_remove_memory() to do that internally. >> >> In the add_memory_resource() case, it might be a loop around that >> memmap_on_memory + arch_add_memory code path (well, and the error path >> also needs adjustment): >> >> /* >> * Self hosted memmap array >> */ >> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { >> if (!mhp_supports_memmap_on_memory(size)) { >> ret = -EINVAL; >> goto error; >> } >> mhp_altmap.free = PHYS_PFN(size); >> mhp_altmap.base_pfn = PHYS_PFN(start); >> params.altmap = &mhp_altmap; >> } >> >> /* call arch's memory hotadd */ >> ret = arch_add_memory(nid, start, size, ¶ms); >> if (ret < 0) >> goto error; >> >> >> Note that we want to handle that on a per memory-block basis, because we >> don't want the vmemmap of memory block #2 to end up on memory block #1. >> It all gets messy with memory onlining/offlining etc otherwise ... >> > > I tried to implement this inside add_memory_driver_managed() and also > within dax/kmem. IMHO doing the error handling inside dax/kmem is > better. Here is how it looks: > > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions > 2. For each succesful request_mem_regions if any blocks got added, we > keep the resource. If none got added, we will kfree the resource > Doing this unconditional splitting outside of add_memory_driver_managed() is undesirable for at least two reasons 1) You end up always creating individual entries in the resource tree (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. 2) As we call arch_add_memory() in memory block granularity (e.g., 128 MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. While you could sense for support and do the split based on that, it will be beneficial for other users (especially DIMMs) if we do that internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be effective or not. In general, we avoid placing important kernel data-structures on slow memory. That's one of the reasons why PMEM decided to mostly always use ZONE_MOVABLE such that exactly what this patch does would not happen. So I'm wondering if there would be demand for an additional toggle. Because even with memmap_on_memory enabled in general, you might not want to do that for dax/kmem. IMHO, this patch should be dropped from your ppc64 series, as it's an independent change that might be valuable for other architectures as well.
On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote: > On 11.07.23 16:30, Aneesh Kumar K.V wrote: > > David Hildenbrand <david@redhat.com> writes: > > > > > > Maybe the better alternative is teach > > > add_memory_resource()/try_remove_memory() to do that internally. > > > > > > In the add_memory_resource() case, it might be a loop around that > > > memmap_on_memory + arch_add_memory code path (well, and the error path > > > also needs adjustment): > > > > > > /* > > > * Self hosted memmap array > > > */ > > > if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { > > > if (!mhp_supports_memmap_on_memory(size)) { > > > ret = -EINVAL; > > > goto error; > > > } > > > mhp_altmap.free = PHYS_PFN(size); > > > mhp_altmap.base_pfn = PHYS_PFN(start); > > > params.altmap = &mhp_altmap; > > > } > > > > > > /* call arch's memory hotadd */ > > > ret = arch_add_memory(nid, start, size, ¶ms); > > > if (ret < 0) > > > goto error; > > > > > > > > > Note that we want to handle that on a per memory-block basis, because we > > > don't want the vmemmap of memory block #2 to end up on memory block #1. > > > It all gets messy with memory onlining/offlining etc otherwise ... > > > > > > > I tried to implement this inside add_memory_driver_managed() and also > > within dax/kmem. IMHO doing the error handling inside dax/kmem is > > better. Here is how it looks: > > > > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions > > 2. For each succesful request_mem_regions if any blocks got added, we > > keep the resource. If none got added, we will kfree the resource > > > > Doing this unconditional splitting outside of > add_memory_driver_managed() is undesirable for at least two reasons > > 1) You end up always creating individual entries in the resource tree > (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. > 2) As we call arch_add_memory() in memory block granularity (e.g., 128 > MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the > identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. > > While you could sense for support and do the split based on that, it > will be beneficial for other users (especially DIMMs) if we do that > internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be > effective or not. I'm taking a shot at implementing the splitting internally in memory_hotplug.c. The caller (kmem) side does become trivial with this approach, but there's a slight complication if I don't have the module param override (patch 1 of this series). The kmem diff now looks like: diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 898ca9505754..8be932f63f90 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | + MHP_SPLIT_MEMBLOCKS; struct resource *res; struct range range; @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, range.start, - range_len(&range), kmem_name, MHP_NID_IS_MGID); + range_len(&range), kmem_name, mhp_flags); if (rc) { dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", However this begins to fail if the memmap_on_memory modparam is not set, as add_memory_driver_managed EINVALs from the mhp_supports_memmap_on_memory() check. The way to work around this would probably include doing the mhp_supports_memmap_on_memory() check in kmem, in a loop to check for each memblock sized chunk, and that feels like a leak of the implementation details into the caller. Any suggestions on how to go about this? > > In general, we avoid placing important kernel data-structures on slow > memory. That's one of the reasons why PMEM decided to mostly always use > ZONE_MOVABLE such that exactly what this patch does would not happen. So > I'm wondering if there would be demand for an additional toggle. > > Because even with memmap_on_memory enabled in general, you might not > want to do that for dax/kmem. > > IMHO, this patch should be dropped from your ppc64 series, as it's an > independent change that might be valuable for other architectures as well. > Sure thing, I can pick this back up and Aneesh can drop this from his set.
On 13.07.23 08:45, Verma, Vishal L wrote: > On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote: >> On 11.07.23 16:30, Aneesh Kumar K.V wrote: >>> David Hildenbrand <david@redhat.com> writes: >>>> >>>> Maybe the better alternative is teach >>>> add_memory_resource()/try_remove_memory() to do that internally. >>>> >>>> In the add_memory_resource() case, it might be a loop around that >>>> memmap_on_memory + arch_add_memory code path (well, and the error path >>>> also needs adjustment): >>>> >>>> /* >>>> * Self hosted memmap array >>>> */ >>>> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { >>>> if (!mhp_supports_memmap_on_memory(size)) { >>>> ret = -EINVAL; >>>> goto error; >>>> } >>>> mhp_altmap.free = PHYS_PFN(size); >>>> mhp_altmap.base_pfn = PHYS_PFN(start); >>>> params.altmap = &mhp_altmap; >>>> } >>>> >>>> /* call arch's memory hotadd */ >>>> ret = arch_add_memory(nid, start, size, ¶ms); >>>> if (ret < 0) >>>> goto error; >>>> >>>> >>>> Note that we want to handle that on a per memory-block basis, because we >>>> don't want the vmemmap of memory block #2 to end up on memory block #1. >>>> It all gets messy with memory onlining/offlining etc otherwise ... >>>> >>> >>> I tried to implement this inside add_memory_driver_managed() and also >>> within dax/kmem. IMHO doing the error handling inside dax/kmem is >>> better. Here is how it looks: >>> >>> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions >>> 2. For each succesful request_mem_regions if any blocks got added, we >>> keep the resource. If none got added, we will kfree the resource >>> >> >> Doing this unconditional splitting outside of >> add_memory_driver_managed() is undesirable for at least two reasons >> >> 1) You end up always creating individual entries in the resource tree >> (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. >> 2) As we call arch_add_memory() in memory block granularity (e.g., 128 >> MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the >> identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. >> >> While you could sense for support and do the split based on that, it >> will be beneficial for other users (especially DIMMs) if we do that >> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be >> effective or not. > > I'm taking a shot at implementing the splitting internally in > memory_hotplug.c. The caller (kmem) side does become trivial with this > approach, but there's a slight complication if I don't have the module > param override (patch 1 of this series). > > The kmem diff now looks like: > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > index 898ca9505754..8be932f63f90 100644 > --- a/drivers/dax/kmem.c > +++ b/drivers/dax/kmem.c > @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > data->mgid = rc; > > for (i = 0; i < dev_dax->nr_range; i++) { > + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | > + MHP_SPLIT_MEMBLOCKS; > struct resource *res; > struct range range; > > @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > * this as RAM automatically. > */ > rc = add_memory_driver_managed(data->mgid, range.start, > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > + range_len(&range), kmem_name, mhp_flags); > > if (rc) { > dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > > Why do we need the MHP_SPLIT_MEMBLOCKS? In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a single memory block, you can simply split up internally, no? Essentially in add_memory_resource() something like if (mhp_flags & MHP_MEMMAP_ON_MEMORY && mhp_supports_memmap_on_memory(memory_block_size_bytes())) { for (cur_start = start, cur_start < start + size; cur_start += memory_block_size_bytes()) { mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); mhp_altmap.base_pfn = PHYS_PFN(start); params.altmap = &mhp_altmap; ret = arch_add_memory(nid, start, memory_block_size_bytes(), ¶ms); if (ret < 0) ... ret = create_memory_block_devices(start, memory_block_size_bytes(), mhp_altmap.alloc, group); if (ret) ... } } else { /* old boring stuff */ } Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() ... If any adding of memory failed, we remove what we already added. That works, because as long as we're holding the relevant locks, memory cannot get onlined in the meantime. Then we only have to teach remove_memory() to deal with individual blocks when finding blocks that have an altmap.
On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: > On 13.07.23 08:45, Verma, Vishal L wrote: > > > > I'm taking a shot at implementing the splitting internally in > > memory_hotplug.c. The caller (kmem) side does become trivial with this > > approach, but there's a slight complication if I don't have the module > > param override (patch 1 of this series). > > > > The kmem diff now looks like: > > > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > > index 898ca9505754..8be932f63f90 100644 > > --- a/drivers/dax/kmem.c > > +++ b/drivers/dax/kmem.c > > @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > data->mgid = rc; > > > > for (i = 0; i < dev_dax->nr_range; i++) { > > + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | > > + MHP_SPLIT_MEMBLOCKS; > > struct resource *res; > > struct range range; > > > > @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > * this as RAM automatically. > > */ > > rc = add_memory_driver_managed(data->mgid, range.start, > > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > > + range_len(&range), kmem_name, mhp_flags); > > > > if (rc) { > > dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > > > > > > Why do we need the MHP_SPLIT_MEMBLOCKS? I thought we still wanted either an opt-in or opt-out for the kmem driver to be able to do memmap_on_memory, in case there were performance implications or the lack of 1GiB PUDs. I haven't implemented that yet, but I was thinking along the lines of a sysfs knob exposed by kmem, that controls setting of this new MHP_SPLIT_MEMBLOCKS flag. > > In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a > single memory block, you can simply split up internally, no? > > Essentially in add_memory_resource() something like > > if (mhp_flags & MHP_MEMMAP_ON_MEMORY && > mhp_supports_memmap_on_memory(memory_block_size_bytes())) { > for (cur_start = start, cur_start < start + size; > cur_start += memory_block_size_bytes()) { > mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); > mhp_altmap.base_pfn = PHYS_PFN(start); > params.altmap = &mhp_altmap; > > ret = arch_add_memory(nid, start, > memory_block_size_bytes(), ¶ms); > if (ret < 0) ... > > ret = create_memory_block_devices(start, memory_block_size_bytes(), > mhp_altmap.alloc, group); > if (ret) ... > > } > } else { > /* old boring stuff */ > } > > Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into > a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() > ... My current approach was looping a level higher, on the call to add_memory_resource, but this looks reasonable too, and I can switch to this. In fact it is better as in my case I had to loop twice, once for the regular add_memory() path and once for the _driver_managed() path. Yours should avoid that. > > If any adding of memory failed, we remove what we already added. That works, because as > long as we're holding the relevant locks, memory cannot get onlined in the meantime. > > Then we only have to teach remove_memory() to deal with individual blocks when finding > blocks that have an altmap. >
On 13.07.23 17:15, Verma, Vishal L wrote: > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: >> On 13.07.23 08:45, Verma, Vishal L wrote: >>> >>> I'm taking a shot at implementing the splitting internally in >>> memory_hotplug.c. The caller (kmem) side does become trivial with this >>> approach, but there's a slight complication if I don't have the module >>> param override (patch 1 of this series). >>> >>> The kmem diff now looks like: >>> >>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>> index 898ca9505754..8be932f63f90 100644 >>> --- a/drivers/dax/kmem.c >>> +++ b/drivers/dax/kmem.c >>> @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> data->mgid = rc; >>> >>> for (i = 0; i < dev_dax->nr_range; i++) { >>> + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | >>> + MHP_SPLIT_MEMBLOCKS; >>> struct resource *res; >>> struct range range; >>> >>> @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>> * this as RAM automatically. >>> */ >>> rc = add_memory_driver_managed(data->mgid, range.start, >>> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >>> + range_len(&range), kmem_name, mhp_flags); >>> >>> if (rc) { >>> dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >>> >>> >> >> Why do we need the MHP_SPLIT_MEMBLOCKS? > > I thought we still wanted either an opt-in or opt-out for the kmem > driver to be able to do memmap_on_memory, in case there were > performance implications or the lack of 1GiB PUDs. I haven't > implemented that yet, but I was thinking along the lines of a sysfs > knob exposed by kmem, that controls setting of this new > MHP_SPLIT_MEMBLOCKS flag. Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? > >> >> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a >> single memory block, you can simply split up internally, no? >> >> Essentially in add_memory_resource() something like >> >> if (mhp_flags & MHP_MEMMAP_ON_MEMORY && >> mhp_supports_memmap_on_memory(memory_block_size_bytes())) { >> for (cur_start = start, cur_start < start + size; >> cur_start += memory_block_size_bytes()) { >> mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); >> mhp_altmap.base_pfn = PHYS_PFN(start); >> params.altmap = &mhp_altmap; >> >> ret = arch_add_memory(nid, start, >> memory_block_size_bytes(), ¶ms); >> if (ret < 0) ... >> >> ret = create_memory_block_devices(start, memory_block_size_bytes(), >> mhp_altmap.alloc, group); >> if (ret) ... >> >> } >> } else { >> /* old boring stuff */ >> } >> >> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into >> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() >> ... > > My current approach was looping a level higher, on the call to > add_memory_resource, but this looks reasonable too, and I can switch to > this. In fact it is better as in my case I had to loop twice, once for > the regular add_memory() path and once for the _driver_managed() path. > Yours should avoid that. As we only care about the altmap here, handling it for arch_add_memory() + create_memory_block_devices() should be sufficient.
On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote: > On 13.07.23 17:15, Verma, Vishal L wrote: > > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: > > > On 13.07.23 08:45, Verma, Vishal L wrote: > > > > > > > > I'm taking a shot at implementing the splitting internally in > > > > memory_hotplug.c. The caller (kmem) side does become trivial with this > > > > approach, but there's a slight complication if I don't have the module > > > > param override (patch 1 of this series). > > > > > > > > The kmem diff now looks like: > > > > > > > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c > > > > index 898ca9505754..8be932f63f90 100644 > > > > --- a/drivers/dax/kmem.c > > > > +++ b/drivers/dax/kmem.c > > > > @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > > > data->mgid = rc; > > > > > > > > for (i = 0; i < dev_dax->nr_range; i++) { > > > > + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | > > > > + MHP_SPLIT_MEMBLOCKS; > > > > struct resource *res; > > > > struct range range; > > > > > > > > @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) > > > > * this as RAM automatically. > > > > */ > > > > rc = add_memory_driver_managed(data->mgid, range.start, > > > > - range_len(&range), kmem_name, MHP_NID_IS_MGID); > > > > + range_len(&range), kmem_name, mhp_flags); > > > > > > > > if (rc) { > > > > dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", > > > > > > > > > > > > > > Why do we need the MHP_SPLIT_MEMBLOCKS? > > > > I thought we still wanted either an opt-in or opt-out for the kmem > > driver to be able to do memmap_on_memory, in case there were > > performance implications or the lack of 1GiB PUDs. I haven't > > implemented that yet, but I was thinking along the lines of a sysfs > > knob exposed by kmem, that controls setting of this new > > MHP_SPLIT_MEMBLOCKS flag. > > Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? > > Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY, and memory_hotplug is free to split to memblocks if it needs to to satisfy that. That sounds reasonable. Let me give this a try and see if I run into anything else. Thanks David!
On 13.07.23 17:40, Verma, Vishal L wrote: > On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote: >> On 13.07.23 17:15, Verma, Vishal L wrote: >>> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: >>>> On 13.07.23 08:45, Verma, Vishal L wrote: >>>>> >>>>> I'm taking a shot at implementing the splitting internally in >>>>> memory_hotplug.c. The caller (kmem) side does become trivial with this >>>>> approach, but there's a slight complication if I don't have the module >>>>> param override (patch 1 of this series). >>>>> >>>>> The kmem diff now looks like: >>>>> >>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c >>>>> index 898ca9505754..8be932f63f90 100644 >>>>> --- a/drivers/dax/kmem.c >>>>> +++ b/drivers/dax/kmem.c >>>>> @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>> data->mgid = rc; >>>>> >>>>> for (i = 0; i < dev_dax->nr_range; i++) { >>>>> + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | >>>>> + MHP_SPLIT_MEMBLOCKS; >>>>> struct resource *res; >>>>> struct range range; >>>>> >>>>> @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) >>>>> * this as RAM automatically. >>>>> */ >>>>> rc = add_memory_driver_managed(data->mgid, range.start, >>>>> - range_len(&range), kmem_name, MHP_NID_IS_MGID); >>>>> + range_len(&range), kmem_name, mhp_flags); >>>>> >>>>> if (rc) { >>>>> dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", >>>>> >>>>> >>>> >>>> Why do we need the MHP_SPLIT_MEMBLOCKS? >>> >>> I thought we still wanted either an opt-in or opt-out for the kmem >>> driver to be able to do memmap_on_memory, in case there were >>> performance implications or the lack of 1GiB PUDs. I haven't >>> implemented that yet, but I was thinking along the lines of a sysfs >>> knob exposed by kmem, that controls setting of this new >>> MHP_SPLIT_MEMBLOCKS flag. >> >> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? >> >> > Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY, > and memory_hotplug is free to split to memblocks if it needs to to > satisfy that. And if you don't want memmap holes in a larger area you're adding (for example to runtime-allocate 1 GiB pages), simply check the size your adding, and if it's, say, less than 1 G, don't set the flag. But that's probably a corner case use case not worth considering for now. > > That sounds reasonable. Let me give this a try and see if I run into > anything else. Thanks David! Sure!
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 7b36db6f1cbd..0751346193ef 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -12,6 +12,7 @@ #include <linux/mm.h> #include <linux/mman.h> #include <linux/memory-tiers.h> +#include <linux/memory_hotplug.h> #include "dax-private.h" #include "bus.h" @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + u64 cur_start, cur_len, remaining; struct resource *res; struct range range; @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) res->flags = IORESOURCE_SYSTEM_RAM; /* - * Ensure that future kexec'd kernels will not treat - * this as RAM automatically. + * Add memory in chunks of memory_block_size_bytes() so that + * it is considered for MHP_MEMMAP_ON_MEMORY + * @range has already been aligned to memory_block_size_bytes(), + * so the following loop will always break it down cleanly. */ - rc = add_memory_driver_managed(data->mgid, range.start, - range_len(&range), kmem_name, MHP_NID_IS_MGID); + cur_start = range.start; + cur_len = memory_block_size_bytes(); + remaining = range_len(&range); + while (remaining) { + mhp_t mhp_flags = MHP_NID_IS_MGID; - if (rc) { - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", - i, range.start, range.end); - remove_resource(res); - kfree(res); - data->res[i] = NULL; - if (mapped) - continue; - goto err_request_mem; + if (mhp_supports_memmap_on_memory(cur_len, + MHP_MEMMAP_ON_MEMORY)) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; + /* + * Ensure that future kexec'd kernels will not treat + * this as RAM automatically. + */ + rc = add_memory_driver_managed(data->mgid, cur_start, + cur_len, kmem_name, + mhp_flags); + + if (rc) { + dev_warn(dev, + "mapping%d: %#llx-%#llx memory add failed\n", + i, cur_start, cur_start + cur_len - 1); + remove_resource(res); + kfree(res); + data->res[i] = NULL; + if (mapped) + continue; + goto err_request_mem; + } + + cur_start += cur_len; + remaining -= cur_len; } mapped++; }