Message ID | 20231005-vv-kmem_memmap-v5-1-a54d1981f0a3@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2016:b0:403:3b70:6f57 with SMTP id fe22csp490918vqb; Thu, 5 Oct 2023 11:32:30 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEsR4oYLJHDRrNlQdoVohhwF+twkSaCYYwa/KKIZ+lOxFVcIptX17MlsLT+AZay06gHfOHg X-Received: by 2002:a05:6e02:13eb:b0:351:4f0c:b959 with SMTP id w11-20020a056e0213eb00b003514f0cb959mr4903482ilj.21.1696530749943; Thu, 05 Oct 2023 11:32:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696530749; cv=none; d=google.com; s=arc-20160816; b=CkNyPFB3IHrSCGsHI3PtWtKtnTbeM707q3KMc33qwiUCGrL4bS/JDCqzCbwjNzewjY QRXZccbWKk1g7lnxfPBNojMWuRDQYc6L11+tSBT6SWWQe2JrlBS6rvFxGwH1bDplS+mO 70aFnjrmbBmLVj9WS6WTQQwmpO4gwc13vAYdOcaNA96t7t8Yoq4pUOXqgvVyRZIOEZS1 1V0MsnO0+sjNPhH1nUe74m/jHCrAboK7MwAxL6AhZp0a0bA3K+T5ju/iezYC7rQiNyxF OLWoG2okiSLoZnI8OY1gl0yFS7QSR68FoE5UJl3S0TqoXXRq4KRLxKyGKMaLs7U6VKc3 S+6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:in-reply-to:references:message-id :content-transfer-encoding:mime-version:subject:date:from :dkim-signature; bh=nFi/UHhVlx0+FJZ9TcTGgT5mbEkbuGrIBBN6C7W+WA4=; fh=ZJyl7lGIgKsdPRNZ+hASCGqYu4kS00HsOWlg75WSZ6w=; b=RPe6WORBkqrUS1NV788vJ8DknnH20ZSt6TyANapcSozOdyh28Iev/w4FUM5954QAoq +IKm6EDTVNOV0/XHQrxrOaL7kiUQxQl5AT+fpXH1G23UFlF6NDtGil5sHm38X4vdGekA GPRUgBjNe/nxi4J4Pr0ZORGEzNBkD8dlT04TJAogBlizt2hhO8M4py3L/2Ymv76nEd6h vlqlEPv8Uwvo1zjtq4xK3n9KtmkeSROJwvv/C8w1NeQ94U6p4Cq1MKLoGQC7ZpQJXbKO GMZcKVCSHXwaQUAXsFRr8Id0A3gJNdyQ3d61GqOy36L//jSvRS13KTt8Mr4LfV3XeZGs fbsg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Rq3h50C+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id s17-20020a632c11000000b005895c80c902si172625pgs.438.2023.10.05.11.32.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Oct 2023 11:32:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Rq3h50C+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id C0584807E002; Thu, 5 Oct 2023 11:32:26 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231770AbjJEScG (ORCPT <rfc822;ezelljr.billy@gmail.com> + 18 others); Thu, 5 Oct 2023 14:32:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231648AbjJEScC (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 5 Oct 2023 14:32:02 -0400 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6155BAD; Thu, 5 Oct 2023 11:32:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1696530721; x=1728066721; h=from:date:subject:mime-version:content-transfer-encoding: message-id:references:in-reply-to:to:cc; bh=YjfMcrhNvXHMJFZBqykeVf1EucGf0umzz4JhkLD/Qkk=; b=Rq3h50C+aZ8VZlSZ/THp6nvJcQarpSkLxaavdB45b1UuZ56f6N9c9Ef3 XUn2snQBtIrVkWeJhLAdUA6yIbwACFM/0geK7GVwjLqn5ajCrdS+C+nZF t3MNuj1UM6CylrTWu34LOm+OwTHtbyebzr029sQc+rsG5FvOIm0je7FPL 9YAVXh4ZGG220rBkpoWiCG+lUXIedmZZekkICDog9me7zBMARz36VXR1N ucFhSMEBq5aqhMjYYZXTB5k9mPv3C6UyKk1u/RJQCCWI23PCdKK7SoyIE S5vE6eNalvET0godYxcgU3LyxDCRFOT1nUPv9x7oG9oLqU8IiVxKEbKIj w==; X-IronPort-AV: E=McAfee;i="6600,9927,10854"; a="363860732" X-IronPort-AV: E=Sophos;i="6.03,203,1694761200"; d="scan'208";a="363860732" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Oct 2023 11:31:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10854"; a="781342828" X-IronPort-AV: E=Sophos;i="6.03,203,1694761200"; d="scan'208";a="781342828" Received: from amykuo-mobl.amr.corp.intel.com (HELO [192.168.1.200]) ([10.212.12.247]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Oct 2023 11:31:53 -0700 From: Vishal Verma <vishal.l.verma@intel.com> Date: Thu, 05 Oct 2023 12:31:39 -0600 Subject: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20231005-vv-kmem_memmap-v5-1-a54d1981f0a3@intel.com> References: <20231005-vv-kmem_memmap-v5-0-a54d1981f0a3@intel.com> In-Reply-To: <20231005-vv-kmem_memmap-v5-0-a54d1981f0a3@intel.com> To: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Oscar Salvador <osalvador@suse.de>, Dan Williams <dan.j.williams@intel.com>, Dave Jiang <dave.jiang@intel.com> Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, nvdimm@lists.linux.dev, linux-cxl@vger.kernel.org, Huang Ying <ying.huang@intel.com>, Dave Hansen <dave.hansen@linux.intel.com>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Michal Hocko <mhocko@suse.com>, Jonathan Cameron <Jonathan.Cameron@Huawei.com>, Jeff Moyer <jmoyer@redhat.com>, Vishal Verma <vishal.l.verma@intel.com> X-Mailer: b4 0.12.3 X-Developer-Signature: v=1; a=openpgp-sha256; l=8397; i=vishal.l.verma@intel.com; h=from:subject:message-id; bh=YjfMcrhNvXHMJFZBqykeVf1EucGf0umzz4JhkLD/Qkk=; b=owGbwMvMwCXGf25diOft7jLG02pJDKnyjBJBclPPXj5WJ9n00skx2Nnbcs2GS9KRlhWtZq++T wx+vrCyo5SFQYyLQVZMkeXvno+Mx+S25/MEJjjCzGFlAhnCwMUpABO5ksjwv840ZvXXpHb/uGd+ V5R/Zp9v9xZqTf16WiFg1e9arjUW+xgZ3mbzL1uzNHTCrsr5mpUrDbtMa13sVv3Nn8TFrinFwWX ABwA= X-Developer-Key: i=vishal.l.verma@intel.com; a=openpgp; fpr=F8682BE134C67A12332A2ED07AFA61BEA3B84DFF X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 05 Oct 2023 11:32:26 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1778941427595959234 X-GMAIL-MSGID: 1778941427595959234 |
Series |
mm: use memmap_on_memory semantics for dax/kmem
|
|
Commit Message
Verma, Vishal L
Oct. 5, 2023, 6:31 p.m. UTC
The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.
For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.
Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 99 insertions(+), 63 deletions(-)
Comments
Vishal Verma wrote: > The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to > 'memblock_size' chunks of memory being added. Adding a larger span of > memory precludes memmap_on_memory semantics. > > For users of hotplug such as kmem, large amounts of memory might get > added from the CXL subsystem. In some cases, this amount may exceed the > available 'main memory' to store the memmap for the memory being added. > In this case, it is useful to have a way to place the memmap on the > memory being added, even if it means splitting the addition into > memblock-sized chunks. > > Change add_memory_resource() to loop over memblock-sized chunks of > memory if caller requested memmap_on_memory, and if other conditions for > it are met. Teach try_remove_memory() to also expect that a memory > range being removed might have been split up into memblock sized chunks, > and to loop through those as needed. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++-------------------- > 1 file changed, 99 insertions(+), 63 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index f8d3e7427e32..77ec6f15f943 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) > return arch_supports_memmap_on_memory(vmemmap_size); > } > > +static int add_memory_create_devices(int nid, struct memory_group *group, > + u64 start, u64 size, mhp_t mhp_flags) > +{ > + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; > + struct vmem_altmap mhp_altmap = { > + .base_pfn = PHYS_PFN(start), > + .end_pfn = PHYS_PFN(start + size - 1), > + }; > + int ret; > + > + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { > + mhp_altmap.free = memory_block_memmap_on_memory_pages(); > + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); > + if (!params.altmap) > + return -ENOMEM; > + > + memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); Isn't this just open coded kmemdup()? Other than that, I am not seeing anything else to comment on, you can add: Reviewed-by: Dan Williams <dan.j.williams@intel.com>
On 05.10.23 20:31, Vishal Verma wrote: > The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to > 'memblock_size' chunks of memory being added. Adding a larger span of > memory precludes memmap_on_memory semantics. > > For users of hotplug such as kmem, large amounts of memory might get > added from the CXL subsystem. In some cases, this amount may exceed the > available 'main memory' to store the memmap for the memory being added. > In this case, it is useful to have a way to place the memmap on the > memory being added, even if it means splitting the addition into > memblock-sized chunks. > > Change add_memory_resource() to loop over memblock-sized chunks of > memory if caller requested memmap_on_memory, and if other conditions for > it are met. Teach try_remove_memory() to also expect that a memory > range being removed might have been split up into memblock sized chunks, > and to loop through those as needed. > Maybe add that this implies that we're not making use of PUD mappings in the direct map yet, and link to the proposal on how we could optimize that eventually in the future. [...] > > -static int __ref try_remove_memory(u64 start, u64 size) > +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size) You shouldn't need the nid, right? > { > + int rc = 0; > struct memory_block *mem; > - int rc = 0, nid = NUMA_NO_NODE; > struct vmem_altmap *altmap = NULL; > > + rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); > + if (rc) { > + altmap = mem->altmap; > + /* > + * Mark altmap NULL so that we can add a debug > + * check on memblock free. > + */ > + mem->altmap = NULL; > + } > + > + /* > + * Memory block device removal under the device_hotplug_lock is > + * a barrier against racing online attempts. > + */ > + remove_memory_block_devices(start, size); We're now calling that under the memory hotplug lock. I assume this is fine, but I remember some ugly lockdep details ...should be alright I guess. > + > + arch_remove_memory(start, size, altmap); > + > + /* Verify that all vmemmap pages have actually been freed. */ > + if (altmap) { > + WARN(altmap->alloc, "Altmap not fully unmapped"); > + kfree(altmap); > + } > +} > + > +static int __ref try_remove_memory(u64 start, u64 size) > +{ > + int rc, nid = NUMA_NO_NODE; > + > BUG_ON(check_hotplug_memory_range(start, size)); > > /* > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) > if (rc) > return rc; > > + mem_hotplug_begin(); > + > /* > - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in > - * the same granularity it was added - a single memory block. > + * For memmap_on_memory, the altmaps could have been added on > + * a per-memblock basis. Loop through the entire range if so, > + * and remove each memblock and its altmap. > */ > if (mhp_memmap_on_memory()) { > - rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); > - if (rc) { > - if (size != memory_block_size_bytes()) { > - pr_warn("Refuse to remove %#llx - %#llx," > - "wrong granularity\n", > - start, start + size); > - return -EINVAL; > - } > - altmap = mem->altmap; > - /* > - * Mark altmap NULL so that we can add a debug > - * check on memblock free. > - */ > - mem->altmap = NULL; > - } > + unsigned long memblock_size = memory_block_size_bytes(); > + u64 cur_start; > + > + for (cur_start = start; cur_start < start + size; > + cur_start += memblock_size) > + remove_memory_block_and_altmap(nid, cur_start, > + memblock_size); > + } else { > + remove_memory_block_and_altmap(nid, start, size); Better call remove_memory_block_devices() and arch_remove_memory(start, size, altmap) here explicitly instead of using remove_memory_block_and_altmap() that really can only handle a single memory block with any inputs. > } > > /* remove memmap entry */ > firmware_map_remove(start, start + size, "System RAM"); Can we continue doing that in the old order? (IOW before taking the lock?). > > - /* > - * Memory block device removal under the device_hotplug_lock is > - * a barrier against racing online attempts. > - */ > - remove_memory_block_devices(start, size); > - > - mem_hotplug_begin(); > - > - arch_remove_memory(start, size, altmap); > - > - /* Verify that all vmemmap pages have actually been freed. */ > - if (altmap) { > - WARN(altmap->alloc, "Altmap not fully unmapped"); > - kfree(altmap); > - } > - > if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { > memblock_phys_free(start, size); > memblock_remove(start, size); > @@ -2219,6 +2254,7 @@ static int __ref try_remove_memory(u64 start, u64 size) > try_offline_node(nid); > > mem_hotplug_done(); > + Unrelated change. > return 0; > } > >
On Thu, 2023-10-05 at 14:20 -0700, Dan Williams wrote: > Vishal Verma wrote: <..> > > > > --- a/mm/memory_hotplug.c > > +++ b/mm/memory_hotplug.c > > @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) > > return arch_supports_memmap_on_memory(vmemmap_size); > > } > > > > +static int add_memory_create_devices(int nid, struct memory_group *group, > > + u64 start, u64 size, mhp_t mhp_flags) > > +{ > > + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; > > + struct vmem_altmap mhp_altmap = { > > + .base_pfn = PHYS_PFN(start), > > + .end_pfn = PHYS_PFN(start + size - 1), > > + }; > > + int ret; > > + > > + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { > > + mhp_altmap.free = memory_block_memmap_on_memory_pages(); > > + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); > > + if (!params.altmap) > > + return -ENOMEM; > > + > > + memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); > > Isn't this just open coded kmemdup()? Ah yes - it was existing code that I just moved, but I can add a precursor cleanup patch to change it. > > Other than that, I am not seeing anything else to comment on, you can add: > > Reviewed-by: Dan Williams <dan.j.williams@intel.com> Thanks Dan!
On Fri, 2023-10-06 at 14:52 +0200, David Hildenbrand wrote: > On 05.10.23 20:31, Vishal Verma wrote: > > <..> > > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) > > if (rc) > > return rc; > > > > + mem_hotplug_begin(); > > + > > /* > > - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in > > - * the same granularity it was added - a single memory block. > > + * For memmap_on_memory, the altmaps could have been added on > > + * a per-memblock basis. Loop through the entire range if so, > > + * and remove each memblock and its altmap. > > */ > > if (mhp_memmap_on_memory()) { > > - rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); > > - if (rc) { > > - if (size != memory_block_size_bytes()) { > > - pr_warn("Refuse to remove %#llx - %#llx," > > - "wrong granularity\n", > > - start, start + size); > > - return -EINVAL; > > - } > > - altmap = mem->altmap; > > - /* > > - * Mark altmap NULL so that we can add a debug > > - * check on memblock free. > > - */ > > - mem->altmap = NULL; > > - } > > + unsigned long memblock_size = memory_block_size_bytes(); > > + u64 cur_start; > > + > > + for (cur_start = start; cur_start < start + size; > > + cur_start += memblock_size) > > + remove_memory_block_and_altmap(nid, cur_start, > > + memblock_size); > > + } else { > > + remove_memory_block_and_altmap(nid, start, size); > > Better call remove_memory_block_devices() and arch_remove_memory(start, > size, altmap) here explicitly instead of using > remove_memory_block_and_altmap() that really can only handle a single > memory block with any inputs. > I'm not sure I follow. Even in the non memmap_on_memory case, we'd have to walk_memory_blocks() to get to the memory_block->altmap, right? Or is there a more direct way? If we have to walk_memory_blocks, what's the advantage of calling those directly instead of calling the helper created above? Agreed with and fixed up all the other comments.
Vishal Verma <vishal.l.verma@intel.com> writes: > The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to > 'memblock_size' chunks of memory being added. Adding a larger span of > memory precludes memmap_on_memory semantics. > > For users of hotplug such as kmem, large amounts of memory might get > added from the CXL subsystem. In some cases, this amount may exceed the > available 'main memory' to store the memmap for the memory being added. > In this case, it is useful to have a way to place the memmap on the > memory being added, even if it means splitting the addition into > memblock-sized chunks. > > Change add_memory_resource() to loop over memblock-sized chunks of > memory if caller requested memmap_on_memory, and if other conditions for > it are met. Teach try_remove_memory() to also expect that a memory > range being removed might have been split up into memblock sized chunks, > and to loop through those as needed. > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: David Hildenbrand <david@redhat.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Jiang <dave.jiang@intel.com> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Huang Ying <ying.huang@intel.com> > Suggested-by: David Hildenbrand <david@redhat.com> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> > --- > mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++-------------------- > 1 file changed, 99 insertions(+), 63 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index f8d3e7427e32..77ec6f15f943 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) > return arch_supports_memmap_on_memory(vmemmap_size); > } > > +static int add_memory_create_devices(int nid, struct memory_group *group, > + u64 start, u64 size, mhp_t mhp_flags) > +{ > + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; > + struct vmem_altmap mhp_altmap = { > + .base_pfn = PHYS_PFN(start), > + .end_pfn = PHYS_PFN(start + size - 1), > + }; > + int ret; > + > + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { > + mhp_altmap.free = memory_block_memmap_on_memory_pages(); > + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); > + if (!params.altmap) > + return -ENOMEM; > + > + memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); > + } > + > + /* call arch's memory hotadd */ > + ret = arch_add_memory(nid, start, size, ¶ms); > + if (ret < 0) > + goto error; > + > + /* create memory block devices after memory was added */ > + ret = create_memory_block_devices(start, size, params.altmap, group); > + if (ret) > + goto err_bdev; > + > + return 0; > + > +err_bdev: > + arch_remove_memory(start, size, NULL); > +error: > + kfree(params.altmap); > + return ret; > +} > + > /* > * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > * and online/offline operations (triggered e.g. by sysfs). > @@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) > */ > int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) > { > - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; > + unsigned long memblock_size = memory_block_size_bytes(); > enum memblock_flags memblock_flags = MEMBLOCK_NONE; > - struct vmem_altmap mhp_altmap = { > - .base_pfn = PHYS_PFN(res->start), > - .end_pfn = PHYS_PFN(res->end), > - }; > struct memory_group *group = NULL; > - u64 start, size; > + u64 start, size, cur_start; > bool new_node = false; > int ret; > > @@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) > /* > * Self hosted memmap array > */ > - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { > - if (mhp_supports_memmap_on_memory(size)) { > - mhp_altmap.free = memory_block_memmap_on_memory_pages(); > - params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); > - if (!params.altmap) > + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) && > + mhp_supports_memmap_on_memory(memblock_size)) { > + for (cur_start = start; cur_start < start + size; > + cur_start += memblock_size) { > + ret = add_memory_create_devices(nid, group, cur_start, > + memblock_size, > + mhp_flags); > + if (ret) > goto error; > - > - memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); > } > - /* fallback to not using altmap */ > - } > - > - /* call arch's memory hotadd */ > - ret = arch_add_memory(nid, start, size, ¶ms); > - if (ret < 0) > - goto error_free; > - > - /* create memory block devices after memory was added */ > - ret = create_memory_block_devices(start, size, params.altmap, group); > - if (ret) { > - arch_remove_memory(start, size, NULL); > - goto error_free; > + } else { > + ret = add_memory_create_devices(nid, group, start, size, > + mhp_flags); > + if (ret) > + goto error; > } > > if (new_node) { > @@ -1494,8 +1521,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) > walk_memory_blocks(start, size, NULL, online_memory_block); > > return ret; > -error_free: > - kfree(params.altmap); > error: > if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) > memblock_remove(start, size); > @@ -2146,12 +2171,41 @@ void try_offline_node(int nid) > } > EXPORT_SYMBOL(try_offline_node); > > -static int __ref try_remove_memory(u64 start, u64 size) > +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size) > { > + int rc = 0; > struct memory_block *mem; > - int rc = 0, nid = NUMA_NO_NODE; > struct vmem_altmap *altmap = NULL; > > + rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); > + if (rc) { > + altmap = mem->altmap; > + /* > + * Mark altmap NULL so that we can add a debug > + * check on memblock free. > + */ > + mem->altmap = NULL; > + } > + > + /* > + * Memory block device removal under the device_hotplug_lock is > + * a barrier against racing online attempts. > + */ > + remove_memory_block_devices(start, size); > + > + arch_remove_memory(start, size, altmap); > + > + /* Verify that all vmemmap pages have actually been freed. */ > + if (altmap) { > + WARN(altmap->alloc, "Altmap not fully unmapped"); > + kfree(altmap); > + } > +} > + > +static int __ref try_remove_memory(u64 start, u64 size) > +{ > + int rc, nid = NUMA_NO_NODE; > + > BUG_ON(check_hotplug_memory_range(start, size)); > > /* > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) > if (rc) > return rc; > > + mem_hotplug_begin(); > + > /* > - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in > - * the same granularity it was added - a single memory block. > + * For memmap_on_memory, the altmaps could have been added on > + * a per-memblock basis. Loop through the entire range if so, > + * and remove each memblock and its altmap. > */ > if (mhp_memmap_on_memory()) { IIUC, even if mhp_memmap_on_memory() returns true, it's still possible that the memmap is put in DRAM after [2/2]. So that, arch_remove_memory() are called for each memory block unnecessarily. Can we detect this (via altmap?) and call remove_memory_block_and_altmap() for the whole range? > - rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); > - if (rc) { > - if (size != memory_block_size_bytes()) { > - pr_warn("Refuse to remove %#llx - %#llx," > - "wrong granularity\n", > - start, start + size); > - return -EINVAL; > - } > - altmap = mem->altmap; > - /* > - * Mark altmap NULL so that we can add a debug > - * check on memblock free. > - */ > - mem->altmap = NULL; > - } > + unsigned long memblock_size = memory_block_size_bytes(); > + u64 cur_start; > + > + for (cur_start = start; cur_start < start + size; > + cur_start += memblock_size) > + remove_memory_block_and_altmap(nid, cur_start, > + memblock_size); > + } else { > + remove_memory_block_and_altmap(nid, start, size); > } > > /* remove memmap entry */ > firmware_map_remove(start, start + size, "System RAM"); > > - /* > - * Memory block device removal under the device_hotplug_lock is > - * a barrier against racing online attempts. > - */ > - remove_memory_block_devices(start, size); > - > - mem_hotplug_begin(); > - > - arch_remove_memory(start, size, altmap); > - > - /* Verify that all vmemmap pages have actually been freed. */ > - if (altmap) { > - WARN(altmap->alloc, "Altmap not fully unmapped"); > - kfree(altmap); > - } > - > if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { > memblock_phys_free(start, size); > memblock_remove(start, size); > @@ -2219,6 +2254,7 @@ static int __ref try_remove_memory(u64 start, u64 size) > try_offline_node(nid); > > mem_hotplug_done(); > + > return 0; > } -- Best Regards, Huang, Ying
On 07.10.23 10:55, Huang, Ying wrote: > Vishal Verma <vishal.l.verma@intel.com> writes: > >> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to >> 'memblock_size' chunks of memory being added. Adding a larger span of >> memory precludes memmap_on_memory semantics. >> >> For users of hotplug such as kmem, large amounts of memory might get >> added from the CXL subsystem. In some cases, this amount may exceed the >> available 'main memory' to store the memmap for the memory being added. >> In this case, it is useful to have a way to place the memmap on the >> memory being added, even if it means splitting the addition into >> memblock-sized chunks. >> >> Change add_memory_resource() to loop over memblock-sized chunks of >> memory if caller requested memmap_on_memory, and if other conditions for >> it are met. Teach try_remove_memory() to also expect that a memory >> range being removed might have been split up into memblock sized chunks, >> and to loop through those as needed. >> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Oscar Salvador <osalvador@suse.de> >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: Dave Jiang <dave.jiang@intel.com> >> Cc: Dave Hansen <dave.hansen@linux.intel.com> >> Cc: Huang Ying <ying.huang@intel.com> >> Suggested-by: David Hildenbrand <david@redhat.com> >> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> >> --- >> mm/memory_hotplug.c | 162 ++++++++++++++++++++++++++++++++-------------------- >> 1 file changed, 99 insertions(+), 63 deletions(-) >> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index f8d3e7427e32..77ec6f15f943 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) >> return arch_supports_memmap_on_memory(vmemmap_size); >> } >> >> +static int add_memory_create_devices(int nid, struct memory_group *group, >> + u64 start, u64 size, mhp_t mhp_flags) >> +{ >> + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; >> + struct vmem_altmap mhp_altmap = { >> + .base_pfn = PHYS_PFN(start), >> + .end_pfn = PHYS_PFN(start + size - 1), >> + }; >> + int ret; >> + >> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { >> + mhp_altmap.free = memory_block_memmap_on_memory_pages(); >> + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); >> + if (!params.altmap) >> + return -ENOMEM; >> + >> + memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); >> + } >> + >> + /* call arch's memory hotadd */ >> + ret = arch_add_memory(nid, start, size, ¶ms); >> + if (ret < 0) >> + goto error; >> + >> + /* create memory block devices after memory was added */ >> + ret = create_memory_block_devices(start, size, params.altmap, group); >> + if (ret) >> + goto err_bdev; >> + >> + return 0; >> + >> +err_bdev: >> + arch_remove_memory(start, size, NULL); >> +error: >> + kfree(params.altmap); >> + return ret; >> +} >> + >> /* >> * NOTE: The caller must call lock_device_hotplug() to serialize hotplug >> * and online/offline operations (triggered e.g. by sysfs). >> @@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) >> */ >> int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) >> { >> - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; >> + unsigned long memblock_size = memory_block_size_bytes(); >> enum memblock_flags memblock_flags = MEMBLOCK_NONE; >> - struct vmem_altmap mhp_altmap = { >> - .base_pfn = PHYS_PFN(res->start), >> - .end_pfn = PHYS_PFN(res->end), >> - }; >> struct memory_group *group = NULL; >> - u64 start, size; >> + u64 start, size, cur_start; >> bool new_node = false; >> int ret; >> >> @@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) >> /* >> * Self hosted memmap array >> */ >> - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { >> - if (mhp_supports_memmap_on_memory(size)) { >> - mhp_altmap.free = memory_block_memmap_on_memory_pages(); >> - params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); >> - if (!params.altmap) >> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) && >> + mhp_supports_memmap_on_memory(memblock_size)) { >> + for (cur_start = start; cur_start < start + size; >> + cur_start += memblock_size) { >> + ret = add_memory_create_devices(nid, group, cur_start, >> + memblock_size, >> + mhp_flags); >> + if (ret) >> goto error; >> - >> - memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); >> } >> - /* fallback to not using altmap */ >> - } >> - >> - /* call arch's memory hotadd */ >> - ret = arch_add_memory(nid, start, size, ¶ms); >> - if (ret < 0) >> - goto error_free; >> - >> - /* create memory block devices after memory was added */ >> - ret = create_memory_block_devices(start, size, params.altmap, group); >> - if (ret) { >> - arch_remove_memory(start, size, NULL); >> - goto error_free; >> + } else { >> + ret = add_memory_create_devices(nid, group, start, size, >> + mhp_flags); >> + if (ret) >> + goto error; >> } >> >> if (new_node) { >> @@ -1494,8 +1521,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) >> walk_memory_blocks(start, size, NULL, online_memory_block); >> >> return ret; >> -error_free: >> - kfree(params.altmap); >> error: >> if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) >> memblock_remove(start, size); >> @@ -2146,12 +2171,41 @@ void try_offline_node(int nid) >> } >> EXPORT_SYMBOL(try_offline_node); >> >> -static int __ref try_remove_memory(u64 start, u64 size) >> +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size) >> { >> + int rc = 0; >> struct memory_block *mem; >> - int rc = 0, nid = NUMA_NO_NODE; >> struct vmem_altmap *altmap = NULL; >> >> + rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); >> + if (rc) { >> + altmap = mem->altmap; >> + /* >> + * Mark altmap NULL so that we can add a debug >> + * check on memblock free. >> + */ >> + mem->altmap = NULL; >> + } >> + >> + /* >> + * Memory block device removal under the device_hotplug_lock is >> + * a barrier against racing online attempts. >> + */ >> + remove_memory_block_devices(start, size); >> + >> + arch_remove_memory(start, size, altmap); >> + >> + /* Verify that all vmemmap pages have actually been freed. */ >> + if (altmap) { >> + WARN(altmap->alloc, "Altmap not fully unmapped"); >> + kfree(altmap); >> + } >> +} >> + >> +static int __ref try_remove_memory(u64 start, u64 size) >> +{ >> + int rc, nid = NUMA_NO_NODE; >> + >> BUG_ON(check_hotplug_memory_range(start, size)); >> >> /* >> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) >> if (rc) >> return rc; >> >> + mem_hotplug_begin(); >> + >> /* >> - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in >> - * the same granularity it was added - a single memory block. >> + * For memmap_on_memory, the altmaps could have been added on >> + * a per-memblock basis. Loop through the entire range if so, >> + * and remove each memblock and its altmap. >> */ >> if (mhp_memmap_on_memory()) { > > IIUC, even if mhp_memmap_on_memory() returns true, it's still possible > that the memmap is put in DRAM after [2/2]. So that, > arch_remove_memory() are called for each memory block unnecessarily. Can > we detect this (via altmap?) and call remove_memory_block_and_altmap() > for the whole range? Good point. We should handle memblock-per-memblock onny if we have to handle the altmap. Otherwise, just call a separate function that doesn't care about -- e.g., called remove_memory_blocks_no_altmap(). We could simply walk all memory blocks and make sure either all have an altmap or none has an altmap. If there is a mix, we should bail out with WARN_ON_ONCE().
On 07.10.23 00:01, Verma, Vishal L wrote: > On Fri, 2023-10-06 at 14:52 +0200, David Hildenbrand wrote: >> On 05.10.23 20:31, Vishal Verma wrote: >>> > <..> >>> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) >>> if (rc) >>> return rc; >>> >>> + mem_hotplug_begin(); >>> + >>> /* >>> - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in >>> - * the same granularity it was added - a single memory block. >>> + * For memmap_on_memory, the altmaps could have been added on >>> + * a per-memblock basis. Loop through the entire range if so, >>> + * and remove each memblock and its altmap. >>> */ >>> if (mhp_memmap_on_memory()) { >>> - rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); >>> - if (rc) { >>> - if (size != memory_block_size_bytes()) { >>> - pr_warn("Refuse to remove %#llx - %#llx," >>> - "wrong granularity\n", >>> - start, start + size); >>> - return -EINVAL; >>> - } >>> - altmap = mem->altmap; >>> - /* >>> - * Mark altmap NULL so that we can add a debug >>> - * check on memblock free. >>> - */ >>> - mem->altmap = NULL; >>> - } >>> + unsigned long memblock_size = memory_block_size_bytes(); >>> + u64 cur_start; >>> + >>> + for (cur_start = start; cur_start < start + size; >>> + cur_start += memblock_size) >>> + remove_memory_block_and_altmap(nid, cur_start, >>> + memblock_size); >>> + } else { >>> + remove_memory_block_and_altmap(nid, start, size); >> >> Better call remove_memory_block_devices() and arch_remove_memory(start, >> size, altmap) here explicitly instead of using >> remove_memory_block_and_altmap() that really can only handle a single >> memory block with any inputs. >> > I'm not sure I follow. Even in the non memmap_on_memory case, we'd have > to walk_memory_blocks() to get to the memory_block->altmap, right? See my other reply to, at least with mhp_memmap_on_memory()==false, we don't have to worry about the altmap. > > Or is there a more direct way? If we have to walk_memory_blocks, what's > the advantage of calling those directly instead of calling the helper > created above? I think we have two cases to handle 1) All have an altmap. Remove them block-by-block. Probably we should call a function remove_memory_blocks(altmap=true) [or alternatively remove_memory_blocks_and_altmaps()] and just handle iterating internally. 2) All don't have an altmap. We can remove them in one go. Probably we should call that remove_memory_blocks(altmap=false) [or alternatively remove_memory_blocks_no_altmaps()]. I guess it's best to do a walk upfront to make sure either all have an altmap or none has one. Then we can branch off to the right function knowing whether we have to process altmaps or not. The existing if (mhp_memmap_on_memory()) { ... } Can be extended for that case. Please let me know if I failed to express what I mean, then I can briefly prototype it on top of your changes.
On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote: > On 07.10.23 10:55, Huang, Ying wrote: > > Vishal Verma <vishal.l.verma@intel.com> writes: > > > > > @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) > > > if (rc) > > > return rc; > > > > > > + mem_hotplug_begin(); > > > + > > > /* > > > - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in > > > - * the same granularity it was added - a single memory block. > > > + * For memmap_on_memory, the altmaps could have been added on > > > + * a per-memblock basis. Loop through the entire range if so, > > > + * and remove each memblock and its altmap. > > > */ > > > if (mhp_memmap_on_memory()) { > > > > IIUC, even if mhp_memmap_on_memory() returns true, it's still possible > > that the memmap is put in DRAM after [2/2]. So that, > > arch_remove_memory() are called for each memory block unnecessarily. Can > > we detect this (via altmap?) and call remove_memory_block_and_altmap() > > for the whole range? > > Good point. We should handle memblock-per-memblock onny if we have to > handle the altmap. Otherwise, just call a separate function that doesn't > care about -- e.g., called remove_memory_blocks_no_altmap(). > > We could simply walk all memory blocks and make sure either all have an > altmap or none has an altmap. If there is a mix, we should bail out with > WARN_ON_ONCE(). > Ok I think I follow - based on both of these threads, here's my understanding in an incremental diff from the original patches (may not apply directly as I've already committed changes from the other bits of feedback - but this should provide an idea of the direction) - --- diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 507291e44c0b..30addcb063b4 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -2201,6 +2201,40 @@ static void __ref remove_memory_block_and_altmap(u64 start, u64 size) } } +static bool memblocks_have_altmaps(u64 start, u64 size) +{ + unsigned long memblock_size = memory_block_size_bytes(); + u64 num_altmaps = 0, num_no_altmaps = 0; + struct memory_block *mem; + u64 cur_start; + int rc = 0; + + if (!mhp_memmap_on_memory()) + return false; + + for (cur_start = start; cur_start < start + size; + cur_start += memblock_size) { + if (walk_memory_blocks(cur_start, memblock_size, &mem, + test_has_altmap_cb)) + num_altmaps++; + else + num_no_altmaps++; + } + + if (!num_altmaps && num_no_altmaps > 0) + return false; + + if (!num_no_altmaps && num_altmaps > 0) + return true; + + /* + * If there is a mix of memblocks with and without altmaps, + * something has gone very wrong. WARN and bail. + */ + WARN_ONCE(1, "memblocks have a mix of missing and present altmaps"); + return false; +} + static int __ref try_remove_memory(u64 start, u64 size) { int rc, nid = NUMA_NO_NODE; @@ -2230,7 +2264,7 @@ static int __ref try_remove_memory(u64 start, u64 size) * a per-memblock basis. Loop through the entire range if so, * and remove each memblock and its altmap. */ - if (mhp_memmap_on_memory()) { + if (mhp_memmap_on_memory() && memblocks_have_altmaps(start, size)) { unsigned long memblock_size = memory_block_size_bytes(); u64 cur_start; @@ -2239,7 +2273,8 @@ static int __ref try_remove_memory(u64 start, u64 size) remove_memory_block_and_altmap(cur_start, memblock_size); } else { - remove_memory_block_and_altmap(start, size); + remove_memory_block_devices(start, size); + arch_remove_memory(start, size, NULL); } if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
On 12.10.23 07:53, Verma, Vishal L wrote: > On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote: >> On 07.10.23 10:55, Huang, Ying wrote: >>> Vishal Verma <vishal.l.verma@intel.com> writes: >>> >>>> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) >>>> if (rc) >>>> return rc; >>>> >>>> + mem_hotplug_begin(); >>>> + >>>> /* >>>> - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in >>>> - * the same granularity it was added - a single memory block. >>>> + * For memmap_on_memory, the altmaps could have been added on >>>> + * a per-memblock basis. Loop through the entire range if so, >>>> + * and remove each memblock and its altmap. >>>> */ >>>> if (mhp_memmap_on_memory()) { >>> >>> IIUC, even if mhp_memmap_on_memory() returns true, it's still possible >>> that the memmap is put in DRAM after [2/2]. So that, >>> arch_remove_memory() are called for each memory block unnecessarily. Can >>> we detect this (via altmap?) and call remove_memory_block_and_altmap() >>> for the whole range? >> >> Good point. We should handle memblock-per-memblock onny if we have to >> handle the altmap. Otherwise, just call a separate function that doesn't >> care about -- e.g., called remove_memory_blocks_no_altmap(). >> >> We could simply walk all memory blocks and make sure either all have an >> altmap or none has an altmap. If there is a mix, we should bail out with >> WARN_ON_ONCE(). >> > Ok I think I follow - based on both of these threads, here's my > understanding in an incremental diff from the original patches (may not > apply directly as I've already committed changes from the other bits of > feedback - but this should provide an idea of the direction) - > > --- > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 507291e44c0b..30addcb063b4 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -2201,6 +2201,40 @@ static void __ref remove_memory_block_and_altmap(u64 start, u64 size) > } > } > > +static bool memblocks_have_altmaps(u64 start, u64 size) > +{ > + unsigned long memblock_size = memory_block_size_bytes(); > + u64 num_altmaps = 0, num_no_altmaps = 0; > + struct memory_block *mem; > + u64 cur_start; > + int rc = 0; > + > + if (!mhp_memmap_on_memory()) > + return false; Probably can remove that, checked by the caller. (or drop the one in the caller) > + > + for (cur_start = start; cur_start < start + size; > + cur_start += memblock_size) { > + if (walk_memory_blocks(cur_start, memblock_size, &mem, > + test_has_altmap_cb)) > + num_altmaps++; > + else > + num_no_altmaps++; > + } You should do that without the outer loop, by doing the counting in the callback function instead. > + > + if (!num_altmaps && num_no_altmaps > 0) > + return false; > + > + if (!num_no_altmaps && num_altmaps > 0) > + return true; > + > + /* > + * If there is a mix of memblocks with and without altmaps, > + * something has gone very wrong. WARN and bail. > + */ > + WARN_ONCE(1, "memblocks have a mix of missing and present altmaps"); It would be better if we could even make try_remove_memory() fail in this case. > + return false; > +} > + > static int __ref try_remove_memory(u64 start, u64 size) > { > int rc, nid = NUMA_NO_NODE; > @@ -2230,7 +2264,7 @@ static int __ref try_remove_memory(u64 start, u64 size) > * a per-memblock basis. Loop through the entire range if so, > * and remove each memblock and its altmap. > */ > - if (mhp_memmap_on_memory()) { > + if (mhp_memmap_on_memory() && memblocks_have_altmaps(start, size)) { > unsigned long memblock_size = memory_block_size_bytes(); > u64 cur_start; > > @@ -2239,7 +2273,8 @@ static int __ref try_remove_memory(u64 start, u64 size) > remove_memory_block_and_altmap(cur_start, > memblock_size); ^ probably cleaner move the loop into remove_memory_block_and_altmap() and call it remove_memory_blocks_and_altmaps(start, size) instead. > } else { > - remove_memory_block_and_altmap(start, size); > + remove_memory_block_devices(start, size); > + arch_remove_memory(start, size, NULL); > } > > if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { >
On Thu, 2023-10-12 at 10:40 +0200, David Hildenbrand wrote: > On 12.10.23 07:53, Verma, Vishal L wrote: > > On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote: > > > On 07.10.23 10:55, Huang, Ying wrote: > > > > Vishal Verma <vishal.l.verma@intel.com> writes: > <..> > > + > > + for (cur_start = start; cur_start < start + size; > > + cur_start += memblock_size) { > > + if (walk_memory_blocks(cur_start, memblock_size, &mem, > > + test_has_altmap_cb)) > > + num_altmaps++; > > + else > > + num_no_altmaps++; > > + } > > You should do that without the outer loop, by doing the counting in the > callback function instead. > > I made a new callback, since the existing callback that returns the memory_block breaks the walk the first time an altmap was encountered. Agreed on all the other comments - it looks much cleaner now! Sending v6 shortly with all of this.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f8d3e7427e32..77ec6f15f943 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) return arch_supports_memmap_on_memory(vmemmap_size); } +static int add_memory_create_devices(int nid, struct memory_group *group, + u64 start, u64 size, mhp_t mhp_flags) +{ + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(start), + .end_pfn = PHYS_PFN(start + size - 1), + }; + int ret; + + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { + mhp_altmap.free = memory_block_memmap_on_memory_pages(); + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); + if (!params.altmap) + return -ENOMEM; + + memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); + } + + /* call arch's memory hotadd */ + ret = arch_add_memory(nid, start, size, ¶ms); + if (ret < 0) + goto error; + + /* create memory block devices after memory was added */ + ret = create_memory_block_devices(start, size, params.altmap, group); + if (ret) + goto err_bdev; + + return 0; + +err_bdev: + arch_remove_memory(start, size, NULL); +error: + kfree(params.altmap); + return ret; +} + /* * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations (triggered e.g. by sysfs). @@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) */ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + unsigned long memblock_size = memory_block_size_bytes(); enum memblock_flags memblock_flags = MEMBLOCK_NONE; - struct vmem_altmap mhp_altmap = { - .base_pfn = PHYS_PFN(res->start), - .end_pfn = PHYS_PFN(res->end), - }; struct memory_group *group = NULL; - u64 start, size; + u64 start, size, cur_start; bool new_node = false; int ret; @@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) /* * Self hosted memmap array */ - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { - if (mhp_supports_memmap_on_memory(size)) { - mhp_altmap.free = memory_block_memmap_on_memory_pages(); - params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); - if (!params.altmap) + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) && + mhp_supports_memmap_on_memory(memblock_size)) { + for (cur_start = start; cur_start < start + size; + cur_start += memblock_size) { + ret = add_memory_create_devices(nid, group, cur_start, + memblock_size, + mhp_flags); + if (ret) goto error; - - memcpy(params.altmap, &mhp_altmap, sizeof(mhp_altmap)); } - /* fallback to not using altmap */ - } - - /* call arch's memory hotadd */ - ret = arch_add_memory(nid, start, size, ¶ms); - if (ret < 0) - goto error_free; - - /* create memory block devices after memory was added */ - ret = create_memory_block_devices(start, size, params.altmap, group); - if (ret) { - arch_remove_memory(start, size, NULL); - goto error_free; + } else { + ret = add_memory_create_devices(nid, group, start, size, + mhp_flags); + if (ret) + goto error; } if (new_node) { @@ -1494,8 +1521,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) walk_memory_blocks(start, size, NULL, online_memory_block); return ret; -error_free: - kfree(params.altmap); error: if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) memblock_remove(start, size); @@ -2146,12 +2171,41 @@ void try_offline_node(int nid) } EXPORT_SYMBOL(try_offline_node); -static int __ref try_remove_memory(u64 start, u64 size) +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size) { + int rc = 0; struct memory_block *mem; - int rc = 0, nid = NUMA_NO_NODE; struct vmem_altmap *altmap = NULL; + rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); + if (rc) { + altmap = mem->altmap; + /* + * Mark altmap NULL so that we can add a debug + * check on memblock free. + */ + mem->altmap = NULL; + } + + /* + * Memory block device removal under the device_hotplug_lock is + * a barrier against racing online attempts. + */ + remove_memory_block_devices(start, size); + + arch_remove_memory(start, size, altmap); + + /* Verify that all vmemmap pages have actually been freed. */ + if (altmap) { + WARN(altmap->alloc, "Altmap not fully unmapped"); + kfree(altmap); + } +} + +static int __ref try_remove_memory(u64 start, u64 size) +{ + int rc, nid = NUMA_NO_NODE; + BUG_ON(check_hotplug_memory_range(start, size)); /* @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) if (rc) return rc; + mem_hotplug_begin(); + /* - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in - * the same granularity it was added - a single memory block. + * For memmap_on_memory, the altmaps could have been added on + * a per-memblock basis. Loop through the entire range if so, + * and remove each memblock and its altmap. */ if (mhp_memmap_on_memory()) { - rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb); - if (rc) { - if (size != memory_block_size_bytes()) { - pr_warn("Refuse to remove %#llx - %#llx," - "wrong granularity\n", - start, start + size); - return -EINVAL; - } - altmap = mem->altmap; - /* - * Mark altmap NULL so that we can add a debug - * check on memblock free. - */ - mem->altmap = NULL; - } + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + + for (cur_start = start; cur_start < start + size; + cur_start += memblock_size) + remove_memory_block_and_altmap(nid, cur_start, + memblock_size); + } else { + remove_memory_block_and_altmap(nid, start, size); } /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); - /* - * Memory block device removal under the device_hotplug_lock is - * a barrier against racing online attempts. - */ - remove_memory_block_devices(start, size); - - mem_hotplug_begin(); - - arch_remove_memory(start, size, altmap); - - /* Verify that all vmemmap pages have actually been freed. */ - if (altmap) { - WARN(altmap->alloc, "Altmap not fully unmapped"); - kfree(altmap); - } - if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { memblock_phys_free(start, size); memblock_remove(start, size); @@ -2219,6 +2254,7 @@ static int __ref try_remove_memory(u64 start, u64 size) try_offline_node(nid); mem_hotplug_done(); + return 0; }