Message ID | 20231208025240.4744-1-gang.li@linux.dev |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:bcd1:0:b0:403:3b70:6f57 with SMTP id r17csp5207677vqy; Thu, 7 Dec 2023 18:53:17 -0800 (PST) X-Google-Smtp-Source: AGHT+IEXcHIgHWtmQbBkdqel7BErTdR326gtl/wT8Oz1mHgfaTeKFDMLAIrO8DAtUwluzA25PC84 X-Received: by 2002:a05:6808:16a6:b0:3b8:b063:505a with SMTP id bb38-20020a05680816a600b003b8b063505amr4212087oib.91.1702003997615; Thu, 07 Dec 2023 18:53:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702003997; cv=none; d=google.com; s=arc-20160816; b=yHCOqX7934JZjcUo8lc3tHzYWGXdL31T70jdlI/WAPami6RWNEbbDjIZo82iW2iUSM ub+zJs3vHQe07bRwihlQbMG8+qZi8pFbv8gRfki9/DXMRH/CGvf2u5Ezj71poTozIpw8 ZyqDXRnkUbUhyemZKW9/x0KnTIzvyo0klJbUGTBOEVCoz0+YEVxOFQEWRGv8bDc1Kewo /ee7HxelPWVb0NQM89y1Vb6yBnGHJNgpLA/RfPDmSQ2D0zHg2iaTQfAQfBEDzRsyhuf3 JQTOXCHvEGH17uM5O+OkcZHnniJLVKKtTk5Skbm9hYv1+j4dKUmYvNZCtHgD2iDCK9IM TcEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=RgAHVYTr05aBZP5gwX6wkJhQrJFR+7p9J4OyUiik14Q=; fh=ALVB2amNZc7wpwPN7EU8i4C3eL0OH515L0QoCIRQrQE=; b=geYEIULvTdFh66RQrGwU+FtPlb5jVadNmBrC/AjKzIjjXFePg6CjR33en/pt0/I8U3 VUviYZWnkPFZ9vtkL4Y9kIpfMqN+AiQnIu3wBxRJVG2qZNUvYZ/Iu4QrjhTpU4C7+QRE YHJzkmepfkjqmG6EjMIcc7vy4M5HQbm5TcLEE85FT1qRQdtvbDRcSyQQXfD/yjmzL0vK U1I7rDJ4kLsLQdvXldyyLbZwBxFtYrS4nLtLFRGyU9acMffKr70kNVfnaCvZb+eCuHot /y56AyT/EwfWpDlvAIyLQFxVgYXUOl7Jyd3/sl+ObD87daXtok/+fimzsoI/4gSdKGP0 XgmA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=bM5YHNJM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id x17-20020a056a00271100b006ce36ffa532si711908pfv.380.2023.12.07.18.53.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Dec 2023 18:53:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=bM5YHNJM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 294498346DC5; Thu, 7 Dec 2023 18:53:14 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235374AbjLHCxD (ORCPT <rfc822;chrisfriedt@gmail.com> + 99 others); Thu, 7 Dec 2023 21:53:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229671AbjLHCxC (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 7 Dec 2023 21:53:02 -0500 Received: from out-189.mta0.migadu.com (out-189.mta0.migadu.com [IPv6:2001:41d0:1004:224b::bd]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 038E0171D for <linux-kernel@vger.kernel.org>; Thu, 7 Dec 2023 18:53:07 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1702003983; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=RgAHVYTr05aBZP5gwX6wkJhQrJFR+7p9J4OyUiik14Q=; b=bM5YHNJMsiPP7Mwk8RRfz94bVPAFboFI53wRgX2BsAdbKgW/QsNQlTUt8k8fNDZRsBmagI +Huu65FRujg6hfv9bGexcmytUc+4DIrTnxbh1XSDJIHUlWDQSvcquLsNdxCXVbj17nguso bjukddYUuSdpiVcRcTB602XbO10Khpc= From: Gang Li <gang.li@linux.dev> To: David Hildenbrand <david@redhat.com>, David Rientjes <rientjes@google.com>, Mike Kravetz <mike.kravetz@oracle.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, ligang.bdlg@bytedance.com, Gang Li <gang.li@linux.dev> Subject: [RFC PATCH v2 0/5] hugetlb: parallelize hugetlb page init on boot Date: Fri, 8 Dec 2023 10:52:35 +0800 Message-Id: <20231208025240.4744-1-gang.li@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 07 Dec 2023 18:53:14 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1784680543781281723 X-GMAIL-MSGID: 1784680543781281723 |
Series |
hugetlb: parallelize hugetlb page init on boot
|
|
Message
Gang Li
Dec. 8, 2023, 2:52 a.m. UTC
Hi all, hugetlb init parallelization has now been updated to v2. To David Hildenbrand: padata multithread utilities has been used to reduce code complexity. To David Rientjes: The patch for measuring time will be separately included in the reply. Please test during your free time, thanks. # Introduction Hugetlb initialization during boot takes up a considerable amount of time. For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2 seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel host takes 65.2 seconds [1], which is 17.4% of the total 373.78 seconds boot time. This is a noteworthy figure. Inspired by [2] and [3], hugetlb initialization can also be accelerated through parallelization. Kernel already has infrastructure like padata_do_multithreaded, this patch uses it to achieve effective results by minimal modifications. [1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/ [2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/ [3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/ # Test result test no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2t(4 node) 2M 2624 956 63.57% 256c2t(4 node) 1G 2679 1582 40.95% 128c1t(2 node) 2M 1788 684 61.74% 128c1t(2 node) 1G 3160 1618 48.80% # Change log Changes in v2: - Reduce complexity with `padata_do_multithreaded` - Support 1G hugetlb v1: - https://lore.kernel.org/all/20231123133036.68540-1-gang.li@linux.dev/ - parallelize 2M hugetlb initialization with workqueue Gang Li (5): hugetlb: code clean for hugetlb_hstate_alloc_pages hugetlb: split hugetlb_hstate_alloc_pages padata: dispatch works on different nodes hugetlb: parallelize 2M hugetlb allocation and initialization hugetlb: parallelize 1G hugetlb initialization include/linux/hugetlb.h | 2 +- include/linux/padata.h | 2 + kernel/padata.c | 8 +- mm/hugetlb.c | 201 +++++++++++++++++++++++++++------------- mm/mm_init.c | 1 + 5 files changed, 148 insertions(+), 66 deletions(-)
Comments
On 12/08/23 10:52, Gang Li wrote: > Hi all, hugetlb init parallelization has now been updated to v2. Thanks for your efforts, and sorry for my late comments. > To David Hildenbrand: padata multithread utilities has been used to reduce > code complexity. > > To David Rientjes: The patch for measuring time will be separately included > in the reply. Please test during your free time, thanks. > > # Introduction > Hugetlb initialization during boot takes up a considerable amount of time. > For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2 > seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel > host takes 65.2 seconds [1], which is 17.4% of the total 373.78 seconds boot > time. This is a noteworthy figure. One issue to be concerned with is hugetlb page allocation on systems with unbalanced numa node memory. Commit f60858f9d327 ("hugetlbfs: don't retry when pool page allocations start to fail") was added to deal with issues reported on such systems. So, users are certainly using hugetlb pages on systems with imbalances. If performing allocations in parallel, I believe we would want the total number of hugetlb pages allocated to be the same as today. For example, consider a simple 2 node system with 16GB total memory: node 0: 2GB node 1: 14GB With today's code, allocating 6656 2MB pages via the kernel command line results in: node 0: 924 pages node 1: 5732 pages total: 6656 pages With code to parallel allocations in this series: node 0: 924 pages node 1: 1547 pages total: 2471 pages
On Fri, 8 Dec 2023, Gang Li wrote: > Hi all, hugetlb init parallelization has now been updated to v2. > > To David Hildenbrand: padata multithread utilities has been used to reduce > code complexity. > > To David Rientjes: The patch for measuring time will be separately included > in the reply. Please test during your free time, thanks. > I'd love to, but what kernel is this based on? :) I can't get this to apply to any kernels that I have recently benchmarked with. > # Introduction > Hugetlb initialization during boot takes up a considerable amount of time. > For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2 > seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel > host takes 65.2 seconds [1], which is 17.4% of the total 373.78 seconds boot > time. This is a noteworthy figure. > > Inspired by [2] and [3], hugetlb initialization can also be accelerated > through parallelization. Kernel already has infrastructure like > padata_do_multithreaded, this patch uses it to achieve effective results > by minimal modifications. > > [1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@google.com/ > [2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@oracle.com/ > [3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@bytedance.com/ > > # Test result > test no patch(ms) patched(ms) saved > ------------------- -------------- ------------- -------- > 256c2t(4 node) 2M 2624 956 63.57% > 256c2t(4 node) 1G 2679 1582 40.95% > 128c1t(2 node) 2M 1788 684 61.74% > 128c1t(2 node) 1G 3160 1618 48.80% > > # Change log > Changes in v2: > - Reduce complexity with `padata_do_multithreaded` > - Support 1G hugetlb > > v1: > - https://lore.kernel.org/all/20231123133036.68540-1-gang.li@linux.dev/ > - parallelize 2M hugetlb initialization with workqueue > > Gang Li (5): > hugetlb: code clean for hugetlb_hstate_alloc_pages > hugetlb: split hugetlb_hstate_alloc_pages > padata: dispatch works on different nodes > hugetlb: parallelize 2M hugetlb allocation and initialization > hugetlb: parallelize 1G hugetlb initialization > > include/linux/hugetlb.h | 2 +- > include/linux/padata.h | 2 + > kernel/padata.c | 8 +- > mm/hugetlb.c | 201 +++++++++++++++++++++++++++------------- > mm/mm_init.c | 1 + > 5 files changed, 148 insertions(+), 66 deletions(-) > > -- > 2.30.2 > >
On 12/12/23 14:14, David Rientjes wrote: > On Fri, 8 Dec 2023, Gang Li wrote: > > > Hi all, hugetlb init parallelization has now been updated to v2. > > > > To David Hildenbrand: padata multithread utilities has been used to reduce > > code complexity. > > > > To David Rientjes: The patch for measuring time will be separately included > > in the reply. Please test during your free time, thanks. > > > > I'd love to, but what kernel is this based on? :) I can't get this to > apply to any kernels that I have recently benchmarked with. I was able to apply and build on top of v6.7-rc5. Gang Li, Since hugetlb now depends on CONFIG_PADATA, the Kconfig file should be updated to reflect this.
On Tue, 12 Dec 2023, Mike Kravetz wrote: > On 12/12/23 14:14, David Rientjes wrote: > > On Fri, 8 Dec 2023, Gang Li wrote: > > > > > Hi all, hugetlb init parallelization has now been updated to v2. > > > > > > To David Hildenbrand: padata multithread utilities has been used to reduce > > > code complexity. > > > > > > To David Rientjes: The patch for measuring time will be separately included > > > in the reply. Please test during your free time, thanks. > > > > > > > I'd love to, but what kernel is this based on? :) I can't get this to > > apply to any kernels that I have recently benchmarked with. > > I was able to apply and build on top of v6.7-rc5. > > Gang Li, > Since hugetlb now depends on CONFIG_PADATA, the Kconfig file should be > updated to reflect this. Gotcha, thanks. I got this: ld: error: undefined symbol: padata_do_multithreaded referenced by hugetlb.c:3470 (./mm/hugetlb.c:3470) vmlinux.o:(gather_bootmem_prealloc) referenced by hugetlb.c:3592 (./mm/hugetlb.c:3592) vmlinux.o:(hugetlb_hstate_alloc_pages_non_gigantic) referenced by hugetlb.c:3599 (./mm/hugetlb.c:3599) vmlinux.o:(hugetlb_hstate_alloc_pages_non_gigantic) So, yeah we need to enable DEFERRED_STRUCT_PAGE_INIT for this to build. On 6.6 I measured "hugepagesz=1G hugepages=11776" on as 12TB host to be 77s this time around. A latest Linus build with this patch set does not boot successfully, so I'll need to look into that and try to capture the failure. Not sure if it's related to this patch or the latest Linus build in general.
Hi, On 2023/12/13 08:10, David Rientjes wrote: > On 6.6 I measured "hugepagesz=1G hugepages=11776" on as 12TB host to be > 77s this time around. Thanks for your test! Is this the total kernel boot time, or just the hugetlb initialization time? > > A latest Linus build with this patch set does not boot successfully, so Which branch/tag is it compiled on? I test this patch on v6.7-rc4 and next-20231130. > I'll need to look into that and try to capture the failure. Not sure if > it's related to this patch or the latest Linus build in general. >
On 2023/12/13 04:06, Mike Kravetz wrote: > With today's code, allocating 6656 2MB pages via the kernel command line > results in: > node 0: 924 pages > node 1: 5732 pages > total: 6656 pages > > With code to parallel allocations in this series: > node 0: 924 pages > node 1: 1547 pages > total: 2471 pages Hi Mike, Disable numa_aware for hugetlb_alloc_node should solve this problem. I will fix it in v3.
On Mon, 18 Dec 2023, Gang Li wrote: > Hi, > > On 2023/12/13 08:10, David Rientjes wrote: > > On 6.6 I measured "hugepagesz=1G hugepages=11776" on as 12TB host to be > > 77s this time around. > > Thanks for your test! Is this the total kernel boot time, or just the > hugetlb initialization time? > Ah, sorry for not being specific. It's just the hugetlb preallocation of 11776 1GB hugetlb pages, total boot takes a few more minutes. > > A latest Linus build with this patch set does not boot successfully, so > > Which branch/tag is it compiled on? > I test this patch on v6.7-rc4 and next-20231130. > It was the latest Linus tip of tree. I'll continue to try again until I get a successful boot and report back, serial console won't be possible for unrelated reasons.
On Thu, 21 Dec 2023, David Rientjes wrote: > > Hi, > > > > On 2023/12/13 08:10, David Rientjes wrote: > > > On 6.6 I measured "hugepagesz=1G hugepages=11776" on as 12TB host to be > > > 77s this time around. > > > > Thanks for your test! Is this the total kernel boot time, or just the > > hugetlb initialization time? > > > > Ah, sorry for not being specific. It's just the hugetlb preallocation of > 11776 1GB hugetlb pages, total boot takes a few more minutes. > I had to apply this to get the patch series to compile on 6.7-rc7: diff --git a/kernel/padata.c b/kernel/padata.c --- a/kernel/padata.c +++ b/kernel/padata.c @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job) struct padata_work my_work, *pw; struct padata_mt_job_state ps; LIST_HEAD(works); - int nworks, nid; + int nworks, nid = 0; if (job->size == 0) return; diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3300,7 +3300,7 @@ int alloc_bootmem_huge_page(struct hstate *h, int nid) int __alloc_bootmem_huge_page(struct hstate *h, int nid) { struct huge_bootmem_page *m = NULL; /* initialize for clang */ - int nr_nodes, node; + int nr_nodes, node = NUMA_NO_NODE; /* do node specific alloc */ if (nid != NUMA_NO_NODE) { With that, I compared "hugepagesz=1G hugepages=11776" before and after on a 12TB host with eight NUMA nodes. Compared to 77s of total initialization time before, with this series I measured 18.3s. Feel free to add this into the changelog once the initialization issues are fixed up and I'm happy to ack it. Thanks!
On 2023/12/25 13:21, David Rientjes wrote: > With that, I compared "hugepagesz=1G hugepages=11776" before and after on > a 12TB host with eight NUMA nodes. > > Compared to 77s of total initialization time before, with this series I > measured 18.3s. > > Feel free to add this into the changelog once the initialization issues > are fixed up and I'm happy to ack it. > > Thanks! Cool! Thank you ;)