Message ID | 6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-48237-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2719:b0:106:209c:c626 with SMTP id hl25csp151311dyb; Thu, 1 Feb 2024 05:31:49 -0800 (PST) X-Google-Smtp-Source: AGHT+IHQv565bZvmFmUVE5W7CrCS8eXXDu7JE5Kt2YYHocrDK5SXrE6NTt0Lp3JzEYNixJLuI2n/ X-Received: by 2002:a05:622a:1b08:b0:42a:b0b6:7279 with SMTP id bb8-20020a05622a1b0800b0042ab0b67279mr5398762qtb.52.1706794309218; Thu, 01 Feb 2024 05:31:49 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706794309; cv=pass; d=google.com; s=arc-20160816; b=q4no4lysPqX66b9Ow2jz3ppoc1yFEZ/07Plhm38rdoVq0+E1XOTiVgdp2JwU1eiURW gMcNi750Lo5lapaCjdXF9NKoTzTHhW5ohEGK6pcohqiq2I3GgA2fMNeaqpP/JaoeHkep fL+jxiTDM+9gewGge5sQd6GhyW8mgbsjtxt0U1ixz6Zl5Dh93n2nIvxZ9nlHEm8x9Mvw PCIiHA5jkw9F19Uio8F+ak3JtVWQmCTcG0lwwHbEEjQsF348wbMuza5OvpIbYByY9Z03 YMC6DYvOroS1rOw/w61PZ24TTAZofcty0ZRFQdQLTYyNpuEh8oGaAEM01kKBcXYPGihu c+oA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=YTeBhqdSU74IlpwBPcbqYdhQu9vEwdBX/KgZ9rQ8d00=; fh=AAG4nnD9FkRftLXLUd6BP/hRzzZPZurLE+xRVTMGyxg=; b=alIlcivcdJtg3dQrNWPEULTQicrJ3dmQygwIQtADWfDjHboQ1u4Bu10KWZ/Rfqt00Z ZTvOnDLmCSsdUVvVLyPsCgUNsjlO5ReR0hA8VvtyL806W8js2wymHTYbnq2lXLORv1Za 9Lm7jsSgTHGtxcGIKQGbqEexwrxjuPWqa7Sh24KUXior/XPJttyv43S+y+f5hAAIgRyw 9C3FT7PvIJ80n0MVpz1pX/QUPXcDNFG2zgWG9lqiSl4oOZdiSj/0zxY6EuQzwx1ORyDK 8FEtDD8IyQcOMyj2LrhxxNIXVoBYW4ddkiwrjwu/shNLFjd2CSE3jUh1PQFuFrextPzT pjrw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.alibaba.com header.s=default header.b=f2iaxKHw; arc=pass (i=1 spf=pass spfdomain=linux.alibaba.com dkim=pass dkdomain=linux.alibaba.com dmarc=pass fromdomain=linux.alibaba.com); spf=pass (google.com: domain of linux-kernel+bounces-48237-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-48237-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.alibaba.com X-Forwarded-Encrypted: i=1; AJvYcCXYz155BUzUqlf/k2C04TnvZfm05egxLNQW9gv7nnQIze8rFcJleVD6kOvtdz9MNwF9/xmeqFJjJJzQpSUr5BPl/OeXXw== Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id u23-20020a05622a199700b0042a8967cdb6si12601947qtc.533.2024.02.01.05.31.49 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Feb 2024 05:31:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-48237-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.alibaba.com header.s=default header.b=f2iaxKHw; arc=pass (i=1 spf=pass spfdomain=linux.alibaba.com dkim=pass dkdomain=linux.alibaba.com dmarc=pass fromdomain=linux.alibaba.com); spf=pass (google.com: domain of linux-kernel+bounces-48237-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-48237-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.alibaba.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id F17C71C21CDB for <ouuuleilei@gmail.com>; Thu, 1 Feb 2024 13:31:48 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A9B875CDC7; Thu, 1 Feb 2024 13:31:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="f2iaxKHw" Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 70F1D53369 for <linux-kernel@vger.kernel.org>; Thu, 1 Feb 2024 13:31:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.131 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706794293; cv=none; b=JQvFSgLKpTJjcqCJru3T9SOUBP+Si5isW7DyWxhKTKU9I5JDOB0Y+R1n+OuynzW7qUKUcMjT/Jq1agCe5K0GGxZYOscBHyatGacKEQaE1KRCf3lfodVTlCFUvht/ZPqrrjwVUd3ZJHCX3pJ8q0BmSOn8Oeu61mTnNJ/adK/eV/M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706794293; c=relaxed/simple; bh=T6/oMqfm0cTUFZu7aG4cysOe/uV1jAyX0rldyapQGag=; h=From:To:Cc:Subject:Date:Message-Id:MIME-Version; b=p/JOYhiYGrZYvZou20wIJAP/KzzqmUOK6l73sUF8G3e20YogeDhkVAKbSIZ3zstGpi4op34oT2arprOEWUjnjC9kDTRiNP0+gJolmg1ETgslXjEQxDfiKxGqOhnuSRy/U6B+KWLWwXTmgb1hpkgGyN5JvO1sVv0dnl/NCueX7GU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=f2iaxKHw; arc=none smtp.client-ip=115.124.30.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1706794282; h=From:To:Subject:Date:Message-Id:MIME-Version; bh=YTeBhqdSU74IlpwBPcbqYdhQu9vEwdBX/KgZ9rQ8d00=; b=f2iaxKHwRYfdAnZsvsKvJF84+YvPrP0z41Chw6K2IlkPvjTa1JgXVk90orxqOZbWD3bLUDazHvrpUFatRJ5iCjwcphwIznGP+He9MaaZEsk7kNgSfO632hLR9WSMsT6Eqw/qa8dkVp82cC1OXrl8JgUyMbTERCb6Wpm5x5TkNqA= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0W.tpA.q_1706794281; Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W.tpA.q_1706794281) by smtp.aliyun-inc.com; Thu, 01 Feb 2024 21:31:22 +0800 From: Baolin Wang <baolin.wang@linux.alibaba.com> To: akpm@linux-foundation.org, muchun.song@linux.dev Cc: osalvador@suse.de, david@redhat.com, mhocko@kernel.org, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb Date: Thu, 1 Feb 2024 21:31:13 +0800 Message-Id: <6f26ce22d2fcd523418a085f2c588fe0776d46e7.1706794035.git.baolin.wang@linux.alibaba.com> X-Mailer: git-send-email 2.39.3 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1789703550012489198 X-GMAIL-MSGID: 1789703550012489198 |
Series |
[RFC] mm: hugetlb: remove __GFP_THISNODE flag when dissolving the old hugetlb
|
|
Commit Message
Baolin Wang
Feb. 1, 2024, 1:31 p.m. UTC
Since commit 369fa227c219 ("mm: make alloc_contig_range handle free
hugetlb pages"), the alloc_contig_range() can handle free hugetlb pages
by allocating a new fresh hugepage, and replacing the old one in the
free hugepage pool.
However, our customers can still see the failure of alloc_contig_range()
when seeing a free hugetlb page. The reason is that, there are few memory
on the old hugetlb page's node, and it can not allocate a fresh hugetlb
page on the old hugetlb page's node in isolate_or_dissolve_huge_page() with
setting __GFP_THISNODE flag. This makes sense to some degree.
Later, the commit ae37c7ff79f1 (" mm: make alloc_contig_range handle
in-use hugetlb pages") handles the in-use hugetlb pages by isolating it
and doing migration in __alloc_contig_migrate_range(), but it can allow
fallbacking to other numa node when allocating a new hugetlb in
alloc_migration_target().
This introduces inconsistency to handling free and in-use hugetlb.
Considering the CMA allocation and memory hotplug relying on the
alloc_contig_range() are important in some scenarios, as well as keeping
the consistent hugetlb handling, we should remove the __GFP_THISNODE flag
in isolate_or_dissolve_huge_page() to allow fallbacking to other numa node,
which can solve the failure of alloc_contig_range() in our case.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
mm/hugetlb.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Comments
On Thu 01-02-24 21:31:13, Baolin Wang wrote: > Since commit 369fa227c219 ("mm: make alloc_contig_range handle free > hugetlb pages"), the alloc_contig_range() can handle free hugetlb pages > by allocating a new fresh hugepage, and replacing the old one in the > free hugepage pool. > > However, our customers can still see the failure of alloc_contig_range() > when seeing a free hugetlb page. The reason is that, there are few memory > on the old hugetlb page's node, and it can not allocate a fresh hugetlb > page on the old hugetlb page's node in isolate_or_dissolve_huge_page() with > setting __GFP_THISNODE flag. This makes sense to some degree. > > Later, the commit ae37c7ff79f1 (" mm: make alloc_contig_range handle > in-use hugetlb pages") handles the in-use hugetlb pages by isolating it > and doing migration in __alloc_contig_migrate_range(), but it can allow > fallbacking to other numa node when allocating a new hugetlb in > alloc_migration_target(). > > This introduces inconsistency to handling free and in-use hugetlb. > Considering the CMA allocation and memory hotplug relying on the > alloc_contig_range() are important in some scenarios, as well as keeping > the consistent hugetlb handling, we should remove the __GFP_THISNODE flag > in isolate_or_dissolve_huge_page() to allow fallbacking to other numa node, > which can solve the failure of alloc_contig_range() in our case. I do agree that the inconsistency is not really good but I am not sure dropping __GFP_THISNODE is the right way forward. Breaking pre-allocated per-node pools might result in unexpected failures when node bound workloads doesn't get what is asssumed available. Keep in mind that our user APIs allow to pre-allocate per-node pools separately. The in-use hugetlb is a very similar case. While having a temporarily misplaced page doesn't really look terrible once that hugetlb page is released back into the pool we are back to the case above. Either we make sure that the node affinity is restored later on or it shouldn't be migrated to a different node at all.
On 2/1/2024 11:27 PM, Michal Hocko wrote: > On Thu 01-02-24 21:31:13, Baolin Wang wrote: >> Since commit 369fa227c219 ("mm: make alloc_contig_range handle free >> hugetlb pages"), the alloc_contig_range() can handle free hugetlb pages >> by allocating a new fresh hugepage, and replacing the old one in the >> free hugepage pool. >> >> However, our customers can still see the failure of alloc_contig_range() >> when seeing a free hugetlb page. The reason is that, there are few memory >> on the old hugetlb page's node, and it can not allocate a fresh hugetlb >> page on the old hugetlb page's node in isolate_or_dissolve_huge_page() with >> setting __GFP_THISNODE flag. This makes sense to some degree. >> >> Later, the commit ae37c7ff79f1 (" mm: make alloc_contig_range handle >> in-use hugetlb pages") handles the in-use hugetlb pages by isolating it >> and doing migration in __alloc_contig_migrate_range(), but it can allow >> fallbacking to other numa node when allocating a new hugetlb in >> alloc_migration_target(). >> >> This introduces inconsistency to handling free and in-use hugetlb. >> Considering the CMA allocation and memory hotplug relying on the >> alloc_contig_range() are important in some scenarios, as well as keeping >> the consistent hugetlb handling, we should remove the __GFP_THISNODE flag >> in isolate_or_dissolve_huge_page() to allow fallbacking to other numa node, >> which can solve the failure of alloc_contig_range() in our case. > > I do agree that the inconsistency is not really good but I am not sure > dropping __GFP_THISNODE is the right way forward. Breaking pre-allocated > per-node pools might result in unexpected failures when node bound > workloads doesn't get what is asssumed available. Keep in mind that our > user APIs allow to pre-allocate per-node pools separately. Yes, I agree, that is also what I concered. But sometimes users don't care about the distribution of per-node hugetlb, instead they are more concerned about the success of cma allocation or memory hotplug. > The in-use hugetlb is a very similar case. While having a temporarily > misplaced page doesn't really look terrible once that hugetlb page is > released back into the pool we are back to the case above. Either we > make sure that the node affinity is restored later on or it shouldn't be > migrated to a different node at all. Agree. So how about below changing? (1) disallow fallbacking to other nodes when handing in-use hugetlb, which can ensure consistent behavior in handling hugetlb. (2) introduce a new sysctl (may be named as "hugetlb_allow_fallback_nodes") for users to control to allow fallbacking, that can solve the CMA or memory hotplug failures that users are more concerned about.
On Fri 02-02-24 09:35:58, Baolin Wang wrote: > > > On 2/1/2024 11:27 PM, Michal Hocko wrote: > > On Thu 01-02-24 21:31:13, Baolin Wang wrote: > > > Since commit 369fa227c219 ("mm: make alloc_contig_range handle free > > > hugetlb pages"), the alloc_contig_range() can handle free hugetlb pages > > > by allocating a new fresh hugepage, and replacing the old one in the > > > free hugepage pool. > > > > > > However, our customers can still see the failure of alloc_contig_range() > > > when seeing a free hugetlb page. The reason is that, there are few memory > > > on the old hugetlb page's node, and it can not allocate a fresh hugetlb > > > page on the old hugetlb page's node in isolate_or_dissolve_huge_page() with > > > setting __GFP_THISNODE flag. This makes sense to some degree. > > > > > > Later, the commit ae37c7ff79f1 (" mm: make alloc_contig_range handle > > > in-use hugetlb pages") handles the in-use hugetlb pages by isolating it > > > and doing migration in __alloc_contig_migrate_range(), but it can allow > > > fallbacking to other numa node when allocating a new hugetlb in > > > alloc_migration_target(). > > > > > > This introduces inconsistency to handling free and in-use hugetlb. > > > Considering the CMA allocation and memory hotplug relying on the > > > alloc_contig_range() are important in some scenarios, as well as keeping > > > the consistent hugetlb handling, we should remove the __GFP_THISNODE flag > > > in isolate_or_dissolve_huge_page() to allow fallbacking to other numa node, > > > which can solve the failure of alloc_contig_range() in our case. > > > > I do agree that the inconsistency is not really good but I am not sure > > dropping __GFP_THISNODE is the right way forward. Breaking pre-allocated > > per-node pools might result in unexpected failures when node bound > > workloads doesn't get what is asssumed available. Keep in mind that our > > user APIs allow to pre-allocate per-node pools separately. > > Yes, I agree, that is also what I concered. But sometimes users don't care > about the distribution of per-node hugetlb, instead they are more concerned > about the success of cma allocation or memory hotplug. Yes, sometimes the exact per-node distribution is not really important. But the kernel has no way of knowing that right now. And we have to make a conservative guess here. > > The in-use hugetlb is a very similar case. While having a temporarily > > misplaced page doesn't really look terrible once that hugetlb page is > > released back into the pool we are back to the case above. Either we > > make sure that the node affinity is restored later on or it shouldn't be > > migrated to a different node at all. > > Agree. So how about below changing? > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which > can ensure consistent behavior in handling hugetlb. I can see two cases here. alloc_contig_range which is an internal kernel user and then we have memory offlining. The former shouldn't break the per-node hugetlb pool reservations, the latter might not have any other choice (the whole node could get offline and that resembles breaking cpu affininty if the cpu is gone). Now I can see how a hugetlb page sitting inside a CMA region breaks CMA users expectations but hugetlb migration already tries hard to allocate a replacement hugetlb so the system must be under a heavy memory pressure if that fails, right? Is it possible that the hugetlb reservation is just overshooted here? Maybe the memory is just terribly fragmented though? Could you be more specific about numbers in your failure case? > (2) introduce a new sysctl (may be named as "hugetlb_allow_fallback_nodes") > for users to control to allow fallbacking, that can solve the CMA or memory > hotplug failures that users are more concerned about. I do not think this is a good idea. The policy might be different on each node and this would get messy pretty quickly. If anything we could try to detect a dedicated per node pool allocation instead. It is quite likely that if admin preallocates pool without any memory policy then the exact distribution of pages doesn't play a huge role.
On 2/2/2024 4:17 PM, Michal Hocko wrote: > On Fri 02-02-24 09:35:58, Baolin Wang wrote: >> >> >> On 2/1/2024 11:27 PM, Michal Hocko wrote: >>> On Thu 01-02-24 21:31:13, Baolin Wang wrote: >>>> Since commit 369fa227c219 ("mm: make alloc_contig_range handle free >>>> hugetlb pages"), the alloc_contig_range() can handle free hugetlb pages >>>> by allocating a new fresh hugepage, and replacing the old one in the >>>> free hugepage pool. >>>> >>>> However, our customers can still see the failure of alloc_contig_range() >>>> when seeing a free hugetlb page. The reason is that, there are few memory >>>> on the old hugetlb page's node, and it can not allocate a fresh hugetlb >>>> page on the old hugetlb page's node in isolate_or_dissolve_huge_page() with >>>> setting __GFP_THISNODE flag. This makes sense to some degree. >>>> >>>> Later, the commit ae37c7ff79f1 (" mm: make alloc_contig_range handle >>>> in-use hugetlb pages") handles the in-use hugetlb pages by isolating it >>>> and doing migration in __alloc_contig_migrate_range(), but it can allow >>>> fallbacking to other numa node when allocating a new hugetlb in >>>> alloc_migration_target(). >>>> >>>> This introduces inconsistency to handling free and in-use hugetlb. >>>> Considering the CMA allocation and memory hotplug relying on the >>>> alloc_contig_range() are important in some scenarios, as well as keeping >>>> the consistent hugetlb handling, we should remove the __GFP_THISNODE flag >>>> in isolate_or_dissolve_huge_page() to allow fallbacking to other numa node, >>>> which can solve the failure of alloc_contig_range() in our case. >>> >>> I do agree that the inconsistency is not really good but I am not sure >>> dropping __GFP_THISNODE is the right way forward. Breaking pre-allocated >>> per-node pools might result in unexpected failures when node bound >>> workloads doesn't get what is asssumed available. Keep in mind that our >>> user APIs allow to pre-allocate per-node pools separately. >> >> Yes, I agree, that is also what I concered. But sometimes users don't care >> about the distribution of per-node hugetlb, instead they are more concerned >> about the success of cma allocation or memory hotplug. > > Yes, sometimes the exact per-node distribution is not really important. > But the kernel has no way of knowing that right now. And we have to make > a conservative guess here. > >>> The in-use hugetlb is a very similar case. While having a temporarily >>> misplaced page doesn't really look terrible once that hugetlb page is >>> released back into the pool we are back to the case above. Either we >>> make sure that the node affinity is restored later on or it shouldn't be >>> migrated to a different node at all. >> >> Agree. So how about below changing? >> (1) disallow fallbacking to other nodes when handing in-use hugetlb, which >> can ensure consistent behavior in handling hugetlb. > > I can see two cases here. alloc_contig_range which is an internal kernel > user and then we have memory offlining. The former shouldn't break the > per-node hugetlb pool reservations, the latter might not have any other > choice (the whole node could get offline and that resembles breaking cpu > affininty if the cpu is gone). IMO, not always true for memory offlining, when handling a free hugetlb, it disallows fallbacking, which is inconsistent. Not only memory offlining, but also the longterm pinning (in migrate_longterm_unpinnable_pages()) and memory failure (in soft_offline_in_use_page()) can also break the per-node hugetlb pool reservations. > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA > users expectations but hugetlb migration already tries hard to allocate > a replacement hugetlb so the system must be under a heavy memory > pressure if that fails, right? Is it possible that the hugetlb > reservation is just overshooted here? Maybe the memory is just terribly > fragmented though? > > Could you be more specific about numbers in your failure case? Sure. Our customer's machine contains serveral numa nodes, and the system reserves a large number of CMA memory occupied 50% of the total memory which is used for the virtual machine, meanwhile it also reserves lots of hugetlb which can occupy 50% of the CMA. So before starting the virtual machine, the hugetlb can use 50% of the CMA, but when starting the virtual machine, the CMA will be used by the virtual machine and the hugetlb should be migrated from CMA. Due to several nodes in the system, one node's memory can be exhausted, which will fail the hugetlb migration with __GFP_THISNODE flag. >> (2) introduce a new sysctl (may be named as "hugetlb_allow_fallback_nodes") >> for users to control to allow fallbacking, that can solve the CMA or memory >> hotplug failures that users are more concerned about. > > I do not think this is a good idea. The policy might be different on > each node and this would get messy pretty quickly. If anything we could > try to detect a dedicated per node pool allocation instead. It is quite > likely that if admin preallocates pool without any memory policy then > the exact distribution of pages doesn't play a huge role. I also agree. Now I think the policy is already messy when handing hugetlb migration: 1. CMA allocation: can or can not break the per-node hugetlb pool reservations. 1.1 handling free hugetlb: can not break per-node hugetlb pool reservations. 1.2 handling in-use hugetlb: can break per-node hugetlb pool reservations. 2. memory offlining: can or can not break per-node hugetlb pool reservations. 2.1 handling free hugetlb: can not break 2.2 handling in-use hugetlb: can break 3. longterm pinning: can break per-node hugetlb pool reservations. 4. memory soft-offline: can break per-node hugetlb pool reservations. What a messy policy. And now we have no documentation to describe this messy policy. So we need to make things more clear when handling hugetlb migration with proper documantation.
On Fri 02-02-24 17:29:02, Baolin Wang wrote: > On 2/2/2024 4:17 PM, Michal Hocko wrote: [...] > > > Agree. So how about below changing? > > > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which > > > can ensure consistent behavior in handling hugetlb. > > > > I can see two cases here. alloc_contig_range which is an internal kernel > > user and then we have memory offlining. The former shouldn't break the > > per-node hugetlb pool reservations, the latter might not have any other > > choice (the whole node could get offline and that resembles breaking cpu > > affininty if the cpu is gone). > > IMO, not always true for memory offlining, when handling a free hugetlb, it > disallows fallbacking, which is inconsistent. It's been some time I've looked into that code so I am not 100% sure how the free pool is currently handled. The above is the way I _think_ it should work from the usability POV. > Not only memory offlining, but also the longterm pinning (in > migrate_longterm_unpinnable_pages()) and memory failure (in > soft_offline_in_use_page()) can also break the per-node hugetlb pool > reservations. Bad > > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA > > users expectations but hugetlb migration already tries hard to allocate > > a replacement hugetlb so the system must be under a heavy memory > > pressure if that fails, right? Is it possible that the hugetlb > > reservation is just overshooted here? Maybe the memory is just terribly > > fragmented though? > > > > Could you be more specific about numbers in your failure case? > > Sure. Our customer's machine contains serveral numa nodes, and the system > reserves a large number of CMA memory occupied 50% of the total memory which > is used for the virtual machine, meanwhile it also reserves lots of hugetlb > which can occupy 50% of the CMA. So before starting the virtual machine, the > hugetlb can use 50% of the CMA, but when starting the virtual machine, the > CMA will be used by the virtual machine and the hugetlb should be migrated > from CMA. Would it make more sense for hugetlb pages to _not_ use CMA in this case? I mean would be better off overall if the hugetlb pool was preallocated before the CMA is reserved? I do realize this is just working around the current limitations but it could be better than nothing. > Due to several nodes in the system, one node's memory can be exhausted, > which will fail the hugetlb migration with __GFP_THISNODE flag. Is the workload NUMA aware? I.e. do you bind virtual machines to specific nodes?
On 2/2/2024 5:55 PM, Michal Hocko wrote: > On Fri 02-02-24 17:29:02, Baolin Wang wrote: >> On 2/2/2024 4:17 PM, Michal Hocko wrote: > [...] >>>> Agree. So how about below changing? >>>> (1) disallow fallbacking to other nodes when handing in-use hugetlb, which >>>> can ensure consistent behavior in handling hugetlb. >>> >>> I can see two cases here. alloc_contig_range which is an internal kernel >>> user and then we have memory offlining. The former shouldn't break the >>> per-node hugetlb pool reservations, the latter might not have any other >>> choice (the whole node could get offline and that resembles breaking cpu >>> affininty if the cpu is gone). >> >> IMO, not always true for memory offlining, when handling a free hugetlb, it >> disallows fallbacking, which is inconsistent. > > It's been some time I've looked into that code so I am not 100% sure how > the free pool is currently handled. The above is the way I _think_ it > should work from the usability POV. Please see alloc_and_dissolve_hugetlb_folio(). >> Not only memory offlining, but also the longterm pinning (in >> migrate_longterm_unpinnable_pages()) and memory failure (in >> soft_offline_in_use_page()) can also break the per-node hugetlb pool >> reservations. > > Bad > >>> Now I can see how a hugetlb page sitting inside a CMA region breaks CMA >>> users expectations but hugetlb migration already tries hard to allocate >>> a replacement hugetlb so the system must be under a heavy memory >>> pressure if that fails, right? Is it possible that the hugetlb >>> reservation is just overshooted here? Maybe the memory is just terribly >>> fragmented though? >>> >>> Could you be more specific about numbers in your failure case? >> >> Sure. Our customer's machine contains serveral numa nodes, and the system >> reserves a large number of CMA memory occupied 50% of the total memory which >> is used for the virtual machine, meanwhile it also reserves lots of hugetlb >> which can occupy 50% of the CMA. So before starting the virtual machine, the >> hugetlb can use 50% of the CMA, but when starting the virtual machine, the >> CMA will be used by the virtual machine and the hugetlb should be migrated >> from CMA. > > Would it make more sense for hugetlb pages to _not_ use CMA in this > case? I mean would be better off overall if the hugetlb pool was > preallocated before the CMA is reserved? I do realize this is just > working around the current limitations but it could be better than > nothing. In this case, the CMA area is large and occupies 50% of the total memory. The purpose is that, if no virtual machines are launched, then CMA memory can be used by hugetlb as much as possible. Once the virtual machines need to be launched, it is necessary to allocate CMA memory as much as possible, such as migrating hugetlb from CMA memory. After more thinking, I think we should still drop the __GFP_THISNODE flag in alloc_and_dissolve_hugetlb_folio(). Firstly, not only it potentially cause CMA allocation to fail, but it might also cause memory offline to fail like I said in the commit message. Secondly, there have been no user reports complaining about breaking the per-node hugetlb pool, although longterm pinning, memory failure, and memory offline can potentially break the per-node hugetlb pool. >> Due to several nodes in the system, one node's memory can be exhausted, >> which will fail the hugetlb migration with __GFP_THISNODE flag. > > Is the workload NUMA aware? I.e. do you bind virtual machines to > specific nodes? Yes, the VM can bind nodes.
On Mon 05-02-24 10:50:32, Baolin Wang wrote: > > > On 2/2/2024 5:55 PM, Michal Hocko wrote: > > On Fri 02-02-24 17:29:02, Baolin Wang wrote: > > > On 2/2/2024 4:17 PM, Michal Hocko wrote: > > [...] > > > > > Agree. So how about below changing? > > > > > (1) disallow fallbacking to other nodes when handing in-use hugetlb, which > > > > > can ensure consistent behavior in handling hugetlb. > > > > > > > > I can see two cases here. alloc_contig_range which is an internal kernel > > > > user and then we have memory offlining. The former shouldn't break the > > > > per-node hugetlb pool reservations, the latter might not have any other > > > > choice (the whole node could get offline and that resembles breaking cpu > > > > affininty if the cpu is gone). > > > > > > IMO, not always true for memory offlining, when handling a free hugetlb, it > > > disallows fallbacking, which is inconsistent. > > > > It's been some time I've looked into that code so I am not 100% sure how > > the free pool is currently handled. The above is the way I _think_ it > > should work from the usability POV. > > Please see alloc_and_dissolve_hugetlb_folio(). This is the alloc_contig_range rather than offlining path. Page offlining migrates in-use pages to a _different_ node (as long as there is one available) via do_migrate_range and it disolves free hugetlb pages via dissolve_free_huge_pages. So the node's pool is altered but as this is an explicit offling operation I think there is not choice to go differently. > > > Not only memory offlining, but also the longterm pinning (in > > > migrate_longterm_unpinnable_pages()) and memory failure (in > > > soft_offline_in_use_page()) can also break the per-node hugetlb pool > > > reservations. > > > > Bad > > > > > > Now I can see how a hugetlb page sitting inside a CMA region breaks CMA > > > > users expectations but hugetlb migration already tries hard to allocate > > > > a replacement hugetlb so the system must be under a heavy memory > > > > pressure if that fails, right? Is it possible that the hugetlb > > > > reservation is just overshooted here? Maybe the memory is just terribly > > > > fragmented though? > > > > > > > > Could you be more specific about numbers in your failure case? > > > > > > Sure. Our customer's machine contains serveral numa nodes, and the system > > > reserves a large number of CMA memory occupied 50% of the total memory which > > > is used for the virtual machine, meanwhile it also reserves lots of hugetlb > > > which can occupy 50% of the CMA. So before starting the virtual machine, the > > > hugetlb can use 50% of the CMA, but when starting the virtual machine, the > > > CMA will be used by the virtual machine and the hugetlb should be migrated > > > from CMA. > > > > Would it make more sense for hugetlb pages to _not_ use CMA in this > > case? I mean would be better off overall if the hugetlb pool was > > preallocated before the CMA is reserved? I do realize this is just > > working around the current limitations but it could be better than > > nothing. > > In this case, the CMA area is large and occupies 50% of the total memory. > The purpose is that, if no virtual machines are launched, then CMA memory > can be used by hugetlb as much as possible. Once the virtual machines need > to be launched, it is necessary to allocate CMA memory as much as possible, > such as migrating hugetlb from CMA memory. I am afraid that your assumption doesn't correspond to the existing implemntation. hugetlb allocations are movable but they are certainly not as movable as regular pages. So you have to consider a bigger margin and spare memory to achieve a more reliable movability. Have you tried to handle this from the userspace. It seems that you know when there is the CMA demand to you could rebalance hugetlb pools at that moment, no? > After more thinking, I think we should still drop the __GFP_THISNODE flag in > alloc_and_dissolve_hugetlb_folio(). Firstly, not only it potentially cause > CMA allocation to fail, but it might also cause memory offline to fail like > I said in the commit message. Secondly, there have been no user reports > complaining about breaking the per-node hugetlb pool, although longterm > pinning, memory failure, and memory offline can potentially break the > per-node hugetlb pool. It is quite possible that traditional users (like large DBs) do not use CMA heavily so such a problem was not observed so far. That doesn't mean those problems do not really matter.
On 2/5/2024 5:15 PM, Michal Hocko wrote: > On Mon 05-02-24 10:50:32, Baolin Wang wrote: >> >> >> On 2/2/2024 5:55 PM, Michal Hocko wrote: >>> On Fri 02-02-24 17:29:02, Baolin Wang wrote: >>>> On 2/2/2024 4:17 PM, Michal Hocko wrote: >>> [...] >>>>>> Agree. So how about below changing? >>>>>> (1) disallow fallbacking to other nodes when handing in-use hugetlb, which >>>>>> can ensure consistent behavior in handling hugetlb. >>>>> >>>>> I can see two cases here. alloc_contig_range which is an internal kernel >>>>> user and then we have memory offlining. The former shouldn't break the >>>>> per-node hugetlb pool reservations, the latter might not have any other >>>>> choice (the whole node could get offline and that resembles breaking cpu >>>>> affininty if the cpu is gone). >>>> >>>> IMO, not always true for memory offlining, when handling a free hugetlb, it >>>> disallows fallbacking, which is inconsistent. >>> >>> It's been some time I've looked into that code so I am not 100% sure how >>> the free pool is currently handled. The above is the way I _think_ it >>> should work from the usability POV. >> >> Please see alloc_and_dissolve_hugetlb_folio(). > > This is the alloc_contig_range rather than offlining path. Page > offlining migrates in-use pages to a _different_ node (as long as there is one > available) via do_migrate_range and it disolves free hugetlb pages via > dissolve_free_huge_pages. So the node's pool is altered but as this is > an explicit offling operation I think there is not choice to go > differently. > >>>> Not only memory offlining, but also the longterm pinning (in >>>> migrate_longterm_unpinnable_pages()) and memory failure (in >>>> soft_offline_in_use_page()) can also break the per-node hugetlb pool >>>> reservations. >>> >>> Bad >>> >>>>> Now I can see how a hugetlb page sitting inside a CMA region breaks CMA >>>>> users expectations but hugetlb migration already tries hard to allocate >>>>> a replacement hugetlb so the system must be under a heavy memory >>>>> pressure if that fails, right? Is it possible that the hugetlb >>>>> reservation is just overshooted here? Maybe the memory is just terribly >>>>> fragmented though? >>>>> >>>>> Could you be more specific about numbers in your failure case? >>>> >>>> Sure. Our customer's machine contains serveral numa nodes, and the system >>>> reserves a large number of CMA memory occupied 50% of the total memory which >>>> is used for the virtual machine, meanwhile it also reserves lots of hugetlb >>>> which can occupy 50% of the CMA. So before starting the virtual machine, the >>>> hugetlb can use 50% of the CMA, but when starting the virtual machine, the >>>> CMA will be used by the virtual machine and the hugetlb should be migrated >>>> from CMA. >>> >>> Would it make more sense for hugetlb pages to _not_ use CMA in this >>> case? I mean would be better off overall if the hugetlb pool was >>> preallocated before the CMA is reserved? I do realize this is just >>> working around the current limitations but it could be better than >>> nothing. >> >> In this case, the CMA area is large and occupies 50% of the total memory. >> The purpose is that, if no virtual machines are launched, then CMA memory >> can be used by hugetlb as much as possible. Once the virtual machines need >> to be launched, it is necessary to allocate CMA memory as much as possible, >> such as migrating hugetlb from CMA memory. > > I am afraid that your assumption doesn't correspond to the existing > implemntation. hugetlb allocations are movable but they are certainly > not as movable as regular pages. So you have to consider a bigger > margin and spare memory to achieve a more reliable movability. > > Have you tried to handle this from the userspace. It seems that you know > when there is the CMA demand to you could rebalance hugetlb pools at > that moment, no? Maybe this can help, but this just mitigates the issue ... >> After more thinking, I think we should still drop the __GFP_THISNODE flag in >> alloc_and_dissolve_hugetlb_folio(). Firstly, not only it potentially cause >> CMA allocation to fail, but it might also cause memory offline to fail like >> I said in the commit message. Secondly, there have been no user reports >> complaining about breaking the per-node hugetlb pool, although longterm >> pinning, memory failure, and memory offline can potentially break the >> per-node hugetlb pool. > > It is quite possible that traditional users (like large DBs) do not use > CMA heavily so such a problem was not observed so far. That doesn't mean > those problems do not really matter. CMA is just one case, as I mentioned before, other situations can also break the per-node hugetlb pool now. Let's focus on the main point, why we should still keep inconsistency behavior to handle free and in-use hugetlb for alloc_contig_range()? That's really confused.
On Mon 05-02-24 21:06:17, Baolin Wang wrote: [...] > > It is quite possible that traditional users (like large DBs) do not use > > CMA heavily so such a problem was not observed so far. That doesn't mean > > those problems do not really matter. > > CMA is just one case, as I mentioned before, other situations can also break > the per-node hugetlb pool now. Is there any other case than memory hotplug which is arguably different as it is a disruptive operation already. > Let's focus on the main point, why we should still keep inconsistency > behavior to handle free and in-use hugetlb for alloc_contig_range()? That's > really confused. yes, this should behave consistently. And the least surprising way to handle that from the user configuration POV is to not move outside of the original NUMA node.
On 2024/2/5 22:23, Michal Hocko wrote: > On Mon 05-02-24 21:06:17, Baolin Wang wrote: > [...] >>> It is quite possible that traditional users (like large DBs) do not use >>> CMA heavily so such a problem was not observed so far. That doesn't mean >>> those problems do not really matter. >> >> CMA is just one case, as I mentioned before, other situations can also break >> the per-node hugetlb pool now. > > Is there any other case than memory hotplug which is arguably different > as it is a disruptive operation already. Yes, like I said before the longterm pinning, memory failure and the users of alloc_contig_pages() may also break the per-node hugetlb pool. >> Let's focus on the main point, why we should still keep inconsistency >> behavior to handle free and in-use hugetlb for alloc_contig_range()? That's >> really confused. > > yes, this should behave consistently. And the least surprising way to > handle that from the user configuration POV is to not move outside of > the original NUMA node. So you mean we should also add __GFP_THISNODE flag in alloc_migration_target() when allocating a new hugetlb as the target for migration, that can unify the behavior and avoid breaking the per-node pool?
On Tue 06-02-24 16:18:22, Baolin Wang wrote: > > > On 2024/2/5 22:23, Michal Hocko wrote: > > On Mon 05-02-24 21:06:17, Baolin Wang wrote: > > [...] > > > > It is quite possible that traditional users (like large DBs) do not use > > > > CMA heavily so such a problem was not observed so far. That doesn't mean > > > > those problems do not really matter. > > > > > > CMA is just one case, as I mentioned before, other situations can also break > > > the per-node hugetlb pool now. > > > > Is there any other case than memory hotplug which is arguably different > > as it is a disruptive operation already. > > Yes, like I said before the longterm pinning, memory failure and the users > of alloc_contig_pages() may also break the per-node hugetlb pool. memory failure is similar to the memory hotplug in the sense that it is a disruptive operation and fallback to a different node might be the only option to handle it. On the other hand longterm pinning is similar to a_c_p and it should fail if it cannot be migrated within the node. It seems that hugetlb is quite behind with many other features and I am not really sure how to deal with that. What is your take Munchun Song? > > > Let's focus on the main point, why we should still keep inconsistency > > > behavior to handle free and in-use hugetlb for alloc_contig_range()? That's > > > really confused. > > > > yes, this should behave consistently. And the least surprising way to > > handle that from the user configuration POV is to not move outside of > > the original NUMA node. > > So you mean we should also add __GFP_THISNODE flag in > alloc_migration_target() when allocating a new hugetlb as the target for > migration, that can unify the behavior and avoid breaking the per-node pool? Not as simple as that, because alloc_migration_target is used also from an user driven migration.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9d996fe4ecd9..9c832709728e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3029,7 +3029,7 @@ void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma, static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, struct folio *old_folio, struct list_head *list) { - gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; + gfp_t gfp_mask = htlb_alloc_mask(h); int nid = folio_nid(old_folio); struct folio *new_folio; int ret = 0; @@ -3088,7 +3088,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, * Ref count on new_folio is already zero as it was dropped * earlier. It can be directly added to the pool free list. */ - __prep_account_new_huge_page(h, nid); + __prep_account_new_huge_page(h, folio_nid(new_folio)); enqueue_hugetlb_folio(h, new_folio); /*