Message ID | 20230412195939.1242462-1-pasha.tatashin@soleen.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp583349vqo; Wed, 12 Apr 2023 13:24:54 -0700 (PDT) X-Google-Smtp-Source: AKy350bZV+JqviGuREMF/UCccX4bc1/0djo8RjUUGDOobD3l30g2S/NMQRWvLDboLg8U+h/0b2kV X-Received: by 2002:a17:902:c793:b0:1a2:a586:fb7c with SMTP id w19-20020a170902c79300b001a2a586fb7cmr164637pla.12.1681331093961; Wed, 12 Apr 2023 13:24:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681331093; cv=none; d=google.com; s=arc-20160816; b=oZUbm4d/cAshgEkiu3aHxET3fO4Ep8sQyNAcJ8qMuhskgiKGQlXekRTs12gVvdDW+L v1bcjw4HoKfYPOInjb9wD0dwrCf4HWoK22nsncyRhwmHtE9fPmG+HRk7dR9KM5XE9Iuo caPUO2pemmxvORerSS9KDkszVBl7PDPeNWHzidXRpjYOCkAwkWXU49ISIoTKZ+XT+7K8 IkzFWycqnD78fwM40nvdIQjBwCZI98WCLFyXbspko3bLvp4Xpn+EapbTg1zWu4Eo+ipb z5iP4mV+dEJPLsCkTIkdv+6iXG6fJqNi8sS+08KdxetPGm/pYIGlQdtvqkGULdCJPziu HaTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:to:from:dkim-signature; bh=dw+CWACRZsXFxLftto9WrsOlG3O76TJuQrKX+zvFATY=; b=X0b+dNNNY9rBzbwjXsDMeqJgR9WLEGjkOzv2HzS/AuUj19ler/PhpYxGfuaiAvzV6U qz4KIXugQ5aClwm9MPSq4vIb7eaIzgF3m+jb9XPQ+VEuJwSlACBXcGyQQW7P6EwuMJjs wMEd/pkTm33jhCEbeqwllO8sxuKsVwxf+9Tn5xEM2k2WPnOAEqzivVSjDoaDpIcEhnxK OHQAtgraBths4P6ujLPTDlnJbt7/eT7ogLgBTiuyouPEe/oiGVK9Y4BZUcOdy+7kfTgh xWm5mRw0VtJX3XSq0aio5VZw8gxBa5g1Yq3h3C/sW1Og6A2WQPkVhlKcbndFTWQKLe3/ r8Rg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@soleen.com header.s=google header.b=YhLpdi3+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l10-20020a170903120a00b001a6717fe413si186plh.271.2023.04.12.13.24.41; Wed, 12 Apr 2023 13:24:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@soleen.com header.s=google header.b=YhLpdi3+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229962AbjDLT7q (ORCPT <rfc822;peter110.wang@gmail.com> + 99 others); Wed, 12 Apr 2023 15:59:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50462 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229820AbjDLT7o (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 12 Apr 2023 15:59:44 -0400 Received: from mail-qt1-x82f.google.com (mail-qt1-x82f.google.com [IPv6:2607:f8b0:4864:20::82f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1E56AD3 for <linux-kernel@vger.kernel.org>; Wed, 12 Apr 2023 12:59:43 -0700 (PDT) Received: by mail-qt1-x82f.google.com with SMTP id l11so14161244qtj.4 for <linux-kernel@vger.kernel.org>; Wed, 12 Apr 2023 12:59:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=soleen.com; s=google; t=1681329582; x=1683921582; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:from:to:cc:subject:date:message-id:reply-to; bh=dw+CWACRZsXFxLftto9WrsOlG3O76TJuQrKX+zvFATY=; b=YhLpdi3+0jUzIX1YuB+r9CraTavpImPkjyW8na7sQw8iNCJwZl8oL881pRmVX0cWm8 yPW/G4zD+mb7ZWuLNbxaUHOvF6HRo9w4bxK1LwexUXFwpPGP4DNy3qbPBr9pj4mIJKYO vmapX99d5UskrdNAovMELRusPJkxLffPBgH8qK2zf5PeNE1rHGq7D6MTa8/C/WDz+6WT AnbhfZDe3k2Zju5D0ZIGweRTDGlnfgx3zKngCNOKBF6TdU9jziAok0yszwgEnatOfPoB veUNzaTSW2uto4cRdHotqCb/ek7Zl9+pVNDkRBUAtGXnIzC9wUjQ6MU2Zzm6xZ4nxcOL 0/GA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681329582; x=1683921582; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dw+CWACRZsXFxLftto9WrsOlG3O76TJuQrKX+zvFATY=; b=Bl0pKCfvXRISczDkKIaHns2OhDozBscZhR0UWidaBL0HYTDXAjCtaozVxZzOAUgGGv CFyDHPNoIeFjIZ/+UfxCP08utUVuKiJRauOP0eK23oW3RpvsLKt3MulMJvCjm0KdiWXs Vp/ksOnyiH1Ij7AzBV4JI0K8q1WAI2+Jn7lRPmkQtoVjo36dDvt4vgDAFSs+hB+y72ti 9PQJ3mff4kKEHvVMCBRYyL8REBPx49wlLjoX/yp/19K2MhBckH//I1l7GAqSw+aK3vAI OYoAGKdCN9aP4kGsrKN2PZxZ4pxUuGzlbvdJs3rFjeA4GyUVFN5V4xZlqqMa98xtcGhL uXtw== X-Gm-Message-State: AAQBX9cqNC6OkXJ4guKspcRQ6v4GZp+q2HqT+YvthFo2fHs2OltCOfFh y7EEL+GYN9No8FCBJWKzlqbTsw== X-Received: by 2002:a05:622a:1102:b0:3e2:3ae2:790e with SMTP id e2-20020a05622a110200b003e23ae2790emr24712205qty.32.1681329582240; Wed, 12 Apr 2023 12:59:42 -0700 (PDT) Received: from soleen.c.googlers.com.com (193.132.150.34.bc.googleusercontent.com. [34.150.132.193]) by smtp.gmail.com with ESMTPSA id do28-20020a05620a2b1c00b007484bc04a63sm4896504qkb.99.2023.04.12.12.59.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Apr 2023 12:59:41 -0700 (PDT) From: Pasha Tatashin <pasha.tatashin@soleen.com> To: pasha.tatashin@soleen.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, mike.kravetz@oracle.com, mhocko@suse.com, muchun.song@linux.dev, rientjes@google.com, souravpanda@google.com Subject: [PATCH v2] mm: hugetlb_vmemmap: provide stronger vmemmap allocation guarantees Date: Wed, 12 Apr 2023 19:59:39 +0000 Message-Id: <20230412195939.1242462-1-pasha.tatashin@soleen.com> X-Mailer: git-send-email 2.40.0.577.gac1e443424-goog MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763003334937146953?= X-GMAIL-MSGID: =?utf-8?q?1763003433128557510?= |
Series |
[v2] mm: hugetlb_vmemmap: provide stronger vmemmap allocation guarantees
|
|
Commit Message
Pasha Tatashin
April 12, 2023, 7:59 p.m. UTC
HugeTLB pages have a struct page optimizations where struct pages for tail
pages are freed. However, when HugeTLB pages are destroyed, the memory for
struct pages (vmemmap) need to be allocated again.
Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap,
but given that this flag makes very little effort to actually reclaim
memory the returning of huge pages back to the system can be problem. Lets
use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful
reclaim without causing ooms, but at least it may perform a few retries,
and will fail only when there is genuinely little amount of unused memory
in the system.
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Suggested-by: David Rientjes <rientjes@google.com>
---
mm/hugetlb_vmemmap.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
Changelog:
v2
- removed gfp_mask argument from alloc_vmemmap_page_list as suggested by
David Rientjes.
- Fixed spelling in the patch title.
Comments
On Wed, Apr 12, 2023 at 4:13 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > Lots of questions (ie, missing information!) > > On Wed, 12 Apr 2023 19:59:39 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > > HugeTLB pages have a struct page optimizations where struct pages for tail > > pages are freed. However, when HugeTLB pages are destroyed, the memory for > > struct pages (vmemmap) need to be allocated again. > > > > Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap, > > but given that this flag makes very little effort to actually reclaim > > memory the returning of huge pages back to the system can be problem. > > Are there any reports of this happening in the real world? > > > Lets > > use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful > > reclaim without causing ooms, but at least it may perform a few retries, > > and will fail only when there is genuinely little amount of unused memory > > in the system. > > If so, does this change help? It helps to avoid transient allocation problems. In general it is not a good idea to fail because we are trying to free gigantic pages back to the system. > > If the allocation attempt fails, what are the consequences? The gigantic page is not going to be returned to the system. The use will have to free some memory before returning them back to the system. > > What are the potential downsides to this change? Why did we choose > __GFP_NORETRY in the first place? > > What happens if we try harder (eg, GFP_KERNEL)? MIchal answered this question, that it won't do much difference due to __GFP_THISNODE
On Wed, Apr 12, 2023 at 4:18 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 12-04-23 13:13:02, Andrew Morton wrote: > > Lots of questions (ie, missing information!) > > > > On Wed, 12 Apr 2023 19:59:39 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > > > > HugeTLB pages have a struct page optimizations where struct pages for tail > > > pages are freed. However, when HugeTLB pages are destroyed, the memory for > > > struct pages (vmemmap) need to be allocated again. > > > > > > Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap, > > > but given that this flag makes very little effort to actually reclaim > > > memory the returning of huge pages back to the system can be problem. > > > > Are there any reports of this happening in the real world? > > > > > Lets > > > use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful > > > reclaim without causing ooms, but at least it may perform a few retries, > > > and will fail only when there is genuinely little amount of unused memory > > > in the system. > > > > If so, does this change help? > > > > If the allocation attempt fails, what are the consequences? > > > > What are the potential downsides to this change? Why did we choose > > __GFP_NORETRY in the first place? > > > > What happens if we try harder (eg, GFP_KERNEL)? > > Mike was generous enough to make me remember > https://lore.kernel.org/linux-mm/YCafit5ruRJ+SL8I@dhcp22.suse.cz/. > GFP_KERNEL wouldn't make much difference becauset this is > __GFP_THISNODE. But I do agree that the changelog should go into more > details about why do we want to try harder now. I can imagine that > shrinking hugetlb pool by a large amount of hugetlb pages might become a > problem but is this really happening or is this a theoretical concern? This is a theoretical concern. Freeing a 1G page requires 16M of free memory. A machine might need to be reconfigured from one task to another, and release a large number of 1G pages back to the system if allocating 16M fails, the release won't work. In an ideal scenario we should guarantee that this never fails: that we always can free HugeTLB pages back to the system. At the very least we could steal the memory for vmemmap from the page that is being released. Pasha
On Thu 13-04-23 11:05:20, Pavel Tatashin wrote: > On Wed, Apr 12, 2023 at 4:18 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 12-04-23 13:13:02, Andrew Morton wrote: > > > Lots of questions (ie, missing information!) > > > > > > On Wed, 12 Apr 2023 19:59:39 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > > > > > > HugeTLB pages have a struct page optimizations where struct pages for tail > > > > pages are freed. However, when HugeTLB pages are destroyed, the memory for > > > > struct pages (vmemmap) need to be allocated again. > > > > > > > > Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap, > > > > but given that this flag makes very little effort to actually reclaim > > > > memory the returning of huge pages back to the system can be problem. > > > > > > Are there any reports of this happening in the real world? > > > > > > > Lets > > > > use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful > > > > reclaim without causing ooms, but at least it may perform a few retries, > > > > and will fail only when there is genuinely little amount of unused memory > > > > in the system. > > > > > > If so, does this change help? > > > > > > If the allocation attempt fails, what are the consequences? > > > > > > What are the potential downsides to this change? Why did we choose > > > __GFP_NORETRY in the first place? > > > > > > What happens if we try harder (eg, GFP_KERNEL)? > > > > Mike was generous enough to make me remember > > https://lore.kernel.org/linux-mm/YCafit5ruRJ+SL8I@dhcp22.suse.cz/. > > GFP_KERNEL wouldn't make much difference becauset this is > > __GFP_THISNODE. But I do agree that the changelog should go into more > > details about why do we want to try harder now. I can imagine that > > shrinking hugetlb pool by a large amount of hugetlb pages might become a > > problem but is this really happening or is this a theoretical concern? > > This is a theoretical concern. Freeing a 1G page requires 16M of free > memory. A machine might need to be reconfigured from one task to > another, and release a large number of 1G pages back to the system if > allocating 16M fails, the release won't work. This is really an important "detail" changelog should mention. While I am not really against that change I would much rather see that as a result of a real world fix rather than a theoretical concern. Mostly because a real life scenario would allow us to test the __GFP_RETRY_MAYFAIL effectivness. As that request might fail as well we just end up with a theoretical fix for a theoretical problem. Something that is easy to introduce but much harder to get rid of should we ever need to change __GFP_RETRY_MAYFAIL implementation for example. > In an ideal scenario we should guarantee that this never fails: that > we always can free HugeTLB pages back to the system. At the very least > we could steal the memory for vmemmap from the page that is being > released. Yes, this really bothered me when the concept was introduced initially. I am always concerned when you need to allocate in order to free memory. Practically speaking we haven't heard about bug reports so maybe this is not such a big deal as I thought.
On Thu, Apr 13, 2023 at 11:25 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 13-04-23 11:05:20, Pavel Tatashin wrote: > > On Wed, Apr 12, 2023 at 4:18 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 12-04-23 13:13:02, Andrew Morton wrote: > > > > Lots of questions (ie, missing information!) > > > > > > > > On Wed, 12 Apr 2023 19:59:39 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > > > > > > > > HugeTLB pages have a struct page optimizations where struct pages for tail > > > > > pages are freed. However, when HugeTLB pages are destroyed, the memory for > > > > > struct pages (vmemmap) need to be allocated again. > > > > > > > > > > Currently, __GFP_NORETRY flag is used to allocate the memory for vmemmap, > > > > > but given that this flag makes very little effort to actually reclaim > > > > > memory the returning of huge pages back to the system can be problem. > > > > > > > > Are there any reports of this happening in the real world? > > > > > > > > > Lets > > > > > use __GFP_RETRY_MAYFAIL instead. This flag is also performs graceful > > > > > reclaim without causing ooms, but at least it may perform a few retries, > > > > > and will fail only when there is genuinely little amount of unused memory > > > > > in the system. > > > > > > > > If so, does this change help? > > > > > > > > If the allocation attempt fails, what are the consequences? > > > > > > > > What are the potential downsides to this change? Why did we choose > > > > __GFP_NORETRY in the first place? > > > > > > > > What happens if we try harder (eg, GFP_KERNEL)? > > > > > > Mike was generous enough to make me remember > > > https://lore.kernel.org/linux-mm/YCafit5ruRJ+SL8I@dhcp22.suse.cz/. > > > GFP_KERNEL wouldn't make much difference becauset this is > > > __GFP_THISNODE. But I do agree that the changelog should go into more > > > details about why do we want to try harder now. I can imagine that > > > shrinking hugetlb pool by a large amount of hugetlb pages might become a > > > problem but is this really happening or is this a theoretical concern? > > > > This is a theoretical concern. Freeing a 1G page requires 16M of free > > memory. A machine might need to be reconfigured from one task to > > another, and release a large number of 1G pages back to the system if > > allocating 16M fails, the release won't work. > > This is really an important "detail" changelog should mention. While I > am not really against that change I would much rather see that as a > result of a real world fix rather than a theoretical concern. Mostly > because a real life scenario would allow us to test the > __GFP_RETRY_MAYFAIL effectivness. As that request might fail as well we > just end up with a theoretical fix for a theoretical problem. Something > that is easy to introduce but much harder to get rid of should we ever > need to change __GFP_RETRY_MAYFAIL implementation for example. I will add this to changelog in v3. If __GFP_RETRY_MAYFAIL is ineffective we will receive feedback once someone hits this problem. Otherwise, we will never hear about it. I think overall it is safer to keep this code with __GFP_RETRY_MAYFAIL flag. > > > In an ideal scenario we should guarantee that this never fails: that > > we always can free HugeTLB pages back to the system. At the very least > > we could steal the memory for vmemmap from the page that is being > > released. > > Yes, this really bothered me when the concept was introduced initially. > I am always concerned when you need to allocate in order to free memory. > Practically speaking we haven't heard about bug reports so maybe this is > not such a big deal as I thought. I suspect this is because at the moment it is not that frequent when a machine is reconfigured from having a lot of HugeTLB based workload to non-HugeTLB workload. Pasha
On Thu 13-04-23 13:11:39, Pavel Tatashin wrote: > On Thu, Apr 13, 2023 at 11:25 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 13-04-23 11:05:20, Pavel Tatashin wrote: [...] > > > This is a theoretical concern. Freeing a 1G page requires 16M of free > > > memory. A machine might need to be reconfigured from one task to > > > another, and release a large number of 1G pages back to the system if > > > allocating 16M fails, the release won't work. > > > > This is really an important "detail" changelog should mention. While I > > am not really against that change I would much rather see that as a > > result of a real world fix rather than a theoretical concern. Mostly > > because a real life scenario would allow us to test the > > __GFP_RETRY_MAYFAIL effectivness. As that request might fail as well we > > just end up with a theoretical fix for a theoretical problem. Something > > that is easy to introduce but much harder to get rid of should we ever > > need to change __GFP_RETRY_MAYFAIL implementation for example. > > I will add this to changelog in v3. If __GFP_RETRY_MAYFAIL is > ineffective we will receive feedback once someone hits this problem. I do not remember anybody hitting this with the current __GFP_NORETRY. So arguably there is nothing to be fixed ATM. > Otherwise, we will never hear about it. I think overall it is safer to > keep this code with __GFP_RETRY_MAYFAIL flag. > > > > > > In an ideal scenario we should guarantee that this never fails: that > > > we always can free HugeTLB pages back to the system. At the very least > > > we could steal the memory for vmemmap from the page that is being > > > released. > > > > Yes, this really bothered me when the concept was introduced initially. > > I am always concerned when you need to allocate in order to free memory. > > Practically speaking we haven't heard about bug reports so maybe this is > > not such a big deal as I thought. > > I suspect this is because at the moment it is not that frequent when a > machine is reconfigured from having a lot of HugeTLB based workload to > non-HugeTLB workload. Yes, hugetlb workloads tend to be pretty static from my experience.
On Thu, 13 Apr 2023, Michal Hocko wrote: > [...] > > > > This is a theoretical concern. Freeing a 1G page requires 16M of free > > > > memory. A machine might need to be reconfigured from one task to > > > > another, and release a large number of 1G pages back to the system if > > > > allocating 16M fails, the release won't work. > > > > > > This is really an important "detail" changelog should mention. While I > > > am not really against that change I would much rather see that as a > > > result of a real world fix rather than a theoretical concern. Mostly > > > because a real life scenario would allow us to test the > > > __GFP_RETRY_MAYFAIL effectivness. As that request might fail as well we > > > just end up with a theoretical fix for a theoretical problem. Something > > > that is easy to introduce but much harder to get rid of should we ever > > > need to change __GFP_RETRY_MAYFAIL implementation for example. > > > > I will add this to changelog in v3. If __GFP_RETRY_MAYFAIL is > > ineffective we will receive feedback once someone hits this problem. > > I do not remember anybody hitting this with the current __GFP_NORETRY. > So arguably there is nothing to be fixed ATM. > I think we should still at least clear __GFP_NORETRY in this allocation: to be able to free 1GB hugepages back to the system we'd like the page allocator to at least exercise its normal order-0 allocation logic rather than exempting it from retrying reclaim by opting into __GFP_NORETRY. I'd agree with the analysis in https://lore.kernel.org/linux-mm/YCafit5ruRJ+SL8I@dhcp22.suse.cz/ that either a cleared __GFP_NORETRY or a __GFP_RETRY_MAYFAIL makes logical sense. We really *do* want to free these hugepages back to the system and the amount of memory freeing will always be more than the allocation for struct page. The net result is more free memory. If the allocation fails, we can't free 1GB back to the system on a saturated node if our first reclaim attempt didn't allow these struct pages to be allocated. Stranding 1GB in the hugetlb pool that no userspace on the system can make use of at the time isn't very useful.
On Fri 14-04-23 17:47:28, David Rientjes wrote: > On Thu, 13 Apr 2023, Michal Hocko wrote: > > > [...] > > > > > This is a theoretical concern. Freeing a 1G page requires 16M of free > > > > > memory. A machine might need to be reconfigured from one task to > > > > > another, and release a large number of 1G pages back to the system if > > > > > allocating 16M fails, the release won't work. > > > > > > > > This is really an important "detail" changelog should mention. While I > > > > am not really against that change I would much rather see that as a > > > > result of a real world fix rather than a theoretical concern. Mostly > > > > because a real life scenario would allow us to test the > > > > __GFP_RETRY_MAYFAIL effectivness. As that request might fail as well we > > > > just end up with a theoretical fix for a theoretical problem. Something > > > > that is easy to introduce but much harder to get rid of should we ever > > > > need to change __GFP_RETRY_MAYFAIL implementation for example. > > > > > > I will add this to changelog in v3. If __GFP_RETRY_MAYFAIL is > > > ineffective we will receive feedback once someone hits this problem. > > > > I do not remember anybody hitting this with the current __GFP_NORETRY. > > So arguably there is nothing to be fixed ATM. > > > > I think we should still at least clear __GFP_NORETRY in this allocation: > to be able to free 1GB hugepages back to the system we'd like the page > allocator to at least exercise its normal order-0 allocation logic rather > than exempting it from retrying reclaim by opting into __GFP_NORETRY. > > I'd agree with the analysis in > https://lore.kernel.org/linux-mm/YCafit5ruRJ+SL8I@dhcp22.suse.cz/ that > either a cleared __GFP_NORETRY or a __GFP_RETRY_MAYFAIL makes logical > sense. > > We really *do* want to free these hugepages back to the system and the > amount of memory freeing will always be more than the allocation for > struct page. The net result is more free memory. > > If the allocation fails, we can't free 1GB back to the system on a > saturated node if our first reclaim attempt didn't allow these struct > pages to be allocated. Stranding 1GB in the hugetlb pool that no > userspace on the system can make use of at the time isn't very useful. I do not think there is any dispute in the theoretical concern. The question is whether this is something that really needs a fix in practice. Have we ever seen workloads which rely on GB pages to fail freeing them?
On 04/17/23 10:33, Michal Hocko wrote: > On Fri 14-04-23 17:47:28, David Rientjes wrote: > > On Thu, 13 Apr 2023, Michal Hocko wrote: > > > > > [...] > > > > > > This is a theoretical concern. Freeing a 1G page requires 16M of free > > > > > > memory. A machine might need to be reconfigured from one task to > > > > > > another, and release a large number of 1G pages back to the system if > > > > > > allocating 16M fails, the release won't work. > > > > > > > > > > This is really an important "detail" changelog should mention. While I > > > > > am not really against that change I would much rather see that as a > > > > > result of a real world fix rather than a theoretical concern. Mostly > > > > > because a real life scenario would allow us to test the > > > > > __GFP_RETRY_MAYFAIL effectivness. As that request might fail as well we > > > > > just end up with a theoretical fix for a theoretical problem. Something > > > > > that is easy to introduce but much harder to get rid of should we ever > > > > > need to change __GFP_RETRY_MAYFAIL implementation for example. > > > > > > > > I will add this to changelog in v3. If __GFP_RETRY_MAYFAIL is > > > > ineffective we will receive feedback once someone hits this problem. > > > > > > I do not remember anybody hitting this with the current __GFP_NORETRY. > > > So arguably there is nothing to be fixed ATM. > > > > > > > I think we should still at least clear __GFP_NORETRY in this allocation: > > to be able to free 1GB hugepages back to the system we'd like the page > > allocator to at least exercise its normal order-0 allocation logic rather > > than exempting it from retrying reclaim by opting into __GFP_NORETRY. > > > > I'd agree with the analysis in > > https://lore.kernel.org/linux-mm/YCafit5ruRJ+SL8I@dhcp22.suse.cz/ that > > either a cleared __GFP_NORETRY or a __GFP_RETRY_MAYFAIL makes logical > > sense. > > > > We really *do* want to free these hugepages back to the system and the > > amount of memory freeing will always be more than the allocation for > > struct page. The net result is more free memory. > > > > If the allocation fails, we can't free 1GB back to the system on a > > saturated node if our first reclaim attempt didn't allow these struct > > pages to be allocated. Stranding 1GB in the hugetlb pool that no > > userspace on the system can make use of at the time isn't very useful. > > I do not think there is any dispute in the theoretical concern. The question is > whether this is something that really needs a fix in practice. Have we > ever seen workloads which rely on GB pages to fail freeing them? Since I have never seen a failure allocating vmemmmap, I agree that this is all a theoretical concern. However, to me it seems that replacing __GFP_NORETRY with __GFP_RETRY_MAYFAIL would lessen that theoretical concern just a little. That is simply because an allocation with __GFP_RETRY_MAYFAIL would be a little more likely to succeed. Again, I know this is all theoretical but if switching to __GFP_RETRY_MAYFAIL would prevent one allocation/hugetlb page freeing failure I think it is worth it. Because, as soon as we see one failure we may need to look into addressing this now theoretical concern.
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index a559037cce00..236647e4bfec 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -384,8 +384,9 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, } static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, - gfp_t gfp_mask, struct list_head *list) + struct list_head *list) { + gfp_t gfp_mask = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_THISNODE; unsigned long nr_pages = (end - start) >> PAGE_SHIFT; int nid = page_to_nid((struct page *)start); struct page *page, *next; @@ -413,12 +414,11 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, * @end: end address of the vmemmap virtual address range that we want to * remap. * @reuse: reuse address. - * @gfp_mask: GFP flag for allocating vmemmap pages. * * Return: %0 on success, negative error code otherwise. */ static int vmemmap_remap_alloc(unsigned long start, unsigned long end, - unsigned long reuse, gfp_t gfp_mask) + unsigned long reuse) { LIST_HEAD(vmemmap_pages); struct vmemmap_remap_walk walk = { @@ -430,7 +430,7 @@ static int vmemmap_remap_alloc(unsigned long start, unsigned long end, /* See the comment in the vmemmap_remap_free(). */ BUG_ON(start - reuse != PAGE_SIZE); - if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages)) + if (alloc_vmemmap_page_list(start, end, &vmemmap_pages)) return -ENOMEM; mmap_read_lock(&init_mm); @@ -476,8 +476,7 @@ int hugetlb_vmemmap_restore(const struct hstate *h, struct page *head) * When a HugeTLB page is freed to the buddy allocator, previously * discarded vmemmap pages must be allocated and remapping. */ - ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse, - GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE); + ret = vmemmap_remap_alloc(vmemmap_start, vmemmap_end, vmemmap_reuse); if (!ret) { ClearHPageVmemmapOptimized(head); static_branch_dec(&hugetlb_optimize_vmemmap_key);