Message ID | a8e16f7eb295e1843f8edaa1ae1c68325c54c896.1699104759.git.quic_charante@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:8f47:0:b0:403:3b70:6f57 with SMTP id j7csp2105783vqu; Sun, 5 Nov 2023 04:52:05 -0800 (PST) X-Google-Smtp-Source: AGHT+IHRldIQeTNAVEc8cU9eRQGsnJF8LVBeCvPDBE9kIVD9WSVvgJJ+sv+3J4cRuYG/zr1KCLfn X-Received: by 2002:a05:6a00:14d2:b0:6bd:254a:8876 with SMTP id w18-20020a056a0014d200b006bd254a8876mr24804053pfu.23.1699188725553; Sun, 05 Nov 2023 04:52:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699188725; cv=none; d=google.com; s=arc-20160816; b=skoPzjKIdRNCvPtWdWBH+ZBLRI61zqsMUIfbI6pNWj6D0blkDNf4FYS6e8CDRVufiy 5dbTf+hgurCCl2v6B/XRKuD+OiEccXsTEm43lrfYpg05LKJp2QVI4XZqny++lvfwip9S bXZcJmpHzCWSdWnrM2019LZg51C+h1Q0ssBB+gGwRt+rg/NdObw4yI0JrxZY8cJmUQ23 8BGqORydaR4QrPapiqOkcEUz+Ab3JfjoriW+QYQA8Vfde5pVNeApv32Xzhb+4xSYDHvz 5jm1A2ZN3spq5woHSupe96uNORqsIhopt8wDV1n+4sdyLceInLNP9nlXInaX4eQqQX7a ULug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=alGu7DSQCg8vRy4ahagVkIN6teU4ANxGr1I6alpYOtM=; fh=OEZ1X2ugt64Sg3RaiLhA7IjZ0579kVxN/CI/M7AE/gY=; b=WGKkZ8LiUe7AUOCRyE9/g/JjDRzjgBdw2Rzin+PGXMz4WwFawIQjidLC3ZWoIsixfX Dhs3fGSXVRSybxSjy/0xVXPvEF87+zkpz19Y8oWV71EgCnmvML6Sm9I3lk3DsSJ9Khxc jtfpnUYldDjWy74a2GDpKxpgP8JLin9cDfKAx17x6y67FCFyQWn/Xh4EAR8mYfn2RNum 4EiuhAx1hkLe9a/oU7ms0b4FD8SXWHej6lf4vjUF82oTQRgLfq6HORo4ih7TM7yqchBy 0pukVu8iR2FXm7NUIJyEdSC9I5PN5vxOzyU7O+fXZY5nJFN6HCu/kq84X5GkST8LokjA LwOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=NorFpvXY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id d10-20020a056a0024ca00b006bf537ce976si6193070pfv.260.2023.11.05.04.52.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Nov 2023 04:52:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=NorFpvXY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id B42DE80A9933; Sun, 5 Nov 2023 04:52:04 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229436AbjKEMvz (ORCPT <rfc822;heyuhang3455@gmail.com> + 34 others); Sun, 5 Nov 2023 07:51:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229621AbjKEMvx (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 5 Nov 2023 07:51:53 -0500 Received: from mx0b-0031df01.pphosted.com (mx0b-0031df01.pphosted.com [205.220.180.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C9D4D134 for <linux-kernel@vger.kernel.org>; Sun, 5 Nov 2023 04:51:49 -0800 (PST) Received: from pps.filterd (m0279868.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3A5CpdQj008163; Sun, 5 Nov 2023 12:51:39 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-type; s=qcppdkim1; bh=alGu7DSQCg8vRy4ahagVkIN6teU4ANxGr1I6alpYOtM=; b=NorFpvXY9dJEINV4nDW+arIx2FvGaQ0ea+EEAsFvd1nWcHxgAeC50zlXKugod0ACOJrG KYrcGYZ8lTov3U74LwpkCciOalMukfFL85nMmWTijoL+Bu0kVrXYyICFTLaH8odZsSaq SyC5vkwYgiivFqaDV/JQwb8BJuo1Cr9463PWOjnr67/RWheQnakY1XWVkMTwNOSLtpaA Evq9t5f/2wryeHPvDSwx5BCG08GdUlKO82ktTGu8pVvhyuU047ydGDexwz3TidITMNz7 c2RWPwrrcHhxvke3mSf35op21s2JqBhWAOLYtHUjtAATrNUjlNFcLT7iqa7qUXO46eWs IQ== Received: from nalasppmta05.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3u5ek4hyjp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 05 Nov 2023 12:51:39 +0000 Received: from nalasex01a.na.qualcomm.com (nalasex01a.na.qualcomm.com [10.47.209.196]) by NALASPPMTA05.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 3A5CpcAl021658 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 5 Nov 2023 12:51:38 GMT Received: from hu-charante-hyd.qualcomm.com (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.39; Sun, 5 Nov 2023 04:51:34 -0800 From: Charan Teja Kalla <quic_charante@quicinc.com> To: <akpm@linux-foundation.org>, <mgorman@techsingularity.net>, <mhocko@suse.com>, <david@redhat.com>, <vbabka@suse.cz>, <hannes@cmpxchg.org>, <quic_pkondeti@quicinc.com> CC: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, Charan Teja Kalla <quic_charante@quicinc.com> Subject: [PATCH V3 3/3] mm: page_alloc: drain pcp lists before oom kill Date: Sun, 5 Nov 2023 18:20:50 +0530 Message-ID: <a8e16f7eb295e1843f8edaa1ae1c68325c54c896.1699104759.git.quic_charante@quicinc.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <cover.1699104759.git.quic_charante@quicinc.com> References: <cover.1699104759.git.quic_charante@quicinc.com> MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nalasex01a.na.qualcomm.com (10.47.209.196) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-GUID: ihGUMcrRY_x_DBJoNrNbYLsj8TJYXiz1 X-Proofpoint-ORIG-GUID: ihGUMcrRY_x_DBJoNrNbYLsj8TJYXiz1 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-05_10,2023-11-02_03,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 impostorscore=0 malwarescore=0 adultscore=0 lowpriorityscore=0 bulkscore=0 mlxscore=0 clxscore=1015 spamscore=0 mlxlogscore=549 priorityscore=1501 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2310240000 definitions=main-2311050112 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sun, 05 Nov 2023 04:52:04 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781728517039204138 X-GMAIL-MSGID: 1781728517039204138 |
Series |
mm: page_alloc: fixes for early oom kills
|
|
Commit Message
Charan Teja Kalla
Nov. 5, 2023, 12:50 p.m. UTC
pcp lists are drained from __alloc_pages_direct_reclaim(), only if some
progress is made in the attempt.
struct page *__alloc_pages_direct_reclaim() {
.....
*did_some_progress = __perform_reclaim(gfp_mask, order, ac);
if (unlikely(!(*did_some_progress)))
goto out;
retry:
page = get_page_from_freelist();
if (!page && !drained) {
drain_all_pages(NULL);
drained = true;
goto retry;
}
out:
}
After the above, allocation attempt can fallback to
should_reclaim_retry() to decide reclaim retries. If it too return
false, allocation request will simply fallback to oom kill path without
even attempting the draining of the pcp pages that might help the
allocation attempt to succeed.
VM system running with ~50MB of memory shown the below stats during OOM
kill:
Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB
reserved_highatomic:0KB managed:49152kB free_pcp:460kB
Though in such system state OOM kill is imminent, but the current kill
could have been delayed if the pcp is drained as pcp + free is even
above the high watermark.
Fix this missing drain of pcp list in should_reclaim_retry() along with
unreserving the high atomic page blocks, like it is done in
__alloc_pages_direct_reclaim().
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
mm/page_alloc.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
Comments
Sorry, this is supposed to be PATCH V2 in place of V3. Not sure If I have to resend it as V2 again. On 11/5/2023 6:20 PM, Charan Teja Kalla wrote: > pcp lists are drained from __alloc_pages_direct_reclaim(), only if some > progress is made in the attempt. > > struct page *__alloc_pages_direct_reclaim() { > ..... > *did_some_progress = __perform_reclaim(gfp_mask, order, ac); > if (unlikely(!(*did_some_progress))) > goto out; > retry: > page = get_page_from_freelist(); > if (!page && !drained) { > drain_all_pages(NULL); > drained = true; > goto retry; > } > out: > } > > After the above, allocation attempt can fallback to > should_reclaim_retry() to decide reclaim retries. If it too return > false, allocation request will simply fallback to oom kill path without > even attempting the draining of the pcp pages that might help the > allocation attempt to succeed. > > VM system running with ~50MB of memory shown the below stats during OOM > kill: > Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB > reserved_highatomic:0KB managed:49152kB free_pcp:460kB > > Though in such system state OOM kill is imminent, but the current kill > could have been delayed if the pcp is drained as pcp + free is even > above the high watermark. > > Fix this missing drain of pcp list in should_reclaim_retry() along with > unreserving the high atomic page blocks, like it is done in > __alloc_pages_direct_reclaim(). > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > --- > mm/page_alloc.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b91c99e..8eee292 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > cond_resched(); > out: > /* Before OOM, exhaust highatomic_reserve */ > - if (!ret) > - return unreserve_highatomic_pageblock(ac, true); > + if (!ret) { > + ret = unreserve_highatomic_pageblock(ac, true); > + drain_all_pages(NULL); > + } > > return ret; > }
On Sun 05-11-23 18:20:50, Charan Teja Kalla wrote: > pcp lists are drained from __alloc_pages_direct_reclaim(), only if some > progress is made in the attempt. > > struct page *__alloc_pages_direct_reclaim() { > ..... > *did_some_progress = __perform_reclaim(gfp_mask, order, ac); > if (unlikely(!(*did_some_progress))) > goto out; > retry: > page = get_page_from_freelist(); > if (!page && !drained) { > drain_all_pages(NULL); > drained = true; > goto retry; > } > out: > } > > After the above, allocation attempt can fallback to > should_reclaim_retry() to decide reclaim retries. If it too return > false, allocation request will simply fallback to oom kill path without > even attempting the draining of the pcp pages that might help the > allocation attempt to succeed. > > VM system running with ~50MB of memory shown the below stats during OOM > kill: > Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB > reserved_highatomic:0KB managed:49152kB free_pcp:460kB > > Though in such system state OOM kill is imminent, but the current kill > could have been delayed if the pcp is drained as pcp + free is even > above the high watermark. TBH I am not sure this is really worth it. Does it really reduce the risk of the OOM in any practical situation? > Fix this missing drain of pcp list in should_reclaim_retry() along with > unreserving the high atomic page blocks, like it is done in > __alloc_pages_direct_reclaim(). > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> > --- > mm/page_alloc.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b91c99e..8eee292 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > cond_resched(); > out: > /* Before OOM, exhaust highatomic_reserve */ > - if (!ret) > - return unreserve_highatomic_pageblock(ac, true); > + if (!ret) { > + ret = unreserve_highatomic_pageblock(ac, true); > + drain_all_pages(NULL); > + } > > return ret; > } > -- > 2.7.4
Thanks Michal!! On 11/9/2023 4:03 PM, Michal Hocko wrote: >> VM system running with ~50MB of memory shown the below stats during OOM >> kill: >> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB >> reserved_highatomic:0KB managed:49152kB free_pcp:460kB >> >> Though in such system state OOM kill is imminent, but the current kill >> could have been delayed if the pcp is drained as pcp + free is even >> above the high watermark. > TBH I am not sure this is really worth it. Does it really reduce the > risk of the OOM in any practical situation? At least in my particular stress test case it just delayed the OOM as i can see that at the time of OOM kill, there are no free pcp pages. My understanding of the OOM is that it should be the last resort and only after doing the enough reclaim retries. CMIW here. This patch just aims to not miss the corner case where we hit the OOM without draining the pcp lists. And after draining, some systems may not need the oom and some may still need the oom. My case is the later here so I am really not sure If we ever encountered/noticed the former case here. >
On Fri 10-11-23 22:06:22, Charan Teja Kalla wrote: > Thanks Michal!! > > On 11/9/2023 4:03 PM, Michal Hocko wrote: > >> VM system running with ~50MB of memory shown the below stats during OOM > >> kill: > >> Normal free:760kB boost:0kB min:768kB low:960kB high:1152kB > >> reserved_highatomic:0KB managed:49152kB free_pcp:460kB > >> > >> Though in such system state OOM kill is imminent, but the current kill > >> could have been delayed if the pcp is drained as pcp + free is even > >> above the high watermark. > > TBH I am not sure this is really worth it. Does it really reduce the > > risk of the OOM in any practical situation? > > At least in my particular stress test case it just delayed the OOM as i > can see that at the time of OOM kill, there are no free pcp pages. My > understanding of the OOM is that it should be the last resort and only > after doing the enough reclaim retries. CMIW here. Yes it is a last resort but it is a heuristic as well. So the real questoin is whether this makes any practical difference outside of artificial workloads. I do not see anything particularly worrying to drain the pcp cache but it should be noted that this won't be 100% either as racing freeing of memory will end up on pcp lists first.
Thanks Michal!! On 11/14/2023 4:18 PM, Michal Hocko wrote: >> At least in my particular stress test case it just delayed the OOM as i >> can see that at the time of OOM kill, there are no free pcp pages. My >> understanding of the OOM is that it should be the last resort and only >> after doing the enough reclaim retries. CMIW here. > Yes it is a last resort but it is a heuristic as well. So the real > questoin is whether this makes any practical difference outside of > artificial workloads. I do not see anything particularly worrying to > drain the pcp cache but it should be noted that this won't be 100% > either as racing freeing of memory will end up on pcp lists first. Okay, I don't have any practical scenario where this helped me in avoiding the OOM. Will comeback If I ever encounter this issue in practical scenario. Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct high atomic reserve calculations will help me. Thanks.
On Tue 14-11-23 22:06:45, Charan Teja Kalla wrote: > Thanks Michal!! > > On 11/14/2023 4:18 PM, Michal Hocko wrote: > >> At least in my particular stress test case it just delayed the OOM as i > >> can see that at the time of OOM kill, there are no free pcp pages. My > >> understanding of the OOM is that it should be the last resort and only > >> after doing the enough reclaim retries. CMIW here. > > Yes it is a last resort but it is a heuristic as well. So the real > > questoin is whether this makes any practical difference outside of > > artificial workloads. I do not see anything particularly worrying to > > drain the pcp cache but it should be noted that this won't be 100% > > either as racing freeing of memory will end up on pcp lists first. > > Okay, I don't have any practical scenario where this helped me in > avoiding the OOM. Will comeback If I ever encounter this issue in > practical scenario. > > Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct > high atomic reserve calculations will help me. I do not have a strong opinion on that one to be honest. I am not even sure that reserving a full page block (4MB) on small systems as presented is really a good use of memory.
Thanks Michal. On 11/15/2023 7:39 PM, Michal Hocko wrote: >> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct >> high atomic reserve calculations will help me. > I do not have a strong opinion on that one to be honest. I am not even > sure that reserving a full page block (4MB) on small systems as > presented is really a good use of memory. May be other way to look at that patch is comment is really not being reflected in the code. It says, " Limit the number reserved to 1 pageblock or roughly 1% of a zone.", but the current code is making it 2 pageblocks. So, for 4M block size, it is > 1%. A second patch, that I will post, like not reserving the high atomic page blocks on small systems -- But how to define the meaning of small systems is not sure. Instead will let the system administrators chose this through either: a) command line param, high_atomic_reserves=off, on by default -- Another knob, so admins may really not like this? b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve. Please lmk If you have any more suggestions here? Also, I am thinking to request Andrew to pick [PATCH V2 1/3] patch and take these discussions separately in a separate thread. Thanks, Charan
On Thu 16-11-23 11:30:04, Charan Teja Kalla wrote: > Thanks Michal. > > On 11/15/2023 7:39 PM, Michal Hocko wrote: > >> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct > >> high atomic reserve calculations will help me. > > I do not have a strong opinion on that one to be honest. I am not even > > sure that reserving a full page block (4MB) on small systems as > > presented is really a good use of memory. > > May be other way to look at that patch is comment is really not being > reflected in the code. It says, " Limit the number reserved to 1 > pageblock or roughly 1% of a zone.", but the current code is making it 2 > pageblocks. So, for 4M block size, it is > 1%. > > A second patch, that I will post, like not reserving the high atomic > page blocks on small systems -- But how to define the meaning of small > systems is not sure. Instead will let the system administrators chose > this through either: > a) command line param, high_atomic_reserves=off, on by default -- > Another knob, so admins may really not like this? > b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve. Please don't! I do not see any admin wanting to care about this at all. It just takes a lot of understanding of internal MM stuff to make an educated guess. This should really be auto-tuned. And as responded in other reply my take would be to reserve a page block on if it doesn't consume more than 1% of memory to preserve the existing behavior yet not overconsume on small systems. > Please lmk If you have any more suggestions here? > > Also, I am thinking to request Andrew to pick [PATCH V2 1/3] patch and > take these discussions separately in a separate thread. That makes sense as that is a clear bug fix.
Thanks Michal!! On 11/16/2023 6:25 PM, Michal Hocko wrote: >> May be other way to look at that patch is comment is really not being >> reflected in the code. It says, " Limit the number reserved to 1 >> pageblock or roughly 1% of a zone.", but the current code is making it 2 >> pageblocks. So, for 4M block size, it is > 1%. >> >> A second patch, that I will post, like not reserving the high atomic >> page blocks on small systems -- But how to define the meaning of small >> systems is not sure. Instead will let the system administrators chose >> this through either: >> a) command line param, high_atomic_reserves=off, on by default -- >> Another knob, so admins may really not like this? >> b) CONFIG_HIGH_ATOMIC_RESERVES, which if not defined, will not reserve. > Please don't! I do not see any admin wanting to care about this at all. > It just takes a lot of understanding of internal MM stuff to make an > educated guess. This should really be auto-tuned. And as responded in > other reply my take would be to reserve a page block on if it doesn't > consume more than 1% of memory to preserve the existing behavior yet not > overconsume on small systems. This idea of auto tune, by reserving a pageblock only if it doesn't consume more than 1% of memory, seems cleaner to me. For a page block size of 4MB, this will turnout to be upto 400MB of RAM. If it is fine, I can post a patch with suggested-by: you. >
Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it. I took a look at data from our fleet, and there are many cases on high-cpu-count machines where we find multi-GiB worth of data sitting on pcpu free lists at the time of system oom-kill, when free memory for the relevant zones are below min watermarks. I.e. clear cases where this patch could have prevented OOM. This kind of issue scales with the number of cpus, so presumably this patch will only become increasingly valuable to both datacenters and desktops alike going forward. Can we revamp it as a standalone patch? Thanks, Zach On Tue, Nov 14, 2023 at 8:37 AM Charan Teja Kalla <quic_charante@quicinc.com> wrote: > > Thanks Michal!! > > On 11/14/2023 4:18 PM, Michal Hocko wrote: > >> At least in my particular stress test case it just delayed the OOM as i > >> can see that at the time of OOM kill, there are no free pcp pages. My > >> understanding of the OOM is that it should be the last resort and only > >> after doing the enough reclaim retries. CMIW here. > > Yes it is a last resort but it is a heuristic as well. So the real > > questoin is whether this makes any practical difference outside of > > artificial workloads. I do not see anything particularly worrying to > > drain the pcp cache but it should be noted that this won't be 100% > > either as racing freeing of memory will end up on pcp lists first. > > Okay, I don't have any practical scenario where this helped me in > avoiding the OOM. Will comeback If I ever encounter this issue in > practical scenario. > > Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct > high atomic reserve calculations will help me. > > Thanks. >
Hi Michal/Zach, On 1/25/2024 10:06 PM, Zach O'Keefe wrote: > Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it. > > I took a look at data from our fleet, and there are many cases on > high-cpu-count machines where we find multi-GiB worth of data sitting > on pcpu free lists at the time of system oom-kill, when free memory > for the relevant zones are below min watermarks. I.e. clear cases > where this patch could have prevented OOM. > > This kind of issue scales with the number of cpus, so presumably this > patch will only become increasingly valuable to both datacenters and > desktops alike going forward. Can we revamp it as a standalone patch? > Glad to see a real world use case for this. We too have observed OOM for every now and then with relatively significant PCP cache, but in all such cases OOM is imminent. AFAICS, Your use case description to be seen like a premature OOM scenario despite lot of free memory sitting on the pcp lists, where this patch should've helped. @Michal: This usecase seems to be a practical scenario that you were asking below. Other concern of racing freeing of memory ending up in pcp lists first -- will that be such a big issue? This patch enables, drain the current pcp lists now that can avoid the oom altogether. If this racing free is a major concern, should that be taken as a separate discussion? Will revamp this as a separate patch if no more concerns here. > Thanks, > Zach > > > On Tue, Nov 14, 2023 at 8:37 AM Charan Teja Kalla > <quic_charante@quicinc.com> wrote: >> >> Thanks Michal!! >> >> On 11/14/2023 4:18 PM, Michal Hocko wrote: >>>> At least in my particular stress test case it just delayed the OOM as i >>>> can see that at the time of OOM kill, there are no free pcp pages. My >>>> understanding of the OOM is that it should be the last resort and only >>>> after doing the enough reclaim retries. CMIW here. >>> Yes it is a last resort but it is a heuristic as well. So the real >>> questoin is whether this makes any practical difference outside of >>> artificial workloads. I do not see anything particularly worrying to >>> drain the pcp cache but it should be noted that this won't be 100% >>> either as racing freeing of memory will end up on pcp lists first. >> >> Okay, I don't have any practical scenario where this helped me in >> avoiding the OOM. Will comeback If I ever encounter this issue in >> practical scenario. >> >> Also If you have any comments on [PATCH V2 2/3] mm: page_alloc: correct >> high atomic reserve calculations will help me. >> >> Thanks. >>
On Fri 26-01-24 16:17:04, Charan Teja Kalla wrote: > Hi Michal/Zach, > > On 1/25/2024 10:06 PM, Zach O'Keefe wrote: > > Thanks for the patch, Charan, and thanks to Yosry for pointing me towards it. > > > > I took a look at data from our fleet, and there are many cases on > > high-cpu-count machines where we find multi-GiB worth of data sitting > > on pcpu free lists at the time of system oom-kill, when free memory > > for the relevant zones are below min watermarks. I.e. clear cases > > where this patch could have prevented OOM. > > > > This kind of issue scales with the number of cpus, so presumably this > > patch will only become increasingly valuable to both datacenters and > > desktops alike going forward. Can we revamp it as a standalone patch? Do you have any example OOM reports? There were recent changes to scale the pcp pages and it would be good to know whether they work reasonably well even under memory pressure. I am not objecting to the patch discussed here but it would be really good to understand the underlying problem and the scale of it. Thanks!
Hey Michal, > Do you have any example OOM reports? [..] Sure, here is one on a 1TiB, 128-physical core machine running a 5.10-based kernel (sorry, it reads pretty awkwardly when wrapped): ---8<--- mytask invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 <...> oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=sdc,mems_allowed=0-1,global_oom,task_memcg=/sdc,task=mytask,pid=835214,uid=0 Out of memory: Killed process 835214 (mytask) total-vm:787716604kB, anon-rss:787536152kB, file-rss:64kB, shmem-rss:0kB, UID:0 pgtables:1541224kB oom_score_adj:0, hugetlb-usage:0kB Mem-Info: active_anon:320 inactive_anon:198083493 isolated_anon:0 active_file:128283 inactive_file:290086 isolated_file:0 unevictable:3525 dirty:15 writeback:0 slab_reclaimable:35505 slab_unreclaimable:272917 mapped:46414 shmem:822 pagetables:64085088 sec_pagetables:0 bounce:0 kernel_misc_reclaimable:0 free:325793 free_pcp:263277 free_cma:0 Node 0 active_anon:1112kB inactive_anon:268172556kB active_file:270992kB inactive_file:254612kB unevictable:12404kB isolated(anon):0kB isolated(file):0kB mapped:147240kB dirty:52kB writeback:0kB shmem:304kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:1310720kB writeback_tmp:0kB kernel_stack:32000kB pagetables:255483108kB sec_pagetables:0kB all_unreclaimable? yes Node 1 active_anon:168kB inactive_anon:524161416kB active_file:242140kB inactive_file:905732kB unevictable:1696kB isolated(anon):0kB isolated(file):0kB mapped:38416kB dirty:8kB writeback:0kB shmem:2984kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:267732992kB writeback_tmp:0kB kernel_stack:8520kB pagetables:857244kB sec_pagetables:0kB all_unreclaimable? yes Node 0 Crash free:72kB min:108kB low:220kB high:332kB reserved_highatomic:0KB active_anon:0kB inactive_anon:111940kB active_file:280kB inactive_file:316kB unevictable:0kB writepending:4kB present:114284kB managed:114196kB mlocked:0kB bounce:0kB free_pcp:1528kB local_pcp:24kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB reserved_highatomic:0KB active_anon:8kB inactive_anon:19456kB active_file:4kB inactive_file:224kB unevictable:0kB writepending:0kB present:2643512kB managed:2643512kB mlocked:0kB bounce:0kB free_pcp:8040kB local_pcp:244kB free_cma:0kB lowmem_reserve[]: 0 0 16029 16029 Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB reserved_highatomic:0KB active_anon:1104kB inactive_anon:268040520kB active_file:270708kB inactive_file:254072kB unevictable:12404kB writepending:48kB present:533969920kB managed:525510968kB mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:723460kB min:755656kB low:1284080kB high:1812504kB reserved_highatomic:0KB active_anon:168kB inactive_anon:524161416kB active_file:242140kB inactive_file:905732kB unevictable:1696kB writepending:8kB present:536866816kB managed:528427664kB mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 Node 0 Crash: 0*4kB 0*8kB 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16kB Node 0 DMA32: 80*4kB (UME) 74*8kB (UE) 23*16kB (UME) 21*32kB (UME) 40*64kB (UE) 35*128kB (UME) 3*256kB (UE) 9*512kB (UME) 13*1024kB (UM) 19*2048kB (UME) 0*4096kB = 66592kB Node 0 Normal: 1999*4kB (UE) 259*8kB (UM) 465*16kB (UM) 114*32kB (UE) 54*64kB (UME) 14*128kB (U) 74*256kB (UME) 128*512kB (UE) 96*1024kB (U) 56*2048kB (U) 46*4096kB (U) = 512292kB Node 1 Normal: 2280*4kB (UM) 12667*8kB (UM) 8859*16kB (UME) 5221*32kB (UME) 1631*64kB (UME) 899*128kB (UM) 330*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723208kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 420675 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 268435456kB Total swap = 268435456kB ---8<--- Node 0/1 Normal free memory is below respective min watermarks, with 790040kB+253500kB ~= 1GiB of memory on pcp lists. With this patch, the GFP_HIGHUSER_MOVABLE + unrestricted mems_allowed allocation would have allowed us to access all that memory, very likely avoiding the oom. > [..] There were recent changes to scale > the pcp pages and it would be good to know whether they work reasonably > well even under memory pressure. I'm not familiar with these changes, but a quick check of recent activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote pageset") ; is this what you are referring to? Thanks, and have a great day, Zach > > I am not objecting to the patch discussed here but it would be really > good to understand the underlying problem and the scale of it. > > Thanks! > -- > Michal Hocko > SUSE Labs
On Fri 26-01-24 14:51:26, Zach O'Keefe wrote: [...] > Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB [...] > free_pcp:8040kB local_pcp:244kB free_cma:0kB > lowmem_reserve[]: 0 0 16029 16029 > Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB [...] > mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB [...] > mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB [...] > I'm not familiar with these changes, but a quick check of recent > activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote > pageset") ; is this what you are referring to? No, but looking at above discrepancy between free_pcp and local_pcp would point that direction for sure. So this is worth checking. vmstat is a periodic activity and it cannot really deal with bursts of memory allocations but it is quite possible that the patch above will prevent the build up before it grows that large. I originally referred to different work though https://lore.kernel.org/all/20231016053002.756205-10-ying.huang@intel.com/T/#m9fdfabaee37db1320bbc678a69d1cdd8391640e0 merged as ca71fe1ad922 ("mm, pcp: avoid to drain PCP when process exit") and the associated patches.
On Mon, Jan 29, 2024 at 7:04 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 26-01-24 14:51:26, Zach O'Keefe wrote: > [...] > > Node 0 DMA32 free:66592kB min:2580kB low:5220kB high:7860kB > [...] > > free_pcp:8040kB local_pcp:244kB free_cma:0kB > > lowmem_reserve[]: 0 0 16029 16029 > > Node 0 Normal free:513048kB min:513192kB low:1038700kB high:1564208kB > [...] > > mlocked:12344kB bounce:0kB free_pcp:790040kB local_pcp:7060kB > [...] > > mlocked:1588kB bounce:0kB free_pcp:253500kB local_pcp:12kB > [...] > > I'm not familiar with these changes, but a quick check of recent > > activity points to v6.7 commit fa8c4f9a665b ("mm: fix draining remote > > pageset") ; is this what you are referring to? > > No, but looking at above discrepancy between free_pcp and local_pcp > would point that direction for sure. So this is worth checking. > vmstat is a periodic activity and it cannot really deal with bursts > of memory allocations but it is quite possible that the patch above > will prevent the build up before it grows that large. > > I originally referred to different work though https://lore.kernel.org/all/20231016053002.756205-10-ying.huang@intel.com/T/#m9fdfabaee37db1320bbc678a69d1cdd8391640e0 > merged as ca71fe1ad922 ("mm, pcp: avoid to drain PCP when process exit") > and the associated patches. Thanks for the response, Michal, and also thank you for the reference here. It'll take me a bit to evaluate how these patches might have helped, and if draining pcpu would have added anything on top. At present, that might take me a bit to get to, but I just wanted to thank you for your response, and to leave this discussion, for the moment, with the ball in my court to return w/ findings. Thanks, Zach > -- > Michal Hocko > SUSE Labs
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b91c99e..8eee292 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3857,8 +3857,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, cond_resched(); out: /* Before OOM, exhaust highatomic_reserve */ - if (!ret) - return unreserve_highatomic_pageblock(ac, true); + if (!ret) { + ret = unreserve_highatomic_pageblock(ac, true); + drain_all_pages(NULL); + } return ret; }