Message ID | 1698669590-3193-1-git-send-email-quic_charante@quicinc.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:d641:0:b0:403:3b70:6f57 with SMTP id cy1csp2179710vqb; Mon, 30 Oct 2023 05:41:00 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEpa6gRlo1NATpEQ+rRpQdxeklLP5qLqYjnRNToLfVuBE+bDptltrn/P/ELnxvIC6UnAWYs X-Received: by 2002:a17:90a:ca8e:b0:27d:222c:f5eb with SMTP id y14-20020a17090aca8e00b0027d222cf5ebmr8405582pjt.11.1698669660282; Mon, 30 Oct 2023 05:41:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698669660; cv=none; d=google.com; s=arc-20160816; b=RhQ0Hb2f6oDSTIKRhiEdtVJFI8UKYZ7jqxTCpPZ8RZUdrGAiZaNYdIOtJ5Y7Dod2NG UQDY52aYWnErM/biLfz/GCdAYxyN0M81P8XMBo2BqWpeqfI9F+ax4iXzeGgGuNTy+184 xP7CZ0bDYNhJCLZY8u+2Cdf6wFbQn4o4cVrOdyGKVkJh+WE+WaZhCGVfttwZLJbwVcEQ 2eaKPXEc8jMqWJQe6awHdOv/7mVccnocBF9d535LlhIo+pszWa2njlgYzSp+006AEkGj t2fQMVDz2tqrQXU7mVA67wAQDla/RAhH1izoelVn8f9YkrLop0p0WRRb0y2/FPHYJxZT qneQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:cc:to:from :dkim-signature; bh=oT8Q2JY+ZYriM6AyjlplkXOaiisUB1+zlmU2NxFg5bg=; fh=Et4hwsHcwekIewx8BUgNTJht5gZFduiUYrLl0zCvV7A=; b=CdrjIGXQZPRID39G2AHuCSPW9Psv8iK/GZQVGLXxARCSRk1y7Tch05jVNjgHko3yhX 1i7LlKBXZIju4z6OjTGHXvmXHpF1EMwNA2O9kGDTmKZG8/asF4xUI5h7NkBGGQCiqBBs qLfZLFrwUX7fg1VbA5z3EElMzHjjnUyTph1J/Jty0hxHiiuEQcmRcYmCsBbcUlds83Yk cG3GTsRvUSgbHZF9v48TrFekb/7Wee9lxDpMPQKAg7mbC89GxuS3yUCrKfqp1U4RFj89 dpVNrrlnSlDm71FVFO8KsNE1EwUbFtnkEwxpcMXS5YqRDt/uU1Xivu5Ve1y5ee9Q7a8j WY8Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=BkP3BgFh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id t18-20020a17090a5d9200b0028068b32084si1256384pji.133.2023.10.30.05.40.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 05:41:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@quicinc.com header.s=qcppdkim1 header.b=BkP3BgFh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=quicinc.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 94EC2805799E; Mon, 30 Oct 2023 05:40:50 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233212AbjJ3Mkn (ORCPT <rfc822;zxc52fgh@gmail.com> + 31 others); Mon, 30 Oct 2023 08:40:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60512 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233191AbjJ3Mkm (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 30 Oct 2023 08:40:42 -0400 Received: from mx0a-0031df01.pphosted.com (mx0a-0031df01.pphosted.com [205.220.168.131]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4944EC9 for <linux-kernel@vger.kernel.org>; Mon, 30 Oct 2023 05:40:39 -0700 (PDT) Received: from pps.filterd (m0279862.ppops.net [127.0.0.1]) by mx0a-0031df01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 39UCUfv8002259; Mon, 30 Oct 2023 12:40:23 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quicinc.com; h=from : to : cc : subject : date : message-id : mime-version : content-type; s=qcppdkim1; bh=oT8Q2JY+ZYriM6AyjlplkXOaiisUB1+zlmU2NxFg5bg=; b=BkP3BgFhgtfOCJXwuAFbjnczCb/AfWc/JBI/r3u2GwXKqNfTRid6CF9fB9DRLUw+o6/i e8PmL7aVexkqG0PUfb/mT/JAI3jdJeo+4AdX7v/DTCDxEaT0/R4cg/RrXZlHhYgyF1R/ qFKWc+B01Mil5+fr836hVwxMqx6V4Tiga0KMdCdN0jW+EmGrDrGxlOSEGV4a9Ns3NyF+ kQywXFMWVor7aJqby7oMt+2tkSu+5Cne4kuhpChdDXE8FpWZiKvRbStZS3NsZnopuL+Z jHYBDm0iVkNshovdBgK+YOlqjkhIUgx/3w+MAOaha8PdZWIZRJViJkyP6muls21bLjBB EQ== Received: from nalasppmta01.qualcomm.com (Global_NAT1.qualcomm.com [129.46.96.20]) by mx0a-0031df01.pphosted.com (PPS) with ESMTPS id 3u2chyg0tg-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 30 Oct 2023 12:40:23 +0000 Received: from nalasex01a.na.qualcomm.com (nalasex01a.na.qualcomm.com [10.47.209.196]) by NALASPPMTA01.qualcomm.com (8.17.1.5/8.17.1.5) with ESMTPS id 39UCeMqW009157 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 30 Oct 2023 12:40:22 GMT Received: from hu-charante-hyd.qualcomm.com (10.80.80.8) by nalasex01a.na.qualcomm.com (10.47.209.196) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.39; Mon, 30 Oct 2023 05:40:19 -0700 From: Charan Teja Kalla <quic_charante@quicinc.com> To: <akpm@linux-foundation.org>, <mgorman@techsingularity.net>, <mhocko@suse.com>, <david@redhat.com>, <vbabka@suse.cz> CC: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, Charan Teja Kalla <quic_charante@quicinc.com> Subject: [PATCH] mm: page_alloc: unreserve highatomic page blocks before oom Date: Mon, 30 Oct 2023 18:09:50 +0530 Message-ID: <1698669590-3193-1-git-send-email-quic_charante@quicinc.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.80.80.8] X-ClientProxiedBy: nasanex01b.na.qualcomm.com (10.46.141.250) To nalasex01a.na.qualcomm.com (10.47.209.196) X-QCInternal: smtphost X-Proofpoint-Virus-Version: vendor=nai engine=6200 definitions=5800 signatures=585085 X-Proofpoint-ORIG-GUID: meoXfyj-2M7sxjewK1Sa1HdZUAmpIapY X-Proofpoint-GUID: meoXfyj-2M7sxjewK1Sa1HdZUAmpIapY X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.987,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-10-30_10,2023-10-27_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 bulkscore=0 mlxscore=0 adultscore=0 mlxlogscore=999 spamscore=0 phishscore=0 clxscore=1011 suspectscore=0 lowpriorityscore=0 malwarescore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2310240000 definitions=main-2310300097 X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 30 Oct 2023 05:40:50 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1781184237599845297 X-GMAIL-MSGID: 1781184237599845297 |
Series |
mm: page_alloc: unreserve highatomic page blocks before oom
|
|
Commit Message
Charan Teja Kalla
Oct. 30, 2023, 12:39 p.m. UTC
__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to
be retried before OOM kill path is taken.
should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:
a) true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.
b) false, thus slowpath allocation takes oom kill path.
should_reclaim_retry() can also unreserves the high atomic reserves
**but only after all the reclaim retries are exhausted.**
In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't
use these high atomic reserves, makes the available memory below min
wmark levels hence false is returned from should_reclaim_retry() leading
the allocation request to take OOM kill path. This is an early oom kill
because high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.
(early)OOM is encountered on a machine in the below state(excerpt from
the oom kill logs):
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB
Per above log, the free memory of ~7MB exist in the high atomic
reserves is not freed up before falling back to oom kill path.
This fix includes unreserving these atomic reserves in the OOM path
before going for a kill. The side effect of unreserving in oom kill path
is that these free pages are checked against the high wmark. If
unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(),
they are checked against the min wmark levels.
Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
---
mm/page_alloc.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
Comments
On Mon, Oct 30, 2023 at 06:09:50PM +0530, Charan Teja Kalla wrote: > __alloc_pages_direct_reclaim() is called from slowpath allocation where > high atomic reserves can be unreserved after there is a progress in > reclaim and yet no suitable page is found. Later should_reclaim_retry() > gets called from slow path allocation to decide if the reclaim needs to > be retried before OOM kill path is taken. > > should_reclaim_retry() checks the available(reclaimable + free pages) > memory against the min wmark levels of a zone and returns: > a) true, if it is above the min wmark so that slow path allocation will > do the reclaim retries. > b) false, thus slowpath allocation takes oom kill path. > > should_reclaim_retry() can also unreserves the high atomic reserves > **but only after all the reclaim retries are exhausted.** > > In a case where there are almost none reclaimable memory and free pages > contains mostly the high atomic reserves but allocation context can't > use these high atomic reserves, makes the available memory below min > wmark levels hence false is returned from should_reclaim_retry() leading > the allocation request to take OOM kill path. This is an early oom kill > because high atomic reserves are holding lot of free memory and > unreserving of them is not attempted. > > (early)OOM is encountered on a machine in the below state(excerpt from > the oom kill logs): > [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB > high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB > active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB > present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB > local_pcp:492kB free_cma:0kB > [ 295.998656] lowmem_reserve[]: 0 32 > [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) > 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB > 0*4096kB = 7752kB > > Per above log, the free memory of ~7MB exist in the high atomic > reserves is not freed up before falling back to oom kill path. > > This fix includes unreserving these atomic reserves in the OOM path > before going for a kill. The side effect of unreserving in oom kill path > is that these free pages are checked against the high wmark. If > unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(), > they are checked against the min wmark levels. > > Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Thanks for the detailed commit description. Really helpful in understanding the problem you are fixing. > --- > mm/page_alloc.c | 18 ++++++++++++++++++ > 1 file changed, 18 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 95546f3..2a2536d 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3281,6 +3281,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > .order = order, > }; > struct page *page; > + struct zone *zone; > + struct zoneref *z; > > *did_some_progress = 0; > > @@ -3295,6 +3297,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > } > > /* > + * If should_reclaim_retry() encounters a state where: > + * reclaimable + free doesn't satisfy the wmark levels, > + * it can directly jump to OOM without even unreserving > + * the highatomic page blocks. Try them for once here > + * before jumping to OOM. > + */ > +retry: > + unreserve_highatomic_pageblock(ac, true); > + Not possible to fix this in should_reclaim_retry()? > + /* > * Go through the zonelist yet one more time, keep very high watermark > * here, this is only to catch a parallel oom killing, we must fail if > * we're still under heavy pressure. But make sure that this reclaim > @@ -3307,6 +3319,12 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, > if (page) > goto out; > > + for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->highest_zoneidx, > + ac->nodemask) { > + if (zone->nr_reserved_highatomic > 0) > + goto retry; > + } > + > /* Coredumps can quickly deplete all memory reserves */ > if (current->flags & PF_DUMPCORE) > goto out;
On Mon 30-10-23 18:09:50, Charan Teja Kalla wrote: > __alloc_pages_direct_reclaim() is called from slowpath allocation where > high atomic reserves can be unreserved after there is a progress in > reclaim and yet no suitable page is found. Later should_reclaim_retry() > gets called from slow path allocation to decide if the reclaim needs to > be retried before OOM kill path is taken. > > should_reclaim_retry() checks the available(reclaimable + free pages) > memory against the min wmark levels of a zone and returns: > a) true, if it is above the min wmark so that slow path allocation will > do the reclaim retries. > b) false, thus slowpath allocation takes oom kill path. > > should_reclaim_retry() can also unreserves the high atomic reserves > **but only after all the reclaim retries are exhausted.** > > In a case where there are almost none reclaimable memory and free pages > contains mostly the high atomic reserves but allocation context can't > use these high atomic reserves, makes the available memory below min > wmark levels hence false is returned from should_reclaim_retry() leading > the allocation request to take OOM kill path. This is an early oom kill > because high atomic reserves are holding lot of free memory and > unreserving of them is not attempted. OK, I see. So we do not release those reserved pages because OOM hits too early. > (early)OOM is encountered on a machine in the below state(excerpt from > the oom kill logs): > [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB > high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB > active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB > present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB > local_pcp:492kB free_cma:0kB > [ 295.998656] lowmem_reserve[]: 0 32 > [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) > 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB > 0*4096kB = 7752kB OK, this is quite interesting as well. The system is really tiny and 8MB of reserved memory is indeed really high. How come those reservations have grown that high? > > Per above log, the free memory of ~7MB exist in the high atomic > reserves is not freed up before falling back to oom kill path. > > This fix includes unreserving these atomic reserves in the OOM path > before going for a kill. The side effect of unreserving in oom kill path > is that these free pages are checked against the high wmark. If > unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(), > they are checked against the min wmark levels. I do not like the fix much TBH. I think the logic should live in should_reclaim_retry. One way to approach it is to unreserve at the end of the function, something like this: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 95546f376302..d04e14adf2c5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3813,10 +3813,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, * Make sure we converge to OOM if we cannot make any progress * several times in the row. */ - if (*no_progress_loops > MAX_RECLAIM_RETRIES) { - /* Before OOM, exhaust highatomic_reserve */ - return unreserve_highatomic_pageblock(ac, true); - } + if (*no_progress_loops > MAX_RECLAIM_RETRIES) + goto out; /* * Keep reclaiming pages while there is a chance this will lead @@ -3859,6 +3857,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, schedule_timeout_uninterruptible(1); else cond_resched(); + +out: + /* Before OOM, exhaust highatomic_reserve */ + if (!ret) + return unreserve_highatomic_pageblock(ac, true); + return ret; }
Thanks Michal/Pavan!! On 10/31/2023 1:44 PM, Michal Hocko wrote: > On Mon 30-10-23 18:09:50, Charan Teja Kalla wrote: >> __alloc_pages_direct_reclaim() is called from slowpath allocation where >> high atomic reserves can be unreserved after there is a progress in >> reclaim and yet no suitable page is found. Later should_reclaim_retry() >> gets called from slow path allocation to decide if the reclaim needs to >> be retried before OOM kill path is taken. >> >> should_reclaim_retry() checks the available(reclaimable + free pages) >> memory against the min wmark levels of a zone and returns: >> a) true, if it is above the min wmark so that slow path allocation will >> do the reclaim retries. >> b) false, thus slowpath allocation takes oom kill path. >> >> should_reclaim_retry() can also unreserves the high atomic reserves >> **but only after all the reclaim retries are exhausted.** >> >> In a case where there are almost none reclaimable memory and free pages >> contains mostly the high atomic reserves but allocation context can't >> use these high atomic reserves, makes the available memory below min >> wmark levels hence false is returned from should_reclaim_retry() leading >> the allocation request to take OOM kill path. This is an early oom kill >> because high atomic reserves are holding lot of free memory and >> unreserving of them is not attempted. > > OK, I see. So we do not release those reserved pages because OOM hits > too early. > >> (early)OOM is encountered on a machine in the below state(excerpt from >> the oom kill logs): >> [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB >> high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB >> active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB >> present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB >> local_pcp:492kB free_cma:0kB >> [ 295.998656] lowmem_reserve[]: 0 32 >> [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) >> 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB >> 0*4096kB = 7752kB > > OK, this is quite interesting as well. The system is really tiny and 8MB > of reserved memory is indeed really high. How come those reservations > have grown that high? Actually it is a VM running on the Linux kernel. Regarding the reservations, I think it is because of the 'max_managed ' calculations in the below: static void reserve_highatomic_pageblock(struct page *page, ....) { .... /* * Limit the number reserved to 1 pageblock or roughly 1% of a zone. * Check is race-prone but harmless. */ max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; if (zone->nr_reserved_highatomic >= max_managed) goto out; zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); out: } Since we are always appending the 1% of zone managed pages count to pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the 'nr_reserved_highatomic' is incremented/decremented in pageblock size granules. And for my case the 8M out of ~50M is turned out to be 16%, which is high. If the below looks fine to you, I can raise this as a separate change: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2a2536d..41441ced 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1886,7 +1886,9 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone) * Limit the number reserved to 1 pageblock or roughly 1% of a zone. * Check is race-prone but harmless. */ - max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; + max_managed = max_t(unsigned long, + ALIGN(zone_managed_pages(zone) / 100, pageblock_nr_pages), + pageblock_nr_pages); if (zone->nr_reserved_highatomic >= max_managed) return; >> >> Per above log, the free memory of ~7MB exist in the high atomic >> reserves is not freed up before falling back to oom kill path. >> >> This fix includes unreserving these atomic reserves in the OOM path >> before going for a kill. The side effect of unreserving in oom kill path >> is that these free pages are checked against the high wmark. If >> unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(), >> they are checked against the min wmark levels. > > I do not like the fix much TBH. I think the logic should live in yeah, This code looks way too cleaner to me. Let me know If I can raise V2 with the below, suggested-by you. I think another thing system is missing here is draining the pcp lists. min:804kB low:1004kB high:1204kB free_pcp:688kB IIUC, the drain pages is being called in reclaim path as below. In this case, when did_some_progress = 0, it is also skipping the pcp drain. struct page *__alloc_pages_direct_reclaim() { ..... *did_some_progress = __perform_reclaim(gfp_mask, order, ac); if (unlikely(!(*did_some_progress))) goto out; retry: page = get_page_from_freelist(); if (!page && !drained) { drain_all_pages(NULL); drained = true; goto retry; } out: } so, how about the extending the below code from you for this case. Assuming that did_some_progress > 0 means the draining perhaps already done in __alloc_pages_direct_reclaim() thus: out: if (!ret) { ret = unreserve_highatomic_pageblock(ac, true); drain_all_pages(NULL); } return ret; Please suggest If the above doesn't make sense. If Looks good, I will raise a separate patch for this condition. > should_reclaim_retry. One way to approach it is to unreserve at the end > of the function, something like this: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 95546f376302..d04e14adf2c5 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3813,10 +3813,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > * Make sure we converge to OOM if we cannot make any progress > * several times in the row. > */ > - if (*no_progress_loops > MAX_RECLAIM_RETRIES) { > - /* Before OOM, exhaust highatomic_reserve */ > - return unreserve_highatomic_pageblock(ac, true); > - } > + if (*no_progress_loops > MAX_RECLAIM_RETRIES) > + goto out; > > /* > * Keep reclaiming pages while there is a chance this will lead > @@ -3859,6 +3857,12 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > schedule_timeout_uninterruptible(1); > else > cond_resched(); > + > +out: > + /* Before OOM, exhaust highatomic_reserve */ > + if (!ret) > + return unreserve_highatomic_pageblock(ac, true); > + > return ret; > } >
On Tue 31-10-23 18:43:55, Charan Teja Kalla wrote: > Thanks Michal/Pavan!! > > On 10/31/2023 1:44 PM, Michal Hocko wrote: > > On Mon 30-10-23 18:09:50, Charan Teja Kalla wrote: > >> __alloc_pages_direct_reclaim() is called from slowpath allocation where > >> high atomic reserves can be unreserved after there is a progress in > >> reclaim and yet no suitable page is found. Later should_reclaim_retry() > >> gets called from slow path allocation to decide if the reclaim needs to > >> be retried before OOM kill path is taken. > >> > >> should_reclaim_retry() checks the available(reclaimable + free pages) > >> memory against the min wmark levels of a zone and returns: > >> a) true, if it is above the min wmark so that slow path allocation will > >> do the reclaim retries. > >> b) false, thus slowpath allocation takes oom kill path. > >> > >> should_reclaim_retry() can also unreserves the high atomic reserves > >> **but only after all the reclaim retries are exhausted.** > >> > >> In a case where there are almost none reclaimable memory and free pages > >> contains mostly the high atomic reserves but allocation context can't > >> use these high atomic reserves, makes the available memory below min > >> wmark levels hence false is returned from should_reclaim_retry() leading > >> the allocation request to take OOM kill path. This is an early oom kill > >> because high atomic reserves are holding lot of free memory and > >> unreserving of them is not attempted. > > > > OK, I see. So we do not release those reserved pages because OOM hits > > too early. > > > >> (early)OOM is encountered on a machine in the below state(excerpt from > >> the oom kill logs): > >> [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB > >> high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB > >> active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB > >> present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB > >> local_pcp:492kB free_cma:0kB > >> [ 295.998656] lowmem_reserve[]: 0 32 > >> [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) > >> 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB > >> 0*4096kB = 7752kB > > > > OK, this is quite interesting as well. The system is really tiny and 8MB > > of reserved memory is indeed really high. How come those reservations > > have grown that high? > > Actually it is a VM running on the Linux kernel. > > Regarding the reservations, I think it is because of the 'max_managed ' > calculations in the below: > static void reserve_highatomic_pageblock(struct page *page, ....) { > .... > /* > * Limit the number reserved to 1 pageblock or roughly 1% of a zone. > * Check is race-prone but harmless. > */ > max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; > > if (zone->nr_reserved_highatomic >= max_managed) > goto out; > > zone->nr_reserved_highatomic += pageblock_nr_pages; > set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); > move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); > out: > } > > Since we are always appending the 1% of zone managed pages count to > pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the > 'nr_reserved_highatomic' is incremented/decremented in pageblock size > granules. > > And for my case the 8M out of ~50M is turned out to be 16%, which is high. > > If the below looks fine to you, I can raise this as a separate change: Yes, please. Having a full page block (4MB) sounds still too much for such a tiny system. Maybe there shouldn't be any reservation. But definitely worth a separate patch. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2a2536d..41441ced 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1886,7 +1886,9 @@ static void reserve_highatomic_pageblock(struct > page *page, struct zone *zone) > * Limit the number reserved to 1 pageblock or roughly 1% of a zone. > * Check is race-prone but harmless. > */ > - max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; > + max_managed = max_t(unsigned long, > + ALIGN(zone_managed_pages(zone) / 100, > pageblock_nr_pages), > + pageblock_nr_pages); > if (zone->nr_reserved_highatomic >= max_managed) > return; > > >> > >> Per above log, the free memory of ~7MB exist in the high atomic > >> reserves is not freed up before falling back to oom kill path. > >> > >> This fix includes unreserving these atomic reserves in the OOM path > >> before going for a kill. The side effect of unreserving in oom kill path > >> is that these free pages are checked against the high wmark. If > >> unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(), > >> they are checked against the min wmark levels. > > > > I do not like the fix much TBH. I think the logic should live in > > yeah, This code looks way too cleaner to me. Let me know If I can raise > V2 with the below, suggested-by you. Sure, go ahead. > I think another thing system is missing here is draining the pcp lists. > min:804kB low:1004kB high:1204kB free_pcp:688kB Yes, but this seems like negligible even under a small system like that. Does it actually help to keep system in balance? I would expect that the OOM is just imminent no matter the draining. Anyway if this makes any difference then just make it a separate patch please.
On Tue, Oct 31, 2023 at 06:43:55PM +0530, Charan Teja Kalla wrote: > >> (early)OOM is encountered on a machine in the below state(excerpt from > >> the oom kill logs): > >> [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB > >> high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB > >> active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB > >> present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB > >> local_pcp:492kB free_cma:0kB > >> [ 295.998656] lowmem_reserve[]: 0 32 > >> [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) > >> 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB > >> 0*4096kB = 7752kB > > > > OK, this is quite interesting as well. The system is really tiny and 8MB > > of reserved memory is indeed really high. How come those reservations > > have grown that high? > > Actually it is a VM running on the Linux kernel. > > Regarding the reservations, I think it is because of the 'max_managed ' > calculations in the below: > static void reserve_highatomic_pageblock(struct page *page, ....) { > .... > /* > * Limit the number reserved to 1 pageblock or roughly 1% of a zone. > * Check is race-prone but harmless. > */ > max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; > > if (zone->nr_reserved_highatomic >= max_managed) > goto out; > > zone->nr_reserved_highatomic += pageblock_nr_pages; > set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); > move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); > out: > } > > Since we are always appending the 1% of zone managed pages count to > pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the > 'nr_reserved_highatomic' is incremented/decremented in pageblock size > granules. > > And for my case the 8M out of ~50M is turned out to be 16%, which is high. > > If the below looks fine to you, I can raise this as a separate change: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2a2536d..41441ced 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1886,7 +1886,9 @@ static void reserve_highatomic_pageblock(struct > page *page, struct zone *zone) > * Limit the number reserved to 1 pageblock or roughly 1% of a zone. > * Check is race-prone but harmless. > */ > - max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; > + max_managed = max_t(unsigned long, > + ALIGN(zone_managed_pages(zone) / 100, > pageblock_nr_pages), > + pageblock_nr_pages); > if (zone->nr_reserved_highatomic >= max_managed) > return; > ALIGN() rounds up the value, so max_t() is not needed here. If you had used ALIGN_DOWN() then max_t() can be used to keep atleast pageblock_nr_pages pages. > >> > >> Per above log, the free memory of ~7MB exist in the high atomic > >> reserves is not freed up before falling back to oom kill path. > >> > >> This fix includes unreserving these atomic reserves in the OOM path > >> before going for a kill. The side effect of unreserving in oom kill path > >> is that these free pages are checked against the high wmark. If > >> unreserved from should_reclaim_retry()/__alloc_pages_direct_reclaim(), > >> they are checked against the min wmark levels. > > > > I do not like the fix much TBH. I think the logic should live in > > yeah, This code looks way too cleaner to me. Let me know If I can raise > V2 with the below, suggested-by you. > Also, add below Fixes tag if it makes sense. Fixes: 04c8716f7b00 ("mm: try to exhaust highatomic reserve before the OOM") Thanks, Pavan
On 11/1/2023 12:16 PM, Pavan Kondeti wrote: >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 2a2536d..41441ced 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1886,7 +1886,9 @@ static void reserve_highatomic_pageblock(struct >> page *page, struct zone *zone) >> * Limit the number reserved to 1 pageblock or roughly 1% of a zone. >> * Check is race-prone but harmless. >> */ >> - max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; >> + max_managed = max_t(unsigned long, >> + ALIGN(zone_managed_pages(zone) / 100, >> pageblock_nr_pages), >> + pageblock_nr_pages); >> if (zone->nr_reserved_highatomic >= max_managed) >> return; >> > ALIGN() rounds up the value, so max_t() is not needed here. If you had > used ALIGN_DOWN() then max_t() can be used to keep atleast > pageblock_nr_pages pages. > Yeah, just ALIGN() enough here. > > Also, add below Fixes tag if it makes sense. > > Fixes: 04c8716f7b00 ("mm: try to exhaust highatomic reserve before the OOM") I should be adding this.
On Wed 01-11-23 12:23:24, Charan Teja Kalla wrote: [...] > > Also, add below Fixes tag if it makes sense. > > > > Fixes: 04c8716f7b00 ("mm: try to exhaust highatomic reserve before the OOM") > I should be adding this. I do not think this Fixes tag is really correct. 04c8716f7b00 was rather an incomplete fix than something that has caused this situation. I think we would need to reference the commit which adds highatomic reserves.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 95546f3..2a2536d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3281,6 +3281,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, .order = order, }; struct page *page; + struct zone *zone; + struct zoneref *z; *did_some_progress = 0; @@ -3295,6 +3297,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* + * If should_reclaim_retry() encounters a state where: + * reclaimable + free doesn't satisfy the wmark levels, + * it can directly jump to OOM without even unreserving + * the highatomic page blocks. Try them for once here + * before jumping to OOM. + */ +retry: + unreserve_highatomic_pageblock(ac, true); + + /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. But make sure that this reclaim @@ -3307,6 +3319,12 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (page) goto out; + for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->highest_zoneidx, + ac->nodemask) { + if (zone->nr_reserved_highatomic > 0) + goto retry; + } + /* Coredumps can quickly deplete all memory reserves */ if (current->flags & PF_DUMPCORE) goto out;