Message ID | 20231113233420.446465795@redhat.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b909:0:b0:403:3b70:6f57 with SMTP id t9csp1544602vqg; Mon, 13 Nov 2023 15:45:49 -0800 (PST) X-Google-Smtp-Source: AGHT+IEmpYwb4xgAZYuR/UA97dOEvs9aY0c8M56E+DoEje6eisa3gvAAsnCo7Sw/Q0d8k741lDl6 X-Received: by 2002:a05:6a20:2590:b0:181:de5:8dbd with SMTP id k16-20020a056a20259000b001810de58dbdmr6322573pzd.59.1699919149350; Mon, 13 Nov 2023 15:45:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699919149; cv=none; d=google.com; s=arc-20160816; b=Kzn+IQk7YAuWoF+kxI33g5p8UQsIu4s+2TuvlYh+bUZ62kOyceEgJ+N4+xzW6fo0J+ aWTrnjEZf2Ih4QM4fy7Dumfq9AlWKDTpAApmjN+2zcvqNgFkPWaJdTosx8rYJCXoqjQv 4bCk5OxGrUmPBPne189Oh9Aliao2nirAW1QtthTSyvcB3l5dwRwW/AYZyRGJCmyqX/rk 0S5OHQXCX3N11fQYeNlWSixBMfuqrvI/WW3d2/LNV3Pl3YxqJjgGGy7Ptg2gC4KdU9a1 YK8OC34+qGNuVcUa90oS6xfeTcj6PUgQlndJvpAhO4nRfumox1H3JfJ/L7e2VGZot+S9 tZkg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:cc:to:from:date:user-agent:message-id :dkim-signature; bh=7tJADaKroqOjkrAmUfZQotFKNEMB2n4vY0DbW8JJQxo=; fh=P/FJAIZVtYS9itrInV41isAzZHeRHr2C34nuozNIw20=; b=NxEirblDyOVLqwxPTnEfLDM5cM7wU67tUHGpKNdDto89M0D338oAo2PmSDn5IO3h7W axTVurz+ARlKsPGjOq/+vZhVHS1iqhrpyHChqq1dKLExIFSQ5zJOQ5/qs1XPNlmVDKBd kthj8ve5B1d8RAmj0vcGc12ZXy03QRyKzgHljBUTGi6HV0WvzzalYL8tq1OCG+isyewy FclstALJfkyPuyOu09Qay7VcG3TIqsNajjjcyT0oNbhDEdgiqm3+G1WdGDYABJ81CpEa N7Bx806yWouwM+vhzePr2SSbYXCMC58FIQ6TNxB71Q1TOiVweb4FNQ80oPgv4ockmuvV PXgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=AQArzqyE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id 21-20020a17090a195500b002775281b839si6694779pjh.39.2023.11.13.15.45.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Nov 2023 15:45:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=AQArzqyE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id DDE3A80842E9; Mon, 13 Nov 2023 15:45:28 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230223AbjKMXpL (ORCPT <rfc822;lhua1029@gmail.com> + 30 others); Mon, 13 Nov 2023 18:45:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39300 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229696AbjKMXpK (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 13 Nov 2023 18:45:10 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFDA1DC for <linux-kernel@vger.kernel.org>; Mon, 13 Nov 2023 15:44:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699919086; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc; bh=7tJADaKroqOjkrAmUfZQotFKNEMB2n4vY0DbW8JJQxo=; b=AQArzqyEEB1e61dntUX5gyH8Q/bTL8538b9Acsam4Ri2s16CEqG4N+Iy3ksfVcIic0FHEA NgRildG+prFcxXfdOToQB1u2BgYTv3i2soBpZv6KX2fQnIpFXgHVoRRWeALKTq1hpNz2cZ e7DAjxkOEoSQ47PI7KR1Vnzm7uYVoIs= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-649-6N-CpV6PPMyT08pW-sOidQ-1; Mon, 13 Nov 2023 18:44:42 -0500 X-MC-Unique: 6N-CpV6PPMyT08pW-sOidQ-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 0D90B1C05146; Mon, 13 Nov 2023 23:44:42 +0000 (UTC) Received: from tpad.localdomain (ovpn-112-2.gru2.redhat.com [10.97.112.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B87261121307; Mon, 13 Nov 2023 23:44:41 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id 28959409B55B4; Mon, 13 Nov 2023 20:35:57 -0300 (-03) Message-ID: <20231113233420.446465795@redhat.com> User-Agent: quilt/0.67 Date: Mon, 13 Nov 2023 20:34:20 -0300 From: Marcelo Tosatti <mtosatti@redhat.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Peter Xu <peterx@redhat.com> Subject: [patch 0/2] mm: too_many_isolated can stall due to out of sync VM counters X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.3 X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 13 Nov 2023 15:45:28 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782494421762948460 X-GMAIL-MSGID: 1782494421762948460 |
Series |
mm: too_many_isolated can stall due to out of sync VM counters
|
|
Message
Marcelo Tosatti
Nov. 13, 2023, 11:34 p.m. UTC
A customer reported seeing processes hung at too_many_isolated, while analysis indicated that the problem occurred due to out of sync per-CPU stats (see below). Fix is to use node_page_state_snapshot to avoid the out of stale values. 2136 static unsigned long 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, 2138 struct scan_control *sc, enum lru_list lru) 2139 { : 2145 bool file = is_file_lru(lru); : 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); : 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { 2151 if (stalled) 2152 return 0; 2153 2154 /* wait a bit for the reclaimer. */ 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. 2156 stalled = true; 2157 2158 /* We are about to die and free our memory. Return now. */ 2159 if (fatal_signal_pending(current)) 2160 return SWAP_CLUSTER_MAX; 2161 } msleep() must be called only when there are too many isolated pages: 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, 2020 struct scan_control *sc) 2021 { : 2030 if (file) { 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); 2033 } else { : 2046 return isolated > inactive; The return value was true since: crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] $8 = { counter = 1 } crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] $9 = { counter = 2 while per_cpu stats had: crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] $86 = 0xffff00917fcc32e0 crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] $87 = -1 '\377' crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] $89 = 0xffff00917fe032e0 crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] $91 = -1 '\377' It seems that processes were trapped in direct reclaim/compaction loop because these nodes had few free pages lower than watermark min. crash> kmem -z | grep -A 3 Normal : NODE: 4 ZONE: 1 ADDR: ffff00817fffec40 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 68 -- NODE: 5 ZONE: 1 ADDR: ffff00897fffec40 NAME: "Normal" SIZE: 118784 MIN/LOW/HIGH: 82/200/318 VM_STAT: NR_FREE_PAGES: 45 -- NODE: 6 ZONE: 1 ADDR: ffff00917fffec40 NAME: "Normal" SIZE: 118784 MIN/LOW/HIGH: 82/200/318 VM_STAT: NR_FREE_PAGES: 53 -- NODE: 7 ZONE: 1 ADDR: ffff00997fbbec40 NAME: "Normal" SIZE: 118784 MIN/LOW/HIGH: 82/200/318 VM_STAT: NR_FREE_PAGES: 52 --- include/linux/vmstat.h | 4 ++++ mm/compaction.c | 6 +++--- mm/vmscan.c | 8 ++++---- mm/vmstat.c | 28 ++++++++++++++++++++++++++++ 4 files changed, 39 insertions(+), 7 deletions(-)
Comments
On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote: > A customer reported seeing processes hung at too_many_isolated, > while analysis indicated that the problem occurred due to out > of sync per-CPU stats (see below). > > Fix is to use node_page_state_snapshot to avoid the out of stale values. > > 2136 static unsigned long > 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > 2138 struct scan_control *sc, enum lru_list lru) > 2139 { > : > 2145 bool file = is_file_lru(lru); > : > 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); > : > 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { > 2151 if (stalled) > 2152 return 0; > 2153 > 2154 /* wait a bit for the reclaimer. */ > 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. > 2156 stalled = true; > 2157 > 2158 /* We are about to die and free our memory. Return now. */ > 2159 if (fatal_signal_pending(current)) > 2160 return SWAP_CLUSTER_MAX; > 2161 } > > msleep() must be called only when there are too many isolated pages: What do you mean here? > 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, > 2020 struct scan_control *sc) > 2021 { > : > 2030 if (file) { > 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); > 2033 } else { > : > 2046 return isolated > inactive; > > The return value was true since: > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > $8 = { > counter = 1 > } > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > $9 = { > counter = 2 > > while per_cpu stats had: > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > $86 = 0xffff00917fcc32e0 > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > $87 = -1 '\377' > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > $89 = 0xffff00917fe032e0 > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > $91 = -1 '\377' This doesn't really tell much. How much out of sync they really are cumulatively over all cpus? > It seems that processes were trapped in direct reclaim/compaction loop > because these nodes had few free pages lower than watermark min. > > crash> kmem -z | grep -A 3 Normal > : > NODE: 4 ZONE: 1 ADDR: ffff00817fffec40 NAME: "Normal" > SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 > VM_STAT: > NR_FREE_PAGES: 68 > -- > NODE: 5 ZONE: 1 ADDR: ffff00897fffec40 NAME: "Normal" > SIZE: 118784 MIN/LOW/HIGH: 82/200/318 > VM_STAT: > NR_FREE_PAGES: 45 > -- > NODE: 6 ZONE: 1 ADDR: ffff00917fffec40 NAME: "Normal" > SIZE: 118784 MIN/LOW/HIGH: 82/200/318 > VM_STAT: > NR_FREE_PAGES: 53 > -- > NODE: 7 ZONE: 1 ADDR: ffff00997fbbec40 NAME: "Normal" > SIZE: 118784 MIN/LOW/HIGH: 82/200/318 > VM_STAT: > NR_FREE_PAGES: 52 How have you concluded that too_many_isolated is at root of this issue. With a very low NR_FREE_PAGES and many contending allocation the system could be easily stuck in reclaim. What are other reclaim characteristics? Is the direct reclaim successful?
Hi Michal, On Tue, Nov 14, 2023 at 09:20:09AM +0100, Michal Hocko wrote: > On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote: > > A customer reported seeing processes hung at too_many_isolated, > > while analysis indicated that the problem occurred due to out > > of sync per-CPU stats (see below). > > > > Fix is to use node_page_state_snapshot to avoid the out of stale values. > > > > 2136 static unsigned long > > 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > 2138 struct scan_control *sc, enum lru_list lru) > > 2139 { > > : > > 2145 bool file = is_file_lru(lru); > > : > > 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > : > > 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { > > 2151 if (stalled) > > 2152 return 0; > > 2153 > > 2154 /* wait a bit for the reclaimer. */ > > 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. > > 2156 stalled = true; > > 2157 > > 2158 /* We are about to die and free our memory. Return now. */ > > 2159 if (fatal_signal_pending(current)) > > 2160 return SWAP_CLUSTER_MAX; > > 2161 } > > > > msleep() must be called only when there are too many isolated pages: > > What do you mean here? That msleep() must not be called when isolated > inactive is false. > > 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, > > 2020 struct scan_control *sc) > > 2021 { > > : > > 2030 if (file) { > > 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); > > 2033 } else { > > : > > 2046 return isolated > inactive; > > > > The return value was true since: > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > > $8 = { > > counter = 1 > > } > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > > $9 = { > > counter = 2 > > > > while per_cpu stats had: > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > > $86 = 0xffff00917fcc32e0 > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > $87 = -1 '\377' > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > > $89 = 0xffff00917fe032e0 > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > $91 = -1 '\377' > > This doesn't really tell much. How much out of sync they really are > cumulatively over all cpus? This is the cumulative value over all CPUs (offsets for other CPUs have been omitted since they are zero). > > It seems that processes were trapped in direct reclaim/compaction loop > > because these nodes had few free pages lower than watermark min. > > > > crash> kmem -z | grep -A 3 Normal > > : > > NODE: 4 ZONE: 1 ADDR: ffff00817fffec40 NAME: "Normal" > > SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 > > VM_STAT: > > NR_FREE_PAGES: 68 > > -- > > NODE: 5 ZONE: 1 ADDR: ffff00897fffec40 NAME: "Normal" > > SIZE: 118784 MIN/LOW/HIGH: 82/200/318 > > VM_STAT: > > NR_FREE_PAGES: 45 > > -- > > NODE: 6 ZONE: 1 ADDR: ffff00917fffec40 NAME: "Normal" > > SIZE: 118784 MIN/LOW/HIGH: 82/200/318 > > VM_STAT: > > NR_FREE_PAGES: 53 > > -- > > NODE: 7 ZONE: 1 ADDR: ffff00997fbbec40 NAME: "Normal" > > SIZE: 118784 MIN/LOW/HIGH: 82/200/318 > > VM_STAT: > > NR_FREE_PAGES: 52 > > How have you concluded that too_many_isolated is at root of this issue. Because the customer observed the problem and obtained traces: "If so, I have to mention about an another problem caused by vmstat issue here. The customer experienced process hang like the issue reported here, but in this case the process was trapped in compaction route. In shrink_inactive_list(), reclaim_throttle() is called when too_many_isolated() is true. In fact confirmed from memory dump, there was no isolated pages but zone's vmstat have 2 counts as isolated pages and percpu vmstats have -2 counts. too_many = isolated > (inactive + active) / 2; There was no more inactive and active pages. As the result, the process was throttled in this point again and again until finish of parallel reclaimers who did not exist there in real." > With a very low NR_FREE_PAGES and many contending allocation the system > could be easily stuck in reclaim. What are other reclaim > characteristics? I can ask. What information in particular do you want to know? > Is the direct reclaim successful? Processes are stuck in too_many_isolated (unnecessarily). What do you mean when you ask "Is the direct reclaim successful", precisely?
On Tue 14-11-23 09:26:53, Marcelo Tosatti wrote: > Hi Michal, > > On Tue, Nov 14, 2023 at 09:20:09AM +0100, Michal Hocko wrote: > > On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote: > > > A customer reported seeing processes hung at too_many_isolated, > > > while analysis indicated that the problem occurred due to out > > > of sync per-CPU stats (see below). > > > > > > Fix is to use node_page_state_snapshot to avoid the out of stale values. > > > > > > 2136 static unsigned long > > > 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > > 2138 struct scan_control *sc, enum lru_list lru) > > > 2139 { > > > : > > > 2145 bool file = is_file_lru(lru); > > > : > > > 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > : > > > 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > 2151 if (stalled) > > > 2152 return 0; > > > 2153 > > > 2154 /* wait a bit for the reclaimer. */ > > > 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. > > > 2156 stalled = true; > > > 2157 > > > 2158 /* We are about to die and free our memory. Return now. */ > > > 2159 if (fatal_signal_pending(current)) > > > 2160 return SWAP_CLUSTER_MAX; > > > 2161 } > > > > > > msleep() must be called only when there are too many isolated pages: > > > > What do you mean here? > > That msleep() must not be called when > > isolated > inactive > > is false. Well, but the code is structured in a way that this is simply true. too_many_isolated might be false positive because it is a very loose interface and the number of isolated pages can fluctuate depending on the number of direct reclaimers. > > > 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, > > > 2020 struct scan_control *sc) > > > 2021 { > > > : > > > 2030 if (file) { > > > 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > > 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); > > > 2033 } else { > > > : > > > 2046 return isolated > inactive; > > > > > > The return value was true since: > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > > > $8 = { > > > counter = 1 > > > } > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > > > $9 = { > > > counter = 2 > > > > > > while per_cpu stats had: > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > > > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > > > $86 = 0xffff00917fcc32e0 > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > $87 = -1 '\377' > > > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > > > $89 = 0xffff00917fe032e0 > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > $91 = -1 '\377' > > > > This doesn't really tell much. How much out of sync they really are > > cumulatively over all cpus? > > This is the cumulative value over all CPUs (offsets for other CPUs > have been omitted since they are zero). OK, so that means the NR_ISOLATED_FILE is 0 while NR_INACTIVE_FILE is 1, correct? If that is the case then the value is indeed outdated but it also means that the NR_INACTIVE_FILE is so small that all but 1 (resp. 2 as kswapd is never throttled) reclaimers will be stalled anyway. So does the exact snapshot really help? Do you have any means to reproduce this behavior and see that the patch actually changed the behavior? [...] > > With a very low NR_FREE_PAGES and many contending allocation the system > > could be easily stuck in reclaim. What are other reclaim > > characteristics? > > I can ask. What information in particular do you want to know? When I am dealing with issues like this I heavily rely on /proc/vmstat counters and pgscan, pgsteal counters to see whether there is any progress over time. > > Is the direct reclaim successful? > > Processes are stuck in too_many_isolated (unnecessarily). What do you mean when you ask > "Is the direct reclaim successful", precisely? With such a small LRU list it is quite likely that many processes will be competing over last pages on the list while rest will be throttled because there is nothing to reclaim. It is quite possible that all reclaimers will be waiting for a single reclaimer (either kswapd or other direct reclaimer). I would like to understand whether the system is stuck in unproductive state where everybody just waits until the counter is synced or everything just progress very slowly because of the small LRU.
On Tue, Nov 14, 2023 at 01:46:41PM +0100, Michal Hocko wrote: > On Tue 14-11-23 09:26:53, Marcelo Tosatti wrote: > > Hi Michal, > > > > On Tue, Nov 14, 2023 at 09:20:09AM +0100, Michal Hocko wrote: > > > On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote: > > > > A customer reported seeing processes hung at too_many_isolated, > > > > while analysis indicated that the problem occurred due to out > > > > of sync per-CPU stats (see below). > > > > > > > > Fix is to use node_page_state_snapshot to avoid the out of stale values. > > > > > > > > 2136 static unsigned long > > > > 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > > > 2138 struct scan_control *sc, enum lru_list lru) > > > > 2139 { > > > > : > > > > 2145 bool file = is_file_lru(lru); > > > > : > > > > 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > > : > > > > 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > > 2151 if (stalled) > > > > 2152 return 0; > > > > 2153 > > > > 2154 /* wait a bit for the reclaimer. */ > > > > 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. > > > > 2156 stalled = true; > > > > 2157 > > > > 2158 /* We are about to die and free our memory. Return now. */ > > > > 2159 if (fatal_signal_pending(current)) > > > > 2160 return SWAP_CLUSTER_MAX; > > > > 2161 } > > > > > > > > msleep() must be called only when there are too many isolated pages: > > > > > > What do you mean here? > > > > That msleep() must not be called when > > > > isolated > inactive > > > > is false. > > Well, but the code is structured in a way that this is simply true. > too_many_isolated might be false positive because it is a very loose > interface and the number of isolated pages can fluctuate depending on > the number of direct reclaimers. OK > > > > > 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, > > > > 2020 struct scan_control *sc) > > > > 2021 { > > > > : > > > > 2030 if (file) { > > > > 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > > > 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); > > > > 2033 } else { > > > > : > > > > 2046 return isolated > inactive; > > > > > > > > The return value was true since: > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > > > > $8 = { > > > > counter = 1 > > > > } > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > > > > $9 = { > > > > counter = 2 > > > > > > > > while per_cpu stats had: > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > > > > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > > > > $86 = 0xffff00917fcc32e0 > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > $87 = -1 '\377' > > > > > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > > > > $89 = 0xffff00917fe032e0 > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > $91 = -1 '\377' > > > > > > This doesn't really tell much. How much out of sync they really are > > > cumulatively over all cpus? > > > > This is the cumulative value over all CPUs (offsets for other CPUs > > have been omitted since they are zero). > > OK, so that means the NR_ISOLATED_FILE is 0 while NR_INACTIVE_FILE is 1, > correct? If that is the case then the value is indeed outdated but it > also means that the NR_INACTIVE_FILE is so small that all but 1 (resp. 2 > as kswapd is never throttled) reclaimers will be stalled anyway. So does > the exact snapshot really help? By looking at the data: > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > $8 = { > counter = 1 > } > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > $9 = { > counter = 2 > > while per_cpu stats had: > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > $86 = 0xffff00917fcc32e0 > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > $87 = -1 '\377' > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > $89 = 0xffff00917fe032e0 > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > $91 = -1 '\377' Actual-Value = Global-Counter + CPU0.delta + CPU1.delta + ... + CPUn.delta Nr-Isolated-File = Nr-Isolated-Global + CPU0.delta-isolated + CPU1.delta-isolated + ... + CPUn.delta-isolated Nr-Inactive-File = Nr-Inactive-Global + CPU0.delta-inactive + CPU1.delta-inactive + ... + CPUn.delta-inactive With outdated values: ==================== Nr-Isolated-File = 2 Nr-Inactive-File = 1 Therefore isolated > inactive, since 2 > 1. Without outdated values (snapshot): ================================== Nr-Isolated-File = 2 - 1 - 1 = 0 Nr-Inactive-File = 1 > Do you have any means to reproduce this > behavior and see that the patch actually changed the behavior? No, because its not easy to test patches on the system where this was reproduced. However, the calculations above seem pretty unambiguous, showing that the snapshot would fix the problem. > [...] > > > > With a very low NR_FREE_PAGES and many contending allocation the system > > > could be easily stuck in reclaim. What are other reclaim > > > characteristics? > > > > I can ask. What information in particular do you want to know? > > When I am dealing with issues like this I heavily rely on /proc/vmstat > counters and pgscan, pgsteal counters to see whether there is any > progress over time. I understand your desire for additional data, can try to grab it (or create a synthetic configuration where this problem is reproducible). However, given the calculations above, it is clear that one problem is the out of sync counters. Don't you agree? > > > Is the direct reclaim successful? > > > > Processes are stuck in too_many_isolated (unnecessarily). What do you mean when you ask > > "Is the direct reclaim successful", precisely? > > With such a small LRU list it is quite likely that many processes will > be competing over last pages on the list while rest will be throttled > because there is nothing to reclaim. It is quite possible that all > reclaimers will be waiting for a single reclaimer (either kswapd or > other direct reclaimer). Sure, but again, the calculations above show that processes are stuck on too_many_isolated (and the proposed fix will address that situation). > I would like to understand whether the system > is stuck in unproductive state where everybody just waits until the > counter is synced or everything just progress very slowly because of the > small LRU. OK.
On Tue, Nov 14, 2023 at 01:46:41PM +0100, Michal Hocko wrote: > On Tue 14-11-23 09:26:53, Marcelo Tosatti wrote: > > Hi Michal, > > > > On Tue, Nov 14, 2023 at 09:20:09AM +0100, Michal Hocko wrote: > > > On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote: > > > > A customer reported seeing processes hung at too_many_isolated, > > > > while analysis indicated that the problem occurred due to out > > > > of sync per-CPU stats (see below). > > > > > > > > Fix is to use node_page_state_snapshot to avoid the out of stale values. > > > > > > > > 2136 static unsigned long > > > > 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > > > 2138 struct scan_control *sc, enum lru_list lru) > > > > 2139 { > > > > : > > > > 2145 bool file = is_file_lru(lru); > > > > : > > > > 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > > : > > > > 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > > 2151 if (stalled) > > > > 2152 return 0; > > > > 2153 > > > > 2154 /* wait a bit for the reclaimer. */ > > > > 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. > > > > 2156 stalled = true; > > > > 2157 > > > > 2158 /* We are about to die and free our memory. Return now. */ > > > > 2159 if (fatal_signal_pending(current)) > > > > 2160 return SWAP_CLUSTER_MAX; > > > > 2161 } > > > > > > > > msleep() must be called only when there are too many isolated pages: > > > > > > What do you mean here? > > > > That msleep() must not be called when > > > > isolated > inactive > > > > is false. > > Well, but the code is structured in a way that this is simply true. > too_many_isolated might be false positive because it is a very loose > interface and the number of isolated pages can fluctuate depending on > the number of direct reclaimers. > > > > > 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, > > > > 2020 struct scan_control *sc) > > > > 2021 { > > > > : > > > > 2030 if (file) { > > > > 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > > > 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); > > > > 2033 } else { > > > > : > > > > 2046 return isolated > inactive; > > > > > > > > The return value was true since: > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > > > > $8 = { > > > > counter = 1 > > > > } > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > > > > $9 = { > > > > counter = 2 > > > > > > > > while per_cpu stats had: > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > > > > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > > > > $86 = 0xffff00917fcc32e0 > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > $87 = -1 '\377' > > > > > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > > > > $89 = 0xffff00917fe032e0 > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > $91 = -1 '\377' > > > > > > This doesn't really tell much. How much out of sync they really are > > > cumulatively over all cpus? > > > > This is the cumulative value over all CPUs (offsets for other CPUs > > have been omitted since they are zero). > > OK, so that means the NR_ISOLATED_FILE is 0 while NR_INACTIVE_FILE is 1, > correct? If that is the case then the value is indeed outdated but it > also means that the NR_INACTIVE_FILE is so small that all but 1 (resp. 2 > as kswapd is never throttled) reclaimers will be stalled anyway. So does > the exact snapshot really help? Do you have any means to reproduce this > behavior and see that the patch actually changed the behavior? > > [...] > > > > With a very low NR_FREE_PAGES and many contending allocation the system > > > could be easily stuck in reclaim. What are other reclaim > > > characteristics? > > > > I can ask. What information in particular do you want to know? > > When I am dealing with issues like this I heavily rely on /proc/vmstat > counters and pgscan, pgsteal counters to see whether there is any > progress over time. > > > > Is the direct reclaim successful? > > > > Processes are stuck in too_many_isolated (unnecessarily). What do you mean when you ask > > "Is the direct reclaim successful", precisely? > > With such a small LRU list it is quite likely that many processes will > be competing over last pages on the list while rest will be throttled > because there is nothing to reclaim. It is quite possible that all > reclaimers will be waiting for a single reclaimer (either kswapd or > other direct reclaimer). I would like to understand whether the system > is stuck in unproductive state where everybody just waits until the > counter is synced or everything just progress very slowly because of the > small LRU. > -- > Michal Hocko > SUSE Labs Michal, I think this provides the data you are looking for: It seems that the situation was invoking memory-consuming user program in pallarel expecting that the system will kick oom-killer at the end. The node 0-3 are small containing system data and almost all files. The node 4-7 are large prepared to contain user data only. The issue described in above was observed on node 4-7, where had very few memory for files. The node 4-7 has more cpu than node 0-3. Only cpus on node 4-7 are configuerd to be nohz_full. So we often found unflushed percpu vmstat on cpus of node 4-7.
On Wed, Nov 22, 2023 at 08:23:51AM -0300, Marcelo Tosatti wrote: > On Tue, Nov 14, 2023 at 01:46:41PM +0100, Michal Hocko wrote: > > On Tue 14-11-23 09:26:53, Marcelo Tosatti wrote: > > > Hi Michal, > > > > > > On Tue, Nov 14, 2023 at 09:20:09AM +0100, Michal Hocko wrote: > > > > On Mon 13-11-23 20:34:20, Marcelo Tosatti wrote: > > > > > A customer reported seeing processes hung at too_many_isolated, > > > > > while analysis indicated that the problem occurred due to out > > > > > of sync per-CPU stats (see below). > > > > > > > > > > Fix is to use node_page_state_snapshot to avoid the out of stale values. > > > > > > > > > > 2136 static unsigned long > > > > > 2137 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > > > > 2138 struct scan_control *sc, enum lru_list lru) > > > > > 2139 { > > > > > : > > > > > 2145 bool file = is_file_lru(lru); > > > > > : > > > > > 2147 struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > > > > : > > > > > 2150 while (unlikely(too_many_isolated(pgdat, file, sc))) { > > > > > 2151 if (stalled) > > > > > 2152 return 0; > > > > > 2153 > > > > > 2154 /* wait a bit for the reclaimer. */ > > > > > 2155 msleep(100); <--- some processes were sleeping here, with pending SIGKILL. > > > > > 2156 stalled = true; > > > > > 2157 > > > > > 2158 /* We are about to die and free our memory. Return now. */ > > > > > 2159 if (fatal_signal_pending(current)) > > > > > 2160 return SWAP_CLUSTER_MAX; > > > > > 2161 } > > > > > > > > > > msleep() must be called only when there are too many isolated pages: > > > > > > > > What do you mean here? > > > > > > That msleep() must not be called when > > > > > > isolated > inactive > > > > > > is false. > > > > Well, but the code is structured in a way that this is simply true. > > too_many_isolated might be false positive because it is a very loose > > interface and the number of isolated pages can fluctuate depending on > > the number of direct reclaimers. > > > > > > > 2019 static int too_many_isolated(struct pglist_data *pgdat, int file, > > > > > 2020 struct scan_control *sc) > > > > > 2021 { > > > > > : > > > > > 2030 if (file) { > > > > > 2031 inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > > > > 2032 isolated = node_page_state(pgdat, NR_ISOLATED_FILE); > > > > > 2033 } else { > > > > > : > > > > > 2046 return isolated > inactive; > > > > > > > > > > The return value was true since: > > > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_INACTIVE_FILE] > > > > > $8 = { > > > > > counter = 1 > > > > > } > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->vm_stat[NR_ISOLATED_FILE] > > > > > $9 = { > > > > > counter = 2 > > > > > > > > > > while per_cpu stats had: > > > > > > > > > > crash> p ((struct pglist_data *) 0xffff00817fffe580)->per_cpu_nodestats > > > > > $85 = (struct per_cpu_nodestat *) 0xffff8000118832e0 > > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[42] > > > > > $86 = 0xffff00917fcc32e0 > > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fcc32e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > > $87 = -1 '\377' > > > > > > > > > > crash> p/x 0xffff8000118832e0 + __per_cpu_offset[44] > > > > > $89 = 0xffff00917fe032e0 > > > > > crash> p ((struct per_cpu_nodestat *) 0xffff00917fe032e0)->vm_node_stat_diff[NR_ISOLATED_FILE] > > > > > $91 = -1 '\377' > > > > > > > > This doesn't really tell much. How much out of sync they really are > > > > cumulatively over all cpus? > > > > > > This is the cumulative value over all CPUs (offsets for other CPUs > > > have been omitted since they are zero). > > > > OK, so that means the NR_ISOLATED_FILE is 0 while NR_INACTIVE_FILE is 1, > > correct? If that is the case then the value is indeed outdated but it > > also means that the NR_INACTIVE_FILE is so small that all but 1 (resp. 2 > > as kswapd is never throttled) reclaimers will be stalled anyway. So does > > the exact snapshot really help? Do you have any means to reproduce this > > behavior and see that the patch actually changed the behavior? > > > > [...] > > > > > > With a very low NR_FREE_PAGES and many contending allocation the system > > > > could be easily stuck in reclaim. What are other reclaim > > > > characteristics? > > > > > > I can ask. What information in particular do you want to know? > > > > When I am dealing with issues like this I heavily rely on /proc/vmstat > > counters and pgscan, pgsteal counters to see whether there is any > > progress over time. > > > > > > Is the direct reclaim successful? > > > > > > Processes are stuck in too_many_isolated (unnecessarily). What do you mean when you ask > > > "Is the direct reclaim successful", precisely? > > > > With such a small LRU list it is quite likely that many processes will > > be competing over last pages on the list while rest will be throttled > > because there is nothing to reclaim. It is quite possible that all > > reclaimers will be waiting for a single reclaimer (either kswapd or > > other direct reclaimer). I would like to understand whether the system > > is stuck in unproductive state where everybody just waits until the > > counter is synced or everything just progress very slowly because of the > > small LRU. > > -- > > Michal Hocko > > SUSE Labs > > Michal, > > I think this provides the data you are looking for: > > It seems that the situation was invoking memory-consuming user program > in pallarel expecting that the system will kick oom-killer at the end. > > The node 0-3 are small containing system data and almost all files. > The node 4-7 are large prepared to contain user data only. > The issue described in above was observed on node 4-7, where > had very few memory for files. > > The node 4-7 has more cpu than node 0-3. > Only cpus on node 4-7 are configuerd to be nohz_full. > So we often found unflushed percpu vmstat on cpus of node 4-7. > > Michal, Let me know if you have any objections to the patch, thanks.
On Wed 22-11-23 08:26:02, Marcelo Tosatti wrote: [...] > Michal, > > Let me know if you have any objections to the patch, thanks. I do not think you have exaplained how the patch helps nor you have shown it has fixed the described problem. You seem to be very focused on the specific snapshot which I do agree shows that the data is out of sync and that there is throttling happening when strictly speaking it should noti. But (let me repeat) those discrepancies are so small that it is very likely that concurrent reclaimers will be stalled (just take one to isolate those pages) anyway. Maybe this leads to an earlier OOM killer invocation as untrottled reclaimers will be able to conclude there is no progress rather than being throttled on the direct reclaim. That being said I am not saying the patch is incorrect. Nevertheless, I do not think we want to merge this patch without a better understanding what is going on in your specific case and what kind of runtime difference does the patch make in that case. From your previous email it seems like the actual case is mostly memory stress test that manages to fill out the memory to push almost all the file LRU while anon LRU is not reclaimable for some reason. That shouldn't be terribly hard to reproduce.