Message ID | 20240102184633.748113-1-urezki@gmail.com |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-14731-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp4623619dyb; Tue, 2 Jan 2024 10:47:06 -0800 (PST) X-Google-Smtp-Source: AGHT+IGYuPrIKCL6S207FqSnPfss1rU5w4HOuMKbtCCQ2HVGxAk+zy3Pzk7G3camCUJx/gjgCJ4R X-Received: by 2002:aa7:c3d8:0:b0:555:30f3:7eaf with SMTP id l24-20020aa7c3d8000000b0055530f37eafmr5486864edr.66.1704221226172; Tue, 02 Jan 2024 10:47:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704221226; cv=none; d=google.com; s=arc-20160816; b=VVCL+Vwug7BvHMVy32s9e7B3vpWvRfhXgn4EMf/AYeTQTDnmt7u9yhvIpaZ9svPiF8 hOrdOOqW8p/+p0VpLgRTkJ7ri5F4h/3tLjon6CZcEkYB60BEyKg2mEyxIRrtZhwpcbuM XWfwt6cDLQGRYy6QV0rhqRJuSIG+LFiy8UIEufe6dv6PSY/CYkbqZg9mDkO/iZQyztnn DOBMIY0vamPF1li9+3FciBKke3c9GRZMiWI9ZhjjL546Ci/wNScZ+zD+dZ+6X7OXc0Qt lxdMt5QqnRbhi3LpuYeB8uEvRVTjyLLcx8U3WuLUXCLEuPxM5X7rcPp1sBtgQ4nKAG1k IRlQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=TCR2vkFjLL4nonp+cXVAy10yMLLpVs0AwkwxKNmhWeY=; fh=cf1wzNgv8Y9vv687wGuXRY2VMl5yvitPBABc44jVg4s=; b=b/Px35asZYnxbKn9AK7GYHgt3KAomxG49aJ8YMeR5hnO49akLUF2N1C9gSO0BtO4s7 bUDYETDb7G3nr8g9SWmBoGomnk8Krn+ALhL1L73B0zEePegSnr6xDwHVUsB6tfz/wWgP 4JrRKlz34sxB0RqTv0gia7Fh6F+BgrnONj7pP5VWVSEpcats15g2QirAMNeSFY5LLkzf bXCxRjzTUwacgY4HSUnEaZLR4LbrA2sPa3qQHxg4NhuHW8fN8zhnaiE0fbz5P9RBs2cf Kup6b19i5x0WrZDHRb3vFu23U6IF7PnLQysGd13Z6RjFB0UFt6ccT5f3d9bZ+vxBuwzL IawQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=Xbd96vXc; spf=pass (google.com: domain of linux-kernel+bounces-14731-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14731-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id h23-20020a0564020e9700b00554f8c5b037si7071000eda.13.2024.01.02.10.47.05 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 10:47:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-14731-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=Xbd96vXc; spf=pass (google.com: domain of linux-kernel+bounces-14731-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14731-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 9752F1F21649 for <ouuuleilei@gmail.com>; Tue, 2 Jan 2024 18:47:05 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id EB68E15E95; Tue, 2 Jan 2024 18:46:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Xbd96vXc" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D12B515AC5 for <linux-kernel@vger.kernel.org>; Tue, 2 Jan 2024 18:46:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-f47.google.com with SMTP id 2adb3069b0e04-50e68e93be1so8779325e87.0 for <linux-kernel@vger.kernel.org>; Tue, 02 Jan 2024 10:46:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704221197; x=1704825997; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=TCR2vkFjLL4nonp+cXVAy10yMLLpVs0AwkwxKNmhWeY=; b=Xbd96vXcZ1ZpD1MmW4Nop10JGwlw2rgGnwcfmwIvQJFktYbWu+9d2qe683xivwVYpF Fvz1xqCQGNeI6gO8n6+uuUte2CQQ10cNsi7Ewc9OXmWaFW6q66xSqvbBlYkQFM5Ma76u sKWw9+0iGhdOF+K7yXYWxjjDSmYTgtggCSglgOGxrgg8PN3zwR7iwKGDJfDFJqhKTeF8 D2avZqsghVLRWYMNYPvegROqPi9SGBBpftFWIM9ZyJNjEBrPeg5jpywEt7cPl7rkhJik 3aWlgYSxGbMzSUDEU7TdMnMej2gsWFCqBAy8nvxKSOmejvPF7JNEL32eQOcT8ucXedks 6QwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704221197; x=1704825997; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TCR2vkFjLL4nonp+cXVAy10yMLLpVs0AwkwxKNmhWeY=; b=FwrIeFyW0NPuYvaOqmT27gy7AkjIRHQQ2+QuljIVjwrZV2rPXytnvJzmMTGTLCmHzt vzYEjFf2pcQqWvIWdR3D98ZTp8duL9kXL+XukQJ+FDA1Xg7MYbaKtUPo+DWkCwAc2aow 22dPglguYh1py+XlfRq0x8muHQUgwg20y9WuyUXPoOLNoZ90yA565jUk0ZP+ARWJ0zlZ Oj+m4EVxDcj1wDJukXQuPVWFaOKtaCEZMVMUek1gxRtvclzNKb8m4xc/JnP2Jz3xJOGO 0pOo38lml+U0AK8ZJwGnWx7oWpM575UyBbDMbwMuoghVAWvZFvQRczv1AC7MJlGIF9nP SUGQ== X-Gm-Message-State: AOJu0Yy1mWM/uFi20GDa7THLVNwBlUIfEFaLq56Dw86Td0nptJCy26J5 DPziX9q508fFjyN73yr+9x1MZymR26Q= X-Received: by 2002:a19:6516:0:b0:50e:3dcd:3ae9 with SMTP id z22-20020a196516000000b0050e3dcd3ae9mr6007459lfb.78.1704221196597; Tue, 02 Jan 2024 10:46:36 -0800 (PST) Received: from pc638.lan (host-185-121-47-193.sydskane.nu. [185.121.47.193]) by smtp.gmail.com with ESMTPSA id q1-20020ac246e1000000b0050e7be886d9sm2592656lfo.56.2024.01.02.10.46.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 10:46:36 -0800 (PST) From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org>, Baoquan He <bhe@redhat.com>, Lorenzo Stoakes <lstoakes@gmail.com>, Christoph Hellwig <hch@infradead.org>, Matthew Wilcox <willy@infradead.org>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Dave Chinner <david@fromorbit.com>, "Paul E . McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Subject: [PATCH v3 00/11] Mitigate a vmap lock contention v3 Date: Tue, 2 Jan 2024 19:46:22 +0100 Message-Id: <20240102184633.748113-1-urezki@gmail.com> X-Mailer: git-send-email 2.39.2 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1787005476783847635 X-GMAIL-MSGID: 1787005476783847635 |
Series |
Mitigate a vmap lock contention v3
|
|
Message
Uladzislau Rezki
Jan. 2, 2024, 6:46 p.m. UTC
This is v3. It is based on the 6.7.0-rc8. 1. Motivation - Offload global vmap locks making it scaled to number of CPUS; - If possible and there is an agreement, we can remove the "Per cpu kva allocator" to make the vmap code to be more simple; - There were complains from XFS folk that a vmalloc might be contented on the their workloads. 2. Design(high level overview) We introduce an effective vmap node logic. A node behaves as independent entity to serve an allocation request directly(if possible) from its pool. That way it bypasses a global vmap space that is protected by its own lock. An access to pools are serialized by CPUs. Number of nodes are equal to number of CPUs in a system. Please note the high threshold is bound to 128 nodes. Pools are size segregated and populated based on system demand. The maximum alloc request that can be stored into a segregated storage is 256 pages. The lazily drain path decays a pool by 25% as a first step and as second populates it by fresh freed VAs for reuse instead of returning them into a global space. When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. Doing so we balance VAs across the nodes as a result an access becomes scalable. The addr_to_node() function does a proper address conversion to a correct node. A vmap space is divided on segments with fixed size, it is 16 pages. That way any address can be associated with a segment number. Number of segments are equal to num_possible_cpus() but not grater then 128. The numeration starts from 0. See below how it is converted: static inline unsigned int addr_to_node_id(unsigned long addr) { return (addr / zone_size) % nr_nodes; } On a free path, a VA can be easily found by converting its "va_start" address to a certain node it resides. It is moved from "busy" data to "lazy" data structure. Later on, as noted earlier, the lazy kworker decays each node pool and populates it by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc request. 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 <default perf> 94.41% 0.89% [kernel] [k] _raw_spin_lock 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath 76.13% 0.28% [kernel] [k] __vmalloc_node_range 72.96% 0.81% [kernel] [k] alloc_vmap_area 56.94% 0.00% [kernel] [k] __get_vm_area_node 41.95% 0.00% [kernel] [k] vmalloc 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test 35.17% 0.00% [kernel] [k] ret_from_fork_asm 35.17% 0.00% [kernel] [k] ret_from_fork 35.17% 0.00% [kernel] [k] kthread 35.08% 0.00% [test_vmalloc] [k] test_func 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test 23.53% 0.25% [kernel] [k] vfree.part.0 21.72% 0.00% [kernel] [k] remove_vm_area 20.08% 0.21% [kernel] [k] find_unlink_vmap_area 2.34% 0.61% [kernel] [k] free_vmap_area_noflush <default perf> vs <patch-series perf> 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test 63.36% 0.02% [kernel] [k] vmalloc 63.34% 2.64% [kernel] [k] __vmalloc_node_range 30.42% 4.46% [kernel] [k] vfree.part.0 28.98% 2.51% [kernel] [k] __alloc_pages_bulk 27.28% 0.19% [kernel] [k] __get_vm_area_node 26.13% 1.50% [kernel] [k] alloc_vmap_area 21.72% 21.67% [kernel] [k] clear_page_rep 19.51% 2.43% [kernel] [k] _raw_spin_lock 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath 13.40% 2.07% [kernel] [k] free_unref_page 10.62% 0.01% [kernel] [k] remove_vm_area 9.02% 8.73% [kernel] [k] insert_vmap_area 8.94% 0.00% [kernel] [k] ret_from_fork_asm 8.94% 0.00% [kernel] [k] ret_from_fork 8.94% 0.00% [kernel] [k] kthread 8.29% 0.00% [test_vmalloc] [k] test_func 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test 5.30% 4.73% [kernel] [k] purge_vmap_node 4.47% 2.65% [kernel] [k] free_vmap_area_noflush <patch-series perf> confirms that a native_queued_spin_lock_slowpath goes down to 16.51% percent from 93.07%. The throughput is ~12x higher: urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 10m51.271s user 0m0.013s sys 0m0.187s urezki@pc638:~$ urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 Run the test with following parameters: run_test_mask=7 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 0m51.301s user 0m0.015s sys 0m0.040s urezki@pc638:~$ 4. Changelog v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/ v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/ Delta v2 -> v3: - fix comments from v2 feedback; - switch from pre-fetch chunk logic to a less complex size based pools. Baoquan He (1): mm/vmalloc: remove vmap_area_list Uladzislau Rezki (Sony) (10): mm: vmalloc: Add va_alloc() helper mm: vmalloc: Rename adjust_va_to_fit_type() function mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c mm: vmalloc: Remove global vmap_area_root rb-tree mm: vmalloc: Remove global purge_vmap_area_root rb-tree mm: vmalloc: Offload free_vmap_area_lock lock mm: vmalloc: Support multiple nodes in vread_iter mm: vmalloc: Support multiple nodes in vmallocinfo mm: vmalloc: Set nr_nodes based on CPUs in a system mm: vmalloc: Add a shrinker to drain vmap pools .../admin-guide/kdump/vmcoreinfo.rst | 8 +- arch/arm64/kernel/crash_core.c | 1 - arch/riscv/kernel/crash_core.c | 1 - include/linux/vmalloc.h | 1 - kernel/crash_core.c | 4 +- kernel/kallsyms_selftest.c | 1 - mm/nommu.c | 2 - mm/vmalloc.c | 1049 ++++++++++++----- 8 files changed, 786 insertions(+), 281 deletions(-)
Comments
Hello, Folk! > This is v3. It is based on the 6.7.0-rc8. > > 1. Motivation > > - Offload global vmap locks making it scaled to number of CPUS; > - If possible and there is an agreement, we can remove the "Per cpu kva allocator" > to make the vmap code to be more simple; > - There were complains from XFS folk that a vmalloc might be contented > on the their workloads. > > 2. Design(high level overview) > > We introduce an effective vmap node logic. A node behaves as independent > entity to serve an allocation request directly(if possible) from its pool. > That way it bypasses a global vmap space that is protected by its own lock. > > An access to pools are serialized by CPUs. Number of nodes are equal to > number of CPUs in a system. Please note the high threshold is bound to > 128 nodes. > > Pools are size segregated and populated based on system demand. The maximum > alloc request that can be stored into a segregated storage is 256 pages. The > lazily drain path decays a pool by 25% as a first step and as second populates > it by fresh freed VAs for reuse instead of returning them into a global space. > > When a VA is obtained(alloc path), it is stored in separate nodes. A va->va_start > address is converted into a correct node where it should be placed and resided. > Doing so we balance VAs across the nodes as a result an access becomes scalable. > The addr_to_node() function does a proper address conversion to a correct node. > > A vmap space is divided on segments with fixed size, it is 16 pages. That way > any address can be associated with a segment number. Number of segments are > equal to num_possible_cpus() but not grater then 128. The numeration starts > from 0. See below how it is converted: > > static inline unsigned int > addr_to_node_id(unsigned long addr) > { > return (addr / zone_size) % nr_nodes; > } > > On a free path, a VA can be easily found by converting its "va_start" address > to a certain node it resides. It is moved from "busy" data to "lazy" data structure. > Later on, as noted earlier, the lazy kworker decays each node pool and populates it > by fresh incoming VAs. Please note, a VA is returned to a node that did an alloc > request. > > 3. Test on AMD Ryzen Threadripper 3970X 32-Core Processor > > sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > > <default perf> > 94.41% 0.89% [kernel] [k] _raw_spin_lock > 93.35% 93.07% [kernel] [k] native_queued_spin_lock_slowpath > 76.13% 0.28% [kernel] [k] __vmalloc_node_range > 72.96% 0.81% [kernel] [k] alloc_vmap_area > 56.94% 0.00% [kernel] [k] __get_vm_area_node > 41.95% 0.00% [kernel] [k] vmalloc > 37.15% 0.01% [test_vmalloc] [k] full_fit_alloc_test > 35.17% 0.00% [kernel] [k] ret_from_fork_asm > 35.17% 0.00% [kernel] [k] ret_from_fork > 35.17% 0.00% [kernel] [k] kthread > 35.08% 0.00% [test_vmalloc] [k] test_func > 34.45% 0.00% [test_vmalloc] [k] fix_size_alloc_test > 28.09% 0.01% [test_vmalloc] [k] long_busy_list_alloc_test > 23.53% 0.25% [kernel] [k] vfree.part.0 > 21.72% 0.00% [kernel] [k] remove_vm_area > 20.08% 0.21% [kernel] [k] find_unlink_vmap_area > 2.34% 0.61% [kernel] [k] free_vmap_area_noflush > <default perf> > vs > <patch-series perf> > 82.32% 0.22% [test_vmalloc] [k] long_busy_list_alloc_test > 63.36% 0.02% [kernel] [k] vmalloc > 63.34% 2.64% [kernel] [k] __vmalloc_node_range > 30.42% 4.46% [kernel] [k] vfree.part.0 > 28.98% 2.51% [kernel] [k] __alloc_pages_bulk > 27.28% 0.19% [kernel] [k] __get_vm_area_node > 26.13% 1.50% [kernel] [k] alloc_vmap_area > 21.72% 21.67% [kernel] [k] clear_page_rep > 19.51% 2.43% [kernel] [k] _raw_spin_lock > 16.61% 16.51% [kernel] [k] native_queued_spin_lock_slowpath > 13.40% 2.07% [kernel] [k] free_unref_page > 10.62% 0.01% [kernel] [k] remove_vm_area > 9.02% 8.73% [kernel] [k] insert_vmap_area > 8.94% 0.00% [kernel] [k] ret_from_fork_asm > 8.94% 0.00% [kernel] [k] ret_from_fork > 8.94% 0.00% [kernel] [k] kthread > 8.29% 0.00% [test_vmalloc] [k] test_func > 7.81% 0.05% [test_vmalloc] [k] full_fit_alloc_test > 5.30% 4.73% [kernel] [k] purge_vmap_node > 4.47% 2.65% [kernel] [k] free_vmap_area_noflush > <patch-series perf> > > confirms that a native_queued_spin_lock_slowpath goes down to > 16.51% percent from 93.07%. > > The throughput is ~12x higher: > > urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > Run the test with following parameters: run_test_mask=7 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 10m51.271s > user 0m0.013s > sys 0m0.187s > urezki@pc638:~$ > > urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=7 nr_threads=64 > Run the test with following parameters: run_test_mask=7 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 0m51.301s > user 0m0.015s > sys 0m0.040s > urezki@pc638:~$ > > 4. Changelog > > v1: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/ > v2: https://lore.kernel.org/lkml/20230829081142.3619-1-urezki@gmail.com/ > > Delta v2 -> v3: > - fix comments from v2 feedback; > - switch from pre-fetch chunk logic to a less complex size based pools. > > Baoquan He (1): > mm/vmalloc: remove vmap_area_list > > Uladzislau Rezki (Sony) (10): > mm: vmalloc: Add va_alloc() helper > mm: vmalloc: Rename adjust_va_to_fit_type() function > mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c > mm: vmalloc: Remove global vmap_area_root rb-tree > mm: vmalloc: Remove global purge_vmap_area_root rb-tree > mm: vmalloc: Offload free_vmap_area_lock lock > mm: vmalloc: Support multiple nodes in vread_iter > mm: vmalloc: Support multiple nodes in vmallocinfo > mm: vmalloc: Set nr_nodes based on CPUs in a system > mm: vmalloc: Add a shrinker to drain vmap pools > > .../admin-guide/kdump/vmcoreinfo.rst | 8 +- > arch/arm64/kernel/crash_core.c | 1 - > arch/riscv/kernel/crash_core.c | 1 - > include/linux/vmalloc.h | 1 - > kernel/crash_core.c | 4 +- > kernel/kallsyms_selftest.c | 1 - > mm/nommu.c | 2 - > mm/vmalloc.c | 1049 ++++++++++++----- > 8 files changed, 786 insertions(+), 281 deletions(-) > > -- > 2.39.2 > There is one thing that i have to clarify and which is open for me yet. Test machine: quemu x86_64 system 64 CPUs 64G of memory test suite: test_vmalloc.sh environment: mm-unstable, branch: next-20240220 where this series is located. On top of it i added locally Suren's Baghdasaryan Memory allocation profiling v3 for better understanding of memory usage. Before running test, the condition is as below: urezki@pc638:~$ sort -h /proc/allocinfo 27.2MiB 6970 mm/memory.c:1122 module:memory func:folio_prealloc 79.1MiB 20245 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 112MiB 8689 mm/slub.c:2202 module:slub func:alloc_slab_page 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 936 63618 0 134 63236 Swap: 0 0 0 urezki@pc638:~$ The test-suite stresses vmap/vmalloc layer by creating workers which in a tight loop do alloc/free, i.e. it is considered as extreme. Below three identical tests were done with only one difference, which is 64, 128 and 256 kworkers: 1) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64 urezki@pc638:~$ sort -h /proc/allocinfo 80.1MiB 20518 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 153MiB 39048 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 178MiB 13259 mm/slub.c:2202 module:slub func:alloc_slab_page 350MiB 89656 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 1417 63054 0 298 62755 Swap: 0 0 0 urezki@pc638:~$ 2) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128 urezki@pc638:~$ sort -h /proc/allocinfo 122MiB 31168 mm/page_ext.c:270 module:page_ext func:alloc_page_ext 154MiB 39440 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 196MiB 14038 mm/slub.c:2202 module:slub func:alloc_slab_page 1.20GiB 315655 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 2556 61914 0 302 61616 Swap: 0 0 0 urezki@pc638:~$ 3) sudo tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 urezki@pc638:~$ sort -h /proc/allocinfo 127MiB 32565 mm/readahead.c:247 module:readahead func:page_cache_ra_unbounded 197MiB 50506 mm/filemap.c:1919 module:filemap func:__filemap_get_folio 278MiB 18519 mm/slub.c:2202 module:slub func:alloc_slab_page 5.36GiB 1405072 include/linux/mm.h:2848 module:memory func:pagetable_alloc urezki@pc638:~$ free -m total used free shared buff/cache available Mem: 64172 6741 57652 0 394 57431 Swap: 0 0 0 urezki@pc638:~$ pagetable_alloc - gets increased as soon as a higher pressure is applied by increasing number of workers. Running same number of jobs on a next run does not increase it and stays on same level as on previous. /** * pagetable_alloc - Allocate pagetables * @gfp: GFP flags * @order: desired pagetable order * * pagetable_alloc allocates memory for page tables as well as a page table * descriptor to describe that memory. * * Return: The ptdesc describing the allocated page tables. */ static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) { struct page *page = alloc_pages(gfp | __GFP_COMP, order); return page_ptdesc(page); } Could you please comment on it? Or do you have any thought? Is it expected? Is a page-table ever shrink? /proc/slabinfo does not show any high "active" or "number" of objects to be used by any cache. /proc/meminfo - "VmallocUsed" stays low after those 3 tests. I have checked it with KASAN, KMEMLEAK and i do not see any issues. Thank you for the help! -- Uladzislau Rezki
Hi, On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > Hello, Folk! > >[...] > pagetable_alloc - gets increased as soon as a higher pressure is applied by > increasing number of workers. Running same number of jobs on a next run > does not increase it and stays on same level as on previous. > > /** > * pagetable_alloc - Allocate pagetables > * @gfp: GFP flags > * @order: desired pagetable order > * > * pagetable_alloc allocates memory for page tables as well as a page table > * descriptor to describe that memory. > * > * Return: The ptdesc describing the allocated page tables. > */ > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) > { > struct page *page = alloc_pages(gfp | __GFP_COMP, order); > > return page_ptdesc(page); > } > > Could you please comment on it? Or do you have any thought? Is it expected? > Is a page-table ever shrink? It's my understanding that the vunmap_range helpers don't actively free page tables, they just clear PTEs. munmap does free them in mmap.c:free_pgtables, maybe something could be worked up for vmalloc too. I would not be surprised if the memory increase you're seeing is more or less correlated to the maximum vmalloc footprint throughout the whole test.
On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote: > Hi, > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > Hello, Folk! > > > >[...] > > pagetable_alloc - gets increased as soon as a higher pressure is applied by > > increasing number of workers. Running same number of jobs on a next run > > does not increase it and stays on same level as on previous. > > > > /** > > * pagetable_alloc - Allocate pagetables > > * @gfp: GFP flags > > * @order: desired pagetable order > > * > > * pagetable_alloc allocates memory for page tables as well as a page table > > * descriptor to describe that memory. > > * > > * Return: The ptdesc describing the allocated page tables. > > */ > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) > > { > > struct page *page = alloc_pages(gfp | __GFP_COMP, order); > > > > return page_ptdesc(page); > > } > > > > Could you please comment on it? Or do you have any thought? Is it expected? > > Is a page-table ever shrink? > > It's my understanding that the vunmap_range helpers don't actively > free page tables, they just clear PTEs. munmap does free them in > mmap.c:free_pgtables, maybe something could be worked up for vmalloc > too. > Right. I see that for a user space, pgtables are removed. There was a work on it. > > I would not be surprised if the memory increase you're seeing is more > or less correlated to the maximum vmalloc footprint throughout the > whole test. > Yes, the vmalloc footprint follows the memory usage. Some uses cases map lot of memory. Thanks for the input! -- Uladzislau Rezki
On 02/23/24 at 10:34am, Uladzislau Rezki wrote: > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote: > > Hi, > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > Hello, Folk! > > > > > >[...] > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by > > > increasing number of workers. Running same number of jobs on a next run > > > does not increase it and stays on same level as on previous. > > > > > > /** > > > * pagetable_alloc - Allocate pagetables > > > * @gfp: GFP flags > > > * @order: desired pagetable order > > > * > > > * pagetable_alloc allocates memory for page tables as well as a page table > > > * descriptor to describe that memory. > > > * > > > * Return: The ptdesc describing the allocated page tables. > > > */ > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) > > > { > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order); > > > > > > return page_ptdesc(page); > > > } > > > > > > Could you please comment on it? Or do you have any thought? Is it expected? > > > Is a page-table ever shrink? > > > > It's my understanding that the vunmap_range helpers don't actively > > free page tables, they just clear PTEs. munmap does free them in > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc > > too. > > > Right. I see that for a user space, pgtables are removed. There was a > work on it. > > > > > I would not be surprised if the memory increase you're seeing is more > > or less correlated to the maximum vmalloc footprint throughout the > > whole test. > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases > map lot of memory. The 'nr_threads=256' testing may be too radical. I took the test on a bare metal machine as below, it's still running and hang there after 30 minutes. I did this after system boot. I am looking for other machines with more processors. [root@dell-r640-068 ~]# nproc 64 [root@dell-r640-068 ~]# free -h total used free shared buff/cache available Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi Swap: 4.0Gi 0B 4.0Gi [root@dell-r640-068 ~]# [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 Run the test with following parameters: run_test_mask=127 nr_threads=256
> On 02/23/24 at 10:34am, Uladzislau Rezki wrote: > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote: > > > Hi, > > > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > Hello, Folk! > > > > > > > >[...] > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by > > > > increasing number of workers. Running same number of jobs on a next run > > > > does not increase it and stays on same level as on previous. > > > > > > > > /** > > > > * pagetable_alloc - Allocate pagetables > > > > * @gfp: GFP flags > > > > * @order: desired pagetable order > > > > * > > > > * pagetable_alloc allocates memory for page tables as well as a page table > > > > * descriptor to describe that memory. > > > > * > > > > * Return: The ptdesc describing the allocated page tables. > > > > */ > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) > > > > { > > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order); > > > > > > > > return page_ptdesc(page); > > > > } > > > > > > > > Could you please comment on it? Or do you have any thought? Is it expected? > > > > Is a page-table ever shrink? > > > > > > It's my understanding that the vunmap_range helpers don't actively > > > free page tables, they just clear PTEs. munmap does free them in > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc > > > too. > > > > > Right. I see that for a user space, pgtables are removed. There was a > > work on it. > > > > > > > > I would not be surprised if the memory increase you're seeing is more > > > or less correlated to the maximum vmalloc footprint throughout the > > > whole test. > > > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases > > map lot of memory. > > The 'nr_threads=256' testing may be too radical. I took the test on > a bare metal machine as below, it's still running and hang there after > 30 minutes. I did this after system boot. I am looking for other > machines with more processors. > > [root@dell-r640-068 ~]# nproc > 64 > [root@dell-r640-068 ~]# free -h > total used free shared buff/cache available > Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi > Swap: 4.0Gi 0B 4.0Gi > [root@dell-r640-068 ~]# > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 > Run the test with following parameters: run_test_mask=127 nr_threads=256 > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to complete. So wait more :) -- Uladzislau Rezki
On 02/23/24 at 12:06pm, Uladzislau Rezki wrote: > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote: > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote: > > > > Hi, > > > > > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > Hello, Folk! > > > > > > > > > >[...] > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by > > > > > increasing number of workers. Running same number of jobs on a next run > > > > > does not increase it and stays on same level as on previous. > > > > > > > > > > /** > > > > > * pagetable_alloc - Allocate pagetables > > > > > * @gfp: GFP flags > > > > > * @order: desired pagetable order > > > > > * > > > > > * pagetable_alloc allocates memory for page tables as well as a page table > > > > > * descriptor to describe that memory. > > > > > * > > > > > * Return: The ptdesc describing the allocated page tables. > > > > > */ > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) > > > > > { > > > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order); > > > > > > > > > > return page_ptdesc(page); > > > > > } > > > > > > > > > > Could you please comment on it? Or do you have any thought? Is it expected? > > > > > Is a page-table ever shrink? > > > > > > > > It's my understanding that the vunmap_range helpers don't actively > > > > free page tables, they just clear PTEs. munmap does free them in > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc > > > > too. > > > > > > > Right. I see that for a user space, pgtables are removed. There was a > > > work on it. > > > > > > > > > > > I would not be surprised if the memory increase you're seeing is more > > > > or less correlated to the maximum vmalloc footprint throughout the > > > > whole test. > > > > > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases > > > map lot of memory. > > > > The 'nr_threads=256' testing may be too radical. I took the test on > > a bare metal machine as below, it's still running and hang there after > > 30 minutes. I did this after system boot. I am looking for other > > machines with more processors. > > > > [root@dell-r640-068 ~]# nproc > > 64 > > [root@dell-r640-068 ~]# free -h > > total used free shared buff/cache available > > Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi > > Swap: 4.0Gi 0B 4.0Gi > > [root@dell-r640-068 ~]# > > > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 > > Run the test with following parameters: run_test_mask=127 nr_threads=256 > > > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to > complete. So wait more :) Right, mine could take the similar time to finish that. I got a machine with 288 cpus, see if I can get some clues. When I go through the code flow, suddenly realized it could be drain_vmap_area_work which is the bottle neck and cause the tremendous page table pages costing. On your system, there's 64 cpus. then nr_lazy_max = lazy_max_pages() = 7*32M = 224M; So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max and triggering drain_vmap_work(). When cpu resouce is very limited, the lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c are going far faster and more easily then vmap reclaiming. If old va is not reused, new va is allocated and keep extending, the new page table surely need be created to cover them. I will take testing on the system with 288 cpus, will update if testing is done.
On Fri, Feb 23, 2024 at 11:57:25PM +0800, Baoquan He wrote: > On 02/23/24 at 12:06pm, Uladzislau Rezki wrote: > > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote: > > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote: > > > > > Hi, > > > > > > > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > Hello, Folk! > > > > > > > > > > > >[...] > > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by > > > > > > increasing number of workers. Running same number of jobs on a next run > > > > > > does not increase it and stays on same level as on previous. > > > > > > > > > > > > /** > > > > > > * pagetable_alloc - Allocate pagetables > > > > > > * @gfp: GFP flags > > > > > > * @order: desired pagetable order > > > > > > * > > > > > > * pagetable_alloc allocates memory for page tables as well as a page table > > > > > > * descriptor to describe that memory. > > > > > > * > > > > > > * Return: The ptdesc describing the allocated page tables. > > > > > > */ > > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) > > > > > > { > > > > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order); > > > > > > > > > > > > return page_ptdesc(page); > > > > > > } > > > > > > > > > > > > Could you please comment on it? Or do you have any thought? Is it expected? > > > > > > Is a page-table ever shrink? > > > > > > > > > > It's my understanding that the vunmap_range helpers don't actively > > > > > free page tables, they just clear PTEs. munmap does free them in > > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc > > > > > too. > > > > > > > > > Right. I see that for a user space, pgtables are removed. There was a > > > > work on it. > > > > > > > > > > > > > > I would not be surprised if the memory increase you're seeing is more > > > > > or less correlated to the maximum vmalloc footprint throughout the > > > > > whole test. > > > > > > > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases > > > > map lot of memory. > > > > > > The 'nr_threads=256' testing may be too radical. I took the test on > > > a bare metal machine as below, it's still running and hang there after > > > 30 minutes. I did this after system boot. I am looking for other > > > machines with more processors. > > > > > > [root@dell-r640-068 ~]# nproc > > > 64 > > > [root@dell-r640-068 ~]# free -h > > > total used free shared buff/cache available > > > Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi > > > Swap: 4.0Gi 0B 4.0Gi > > > [root@dell-r640-068 ~]# > > > > > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 > > > Run the test with following parameters: run_test_mask=127 nr_threads=256 > > > > > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to > > complete. So wait more :) > > Right, mine could take the similar time to finish that. I got a machine > with 288 cpus, see if I can get some clues. When I go through the code > flow, suddenly realized it could be drain_vmap_area_work which is the > bottle neck and cause the tremendous page table pages costing. > > On your system, there's 64 cpus. then > > nr_lazy_max = lazy_max_pages() = 7*32M = 224M; > > So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max > and triggering drain_vmap_work(). When cpu resouce is very limited, the > lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c > are going far faster and more easily then vmap reclaiming. If old va is not > reused, new va is allocated and keep extending, the new page table surely > need be created to cover them. > > I will take testing on the system with 288 cpus, will update if testing > is done. > <snip> diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 12caa794abd4..a90c5393d85f 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1754,6 +1754,8 @@ size_to_va_pool(struct vmap_node *vn, unsigned long size) return NULL; } +static unsigned long lazy_max_pages(void); + static bool node_pool_add_va(struct vmap_node *n, struct vmap_area *va) { @@ -1763,6 +1765,9 @@ node_pool_add_va(struct vmap_node *n, struct vmap_area *va) if (!vp) return false; + if (READ_ONCE(vp->len) > lazy_max_pages()) + return false; + spin_lock(&n->pool_lock); list_add(&va->list, &vp->head); WRITE_ONCE(vp->len, vp->len + 1); @@ -2170,9 +2175,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, INIT_WORK(&vn->purge_work, purge_vmap_node); if (cpumask_test_cpu(i, cpu_online_mask)) - schedule_work_on(i, &vn->purge_work); + queue_work_on(i, system_highpri_wq, &vn->purge_work); else - schedule_work(&vn->purge_work); + queue_work(system_highpri_wq, &vn->purge_work); nr_purge_helpers--; } else { <snip> We need this. This settles it back to a normal PTE-usage. Tomorrow i will check if cache-len should be limited. I tested on my 64 CPUs system with radical 256 kworkers. It looks good. -- Uladzislau Rezki
On 02/23/24 at 07:55pm, Uladzislau Rezki wrote: > On Fri, Feb 23, 2024 at 11:57:25PM +0800, Baoquan He wrote: > > On 02/23/24 at 12:06pm, Uladzislau Rezki wrote: > > > > On 02/23/24 at 10:34am, Uladzislau Rezki wrote: > > > > > On Thu, Feb 22, 2024 at 11:15:59PM +0000, Pedro Falcato wrote: > > > > > > Hi, > > > > > > > > > > > > On Thu, Feb 22, 2024 at 8:35 AM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > > > > Hello, Folk! > > > > > > > > > > > > > >[...] > > > > > > > pagetable_alloc - gets increased as soon as a higher pressure is applied by > > > > > > > increasing number of workers. Running same number of jobs on a next run > > > > > > > does not increase it and stays on same level as on previous. > > > > > > > > > > > > > > /** > > > > > > > * pagetable_alloc - Allocate pagetables > > > > > > > * @gfp: GFP flags > > > > > > > * @order: desired pagetable order > > > > > > > * > > > > > > > * pagetable_alloc allocates memory for page tables as well as a page table > > > > > > > * descriptor to describe that memory. > > > > > > > * > > > > > > > * Return: The ptdesc describing the allocated page tables. > > > > > > > */ > > > > > > > static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order) > > > > > > > { > > > > > > > struct page *page = alloc_pages(gfp | __GFP_COMP, order); > > > > > > > > > > > > > > return page_ptdesc(page); > > > > > > > } > > > > > > > > > > > > > > Could you please comment on it? Or do you have any thought? Is it expected? > > > > > > > Is a page-table ever shrink? > > > > > > > > > > > > It's my understanding that the vunmap_range helpers don't actively > > > > > > free page tables, they just clear PTEs. munmap does free them in > > > > > > mmap.c:free_pgtables, maybe something could be worked up for vmalloc > > > > > > too. > > > > > > > > > > > Right. I see that for a user space, pgtables are removed. There was a > > > > > work on it. > > > > > > > > > > > > > > > > > I would not be surprised if the memory increase you're seeing is more > > > > > > or less correlated to the maximum vmalloc footprint throughout the > > > > > > whole test. > > > > > > > > > > > Yes, the vmalloc footprint follows the memory usage. Some uses cases > > > > > map lot of memory. > > > > > > > > The 'nr_threads=256' testing may be too radical. I took the test on > > > > a bare metal machine as below, it's still running and hang there after > > > > 30 minutes. I did this after system boot. I am looking for other > > > > machines with more processors. > > > > > > > > [root@dell-r640-068 ~]# nproc > > > > 64 > > > > [root@dell-r640-068 ~]# free -h > > > > total used free shared buff/cache available > > > > Mem: 187Gi 18Gi 169Gi 12Mi 262Mi 168Gi > > > > Swap: 4.0Gi 0B 4.0Gi > > > > [root@dell-r640-068 ~]# > > > > > > > > [root@dell-r640-068 linux]# tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 > > > > Run the test with following parameters: run_test_mask=127 nr_threads=256 > > > > > > > Agree, nr_threads=256 is a way radical :) Mine took 50 minutes to > > > complete. So wait more :) > > > > Right, mine could take the similar time to finish that. I got a machine > > with 288 cpus, see if I can get some clues. When I go through the code > > flow, suddenly realized it could be drain_vmap_area_work which is the > > bottle neck and cause the tremendous page table pages costing. > > > > On your system, there's 64 cpus. then > > > > nr_lazy_max = lazy_max_pages() = 7*32M = 224M; > > > > So with nr_threads=128 or 256, it's so easily getting to the nr_lazy_max > > and triggering drain_vmap_work(). When cpu resouce is very limited, the > > lazy vmap purging will be very slow. While the alloc/free in lib/tet_vmalloc.c > > are going far faster and more easily then vmap reclaiming. If old va is not > > reused, new va is allocated and keep extending, the new page table surely > > need be created to cover them. > > > > I will take testing on the system with 288 cpus, will update if testing > > is done. > > > <snip> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 12caa794abd4..a90c5393d85f 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -1754,6 +1754,8 @@ size_to_va_pool(struct vmap_node *vn, unsigned long size) > return NULL; > } > > +static unsigned long lazy_max_pages(void); > + > static bool > node_pool_add_va(struct vmap_node *n, struct vmap_area *va) > { > @@ -1763,6 +1765,9 @@ node_pool_add_va(struct vmap_node *n, struct vmap_area *va) > if (!vp) > return false; > > + if (READ_ONCE(vp->len) > lazy_max_pages()) > + return false; > + > spin_lock(&n->pool_lock); > list_add(&va->list, &vp->head); > WRITE_ONCE(vp->len, vp->len + 1); > @@ -2170,9 +2175,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, > INIT_WORK(&vn->purge_work, purge_vmap_node); > > if (cpumask_test_cpu(i, cpu_online_mask)) > - schedule_work_on(i, &vn->purge_work); > + queue_work_on(i, system_highpri_wq, &vn->purge_work); > else > - schedule_work(&vn->purge_work); > + queue_work(system_highpri_wq, &vn->purge_work); > > nr_purge_helpers--; > } else { > <snip> > > We need this. This settles it back to a normal PTE-usage. Tomorrow i > will check if cache-len should be limited. I tested on my 64 CPUs > system with radical 256 kworkers. It looks good. I finally finished the testing w/o and with your above improvement patch. Testing is done on a system with 128 cpus. The system with 288 cpus is not available because of some console connection. Attach the log here. In some testing after rebooting, I found it could take more than 30 minutes, I am not sure if it's caused by my messy code change. I finally cleaned up all of them and take a clean linux-next to test, then apply your above draft code. [root@dell-per6515-03 linux]# nproc 128 [root@dell-per6515-03 linux]# free -h total used free shared buff/cache available Mem: 124Gi 2.6Gi 122Gi 21Mi 402Mi 122Gi Swap: 4.0Gi 0B 4.0Gi 1)linux-next kernel w/o improving code from Uladzislau ------------------------------------------------------- [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64 Run the test with following parameters: run_test_mask=127 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 4m28.018s user 0m0.015s sys 0m4.712s [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10 21405696 5226 mm/memory.c:1122 func:folio_prealloc 26199936 7980 kernel/fork.c:309 func:alloc_thread_stack_node 29822976 7281 mm/readahead.c:247 func:page_cache_ra_unbounded 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 107638784 6320 mm/readahead.c:468 func:ra_alloc_folio 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext 266797056 65136 include/linux/mm.h:2848 func:pagetable_alloc 507617280 32796 mm/slub.c:2305 func:alloc_slab_page [root@dell-per6515-03 ~]# [root@dell-per6515-03 ~]# [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128 Run the test with following parameters: run_test_mask=127 nr_threads=128 Done. Check the kernel ring buffer to see the summary. real 6m19.328s user 0m0.005s sys 0m9.476s [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10 21405696 5226 mm/memory.c:1122 func:folio_prealloc 26889408 8190 kernel/fork.c:309 func:alloc_thread_stack_node 29822976 7281 mm/readahead.c:247 func:page_cache_ra_unbounded 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 107638784 6320 mm/readahead.c:468 func:ra_alloc_folio 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext 550068224 34086 mm/slub.c:2305 func:alloc_slab_page 664535040 162240 include/linux/mm.h:2848 func:pagetable_alloc [root@dell-per6515-03 ~]# [root@dell-per6515-03 ~]# [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 Run the test with following parameters: run_test_mask=127 nr_threads=256 Done. Check the kernel ring buffer to see the summary. real 19m10.657s user 0m0.015s sys 0m20.959s [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10 22441984 5479 mm/shmem.c:1634 func:shmem_alloc_folio 26758080 8150 kernel/fork.c:309 func:alloc_thread_stack_node 35880960 8760 mm/readahead.c:247 func:page_cache_ra_unbounded 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash 122355712 7852 mm/readahead.c:468 func:ra_alloc_folio 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext 708231168 50309 mm/slub.c:2305 func:alloc_slab_page 1107296256 270336 include/linux/mm.h:2848 func:pagetable_alloc [root@dell-per6515-03 ~]# 2)linux-next kernel with improving code from Uladzislau ----------------------------------------------------- [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64 Run the test with following parameters: run_test_mask=127 nr_threads=64 Done. Check the kernel ring buffer to see the summary. real 4m27.226s user 0m0.006s sys 0m4.709s [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 38023168 9283 mm/readahead.c:247 func:page_cache_ra_unbounded 72228864 17634 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 184176640 10684 mm/readahead.c:468 func:ra_alloc_folio 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext 284700672 69507 include/linux/mm.h:2848 func:pagetable_alloc 601427968 36377 mm/slub.c:2305 func:alloc_slab_page [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128 Run the test with following parameters: run_test_mask=127 nr_threads=128 Done. Check the kernel ring buffer to see the summary. real 6m16.960s user 0m0.007s sys 0m9.465s [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 38158336 9316 mm/readahead.c:247 func:page_cache_ra_unbounded 72220672 17632 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 184504320 10710 mm/readahead.c:468 func:ra_alloc_folio 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext 427884544 104464 include/linux/mm.h:2848 func:pagetable_alloc 697311232 45159 mm/slub.c:2305 func:alloc_slab_page [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 Run the test with following parameters: run_test_mask=127 nr_threads=256 Done. Check the kernel ring buffer to see the summary. real 21m15.673s user 0m0.008s sys 0m20.259s [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 38158336 9316 mm/readahead.c:247 func:page_cache_ra_unbounded 72224768 17633 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages 184504320 10710 mm/readahead.c:468 func:ra_alloc_folio 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext 506974208 123773 include/linux/mm.h:2848 func:pagetable_alloc 809504768 53621 mm/slub.c:2305 func:alloc_slab_page [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 Run the test with following parameters: run_test_mask=127 nr_threads=256 Done. Check the kernel ring buffer to see the summary. real 21m36.580s user 0m0.012s sys 0m19.912s [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 38977536 9516 mm/readahead.c:247 func:page_cache_ra_unbounded 72273920 17645 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc 99895296 97554 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash 141033472 34432 mm/percpu-vm.c:95 func:pcpu_alloc_pages 186064896 10841 mm/readahead.c:468 func:ra_alloc_folio 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext 541237248 132138 include/linux/mm.h:2848 func:pagetable_alloc 694718464 41216 mm/slub.c:2305 func:alloc_slab_page
> > I finally finished the testing w/o and with your above improvement > patch. Testing is done on a system with 128 cpus. The system with 288 > cpus is not available because of some console connection. Attach the log > here. In some testing after rebooting, I found it could take more than 30 > minutes, I am not sure if it's caused by my messy code change. I finally > cleaned up all of them and take a clean linux-next to test, then apply > your above draft code. > [root@dell-per6515-03 linux]# nproc > 128 > [root@dell-per6515-03 linux]# free -h > total used free shared buff/cache available > Mem: 124Gi 2.6Gi 122Gi 21Mi 402Mi 122Gi > Swap: 4.0Gi 0B 4.0Gi > > 1)linux-next kernel w/o improving code from Uladzislau > ------------------------------------------------------- > [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64 > Run the test with following parameters: run_test_mask=127 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 4m28.018s > user 0m0.015s > sys 0m4.712s > [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10 > 21405696 5226 mm/memory.c:1122 func:folio_prealloc > 26199936 7980 kernel/fork.c:309 func:alloc_thread_stack_node > 29822976 7281 mm/readahead.c:247 func:page_cache_ra_unbounded > 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc > 107638784 6320 mm/readahead.c:468 func:ra_alloc_folio > 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash > 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages > 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext > 266797056 65136 include/linux/mm.h:2848 func:pagetable_alloc > 507617280 32796 mm/slub.c:2305 func:alloc_slab_page > [root@dell-per6515-03 ~]# > [root@dell-per6515-03 ~]# > [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128 > Run the test with following parameters: run_test_mask=127 nr_threads=128 > Done. > Check the kernel ring buffer to see the summary. > > real 6m19.328s > user 0m0.005s > sys 0m9.476s > [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10 > 21405696 5226 mm/memory.c:1122 func:folio_prealloc > 26889408 8190 kernel/fork.c:309 func:alloc_thread_stack_node > 29822976 7281 mm/readahead.c:247 func:page_cache_ra_unbounded > 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc > 107638784 6320 mm/readahead.c:468 func:ra_alloc_folio > 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash > 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages > 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext > 550068224 34086 mm/slub.c:2305 func:alloc_slab_page > 664535040 162240 include/linux/mm.h:2848 func:pagetable_alloc > [root@dell-per6515-03 ~]# > [root@dell-per6515-03 ~]# > [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 > Run the test with following parameters: run_test_mask=127 nr_threads=256 > Done. > Check the kernel ring buffer to see the summary. > > real 19m10.657s > user 0m0.015s > sys 0m20.959s > [root@dell-per6515-03 ~]# sort -h /proc/allocinfo | tail -10 > 22441984 5479 mm/shmem.c:1634 func:shmem_alloc_folio > 26758080 8150 kernel/fork.c:309 func:alloc_thread_stack_node > 35880960 8760 mm/readahead.c:247 func:page_cache_ra_unbounded > 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc > 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash > 122355712 7852 mm/readahead.c:468 func:ra_alloc_folio > 134742016 32896 mm/percpu-vm.c:95 func:pcpu_alloc_pages > 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext > 708231168 50309 mm/slub.c:2305 func:alloc_slab_page > 1107296256 270336 include/linux/mm.h:2848 func:pagetable_alloc > [root@dell-per6515-03 ~]# > > 2)linux-next kernel with improving code from Uladzislau > ----------------------------------------------------- > [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=64 > Run the test with following parameters: run_test_mask=127 nr_threads=64 > Done. > Check the kernel ring buffer to see the summary. > > real 4m27.226s > user 0m0.006s > sys 0m4.709s > [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 > 38023168 9283 mm/readahead.c:247 func:page_cache_ra_unbounded > 72228864 17634 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages > 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc > 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc > 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash > 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages > 184176640 10684 mm/readahead.c:468 func:ra_alloc_folio > 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext > 284700672 69507 include/linux/mm.h:2848 func:pagetable_alloc > 601427968 36377 mm/slub.c:2305 func:alloc_slab_page > [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=128 > Run the test with following parameters: run_test_mask=127 nr_threads=128 > Done. > Check the kernel ring buffer to see the summary. > > real 6m16.960s > user 0m0.007s > sys 0m9.465s > [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 > 38158336 9316 mm/readahead.c:247 func:page_cache_ra_unbounded > 72220672 17632 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages > 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc > 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc > 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash > 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages > 184504320 10710 mm/readahead.c:468 func:ra_alloc_folio > 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext > 427884544 104464 include/linux/mm.h:2848 func:pagetable_alloc > 697311232 45159 mm/slub.c:2305 func:alloc_slab_page > [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 > Run the test with following parameters: run_test_mask=127 nr_threads=256 > Done. > Check the kernel ring buffer to see the summary. > > real 21m15.673s > user 0m0.008s > sys 0m20.259s > [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 > 38158336 9316 mm/readahead.c:247 func:page_cache_ra_unbounded > 72224768 17633 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages > 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc > 99863552 97523 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc > 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash > 136314880 33280 mm/percpu-vm.c:95 func:pcpu_alloc_pages > 184504320 10710 mm/readahead.c:468 func:ra_alloc_folio > 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext > 506974208 123773 include/linux/mm.h:2848 func:pagetable_alloc > 809504768 53621 mm/slub.c:2305 func:alloc_slab_page > [root@dell-per6515-03 linux]# time tools/testing/selftests/mm/test_vmalloc.sh run_test_mask=127 nr_threads=256 > Run the test with following parameters: run_test_mask=127 nr_threads=256 > Done. > Check the kernel ring buffer to see the summary. > > real 21m36.580s > user 0m0.012s > sys 0m19.912s > [root@dell-per6515-03 linux]# sort -h /proc/allocinfo | tail -10 > 38977536 9516 mm/readahead.c:247 func:page_cache_ra_unbounded > 72273920 17645 fs/xfs/xfs_buf.c:390 [xfs] func:xfs_buf_alloc_pages > 99090432 96768 drivers/iommu/iova.c:604 func:iova_magazine_alloc > 99895296 97554 fs/xfs/xfs_icache.c:81 [xfs] func:xfs_inode_alloc > 120560528 29439 mm/mm_init.c:2521 func:alloc_large_system_hash > 141033472 34432 mm/percpu-vm.c:95 func:pcpu_alloc_pages > 186064896 10841 mm/readahead.c:468 func:ra_alloc_folio > 263192576 64256 mm/page_ext.c:270 func:alloc_page_ext > 541237248 132138 include/linux/mm.h:2848 func:pagetable_alloc > 694718464 41216 mm/slub.c:2305 func:alloc_slab_page > > Thank you for testing this. So ~132mb with a patch. I think it looks good but i might change the draft version and send out a new version. Thank you again! -- Uladzislau Rezki