Message ID | 20240102184633.748113-5-urezki@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-14735-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:6f82:b0:100:9c79:88ff with SMTP id tb2csp4624019dyb; Tue, 2 Jan 2024 10:48:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IGSjUIysCwG1WWbW9TzdaOdLMIlkNLPfjfIB/A6WQVCDOFZY4UZKXQpc4GUg2YmB7wi3egj X-Received: by 2002:a17:903:11c6:b0:1d4:cb83:293b with SMTP id q6-20020a17090311c600b001d4cb83293bmr728872plh.31.1704221284443; Tue, 02 Jan 2024 10:48:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704221284; cv=none; d=google.com; s=arc-20160816; b=xGDR4SsBx7EfFuJKM8NTA31ByBawW1metTzQwrJ09S8NnhmA+0/9mAyMj1pDdOTNGG E+lHOoYKzDo/SMwb+Lro+YJAZDeocvUACHh5SXQ2CYyaf6HBjXR7GfYxsgmJcttjKcjS pctxlVimZZ49I4PvXniS8jInPRjMq6HdAvKNWsBDONmVOhCIPO0eZkVMVYjHMVW3yG9K kMfIHVs+9sTONc2QkjwNdGU1GmSIboIaDu7NffrgGBR+3P+KCDC/T35IFAoHWHsSRqF1 NS28p5mkkMLG1c/D+mhXWdGrFhuVr5TT9EJmcLHNi86bXqwRuqIUInoFhwx0KXkbqDXX 8A5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=846iHGUkxLzE5ToX3/xFEcRjp3aW1Mkydl1M48KgaNk=; fh=cf1wzNgv8Y9vv687wGuXRY2VMl5yvitPBABc44jVg4s=; b=f/tnCCbfv2p77j+61PILgm3WPmXifKs6uYFZ/0JjDv77cXRzLaJDCCY2cWPJI6xuHH uwrXm/+/dCCBrZW4BIdFGED0oILMLaOV+HQkc4gk5K9k7SimJyIBsfBEtDgUxlRrVh9E rbvcE7xoE1gRi/AxUO/wm1OfwX35gdHqotS9j77xMNWzOOmTCOTXjaTh7ZMrk6465J8k lYuZn6A+xZN0EqollxE76W5Frzm12/r5YS+tNPwo/6Sau9+DkkSrgftdkDI2wx3rQpFY l8NVIUZ0KWUQuo5iDvNEsJX+cMAoLMwHpivP7z1pPyN/nhEOSvQmMeGZmb1GZQWn39oy w0cA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b="eq/p6ZvV"; spf=pass (google.com: domain of linux-kernel+bounces-14735-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14735-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id u2-20020a170902e80200b001d4abccef2asi5042142plg.139.2024.01.02.10.48.04 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 10:48:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-14735-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b="eq/p6ZvV"; spf=pass (google.com: domain of linux-kernel+bounces-14735-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14735-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 3D399B221FC for <ouuuleilei@gmail.com>; Tue, 2 Jan 2024 18:48:03 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2D5D5168CD; Tue, 2 Jan 2024 18:46:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="eq/p6ZvV" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92F9215AF7 for <linux-kernel@vger.kernel.org>; Tue, 2 Jan 2024 18:46:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-50e8ca6c76dso4333732e87.3 for <linux-kernel@vger.kernel.org>; Tue, 02 Jan 2024 10:46:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1704221200; x=1704826000; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=846iHGUkxLzE5ToX3/xFEcRjp3aW1Mkydl1M48KgaNk=; b=eq/p6ZvV5Tefw8btJLqA7KWquv6gPjOHWzBNYzuafvAjMJYL5uomhIkhK4vbaemU5u uQErfNgXMaFl6HaTDf2cY6kYyEBMvihMYbN/PgnzccQ5D7mHxVKG0H5zkUa7SeMmwW50 qU8HbU28gOPfUTvC2nuqhlz1LbCrPh/75H6kp7JiRi3dxxCzlesrlTjdh6rU5N0c7RWt jajRurtySimvMiBRPlXeUVC7zLeukClNV2pb3gOiSUxby74vjRP3K8F03om3KnlEo+y+ KgRPZ/9pMyw/GYk3AtSEde/DR0PP9uVxLe3+yVSihmjcj9GW/FU+OqHEIbEuEJn7sI9V WgTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704221200; x=1704826000; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=846iHGUkxLzE5ToX3/xFEcRjp3aW1Mkydl1M48KgaNk=; b=JLW8E/3tbSVlSiawdJ6uACkOevg+dZDYRpdLJ832alajsEaUgH8GNgqLD2AKa5YcS+ ChBEe/ylsdMlBlAoHubl40mg+9jIzQLdC9Iw3ktrXa5LMNhgSBL/OCg6GJCaho5qAmhP SrHOmmNG0r1faKpwRm8cS4y7qWVh6ddQhRbO+0MUrQJrRl3VZAKyTJCQ/XiADOLK7Ero zDk4Kp4NikHAw1LyOAouzTz/IdBVZ/hbmSotGdiIKaTVT5lG2Xv7jE8B2oj5i11mnTnn hjYg958t7w/E6h7Fx6KtKMxk9472MiTgdS4scLQguYHrgUV//r5uxJJVt30sxKtYYNKf Ojzw== X-Gm-Message-State: AOJu0YzCEnWxEPe7TKPj2Qx/YnA7KFgs7qRFkMVhvqi2/nvnAESbjAdB DfuE1kXBhH36D3cfl/WthDgYl/J8v55C5A== X-Received: by 2002:a05:6512:3b28:b0:50e:7f8a:5f77 with SMTP id f40-20020a0565123b2800b0050e7f8a5f77mr6510582lfv.127.1704221200488; Tue, 02 Jan 2024 10:46:40 -0800 (PST) Received: from pc638.lan (host-185-121-47-193.sydskane.nu. [185.121.47.193]) by smtp.gmail.com with ESMTPSA id q1-20020ac246e1000000b0050e7be886d9sm2592656lfo.56.2024.01.02.10.46.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 10:46:40 -0800 (PST) From: "Uladzislau Rezki (Sony)" <urezki@gmail.com> To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org>, Baoquan He <bhe@redhat.com>, Lorenzo Stoakes <lstoakes@gmail.com>, Christoph Hellwig <hch@infradead.org>, Matthew Wilcox <willy@infradead.org>, "Liam R . Howlett" <Liam.Howlett@oracle.com>, Dave Chinner <david@fromorbit.com>, "Paul E . McKenney" <paulmck@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Uladzislau Rezki <urezki@gmail.com>, Oleksiy Avramchenko <oleksiy.avramchenko@sony.com> Subject: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree Date: Tue, 2 Jan 2024 19:46:26 +0100 Message-Id: <20240102184633.748113-5-urezki@gmail.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20240102184633.748113-1-urezki@gmail.com> References: <20240102184633.748113-1-urezki@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1787005537993890276 X-GMAIL-MSGID: 1787005537993890276 |
Series |
Mitigate a vmap lock contention v3
|
|
Commit Message
Uladzislau Rezki
Jan. 2, 2024, 6:46 p.m. UTC
Store allocated objects in a separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. An addr_to_node() function is used to do a proper address conversion to determine a node that contains a VA. Such approach balances VAs across nodes as a result an access becomes scalable. Number of nodes in a system depends on number of CPUs. Please note: 1. As of now allocated VAs are bound to a node-0. It means the patch does not give any difference comparing with a current behavior; 2. The global vmap_area_lock, vmap_area_root are removed as there is no need in it anymore. The vmap_area_list is still kept and is _empty_. It is exported for a kexec only; 3. The vmallocinfo and vread() have to be reworked to be able to handle multiple nodes. Reviewed-by: Baoquan He <bhe@redhat.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> --- mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 173 insertions(+), 67 deletions(-)
Comments
On 2024/01/03 02:46, Uladzislau Rezki wrote: > Store allocated objects in a separate nodes. A va->va_start > address is converted into a correct node where it should > be placed and resided. An addr_to_node() function is used > to do a proper address conversion to determine a node that > contains a VA. > > Such approach balances VAs across nodes as a result an access > becomes scalable. Number of nodes in a system depends on number > of CPUs. > > Please note: > > 1. As of now allocated VAs are bound to a node-0. It means the > patch does not give any difference comparing with a current > behavior; > > 2. The global vmap_area_lock, vmap_area_root are removed as there > is no need in it anymore. The vmap_area_list is still kept and > is _empty_. It is exported for a kexec only; > > 3. The vmallocinfo and vread() have to be reworked to be able to > handle multiple nodes. > > Reviewed-by: Baoquan He <bhe@redhat.com> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > --- <...> > struct vmap_area *find_vmap_area(unsigned long addr) > { > + struct vmap_node *vn; > struct vmap_area *va; > + int i, j; > > - spin_lock(&vmap_area_lock); > - va = __find_vmap_area(addr, &vmap_area_root); > - spin_unlock(&vmap_area_lock); > + /* > + * An addr_to_node_id(addr) converts an address to a node index > + * where a VA is located. If VA spans several zones and passed > + * addr is not the same as va->va_start, what is not common, we > + * may need to scan an extra nodes. See an example: > + * > + * <--va--> > + * -|-----|-----|-----|-----|- > + * 1 2 0 1 > + * > + * VA resides in node 1 whereas it spans 1 and 2. If passed > + * addr is within a second node we should do extra work. We > + * should mention that it is rare and is a corner case from > + * the other hand it has to be covered. > + */ > + i = j = addr_to_node_id(addr); > + do { > + vn = &vmap_nodes[i]; > > - return va; > + spin_lock(&vn->busy.lock); > + va = __find_vmap_area(addr, &vn->busy.root); > + spin_unlock(&vn->busy.lock); > + > + if (va) > + return va; > + } while ((i = (i + 1) % nr_vmap_nodes) != j); > + > + return NULL; > } > Hi Uladzislau Rezki, I really like your work, it is great and helpful! Currently, I am working on using shared memory communication (SMC [1]) to transparently accelerate TCP communication between two peers within the same OS instance[2]. In this scenario, a vzalloced kernel buffer acts as a shared memory and will be simultaneous read or written by two SMC sockets, thus forming an SMC connection. socket1 socket2 | ^ | | userspace ---- write -------------------- read ------ | +-----------------+ | kernel +--->| shared memory |---+ | (vzalloced now) | +-----------------+ Then I encountered the performance regression caused by lock contention in find_vmap_area() when multiple threads transfer data through multiple SMC connections on machines with many CPUs[3]. According to perf, the performance bottleneck is caused by the global vmap_area_lock contention[4]: - writer: smc_tx_sendmsg -> memcpy_from_msg -> copy_from_iter -> check_copy_size -> check_object_size -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object -> if(vm) find_vmap_area -> try to hold vmap_area_lock lock - reader: smc_rx_recvmsg -> memcpy_to_msg -> copy_to_iter -> check_copy_size -> check_object_size -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object -> if(vm) find_vmap_area -> try to hold vmap_area_lock lock Fortunately, thank you for this patch set, the global vmap_area_lock was removed and per node lock vn->busy.lock is introduced. it is really helpful: In 48 CPUs qemu environment, the Requests/s increased by 5 times: - nginx - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80 vzalloced shmem vzalloced shmem(with this patch set) Requests/sec 113536.56 583729.93 But it also has some overhead, compared to using kzalloced shared memory or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area: kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY) Requests/sec 831950.39 805164.78 So, as a newbie in Linux-mm, I would like to ask for some suggestions: Is it possible to further eliminate the overhead caused by lock contention in find_vmap_area() in this scenario (maybe this is asking too much), or the only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced buffer in the situation where cocurrent kernel-userspace-copy happens? Any feedback will be appreciated. Thanks again for your time. [1] Shared Memory Communications (SMC) enables two SMC capable peers to communicate by using memory buffers that each peer allocates for the partner's use. It improves throughput, lowers latency and cost, and maintains existing functions. See details in https://www.ibm.com/support/pages/node/7009315 [2] https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ [3] issues: https://lore.kernel.org/all/1fbd6b74-1080-923a-01c1-689c3d65f880@huawei.com/ analysis: https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ [4] Some flamegraphs are attached, - SMC using vzalloced buffer: vzalloc_t96.svg - SMC using vzalloced buffer and with this patchset: vzalloc_t96_improve.svg - SMC using vzalloced buffer and unset CONFIG_HARDENED_USERCOPY: vzalloc_t96_nocheck.svg - SMC using kzalloced buffer: kzalloc_t96.svg Best regards, Wen Gu
Hello, Wen Gu. > > Hi Uladzislau Rezki, > > I really like your work, it is great and helpful! > > Currently, I am working on using shared memory communication (SMC [1]) > to transparently accelerate TCP communication between two peers within > the same OS instance[2]. > > In this scenario, a vzalloced kernel buffer acts as a shared memory and > will be simultaneous read or written by two SMC sockets, thus forming an > SMC connection. > > > socket1 socket2 > | ^ > | | userspace > ---- write -------------------- read ------ > | +-----------------+ | kernel > +--->| shared memory |---+ > | (vzalloced now) | > +-----------------+ > > Then I encountered the performance regression caused by lock contention > in find_vmap_area() when multiple threads transfer data through multiple > SMC connections on machines with many CPUs[3]. > > According to perf, the performance bottleneck is caused by the global > vmap_area_lock contention[4]: > > - writer: > > smc_tx_sendmsg > -> memcpy_from_msg > -> copy_from_iter > -> check_copy_size > -> check_object_size > -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object > -> if(vm) find_vmap_area > -> try to hold vmap_area_lock lock > - reader: > > smc_rx_recvmsg > -> memcpy_to_msg > -> copy_to_iter > -> check_copy_size > -> check_object_size > -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object > -> if(vm) find_vmap_area > -> try to hold vmap_area_lock lock > > Fortunately, thank you for this patch set, the global vmap_area_lock was > removed and per node lock vn->busy.lock is introduced. it is really helpful: > > In 48 CPUs qemu environment, the Requests/s increased by 5 times: > - nginx > - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80 > > vzalloced shmem vzalloced shmem(with this patch set) > Requests/sec 113536.56 583729.93 > > Thank you for the confirmation that your workload is improved. The "nginx" is 5 times better! > But it also has some overhead, compared to using kzalloced shared memory > or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area: > > kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY) > Requests/sec 831950.39 805164.78 > > The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is why if it is a vmalloced memory it wants to make sure it is really true, if not user-copy is aborted. So there is an extra work that involves finding a VA associated with an address. > So, as a newbie in Linux-mm, I would like to ask for some suggestions: > > Is it possible to further eliminate the overhead caused by lock contention > in find_vmap_area() in this scenario (maybe this is asking too much), or the > only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced > buffer in the situation where cocurrent kernel-userspace-copy happens? > Could you please try below patch, if it improves this series further? Just in case: <snip> diff --git a/mm/vmalloc.c b/mm/vmalloc.c index e30dabf68263..40acf53cadfb 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -772,7 +772,7 @@ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node); struct rb_list { struct rb_root root; struct list_head head; - spinlock_t lock; + rwlock_t lock; }; struct vmap_pool { @@ -947,19 +947,19 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va) for (i = 0; i < nr_vmap_nodes; i++) { vn = &vmap_nodes[i]; - spin_lock(&vn->busy.lock); + read_lock(&vn->busy.lock); va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root); if (va_lowest) { if (!va_node || va_lowest->va_start < (*va)->va_start) { if (va_node) - spin_unlock(&va_node->busy.lock); + read_unlock(&va_node->busy.lock); *va = va_lowest; va_node = vn; continue; } } - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); } return va_node; @@ -1695,9 +1695,9 @@ static void free_vmap_area(struct vmap_area *va) /* * Remove from the busy tree/list. */ - spin_lock(&vn->busy.lock); + write_lock(&vn->busy.lock); unlink_va(va, &vn->busy.root); - spin_unlock(&vn->busy.lock); + write_unlock(&vn->busy.lock); /* * Insert/Merge it back to the free tree/list. @@ -1901,9 +1901,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, vn = addr_to_node(va->va_start); - spin_lock(&vn->busy.lock); + write_lock(&vn->busy.lock); insert_vmap_area(va, &vn->busy.root, &vn->busy.head); - spin_unlock(&vn->busy.lock); + write_unlock(&vn->busy.lock); BUG_ON(!IS_ALIGNED(va->va_start, align)); BUG_ON(va->va_start < vstart); @@ -2123,10 +2123,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, if (RB_EMPTY_ROOT(&vn->lazy.root)) continue; - spin_lock(&vn->lazy.lock); + write_lock(&vn->lazy.lock); WRITE_ONCE(vn->lazy.root.rb_node, NULL); list_replace_init(&vn->lazy.head, &vn->purge_list); - spin_unlock(&vn->lazy.lock); + write_unlock(&vn->lazy.lock); start = min(start, list_first_entry(&vn->purge_list, struct vmap_area, list)->va_start); @@ -2223,9 +2223,9 @@ static void free_vmap_area_noflush(struct vmap_area *va) vn = is_vn_id_valid(vn_id) ? id_to_node(vn_id):addr_to_node(va->va_start); - spin_lock(&vn->lazy.lock); + write_lock(&vn->lazy.lock); insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head); - spin_unlock(&vn->lazy.lock); + write_unlock(&vn->lazy.lock); trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max); @@ -2272,9 +2272,9 @@ struct vmap_area *find_vmap_area(unsigned long addr) do { vn = &vmap_nodes[i]; - spin_lock(&vn->busy.lock); + read_lock(&vn->busy.lock); va = __find_vmap_area(addr, &vn->busy.root); - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); if (va) return va; @@ -2293,11 +2293,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) do { vn = &vmap_nodes[i]; - spin_lock(&vn->busy.lock); + write_lock(&vn->busy.lock); va = __find_vmap_area(addr, &vn->busy.root); if (va) unlink_va(va, &vn->busy.root); - spin_unlock(&vn->busy.lock); + write_unlock(&vn->busy.lock); if (va) return va; @@ -2514,9 +2514,9 @@ static void free_vmap_block(struct vmap_block *vb) BUG_ON(tmp != vb); vn = addr_to_node(vb->va->va_start); - spin_lock(&vn->busy.lock); + write_lock(&vn->busy.lock); unlink_va(vb->va, &vn->busy.root); - spin_unlock(&vn->busy.lock); + write_unlock(&vn->busy.lock); free_vmap_area_noflush(vb->va); kfree_rcu(vb, rcu_head); @@ -2942,9 +2942,9 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, { struct vmap_node *vn = addr_to_node(va->va_start); - spin_lock(&vn->busy.lock); + read_lock(&vn->busy.lock); setup_vmalloc_vm_locked(vm, va, flags, caller); - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); } static void clear_vm_uninitialized_flag(struct vm_struct *vm) @@ -4214,19 +4214,19 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) next_va: next = va->va_end; - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); } while ((vn = find_vmap_area_exceed_addr_lock(next, &va))); finished_zero: if (vn) - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); /* zero-fill memory holes */ return count - remains + zero_iter(iter, remains); finished: /* Nothing remains, or We couldn't copy/zero everything. */ if (vn) - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); return count - remains; } @@ -4563,11 +4563,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, for (area = 0; area < nr_vms; area++) { struct vmap_node *vn = addr_to_node(vas[area]->va_start); - spin_lock(&vn->busy.lock); + write_lock(&vn->busy.lock); insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, pcpu_get_vm_areas); - spin_unlock(&vn->busy.lock); + write_unlock(&vn->busy.lock); } /* @@ -4687,7 +4687,7 @@ bool vmalloc_dump_obj(void *object) vn = addr_to_node((unsigned long)objp); - if (spin_trylock(&vn->busy.lock)) { + if (read_trylock(&vn->busy.lock)) { va = __find_vmap_area(addr, &vn->busy.root); if (va && va->vm) { @@ -4697,7 +4697,7 @@ bool vmalloc_dump_obj(void *object) success = true; } - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); } if (success) @@ -4742,13 +4742,13 @@ static void show_purge_info(struct seq_file *m) for (i = 0; i < nr_vmap_nodes; i++) { vn = &vmap_nodes[i]; - spin_lock(&vn->lazy.lock); + read_lock(&vn->lazy.lock); list_for_each_entry(va, &vn->lazy.head, list) { seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", (void *)va->va_start, (void *)va->va_end, va->va_end - va->va_start); } - spin_unlock(&vn->lazy.lock); + read_unlock(&vn->lazy.lock); } } @@ -4762,7 +4762,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p) for (i = 0; i < nr_vmap_nodes; i++) { vn = &vmap_nodes[i]; - spin_lock(&vn->busy.lock); + read_lock(&vn->busy.lock); list_for_each_entry(va, &vn->busy.head, list) { if (!va->vm) { if (va->flags & VMAP_RAM) @@ -4808,7 +4808,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p) show_numa_info(m, v); seq_putc(m, '\n'); } - spin_unlock(&vn->busy.lock); + read_unlock(&vn->busy.lock); } /* @@ -4902,11 +4902,11 @@ static void vmap_init_nodes(void) vn = &vmap_nodes[n]; vn->busy.root = RB_ROOT; INIT_LIST_HEAD(&vn->busy.head); - spin_lock_init(&vn->busy.lock); + rwlock_init(&vn->busy.lock); vn->lazy.root = RB_ROOT; INIT_LIST_HEAD(&vn->lazy.head); - spin_lock_init(&vn->lazy.lock); + rwlock_init(&vn->lazy.lock); for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { INIT_LIST_HEAD(&vn->pool[i].head); <snip> Thank you! -- Uladzislau Rezki
On 2024/1/5 18:50, Uladzislau Rezki wrote: > Hello, Wen Gu. > >> >> Hi Uladzislau Rezki, >> <...> >> Fortunately, thank you for this patch set, the global vmap_area_lock was >> removed and per node lock vn->busy.lock is introduced. it is really helpful: >> >> In 48 CPUs qemu environment, the Requests/s increased by 5 times: >> - nginx >> - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80 >> >> vzalloced shmem vzalloced shmem(with this patch set) >> Requests/sec 113536.56 583729.93 >> >> > Thank you for the confirmation that your workload is improved. The "nginx" > is 5 times better! > Yes, thank you very much for the improvement! >> But it also has some overhead, compared to using kzalloced shared memory >> or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area: >> >> kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY) >> Requests/sec 831950.39 805164.78 >> >> > The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is > why if it is a vmalloced memory it wants to make sure it is really true, > if not user-copy is aborted. > > So there is an extra work that involves finding a VA associated with an address. > Yes, and lock contention in finding VA is likely to be a performance bottleneck, which is mitigated a lot by your work. >> So, as a newbie in Linux-mm, I would like to ask for some suggestions: >> >> Is it possible to further eliminate the overhead caused by lock contention >> in find_vmap_area() in this scenario (maybe this is asking too much), or the >> only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced >> buffer in the situation where cocurrent kernel-userspace-copy happens? >> > Could you please try below patch, if it improves this series further? > Just in case: > Thank you! I tried the patch, and it seems that the wait for rwlock_t also exists, as much as using spinlock_t. (The flamegraph is attached. Not sure why the read_lock waits so long, given that there is no frequent write_lock competition) vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t) Requests/sec 583729.93 460007.44 So I guess the overhead in finding vmap area is inevitable here and the original spin_lock is fine in this series. Thanks again for your help! Best regards, Wen Gu > <snip> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index e30dabf68263..40acf53cadfb 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -772,7 +772,7 @@ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node); > struct rb_list { > struct rb_root root; > struct list_head head; > - spinlock_t lock; > + rwlock_t lock; > }; > > struct vmap_pool { > @@ -947,19 +947,19 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va) > for (i = 0; i < nr_vmap_nodes; i++) { > vn = &vmap_nodes[i]; > > - spin_lock(&vn->busy.lock); > + read_lock(&vn->busy.lock); > va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root); > if (va_lowest) { > if (!va_node || va_lowest->va_start < (*va)->va_start) { > if (va_node) > - spin_unlock(&va_node->busy.lock); > + read_unlock(&va_node->busy.lock); > > *va = va_lowest; > va_node = vn; > continue; > } > } > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > } > > return va_node; > @@ -1695,9 +1695,9 @@ static void free_vmap_area(struct vmap_area *va) > /* > * Remove from the busy tree/list. > */ > - spin_lock(&vn->busy.lock); > + write_lock(&vn->busy.lock); > unlink_va(va, &vn->busy.root); > - spin_unlock(&vn->busy.lock); > + write_unlock(&vn->busy.lock); > > /* > * Insert/Merge it back to the free tree/list. > @@ -1901,9 +1901,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > > vn = addr_to_node(va->va_start); > > - spin_lock(&vn->busy.lock); > + write_lock(&vn->busy.lock); > insert_vmap_area(va, &vn->busy.root, &vn->busy.head); > - spin_unlock(&vn->busy.lock); > + write_unlock(&vn->busy.lock); > > BUG_ON(!IS_ALIGNED(va->va_start, align)); > BUG_ON(va->va_start < vstart); > @@ -2123,10 +2123,10 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, > if (RB_EMPTY_ROOT(&vn->lazy.root)) > continue; > > - spin_lock(&vn->lazy.lock); > + write_lock(&vn->lazy.lock); > WRITE_ONCE(vn->lazy.root.rb_node, NULL); > list_replace_init(&vn->lazy.head, &vn->purge_list); > - spin_unlock(&vn->lazy.lock); > + write_unlock(&vn->lazy.lock); > > start = min(start, list_first_entry(&vn->purge_list, > struct vmap_area, list)->va_start); > @@ -2223,9 +2223,9 @@ static void free_vmap_area_noflush(struct vmap_area *va) > vn = is_vn_id_valid(vn_id) ? > id_to_node(vn_id):addr_to_node(va->va_start); > > - spin_lock(&vn->lazy.lock); > + write_lock(&vn->lazy.lock); > insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head); > - spin_unlock(&vn->lazy.lock); > + write_unlock(&vn->lazy.lock); > > trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max); > > @@ -2272,9 +2272,9 @@ struct vmap_area *find_vmap_area(unsigned long addr) > do { > vn = &vmap_nodes[i]; > > - spin_lock(&vn->busy.lock); > + read_lock(&vn->busy.lock); > va = __find_vmap_area(addr, &vn->busy.root); > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > > if (va) > return va; > @@ -2293,11 +2293,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > do { > vn = &vmap_nodes[i]; > > - spin_lock(&vn->busy.lock); > + write_lock(&vn->busy.lock); > va = __find_vmap_area(addr, &vn->busy.root); > if (va) > unlink_va(va, &vn->busy.root); > - spin_unlock(&vn->busy.lock); > + write_unlock(&vn->busy.lock); > > if (va) > return va; > @@ -2514,9 +2514,9 @@ static void free_vmap_block(struct vmap_block *vb) > BUG_ON(tmp != vb); > > vn = addr_to_node(vb->va->va_start); > - spin_lock(&vn->busy.lock); > + write_lock(&vn->busy.lock); > unlink_va(vb->va, &vn->busy.root); > - spin_unlock(&vn->busy.lock); > + write_unlock(&vn->busy.lock); > > free_vmap_area_noflush(vb->va); > kfree_rcu(vb, rcu_head); > @@ -2942,9 +2942,9 @@ static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, > { > struct vmap_node *vn = addr_to_node(va->va_start); > > - spin_lock(&vn->busy.lock); > + read_lock(&vn->busy.lock); > setup_vmalloc_vm_locked(vm, va, flags, caller); > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > } > > static void clear_vm_uninitialized_flag(struct vm_struct *vm) > @@ -4214,19 +4214,19 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > > next_va: > next = va->va_end; > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > } while ((vn = find_vmap_area_exceed_addr_lock(next, &va))); > > finished_zero: > if (vn) > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > > /* zero-fill memory holes */ > return count - remains + zero_iter(iter, remains); > finished: > /* Nothing remains, or We couldn't copy/zero everything. */ > if (vn) > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > > return count - remains; > } > @@ -4563,11 +4563,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, > for (area = 0; area < nr_vms; area++) { > struct vmap_node *vn = addr_to_node(vas[area]->va_start); > > - spin_lock(&vn->busy.lock); > + write_lock(&vn->busy.lock); > insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, > pcpu_get_vm_areas); > - spin_unlock(&vn->busy.lock); > + write_unlock(&vn->busy.lock); > } > > /* > @@ -4687,7 +4687,7 @@ bool vmalloc_dump_obj(void *object) > > vn = addr_to_node((unsigned long)objp); > > - if (spin_trylock(&vn->busy.lock)) { > + if (read_trylock(&vn->busy.lock)) { > va = __find_vmap_area(addr, &vn->busy.root); > > if (va && va->vm) { > @@ -4697,7 +4697,7 @@ bool vmalloc_dump_obj(void *object) > success = true; > } > > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > } > > if (success) > @@ -4742,13 +4742,13 @@ static void show_purge_info(struct seq_file *m) > for (i = 0; i < nr_vmap_nodes; i++) { > vn = &vmap_nodes[i]; > > - spin_lock(&vn->lazy.lock); > + read_lock(&vn->lazy.lock); > list_for_each_entry(va, &vn->lazy.head, list) { > seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n", > (void *)va->va_start, (void *)va->va_end, > va->va_end - va->va_start); > } > - spin_unlock(&vn->lazy.lock); > + read_unlock(&vn->lazy.lock); > } > } > > @@ -4762,7 +4762,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p) > for (i = 0; i < nr_vmap_nodes; i++) { > vn = &vmap_nodes[i]; > > - spin_lock(&vn->busy.lock); > + read_lock(&vn->busy.lock); > list_for_each_entry(va, &vn->busy.head, list) { > if (!va->vm) { > if (va->flags & VMAP_RAM) > @@ -4808,7 +4808,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p) > show_numa_info(m, v); > seq_putc(m, '\n'); > } > - spin_unlock(&vn->busy.lock); > + read_unlock(&vn->busy.lock); > } > > /* > @@ -4902,11 +4902,11 @@ static void vmap_init_nodes(void) > vn = &vmap_nodes[n]; > vn->busy.root = RB_ROOT; > INIT_LIST_HEAD(&vn->busy.head); > - spin_lock_init(&vn->busy.lock); > + rwlock_init(&vn->busy.lock); > > vn->lazy.root = RB_ROOT; > INIT_LIST_HEAD(&vn->lazy.head); > - spin_lock_init(&vn->lazy.lock); > + rwlock_init(&vn->lazy.lock); > > for (i = 0; i < MAX_VA_SIZE_PAGES; i++) { > INIT_LIST_HEAD(&vn->pool[i].head); > <snip> > > Thank you! > > -- > Uladzislau Rezki
> > On 2024/1/5 18:50, Uladzislau Rezki wrote: > > > Hello, Wen Gu. > > > > > > > > Hi Uladzislau Rezki, > > > > > <...> > > > > Fortunately, thank you for this patch set, the global vmap_area_lock was > > > removed and per node lock vn->busy.lock is introduced. it is really helpful: > > > > > > In 48 CPUs qemu environment, the Requests/s increased by 5 times: > > > - nginx > > > - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80 > > > > > > vzalloced shmem vzalloced shmem(with this patch set) > > > Requests/sec 113536.56 583729.93 > > > > > > > > Thank you for the confirmation that your workload is improved. The "nginx" > > is 5 times better! > > > > Yes, thank you very much for the improvement! > > > > But it also has some overhead, compared to using kzalloced shared memory > > > or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area: > > > > > > kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY) > > > Requests/sec 831950.39 805164.78 > > > > > > > > The CONFIG_HARDENED_USERCOPY prevents coping "wrong" memory regions. That is > > why if it is a vmalloced memory it wants to make sure it is really true, > > if not user-copy is aborted. > > > > So there is an extra work that involves finding a VA associated with an address. > > > > Yes, and lock contention in finding VA is likely to be a performance bottleneck, > which is mitigated a lot by your work. > > > > So, as a newbie in Linux-mm, I would like to ask for some suggestions: > > > > > > Is it possible to further eliminate the overhead caused by lock contention > > > in find_vmap_area() in this scenario (maybe this is asking too much), or the > > > only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced > > > buffer in the situation where cocurrent kernel-userspace-copy happens? > > > > > Could you please try below patch, if it improves this series further? > > Just in case: > > > > Thank you! I tried the patch, and it seems that the wait for rwlock_t > also exists, as much as using spinlock_t. (The flamegraph is attached. > Not sure why the read_lock waits so long, given that there is no frequent > write_lock competition) > > vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t) > Requests/sec 583729.93 460007.44 > > So I guess the overhead in finding vmap area is inevitable here and the > original spin_lock is fine in this series. > I have also noticed a erformance difference between rwlock and spinlock. So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is set, i.e. find a VA. -- Uladzislau Rezki
On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <urezki@gmail.com> > > > > Thank you! I tried the patch, and it seems that the wait for rwlock_t > > also exists, as much as using spinlock_t. (The flamegraph is attached. > > Not sure why the read_lock waits so long, given that there is no frequent > > write_lock competition) > > > > vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t) > > Requests/sec 583729.93 460007.44 > > > > So I guess the overhead in finding vmap area is inevitable here and the > > original spin_lock is fine in this series. > > > I have also noticed a erformance difference between rwlock and spinlock. > So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is > set, i.e. find a VA. See if read bias helps to understand the gap between spinlock and rwlock. --- x/kernel/locking/qrwlock.c +++ y/kernel/locking/qrwlock.c @@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat /* * Readers come here when they cannot get the lock without waiting */ - if (unlikely(in_interrupt())) { + if (1) { /* * Readers in interrupt context will get the lock immediately * if the writer is just waiting (not holding the lock yet),
On 2024/1/7 14:59, Hillf Danton wrote: > On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <urezki@gmail.com> >>> >>> Thank you! I tried the patch, and it seems that the wait for rwlock_t >>> also exists, as much as using spinlock_t. (The flamegraph is attached. >>> Not sure why the read_lock waits so long, given that there is no frequent >>> write_lock competition) >>> >>> vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t) >>> Requests/sec 583729.93 460007.44 >>> >>> So I guess the overhead in finding vmap area is inevitable here and the >>> original spin_lock is fine in this series. >>> >> I have also noticed a erformance difference between rwlock and spinlock. >> So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is >> set, i.e. find a VA. > > See if read bias helps to understand the gap between spinlock and rwlock. > > --- x/kernel/locking/qrwlock.c > +++ y/kernel/locking/qrwlock.c > @@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat > /* > * Readers come here when they cannot get the lock without waiting > */ > - if (unlikely(in_interrupt())) { > + if (1) { > /* > * Readers in interrupt context will get the lock immediately > * if the writer is just waiting (not holding the lock yet), Thank you for the idea! Hillf. IIUC, the change makes read operations more likely to acquire lock and modified fairness to favor reading. The test in my scenario shows: vzalloced shmem with spinlock_t rwlock_t rwlock_t(with above change) Requests/sec 564961.29 442534.33 439733.31 In addition to read bias, there seems to be other factors that cause the gap, but I haven't figured it out yet.. Thanks, Wen Gu
On Mon, Jan 08, 2024 at 03:45:18PM +0800, Wen Gu wrote: > > > On 2024/1/7 14:59, Hillf Danton wrote: > > On Sat, 6 Jan 2024 17:36:23 +0100 Uladzislau Rezki <urezki@gmail.com> > > > > > > > > Thank you! I tried the patch, and it seems that the wait for rwlock_t > > > > also exists, as much as using spinlock_t. (The flamegraph is attached. > > > > Not sure why the read_lock waits so long, given that there is no frequent > > > > write_lock competition) > > > > > > > > vzalloced shmem(spinlock_t) vzalloced shmem(rwlock_t) > > > > Requests/sec 583729.93 460007.44 > > > > > > > > So I guess the overhead in finding vmap area is inevitable here and the > > > > original spin_lock is fine in this series. > > > > > > > I have also noticed a erformance difference between rwlock and spinlock. > > > So, yes. This is what we need to do extra if CONFIG_HARDENED_USERCOPY is > > > set, i.e. find a VA. > > > > See if read bias helps to understand the gap between spinlock and rwlock. > > > > --- x/kernel/locking/qrwlock.c > > +++ y/kernel/locking/qrwlock.c > > @@ -23,7 +23,7 @@ void __lockfunc queued_read_lock_slowpat > > /* > > * Readers come here when they cannot get the lock without waiting > > */ > > - if (unlikely(in_interrupt())) { > > + if (1) { > > /* > > * Readers in interrupt context will get the lock immediately > > * if the writer is just waiting (not holding the lock yet), > > Thank you for the idea! Hillf. > > IIUC, the change makes read operations more likely to acquire lock and > modified fairness to favor reading. > > The test in my scenario shows: > > vzalloced shmem with spinlock_t rwlock_t rwlock_t(with above change) > Requests/sec 564961.29 442534.33 439733.31 > > In addition to read bias, there seems to be other factors that cause the > gap, but I haven't figured it out yet.. > <snip> urezki@pc638:~$ cat locktorture.sh #!/bin/sh # available lock types: spin_lock, rw_lock torture_type=$1 test_time=$2 echo "Start..." modprobe locktorture $torture_type nreaders_stress=0 > /dev/null 2>&1 sleep $test_time rmmod locktorture > /dev/null 2>&1 echo "Done." urezki@pc638:~$ <snip> Out: # sudo ./locktorture.sh rw_lock 30 [12107.327566] Writes: Total: 53304415 Max/Min: 1620715/3225 ??? Fail: 0 [12107.327898] spin_lock-torture: lock_torture_stats is stopping [12107.328192] Writes: Total: 53304415 Max/Min: 1620715/3225 ??? Fail: 0 [12107.328368] spin_lock-torture:--- End of test: SUCCESS: acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 long_hold=100 nested_locks=0 nreaders_stress=0 nwriters_stress=128 onoff_holdoff=0 onoff_interval=0 rt_boost=2 rt_boost_factor=50 shuffle_interval=3 shutdown_secs=0 stat_interval=60 stutter=5 verbose=1 writer_fifo=0 # sudo ./locktorture.sh spin_lock 30 [12051.968992] Writes: Total: 47843400 Max/Min: 1335320/5942 ??? Fail: 0 [12051.969236] spin_lock-torture: lock_torture_stats is stopping [12051.969507] Writes: Total: 47843400 Max/Min: 1335320/5942 ??? Fail: 0 [12051.969813] spin_lock-torture:--- End of test: SUCCESS: acq_writer_lim=0 bind_readers=0-63 bind_writers=0-63 call_rcu_chains=0 long_hold=100 nested_locks=0 nreaders_stress=0 nwriters_stress=128 onoff_holdoff=0 onoff_interval=0 rt_boost=2 rt_boost_factor=50 shuffle_interval=3 shutdown_secs=0 stat_interval=60 stutter=5 verbose=1 writer_fifo=0 I do not see a big difference between spin_lock and rw_lock. In fact the locktorture.ko test shows that a spin_lock is slightly worse but it might be something that i could miss or is not accurate enough in this test. When it comes to vmap test-suite and the difference between rw_lock and spin_lock. I have not spend much time figuring out the difference. From the first glance it can be that a cache-miss rate is higher when switch to rw_lock or rw-lock requires more atomics. -- Uladzislau Rezki
On Tue, Jan 02, 2024 at 07:46:26PM +0100, Uladzislau Rezki (Sony) wrote: > Store allocated objects in a separate nodes. A va->va_start > address is converted into a correct node where it should > be placed and resided. An addr_to_node() function is used > to do a proper address conversion to determine a node that > contains a VA. > > Such approach balances VAs across nodes as a result an access > becomes scalable. Number of nodes in a system depends on number > of CPUs. > > Please note: > > 1. As of now allocated VAs are bound to a node-0. It means the > patch does not give any difference comparing with a current > behavior; > > 2. The global vmap_area_lock, vmap_area_root are removed as there > is no need in it anymore. The vmap_area_list is still kept and > is _empty_. It is exported for a kexec only; > > 3. The vmallocinfo and vread() have to be reworked to be able to > handle multiple nodes. > > Reviewed-by: Baoquan He <bhe@redhat.com> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > --- > mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 173 insertions(+), 67 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 06bd843d18ae..786ecb18ae22 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn); > #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0 > > > -static DEFINE_SPINLOCK(vmap_area_lock); > static DEFINE_SPINLOCK(free_vmap_area_lock); > /* Export for kexec only */ > LIST_HEAD(vmap_area_list); > -static struct rb_root vmap_area_root = RB_ROOT; > static bool vmap_initialized __read_mostly; > > static struct rb_root purge_vmap_area_root = RB_ROOT; > @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT; > */ > static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node); > > +/* > + * An effective vmap-node logic. Users make use of nodes instead > + * of a global heap. It allows to balance an access and mitigate > + * contention. > + */ > +struct rb_list { I'm not sure this name is very instructive - this contains a red/black tree root node, a list head and a lock, but the meaning of it is that it stores a red/black tree and a list of vmap_area objects and has a lock to protect access. A raw 'list_head' without a comment is always hard to parse, maybe some comments/embed in vmap_node? At the very least if you wanted to keep the name generic it should be something like ordered_rb_tree or something like this. > + struct rb_root root; > + struct list_head head; > + spinlock_t lock; > +}; > + > +static struct vmap_node { > + /* Bookkeeping data of this node. */ > + struct rb_list busy; > +} single; This may be a thing about encapsulation/naming or similar, but I'm a little confused as to why the rb_list type is maintained as a field rather than its fields embedded? > + > +static struct vmap_node *vmap_nodes = &single; > +static __read_mostly unsigned int nr_vmap_nodes = 1; > +static __read_mostly unsigned int vmap_zone_size = 1; It might be worth adding a comment here explaining that we're binding to a single node for now to maintain existing behaviour (and a brief description of what these values mean - for instance what unit vmap_zone_size is expressed in?) > + > +static inline unsigned int > +addr_to_node_id(unsigned long addr) > +{ > + return (addr / vmap_zone_size) % nr_vmap_nodes; > +} > + > +static inline struct vmap_node * > +addr_to_node(unsigned long addr) > +{ > + return &vmap_nodes[addr_to_node_id(addr)]; > +} > + > static __always_inline unsigned long > va_size(struct vmap_area *va) > { > @@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void) > } > > /* Look up the first VA which satisfies addr < va_end, NULL if none. */ > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) > +static struct vmap_area * > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) > { > struct vmap_area *va = NULL; > - struct rb_node *n = vmap_area_root.rb_node; > + struct rb_node *n = root->rb_node; > > addr = (unsigned long)kasan_reset_tag((void *)addr); > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, > */ > static void free_vmap_area(struct vmap_area *va) > { > + struct vmap_node *vn = addr_to_node(va->va_start); > + I'm being nitty here, and while I know it's a vmalloc convention to use 'va' and 'vm', perhaps we can break away from the super short variable name convention and use 'vnode' or something for these values? I feel people might get confused between 'vm' and 'vn' for instance. > /* > * Remove from the busy tree/list. > */ > - spin_lock(&vmap_area_lock); > - unlink_va(va, &vmap_area_root); > - spin_unlock(&vmap_area_lock); > + spin_lock(&vn->busy.lock); > + unlink_va(va, &vn->busy.root); > + spin_unlock(&vn->busy.lock); > > /* > * Insert/Merge it back to the free tree/list. > @@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > int node, gfp_t gfp_mask, > unsigned long va_flags) > { > + struct vmap_node *vn; > struct vmap_area *va; > unsigned long freed; > unsigned long addr; > @@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > va->vm = NULL; > va->flags = va_flags; > > - spin_lock(&vmap_area_lock); > - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); > - spin_unlock(&vmap_area_lock); > + vn = addr_to_node(va->va_start); > + > + spin_lock(&vn->busy.lock); > + insert_vmap_area(va, &vn->busy.root, &vn->busy.head); > + spin_unlock(&vn->busy.lock); > > BUG_ON(!IS_ALIGNED(va->va_start, align)); > BUG_ON(va->va_start < vstart); > @@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va) > > struct vmap_area *find_vmap_area(unsigned long addr) > { > + struct vmap_node *vn; > struct vmap_area *va; > + int i, j; > > - spin_lock(&vmap_area_lock); > - va = __find_vmap_area(addr, &vmap_area_root); > - spin_unlock(&vmap_area_lock); > + /* > + * An addr_to_node_id(addr) converts an address to a node index > + * where a VA is located. If VA spans several zones and passed > + * addr is not the same as va->va_start, what is not common, we > + * may need to scan an extra nodes. See an example: For my understading when you say 'scan an extra nodes' do you mean scan just 1 extra node, or multiple? If the former I'd replace this with 'may need to scan an extra node' if the latter then 'may ened to scan extra nodes'. It's a nitty language thing, but also potentially changes the meaning of this! > + * > + * <--va--> > + * -|-----|-----|-----|-----|- > + * 1 2 0 1 > + * > + * VA resides in node 1 whereas it spans 1 and 2. If passed > + * addr is within a second node we should do extra work. We > + * should mention that it is rare and is a corner case from > + * the other hand it has to be covered. A very minor language style nit, but you've already said this is not common, I don't think you need this 'We should mention...' bit. It's not a big deal however! > + */ > + i = j = addr_to_node_id(addr); > + do { > + vn = &vmap_nodes[i]; > > - return va; > + spin_lock(&vn->busy.lock); > + va = __find_vmap_area(addr, &vn->busy.root); > + spin_unlock(&vn->busy.lock); > + > + if (va) > + return va; > + } while ((i = (i + 1) % nr_vmap_nodes) != j); If you comment above suggests that only 1 extra node might need to be scanned, should we stop after one iteration? > + > + return NULL; > } > > static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > { > + struct vmap_node *vn; > struct vmap_area *va; > + int i, j; > > - spin_lock(&vmap_area_lock); > - va = __find_vmap_area(addr, &vmap_area_root); > - if (va) > - unlink_va(va, &vmap_area_root); > - spin_unlock(&vmap_area_lock); > + i = j = addr_to_node_id(addr); > + do { > + vn = &vmap_nodes[i]; > > - return va; > + spin_lock(&vn->busy.lock); > + va = __find_vmap_area(addr, &vn->busy.root); > + if (va) > + unlink_va(va, &vn->busy.root); > + spin_unlock(&vn->busy.lock); > + > + if (va) > + return va; > + } while ((i = (i + 1) % nr_vmap_nodes) != j); Maybe worth adding a comment saying to refer to the comment in find_vmap_area() to see why this loop is necessary. > + > + return NULL; > } > > /*** Per cpu kva allocator ***/ > @@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) > > static void free_vmap_block(struct vmap_block *vb) > { > + struct vmap_node *vn; > struct vmap_block *tmp; > struct xarray *xa; > > @@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb) > tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start)); > BUG_ON(tmp != vb); > > - spin_lock(&vmap_area_lock); > - unlink_va(vb->va, &vmap_area_root); > - spin_unlock(&vmap_area_lock); > + vn = addr_to_node(vb->va->va_start); > + spin_lock(&vn->busy.lock); > + unlink_va(vb->va, &vn->busy.root); > + spin_unlock(&vn->busy.lock); > > free_vmap_area_noflush(vb->va); > kfree_rcu(vb, rcu_head); > @@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, > static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, > unsigned long flags, const void *caller) > { > - spin_lock(&vmap_area_lock); > + struct vmap_node *vn = addr_to_node(va->va_start); > + > + spin_lock(&vn->busy.lock); > setup_vmalloc_vm_locked(vm, va, flags, caller); > - spin_unlock(&vmap_area_lock); > + spin_unlock(&vn->busy.lock); > } > > static void clear_vm_uninitialized_flag(struct vm_struct *vm) > @@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr, > */ > long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > { > + struct vmap_node *vn; > struct vmap_area *va; > struct vm_struct *vm; > char *vaddr; > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) Unrelated to your change but makes me feel a little unwell to see 'const char *addr'! Can we change this at some point? Or maybe I can :) > > remains = count; > > - spin_lock(&vmap_area_lock); > - va = find_vmap_area_exceed_addr((unsigned long)addr); > + /* Hooked to node_0 so far. */ > + vn = addr_to_node(0); Why can't we use addr for this call? We already enforce the node-0 only thing by setting nr_vmap_nodes to 1 right? And won't this be potentially subtly wrong when we later increase this? > + spin_lock(&vn->busy.lock); > + > + va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root); > if (!va) > goto finished_zero; > > @@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > if ((unsigned long)addr + remains <= va->va_start) > goto finished_zero; > > - list_for_each_entry_from(va, &vmap_area_list, list) { > + list_for_each_entry_from(va, &vn->busy.head, list) { > size_t copied; > > if (remains == 0) > @@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > } > > finished_zero: > - spin_unlock(&vmap_area_lock); > + spin_unlock(&vn->busy.lock); > /* zero-fill memory holes */ > return count - remains + zero_iter(iter, remains); > finished: > /* Nothing remains, or We couldn't copy/zero everything. */ > - spin_unlock(&vmap_area_lock); > + spin_unlock(&vn->busy.lock); > > return count - remains; > } > @@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, > } > > /* insert all vm's */ > - spin_lock(&vmap_area_lock); > for (area = 0; area < nr_vms; area++) { > - insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list); > + struct vmap_node *vn = addr_to_node(vas[area]->va_start); > > + spin_lock(&vn->busy.lock); > + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, > pcpu_get_vm_areas); > + spin_unlock(&vn->busy.lock); Hmm, before we were locking/unlocking once before the loop, now we're locking on each iteration, this seems inefficient. Seems like we need logic like: /* ... something to check nr_vms > 0 ... */ struct vmap_node *last_node = NULL; for (...) { struct vmap_node *vnode = addr_to_node(vas[area]->va_start); if (vnode != last_node) { spin_unlock(last_node->busy.lock); spin_lock(vnode->busy.lock); last_node = vnode; } ... } if (last_node) spin_unlock(last_node->busy.lock); To minimise the lock twiddling. What do you think? > } > - spin_unlock(&vmap_area_lock); > > /* > * Mark allocated areas as accessible. Do it now as a best-effort > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object) > { > void *objp = (void *)PAGE_ALIGN((unsigned long)object); > const void *caller; > - struct vm_struct *vm; > struct vmap_area *va; > + struct vmap_node *vn; > unsigned long addr; > unsigned int nr_pages; > + bool success = false; > > - if (!spin_trylock(&vmap_area_lock)) > - return false; Nitpick on style for this, I really don't know why you are removing this early exit? It's far neater to have a guard clause than to nest a whole bunch of code below. > - va = __find_vmap_area((unsigned long)objp, &vmap_area_root); > - if (!va) { > - spin_unlock(&vmap_area_lock); > - return false; > - } > + vn = addr_to_node((unsigned long)objp); Later in the patch where you fix build bot issues with the below __find_vmap_area() invocation, you move from addr to (unsigned long)objp. However since you're already referring to that here, why not change what addr refers to and use that in both instances, e.g. unsigned long addr = (unsigned long)objp; Then update things that refer to the objp value as necessary. > > - vm = va->vm; > - if (!vm) { > - spin_unlock(&vmap_area_lock); > - return false; > + if (spin_trylock(&vn->busy.lock)) { > + va = __find_vmap_area(addr, &vn->busy.root); > + > + if (va && va->vm) { > + addr = (unsigned long)va->vm->addr; > + caller = va->vm->caller; > + nr_pages = va->vm->nr_pages; Again it feels like you're undoing some good here, now you're referencing va->vm over and over when you could simply assign vm = va->vm as the original code did. Also again it'd be nicer to use an early exit/guard clause approach. > + success = true; > + } > + > + spin_unlock(&vn->busy.lock); > } > - addr = (unsigned long)vm->addr; > - caller = vm->caller; > - nr_pages = vm->nr_pages; > - spin_unlock(&vmap_area_lock); > - pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", > - nr_pages, addr, caller); > - return true; > + > + if (success) > + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", > + nr_pages, addr, caller); With the redefinition of addr, you could then simply put va->vm->addr here. > + > + return success; > } > #endif These are all essentially style nits (the actual bug was fixed by your follow up patch for the build bots) :) > > #ifdef CONFIG_PROC_FS > static void *s_start(struct seq_file *m, loff_t *pos) > - __acquires(&vmap_purge_lock) > - __acquires(&vmap_area_lock) Do we want to replace these __acquires() directives? I suppose we simply cannot now we need to look up the node. > { > + struct vmap_node *vn = addr_to_node(0); Hm does the procfs operation span only one node? I guess we can start from the initial node for an iteration, but I wonder if '&vmap_nodes[0]' here is a more 'honest' thing to do rather than to assume that address 0 gets translated to node zero here? I think a comment like: /* We start from node 0 */ Would be useful here at any rate. > + > mutex_lock(&vmap_purge_lock); > - spin_lock(&vmap_area_lock); > + spin_lock(&vn->busy.lock); > > - return seq_list_start(&vmap_area_list, *pos); > + return seq_list_start(&vn->busy.head, *pos); > } > > static void *s_next(struct seq_file *m, void *p, loff_t *pos) > { > - return seq_list_next(p, &vmap_area_list, pos); > + struct vmap_node *vn = addr_to_node(0); This one I'm a little more uncertain of, obviously comments above still apply but actually shouldn't we add a check to see if we're at the end of the list and should look at the next node? Even if we only have one for now, I don't like the idea of leaving in hardcoded things that might get missed when we move to nr_vmap_nodes > 1. For instance right now if you increased this above 1 it'd break things right? I'd say better to write logic assuming nr_vmap_nodes _could_ be > 1 even if, to start, it won't be. > + return seq_list_next(p, &vn->busy.head, pos); > } > > static void s_stop(struct seq_file *m, void *p) > - __releases(&vmap_area_lock) > - __releases(&vmap_purge_lock) > { > - spin_unlock(&vmap_area_lock); > + struct vmap_node *vn = addr_to_node(0); See comments above about use of addr_to_node(0). > + > + spin_unlock(&vn->busy.lock); > mutex_unlock(&vmap_purge_lock); > } > > @@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m) > > static int s_show(struct seq_file *m, void *p) > { > + struct vmap_node *vn; > struct vmap_area *va; > struct vm_struct *v; > > + vn = addr_to_node(0); This one is really quite icky, should we make it easy for a vmap_area to know its vmap_node? How is this going to work once nr_vmap_nodes > 1? > va = list_entry(p, struct vmap_area, list); > > if (!va->vm) { > @@ -4397,7 +4481,7 @@ static int s_show(struct seq_file *m, void *p) > * As a final step, dump "unpurged" areas. > */ > final: > - if (list_is_last(&va->list, &vmap_area_list)) > + if (list_is_last(&va->list, &vn->busy.head)) > show_purge_info(m); > > return 0; > @@ -4428,7 +4512,8 @@ static void vmap_init_free_space(void) > { > unsigned long vmap_start = 1; > const unsigned long vmap_end = ULONG_MAX; > - struct vmap_area *busy, *free; > + struct vmap_area *free; > + struct vm_struct *busy; > > /* > * B F B B B F > @@ -4436,12 +4521,12 @@ static void vmap_init_free_space(void) > * | The KVA space | > * |<--------------------------------->| > */ > - list_for_each_entry(busy, &vmap_area_list, list) { > - if (busy->va_start - vmap_start > 0) { > + for (busy = vmlist; busy; busy = busy->next) { > + if ((unsigned long) busy->addr - vmap_start > 0) { > free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); > if (!WARN_ON_ONCE(!free)) { > free->va_start = vmap_start; > - free->va_end = busy->va_start; > + free->va_end = (unsigned long) busy->addr; > > insert_vmap_area_augment(free, NULL, > &free_vmap_area_root, > @@ -4449,7 +4534,7 @@ static void vmap_init_free_space(void) > } > } > > - vmap_start = busy->va_end; > + vmap_start = (unsigned long) busy->addr + busy->size; > } > > if (vmap_end - vmap_start > 0) { > @@ -4465,9 +4550,23 @@ static void vmap_init_free_space(void) > } > } > > +static void vmap_init_nodes(void) > +{ > + struct vmap_node *vn; > + int i; > + > + for (i = 0; i < nr_vmap_nodes; i++) { > + vn = &vmap_nodes[i]; > + vn->busy.root = RB_ROOT; > + INIT_LIST_HEAD(&vn->busy.head); > + spin_lock_init(&vn->busy.lock); > + } > +} > + > void __init vmalloc_init(void) > { > struct vmap_area *va; > + struct vmap_node *vn; > struct vm_struct *tmp; > int i; > > @@ -4489,6 +4588,11 @@ void __init vmalloc_init(void) > xa_init(&vbq->vmap_blocks); > } > > + /* > + * Setup nodes before importing vmlist. > + */ > + vmap_init_nodes(); > + > /* Import existing vmlist entries. */ > for (tmp = vmlist; tmp; tmp = tmp->next) { > va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); > @@ -4498,7 +4602,9 @@ void __init vmalloc_init(void) > va->va_start = (unsigned long)tmp->addr; > va->va_end = va->va_start + tmp->size; > va->vm = tmp; > - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); > + > + vn = addr_to_node(va->va_start); > + insert_vmap_area(va, &vn->busy.root, &vn->busy.head); > } > > /* > -- > 2.39.2 >
On Tue, Jan 16, 2024 at 11:25:43PM +0000, Lorenzo Stoakes wrote: > On Tue, Jan 02, 2024 at 07:46:26PM +0100, Uladzislau Rezki (Sony) wrote: > > Store allocated objects in a separate nodes. A va->va_start > > address is converted into a correct node where it should > > be placed and resided. An addr_to_node() function is used > > to do a proper address conversion to determine a node that > > contains a VA. > > > > Such approach balances VAs across nodes as a result an access > > becomes scalable. Number of nodes in a system depends on number > > of CPUs. > > > > Please note: > > > > 1. As of now allocated VAs are bound to a node-0. It means the > > patch does not give any difference comparing with a current > > behavior; > > > > 2. The global vmap_area_lock, vmap_area_root are removed as there > > is no need in it anymore. The vmap_area_list is still kept and > > is _empty_. It is exported for a kexec only; > > > > 3. The vmallocinfo and vread() have to be reworked to be able to > > handle multiple nodes. > > > > Reviewed-by: Baoquan He <bhe@redhat.com> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > --- > > mm/vmalloc.c | 240 +++++++++++++++++++++++++++++++++++++-------------- > > 1 file changed, 173 insertions(+), 67 deletions(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 06bd843d18ae..786ecb18ae22 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn); > > #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0 > > > > > > -static DEFINE_SPINLOCK(vmap_area_lock); > > static DEFINE_SPINLOCK(free_vmap_area_lock); > > /* Export for kexec only */ > > LIST_HEAD(vmap_area_list); > > -static struct rb_root vmap_area_root = RB_ROOT; > > static bool vmap_initialized __read_mostly; > > > > static struct rb_root purge_vmap_area_root = RB_ROOT; > > @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT; > > */ > > static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node); > > > > +/* > > + * An effective vmap-node logic. Users make use of nodes instead > > + * of a global heap. It allows to balance an access and mitigate > > + * contention. > > + */ > > +struct rb_list { > > I'm not sure this name is very instructive - this contains a red/black tree > root node, a list head and a lock, but the meaning of it is that it stores > a red/black tree and a list of vmap_area objects and has a lock to protect > access. > > A raw 'list_head' without a comment is always hard to parse, maybe some > comments/embed in vmap_node? > > At the very least if you wanted to keep the name generic it should be > something like ordered_rb_tree or something like this. > I can add a comment in the vmap_node. rb_list describes a single, solid data structure where a list and rb-tree are part of one entity protected by lock. Similar to the B+tree where you have leaf nodes linked between each other in order to do a fast sequential traversal. > > + struct rb_root root; > > + struct list_head head; > > + spinlock_t lock; > > +}; > > + > > +static struct vmap_node { > > + /* Bookkeeping data of this node. */ > > + struct rb_list busy; > > +} single; > > This may be a thing about encapsulation/naming or similar, but I'm a little > confused as to why the rb_list type is maintained as a field rather than > its fields embedded? > The "struct vmap_node" will be extended by the following patches in the series. > > + > > +static struct vmap_node *vmap_nodes = &single; > > +static __read_mostly unsigned int nr_vmap_nodes = 1; > > +static __read_mostly unsigned int vmap_zone_size = 1; > > It might be worth adding a comment here explaining that we're binding to a > single node for now to maintain existing behaviour (and a brief description > of what these values mean - for instance what unit vmap_zone_size is > expressed in?) > Right. Agree on it :) > > + > > +static inline unsigned int > > +addr_to_node_id(unsigned long addr) > > +{ > > + return (addr / vmap_zone_size) % nr_vmap_nodes; > > +} > > + > > +static inline struct vmap_node * > > +addr_to_node(unsigned long addr) > > +{ > > + return &vmap_nodes[addr_to_node_id(addr)]; > > +} > > + > > static __always_inline unsigned long > > va_size(struct vmap_area *va) > > { > > @@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void) > > } > > > > /* Look up the first VA which satisfies addr < va_end, NULL if none. */ > > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) > > +static struct vmap_area * > > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) > > { > > struct vmap_area *va = NULL; > > - struct rb_node *n = vmap_area_root.rb_node; > > + struct rb_node *n = root->rb_node; > > > > addr = (unsigned long)kasan_reset_tag((void *)addr); > > > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, > > */ > > static void free_vmap_area(struct vmap_area *va) > > { > > + struct vmap_node *vn = addr_to_node(va->va_start); > > + > > I'm being nitty here, and while I know it's a vmalloc convention to use > 'va' and 'vm', perhaps we can break away from the super short variable name > convention and use 'vnode' or something for these values? > > I feel people might get confused between 'vm' and 'vn' for instance. > vnode, varea? > > /* > > * Remove from the busy tree/list. > > */ > > - spin_lock(&vmap_area_lock); > > - unlink_va(va, &vmap_area_root); > > - spin_unlock(&vmap_area_lock); > > + spin_lock(&vn->busy.lock); > > + unlink_va(va, &vn->busy.root); > > + spin_unlock(&vn->busy.lock); > > > > /* > > * Insert/Merge it back to the free tree/list. > > @@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > > int node, gfp_t gfp_mask, > > unsigned long va_flags) > > { > > + struct vmap_node *vn; > > struct vmap_area *va; > > unsigned long freed; > > unsigned long addr; > > @@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > > va->vm = NULL; > > va->flags = va_flags; > > > > - spin_lock(&vmap_area_lock); > > - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); > > - spin_unlock(&vmap_area_lock); > > + vn = addr_to_node(va->va_start); > > + > > + spin_lock(&vn->busy.lock); > > + insert_vmap_area(va, &vn->busy.root, &vn->busy.head); > > + spin_unlock(&vn->busy.lock); > > > > BUG_ON(!IS_ALIGNED(va->va_start, align)); > > BUG_ON(va->va_start < vstart); > > @@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va) > > > > struct vmap_area *find_vmap_area(unsigned long addr) > > { > > + struct vmap_node *vn; > > struct vmap_area *va; > > + int i, j; > > > > - spin_lock(&vmap_area_lock); > > - va = __find_vmap_area(addr, &vmap_area_root); > > - spin_unlock(&vmap_area_lock); > > + /* > > + * An addr_to_node_id(addr) converts an address to a node index > > + * where a VA is located. If VA spans several zones and passed > > + * addr is not the same as va->va_start, what is not common, we > > + * may need to scan an extra nodes. See an example: > > For my understading when you say 'scan an extra nodes' do you mean scan > just 1 extra node, or multiple? If the former I'd replace this with 'may > need to scan an extra node' if the latter then 'may ened to scan extra > nodes'. > > It's a nitty language thing, but also potentially changes the meaning of > this! > Typo, i should replace it to: scan extra nodes. > > + * > > + * <--va--> > > + * -|-----|-----|-----|-----|- > > + * 1 2 0 1 > > + * > > + * VA resides in node 1 whereas it spans 1 and 2. If passed > > + * addr is within a second node we should do extra work. We > > + * should mention that it is rare and is a corner case from > > + * the other hand it has to be covered. > > A very minor language style nit, but you've already said this is not > common, I don't think you need this 'We should mention...' bit. It's not a > big deal however! > No problem. We can remove it! > > + */ > > + i = j = addr_to_node_id(addr); > > + do { > > + vn = &vmap_nodes[i]; > > > > - return va; > > + spin_lock(&vn->busy.lock); > > + va = __find_vmap_area(addr, &vn->busy.root); > > + spin_unlock(&vn->busy.lock); > > + > > + if (va) > > + return va; > > + } while ((i = (i + 1) % nr_vmap_nodes) != j); > > If you comment above suggests that only 1 extra node might need to be > scanned, should we stop after one iteration? > Not really. Though we can improve it further to scan backward. > > + > > + return NULL; > > } > > > > static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > > { > > + struct vmap_node *vn; > > struct vmap_area *va; > > + int i, j; > > > > - spin_lock(&vmap_area_lock); > > - va = __find_vmap_area(addr, &vmap_area_root); > > - if (va) > > - unlink_va(va, &vmap_area_root); > > - spin_unlock(&vmap_area_lock); > > + i = j = addr_to_node_id(addr); > > + do { > > + vn = &vmap_nodes[i]; > > > > - return va; > > + spin_lock(&vn->busy.lock); > > + va = __find_vmap_area(addr, &vn->busy.root); > > + if (va) > > + unlink_va(va, &vn->busy.root); > > + spin_unlock(&vn->busy.lock); > > + > > + if (va) > > + return va; > > + } while ((i = (i + 1) % nr_vmap_nodes) != j); > > Maybe worth adding a comment saying to refer to the comment in > find_vmap_area() to see why this loop is necessary. > OK. We can do it to make it better for reading. > > + > > + return NULL; > > } > > > > /*** Per cpu kva allocator ***/ > > @@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) > > > > static void free_vmap_block(struct vmap_block *vb) > > { > > + struct vmap_node *vn; > > struct vmap_block *tmp; > > struct xarray *xa; > > > > @@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb) > > tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start)); > > BUG_ON(tmp != vb); > > > > - spin_lock(&vmap_area_lock); > > - unlink_va(vb->va, &vmap_area_root); > > - spin_unlock(&vmap_area_lock); > > + vn = addr_to_node(vb->va->va_start); > > + spin_lock(&vn->busy.lock); > > + unlink_va(vb->va, &vn->busy.root); > > + spin_unlock(&vn->busy.lock); > > > > free_vmap_area_noflush(vb->va); > > kfree_rcu(vb, rcu_head); > > @@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, > > static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, > > unsigned long flags, const void *caller) > > { > > - spin_lock(&vmap_area_lock); > > + struct vmap_node *vn = addr_to_node(va->va_start); > > + > > + spin_lock(&vn->busy.lock); > > setup_vmalloc_vm_locked(vm, va, flags, caller); > > - spin_unlock(&vmap_area_lock); > > + spin_unlock(&vn->busy.lock); > > } > > > > static void clear_vm_uninitialized_flag(struct vm_struct *vm) > > @@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr, > > */ > > long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > > { > > + struct vmap_node *vn; > > struct vmap_area *va; > > struct vm_struct *vm; > > char *vaddr; > > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > > Unrelated to your change but makes me feel a little unwell to see 'const > char *addr'! Can we change this at some point? Or maybe I can :) > You are welcome :) > > > > remains = count; > > > > - spin_lock(&vmap_area_lock); > > - va = find_vmap_area_exceed_addr((unsigned long)addr); > > + /* Hooked to node_0 so far. */ > > + vn = addr_to_node(0); > > Why can't we use addr for this call? We already enforce the node-0 only > thing by setting nr_vmap_nodes to 1 right? And won't this be potentially > subtly wrong when we later increase this? > I used to have 0 here. But please note, it is changed by the next patch in this series. > > + spin_lock(&vn->busy.lock); > > + > > + va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root); > > if (!va) > > goto finished_zero; > > > > @@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > > if ((unsigned long)addr + remains <= va->va_start) > > goto finished_zero; > > > > - list_for_each_entry_from(va, &vmap_area_list, list) { > > + list_for_each_entry_from(va, &vn->busy.head, list) { > > size_t copied; > > > > if (remains == 0) > > @@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > > } > > > > finished_zero: > > - spin_unlock(&vmap_area_lock); > > + spin_unlock(&vn->busy.lock); > > /* zero-fill memory holes */ > > return count - remains + zero_iter(iter, remains); > > finished: > > /* Nothing remains, or We couldn't copy/zero everything. */ > > - spin_unlock(&vmap_area_lock); > > + spin_unlock(&vn->busy.lock); > > > > return count - remains; > > } > > @@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, > > } > > > > /* insert all vm's */ > > - spin_lock(&vmap_area_lock); > > for (area = 0; area < nr_vms; area++) { > > - insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list); > > + struct vmap_node *vn = addr_to_node(vas[area]->va_start); > > > > + spin_lock(&vn->busy.lock); > > + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); > > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, > > pcpu_get_vm_areas); > > + spin_unlock(&vn->busy.lock); > > Hmm, before we were locking/unlocking once before the loop, now we're > locking on each iteration, this seems inefficient. > > Seems like we need logic like: > > /* ... something to check nr_vms > 0 ... */ > struct vmap_node *last_node = NULL; > > for (...) { > struct vmap_node *vnode = addr_to_node(vas[area]->va_start); > > if (vnode != last_node) { > spin_unlock(last_node->busy.lock); > spin_lock(vnode->busy.lock); > last_node = vnode; > } > > ... > } > > if (last_node) > spin_unlock(last_node->busy.lock); > > To minimise the lock twiddling. What do you think? > This per-cpu-allocator prefetches several VA units per-cpu. I do not find it as critical because it is not a hot path for the per-cpu allocator. When its buffers are exhausted it does an extra prefetch. So it is not frequent. > > > } > > - spin_unlock(&vmap_area_lock); > > > > /* > > * Mark allocated areas as accessible. Do it now as a best-effort > > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object) > > { > > void *objp = (void *)PAGE_ALIGN((unsigned long)object); > > const void *caller; > > - struct vm_struct *vm; > > struct vmap_area *va; > > + struct vmap_node *vn; > > unsigned long addr; > > unsigned int nr_pages; > > + bool success = false; > > > > - if (!spin_trylock(&vmap_area_lock)) > > - return false; > > Nitpick on style for this, I really don't know why you are removing this > early exit? It's far neater to have a guard clause than to nest a whole > bunch of code below. > Hm... I can return back as it used to be. I do not have a strong opinion here. > > - va = __find_vmap_area((unsigned long)objp, &vmap_area_root); > > - if (!va) { > > - spin_unlock(&vmap_area_lock); > > - return false; > > - } > > + vn = addr_to_node((unsigned long)objp); > > Later in the patch where you fix build bot issues with the below > __find_vmap_area() invocation, you move from addr to (unsigned long)objp. > > However since you're already referring to that here, why not change what > addr refers to and use that in both instances, e.g. > > unsigned long addr = (unsigned long)objp; > > Then update things that refer to the objp value as necessary. > That is what i was thinking of. We can send a separate patch for this. > > > > - vm = va->vm; > > - if (!vm) { > > - spin_unlock(&vmap_area_lock); > > - return false; > > + if (spin_trylock(&vn->busy.lock)) { > > + va = __find_vmap_area(addr, &vn->busy.root); > > + > > + if (va && va->vm) { > > + addr = (unsigned long)va->vm->addr; > > + caller = va->vm->caller; > > + nr_pages = va->vm->nr_pages; > > Again it feels like you're undoing some good here, now you're referencing > va->vm over and over when you could simply assign vm = va->vm as the > original code did. > > Also again it'd be nicer to use an early exit/guard clause approach. > We can change it in separate patch. > > + success = true; > > + } > > + > > + spin_unlock(&vn->busy.lock); > > } > > - addr = (unsigned long)vm->addr; > > - caller = vm->caller; > > - nr_pages = vm->nr_pages; > > - spin_unlock(&vmap_area_lock); > > - pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", > > - nr_pages, addr, caller); > > - return true; > > + > > + if (success) > > + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", > > + nr_pages, addr, caller); > > With the redefinition of addr, you could then simply put va->vm->addr here. > > > + > > + return success; > > } > > #endif > > These are all essentially style nits (the actual bug was fixed by your > follow up patch for the build bots) :) > Right :) > > > > #ifdef CONFIG_PROC_FS > > static void *s_start(struct seq_file *m, loff_t *pos) > > - __acquires(&vmap_purge_lock) > > - __acquires(&vmap_area_lock) > > Do we want to replace these __acquires() directives? I suppose we simply > cannot now we need to look up the node. > Yep. It is reworked anyway in another patch. > > { > > + struct vmap_node *vn = addr_to_node(0); > > Hm does the procfs operation span only one node? I guess we can start from > the initial node for an iteration, but I wonder if '&vmap_nodes[0]' here is > a more 'honest' thing to do rather than to assume that address 0 gets > translated to node zero here? > > I think a comment like: > > /* We start from node 0 */ > > Would be useful here at any rate. > It works since nr_nodes is set to 1. By later patches in this series it is fulfilled and completed. > > + > > mutex_lock(&vmap_purge_lock); > > - spin_lock(&vmap_area_lock); > > + spin_lock(&vn->busy.lock); > > > > - return seq_list_start(&vmap_area_list, *pos); > > + return seq_list_start(&vn->busy.head, *pos); > > } > > > > static void *s_next(struct seq_file *m, void *p, loff_t *pos) > > { > > - return seq_list_next(p, &vmap_area_list, pos); > > + struct vmap_node *vn = addr_to_node(0); > > This one I'm a little more uncertain of, obviously comments above still > apply but actually shouldn't we add a check to see if we're at the end of > the list and should look at the next node? > > Even if we only have one for now, I don't like the idea of leaving in > hardcoded things that might get missed when we move to nr_vmap_nodes > 1. > > For instance right now if you increased this above 1 it'd break things > right? > > I'd say better to write logic assuming nr_vmap_nodes _could_ be > 1 even > if, to start, it won't be. > Same as above. It is incomplete and stick to a single node. Further patches solve this fully. > > + return seq_list_next(p, &vn->busy.head, pos); > > } > > > > static void s_stop(struct seq_file *m, void *p) > > - __releases(&vmap_area_lock) > > - __releases(&vmap_purge_lock) > > { > > - spin_unlock(&vmap_area_lock); > > + struct vmap_node *vn = addr_to_node(0); > > See comments above about use of addr_to_node(0). > All this s_start/s_next/etc are removed and reworked by following patches. > > + > > + spin_unlock(&vn->busy.lock); > > mutex_unlock(&vmap_purge_lock); > > } > > > > @@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m) > > > > static int s_show(struct seq_file *m, void *p) > > { > > + struct vmap_node *vn; > > struct vmap_area *va; > > struct vm_struct *v; > > > > + vn = addr_to_node(0); > > This one is really quite icky, should we make it easy for a vmap_area to > know its vmap_node? How is this going to work once nr_vmap_nodes > 1? > Same as above. Thank you for the review! I can fix the comments as separate patches if no objections. -- Uladzislau Rezki
On Thu, Jan 18, 2024 at 02:15:31PM +0100, Uladzislau Rezki wrote: [snip] > > > > + struct rb_root root; > > > + struct list_head head; > > > + spinlock_t lock; > > > +}; > > > + > > > +static struct vmap_node { > > > + /* Bookkeeping data of this node. */ > > > + struct rb_list busy; > > > +} single; > > > > This may be a thing about encapsulation/naming or similar, but I'm a little > > confused as to why the rb_list type is maintained as a field rather than > > its fields embedded? > > > The "struct vmap_node" will be extended by the following patches in the > series. > Yeah sorry I missed this, only realising after I sent...! > > > + > > > +static struct vmap_node *vmap_nodes = &single; > > > +static __read_mostly unsigned int nr_vmap_nodes = 1; > > > +static __read_mostly unsigned int vmap_zone_size = 1; > > > > It might be worth adding a comment here explaining that we're binding to a > > single node for now to maintain existing behaviour (and a brief description > > of what these values mean - for instance what unit vmap_zone_size is > > expressed in?) > > > Right. Agree on it :) > Indeed :) [snip] > > > /* Look up the first VA which satisfies addr < va_end, NULL if none. */ > > > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) > > > +static struct vmap_area * > > > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) > > > { > > > struct vmap_area *va = NULL; > > > - struct rb_node *n = vmap_area_root.rb_node; > > > + struct rb_node *n = root->rb_node; > > > > > > addr = (unsigned long)kasan_reset_tag((void *)addr); > > > > > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, > > > */ > > > static void free_vmap_area(struct vmap_area *va) > > > { > > > + struct vmap_node *vn = addr_to_node(va->va_start); > > > + > > > > I'm being nitty here, and while I know it's a vmalloc convention to use > > 'va' and 'vm', perhaps we can break away from the super short variable name > > convention and use 'vnode' or something for these values? > > > > I feel people might get confused between 'vm' and 'vn' for instance. > > > vnode, varea? I think 'vm' and 'va' are fine, just scanning through easy to mistake 'vn' and 'vm'. Obviously a litle nitpicky! You could replace all but a bit churny, so I think vn -> vnode works best imo. [snip] > > > struct vmap_area *find_vmap_area(unsigned long addr) > > > { > > > + struct vmap_node *vn; > > > struct vmap_area *va; > > > + int i, j; > > > > > > - spin_lock(&vmap_area_lock); > > > - va = __find_vmap_area(addr, &vmap_area_root); > > > - spin_unlock(&vmap_area_lock); > > > + /* > > > + * An addr_to_node_id(addr) converts an address to a node index > > > + * where a VA is located. If VA spans several zones and passed > > > + * addr is not the same as va->va_start, what is not common, we > > > + * may need to scan an extra nodes. See an example: > > > > For my understading when you say 'scan an extra nodes' do you mean scan > > just 1 extra node, or multiple? If the former I'd replace this with 'may > > need to scan an extra node' if the latter then 'may ened to scan extra > > nodes'. > > > > It's a nitty language thing, but also potentially changes the meaning of > > this! > > > Typo, i should replace it to: scan extra nodes. Thanks. > > > > + * > > > + * <--va--> > > > + * -|-----|-----|-----|-----|- > > > + * 1 2 0 1 > > > + * > > > + * VA resides in node 1 whereas it spans 1 and 2. If passed > > > + * addr is within a second node we should do extra work. We > > > + * should mention that it is rare and is a corner case from > > > + * the other hand it has to be covered. > > > > A very minor language style nit, but you've already said this is not > > common, I don't think you need this 'We should mention...' bit. It's not a > > big deal however! > > > No problem. We can remove it! Thanks. > > > > + */ > > > + i = j = addr_to_node_id(addr); > > > + do { > > > + vn = &vmap_nodes[i]; > > > > > > - return va; > > > + spin_lock(&vn->busy.lock); > > > + va = __find_vmap_area(addr, &vn->busy.root); > > > + spin_unlock(&vn->busy.lock); > > > + > > > + if (va) > > > + return va; > > > + } while ((i = (i + 1) % nr_vmap_nodes) != j); > > > > If you comment above suggests that only 1 extra node might need to be > > scanned, should we stop after one iteration? > > > Not really. Though we can improve it further to scan backward. I think it'd be good to clarify in the comment above that the VA could span more than 1 node then, as the diagram seems to imply only 1 (I think just simply because of the example you were showing). [snip] > > > static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > > > { > > > + struct vmap_node *vn; > > > struct vmap_area *va; > > > + int i, j; > > > > > > - spin_lock(&vmap_area_lock); > > > - va = __find_vmap_area(addr, &vmap_area_root); > > > - if (va) > > > - unlink_va(va, &vmap_area_root); > > > - spin_unlock(&vmap_area_lock); > > > + i = j = addr_to_node_id(addr); > > > + do { > > > + vn = &vmap_nodes[i]; > > > > > > - return va; > > > + spin_lock(&vn->busy.lock); > > > + va = __find_vmap_area(addr, &vn->busy.root); > > > + if (va) > > > + unlink_va(va, &vn->busy.root); > > > + spin_unlock(&vn->busy.lock); > > > + > > > + if (va) > > > + return va; > > > + } while ((i = (i + 1) % nr_vmap_nodes) != j); > > > > Maybe worth adding a comment saying to refer to the comment in > > find_vmap_area() to see why this loop is necessary. > > > OK. We can do it to make it better for reading. Thanks! [snip] > > > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > > > > Unrelated to your change but makes me feel a little unwell to see 'const > > char *addr'! Can we change this at some point? Or maybe I can :) > > > You are welcome :) Haha ;) yes I think I might tbh, I have noted it down. > > > > > > > remains = count; > > > > > > - spin_lock(&vmap_area_lock); > > > - va = find_vmap_area_exceed_addr((unsigned long)addr); > > > + /* Hooked to node_0 so far. */ > > > + vn = addr_to_node(0); > > > > Why can't we use addr for this call? We already enforce the node-0 only > > thing by setting nr_vmap_nodes to 1 right? And won't this be potentially > > subtly wrong when we later increase this? > > > I used to have 0 here. But please note, it is changed by the next patch in > this series. Yeah sorry, again hadn't noticed this. [snip] > > > + spin_lock(&vn->busy.lock); > > > + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); > > > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, > > > pcpu_get_vm_areas); > > > + spin_unlock(&vn->busy.lock); > > > > Hmm, before we were locking/unlocking once before the loop, now we're > > locking on each iteration, this seems inefficient. > > > > Seems like we need logic like: > > > > /* ... something to check nr_vms > 0 ... */ > > struct vmap_node *last_node = NULL; > > > > for (...) { > > struct vmap_node *vnode = addr_to_node(vas[area]->va_start); > > > > if (vnode != last_node) { > > spin_unlock(last_node->busy.lock); > > spin_lock(vnode->busy.lock); > > last_node = vnode; > > } > > > > ... > > } > > > > if (last_node) > > spin_unlock(last_node->busy.lock); > > > > To minimise the lock twiddling. What do you think? > > > This per-cpu-allocator prefetches several VA units per-cpu. I do not > find it as critical because it is not a hot path for the per-cpu allocator. > When its buffers are exhausted it does an extra prefetch. So it is not > frequent. OK, sure I mean this is simpler and more readable so if not a huge perf concern then not a big deal. > > > > > > } > > > - spin_unlock(&vmap_area_lock); > > > > > > /* > > > * Mark allocated areas as accessible. Do it now as a best-effort > > > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object) > > > { > > > void *objp = (void *)PAGE_ALIGN((unsigned long)object); > > > const void *caller; > > > - struct vm_struct *vm; > > > struct vmap_area *va; > > > + struct vmap_node *vn; > > > unsigned long addr; > > > unsigned int nr_pages; > > > + bool success = false; > > > > > > - if (!spin_trylock(&vmap_area_lock)) > > > - return false; > > > > Nitpick on style for this, I really don't know why you are removing this > > early exit? It's far neater to have a guard clause than to nest a whole > > bunch of code below. > > > Hm... I can return back as it used to be. I do not have a strong opinion here. Yeah that'd be ideal just for readability. [snip the rest as broadly fairly trivial comment stuff on which we agree] > > Thank you for the review! I can fix the comments as separate patches if > no objections. Yes, overall it's style/comment improvement stuff nothing major, feel free to send as follow-up patches. I don't want to hold anything up here so for the rest, feel free to add: Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> > > -- > Uladzislau Rezki
On Sat, Jan 20, 2024 at 12:55:10PM +0000, Lorenzo Stoakes wrote: > On Thu, Jan 18, 2024 at 02:15:31PM +0100, Uladzislau Rezki wrote: > > [snip] > > > > > > > + struct rb_root root; > > > > + struct list_head head; > > > > + spinlock_t lock; > > > > +}; > > > > + > > > > +static struct vmap_node { > > > > + /* Bookkeeping data of this node. */ > > > > + struct rb_list busy; > > > > +} single; > > > > > > This may be a thing about encapsulation/naming or similar, but I'm a little > > > confused as to why the rb_list type is maintained as a field rather than > > > its fields embedded? > > > > > The "struct vmap_node" will be extended by the following patches in the > > series. > > > > Yeah sorry I missed this, only realising after I sent...! > > > > > + > > > > +static struct vmap_node *vmap_nodes = &single; > > > > +static __read_mostly unsigned int nr_vmap_nodes = 1; > > > > +static __read_mostly unsigned int vmap_zone_size = 1; > > > > > > It might be worth adding a comment here explaining that we're binding to a > > > single node for now to maintain existing behaviour (and a brief description > > > of what these values mean - for instance what unit vmap_zone_size is > > > expressed in?) > > > > > Right. Agree on it :) > > > > Indeed :) > > [snip] > > > > > /* Look up the first VA which satisfies addr < va_end, NULL if none. */ > > > > -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) > > > > +static struct vmap_area * > > > > +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) > > > > { > > > > struct vmap_area *va = NULL; > > > > - struct rb_node *n = vmap_area_root.rb_node; > > > > + struct rb_node *n = root->rb_node; > > > > > > > > addr = (unsigned long)kasan_reset_tag((void *)addr); > > > > > > > > @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, > > > > */ > > > > static void free_vmap_area(struct vmap_area *va) > > > > { > > > > + struct vmap_node *vn = addr_to_node(va->va_start); > > > > + > > > > > > I'm being nitty here, and while I know it's a vmalloc convention to use > > > 'va' and 'vm', perhaps we can break away from the super short variable name > > > convention and use 'vnode' or something for these values? > > > > > > I feel people might get confused between 'vm' and 'vn' for instance. > > > > > vnode, varea? > > I think 'vm' and 'va' are fine, just scanning through easy to mistake 'vn' > and 'vm'. Obviously a litle nitpicky! You could replace all but a bit > churny, so I think vn -> vnode works best imo. > > [snip] > > > > > struct vmap_area *find_vmap_area(unsigned long addr) > > > > { > > > > + struct vmap_node *vn; > > > > struct vmap_area *va; > > > > + int i, j; > > > > > > > > - spin_lock(&vmap_area_lock); > > > > - va = __find_vmap_area(addr, &vmap_area_root); > > > > - spin_unlock(&vmap_area_lock); > > > > + /* > > > > + * An addr_to_node_id(addr) converts an address to a node index > > > > + * where a VA is located. If VA spans several zones and passed > > > > + * addr is not the same as va->va_start, what is not common, we > > > > + * may need to scan an extra nodes. See an example: > > > > > > For my understading when you say 'scan an extra nodes' do you mean scan > > > just 1 extra node, or multiple? If the former I'd replace this with 'may > > > need to scan an extra node' if the latter then 'may ened to scan extra > > > nodes'. > > > > > > It's a nitty language thing, but also potentially changes the meaning of > > > this! > > > > > Typo, i should replace it to: scan extra nodes. > > Thanks. > > > > > > > + * > > > > + * <--va--> > > > > + * -|-----|-----|-----|-----|- > > > > + * 1 2 0 1 > > > > + * > > > > + * VA resides in node 1 whereas it spans 1 and 2. If passed > > > > + * addr is within a second node we should do extra work. We > > > > + * should mention that it is rare and is a corner case from > > > > + * the other hand it has to be covered. > > > > > > A very minor language style nit, but you've already said this is not > > > common, I don't think you need this 'We should mention...' bit. It's not a > > > big deal however! > > > > > No problem. We can remove it! > > Thanks. > > > > > > > + */ > > > > + i = j = addr_to_node_id(addr); > > > > + do { > > > > + vn = &vmap_nodes[i]; > > > > > > > > - return va; > > > > + spin_lock(&vn->busy.lock); > > > > + va = __find_vmap_area(addr, &vn->busy.root); > > > > + spin_unlock(&vn->busy.lock); > > > > + > > > > + if (va) > > > > + return va; > > > > + } while ((i = (i + 1) % nr_vmap_nodes) != j); > > > > > > If you comment above suggests that only 1 extra node might need to be > > > scanned, should we stop after one iteration? > > > > > Not really. Though we can improve it further to scan backward. > > I think it'd be good to clarify in the comment above that the VA could span > more than 1 node then, as the diagram seems to imply only 1 (I think just > simply because of the example you were showing). > > [snip] > > > > > static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > > > > { > > > > + struct vmap_node *vn; > > > > struct vmap_area *va; > > > > + int i, j; > > > > > > > > - spin_lock(&vmap_area_lock); > > > > - va = __find_vmap_area(addr, &vmap_area_root); > > > > - if (va) > > > > - unlink_va(va, &vmap_area_root); > > > > - spin_unlock(&vmap_area_lock); > > > > + i = j = addr_to_node_id(addr); > > > > + do { > > > > + vn = &vmap_nodes[i]; > > > > > > > > - return va; > > > > + spin_lock(&vn->busy.lock); > > > > + va = __find_vmap_area(addr, &vn->busy.root); > > > > + if (va) > > > > + unlink_va(va, &vn->busy.root); > > > > + spin_unlock(&vn->busy.lock); > > > > + > > > > + if (va) > > > > + return va; > > > > + } while ((i = (i + 1) % nr_vmap_nodes) != j); > > > > > > Maybe worth adding a comment saying to refer to the comment in > > > find_vmap_area() to see why this loop is necessary. > > > > > OK. We can do it to make it better for reading. > > Thanks! > > [snip] > > > > > @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) > > > > > > Unrelated to your change but makes me feel a little unwell to see 'const > > > char *addr'! Can we change this at some point? Or maybe I can :) > > > > > You are welcome :) > > Haha ;) yes I think I might tbh, I have noted it down. > > > > > > > > > > > remains = count; > > > > > > > > - spin_lock(&vmap_area_lock); > > > > - va = find_vmap_area_exceed_addr((unsigned long)addr); > > > > + /* Hooked to node_0 so far. */ > > > > + vn = addr_to_node(0); > > > > > > Why can't we use addr for this call? We already enforce the node-0 only > > > thing by setting nr_vmap_nodes to 1 right? And won't this be potentially > > > subtly wrong when we later increase this? > > > > > I used to have 0 here. But please note, it is changed by the next patch in > > this series. > > Yeah sorry, again hadn't noticed this. > > [snip] > > > > > + spin_lock(&vn->busy.lock); > > > > + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); > > > > setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, > > > > pcpu_get_vm_areas); > > > > + spin_unlock(&vn->busy.lock); > > > > > > Hmm, before we were locking/unlocking once before the loop, now we're > > > locking on each iteration, this seems inefficient. > > > > > > Seems like we need logic like: > > > > > > /* ... something to check nr_vms > 0 ... */ > > > struct vmap_node *last_node = NULL; > > > > > > for (...) { > > > struct vmap_node *vnode = addr_to_node(vas[area]->va_start); > > > > > > if (vnode != last_node) { > > > spin_unlock(last_node->busy.lock); > > > spin_lock(vnode->busy.lock); > > > last_node = vnode; > > > } > > > > > > ... > > > } > > > > > > if (last_node) > > > spin_unlock(last_node->busy.lock); > > > > > > To minimise the lock twiddling. What do you think? > > > > > This per-cpu-allocator prefetches several VA units per-cpu. I do not > > find it as critical because it is not a hot path for the per-cpu allocator. > > When its buffers are exhausted it does an extra prefetch. So it is not > > frequent. > > OK, sure I mean this is simpler and more readable so if not a huge perf > concern then not a big deal. > > > > > > > > > > } > > > > - spin_unlock(&vmap_area_lock); > > > > > > > > /* > > > > * Mark allocated areas as accessible. Do it now as a best-effort > > > > @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object) > > > > { > > > > void *objp = (void *)PAGE_ALIGN((unsigned long)object); > > > > const void *caller; > > > > - struct vm_struct *vm; > > > > struct vmap_area *va; > > > > + struct vmap_node *vn; > > > > unsigned long addr; > > > > unsigned int nr_pages; > > > > + bool success = false; > > > > > > > > - if (!spin_trylock(&vmap_area_lock)) > > > > - return false; > > > > > > Nitpick on style for this, I really don't know why you are removing this > > > early exit? It's far neater to have a guard clause than to nest a whole > > > bunch of code below. > > > > > Hm... I can return back as it used to be. I do not have a strong opinion here. > > Yeah that'd be ideal just for readability. > > [snip the rest as broadly fairly trivial comment stuff on which we agree] > > > > > Thank you for the review! I can fix the comments as separate patches if > > no objections. > > Yes, overall it's style/comment improvement stuff nothing major, feel free > to send as follow-up patches. > > I don't want to hold anything up here so for the rest, feel free to add: > > Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> > Appreciate! I will go through again and send out the patch that adds more detailed explanation as requested in this review. Again, thank you! -- Uladzislau Rezki
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 06bd843d18ae..786ecb18ae22 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn); #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0 -static DEFINE_SPINLOCK(vmap_area_lock); static DEFINE_SPINLOCK(free_vmap_area_lock); /* Export for kexec only */ LIST_HEAD(vmap_area_list); -static struct rb_root vmap_area_root = RB_ROOT; static bool vmap_initialized __read_mostly; static struct rb_root purge_vmap_area_root = RB_ROOT; @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT; */ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node); +/* + * An effective vmap-node logic. Users make use of nodes instead + * of a global heap. It allows to balance an access and mitigate + * contention. + */ +struct rb_list { + struct rb_root root; + struct list_head head; + spinlock_t lock; +}; + +static struct vmap_node { + /* Bookkeeping data of this node. */ + struct rb_list busy; +} single; + +static struct vmap_node *vmap_nodes = &single; +static __read_mostly unsigned int nr_vmap_nodes = 1; +static __read_mostly unsigned int vmap_zone_size = 1; + +static inline unsigned int +addr_to_node_id(unsigned long addr) +{ + return (addr / vmap_zone_size) % nr_vmap_nodes; +} + +static inline struct vmap_node * +addr_to_node(unsigned long addr) +{ + return &vmap_nodes[addr_to_node_id(addr)]; +} + static __always_inline unsigned long va_size(struct vmap_area *va) { @@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void) } /* Look up the first VA which satisfies addr < va_end, NULL if none. */ -static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) +static struct vmap_area * +find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root) { struct vmap_area *va = NULL; - struct rb_node *n = vmap_area_root.rb_node; + struct rb_node *n = root->rb_node; addr = (unsigned long)kasan_reset_tag((void *)addr); @@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head, */ static void free_vmap_area(struct vmap_area *va) { + struct vmap_node *vn = addr_to_node(va->va_start); + /* * Remove from the busy tree/list. */ - spin_lock(&vmap_area_lock); - unlink_va(va, &vmap_area_root); - spin_unlock(&vmap_area_lock); + spin_lock(&vn->busy.lock); + unlink_va(va, &vn->busy.root); + spin_unlock(&vn->busy.lock); /* * Insert/Merge it back to the free tree/list. @@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, int node, gfp_t gfp_mask, unsigned long va_flags) { + struct vmap_node *vn; struct vmap_area *va; unsigned long freed; unsigned long addr; @@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->vm = NULL; va->flags = va_flags; - spin_lock(&vmap_area_lock); - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); - spin_unlock(&vmap_area_lock); + vn = addr_to_node(va->va_start); + + spin_lock(&vn->busy.lock); + insert_vmap_area(va, &vn->busy.root, &vn->busy.head); + spin_unlock(&vn->busy.lock); BUG_ON(!IS_ALIGNED(va->va_start, align)); BUG_ON(va->va_start < vstart); @@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va) struct vmap_area *find_vmap_area(unsigned long addr) { + struct vmap_node *vn; struct vmap_area *va; + int i, j; - spin_lock(&vmap_area_lock); - va = __find_vmap_area(addr, &vmap_area_root); - spin_unlock(&vmap_area_lock); + /* + * An addr_to_node_id(addr) converts an address to a node index + * where a VA is located. If VA spans several zones and passed + * addr is not the same as va->va_start, what is not common, we + * may need to scan an extra nodes. See an example: + * + * <--va--> + * -|-----|-----|-----|-----|- + * 1 2 0 1 + * + * VA resides in node 1 whereas it spans 1 and 2. If passed + * addr is within a second node we should do extra work. We + * should mention that it is rare and is a corner case from + * the other hand it has to be covered. + */ + i = j = addr_to_node_id(addr); + do { + vn = &vmap_nodes[i]; - return va; + spin_lock(&vn->busy.lock); + va = __find_vmap_area(addr, &vn->busy.root); + spin_unlock(&vn->busy.lock); + + if (va) + return va; + } while ((i = (i + 1) % nr_vmap_nodes) != j); + + return NULL; } static struct vmap_area *find_unlink_vmap_area(unsigned long addr) { + struct vmap_node *vn; struct vmap_area *va; + int i, j; - spin_lock(&vmap_area_lock); - va = __find_vmap_area(addr, &vmap_area_root); - if (va) - unlink_va(va, &vmap_area_root); - spin_unlock(&vmap_area_lock); + i = j = addr_to_node_id(addr); + do { + vn = &vmap_nodes[i]; - return va; + spin_lock(&vn->busy.lock); + va = __find_vmap_area(addr, &vn->busy.root); + if (va) + unlink_va(va, &vn->busy.root); + spin_unlock(&vn->busy.lock); + + if (va) + return va; + } while ((i = (i + 1) % nr_vmap_nodes) != j); + + return NULL; } /*** Per cpu kva allocator ***/ @@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) static void free_vmap_block(struct vmap_block *vb) { + struct vmap_node *vn; struct vmap_block *tmp; struct xarray *xa; @@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb) tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start)); BUG_ON(tmp != vb); - spin_lock(&vmap_area_lock); - unlink_va(vb->va, &vmap_area_root); - spin_unlock(&vmap_area_lock); + vn = addr_to_node(vb->va->va_start); + spin_lock(&vn->busy.lock); + unlink_va(vb->va, &vn->busy.root); + spin_unlock(&vn->busy.lock); free_vmap_area_noflush(vb->va); kfree_rcu(vb, rcu_head); @@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, unsigned long flags, const void *caller) { - spin_lock(&vmap_area_lock); + struct vmap_node *vn = addr_to_node(va->va_start); + + spin_lock(&vn->busy.lock); setup_vmalloc_vm_locked(vm, va, flags, caller); - spin_unlock(&vmap_area_lock); + spin_unlock(&vn->busy.lock); } static void clear_vm_uninitialized_flag(struct vm_struct *vm) @@ -3715,6 +3790,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr, */ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) { + struct vmap_node *vn; struct vmap_area *va; struct vm_struct *vm; char *vaddr; @@ -3728,8 +3804,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) remains = count; - spin_lock(&vmap_area_lock); - va = find_vmap_area_exceed_addr((unsigned long)addr); + /* Hooked to node_0 so far. */ + vn = addr_to_node(0); + spin_lock(&vn->busy.lock); + + va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root); if (!va) goto finished_zero; @@ -3737,7 +3816,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) if ((unsigned long)addr + remains <= va->va_start) goto finished_zero; - list_for_each_entry_from(va, &vmap_area_list, list) { + list_for_each_entry_from(va, &vn->busy.head, list) { size_t copied; if (remains == 0) @@ -3796,12 +3875,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count) } finished_zero: - spin_unlock(&vmap_area_lock); + spin_unlock(&vn->busy.lock); /* zero-fill memory holes */ return count - remains + zero_iter(iter, remains); finished: /* Nothing remains, or We couldn't copy/zero everything. */ - spin_unlock(&vmap_area_lock); + spin_unlock(&vn->busy.lock); return count - remains; } @@ -4135,14 +4214,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, } /* insert all vm's */ - spin_lock(&vmap_area_lock); for (area = 0; area < nr_vms; area++) { - insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list); + struct vmap_node *vn = addr_to_node(vas[area]->va_start); + spin_lock(&vn->busy.lock); + insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, pcpu_get_vm_areas); + spin_unlock(&vn->busy.lock); } - spin_unlock(&vmap_area_lock); /* * Mark allocated areas as accessible. Do it now as a best-effort @@ -4253,55 +4333,57 @@ bool vmalloc_dump_obj(void *object) { void *objp = (void *)PAGE_ALIGN((unsigned long)object); const void *caller; - struct vm_struct *vm; struct vmap_area *va; + struct vmap_node *vn; unsigned long addr; unsigned int nr_pages; + bool success = false; - if (!spin_trylock(&vmap_area_lock)) - return false; - va = __find_vmap_area((unsigned long)objp, &vmap_area_root); - if (!va) { - spin_unlock(&vmap_area_lock); - return false; - } + vn = addr_to_node((unsigned long)objp); - vm = va->vm; - if (!vm) { - spin_unlock(&vmap_area_lock); - return false; + if (spin_trylock(&vn->busy.lock)) { + va = __find_vmap_area(addr, &vn->busy.root); + + if (va && va->vm) { + addr = (unsigned long)va->vm->addr; + caller = va->vm->caller; + nr_pages = va->vm->nr_pages; + success = true; + } + + spin_unlock(&vn->busy.lock); } - addr = (unsigned long)vm->addr; - caller = vm->caller; - nr_pages = vm->nr_pages; - spin_unlock(&vmap_area_lock); - pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", - nr_pages, addr, caller); - return true; + + if (success) + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", + nr_pages, addr, caller); + + return success; } #endif #ifdef CONFIG_PROC_FS static void *s_start(struct seq_file *m, loff_t *pos) - __acquires(&vmap_purge_lock) - __acquires(&vmap_area_lock) { + struct vmap_node *vn = addr_to_node(0); + mutex_lock(&vmap_purge_lock); - spin_lock(&vmap_area_lock); + spin_lock(&vn->busy.lock); - return seq_list_start(&vmap_area_list, *pos); + return seq_list_start(&vn->busy.head, *pos); } static void *s_next(struct seq_file *m, void *p, loff_t *pos) { - return seq_list_next(p, &vmap_area_list, pos); + struct vmap_node *vn = addr_to_node(0); + return seq_list_next(p, &vn->busy.head, pos); } static void s_stop(struct seq_file *m, void *p) - __releases(&vmap_area_lock) - __releases(&vmap_purge_lock) { - spin_unlock(&vmap_area_lock); + struct vmap_node *vn = addr_to_node(0); + + spin_unlock(&vn->busy.lock); mutex_unlock(&vmap_purge_lock); } @@ -4344,9 +4426,11 @@ static void show_purge_info(struct seq_file *m) static int s_show(struct seq_file *m, void *p) { + struct vmap_node *vn; struct vmap_area *va; struct vm_struct *v; + vn = addr_to_node(0); va = list_entry(p, struct vmap_area, list); if (!va->vm) { @@ -4397,7 +4481,7 @@ static int s_show(struct seq_file *m, void *p) * As a final step, dump "unpurged" areas. */ final: - if (list_is_last(&va->list, &vmap_area_list)) + if (list_is_last(&va->list, &vn->busy.head)) show_purge_info(m); return 0; @@ -4428,7 +4512,8 @@ static void vmap_init_free_space(void) { unsigned long vmap_start = 1; const unsigned long vmap_end = ULONG_MAX; - struct vmap_area *busy, *free; + struct vmap_area *free; + struct vm_struct *busy; /* * B F B B B F @@ -4436,12 +4521,12 @@ static void vmap_init_free_space(void) * | The KVA space | * |<--------------------------------->| */ - list_for_each_entry(busy, &vmap_area_list, list) { - if (busy->va_start - vmap_start > 0) { + for (busy = vmlist; busy; busy = busy->next) { + if ((unsigned long) busy->addr - vmap_start > 0) { free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); if (!WARN_ON_ONCE(!free)) { free->va_start = vmap_start; - free->va_end = busy->va_start; + free->va_end = (unsigned long) busy->addr; insert_vmap_area_augment(free, NULL, &free_vmap_area_root, @@ -4449,7 +4534,7 @@ static void vmap_init_free_space(void) } } - vmap_start = busy->va_end; + vmap_start = (unsigned long) busy->addr + busy->size; } if (vmap_end - vmap_start > 0) { @@ -4465,9 +4550,23 @@ static void vmap_init_free_space(void) } } +static void vmap_init_nodes(void) +{ + struct vmap_node *vn; + int i; + + for (i = 0; i < nr_vmap_nodes; i++) { + vn = &vmap_nodes[i]; + vn->busy.root = RB_ROOT; + INIT_LIST_HEAD(&vn->busy.head); + spin_lock_init(&vn->busy.lock); + } +} + void __init vmalloc_init(void) { struct vmap_area *va; + struct vmap_node *vn; struct vm_struct *tmp; int i; @@ -4489,6 +4588,11 @@ void __init vmalloc_init(void) xa_init(&vbq->vmap_blocks); } + /* + * Setup nodes before importing vmlist. + */ + vmap_init_nodes(); + /* Import existing vmlist entries. */ for (tmp = vmlist; tmp; tmp = tmp->next) { va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); @@ -4498,7 +4602,9 @@ void __init vmalloc_init(void) va->va_start = (unsigned long)tmp->addr; va->va_end = va->va_start + tmp->size; va->vm = tmp; - insert_vmap_area(va, &vmap_area_root, &vmap_area_list); + + vn = addr_to_node(va->va_start); + insert_vmap_area(va, &vn->busy.root, &vn->busy.head); } /*