Message ID | 20240229082611.4104839-1-rulin.huang@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-86288-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:2097:b0:108:e6aa:91d0 with SMTP id gs23csp249982dyb; Thu, 29 Feb 2024 00:26:53 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCV+cqrv1H7FpP9h5JAJ4ARYnFUtHMe+KLWjK+OItVCQ3sOJQPf802QVA5J8/szP2u19Mzbd9s+9FE/HIVRXuKc9z2sXhw== X-Google-Smtp-Source: AGHT+IGpF/l5TQ7SnS34UQYZSpdA/aIE69HbBxQumpLdz6BC1TppntY4Gn2yIBTPQksjZpD0/r9M X-Received: by 2002:a05:620a:55a6:b0:787:c0e7:9d2b with SMTP id vr6-20020a05620a55a600b00787c0e79d2bmr1490027qkn.24.1709195213663; Thu, 29 Feb 2024 00:26:53 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709195213; cv=pass; d=google.com; s=arc-20160816; b=q1G//71CPNLAp+pkBBgevtFdw35AgG+zQOi+cj/N5w1QWe8QbbpBleIFDahtouhA/M HdAvewWA/5/sUn5MQzFNQG7rFTVjLmS798HRk5grY/h3EUIooZ89KvKKEFx8TTvnyiTU b5A8weq9qZzMraZ4PTJeMJ1K7Z381fteNXaVQ55PkPVAN8UrXi48EkkmbLjuGAEDMht1 T+NIzGu3Ui2CvkOjLcK5PsQnf8r0c+wfhe4dSrrTAFIVdmz58JIMlukjCHvLPMS4ycqZ wHgmS2CiMK/jAIysJtCQkkUe1wZzedaRtTT61kIww54Ph6miAkfNIP5G7QsOTUhxLO1O 3jEQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=zGPUNNz07XB0K50XItsiJrfw1sE3FLrBNfXLxYiaiho=; fh=Px8n5485VMW+PszSVMqClTsIRpLeKncr6Qm4BkQLwlA=; b=Rsv6zeapXcB3/cidN0WVIKK9ctXwRbaVvECO9wLlst2I+JcdhUvatcwKQuHJ9we10R zYOxPdVy3ZlqZyrWTjdRFDLmLXY8sAjdWQNlBku5nd28rX63z4x82ZUbIJsrrSZ2f7Z4 NlyC90qptOkrDV+D2m3Xyqt23PObTQmI4B/1OPmv6xm2UJhNe3wy9p6f1Oj6h63OKYxp Y57+2EwbC0d1/miqxsfumkh0rpDYmICusC2VW71bdKUb7uQQj7RUVl/mZccZw+IXpNYy f6U1dbSrf4lGVHPlP8T5kVeMqQordFcG3Kmjn7AIMYrz6mfEMzxf18PjN5U8pcbwrUFC h2og==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="nY9/lmKt"; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-86288-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-86288-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id d3-20020a37c403000000b00787bb8343c2si891142qki.785.2024.02.29.00.26.53 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Feb 2024 00:26:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-86288-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="nY9/lmKt"; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel+bounces-86288-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-86288-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 63B431C21581 for <ouuuleilei@gmail.com>; Thu, 29 Feb 2024 08:26:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 811C34F8B1; Thu, 29 Feb 2024 08:26:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="nY9/lmKt" Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6AC6E4CE00 for <linux-kernel@vger.kernel.org>; Thu, 29 Feb 2024 08:26:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.7 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709195198; cv=none; b=lORfLT9SnR+lIfBP9fW0xQCzfdNvD3Wp7cNxu4NyFJ16lLmbdYbqd4JMlh1TPo1kGvGKEDXl7va56eXQknlECmC7iApSKI9+he4R7Gefol44j0/rlSRBUsSfT1PJGvPYoFwV+VvlXLSYLl2xD8IvZPghc08md9OAbFWkuIJA3yI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709195198; c=relaxed/simple; bh=juUcc4UcKK/NL61XUnA4t+5ff677KqidZ2CHwPWtVYc=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=hvWK4bytIVJy/KsJk3zLAYQtQQt14bIusYcGdMyMeDWcCAxMNIiobHtFvWUMuLFSxKjtSweTx7GP315Nm20vgxh1iQTfYMr3ZqsybdYJPkOUnnY9a8X1EPacKT9KiALsNCwOI6H+r592LfFasX7NKVwZMx4r4QJsYVdKwOd35a4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=nY9/lmKt; arc=none smtp.client-ip=192.198.163.7 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709195196; x=1740731196; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=juUcc4UcKK/NL61XUnA4t+5ff677KqidZ2CHwPWtVYc=; b=nY9/lmKtN2mJm7OUgW9B5felRxxM+kkRVMH18cOxWdWkmQW2roldibY7 iZq59vTmq3ZejqR8/JbN2Njvv81ZgfkN7P0esAMIFBqKI5BlYgLZ9Q9KK hrp0dqzd8WmGZrF2J3BcWb5BByM6isVKXNYDLJXcHECurQM7grrbi6CVz m/ZQDceRrCEJqiPh/KkFbQSL7kHLjHwlBs43T/iNdQPD7gyh8PJELWwOh CLxgjFmf7QqRan/9MaQ1cxcVlN9QnWgdC/9w4EwpuUXyjfivK01KZ4RGI NRDmvWdVrPZOkpxDJcgGVKXyKVMbrSy+ZRi9l6npDlnQJ4obj30Ks9/2f A==; X-IronPort-AV: E=McAfee;i="6600,9927,10998"; a="29082088" X-IronPort-AV: E=Sophos;i="6.06,192,1705392000"; d="scan'208";a="29082088" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Feb 2024 00:26:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,192,1705392000"; d="scan'208";a="12360365" Received: from a4bf01946c30.jf.intel.com ([10.165.116.30]) by fmviesa004.fm.intel.com with ESMTP; 29 Feb 2024 00:26:33 -0800 From: rulinhuang <rulin.huang@intel.com> To: urezki@gmail.com, bhe@redhat.com Cc: akpm@linux-foundation.org, colin.king@intel.com, hch@infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lstoakes@gmail.com, rulin.huang@intel.com, tianyou.li@intel.com, tim.c.chen@intel.com, wangyang.guo@intel.com, zhiguo.zhou@intel.com Subject: [PATCH v6] mm/vmalloc: lock contention optimization under multi-threading Date: Thu, 29 Feb 2024 00:26:11 -0800 Message-ID: <20240229082611.4104839-1-rulin.huang@intel.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1792221080477793984 X-GMAIL-MSGID: 1792221080477793984 |
Series |
[v6] mm/vmalloc: lock contention optimization under multi-threading
|
|
Commit Message
rulinhuang
Feb. 29, 2024, 8:26 a.m. UTC
When allocating a new memory area where the mapping address range is known, it is observed that the vmap_node->busy.lock is acquired twice. The first acquisition occurs in the alloc_vmap_area() function when inserting the vm area into the vm mapping red-black tree. The second acquisition occurs in the setup_vmalloc_vm() function when updating the properties of the vm, such as flags and address, etc. Combine these two operations together in alloc_vmap_area(), which improves scalability when the vmap_node->busy.lock is contended. By doing so, the need to acquire the lock twice can also be eliminated to once. With the above change, tested on intel sapphire rapids platform(224 vcpu), a 4% performance improvement is gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), which is the stress test of thread creations. Reviewed-by: Uladzislau Rezki <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: "Chen, Tim C" <tim.c.chen@intel.com> Reviewed-by: "King, Colin" <colin.king@intel.com> Signed-off-by: rulinhuang <rulin.huang@intel.com> --- V1 -> V2: Avoided the partial initialization issue of vm and separated insert_vmap_area() from alloc_vmap_area() V2 -> V3: Rebased on 6.8-rc5 V3 -> V4: Rebased on mm-unstable branch V4 -> V5: cancel the split of alloc_vmap_area() and keep insert_vmap_area() V5 -> V6: add bug_on --- mm/vmalloc.c | 132 +++++++++++++++++++++++++-------------------------- 1 file changed, 64 insertions(+), 68 deletions(-) base-commit: 7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346
Comments
Apologizes for the confusions the original format led to and thanks so much for your guidance which will surely enhance the efficiency when communicating with the kernel community. We've submitted the v6 of the patch, which more rigorously checks va_flag with BUG_ON, and at the same time ensures the additional performance overhead is subtle. In this modification we also moved the position of the macros because the definition of VMAP_RAM should be placed before alloc_vmap_area(). Much appreciation from you and Uladzislau on the code refinement. And at the same time, we'd also respect the internal review comments and suggestions from Tim and Colin, without which this patch cannot be qualified to be sent out for your review. Although the current implementation has been much different from its first version, I'd still recommend properly recognizing their contributions with the "review-by" tag. Does it make sense? Could you kindly help us review this version and share with us your further comments? Thanks again for your continuous help! On 2024/2/29 16:26, rulinhuang wrote: > When allocating a new memory area where the mapping address range is > known, it is observed that the vmap_node->busy.lock is acquired twice. > > The first acquisition occurs in the alloc_vmap_area() function when > inserting the vm area into the vm mapping red-black tree. The second > acquisition occurs in the setup_vmalloc_vm() function when updating the > properties of the vm, such as flags and address, etc. > > Combine these two operations together in alloc_vmap_area(), which > improves scalability when the vmap_node->busy.lock is contended. > By doing so, the need to acquire the lock twice can also be eliminated > to once. > > With the above change, tested on intel sapphire rapids > platform(224 vcpu), a 4% performance improvement is > gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), > which is the stress test of thread creations. > > Reviewed-by: Uladzislau Rezki <urezki@gmail.com> > Reviewed-by: Baoquan He <bhe@redhat.com> > Reviewed-by: "Chen, Tim C" <tim.c.chen@intel.com> > Reviewed-by: "King, Colin" <colin.king@intel.com> > Signed-off-by: rulinhuang <rulin.huang@intel.com> > --- > V1 -> V2: Avoided the partial initialization issue of vm and > separated insert_vmap_area() from alloc_vmap_area() > V2 -> V3: Rebased on 6.8-rc5 > V3 -> V4: Rebased on mm-unstable branch > V4 -> V5: cancel the split of alloc_vmap_area() > and keep insert_vmap_area() > V5 -> V6: add bug_on > --- > mm/vmalloc.c | 132 +++++++++++++++++++++++++-------------------------- > 1 file changed, 64 insertions(+), 68 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 25a8df497255..5ae028b0d58d 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -1841,15 +1841,66 @@ node_alloc(unsigned long size, unsigned long align, > return va; > } > > +/*** Per cpu kva allocator ***/ > + > +/* > + * vmap space is limited especially on 32 bit architectures. Ensure there is > + * room for at least 16 percpu vmap blocks per CPU. > + */ > +/* > + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able > + * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess > + * instead (we just need a rough idea) > + */ > +#if BITS_PER_LONG == 32 > +#define VMALLOC_SPACE (128UL*1024*1024) > +#else > +#define VMALLOC_SPACE (128UL*1024*1024*1024) > +#endif > + > +#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) > +#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ > +#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ > +#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) > +#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ > +#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ > +#define VMAP_BBMAP_BITS \ > + VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ > + VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ > + VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) > + > +#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) > + > +/* > + * Purge threshold to prevent overeager purging of fragmented blocks for > + * regular operations: Purge if vb->free is less than 1/4 of the capacity. > + */ > +#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) > + > +#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ > +#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ > +#define VMAP_FLAGS_MASK 0x3 > + > +static inline void setup_vmalloc_vm(struct vm_struct *vm, > + struct vmap_area *va, unsigned long flags, const void *caller) > +{ > + vm->flags = flags; > + vm->addr = (void *)va->va_start; > + vm->size = va->va_end - va->va_start; > + vm->caller = caller; > + va->vm = vm; > +} > + > /* > * Allocate a region of KVA of the specified size and alignment, within the > - * vstart and vend. > + * vstart and vend. If vm is passed in, the two will also be bound. > */ > static struct vmap_area *alloc_vmap_area(unsigned long size, > unsigned long align, > unsigned long vstart, unsigned long vend, > int node, gfp_t gfp_mask, > - unsigned long va_flags) > + unsigned long va_flags, struct vm_struct *vm, > + unsigned long flags, const void *caller) > { > struct vmap_node *vn; > struct vmap_area *va; > @@ -1912,6 +1963,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > va->vm = NULL; > va->flags = (va_flags | vn_id); > > + if (vm) { > + BUG_ON(va_flags & VMAP_RAM); > + setup_vmalloc_vm(vm, va, flags, caller); > + } > + > vn = addr_to_node(va->va_start); > > spin_lock(&vn->busy.lock); > @@ -2325,46 +2381,6 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > return NULL; > } > > -/*** Per cpu kva allocator ***/ > - > -/* > - * vmap space is limited especially on 32 bit architectures. Ensure there is > - * room for at least 16 percpu vmap blocks per CPU. > - */ > -/* > - * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able > - * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess > - * instead (we just need a rough idea) > - */ > -#if BITS_PER_LONG == 32 > -#define VMALLOC_SPACE (128UL*1024*1024) > -#else > -#define VMALLOC_SPACE (128UL*1024*1024*1024) > -#endif > - > -#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) > -#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ > -#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ > -#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) > -#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ > -#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ > -#define VMAP_BBMAP_BITS \ > - VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ > - VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ > - VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) > - > -#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) > - > -/* > - * Purge threshold to prevent overeager purging of fragmented blocks for > - * regular operations: Purge if vb->free is less than 1/4 of the capacity. > - */ > -#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) > - > -#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ > -#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ > -#define VMAP_FLAGS_MASK 0x3 > - > struct vmap_block_queue { > spinlock_t lock; > struct list_head free; > @@ -2486,7 +2502,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) > va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, > VMALLOC_START, VMALLOC_END, > node, gfp_mask, > - VMAP_RAM|VMAP_BLOCK); > + VMAP_RAM|VMAP_BLOCK, NULL, > + 0, NULL); > if (IS_ERR(va)) { > kfree(vb); > return ERR_CAST(va); > @@ -2843,7 +2860,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node) > struct vmap_area *va; > va = alloc_vmap_area(size, PAGE_SIZE, > VMALLOC_START, VMALLOC_END, > - node, GFP_KERNEL, VMAP_RAM); > + node, GFP_KERNEL, VMAP_RAM, > + NULL, 0, NULL); > if (IS_ERR(va)) > return NULL; > > @@ -2946,26 +2964,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) > kasan_populate_early_vm_area_shadow(vm->addr, vm->size); > } > > -static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, > - struct vmap_area *va, unsigned long flags, const void *caller) > -{ > - vm->flags = flags; > - vm->addr = (void *)va->va_start; > - vm->size = va->va_end - va->va_start; > - vm->caller = caller; > - va->vm = vm; > -} > - > -static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, > - unsigned long flags, const void *caller) > -{ > - struct vmap_node *vn = addr_to_node(va->va_start); > - > - spin_lock(&vn->busy.lock); > - setup_vmalloc_vm_locked(vm, va, flags, caller); > - spin_unlock(&vn->busy.lock); > -} > - > static void clear_vm_uninitialized_flag(struct vm_struct *vm) > { > /* > @@ -3002,14 +3000,12 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, > if (!(flags & VM_NO_GUARD)) > size += PAGE_SIZE; > > - va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0); > + va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller); > if (IS_ERR(va)) { > kfree(area); > return NULL; > } > > - setup_vmalloc_vm(area, va, flags, caller); > - > /* > * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a > * best-effort approach, as they can be mapped outside of vmalloc code. > @@ -4584,7 +4580,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, > > spin_lock(&vn->busy.lock); > insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); > - setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, > + setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC, > pcpu_get_vm_areas); > spin_unlock(&vn->busy.lock); > } > > base-commit: 7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346
Hi Rulin, Thanks for the great work and v6, some concerns, please see inline comments. On 02/29/24 at 12:26am, rulinhuang wrote: > When allocating a new memory area where the mapping address range is > known, it is observed that the vmap_node->busy.lock is acquired twice. > > The first acquisition occurs in the alloc_vmap_area() function when > inserting the vm area into the vm mapping red-black tree. The second > acquisition occurs in the setup_vmalloc_vm() function when updating the > properties of the vm, such as flags and address, etc. > > Combine these two operations together in alloc_vmap_area(), which > improves scalability when the vmap_node->busy.lock is contended. > By doing so, the need to acquire the lock twice can also be eliminated > to once. > > With the above change, tested on intel sapphire rapids > platform(224 vcpu), a 4% performance improvement is > gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), > which is the stress test of thread creations. > > Reviewed-by: Uladzislau Rezki <urezki@gmail.com> > Reviewed-by: Baoquan He <bhe@redhat.com> > Reviewed-by: "Chen, Tim C" <tim.c.chen@intel.com> > Reviewed-by: "King, Colin" <colin.king@intel.com> We possibly need remove these reviewers' tags when new code change is taken so that people check and add Acked-by or Reviewed-by again if then agree, or add new comments if any concern. > Signed-off-by: rulinhuang <rulin.huang@intel.com> > --- > V1 -> V2: Avoided the partial initialization issue of vm and > separated insert_vmap_area() from alloc_vmap_area() > V2 -> V3: Rebased on 6.8-rc5 > V3 -> V4: Rebased on mm-unstable branch > V4 -> V5: cancel the split of alloc_vmap_area() > and keep insert_vmap_area() > V5 -> V6: add bug_on > --- > mm/vmalloc.c | 132 +++++++++++++++++++++++++-------------------------- > 1 file changed, 64 insertions(+), 68 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 25a8df497255..5ae028b0d58d 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -1841,15 +1841,66 @@ node_alloc(unsigned long size, unsigned long align, > return va; > } > > +/*** Per cpu kva allocator ***/ > + > +/* > + * vmap space is limited especially on 32 bit architectures. Ensure there is > + * room for at least 16 percpu vmap blocks per CPU. > + */ > +/* > + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able > + * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess > + * instead (we just need a rough idea) > + */ > +#if BITS_PER_LONG == 32 > +#define VMALLOC_SPACE (128UL*1024*1024) > +#else > +#define VMALLOC_SPACE (128UL*1024*1024*1024) > +#endif > + > +#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) > +#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ > +#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ > +#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) > +#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ > +#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ > +#define VMAP_BBMAP_BITS \ > + VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ > + VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ > + VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) > + > +#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) > + > +/* > + * Purge threshold to prevent overeager purging of fragmented blocks for > + * regular operations: Purge if vb->free is less than 1/4 of the capacity. > + */ > +#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) > + > +#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ > +#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ > +#define VMAP_FLAGS_MASK 0x3 These code moving is made because we need check VMAP_RAM in advance. We may need move all those data structures and basic helpers related to per cpu kva allocator up too to along with these macros, just as the newly introduced vmap_node does. If that's agreed, better be done in a separate patch. My personal opinion. Not sure if Uladzislau has different thoughts. Other than this, the overall looks good to me. > + > +static inline void setup_vmalloc_vm(struct vm_struct *vm, > + struct vmap_area *va, unsigned long flags, const void *caller) > +{ > + vm->flags = flags; > + vm->addr = (void *)va->va_start; > + vm->size = va->va_end - va->va_start; > + vm->caller = caller; > + va->vm = vm; > +} > + > /* > * Allocate a region of KVA of the specified size and alignment, within the > - * vstart and vend. > + * vstart and vend. If vm is passed in, the two will also be bound. > */ > static struct vmap_area *alloc_vmap_area(unsigned long size, > unsigned long align, > unsigned long vstart, unsigned long vend, > int node, gfp_t gfp_mask, > - unsigned long va_flags) > + unsigned long va_flags, struct vm_struct *vm, > + unsigned long flags, const void *caller) > { > struct vmap_node *vn; > struct vmap_area *va; > @@ -1912,6 +1963,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, > va->vm = NULL; > va->flags = (va_flags | vn_id); > > + if (vm) { > + BUG_ON(va_flags & VMAP_RAM); > + setup_vmalloc_vm(vm, va, flags, caller); > + } > + > vn = addr_to_node(va->va_start); > > spin_lock(&vn->busy.lock); > @@ -2325,46 +2381,6 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) > return NULL; > } > > -/*** Per cpu kva allocator ***/ > - > -/* > - * vmap space is limited especially on 32 bit architectures. Ensure there is > - * room for at least 16 percpu vmap blocks per CPU. > - */ > -/* > - * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able > - * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess > - * instead (we just need a rough idea) > - */ > -#if BITS_PER_LONG == 32 > -#define VMALLOC_SPACE (128UL*1024*1024) > -#else > -#define VMALLOC_SPACE (128UL*1024*1024*1024) > -#endif > - > -#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) > -#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ > -#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ > -#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) > -#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ > -#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ > -#define VMAP_BBMAP_BITS \ > - VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ > - VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ > - VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) > - > -#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) > - > -/* > - * Purge threshold to prevent overeager purging of fragmented blocks for > - * regular operations: Purge if vb->free is less than 1/4 of the capacity. > - */ > -#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) > - > -#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ > -#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ > -#define VMAP_FLAGS_MASK 0x3 > - > struct vmap_block_queue { > spinlock_t lock; > struct list_head free; > @@ -2486,7 +2502,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) > va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, > VMALLOC_START, VMALLOC_END, > node, gfp_mask, > - VMAP_RAM|VMAP_BLOCK); > + VMAP_RAM|VMAP_BLOCK, NULL, > + 0, NULL); > if (IS_ERR(va)) { > kfree(vb); > return ERR_CAST(va); > @@ -2843,7 +2860,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node) > struct vmap_area *va; > va = alloc_vmap_area(size, PAGE_SIZE, > VMALLOC_START, VMALLOC_END, > - node, GFP_KERNEL, VMAP_RAM); > + node, GFP_KERNEL, VMAP_RAM, > + NULL, 0, NULL); > if (IS_ERR(va)) > return NULL; > > @@ -2946,26 +2964,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) > kasan_populate_early_vm_area_shadow(vm->addr, vm->size); > } > > -static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, > - struct vmap_area *va, unsigned long flags, const void *caller) > -{ > - vm->flags = flags; > - vm->addr = (void *)va->va_start; > - vm->size = va->va_end - va->va_start; > - vm->caller = caller; > - va->vm = vm; > -} > - > -static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, > - unsigned long flags, const void *caller) > -{ > - struct vmap_node *vn = addr_to_node(va->va_start); > - > - spin_lock(&vn->busy.lock); > - setup_vmalloc_vm_locked(vm, va, flags, caller); > - spin_unlock(&vn->busy.lock); > -} > - > static void clear_vm_uninitialized_flag(struct vm_struct *vm) > { > /* > @@ -3002,14 +3000,12 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, > if (!(flags & VM_NO_GUARD)) > size += PAGE_SIZE; > > - va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0); > + va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller); > if (IS_ERR(va)) { > kfree(area); > return NULL; > } > > - setup_vmalloc_vm(area, va, flags, caller); > - > /* > * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a > * best-effort approach, as they can be mapped outside of vmalloc code. > @@ -4584,7 +4580,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, > > spin_lock(&vn->busy.lock); > insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); > - setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, > + setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC, > pcpu_get_vm_areas); > spin_unlock(&vn->busy.lock); > } > > base-commit: 7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346 > -- > 2.43.0 >
On Thu, Feb 29, 2024 at 06:12:00PM +0800, Baoquan He wrote: > Hi Rulin, > > Thanks for the great work and v6, some concerns, please see inline > comments. > > On 02/29/24 at 12:26am, rulinhuang wrote: > > When allocating a new memory area where the mapping address range is > > known, it is observed that the vmap_node->busy.lock is acquired twice. > > > > The first acquisition occurs in the alloc_vmap_area() function when > > inserting the vm area into the vm mapping red-black tree. The second > > acquisition occurs in the setup_vmalloc_vm() function when updating the > > properties of the vm, such as flags and address, etc. > > > > Combine these two operations together in alloc_vmap_area(), which > > improves scalability when the vmap_node->busy.lock is contended. > > By doing so, the need to acquire the lock twice can also be eliminated > > to once. > > > > With the above change, tested on intel sapphire rapids > > platform(224 vcpu), a 4% performance improvement is > > gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), > > which is the stress test of thread creations. > > > > Reviewed-by: Uladzislau Rezki <urezki@gmail.com> > > Reviewed-by: Baoquan He <bhe@redhat.com> > > Reviewed-by: "Chen, Tim C" <tim.c.chen@intel.com> > > Reviewed-by: "King, Colin" <colin.king@intel.com> > > > We possibly need remove these reviewers' tags when new code change is > taken so that people check and add Acked-by or Reviewed-by again if then > agree, or add new comments if any concern. > > > Signed-off-by: rulinhuang <rulin.huang@intel.com> > > --- > > V1 -> V2: Avoided the partial initialization issue of vm and > > separated insert_vmap_area() from alloc_vmap_area() > > V2 -> V3: Rebased on 6.8-rc5 > > V3 -> V4: Rebased on mm-unstable branch > > V4 -> V5: cancel the split of alloc_vmap_area() > > and keep insert_vmap_area() > > V5 -> V6: add bug_on > > --- > > mm/vmalloc.c | 132 +++++++++++++++++++++++++-------------------------- > > 1 file changed, 64 insertions(+), 68 deletions(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 25a8df497255..5ae028b0d58d 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -1841,15 +1841,66 @@ node_alloc(unsigned long size, unsigned long align, > > return va; > > } > > > > +/*** Per cpu kva allocator ***/ > > + > > +/* > > + * vmap space is limited especially on 32 bit architectures. Ensure there is > > + * room for at least 16 percpu vmap blocks per CPU. > > + */ > > +/* > > + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able > > + * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess > > + * instead (we just need a rough idea) > > + */ > > +#if BITS_PER_LONG == 32 > > +#define VMALLOC_SPACE (128UL*1024*1024) > > +#else > > +#define VMALLOC_SPACE (128UL*1024*1024*1024) > > +#endif > > + > > +#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) > > +#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ > > +#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ > > +#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) > > +#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ > > +#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ > > +#define VMAP_BBMAP_BITS \ > > + VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ > > + VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ > > + VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) > > + > > +#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) > > + > > +/* > > + * Purge threshold to prevent overeager purging of fragmented blocks for > > + * regular operations: Purge if vb->free is less than 1/4 of the capacity. > > + */ > > +#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) > > + > > +#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ > > +#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ > > +#define VMAP_FLAGS_MASK 0x3 > > These code moving is made because we need check VMAP_RAM in advance. We > may need move all those data structures and basic helpers related to per > cpu kva allocator up too to along with these macros, just as the newly > introduced vmap_node does. If that's agreed, better be done in a > separate patch. My personal opinion. Not sure if Uladzislau has > different thoughts. > > Other than this, the overall looks good to me. > I agree, the split should be done. One is a preparation move saying that no functional change happens and final one an actual change is. -- Uladzislau Rezki
On 02/29/24 at 04:31pm, Huang, Rulin wrote: > Apologizes for the confusions the original format led to and thanks so > much for your guidance which will surely enhance the efficiency when > communicating with the kernel community. > > We've submitted the v6 of the patch, which more rigorously checks > va_flag with BUG_ON, and at the same time ensures the additional > performance overhead is subtle. In this modification we also moved the > position of the macros because the definition of VMAP_RAM should be > placed before alloc_vmap_area(). > > Much appreciation from you and Uladzislau on the code refinement. And at > the same time, we'd also respect the internal review comments and > suggestions from Tim and Colin, without which this patch cannot be > qualified to be sent out for your review. Although the current > implementation has been much different from its first version, I'd still > recommend properly recognizing their contributions with the "review-by" > tag. Does it make sense? Just checked Documentation/process/submitting-patches.rst, seems below tags are more appropriate? Because the work you mentioned is your internal cooperation and effort, may not be related to upstream patch reviewing. Co-developed-by: "Chen, Tim C" <tim.c.chen@intel.com> Signed-off-by: "Chen, Tim C" <tim.c.chen@intel.com> Co-developed-by: "King, Colin" <colin.king@intel.com> Signed-off-by: "King, Colin" <colin.king@intel.com>
Just to confirm, looks good to me. Thanks Rulin. Colin -----Original Message----- From: Huang, Rulin <rulin.huang@intel.com> Sent: Thursday, February 29, 2024 8:26 AM To: urezki@gmail.com; bhe@redhat.com Cc: akpm@linux-foundation.org; King, Colin <colin.king@intel.com>; hch@infradead.org; linux-kernel@vger.kernel.org; linux-mm@kvack.org; lstoakes@gmailcom; Huang, Rulin <rulin.huang@intel.com>; Li, Tianyou <tianyou.li@intel.com>; Chen, Tim C <tim.c.chen@intel.com>; Guo, Wangyang <wangyang.guo@intel.com>; Zhou, Zhiguo <zhiguo.zhou@intel.com> Subject: [PATCH v6] mm/vmalloc: lock contention optimization under multi-threading When allocating a new memory area where the mapping address range is known, it is observed that the vmap_node->busy.lock is acquired twice. The first acquisition occurs in the alloc_vmap_area() function when inserting the vm area into the vm mapping red-black tree. The second acquisition occurs in the setup_vmalloc_vm() function when updating the properties of the vm, such as flags and address, etc. Combine these two operations together in alloc_vmap_area(), which improves scalability when the vmap_node->busy.lock is contended. By doing so, the need to acquire the lock twice can also be eliminated to once. With the above change, tested on intel sapphire rapids platform(224 vcpu), a 4% performance improvement is gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng), which is the stress test of thread creations. Reviewed-by: Uladzislau Rezki <urezki@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: "Chen, Tim C" <tim.c.chen@intel.com> Reviewed-by: "King, Colin" <colin.king@intel.com> Signed-off-by: rulinhuang <rulin.huang@intel.com> --- V1 -> V2: Avoided the partial initialization issue of vm and separated insert_vmap_area() from alloc_vmap_area() V2 -> V3: Rebased on 6.8-rc5 V3 -> V4: Rebased on mm-unstable branch V4 -> V5: cancel the split of alloc_vmap_area() and keep insert_vmap_area() V5 -> V6: add bug_on --- mm/vmalloc.c | 132 +++++++++++++++++++++++++-------------------------- 1 file changed, 64 insertions(+), 68 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 25a8df497255..5ae028b0d58d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1841,15 +1841,66 @@ node_alloc(unsigned long size, unsigned long align, return va; } +/*** Per cpu kva allocator ***/ + +/* + * vmap space is limited especially on 32 bit architectures. Ensure +there is + * room for at least 16 percpu vmap blocks per CPU. + */ +/* + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able + * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess + * instead (we just need a rough idea) + */ +#if BITS_PER_LONG == 32 +#define VMALLOC_SPACE (128UL*1024*1024) +#else +#define VMALLOC_SPACE (128UL*1024*1024*1024) +#endif + +#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) +#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ +#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ +#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) +#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ +#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ +#define VMAP_BBMAP_BITS \ + VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ + VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ + VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) + +#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) + +/* + * Purge threshold to prevent overeager purging of fragmented blocks +for + * regular operations: Purge if vb->free is less than 1/4 of the capacity. + */ +#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) + +#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ +#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ +#define VMAP_FLAGS_MASK 0x3 + +static inline void setup_vmalloc_vm(struct vm_struct *vm, + struct vmap_area *va, unsigned long flags, const void *caller) { + vm->flags = flags; + vm->addr = (void *)va->va_start; + vm->size = va->va_end - va->va_start; + vm->caller = caller; + va->vm = vm; +} + /* * Allocate a region of KVA of the specified size and alignment, within the - * vstart and vend. + * vstart and vend. If vm is passed in, the two will also be bound. */ static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask, - unsigned long va_flags) + unsigned long va_flags, struct vm_struct *vm, + unsigned long flags, const void *caller) { struct vmap_node *vn; struct vmap_area *va; @@ -1912,6 +1963,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->vm = NULL; va->flags = (va_flags | vn_id); + if (vm) { + BUG_ON(va_flags & VMAP_RAM); + setup_vmalloc_vm(vm, va, flags, caller); + } + vn = addr_to_node(va->va_start); spin_lock(&vn->busy.lock); @@ -2325,46 +2381,6 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) return NULL; } -/*** Per cpu kva allocator ***/ - -/* - * vmap space is limited especially on 32 bit architectures. Ensure there is - * room for at least 16 percpu vmap blocks per CPU. - */ -/* - * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able - * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess - * instead (we just need a rough idea) - */ -#if BITS_PER_LONG == 32 -#define VMALLOC_SPACE (128UL*1024*1024) -#else -#define VMALLOC_SPACE (128UL*1024*1024*1024) -#endif - -#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) -#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ -#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ -#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) -#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ -#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ -#define VMAP_BBMAP_BITS \ - VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ - VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ - VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) - -#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) - -/* - * Purge threshold to prevent overeager purging of fragmented blocks for - * regular operations: Purge if vb->free is less than 1/4 of the capacity. - */ -#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) - -#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ -#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ -#define VMAP_FLAGS_MASK 0x3 - struct vmap_block_queue { spinlock_t lock; struct list_head free; @@ -2486,7 +2502,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, VMALLOC_START, VMALLOC_END, node, gfp_mask, - VMAP_RAM|VMAP_BLOCK); + VMAP_RAM|VMAP_BLOCK, NULL, + 0, NULL); if (IS_ERR(va)) { kfree(vb); return ERR_CAST(va); @@ -2843,7 +2860,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node) struct vmap_area *va; va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, VMALLOC_END, - node, GFP_KERNEL, VMAP_RAM); + node, GFP_KERNEL, VMAP_RAM, + NULL, 0, NULL); if (IS_ERR(va)) return NULL; @@ -2946,26 +2964,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) kasan_populate_early_vm_area_shadow(vm->addr, vm->size); } -static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, - struct vmap_area *va, unsigned long flags, const void *caller) -{ - vm->flags = flags; - vm->addr = (void *)va->va_start; - vm->size = va->va_end - va->va_start; - vm->caller = caller; - va->vm = vm; -} - -static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, - unsigned long flags, const void *caller) -{ - struct vmap_node *vn = addr_to_node(va->va_start); - - spin_lock(&vn->busy.lock); - setup_vmalloc_vm_locked(vm, va, flags, caller); - spin_unlock(&vn->busy.lock); -} - static void clear_vm_uninitialized_flag(struct vm_struct *vm) { /* @@ -3002,14 +3000,12 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, if (!(flags & VM_NO_GUARD)) size += PAGE_SIZE; - va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0); + va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, +flags, caller); if (IS_ERR(va)) { kfree(area); return NULL; } - setup_vmalloc_vm(area, va, flags, caller); - /* * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a * best-effort approach, as they can be mapped outside of vmalloc code. @@ -4584,7 +4580,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, spin_lock(&vn->busy.lock); insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); - setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, + setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC, pcpu_get_vm_areas); spin_unlock(&vn->busy.lock); } base-commit: 7e6ae2db7f319bf9613ec6db8fa3c9bc1de1b346 -- 2.43.0
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 25a8df497255..5ae028b0d58d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1841,15 +1841,66 @@ node_alloc(unsigned long size, unsigned long align, return va; } +/*** Per cpu kva allocator ***/ + +/* + * vmap space is limited especially on 32 bit architectures. Ensure there is + * room for at least 16 percpu vmap blocks per CPU. + */ +/* + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able + * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess + * instead (we just need a rough idea) + */ +#if BITS_PER_LONG == 32 +#define VMALLOC_SPACE (128UL*1024*1024) +#else +#define VMALLOC_SPACE (128UL*1024*1024*1024) +#endif + +#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) +#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ +#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ +#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) +#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ +#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ +#define VMAP_BBMAP_BITS \ + VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ + VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ + VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) + +#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) + +/* + * Purge threshold to prevent overeager purging of fragmented blocks for + * regular operations: Purge if vb->free is less than 1/4 of the capacity. + */ +#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) + +#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ +#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ +#define VMAP_FLAGS_MASK 0x3 + +static inline void setup_vmalloc_vm(struct vm_struct *vm, + struct vmap_area *va, unsigned long flags, const void *caller) +{ + vm->flags = flags; + vm->addr = (void *)va->va_start; + vm->size = va->va_end - va->va_start; + vm->caller = caller; + va->vm = vm; +} + /* * Allocate a region of KVA of the specified size and alignment, within the - * vstart and vend. + * vstart and vend. If vm is passed in, the two will also be bound. */ static struct vmap_area *alloc_vmap_area(unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend, int node, gfp_t gfp_mask, - unsigned long va_flags) + unsigned long va_flags, struct vm_struct *vm, + unsigned long flags, const void *caller) { struct vmap_node *vn; struct vmap_area *va; @@ -1912,6 +1963,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, va->vm = NULL; va->flags = (va_flags | vn_id); + if (vm) { + BUG_ON(va_flags & VMAP_RAM); + setup_vmalloc_vm(vm, va, flags, caller); + } + vn = addr_to_node(va->va_start); spin_lock(&vn->busy.lock); @@ -2325,46 +2381,6 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr) return NULL; } -/*** Per cpu kva allocator ***/ - -/* - * vmap space is limited especially on 32 bit architectures. Ensure there is - * room for at least 16 percpu vmap blocks per CPU. - */ -/* - * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able - * to #define VMALLOC_SPACE (VMALLOC_END-VMALLOC_START). Guess - * instead (we just need a rough idea) - */ -#if BITS_PER_LONG == 32 -#define VMALLOC_SPACE (128UL*1024*1024) -#else -#define VMALLOC_SPACE (128UL*1024*1024*1024) -#endif - -#define VMALLOC_PAGES (VMALLOC_SPACE / PAGE_SIZE) -#define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ -#define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ -#define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) -#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ -#define VMAP_MAX(x, y) ((x) > (y) ? (x) : (y)) /* can't use max() */ -#define VMAP_BBMAP_BITS \ - VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ - VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ - VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) - -#define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) - -/* - * Purge threshold to prevent overeager purging of fragmented blocks for - * regular operations: Purge if vb->free is less than 1/4 of the capacity. - */ -#define VMAP_PURGE_THRESHOLD (VMAP_BBMAP_BITS / 4) - -#define VMAP_RAM 0x1 /* indicates vm_map_ram area*/ -#define VMAP_BLOCK 0x2 /* mark out the vmap_block sub-type*/ -#define VMAP_FLAGS_MASK 0x3 - struct vmap_block_queue { spinlock_t lock; struct list_head free; @@ -2486,7 +2502,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask) va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE, VMALLOC_START, VMALLOC_END, node, gfp_mask, - VMAP_RAM|VMAP_BLOCK); + VMAP_RAM|VMAP_BLOCK, NULL, + 0, NULL); if (IS_ERR(va)) { kfree(vb); return ERR_CAST(va); @@ -2843,7 +2860,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node) struct vmap_area *va; va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, VMALLOC_END, - node, GFP_KERNEL, VMAP_RAM); + node, GFP_KERNEL, VMAP_RAM, + NULL, 0, NULL); if (IS_ERR(va)) return NULL; @@ -2946,26 +2964,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align) kasan_populate_early_vm_area_shadow(vm->addr, vm->size); } -static inline void setup_vmalloc_vm_locked(struct vm_struct *vm, - struct vmap_area *va, unsigned long flags, const void *caller) -{ - vm->flags = flags; - vm->addr = (void *)va->va_start; - vm->size = va->va_end - va->va_start; - vm->caller = caller; - va->vm = vm; -} - -static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va, - unsigned long flags, const void *caller) -{ - struct vmap_node *vn = addr_to_node(va->va_start); - - spin_lock(&vn->busy.lock); - setup_vmalloc_vm_locked(vm, va, flags, caller); - spin_unlock(&vn->busy.lock); -} - static void clear_vm_uninitialized_flag(struct vm_struct *vm) { /* @@ -3002,14 +3000,12 @@ static struct vm_struct *__get_vm_area_node(unsigned long size, if (!(flags & VM_NO_GUARD)) size += PAGE_SIZE; - va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0); + va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller); if (IS_ERR(va)) { kfree(area); return NULL; } - setup_vmalloc_vm(area, va, flags, caller); - /* * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a * best-effort approach, as they can be mapped outside of vmalloc code. @@ -4584,7 +4580,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, spin_lock(&vn->busy.lock); insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head); - setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC, + setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC, pcpu_get_vm_areas); spin_unlock(&vn->busy.lock); }