Message ID | 20230314123403.100158-1-chenjun102@huawei.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp1783525wrd; Tue, 14 Mar 2023 07:14:13 -0700 (PDT) X-Google-Smtp-Source: AK7set/Tdgsf7JvscD0E6Nl6ZZjP1I/f76zQciV/19kG1OJXqAEcRq8kftHqPUjE1W9DwR6rzwJD X-Received: by 2002:a17:90b:2390:b0:23b:32e5:9036 with SMTP id mr16-20020a17090b239000b0023b32e59036mr11855498pjb.17.1678803252722; Tue, 14 Mar 2023 07:14:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678803252; cv=none; d=google.com; s=arc-20160816; b=nrhXCvsFnu5oB8kY074VRLU3EWWtxzF7mM5u66eqTwCALDiaEZcszGITGFi4wOMAwk o9gz/vMM9LM4BFlyNAM07Q3AHbG55x5k9yl9H5XuPR3yI9ZcFVwqr+6m7lu457hBP8s0 yqiCpiFrDfgv/sngu6jd3ZvMc/ROX0K/TKEvMXa2adTGGPJjqHzHzqs3XETfpSYFUn4Z ldhQ2Sh6ZSTtY8K5AAklrgxNT3xnqKL4uD/iAzjOMkSpfTv8UjrZFx4U49jQ3bHJ8504 zXzQm59emPJD8OtDQPKC77qI8I23lSe4IG7IVp6TrY2UBKZo55SbwZp0iKFpCHOP/aEb oK1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:message-id:date:subject:cc:to:from; bh=/LZ+IDJcelBrk6LxM5+hu0x44IhlhxP3Mx05rMy5D1M=; b=XwVcHRXjCO9KgJw00H0dKiXNxr72uVZRTaO+aykLVQtKEG4M2IgzCjdQILa/iN96uG UNpPvwnY38+1/RPZUkGrzISY89wAIXxjF2wQ0V8AqT3jpuNCjzVtceOVLGhMVzWzmQaF 38dKjhhwRv/k+m33l2U5e9nsIVV50UmSyM9QLrfJ5oxEUIWFrIo2jPi+6Uc2z/jb2k0F Z3/ibqAKaFF7vNeEoBG7zCWLLDrZ6yJVdL3VpIDUJ9Icfwg5uCCI/7u6tY6Ywg4vFU88 EKFavuaH4AnkxvG7LgacEEBLo1umQJuHTxlUBHTcHl0HrPbMSE4HsZtGA+JM5mNg5yLr CUQQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x7-20020a17090aca0700b00233cea2dab8si2541222pjt.121.2023.03.14.07.13.56; Tue, 14 Mar 2023 07:14:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230186AbjCNNhi (ORCPT <rfc822;realc9580@gmail.com> + 99 others); Tue, 14 Mar 2023 09:37:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43438 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230098AbjCNNhX (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 14 Mar 2023 09:37:23 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4ACCC6A426 for <linux-kernel@vger.kernel.org>; Tue, 14 Mar 2023 06:33:33 -0700 (PDT) Received: from dggpemm500006.china.huawei.com (unknown [172.30.72.53]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4PbXzR67ZhzHwfV; Tue, 14 Mar 2023 20:34:59 +0800 (CST) Received: from mdc.huawei.com (10.175.112.208) by dggpemm500006.china.huawei.com (7.185.36.236) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.21; Tue, 14 Mar 2023 20:37:09 +0800 From: Chen Jun <chenjun102@huawei.com> To: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, <cl@linux.com>, <penberg@kernel.org>, <rientjes@google.com>, <iamjoonsoo.kim@lge.com>, <akpm@linux-foundation.org>, <vbabka@suse.cz> CC: <xuqiang36@huawei.com>, <chenjun102@huawei.com>, <wangkefeng.wang@huawei.com> Subject: [PATCH] mm/slub: Reduce memory consumption in extreme scenarios Date: Tue, 14 Mar 2023 12:34:03 +0000 Message-ID: <20230314123403.100158-1-chenjun102@huawei.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.175.112.208] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggpemm500006.china.huawei.com (7.185.36.236) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760352799005881723?= X-GMAIL-MSGID: =?utf-8?q?1760352799005881723?= |
Series |
mm/slub: Reduce memory consumption in extreme scenarios
|
|
Commit Message
Chen Jun
March 14, 2023, 12:34 p.m. UTC
When kmalloc_node() is called without __GFP_THISNODE and the target node
lacks sufficient memory, SLUB allocates a folio from a different node
other than the requested node, instead of taking a partial slab from it.
However, since the allocated folio does not belong to the requested
node, it is deactivated and added to the partial slab list of the node
it belongs to.
This behavior can result in excessive memory usage when the requested
node has insufficient memory, as SLUB will repeatedly allocate folios
from other nodes without reusing the previously allocated ones.
To prevent memory wastage,
when (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:
1) try to get a partial slab from target node with __GFP_THISNODE.
2) if 1) failed, try to allocate a new slab from target node with
__GFP_THISNODE.
3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint.
when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior
remains unchanged.
On qemu with 4 numa nodes and each numa has 1G memory. Write a test ko
to call kmalloc_node(196, GFP_KERNEL, 3) for (4 * 1024 + 4) * 1024 times.
cat /proc/slabinfo shows:
kmalloc-256 4200530 13519712 256 32 2 : tunables..
after this patch,
cat /proc/slabinfo shows:
kmalloc-256 4200558 4200768 256 32 2 : tunables..
Signed-off-by: Chen Jun <chenjun102@huawei.com>
---
mm/slub.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
Comments
On 3/14/23 13:34, Chen Jun wrote: > When kmalloc_node() is called without __GFP_THISNODE and the target node > lacks sufficient memory, SLUB allocates a folio from a different node > other than the requested node, instead of taking a partial slab from it. > > However, since the allocated folio does not belong to the requested > node, it is deactivated and added to the partial slab list of the node > it belongs to. > > This behavior can result in excessive memory usage when the requested > node has insufficient memory, as SLUB will repeatedly allocate folios > from other nodes without reusing the previously allocated ones. > > To prevent memory wastage, > when (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is: > 1) try to get a partial slab from target node with __GFP_THISNODE. > 2) if 1) failed, try to allocate a new slab from target node with > __GFP_THISNODE. > 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint. > > when node != NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior > remains unchanged. > > On qemu with 4 numa nodes and each numa has 1G memory. Write a test ko > to call kmalloc_node(196, GFP_KERNEL, 3) for (4 * 1024 + 4) * 1024 times. > > cat /proc/slabinfo shows: > kmalloc-256 4200530 13519712 256 32 2 : tunables.. > > after this patch, > cat /proc/slabinfo shows: > kmalloc-256 4200558 4200768 256 32 2 : tunables.. > > Signed-off-by: Chen Jun <chenjun102@huawei.com> > --- > mm/slub.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 39327e98fce3..32e436957e03 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context > searchnode = numa_mem_id(); > > object = get_partial_node(s, get_node(s, searchnode), pc); > - if (object || node != NUMA_NO_NODE) > + if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE))) > return object; > > return get_any_partial(s, pc); > @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > struct slab *slab; > unsigned long flags; > struct partial_context pc; > + bool try_thisnode = true; > > stat(s, ALLOC_SLOWPATH); > > @@ -3181,8 +3182,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > } > > new_objects: > - > pc.flags = gfpflags; > + > + /* > + * when (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) > + * 1) try to get a partial slab from target node with __GFP_THISNODE. > + * 2) if 1) failed, try to allocate a new slab from target node with > + * __GFP_THISNODE. > + * 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint. > + */ > + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) > + pc.flags |= __GFP_THISNODE; Hmm I'm thinking we should also perhaps remove direct reclaim possibilities from the attempt 2). In your qemu test it should make no difference, as it fills everything with kernel memory that is not reclaimable. But in practice the target node might be filled with user memory, and I think it's better to quickly allocate on a different node than spend time in direct reclaim. So the following should work I think? pc.flags = GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE > + > pc.slab = &slab; > pc.orig_size = orig_size; > freelist = get_partial(s, node, &pc); > @@ -3190,10 +3201,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > goto check_new_slab; > > slub_put_cpu_ptr(s->cpu_slab); > - slab = new_slab(s, gfpflags, node); > + slab = new_slab(s, pc.flags, node); > c = slub_get_cpu_ptr(s->cpu_slab); > > if (unlikely(!slab)) { > + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) { > + try_thisnode = false; > + goto new_objects; > + } > + > slab_out_of_memory(s, gfpflags, node); > return NULL; > }
On 3/17/23 12:32, chenjun (AM) wrote: > 在 2023/3/14 22:41, Vlastimil Babka 写道: >>> pc.flags = gfpflags; >>> + >>> + /* >>> + * when (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) >>> + * 1) try to get a partial slab from target node with __GFP_THISNODE. >>> + * 2) if 1) failed, try to allocate a new slab from target node with >>> + * __GFP_THISNODE. >>> + * 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint. >>> + */ >>> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) >>> + pc.flags |= __GFP_THISNODE; >> >> Hmm I'm thinking we should also perhaps remove direct reclaim possibilities >> from the attempt 2). In your qemu test it should make no difference, as it >> fills everything with kernel memory that is not reclaimable. But in practice >> the target node might be filled with user memory, and I think it's better to >> quickly allocate on a different node than spend time in direct reclaim. So >> the following should work I think? >> >> pc.flags = GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE >> > > Hmm, Should it be that: > > pc.flags |= GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE No, we need to ignore the other reclaim-related flags that the caller passed, or it wouldn't work as intended. The danger is that we ignore some flag that would be necessary to pass, but I don't think there's any?
On 3/19/23 08:22, chenjun (AM) wrote: > 在 2023/3/17 20:06, Vlastimil Babka 写道: >> On 3/17/23 12:32, chenjun (AM) wrote: >>> 在 2023/3/14 22:41, Vlastimil Babka 写道: >>>>> pc.flags = gfpflags; >>>>> + >>>>> + /* >>>>> + * when (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) >>>>> + * 1) try to get a partial slab from target node with __GFP_THISNODE. >>>>> + * 2) if 1) failed, try to allocate a new slab from target node with >>>>> + * __GFP_THISNODE. >>>>> + * 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint. >>>>> + */ >>>>> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) >>>>> + pc.flags |= __GFP_THISNODE; >>>> >>>> Hmm I'm thinking we should also perhaps remove direct reclaim possibilities >>>> from the attempt 2). In your qemu test it should make no difference, as it >>>> fills everything with kernel memory that is not reclaimable. But in practice >>>> the target node might be filled with user memory, and I think it's better to >>>> quickly allocate on a different node than spend time in direct reclaim. So >>>> the following should work I think? >>>> >>>> pc.flags = GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE >>>> >>> >>> Hmm, Should it be that: >>> >>> pc.flags |= GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE >> >> No, we need to ignore the other reclaim-related flags that the caller >> passed, or it wouldn't work as intended. >> The danger is that we ignore some flag that would be necessary to pass, but >> I don't think there's any? >> >> > > If we ignore __GFP_ZERO passed by kzalloc, kzalloc will not work. > Could we just unmask __GFP_RECLAIMABLE | __GFP_RECLAIM? > > pc.flags &= ~(__GFP_RECLAIMABLE | __GFP_RECLAIM) > pc.flags |= __GFP_THISNODE __GFP_RECLAIMABLE would be wrong, but also ignored as new_slab() does: flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK) which would filter out __GFP_ZERO as well. That's not a problem as kzalloc() will zero out the individual allocated objects, so it doesn't matter if we don't zero out the whole slab page. But I wonder, if we're not past due time for a helper e.g. gfp_opportunistic(flags) that would turn any allocation flags to a GFP_NOWAIT while keeping the rest of relevant flags intact, and thus there would be one canonical way to do it - I'm sure there's a number of places with their own variants now? With such helper we'd just add __GFP_THISNODE to the result here as that's specific to this particular opportunistic allocation.
On Mon, Mar 20, 2023 at 09:05:57AM +0100, Vlastimil Babka wrote: > On 3/19/23 08:22, chenjun (AM) wrote: > > 在 2023/3/17 20:06, Vlastimil Babka 写道: > >> On 3/17/23 12:32, chenjun (AM) wrote: > >>> 在 2023/3/14 22:41, Vlastimil Babka 写道: > >>>>> pc.flags = gfpflags; > >>>>> + > >>>>> + /* > >>>>> + * when (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) > >>>>> + * 1) try to get a partial slab from target node with __GFP_THISNODE. > >>>>> + * 2) if 1) failed, try to allocate a new slab from target node with > >>>>> + * __GFP_THISNODE. > >>>>> + * 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint. > >>>>> + */ > >>>>> + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) > >>>>> + pc.flags |= __GFP_THISNODE; > >>>> > >>>> Hmm I'm thinking we should also perhaps remove direct reclaim possibilities > >>>> from the attempt 2). In your qemu test it should make no difference, as it > >>>> fills everything with kernel memory that is not reclaimable. But in practice > >>>> the target node might be filled with user memory, and I think it's better to > >>>> quickly allocate on a different node than spend time in direct reclaim. So > >>>> the following should work I think? > >>>> > >>>> pc.flags = GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE > >>>> > >>> > >>> Hmm, Should it be that: > >>> > >>> pc.flags |= GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE > >> > >> No, we need to ignore the other reclaim-related flags that the caller > >> passed, or it wouldn't work as intended. > >> The danger is that we ignore some flag that would be necessary to pass, but > >> I don't think there's any? > >> > >> > > > > If we ignore __GFP_ZERO passed by kzalloc, kzalloc will not work. > > Could we just unmask __GFP_RECLAIMABLE | __GFP_RECLAIM? > > > > pc.flags &= ~(__GFP_RECLAIMABLE | __GFP_RECLAIM) > > pc.flags |= __GFP_THISNODE > > __GFP_RECLAIMABLE would be wrong, but also ignored as new_slab() does: > flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK) > > which would filter out __GFP_ZERO as well. That's not a problem as kzalloc() > will zero out the individual allocated objects, so it doesn't matter if we > don't zero out the whole slab page. > > But I wonder, if we're not past due time for a helper e.g. > gfp_opportunistic(flags) that would turn any allocation flags to a > GFP_NOWAIT while keeping the rest of relevant flags intact, and thus there > would be one canonical way to do it - I'm sure there's a number of places > with their own variants now? > With such helper we'd just add __GFP_THISNODE to the result here as that's > specific to this particular opportunistic allocation. I like the idea, but maybe gfp_no_reclaim() would be clearer?
On 3/20/23 10:12, Mike Rapoport wrote: > On Mon, Mar 20, 2023 at 09:05:57AM +0100, Vlastimil Babka wrote: >> On 3/19/23 08:22, chenjun (AM) wrote: >> > 在 2023/3/17 20:06, Vlastimil Babka 写道: >> > >> > If we ignore __GFP_ZERO passed by kzalloc, kzalloc will not work. >> > Could we just unmask __GFP_RECLAIMABLE | __GFP_RECLAIM? >> > >> > pc.flags &= ~(__GFP_RECLAIMABLE | __GFP_RECLAIM) >> > pc.flags |= __GFP_THISNODE >> >> __GFP_RECLAIMABLE would be wrong, but also ignored as new_slab() does: >> flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK) >> >> which would filter out __GFP_ZERO as well. That's not a problem as kzalloc() >> will zero out the individual allocated objects, so it doesn't matter if we >> don't zero out the whole slab page. >> >> But I wonder, if we're not past due time for a helper e.g. >> gfp_opportunistic(flags) that would turn any allocation flags to a >> GFP_NOWAIT while keeping the rest of relevant flags intact, and thus there >> would be one canonical way to do it - I'm sure there's a number of places >> with their own variants now? >> With such helper we'd just add __GFP_THISNODE to the result here as that's >> specific to this particular opportunistic allocation. > > I like the idea, but maybe gfp_no_reclaim() would be clearer? Well, that name would say how it's implemented, but not exactly as we also want to add __GFP_NOWARN. "gfp_opportunistic()" or a better name with similar meaning was meant to convey the intention of what this allocation is trying to do, and I think that's better from the API users POV?
On 3/21/23 10:30, chenjun (AM) wrote: > 在 2023/3/20 17:12, Mike Rapoport 写道: >>>> >>>> If we ignore __GFP_ZERO passed by kzalloc, kzalloc will not work. >>>> Could we just unmask __GFP_RECLAIMABLE | __GFP_RECLAIM? >>>> >>>> pc.flags &= ~(__GFP_RECLAIMABLE | __GFP_RECLAIM) >>>> pc.flags |= __GFP_THISNODE >>> >>> __GFP_RECLAIMABLE would be wrong, but also ignored as new_slab() does: >>> flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK) >>> >>> which would filter out __GFP_ZERO as well. That's not a problem as kzalloc() >>> will zero out the individual allocated objects, so it doesn't matter if we >>> don't zero out the whole slab page. >>> >>> But I wonder, if we're not past due time for a helper e.g. >>> gfp_opportunistic(flags) that would turn any allocation flags to a >>> GFP_NOWAIT while keeping the rest of relevant flags intact, and thus there >>> would be one canonical way to do it - I'm sure there's a number of places >>> with their own variants now? >>> With such helper we'd just add __GFP_THISNODE to the result here as that's >>> specific to this particular opportunistic allocation. >> >> I like the idea, but maybe gfp_no_reclaim() would be clearer? >> > > #define gfp_no_reclaim(gfpflag) (gfpflag & ~__GFP_DIRECT_RECLAIM) I hoped for more feedback on the idea, but it's probably best proposed outside of this slub-specific thread, so we could go for an open-coded solution in slub for now. Also just masking out __GFP_DIRECT_RECLAIM wouldn't be sufficient in any case for the general solution/ > And here, > > pc.flags = gfp_no_reclaim(gfpflags) | __GFP_THISNODE. I'd still suggest as earlier: pc.flags = GFP_NOWAIT | __GFP_NOWARN |__GFP_THISNODE; > Do I get it right?
diff --git a/mm/slub.c b/mm/slub.c index 39327e98fce3..32e436957e03 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, int node, struct partial_context searchnode = numa_mem_id(); object = get_partial_node(s, get_node(s, searchnode), pc); - if (object || node != NUMA_NO_NODE) + if (object || (node != NUMA_NO_NODE && (pc->flags & __GFP_THISNODE))) return object; return get_any_partial(s, pc); @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, struct slab *slab; unsigned long flags; struct partial_context pc; + bool try_thisnode = true; stat(s, ALLOC_SLOWPATH); @@ -3181,8 +3182,18 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, } new_objects: - pc.flags = gfpflags; + + /* + * when (node != NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) + * 1) try to get a partial slab from target node with __GFP_THISNODE. + * 2) if 1) failed, try to allocate a new slab from target node with + * __GFP_THISNODE. + * 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint. + */ + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) + pc.flags |= __GFP_THISNODE; + pc.slab = &slab; pc.orig_size = orig_size; freelist = get_partial(s, node, &pc); @@ -3190,10 +3201,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, goto check_new_slab; slub_put_cpu_ptr(s->cpu_slab); - slab = new_slab(s, gfpflags, node); + slab = new_slab(s, pc.flags, node); c = slub_get_cpu_ptr(s->cpu_slab); if (unlikely(!slab)) { + if (node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_thisnode) { + try_thisnode = false; + goto new_objects; + } + slab_out_of_memory(s, gfpflags, node); return NULL; }