From patchwork Tue Nov 22 20:39:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mathieu Desnoyers X-Patchwork-Id: 24586 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp2434757wrr; Tue, 22 Nov 2022 12:58:47 -0800 (PST) X-Google-Smtp-Source: AA0mqf5MDFgpXlecpVaR4+pRYsrenw4wjblSWBBMFSzJu0tou14LWn0izmq/0yUJr3tdTq6ORJxd X-Received: by 2002:aa7:8054:0:b0:56b:d951:67d2 with SMTP id y20-20020aa78054000000b0056bd95167d2mr27184662pfm.55.1669150727132; Tue, 22 Nov 2022 12:58:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669150727; cv=none; d=google.com; s=arc-20160816; b=FskMrDYRFerqgzcL4+SoazE8dE7OXgktSX2b6oa50NfcDXJb8pIvuWN2J8VZeYRNSO OCcILWF/7USYrjndXxuc/OPoCHT7QOiooqGBQhp/YWwd10jnxfJUHzH5aCiHA98Qpnmu o4yOc3+VqoDp/fZThqh2oLBZ1Kvv9lzoXcSmSLrQ+skaYz9uj/HJLOo8HTiWTOuaLikk EsTOe3ecKhY+kXSttQ690i/L0quCrBAxuDUGYV1rJ0fCZuZNRLtZpHGUKH6ospVCa6HX W1aBWyJZxvuJKHzubrWKm3sc5Qh9+Wefw9/QAAwB4XnFedPnadce7C6y6tg1QAJuKgTi 3GNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=JpKhxZQvOip7mklWAw1TJajqhSLBsrItLL54NFnP66U=; b=ZJqzrMtGf7kTu1PSIsgR+jO+DiU2IqODIQonF9x8oEv903QuPCPcsLw+87ayz3741O K7c2QPv5/E/5Z4LoMRhXtqY55athnnG5rkZ7o6nZyHxGxWW6U+OI1E5axOnrtRcdUpUq QbRxfhz9wILuIdJVSG1Fdp26sn9h6KobGzVP9vFBQFCXIc5H1FfEULCnZFKeuvOAh+z9 /vmTOzFm4OMIF9EwcfM+l8nIIMt7q1RL4hRpPv7Tmp2xj0InNHWQHqcShsdfb0jnofqF QIVyu9IrD5lG8uvn04qDRmlWuCzB/LBW5jCEfErJISRNQiGsT9NWfh0BpskeenqfM18Z gOPg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=GX9PyzQ1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k62-20020a638441000000b00460b109a025si621804pgd.232.2022.11.22.12.58.34; Tue, 22 Nov 2022 12:58:47 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=smtpout1 header.b=GX9PyzQ1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235066AbiKVUmT (ORCPT + 99 others); Tue, 22 Nov 2022 15:42:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234891AbiKVUk3 (ORCPT ); Tue, 22 Nov 2022 15:40:29 -0500 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D18196B389; Tue, 22 Nov 2022 12:39:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1669149594; bh=SvSw6S6e6tNfd6zrniG0qDKZ90D4V2JsOuZkwcMLt50=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GX9PyzQ1VfxNh7aVUCaTMFBuiF6jgRbZZ6n2zVI5SawH7X3WOBAUfCUN+GHtlzl9/ oBEgFxlgVTs6DgdsVhmQI/vcUrIaPaDBPONWdLqsDadk1W4/tOPLdg0uhe+CItqOVd 0lMvY9schWolPPD4CFdbzQoKhDnD+Q60s+ze8mE9AjXJsUpfSzdBxgX5HEzgBfMfr1 JW2dNE7SEOzBwOhW63nvwusiWwWiZ5wEFcphScF53TwIEtluFdBiHirjXCbULv2r05 DstqyIEVnOAAka4oUmyOE7d+79WLfSXjTnlne+xzLAQYP7y0cDccw4BtgUKSSWueAZ gclxxPCB8FgRg== Received: from localhost.localdomain (192-222-180-24.qc.cable.ebox.net [192.222.180.24]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4NGx2f2cTNzWgr; Tue, 22 Nov 2022 15:39:54 -0500 (EST) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H . Peter Anvin" , Paul Turner , linux-api@vger.kernel.org, Christian Brauner , Florian Weimer , David.Laight@ACULAB.COM, carlos@redhat.com, Peter Oskolkov , Alexander Mikhalitsyn , Chris Kennelly , Mathieu Desnoyers Subject: [PATCH 24/30] sched: NUMA-aware per-memory-map concurrency ID Date: Tue, 22 Nov 2022 15:39:26 -0500 Message-Id: <20221122203932.231377-25-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20221122203932.231377-1-mathieu.desnoyers@efficios.com> References: <20221122203932.231377-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1750231392384538067?= X-GMAIL-MSGID: =?utf-8?q?1750231392384538067?= Keep track of a NUMA-aware concurrency ID. On NUMA systems, when a NUMA-aware concurrency ID is observed by user-space to be associated with a NUMA node, it is guaranteed to never change NUMA node unless a kernel-level NUMA configuration change happens. Exposing a numa-aware concurrency ID is useful in situations where a process or a set of processes belonging to cpuset are pinned to a set of cores which belong to a subset of the system's NUMA nodes. In those situations, it is possible to benefit from the compactness of concurrency IDs over CPU ids, while keeping NUMA locality, for indexing a per-cpu data structure which takes into account NUMA locality. Signed-off-by: Mathieu Desnoyers --- include/linux/mm.h | 18 +++++ include/linux/mm_types.h | 68 +++++++++++++++- include/linux/sched.h | 3 + kernel/fork.c | 3 + kernel/sched/core.c | 8 +- kernel/sched/sched.h | 168 +++++++++++++++++++++++++++++++++++---- 6 files changed, 245 insertions(+), 23 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e0fba52de3e2..c7dfdf4c9d08 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3484,6 +3484,20 @@ static inline int task_mm_cid(struct task_struct *t) { return t->mm_cid; } +#ifdef CONFIG_NUMA +static inline int task_mm_numa_cid(struct task_struct *t) +{ + if (num_possible_nodes() == 1) + return task_mm_cid(t); + else + return t->mm_numa_cid; +} +#else +static inline int task_mm_numa_cid(struct task_struct *t) +{ + return task_mm_cid(t); +} +#endif #else static inline void sched_mm_cid_before_execve(struct task_struct *t) { } static inline void sched_mm_cid_after_execve(struct task_struct *t) { } @@ -3498,6 +3512,10 @@ static inline int task_mm_cid(struct task_struct *t) */ return raw_smp_processor_id(); } +static inline int task_mm_numa_cid(struct task_struct *t) +{ + return task_mm_cid(t); +} #endif #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index dabb42d26bb9..8c9afe8ce603 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,6 +18,7 @@ #include #include #include +#include #include @@ -847,15 +848,80 @@ static inline cpumask_t *mm_cidmask(struct mm_struct *mm) return (struct cpumask *)cid_bitmap; } +#ifdef CONFIG_NUMA +/* + * Layout of node cidmasks: + * - mm_numa cidmask: cpumask of the currently used mm_numa cids. + * - node_alloc cidmask: cpumask tracking which cid were + * allocated (across nodes) in this + * memory map. + * - node cidmask[nr_node_ids]: per-node cpumask tracking which cid + * were allocated in this memory map. + */ +static inline cpumask_t *mm_numa_cidmask(struct mm_struct *mm) +{ + unsigned long cid_bitmap = (unsigned long)mm_cidmask(mm); + + /* Skip mm_cidmask */ + cid_bitmap += cpumask_size(); + return (struct cpumask *)cid_bitmap; +} + +static inline cpumask_t *mm_node_alloc_cidmask(struct mm_struct *mm) +{ + unsigned long cid_bitmap = (unsigned long)mm_numa_cidmask(mm); + + /* Skip mm_numa_cidmask */ + cid_bitmap += cpumask_size(); + return (struct cpumask *)cid_bitmap; +} + +static inline cpumask_t *mm_node_cidmask(struct mm_struct *mm, unsigned int node) +{ + unsigned long cid_bitmap = (unsigned long)mm_node_alloc_cidmask(mm); + + /* Skip node alloc cidmask */ + cid_bitmap += cpumask_size(); + cid_bitmap += node * cpumask_size(); + return (struct cpumask *)cid_bitmap; +} + +static inline void mm_init_node_cidmask(struct mm_struct *mm) +{ + unsigned int node; + + if (num_possible_nodes() == 1) + return; + cpumask_clear(mm_numa_cidmask(mm)); + cpumask_clear(mm_node_alloc_cidmask(mm)); + for (node = 0; node < nr_node_ids; node++) + cpumask_clear(mm_node_cidmask(mm, node)); +} + +static inline unsigned int mm_node_cidmask_size(void) +{ + if (num_possible_nodes() == 1) + return 0; + return (nr_node_ids + 2) * cpumask_size(); +} +#else /* CONFIG_NUMA */ +static inline void mm_init_node_cidmask(struct mm_struct *mm) { } +static inline unsigned int mm_node_cidmask_size(void) +{ + return 0; +} +#endif /* CONFIG_NUMA */ + static inline void mm_init_cid(struct mm_struct *mm) { raw_spin_lock_init(&mm->cid_lock); cpumask_clear(mm_cidmask(mm)); + mm_init_node_cidmask(mm); } static inline unsigned int mm_cid_size(void) { - return cpumask_size(); + return cpumask_size() + mm_node_cidmask_size(); } #else /* CONFIG_SCHED_MM_CID */ static inline void mm_init_cid(struct mm_struct *mm) { } diff --git a/include/linux/sched.h b/include/linux/sched.h index c7e3c27e0e2e..990ef3d4b22b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1317,6 +1317,9 @@ struct task_struct { #ifdef CONFIG_SCHED_MM_CID int mm_cid; /* Current cid in mm */ int mm_cid_active; /* Whether cid bitmap is active */ +#ifdef CONFIG_NUMA + int mm_numa_cid; /* Current numa_cid in mm */ +#endif #endif struct tlbflush_unmap_batch tlb_ubc; diff --git a/kernel/fork.c b/kernel/fork.c index d48dedc4be75..364f4c62b1a4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1050,6 +1050,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_SCHED_MM_CID tsk->mm_cid = -1; tsk->mm_cid_active = 0; +#ifdef CONFIG_NUMA + tsk->mm_numa_cid = -1; +#endif #endif return tsk; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ef0cc40cca6b..095b5eb35d3d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -11284,8 +11284,7 @@ void sched_mm_cid_exit_signals(struct task_struct *t) if (!mm) return; local_irq_save(flags); - mm_cid_put(mm, t->mm_cid); - t->mm_cid = -1; + mm_cid_put(mm, t); t->mm_cid_active = 0; local_irq_restore(flags); } @@ -11298,8 +11297,7 @@ void sched_mm_cid_before_execve(struct task_struct *t) if (!mm) return; local_irq_save(flags); - mm_cid_put(mm, t->mm_cid); - t->mm_cid = -1; + mm_cid_put(mm, t); t->mm_cid_active = 0; local_irq_restore(flags); } @@ -11312,7 +11310,7 @@ void sched_mm_cid_after_execve(struct task_struct *t) WARN_ON_ONCE((t->flags & PF_KTHREAD) || !t->mm); local_irq_save(flags); - t->mm_cid = mm_cid_get(mm); + mm_cid_get(mm, t); t->mm_cid_active = 1; local_irq_restore(flags); rseq_set_notify_resume(t); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0096dc22926e..87f61f926e88 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3262,38 +3262,174 @@ static inline void update_current_exec_runtime(struct task_struct *curr, } #ifdef CONFIG_SCHED_MM_CID -static inline int __mm_cid_get(struct mm_struct *mm) +#ifdef CONFIG_NUMA +static inline void __mm_numa_cid_get(struct mm_struct *mm, struct task_struct *t) +{ + struct cpumask *cpumask = mm_numa_cidmask(mm), + *node_cpumask = mm_node_cidmask(mm, numa_node_id()), + *node_alloc_cpumask = mm_node_alloc_cidmask(mm); + unsigned int node; + int cid; + + if (num_possible_nodes() == 1) { + cid = -1; + goto end; + } + + /* + * Try to reserve lowest available cid number within those already + * reserved for this NUMA node. + */ + cid = cpumask_first_andnot(node_cpumask, cpumask); + if (cid >= nr_cpu_ids) + goto alloc_numa; + __cpumask_set_cpu(cid, cpumask); + goto end; + +alloc_numa: + /* + * Try to reserve lowest available cid number within those not already + * allocated for numa nodes. + */ + cid = cpumask_first_notandnot(node_alloc_cpumask, cpumask); + if (cid >= nr_cpu_ids) + goto numa_update; + __cpumask_set_cpu(cid, cpumask); + __cpumask_set_cpu(cid, node_cpumask); + __cpumask_set_cpu(cid, node_alloc_cpumask); + goto end; + +numa_update: + /* + * NUMA node id configuration changed for at least one CPU in the system. + * We need to steal a currently unused cid from an overprovisioned + * node for our current node. Userspace must handle the fact that the + * node id associated with this cid may change due to node ID + * reconfiguration. + * + * Count how many possible cpus are attached to each (other) node id, + * and compare this with the per-mm node cidmask cpu count. Find one + * which has too many cpus in its mask to steal from. + */ + for (node = 0; node < nr_node_ids; node++) { + struct cpumask *iter_cpumask; + + if (node == numa_node_id()) + continue; + iter_cpumask = mm_node_cidmask(mm, node); + if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) { + /* Try to steal from this node. */ + cid = cpumask_first_andnot(iter_cpumask, cpumask); + if (cid >= nr_cpu_ids) + goto steal_fail; + __cpumask_set_cpu(cid, cpumask); + __cpumask_clear_cpu(cid, iter_cpumask); + __cpumask_set_cpu(cid, node_cpumask); + goto end; + } + } + +steal_fail: + /* + * Our attempt at gracefully stealing a cid from another + * overprovisioned NUMA node failed. Fallback to grabbing the first + * available cid. + */ + cid = cpumask_first_zero(cpumask); + if (cid >= nr_cpu_ids) { + cid = -1; + goto end; + } + __cpumask_set_cpu(cid, cpumask); + /* Steal cid from its numa node mask. */ + for (node = 0; node < nr_node_ids; node++) { + struct cpumask *iter_cpumask; + + if (node == numa_node_id()) + continue; + iter_cpumask = mm_node_cidmask(mm, node); + if (cpumask_test_cpu(cid, iter_cpumask)) { + __cpumask_clear_cpu(cid, iter_cpumask); + break; + } + } + __cpumask_set_cpu(cid, node_cpumask); +end: + t->mm_numa_cid = cid; +} + +static inline void __mm_numa_cid_put(struct mm_struct *mm, struct task_struct *t) +{ + int cid = t->mm_numa_cid; + + if (num_possible_nodes() == 1) + return; + if (cid < 0) + return; + __cpumask_clear_cpu(cid, mm_numa_cidmask(mm)); + t->mm_numa_cid = -1; +} + +static inline void mm_numa_transfer_cid_prev_next(struct task_struct *prev, struct task_struct *next) +{ + next->mm_numa_cid = prev->mm_numa_cid; + prev->mm_numa_cid = -1; +} +#else +static inline void __mm_numa_cid_get(struct mm_struct *mm, struct task_struct *t) { } +static inline void __mm_numa_cid_put(struct mm_struct *mm, struct task_struct *t) { } +static inline void mm_numa_transfer_cid_prev_next(struct task_struct *prev, struct task_struct *next) { } +#endif + +static inline void __mm_cid_get(struct mm_struct *mm, struct task_struct *t) { struct cpumask *cpumask; int cid; cpumask = mm_cidmask(mm); cid = cpumask_first_zero(cpumask); - if (cid >= nr_cpu_ids) - return -1; + if (cid >= nr_cpu_ids) { + cid = -1; + goto end; + } __cpumask_set_cpu(cid, cpumask); - return cid; +end: + t->mm_cid = cid; } -static inline void mm_cid_put(struct mm_struct *mm, int cid) +static inline void mm_cid_get(struct mm_struct *mm, struct task_struct *t) { lockdep_assert_irqs_disabled(); - if (cid < 0) - return; raw_spin_lock(&mm->cid_lock); - __cpumask_clear_cpu(cid, mm_cidmask(mm)); + __mm_cid_get(mm, t); + __mm_numa_cid_get(mm, t); raw_spin_unlock(&mm->cid_lock); } -static inline int mm_cid_get(struct mm_struct *mm) +static inline void __mm_cid_put(struct mm_struct *mm, struct task_struct *t) { - int ret; + int cid = t->mm_cid; + + if (cid < 0) + return; + __cpumask_clear_cpu(cid, mm_cidmask(mm)); + t->mm_cid = -1; +} +static inline void mm_cid_put(struct mm_struct *mm, struct task_struct *t) +{ lockdep_assert_irqs_disabled(); raw_spin_lock(&mm->cid_lock); - ret = __mm_cid_get(mm); + __mm_cid_put(mm, t); + __mm_numa_cid_put(mm, t); raw_spin_unlock(&mm->cid_lock); - return ret; +} + +static inline void mm_transfer_cid_prev_next(struct task_struct *prev, struct task_struct *next) +{ + next->mm_cid = prev->mm_cid; + prev->mm_cid = -1; + mm_numa_transfer_cid_prev_next(prev, next); } static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) @@ -3304,15 +3440,13 @@ static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *n * Context switch between threads in same mm, hand over * the mm_cid from prev to next. */ - next->mm_cid = prev->mm_cid; - prev->mm_cid = -1; + mm_transfer_cid_prev_next(prev, next); return; } - mm_cid_put(prev->mm, prev->mm_cid); - prev->mm_cid = -1; + mm_cid_put(prev->mm, prev); } if (next->mm_cid_active) - next->mm_cid = mm_cid_get(next->mm); + mm_cid_get(next->mm, next); } #else