From patchwork Wed Mar 15 12:18:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Juri Lelli X-Patchwork-Id: 70199 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2318189wrd; Wed, 15 Mar 2023 06:08:11 -0700 (PDT) X-Google-Smtp-Source: AK7set/EETFSy9qcRl29JzisKnLDgCWULfP7qh80/P2mz7es29qmG+ENgG5/cxbP9qKJeJCCNjL8 X-Received: by 2002:a05:6a20:734a:b0:d6:6f63:38f5 with SMTP id v10-20020a056a20734a00b000d66f6338f5mr1094331pzc.51.1678885691543; Wed, 15 Mar 2023 06:08:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678885691; cv=none; d=google.com; s=arc-20160816; b=eetjvb+Rvg9HeELo+KWZL3/L226QfgtgIqN9gzfotehmOQGOUK80Pj7R05go1MMcni 3RH93+ggvdOYyy+QG65L9WwT5nrnYselykk8q5F5QUlOIu9vA/pFBSQ75LT/iYR7C9jV hIFduDV7kYZth/+SQcXnjw8Ezq5oVr3GlZN1H/mL1s6C9ynWr5IqqTYXirwR/xFQLexx fCdwNbwtm0kkMhT6Z+h/ZXe7CDm4/HzqZutSeflSPzKN3wB5zz+A75ksEM18fno0wSZ3 pQGmi56KrWtJfOZdw503HGISFE+Ww4TeI2veOmqswGf8uqZ4hLYGP9LoKzro/YHPcPnH l2qA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=opKUyMj2lRH5d2JNuZrYyG9KL2ROXw8tqPa9QEIy/QU=; b=QqKgA0zVRciXQcHk1A4ofqWC0OU4S2tIJdPtvnhNAD4w646fyoGrGQwFujdwtWaGFt 04XPBRNrueeZ2VtYGyBgHCb0qfnGAlaO0rTvXSmv+hmw/PgF0TVY+0p/wJ/zopJpL4ev 2kKNlE5j/MYgnU6agkLLgVfShB8uh41mJ4hA6Y7wrd5P0UJ0045HVRHZWuPWGK/QivFF F2gMX/0LqP4XT2VZAwj3ogNuBEgO4d47pNEJLB58GwMniw0as9rQyGwCUSVcywBrySem KdsRxEFsrea1yVxH/P+3ug+9gOaskZit2+h2xtVgOzR+Hoo/C/kW13ftdeiYxr95Td0m mT7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=b11Yifnq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x24-20020a63db58000000b004fc29da6e55si4758995pgi.674.2023.03.15.06.07.59; Wed, 15 Mar 2023 06:08:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=b11Yifnq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232227AbjCOMUp (ORCPT + 99 others); Wed, 15 Mar 2023 08:20:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46198 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232211AbjCOMUL (ORCPT ); Wed, 15 Mar 2023 08:20:11 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAB4395BD9 for ; Wed, 15 Mar 2023 05:19:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678882748; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=opKUyMj2lRH5d2JNuZrYyG9KL2ROXw8tqPa9QEIy/QU=; b=b11Yifnq1AAlJnfqh2J7c0oj4FSORl+trb8KDEcwxRZH5ni6KN4PkJ4cvEmyV/CNKDp1Pf cr6z9fKv+FZ/OltDSZE6RAMW5/S9+S9dBXjuQcsEWQn1GDXkRqdabONmI1fNlob41w+sWB a+x+IeFfdXq4PKpIRe7LFdEkNQm3U50= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-336-h8gKKAtqM9KJInR_HCPT8A-1; Wed, 15 Mar 2023 08:19:07 -0400 X-MC-Unique: h8gKKAtqM9KJInR_HCPT8A-1 Received: by mail-qv1-f71.google.com with SMTP id jh21-20020a0562141fd500b0053c23b938a0so11496045qvb.17 for ; Wed, 15 Mar 2023 05:19:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678882747; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=opKUyMj2lRH5d2JNuZrYyG9KL2ROXw8tqPa9QEIy/QU=; b=ADBNIqZxAch2C/p/O6N0L7EUiQp1RuAC8FfVivmM3AwW5kyet3J86TFuq8wH/tVl6G 94+68UjfrRU4iKGWxN/YSC5B+mBF/q5Ky7tJQ0OEIGedTKDwea67xoupVGG6046iSECU 58JZ2JV4xUbGeFI+sS+dKYvHTEYxrZ71Fbw8fjFGprcNZkQGVgn65dQhb+oxd2tcKEpB R+6REHldSlWtvIftBa4U0D0CSLDSJNtORU1PX0i5lUY3HOKGM8bENWrnPdncZSc6As89 MhTRr6kFkpx7URdKd/aPS8YHJPYj0CSdbz8IvjVhjK+DTXbEjcll63GF1uyeRh2yBXqs Yyng== X-Gm-Message-State: AO0yUKVKEXZ6NDi1F3+Mhyfv4PG3WEQ9IpgjJs1o3VeGBHoKh/PeJJiv 3FXGiJFlwGUcTT8X6U6CWZgghIj03IoTmald9wQA2aL+wzhgGFTtbKVQqPMjv74veghNcfyewBl ioaIjRd+yEJH6dnxrcsTtEg5u X-Received: by 2002:a05:622a:1113:b0:3bf:c0fb:53a2 with SMTP id e19-20020a05622a111300b003bfc0fb53a2mr41779941qty.48.1678882747018; Wed, 15 Mar 2023 05:19:07 -0700 (PDT) X-Received: by 2002:a05:622a:1113:b0:3bf:c0fb:53a2 with SMTP id e19-20020a05622a111300b003bfc0fb53a2mr41779872qty.48.1678882746472; Wed, 15 Mar 2023 05:19:06 -0700 (PDT) Received: from localhost.localdomain.com ([2a00:23c6:4a21:6f01:ac73:9611:643a:5397]) by smtp.gmail.com with ESMTPSA id f11-20020ac8014b000000b003bd21323c80sm3672595qtg.11.2023.03.15.05.19.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Mar 2023 05:19:06 -0700 (PDT) From: Juri Lelli To: Peter Zijlstra , Ingo Molnar , Qais Yousef , Waiman Long , Tejun Heo , Zefan Li , Johannes Weiner , Hao Luo Cc: Dietmar Eggemann , Steven Rostedt , linux-kernel@vger.kernel.org, luca.abeni@santannapisa.it, claudio@evidence.eu.com, tommaso.cucinotta@santannapisa.it, bristot@redhat.com, mathieu.poirier@linaro.org, cgroups@vger.kernel.org, Vincent Guittot , Wei Wang , Rick Yiu , Quentin Perret , Heiko Carstens , Vasily Gorbik , Alexander Gordeev , Sudeep Holla , Juri Lelli Subject: [RFC PATCH 1/3] sched/cpuset: Bring back cpuset_mutex Date: Wed, 15 Mar 2023 12:18:10 +0000 Message-Id: <20230315121812.206079-2-juri.lelli@redhat.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230315121812.206079-1-juri.lelli@redhat.com> References: <20230315121812.206079-1-juri.lelli@redhat.com> MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760439242923581959?= X-GMAIL-MSGID: =?utf-8?q?1760439242923581959?= Turns out percpu_cpuset_rwsem - commit 1243dc518c9d ("cgroup/cpuset: Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea, as it has been reported to cause slowdowns in workloads that need to change cpuset configuration frequently and it is also not implementing priority inheritance (which causes troubles with realtime workloads). Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it only for SCHED_DEADLINE tasks (other policies don't care about stable cpusets anyway). Signed-off-by: Juri Lelli --- include/linux/cpuset.h | 8 +-- kernel/cgroup/cpuset.c | 147 ++++++++++++++++++++--------------------- kernel/sched/core.c | 22 ++++-- 3 files changed, 91 insertions(+), 86 deletions(-) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index d58e0476ee8e..355f796c5f07 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -71,8 +71,8 @@ extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); extern void cpuset_update_active_cpus(void); extern void cpuset_wait_for_hotplug(void); -extern void cpuset_read_lock(void); -extern void cpuset_read_unlock(void); +extern void cpuset_lock(void); +extern void cpuset_unlock(void); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); extern nodemask_t cpuset_mems_allowed(struct task_struct *p); @@ -196,8 +196,8 @@ static inline void cpuset_update_active_cpus(void) static inline void cpuset_wait_for_hotplug(void) { } -static inline void cpuset_read_lock(void) { } -static inline void cpuset_read_unlock(void) { } +static inline void cpuset_lock(void) { } +static inline void cpuset_unlock(void) { } static inline void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index a29c0b13706b..8d82d66d432b 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -366,22 +366,21 @@ static struct cpuset top_cpuset = { if (is_cpuset_online(((des_cs) = css_cs((pos_css))))) /* - * There are two global locks guarding cpuset structures - cpuset_rwsem and + * There are two global locks guarding cpuset structures - cpuset_mutex and * callback_lock. We also require taking task_lock() when dereferencing a * task's cpuset pointer. See "The task_lock() exception", at the end of this - * comment. The cpuset code uses only cpuset_rwsem write lock. Other - * kernel subsystems can use cpuset_read_lock()/cpuset_read_unlock() to - * prevent change to cpuset structures. + * comment. The cpuset code uses only cpuset_mutex. Other kernel subsystems + * can use cpuset_lock()/cpuset_unlock() to prevent change to cpuset + * structures. * * A task must hold both locks to modify cpusets. If a task holds - * cpuset_rwsem, it blocks others wanting that rwsem, ensuring that it - * is the only task able to also acquire callback_lock and be able to - * modify cpusets. It can perform various checks on the cpuset structure - * first, knowing nothing will change. It can also allocate memory while - * just holding cpuset_rwsem. While it is performing these checks, various - * callback routines can briefly acquire callback_lock to query cpusets. - * Once it is ready to make the changes, it takes callback_lock, blocking - * everyone else. + * cpuset_mutex, it blocks others, ensuring that it is the only task able to + * also acquire callback_lock and be able to modify cpusets. It can perform + * various checks on the cpuset structure first, knowing nothing will change. + * It can also allocate memory while just holding cpuset_mutex. While it is + * performing these checks, various callback routines can briefly acquire + * callback_lock to query cpusets. Once it is ready to make the changes, it + * takes callback_lock, blocking everyone else. * * Calls to the kernel memory allocator can not be made while holding * callback_lock, as that would risk double tripping on callback_lock @@ -403,16 +402,16 @@ static struct cpuset top_cpuset = { * guidelines for accessing subsystem state in kernel/cgroup.c */ -DEFINE_STATIC_PERCPU_RWSEM(cpuset_rwsem); +static DEFINE_MUTEX(cpuset_mutex); -void cpuset_read_lock(void) +void cpuset_lock(void) { - percpu_down_read(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); } -void cpuset_read_unlock(void) +void cpuset_unlock(void) { - percpu_up_read(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); } static DEFINE_SPINLOCK(callback_lock); @@ -496,7 +495,7 @@ static inline bool partition_is_populated(struct cpuset *cs, * One way or another, we guarantee to return some non-empty subset * of cpu_online_mask. * - * Call with callback_lock or cpuset_rwsem held. + * Call with callback_lock or cpuset_mutex held. */ static void guarantee_online_cpus(struct task_struct *tsk, struct cpumask *pmask) @@ -538,7 +537,7 @@ static void guarantee_online_cpus(struct task_struct *tsk, * One way or another, we guarantee to return some non-empty subset * of node_states[N_MEMORY]. * - * Call with callback_lock or cpuset_rwsem held. + * Call with callback_lock or cpuset_mutex held. */ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) { @@ -550,7 +549,7 @@ static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask) /* * update task's spread flag if cpuset's page/slab spread flag is set * - * Call with callback_lock or cpuset_rwsem held. The check can be skipped + * Call with callback_lock or cpuset_mutex held. The check can be skipped * if on default hierarchy. */ static void cpuset_update_task_spread_flags(struct cpuset *cs, @@ -575,7 +574,7 @@ static void cpuset_update_task_spread_flags(struct cpuset *cs, * * One cpuset is a subset of another if all its allowed CPUs and * Memory Nodes are a subset of the other, and its exclusive flags - * are only set if the other's are set. Call holding cpuset_rwsem. + * are only set if the other's are set. Call holding cpuset_mutex. */ static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) @@ -713,7 +712,7 @@ static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial) * If we replaced the flag and mask values of the current cpuset * (cur) with those values in the trial cpuset (trial), would * our various subset and exclusive rules still be valid? Presumes - * cpuset_rwsem held. + * cpuset_mutex held. * * 'cur' is the address of an actual, in-use cpuset. Operations * such as list traversal that depend on the actual address of the @@ -829,7 +828,7 @@ static void update_domain_attr_tree(struct sched_domain_attr *dattr, rcu_read_unlock(); } -/* Must be called with cpuset_rwsem held. */ +/* Must be called with cpuset_mutex held. */ static inline int nr_cpusets(void) { /* jump label reference count + the top-level cpuset */ @@ -855,7 +854,7 @@ static inline int nr_cpusets(void) * domains when operating in the severe memory shortage situations * that could cause allocation failures below. * - * Must be called with cpuset_rwsem held. + * Must be called with cpuset_mutex held. * * The three key local variables below are: * cp - cpuset pointer, used (together with pos_css) to perform a @@ -1084,7 +1083,7 @@ static void rebuild_root_domains(void) struct cpuset *cs = NULL; struct cgroup_subsys_state *pos_css; - percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex); lockdep_assert_cpus_held(); lockdep_assert_held(&sched_domains_mutex); @@ -1134,7 +1133,7 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[], * 'cpus' is removed, then call this routine to rebuild the * scheduler's dynamic sched domains. * - * Call with cpuset_rwsem held. Takes cpus_read_lock(). + * Call with cpuset_mutex held. Takes cpus_read_lock(). */ static void rebuild_sched_domains_locked(void) { @@ -1145,7 +1144,7 @@ static void rebuild_sched_domains_locked(void) int ndoms; lockdep_assert_cpus_held(); - percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex); /* * If we have raced with CPU hotplug, return early to avoid @@ -1196,9 +1195,9 @@ static void rebuild_sched_domains_locked(void) void rebuild_sched_domains(void) { cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); rebuild_sched_domains_locked(); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); } @@ -1207,7 +1206,7 @@ void rebuild_sched_domains(void) * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed * * Iterate through each task of @cs updating its cpus_allowed to the - * effective cpuset's. As this function is called with cpuset_rwsem held, + * effective cpuset's. As this function is called with cpuset_mutex held, * cpuset membership stays stable. */ static void update_tasks_cpumask(struct cpuset *cs) @@ -1313,7 +1312,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, int old_prs, new_prs; int part_error = PERR_NONE; /* Partition error? */ - percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex); /* * The parent must be a partition root. @@ -1536,7 +1535,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, * * On legacy hierarchy, effective_cpus will be the same with cpu_allowed. * - * Called with cpuset_rwsem held + * Called with cpuset_mutex held */ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, bool force) @@ -1696,7 +1695,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs, struct cpuset *sibling; struct cgroup_subsys_state *pos_css; - percpu_rwsem_assert_held(&cpuset_rwsem); + lockdep_assert_held(&cpuset_mutex); /* * Check all its siblings and call update_cpumasks_hier() @@ -1938,12 +1937,12 @@ static void *cpuset_being_rebound; * @cs: the cpuset in which each task's mems_allowed mask needs to be changed * * Iterate through each task of @cs updating its mems_allowed to the - * effective cpuset's. As this function is called with cpuset_rwsem held, + * effective cpuset's. As this function is called with cpuset_mutex held, * cpuset membership stays stable. */ static void update_tasks_nodemask(struct cpuset *cs) { - static nodemask_t newmems; /* protected by cpuset_rwsem */ + static nodemask_t newmems; /* protected by cpuset_mutex */ struct css_task_iter it; struct task_struct *task; @@ -1956,7 +1955,7 @@ static void update_tasks_nodemask(struct cpuset *cs) * take while holding tasklist_lock. Forks can happen - the * mpol_dup() cpuset_being_rebound check will catch such forks, * and rebind their vma mempolicies too. Because we still hold - * the global cpuset_rwsem, we know that no other rebind effort + * the global cpuset_mutex, we know that no other rebind effort * will be contending for the global variable cpuset_being_rebound. * It's ok if we rebind the same mm twice; mpol_rebind_mm() * is idempotent. Also migrate pages in each mm to new nodes. @@ -2002,7 +2001,7 @@ static void update_tasks_nodemask(struct cpuset *cs) * * On legacy hierarchy, effective_mems will be the same with mems_allowed. * - * Called with cpuset_rwsem held + * Called with cpuset_mutex held */ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems) { @@ -2055,7 +2054,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems) * mempolicies and if the cpuset is marked 'memory_migrate', * migrate the tasks pages to the new memory. * - * Call with cpuset_rwsem held. May take callback_lock during call. + * Call with cpuset_mutex held. May take callback_lock during call. * Will take tasklist_lock, scan tasklist for tasks in cpuset cs, * lock each such tasks mm->mmap_lock, scan its vma's and rebind * their mempolicies to the cpusets new mems_allowed. @@ -2147,7 +2146,7 @@ static int update_relax_domain_level(struct cpuset *cs, s64 val) * @cs: the cpuset in which each task's spread flags needs to be changed * * Iterate through each task of @cs updating its spread flags. As this - * function is called with cpuset_rwsem held, cpuset membership stays + * function is called with cpuset_mutex held, cpuset membership stays * stable. */ static void update_tasks_flags(struct cpuset *cs) @@ -2167,7 +2166,7 @@ static void update_tasks_flags(struct cpuset *cs) * cs: the cpuset to update * turning_on: whether the flag is being set or cleared * - * Call with cpuset_rwsem held. + * Call with cpuset_mutex held. */ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, @@ -2217,7 +2216,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, * @new_prs: new partition root state * Return: 0 if successful, != 0 if error * - * Call with cpuset_rwsem held. + * Call with cpuset_mutex held. */ static int update_prstate(struct cpuset *cs, int new_prs) { @@ -2440,7 +2439,7 @@ static int fmeter_getrate(struct fmeter *fmp) static struct cpuset *cpuset_attach_old_cs; -/* Called by cgroups to determine if a cpuset is usable; cpuset_rwsem held */ +/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */ static int cpuset_can_attach(struct cgroup_taskset *tset) { struct cgroup_subsys_state *css; @@ -2452,7 +2451,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css)); cs = css_cs(css); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); /* allow moving tasks into an empty cpuset if on default hierarchy */ ret = -ENOSPC; @@ -2482,7 +2481,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) cs->attach_in_progress++; ret = 0; out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); return ret; } @@ -2492,13 +2491,13 @@ static void cpuset_cancel_attach(struct cgroup_taskset *tset) cgroup_taskset_first(tset, &css); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); css_cs(css)->attach_in_progress--; - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); } /* - * Protected by cpuset_rwsem. cpus_attach is used only by cpuset_attach() + * Protected by cpuset_mutex. cpus_attach is used only by cpuset_attach() * but we can't allocate it dynamically there. Define it global and * allocate from cpuset_init(). */ @@ -2506,7 +2505,7 @@ static cpumask_var_t cpus_attach; static void cpuset_attach(struct cgroup_taskset *tset) { - /* static buf protected by cpuset_rwsem */ + /* static buf protected by cpuset_mutex */ static nodemask_t cpuset_attach_nodemask_to; struct task_struct *task; struct task_struct *leader; @@ -2519,7 +2518,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) cs = css_cs(css); lockdep_assert_cpus_held(); /* see cgroup_attach_lock() */ - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); cpus_updated = !cpumask_equal(cs->effective_cpus, oldcs->effective_cpus); mems_updated = !nodes_equal(cs->effective_mems, oldcs->effective_mems); @@ -2592,7 +2591,7 @@ static void cpuset_attach(struct cgroup_taskset *tset) if (!cs->attach_in_progress) wake_up(&cpuset_attach_wq); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); } /* The various types of files and directories in a cpuset file system */ @@ -2624,7 +2623,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, int retval = 0; cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) { retval = -ENODEV; goto out_unlock; @@ -2660,7 +2659,7 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, break; } out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); return retval; } @@ -2673,7 +2672,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, int retval = -ENODEV; cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) goto out_unlock; @@ -2686,7 +2685,7 @@ static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft, break; } out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); return retval; } @@ -2719,7 +2718,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, * operation like this one can lead to a deadlock through kernfs * active_ref protection. Let's break the protection. Losing the * protection is okay as we check whether @cs is online after - * grabbing cpuset_rwsem anyway. This only happens on the legacy + * grabbing cpuset_mutex anyway. This only happens on the legacy * hierarchies. */ css_get(&cs->css); @@ -2727,7 +2726,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, flush_work(&cpuset_hotplug_work); cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) goto out_unlock; @@ -2751,7 +2750,7 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, free_cpuset(trialcs); out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); kernfs_unbreak_active_protection(of->kn); css_put(&cs->css); @@ -2899,13 +2898,13 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, css_get(&cs->css); cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (!is_cpuset_online(cs)) goto out_unlock; retval = update_prstate(cs, val); out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); css_put(&cs->css); return retval ?: nbytes; @@ -3122,7 +3121,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) return 0; cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); set_bit(CS_ONLINE, &cs->flags); if (is_spread_page(parent)) @@ -3173,7 +3172,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) cpumask_copy(cs->effective_cpus, parent->cpus_allowed); spin_unlock_irq(&callback_lock); out_unlock: - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); return 0; } @@ -3194,7 +3193,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) struct cpuset *cs = css_cs(css); cpus_read_lock(); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); if (is_partition_valid(cs)) update_prstate(cs, 0); @@ -3213,7 +3212,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) cpuset_dec(); clear_bit(CS_ONLINE, &cs->flags); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); cpus_read_unlock(); } @@ -3226,7 +3225,7 @@ static void cpuset_css_free(struct cgroup_subsys_state *css) static void cpuset_bind(struct cgroup_subsys_state *root_css) { - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); spin_lock_irq(&callback_lock); if (is_in_v2_mode()) { @@ -3239,7 +3238,7 @@ static void cpuset_bind(struct cgroup_subsys_state *root_css) } spin_unlock_irq(&callback_lock); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); } /* @@ -3281,8 +3280,6 @@ struct cgroup_subsys cpuset_cgrp_subsys = { int __init cpuset_init(void) { - BUG_ON(percpu_init_rwsem(&cpuset_rwsem)); - BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL)); BUG_ON(!zalloc_cpumask_var(&top_cpuset.subparts_cpus, GFP_KERNEL)); @@ -3354,7 +3351,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs, is_empty = cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); /* * Move tasks to the nearest ancestor with execution resources, @@ -3364,7 +3361,7 @@ hotplug_update_tasks_legacy(struct cpuset *cs, if (is_empty) remove_tasks_in_empty_cpuset(cs); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); } static void @@ -3415,14 +3412,14 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) retry: wait_event(cpuset_attach_wq, cs->attach_in_progress == 0); - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); /* * We have raced with task attaching. We wait until attaching * is finished, so we won't attach a task to an empty cpuset. */ if (cs->attach_in_progress) { - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); goto retry; } @@ -3516,7 +3513,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, cpus_updated, mems_updated); - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); } /** @@ -3546,7 +3543,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) if (on_dfl && !alloc_cpumasks(NULL, &tmp)) ptmp = &tmp; - percpu_down_write(&cpuset_rwsem); + mutex_lock(&cpuset_mutex); /* fetch the available cpus/mems and find out which changed how */ cpumask_copy(&new_cpus, cpu_active_mask); @@ -3603,7 +3600,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) update_tasks_nodemask(&top_cpuset); } - percpu_up_write(&cpuset_rwsem); + mutex_unlock(&cpuset_mutex); /* if cpus or mems changed, we need to propagate to descendants */ if (cpus_updated || mems_updated) { @@ -4008,7 +4005,7 @@ void __cpuset_memory_pressure_bump(void) * - Used for /proc//cpuset. * - No need to task_lock(tsk) on this tsk->cpuset reference, as it * doesn't really matter if tsk->cpuset changes after we read it, - * and we take cpuset_rwsem, keeping cpuset_attach() from changing it + * and we take cpuset_mutex, keeping cpuset_attach() from changing it * anyway. */ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4580fe3e1d0c..5902cbb5e751 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7535,6 +7535,7 @@ static int __sched_setscheduler(struct task_struct *p, int reset_on_fork; int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; struct rq *rq; + bool cpuset_locked = false; /* The pi code expects interrupts enabled */ BUG_ON(pi && in_interrupt()); @@ -7584,8 +7585,14 @@ static int __sched_setscheduler(struct task_struct *p, return retval; } - if (pi) - cpuset_read_lock(); + /* + * SCHED_DEADLINE bandwidth accounting relies on stable cpusets + * information. + */ + if (dl_policy(policy) || dl_policy(p->policy)) { + cpuset_locked = true; + cpuset_lock(); + } /* * Make sure no PI-waiters arrive (or leave) while we are @@ -7661,8 +7668,8 @@ static int __sched_setscheduler(struct task_struct *p, if (unlikely(oldpolicy != -1 && oldpolicy != p->policy)) { policy = oldpolicy = -1; task_rq_unlock(rq, p, &rf); - if (pi) - cpuset_read_unlock(); + if (cpuset_locked) + cpuset_unlock(); goto recheck; } @@ -7729,7 +7736,8 @@ static int __sched_setscheduler(struct task_struct *p, task_rq_unlock(rq, p, &rf); if (pi) { - cpuset_read_unlock(); + if (cpuset_locked) + cpuset_unlock(); rt_mutex_adjust_pi(p); } @@ -7741,8 +7749,8 @@ static int __sched_setscheduler(struct task_struct *p, unlock: task_rq_unlock(rq, p, &rf); - if (pi) - cpuset_read_unlock(); + if (cpuset_locked) + cpuset_unlock(); return retval; } From patchwork Wed Mar 15 12:18:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Juri Lelli X-Patchwork-Id: 70191 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2311474wrd; Wed, 15 Mar 2023 05:57:02 -0700 (PDT) X-Google-Smtp-Source: AK7set8Sh4rDCStQ8n5JRuX0gAPcHpGdL6oTHg7EN2vqKdMZUCFPtAMz81X6Jui8lPJv9YyK8iYw X-Received: by 2002:a17:90b:368c:b0:23c:8ee2:bc14 with SMTP id mj12-20020a17090b368c00b0023c8ee2bc14mr12150482pjb.24.1678885021980; Wed, 15 Mar 2023 05:57:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678885021; cv=none; d=google.com; s=arc-20160816; b=DdUjy84eNNPlKfFnnEVgJ5rIG3qgrRXNQdBTbB4GqrRtJCvIvYj1TROfczrqrTvemB gBVhUV/Erep+vC4dIAFEknH7vRV4jHTE+66PaepKOOkAn5O/MrTxViSp0MEnvz7dky4g EoTxCd/8q8z6KjzNczTz+QLWt5DWLIwR57X3FD2Y9zbZba5YeywiXFe+Lv2dOqmq/zTz Q+/yttY8ezCgmB+KUHbrYh6SFlpqzwEuLQxvyElxLCO5DtYgIDCewlcAY68n3NQ/NxxE iMmdhFXB5TOb3W32NwNXywVjHIW7vKhCGtbkkwItMIalmSrKAld84tHpUVzBg/JsQVqe lO4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=YgxzsfQSDKWEcDKochbw04bI9bnEC0kJv7FwQRZ5yy0=; b=OBoIuqKtlGlImuOyvMgyWjwtom5ZVk+XqSha/rVUEK/A9CNmxv5uwmQuaKjTImmC+3 rsXpJnoJ91Wx2c+GQc+v4NUvbZkXp5NMi5lfu4Mf+HSXNSm6Qn73Dmli5lXVEoH2CYNx 384+ivDYK2QeUuYbFSuPOn01JhHRtaFtb9d5XIaWqMsNRO5UKe/vy79KusPzCWxBDuSi B/VdgLUE9+WwUINZ09jmJnvBM5fBXVnzN6ymYmMn+DHuu4xg6mQvBMd5qKnztjRMpR01 r+8lUDaJ4/l7/+vxmICyIk5f+L/UMWiE1TcpjwrW5CkSAur1JSTl05GJhEwsinUHetGT dmAw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dlCz4+m2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m4-20020a654c84000000b0050bcb8e3a3bsi4787477pgt.3.2023.03.15.05.56.41; Wed, 15 Mar 2023 05:57:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=dlCz4+m2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232218AbjCOMUm (ORCPT + 99 others); Wed, 15 Mar 2023 08:20:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46300 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232207AbjCOMUK (ORCPT ); Wed, 15 Mar 2023 08:20:10 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D638B84803 for ; Wed, 15 Mar 2023 05:19:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678882753; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YgxzsfQSDKWEcDKochbw04bI9bnEC0kJv7FwQRZ5yy0=; b=dlCz4+m2YnkYiLQK+hjQPgc85v+WWykpPaPlXQ5z5IZjVJSP9oMI3SjWmB39YGJZZps3SC TEYkFPiSK5Y2R/9hr2ezmv+/rZKTD8jNy76wC/2OnU+AgajOQJpxnhQuU6/yGGmFYD8Z26 Rax9T6GxIng+WAcDKfjc+alXLakjSCE= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-422-lviYAVyVOU6_WfSoCy_uvA-1; Wed, 15 Mar 2023 08:19:12 -0400 X-MC-Unique: lviYAVyVOU6_WfSoCy_uvA-1 Received: by mail-qv1-f71.google.com with SMTP id jo13-20020a056214500d00b004c6c72bf1d0so11409552qvb.9 for ; Wed, 15 Mar 2023 05:19:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678882750; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YgxzsfQSDKWEcDKochbw04bI9bnEC0kJv7FwQRZ5yy0=; b=D2DXcqektU+Rkvtpcha4oFqlW+SiyPiDtArFJwJLKHuzTbmWz4/vS3+3JnuTnHKpdm H3G01NnbRMiOBH35YK9mGsAfL1QM3jsgFunb8kYX9D43tiTZWtAFsuW0RO7BYNjoda+k BkL9Bm4hqkd6zxmmwheiNsgM+bYwS2mk/L0IMw+5s4foewX2OqlxugR0lyfbpArrtAOO rQMv7tW63rrJmZlrkp8V0XujLzOe9XVDJTWhPZZqezBu6e/r3i0wHpi0QQlx1GPJn9/9 FpNxXwKOLjTKTve8WwhzTvuOXaqGkez8aik6v+7D/TQA2X2rajSGZ13caMqSUgj5PZt2 A+7w== X-Gm-Message-State: AO0yUKXc4bSu9osI5q40KA28UFImM7Ix77sXHqKPQzDdPDAenRn0+g+9 hD1aGR39VO6eKcKX4LBMM87MVSbea7JRAp0e3fG2PbwAOyCiJRxhmW3dCy+XclRjl9EQjLSM9OQ vAHUE5dZ1inGEEdeWpf//9MGb X-Received: by 2002:ac8:7fc1:0:b0:3bf:b973:3078 with SMTP id b1-20020ac87fc1000000b003bfb9733078mr74469216qtk.13.1678882749781; Wed, 15 Mar 2023 05:19:09 -0700 (PDT) X-Received: by 2002:ac8:7fc1:0:b0:3bf:b973:3078 with SMTP id b1-20020ac87fc1000000b003bfb9733078mr74469175qtk.13.1678882749503; Wed, 15 Mar 2023 05:19:09 -0700 (PDT) Received: from localhost.localdomain.com ([2a00:23c6:4a21:6f01:ac73:9611:643a:5397]) by smtp.gmail.com with ESMTPSA id f11-20020ac8014b000000b003bd21323c80sm3672595qtg.11.2023.03.15.05.19.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Mar 2023 05:19:09 -0700 (PDT) From: Juri Lelli To: Peter Zijlstra , Ingo Molnar , Qais Yousef , Waiman Long , Tejun Heo , Zefan Li , Johannes Weiner , Hao Luo Cc: Dietmar Eggemann , Steven Rostedt , linux-kernel@vger.kernel.org, luca.abeni@santannapisa.it, claudio@evidence.eu.com, tommaso.cucinotta@santannapisa.it, bristot@redhat.com, mathieu.poirier@linaro.org, cgroups@vger.kernel.org, Vincent Guittot , Wei Wang , Rick Yiu , Quentin Perret , Heiko Carstens , Vasily Gorbik , Alexander Gordeev , Sudeep Holla , Juri Lelli Subject: [RFC PATCH 2/3] sched/cpuset: Keep track of SCHED_DEADLINE tasks in cpusets Date: Wed, 15 Mar 2023 12:18:11 +0000 Message-Id: <20230315121812.206079-3-juri.lelli@redhat.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230315121812.206079-1-juri.lelli@redhat.com> References: <20230315121812.206079-1-juri.lelli@redhat.com> MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760438540765732401?= X-GMAIL-MSGID: =?utf-8?q?1760438540765732401?= Qais reported that iterating over all tasks when rebuilding root domains for finding out which ones are DEADLINE and need their bandwidth correctly restored on such root domains can be a costly operation (10+ ms delays on suspend-resume). To fix the problem keep track of the number of DEADLINE tasks belonging to each cpuset and then use this information (followup patch) to only perform the above iteration if DEADLINE tasks are actually present in the cpuset for which a corresponding root domain is being rebuilt. Reported-by: Qais Yousef Signed-off-by: Juri Lelli --- include/linux/cpuset.h | 4 ++++ kernel/cgroup/cgroup.c | 4 ++++ kernel/cgroup/cpuset.c | 25 +++++++++++++++++++++++++ kernel/sched/core.c | 10 ++++++++++ 4 files changed, 43 insertions(+) diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 355f796c5f07..0348dba5680e 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -71,6 +71,8 @@ extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); extern void cpuset_update_active_cpus(void); extern void cpuset_wait_for_hotplug(void); +extern void inc_dl_tasks_cs(struct task_struct *task); +extern void dec_dl_tasks_cs(struct task_struct *task); extern void cpuset_lock(void); extern void cpuset_unlock(void); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); @@ -196,6 +198,8 @@ static inline void cpuset_update_active_cpus(void) static inline void cpuset_wait_for_hotplug(void) { } +static inline void inc_dl_tasks_cs(struct task_struct *task) { } +static inline void dec_dl_tasks_cs(struct task_struct *task) { } static inline void cpuset_lock(void) { } static inline void cpuset_unlock(void) { } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c099cf3fa02d..357925e1e4af 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -57,6 +57,7 @@ #include #include #include +#include #include #include @@ -6673,6 +6674,9 @@ void cgroup_exit(struct task_struct *tsk) list_add_tail(&tsk->cg_list, &cset->dying_tasks); cset->nr_tasks--; + if (dl_task(tsk)) + dec_dl_tasks_cs(tsk); + WARN_ON_ONCE(cgroup_task_frozen(tsk)); if (unlikely(!(tsk->flags & PF_KTHREAD) && test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags))) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 8d82d66d432b..57bc60112618 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -193,6 +193,12 @@ struct cpuset { int use_parent_ecpus; int child_ecpus_count; + /* + * number of SCHED_DEADLINE tasks attached to this cpuset, so that we + * know when to rebuild associated root domain bandwidth information. + */ + int nr_deadline_tasks; + /* Invalid partition error code, not lock protected */ enum prs_errcode prs_err; @@ -245,6 +251,20 @@ static inline struct cpuset *parent_cs(struct cpuset *cs) return css_cs(cs->css.parent); } +void inc_dl_tasks_cs(struct task_struct *p) +{ + struct cpuset *cs = task_cs(p); + + cs->nr_deadline_tasks++; +} + +void dec_dl_tasks_cs(struct task_struct *p) +{ + struct cpuset *cs = task_cs(p); + + cs->nr_deadline_tasks--; +} + /* bits in struct cpuset flags field */ typedef enum { CS_ONLINE, @@ -2472,6 +2492,11 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) ret = security_task_setscheduler(task); if (ret) goto out_unlock; + + if (dl_task(task)) { + cs->nr_deadline_tasks++; + cpuset_attach_old_cs->nr_deadline_tasks--; + } } /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5902cbb5e751..d586a8440348 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7683,6 +7683,16 @@ static int __sched_setscheduler(struct task_struct *p, goto unlock; } + /* + * In case a task is setscheduled to SCHED_DEADLINE, or if a task is + * moved to a different sched policy, we need to keep track of that on + * its cpuset (for correct bandwidth tracking). + */ + if (dl_policy(policy) && !dl_task(p)) + inc_dl_tasks_cs(p); + else if (dl_task(p) && !dl_policy(policy)) + dec_dl_tasks_cs(p); + p->sched_reset_on_fork = reset_on_fork; oldprio = p->prio; From patchwork Wed Mar 15 12:18:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Juri Lelli X-Patchwork-Id: 70203 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:5915:0:0:0:0:0 with SMTP id v21csp2319665wrd; Wed, 15 Mar 2023 06:10:44 -0700 (PDT) X-Google-Smtp-Source: AK7set9t0k4WttAisIpQ1EUHbWBC5DJGRZ/0+JdH0slCqknrV6BSPoU/KcGfmqQ8YKN1yR3x/afg X-Received: by 2002:a05:6a20:69a3:b0:c7:6f26:c99 with SMTP id t35-20020a056a2069a300b000c76f260c99mr63599202pzk.61.1678885843794; Wed, 15 Mar 2023 06:10:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1678885843; cv=none; d=google.com; s=arc-20160816; b=boXEyjQ4Q2Co/+gIf2XXKH2UgRJxBQNTf+VGmjAtH83qLNJL1oMrOya9K9EHvyBQ6w 467pMo30uOCQAe8jseWQmEf+JLhefa/uQM/mh5h/rzrXsqPFwxfYPttYpv3XdmF1U2ew WFrPrWv/DZO1Cs8eU7vFtsmalwSTnsL4LVmg3bOZg0GjSBMipSxz1Bi7W3sTrYTnHLk3 s6QUD1WeCQmWPbfj4tBll5yUEGjwQAO7T9Vam2132Hc2pzQiE7W0P1uXKY+phw927Yi1 nqRsuBPZT3+Mm7npE9BdOMqkenKJ1Ymrsx0sh1itZ96JWq34+2OIw8I+/m5dQ9n2Ah8V bUog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=2PRbwpAoEnOZK1WgHwlMyMBfG9nGKa4R9Y3iK9sv+Ts=; b=Qy4uBe+ZUEyiWEq0Uq/MC2RfzID9l900O3ikpl1neyODxPxFuULJNUK8KQbdv4QVFX 3QXVmHDH8CQ8MJStXJUjvZFGay/ZphvjBx8OpljspceUcQs8dEsuKXoJhVQ/mEtwykjE OOu8YJKMhrHoVVy2jqK/TiiKbVSZM1M5Ll1bQFfLQaL3yq1HfObAM2IjXOCKvG7YJiAC 2ws8PP4Y21fsUPg88WKbByezyLW66KjDxg9wR/B7ntJFxtqCOZgR+2ZZXEbdymAFQT/h 8MkRD2XBDF2zYGgWLZ+P0rSFAsQ42orks3lL2JUZct/1wpxoacaA4k2gO1y7cKZDlJm6 x/LQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Y+tBqveN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x12-20020a63170c000000b004fcc0120074si5051980pgl.161.2023.03.15.06.10.04; Wed, 15 Mar 2023 06:10:43 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Y+tBqveN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232278AbjCOMUs (ORCPT + 99 others); Wed, 15 Mar 2023 08:20:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45780 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232208AbjCOMUK (ORCPT ); Wed, 15 Mar 2023 08:20:10 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7E7F541B6B for ; Wed, 15 Mar 2023 05:19:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1678882755; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2PRbwpAoEnOZK1WgHwlMyMBfG9nGKa4R9Y3iK9sv+Ts=; b=Y+tBqveNFkCebtVM4gYrrG2SGgqWwvmUO35LwZnrKiijncu5JgZeIOPiWW9AMo03pzySsS w2d7NMr40pM99rtR18P+xY1wMVXJfAL5Fq93Jt4gu+3g6MpbDKfxsGIPigLPe4rJecym2C p/DATnOJeUIke9vNCORpCO1liIFvNzE= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-36-Y6weJSA_PaW67LzxqlU1Pg-1; Wed, 15 Mar 2023 08:19:14 -0400 X-MC-Unique: Y6weJSA_PaW67LzxqlU1Pg-1 Received: by mail-qv1-f71.google.com with SMTP id pp11-20020a056214138b00b0056c228fa15cso11389784qvb.4 for ; Wed, 15 Mar 2023 05:19:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1678882753; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2PRbwpAoEnOZK1WgHwlMyMBfG9nGKa4R9Y3iK9sv+Ts=; b=2NaMkwOgLHbGy46YTenySSyOiDwc8yOlKSm3i7czwalZPnVlqQAGldj3NxYbBZLnFy 9prRvcU1Np6pzRXGhKtwpOfhmgWpOX2TEOXSKYlI95SPIZxLrQK2N0iZMSqbFH/6Hqws Qhf1TaEGpNW6/1x/NhjnXCHr0MFyKN7PduyI7sMm8qQjWLr88Ve7oFgcWKHZkYzTfq3/ BXEwnOQvBgJHlEJZJhkKSy+itp9gDZCcwZuwpgV+mc3WP9mAafWvaGoH4gvRzj6SH4gc sxToEA8V/YJXgxTfikLoMxDsGI0naDeCNdhypb642qLf7Zr+WrRJyTdrH+L393LjwBac znQQ== X-Gm-Message-State: AO0yUKWc7yJSuA2X6F1tkO1T7YIUAi4aHaC502zzvidS1Z3NCC7DglVP G3ZYhukSqCp39ounnksSgmayvC9EC7u5+redk2bvy44kkmDcLQqNgxgruQTrlUJDfBbJwzuFgVr oi+U7t5ADZo43j2befj6kNHIU X-Received: by 2002:ac8:7f4d:0:b0:3b9:a441:37f4 with SMTP id g13-20020ac87f4d000000b003b9a44137f4mr69515113qtk.52.1678882752883; Wed, 15 Mar 2023 05:19:12 -0700 (PDT) X-Received: by 2002:ac8:7f4d:0:b0:3b9:a441:37f4 with SMTP id g13-20020ac87f4d000000b003b9a44137f4mr69515075qtk.52.1678882752606; Wed, 15 Mar 2023 05:19:12 -0700 (PDT) Received: from localhost.localdomain.com ([2a00:23c6:4a21:6f01:ac73:9611:643a:5397]) by smtp.gmail.com with ESMTPSA id f11-20020ac8014b000000b003bd21323c80sm3672595qtg.11.2023.03.15.05.19.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Mar 2023 05:19:12 -0700 (PDT) From: Juri Lelli To: Peter Zijlstra , Ingo Molnar , Qais Yousef , Waiman Long , Tejun Heo , Zefan Li , Johannes Weiner , Hao Luo Cc: Dietmar Eggemann , Steven Rostedt , linux-kernel@vger.kernel.org, luca.abeni@santannapisa.it, claudio@evidence.eu.com, tommaso.cucinotta@santannapisa.it, bristot@redhat.com, mathieu.poirier@linaro.org, cgroups@vger.kernel.org, Vincent Guittot , Wei Wang , Rick Yiu , Quentin Perret , Heiko Carstens , Vasily Gorbik , Alexander Gordeev , Sudeep Holla , Juri Lelli Subject: [RFC PATCH 3/3] cgroup/cpuset: Iterate only if DEADLINE tasks are present Date: Wed, 15 Mar 2023 12:18:12 +0000 Message-Id: <20230315121812.206079-4-juri.lelli@redhat.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230315121812.206079-1-juri.lelli@redhat.com> References: <20230315121812.206079-1-juri.lelli@redhat.com> MIME-Version: 1.0 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1760439402666441506?= X-GMAIL-MSGID: =?utf-8?q?1760439402666441506?= update_tasks_root_domain currently iterates over all tasks even if no DEADLINE task is present on the cpuset/root domain for which bandwidth accounting is being rebuilt. This has been reported to introduce 10+ ms delays on suspend-resume operations. Skip the costly iteration for cpusets that don't contain DEADLINE tasks. Reported-by: Qais Yousef Signed-off-by: Juri Lelli --- kernel/cgroup/cpuset.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 57bc60112618..f46192d2e97e 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -1090,6 +1090,9 @@ static void update_tasks_root_domain(struct cpuset *cs) struct css_task_iter it; struct task_struct *task; + if (cs->nr_deadline_tasks == 0) + return; + css_task_iter_start(&cs->css, 0, &it); while ((task = css_task_iter_next(&it)))