[v2,3/4] sched/isolation: Add HK_TYPE_WQ to isolcpus=domain

Message ID 20221013184028.129486-4-leobras@redhat.com
State New
Headers
Series CPU isolation improvements |

Commit Message

Leonardo Bras Soares Passos Oct. 13, 2022, 6:40 p.m. UTC
  Housekeeping code keeps multiple cpumasks in order to keep track of which
cpus can perform given housekeeping category.

Every time the HK_TYPE_WQ cpumask is checked before queueing work at a cpu
WQ it also happens to check for HK_TYPE_DOMAIN. So It can be assumed that
the Domain isolation also ends up isolating work queues.

Delegating current HK_TYPE_DOMAIN's work queue isolation to HK_TYPE_WQ
makes it simpler to check if a cpu can run a task into an work queue, since
code just need to go through a single HK_TYPE_* cpumask.

Make isolcpus=domain aggregate both HK_TYPE_DOMAIN and HK_TYPE_WQ, and
remove a lot of cpumask_and calls.

Also, remove a unnecessary '|=' at housekeeping_isolcpus_setup() since we
are sure that 'flags == 0' here.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 drivers/pci/pci-driver.c | 13 +------------
 kernel/sched/isolation.c |  4 ++--
 kernel/workqueue.c       |  1 -
 net/core/net-sysfs.c     |  1 -
 4 files changed, 3 insertions(+), 16 deletions(-)
  

Comments

kernel test robot Oct. 14, 2022, 12:56 a.m. UTC | #1
Hi Leonardo,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on herbert-cryptodev-2.6/master]
[also build test WARNING on herbert-crypto-2.6/master helgaas-pci/next tip/sched/core tj-wq/for-next linus/master v6.0 next-20221013]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Leonardo-Bras/CPU-isolation-improvements/20221014-024707
base:   https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git master
config: x86_64-rhel-8.3-kunit
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/7df1006c37ca5353fda63cd01ebf5bf3a2ef6562
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Leonardo-Bras/CPU-isolation-improvements/20221014-024707
        git checkout 7df1006c37ca5353fda63cd01ebf5bf3a2ef6562
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/pci/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/pci/pci-driver.c: In function 'pci_call_probe':
>> drivers/pci/pci-driver.c:382:1: warning: label 'out' defined but not used [-Wunused-label]
     382 | out:
         | ^~~


vim +/out +382 drivers/pci/pci-driver.c

0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  349  
d42c69972b853fd Andi Kleen          2005-07-06  350  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
d42c69972b853fd Andi Kleen          2005-07-06  351  			  const struct pci_device_id *id)
d42c69972b853fd Andi Kleen          2005-07-06  352  {
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  353  	int error, node, cpu;
873392ca514f87e Rusty Russell       2008-12-31  354  	struct drv_dev_and_id ddi = { drv, dev, id };
873392ca514f87e Rusty Russell       2008-12-31  355  
12c3156f10c5d8c Alexander Duyck     2013-11-18  356  	/*
12c3156f10c5d8c Alexander Duyck     2013-11-18  357  	 * Execute driver initialization on node where the device is
12c3156f10c5d8c Alexander Duyck     2013-11-18  358  	 * attached.  This way the driver likely allocates its local memory
12c3156f10c5d8c Alexander Duyck     2013-11-18  359  	 * on the right node.
12c3156f10c5d8c Alexander Duyck     2013-11-18  360  	 */
873392ca514f87e Rusty Russell       2008-12-31  361  	node = dev_to_node(&dev->dev);
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  362  	dev->is_probed = 1;
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  363  
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  364  	cpu_hotplug_disable();
12c3156f10c5d8c Alexander Duyck     2013-11-18  365  
12c3156f10c5d8c Alexander Duyck     2013-11-18  366  	/*
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  367  	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  368  	 * device is probed from work_on_cpu() of the Physical device.
12c3156f10c5d8c Alexander Duyck     2013-11-18  369  	 */
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  370  	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
9d42ea0d6984044 Frederic Weisbecker 2022-02-07  371  	    pci_physfn_is_probed(dev)) {
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  372  		cpu = nr_cpu_ids;
9d42ea0d6984044 Frederic Weisbecker 2022-02-07  373  	} else {
69a18b18699b596 Alex Belits         2020-06-25  374  		cpu = cpumask_any_and(cpumask_of_node(node),
7df1006c37ca535 Leonardo Bras       2022-10-13  375  				      housekeeping_cpumask(HK_TYPE_WQ));
9d42ea0d6984044 Frederic Weisbecker 2022-02-07  376  	}
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  377  
873392ca514f87e Rusty Russell       2008-12-31  378  	if (cpu < nr_cpu_ids)
873392ca514f87e Rusty Russell       2008-12-31  379  		error = work_on_cpu(cpu, local_pci_probe, &ddi);
873392ca514f87e Rusty Russell       2008-12-31  380  	else
873392ca514f87e Rusty Russell       2008-12-31  381  		error = local_pci_probe(&ddi);
9d42ea0d6984044 Frederic Weisbecker 2022-02-07 @382  out:
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  383  	dev->is_probed = 0;
0b2c2a71e6f07fb Thomas Gleixner     2017-05-24  384  	cpu_hotplug_enable();
d42c69972b853fd Andi Kleen          2005-07-06  385  	return error;
d42c69972b853fd Andi Kleen          2005-07-06  386  }
d42c69972b853fd Andi Kleen          2005-07-06  387
  
Peter Zijlstra Oct. 14, 2022, 8:36 a.m. UTC | #2
+ Frederic; who actually does most of this code

On Thu, Oct 13, 2022 at 03:40:28PM -0300, Leonardo Bras wrote:
> Housekeeping code keeps multiple cpumasks in order to keep track of which
> cpus can perform given housekeeping category.
> 
> Every time the HK_TYPE_WQ cpumask is checked before queueing work at a cpu
> WQ it also happens to check for HK_TYPE_DOMAIN. So It can be assumed that
> the Domain isolation also ends up isolating work queues.
> 
> Delegating current HK_TYPE_DOMAIN's work queue isolation to HK_TYPE_WQ
> makes it simpler to check if a cpu can run a task into an work queue, since
> code just need to go through a single HK_TYPE_* cpumask.
> 
> Make isolcpus=domain aggregate both HK_TYPE_DOMAIN and HK_TYPE_WQ, and
> remove a lot of cpumask_and calls.
> 
> Also, remove a unnecessary '|=' at housekeeping_isolcpus_setup() since we
> are sure that 'flags == 0' here.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

I've long maintained that having all these separate masks is daft;
Frederic do we really need that?

> ---
>  drivers/pci/pci-driver.c | 13 +------------
>  kernel/sched/isolation.c |  4 ++--
>  kernel/workqueue.c       |  1 -
>  net/core/net-sysfs.c     |  1 -
>  4 files changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 107d77f3c8467..550bef2504b8d 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -371,19 +371,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  	    pci_physfn_is_probed(dev)) {
>  		cpu = nr_cpu_ids;
>  	} else {
> -		cpumask_var_t wq_domain_mask;
> -
> -		if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
> -			error = -ENOMEM;
> -			goto out;
> -		}
> -		cpumask_and(wq_domain_mask,
> -			    housekeeping_cpumask(HK_TYPE_WQ),
> -			    housekeeping_cpumask(HK_TYPE_DOMAIN));
> -
>  		cpu = cpumask_any_and(cpumask_of_node(node),
> -				      wq_domain_mask);
> -		free_cpumask_var(wq_domain_mask);
> +				      housekeeping_cpumask(HK_TYPE_WQ));
>  	}
>  
>  	if (cpu < nr_cpu_ids)
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 373d42c707bc5..ced4b78564810 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -204,7 +204,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
>  
>  		if (!strncmp(str, "domain,", 7)) {
>  			str += 7;
> -			flags |= HK_FLAG_DOMAIN;
> +			flags |= HK_FLAG_DOMAIN | HK_FLAG_WQ;
>  			continue;
>  		}
>  
> @@ -234,7 +234,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
>  
>  	/* Default behaviour for isolcpus without flags */
>  	if (!flags)
> -		flags |= HK_FLAG_DOMAIN;
> +		flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
>  
>  	return housekeeping_setup(str, flags);
>  }
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 7cd5f5e7e0a1b..b557daa571f17 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -6004,7 +6004,6 @@ void __init workqueue_init_early(void)
>  
>  	BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
>  	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_WQ));
> -	cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
>  
>  	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
>  
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 8409d41405dfe..7b6fb62a118ab 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -852,7 +852,6 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
>  	}
>  
>  	if (!cpumask_empty(mask)) {
> -		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
>  		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ));
>  		if (cpumask_empty(mask)) {
>  			free_cpumask_var(mask);
> -- 
> 2.38.0
>
  
Frederic Weisbecker Oct. 14, 2022, 1:24 p.m. UTC | #3
On Fri, Oct 14, 2022 at 10:36:19AM +0200, Peter Zijlstra wrote:
> 
> + Frederic; who actually does most of this code
> 
> On Thu, Oct 13, 2022 at 03:40:28PM -0300, Leonardo Bras wrote:
> > Housekeeping code keeps multiple cpumasks in order to keep track of which
> > cpus can perform given housekeeping category.
> > 
> > Every time the HK_TYPE_WQ cpumask is checked before queueing work at a cpu
> > WQ it also happens to check for HK_TYPE_DOMAIN. So It can be assumed that
> > the Domain isolation also ends up isolating work queues.
> > 
> > Delegating current HK_TYPE_DOMAIN's work queue isolation to HK_TYPE_WQ
> > makes it simpler to check if a cpu can run a task into an work queue, since
> > code just need to go through a single HK_TYPE_* cpumask.
> > 
> > Make isolcpus=domain aggregate both HK_TYPE_DOMAIN and HK_TYPE_WQ, and
> > remove a lot of cpumask_and calls.
> > 
> > Also, remove a unnecessary '|=' at housekeeping_isolcpus_setup() since we
> > are sure that 'flags == 0' here.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> 
> I've long maintained that having all these separate masks is daft;
> Frederic do we really need that?

Indeed. In my queue for the cpuset interface to nohz_full, I have the following
patch (but note DOMAIN and WQ have to stay seperate flags because workqueue
affinity can be modified seperately from isolcpus)

---
From: Frederic Weisbecker <frederic@kernel.org>
Date: Tue, 26 Jul 2022 17:03:30 +0200
Subject: [PATCH] sched/isolation: Gather nohz_full related isolation features
 into common flag

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 arch/x86/kvm/x86.c              |  2 +-
 drivers/pci/pci-driver.c        |  2 +-
 include/linux/sched/isolation.h |  7 +------
 kernel/cpu.c                    |  4 ++--
 kernel/kthread.c                |  4 ++--
 kernel/rcu/tasks.h              |  2 +-
 kernel/rcu/tree_plugin.h        |  6 +++---
 kernel/sched/core.c             | 10 +++++-----
 kernel/sched/fair.c             |  6 +++---
 kernel/sched/isolation.c        | 25 +++++++------------------
 kernel/watchdog.c               |  2 +-
 kernel/workqueue.c              |  2 +-
 net/core/net-sysfs.c            |  2 +-
 13 files changed, 29 insertions(+), 45 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1910e1e78b15..d0b73fcf4a1c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9009,7 +9009,7 @@ int kvm_arch_init(void *opaque)
 	}
 
 	if (pi_inject_timer == -1)
-		pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
+		pi_inject_timer = housekeeping_enabled(HK_TYPE_NOHZ_FULL);
 #ifdef CONFIG_X86_64
 	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
 
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 49238ddd39ee..af3494a39921 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -378,7 +378,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 			goto out;
 		}
 		cpumask_and(wq_domain_mask,
-			    housekeeping_cpumask(HK_TYPE_WQ),
+			    housekeeping_cpumask(HK_TYPE_NOHZ_FULL),
 			    housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 		cpu = cpumask_any_and(cpumask_of_node(node),
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 8c15abd67aed..7ca34e04abe7 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -6,15 +6,10 @@
 #include <linux/tick.h>
 
 enum hk_type {
-	HK_TYPE_TIMER,
-	HK_TYPE_RCU,
-	HK_TYPE_MISC,
+	HK_TYPE_NOHZ_FULL,
 	HK_TYPE_SCHED,
-	HK_TYPE_TICK,
 	HK_TYPE_DOMAIN,
-	HK_TYPE_WQ,
 	HK_TYPE_MANAGED_IRQ,
-	HK_TYPE_KTHREAD,
 	HK_TYPE_MAX
 };
 
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bbad5e375d3b..573f14d75a2e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1500,8 +1500,8 @@ int freeze_secondary_cpus(int primary)
 	cpu_maps_update_begin();
 	if (primary == -1) {
 		primary = cpumask_first(cpu_online_mask);
-		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
-			primary = housekeeping_any_cpu(HK_TYPE_TIMER);
+		if (!housekeeping_cpu(primary, HK_TYPE_NOHZ_FULL))
+			primary = housekeeping_any_cpu(HK_TYPE_NOHZ_FULL);
 	} else {
 		if (!cpu_online(primary))
 			primary = cpumask_first(cpu_online_mask);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 544fd4097406..0719035feba0 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -355,7 +355,7 @@ static int kthread(void *_create)
 	 * back to default in case they have been changed.
 	 */
 	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
-	set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
 
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
@@ -721,7 +721,7 @@ int kthreadd(void *unused)
 	/* Setup a clean context for our children to inherit. */
 	set_task_comm(tsk, "kthreadd");
 	ignore_signals(tsk);
-	set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_KTHREAD));
+	set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
 	set_mems_allowed(node_states[N_MEMORY]);
 
 	current->flags |= PF_NOFREEZE;
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index f5bf6fb430da..b99f79625b26 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -537,7 +537,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 	struct rcu_tasks *rtp = arg;
 
 	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
-	housekeeping_affine(current, HK_TYPE_RCU);
+	housekeeping_affine(current, HK_TYPE_NOHZ_FULL);
 	WRITE_ONCE(rtp->kthread_ptr, current); // Let GPs start!
 
 	/*
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b2219577fbe2..4935b06c3caf 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1237,9 +1237,9 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
 		if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
 		    cpu != outgoingcpu)
 			cpumask_set_cpu(cpu, cm);
-	cpumask_and(cm, cm, housekeeping_cpumask(HK_TYPE_RCU));
+	cpumask_and(cm, cm, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
 	if (cpumask_empty(cm))
-		cpumask_copy(cm, housekeeping_cpumask(HK_TYPE_RCU));
+		cpumask_copy(cm, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
 	set_cpus_allowed_ptr(t, cm);
 	mutex_unlock(&rnp->boost_kthread_mutex);
 	free_cpumask_var(cm);
@@ -1294,5 +1294,5 @@ static void rcu_bind_gp_kthread(void)
 {
 	if (!tick_nohz_full_enabled())
 		return;
-	housekeeping_affine(current, HK_TYPE_RCU);
+	housekeeping_affine(current, HK_TYPE_NOHZ_FULL);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f53c0096860b..5ff205f39197 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1079,13 +1079,13 @@ int get_nohz_timer_target(void)
 	struct sched_domain *sd;
 	const struct cpumask *hk_mask;
 
-	if (housekeeping_cpu(cpu, HK_TYPE_TIMER)) {
+	if (housekeeping_cpu(cpu, HK_TYPE_NOHZ_FULL)) {
 		if (!idle_cpu(cpu))
 			return cpu;
 		default_cpu = cpu;
 	}
 
-	hk_mask = housekeeping_cpumask(HK_TYPE_TIMER);
+	hk_mask = housekeeping_cpumask(HK_TYPE_NOHZ_FULL);
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
@@ -1101,7 +1101,7 @@ int get_nohz_timer_target(void)
 	}
 
 	if (default_cpu == -1)
-		default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
+		default_cpu = housekeeping_any_cpu(HK_TYPE_NOHZ_FULL);
 	cpu = default_cpu;
 unlock:
 	rcu_read_unlock();
@@ -5562,7 +5562,7 @@ static void sched_tick_start(int cpu)
 	int os;
 	struct tick_work *twork;
 
-	if (housekeeping_cpu(cpu, HK_TYPE_TICK))
+	if (housekeeping_cpu(cpu, HK_TYPE_NOHZ_FULL))
 		return;
 
 	WARN_ON_ONCE(!tick_work_cpu);
@@ -5583,7 +5583,7 @@ static void sched_tick_stop(int cpu)
 	struct tick_work *twork;
 	int os;
 
-	if (housekeeping_cpu(cpu, HK_TYPE_TICK))
+	if (housekeeping_cpu(cpu, HK_TYPE_NOHZ_FULL))
 		return;
 
 	WARN_ON_ONCE(!tick_work_cpu);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b2048a9326..ac3b33e00451 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10375,7 +10375,7 @@ static inline int on_null_domain(struct rq *rq)
  * - When one of the busy CPUs notice that there may be an idle rebalancing
  *   needed, they will kick the idle load balancer, which then does idle
  *   load balancing for all the idle CPUs.
- * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED not set
+ * - HK_TYPE_NOHZ_FULL CPUs are used for this task, because HK_TYPE_SCHED not set
  *   anywhere yet.
  */
 
@@ -10384,7 +10384,7 @@ static inline int find_new_ilb(void)
 	int ilb;
 	const struct cpumask *hk_mask;
 
-	hk_mask = housekeeping_cpumask(HK_TYPE_MISC);
+	hk_mask = housekeeping_cpumask(HK_TYPE_NOHZ_FULL);
 
 	for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
 
@@ -10400,7 +10400,7 @@ static inline int find_new_ilb(void)
 
 /*
  * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
- * idle CPU in the HK_TYPE_MISC housekeeping set (if there is one).
+ * idle CPU in the HK_TYPE_NOHZ_FULL housekeeping set (if there is one).
  */
 static void kick_ilb(unsigned int flags)
 {
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 4087718ee5b4..443f1ce83e32 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -4,20 +4,15 @@
  *  any CPU: unbound workqueues, timers, kthreads and any offloadable work.
  *
  * Copyright (C) 2017 Red Hat, Inc., Frederic Weisbecker
- * Copyright (C) 2017-2018 SUSE, Frederic Weisbecker
+ * Copyright (C) 2017-2022 SUSE, Frederic Weisbecker
  *
  */
 
 enum hk_flags {
-	HK_FLAG_TIMER		= BIT(HK_TYPE_TIMER),
-	HK_FLAG_RCU		= BIT(HK_TYPE_RCU),
-	HK_FLAG_MISC		= BIT(HK_TYPE_MISC),
+	HK_FLAG_NOHZ_FULL	= BIT(HK_TYPE_NOHZ_FULL),
 	HK_FLAG_SCHED		= BIT(HK_TYPE_SCHED),
-	HK_FLAG_TICK		= BIT(HK_TYPE_TICK),
 	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
-	HK_FLAG_WQ		= BIT(HK_TYPE_WQ),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
-	HK_FLAG_KTHREAD		= BIT(HK_TYPE_KTHREAD),
 };
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
@@ -88,7 +83,7 @@ void __init housekeeping_init(void)
 
 	static_branch_enable(&housekeeping_overridden);
 
-	if (housekeeping.flags & HK_FLAG_TICK)
+	if (housekeeping.flags & HK_FLAG_NOHZ_FULL)
 		sched_tick_offload_init();
 
 	for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
@@ -111,7 +106,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 	cpumask_var_t non_housekeeping_mask, housekeeping_staging;
 	int err = 0;
 
-	if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) {
+	if ((flags & HK_FLAG_NOHZ_FULL) && !(housekeeping.flags & HK_FLAG_NOHZ_FULL)) {
 		if (!IS_ENABLED(CONFIG_NO_HZ_FULL)) {
 			pr_warn("Housekeeping: nohz unsupported."
 				" Build with CONFIG_NO_HZ_FULL\n");
@@ -163,7 +158,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 			housekeeping_setup_type(type, housekeeping_staging);
 	}
 
-	if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK))
+	if ((flags & HK_FLAG_NOHZ_FULL) && !(housekeeping.flags & HK_FLAG_NOHZ_FULL))
 		tick_nohz_full_setup(non_housekeeping_mask);
 
 	housekeeping.flags |= flags;
@@ -179,12 +174,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
 
 static int __init housekeeping_nohz_full_setup(char *str)
 {
-	unsigned long flags;
-
-	flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
-		HK_FLAG_MISC | HK_FLAG_KTHREAD;
-
-	return housekeeping_setup(str, flags);
+	return housekeeping_setup(str, HK_FLAG_NOHZ_FULL);
 }
 __setup("nohz_full=", housekeeping_nohz_full_setup);
 
@@ -198,8 +188,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
 	while (isalpha(*str)) {
 		if (!strncmp(str, "nohz,", 5)) {
 			str += 5;
-			flags |= HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER |
-				HK_FLAG_RCU | HK_FLAG_MISC | HK_FLAG_KTHREAD;
+			flags |= HK_FLAG_NOHZ_FULL;
 			continue;
 		}
 
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 20a7a55e62b6..3e9636f4bac6 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -852,7 +852,7 @@ void __init lockup_detector_init(void)
 		pr_info("Disabling watchdog on nohz_full cores by default\n");
 
 	cpumask_copy(&watchdog_cpumask,
-		     housekeeping_cpumask(HK_TYPE_TIMER));
+		     housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
 
 	if (!watchdog_nmi_probe())
 		nmi_watchdog_available = true;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1ea50f6be843..3eb283d76d81 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5993,7 +5993,7 @@ void __init workqueue_init_early(void)
 	BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
 
 	BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
-	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_WQ));
+	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
 	cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index e319e242dddf..6dddf359b754 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -852,7 +852,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 
 	if (!cpumask_empty(mask)) {
 		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
-		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ));
+		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
 		if (cpumask_empty(mask)) {
 			free_cpumask_var(mask);
 			return -EINVAL;
  
Leonardo Bras Soares Passos Oct. 14, 2022, 4:27 p.m. UTC | #4
On Fri, 2022-10-14 at 15:24 +0200, Frederic Weisbecker wrote:
> On Fri, Oct 14, 2022 at 10:36:19AM +0200, Peter Zijlstra wrote:
> > 
> > + Frederic; who actually does most of this code
> > 
> > On Thu, Oct 13, 2022 at 03:40:28PM -0300, Leonardo Bras wrote:
> > > Housekeeping code keeps multiple cpumasks in order to keep track of which
> > > cpus can perform given housekeeping category.
> > > 
> > > Every time the HK_TYPE_WQ cpumask is checked before queueing work at a cpu
> > > WQ it also happens to check for HK_TYPE_DOMAIN. So It can be assumed that
> > > the Domain isolation also ends up isolating work queues.
> > > 
> > > Delegating current HK_TYPE_DOMAIN's work queue isolation to HK_TYPE_WQ
> > > makes it simpler to check if a cpu can run a task into an work queue, since
> > > code just need to go through a single HK_TYPE_* cpumask.
> > > 
> > > Make isolcpus=domain aggregate both HK_TYPE_DOMAIN and HK_TYPE_WQ, and
> > > remove a lot of cpumask_and calls.
> > > 
> > > Also, remove a unnecessary '|=' at housekeeping_isolcpus_setup() since we
> > > are sure that 'flags == 0' here.
> > > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > 
> > I've long maintained that having all these separate masks is daft;
> > Frederic do we really need that?
> 
> Indeed. In my queue for the cpuset interface to nohz_full, I have the following
> patch (but note DOMAIN and WQ have to stay seperate flags because workqueue
> affinity can be modified seperately from isolcpus)
> 
> ---
> From: Frederic Weisbecker <frederic@kernel.org>
> Date: Tue, 26 Jul 2022 17:03:30 +0200
> Subject: [PATCH] sched/isolation: Gather nohz_full related isolation features
>  into common flag
> 
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  arch/x86/kvm/x86.c              |  2 +-
>  drivers/pci/pci-driver.c        |  2 +-
>  include/linux/sched/isolation.h |  7 +------
>  kernel/cpu.c                    |  4 ++--
>  kernel/kthread.c                |  4 ++--
>  kernel/rcu/tasks.h              |  2 +-
>  kernel/rcu/tree_plugin.h        |  6 +++---
>  kernel/sched/core.c             | 10 +++++-----
>  kernel/sched/fair.c             |  6 +++---
>  kernel/sched/isolation.c        | 25 +++++++------------------
>  kernel/watchdog.c               |  2 +-
>  kernel/workqueue.c              |  2 +-
>  net/core/net-sysfs.c            |  2 +-
>  13 files changed, 29 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1910e1e78b15..d0b73fcf4a1c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9009,7 +9009,7 @@ int kvm_arch_init(void *opaque)
>  	}
>  
>  	if (pi_inject_timer == -1)
> -		pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
> +		pi_inject_timer = housekeeping_enabled(HK_TYPE_NOHZ_FULL);
>  #ifdef CONFIG_X86_64
>  	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
>  
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 49238ddd39ee..af3494a39921 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -378,7 +378,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>  			goto out;
>  		}
>  		cpumask_and(wq_domain_mask,
> -			    housekeeping_cpumask(HK_TYPE_WQ),
> +			    housekeeping_cpumask(HK_TYPE_NOHZ_FULL),
>  			    housekeeping_cpumask(HK_TYPE_DOMAIN));
>  
>  		cpu = cpumask_any_and(cpumask_of_node(node),
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 8c15abd67aed..7ca34e04abe7 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -6,15 +6,10 @@
>  #include <linux/tick.h>
>  
>  enum hk_type {
> -	HK_TYPE_TIMER,
> -	HK_TYPE_RCU,
> -	HK_TYPE_MISC,
> +	HK_TYPE_NOHZ_FULL,
>  	HK_TYPE_SCHED,
> -	HK_TYPE_TICK,
>  	HK_TYPE_DOMAIN,
> -	HK_TYPE_WQ,
>  	HK_TYPE_MANAGED_IRQ,
> -	HK_TYPE_KTHREAD,
>  	HK_TYPE_MAX
>  };
>  
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bbad5e375d3b..573f14d75a2e 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1500,8 +1500,8 @@ int freeze_secondary_cpus(int primary)
>  	cpu_maps_update_begin();
>  	if (primary == -1) {
>  		primary = cpumask_first(cpu_online_mask);
> -		if (!housekeeping_cpu(primary, HK_TYPE_TIMER))
> -			primary = housekeeping_any_cpu(HK_TYPE_TIMER);
> +		if (!housekeeping_cpu(primary, HK_TYPE_NOHZ_FULL))
> +			primary = housekeeping_any_cpu(HK_TYPE_NOHZ_FULL);
>  	} else {
>  		if (!cpu_online(primary))
>  			primary = cpumask_first(cpu_online_mask);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 544fd4097406..0719035feba0 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -355,7 +355,7 @@ static int kthread(void *_create)
>  	 * back to default in case they have been changed.
>  	 */
>  	sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> -	set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
> +	set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
>  
>  	/* OK, tell user we're spawned, wait for stop or wakeup */
>  	__set_current_state(TASK_UNINTERRUPTIBLE);
> @@ -721,7 +721,7 @@ int kthreadd(void *unused)
>  	/* Setup a clean context for our children to inherit. */
>  	set_task_comm(tsk, "kthreadd");
>  	ignore_signals(tsk);
> -	set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_KTHREAD));
> +	set_cpus_allowed_ptr(tsk, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
>  	set_mems_allowed(node_states[N_MEMORY]);
>  
>  	current->flags |= PF_NOFREEZE;
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index f5bf6fb430da..b99f79625b26 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -537,7 +537,7 @@ static int __noreturn rcu_tasks_kthread(void *arg)
>  	struct rcu_tasks *rtp = arg;
>  
>  	/* Run on housekeeping CPUs by default.  Sysadm can move if desired. */
> -	housekeeping_affine(current, HK_TYPE_RCU);
> +	housekeeping_affine(current, HK_TYPE_NOHZ_FULL);
>  	WRITE_ONCE(rtp->kthread_ptr, current); // Let GPs start!
>  
>  	/*
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index b2219577fbe2..4935b06c3caf 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1237,9 +1237,9 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>  		if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
>  		    cpu != outgoingcpu)
>  			cpumask_set_cpu(cpu, cm);
> -	cpumask_and(cm, cm, housekeeping_cpumask(HK_TYPE_RCU));
> +	cpumask_and(cm, cm, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
>  	if (cpumask_empty(cm))
> -		cpumask_copy(cm, housekeeping_cpumask(HK_TYPE_RCU));
> +		cpumask_copy(cm, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
>  	set_cpus_allowed_ptr(t, cm);
>  	mutex_unlock(&rnp->boost_kthread_mutex);
>  	free_cpumask_var(cm);
> @@ -1294,5 +1294,5 @@ static void rcu_bind_gp_kthread(void)
>  {
>  	if (!tick_nohz_full_enabled())
>  		return;
> -	housekeeping_affine(current, HK_TYPE_RCU);
> +	housekeeping_affine(current, HK_TYPE_NOHZ_FULL);
>  }
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f53c0096860b..5ff205f39197 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1079,13 +1079,13 @@ int get_nohz_timer_target(void)
>  	struct sched_domain *sd;
>  	const struct cpumask *hk_mask;
>  
> -	if (housekeeping_cpu(cpu, HK_TYPE_TIMER)) {
> +	if (housekeeping_cpu(cpu, HK_TYPE_NOHZ_FULL)) {
>  		if (!idle_cpu(cpu))
>  			return cpu;
>  		default_cpu = cpu;
>  	}
>  
> -	hk_mask = housekeeping_cpumask(HK_TYPE_TIMER);
> +	hk_mask = housekeeping_cpumask(HK_TYPE_NOHZ_FULL);
>  
>  	rcu_read_lock();
>  	for_each_domain(cpu, sd) {
> @@ -1101,7 +1101,7 @@ int get_nohz_timer_target(void)
>  	}
>  
>  	if (default_cpu == -1)
> -		default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
> +		default_cpu = housekeeping_any_cpu(HK_TYPE_NOHZ_FULL);
>  	cpu = default_cpu;
>  unlock:
>  	rcu_read_unlock();
> @@ -5562,7 +5562,7 @@ static void sched_tick_start(int cpu)
>  	int os;
>  	struct tick_work *twork;
>  
> -	if (housekeeping_cpu(cpu, HK_TYPE_TICK))
> +	if (housekeeping_cpu(cpu, HK_TYPE_NOHZ_FULL))
>  		return;
>  
>  	WARN_ON_ONCE(!tick_work_cpu);
> @@ -5583,7 +5583,7 @@ static void sched_tick_stop(int cpu)
>  	struct tick_work *twork;
>  	int os;
>  
> -	if (housekeeping_cpu(cpu, HK_TYPE_TICK))
> +	if (housekeeping_cpu(cpu, HK_TYPE_NOHZ_FULL))
>  		return;
>  
>  	WARN_ON_ONCE(!tick_work_cpu);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 77b2048a9326..ac3b33e00451 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10375,7 +10375,7 @@ static inline int on_null_domain(struct rq *rq)
>   * - When one of the busy CPUs notice that there may be an idle rebalancing
>   *   needed, they will kick the idle load balancer, which then does idle
>   *   load balancing for all the idle CPUs.
> - * - HK_TYPE_MISC CPUs are used for this task, because HK_TYPE_SCHED not set
> + * - HK_TYPE_NOHZ_FULL CPUs are used for this task, because HK_TYPE_SCHED not set
>   *   anywhere yet.
>   */
>  
> @@ -10384,7 +10384,7 @@ static inline int find_new_ilb(void)
>  	int ilb;
>  	const struct cpumask *hk_mask;
>  
> -	hk_mask = housekeeping_cpumask(HK_TYPE_MISC);
> +	hk_mask = housekeeping_cpumask(HK_TYPE_NOHZ_FULL);
>  
>  	for_each_cpu_and(ilb, nohz.idle_cpus_mask, hk_mask) {
>  
> @@ -10400,7 +10400,7 @@ static inline int find_new_ilb(void)
>  
>  /*
>   * Kick a CPU to do the nohz balancing, if it is time for it. We pick any
> - * idle CPU in the HK_TYPE_MISC housekeeping set (if there is one).
> + * idle CPU in the HK_TYPE_NOHZ_FULL housekeeping set (if there is one).
>   */
>  static void kick_ilb(unsigned int flags)
>  {
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 4087718ee5b4..443f1ce83e32 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -4,20 +4,15 @@
>   *  any CPU: unbound workqueues, timers, kthreads and any offloadable work.
>   *
>   * Copyright (C) 2017 Red Hat, Inc., Frederic Weisbecker
> - * Copyright (C) 2017-2018 SUSE, Frederic Weisbecker
> + * Copyright (C) 2017-2022 SUSE, Frederic Weisbecker
>   *
>   */
>  
>  enum hk_flags {
> -	HK_FLAG_TIMER		= BIT(HK_TYPE_TIMER),
> -	HK_FLAG_RCU		= BIT(HK_TYPE_RCU),
> -	HK_FLAG_MISC		= BIT(HK_TYPE_MISC),
> +	HK_FLAG_NOHZ_FULL	= BIT(HK_TYPE_NOHZ_FULL),
>  	HK_FLAG_SCHED		= BIT(HK_TYPE_SCHED),
> -	HK_FLAG_TICK		= BIT(HK_TYPE_TICK),
>  	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
> -	HK_FLAG_WQ		= BIT(HK_TYPE_WQ),
>  	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
> -	HK_FLAG_KTHREAD		= BIT(HK_TYPE_KTHREAD),
>  };
>  
>  DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
> @@ -88,7 +83,7 @@ void __init housekeeping_init(void)
>  
>  	static_branch_enable(&housekeeping_overridden);
>  
> -	if (housekeeping.flags & HK_FLAG_TICK)
> +	if (housekeeping.flags & HK_FLAG_NOHZ_FULL)
>  		sched_tick_offload_init();
>  
>  	for_each_set_bit(type, &housekeeping.flags, HK_TYPE_MAX) {
> @@ -111,7 +106,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>  	cpumask_var_t non_housekeeping_mask, housekeeping_staging;
>  	int err = 0;
>  
> -	if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK)) {
> +	if ((flags & HK_FLAG_NOHZ_FULL) && !(housekeeping.flags & HK_FLAG_NOHZ_FULL)) {
>  		if (!IS_ENABLED(CONFIG_NO_HZ_FULL)) {
>  			pr_warn("Housekeeping: nohz unsupported."
>  				" Build with CONFIG_NO_HZ_FULL\n");
> @@ -163,7 +158,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>  			housekeeping_setup_type(type, housekeeping_staging);
>  	}
>  
> -	if ((flags & HK_FLAG_TICK) && !(housekeeping.flags & HK_FLAG_TICK))
> +	if ((flags & HK_FLAG_NOHZ_FULL) && !(housekeeping.flags & HK_FLAG_NOHZ_FULL))
>  		tick_nohz_full_setup(non_housekeeping_mask);
>  
>  	housekeeping.flags |= flags;
> @@ -179,12 +174,7 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>  
>  static int __init housekeeping_nohz_full_setup(char *str)
>  {
> -	unsigned long flags;
> -
> -	flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
> -		HK_FLAG_MISC | HK_FLAG_KTHREAD;
> -
> -	return housekeeping_setup(str, flags);
> +	return housekeeping_setup(str, HK_FLAG_NOHZ_FULL);
>  }
>  __setup("nohz_full=", housekeeping_nohz_full_setup);
>  
> @@ -198,8 +188,7 @@ static int __init housekeeping_isolcpus_setup(char *str)
>  	while (isalpha(*str)) {
>  		if (!strncmp(str, "nohz,", 5)) {
>  			str += 5;
> -			flags |= HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER |
> -				HK_FLAG_RCU | HK_FLAG_MISC | HK_FLAG_KTHREAD;
> +			flags |= HK_FLAG_NOHZ_FULL;
>  			continue;
>  		}
>  
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 20a7a55e62b6..3e9636f4bac6 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -852,7 +852,7 @@ void __init lockup_detector_init(void)
>  		pr_info("Disabling watchdog on nohz_full cores by default\n");
>  
>  	cpumask_copy(&watchdog_cpumask,
> -		     housekeeping_cpumask(HK_TYPE_TIMER));
> +		     housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
>  
>  	if (!watchdog_nmi_probe())
>  		nmi_watchdog_available = true;
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 1ea50f6be843..3eb283d76d81 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -5993,7 +5993,7 @@ void __init workqueue_init_early(void)
>  	BUILD_BUG_ON(__alignof__(struct pool_workqueue) < __alignof__(long long));
>  
>  	BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
> -	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_WQ));
> +	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
>  	cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
>  
>  	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index e319e242dddf..6dddf359b754 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -852,7 +852,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue,
>  
>  	if (!cpumask_empty(mask)) {
>  		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
> -		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ));
> +		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_NOHZ_FULL));
>  		if (cpumask_empty(mask)) {
>  			free_cpumask_var(mask);
>  			return -EINVAL;

Hello Frederic,

So, IIUC you are removing all flags composing nohz_full= parameter in favor of a
unified NOHZ_FULL flag. 

I am very new to the code, and I am probably missing the whole picture, but I
actually think it's a good approach to keep them split for a couple reasons:
1 - They are easier to understand in code (IMHO): 
"This cpu should not do this, because it's not able to do WQ housekeeping" looks
better than "because it's not in DOMAIN or NOHZ_FULL housekeeping"

2 - They are simpler for using: 
Suppose we have this function that should run at a WQ, but we want to keep them
out of the isolated cpus. If we have the unified flags, we need to combine both
DOMAIN and NOHZ_FULL bitmasks, and then combine it again with something like
cpu_online_mask. It usually means allocating a new cpumask_t, and also freeing
it afterwards.
If we have a single WQ flag, we can avoid the allocation altogether by using
for_each_cpu_and(), making the code much simpler.

3 - It makes easier to compose new isolation modes:
In case the future requires a new isolation mode that also uses the types of
isolation we currently have implemented, it would be much easier to just compose
it with the current HK flags, instead of having to go through all usages and do
a cpumask_and() there. Also, new isolation modes would make (2) worse.

What I am able see as a pro in "unified flag" side, is that getting all the
multiple flags doing their jobs should be a lot of trouble. 

While I do understand you are much more experienced than me on that, and that
your decision is more likely to be better, I am unable to see the arguments that
helped you reach it. 

Could you please point them, so I can better understand the decision?

Best regards,
Leo
  
Frederic Weisbecker Nov. 29, 2022, 12:10 p.m. UTC | #5
On Fri, Oct 14, 2022 at 01:27:25PM -0300, Leonardo Brás wrote:
> Hello Frederic,
> 
> So, IIUC you are removing all flags composing nohz_full= parameter in favor of a
> unified NOHZ_FULL flag. 
> 
> I am very new to the code, and I am probably missing the whole picture, but I
> actually think it's a good approach to keep them split for a couple reasons:
> 1 - They are easier to understand in code (IMHO): 
> "This cpu should not do this, because it's not able to do WQ housekeeping" looks
> better than "because it's not in DOMAIN or NOHZ_FULL housekeeping"

A comment above each site may solve that.

> 
> 2 - They are simpler for using: 
> Suppose we have this function that should run at a WQ, but we want to keep them
> out of the isolated cpus. If we have the unified flags, we need to combine both
> DOMAIN and NOHZ_FULL bitmasks, and then combine it again with something like
> cpu_online_mask. It usually means allocating a new cpumask_t, and also freeing
> it afterwards.
> If we have a single WQ flag, we can avoid the allocation altogether by using
> for_each_cpu_and(), making the code much simpler.

I guess having a specific function for workqueues would arrange for it.

> 
> 3 - It makes easier to compose new isolation modes:
> In case the future requires a new isolation mode that also uses the types of
> isolation we currently have implemented, it would be much easier to just compose
> it with the current HK flags, instead of having to go through all usages and do
> a cpumask_and() there. Also, new isolation modes would make (2) worse.

Actually having a new feature merged in HK_NOHZ_FULL would make it easier to
handle as it avoids spreading cpumasks. I'm not sure I understand what you
mean.

Thanks.
  
Leonardo Bras Soares Passos Dec. 20, 2022, 6:57 a.m. UTC | #6
On Tue, 2022-11-29 at 13:10 +0100, Frederic Weisbecker wrote:
> On Fri, Oct 14, 2022 at 01:27:25PM -0300, Leonardo Brás wrote:
> > Hello Frederic,
> > 
> > So, IIUC you are removing all flags composing nohz_full= parameter in favor of a
> > unified NOHZ_FULL flag. 
> > 
> > I am very new to the code, and I am probably missing the whole picture, but I
> > actually think it's a good approach to keep them split for a couple reasons:
> > 1 - They are easier to understand in code (IMHO): 
> > "This cpu should not do this, because it's not able to do WQ housekeeping" looks
> > better than "because it's not in DOMAIN or NOHZ_FULL housekeeping"
> 
> A comment above each site may solve that.

Sure, but not having to leave comments would be better. Or am I missing
something?

> 
> > 
> > 2 - They are simpler for using: 
> > Suppose we have this function that should run at a WQ, but we want to keep them
> > out of the isolated cpus. If we have the unified flags, we need to combine both
> > DOMAIN and NOHZ_FULL bitmasks, and then combine it again with something like
> > cpu_online_mask. It usually means allocating a new cpumask_t, and also freeing
> > it afterwards.
> > If we have a single WQ flag, we can avoid the allocation altogether by using
> > for_each_cpu_and(), making the code much simpler.
> 
> I guess having a specific function for workqueues would arrange for it.

You mean keeping a WQ housekeeping bitmap? This could be a solution, but it
would affect only the WQ example.

> 
> > 
> > 3 - It makes easier to compose new isolation modes:
> > In case the future requires a new isolation mode that also uses the types of
> > isolation we currently have implemented, it would be much easier to just compose
> > it with the current HK flags, instead of having to go through all usages and do
> > a cpumask_and() there. Also, new isolation modes would make (2) worse.
> 
> Actually having a new feature merged in HK_NOHZ_FULL would make it easier to
> handle as it avoids spreading cpumasks. I'm not sure I understand what you
> mean.

IIUC, your queued patch merges the housekeeping types HK_TYPE_TIMER,
HK_TYPE_RCU, HK_TYPE_MISC, HK_TYPE_TICK, HK_TYPE_WQ and HK_TYPE_KTHREAD in a
single HK_TYPE_NOHZ_FULL.

Suppose in future we need a new isolation feature in cmdline, say 
isol_new=<cpulist>, and it works exactly like nohz_full=<cpulist>, but also
needs to isolate cpulist against something else, say doing X.

How would this get implemented? IIUC, following the same pattern:
- A new type HK_TYPE_ISOL_NEW would be created together with a cpumask,
- The new cpumask would be used to keep cpulist from doing X
- All places that use HK_TYPE_NOHZ_FULL bitmap for isolation would need to also
bitmask_and() the new cpumask. (sometimes needing a local cpumask_t)

Ok, there may be shortcuts for this, like keeping an intermediary bitmap, but
that can become tricky.

Other more complex example: New isolation feature isol_new2=<cpulist> behaves
like nohz_full=<cpulist>, keeps cpulist from doing X, but allows unbound RCU
work. Now it's even harder to have shortcuts from previous implementation.

What I am trying to defend here is that keeping the HK_type with the idea of
"things to get cpulist isolated from" works better for future implementations
than a single flag with a lot of responsibilities:
- A new type HK_TYPE_X would be created together with a cpumask,
- The new cpumask would be used to keep cpulist from doing X
- isol_new=<cpulist> is composed with the flags for what cpulist is getting
isolated.
- (No need to touch already implemented isolations.)

In fact, I propose that it works better for current implementations also:
The current patch (3/4) takes the WQ isolation responsibility from
HK_TYPE_DOMAIN and focus it in HK_TYPE_WQ, adding it to isolcpus=<cpulist>
flags. This avoids some cpumask_and()s, and a cpumask_t kzalloc, and makes the
code less complex to implement when we need to put isolation in further parts of
the code. (patch 4/4)

I am not sure if I am missing some important point here. 
Please let me know if it's the case. 

> 
> Thanks.
> 

Thank you for replying!
Leo
  

Patch

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 107d77f3c8467..550bef2504b8d 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -371,19 +371,8 @@  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	    pci_physfn_is_probed(dev)) {
 		cpu = nr_cpu_ids;
 	} else {
-		cpumask_var_t wq_domain_mask;
-
-		if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
-			error = -ENOMEM;
-			goto out;
-		}
-		cpumask_and(wq_domain_mask,
-			    housekeeping_cpumask(HK_TYPE_WQ),
-			    housekeeping_cpumask(HK_TYPE_DOMAIN));
-
 		cpu = cpumask_any_and(cpumask_of_node(node),
-				      wq_domain_mask);
-		free_cpumask_var(wq_domain_mask);
+				      housekeeping_cpumask(HK_TYPE_WQ));
 	}
 
 	if (cpu < nr_cpu_ids)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 373d42c707bc5..ced4b78564810 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -204,7 +204,7 @@  static int __init housekeeping_isolcpus_setup(char *str)
 
 		if (!strncmp(str, "domain,", 7)) {
 			str += 7;
-			flags |= HK_FLAG_DOMAIN;
+			flags |= HK_FLAG_DOMAIN | HK_FLAG_WQ;
 			continue;
 		}
 
@@ -234,7 +234,7 @@  static int __init housekeeping_isolcpus_setup(char *str)
 
 	/* Default behaviour for isolcpus without flags */
 	if (!flags)
-		flags |= HK_FLAG_DOMAIN;
+		flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
 
 	return housekeeping_setup(str, flags);
 }
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7cd5f5e7e0a1b..b557daa571f17 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6004,7 +6004,6 @@  void __init workqueue_init_early(void)
 
 	BUG_ON(!alloc_cpumask_var(&wq_unbound_cpumask, GFP_KERNEL));
 	cpumask_copy(wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_WQ));
-	cpumask_and(wq_unbound_cpumask, wq_unbound_cpumask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 
 	pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 8409d41405dfe..7b6fb62a118ab 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -852,7 +852,6 @@  static ssize_t store_rps_map(struct netdev_rx_queue *queue,
 	}
 
 	if (!cpumask_empty(mask)) {
-		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_DOMAIN));
 		cpumask_and(mask, mask, housekeeping_cpumask(HK_TYPE_WQ));
 		if (cpumask_empty(mask)) {
 			free_cpumask_var(mask);