[RFC,v2,3/5] padata: dispatch works on different nodes

Message ID 20231208025240.4744-4-gang.li@linux.dev
State New
Headers
Series hugetlb: parallelize hugetlb page init on boot |

Commit Message

Gang Li Dec. 8, 2023, 2:52 a.m. UTC
  When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.

Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.

Signed-off-by: Gang Li <gang.li@linux.dev>
---
 include/linux/padata.h | 2 ++
 kernel/padata.c        | 8 ++++++--
 mm/mm_init.c           | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)
  

Comments

Tim Chen Dec. 12, 2023, 11:40 p.m. UTC | #1
>  
>  	list_for_each_entry(pw, &works, pw_list)
> -		queue_work(system_unbound_wq, &pw->pw_work);
> +		if (job->numa_aware)
> +			queue_work_node((++nid % num_node_state(N_MEMORY)),

The nid may fall on a NUMA node with only memory but no CPU.  In that case you
may still put the work on the unbound queue. You could end up on one CPU node for work
from all memory nodes without CPU. Is this what you want?  Or you would
like to spread them between CPU nodes?

Tim

> +					system_unbound_wq, &pw->pw_work);
> +		else
> +			queue_work(system_unbound_wq, &pw->pw_work);
>  
>  	/* Use the current thread, which saves starting a workqueue worker. */
>  	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
  
Gang Li Dec. 18, 2023, 6:46 a.m. UTC | #2
On 2023/12/13 07:40, Tim Chen wrote:
> 
>>   
>>   	list_for_each_entry(pw, &works, pw_list)
>> -		queue_work(system_unbound_wq, &pw->pw_work);
>> +		if (job->numa_aware)
>> +			queue_work_node((++nid % num_node_state(N_MEMORY)),
> 
> The nid may fall on a NUMA node with only memory but no CPU.  In that case you
> may still put the work on the unbound queue. You could end up on one CPU node for work
> from all memory nodes without CPU. Is this what you want?  Or you would
> like to spread them between CPU nodes?
> 
> Tim

Hi, thank you for your reminder. My intention was to fully utilize all
memory bandwidth.

For memory nodes without CPUs, I also hope to be able to spread them on
different CPUs.
  
Gang Li Dec. 27, 2023, 10:33 a.m. UTC | #3
Hi Tim,

According to queue_work_node, if there are no CPUs available on the
given node, it will schedule to any available CPU.

On 2023/12/18 14:46, Gang Li wrote:
> On 2023/12/13 07:40, Tim Chen wrote:
>>
>>>       list_for_each_entry(pw, &works, pw_list)
>>> -        queue_work(system_unbound_wq, &pw->pw_work);
>>> +        if (job->numa_aware)
>>> +            queue_work_node((++nid % num_node_state(N_MEMORY)),
>>
>> The nid may fall on a NUMA node with only memory but no CPU.  In that 
>> case you
>> may still put the work on the unbound queue. You could end up on one 
>> CPU node for work
>> from all memory nodes without CPU. Is this what you want?  Or you would
>> like to spread them between CPU nodes?
>>
>> Tim
> 
> Hi, thank you for your reminder. My intention was to fully utilize all
> memory bandwidth.
> 
> For memory nodes without CPUs, I also hope to be able to spread them on
> different CPUs.
  

Patch

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d72..f6c58c30ed96a 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -137,6 +137,7 @@  struct padata_shell {
  *             appropriate for one worker thread to do at once.
  * @max_threads: Max threads to use for the job, actual number may be less
  *               depending on task size and minimum chunk size.
+ * @numa_aware: Dispatch jobs to different nodes.
  */
 struct padata_mt_job {
 	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +147,7 @@  struct padata_mt_job {
 	unsigned long		align;
 	unsigned long		min_chunk;
 	int			max_threads;
+	bool			numa_aware;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 179fb1518070c..80f82c563e46a 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,7 @@  void __init padata_do_multithreaded(struct padata_mt_job *job)
 	struct padata_work my_work, *pw;
 	struct padata_mt_job_state ps;
 	LIST_HEAD(works);
-	int nworks;
+	int nworks, nid;
 
 	if (job->size == 0)
 		return;
@@ -517,7 +517,11 @@  void __init padata_do_multithreaded(struct padata_mt_job *job)
 	ps.chunk_size = roundup(ps.chunk_size, job->align);
 
 	list_for_each_entry(pw, &works, pw_list)
-		queue_work(system_unbound_wq, &pw->pw_work);
+		if (job->numa_aware)
+			queue_work_node((++nid % num_node_state(N_MEMORY)),
+					system_unbound_wq, &pw->pw_work);
+		else
+			queue_work(system_unbound_wq, &pw->pw_work);
 
 	/* Use the current thread, which saves starting a workqueue worker. */
 	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 077bfe393b5e2..1226f0c81fcb3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2234,6 +2234,7 @@  static int __init deferred_init_memmap(void *data)
 			.align       = PAGES_PER_SECTION,
 			.min_chunk   = PAGES_PER_SECTION,
 			.max_threads = max_threads,
+			.numa_aware  = false,
 		};
 
 		padata_do_multithreaded(&job);