[RFC,v2,3/5] padata: dispatch works on different nodes
Commit Message
When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.
Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.
Signed-off-by: Gang Li <gang.li@linux.dev>
---
include/linux/padata.h | 2 ++
kernel/padata.c | 8 ++++++--
mm/mm_init.c | 1 +
3 files changed, 9 insertions(+), 2 deletions(-)
Comments
>
> list_for_each_entry(pw, &works, pw_list)
> - queue_work(system_unbound_wq, &pw->pw_work);
> + if (job->numa_aware)
> + queue_work_node((++nid % num_node_state(N_MEMORY)),
The nid may fall on a NUMA node with only memory but no CPU. In that case you
may still put the work on the unbound queue. You could end up on one CPU node for work
from all memory nodes without CPU. Is this what you want? Or you would
like to spread them between CPU nodes?
Tim
> + system_unbound_wq, &pw->pw_work);
> + else
> + queue_work(system_unbound_wq, &pw->pw_work);
>
> /* Use the current thread, which saves starting a workqueue worker. */
> padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
On 2023/12/13 07:40, Tim Chen wrote:
>
>>
>> list_for_each_entry(pw, &works, pw_list)
>> - queue_work(system_unbound_wq, &pw->pw_work);
>> + if (job->numa_aware)
>> + queue_work_node((++nid % num_node_state(N_MEMORY)),
>
> The nid may fall on a NUMA node with only memory but no CPU. In that case you
> may still put the work on the unbound queue. You could end up on one CPU node for work
> from all memory nodes without CPU. Is this what you want? Or you would
> like to spread them between CPU nodes?
>
> Tim
Hi, thank you for your reminder. My intention was to fully utilize all
memory bandwidth.
For memory nodes without CPUs, I also hope to be able to spread them on
different CPUs.
Hi Tim,
According to queue_work_node, if there are no CPUs available on the
given node, it will schedule to any available CPU.
On 2023/12/18 14:46, Gang Li wrote:
> On 2023/12/13 07:40, Tim Chen wrote:
>>
>>> list_for_each_entry(pw, &works, pw_list)
>>> - queue_work(system_unbound_wq, &pw->pw_work);
>>> + if (job->numa_aware)
>>> + queue_work_node((++nid % num_node_state(N_MEMORY)),
>>
>> The nid may fall on a NUMA node with only memory but no CPU. In that
>> case you
>> may still put the work on the unbound queue. You could end up on one
>> CPU node for work
>> from all memory nodes without CPU. Is this what you want? Or you would
>> like to spread them between CPU nodes?
>>
>> Tim
>
> Hi, thank you for your reminder. My intention was to fully utilize all
> memory bandwidth.
>
> For memory nodes without CPUs, I also hope to be able to spread them on
> different CPUs.
@@ -137,6 +137,7 @@ struct padata_shell {
* appropriate for one worker thread to do at once.
* @max_threads: Max threads to use for the job, actual number may be less
* depending on task size and minimum chunk size.
+ * @numa_aware: Dispatch jobs to different nodes.
*/
struct padata_mt_job {
void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +147,7 @@ struct padata_mt_job {
unsigned long align;
unsigned long min_chunk;
int max_threads;
+ bool numa_aware;
};
/**
@@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
struct padata_work my_work, *pw;
struct padata_mt_job_state ps;
LIST_HEAD(works);
- int nworks;
+ int nworks, nid;
if (job->size == 0)
return;
@@ -517,7 +517,11 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
ps.chunk_size = roundup(ps.chunk_size, job->align);
list_for_each_entry(pw, &works, pw_list)
- queue_work(system_unbound_wq, &pw->pw_work);
+ if (job->numa_aware)
+ queue_work_node((++nid % num_node_state(N_MEMORY)),
+ system_unbound_wq, &pw->pw_work);
+ else
+ queue_work(system_unbound_wq, &pw->pw_work);
/* Use the current thread, which saves starting a workqueue worker. */
padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
@@ -2234,6 +2234,7 @@ static int __init deferred_init_memmap(void *data)
.align = PAGES_PER_SECTION,
.min_chunk = PAGES_PER_SECTION,
.max_threads = max_threads,
+ .numa_aware = false,
};
padata_do_multithreaded(&job);