[RFC,0/3] workqueue: Enable unbound cpumask update on ordered workqueues

Message ID	20240130183336.511948-1-longman@redhat.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-45113-ouuuleilei=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; From: Waiman Long <longman@redhat.com> To: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com> Cc: linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>, Cestmir Kalina <ckalina@redhat.com>, Alex Gladkov <agladkov@redhat.com>, Waiman Long <longman@redhat.com> Subject: [RFC PATCH 0/3] workqueue: Enable unbound cpumask update on ordered workqueues Date: Tue, 30 Jan 2024 13:33:33 -0500 Message-Id: <20240130183336.511948-1-longman@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	workqueue: Enable unbound cpumask update on ordered workqueues \| [RFC,0/3] workqueue: Enable unbound cpumask update on ordered workqueues [RFC,1/3] workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask [RFC,2/3] workqueue: Break out __queue_work_rcu_locked() from __queue_work() [RFC,3/3] workqueue: Enable unbound cpumask update on ordered workqueues

Message ID

20240130183336.511948-1-longman@redhat.com

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-45113-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.199.223 as permitted sender) client-ip=147.75.199.223;
From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Juri Lelli <juri.lelli@redhat.com>,
	Cestmir Kalina <ckalina@redhat.com>,
	Alex Gladkov <agladkov@redhat.com>,
	Waiman Long <longman@redhat.com>
Subject: [RFC PATCH 0/3] workqueue: Enable unbound cpumask update on ordered
 workqueues
Date: Tue, 30 Jan 2024 13:33:33 -0500
Message-Id: <20240130183336.511948-1-longman@redhat.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

workqueue: Enable unbound cpumask update on ordered workqueues |

Message

Waiman Long Jan. 30, 2024, 6:33 p.m. UTC

  Ordered workqueues does not currently follow changes made to the
global unbound cpumask because per-pool workqueue changes may break
the ordering guarantee. IOW, a work function in an ordered workqueue
may run on a cpuset isolated CPU.

This series enables ordered workqueues to follow changes made to the
global unbound cpumask by temporaily saving the work items in an
internal queue until the old pwq has been properly flushed and to be
freed. At that point, those work items, if present, are queued back to
the new pwq to be executed.

Waiman Long (3):
  workqueue: Skip __WQ_DESTROYING workqueues when updating global
    unbound cpumask
  workqueue: Break out __queue_work_rcu_locked() from __queue_work()
  workqueue: Enable unbound cpumask update on ordered workqueues

 kernel/workqueue.c | 217 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 183 insertions(+), 34 deletions(-)

Comments

Juri Lelli Jan. 31, 2024, 1:01 p.m. UTC | #1

Hi Waiman,

Thanks for working on this!

On 30/01/24 13:33, Waiman Long wrote:
> Ordered workqueues does not currently follow changes made to the
> global unbound cpumask because per-pool workqueue changes may break
> the ordering guarantee. IOW, a work function in an ordered workqueue
> may run on a cpuset isolated CPU.
> 
> This series enables ordered workqueues to follow changes made to the
> global unbound cpumask by temporaily saving the work items in an
> internal queue until the old pwq has been properly flushed and to be
> freed. At that point, those work items, if present, are queued back to
> the new pwq to be executed.

I took it for a quick first spin (on top of wq/for-6.9) and this is what
I'm seeing.

Let's take edac-poller ordered wq, as the behavior seems to be the same
for the rest.

Initially we have (using wq_dump.py)

wq_unbound_cpumask=0xffffffff 000000ff
..
pool[80] ref= 44 nice=  0 idle/workers=  2/  2 cpus=0xffffffff 000000ff pod_cpus=0xffffffff 000000ff
..
edac-poller                      ordered    80 80 80 80 80 80 80 80 ...
..
edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff

after I

# echo 3 >/sys/devices/virtual/workqueue/cpumask

I get

wq_unbound_cpumask=00000003
..
pool[86] ref= 44 nice=  0 idle/workers=  2/  2 cpus=00000003 pod_cpus=00000003
..
edac-poller                      ordered    86 86 86 86 86 86 86 86 86 86 ...
..
edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff

So, IIUC, the pool and wq -> pool settings are updated correctly, but
the wq.unbound_cpus (and its associated rescure affinity) are left
untouched. Is this expected or are we maybe still missing an additional
step?

Best,
Juri

Waiman Long Jan. 31, 2024, 3:31 p.m. UTC | #2

On 1/31/24 08:01, Juri Lelli wrote:
> Hi Waiman,
>
> Thanks for working on this!
>
> On 30/01/24 13:33, Waiman Long wrote:
>> Ordered workqueues does not currently follow changes made to the
>> global unbound cpumask because per-pool workqueue changes may break
>> the ordering guarantee. IOW, a work function in an ordered workqueue
>> may run on a cpuset isolated CPU.
>>
>> This series enables ordered workqueues to follow changes made to the
>> global unbound cpumask by temporaily saving the work items in an
>> internal queue until the old pwq has been properly flushed and to be
>> freed. At that point, those work items, if present, are queued back to
>> the new pwq to be executed.
> I took it for a quick first spin (on top of wq/for-6.9) and this is what
> I'm seeing.
>
> Let's take edac-poller ordered wq, as the behavior seems to be the same
> for the rest.
>
> Initially we have (using wq_dump.py)
>
> wq_unbound_cpumask=0xffffffff 000000ff
> ...
> pool[80] ref= 44 nice=  0 idle/workers=  2/  2 cpus=0xffffffff 000000ff pod_cpus=0xffffffff 000000ff
> ...
> edac-poller                      ordered    80 80 80 80 80 80 80 80 ...
> ...
> edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff
>
> after I
>
> # echo 3 >/sys/devices/virtual/workqueue/cpumask
>
> I get
>
> wq_unbound_cpumask=00000003
> ...
> pool[86] ref= 44 nice=  0 idle/workers=  2/  2 cpus=00000003 pod_cpus=00000003
> ...
> edac-poller                      ordered    86 86 86 86 86 86 86 86 86 86 ...
> ...
> edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff
>
> So, IIUC, the pool and wq -> pool settings are updated correctly, but
> the wq.unbound_cpus (and its associated rescure affinity) are left
> untouched. Is this expected or are we maybe still missing an additional
> step?

Isn't this what the 4th patch of your RFC workqueue patch series does?

https://lore.kernel.org/lkml/20240116161929.232885-5-juri.lelli@redhat.com/

The focus of this series is to make sure that we can update the pool 
cpumask of ordered workqueue to follow changes in global unbound 
workqueue cpumask. So I haven't touched anything related to rescuer at all.

I will include your 4th patch in the next version of this series.

Cheers,
Longman

Juri Lelli Feb. 1, 2024, 10:18 a.m. UTC | #3

On 31/01/24 10:31, Waiman Long wrote:
> 
> On 1/31/24 08:01, Juri Lelli wrote:
> > Hi Waiman,
> > 
> > Thanks for working on this!
> > 
> > On 30/01/24 13:33, Waiman Long wrote:
> > > Ordered workqueues does not currently follow changes made to the
> > > global unbound cpumask because per-pool workqueue changes may break
> > > the ordering guarantee. IOW, a work function in an ordered workqueue
> > > may run on a cpuset isolated CPU.
> > > 
> > > This series enables ordered workqueues to follow changes made to the
> > > global unbound cpumask by temporaily saving the work items in an
> > > internal queue until the old pwq has been properly flushed and to be
> > > freed. At that point, those work items, if present, are queued back to
> > > the new pwq to be executed.
> > I took it for a quick first spin (on top of wq/for-6.9) and this is what
> > I'm seeing.
> > 
> > Let's take edac-poller ordered wq, as the behavior seems to be the same
> > for the rest.
> > 
> > Initially we have (using wq_dump.py)
> > 
> > wq_unbound_cpumask=0xffffffff 000000ff
> > ...
> > pool[80] ref= 44 nice=  0 idle/workers=  2/  2 cpus=0xffffffff 000000ff pod_cpus=0xffffffff 000000ff
> > ...
> > edac-poller                      ordered    80 80 80 80 80 80 80 80 ...
> > ...
> > edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff
> > 
> > after I
> > 
> > # echo 3 >/sys/devices/virtual/workqueue/cpumask
> > 
> > I get
> > 
> > wq_unbound_cpumask=00000003
> > ...
> > pool[86] ref= 44 nice=  0 idle/workers=  2/  2 cpus=00000003 pod_cpus=00000003
> > ...
> > edac-poller                      ordered    86 86 86 86 86 86 86 86 86 86 ...
> > ...
> > edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff
> > 
> > So, IIUC, the pool and wq -> pool settings are updated correctly, but
> > the wq.unbound_cpus (and its associated rescure affinity) are left
> > untouched. Is this expected or are we maybe still missing an additional
> > step?
> 
> Isn't this what the 4th patch of your RFC workqueue patch series does?
> 
> https://lore.kernel.org/lkml/20240116161929.232885-5-juri.lelli@redhat.com/
> 
> The focus of this series is to make sure that we can update the pool cpumask
> of ordered workqueue to follow changes in global unbound workqueue cpumask.
> So I haven't touched anything related to rescuer at all.

My patch only uses the wq->unbound_attrs->cpumask to change the
associated rescuer cpumask, but I don't think your series modifies the
former?

Thanks,
Juri

Waiman Long Feb. 1, 2024, 2:28 p.m. UTC | #4

On 2/1/24 05:18, Juri Lelli wrote:
> On 31/01/24 10:31, Waiman Long wrote:
>> On 1/31/24 08:01, Juri Lelli wrote:
>>> Hi Waiman,
>>>
>>> Thanks for working on this!
>>>
>>> On 30/01/24 13:33, Waiman Long wrote:
>>>> Ordered workqueues does not currently follow changes made to the
>>>> global unbound cpumask because per-pool workqueue changes may break
>>>> the ordering guarantee. IOW, a work function in an ordered workqueue
>>>> may run on a cpuset isolated CPU.
>>>>
>>>> This series enables ordered workqueues to follow changes made to the
>>>> global unbound cpumask by temporaily saving the work items in an
>>>> internal queue until the old pwq has been properly flushed and to be
>>>> freed. At that point, those work items, if present, are queued back to
>>>> the new pwq to be executed.
>>> I took it for a quick first spin (on top of wq/for-6.9) and this is what
>>> I'm seeing.
>>>
>>> Let's take edac-poller ordered wq, as the behavior seems to be the same
>>> for the rest.
>>>
>>> Initially we have (using wq_dump.py)
>>>
>>> wq_unbound_cpumask=0xffffffff 000000ff
>>> ...
>>> pool[80] ref= 44 nice=  0 idle/workers=  2/  2 cpus=0xffffffff 000000ff pod_cpus=0xffffffff 000000ff
>>> ...
>>> edac-poller                      ordered    80 80 80 80 80 80 80 80 ...
>>> ...
>>> edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff
>>>
>>> after I
>>>
>>> # echo 3 >/sys/devices/virtual/workqueue/cpumask
>>>
>>> I get
>>>
>>> wq_unbound_cpumask=00000003
>>> ...
>>> pool[86] ref= 44 nice=  0 idle/workers=  2/  2 cpus=00000003 pod_cpus=00000003
>>> ...
>>> edac-poller                      ordered    86 86 86 86 86 86 86 86 86 86 ...
>>> ...
>>> edac-poller                      0xffffffff 000000ff    345 0xffffffff 000000ff
>>>
>>> So, IIUC, the pool and wq -> pool settings are updated correctly, but
>>> the wq.unbound_cpus (and its associated rescure affinity) are left
>>> untouched. Is this expected or are we maybe still missing an additional
>>> step?
>> Isn't this what the 4th patch of your RFC workqueue patch series does?
>>
>> https://lore.kernel.org/lkml/20240116161929.232885-5-juri.lelli@redhat.com/
>>
>> The focus of this series is to make sure that we can update the pool cpumask
>> of ordered workqueue to follow changes in global unbound workqueue cpumask.
>> So I haven't touched anything related to rescuer at all.
> My patch only uses the wq->unbound_attrs->cpumask to change the
> associated rescuer cpumask, but I don't think your series modifies the
> former?

I don't think so. The calling sequence of apply_wqattrs_prepare() and 
apply_wqattrs_commit() will copy unbound_cpumask into ctx->attrs which 
is copied into unbound_attrs. So unbound_attrs->cpumask should reflect 
the new global unbound cpumask. This code is there all along. The only 
difference is that ordered workqueues were skipped for unbound cpumask 
update before. This patch series now includes those ordered workqueues 
when the unbound cpumask is updated.

Cheers,
Longman

Juri Lelli Feb. 2, 2024, 2:55 p.m. UTC | #5

On 01/02/24 09:28, Waiman Long wrote:
> On 2/1/24 05:18, Juri Lelli wrote:
> > On 31/01/24 10:31, Waiman Long wrote:

..

> > My patch only uses the wq->unbound_attrs->cpumask to change the
> > associated rescuer cpumask, but I don't think your series modifies the
> > former?
> 
> I don't think so. The calling sequence of apply_wqattrs_prepare() and
> apply_wqattrs_commit() will copy unbound_cpumask into ctx->attrs which is
> copied into unbound_attrs. So unbound_attrs->cpumask should reflect the new
> global unbound cpumask. This code is there all along.

Indeed. I believe this is what my 3/4 [1] was trying to cure, though. I
still think that with current code the new_attr->cpumask gets first
correctly initialized considering unbound_cpumask

apply_wqattrs_prepare ->
  copy_workqueue_attrs(new_attrs, attrs);
  wqattrs_actualize_cpumask(new_attrs, unbound_cpumask);

but then overwritten further below using cpu_possible_mask

apply_wqattrs_prepare ->
  copy_workqueue_attrs(new_attrs, attrs);
  cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);

operation that I honestly seem to still fail to grasp why we need to do.
:)

In the end we commit that last (overwritten) cpumask

apply_wqattrs_commit ->
  copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);

Now, my patch was wrong, as you pointed out, as it wasn't taking into
consideration the ordering guarantee. I thought maybe your changes (plus
and additional change to the above?) might fix the problem correctly.

Best,
Juri

1 - https://lore.kernel.org/lkml/20240116161929.232885-4-juri.lelli@redhat.com/

Tejun Heo Feb. 2, 2024, 5:07 p.m. UTC | #6

Hello,

On Fri, Feb 02, 2024 at 03:55:15PM +0100, Juri Lelli wrote:
> Indeed. I believe this is what my 3/4 [1] was trying to cure, though. I
> still think that with current code the new_attr->cpumask gets first
> correctly initialized considering unbound_cpumask
> 
> apply_wqattrs_prepare ->
>   copy_workqueue_attrs(new_attrs, attrs);
>   wqattrs_actualize_cpumask(new_attrs, unbound_cpumask);
> 
> but then overwritten further below using cpu_possible_mask
> 
> apply_wqattrs_prepare ->
>   copy_workqueue_attrs(new_attrs, attrs);
>   cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
> 
> operation that I honestly seem to still fail to grasp why we need to do.
> :)

So, imagine the following scenario on a system with four CPUs:

1. Initially both wq_unbound_cpumask and wq A's cpumask are 0xf.

2. wq_unbound_cpumask is set to 0x3. A's effective is 0x3.

3. A's cpumask is set to 0xe, A's effective is 0x3.

4. wq_unbound_cpumask is restore to 0xf. A's effective should become 0xe.

The reason why we're saving what user requested rather than effective is to
be able to do #4 so that the effective is always what's currently allowed
from what the user specified for the workqueue.

Now, if you want the current effective cpumask, that always coincides with
the workqueue's dfl_pwq's __pod_cpumask and if you look at the current
wq/for-6.9 branch, that's accessible through unbound_effective_cpumask()
helper.

Thanks.

Waiman Long Feb. 2, 2024, 7:03 p.m. UTC | #7

On 2/2/24 12:07, Tejun Heo wrote:
> Hello,
>
> On Fri, Feb 02, 2024 at 03:55:15PM +0100, Juri Lelli wrote:
>> Indeed. I believe this is what my 3/4 [1] was trying to cure, though. I
>> still think that with current code the new_attr->cpumask gets first
>> correctly initialized considering unbound_cpumask
>>
>> apply_wqattrs_prepare ->
>>    copy_workqueue_attrs(new_attrs, attrs);
>>    wqattrs_actualize_cpumask(new_attrs, unbound_cpumask);
>>
>> but then overwritten further below using cpu_possible_mask
>>
>> apply_wqattrs_prepare ->
>>    copy_workqueue_attrs(new_attrs, attrs);
>>    cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
>>
>> operation that I honestly seem to still fail to grasp why we need to do.
>> :)
> So, imagine the following scenario on a system with four CPUs:
>
> 1. Initially both wq_unbound_cpumask and wq A's cpumask are 0xf.
>
> 2. wq_unbound_cpumask is set to 0x3. A's effective is 0x3.
>
> 3. A's cpumask is set to 0xe, A's effective is 0x3.
>
> 4. wq_unbound_cpumask is restore to 0xf. A's effective should become 0xe.
>
> The reason why we're saving what user requested rather than effective is to
> be able to do #4 so that the effective is always what's currently allowed
> from what the user specified for the workqueue.
>
> Now, if you want the current effective cpumask, that always coincides with
> the workqueue's dfl_pwq's __pod_cpumask and if you look at the current
> wq/for-6.9 branch, that's accessible through unbound_effective_cpumask()
> helper.

Thank for the explanation, we will use the new 
unbound_effective_cpumask() helper. It does look like there is a major 
restructuring of the workqueue code in 6.9. I will adapt my patch series 
to be based on the for-6.9 branch.

Cheers,
Longman

Juri Lelli Feb. 5, 2024, 6:30 a.m. UTC | #8

On 02/02/24 14:03, Waiman Long wrote:
> On 2/2/24 12:07, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Feb 02, 2024 at 03:55:15PM +0100, Juri Lelli wrote:
> > > Indeed. I believe this is what my 3/4 [1] was trying to cure, though. I
> > > still think that with current code the new_attr->cpumask gets first
> > > correctly initialized considering unbound_cpumask
> > > 
> > > apply_wqattrs_prepare ->
> > >    copy_workqueue_attrs(new_attrs, attrs);
> > >    wqattrs_actualize_cpumask(new_attrs, unbound_cpumask);
> > > 
> > > but then overwritten further below using cpu_possible_mask
> > > 
> > > apply_wqattrs_prepare ->
> > >    copy_workqueue_attrs(new_attrs, attrs);
> > >    cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
> > > 
> > > operation that I honestly seem to still fail to grasp why we need to do.
> > > :)
> > So, imagine the following scenario on a system with four CPUs:
> > 
> > 1. Initially both wq_unbound_cpumask and wq A's cpumask are 0xf.
> > 
> > 2. wq_unbound_cpumask is set to 0x3. A's effective is 0x3.
> > 
> > 3. A's cpumask is set to 0xe, A's effective is 0x3.
> > 
> > 4. wq_unbound_cpumask is restore to 0xf. A's effective should become 0xe.
> > 
> > The reason why we're saving what user requested rather than effective is to
> > be able to do #4 so that the effective is always what's currently allowed
> > from what the user specified for the workqueue.

Thanks for the explanation!

> > Now, if you want the current effective cpumask, that always coincides with
> > the workqueue's dfl_pwq's __pod_cpumask and if you look at the current
> > wq/for-6.9 branch, that's accessible through unbound_effective_cpumask()
> > helper.
> 
> Thank for the explanation, we will use the new unbound_effective_cpumask()
> helper.

Right, that should indeed work.

Best,
Juri