diff mbox series

[v2,1/2] blk-iocost: add refcounting for iocg

Message ID	20221227125502.541931-2-yukuai1@huaweicloud.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Yu Kuai <yukuai1@huaweicloud.com> To: tj@kernel.org, hch@infradead.org, josef@toxicpanda.com, axboe@kernel.dk Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com Subject: [PATCH v2 1/2] blk-iocost: add refcounting for iocg Date: Tue, 27 Dec 2022 20:55:01 +0800 Message-Id: <20221227125502.541931-2-yukuai1@huaweicloud.com> In-Reply-To: <20221227125502.541931-1-yukuai1@huaweicloud.com> References: <20221227125502.541931-1-yukuai1@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	blk-iocost: add refcounting for iocg and ioc \| [v2,0/2] blk-iocost: add refcounting for iocg and ioc [v2,1/2] blk-iocost: add refcounting for iocg [v2,2/2] blk-iocost: add refcounting for ioc

Commit Message

Yu Kuai Dec. 27, 2022, 12:55 p.m. UTC

  From: Yu Kuai <yukuai3@huawei.com>

iocost requires that child iocg must exit before parent iocg, otherwise
kernel might crash in ioc_timer_fn(). However, currently iocg is exited
in pd_free_fn(), which can't guarantee such order:

1) remove cgroup can concurrent with deactivate policy;
2) blkg_free() triggered by remove cgroup is asynchronously, remove
child cgroup can concurrent with remove parent cgroup;

Fix the problem by add refcounting for iocg, and child iocg will grab
reference of parent iocg, so that parent iocg will wait for all child
iocg to be exited.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-iocost.c | 74 +++++++++++++++++++++++++++++++---------------
 1 file changed, 50 insertions(+), 24 deletions(-)

Comments

Tejun Heo Jan. 4, 2023, 9:44 p.m. UTC | #1

On Tue, Dec 27, 2022 at 08:55:01PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> iocost requires that child iocg must exit before parent iocg, otherwise
> kernel might crash in ioc_timer_fn(). However, currently iocg is exited
> in pd_free_fn(), which can't guarantee such order:
> 
> 1) remove cgroup can concurrent with deactivate policy;
> 2) blkg_free() triggered by remove cgroup is asynchronously, remove
> child cgroup can concurrent with remove parent cgroup;
> 
> Fix the problem by add refcounting for iocg, and child iocg will grab
> reference of parent iocg, so that parent iocg will wait for all child
> iocg to be exited.

Wouldn't it be better to do this refcnting in the blk-cgroup core code
rather than in blk-iocost?

Thanks.

Yu Kuai Jan. 5, 2023, 1:14 a.m. UTC | #2

Hi,

在 2023/01/05 5:44, Tejun Heo 写道:
> On Tue, Dec 27, 2022 at 08:55:01PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> iocost requires that child iocg must exit before parent iocg, otherwise
>> kernel might crash in ioc_timer_fn(). However, currently iocg is exited
>> in pd_free_fn(), which can't guarantee such order:
>>
>> 1) remove cgroup can concurrent with deactivate policy;
>> 2) blkg_free() triggered by remove cgroup is asynchronously, remove
>> child cgroup can concurrent with remove parent cgroup;
>>
>> Fix the problem by add refcounting for iocg, and child iocg will grab
>> reference of parent iocg, so that parent iocg will wait for all child
>> iocg to be exited.
> 
> Wouldn't it be better to do this refcnting in the blk-cgroup core code
> rather than in blk-iocost?
> 

The problem is that I can't find a proper way to fix the competition
that pd_free_fn() can be called from different context:

1) from blkg_free() that is called asynchronously from removing cgroup;
2) from blkcg_deactivate_policy() that is called from removing device;

1) is related to blkg, while 2) is not, hence refcnting from blkg can't
fix the problem. refcnting from blkcg_policy_data should be ok, but I
see that bfq already has the similar refcnting, while other policy
doesn't require such refcnting.

Any suggestions?

Thanks,
Kuai
> Thanks.
>

Tejun Heo Jan. 5, 2023, 6:32 p.m. UTC | #3

On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote:
> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't
> fix the problem. refcnting from blkcg_policy_data should be ok, but I
> see that bfq already has the similar refcnting, while other policy
> doesn't require such refcnting.

Hmm... taking a step back, wouldn't this be solved by moving the first part
of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there,
right?

Thanks.

--
tejun

Yu Kuai Jan. 6, 2023, 1:08 a.m. UTC | #4

Hi,

在 2023/01/06 2:32, Tejun Heo 写道:
> On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote:
>> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't
>> fix the problem. refcnting from blkcg_policy_data should be ok, but I
>> see that bfq already has the similar refcnting, while other policy
>> doesn't require such refcnting.
> 
> Hmm... taking a step back, wouldn't this be solved by moving the first part
> of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there,
> right?
> 

Moving first part to pd_offline_fn() has some requirements, like what I
did in the other thread:

iocg can be activated again after pd_offline_fn(), which is possible
because bio can be dispatched when cgroup is removed. I tried to avoid
that by:

1) dispatch all throttled bio io ioc_pd_offline()
2) don't throttle bio after ioc_pd_offline()

However, you already disagreed with that. 😔

Thanks,
Kuai

> Thanks.
> 
> --
> tejun
> .
>

Tejun Heo Jan. 6, 2023, 8:18 p.m. UTC | #5

On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2023/01/06 2:32, Tejun Heo 写道:
> > On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote:
> > > 1) is related to blkg, while 2) is not, hence refcnting from blkg can't
> > > fix the problem. refcnting from blkcg_policy_data should be ok, but I
> > > see that bfq already has the similar refcnting, while other policy
> > > doesn't require such refcnting.
> > 
> > Hmm... taking a step back, wouldn't this be solved by moving the first part
> > of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there,
> > right?
> > 
> 
> Moving first part to pd_offline_fn() has some requirements, like what I
> did in the other thread:
> 
> iocg can be activated again after pd_offline_fn(), which is possible
> because bio can be dispatched when cgroup is removed. I tried to avoid
> that by:
> 
> 1) dispatch all throttled bio io ioc_pd_offline()
> 2) don't throttle bio after ioc_pd_offline()
> 
> However, you already disagreed with that. 😔

Okay, I was completely wrong while I was replying to your original patch.
Should have looked at the code closer, my apologies.

What I missed is that pd_offline doesn't happen when the cgroup goes
offline. Please take a look at the following two commits:

 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
 d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it")

After the above two commits, ->pd_offline_fn() is called only after all
possible writebacks are complete, so it shouldn't allow mass escapes to
root. With writebacks out of the picture, it might be that there can be no
further IOs once ->pd_offline_fn() is called too as there can be no tasks
left in it and no dirty pages, but best to confirm that.

So, yeah, the original approach you took should work although I'm not sure
the patches that you added to make offline blkg to bypass are necessary
(that also contributed to my assumption that there will be more IOs on those
blkg's). Have you seen more IOs coming down the pipeline after offline? If
so, can you dump some backtraces and see where they're coming from?

Thanks.

Yu Kuai Jan. 9, 2023, 1:32 a.m. UTC | #6

Hi,

在 2023/01/07 4:18, Tejun Heo 写道:
> On Fri, Jan 06, 2023 at 09:08:45AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/01/06 2:32, Tejun Heo 写道:
>>> On Thu, Jan 05, 2023 at 09:14:07AM +0800, Yu Kuai wrote:
>>>> 1) is related to blkg, while 2) is not, hence refcnting from blkg can't
>>>> fix the problem. refcnting from blkcg_policy_data should be ok, but I
>>>> see that bfq already has the similar refcnting, while other policy
>>>> doesn't require such refcnting.
>>>
>>> Hmm... taking a step back, wouldn't this be solved by moving the first part
>>> of ioc_pd_free() to pd_offline_fn()? The ordering is strictly defined there,
>>> right?
>>>
>>
>> Moving first part to pd_offline_fn() has some requirements, like what I
>> did in the other thread:
>>
>> iocg can be activated again after pd_offline_fn(), which is possible
>> because bio can be dispatched when cgroup is removed. I tried to avoid
>> that by:
>>
>> 1) dispatch all throttled bio io ioc_pd_offline()
>> 2) don't throttle bio after ioc_pd_offline()
>>
>> However, you already disagreed with that. 😔
> 
> Okay, I was completely wrong while I was replying to your original patch.
> Should have looked at the code closer, my apologies.
> 
> What I missed is that pd_offline doesn't happen when the cgroup goes
> offline. Please take a look at the following two commits:
> 
>   59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
>   d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it")
> 

These two commits are applied for three years, I don't check the details
yet but they seem can't guarantee that no io will be handled by
rq_qos_throttle() after pd_offline_fn(), because I just reproduced this
in another problem:

f02be9002c48 ("block, bfq: fix null pointer dereference in bfq_bio_bfqg()")

User thread can issue async io, and io can be throttled by
blk-throttle(not writeback), then user thread can exit and cgroup can be
removed before such io is dispatched to rq_qos_throttle.

> After the above two commits, ->pd_offline_fn() is called only after all
> possible writebacks are complete, so it shouldn't allow mass escapes to
> root. With writebacks out of the picture, it might be that there can be no
> further IOs once ->pd_offline_fn() is called too as there can be no tasks
> left in it and no dirty pages, but best to confirm that.
> 
> So, yeah, the original approach you took should work although I'm not sure
> the patches that you added to make offline blkg to bypass are necessary
> (that also contributed to my assumption that there will be more IOs on those
> blkg's). Have you seen more IOs coming down the pipeline after offline? If
> so, can you dump some backtraces and see where they're coming from?

Currently I'm sure such IOs can come from blk-throttle, and I'm not sure
yet but I also suspect io_uring can do this.

Thanks,
Kuai

Tejun Heo Jan. 9, 2023, 6:23 p.m. UTC | #7

Hello,

On Mon, Jan 09, 2023 at 09:32:46AM +0800, Yu Kuai wrote:
> >   59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
> >   d866dbf61787 ("blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it")
> 
> These two commits are applied for three years, I don't check the details
> yet but they seem can't guarantee that no io will be handled by
> rq_qos_throttle() after pd_offline_fn(), because I just reproduced this
> in another problem:
> 
> f02be9002c48 ("block, bfq: fix null pointer dereference in bfq_bio_bfqg()")
> 
> User thread can issue async io, and io can be throttled by
> blk-throttle(not writeback), then user thread can exit and cgroup can be
> removed before such io is dispatched to rq_qos_throttle.

I see.

> > After the above two commits, ->pd_offline_fn() is called only after all
> > possible writebacks are complete, so it shouldn't allow mass escapes to
> > root. With writebacks out of the picture, it might be that there can be no
> > further IOs once ->pd_offline_fn() is called too as there can be no tasks
> > left in it and no dirty pages, but best to confirm that.
> > 
> > So, yeah, the original approach you took should work although I'm not sure
> > the patches that you added to make offline blkg to bypass are necessary
> > (that also contributed to my assumption that there will be more IOs on those
> > blkg's). Have you seen more IOs coming down the pipeline after offline? If
> > so, can you dump some backtraces and see where they're coming from?
> 
> Currently I'm sure such IOs can come from blk-throttle, and I'm not sure
> yet but I also suspect io_uring can do this.

Yeah, that's unfortunate. There are several options here:

1. Do what you originally suggested - bypass to root after offline. I feel
   uneasy about this. Both iolatency and throtl clear their configs on
   offline but that's punting to the parent. For iocost it'd be bypassing
   all controls, which can actually be exploited.

2. Make all possible IO issuers use blkcg_[un]pin_online() and shift the
   iocost shutdown to pd_offline_fn(). This likely is the most canonical
   solution given the current situation but it's kinda nasty to add another
   layer of refcnting all over the place.

3. Order blkg free so that parents are never freed before children. You did
   this by adding refcnts in iocost but shouldn't it be possible to simply
   shift blkg_put(blkg->parent) in __blkg_release() to blkg_free_workfn()?

#3 seems the most logical to me. What do you thinK?

Thanks.

Yu Kuai Jan. 10, 2023, 1:39 a.m. UTC | #8

Hi,

在 2023/01/10 2:23, Tejun Heo 写道:
> Yeah, that's unfortunate. There are several options here:
> 
> 1. Do what you originally suggested - bypass to root after offline. I feel
>     uneasy about this. Both iolatency and throtl clear their configs on
>     offline but that's punting to the parent. For iocost it'd be bypassing
>     all controls, which can actually be exploited.
> 
> 2. Make all possible IO issuers use blkcg_[un]pin_online() and shift the
>     iocost shutdown to pd_offline_fn(). This likely is the most canonical
>     solution given the current situation but it's kinda nasty to add another
>     layer of refcnting all over the place.
> 
> 3. Order blkg free so that parents are never freed before children. You did
>     this by adding refcnts in iocost but shouldn't it be possible to simply
>     shift blkg_put(blkg->parent) in __blkg_release() to blkg_free_workfn()?

As I tried to explain before, we can make sure blkg_free() is called
in order, but blkg_free() from remove cgroup can concurrent with
deactivate policy, and we can't guarantee the order of ioc_pd_free()
that is called both from blkg_free() and blkcg_deactivate_policy().
Hence I don't think #3 is possible.

I personaly prefer #1, I don't see any real use case about the defect
that you described, and actually in cgroup v1 blk-throtl is bypassed to
no limit as well.

I'm not sure about #2, that sounds a possible solution but I'm not quite
familiar with the implementations here.

Consider that bfq already has such refcounting for bfqg, perhaps
similiar refcounting is acceptable?

Thanks,
Kuai

Tejun Heo Jan. 10, 2023, 6:36 p.m. UTC | #9

Hello,

On Tue, Jan 10, 2023 at 09:39:44AM +0800, Yu Kuai wrote:
> As I tried to explain before, we can make sure blkg_free() is called
> in order, but blkg_free() from remove cgroup can concurrent with
> deactivate policy, and we can't guarantee the order of ioc_pd_free()
> that is called both from blkg_free() and blkcg_deactivate_policy().
> Hence I don't think #3 is possible.

Hahaha, sorry that I keep forgetting that. This doesn't really feel like
that important or difficult part of the problem tho. Can't it be solved by
synchronizing blkg free work item against the deactivate path with a mutex?

Thanks.

Yu Kuai Jan. 11, 2023, 1:36 a.m. UTC | #10

Hi,

在 2023/01/11 2:36, Tejun Heo 写道:
> Hello,
> 
> On Tue, Jan 10, 2023 at 09:39:44AM +0800, Yu Kuai wrote:
>> As I tried to explain before, we can make sure blkg_free() is called
>> in order, but blkg_free() from remove cgroup can concurrent with
>> deactivate policy, and we can't guarantee the order of ioc_pd_free()
>> that is called both from blkg_free() and blkcg_deactivate_policy().
>> Hence I don't think #3 is possible.
> 
> Hahaha, sorry that I keep forgetting that. This doesn't really feel like
> that important or difficult part of the problem tho. Can't it be solved by
> synchronizing blkg free work item against the deactivate path with a mutex?
> 

I'm not sure, of course this can fix the problem, but two spinlock
'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy()
currently, add a mutex（disk level?) requires a refactor, which seems
complex to me.

Thanks,
Kuai

Tejun Heo Jan. 11, 2023, 5:07 p.m. UTC | #11

Hello,

On Wed, Jan 11, 2023 at 09:36:25AM +0800, Yu Kuai wrote:
> I'm not sure, of course this can fix the problem, but two spinlock
> 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy()
> currently, add a mutex（disk level?) requires a refactor, which seems
> complex to me.

The fact that the two paths can race each other already seems buggy. e.g.
What prevents them from running pd_free on the same pd twice? So, it needs
to be fixed anyway and the intention always has been that these callbacks
are called in the correct traversal order.

Thanks.

Yu Kuai Jan. 12, 2023, 6:18 a.m. UTC | #12

Hi,

在 2023/01/12 1:07, Tejun Heo 写道:
> Hello,
> 
> On Wed, Jan 11, 2023 at 09:36:25AM +0800, Yu Kuai wrote:
>> I'm not sure, of course this can fix the problem, but two spinlock
>> 'blkcg->lock' and 'q->queue_lock' are used to protect blkg_destroy()
>> currently, add a mutex（disk level?) requires a refactor, which seems
>> complex to me.
> 
> The fact that the two paths can race each other already seems buggy. e.g.
> What prevents them from running pd_free on the same pd twice? So, it needs

I think the root cause is that blkg is tracked from two different list,
blkcg->blkg_list from cgroup level and q->blkg_list from disk level. And
pd_free_fn is also called from both blkg_destroy() and deactivate policy
for a disk.

I just thought about another solution:

remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting
the device, and delay the policy cleanup and free to blkg_destroy_all().
Then the policies(other than bfq) can only call pd_free_fn() from
blkg_destroy(), and it's easy to guarantee the order. For bfq, it can
stay the same since bfq has refcounting itself.

Then for the problem that ioc can be freed in pd_free_fn(), we can fix
it by freeing ioc in ioc_pd_free() for root blkg instead of
rq_qos_exit().

What do you think?

Thanks,
Kuai
> to be fixed anyway and the intention always has been that these callbacks
> are called in the correct traversal order.
> 
> Thanks.
>

Tejun Heo Jan. 13, 2023, 12:53 a.m. UTC | #13

Hello,

On Thu, Jan 12, 2023 at 02:18:15PM +0800, Yu Kuai wrote:
> remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting
> the device, and delay the policy cleanup and free to blkg_destroy_all().
> Then the policies(other than bfq) can only call pd_free_fn() from
> blkg_destroy(), and it's easy to guarantee the order. For bfq, it can
> stay the same since bfq has refcounting itself.
> 
> Then for the problem that ioc can be freed in pd_free_fn(), we can fix
> it by freeing ioc in ioc_pd_free() for root blkg instead of
> rq_qos_exit().
> 
> What do you think?

That would remove the ability to dynamically remove an rq_qos policy, right?
We don't currently do it but given that having an rq_qos registered comes
with perf overhead, it's something we might want to do in the future - e.g.
only activate the policy when the controller is actually enabled. So, idk.
What's wrong with synchronizing the two removal paths? blkcg policies are
combinations of cgroups and block device configurations, so having exit
paths from both sides is kinda natural.

Thanks.

Yu Kuai Jan. 13, 2023, 1:10 a.m. UTC | #14

Hi,

在 2023/01/13 8:53, Tejun Heo 写道:
> Hello,
> 
> On Thu, Jan 12, 2023 at 02:18:15PM +0800, Yu Kuai wrote:
>> remove the blkcg_deactivate_policy() from rq_qos_exit() from deleting
>> the device, and delay the policy cleanup and free to blkg_destroy_all().
>> Then the policies(other than bfq) can only call pd_free_fn() from
>> blkg_destroy(), and it's easy to guarantee the order. For bfq, it can
>> stay the same since bfq has refcounting itself.
>>
>> Then for the problem that ioc can be freed in pd_free_fn(), we can fix
>> it by freeing ioc in ioc_pd_free() for root blkg instead of
>> rq_qos_exit().
>>
>> What do you think?
> 
> That would remove the ability to dynamically remove an rq_qos policy, right?
> We don't currently do it but given that having an rq_qos registered comes
> with perf overhead, it's something we might want to do in the future - e.g.

Yes, that make sense, remove ioc and other policies dynamically.

> only activate the policy when the controller is actually enabled. So, idk.
> What's wrong with synchronizing the two removal paths? blkcg policies are
> combinations of cgroups and block device configurations, so having exit
> paths from both sides is kinda natural.

I still can't figure out how to synchronizing them will a mutex. Maybe
I'm being foolish...

Thanks,
Kuai

Tejun Heo Jan. 13, 2023, 1:15 a.m. UTC | #15

Hello,

On Fri, Jan 13, 2023 at 09:10:25AM +0800, Yu Kuai wrote:
> > only activate the policy when the controller is actually enabled. So, idk.
> > What's wrong with synchronizing the two removal paths? blkcg policies are
> > combinations of cgroups and block device configurations, so having exit
> > paths from both sides is kinda natural.
> 
> I still can't figure out how to synchronizing them will a mutex. Maybe
> I'm being foolish...

Hmm... can't you just use e.g. per-bdev mutex which is grabbed by both
blkg_free_workfn() and blkcg_deactivate_policy()?

Thanks.

Yu Kuai Jan. 13, 2023, 1:25 a.m. UTC | #16

Hi,

在 2023/01/13 9:15, Tejun Heo 写道:
> Hello,
> 
> On Fri, Jan 13, 2023 at 09:10:25AM +0800, Yu Kuai wrote:
>>> only activate the policy when the controller is actually enabled. So, idk.
>>> What's wrong with synchronizing the two removal paths? blkcg policies are
>>> combinations of cgroups and block device configurations, so having exit
>>> paths from both sides is kinda natural.
>>
>> I still can't figure out how to synchronizing them will a mutex. Maybe
>> I'm being foolish...
> 
> Hmm... can't you just use e.g. per-bdev mutex which is grabbed by both
> blkg_free_workfn() and blkcg_deactivate_policy()?
> 

I think hold the lock in blkg_free_workfn() is too late, pd_free_fn()
for parent from blkcg_deactivate_policy() can be called first.

t1: remove cgroup t1/t2
blkcg_destroy_blkgs
  blkg_destroy
   percpu_ref_kill(&blkg->refcnt)
    blkg_release
     blkg_free
      schedule_work(&blkg->free_work)
      // t1 is done

t2: handle t1 from removing device
blkcg_deactivate_policy
  pd_free_fn
  // free parent
				t3: from t1
				blkg_free_workfn
				 pd_free_fn
				 // free child

Thanks,
Kuai

Tejun Heo Jan. 13, 2023, 5:16 p.m. UTC | #17

Hello,

On Fri, Jan 13, 2023 at 09:25:11AM +0800, Yu Kuai wrote:
> I think hold the lock in blkg_free_workfn() is too late, pd_free_fn()
> for parent from blkcg_deactivate_policy() can be called first.
> 
> t1: remove cgroup t1/t2
> blkcg_destroy_blkgs
>  blkg_destroy
>   percpu_ref_kill(&blkg->refcnt)
>    blkg_release
>     blkg_free
>      schedule_work(&blkg->free_work)
>      // t1 is done
> 
> t2: handle t1 from removing device
> blkcg_deactivate_policy
>  pd_free_fn
>  // free parent
> 				t3: from t1
> 				blkg_free_workfn
> 				 pd_free_fn
> 				 // free child

As we discussed before, you'd have to order the actual freeing by shifting
the ref puts into the free_work. If you move `blkg_put(blkg->parent)` and
`list_del_init(&blkg->q_node)` to blkg_free_workfn() (this will require
adjustments as these things are used from other places too), the free work
items will be ordered and the blkg would remain iterable - IOW,
deactivate_policy would be able to see it allowing the two paths to
synchronize, right?

Thanks.

Yu Kuai Jan. 16, 2023, 3:25 a.m. UTC | #18

Hi,

在 2023/01/14 1:16, Tejun Heo 写道:
> As we discussed before, you'd have to order the actual freeing by shifting
> the ref puts into the free_work. If you move `blkg_put(blkg->parent)` and
> `list_del_init(&blkg->q_node)` to blkg_free_workfn() (this will require
> adjustments as these things are used from other places too), the free work
> items will be ordered and the blkg would remain iterable - IOW,
> deactivate_policy would be able to see it allowing the two paths to
> synchronize, right?

That sounds reasonable to only remove blkg from queue list if
pd_free_fn() is done.

It's right this way deactivate_policy will be able to see it, and if
deactivate_policy is called first, pd_free_fn() can be called here, and
later blkg_free_workfn() should skip pd_free_fn().

It's glad that we come up with suitable solution finially. 😃

BTW, it might take some time before I send a new patchset cause Spring
Festival is coming.

Thanks,
Kuai
> 
> Thanks.
>

diff mbox series

Patch

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 7a0d754b9eb2..525e93e1175a 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -461,6 +461,8 @@  struct ioc_gq {
 	struct blkg_policy_data		pd;
 	struct ioc			*ioc;
 
+	refcount_t			ref;
+
 	/*
 	 * A iocg can get its weight from two sources - an explicit
 	 * per-device-cgroup configuration or the default weight of the
@@ -2943,9 +2945,53 @@  static struct blkg_policy_data *ioc_pd_alloc(gfp_t gfp, struct request_queue *q,
 		return NULL;
 	}
 
+	refcount_set(&iocg->ref, 1);
 	return &iocg->pd;
 }
 
+static void iocg_get(struct ioc_gq *iocg)
+{
+	refcount_inc(&iocg->ref);
+}
+
+static void iocg_put(struct ioc_gq *iocg)
+{
+	struct ioc *ioc = iocg->ioc;
+	unsigned long flags;
+	struct ioc_gq *parent = NULL;
+
+	if (!refcount_dec_and_test(&iocg->ref))
+		return;
+
+	if (iocg->level > 0)
+		parent = iocg->ancestors[iocg->level - 1];
+
+	if (ioc) {
+		spin_lock_irqsave(&ioc->lock, flags);
+
+		if (!list_empty(&iocg->active_list)) {
+			struct ioc_now now;
+
+			ioc_now(ioc, &now);
+			propagate_weights(iocg, 0, 0, false, &now);
+			list_del_init(&iocg->active_list);
+		}
+
+		WARN_ON_ONCE(!list_empty(&iocg->walk_list));
+		WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
+
+		spin_unlock_irqrestore(&ioc->lock, flags);
+
+		hrtimer_cancel(&iocg->waitq_timer);
+	}
+
+	free_percpu(iocg->pcpu_stat);
+	kfree(iocg);
+
+	if (parent)
+		iocg_put(parent);
+}
+
 static void ioc_pd_init(struct blkg_policy_data *pd)
 {
 	struct ioc_gq *iocg = pd_to_iocg(pd);
@@ -2973,6 +3019,9 @@  static void ioc_pd_init(struct blkg_policy_data *pd)
 
 	iocg->level = blkg->blkcg->css.cgroup->level;
 
+	if (blkg->parent)
+		iocg_get(blkg_to_iocg(blkg->parent));
+
 	for (tblkg = blkg; tblkg; tblkg = tblkg->parent) {
 		struct ioc_gq *tiocg = blkg_to_iocg(tblkg);
 		iocg->ancestors[tiocg->level] = tiocg;
@@ -2985,30 +3034,7 @@  static void ioc_pd_init(struct blkg_policy_data *pd)
 
 static void ioc_pd_free(struct blkg_policy_data *pd)
 {
-	struct ioc_gq *iocg = pd_to_iocg(pd);
-	struct ioc *ioc = iocg->ioc;
-	unsigned long flags;
-
-	if (ioc) {
-		spin_lock_irqsave(&ioc->lock, flags);
-
-		if (!list_empty(&iocg->active_list)) {
-			struct ioc_now now;
-
-			ioc_now(ioc, &now);
-			propagate_weights(iocg, 0, 0, false, &now);
-			list_del_init(&iocg->active_list);
-		}
-
-		WARN_ON_ONCE(!list_empty(&iocg->walk_list));
-		WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
-
-		spin_unlock_irqrestore(&ioc->lock, flags);
-
-		hrtimer_cancel(&iocg->waitq_timer);
-	}
-	free_percpu(iocg->pcpu_stat);
-	kfree(iocg);
+	iocg_put(pd_to_iocg(pd));
 }
 
 static void ioc_pd_stat(struct blkg_policy_data *pd, struct seq_file *s)