[-next,v2,9/9] blk-iocost: fix walk_list corruption

Message ID 20221130132156.2836184-10-linan122@huawei.com
State New
Headers
Series iocost bugfix |

Commit Message

Li Nan Nov. 30, 2022, 1:21 p.m. UTC
  From: Yu Kuai <yukuai3@huawei.com>

Our test report a problem:

------------[ cut here ]------------
list_del corruption. next->prev should be ffff888127e0c4b0, but was ffff888127e090b0
WARNING: CPU: 2 PID: 3117789 at lib/list_debug.c:62 __list_del_entry_valid+0x119/0x130
RIP: 0010:__list_del_entry_valid+0x119/0x130
RIP: 0010:__list_del_entry_valid+0x119/0x130
Call Trace:
 <IRQ>
 iocg_flush_stat.isra.0+0x11e/0x230
 ? ioc_rqos_done+0x230/0x230
 ? ioc_now+0x14f/0x180
 ioc_timer_fn+0x569/0x1640

We haven't reporduced it yet, but we think this is due to parent iocg is
freed before child iocg, and then in ioc_timer_fn, walk_list is
corrupted.

1) Remove child cgroup can concurrent with remove parent cgroup, and
ioc_pd_free for parent iocg can be called before child iocg. This can be
fixed by moving the handle of walk_list to ioc_pd_offline, since that
offline from child is ensured to be called before parent.

2) ioc_pd_free can be triggered from both removing device and removing
cgroup, this patch fix the problem by deleting timer before deactivating
policy, so that free parent iocg first in this case won't matter.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Li Nan <linan122@huawei.com>
---
 block/blk-iocost.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
  

Comments

Tejun Heo Nov. 30, 2022, 8:59 p.m. UTC | #1
On Wed, Nov 30, 2022 at 09:21:56PM +0800, Li Nan wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Our test report a problem:
> 
> ------------[ cut here ]------------
> list_del corruption. next->prev should be ffff888127e0c4b0, but was ffff888127e090b0
> WARNING: CPU: 2 PID: 3117789 at lib/list_debug.c:62 __list_del_entry_valid+0x119/0x130
> RIP: 0010:__list_del_entry_valid+0x119/0x130
> RIP: 0010:__list_del_entry_valid+0x119/0x130
> Call Trace:
>  <IRQ>
>  iocg_flush_stat.isra.0+0x11e/0x230
>  ? ioc_rqos_done+0x230/0x230
>  ? ioc_now+0x14f/0x180
>  ioc_timer_fn+0x569/0x1640
> 
> We haven't reporduced it yet, but we think this is due to parent iocg is
> freed before child iocg, and then in ioc_timer_fn, walk_list is
> corrupted.
> 
> 1) Remove child cgroup can concurrent with remove parent cgroup, and
> ioc_pd_free for parent iocg can be called before child iocg. This can be
> fixed by moving the handle of walk_list to ioc_pd_offline, since that
> offline from child is ensured to be called before parent.

Which you already did in a previous patch, right?

> 2) ioc_pd_free can be triggered from both removing device and removing
> cgroup, this patch fix the problem by deleting timer before deactivating
> policy, so that free parent iocg first in this case won't matter.

Okay, so, yeah, css's pin parents but blkg's don't. I think the right thing
to do here is making sure that a child blkg pins its parent (and eventually
ioc).

> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Signed-off-by: Li Nan <linan122@huawei.com>
> ---
>  block/blk-iocost.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> index 710cf63a1643..d2b873908f88 100644
> --- a/block/blk-iocost.c
> +++ b/block/blk-iocost.c
> @@ -2813,13 +2813,14 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
>  {
>  	struct ioc *ioc = rqos_to_ioc(rqos);
>  
> +	del_timer_sync(&ioc->timer);
> +
>  	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
>  
>  	spin_lock_irq(&ioc->lock);
>  	ioc->running = IOC_STOP;
>  	spin_unlock_irq(&ioc->lock);
>  
> -	del_timer_sync(&ioc->timer);

I don't about this workaround. Let's fix properly?
  
Yu Kuai Dec. 1, 2022, 1:19 a.m. UTC | #2
Hi,

在 2022/12/01 4:59, Tejun Heo 写道:
> On Wed, Nov 30, 2022 at 09:21:56PM +0800, Li Nan wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Our test report a problem:
>>
>> ------------[ cut here ]------------
>> list_del corruption. next->prev should be ffff888127e0c4b0, but was ffff888127e090b0
>> WARNING: CPU: 2 PID: 3117789 at lib/list_debug.c:62 __list_del_entry_valid+0x119/0x130
>> RIP: 0010:__list_del_entry_valid+0x119/0x130
>> RIP: 0010:__list_del_entry_valid+0x119/0x130
>> Call Trace:
>>   <IRQ>
>>   iocg_flush_stat.isra.0+0x11e/0x230
>>   ? ioc_rqos_done+0x230/0x230
>>   ? ioc_now+0x14f/0x180
>>   ioc_timer_fn+0x569/0x1640
>>
>> We haven't reporduced it yet, but we think this is due to parent iocg is
>> freed before child iocg, and then in ioc_timer_fn, walk_list is
>> corrupted.
>>
>> 1) Remove child cgroup can concurrent with remove parent cgroup, and
>> ioc_pd_free for parent iocg can be called before child iocg. This can be
>> fixed by moving the handle of walk_list to ioc_pd_offline, since that
>> offline from child is ensured to be called before parent.
> 
> Which you already did in a previous patch, right?

yes, this is already did in patch 7.

> 
>> 2) ioc_pd_free can be triggered from both removing device and removing
>> cgroup, this patch fix the problem by deleting timer before deactivating
>> policy, so that free parent iocg first in this case won't matter.
> 
> Okay, so, yeah, css's pin parents but blkg's don't. I think the right thing
> to do here is making sure that a child blkg pins its parent (and eventually
> ioc).

Ok, I can try to do that.

> 
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> Signed-off-by: Li Nan <linan122@huawei.com>
>> ---
>>   block/blk-iocost.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
>> index 710cf63a1643..d2b873908f88 100644
>> --- a/block/blk-iocost.c
>> +++ b/block/blk-iocost.c
>> @@ -2813,13 +2813,14 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
>>   {
>>   	struct ioc *ioc = rqos_to_ioc(rqos);
>>   
>> +	del_timer_sync(&ioc->timer);
>> +
>>   	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
>>   
>>   	spin_lock_irq(&ioc->lock);
>>   	ioc->running = IOC_STOP;
>>   	spin_unlock_irq(&ioc->lock);
>>   
>> -	del_timer_sync(&ioc->timer);
> 
> I don't about this workaround. Let's fix properly?

Ok, and by the way, is there any reason to delete timer after
deactivate policy? This seems a litter wreid to me.

Thanks,
Kuai
>
  
Tejun Heo Dec. 1, 2022, 10 a.m. UTC | #3
On Thu, Dec 01, 2022 at 09:19:54AM +0800, Yu Kuai wrote:
> > > diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> > > index 710cf63a1643..d2b873908f88 100644
> > > --- a/block/blk-iocost.c
> > > +++ b/block/blk-iocost.c
> > > @@ -2813,13 +2813,14 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
> > >   {
> > >   	struct ioc *ioc = rqos_to_ioc(rqos);
> > > +	del_timer_sync(&ioc->timer);
> > > +
> > >   	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
> > >   	spin_lock_irq(&ioc->lock);
> > >   	ioc->running = IOC_STOP;
> > >   	spin_unlock_irq(&ioc->lock);
> > > -	del_timer_sync(&ioc->timer);
> > 
> > I don't about this workaround. Let's fix properly?
> 
> Ok, and by the way, is there any reason to delete timer after
> deactivate policy? This seems a litter wreid to me.

ioc->running is what controls whether the timer gets rescheduled or not. If
we don't shut that down, the timer may as well get rescheduled after being
deleted. Here, the only extra activation point is IO issue which shouldn't
trigger during rq_qos_exit, so the ordering shouldn't matter but this is the
right order for anything which can get restarted.

Thanks.
  
Yu Kuai Dec. 1, 2022, 10:14 a.m. UTC | #4
Hi,

在 2022/12/01 18:00, Tejun Heo 写道:
> On Thu, Dec 01, 2022 at 09:19:54AM +0800, Yu Kuai wrote:
>>>> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
>>>> index 710cf63a1643..d2b873908f88 100644
>>>> --- a/block/blk-iocost.c
>>>> +++ b/block/blk-iocost.c
>>>> @@ -2813,13 +2813,14 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
>>>>    {
>>>>    	struct ioc *ioc = rqos_to_ioc(rqos);
>>>> +	del_timer_sync(&ioc->timer);
>>>> +
>>>>    	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
>>>>    	spin_lock_irq(&ioc->lock);
>>>>    	ioc->running = IOC_STOP;
>>>>    	spin_unlock_irq(&ioc->lock);
>>>> -	del_timer_sync(&ioc->timer);
>>>
>>> I don't about this workaround. Let's fix properly?
>>
>> Ok, and by the way, is there any reason to delete timer after
>> deactivate policy? This seems a litter wreid to me.
> 
> ioc->running is what controls whether the timer gets rescheduled or not. If
> we don't shut that down, the timer may as well get rescheduled after being
> deleted. Here, the only extra activation point is IO issue which shouldn't
> trigger during rq_qos_exit, so the ordering shouldn't matter but this is the
> right order for anything which can get restarted.

Thanks for the explanation.

I'm trying to figure out how to make sure child blkg pins it's parent,
btw, do you think following cleanup make sense?

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index a645184aba4a..6ad8791af9d7 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2810,13 +2810,13 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
  {
         struct ioc *ioc = rqos_to_ioc(rqos);

-       blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
-
         spin_lock_irq(&ioc->lock);
         ioc->running = IOC_STOP;
         spin_unlock_irq(&ioc->lock);

         del_timer_sync(&ioc->timer);
+       blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
+
         free_percpu(ioc->pcpu_stat);
         kfree(ioc);
  }

Thanks,
Kuai
  
Tejun Heo Dec. 1, 2022, 10:29 a.m. UTC | #5
On Thu, Dec 01, 2022 at 06:14:32PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2022/12/01 18:00, Tejun Heo 写道:
> > On Thu, Dec 01, 2022 at 09:19:54AM +0800, Yu Kuai wrote:
> > > > > diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> > > > > index 710cf63a1643..d2b873908f88 100644
> > > > > --- a/block/blk-iocost.c
> > > > > +++ b/block/blk-iocost.c
> > > > > @@ -2813,13 +2813,14 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
> > > > >    {
> > > > >    	struct ioc *ioc = rqos_to_ioc(rqos);
> > > > > +	del_timer_sync(&ioc->timer);
> > > > > +
> > > > >    	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
> > > > >    	spin_lock_irq(&ioc->lock);
> > > > >    	ioc->running = IOC_STOP;
> > > > >    	spin_unlock_irq(&ioc->lock);
> > > > > -	del_timer_sync(&ioc->timer);
> > > > 
> > > > I don't about this workaround. Let's fix properly?
> > > 
> > > Ok, and by the way, is there any reason to delete timer after
> > > deactivate policy? This seems a litter wreid to me.
> > 
> > ioc->running is what controls whether the timer gets rescheduled or not. If
> > we don't shut that down, the timer may as well get rescheduled after being
> > deleted. Here, the only extra activation point is IO issue which shouldn't
> > trigger during rq_qos_exit, so the ordering shouldn't matter but this is the
> > right order for anything which can get restarted.
> 
> Thanks for the explanation.
> 
> I'm trying to figure out how to make sure child blkg pins it's parent,
> btw, do you think following cleanup make sense?

It's on you to explain why any change that you're suggesting is better and
safe. I know it's not intentional but you're repeatedly suggesting operation
reorderings in code paths which are really sensitive to ordering at least
seemingly without putting much effort into thinking through the side
effects. This costs disproportionate amount of review bandwidth, and
increases the chance of new subtle bugs. Can you please slow down a bit and
be more deliberate?

Thanks.
  
Yu Kuai Dec. 1, 2022, 1:43 p.m. UTC | #6
在 2022/12/01 18:29, Tejun Heo 写道:
> On Thu, Dec 01, 2022 at 06:14:32PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2022/12/01 18:00, Tejun Heo 写道:
>>> On Thu, Dec 01, 2022 at 09:19:54AM +0800, Yu Kuai wrote:
>>>>>> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
>>>>>> index 710cf63a1643..d2b873908f88 100644
>>>>>> --- a/block/blk-iocost.c
>>>>>> +++ b/block/blk-iocost.c
>>>>>> @@ -2813,13 +2813,14 @@ static void ioc_rqos_exit(struct rq_qos *rqos)
>>>>>>     {
>>>>>>     	struct ioc *ioc = rqos_to_ioc(rqos);
>>>>>> +	del_timer_sync(&ioc->timer);
>>>>>> +
>>>>>>     	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
>>>>>>     	spin_lock_irq(&ioc->lock);
>>>>>>     	ioc->running = IOC_STOP;
>>>>>>     	spin_unlock_irq(&ioc->lock);
>>>>>> -	del_timer_sync(&ioc->timer);
>>>>>
>>>>> I don't about this workaround. Let's fix properly?
>>>>
>>>> Ok, and by the way, is there any reason to delete timer after
>>>> deactivate policy? This seems a litter wreid to me.
>>>
>>> ioc->running is what controls whether the timer gets rescheduled or not. If
>>> we don't shut that down, the timer may as well get rescheduled after being
>>> deleted. Here, the only extra activation point is IO issue which shouldn't
>>> trigger during rq_qos_exit, so the ordering shouldn't matter but this is the
>>> right order for anything which can get restarted.
>>
>> Thanks for the explanation.
>>
>> I'm trying to figure out how to make sure child blkg pins it's parent,
>> btw, do you think following cleanup make sense?
> 
> It's on you to explain why any change that you're suggesting is better and
> safe. I know it's not intentional but you're repeatedly suggesting operation
> reorderings in code paths which are really sensitive to ordering at least
> seemingly without putting much effort into thinking through the side
> effects. This costs disproportionate amount of review bandwidth, and
> increases the chance of new subtle bugs. Can you please slow down a bit and
> be more deliberate?

Thanks for the suggestion, I'll pay close attention to explain this "why
the change is better and safe". And sorry for the review pressure. 😔

> 
> Thanks.
>
  
Yu Kuai Dec. 5, 2022, 9:35 a.m. UTC | #7
Hi, Tejun

在 2022/12/01 4:59, Tejun Heo 写道:
> On Wed, Nov 30, 2022 at 09:21:56PM +0800, Li Nan wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Our test report a problem:
>>
>> ------------[ cut here ]------------
>> list_del corruption. next->prev should be ffff888127e0c4b0, but was ffff888127e090b0
>> WARNING: CPU: 2 PID: 3117789 at lib/list_debug.c:62 __list_del_entry_valid+0x119/0x130
>> RIP: 0010:__list_del_entry_valid+0x119/0x130
>> RIP: 0010:__list_del_entry_valid+0x119/0x130
>> Call Trace:
>>   <IRQ>
>>   iocg_flush_stat.isra.0+0x11e/0x230
>>   ? ioc_rqos_done+0x230/0x230
>>   ? ioc_now+0x14f/0x180
>>   ioc_timer_fn+0x569/0x1640
>>
>> We haven't reporduced it yet, but we think this is due to parent iocg is
>> freed before child iocg, and then in ioc_timer_fn, walk_list is
>> corrupted.
>>
>> 1) Remove child cgroup can concurrent with remove parent cgroup, and
>> ioc_pd_free for parent iocg can be called before child iocg. This can be
>> fixed by moving the handle of walk_list to ioc_pd_offline, since that
>> offline from child is ensured to be called before parent.
> 
> Which you already did in a previous patch, right?
> 
>> 2) ioc_pd_free can be triggered from both removing device and removing
>> cgroup, this patch fix the problem by deleting timer before deactivating
>> policy, so that free parent iocg first in this case won't matter.
> 
> Okay, so, yeah, css's pin parents but blkg's don't. I think the right thing
> to do here is making sure that a child blkg pins its parent (and eventually
> ioc).

Sorry about this, actually it's can be ensured that pd_offline
from child will be called before parent. Hence just moving he handle of
walk_list to ioc_pd_offline can fix this problem thoroughly.

Thanks,
Kuai
  

Patch

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 710cf63a1643..d2b873908f88 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2813,13 +2813,14 @@  static void ioc_rqos_exit(struct rq_qos *rqos)
 {
 	struct ioc *ioc = rqos_to_ioc(rqos);
 
+	del_timer_sync(&ioc->timer);
+
 	blkcg_deactivate_policy(rqos->q, &blkcg_policy_iocost);
 
 	spin_lock_irq(&ioc->lock);
 	ioc->running = IOC_STOP;
 	spin_unlock_irq(&ioc->lock);
 
-	del_timer_sync(&ioc->timer);
 	free_percpu(ioc->pcpu_stat);
 	kfree(ioc);
 }