[-next,0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

Message ID	20221217030908.1261787-1-yukuai1@huaweicloud.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Yu Kuai <yukuai1@huaweicloud.com> To: tj@kernel.org, hch@infradead.org, josef@toxicpanda.com, axboe@kernel.dk Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com Subject: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy Date: Sat, 17 Dec 2022 11:09:04 +0800 Message-Id: <20221217030908.1261787-1-yukuai1@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	blk-cgroup: synchronize del_gendisk() with configuring cgroup policy \| [-next,0/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy [-next,1/4] block/rq_qos: protect 'q->rq_qos' with queue_lock in rq_qos_exit() [-next,2/4] block/rq_qos: fail rq_qos_add() after del_gendisk() [-next,3/4] blk-cgroup: add a new interface blkcg_conf_close_bdev() [-next,4/4] blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

Message ID

20221217030908.1261787-1-yukuai1@huaweicloud.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Yu Kuai <yukuai1@huaweicloud.com>
To: tj@kernel.org, hch@infradead.org, josef@toxicpanda.com,
        axboe@kernel.dk
Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org,
        linux-kernel@vger.kernel.org, yukuai3@huawei.com,
        yukuai1@huaweicloud.com, yi.zhang@huawei.com
Subject: [PATCH -next 0/4] blk-cgroup: synchronize del_gendisk() with
 configuring cgroup policy
Date: Sat, 17 Dec 2022 11:09:04 +0800
Message-Id: <20221217030908.1261787-1-yukuai1@huaweicloud.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

blk-cgroup: synchronize del_gendisk() with configuring cgroup policy |

Message

Yu Kuai Dec. 17, 2022, 3:09 a.m. UTC

  From: Yu Kuai <yukuai3@huawei.com>

iocost is initialized when it's configured the first time, and iocost
initializing can race with del_gendisk(), which will cause null pointer
dereference:

t1				t2
ioc_qos_write
 blk_iocost_init
  rq_qos_add
  				del_gendisk
  				 rq_qos_exit
  				 //iocost is removed from q->roqs
  blkcg_activate_policy
   pd_init_fn
    ioc_pd_init
     ioc = q_to_ioc(blkg->q)
     //can't find iocost and return null

And iolatency is about to switch to the same lazy initialization.

This patchset fix this problem by synchronize rq_qos_add() and
blkcg_activate_policy() with rq_qos_exit().

Yu Kuai (4):
  block/rq_qos: protect 'q->rq_qos' with queue_lock in rq_qos_exit()
  block/rq_qos: fail rq_qos_add() after del_gendisk()
  blk-cgroup: add a new interface blkcg_conf_close_bdev()
  blk-cgroup: synchronize del_gendisk() with configuring cgroup policy

 block/blk-cgroup.c     | 12 ++++++++++--
 block/blk-cgroup.h     |  1 +
 block/blk-iocost.c     |  8 ++++----
 block/blk-rq-qos.c     | 25 ++++++++++++++++++++-----
 block/blk-rq-qos.h     | 17 +++++++++++++----
 include/linux/blkdev.h |  1 +
 6 files changed, 49 insertions(+), 15 deletions(-)

Comments

Tejun Heo Dec. 19, 2022, 8:55 p.m. UTC | #1

Hello,

On Sat, Dec 17, 2022 at 11:09:04AM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> iocost is initialized when it's configured the first time, and iocost
> initializing can race with del_gendisk(), which will cause null pointer
> dereference:
> 
> t1				t2
> ioc_qos_write
>  blk_iocost_init
>   rq_qos_add
>   				del_gendisk
>   				 rq_qos_exit
>   				 //iocost is removed from q->roqs
>   blkcg_activate_policy
>    pd_init_fn
>     ioc_pd_init
>      ioc = q_to_ioc(blkg->q)
>      //can't find iocost and return null
> 
> And iolatency is about to switch to the same lazy initialization.
> 
> This patchset fix this problem by synchronize rq_qos_add() and
> blkcg_activate_policy() with rq_qos_exit().

So, the patchset seems a bit overly complicated to me. What do you think
about the following?

* These init/exit paths are super cold path, just protecting them with a
  global mutex is probably enough. If we encounter a scalability problem,
  it's easy to fix down the line.

* If we're synchronizing this with a mutex anyway, no need to grab the
  queue_lock, right? rq_qos_add/del/exit() can all just hold the mutex.

* And we can keep the state tracking within rq_qos. When rq_qos_exit() is
  called, mark it so that future adds will fail - be that a special ->next
  value a queue flag or whatever.

Thanks.

Yu Kuai Dec. 20, 2022, 9:19 a.m. UTC | #2

Hi,

在 2022/12/20 4:55, Tejun Heo 写道:
> Hello,
> 
> On Sat, Dec 17, 2022 at 11:09:04AM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> iocost is initialized when it's configured the first time, and iocost
>> initializing can race with del_gendisk(), which will cause null pointer
>> dereference:
>>
>> t1				t2
>> ioc_qos_write
>>   blk_iocost_init
>>    rq_qos_add
>>    				del_gendisk
>>    				 rq_qos_exit
>>    				 //iocost is removed from q->roqs
>>    blkcg_activate_policy
>>     pd_init_fn
>>      ioc_pd_init
>>       ioc = q_to_ioc(blkg->q)
>>       //can't find iocost and return null
>>
>> And iolatency is about to switch to the same lazy initialization.
>>
>> This patchset fix this problem by synchronize rq_qos_add() and
>> blkcg_activate_policy() with rq_qos_exit().
> 
> So, the patchset seems a bit overly complicated to me. What do you think
> about the following?
> 
> * These init/exit paths are super cold path, just protecting them with a
>    global mutex is probably enough. If we encounter a scalability problem,
>    it's easy to fix down the line.
> 
> * If we're synchronizing this with a mutex anyway, no need to grab the
>    queue_lock, right? rq_qos_add/del/exit() can all just hold the mutex.
> 
> * And we can keep the state tracking within rq_qos. When rq_qos_exit() is
>    called, mark it so that future adds will fail - be that a special ->next
>    value a queue flag or whatever.

Yes, that sounds good. BTW, queue_lock is also used to protect
pd_alloc_fn/pd_init_fn，and we found that blkcg_activate_policy() is
problematic:

blkcg_activate_policy
  spin_lock_irq(&q->queue_lock);
  list_for_each_entry_reverse(blkg, &q->blkg_list
   pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed

   spin_unlock_irq(&q->queue_lock);
   // release queue_lock here is problematic, this will cause
pd_offline_fn called without pd_init_fn.
   pd_alloc_fn(__GFP_NOWARN,...)

If we are using a mutex to protect rq_qos ops, it seems the right thing
to do do also using the mutex to protect blkcg_policy ops, and this
problem can be fixed because mutex can be held to alloc memroy with
GFP_KERNEL. What do you think?

Thanks,
Kuai
> 
> Thanks.
>

Tejun Heo Dec. 20, 2022, 4:01 p.m. UTC | #3

Hello,

On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
> Yes, that sounds good. BTW, queue_lock is also used to protect
> pd_alloc_fn/pd_init_fn，and we found that blkcg_activate_policy() is
> problematic:
> 
> blkcg_activate_policy
>  spin_lock_irq(&q->queue_lock);
>  list_for_each_entry_reverse(blkg, &q->blkg_list
>   pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed
> 
>   spin_unlock_irq(&q->queue_lock);
>   // release queue_lock here is problematic, this will cause
> pd_offline_fn called without pd_init_fn.
>   pd_alloc_fn(__GFP_NOWARN,...)

So, if a blkg is destroyed while a policy is being activated, right?

> If we are using a mutex to protect rq_qos ops, it seems the right thing
> to do do also using the mutex to protect blkcg_policy ops, and this
> problem can be fixed because mutex can be held to alloc memroy with
> GFP_KERNEL. What do you think?

One worry is that switching to mutex can be more headache due to destroy
path synchronization. Another approach would be using a per-blkg flag to
track whether a blkg has been initialized.

Thanks.

Yu Kuai Dec. 21, 2022, 1:10 a.m. UTC | #4

Hi,

在 2022/12/21 0:01, Tejun Heo 写道:
> Hello,
> 
> On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
>> Yes, that sounds good. BTW, queue_lock is also used to protect
>> pd_alloc_fn/pd_init_fn，and we found that blkcg_activate_policy() is
>> problematic:
>>
>> blkcg_activate_policy
>>   spin_lock_irq(&q->queue_lock);
>>   list_for_each_entry_reverse(blkg, &q->blkg_list
>>    pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed
>>
>>    spin_unlock_irq(&q->queue_lock);
>>    // release queue_lock here is problematic, this will cause
>> pd_offline_fn called without pd_init_fn.
>>    pd_alloc_fn(__GFP_NOWARN,...)
> 
> So, if a blkg is destroyed while a policy is being activated, right?
Yes, remove cgroup can race with this, for bfq null pointer deference
will be triggered in bfq_pd_offline().

> 
>> If we are using a mutex to protect rq_qos ops, it seems the right thing
>> to do do also using the mutex to protect blkcg_policy ops, and this
>> problem can be fixed because mutex can be held to alloc memroy with
>> GFP_KERNEL. What do you think?
> 
> One worry is that switching to mutex can be more headache due to destroy
> path synchronization. Another approach would be using a per-blkg flag to
> track whether a blkg has been initialized.
I think perhaps you mean per blkg_policy_data flag? per blkg flag should
not work in this case.

Thanks,
Kuai
> 
> Thanks.
>

Yu Kuai Dec. 21, 2022, 2:48 a.m. UTC | #5

Hi,

在 2022/12/21 9:10, Yu Kuai 写道:
> Hi,
> 
> 在 2022/12/21 0:01, Tejun Heo 写道:
>> Hello,
>>
>> On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
>>> Yes, that sounds good. BTW, queue_lock is also used to protect
>>> pd_alloc_fn/pd_init_fn，and we found that blkcg_activate_policy() is
>>> problematic:
>>>
>>> blkcg_activate_policy
>>>   spin_lock_irq(&q->queue_lock);
>>>   list_for_each_entry_reverse(blkg, &q->blkg_list
>>>    pd_alloc_fn(GFP_NOWAIT | __GFP_NOWARN,...) -> failed
>>>
>>>    spin_unlock_irq(&q->queue_lock);
>>>    // release queue_lock here is problematic, this will cause
>>> pd_offline_fn called without pd_init_fn.
>>>    pd_alloc_fn(__GFP_NOWARN,...)
>>
>> So, if a blkg is destroyed while a policy is being activated, right?
> Yes, remove cgroup can race with this, for bfq null pointer deference
> will be triggered in bfq_pd_offline().

BTW, We just found that pd_online_fn() is missed in
blkcg_activate_policy()... Currently this is not a problem because only
bl-throttle implement it, and blk-throttle is activated while creating
blkg.

Thanks,
Kuai
> 
>>
>>> If we are using a mutex to protect rq_qos ops, it seems the right thing
>>> to do do also using the mutex to protect blkcg_policy ops, and this
>>> problem can be fixed because mutex can be held to alloc memroy with
>>> GFP_KERNEL. What do you think?
>>
>> One worry is that switching to mutex can be more headache due to destroy
>> path synchronization. Another approach would be using a per-blkg flag to
>> track whether a blkg has been initialized.
> I think perhaps you mean per blkg_policy_data flag? per blkg flag should
> not work in this case.
> 
> Thanks,
> Kuai
>>
>> Thanks.
>>
> 
> .
>

Christoph Hellwig Dec. 21, 2022, 10:37 a.m. UTC | #6

On Tue, Dec 20, 2022 at 05:19:12PM +0800, Yu Kuai wrote:
> If we are using a mutex to protect rq_qos ops, it seems the right thing
> to do do also using the mutex to protect blkcg_policy ops, and this
> problem can be fixed because mutex can be held to alloc memroy with
> GFP_KERNEL. What do you think?

Getting rid of the atomic allocations would be awesome.

FYI, I'm also in favor of everything that moves things out of
queue_lock into more dedicated locks.  queue_lock is such an undocument
mess of untargeted things that don't realted to each other right now
that splitting and removing it is becoming more and more important.