[-next,v2,8/9] block: fix null-pointer dereference in ioc_pd_init

Message ID 20221130132156.2836184-9-linan122@huawei.com
State New
Headers
Series iocost bugfix |

Commit Message

Li Nan Nov. 30, 2022, 1:21 p.m. UTC
  Remove block device when iocost is initializing may cause
null-pointer dereference:

	CPU1				   CPU2
  ioc_qos_write
   blkcg_conf_open_bdev
    blkdev_get_no_open
     kobject_get_unless_zero
    blk_iocost_init
     rq_qos_add
  					del_gendisk
  					 rq_qos_exit
  					  q->rq_qos = rqos->next
  					   //iocost is removed from q->roqs
      blkcg_activate_policy
       pd_init_fn
        ioc_pd_init
  	 ioc = q_to_ioc(blkg->q)
 	  //cant find iocost and return null

Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get
bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be
actived until iocost initialization is complited.

Signed-off-by: Li Nan <linan122@huawei.com>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Tejun Heo Nov. 30, 2022, 8:50 p.m. UTC | #1
On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote:
> Remove block device when iocost is initializing may cause
> null-pointer dereference:
> 
> 	CPU1				   CPU2
>   ioc_qos_write
>    blkcg_conf_open_bdev
>     blkdev_get_no_open
>      kobject_get_unless_zero
>     blk_iocost_init
>      rq_qos_add
>   					del_gendisk
>   					 rq_qos_exit
>   					  q->rq_qos = rqos->next
>   					   //iocost is removed from q->roqs
>       blkcg_activate_policy
>        pd_init_fn
>         ioc_pd_init
>   	 ioc = q_to_ioc(blkg->q)
>  	  //cant find iocost and return null
> 
> Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get
> bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be
> actived until iocost initialization is complited.

I think it'd be better to make del_gendisk wait for these in-flight
operations because this might fix the above particular issue but now all the
policies are exposed to request_queue in a state it never expected before.

del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can
piggyback on that and update blkcg_conf_open_bdev() to provide such
exclusion?

Thanks.
  
Yu Kuai Dec. 1, 2022, 2:12 a.m. UTC | #2
Hi,

在 2022/12/01 4:50, Tejun Heo 写道:
> On Wed, Nov 30, 2022 at 09:21:55PM +0800, Li Nan wrote:
>> Remove block device when iocost is initializing may cause
>> null-pointer dereference:
>>
>> 	CPU1				   CPU2
>>    ioc_qos_write
>>     blkcg_conf_open_bdev
>>      blkdev_get_no_open
>>       kobject_get_unless_zero
>>      blk_iocost_init
>>       rq_qos_add
>>    					del_gendisk
>>    					 rq_qos_exit
>>    					  q->rq_qos = rqos->next
>>    					   //iocost is removed from q->roqs
>>        blkcg_activate_policy
>>         pd_init_fn
>>          ioc_pd_init
>>    	 ioc = q_to_ioc(blkg->q)
>>   	  //cant find iocost and return null
>>
>> Fix problem by moving rq_qos_exit() to disk_release(). ioc_qos_write() get
>> bd_device.kobj in blkcg_conf_open_bdev(), so disk_release will not be
>> actived until iocost initialization is complited.
> 
> I think it'd be better to make del_gendisk wait for these in-flight
> operations because this might fix the above particular issue but now all the
> policies are exposed to request_queue in a state it never expected before.
> 
> del_gendisk() is quiescing the queue around rq_qos_exit(), so maybe we can
> piggyback on that and update blkcg_conf_open_bdev() to provide such
> exclusion?

Let del_gendisk waiting for that sounds good, but I'm litter confused
how to do that. Following are what I think about:

1) By mentioning that "del_gendisk() is quiescing the queue", do you
suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
requires memory allocation.

2) Hold gendisk open mutex

3) Use q_usage_counter, and in the meantime, rq_qos_add() and
blkcg_activate_policy() will need refactoring to factor out freeze
queue.

4) Define a new metux

Thanks,
Kuai
> 
> Thanks.
>
  
Tejun Heo Dec. 1, 2022, 10:11 a.m. UTC | #3
On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote:
> 1) By mentioning that "del_gendisk() is quiescing the queue", do you
> suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
> requires memory allocation.

Quiescing uses SRCU so that should be fine but I'm not sure whether this is
the right one to piggyback on. Jens should have a better idea.

Thanks.
  
Yu Kuai Dec. 1, 2022, 10:23 a.m. UTC | #4
Hi,

在 2022/12/01 18:11, Tejun Heo 写道:
> On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote:
>> 1) By mentioning that "del_gendisk() is quiescing the queue", do you
>> suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
>> requires memory allocation.
> 
> Quiescing uses SRCU so that should be fine but I'm not sure whether this is
> the right one to piggyback on. Jens should have a better idea.
> 
> Thanks.
> 

Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used.

dispatch:
#define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
do {                                                            \
         if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) {          \
                 int srcu_idx;                                   \
                                                                 \
                 might_sleep_if(check_sleep);                    \
                 srcu_idx = srcu_read_lock((q)->tag_set->srcu);  \
                 (dispatch_ops);                                 \
                 srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \
         } else {                                                \
                 rcu_read_lock();                                \
                 (dispatch_ops);                                 \
                 rcu_read_unlock();                              \
         }                                                       \
} while (0)

quiesce:
void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set)
{
         if (set->flags & BLK_MQ_F_BLOCKING)
                 synchronize_srcu(set->srcu);
         else
                 synchronize_rcu();
}

Thanks,
Kuai
  
Tejun Heo Dec. 1, 2022, 10:31 a.m. UTC | #5
On Thu, Dec 01, 2022 at 06:23:16PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2022/12/01 18:11, Tejun Heo 写道:
> > On Thu, Dec 01, 2022 at 10:12:13AM +0800, Yu Kuai wrote:
> > > 1) By mentioning that "del_gendisk() is quiescing the queue", do you
> > > suggest to add rcu_read_lock()? This seems wrong because blk_iocost_init
> > > requires memory allocation.
> > 
> > Quiescing uses SRCU so that should be fine but I'm not sure whether this is
> > the right one to piggyback on. Jens should have a better idea.
> > 
> > Thanks.
> > 
> 
> Currently SRCU is used if BLK_MQ_F_BLOCKING set, otherwise RCU is used.
> 
> dispatch:
> #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops) \
> do {                                                            \
>         if ((q)->tag_set->flags & BLK_MQ_F_BLOCKING) {          \
>                 int srcu_idx;                                   \
>                                                                 \
>                 might_sleep_if(check_sleep);                    \
>                 srcu_idx = srcu_read_lock((q)->tag_set->srcu);  \
>                 (dispatch_ops);                                 \
>                 srcu_read_unlock((q)->tag_set->srcu, srcu_idx); \
>         } else {                                                \
>                 rcu_read_lock();                                \
>                 (dispatch_ops);                                 \
>                 rcu_read_unlock();                              \
>         }                                                       \
> } while (0)
> 
> quiesce:
> void blk_mq_wait_quiesce_done(struct blk_mq_tag_set *set)
> {
>         if (set->flags & BLK_MQ_F_BLOCKING)
>                 synchronize_srcu(set->srcu);
>         else
>                 synchronize_rcu();
> }

Oh I see. Unfortunately, I don't know what to do off the top of my head.

Thanks.
  
Yu Kuai Dec. 5, 2022, 9:32 a.m. UTC | #6
Hi, Tejun!

While reviewing rq_qos code, I found that there are some other possible
defects:

1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie
it's not held to protect rq_qos_exit(), which is absolutely not safe
because they can be called concurrently by configuring iocost and
removing device.
I'm thinking about holding the lock to fetch the list and reset
q->rq_qos first:

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 88f0fe7dcf54..271ad65eebd9 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void 
*private_data,

  void rq_qos_exit(struct request_queue *q)
  {
-       while (q->rq_qos) {
-               struct rq_qos *rqos = q->rq_qos;
-               q->rq_qos = rqos->next;
+       struct rq_qos *rqos;
+
+       spin_lock_irq(&q->queue_lock);
+       rqos = q->rq_qos;
+       q->rq_qos = NULL;
+       spin_unlock_irq(&q->queue_lock);
+
+       while (rqos) {
                 rqos->ops->exit(rqos);
+               rqos = rqos->next;
         }
  }

2) rq_qos_add() can still succeed after rq_qos_exit() is done, which
will cause memory leak. Hence a checking is required beforing adding
to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the
flag will not set if disk state is not marked GD_OWNS_QUEUE. Since
blk_unregister_queue() is called before rq_qos_exit(), use the queue
flag QUEUE_FLAG_REGISTERED should be OK.

For the current problem that device can be removed while initializing
, I'm thinking about some possible solutions:

Since bfq is initialized in elevator initialization, and others are
in queue initialization, such problem is only possible in iocost, hence
it make sense to fix it in iocost:

1) use open mutex to prevet concurrency, however, this will cause
that configuring iocost will block some other operations that is relied
on open_mutex.

@@ -2889,7 +2889,15 @@ static int blk_iocost_init(struct gendisk *disk)
         if (ret)
                 goto err_free_ioc;

+       mutex_lock(&disk->open_mutex);
+       if (!disk_live(disk)) {
+               mutex_unlock(&disk->open_mutex);
+               ret = -ENODEV;
+               goto err_del_qos;
+       }
+
         ret = blkcg_activate_policy(q, &blkcg_policy_iocost);
+       mutex_unlock(&disk->open_mutex);
         if (ret)
                 goto err_del_qos;

2) like 1), the difference is that define a new mutex just in iocst.

3) Or is it better to fix it in the higher level? For example:
add a new restriction that blkcg_deactivate_policy() should be called
with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy()
will wait for blkcg_activate_policy() to finish. Something like:

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ef4fef1af909..6266f702157f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q,
         struct blkcg_gq *blkg, *pinned_blkg = NULL;
         int ret;

-       if (blkcg_policy_enabled(q, pol))
+       if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol)))
                 return 0;

         if (queue_is_mq(q))
@@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q,
                 blkg_put(pinned_blkg);
         if (pd_prealloc)
                 pol->pd_free_fn(pd_prealloc);
+       if (!ret)
+               wake_up(q->policy_waitq);
         return ret;

  enomem:
@@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
         struct blkcg_gq *blkg;

         if (!blkcg_policy_enabled(q, pol))
-               return;
+               wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol));
    wait_event(q->xxx, blkcg_policy_enabled(q, pol));
  
Tejun Heo Dec. 12, 2022, 11:10 p.m. UTC | #7
Hello,

On Mon, Dec 05, 2022 at 05:32:17PM +0800, Yu Kuai wrote:
> 1) queue_lock is held to protect rq_qos_add() and rq_qos_del(), whlie
> it's not held to protect rq_qos_exit(), which is absolutely not safe
> because they can be called concurrently by configuring iocost and
> removing device.
> I'm thinking about holding the lock to fetch the list and reset
> q->rq_qos first:
> 
> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> index 88f0fe7dcf54..271ad65eebd9 100644
> --- a/block/blk-rq-qos.c
> +++ b/block/blk-rq-qos.c
> @@ -288,9 +288,15 @@ void rq_qos_wait(struct rq_wait *rqw, void
> *private_data,
> 
>  void rq_qos_exit(struct request_queue *q)
>  {
> -       while (q->rq_qos) {
> -               struct rq_qos *rqos = q->rq_qos;
> -               q->rq_qos = rqos->next;
> +       struct rq_qos *rqos;
> +
> +       spin_lock_irq(&q->queue_lock);
> +       rqos = q->rq_qos;
> +       q->rq_qos = NULL;
> +       spin_unlock_irq(&q->queue_lock);
> +
> +       while (rqos) {
>                 rqos->ops->exit(rqos);
> +               rqos = rqos->next;
>         }
>  }
> 
> 2) rq_qos_add() can still succeed after rq_qos_exit() is done, which
> will cause memory leak. Hence a checking is required beforing adding
> to q->rq_qos. I'm thinking about flag QUEUE_FLAG_DYING first, but the
> flag will not set if disk state is not marked GD_OWNS_QUEUE. Since
> blk_unregister_queue() is called before rq_qos_exit(), use the queue
> flag QUEUE_FLAG_REGISTERED should be OK.
> 
> For the current problem that device can be removed while initializing
> , I'm thinking about some possible solutions:
> 
> Since bfq is initialized in elevator initialization, and others are
> in queue initialization, such problem is only possible in iocost, hence
> it make sense to fix it in iocost:

So, iolatency is likely to switch to similar lazy init scheme, so we better
fix it in the rq_qos / core block layer.

...
> 3) Or is it better to fix it in the higher level? For example:
> add a new restriction that blkcg_deactivate_policy() should be called
> with blkcg_activate_policy() in pairs, and blkcg_deactivate_policy()
> will wait for blkcg_activate_policy() to finish. Something like:
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index ef4fef1af909..6266f702157f 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1410,7 +1410,7 @@ int blkcg_activate_policy(struct request_queue *q,
>         struct blkcg_gq *blkg, *pinned_blkg = NULL;
>         int ret;
> 
> -       if (blkcg_policy_enabled(q, pol))
> +       if (WARN_ON_ONCE(blkcg_policy_enabled(q, pol)))
>                 return 0;
> 
>         if (queue_is_mq(q))
> @@ -1477,6 +1477,8 @@ int blkcg_activate_policy(struct request_queue *q,
>                 blkg_put(pinned_blkg);
>         if (pd_prealloc)
>                 pol->pd_free_fn(pd_prealloc);
> +       if (!ret)
> +               wake_up(q->policy_waitq);
>         return ret;
> 
>  enomem:
> @@ -1512,7 +1514,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
>         struct blkcg_gq *blkg;
> 
>         if (!blkcg_policy_enabled(q, pol))
> -               return;
> +               wait_event(q->policy_waitq, blkcg_policy_enabled(q, pol));
>    wait_event(q->xxx, blkcg_policy_enabled(q, pol));

Yeah, along this line but hopefully something simpler like a mutex.

Thanks.
  

Patch

diff --git a/block/genhd.c b/block/genhd.c
index dcf200bcbd3e..0db440bbfefb 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -656,7 +656,6 @@  void del_gendisk(struct gendisk *disk)
 		elevator_exit(q);
 		mutex_unlock(&q->sysfs_lock);
 	}
-	rq_qos_exit(q);
 	blk_mq_unquiesce_queue(q);
 
 	/*
@@ -1168,6 +1167,7 @@  static void disk_release(struct device *dev)
 	    !test_bit(GD_ADDED, &disk->state))
 		blk_mq_exit_queue(disk->queue);
 
+	rq_qos_exit(disk->queue);
 	blkcg_exit_disk(disk);
 
 	bioset_exit(&disk->bio_split);