[0/3] lib/percpu-refcount: fix use-after-free by late ->release

Message ID	20221214025101.1268437-1-ming.lei@redhat.com
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org> Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Zhong Jinghua <zhongjinghua@huawei.com>, Yu Kuai <yukuai3@huawei.com>, Dennis Zhou <dennis@kernel.org>, Ming Lei <ming.lei@redhat.com> Subject: [PATCH 0/3] lib/percpu-refcount: fix use-after-free by late ->release Date: Wed, 14 Dec 2022 10:50:58 +0800 Message-Id: <20221214025101.1268437-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	lib/percpu-refcount: fix use-after-free by late ->release \| [0/3] lib/percpu-refcount: fix use-after-free by late ->release [1/3] lib/percpu-refcount: support to exit refcount automatically during releasing [2/3] lib/percpu-refcount: apply PERCPU_REF_AUTO_EXIT [3/3] lib/percpu-refcount: drain ->release() in perpcu_ref_exit()

Message ID

20221214025101.1268437-1-ming.lei@redhat.com

Headers

Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org
 designates 2620:137:e000::1:20 as permitted sender)
 client-ip=2620:137:e000::1:20;
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
        Zhong Jinghua <zhongjinghua@huawei.com>,
        Yu Kuai <yukuai3@huawei.com>, Dennis Zhou <dennis@kernel.org>,
        Ming Lei <ming.lei@redhat.com>
Subject: [PATCH 0/3] lib/percpu-refcount: fix use-after-free by late ->release
Date: Wed, 14 Dec 2022 10:50:58 +0800
Message-Id: <20221214025101.1268437-1-ming.lei@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

lib/percpu-refcount: fix use-after-free by late ->release |

Message

Ming Lei Dec. 14, 2022, 2:50 a.m. UTC

  Hi,

The pattern of wait_event(percpu_ref_is_zero()) may cause
percpu_ref_exit() to be called before ->release() is done, so
user-after-free may be caused, fix the issue by draining ->release()
in percpu_ref_exit().


Ming Lei (3):
  lib/percpu-refcount: support to exit refcount automatically during
    releasing
  lib/percpu-refcount: apply PERCPU_REF_AUTO_EXIT
  lib/percpu-refcount: drain ->release() in perpcu_ref_exit()

 drivers/infiniband/ulp/rtrs/rtrs-srv.c |  4 +--
 include/linux/percpu-refcount.h        | 36 ++++++++++++++++++++++++--
 lib/percpu-refcount.c                  | 31 +++++++++++++++++++---
 mm/memcontrol.c                        |  5 ++--
 4 files changed, 66 insertions(+), 10 deletions(-)

Comments

Ming Lei Dec. 14, 2022, 1:30 p.m. UTC | #1

On Wed, Dec 14, 2022 at 04:16:51PM +0800, Hillf Danton wrote:
> On 14 Dec 2022 10:51:01 +0800 Ming Lei <ming.lei@redhat.com>
> > The pattern of wait_event(percpu_ref_is_zero()) has been used in several
> 
> For example?

blk_mq_freeze_queue_wait() and target_wait_for_sess_cmds().

> 
> > kernel components, and this way actually has the following risk:
> > 
> > - percpu_ref_is_zero() can be returned just between
> >   atomic_long_sub_and_test() and ref->data->release(ref)
> > 
> > - given the refcount is found as zero, percpu_ref_exit() could
> >   be called, and the host data structure is freed
> > 
> > - then use-after-free is triggered in ->release() when the user host
> >   data structure is freed after percpu_ref_exit() returns
> 
> The race between exit and the release callback should be considered at the
> corresponding callsite, given the comment below, and closed for instance
> by synchronizing rcu.
> 
> /**
>  * percpu_ref_put_many - decrement a percpu refcount
>  * @ref: percpu_ref to put
>  * @nr: number of references to put
>  *
>  * Decrement the refcount, and if 0, call the release function (which was passed
>  * to percpu_ref_init())
>  *
>  * This function is safe to call as long as @ref is between init and exit.
>  */

Not sure if the above comment implies that the callsite should cover the
race.

But blk-mq can really avoid the trouble by using the existed call_rcu():


diff --git a/block/blk-core.c b/block/blk-core.c
index 3866b6c4cd88..9321767470dc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -254,14 +254,15 @@ EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 static void blk_free_queue_rcu(struct rcu_head *rcu_head)
 {
-	kmem_cache_free(blk_requestq_cachep,
-			container_of(rcu_head, struct request_queue, rcu_head));
+	struct request_queue *q = container_of(rcu_head,
+			struct request_queue, rcu_head);
+
+	percpu_ref_exit(&q->q_usage_counter);
+	kmem_cache_free(blk_requestq_cachep, q);
 }
 
 static void blk_free_queue(struct request_queue *q)
 {
-	percpu_ref_exit(&q->q_usage_counter);
-
 	if (q->poll_stat)
 		blk_stat_remove_callback(q, q->poll_cb);
 	blk_stat_free_callback(q->poll_cb);


Thanks, 
Ming

Dennis Zhou Dec. 14, 2022, 4:07 p.m. UTC | #2

Hello,

On Wed, Dec 14, 2022 at 09:30:08PM +0800, Ming Lei wrote:
> On Wed, Dec 14, 2022 at 04:16:51PM +0800, Hillf Danton wrote:
> > On 14 Dec 2022 10:51:01 +0800 Ming Lei <ming.lei@redhat.com>
> > > The pattern of wait_event(percpu_ref_is_zero()) has been used in several
> > 
> > For example?
> 
> blk_mq_freeze_queue_wait() and target_wait_for_sess_cmds().
> 
> > 
> > > kernel components, and this way actually has the following risk:
> > > 
> > > - percpu_ref_is_zero() can be returned just between
> > >   atomic_long_sub_and_test() and ref->data->release(ref)
> > > 
> > > - given the refcount is found as zero, percpu_ref_exit() could
> > >   be called, and the host data structure is freed
> > > 
> > > - then use-after-free is triggered in ->release() when the user host
> > >   data structure is freed after percpu_ref_exit() returns
> > 
> > The race between exit and the release callback should be considered at the
> > corresponding callsite, given the comment below, and closed for instance
> > by synchronizing rcu.
> > 
> > /**
> >  * percpu_ref_put_many - decrement a percpu refcount
> >  * @ref: percpu_ref to put
> >  * @nr: number of references to put
> >  *
> >  * Decrement the refcount, and if 0, call the release function (which was passed
> >  * to percpu_ref_init())
> >  *
> >  * This function is safe to call as long as @ref is between init and exit.
> >  */
> 
> Not sure if the above comment implies that the callsite should cover the
> race.
> 
> But blk-mq can really avoid the trouble by using the existed call_rcu():
> 

I struggle with the dependency on release(). release() itself should not
block, but a common pattern would be to through a call_rcu() in and
schedule additional work - see block/blk-cgroup.c, blkg_release().

I think the dependency really is the completion of release() and the
work scheduled on it's behalf rather than strictly starting the
release() callback. This series doesn't preclude that from happening.

/**
 * percpu_ref_exit - undo percpu_ref_init()
 * @ref: percpu_ref to exit
 *
 * This function exits @ref.  The caller is responsible for ensuring that
 * @ref is no longer in active use.  The usual places to invoke this
 * function from are the @ref->release() callback or in init failure path
 * where percpu_ref_init() succeeded but other parts of the initialization
 * of the embedding object failed.
 */

I think the percpu_ref_exit() comment explains the more common use case
approach to percpu refcounts. release() triggering percpu_ref_exit() is
the ideal case.

Thanks,
Dennis

> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3866b6c4cd88..9321767470dc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -254,14 +254,15 @@ EXPORT_SYMBOL_GPL(blk_clear_pm_only);
>  
>  static void blk_free_queue_rcu(struct rcu_head *rcu_head)
>  {
> -	kmem_cache_free(blk_requestq_cachep,
> -			container_of(rcu_head, struct request_queue, rcu_head));
> +	struct request_queue *q = container_of(rcu_head,
> +			struct request_queue, rcu_head);
> +
> +	percpu_ref_exit(&q->q_usage_counter);
> +	kmem_cache_free(blk_requestq_cachep, q);
>  }
>  
>  static void blk_free_queue(struct request_queue *q)
>  {
> -	percpu_ref_exit(&q->q_usage_counter);
> -
>  	if (q->poll_stat)
>  		blk_stat_remove_callback(q, q->poll_cb);
>  	blk_stat_free_callback(q->poll_cb);
> 
> 
> Thanks, 
> Ming
>

Ming Lei Dec. 15, 2022, 12:34 a.m. UTC | #3

On Wed, Dec 14, 2022 at 08:07:28AM -0800, Dennis Zhou wrote:
> Hello,
> 
> On Wed, Dec 14, 2022 at 09:30:08PM +0800, Ming Lei wrote:
> > On Wed, Dec 14, 2022 at 04:16:51PM +0800, Hillf Danton wrote:
> > > On 14 Dec 2022 10:51:01 +0800 Ming Lei <ming.lei@redhat.com>
> > > > The pattern of wait_event(percpu_ref_is_zero()) has been used in several
> > > 
> > > For example?
> > 
> > blk_mq_freeze_queue_wait() and target_wait_for_sess_cmds().
> > 
> > > 
> > > > kernel components, and this way actually has the following risk:
> > > > 
> > > > - percpu_ref_is_zero() can be returned just between
> > > >   atomic_long_sub_and_test() and ref->data->release(ref)
> > > > 
> > > > - given the refcount is found as zero, percpu_ref_exit() could
> > > >   be called, and the host data structure is freed
> > > > 
> > > > - then use-after-free is triggered in ->release() when the user host
> > > >   data structure is freed after percpu_ref_exit() returns
> > > 
> > > The race between exit and the release callback should be considered at the
> > > corresponding callsite, given the comment below, and closed for instance
> > > by synchronizing rcu.
> > > 
> > > /**
> > >  * percpu_ref_put_many - decrement a percpu refcount
> > >  * @ref: percpu_ref to put
> > >  * @nr: number of references to put
> > >  *
> > >  * Decrement the refcount, and if 0, call the release function (which was passed
> > >  * to percpu_ref_init())
> > >  *
> > >  * This function is safe to call as long as @ref is between init and exit.
> > >  */
> > 
> > Not sure if the above comment implies that the callsite should cover the
> > race.
> > 
> > But blk-mq can really avoid the trouble by using the existed call_rcu():
> > 
> 
> I struggle with the dependency on release(). release() itself should not
> block, but a common pattern would be to through a call_rcu() in and

Yes, release() is called with rcu read lock, and I guess the trouble may
be originated from the fact release() may do nothing related with
actual data releasing.

> schedule additional work - see block/blk-cgroup.c, blkg_release().

I believe the pattern is user specific, and the motivation of using call_rcu
can't be just for avoiding such potential race between release() and
percpu_ref_exit().

> 
> I think the dependency really is the completion of release() and the
> work scheduled on it's behalf rather than strictly starting the
> release() callback. This series doesn't preclude that from happening.

Yeah.

For any additional work or sort of thing scheduled in release(), only
the caller can guarantee they are drained before percpu_exit_ref(), so
I agree now it is better for caller to avoid the race.

> 
> /**
>  * percpu_ref_exit - undo percpu_ref_init()
>  * @ref: percpu_ref to exit
>  *
>  * This function exits @ref.  The caller is responsible for ensuring that
>  * @ref is no longer in active use.  The usual places to invoke this
>  * function from are the @ref->release() callback or in init failure path
>  * where percpu_ref_init() succeeded but other parts of the initialization
>  * of the embedding object failed.
>  */
> 
> I think the percpu_ref_exit() comment explains the more common use case
> approach to percpu refcounts. release() triggering percpu_ref_exit() is
> the ideal case.

But most of callers don't use in this way actually.


Thanks, 
Ming