[v3,08/16] nvmet-fc: untangle cross refcounting objects

Message ID 20231218153105.12717-9-dwagner@suse.de
State New
Headers
Series enable nvmet-fc for blktests |

Commit Message

Daniel Wagner Dec. 18, 2023, 3:30 p.m. UTC
  Associations take a refcount on queues, queues take a refcount on
associations.

The existing code lead to the situation that the target executes a
disconnect and the host triggers a reconnect immediately. The reconnect
command still finds an existing association and uses this. Though the
reconnect crashes later on because nvmet_fc_delete_target_assoc()
blindly goes ahead and removes resources while the reconnect code wants
to use it. The problem is that nvmet_fc_find_target_assoc() is able to
lookup an association which is being removed.

So the first thing to address nvmet_fc_find_target_queue() is to remove
the association out of the list and wait a RCU cycle and free resources
in the free function callback of the kref_put().

The live time of the queues are strictly bound to the lifetime of an
association. Thus we don't need to take reverse refcounts (queue ->
association).

Furthermore, streamline the cleanup code by using the workqueue for
delete the association in nvmet_fc_ls_disconnect. This ensures, that we
run through the same shutdown path in all non error cases.

Reproducer: nvme/003

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/target/fc.c | 83 +++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 43 deletions(-)
  

Comments

Christoph Hellwig Dec. 19, 2023, 5:16 a.m. UTC | #1
On Mon, Dec 18, 2023 at 04:30:56PM +0100, Daniel Wagner wrote:
> The live time of the queues are strictly bound to the lifetime of an

> +	struct nvmet_fc_tgt_queue	*_queues[NVMET_NR_QUEUES + 1];
>  	struct nvmet_fc_tgt_queue __rcu	*queues[NVMET_NR_QUEUES + 1];

For magic prefixes we use __, not _ in Linux.  But having two arrays
of queues right next to each other, once with rcu annotation and one
not rings a bit far warning bell to me.  Why do we have both?  When
are we supposed to use either?  Why is FC different from rest?

I really don't have any good answers as I don't know the code in the
FC transport very well, but I think this needs more work.
  
Hannes Reinecke Dec. 19, 2023, 7:59 a.m. UTC | #2
On 12/18/23 16:30, Daniel Wagner wrote:
> Associations take a refcount on queues, queues take a refcount on
> associations.
> 
> The existing code lead to the situation that the target executes a
> disconnect and the host triggers a reconnect immediately. The reconnect
> command still finds an existing association and uses this. Though the
> reconnect crashes later on because nvmet_fc_delete_target_assoc()
> blindly goes ahead and removes resources while the reconnect code wants
> to use it. The problem is that nvmet_fc_find_target_assoc() is able to
> lookup an association which is being removed.
> 
> So the first thing to address nvmet_fc_find_target_queue() is to remove
> the association out of the list and wait a RCU cycle and free resources
> in the free function callback of the kref_put().
> 
> The live time of the queues are strictly bound to the lifetime of an
> association. Thus we don't need to take reverse refcounts (queue ->
> association).
> 
> Furthermore, streamline the cleanup code by using the workqueue for
> delete the association in nvmet_fc_ls_disconnect. This ensures, that we
> run through the same shutdown path in all non error cases.
> 
> Reproducer: nvme/003
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>   drivers/nvme/target/fc.c | 83 +++++++++++++++++++---------------------
>   1 file changed, 40 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
> index 28e432e62361..db992df13c73 100644
> --- a/drivers/nvme/target/fc.c
> +++ b/drivers/nvme/target/fc.c
> @@ -166,6 +166,7 @@ struct nvmet_fc_tgt_assoc {
>   	struct nvmet_fc_hostport	*hostport;
>   	struct nvmet_fc_ls_iod		*rcv_disconn;
>   	struct list_head		a_list;
> +	struct nvmet_fc_tgt_queue	*_queues[NVMET_NR_QUEUES + 1];
>   	struct nvmet_fc_tgt_queue __rcu	*queues[NVMET_NR_QUEUES + 1];
>   	struct kref			ref;
>   	struct work_struct		del_work;
> @@ -803,14 +804,11 @@ nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
>   	if (!queue)
>   		return NULL;
>   
> -	if (!nvmet_fc_tgt_a_get(assoc))
> -		goto out_free_queue;
> -
>   	queue->work_q = alloc_workqueue("ntfc%d.%d.%d", 0, 0,
>   				assoc->tgtport->fc_target_port.port_num,
>   				assoc->a_id, qid);
>   	if (!queue->work_q)
> -		goto out_a_put;
> +		goto out_free_queue;
>   
>   	queue->qid = qid;
>   	queue->sqsize = sqsize;
> @@ -831,7 +829,8 @@ nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
>   	if (ret)
>   		goto out_fail_iodlist;
>   
> -	WARN_ON(assoc->queues[qid]);
> +	WARN_ON(assoc->_queues[qid]);
> +	assoc->_queues[qid] = queue;
>   	rcu_assign_pointer(assoc->queues[qid], queue);
>   
Is it really worth is using an rcu pointer here?
In the end, creating and deleting queues for an association don't happen 
that often, and involve some synchronization points anyway.
IE the lifetime of the queue is actually bounded by the lifetime of the
association itself, so if the association is valid the queues will be
valid, too.

With that reasoning can't we drop rcu above and use the array directly,
delegating any synchronization to the association itself?

Cheers,

Hannes
  
Daniel Wagner Dec. 21, 2023, 9:15 a.m. UTC | #3
On Tue, Dec 19, 2023 at 06:16:48AM +0100, Christoph Hellwig wrote:
> On Mon, Dec 18, 2023 at 04:30:56PM +0100, Daniel Wagner wrote:
> > The live time of the queues are strictly bound to the lifetime of an
> 
> > +	struct nvmet_fc_tgt_queue	*_queues[NVMET_NR_QUEUES + 1];
> >  	struct nvmet_fc_tgt_queue __rcu	*queues[NVMET_NR_QUEUES + 1];
> 
> For magic prefixes we use __, not _ in Linux.  But having two arrays
> of queues right next to each other, once with rcu annotation and one
> not rings a bit far warning bell to me.  Why do we have both?  When
> are we supposed to use either?  Why is FC different from rest?

This is my attempt to solve the problem that after NULLing the rcu
pointer and wait for an RCU graceperiod I still need to cleanup the
queues. Thus I need to keep hold on the queue pointers a bit longer.
Indeed not so elegant.

I'm sure there is a better way to do it, I just didn't figure it out
when I wrote this part. Any tips are highly welcomed how to solve this
puzzle.

> I really don't have any good answers as I don't know the code in the
> FC transport very well, but I think this needs more work.

Indeed, blktests still is able to run into a hanger. So not all problems
are sovled but getting closer.
  
Daniel Wagner Jan. 26, 2024, 3:32 p.m. UTC | #4
On Thu, Dec 21, 2023 at 10:17:30AM +0100, Daniel Wagner wrote:
> On Tue, Dec 19, 2023 at 06:16:48AM +0100, Christoph Hellwig wrote:
> > On Mon, Dec 18, 2023 at 04:30:56PM +0100, Daniel Wagner wrote:
> > > The live time of the queues are strictly bound to the lifetime of an
> > 
> > > +	struct nvmet_fc_tgt_queue	*_queues[NVMET_NR_QUEUES + 1];
> > >  	struct nvmet_fc_tgt_queue __rcu	*queues[NVMET_NR_QUEUES + 1];
> > 
> > For magic prefixes we use __, not _ in Linux.  But having two arrays
> > of queues right next to each other, once with rcu annotation and one
> > not rings a bit far warning bell to me.  Why do we have both?  When
> > are we supposed to use either?  Why is FC different from rest?
> 
> This is my attempt to solve the problem that after NULLing the rcu
> pointer and wait for an RCU graceperiod I still need to cleanup the
> queues. Thus I need to keep hold on the queue pointers a bit longer.
> Indeed not so elegant.
> 
> I'm sure there is a better way to do it, I just didn't figure it out
> when I wrote this part. Any tips are highly welcomed how to solve this
> puzzle.

Looking at this code again, I don't think we need use RCU for the queues
pointers at all. The association is already under RCU and thus the
queues pointers don't need additional RCUing. We either can lookup the
association or not and the queue pointer's lifetime is under kref rules.
So with this in mind the lookup would be:

	rcu_read_lock();
	list_for_each_entry_rcu(assoc, &tgtport->assoc_list, a_list) {
		if (association_id == assoc->association_id) {
			queue = assoc->queues[qid];
			if (queue &&
			    (!atomic_read(&queue->connected) ||
			     !nvmet_fc_tgt_q_get(queue)))
				queue = NULL;
			rcu_read_unlock();
			return queue;
		}
	}
	rcu_read_unlock();
	return NULL;

No need for more complexity :)
  

Patch

diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index 28e432e62361..db992df13c73 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -166,6 +166,7 @@  struct nvmet_fc_tgt_assoc {
 	struct nvmet_fc_hostport	*hostport;
 	struct nvmet_fc_ls_iod		*rcv_disconn;
 	struct list_head		a_list;
+	struct nvmet_fc_tgt_queue	*_queues[NVMET_NR_QUEUES + 1];
 	struct nvmet_fc_tgt_queue __rcu	*queues[NVMET_NR_QUEUES + 1];
 	struct kref			ref;
 	struct work_struct		del_work;
@@ -803,14 +804,11 @@  nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
 	if (!queue)
 		return NULL;
 
-	if (!nvmet_fc_tgt_a_get(assoc))
-		goto out_free_queue;
-
 	queue->work_q = alloc_workqueue("ntfc%d.%d.%d", 0, 0,
 				assoc->tgtport->fc_target_port.port_num,
 				assoc->a_id, qid);
 	if (!queue->work_q)
-		goto out_a_put;
+		goto out_free_queue;
 
 	queue->qid = qid;
 	queue->sqsize = sqsize;
@@ -831,7 +829,8 @@  nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
 	if (ret)
 		goto out_fail_iodlist;
 
-	WARN_ON(assoc->queues[qid]);
+	WARN_ON(assoc->_queues[qid]);
+	assoc->_queues[qid] = queue;
 	rcu_assign_pointer(assoc->queues[qid], queue);
 
 	return queue;
@@ -839,8 +838,6 @@  nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
 out_fail_iodlist:
 	nvmet_fc_destroy_fcp_iodlist(assoc->tgtport, queue);
 	destroy_workqueue(queue->work_q);
-out_a_put:
-	nvmet_fc_tgt_a_put(assoc);
 out_free_queue:
 	kfree(queue);
 	return NULL;
@@ -853,12 +850,8 @@  nvmet_fc_tgt_queue_free(struct kref *ref)
 	struct nvmet_fc_tgt_queue *queue =
 		container_of(ref, struct nvmet_fc_tgt_queue, ref);
 
-	rcu_assign_pointer(queue->assoc->queues[queue->qid], NULL);
-
 	nvmet_fc_destroy_fcp_iodlist(queue->assoc->tgtport, queue);
 
-	nvmet_fc_tgt_a_put(queue->assoc);
-
 	destroy_workqueue(queue->work_q);
 
 	kfree_rcu(queue, rcu);
@@ -1173,13 +1166,18 @@  nvmet_fc_target_assoc_free(struct kref *ref)
 	struct nvmet_fc_tgtport *tgtport = assoc->tgtport;
 	struct nvmet_fc_ls_iod	*oldls;
 	unsigned long flags;
+	int i;
+
+	for (i = NVMET_NR_QUEUES; i >= 0; i--) {
+		if (assoc->_queues[i])
+			nvmet_fc_delete_target_queue(assoc->_queues[i]);
+	}
 
 	/* Send Disconnect now that all i/o has completed */
 	nvmet_fc_xmt_disconnect_assoc(assoc);
 
 	nvmet_fc_free_hostport(assoc->hostport);
 	spin_lock_irqsave(&tgtport->lock, flags);
-	list_del_rcu(&assoc->a_list);
 	oldls = assoc->rcv_disconn;
 	spin_unlock_irqrestore(&tgtport->lock, flags);
 	/* if pending Rcv Disconnect Association LS, send rsp now */
@@ -1209,7 +1207,7 @@  static void
 nvmet_fc_delete_target_assoc(struct nvmet_fc_tgt_assoc *assoc)
 {
 	struct nvmet_fc_tgtport *tgtport = assoc->tgtport;
-	struct nvmet_fc_tgt_queue *queue;
+	unsigned long flags;
 	int i, terminating;
 
 	terminating = atomic_xchg(&assoc->terminating, 1);
@@ -1218,29 +1216,25 @@  nvmet_fc_delete_target_assoc(struct nvmet_fc_tgt_assoc *assoc)
 	if (terminating)
 		return;
 
+	/* prevent new I/Os entering the queues */
+	for (i = NVMET_NR_QUEUES; i >= 0; i--)
+		rcu_assign_pointer(assoc->queues[i], NULL);
 
-	for (i = NVMET_NR_QUEUES; i >= 0; i--) {
-		rcu_read_lock();
-		queue = rcu_dereference(assoc->queues[i]);
-		if (!queue) {
-			rcu_read_unlock();
-			continue;
-		}
+	spin_lock_irqsave(&tgtport->lock, flags);
+	list_del_rcu(&assoc->a_list);
+	spin_unlock_irqrestore(&tgtport->lock, flags);
 
-		if (!nvmet_fc_tgt_q_get(queue)) {
-			rcu_read_unlock();
-			continue;
-		}
-		rcu_read_unlock();
-		nvmet_fc_delete_target_queue(queue);
-		nvmet_fc_tgt_q_put(queue);
+	synchronize_rcu();
+
+	/* ensure all in-flight I/Os have been processed */
+	for (i = NVMET_NR_QUEUES; i >= 0; i--) {
+		if (assoc->_queues[i])
+			flush_workqueue(assoc->_queues[i]->work_q);
 	}
 
 	dev_info(tgtport->dev,
 		"{%d:%d} Association deleted\n",
 		tgtport->fc_target_port.port_num, assoc->a_id);
-
-	nvmet_fc_tgt_a_put(assoc);
 }
 
 static struct nvmet_fc_tgt_assoc *
@@ -1493,9 +1487,8 @@  __nvmet_fc_free_assocs(struct nvmet_fc_tgtport *tgtport)
 	list_for_each_entry_rcu(assoc, &tgtport->assoc_list, a_list) {
 		if (!nvmet_fc_tgt_a_get(assoc))
 			continue;
-		if (!queue_work(nvmet_wq, &assoc->del_work))
-			/* already deleting - release local reference */
-			nvmet_fc_tgt_a_put(assoc);
+		queue_work(nvmet_wq, &assoc->del_work);
+		nvmet_fc_tgt_a_put(assoc);
 	}
 	rcu_read_unlock();
 }
@@ -1548,9 +1541,8 @@  nvmet_fc_invalidate_host(struct nvmet_fc_target_port *target_port,
 			continue;
 		assoc->hostport->invalid = 1;
 		noassoc = false;
-		if (!queue_work(nvmet_wq, &assoc->del_work))
-			/* already deleting - release local reference */
-			nvmet_fc_tgt_a_put(assoc);
+		queue_work(nvmet_wq, &assoc->del_work);
+		nvmet_fc_tgt_a_put(assoc);
 	}
 	spin_unlock_irqrestore(&tgtport->lock, flags);
 
@@ -1594,9 +1586,8 @@  nvmet_fc_delete_ctrl(struct nvmet_ctrl *ctrl)
 		nvmet_fc_tgtport_put(tgtport);
 
 		if (found_ctrl) {
-			if (!queue_work(nvmet_wq, &assoc->del_work))
-				/* already deleting - release local reference */
-				nvmet_fc_tgt_a_put(assoc);
+			queue_work(nvmet_wq, &assoc->del_work);
+			nvmet_fc_tgt_a_put(assoc);
 			return;
 		}
 
@@ -1626,6 +1617,8 @@  nvmet_fc_unregister_targetport(struct nvmet_fc_target_port *target_port)
 	/* terminate any outstanding associations */
 	__nvmet_fc_free_assocs(tgtport);
 
+	flush_workqueue(nvmet_wq);
+
 	/*
 	 * should terminate LS's as well. However, LS's will be generated
 	 * at the tail end of association termination, so they likely don't
@@ -1871,9 +1864,6 @@  nvmet_fc_ls_disconnect(struct nvmet_fc_tgtport *tgtport,
 				sizeof(struct fcnvme_ls_disconnect_assoc_acc)),
 			FCNVME_LS_DISCONNECT_ASSOC);
 
-	/* release get taken in nvmet_fc_find_target_assoc */
-	nvmet_fc_tgt_a_put(assoc);
-
 	/*
 	 * The rules for LS response says the response cannot
 	 * go back until ABTS's have been sent for all outstanding
@@ -1888,8 +1878,6 @@  nvmet_fc_ls_disconnect(struct nvmet_fc_tgtport *tgtport,
 	assoc->rcv_disconn = iod;
 	spin_unlock_irqrestore(&tgtport->lock, flags);
 
-	nvmet_fc_delete_target_assoc(assoc);
-
 	if (oldls) {
 		dev_info(tgtport->dev,
 			"{%d:%d} Multiple Disconnect Association LS's "
@@ -1905,6 +1893,9 @@  nvmet_fc_ls_disconnect(struct nvmet_fc_tgtport *tgtport,
 		nvmet_fc_xmt_ls_rsp(tgtport, oldls);
 	}
 
+	queue_work(nvmet_wq, &assoc->del_work);
+	nvmet_fc_tgt_a_put(assoc);
+
 	return false;
 }
 
@@ -2903,6 +2894,9 @@  nvmet_fc_remove_port(struct nvmet_port *port)
 
 	nvmet_fc_portentry_unbind(pe);
 
+	/* terminate any outstanding associations */
+	__nvmet_fc_free_assocs(pe->tgtport);
+
 	kfree(pe);
 }
 
@@ -2934,6 +2928,9 @@  static int __init nvmet_fc_init_module(void)
 
 static void __exit nvmet_fc_exit_module(void)
 {
+	/* ensure any shutdown operation, e.g. delete ctrls have finished */
+	flush_workqueue(nvmet_wq);
+
 	/* sanity check - all lports should be removed */
 	if (!list_empty(&nvmet_fc_target_list))
 		pr_warn("%s: targetport list not empty\n", __func__);