[1/2] KVM: async kvm_destroy_vm for vfio devices

Message ID 20230109201037.33051-2-mjrosato@linux.ibm.com
State New
Headers
Series kvm/vfio: fix potential deadlock on vfio group lock |

Commit Message

Matthew Rosato Jan. 9, 2023, 8:10 p.m. UTC
  Currently it is possible that the final put of a KVM reference comes from
vfio during its device close operation.  This occurs while the vfio group
lock is held; however, if the vfio device is still in the kvm device list,
then the following call chain could result in a deadlock:

kvm_put_kvm
 -> kvm_destroy_vm
  -> kvm_destroy_devices
   -> kvm_vfio_destroy
    -> kvm_vfio_file_set_kvm
     -> vfio_file_set_kvm
      -> group->group_lock/group_rwsem

Avoid this scenario by adding kvm_put_kvm_async which will perform the
kvm_destroy_vm asynchronously if the refcount reaches 0.

Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/gpu/drm/i915/gvt/kvmgt.c  |  6 +++++-
 drivers/s390/crypto/vfio_ap_ops.c |  7 ++++++-
 include/linux/kvm_host.h          |  3 +++
 virt/kvm/kvm_main.c               | 22 ++++++++++++++++++++++
 4 files changed, 36 insertions(+), 2 deletions(-)
  

Comments

Jason Gunthorpe Jan. 9, 2023, 8:13 p.m. UTC | #1
On Mon, Jan 09, 2023 at 03:10:36PM -0500, Matthew Rosato wrote:
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation.  This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
> 
> kvm_put_kvm
>  -> kvm_destroy_vm
>   -> kvm_destroy_devices
>    -> kvm_vfio_destroy
>     -> kvm_vfio_file_set_kvm
>      -> vfio_file_set_kvm
>       -> group->group_lock/group_rwsem
> 
> Avoid this scenario by adding kvm_put_kvm_async which will perform the
> kvm_destroy_vm asynchronously if the refcount reaches 0.
> 
> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
> Reported-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c  |  6 +++++-
>  drivers/s390/crypto/vfio_ap_ops.c |  7 ++++++-
>  include/linux/kvm_host.h          |  3 +++
>  virt/kvm/kvm_main.c               | 22 ++++++++++++++++++++++
>  4 files changed, 36 insertions(+), 2 deletions(-)

Why two patches?

It looks OK to me

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason
  
Matthew Rosato Jan. 9, 2023, 8:24 p.m. UTC | #2
On 1/9/23 3:13 PM, Jason Gunthorpe wrote:
> On Mon, Jan 09, 2023 at 03:10:36PM -0500, Matthew Rosato wrote:
>> Currently it is possible that the final put of a KVM reference comes from
>> vfio during its device close operation.  This occurs while the vfio group
>> lock is held; however, if the vfio device is still in the kvm device list,
>> then the following call chain could result in a deadlock:
>>
>> kvm_put_kvm
>>  -> kvm_destroy_vm
>>   -> kvm_destroy_devices
>>    -> kvm_vfio_destroy
>>     -> kvm_vfio_file_set_kvm
>>      -> vfio_file_set_kvm
>>       -> group->group_lock/group_rwsem
>>
>> Avoid this scenario by adding kvm_put_kvm_async which will perform the
>> kvm_destroy_vm asynchronously if the refcount reaches 0.
>>
>> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
>> Reported-by: Alex Williamson <alex.williamson@redhat.com>
>> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> ---
>>  drivers/gpu/drm/i915/gvt/kvmgt.c  |  6 +++++-
>>  drivers/s390/crypto/vfio_ap_ops.c |  7 ++++++-
>>  include/linux/kvm_host.h          |  3 +++
>>  virt/kvm/kvm_main.c               | 22 ++++++++++++++++++++++
>>  4 files changed, 36 insertions(+), 2 deletions(-)
> 
> Why two patches?
> 
> It looks OK to me

Mentioned in the cover, the fixes: tag is different on the 2nd patch as the s390 PCI passthrough kvm_puts were added later soemtime after 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM").  The blamed commit for those changes also landed in a different release (6.0 vs 5.19).

But, now that you mention it, neither is an LTS so it probably doesn't matter all that much and could be squashed if preferred. 

> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks!

> 
> Jason
  
Anthony Krowiak Jan. 9, 2023, 9:07 p.m. UTC | #3
LGTM

Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>

On 1/9/23 3:10 PM, Matthew Rosato wrote:
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation.  This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
>
> kvm_put_kvm
>   -> kvm_destroy_vm
>    -> kvm_destroy_devices
>     -> kvm_vfio_destroy
>      -> kvm_vfio_file_set_kvm
>       -> vfio_file_set_kvm
>        -> group->group_lock/group_rwsem
>
> Avoid this scenario by adding kvm_put_kvm_async which will perform the
> kvm_destroy_vm asynchronously if the refcount reaches 0.
>
> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
> Reported-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
>   drivers/gpu/drm/i915/gvt/kvmgt.c  |  6 +++++-
>   drivers/s390/crypto/vfio_ap_ops.c |  7 ++++++-
>   include/linux/kvm_host.h          |  3 +++
>   virt/kvm/kvm_main.c               | 22 ++++++++++++++++++++++
>   4 files changed, 36 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index 8ae7039b3683..24511c877572 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -703,7 +703,11 @@ static void intel_vgpu_close_device(struct vfio_device *vfio_dev)
>   
>   	kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm,
>   					   &vgpu->track_node);
> -	kvm_put_kvm(vgpu->vfio_device.kvm);
> +	/*
> +	 * Avoid possible deadlock on any currently-held vfio lock by
> +	 * ensuring the potential kvm_destroy_vm call is done asynchronously
> +	 */
> +	kvm_put_kvm_async(vgpu->vfio_device.kvm);
>   
>   	kvmgt_protect_table_destroy(vgpu);
>   	gvt_cache_destroy(vgpu);
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index e93bb9c468ce..a37b2baefb36 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1574,7 +1574,12 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
>   
>   		kvm_arch_crypto_clear_masks(kvm);
>   		vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
> -		kvm_put_kvm(kvm);
> +		/*
> +		 * Avoid possible deadlock on any currently-held vfio lock by
> +		 * ensuring the potential kvm_destroy_vm call is done
> +		 * asynchronously
> +		 */
> +		kvm_put_kvm_async(kvm);
>   		matrix_mdev->kvm = NULL;
>   
>   		release_update_locks_for_kvm(kvm);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4f26b244f6d0..2ef6a5102265 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -34,6 +34,7 @@
>   #include <linux/instrumentation.h>
>   #include <linux/interval_tree.h>
>   #include <linux/rbtree.h>
> +#include <linux/workqueue.h>
>   #include <linux/xarray.h>
>   #include <asm/signal.h>
>   
> @@ -793,6 +794,7 @@ struct kvm {
>   	struct kvm_stat_data **debugfs_stat_data;
>   	struct srcu_struct srcu;
>   	struct srcu_struct irq_srcu;
> +	struct work_struct async_work;
>   	pid_t userspace_pid;
>   	bool override_halt_poll_ns;
>   	unsigned int max_halt_poll_ns;
> @@ -963,6 +965,7 @@ void kvm_exit(void);
>   void kvm_get_kvm(struct kvm *kvm);
>   bool kvm_get_kvm_safe(struct kvm *kvm);
>   void kvm_put_kvm(struct kvm *kvm);
> +void kvm_put_kvm_async(struct kvm *kvm);
>   bool file_is_kvm(struct file *file);
>   void kvm_put_kvm_no_destroy(struct kvm *kvm);
>   
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 13e88297f999..fbe8d127028b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1353,6 +1353,28 @@ void kvm_put_kvm(struct kvm *kvm)
>   }
>   EXPORT_SYMBOL_GPL(kvm_put_kvm);
>   
> +static void kvm_put_async_fn(struct work_struct *work)
> +{
> +	struct kvm *kvm = container_of(work, struct kvm,
> +				       async_work);
> +
> +	kvm_destroy_vm(kvm);
> +}
> +
> +/*
> + * Put a reference but only destroy the vm asynchronously.  Can be used in
> + * cases where the caller holds a mutex that could cause deadlock if
> + * kvm_destroy_vm is triggered
> + */
> +void kvm_put_kvm_async(struct kvm *kvm)
> +{
> +	if (refcount_dec_and_test(&kvm->users_count)) {
> +		INIT_WORK(&kvm->async_work, kvm_put_async_fn);
> +		schedule_work(&kvm->async_work);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(kvm_put_kvm_async);
> +
>   /*
>    * Used to put a reference that was taken on behalf of an object associated
>    * with a user-visible file descriptor, e.g. a vcpu or device, if installation
  
Sean Christopherson Jan. 11, 2023, 7:54 p.m. UTC | #4
On Mon, Jan 09, 2023, Matthew Rosato wrote:
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation.  This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
> 
> kvm_put_kvm
>  -> kvm_destroy_vm
>   -> kvm_destroy_devices
>    -> kvm_vfio_destroy
>     -> kvm_vfio_file_set_kvm
>      -> vfio_file_set_kvm
>       -> group->group_lock/group_rwsem
> 
> Avoid this scenario by adding kvm_put_kvm_async which will perform the
> kvm_destroy_vm asynchronously if the refcount reaches 0.

Something feels off.  If KVM's refcount is 0, then accessing device->group->kvm
in vfio_device_open() can't happen unless there's a refcounting bug somewhere.

E.g. if this snippet holds group_lock

		mutex_lock(&device->group->group_lock);
		device->kvm = device->group->kvm;

		if (device->ops->open_device) {
			ret = device->ops->open_device(device);
			if (ret)
				goto err_undo_count;
		}
		vfio_device_container_register(device);
		mutex_unlock(&device->group->group_lock);

and kvm_vfio_destroy() has already been invoked and is waiting on group_lock in
vfio_file_set_kvm(), then device->ops->open_device() is being called with a
non-NULL device->kvm that has kvm->users_count==0.  And intel_vgpu_open_device()
at least uses kvm_get_kvm(), i.e. assumes kvm->users_count > 0.

If there's no refcounting bug then kvm_vfio_destroy() doesn't need to call
kvm_vfio_file_set_kvm() since the only way there isn't a refcounting bug is if
group->kvm is unreachable.  This seems unlikely.

Assuming there is indeed a refcounting issue, one solution would be to harden all
->open_device() implementations to use kvm_get_kvm_safe().  I'd prefer not to have
to do that since it will complicate those implementations and also requires KVM
to support an async put.

Rather than force devices to get KVM references, why not handle that in common
VFIO code and drop KVM refcountin from devices?  Worst case scenario KVM is pinned
by a device that doesn't need KVM but is in a group associated with KVM.  If that's
a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate
whether or not the device depends on KVM.

---
 drivers/vfio/vfio_main.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 6e8804fe0095..fb43212d77a0 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -772,6 +772,13 @@ static struct file *vfio_device_open(struct vfio_device *device)
 		 * reference and release it during close_device.
 		 */
 		mutex_lock(&device->group->group_lock);
+
+		if (device->group->kvm &&
+		    !kvm_get_kvm_safe(device->group->kvm->kvm)) {
+			ret = -ENODEV;
+			goto err_undo_count;
+		}
+
 		device->kvm = device->group->kvm;
 
 		if (device->ops->open_device) {
@@ -823,8 +830,10 @@ static struct file *vfio_device_open(struct vfio_device *device)
 err_undo_count:
 	mutex_unlock(&device->group->group_lock);
 	device->open_count--;
-	if (device->open_count == 0 && device->kvm)
+	if (device->open_count == 0 && device->kvm) {
+		kvm_put_kvm(device->kvm);
 		device->kvm = NULL;
+	}
 	mutex_unlock(&device->dev_set->lock);
 	module_put(device->dev->driver->owner);
 err_unassign_container:
@@ -1039,8 +1048,10 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	}
 	mutex_unlock(&device->group->group_lock);
 	device->open_count--;
-	if (device->open_count == 0)
+	if (device->open_count == 0 && device->kvm) {
+		kvm_put_kvm(device->kvm);
 		device->kvm = NULL;
+	}
 	mutex_unlock(&device->dev_set->lock);
 
 	module_put(device->dev->driver->owner);

base-commit: d52444c7a90fc551b4c3b0bda7d3f0b2ca9fc84d
--
  
Jason Gunthorpe Jan. 11, 2023, 8:05 p.m. UTC | #5
On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote:

> Something feels off.  If KVM's refcount is 0, then accessing device->group->kvm
> in vfio_device_open() can't happen unless there's a refcounting bug somewhere.

The problem is in close, not open.

Specifically it would be very hard to avoid holding the group_lock
during close which is when the put is done.

> Rather than force devices to get KVM references, why not handle that in common
> VFIO code and drop KVM refcountin from devices?  Worst case scenario KVM is pinned
> by a device that doesn't need KVM but is in a group associated with KVM.  If that's
> a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate
> whether or not the device depends on KVM.

We can't make cross-dependencies between kvm and core VFIO - it is why
so much of this is soo ugly.

The few device drivers that unavoidably have KVM involvment already
have a KVM module dependency, so they can safely do the get/put

Jason
  
Sean Christopherson Jan. 11, 2023, 8:53 p.m. UTC | #6
On Wed, Jan 11, 2023, Jason Gunthorpe wrote:
> On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote:
> 
> > Something feels off.  If KVM's refcount is 0, then accessing device->group->kvm
> > in vfio_device_open() can't happen unless there's a refcounting bug somewhere.
> 
> The problem is in close, not open.

The deadlock problem is, yes.  My point is that if group_lock needs to be taken
when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting
prolem with respect to open().  If there is no refcounting problem, then nullifying
group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is
the case).

The two things aren't directly related, but it seems possible to solve both while
making this all slightly less ugly.  Well, at least from KVM's perspective, whether
or not it'd be an improvement on the VFIO side is definitely debatable.

> Specifically it would be very hard to avoid holding the group_lock
> during close which is when the put is done.
> 
> > Rather than force devices to get KVM references, why not handle that in common
> > VFIO code and drop KVM refcountin from devices?  Worst case scenario KVM is pinned
> > by a device that doesn't need KVM but is in a group associated with KVM.  If that's
> > a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate
> > whether or not the device depends on KVM.
> 
> We can't make cross-dependencies between kvm and core VFIO - it is why
> so much of this is soo ugly.

Ugh, right, modules for everyone.

> The few device drivers that unavoidably have KVM involvment already
> have a KVM module dependency, so they can safely do the get/put

Rather than store a "struct kvm *" in vfio_device, what about adding a new set
of optional ops to get/put KVM references?  Having dedicated KVM ops is gross,
but IMO it's less gross than backdooring the KVM pointer into open_device() by
stashing KVM into the device, e.g. it formalizes the VFIO API for devices that
depend on KVM instead of making devices pinky-swear to grab a reference during
open_device().

To further harden things, KVM could export only kvm_get_safe_kvm() if there are
no vendor modules.  I.e. make kvm_get_kvm() an internal-only helper when possible
and effectively force VFIO devices to use the safe variant.  That would work even
x86, as kvm_get_kvm() wouldn't be exported if neither KVM_AMD nor KVM_INTEL is
built as a module.

---
 drivers/vfio/vfio_main.c | 20 +++++++++++++-------
 include/linux/vfio.h     |  9 +++++++--
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 6e8804fe0095..b3a84d65baa6 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device)
 		 * reference and release it during close_device.
 		 */
 		mutex_lock(&device->group->group_lock);
-		device->kvm = device->group->kvm;
+
+		if (device->kvm_ops && device->group->kvm) {
+			ret = device->kvm_ops->get_kvm(device->group->kvm);
+			if (ret)
+				goto err_undo_count;
+		}
 
 		if (device->ops->open_device) {
 			ret = device->ops->open_device(device);
@@ -823,8 +828,9 @@ static struct file *vfio_device_open(struct vfio_device *device)
 err_undo_count:
 	mutex_unlock(&device->group->group_lock);
 	device->open_count--;
-	if (device->open_count == 0 && device->kvm)
-		device->kvm = NULL;
+	if (device->open_count == 0 && device->kvm_ops)
+		device->kvm_ops->put_kvm();
+
 	mutex_unlock(&device->dev_set->lock);
 	module_put(device->dev->driver->owner);
 err_unassign_container:
@@ -1039,8 +1045,8 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	}
 	mutex_unlock(&device->group->group_lock);
 	device->open_count--;
-	if (device->open_count == 0)
-		device->kvm = NULL;
+	if (device->open_count == 0 && device->kvm_ops)
+		device->kvm_ops->put_kvm();
 	mutex_unlock(&device->dev_set->lock);
 
 	module_put(device->dev->driver->owner);
@@ -1656,8 +1662,8 @@ EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
  * @file: VFIO group file
  * @kvm: KVM to link
  *
- * When a VFIO device is first opened the KVM will be available in
- * device->kvm if one was associated with the group.
+ * When a VFIO device is first opened, the device's kvm_ops->get_kvm() will be
+ * invoked with the KVM instance associated with the group (if applicable).
  */
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index fdd393f70b19..d6dcbe0546bf 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,11 @@
 
 struct kvm;
 
+struct vfio_device_kvm_ops {
+	int (*get_kvm)(struct kvm *kvm);
+	void (*put_kvm)(void);
+};
+
 /*
  * VFIO devices can be placed in a set, this allows all devices to share this
  * structure and the VFIO core will provide a lock that is held around
@@ -43,8 +48,8 @@ struct vfio_device {
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
 	unsigned int migration_flags;
-	/* Driver must reference the kvm during open_device or never touch it */
-	struct kvm *kvm;
+
+	const struct vfio_device_kvm_ops *kvm_ops;
 
 	/* Members below here are private, not for driver use */
 	unsigned int index;

base-commit: d52444c7a90fc551b4c3b0bda7d3f0b2ca9fc84d
--
  
Jason Gunthorpe Jan. 12, 2023, 12:45 p.m. UTC | #7
On Wed, Jan 11, 2023 at 08:53:34PM +0000, Sean Christopherson wrote:
> On Wed, Jan 11, 2023, Jason Gunthorpe wrote:
> > On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote:
> > 
> > > Something feels off.  If KVM's refcount is 0, then accessing device->group->kvm
> > > in vfio_device_open() can't happen unless there's a refcounting bug somewhere.
> > 
> > The problem is in close, not open.
> 
> The deadlock problem is, yes.  My point is that if group_lock needs to be taken
> when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting
> prolem with respect to open().  If there is no refcounting problem, then nullifying
> group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is
> the case).

IIRC the drivers are supposed to use one of the refcount not zero
incrs to counteract this, but I never checked they do..

Yi is working on a patch to change things so vfio drops the kvm
pointer when the kvm file closes, not when the reference goes to 0
to avoid a refcount cycle problem which should also solve that.

> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 6e8804fe0095..b3a84d65baa6 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device)
>  		 * reference and release it during close_device.
>  		 */
>  		mutex_lock(&device->group->group_lock);
> -		device->kvm = device->group->kvm;
> +
> +		if (device->kvm_ops && device->group->kvm) {
> +			ret = device->kvm_ops->get_kvm(device->group->kvm);

At this point I'd rather just use the symbol get stuff like kvm does
and call the proper functions.

Jason
  
Matthew Rosato Jan. 12, 2023, 5:21 p.m. UTC | #8
On 1/12/23 7:45 AM, Jason Gunthorpe wrote:
> On Wed, Jan 11, 2023 at 08:53:34PM +0000, Sean Christopherson wrote:
>> On Wed, Jan 11, 2023, Jason Gunthorpe wrote:
>>> On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote:
>>>
>>>> Something feels off.  If KVM's refcount is 0, then accessing device->group->kvm
>>>> in vfio_device_open() can't happen unless there's a refcounting bug somewhere.
>>>
>>> The problem is in close, not open.
>>
>> The deadlock problem is, yes.  My point is that if group_lock needs to be taken
>> when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting
>> prolem with respect to open().  If there is no refcounting problem, then nullifying
>> group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is
>> the case).
> 
> IIRC the drivers are supposed to use one of the refcount not zero
> incrs to counteract this, but I never checked they do..
> 
> Yi is working on a patch to change things so vfio drops the kvm
> pointer when the kvm file closes, not when the reference goes to 0
> to avoid a refcount cycle problem which should also solve that.
> 
>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>> index 6e8804fe0095..b3a84d65baa6 100644
>> --- a/drivers/vfio/vfio_main.c
>> +++ b/drivers/vfio/vfio_main.c
>> @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device)
>>  		 * reference and release it during close_device.
>>  		 */
>>  		mutex_lock(&device->group->group_lock);
>> -		device->kvm = device->group->kvm;
>> +
>> +		if (device->kvm_ops && device->group->kvm) {
>> +			ret = device->kvm_ops->get_kvm(device->group->kvm);
> 
> At this point I'd rather just use the symbol get stuff like kvm does
> and call the proper functions.
> 

So should I work up a v2 that does symbol gets for kvm_get_kvm_safe and kvm_put_kvm from vfio_main and drop kvm_put_kvm_async?  Or is the patch Yi is working on changing things such that will also address the deadlock issue?
 
If so, something like the following (where vfio_kvm_get symbol gets kvm_get_kvm_safe and vfio_kvm_put symbol gets kvm_put_kvm):

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 5177bb061b17..a49bf1080f0a 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -361,16 +361,22 @@ static int vfio_device_first_open(struct vfio_device *device,
        if (ret)
                goto err_module_put;
 
+       if (kvm && !vfio_kvm_get(kvm)) {
+               ret = -ENOENT;
+               goto err_unuse_iommu;
+       }
        device->kvm = kvm;
        if (device->ops->open_device) {
                ret = device->ops->open_device(device);
                if (ret)
-                       goto err_unuse_iommu;
+                       goto err_put_kvm;
        }
        return 0;
 
-err_unuse_iommu:
+err_put_kvm:
+       vfio_put_kvm(kvm);
        device->kvm = NULL;
+err_unuse_iommu:
        if (iommufd)
                vfio_iommufd_unbind(device);
        else
@@ -465,6 +471,9 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 
        vfio_device_group_close(device);
 
+       if (device->open_count == 0 && device->group->kvm)
+               vfio_kvm_put(device->group->kvm);
+
        vfio_device_put_registration(device);
 
        return 0;
  
Jason Gunthorpe Jan. 12, 2023, 5:27 p.m. UTC | #9
On Thu, Jan 12, 2023 at 12:21:17PM -0500, Matthew Rosato wrote:

> So should I work up a v2 that does symbol gets for kvm_get_kvm_safe
> and kvm_put_kvm from vfio_main and drop kvm_put_kvm_async?  Or is
> the patch Yi is working on changing things such that will also
> address the deadlock issue?

I don't think Yi's part will help

> +361,22 @@ static int vfio_device_first_open(struct vfio_device
> *device, if (ret) goto err_module_put;
>  
> +       if (kvm && !vfio_kvm_get(kvm)) {

Do call it kvm_get_safe though

> +               ret = -ENOENT;
> +               goto err_unuse_iommu;
> +       }
>         device->kvm = kvm;
>         if (device->ops->open_device) {
>                 ret = device->ops->open_device(device);
>                 if (ret)
> -                       goto err_unuse_iommu;
> +                       goto err_put_kvm;
>         }
>         return 0;
>  
> -err_unuse_iommu:
> +err_put_kvm:
> +       vfio_put_kvm(kvm);
>         device->kvm = NULL;
> +err_unuse_iommu:
>         if (iommufd)
>                 vfio_iommufd_unbind(device);
>         else
> @@ -465,6 +471,9 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
>  
>         vfio_device_group_close(device);
>  
> +       if (device->open_count == 0 && device->group->kvm)
> +               vfio_kvm_put(device->group->kvm);
> +

No, you can't touch group->kvm without holding the group lock,
that is the whole point of the problem..

This has to be device->kvm

Jason
  

Patch

diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 8ae7039b3683..24511c877572 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -703,7 +703,11 @@  static void intel_vgpu_close_device(struct vfio_device *vfio_dev)
 
 	kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm,
 					   &vgpu->track_node);
-	kvm_put_kvm(vgpu->vfio_device.kvm);
+	/*
+	 * Avoid possible deadlock on any currently-held vfio lock by
+	 * ensuring the potential kvm_destroy_vm call is done asynchronously
+	 */
+	kvm_put_kvm_async(vgpu->vfio_device.kvm);
 
 	kvmgt_protect_table_destroy(vgpu);
 	gvt_cache_destroy(vgpu);
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index e93bb9c468ce..a37b2baefb36 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1574,7 +1574,12 @@  static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev)
 
 		kvm_arch_crypto_clear_masks(kvm);
 		vfio_ap_mdev_reset_queues(&matrix_mdev->qtable);
-		kvm_put_kvm(kvm);
+		/*
+		 * Avoid possible deadlock on any currently-held vfio lock by
+		 * ensuring the potential kvm_destroy_vm call is done
+		 * asynchronously
+		 */
+		kvm_put_kvm_async(kvm);
 		matrix_mdev->kvm = NULL;
 
 		release_update_locks_for_kvm(kvm);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4f26b244f6d0..2ef6a5102265 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@ 
 #include <linux/instrumentation.h>
 #include <linux/interval_tree.h>
 #include <linux/rbtree.h>
+#include <linux/workqueue.h>
 #include <linux/xarray.h>
 #include <asm/signal.h>
 
@@ -793,6 +794,7 @@  struct kvm {
 	struct kvm_stat_data **debugfs_stat_data;
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
+	struct work_struct async_work;
 	pid_t userspace_pid;
 	bool override_halt_poll_ns;
 	unsigned int max_halt_poll_ns;
@@ -963,6 +965,7 @@  void kvm_exit(void);
 void kvm_get_kvm(struct kvm *kvm);
 bool kvm_get_kvm_safe(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
+void kvm_put_kvm_async(struct kvm *kvm);
 bool file_is_kvm(struct file *file);
 void kvm_put_kvm_no_destroy(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13e88297f999..fbe8d127028b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1353,6 +1353,28 @@  void kvm_put_kvm(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_put_kvm);
 
+static void kvm_put_async_fn(struct work_struct *work)
+{
+	struct kvm *kvm = container_of(work, struct kvm,
+				       async_work);
+
+	kvm_destroy_vm(kvm);
+}
+
+/*
+ * Put a reference but only destroy the vm asynchronously.  Can be used in
+ * cases where the caller holds a mutex that could cause deadlock if
+ * kvm_destroy_vm is triggered
+ */
+void kvm_put_kvm_async(struct kvm *kvm)
+{
+	if (refcount_dec_and_test(&kvm->users_count)) {
+		INIT_WORK(&kvm->async_work, kvm_put_async_fn);
+		schedule_work(&kvm->async_work);
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_put_kvm_async);
+
 /*
  * Used to put a reference that was taken on behalf of an object associated
  * with a user-visible file descriptor, e.g. a vcpu or device, if installation