Message ID | 20230120150528.471752-1-yi.l.liu@intel.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp257530wrn; Fri, 20 Jan 2023 07:19:19 -0800 (PST) X-Google-Smtp-Source: AMrXdXuFMVHvMp+d5JIvSIuDiaVmfzr7uA1rxx1/xV5Ocm8dMmGk5kLy2i9wB7Gi+DrOC94oJtcf X-Received: by 2002:a05:6a20:3ca8:b0:b8:a56f:7612 with SMTP id b40-20020a056a203ca800b000b8a56f7612mr19720494pzj.55.1674227959494; Fri, 20 Jan 2023 07:19:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674227959; cv=none; d=google.com; s=arc-20160816; b=0dwoSIbJSgZV0EL+mgyt+nfjYhfySUf1cw16cRCiyAe43ipoV3lAeUJJHIrnqOH9KI h7aAr5l193IiCklzDUbrEGr/zqz45COtnQpNf48ZHzpqpjZpIPc/wprWkinkuvtQdloC x4WL/9FdOZ6EX+/gzahUFMILSmGOAXjnOPZNEig+egiESZth1dalyUJHbvBfLikqoRxO PoLANQrGMDoJ69siQsqNtQBR4bnnxNNfcYY4CZPa5pZlzZrVfDtQNgfom48MpFSWceSD mzWKq8ZqozTWyvreCSIbEJluS6KvDkfBe0v83f3zQw2aE3jtxDhArCnrQ+ekXFudAmFU ub3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=rZf/PsF88cC4IiK9Tf38Hstprk7/2S5N3R+8Bskw0NM=; b=lLacj6s+NJAyW4XwQsBJnrOdOMJ5QGY6K8iYU4QPiGxlmPqbc6n8OQ8mT8CvvXJ9tI Qj5iTe4Gymr0XJzfuuA3Qe/SerDgLEJiJ8toI0CjE50PveMbZG8VuGnl6OHQdYA3NRo6 Zht9GlLq2hVaclQdAI5NasvAj7fAHGK21Ts4ZBdivr5tsep28j77Ha+9d1cyMhn/xeLJ qGo1tpzlMLqN+q9IMU8wZz+lvbmt/0tV8dXjzdOyV7uVBpCWT/JDpTXwXfCDAPFvez2q 0MGt2xHrdmaKxdTkYRuFXqz3NjX92SiGkrCxUc4rLU/LjfBbdnxIvWFJoJl54bxMadSQ 6oEQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="F/0vIVpV"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m18-20020a639412000000b00497e31916dfsi41638635pge.273.2023.01.20.07.19.06; Fri, 20 Jan 2023 07:19:19 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="F/0vIVpV"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230209AbjATPFe (ORCPT <rfc822;literming00@gmail.com> + 99 others); Fri, 20 Jan 2023 10:05:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229907AbjATPFc (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Fri, 20 Jan 2023 10:05:32 -0500 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C888B7CCC1; Fri, 20 Jan 2023 07:05:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674227131; x=1705763131; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=Zn183ElO1bl64Y4pVPfF7h7rDHD+/IuJAJiKiAAXWvc=; b=F/0vIVpVwojTbSAlnf2veNQ3B7J09FMvOPCmAYS/wa+rX8SN9SZwu//F g27XxVYnlUsdx8FXerODDMbpDy6DaRZaL3aC3GBItoxhEjMNzDFtHNiMt gvgmlJVnNEP3sGPBrIwWceEyFKJkIBKb8AVshNFSBOJ73fvqW5FZIHANP iVzvG5cOsXYXT/hrVZKdA7hfxloRxb3FbQ66Vi9qQSaP6b5AJQrI2+/j4 0FepJEMCN/xG818aDRDF7mWx8VZKknMtjjRbh1vquc0Zww9VUHaUasmQP 5mBp8p7uwepP1pp8+fqrWjgLBj8/WSx/YTOj53x+s05xlsCUlb6HjWlXj A==; X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="309169480" X-IronPort-AV: E=Sophos;i="5.97,232,1669104000"; d="scan'208";a="309169480" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 07:05:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10595"; a="784532388" X-IronPort-AV: E=Sophos;i="5.97,232,1669104000"; d="scan'208";a="784532388" Received: from 984fee00a4c6.jf.intel.com ([10.165.58.231]) by orsmga004.jf.intel.com with ESMTP; 20 Jan 2023 07:05:30 -0800 From: Yi Liu <yi.l.liu@intel.com> To: alex.williamson@redhat.com, pbonzini@redhat.com, mjrosato@linux.ibm.com Cc: yi.l.liu@intel.com, jgg@nvidia.com, kevin.tian@intel.com, cohuck@redhat.com, farman@linux.ibm.com, pmorel@linux.ibm.com, borntraeger@linux.ibm.com, frankja@linux.ibm.com, imbrenda@linux.ibm.com, david@redhat.com, akrowiak@linux.ibm.com, jjherne@linux.ibm.com, pasic@linux.ibm.com, zhenyuw@linux.intel.com, zhi.a.wang@intel.com, seanjc@google.com, linux-s390@vger.kernel.org, kvm@vger.kernel.org, intel-gvt-dev@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH] kvm/vfio: Fix potential deadlock on vfio group_lock Date: Fri, 20 Jan 2023 07:05:28 -0800 Message-Id: <20230120150528.471752-1-yi.l.liu@intel.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1755555256994818070?= X-GMAIL-MSGID: =?utf-8?q?1755555256994818070?= |
Series |
kvm/vfio: Fix potential deadlock on vfio group_lock
|
|
Commit Message
Yi Liu
Jan. 20, 2023, 3:05 p.m. UTC
Currently it is possible that the final put of a KVM reference comes from vfio during its device close operation. This occurs while the vfio group lock is held; however, if the vfio device is still in the kvm device list, then the following call chain could result in a deadlock: VFIO holds group->group_lock/group_rwsem -> kvm_put_kvm -> kvm_destroy_vm -> kvm_destroy_devices -> kvm_vfio_destroy -> kvm_vfio_file_set_kvm -> vfio_file_set_kvm -> try to hold group->group_lock/group_rwsem The key function is the kvm_destroy_devices() which triggers destroy cb of kvm_device_ops. It calls back to vfio and try to hold group_lock. So if this path doesn't call back to vfio, this dead lock would be fixed. Actually, there is a way for it. KVM provides another point to free the kvm-vfio device which is the point when the device file descriptor is closed. This can be achieved by providing the release cb instead of the destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). /* * Destroy is responsible for freeing dev. * * Destroy may be called before or after destructors are called * on emulated I/O regions, depending on whether a reference is * held by a vcpu or other kvm component that gets destroyed * after the emulated I/O. */ void (*destroy)(struct kvm_device *dev); /* * Release is an alternative method to free the device. It is * called when the device file descriptor is closed. Once * release is called, the destroy method will not be called * anymore as the device is removed from the device list of * the VM. kvm->lock is held. */ void (*release)(struct kvm_device *dev); Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") Reported-by: Alex Williamson <alex.williamson@redhat.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> --- virt/kvm/vfio.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
Comments
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Friday, January 20, 2023 11:05 PM > > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > VFIO holds group->group_lock/group_rwsem > -> kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> try to hold group->group_lock/group_rwsem > > The key function is the kvm_destroy_devices() which triggers destroy cb > of kvm_device_ops. It calls back to vfio and try to hold group_lock. So > if this path doesn't call back to vfio, this dead lock would be fixed. > Actually, there is a way for it. KVM provides another point to free the > kvm-vfio device which is the point when the device file descriptor is > closed. This can be achieved by providing the release cb instead of the > destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). > > /* > * Destroy is responsible for freeing dev. > * > * Destroy may be called before or after destructors are called > * on emulated I/O regions, depending on whether a reference is > * held by a vcpu or other kvm component that gets destroyed > * after the emulated I/O. > */ > void (*destroy)(struct kvm_device *dev); > > /* > * Release is an alternative method to free the device. It is > * called when the device file descriptor is closed. Once > * release is called, the destroy method will not be called > * anymore as the device is removed from the device list of > * the VM. kvm->lock is held. > */ > void (*release)(struct kvm_device *dev); > > Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Suggested-by: Kevin Tian <kevin.tian@intel.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Yi Liu <yi.l.liu@intel.com> More background can be found in Mathew's work. https://lore.kernel.org/kvm/20230114000351.115444-1-mjrosato@linux.ibm.com/T/#u Regards, Yi Liu
On 1/20/23 10:08 AM, Liu, Yi L wrote: >> From: Liu, Yi L <yi.l.liu@intel.com> >> Sent: Friday, January 20, 2023 11:05 PM >> >> Currently it is possible that the final put of a KVM reference comes from >> vfio during its device close operation. This occurs while the vfio group >> lock is held; however, if the vfio device is still in the kvm device list, >> then the following call chain could result in a deadlock: >> >> VFIO holds group->group_lock/group_rwsem >> -> kvm_put_kvm >> -> kvm_destroy_vm >> -> kvm_destroy_devices >> -> kvm_vfio_destroy >> -> kvm_vfio_file_set_kvm >> -> vfio_file_set_kvm >> -> try to hold group->group_lock/group_rwsem >> >> The key function is the kvm_destroy_devices() which triggers destroy cb >> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So >> if this path doesn't call back to vfio, this dead lock would be fixed. >> Actually, there is a way for it. KVM provides another point to free the >> kvm-vfio device which is the point when the device file descriptor is >> closed. This can be achieved by providing the release cb instead of the >> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). >> >> /* >> * Destroy is responsible for freeing dev. >> * >> * Destroy may be called before or after destructors are called >> * on emulated I/O regions, depending on whether a reference is >> * held by a vcpu or other kvm component that gets destroyed >> * after the emulated I/O. >> */ >> void (*destroy)(struct kvm_device *dev); >> >> /* >> * Release is an alternative method to free the device. It is >> * called when the device file descriptor is closed. Once >> * release is called, the destroy method will not be called >> * anymore as the device is removed from the device list of >> * the VM. kvm->lock is held. >> */ >> void (*release)(struct kvm_device *dev); >> >> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") >> Reported-by: Alex Williamson <alex.williamson@redhat.com> >> Suggested-by: Kevin Tian <kevin.tian@intel.com> >> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > More background can be found in Mathew's work. > https://lore.kernel.org/kvm/20230114000351.115444-1-mjrosato@linux.ibm.com/T/#u > Thanks Yi. Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> One small nit: There is a comment at the very end of kvm_vfio_release on the kfree(dev) that still references .destroy, this should be updated to .release
On Fri, 20 Jan 2023 10:45:40 -0500 Matthew Rosato <mjrosato@linux.ibm.com> wrote: > On 1/20/23 10:08 AM, Liu, Yi L wrote: > >> From: Liu, Yi L <yi.l.liu@intel.com> > >> Sent: Friday, January 20, 2023 11:05 PM > >> > >> Currently it is possible that the final put of a KVM reference comes from > >> vfio during its device close operation. This occurs while the vfio group > >> lock is held; however, if the vfio device is still in the kvm device list, > >> then the following call chain could result in a deadlock: > >> > >> VFIO holds group->group_lock/group_rwsem > >> -> kvm_put_kvm > >> -> kvm_destroy_vm > >> -> kvm_destroy_devices > >> -> kvm_vfio_destroy > >> -> kvm_vfio_file_set_kvm > >> -> vfio_file_set_kvm > >> -> try to hold group->group_lock/group_rwsem > >> > >> The key function is the kvm_destroy_devices() which triggers destroy cb > >> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So > >> if this path doesn't call back to vfio, this dead lock would be fixed. > >> Actually, there is a way for it. KVM provides another point to free the > >> kvm-vfio device which is the point when the device file descriptor is > >> closed. This can be achieved by providing the release cb instead of the > >> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). > >> > >> /* > >> * Destroy is responsible for freeing dev. > >> * > >> * Destroy may be called before or after destructors are called > >> * on emulated I/O regions, depending on whether a reference is > >> * held by a vcpu or other kvm component that gets destroyed > >> * after the emulated I/O. > >> */ > >> void (*destroy)(struct kvm_device *dev); > >> > >> /* > >> * Release is an alternative method to free the device. It is > >> * called when the device file descriptor is closed. Once > >> * release is called, the destroy method will not be called > >> * anymore as the device is removed from the device list of > >> * the VM. kvm->lock is held. > >> */ > >> void (*release)(struct kvm_device *dev); > >> > >> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > >> Reported-by: Alex Williamson <alex.williamson@redhat.com> > >> Suggested-by: Kevin Tian <kevin.tian@intel.com> > >> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > >> Signed-off-by: Yi Liu <yi.l.liu@intel.com> > > > > More background can be found in Mathew's work. > > https://lore.kernel.org/kvm/20230114000351.115444-1-mjrosato@linux.ibm.com/T/#u > > > > Thanks Yi. > > Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> > > One small nit: There is a comment at the very end of > kvm_vfio_release on the kfree(dev) that still references .destroy, > this should be updated to .release I've fixed this locally, s/destroy/release/ in that comment. Thanks, Alex
On Fri, 20 Jan 2023 07:05:28 -0800 Yi Liu <yi.l.liu@intel.com> wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > VFIO holds group->group_lock/group_rwsem > -> kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> try to hold group->group_lock/group_rwsem > > The key function is the kvm_destroy_devices() which triggers destroy cb > of kvm_device_ops. It calls back to vfio and try to hold group_lock. So > if this path doesn't call back to vfio, this dead lock would be fixed. > Actually, there is a way for it. KVM provides another point to free the > kvm-vfio device which is the point when the device file descriptor is > closed. This can be achieved by providing the release cb instead of the > destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). > > /* > * Destroy is responsible for freeing dev. > * > * Destroy may be called before or after destructors are called > * on emulated I/O regions, depending on whether a reference is > * held by a vcpu or other kvm component that gets destroyed > * after the emulated I/O. > */ > void (*destroy)(struct kvm_device *dev); > > /* > * Release is an alternative method to free the device. It is > * called when the device file descriptor is closed. Once > * release is called, the destroy method will not be called > * anymore as the device is removed from the device list of > * the VM. kvm->lock is held. > */ > void (*release)(struct kvm_device *dev); > > Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Suggested-by: Kevin Tian <kevin.tian@intel.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > --- > virt/kvm/vfio.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c > index 495ceabffe88..e94f3ea718e5 100644 > --- a/virt/kvm/vfio.c > +++ b/virt/kvm/vfio.c > @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev, > return -ENXIO; > } > > -static void kvm_vfio_destroy(struct kvm_device *dev) > +static void kvm_vfio_release(struct kvm_device *dev) > { > struct kvm_vfio *kv = dev->private; > struct kvm_vfio_group *kvg, *tmp; > @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type); > static struct kvm_device_ops kvm_vfio_ops = { > .name = "kvm-vfio", > .create = kvm_vfio_create, > - .destroy = kvm_vfio_destroy, > + .release = kvm_vfio_release, > .set_attr = kvm_vfio_set_attr, > .has_attr = kvm_vfio_has_attr, > }; Applied to vfio for-linus branch for v6.2, along with Matthew's R-b, the comment update, and the extra reference link. Once we get a linux-next build I'll send a pull request, along with Matthew's reserved region fix. Thanks, Alex
I encountered a lockdep splat while running some regression tests today (see below). I suspected it might be this patch so I reverted it, rebuilt the kernel and ran the regression tests again; this time, the test ran cleanly. It looks like this patch may not have fixed the problem for which it was intended. Here is the relevant dmesg output: [ 579.471402] hades[1099]: Start test run [ 579.473486] hades[1099]: Start 'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest' test [ 579.505804] vfio_ap matrix: MDEV: Registered [ 579.604024] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Adding to iommu group 0 [ 585.043898] ====================================================== [ 585.043900] WARNING: possible circular locking dependency detected [ 585.043902] 6.2.0-rc6-00057-g41c03ba9beea-dirty #18 Not tainted [ 585.043904] ------------------------------------------------------ [ 585.043905] CPU 0/KVM/1173 is trying to acquire lock: [ 585.043907] 000000008cfb24b0 (&group->group_lock){+.+.}-{3:3}, at: vfio_file_set_kvm+0x50/0x68 [vfio] [ 585.043919] but task is already holding lock: [ 585.043920] 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_device_release+0x4a/0xb8 [kvm] [ 585.043960] which lock already depends on the new lock. [ 585.043962] the existing dependency chain (in reverse order) is: [ 585.043963] -> #3 (&kvm->lock){+.+.}-{3:3}: [ 585.043967] __lock_acquire+0x3e2/0x750 [ 585.043974] lock_acquire.part.0+0xe2/0x250 [ 585.043977] lock_acquire+0xac/0x1d0 [ 585.043980] __mutex_lock+0x9e/0x868 [ 585.043985] mutex_lock_nested+0x32/0x40 [ 585.043988] vfio_ap_mdev_open_device+0x9a/0x198 [vfio_ap] [ 585.043991] vfio_device_open+0x122/0x168 [vfio] [ 585.043995] vfio_device_open_file+0x64/0x120 [vfio] [ 585.043999] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio] [ 585.044002] __s390x_sys_ioctl+0xc0/0x100 [ 585.044007] do_syscall+0xee/0x118 [ 585.044032] __do_syscall+0xd2/0x120 [ 585.044035] system_call+0x82/0xb0 [ 585.044037] -> #2 (&matrix_dev->guests_lock){+.+.}-{3:3}: [ 585.044041] __lock_acquire+0x3e2/0x750 [ 585.044044] lock_acquire.part.0+0xe2/0x250 [ 585.044047] lock_acquire+0xac/0x1d0 [ 585.044049] __mutex_lock+0x9e/0x868 [ 585.044052] mutex_lock_nested+0x32/0x40 [ 585.044054] vfio_ap_mdev_open_device+0x8c/0x198 [vfio_ap] [ 585.044057] vfio_device_open+0x122/0x168 [vfio] [ 585.044060] vfio_device_open_file+0x64/0x120 [vfio] [ 585.044064] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio] [ 585.044068] __s390x_sys_ioctl+0xc0/0x100 [ 585.044070] do_syscall+0xee/0x118 [ 585.044072] __do_syscall+0xd2/0x120 [ 585.044074] system_call+0x82/0xb0 [ 585.044076] -> #1 (&new_dev_set->lock){+.+.}-{3:3}: [ 585.044080] __lock_acquire+0x3e2/0x750 [ 585.044082] lock_acquire.part.0+0xe2/0x250 [ 585.044085] lock_acquire+0xac/0x1d0 [ 585.044088] __mutex_lock+0x9e/0x868 [ 585.044090] mutex_lock_nested+0x32/0x40 [ 585.044093] vfio_device_open+0x3e/0x168 [vfio] [ 585.044096] vfio_device_open_file+0x64/0x120 [vfio] [ 585.044100] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio] [ 585.044104] __s390x_sys_ioctl+0xc0/0x100 [ 585.044106] do_syscall+0xee/0x118 [ 585.044108] __do_syscall+0xd2/0x120 [ 585.044110] system_call+0x82/0xb0 [ 585.044112] -> #0 (&group->group_lock){+.+.}-{3:3}: [ 585.044115] check_prev_add+0xd4/0xf10 [ 585.044118] validate_chain+0x698/0x8e8 [ 585.044120] __lock_acquire+0x3e2/0x750 [ 585.044123] lock_acquire.part.0+0xe2/0x250 [ 585.044125] lock_acquire+0xac/0x1d0 [ 585.044128] __mutex_lock+0x9e/0x868 [ 585.044130] mutex_lock_nested+0x32/0x40 [ 585.044133] vfio_file_set_kvm+0x50/0x68 [vfio] [ 585.044137] kvm_vfio_release+0x5e/0xf8 [kvm] [ 585.044156] kvm_device_release+0x90/0xb8 [kvm] [ 585.044175] __fput+0xaa/0x2a0 [ 585.044180] task_work_run+0x76/0xd0 [ 585.044183] do_exit+0x248/0x538 [ 585.044186] do_group_exit+0x40/0xb0 [ 585.044188] get_signal+0x614/0x698 [ 585.044192] arch_do_signal_or_restart+0x58/0x370 [ 585.044195] exit_to_user_mode_loop+0xe8/0x1b8 [ 585.044200] exit_to_user_mode_prepare+0x164/0x190 [ 585.044203] __do_syscall+0xd2/0x120 [ 585.044205] system_call+0x82/0xb0 [ 585.044207] other info that might help us debug this: [ 585.044209] Chain exists of: &group->group_lock --> &matrix_dev->guests_lock --> &kvm->lock [ 585.044213] Possible unsafe locking scenario: [ 585.044214] CPU0 CPU1 [ 585.044216] ---- ---- [ 585.044217] lock(&kvm->lock); [ 585.044219] lock(&matrix_dev->guests_lock); [ 585.044221] lock(&kvm->lock); [ 585.044223] lock(&group->group_lock); [ 585.044225] *** DEADLOCK *** [ 585.044227] 1 lock held by CPU 0/KVM/1173: [ 585.044228] #0: 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_device_release+0x4a/0xb8 [kvm] [ 585.044251] stack backtrace: [ 585.044253] CPU: 3 PID: 1173 Comm: CPU 0/KVM Not tainted 6.2.0-rc6-00057-g41c03ba9beea-dirty #18 [ 585.044256] Hardware name: IBM 8561 T01 772 (LPAR) [ 585.044257] Call Trace: [ 585.044258] [<000000011a818936>] dump_stack_lvl+0x8e/0xc8 [ 585.044261] [<0000000119aca3f2>] check_noncircular+0x132/0x158 [ 585.044264] [<0000000119acba44>] check_prev_add+0xd4/0xf10 [ 585.044267] [<0000000119accf18>] validate_chain+0x698/0x8e8 [ 585.044270] [<0000000119ace70a>] __lock_acquire+0x3e2/0x750 [ 585.044273] [<0000000119acf682>] lock_acquire.part.0+0xe2/0x250 [ 585.044276] [<0000000119acf89c>] lock_acquire+0xac/0x1d0 [ 585.044279] [<000000011a823c66>] __mutex_lock+0x9e/0x868 [ 585.044282] [<000000011a824462>] mutex_lock_nested+0x32/0x40 [ 585.044285] [<000003ff7fbcd6a0>] vfio_file_set_kvm+0x50/0x68 [vfio] [ 585.044289] [<000003ff7feacab6>] kvm_vfio_release+0x5e/0xf8 [kvm] [ 585.044308] [<000003ff7fea6d58>] kvm_device_release+0x90/0xb8 [kvm] [ 585.044328] [<0000000119dbb83a>] __fput+0xaa/0x2a0 [ 585.044331] [<0000000119a67c66>] task_work_run+0x76/0xd0 [ 585.044333] [<0000000119a3ec18>] do_exit+0x248/0x538 [ 585.044335] [<0000000119a3f0c8>] do_group_exit+0x40/0xb0 [ 585.044338] [<0000000119a50dec>] get_signal+0x614/0x698 [ 585.044340] [<00000001199ea030>] arch_do_signal_or_restart+0x58/0x370 [ 585.044343] [<0000000119b0bb50>] exit_to_user_mode_loop+0xe8/0x1b8 [ 585.044346] [<0000000119b0bd84>] exit_to_user_mode_prepare+0x164/0x190 [ 585.044349] [<000000011a818d2a>] __do_syscall+0xd2/0x120 [ 585.044351] [<000000011a82c462>] system_call+0x82/0xb0 [ 585.044354] INFO: lockdep is turned off. [ 610.595528] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Removing from iommu group 0 [ 610.604408] vfio_ap matrix: MDEV: Unregistering [ 610.826074] hades[1099]: Stop 'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest' On 1/20/23 10:05 AM, Yi Liu wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > VFIO holds group->group_lock/group_rwsem > -> kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> try to hold group->group_lock/group_rwsem > > The key function is the kvm_destroy_devices() which triggers destroy cb > of kvm_device_ops. It calls back to vfio and try to hold group_lock. So > if this path doesn't call back to vfio, this dead lock would be fixed. > Actually, there is a way for it. KVM provides another point to free the > kvm-vfio device which is the point when the device file descriptor is > closed. This can be achieved by providing the release cb instead of the > destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). > > /* > * Destroy is responsible for freeing dev. > * > * Destroy may be called before or after destructors are called > * on emulated I/O regions, depending on whether a reference is > * held by a vcpu or other kvm component that gets destroyed > * after the emulated I/O. > */ > void (*destroy)(struct kvm_device *dev); > > /* > * Release is an alternative method to free the device. It is > * called when the device file descriptor is closed. Once > * release is called, the destroy method will not be called > * anymore as the device is removed from the device list of > * the VM. kvm->lock is held. > */ > void (*release)(struct kvm_device *dev); > > Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Suggested-by: Kevin Tian <kevin.tian@intel.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Signed-off-by: Yi Liu <yi.l.liu@intel.com> > --- > virt/kvm/vfio.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c > index 495ceabffe88..e94f3ea718e5 100644 > --- a/virt/kvm/vfio.c > +++ b/virt/kvm/vfio.c > @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev, > return -ENXIO; > } > > -static void kvm_vfio_destroy(struct kvm_device *dev) > +static void kvm_vfio_release(struct kvm_device *dev) > { > struct kvm_vfio *kv = dev->private; > struct kvm_vfio_group *kvg, *tmp; > @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type); > static struct kvm_device_ops kvm_vfio_ops = { > .name = "kvm-vfio", > .create = kvm_vfio_create, > - .destroy = kvm_vfio_destroy, > + .release = kvm_vfio_release, > .set_attr = kvm_vfio_set_attr, > .has_attr = kvm_vfio_has_attr, > };
On Tue, Jan 31, 2023 at 09:27:54AM -0500, Anthony Krowiak wrote: > I encountered a lockdep splat while running some regression tests today (see > below). I suspected it might be this patch so I reverted it, rebuilt the > kernel and ran the regression tests again; this time, the test ran cleanly. > It looks like this patch may not have fixed the problem for which it was > intended. Here is the relevant dmesg output: Well, it fixes the deadlock it intended to fix and created another one :) It means device drivers cannot obtain the kvm lock from their open functions in this new model. Why does ap need to touch kvm->lock? (via get_update_locks_for_kvm) Maybe you should split that lock and have a dedicated apcb lock? Jason
On 1/31/23 9:27 AM, Anthony Krowiak wrote: > I encountered a lockdep splat while running some regression tests today (see below). I suspected it might be this patch so I reverted it, rebuilt the kernel and ran the regression tests again; this time, the test ran cleanly. It looks like this patch may not have fixed the problem for which it was intended. Here is the relevant dmesg output: > Damn it, the config I used to test must not have had lockdep enabled. It looks like the issue is that .destroy did not used to acquire the kvm lock prior to calling into vfio_file_set_kvm whereas now .release does. This means .release expects a hierarchy of kvm->lock ... vfio->group_lock but your .open_device does vfio->group_lock ... kvm->lock. I can reproduce something similar for vfio_pci_zdev. > [ 579.471402] hades[1099]: Start test run > [ 579.473486] hades[1099]: Start 'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest' test > [ 579.505804] vfio_ap matrix: MDEV: Registered > [ 579.604024] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Adding to iommu group 0 > > [ 585.043898] ====================================================== > [ 585.043900] WARNING: possible circular locking dependency detected > [ 585.043902] 6.2.0-rc6-00057-g41c03ba9beea-dirty #18 Not tainted > [ 585.043904] ------------------------------------------------------ > [ 585.043905] CPU 0/KVM/1173 is trying to acquire lock: > [ 585.043907] 000000008cfb24b0 (&group->group_lock){+.+.}-{3:3}, at: vfio_file_set_kvm+0x50/0x68 [vfio] > [ 585.043919] > but task is already holding lock: > [ 585.043920] 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_device_release+0x4a/0xb8 [kvm] > [ 585.043960] > which lock already depends on the new lock. > > [ 585.043962] > the existing dependency chain (in reverse order) is: > [ 585.043963] > -> #3 (&kvm->lock){+.+.}-{3:3}: > [ 585.043967] __lock_acquire+0x3e2/0x750 > [ 585.043974] lock_acquire.part.0+0xe2/0x250 > [ 585.043977] lock_acquire+0xac/0x1d0 > [ 585.043980] __mutex_lock+0x9e/0x868 > [ 585.043985] mutex_lock_nested+0x32/0x40 > [ 585.043988] vfio_ap_mdev_open_device+0x9a/0x198 [vfio_ap] > [ 585.043991] vfio_device_open+0x122/0x168 [vfio] > [ 585.043995] vfio_device_open_file+0x64/0x120 [vfio] > [ 585.043999] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio] > [ 585.044002] __s390x_sys_ioctl+0xc0/0x100 > [ 585.044007] do_syscall+0xee/0x118 > [ 585.044032] __do_syscall+0xd2/0x120 > [ 585.044035] system_call+0x82/0xb0 > [ 585.044037] > -> #2 (&matrix_dev->guests_lock){+.+.}-{3:3}: > [ 585.044041] __lock_acquire+0x3e2/0x750 > [ 585.044044] lock_acquire.part.0+0xe2/0x250 > [ 585.044047] lock_acquire+0xac/0x1d0 > [ 585.044049] __mutex_lock+0x9e/0x868 > [ 585.044052] mutex_lock_nested+0x32/0x40 > [ 585.044054] vfio_ap_mdev_open_device+0x8c/0x198 [vfio_ap] > [ 585.044057] vfio_device_open+0x122/0x168 [vfio] > [ 585.044060] vfio_device_open_file+0x64/0x120 [vfio] > [ 585.044064] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio] > [ 585.044068] __s390x_sys_ioctl+0xc0/0x100 > [ 585.044070] do_syscall+0xee/0x118 > [ 585.044072] __do_syscall+0xd2/0x120 > [ 585.044074] system_call+0x82/0xb0 > [ 585.044076] > -> #1 (&new_dev_set->lock){+.+.}-{3:3}: > [ 585.044080] __lock_acquire+0x3e2/0x750 > [ 585.044082] lock_acquire.part.0+0xe2/0x250 > [ 585.044085] lock_acquire+0xac/0x1d0 > [ 585.044088] __mutex_lock+0x9e/0x868 > [ 585.044090] mutex_lock_nested+0x32/0x40 > [ 585.044093] vfio_device_open+0x3e/0x168 [vfio] > [ 585.044096] vfio_device_open_file+0x64/0x120 [vfio] > [ 585.044100] vfio_group_fops_unl_ioctl+0xd4/0x1e0 [vfio] > [ 585.044104] __s390x_sys_ioctl+0xc0/0x100 > [ 585.044106] do_syscall+0xee/0x118 > [ 585.044108] __do_syscall+0xd2/0x120 > [ 585.044110] system_call+0x82/0xb0 > [ 585.044112] > -> #0 (&group->group_lock){+.+.}-{3:3}: > [ 585.044115] check_prev_add+0xd4/0xf10 > [ 585.044118] validate_chain+0x698/0x8e8 > [ 585.044120] __lock_acquire+0x3e2/0x750 > [ 585.044123] lock_acquire.part.0+0xe2/0x250 > [ 585.044125] lock_acquire+0xac/0x1d0 > [ 585.044128] __mutex_lock+0x9e/0x868 > [ 585.044130] mutex_lock_nested+0x32/0x40 > [ 585.044133] vfio_file_set_kvm+0x50/0x68 [vfio] > [ 585.044137] kvm_vfio_release+0x5e/0xf8 [kvm] > [ 585.044156] kvm_device_release+0x90/0xb8 [kvm] > [ 585.044175] __fput+0xaa/0x2a0 > [ 585.044180] task_work_run+0x76/0xd0 > [ 585.044183] do_exit+0x248/0x538 > [ 585.044186] do_group_exit+0x40/0xb0 > [ 585.044188] get_signal+0x614/0x698 > [ 585.044192] arch_do_signal_or_restart+0x58/0x370 > [ 585.044195] exit_to_user_mode_loop+0xe8/0x1b8 > [ 585.044200] exit_to_user_mode_prepare+0x164/0x190 > [ 585.044203] __do_syscall+0xd2/0x120 > [ 585.044205] system_call+0x82/0xb0 > [ 585.044207] > other info that might help us debug this: > > [ 585.044209] Chain exists of: > &group->group_lock --> &matrix_dev->guests_lock --> &kvm->lock > > [ 585.044213] Possible unsafe locking scenario: > > [ 585.044214] CPU0 CPU1 > [ 585.044216] ---- ---- > [ 585.044217] lock(&kvm->lock); > [ 585.044219] lock(&matrix_dev->guests_lock); > [ 585.044221] lock(&kvm->lock); > [ 585.044223] lock(&group->group_lock); > [ 585.044225] > *** DEADLOCK *** > > [ 585.044227] 1 lock held by CPU 0/KVM/1173: > [ 585.044228] #0: 00000000b2960ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_device_release+0x4a/0xb8 [kvm] > [ 585.044251] > stack backtrace: > [ 585.044253] CPU: 3 PID: 1173 Comm: CPU 0/KVM Not tainted 6.2.0-rc6-00057-g41c03ba9beea-dirty #18 > [ 585.044256] Hardware name: IBM 8561 T01 772 (LPAR) > [ 585.044257] Call Trace: > [ 585.044258] [<000000011a818936>] dump_stack_lvl+0x8e/0xc8 > [ 585.044261] [<0000000119aca3f2>] check_noncircular+0x132/0x158 > [ 585.044264] [<0000000119acba44>] check_prev_add+0xd4/0xf10 > [ 585.044267] [<0000000119accf18>] validate_chain+0x698/0x8e8 > [ 585.044270] [<0000000119ace70a>] __lock_acquire+0x3e2/0x750 > [ 585.044273] [<0000000119acf682>] lock_acquire.part.0+0xe2/0x250 > [ 585.044276] [<0000000119acf89c>] lock_acquire+0xac/0x1d0 > [ 585.044279] [<000000011a823c66>] __mutex_lock+0x9e/0x868 > [ 585.044282] [<000000011a824462>] mutex_lock_nested+0x32/0x40 > [ 585.044285] [<000003ff7fbcd6a0>] vfio_file_set_kvm+0x50/0x68 [vfio] > [ 585.044289] [<000003ff7feacab6>] kvm_vfio_release+0x5e/0xf8 [kvm] > [ 585.044308] [<000003ff7fea6d58>] kvm_device_release+0x90/0xb8 [kvm] > [ 585.044328] [<0000000119dbb83a>] __fput+0xaa/0x2a0 > [ 585.044331] [<0000000119a67c66>] task_work_run+0x76/0xd0 > [ 585.044333] [<0000000119a3ec18>] do_exit+0x248/0x538 > [ 585.044335] [<0000000119a3f0c8>] do_group_exit+0x40/0xb0 > [ 585.044338] [<0000000119a50dec>] get_signal+0x614/0x698 > [ 585.044340] [<00000001199ea030>] arch_do_signal_or_restart+0x58/0x370 > [ 585.044343] [<0000000119b0bb50>] exit_to_user_mode_loop+0xe8/0x1b8 > [ 585.044346] [<0000000119b0bd84>] exit_to_user_mode_prepare+0x164/0x190 > [ 585.044349] [<000000011a818d2a>] __do_syscall+0xd2/0x120 > [ 585.044351] [<000000011a82c462>] system_call+0x82/0xb0 > [ 585.044354] INFO: lockdep is turned off. > [ 610.595528] vfio_ap_mdev 529654a9-bea4-461a-b64d-9d9c63df0deb: Removing from iommu group 0 > [ 610.604408] vfio_ap matrix: MDEV: Unregistering > [ 610.826074] hades[1099]: Stop 'tests.test_vfio_ap.VfioAPAssignMdevToGuestTest.runTest' > > On 1/20/23 10:05 AM, Yi Liu wrote: >> Currently it is possible that the final put of a KVM reference comes from >> vfio during its device close operation. This occurs while the vfio group >> lock is held; however, if the vfio device is still in the kvm device list, >> then the following call chain could result in a deadlock: >> >> VFIO holds group->group_lock/group_rwsem >> -> kvm_put_kvm >> -> kvm_destroy_vm >> -> kvm_destroy_devices >> -> kvm_vfio_destroy >> -> kvm_vfio_file_set_kvm >> -> vfio_file_set_kvm >> -> try to hold group->group_lock/group_rwsem >> >> The key function is the kvm_destroy_devices() which triggers destroy cb >> of kvm_device_ops. It calls back to vfio and try to hold group_lock. So >> if this path doesn't call back to vfio, this dead lock would be fixed. >> Actually, there is a way for it. KVM provides another point to free the >> kvm-vfio device which is the point when the device file descriptor is >> closed. This can be achieved by providing the release cb instead of the >> destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). >> >> /* >> * Destroy is responsible for freeing dev. >> * >> * Destroy may be called before or after destructors are called >> * on emulated I/O regions, depending on whether a reference is >> * held by a vcpu or other kvm component that gets destroyed >> * after the emulated I/O. >> */ >> void (*destroy)(struct kvm_device *dev); >> >> /* >> * Release is an alternative method to free the device. It is >> * called when the device file descriptor is closed. Once >> * release is called, the destroy method will not be called >> * anymore as the device is removed from the device list of >> * the VM. kvm->lock is held. >> */ >> void (*release)(struct kvm_device *dev); >> >> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") >> Reported-by: Alex Williamson <alex.williamson@redhat.com> >> Suggested-by: Kevin Tian <kevin.tian@intel.com> >> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com> >> --- >> virt/kvm/vfio.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c >> index 495ceabffe88..e94f3ea718e5 100644 >> --- a/virt/kvm/vfio.c >> +++ b/virt/kvm/vfio.c >> @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev, >> return -ENXIO; >> } >> -static void kvm_vfio_destroy(struct kvm_device *dev) >> +static void kvm_vfio_release(struct kvm_device *dev) >> { >> struct kvm_vfio *kv = dev->private; >> struct kvm_vfio_group *kvg, *tmp; >> @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type); >> static struct kvm_device_ops kvm_vfio_ops = { >> .name = "kvm-vfio", >> .create = kvm_vfio_create, >> - .destroy = kvm_vfio_destroy, >> + .release = kvm_vfio_release, >> .set_attr = kvm_vfio_set_attr, >> .has_attr = kvm_vfio_has_attr, >> };
On 1/31/23 9:34 AM, Jason Gunthorpe wrote: > On Tue, Jan 31, 2023 at 09:27:54AM -0500, Anthony Krowiak wrote: >> I encountered a lockdep splat while running some regression tests today (see >> below). I suspected it might be this patch so I reverted it, rebuilt the >> kernel and ran the regression tests again; this time, the test ran cleanly. >> It looks like this patch may not have fixed the problem for which it was >> intended. Here is the relevant dmesg output: > Well, it fixes the deadlock it intended to fix and created another one > :) > > It means device drivers cannot obtain the kvm lock from their open > functions in this new model. > > Why does ap need to touch kvm->lock? (via get_update_locks_for_kvm) We need the kvm->lock because we take the vCPUs out of SIE in order to dynamically change values in the APCB. > > Maybe you should split that lock and have a dedicated apcb lock? I don't think that would suffice for taking the vCPUs out of SIE. > > Jason
On Tue, Jan 31, 2023 at 09:46:18AM -0500, Anthony Krowiak wrote: > > Maybe you should split that lock and have a dedicated apcb lock? > > I don't think that would suffice for taking the vCPUs out of SIE. Then I think we have to keep this patch and also do Matthew's patch to keep kvm refs inside vfio as well. Jason
On 1/31/23 9:48 AM, Jason Gunthorpe wrote: > On Tue, Jan 31, 2023 at 09:46:18AM -0500, Anthony Krowiak wrote: > >>> Maybe you should split that lock and have a dedicated apcb lock? >> >> I don't think that would suffice for taking the vCPUs out of SIE. > > Then I think we have to keep this patch and also do Matthew's patch to > keep kvm refs inside vfio as well. > I don't think keeping kvm refs inside vfio solves this issue though -- Even if we handle the kvm_put_kvm asynchronously within vfio as previously proposed, kvm_vfio_release will eventually get called and it gets called with the kvm->lock already held, then proceeds to call vfio_file_set_kvm which gets the group->lock. That order conflicts with the hierarchy used by the driver during open_device of vfio->group_lock ... kvm->lock.
On Tue, Jan 31, 2023 at 10:00:14AM -0500, Matthew Rosato wrote: > On 1/31/23 9:48 AM, Jason Gunthorpe wrote: > > On Tue, Jan 31, 2023 at 09:46:18AM -0500, Anthony Krowiak wrote: > > > >>> Maybe you should split that lock and have a dedicated apcb lock? > >> > >> I don't think that would suffice for taking the vCPUs out of SIE. > > > > Then I think we have to keep this patch and also do Matthew's patch to > > keep kvm refs inside vfio as well. > > > > I don't think keeping kvm refs inside vfio solves this issue though > -- Even if we handle the kvm_put_kvm asynchronously within vfio as > previously proposed, kvm_vfio_release will eventually get called and > it gets called with the kvm->lock already held, then proceeds to > call vfio_file_set_kvm which gets the group->lock. That order > conflicts with the hierarchy used by the driver during open_device > of vfio->group_lock ... kvm->lock. The group lock is held by vfio_file_set_kvm() only because we don't have a refcount and we have to hold it across the open call to keep the pointer alive. With proper refcounting you'd split this to a spinlock and hold it only while obtaining the get ref for the open thread. Jason
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c index 495ceabffe88..e94f3ea718e5 100644 --- a/virt/kvm/vfio.c +++ b/virt/kvm/vfio.c @@ -336,7 +336,7 @@ static int kvm_vfio_has_attr(struct kvm_device *dev, return -ENXIO; } -static void kvm_vfio_destroy(struct kvm_device *dev) +static void kvm_vfio_release(struct kvm_device *dev) { struct kvm_vfio *kv = dev->private; struct kvm_vfio_group *kvg, *tmp; @@ -363,7 +363,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type); static struct kvm_device_ops kvm_vfio_ops = { .name = "kvm-vfio", .create = kvm_vfio_create, - .destroy = kvm_vfio_destroy, + .release = kvm_vfio_release, .set_attr = kvm_vfio_set_attr, .has_attr = kvm_vfio_has_attr, };