Message ID | 20230109201037.33051-2-mjrosato@linux.ibm.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp2361271wrt; Mon, 9 Jan 2023 12:22:01 -0800 (PST) X-Google-Smtp-Source: AMrXdXvBvy/8H2a7Gz3c7V6mGl/yCLcNrRRZ91vRaZP4BqiV+SmtNgWBjF38PWAtq8OYBQ22dTFi X-Received: by 2002:aa7:cf94:0:b0:47b:16f5:61dc with SMTP id z20-20020aa7cf94000000b0047b16f561dcmr57733497edx.37.1673295721510; Mon, 09 Jan 2023 12:22:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1673295721; cv=none; d=google.com; s=arc-20160816; b=Ob93JKfpW5/d+LrIGLa/JLAkS+KCWIV9fQRr5DGmuAgUvOPYZvrS7PU07HfmRg4pfc /Wj4dIIMJKXtzUGHkv5dnN3Ypq6IIKieTXaJ7X0Qlqj/PQ/RI11O0OhoapenABkKDLl5 sQukmjsitNDl6UM1+33FFkBYCVWWxpn+jCF/TD075rzxP4XG1RhKq+ScMFY36mnXUBMM XhWW8ftF3jbo9I48cLLbt9W3yIaiqzgPzbmo6AxwS+YWtFAfakIkE56NpnLZsLzHJhAL /eUR0OV+0pH9nsGSvxVZ9pT6atkQdPmRFv39E3A6M6tgMEvRAKkVsj216GUHsx0P2cLf inxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=vHxtIZVNhh1Xt+rXy3HWXJ4MJLBlSbXgyKBEB0LRDyA=; b=D4Ah3jdLhk8APkMB+DhJnsDj+GrUG7jILqBSB7Knr4TME+ECouaESLglSFPmmTdEN/ 1n8kq69dDGsGjgMXx+V4LXoqJAtn3PsGK0OpQg7o8qraTKta60NDWE802xpOIECpTtDP Y0crj6EODtty+7GNolyiMg7tSBLA61u2D+6uL29XGHRkgWeWjun2jmbBu2s7qA5QaZmm /m//Y2quZF2ebeMxBdy2g31i9TgOMXMLlnGxRQFt9KBdxAXhXLiJBl2H+eouUC9D3l1N Z0eP4SZ12LXBCHK3imDytbihiiy7iltLjeq8c+QA9JtNadMtqM5u4+wOnhnSqnR5A/j1 U7iQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=So2+siRK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ga28-20020a1709070c1c00b007af039d0bcasi11577034ejc.429.2023.01.09.12.21.38; Mon, 09 Jan 2023 12:22:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=So2+siRK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237669AbjAIULK (ORCPT <rfc822;syz17693488234@gmail.com> + 99 others); Mon, 9 Jan 2023 15:11:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34984 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237591AbjAIUKz (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 9 Jan 2023 15:10:55 -0500 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 72BCEDF4; Mon, 9 Jan 2023 12:10:51 -0800 (PST) Received: from pps.filterd (m0187473.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 309JI64o031270; Mon, 9 Jan 2023 20:10:47 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=vHxtIZVNhh1Xt+rXy3HWXJ4MJLBlSbXgyKBEB0LRDyA=; b=So2+siRKzoLNWU/BTwDQgxfOT/hhxrjoI263w98bub0CrYm1zQqvY81F5lW/u2Oo2BPk Fa9y+UUeaiXma8p4bP5jEq/+4sJv8C4snheoW+/KAbHBBJQEqtsvL+Zga+QBY6lUc4eu sQbHZ3C76glHCVdtVNvzwiXT34CACjgeFAoYzoUDCTj41NWojtJCQCKMOHH4bvnmwRWL lkAdgUsUYYjIxRfis0m+racwNvWqLZloHmq37gDvqCRMsIi1lrgLwMPWaRp9jDAMaLi+ 3xKqtYpSKN9S0fx0FxTfPDaPoCTxzmlAqOJMPCDAteU5LW0GnTp4/KIU6k47w/mNFREH EQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3n0rya14dy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 09 Jan 2023 20:10:47 +0000 Received: from m0187473.ppops.net (m0187473.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 309Jw26E002748; Mon, 9 Jan 2023 20:10:46 GMT Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3n0rya14df-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 09 Jan 2023 20:10:46 +0000 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 309Ipw3X021901; Mon, 9 Jan 2023 20:10:45 GMT Received: from smtprelay06.wdc07v.mail.ibm.com ([9.208.129.118]) by ppma01dal.us.ibm.com (PPS) with ESMTPS id 3my0c7s31c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 09 Jan 2023 20:10:45 +0000 Received: from smtpav06.wdc07v.mail.ibm.com (smtpav06.wdc07v.mail.ibm.com [10.39.53.233]) by smtprelay06.wdc07v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 309KAhYx8520276 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 9 Jan 2023 20:10:43 GMT Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 52F4E5804E; Mon, 9 Jan 2023 20:10:43 +0000 (GMT) Received: from smtpav06.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 263735803F; Mon, 9 Jan 2023 20:10:41 +0000 (GMT) Received: from li-2311da4c-2e09-11b2-a85c-c003041e9174.ibm.com.com (unknown [9.65.251.44]) by smtpav06.wdc07v.mail.ibm.com (Postfix) with ESMTP; Mon, 9 Jan 2023 20:10:41 +0000 (GMT) From: Matthew Rosato <mjrosato@linux.ibm.com> To: alex.williamson@redhat.com, pbonzini@redhat.com Cc: jgg@nvidia.com, cohuck@redhat.com, farman@linux.ibm.com, pmorel@linux.ibm.com, borntraeger@linux.ibm.com, frankja@linux.ibm.com, imbrenda@linux.ibm.com, david@redhat.com, akrowiak@linux.ibm.com, jjherne@linux.ibm.com, pasic@linux.ibm.com, zhenyuw@linux.intel.com, zhi.a.wang@intel.com, linux-s390@vger.kernel.org, kvm@vger.kernel.org, intel-gvt-dev@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/2] KVM: async kvm_destroy_vm for vfio devices Date: Mon, 9 Jan 2023 15:10:36 -0500 Message-Id: <20230109201037.33051-2-mjrosato@linux.ibm.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230109201037.33051-1-mjrosato@linux.ibm.com> References: <20230109201037.33051-1-mjrosato@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: tdfzrYQRmVh1fRlcbDJIR3XrFlvpcnTo X-Proofpoint-GUID: 7iW8hCGahZOrmV06uujJOxGDrV1T4EEc X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.923,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2023-01-09_14,2023-01-09_02,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 mlxscore=0 clxscore=1015 adultscore=0 priorityscore=1501 lowpriorityscore=0 impostorscore=0 spamscore=0 phishscore=0 mlxlogscore=999 malwarescore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2212070000 definitions=main-2301090141 X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754577734696466123?= X-GMAIL-MSGID: =?utf-8?q?1754577734696466123?= |
Series |
kvm/vfio: fix potential deadlock on vfio group lock
|
|
Commit Message
Matthew Rosato
Jan. 9, 2023, 8:10 p.m. UTC
Currently it is possible that the final put of a KVM reference comes from
vfio during its device close operation. This occurs while the vfio group
lock is held; however, if the vfio device is still in the kvm device list,
then the following call chain could result in a deadlock:
kvm_put_kvm
-> kvm_destroy_vm
-> kvm_destroy_devices
-> kvm_vfio_destroy
-> kvm_vfio_file_set_kvm
-> vfio_file_set_kvm
-> group->group_lock/group_rwsem
Avoid this scenario by adding kvm_put_kvm_async which will perform the
kvm_destroy_vm asynchronously if the refcount reaches 0.
Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM")
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++-
drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++-
include/linux/kvm_host.h | 3 +++
virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++
4 files changed, 36 insertions(+), 2 deletions(-)
Comments
On Mon, Jan 09, 2023 at 03:10:36PM -0500, Matthew Rosato wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> group->group_lock/group_rwsem > > Avoid this scenario by adding kvm_put_kvm_async which will perform the > kvm_destroy_vm asynchronously if the refcount reaches 0. > > Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++- > drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++- > include/linux/kvm_host.h | 3 +++ > virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++ > 4 files changed, 36 insertions(+), 2 deletions(-) Why two patches? It looks OK to me Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Jason
On 1/9/23 3:13 PM, Jason Gunthorpe wrote: > On Mon, Jan 09, 2023 at 03:10:36PM -0500, Matthew Rosato wrote: >> Currently it is possible that the final put of a KVM reference comes from >> vfio during its device close operation. This occurs while the vfio group >> lock is held; however, if the vfio device is still in the kvm device list, >> then the following call chain could result in a deadlock: >> >> kvm_put_kvm >> -> kvm_destroy_vm >> -> kvm_destroy_devices >> -> kvm_vfio_destroy >> -> kvm_vfio_file_set_kvm >> -> vfio_file_set_kvm >> -> group->group_lock/group_rwsem >> >> Avoid this scenario by adding kvm_put_kvm_async which will perform the >> kvm_destroy_vm asynchronously if the refcount reaches 0. >> >> Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") >> Reported-by: Alex Williamson <alex.williamson@redhat.com> >> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> >> --- >> drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++- >> drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++- >> include/linux/kvm_host.h | 3 +++ >> virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++ >> 4 files changed, 36 insertions(+), 2 deletions(-) > > Why two patches? > > It looks OK to me Mentioned in the cover, the fixes: tag is different on the 2nd patch as the s390 PCI passthrough kvm_puts were added later soemtime after 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM"). The blamed commit for those changes also landed in a different release (6.0 vs 5.19). But, now that you mention it, neither is an LTS so it probably doesn't matter all that much and could be squashed if preferred. > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Thanks! > > Jason
LGTM Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> On 1/9/23 3:10 PM, Matthew Rosato wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> group->group_lock/group_rwsem > > Avoid this scenario by adding kvm_put_kvm_async which will perform the > kvm_destroy_vm asynchronously if the refcount reaches 0. > > Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> > --- > drivers/gpu/drm/i915/gvt/kvmgt.c | 6 +++++- > drivers/s390/crypto/vfio_ap_ops.c | 7 ++++++- > include/linux/kvm_host.h | 3 +++ > virt/kvm/kvm_main.c | 22 ++++++++++++++++++++++ > 4 files changed, 36 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c > index 8ae7039b3683..24511c877572 100644 > --- a/drivers/gpu/drm/i915/gvt/kvmgt.c > +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c > @@ -703,7 +703,11 @@ static void intel_vgpu_close_device(struct vfio_device *vfio_dev) > > kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm, > &vgpu->track_node); > - kvm_put_kvm(vgpu->vfio_device.kvm); > + /* > + * Avoid possible deadlock on any currently-held vfio lock by > + * ensuring the potential kvm_destroy_vm call is done asynchronously > + */ > + kvm_put_kvm_async(vgpu->vfio_device.kvm); > > kvmgt_protect_table_destroy(vgpu); > gvt_cache_destroy(vgpu); > diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c > index e93bb9c468ce..a37b2baefb36 100644 > --- a/drivers/s390/crypto/vfio_ap_ops.c > +++ b/drivers/s390/crypto/vfio_ap_ops.c > @@ -1574,7 +1574,12 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) > > kvm_arch_crypto_clear_masks(kvm); > vfio_ap_mdev_reset_queues(&matrix_mdev->qtable); > - kvm_put_kvm(kvm); > + /* > + * Avoid possible deadlock on any currently-held vfio lock by > + * ensuring the potential kvm_destroy_vm call is done > + * asynchronously > + */ > + kvm_put_kvm_async(kvm); > matrix_mdev->kvm = NULL; > > release_update_locks_for_kvm(kvm); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4f26b244f6d0..2ef6a5102265 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -34,6 +34,7 @@ > #include <linux/instrumentation.h> > #include <linux/interval_tree.h> > #include <linux/rbtree.h> > +#include <linux/workqueue.h> > #include <linux/xarray.h> > #include <asm/signal.h> > > @@ -793,6 +794,7 @@ struct kvm { > struct kvm_stat_data **debugfs_stat_data; > struct srcu_struct srcu; > struct srcu_struct irq_srcu; > + struct work_struct async_work; > pid_t userspace_pid; > bool override_halt_poll_ns; > unsigned int max_halt_poll_ns; > @@ -963,6 +965,7 @@ void kvm_exit(void); > void kvm_get_kvm(struct kvm *kvm); > bool kvm_get_kvm_safe(struct kvm *kvm); > void kvm_put_kvm(struct kvm *kvm); > +void kvm_put_kvm_async(struct kvm *kvm); > bool file_is_kvm(struct file *file); > void kvm_put_kvm_no_destroy(struct kvm *kvm); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 13e88297f999..fbe8d127028b 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1353,6 +1353,28 @@ void kvm_put_kvm(struct kvm *kvm) > } > EXPORT_SYMBOL_GPL(kvm_put_kvm); > > +static void kvm_put_async_fn(struct work_struct *work) > +{ > + struct kvm *kvm = container_of(work, struct kvm, > + async_work); > + > + kvm_destroy_vm(kvm); > +} > + > +/* > + * Put a reference but only destroy the vm asynchronously. Can be used in > + * cases where the caller holds a mutex that could cause deadlock if > + * kvm_destroy_vm is triggered > + */ > +void kvm_put_kvm_async(struct kvm *kvm) > +{ > + if (refcount_dec_and_test(&kvm->users_count)) { > + INIT_WORK(&kvm->async_work, kvm_put_async_fn); > + schedule_work(&kvm->async_work); > + } > +} > +EXPORT_SYMBOL_GPL(kvm_put_kvm_async); > + > /* > * Used to put a reference that was taken on behalf of an object associated > * with a user-visible file descriptor, e.g. a vcpu or device, if installation
On Mon, Jan 09, 2023, Matthew Rosato wrote: > Currently it is possible that the final put of a KVM reference comes from > vfio during its device close operation. This occurs while the vfio group > lock is held; however, if the vfio device is still in the kvm device list, > then the following call chain could result in a deadlock: > > kvm_put_kvm > -> kvm_destroy_vm > -> kvm_destroy_devices > -> kvm_vfio_destroy > -> kvm_vfio_file_set_kvm > -> vfio_file_set_kvm > -> group->group_lock/group_rwsem > > Avoid this scenario by adding kvm_put_kvm_async which will perform the > kvm_destroy_vm asynchronously if the refcount reaches 0. Something feels off. If KVM's refcount is 0, then accessing device->group->kvm in vfio_device_open() can't happen unless there's a refcounting bug somewhere. E.g. if this snippet holds group_lock mutex_lock(&device->group->group_lock); device->kvm = device->group->kvm; if (device->ops->open_device) { ret = device->ops->open_device(device); if (ret) goto err_undo_count; } vfio_device_container_register(device); mutex_unlock(&device->group->group_lock); and kvm_vfio_destroy() has already been invoked and is waiting on group_lock in vfio_file_set_kvm(), then device->ops->open_device() is being called with a non-NULL device->kvm that has kvm->users_count==0. And intel_vgpu_open_device() at least uses kvm_get_kvm(), i.e. assumes kvm->users_count > 0. If there's no refcounting bug then kvm_vfio_destroy() doesn't need to call kvm_vfio_file_set_kvm() since the only way there isn't a refcounting bug is if group->kvm is unreachable. This seems unlikely. Assuming there is indeed a refcounting issue, one solution would be to harden all ->open_device() implementations to use kvm_get_kvm_safe(). I'd prefer not to have to do that since it will complicate those implementations and also requires KVM to support an async put. Rather than force devices to get KVM references, why not handle that in common VFIO code and drop KVM refcountin from devices? Worst case scenario KVM is pinned by a device that doesn't need KVM but is in a group associated with KVM. If that's a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate whether or not the device depends on KVM. --- drivers/vfio/vfio_main.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 6e8804fe0095..fb43212d77a0 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -772,6 +772,13 @@ static struct file *vfio_device_open(struct vfio_device *device) * reference and release it during close_device. */ mutex_lock(&device->group->group_lock); + + if (device->group->kvm && + !kvm_get_kvm_safe(device->group->kvm->kvm)) { + ret = -ENODEV; + goto err_undo_count; + } + device->kvm = device->group->kvm; if (device->ops->open_device) { @@ -823,8 +830,10 @@ static struct file *vfio_device_open(struct vfio_device *device) err_undo_count: mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0 && device->kvm) + if (device->open_count == 0 && device->kvm) { + kvm_put_kvm(device->kvm); device->kvm = NULL; + } mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); err_unassign_container: @@ -1039,8 +1048,10 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) } mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0) + if (device->open_count == 0 && device->kvm) { + kvm_put_kvm(device->kvm); device->kvm = NULL; + } mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); base-commit: d52444c7a90fc551b4c3b0bda7d3f0b2ca9fc84d --
On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: > Something feels off. If KVM's refcount is 0, then accessing device->group->kvm > in vfio_device_open() can't happen unless there's a refcounting bug somewhere. The problem is in close, not open. Specifically it would be very hard to avoid holding the group_lock during close which is when the put is done. > Rather than force devices to get KVM references, why not handle that in common > VFIO code and drop KVM refcountin from devices? Worst case scenario KVM is pinned > by a device that doesn't need KVM but is in a group associated with KVM. If that's > a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate > whether or not the device depends on KVM. We can't make cross-dependencies between kvm and core VFIO - it is why so much of this is soo ugly. The few device drivers that unavoidably have KVM involvment already have a KVM module dependency, so they can safely do the get/put Jason
On Wed, Jan 11, 2023, Jason Gunthorpe wrote: > On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: > > > Something feels off. If KVM's refcount is 0, then accessing device->group->kvm > > in vfio_device_open() can't happen unless there's a refcounting bug somewhere. > > The problem is in close, not open. The deadlock problem is, yes. My point is that if group_lock needs to be taken when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting prolem with respect to open(). If there is no refcounting problem, then nullifying group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is the case). The two things aren't directly related, but it seems possible to solve both while making this all slightly less ugly. Well, at least from KVM's perspective, whether or not it'd be an improvement on the VFIO side is definitely debatable. > Specifically it would be very hard to avoid holding the group_lock > during close which is when the put is done. > > > Rather than force devices to get KVM references, why not handle that in common > > VFIO code and drop KVM refcountin from devices? Worst case scenario KVM is pinned > > by a device that doesn't need KVM but is in a group associated with KVM. If that's > > a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate > > whether or not the device depends on KVM. > > We can't make cross-dependencies between kvm and core VFIO - it is why > so much of this is soo ugly. Ugh, right, modules for everyone. > The few device drivers that unavoidably have KVM involvment already > have a KVM module dependency, so they can safely do the get/put Rather than store a "struct kvm *" in vfio_device, what about adding a new set of optional ops to get/put KVM references? Having dedicated KVM ops is gross, but IMO it's less gross than backdooring the KVM pointer into open_device() by stashing KVM into the device, e.g. it formalizes the VFIO API for devices that depend on KVM instead of making devices pinky-swear to grab a reference during open_device(). To further harden things, KVM could export only kvm_get_safe_kvm() if there are no vendor modules. I.e. make kvm_get_kvm() an internal-only helper when possible and effectively force VFIO devices to use the safe variant. That would work even x86, as kvm_get_kvm() wouldn't be exported if neither KVM_AMD nor KVM_INTEL is built as a module. --- drivers/vfio/vfio_main.c | 20 +++++++++++++------- include/linux/vfio.h | 9 +++++++-- 2 files changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 6e8804fe0095..b3a84d65baa6 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device) * reference and release it during close_device. */ mutex_lock(&device->group->group_lock); - device->kvm = device->group->kvm; + + if (device->kvm_ops && device->group->kvm) { + ret = device->kvm_ops->get_kvm(device->group->kvm); + if (ret) + goto err_undo_count; + } if (device->ops->open_device) { ret = device->ops->open_device(device); @@ -823,8 +828,9 @@ static struct file *vfio_device_open(struct vfio_device *device) err_undo_count: mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0 && device->kvm) - device->kvm = NULL; + if (device->open_count == 0 && device->kvm_ops) + device->kvm_ops->put_kvm(); + mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); err_unassign_container: @@ -1039,8 +1045,8 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) } mutex_unlock(&device->group->group_lock); device->open_count--; - if (device->open_count == 0) - device->kvm = NULL; + if (device->open_count == 0 && device->kvm_ops) + device->kvm_ops->put_kvm(); mutex_unlock(&device->dev_set->lock); module_put(device->dev->driver->owner); @@ -1656,8 +1662,8 @@ EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent); * @file: VFIO group file * @kvm: KVM to link * - * When a VFIO device is first opened the KVM will be available in - * device->kvm if one was associated with the group. + * When a VFIO device is first opened, the device's kvm_ops->get_kvm() will be + * invoked with the KVM instance associated with the group (if applicable). */ void vfio_file_set_kvm(struct file *file, struct kvm *kvm) { diff --git a/include/linux/vfio.h b/include/linux/vfio.h index fdd393f70b19..d6dcbe0546bf 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -18,6 +18,11 @@ struct kvm; +struct vfio_device_kvm_ops { + int (*get_kvm)(struct kvm *kvm); + void (*put_kvm)(void); +}; + /* * VFIO devices can be placed in a set, this allows all devices to share this * structure and the VFIO core will provide a lock that is held around @@ -43,8 +48,8 @@ struct vfio_device { struct vfio_device_set *dev_set; struct list_head dev_set_list; unsigned int migration_flags; - /* Driver must reference the kvm during open_device or never touch it */ - struct kvm *kvm; + + const struct vfio_device_kvm_ops *kvm_ops; /* Members below here are private, not for driver use */ unsigned int index; base-commit: d52444c7a90fc551b4c3b0bda7d3f0b2ca9fc84d --
On Wed, Jan 11, 2023 at 08:53:34PM +0000, Sean Christopherson wrote: > On Wed, Jan 11, 2023, Jason Gunthorpe wrote: > > On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: > > > > > Something feels off. If KVM's refcount is 0, then accessing device->group->kvm > > > in vfio_device_open() can't happen unless there's a refcounting bug somewhere. > > > > The problem is in close, not open. > > The deadlock problem is, yes. My point is that if group_lock needs to be taken > when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting > prolem with respect to open(). If there is no refcounting problem, then nullifying > group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is > the case). IIRC the drivers are supposed to use one of the refcount not zero incrs to counteract this, but I never checked they do.. Yi is working on a patch to change things so vfio drops the kvm pointer when the kvm file closes, not when the reference goes to 0 to avoid a refcount cycle problem which should also solve that. > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c > index 6e8804fe0095..b3a84d65baa6 100644 > --- a/drivers/vfio/vfio_main.c > +++ b/drivers/vfio/vfio_main.c > @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device) > * reference and release it during close_device. > */ > mutex_lock(&device->group->group_lock); > - device->kvm = device->group->kvm; > + > + if (device->kvm_ops && device->group->kvm) { > + ret = device->kvm_ops->get_kvm(device->group->kvm); At this point I'd rather just use the symbol get stuff like kvm does and call the proper functions. Jason
On 1/12/23 7:45 AM, Jason Gunthorpe wrote: > On Wed, Jan 11, 2023 at 08:53:34PM +0000, Sean Christopherson wrote: >> On Wed, Jan 11, 2023, Jason Gunthorpe wrote: >>> On Wed, Jan 11, 2023 at 07:54:51PM +0000, Sean Christopherson wrote: >>> >>>> Something feels off. If KVM's refcount is 0, then accessing device->group->kvm >>>> in vfio_device_open() can't happen unless there's a refcounting bug somewhere. >>> >>> The problem is in close, not open. >> >> The deadlock problem is, yes. My point is that if group_lock needs to be taken >> when nullifying group->kvm during kvm_vfio_destroy(), then there is also a refcounting >> prolem with respect to open(). If there is no refcounting problem, then nullifying >> group->kvm during kvm_vfio_destroy() is unnecessary (but again, I doubt this is >> the case). > > IIRC the drivers are supposed to use one of the refcount not zero > incrs to counteract this, but I never checked they do.. > > Yi is working on a patch to change things so vfio drops the kvm > pointer when the kvm file closes, not when the reference goes to 0 > to avoid a refcount cycle problem which should also solve that. > >> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c >> index 6e8804fe0095..b3a84d65baa6 100644 >> --- a/drivers/vfio/vfio_main.c >> +++ b/drivers/vfio/vfio_main.c >> @@ -772,7 +772,12 @@ static struct file *vfio_device_open(struct vfio_device *device) >> * reference and release it during close_device. >> */ >> mutex_lock(&device->group->group_lock); >> - device->kvm = device->group->kvm; >> + >> + if (device->kvm_ops && device->group->kvm) { >> + ret = device->kvm_ops->get_kvm(device->group->kvm); > > At this point I'd rather just use the symbol get stuff like kvm does > and call the proper functions. > So should I work up a v2 that does symbol gets for kvm_get_kvm_safe and kvm_put_kvm from vfio_main and drop kvm_put_kvm_async? Or is the patch Yi is working on changing things such that will also address the deadlock issue? If so, something like the following (where vfio_kvm_get symbol gets kvm_get_kvm_safe and vfio_kvm_put symbol gets kvm_put_kvm): diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 5177bb061b17..a49bf1080f0a 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -361,16 +361,22 @@ static int vfio_device_first_open(struct vfio_device *device, if (ret) goto err_module_put; + if (kvm && !vfio_kvm_get(kvm)) { + ret = -ENOENT; + goto err_unuse_iommu; + } device->kvm = kvm; if (device->ops->open_device) { ret = device->ops->open_device(device); if (ret) - goto err_unuse_iommu; + goto err_put_kvm; } return 0; -err_unuse_iommu: +err_put_kvm: + vfio_put_kvm(kvm); device->kvm = NULL; +err_unuse_iommu: if (iommufd) vfio_iommufd_unbind(device); else @@ -465,6 +471,9 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) vfio_device_group_close(device); + if (device->open_count == 0 && device->group->kvm) + vfio_kvm_put(device->group->kvm); + vfio_device_put_registration(device); return 0;
On Thu, Jan 12, 2023 at 12:21:17PM -0500, Matthew Rosato wrote: > So should I work up a v2 that does symbol gets for kvm_get_kvm_safe > and kvm_put_kvm from vfio_main and drop kvm_put_kvm_async? Or is > the patch Yi is working on changing things such that will also > address the deadlock issue? I don't think Yi's part will help > +361,22 @@ static int vfio_device_first_open(struct vfio_device > *device, if (ret) goto err_module_put; > > + if (kvm && !vfio_kvm_get(kvm)) { Do call it kvm_get_safe though > + ret = -ENOENT; > + goto err_unuse_iommu; > + } > device->kvm = kvm; > if (device->ops->open_device) { > ret = device->ops->open_device(device); > if (ret) > - goto err_unuse_iommu; > + goto err_put_kvm; > } > return 0; > > -err_unuse_iommu: > +err_put_kvm: > + vfio_put_kvm(kvm); > device->kvm = NULL; > +err_unuse_iommu: > if (iommufd) > vfio_iommufd_unbind(device); > else > @@ -465,6 +471,9 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) > > vfio_device_group_close(device); > > + if (device->open_count == 0 && device->group->kvm) > + vfio_kvm_put(device->group->kvm); > + No, you can't touch group->kvm without holding the group lock, that is the whole point of the problem.. This has to be device->kvm Jason
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c index 8ae7039b3683..24511c877572 100644 --- a/drivers/gpu/drm/i915/gvt/kvmgt.c +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c @@ -703,7 +703,11 @@ static void intel_vgpu_close_device(struct vfio_device *vfio_dev) kvm_page_track_unregister_notifier(vgpu->vfio_device.kvm, &vgpu->track_node); - kvm_put_kvm(vgpu->vfio_device.kvm); + /* + * Avoid possible deadlock on any currently-held vfio lock by + * ensuring the potential kvm_destroy_vm call is done asynchronously + */ + kvm_put_kvm_async(vgpu->vfio_device.kvm); kvmgt_protect_table_destroy(vgpu); gvt_cache_destroy(vgpu); diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c index e93bb9c468ce..a37b2baefb36 100644 --- a/drivers/s390/crypto/vfio_ap_ops.c +++ b/drivers/s390/crypto/vfio_ap_ops.c @@ -1574,7 +1574,12 @@ static void vfio_ap_mdev_unset_kvm(struct ap_matrix_mdev *matrix_mdev) kvm_arch_crypto_clear_masks(kvm); vfio_ap_mdev_reset_queues(&matrix_mdev->qtable); - kvm_put_kvm(kvm); + /* + * Avoid possible deadlock on any currently-held vfio lock by + * ensuring the potential kvm_destroy_vm call is done + * asynchronously + */ + kvm_put_kvm_async(kvm); matrix_mdev->kvm = NULL; release_update_locks_for_kvm(kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4f26b244f6d0..2ef6a5102265 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -34,6 +34,7 @@ #include <linux/instrumentation.h> #include <linux/interval_tree.h> #include <linux/rbtree.h> +#include <linux/workqueue.h> #include <linux/xarray.h> #include <asm/signal.h> @@ -793,6 +794,7 @@ struct kvm { struct kvm_stat_data **debugfs_stat_data; struct srcu_struct srcu; struct srcu_struct irq_srcu; + struct work_struct async_work; pid_t userspace_pid; bool override_halt_poll_ns; unsigned int max_halt_poll_ns; @@ -963,6 +965,7 @@ void kvm_exit(void); void kvm_get_kvm(struct kvm *kvm); bool kvm_get_kvm_safe(struct kvm *kvm); void kvm_put_kvm(struct kvm *kvm); +void kvm_put_kvm_async(struct kvm *kvm); bool file_is_kvm(struct file *file); void kvm_put_kvm_no_destroy(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 13e88297f999..fbe8d127028b 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1353,6 +1353,28 @@ void kvm_put_kvm(struct kvm *kvm) } EXPORT_SYMBOL_GPL(kvm_put_kvm); +static void kvm_put_async_fn(struct work_struct *work) +{ + struct kvm *kvm = container_of(work, struct kvm, + async_work); + + kvm_destroy_vm(kvm); +} + +/* + * Put a reference but only destroy the vm asynchronously. Can be used in + * cases where the caller holds a mutex that could cause deadlock if + * kvm_destroy_vm is triggered + */ +void kvm_put_kvm_async(struct kvm *kvm) +{ + if (refcount_dec_and_test(&kvm->users_count)) { + INIT_WORK(&kvm->async_work, kvm_put_async_fn); + schedule_work(&kvm->async_work); + } +} +EXPORT_SYMBOL_GPL(kvm_put_kvm_async); + /* * Used to put a reference that was taken on behalf of an object associated * with a user-visible file descriptor, e.g. a vcpu or device, if installation