Message ID | 170752273224.1693615.11371097645648272257.stgit@bgt-140510-bm01.eng.stellus.in |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel+bounces-60157-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7300:50ea:b0:106:860b:bbdd with SMTP id r10csp1200967dyd; Fri, 9 Feb 2024 15:52:45 -0800 (PST) X-Google-Smtp-Source: AGHT+IHn87WcKGNjcLdUap6ND3H14LcLDD8GLKz8QwdqgKG4LKkOCfaOVr/2UyEtDuYSN5prYpgl X-Received: by 2002:a05:6870:4d01:b0:219:fdbb:bd4d with SMTP id pn1-20020a0568704d0100b00219fdbbbd4dmr906676oab.10.1707522765529; Fri, 09 Feb 2024 15:52:45 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707522765; cv=pass; d=google.com; s=arc-20160816; b=TfUJLg1g/cEE8hgOnauxVcINaOGFyDbXfhTAMpf1nWW4Y9Wk8CgEWABirXsC7NwldM 4Y6sk8LhyJdvDdffsj4ymQl0q6hkkGPZkQuZp4IsspdZUUghfAVOtyAcXyvGYJqeKX3g OYc+DFEymqFJ+Fj9oRYzAeJTj+NnqoRf1hyoADADGbU94fwtKJi4UN1GbyrsvaT0z/TD dw50nn+zOiYKv25Q06lPHINUpF/m49NF2ya9nG67Whbt2CdozDb0F1tCbTKxGNlLw8nK 96RCmpjt0M7CEpkgYp0yK9r3nj8+dEtvfYjUQs4yuHZxjQiHXUc3Dbk5uRQ/wmqyqOKH 4kCw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:cms-type:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:content-transfer-encoding:content-id :content-language:accept-language:in-reply-to:message-id:date :thread-index:thread-topic:subject:to:from:dkim-signature :dkim-filter; bh=NWHjQPlaGIQ9OtliJHmGY1JMnvToGgMO8MA8ryL6e5g=; fh=NzlMcLPPH0zx9RKbeZxfKKeG7n5nkaPilS0a+kzasdE=; b=ChJscN+vrfLDc2dtj24E7Hmbng5e2QOtCrIOadJHCCGeubryTPWyizvFG+4kbOCx4R ZCAeuvLxAMYfjjgJEuZtHaGFty1s7qQfnvV+dWpj9RzbMcimfyqJ1jYwlYM1cp3WA07+ e+x9BBmQkGolCVGMF2P8LTH6qO0+xjhr/4ROz/1T4RLhmtaojQqRtEKM30hRhrLaMZ2X 5JsoAf2KU1u5TuZBjT04yZ1Rc2957JZrcJxWLOQ0VNHouMYzFFHlU0GIK0+8ES+6V2PE yhfdIO72sJThyYLShYfrKF6hnANvAENyH9Ifg/uPMvrLfeOTHGL+H0uopqenHvS5Sa1E 6TNA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@samsung.com header.s=mail20170921 header.b=LusQBJ16; arc=pass (i=1 spf=pass spfdomain=samsung.com dkim=pass dkdomain=samsung.com dmarc=pass fromdomain=samsung.com); spf=pass (google.com: domain of linux-kernel+bounces-60157-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-60157-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=samsung.com X-Forwarded-Encrypted: i=2; AJvYcCXbb5pgh22H9XZbVC8bmTT6YsGYoGYG+SIJZW5VWVJ8EruwUGXeM/5dVPX1kuJ2p+quWhygMLDef1/YPKzZNSGxuCcLvg== Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id b27-20020a63715b000000b005897813624fsi2546288pgn.476.2024.02.09.15.52.45 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 09 Feb 2024 15:52:45 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-60157-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@samsung.com header.s=mail20170921 header.b=LusQBJ16; arc=pass (i=1 spf=pass spfdomain=samsung.com dkim=pass dkdomain=samsung.com dmarc=pass fromdomain=samsung.com); spf=pass (google.com: domain of linux-kernel+bounces-60157-ouuuleilei=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-60157-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=samsung.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 4A16E28C288 for <ouuuleilei@gmail.com>; Fri, 9 Feb 2024 23:52:45 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C247738DF8; Fri, 9 Feb 2024 23:52:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=samsung.com header.i=@samsung.com header.b="LusQBJ16" Received: from mailout2.w2.samsung.com (mailout2.w2.samsung.com [211.189.100.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C71538FB9; Fri, 9 Feb 2024 23:52:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=211.189.100.12 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707522738; cv=none; b=CIe4jl7yvndKnDwiIo1p+tgNqbwY/9FlSbJw35lS3fc4cMwcl+YqbGS4fanvoelox73mlhnk/qgf9a41cONa4j/EA1Ep6LUOFGOUHWUp8Yd9q6gUQMzdFGfy1afwD1FsXP1GlHstoQoPWYHTeGolamTauUcpOVK2jEGr3B87JOA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707522738; c=relaxed/simple; bh=NWHjQPlaGIQ9OtliJHmGY1JMnvToGgMO8MA8ryL6e5g=; h=From:To:Subject:Date:Message-ID:In-Reply-To:Content-Type: MIME-Version:References; b=ZLVeItb58A219DQmxjOGVLQ8j2C6XD+FfF+pTH7lQ30EylsNhwSZcdUlQ2XBYN2AoumnsEvZ/J+pMNSZuDoTPm+9WGv3VTVIpcj5xmT4d98mp+t9qoLS1Mr0/9aYzSZKUWC8gGtd5ml+gKUL7wSZb3gnDHgEjQH/b8mbbji5krI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=samsung.com; spf=pass smtp.mailfrom=samsung.com; dkim=pass (1024-bit key) header.d=samsung.com header.i=@samsung.com header.b=LusQBJ16; arc=none smtp.client-ip=211.189.100.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=samsung.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=samsung.com Received: from uscas1p2.samsung.com (unknown [182.198.245.207]) by mailout2.w2.samsung.com (KnoxPortal) with ESMTP id 20240209235214usoutp0211a981d2cc3704b6e04b6afb426759d7~yVuFVnazW0735807358usoutp02P; Fri, 9 Feb 2024 23:52:14 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 mailout2.w2.samsung.com 20240209235214usoutp0211a981d2cc3704b6e04b6afb426759d7~yVuFVnazW0735807358usoutp02P DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=samsung.com; s=mail20170921; t=1707522734; bh=NWHjQPlaGIQ9OtliJHmGY1JMnvToGgMO8MA8ryL6e5g=; h=From:To:Subject:Date:In-Reply-To:References:From; b=LusQBJ16MTV4T+mTwWDCGSexl6lXIN7AQppjKIB6Hmy5/0FLZGqWv6b0NgZolkSY+ lZjvi1QT4Zeu5qbVMrKyVDAVxkcC3SOiamThh6szztPK/pY394hFkyV76E1lFF4+28 /ODqlv7WejgD+GiUpDNN7bOSDLV4z7vwI3qWRNOM= Received: from ussmges1new.samsung.com (u109.gpu85.samsung.co.kr [203.254.195.109]) by uscas1p1.samsung.com (KnoxPortal) with ESMTP id 20240209235213uscas1p1405627f096500d89d28cc2b46806a5fd~yVuEi2rxO2023420234uscas1p1P; Fri, 9 Feb 2024 23:52:13 +0000 (GMT) Received: from uscas1p1.samsung.com ( [182.198.245.206]) by ussmges1new.samsung.com (USCPEMTA) with SMTP id 73.51.09678.DAAB6C56; Fri, 9 Feb 2024 18:52:13 -0500 (EST) Received: from ussmgxs1new.samsung.com (u89.gpu85.samsung.co.kr [203.254.195.89]) by uscas1p2.samsung.com (KnoxPortal) with ESMTP id 20240209235213uscas1p2e8de2bdf05e6e7cba51bd41ddb42a8e4~yVuEO-23j3138731387uscas1p2s; Fri, 9 Feb 2024 23:52:13 +0000 (GMT) X-AuditID: cbfec36d-85fff700000025ce-c6-65c6baadf240 Received: from SSI-EX2.ssi.samsung.com ( [105.128.3.67]) by ussmgxs1new.samsung.com (USCPEXMTA) with SMTP id A1.0B.50167.DAAB6C56; Fri, 9 Feb 2024 18:52:13 -0500 (EST) Received: from SSI-EX2.ssi.samsung.com (105.128.2.227) by SSI-EX2.ssi.samsung.com (105.128.2.227) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.1.2375.24; Fri, 9 Feb 2024 15:52:12 -0800 Received: from SSI-EX2.ssi.samsung.com ([105.128.2.227]) by SSI-EX2.ssi.samsung.com ([105.128.2.227]) with mapi id 15.01.2375.024; Fri, 9 Feb 2024 15:52:12 -0800 From: Jim Harris <jim.harris@samsung.com> To: Bjorn Helgaas <bhelgaas@google.com>, "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Leon Romanovsky <leonro@nvidia.com>, "Jason Gunthorpe" <jgg@nvidia.com>, Alex Williamson <alex.williamson@redhat.com>, "pierre.cregut@orange.com" <pierre.cregut@orange.com> Subject: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" Thread-Topic: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" Thread-Index: AQHaW7MAtwi2xwjfZkaUZDAOVEJQAw== Date: Fri, 9 Feb 2024 23:52:12 +0000 Message-ID: <170752273224.1693615.11371097645648272257.stgit@bgt-140510-bm01.eng.stellus.in> In-Reply-To: <170752254154.1693615.9176696143128338408.stgit@bgt-140510-bm01.eng.stellus.in> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="utf-8" Content-ID: <89791AD5EB1C934187241FFFDADB64F6@ssi.samsung.com> Content-Transfer-Encoding: base64 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrGKsWRmVeSWpSXmKPExsWy7djXc7prdx1LNfg6Wdni2/8eNoslTRkW V/7tYbTYtOEJi8XlXXPYLM7OO85msf7rezYHdo8Fm0o9epvfsXm0PDvJ5vF+31U2j8+b5AJY o7hsUlJzMstSi/TtErgymmbsYSnYJlBxe3EDewNjg0AXIyeHhICJxMSnRxlBbCGBlYwSC3qU IexWJomLr4Ngal427GPtYuQCiq9hlNh9YxorRNFHRol/5+0gEksZJSb397KDJNgENCV+XVnD BJIQEbjKJDF59QdmkISwQKLEjeb3LCC2iECaxLXTlxghbD2JR/+fgtksAioS/26uAavnFYiR 2PPxLROIzQlkv+vbBBZnFBCT+H5qDVicWUBc4taT+UwQpwpKLJq9hxnCFpP4t+shG4StKHH/ +0ug4ziA6jUl1u/Sh2i1k3gxdx8bhK0oMaX7ITvEWkGJkzOfsEC0SkocXHGDBeQXCYEzHBLT Xn5mhUi4SLw8vYcdwpaWuHp9KtTebImV6zuYQHZJCBRINByBhqK1xMI/66FO5pP4++sR4wRG 5VlIPpiFcN0sJNfNQnLdLCTXLWBkXcUoXlpcnJueWmyYl1quV5yYW1yal66XnJ+7iRGYlE7/ O5y7g3HHrY96hxiZOBgPMUpwMCuJ8IYsOZIqxJuSWFmVWpQfX1Sak1p8iFGag0VJnNfQ9mSy kEB6YklqdmpqQWoRTJaJg1OqgclJ++PdVTMvKqUGN+qcm60vt7316E/914bbYzbu3/T/N4fb kzcKmskGc+tKm5PutIifbHcrnGcg9MDJVeZ0w6Sp500kxEKzVsxQmPgrzHd6UuubvZOOrp2U Ktz/Sr+i/+LbxRV3WGtXS60qk5olfOrn+u9vM7vX1oqsqjx4etYUFS0Fq5nv7V+kbr/8QtTn 4nGJ5WU5f8SP7f7eucHm1fuV7C5nP7mXCcp8PngvzIT7XRb/tsY5tnZvDDtuRC8X0bql8m/e xxPTuOflvlL9edZ7ocyczmfH4m5uXj35b+B34TvzElacfih4mKPw1SenIzXft13T5F3TvK9k 5g7R4rp/z9XkVqb8Pldlf09w1eMEOyWW4oxEQy3mouJEANMPk6i5AwAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrNIsWRmVeSWpSXmKPExsWS2cDsrLt217FUg2e7eS2+/e9hs1jSlGFx 5d8eRotNG56wWFzeNYfN4uy842wW67++Z3Ng91iwqdSjt/kdm0fLs5NsHu/3XWXz+LxJLoA1 issmJTUnsyy1SN8ugSujacYeloJtAhW3FzewNzA2CHQxcnJICJhIvGzYx9rFyMUhJLCKUWLF lbUsEM5HRoklS5dAZZYySiw8+4kNpIVNQFPi15U1TCAJEYGrTBKTV39gBkkICyRKXL08jxHE FhFIk7h2+hKUrSfx6P9TMJtFQEXi3801YPW8AjESez6+ZQKxhQSiJeZPnMwOYnMCxd/1bQKr YRQQk/h+ag1YDbOAuMStJ/OZIO4WkFiy5zwzhC0q8fLxP1YIW1Hi/veXQHM4gOo1Jdbv0odo tZN4MXcfG4StKDGl+yE7xAmCEidnPmGBaJWUOLjiBssERvFZSLbNQpg0C8mkWUgmzUIyaQEj 6ypG8dLi4tz0imLDvNRyveLE3OLSvHS95PzcTYzA+D3973DkDsajtz7qHWJk4mA8xCjBwawk whuy5EiqEG9KYmVValF+fFFpTmrxIUZpDhYlcd67DzRShQTSE0tSs1NTC1KLYLJMHJxSDUwh Bi/tDFxrHTed+iEn92TBBPt+U2WORczPtqw6+P3f4yCjHetUgiSvzLz38cHu1J7lvUveCPb8 XbpWft7iWQElPFEPWQPrFv6YdrfRsejP1Dm3801N/I0cylOrW1Nl3f6lLK2YZnLkyeKQtO/b FpnuCwqst04uVTr76oH1kkO/7V/HRcmo+dlGum27lbDtZ1ce7web8xtjSs7t3nVg3b1nRs9f XU+WbXNssit4pPBtr+/sUC6NzI8pSfpqkzZeWnl3xg/TuaZOru3d4dnlsZwXF9TXSZyJTfnf NLEi+/gsj5N7L9/rWvm0+/JL371hVq0pLVvvGO6slT/oKiwhstZtz7E9308ayq+4nBNsJ1Ko xFKckWioxVxUnAgAc2HpCE4DAAA= X-CMS-MailID: 20240209235213uscas1p2e8de2bdf05e6e7cba51bd41ddb42a8e4 CMS-TYPE: 301P X-CMS-RootMailID: 20240209235213uscas1p2e8de2bdf05e6e7cba51bd41ddb42a8e4 References: <170752254154.1693615.9176696143128338408.stgit@bgt-140510-bm01.eng.stellus.in> <CGME20240209235213uscas1p2e8de2bdf05e6e7cba51bd41ddb42a8e4@uscas1p2.samsung.com> X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790467391480335679 X-GMAIL-MSGID: 1790467391480335679 |
Series |
PCI/IOV: sriov_numvfs bug fixes
|
|
Commit Message
Jim Harris
Feb. 9, 2024, 11:52 p.m. UTC
If an SR-IOV enabled device is held by vfio, and the device is removed, vfio will hold device lock and notify userspace of the removal. If userspace reads the sriov_numvfs sysfs entry, that thread will be blocked since sriov_numvfs_show() also tries to acquire the device lock. If that same thread is responsible for releasing the device to vfio, it results in a deadlock. The proper way to detect a change to the num_VFs value is to listen for a sysfs event, not to add a device_lock() on the attribute _show() in the kernel. This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ Suggested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jim Harris <jim.harris@samsung.com> --- drivers/pci/iov.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-)
Comments
On 2/9/24 3:52 PM, Jim Harris wrote: > If an SR-IOV enabled device is held by vfio, and the device is removed, > vfio will hold device lock and notify userspace of the removal. If > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > since sriov_numvfs_show() also tries to acquire the device lock. If that > same thread is responsible for releasing the device to vfio, it results in > a deadlock. > > The proper way to detect a change to the num_VFs value is to listen for a > sysfs event, not to add a device_lock() on the attribute _show() in the > kernel. Since you are reverting a commit that synchronizes SysFS read /write, please add some comments about why it is not an issue anymore. > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > Signed-off-by: Jim Harris <jim.harris@samsung.com> > --- > drivers/pci/iov.c | 8 +------- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index aaa33e8dc4c9..0ca20cd518d5 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > char *buf) > { > struct pci_dev *pdev = to_pci_dev(dev); > - u16 num_vfs; > - > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > - device_lock(&pdev->dev); > - num_vfs = pdev->sriov->num_VFs; > - device_unlock(&pdev->dev); > > - return sysfs_emit(buf, "%u\n", num_vfs); > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > } > > /* >
On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > On 2/9/24 3:52 PM, Jim Harris wrote: > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > vfio will hold device lock and notify userspace of the removal. If > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > same thread is responsible for releasing the device to vfio, it results in > > a deadlock. > > > > The proper way to detect a change to the num_VFs value is to listen for a > > sysfs event, not to add a device_lock() on the attribute _show() in the > > kernel. > > Since you are reverting a commit that synchronizes SysFS read > /write, please add some comments about why it is not an > issue anymore. It was never an issue, the idea that sysfs read and write should be serialized by kernel is not correct by definition. Thanks > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > --- > > drivers/pci/iov.c | 8 +------- > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > --- a/drivers/pci/iov.c > > +++ b/drivers/pci/iov.c > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > char *buf) > > { > > struct pci_dev *pdev = to_pci_dev(dev); > > - u16 num_vfs; > > - > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > - device_lock(&pdev->dev); > > - num_vfs = pdev->sriov->num_VFs; > > - device_unlock(&pdev->dev); > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > } > > > > /* > > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer >
On 2/11/24 12:48 AM, Leon Romanovsky wrote: > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: >> On 2/9/24 3:52 PM, Jim Harris wrote: >>> If an SR-IOV enabled device is held by vfio, and the device is removed, >>> vfio will hold device lock and notify userspace of the removal. If >>> userspace reads the sriov_numvfs sysfs entry, that thread will be blocked >>> since sriov_numvfs_show() also tries to acquire the device lock. If that >>> same thread is responsible for releasing the device to vfio, it results in >>> a deadlock. >>> >>> The proper way to detect a change to the num_VFs value is to listen for a >>> sysfs event, not to add a device_lock() on the attribute _show() in the >>> kernel. >> Since you are reverting a commit that synchronizes SysFS read >> /write, please add some comments about why it is not an >> issue anymore. > It was never an issue, the idea that sysfs read and write should be serialized by kernel > is not correct by definition. What: /sys/bus/pci/devices/.../sriov_numvfs Date: November 2012 Contact: Donald Dutile <ddutile@redhat.com> Description: This file appears when a physical PCIe device supports SR-IOV. Userspace applications can read and write to this file to determine and control the enablement or disablement of Virtual Functions (VFs) on the physical function (PF). A read of this file will return the number of VFs that are enabled on this PF. I am not very clear about the user of this SysFs. But, as per above description, this sysfs seems to controls the number of VFs. A typical usage is to allow user to write a value and then read to check the enabled/disabled number of VMs, right? If you are not synchronizing, then the value returned may not reflect the actual number of enabled / disabled VFs. So wont this change affect the existing user of this SysFS. > > Thanks > >>> This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. >>> Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). >>> >>> Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ >>> Suggested-by: Leon Romanovsky <leonro@nvidia.com> >>> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> >>> Signed-off-by: Jim Harris <jim.harris@samsung.com> >>> --- >>> drivers/pci/iov.c | 8 +------- >>> 1 file changed, 1 insertion(+), 7 deletions(-) >>> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>> index aaa33e8dc4c9..0ca20cd518d5 100644 >>> --- a/drivers/pci/iov.c >>> +++ b/drivers/pci/iov.c >>> @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, >>> char *buf) >>> { >>> struct pci_dev *pdev = to_pci_dev(dev); >>> - u16 num_vfs; >>> - >>> - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ >>> - device_lock(&pdev->dev); >>> - num_vfs = pdev->sriov->num_VFs; >>> - device_unlock(&pdev->dev); >>> >>> - return sysfs_emit(buf, "%u\n", num_vfs); >>> + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); >>> } >>> >>> /* >>> >> -- >> Sathyanarayanan Kuppuswamy >> Linux Kernel Developer >>
On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > vfio will hold device lock and notify userspace of the removal. If > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > same thread is responsible for releasing the device to vfio, it results in > > > a deadlock. > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > kernel. The lock was not about detecting a change; Pierre did this: ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ cat ${path}/device/sriov_numvfs; \ which I assume works by listening for sysfs events. The problem was that after the event occurred, the sriov_numvfs read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). So I would drop this sentence because I don't think it accurately reflects the reason for 35ff867b7657. > > Since you are reverting a commit that synchronizes SysFS read > > /write, please add some comments about why it is not an > > issue anymore. > > It was never an issue, the idea that sysfs read and write should be > serialized by kernel is not correct by definition. I think it *was* an issue. The behavior Pierre observed at was clearly wrong, and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") to resolve it. We should try to avoid reintroducing the problem, so I think we should probably squash these two patches and describe it as a deadlock fix instead of dismissing 35ff867b7657 as being based on false premises. It would be awesome if you had time to verify that these patches also resolve the problem you saw, Pierre. I think we should also add: Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") as a trigger for backporting this to kernels that include 35ff867b7657. Bjorn > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > --- > > > drivers/pci/iov.c | 8 +------- > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > --- a/drivers/pci/iov.c > > > +++ b/drivers/pci/iov.c > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > char *buf) > > > { > > > struct pci_dev *pdev = to_pci_dev(dev); > > > - u16 num_vfs; > > > - > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > - device_lock(&pdev->dev); > > > - num_vfs = pdev->sriov->num_VFs; > > > - device_unlock(&pdev->dev); > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > } > > > > > > /* > > > > > -- > > Sathyanarayanan Kuppuswamy > > Linux Kernel Developer > >
On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > vfio will hold device lock and notify userspace of the removal. If > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > same thread is responsible for releasing the device to vfio, it results in > > > > a deadlock. > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > kernel. > > The lock was not about detecting a change; Pierre did this: > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > cat ${path}/device/sriov_numvfs; \ > > which I assume works by listening for sysfs events. The problem was > that after the event occurred, the sriov_numvfs read got a stale value > (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). I don't think 'ip monitor dev' listens for any sysfs events. Or at least if I have this running and write values to sriov_numvfs, I don't see any output. It looks like the original bug report was against v5.0 (matching by dates and the patch file attached). In that code, we have: kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); iov->num_VFs = nr_virtfn; which is identical to how the code looks today. Is it possible that userspace could react to this uevent and read the stale num_VFs before iov->num_VFs gets written here? I mean, theoretically it's possible, but from the bug report it seems like the scenario Pierre was facing was 100% reproducible. It would be great if we could get input from Pierre on this. It isn't clear to me from the bug report what exactly is updating the sriov_numvfs sysfs entry, and what is triggering that update. We could also revisit my original suggestion, which was to use a discrete lock just for this sysfs entry, rather than overloading the device lock. That probably has lower risk of introducing an unintended regression. https://lore.kernel.org/linux-pci/ZXNNQkXzluoyeguu@bgt-140510-bm01.eng.stellus.in/ > > So I would drop this sentence because I don't think it accurately > reflects the reason for 35ff867b7657. > > > > Since you are reverting a commit that synchronizes SysFS read > > > /write, please add some comments about why it is not an > > > issue anymore. > > > > It was never an issue, the idea that sysfs read and write should be > > serialized by kernel is not correct by definition. > > I think it *was* an issue. The behavior Pierre observed at was > clearly wrong, and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > sriov_numvfs reads vs writes") to resolve it. > > We should try to avoid reintroducing the problem, so I think we should > probably squash these two patches and describe it as a deadlock fix > instead of dismissing 35ff867b7657 as being based on false premises. > > It would be awesome if you had time to verify that these patches also > resolve the problem you saw, Pierre. > > I think we should also add: > > Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > as a trigger for backporting this to kernels that include > 35ff867b7657. > > Bjorn > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > > --- > > > > drivers/pci/iov.c | 8 +------- > > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > > --- a/drivers/pci/iov.c > > > > +++ b/drivers/pci/iov.c > > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > > char *buf) > > > > { > > > > struct pci_dev *pdev = to_pci_dev(dev); > > > > - u16 num_vfs; > > > > - > > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > > - device_lock(&pdev->dev); > > > > - num_vfs = pdev->sriov->num_VFs; > > > > - device_unlock(&pdev->dev); > > > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > > } > > > > > > > > /* > > > > > > > -- > > > Sathyanarayanan Kuppuswamy > > > Linux Kernel Developer > > >
On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > vfio will hold device lock and notify userspace of the removal. If > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > same thread is responsible for releasing the device to vfio, it results in > > > > a deadlock. > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > kernel. > > The lock was not about detecting a change; Pierre did this: > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > cat ${path}/device/sriov_numvfs; \ > > which I assume works by listening for sysfs events. It is not, "ip monitor ..." listens to netlink events emitted by netdev core and not sysfs events. Sysfs events are not involved in this case. > The problem was that after the event occurred, the sriov_numvfs > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). Yes, and it is outcome of such cross-subsytem involvement, which is racy by definition. Someone can come with even simpler example of why locking sysfs read and write is not a good idea. For example, let's consider the following scenario with two CPUs and locks on sysfs read and write: CPU1 CPU2 echo 1 > ${path}/device/sriov_numvfs context_switch -> cat ${path}/device/sriov_numvfs lock return 0 unlock context_switch <- lock set 1 unlock CPU1 CPU2 echo 1 > ${path}/device/sriov_numvfs lock set 1 unlock context_switch -> cat ${path}/device/sriov_numvfs lock return 1 unlock So same scenario will return different values if user doesn't protect such case with external to the kernel lock. But if we return back to Pierre report and if you want to provide completely bullet proof solution to solve cross-subsystem interaction, you will need to prohibit device probe till sriov_numvfs update is completed. However, it is overkill for something that is not a real issue. > > So I would drop this sentence because I don't think it accurately > reflects the reason for 35ff867b7657. > > > > Since you are reverting a commit that synchronizes SysFS read > > > /write, please add some comments about why it is not an > > > issue anymore. > > > > It was never an issue, the idea that sysfs read and write should be > > serialized by kernel is not correct by definition. > > I think it *was* an issue. The behavior Pierre observed at was > clearly wrong, I disagree with this sentence. > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > sriov_numvfs reads vs writes") to resolve it. > > We should try to avoid reintroducing the problem, so I think we should > probably squash these two patches and describe it as a deadlock fix > instead of dismissing 35ff867b7657 as being based on false premises. > > It would be awesome if you had time to verify that these patches also > resolve the problem you saw, Pierre. They won't resolve his problem, because he is not listening to sysfs events, but rely on something from netdev side. > > I think we should also add: > > Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > as a trigger for backporting this to kernels that include > 35ff867b7657. > > Bjorn > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > > --- > > > > drivers/pci/iov.c | 8 +------- > > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > > --- a/drivers/pci/iov.c > > > > +++ b/drivers/pci/iov.c > > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > > char *buf) > > > > { > > > > struct pci_dev *pdev = to_pci_dev(dev); > > > > - u16 num_vfs; > > > > - > > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > > - device_lock(&pdev->dev); > > > > - num_vfs = pdev->sriov->num_VFs; > > > > - device_unlock(&pdev->dev); > > > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > > } > > > > > > > > /* > > > > > > > -- > > > Sathyanarayanan Kuppuswamy > > > Linux Kernel Developer > > > >
On Mon, Feb 12, 2024 at 10:59:03PM +0000, Jim Harris wrote: > On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > > vfio will hold device lock and notify userspace of the removal. If > > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > > same thread is responsible for releasing the device to vfio, it results in > > > > > a deadlock. > > > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > > kernel. > > > > The lock was not about detecting a change; Pierre did this: > > > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > > cat ${path}/device/sriov_numvfs; \ > > > > which I assume works by listening for sysfs events. The problem was > > that after the event occurred, the sriov_numvfs read got a stale value > > (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). > > I don't think 'ip monitor dev' listens for any sysfs events. Or at least if > I have this running and write values to sriov_numvfs, I don't see any > output. > > It looks like the original bug report was against v5.0 (matching by dates > and the patch file attached). In that code, we have: > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; > > which is identical to how the code looks today. Is it possible that > userspace could react to this uevent and read the stale num_VFs before > iov->num_VFs gets written here? I mean, theoretically it's possible, but > from the bug report it seems like the scenario Pierre was facing was > 100% reproducible. > > It would be great if we could get input from Pierre on this. It isn't clear > to me from the bug report what exactly is updating the sriov_numvfs sysfs > entry, and what is triggering that update. > > We could also revisit my original suggestion, which was to use a > discrete lock just for this sysfs entry, rather than overloading the > device lock. That probably has lower risk of introducing an unintended > regression. The idea that lock issues are need to be solved by adding more locks doesn't sound good to me. Thanks > > https://lore.kernel.org/linux-pci/ZXNNQkXzluoyeguu@bgt-140510-bm01.eng.stellus.in/ > > > > > So I would drop this sentence because I don't think it accurately > > reflects the reason for 35ff867b7657. > > > > > > Since you are reverting a commit that synchronizes SysFS read > > > > /write, please add some comments about why it is not an > > > > issue anymore. > > > > > > It was never an issue, the idea that sysfs read and write should be > > > serialized by kernel is not correct by definition. > > > > I think it *was* an issue. The behavior Pierre observed at was > > clearly wrong, and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > > sriov_numvfs reads vs writes") to resolve it. > > > > We should try to avoid reintroducing the problem, so I think we should > > probably squash these two patches and describe it as a deadlock fix > > instead of dismissing 35ff867b7657 as being based on false premises. > > > > It would be awesome if you had time to verify that these patches also > > resolve the problem you saw, Pierre. > > > > I think we should also add: > > > > Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > > > as a trigger for backporting this to kernels that include > > 35ff867b7657. > > > > Bjorn > > > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > > > --- > > > > > drivers/pci/iov.c | 8 +------- > > > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > > > --- a/drivers/pci/iov.c > > > > > +++ b/drivers/pci/iov.c > > > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > > > char *buf) > > > > > { > > > > > struct pci_dev *pdev = to_pci_dev(dev); > > > > > - u16 num_vfs; > > > > > - > > > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > > > - device_lock(&pdev->dev); > > > > > - num_vfs = pdev->sriov->num_VFs; > > > > > - device_unlock(&pdev->dev); > > > > > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > > > } > > > > > > > > > > /* > > > > > > > > > -- > > > > Sathyanarayanan Kuppuswamy > > > > Linux Kernel Developer > > > >
First sorry for not answering earlier but it is a long time ago. I do not work on the topic any more (a monitoring tool Skydive, an open source project no more actively developed as far as I know) and only have vague memories of it. > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; > > which is identical to how the code looks today. Is it possible that > userspace could react to this uevent and read the stale num_VFs before > iov->num_VFs gets written here? I mean, theoretically it's possible, but > from the bug report it seems like the scenario Pierre was facing was > 100% reproducible. From my memories yes that was exactly the problem. Any stable method that could detect the change of configuration in user land and ensure that we get a reliable value of num_vfs after we received it would be fine. Best regards, Pierre Crégut ____________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
On Mon, Feb 12, 2024 at 10:59:03PM +0000, Jim Harris wrote: > On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > > vfio will hold device lock and notify userspace of the removal. If > > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > > same thread is responsible for releasing the device to vfio, it results in > > > > > a deadlock. > > > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > > kernel. > > > > The lock was not about detecting a change; Pierre did this: > > > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > > cat ${path}/device/sriov_numvfs; \ > > > > which I assume works by listening for sysfs events. The problem was > > that after the event occurred, the sriov_numvfs read got a stale value > > (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). > > I don't think 'ip monitor dev' listens for any sysfs events. Or at least if > I have this running and write values to sriov_numvfs, I don't see any > output. The issue is that the sysfs change inadvertently throws out a netlink event (or udev event, or whatever) and something can observe that event and then turn around and read the sysfs and observe a sysfs result that hasn't caught up to the event launch. The lock fixed this because it held it across the event launch and the update of the internal state. > It looks like the original bug report was against v5.0 (matching by dates > and the patch file attached). In that code, we have: > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; This is a udev event, I suspect the ip monitor event was thrown by driver binding during the VF creation. Jason
On Tue, Feb 13, 2024 at 09:34:50AM +0200, Leon Romanovsky wrote: > On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > > If an SR-IOV enabled device is held by vfio, and the device > > > > > is removed, vfio will hold device lock and notify userspace > > > > > of the removal. If userspace reads the sriov_numvfs sysfs > > > > > entry, that thread will be blocked since sriov_numvfs_show() > > > > > also tries to acquire the device lock. If that same thread > > > > > is responsible for releasing the device to vfio, it results > > > > > in a deadlock. > > > > > > > > > > The proper way to detect a change to the num_VFs value is to > > > > > listen for a sysfs event, not to add a device_lock() on the > > > > > attribute _show() in the kernel. > > > > The lock was not about detecting a change; Pierre did this: > > > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > > cat ${path}/device/sriov_numvfs; \ > > > > which I assume works by listening for sysfs events. > > It is not, "ip monitor ..." listens to netlink events emitted by > netdev core and not sysfs events. Sysfs events are not involved in > this case. Thanks for correcting my hasty assumption! > > The problem was that after the event occurred, the sriov_numvfs > > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). > > Yes, and it is outcome of such cross-subsytem involvement, which > is racy by definition. Someone can come with even simpler example of why > locking sysfs read and write is not a good idea. > > For example, let's consider the following scenario with two CPUs and > locks on sysfs read and write: > > CPU1 CPU2 > echo 1 > ${path}/device/sriov_numvfs > context_switch -> > cat ${path}/device/sriov_numvfs > lock > return 0 > unlock > context_switch <- > lock > set 1 > unlock > > CPU1 CPU2 > echo 1 > ${path}/device/sriov_numvfs > lock > set 1 > unlock > context_switch -> > cat ${path}/device/sriov_numvfs > lock > return 1 > unlock > > So same scenario will return different values if user doesn't protect > such case with external to the kernel lock. > > But if we return back to Pierre report and if you want to provide > completely bullet proof solution to solve cross-subsystem interaction, > you will need to prohibit device probe till sriov_numvfs update is completed. > However, it is overkill for something that is not a real issue. Pierre wanted to detect the configuration change and learn the new num_vfs, which seems like a reasonable thing to do. Is there a way to do both via netlink or some other mechanism? > > So I would drop this sentence because I don't think it accurately > > reflects the reason for 35ff867b7657. > > > > > > Since you are reverting a commit that synchronizes SysFS read > > > > /write, please add some comments about why it is not an > > > > issue anymore. > > > > > > It was never an issue, the idea that sysfs read and write should be > > > serialized by kernel is not correct by definition. > > > > I think it *was* an issue. The behavior Pierre observed at was > > clearly wrong, > > I disagree with this sentence. > > > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > > sriov_numvfs reads vs writes") to resolve it. > > > > We should try to avoid reintroducing the problem, so I think we should > > probably squash these two patches and describe it as a deadlock fix > > instead of dismissing 35ff867b7657 as being based on false premises. > > > > It would be awesome if you had time to verify that these patches also > > resolve the problem you saw, Pierre. > > They won't resolve his problem, because he is not listening to sysfs > events, but rely on something from netdev side. I guess that means that if we apply this revert, the problem Pierre reported will return. Obviously the deadlock is more important than the inconsistency Pierre observed, but from the user's point of view this will look like a regression. Maybe listening to netlink and then looking at sysfs isn't the "correct" way to do this, but I don't want to just casually break existing user code. If we do contemplate doing the revert, at the very least we should include specific details about what the user code *should* do instead, at the level of the actual commands to use instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". Bjorn
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index aaa33e8dc4c9..0ca20cd518d5 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, char *buf) { struct pci_dev *pdev = to_pci_dev(dev); - u16 num_vfs; - - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ - device_lock(&pdev->dev); - num_vfs = pdev->sriov->num_VFs; - device_unlock(&pdev->dev); - return sysfs_emit(buf, "%u\n", num_vfs); + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); } /*