Message ID | 20221117165311.vovrc7usy4efiytl@houat |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:f944:0:0:0:0:0 with SMTP id q4csp511279wrr; Thu, 17 Nov 2022 09:02:25 -0800 (PST) X-Google-Smtp-Source: AA0mqf6uIB+abbrbNBxT5eTzH2V/QhVrJZa38nNehT1ykPDhIUW5C9Ox/Fk7zPXWJdWFxxF1I3AI X-Received: by 2002:a17:902:e5cd:b0:188:f570:7bb6 with SMTP id u13-20020a170902e5cd00b00188f5707bb6mr1983070plf.74.1668704544843; Thu, 17 Nov 2022 09:02:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668704544; cv=none; d=google.com; s=arc-20160816; b=OA80YehlLHbJcoGAEos/KcKOe+jNO4vyrSZiIp59pYw7OpobzjEKI66OZLke2/Rzip zSiUC/3aIfuGex9VMT10Iea3qgv5233uGwLsZ+ZoCSirzgTaeeBW2GcIx48IGdWojrNZ hrFZWeaCWfq3ZEnkaGNiR4V6+JhpoGt4QTH2/AmZvqs9J9LsM0q0QdDEa9dm/BOg8GZE +ZtDX0hFe3xzoiL+cks3DCVUddnMQdvIrq3WKS5sQgwIw6N5wp3H+2+ENCAJ4B0/6jvQ SeZ74gXT1yFmKnY3OzckRG8l6RoHX0YacBrutqb3J8oL2ZCpeNATp9thnN2tTMGthrNZ duxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-disposition:mime-version:message-id :subject:cc:to:from:date:feedback-id:dkim-signature:dkim-signature; bh=FbzlDkkhhsRtKSZiUm2xWhCcXLtVXKnc1W3izJLploc=; b=FYTUkP10FJI0UWtoOH6w1PDyAq0IY8cud24WTEydmbMst48IgeevalQFTqDUVmPz6+ F8vlGVBx7fPolfAeRFu5TMa43Be5KOvC4vdzVvqLcQ0UtsE51XidkT7caf1fEPbtG66B 2pb4/K8g3epfJNa2VBiwpYbTwoEXwz3EbVDs981r8w7o5mQuuH4wr5uQnNwKzvBq48yu ioQxBJZqdjSP7LQdhmKGcYaHG3M8aHyuwDNZL4KyCnClzL7rHMpzBthOu7b2jXlzGM99 nCeuDVVzs6owu8yLisBAmmInLpnr52+CMwrMq/yj/LSXKQ9SGT2aW+OIc/5qN7PyiwMJ ugxA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cerno.tech header.s=fm2 header.b="Q/266Oy+"; dkim=pass header.i=@messagingengine.com header.s=fm1 header.b=wkH5BeJJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cerno.tech Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id nl3-20020a17090b384300b0021322320f17si1442456pjb.17.2022.11.17.09.02.08; Thu, 17 Nov 2022 09:02:24 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cerno.tech header.s=fm2 header.b="Q/266Oy+"; dkim=pass header.i=@messagingengine.com header.s=fm1 header.b=wkH5BeJJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=cerno.tech Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239669AbiKQQxX (ORCPT <rfc822;a1648639935@gmail.com> + 99 others); Thu, 17 Nov 2022 11:53:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57590 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234292AbiKQQxV (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 17 Nov 2022 11:53:21 -0500 Received: from wout4-smtp.messagingengine.com (wout4-smtp.messagingengine.com [64.147.123.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D5573331 for <linux-kernel@vger.kernel.org>; Thu, 17 Nov 2022 08:53:19 -0800 (PST) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id 989633201B69; Thu, 17 Nov 2022 11:53:15 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute2.internal (MEProxy); Thu, 17 Nov 2022 11:53:16 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cerno.tech; h=cc :cc:content-type:date:date:from:from:in-reply-to:message-id :mime-version:reply-to:sender:subject:subject:to:to; s=fm2; t= 1668703995; x=1668790395; bh=FbzlDkkhhsRtKSZiUm2xWhCcXLtVXKnc1W3 izJLploc=; b=Q/266Oy+xDLeuIV4Kf5CL1NHL7bAxNsInYyW4KsyQ3HCpNw9cHE dGhwAppw14wJKRPMUhL3OCzPc4MHEajCb5+7UHkc4XUvHtLWHFYmyMVc8nKnF2jH TA/jE/wJirFQ4yctUTYNyW9c7JbP4rx/rpWmzpiUubEG/o8xtXt037To+G/QqBra 009fhxDNboiMNF0A3vvEfzZ95B3jk2ZqSnOxdXNY1XJSOF8cF/g/7n3o64qprOGJ YEELav7az3jiUO2w5VLF6RcNkjMvpewKeB7Ix1ySwFDvdxKHhbe+68fhfIxn8NNj hHQYCqAE89CoZjk7qf1vuoZYMlwIr0OlZtg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:date:date:feedback-id :feedback-id:from:from:in-reply-to:message-id:mime-version :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1668703995; x= 1668790395; bh=FbzlDkkhhsRtKSZiUm2xWhCcXLtVXKnc1W3izJLploc=; b=w kH5BeJJ0sxtl3xiySYdhYXXot1xbFrHsakf599nx+1VS0Wp9N8XtcwZSZDeHebpv 5iO113iphsejg6/oZRiQnUjznHRBwLZv17HRs35gFEg9WkGZDBqO8FvE43lyNeoY +U+VxaO+8J07/4hvimbbyHK7r6cmOxdVsXHIr1OpwUSRgr9oX9xPM5L9l/SNMDkC bNok6KkBXuu8bwNxhGDU7qkdI3EKOinFSSNSmEIKq71hisZ6i6drfzUG5D8Imt96 oD0zj022aw0D2/CWrI0eMT5SFohIcDSmaU84y312vSytcyWSOMAPkzuRgQKne23h 9WW1y5ufVgJHT5G8l8GWw== X-ME-Sender: <xms:-mZ2Y7x75JmWi-PU-Tt81VKAsryb21C2a_9B0yzSh_rG1xxOrT24lw> <xme:-mZ2YzSzuS7MqFWmQcdDqyLqW-7l3G_LQFi0MQBGBuZzhnacMighm0ZRDzAhBOFHm yHJmxiARpM8eDlxTWU> X-ME-Received: <xmr:-mZ2Y1VdYv9Yikln76gc7Y6iGlLrM0G5JJ37hk0YA3mpEqq7J9bpa6XVkVm9Ku4hstHFQSkjAh0F6N5xIscBG6BDWeKv1WPsdy8Ua7yh5LrFNg> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvgedrgeekgdelvdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvvefukfggtggusehgtderredttddvnecuhfhrohhmpeforgigihhmvgcu tfhiphgrrhguuceomhgrgihimhgvsegtvghrnhhordhtvggthheqnecuggftrfgrthhtvg hrnhepteeifeeigeetgfffgedufeefheekgfevuefgtdehffefheelleejveeviefhgfek necuffhomhgrihhnpehfrhgvvgguvghskhhtohhprdhorhhgpdgsohhothhlihhnrdgtoh hmnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepmhgr gihimhgvsegtvghrnhhordhtvggthh X-ME-Proxy: <xmx:-mZ2Y1iH-yFsgQdRLryHHochoP3jGepgvXLGMUhHdo2JarzBJppwbQ> <xmx:-mZ2Y9B0FcVk1lUd8yVc2AXI-uLyPAdP7Iyn1tni29TkL2FIAc7wrg> <xmx:-mZ2Y-I4vMEsHoqYzkUkAIp6bhLLpXte3kTFeSx0Fq4lPInIVpm3iQ> <xmx:-2Z2Y_xVlOHHHpggpykdE8Js_MZ1K28WKoRubwYn69x08c8ivqKrWQ> Feedback-ID: i8771445c:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 17 Nov 2022 11:53:14 -0500 (EST) Date: Thu, 17 Nov 2022 17:53:11 +0100 From: Maxime Ripard <maxime@cerno.tech> To: Daniel Vetter <daniel.vetter@intel.com>, David Airlie <airlied@linux.ie>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Thomas Zimmermann <tzimmermann@suse.de>, Maxime Ripard <maxime@cerno.tech>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, "Rafael J. Wysocki" <rafael@kernel.org> Cc: dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: DRM-managed resources / devm_drm_dev_alloc leaking resources Message-ID: <20221117165311.vovrc7usy4efiytl@houat> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="xbb4rrcandpnbw3a" Content-Disposition: inline X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749763537084808740?= X-GMAIL-MSGID: =?utf-8?q?1749763537084808740?= |
Series |
DRM-managed resources / devm_drm_dev_alloc leaking resources
|
|
Commit Message
Maxime Ripard
Nov. 17, 2022, 4:53 p.m. UTC
Hi, After trying to get more kunit tests for KMS, I found out that the recent kunit helpers we merged to create a DRM device [1] are broken and won't free their device-managed and DRM-managed resources. With some help from Thomas, we've dug into this and it turns out that if we allocate a device with root_device_register, initialise our drm device with devm_drm_dev_alloc(), register it using drm_dev_register(), unregister it using drm_dev_unregister/drm_dev_unplug and then remove the parent device, neither the device managed nor the DRM managed actions are run. root_device_register initializes the device by eventually calling device_initialize() which sets the initial reference count of the root device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will increase the root device refcount [3] and initialize our DRM device to 1 [4]. drm_dev_register(), through drm_minor_register() and device_add(), will increase the root device refcount [5]. When unrolling things, drm_dev_unregister(), through drm_minor_unregister() and device_del(), will give up its reference [6]. root_device_unregister(), through device_unregister(), will also give up its own [7]. So we end up with this for the reference counts: +------------------------+-------------+------------+ | funcs | root device | DRM device | +------------------------+-------------+------------+ | root_device_register | 1 | N/A | | devm_drm_dev_alloc | 2 | 1 | | drm_dev_register | 3 | 1 | | drm_dev_unregister | 2 | 1 | | root_device_unregister | 1 | 1 | +------------------------+-------------+------------+ If we go back to the list of reference taken, the root device reference and the initial drm_device reference, both taken by devm_drm_dev_alloc() through drm_dev_init(), haven't been put back. If we look at the drm_dev_init code(), we can see that it sets up a DRM-managed action [8] that will put back the device reference [9]. The DRM-managed code is executed by the drm_managed_cleanup() function, that is executed as part of a release hook [10] executed once we give up the final reference to the DRM device [11]. If we go back a little, the final reference to the DRM device is actually the initial one setup by devm_drm_dev_alloc(). This function has superseded drm_dev_alloc(), with the documentation that we do need a final drm_dev_put() to put back our final reference [12]. devm_drm_dev_alloc() is a more convenient variant that has been introduced explicitly to not require that drm_dev_put(), and states it as such in the documentation [13]. It does so by adding a device-managed action that will call drm_dev_put() [14]. Device-managed actions are ran as part devres_release_all() that is called by device_release() [15], itself being run when the last reference on the device is put back [16][17][18]. So if we sum things up, the DRM device will only give its last root device reference when the last DRM device reference will be put back, and the last DRM device reference will be put back when the last device reference will be put back, which sounds very circular to me, with both ending up in a deadlock scenario. I've added two kunit tests that demonstrate the issue: we register a device, allocate and register a DRM device, register a DRM-managed action, remove the DRM device and the parent device, and wait for the action to execute. drm_register_unregister_with_devm_test() uses the broken(?) devm_drm_dev_alloc and is failing. drm_register_unregister_test uses the deprecated drm_dev_alloc() that requires an explicit call to drm_dev_put() which works fine. It's also worth noting that Thomas tested with simpledrm and it seems to work fine. Using a platform_device instead of the root_device doesn't change anything to the outcome in my tests, so there might be a more subtle behaviour involved. Thanks, Maxime --------- 8< ----------- --------- 8< ----------- 1: https://cgit.freedesktop.org/drm-misc/tree/drivers/gpu/drm/tests/drm_kunit_helpers.c 2: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L2979 3: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L597 4: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L596 5: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3437 6: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L201 7: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3737 8: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L618 9: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L570 10: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L751 11: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L785 12: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L259 13: https://elixir.bootlin.com/linux/latest/source/include/drm/drm_drv.h#L505 14: https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/drm_drv.c#L682 15: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L2321 16: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L2357 17: https://elixir.bootlin.com/linux/latest/source/lib/kobject.c#L721 18: https://elixir.bootlin.com/linux/latest/source/lib/kobject.c#L647
Comments
Hello Maxime, On 11/17/22 17:53, Maxime Ripard wrote: > Hi, > > After trying to get more kunit tests for KMS, I found out that the > recent kunit helpers we merged to create a DRM device [1] are broken and > won't free their device-managed and DRM-managed resources. > > With some help from Thomas, we've dug into this and it turns out that if > we allocate a device with root_device_register, initialise our drm > device with devm_drm_dev_alloc(), register it using drm_dev_register(), > unregister it using drm_dev_unregister/drm_dev_unplug and then remove > the parent device, neither the device managed nor the DRM managed > actions are run. > > root_device_register initializes the device by eventually calling > device_initialize() which sets the initial reference count of the root > device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will > increase the root device refcount [3] and initialize our DRM device to 1 > [4]. drm_dev_register(), through drm_minor_register() and device_add(), > will increase the root device refcount [5]. > > When unrolling things, drm_dev_unregister(), through > drm_minor_unregister() and device_del(), will give up its reference [6]. > root_device_unregister(), through device_unregister(), will also give up > its own [7]. > > So we end up with this for the reference counts: > > +------------------------+-------------+------------+ > | funcs | root device | DRM device | > +------------------------+-------------+------------+ > | root_device_register | 1 | N/A | > | devm_drm_dev_alloc | 2 | 1 | > | drm_dev_register | 3 | 1 | > | drm_dev_unregister | 2 | 1 | > | root_device_unregister | 1 | 1 | > +------------------------+-------------+------------+ > > If we go back to the list of reference taken, the root device reference > and the initial drm_device reference, both taken by devm_drm_dev_alloc() > through drm_dev_init(), haven't been put back. > > If we look at the drm_dev_init code(), we can see that it sets up a > DRM-managed action [8] that will put back the device reference [9]. The > DRM-managed code is executed by the drm_managed_cleanup() function, that > is executed as part of a release hook [10] executed once we give up the > final reference to the DRM device [11]. > > If we go back a little, the final reference to the DRM device is > actually the initial one setup by devm_drm_dev_alloc(). This function > has superseded drm_dev_alloc(), with the documentation that we do need a > final drm_dev_put() to put back our final reference [12]. > > devm_drm_dev_alloc() is a more convenient variant that has been > introduced explicitly to not require that drm_dev_put(), and states it > as such in the documentation [13]. It does so by adding a device-managed > action that will call drm_dev_put() [14]. > > Device-managed actions are ran as part devres_release_all() that is > called by device_release() [15], itself being run when the last > reference on the device is put back [16][17][18]. > Thanks a lot for the write up. It was very informative and detailed. > So if we sum things up, the DRM device will only give its last root > device reference when the last DRM device reference will be put back, > and the last DRM device reference will be put back when the last device > reference will be put back, which sounds very circular to me, with both > ending up in a deadlock scenario. > So my conclusion after looking at this is the same than your, that the drivers would need to signal somehow to the DRM core when a DRM device won't be used anymore and drop the final reference to the DRM device. That is, I don't think we can get away of drivers not calling either drm_dev_put(). I think that we should try to simplify the DRM register and release API and make very clear in the documentation what should be used. Right now for example as you mentioned we have both drm_dev_unregister() and drm_dev_unplug() but AFAICT the only difference is that the latter does a sync to protect critical sections during drm_dev_{enter,exit}(). The drawback is that the DRM device will be marked as unplugged before drm_atomic_helper_shutdown(), but is this really a problem in practice? Maybe we can just rename drm_dev_unplug() to drm_dev_unregister() and drm_dev_unregister() to __drm_dev_unregister(). That way, the register path could always be: devm_drm_dev_alloc() drm_dev_register() and then in the release path: drm_dev_unregister() drm_dev_put() making both DRM-managed and device-managed resources to always work. > I've added two kunit tests that demonstrate the issue: we register a > device, allocate and register a DRM device, register a DRM-managed > action, remove the DRM device and the parent device, and wait for the > action to execute. drm_register_unregister_with_devm_test() uses the > broken(?) devm_drm_dev_alloc and is failing. > drm_register_unregister_test uses the deprecated drm_dev_alloc() that > requires an explicit call to drm_dev_put() which works fine. > Great. In my opinion we should add these Kunit tests even when they are exposing an issue in the devm_drm_dev_alloc() helper. > It's also worth noting that Thomas tested with simpledrm and it seems to > work fine. Using a platform_device instead of the root_device doesn't > change anything to the outcome in my tests, so there might be a more > subtle behaviour involved. > That's strange because AFAICT simpledrm is basically doing the same than your failing tests. I tried to look at the differences but couldn't spot anything evident... > Thanks, > Maxime >
Hi, Javier and I looked into it some more today, and I think we have a better idea of what is going on. On Thu, Nov 17, 2022 at 05:53:11PM +0100, Maxime Ripard wrote: > After trying to get more kunit tests for KMS, I found out that the > recent kunit helpers we merged to create a DRM device [1] are broken and > won't free their device-managed and DRM-managed resources. > > With some help from Thomas, we've dug into this and it turns out that if > we allocate a device with root_device_register, initialise our drm > device with devm_drm_dev_alloc(), register it using drm_dev_register(), > unregister it using drm_dev_unregister/drm_dev_unplug and then remove > the parent device, neither the device managed nor the DRM managed > actions are run. > > root_device_register initializes the device by eventually calling > device_initialize() which sets the initial reference count of the root > device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will > increase the root device refcount [3] and initialize our DRM device to 1 > [4]. drm_dev_register(), through drm_minor_register() and device_add(), > will increase the root device refcount [5]. > > When unrolling things, drm_dev_unregister(), through > drm_minor_unregister() and device_del(), will give up its reference [6]. > root_device_unregister(), through device_unregister(), will also give up > its own [7]. > > So we end up with this for the reference counts: > > +------------------------+-------------+------------+ > | funcs | root device | DRM device | > +------------------------+-------------+------------+ > | root_device_register | 1 | N/A | > | devm_drm_dev_alloc | 2 | 1 | > | drm_dev_register | 3 | 1 | > | drm_dev_unregister | 2 | 1 | > | root_device_unregister | 1 | 1 | > +------------------------+-------------+------------+ > > If we go back to the list of reference taken, the root device reference > and the initial drm_device reference, both taken by devm_drm_dev_alloc() > through drm_dev_init(), haven't been put back. > > If we look at the drm_dev_init code(), we can see that it sets up a > DRM-managed action [8] that will put back the device reference [9]. The > DRM-managed code is executed by the drm_managed_cleanup() function, that > is executed as part of a release hook [10] executed once we give up the > final reference to the DRM device [11]. > > If we go back a little, the final reference to the DRM device is > actually the initial one setup by devm_drm_dev_alloc(). This function > has superseded drm_dev_alloc(), with the documentation that we do need a > final drm_dev_put() to put back our final reference [12]. > > devm_drm_dev_alloc() is a more convenient variant that has been > introduced explicitly to not require that drm_dev_put(), and states it > as such in the documentation [13]. It does so by adding a device-managed > action that will call drm_dev_put() [14]. > > Device-managed actions are ran as part devres_release_all() that is > called by device_release() [15], itself being run when the last > reference on the device is put back [16][17][18]. > > So if we sum things up, the DRM device will only give its last root > device reference when the last DRM device reference will be put back, > and the last DRM device reference will be put back when the last device > reference will be put back, which sounds very circular to me, with both > ending up in a deadlock scenario. > > I've added two kunit tests that demonstrate the issue: we register a > device, allocate and register a DRM device, register a DRM-managed > action, remove the DRM device and the parent device, and wait for the > action to execute. drm_register_unregister_with_devm_test() uses the > broken(?) devm_drm_dev_alloc and is failing. > drm_register_unregister_test uses the deprecated drm_dev_alloc() that > requires an explicit call to drm_dev_put() which works fine. > > It's also worth noting that Thomas tested with simpledrm and it seems to > work fine. Indeed, the transition from simpledrm to a full-blown DRM driver handles this properly. It's using a platform_device_unregister() [1] and eventually device_del() [2][3]. That part is handled just like root_device_unregister() [4][5]. Basically, both will call device_del(), and then device_put(), so device_del() is called while holding a reference. As we've seen before, at this point the DRM driver still holds a reference to the device as well. device_del() will call bus_remove_device() [6], which will be skipped for the root device since it doesn't have a bus [7]. It will then call device_release_driver() [8], which is basically forwarded to __device_release_driver() [9][10], that will call device_unbind_cleanup() [11]. device_unbind_cleanup() calls devres_release_all() directly [12], that runs all the device-managed actions [13]. And it does so WHILE THERE'S STILL A REFCOUNT OF 2! I would expect the call to devres_release_all to happen only in device_release, once all the device reference have been put back. Not 4 functions in as a side effect, and while there's still some active references. > Using a platform_device instead of the root_device doesn't > change anything to the outcome in my tests, so there might be a more > subtle behaviour involved. This one has the same symptom but a different cause. I was just registering a platform_device but it wasn't bound to any driver. While it's valid to do so according to that comment [13], it doesn't have any driver so the check for the driver in device_release_driver() [8], and never hits device_unbind_cleanup(). Thanks again to Thomas and Javier for their help Maxime 1: https://elixir.bootlin.com/linux/latest/source/drivers/video/aperture.c#L199 2: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L793 3: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L751 4: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L4153 5: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3733 6: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3704 7: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L511 8: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L529 9: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1298 10: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1275 11: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1255 12: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L530 13: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L2312
On Fri, Nov 18, 2022 at 05:15:58PM +0100, Maxime Ripard wrote: > Hi, > > Javier and I looked into it some more today, and I think we have a > better idea of what is going on. > > On Thu, Nov 17, 2022 at 05:53:11PM +0100, Maxime Ripard wrote: > > After trying to get more kunit tests for KMS, I found out that the > > recent kunit helpers we merged to create a DRM device [1] are broken and > > won't free their device-managed and DRM-managed resources. > > > > With some help from Thomas, we've dug into this and it turns out that if > > we allocate a device with root_device_register, initialise our drm > > device with devm_drm_dev_alloc(), register it using drm_dev_register(), > > unregister it using drm_dev_unregister/drm_dev_unplug and then remove > > the parent device, neither the device managed nor the DRM managed > > actions are run. > > > > root_device_register initializes the device by eventually calling > > device_initialize() which sets the initial reference count of the root > > device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will > > increase the root device refcount [3] and initialize our DRM device to 1 > > [4]. drm_dev_register(), through drm_minor_register() and device_add(), > > will increase the root device refcount [5]. > > > > When unrolling things, drm_dev_unregister(), through > > drm_minor_unregister() and device_del(), will give up its reference [6]. > > root_device_unregister(), through device_unregister(), will also give up > > its own [7]. > > > > So we end up with this for the reference counts: > > > > +------------------------+-------------+------------+ > > | funcs | root device | DRM device | > > +------------------------+-------------+------------+ > > | root_device_register | 1 | N/A | > > | devm_drm_dev_alloc | 2 | 1 | > > | drm_dev_register | 3 | 1 | > > | drm_dev_unregister | 2 | 1 | > > | root_device_unregister | 1 | 1 | > > +------------------------+-------------+------------+ > > > > If we go back to the list of reference taken, the root device reference > > and the initial drm_device reference, both taken by devm_drm_dev_alloc() > > through drm_dev_init(), haven't been put back. > > > > If we look at the drm_dev_init code(), we can see that it sets up a > > DRM-managed action [8] that will put back the device reference [9]. The > > DRM-managed code is executed by the drm_managed_cleanup() function, that > > is executed as part of a release hook [10] executed once we give up the > > final reference to the DRM device [11]. > > > > If we go back a little, the final reference to the DRM device is > > actually the initial one setup by devm_drm_dev_alloc(). This function > > has superseded drm_dev_alloc(), with the documentation that we do need a > > final drm_dev_put() to put back our final reference [12]. > > > > devm_drm_dev_alloc() is a more convenient variant that has been > > introduced explicitly to not require that drm_dev_put(), and states it > > as such in the documentation [13]. It does so by adding a device-managed > > action that will call drm_dev_put() [14]. > > > > Device-managed actions are ran as part devres_release_all() that is > > called by device_release() [15], itself being run when the last > > reference on the device is put back [16][17][18]. > > > > So if we sum things up, the DRM device will only give its last root > > device reference when the last DRM device reference will be put back, > > and the last DRM device reference will be put back when the last device > > reference will be put back, which sounds very circular to me, with both > > ending up in a deadlock scenario. > > > > I've added two kunit tests that demonstrate the issue: we register a > > device, allocate and register a DRM device, register a DRM-managed > > action, remove the DRM device and the parent device, and wait for the > > action to execute. drm_register_unregister_with_devm_test() uses the > > broken(?) devm_drm_dev_alloc and is failing. > > drm_register_unregister_test uses the deprecated drm_dev_alloc() that > > requires an explicit call to drm_dev_put() which works fine. > > > > It's also worth noting that Thomas tested with simpledrm and it seems to > > work fine. > > Indeed, the transition from simpledrm to a full-blown DRM driver handles > this properly. It's using a platform_device_unregister() [1] and > eventually device_del() [2][3]. That part is handled just like > root_device_unregister() [4][5]. Basically, both will call device_del(), > and then device_put(), so device_del() is called while holding a > reference. > > As we've seen before, at this point the DRM driver still holds a > reference to the device as well. > > device_del() will call bus_remove_device() [6], which will be skipped > for the root device since it doesn't have a bus [7]. > > It will then call device_release_driver() [8], which is basically forwarded > to __device_release_driver() [9][10], that will call device_unbind_cleanup() [11]. > > device_unbind_cleanup() calls devres_release_all() directly [12], that > runs all the device-managed actions [13]. And it does so WHILE THERE'S > STILL A REFCOUNT OF 2! > > I would expect the call to devres_release_all to happen only in > device_release, once all the device reference have been put back. Not 4 > functions in as a side effect, and while there's still some active > references. So maybe I'm missing something, but devm is tied to the device's existence, not to the lifetime of the struct device. Or well, it's complicated, there's actually two cleanups, once when doing the hotunplug/unregister, and once when the final struct device cleanup happens. This is why devm_drm_dev_alloc absolutely must be in the former cleanup group, otherwise you have the loop you've described. With usual device model this should all Just Work, if you hand roll your struct device it gets more complicated. I've reworked how this works for vgem/vkms a while ago in an attempt to make them look more like real devices/drivers, but it's a bit nasty. But yeah if you don't have that unregister step, then you have a loop and the drm managed stuff never gets freed. That is by design, because the full lifetime dependencies are "hw device" (devm-managed) -> "drm device" (drmm-managed) -> "struct device sw pieces" (kref, nothing else should be attached here really) Cheers, Daniel > > > Using a platform_device instead of the root_device doesn't > > change anything to the outcome in my tests, so there might be a more > > subtle behaviour involved. > > This one has the same symptom but a different cause. I was just > registering a platform_device but it wasn't bound to any driver. While > it's valid to do so according to that comment [13], it doesn't have any > driver so the check for the driver in device_release_driver() [8], and > never hits device_unbind_cleanup(). > > Thanks again to Thomas and Javier for their help > Maxime > > 1: https://elixir.bootlin.com/linux/latest/source/drivers/video/aperture.c#L199 > 2: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L793 > 3: https://elixir.bootlin.com/linux/latest/source/drivers/base/platform.c#L751 > 4: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L4153 > 5: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3733 > 6: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L3704 > 7: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L511 > 8: https://elixir.bootlin.com/linux/latest/source/drivers/base/bus.c#L529 > 9: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1298 > 10: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1275 > 11: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L1255 > 12: https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L530 > 13: https://elixir.bootlin.com/linux/latest/source/drivers/base/core.c#L2312
Hi On Sat, Nov 19, 2022 at 12:14:14PM +0100, Daniel Vetter wrote: > On Fri, Nov 18, 2022 at 05:15:58PM +0100, Maxime Ripard wrote: > > Hi, > > > > Javier and I looked into it some more today, and I think we have a > > better idea of what is going on. > > > > On Thu, Nov 17, 2022 at 05:53:11PM +0100, Maxime Ripard wrote: > > > After trying to get more kunit tests for KMS, I found out that the > > > recent kunit helpers we merged to create a DRM device [1] are broken and > > > won't free their device-managed and DRM-managed resources. > > > > > > With some help from Thomas, we've dug into this and it turns out that if > > > we allocate a device with root_device_register, initialise our drm > > > device with devm_drm_dev_alloc(), register it using drm_dev_register(), > > > unregister it using drm_dev_unregister/drm_dev_unplug and then remove > > > the parent device, neither the device managed nor the DRM managed > > > actions are run. > > > > > > root_device_register initializes the device by eventually calling > > > device_initialize() which sets the initial reference count of the root > > > device to 1 [2]. devm_drm_dev_alloc() then comes in, drm_dev_init() will > > > increase the root device refcount [3] and initialize our DRM device to 1 > > > [4]. drm_dev_register(), through drm_minor_register() and device_add(), > > > will increase the root device refcount [5]. > > > > > > When unrolling things, drm_dev_unregister(), through > > > drm_minor_unregister() and device_del(), will give up its reference [6]. > > > root_device_unregister(), through device_unregister(), will also give up > > > its own [7]. > > > > > > So we end up with this for the reference counts: > > > > > > +------------------------+-------------+------------+ > > > | funcs | root device | DRM device | > > > +------------------------+-------------+------------+ > > > | root_device_register | 1 | N/A | > > > | devm_drm_dev_alloc | 2 | 1 | > > > | drm_dev_register | 3 | 1 | > > > | drm_dev_unregister | 2 | 1 | > > > | root_device_unregister | 1 | 1 | > > > +------------------------+-------------+------------+ > > > > > > If we go back to the list of reference taken, the root device reference > > > and the initial drm_device reference, both taken by devm_drm_dev_alloc() > > > through drm_dev_init(), haven't been put back. > > > > > > If we look at the drm_dev_init code(), we can see that it sets up a > > > DRM-managed action [8] that will put back the device reference [9]. The > > > DRM-managed code is executed by the drm_managed_cleanup() function, that > > > is executed as part of a release hook [10] executed once we give up the > > > final reference to the DRM device [11]. > > > > > > If we go back a little, the final reference to the DRM device is > > > actually the initial one setup by devm_drm_dev_alloc(). This function > > > has superseded drm_dev_alloc(), with the documentation that we do need a > > > final drm_dev_put() to put back our final reference [12]. > > > > > > devm_drm_dev_alloc() is a more convenient variant that has been > > > introduced explicitly to not require that drm_dev_put(), and states it > > > as such in the documentation [13]. It does so by adding a device-managed > > > action that will call drm_dev_put() [14]. > > > > > > Device-managed actions are ran as part devres_release_all() that is > > > called by device_release() [15], itself being run when the last > > > reference on the device is put back [16][17][18]. > > > > > > So if we sum things up, the DRM device will only give its last root > > > device reference when the last DRM device reference will be put back, > > > and the last DRM device reference will be put back when the last device > > > reference will be put back, which sounds very circular to me, with both > > > ending up in a deadlock scenario. > > > > > > I've added two kunit tests that demonstrate the issue: we register a > > > device, allocate and register a DRM device, register a DRM-managed > > > action, remove the DRM device and the parent device, and wait for the > > > action to execute. drm_register_unregister_with_devm_test() uses the > > > broken(?) devm_drm_dev_alloc and is failing. > > > drm_register_unregister_test uses the deprecated drm_dev_alloc() that > > > requires an explicit call to drm_dev_put() which works fine. > > > > > > It's also worth noting that Thomas tested with simpledrm and it seems to > > > work fine. > > > > Indeed, the transition from simpledrm to a full-blown DRM driver handles > > this properly. It's using a platform_device_unregister() [1] and > > eventually device_del() [2][3]. That part is handled just like > > root_device_unregister() [4][5]. Basically, both will call device_del(), > > and then device_put(), so device_del() is called while holding a > > reference. > > > > As we've seen before, at this point the DRM driver still holds a > > reference to the device as well. > > > > device_del() will call bus_remove_device() [6], which will be skipped > > for the root device since it doesn't have a bus [7]. > > > > It will then call device_release_driver() [8], which is basically forwarded > > to __device_release_driver() [9][10], that will call device_unbind_cleanup() [11]. > > > > device_unbind_cleanup() calls devres_release_all() directly [12], that > > runs all the device-managed actions [13]. And it does so WHILE THERE'S > > STILL A REFCOUNT OF 2! > > > > I would expect the call to devres_release_all to happen only in > > device_release, once all the device reference have been put back. Not 4 > > functions in as a side effect, and while there's still some active > > references. > > So maybe I'm missing something, but devm is tied to the device's > existence, not to the lifetime of the struct device. Yes, devm_ resources can be released even if somebody is keeping positive reference to struct device FWICT, after bus->remove() you can not access them. > Or well, it's complicated, there's actually two cleanups, once when doing > the hotunplug/unregister, and once when the final struct device cleanup > happens. This is why devm_drm_dev_alloc absolutely must be in the former > cleanup group, otherwise you have the loop you've described. There are 2 separate managed resources list: one in drm_device and one in struct device. Those are cleaned up separately. But if we are talking about struct device there is only one cleanup. However it can happen from few entry points that ended up on: device_unbind_cleanup() -> devres_release_all() And there is a problem when mixing devm_ and drmm_, because you have to correctly chose devm_ resources and avoid those which can still be accessed via drm callbacks. Or block access tho those after bus->remove / sysfs unbind. Not sure it this is related to original, problem presented here by Maxime though. Just my thoughts from converting intel_vpu to use drmm. Regards Stanislaw
diff --git a/drivers/gpu/drm/tests/drm_register_test.c b/drivers/gpu/drm/tests/drm_register_test.c new file mode 100644 index 000000000000..3256b53d08f2 --- /dev/null +++ b/drivers/gpu/drm/tests/drm_register_test.c @@ -0,0 +1,117 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <drm/drm_drv.h> +#include <drm/drm_managed.h> + +#include <kunit/resource.h> + +#include <linux/device.h> +#include <linux/platform_device.h> + +#include "drm_kunit_helpers.h" + +struct test_priv { + bool release_done; + wait_queue_head_t release_wq; +}; + +static const struct drm_mode_config_funcs drm_mode_config_funcs = { +}; + +static const struct drm_driver drm_driver = { + .driver_features = DRIVER_MODESET, +}; + +static void drm_release(struct drm_device *drm, void *ptr) +{ + struct test_priv *priv = ptr; + + priv->release_done = true; + wake_up_interruptible(&priv->release_wq); +} + +#define RELEASE_TIMEOUT_MS 500 + +static void drm_register_unregister_test(struct kunit *test) +{ + struct test_priv *priv; + struct drm_device *drm; + struct device *dev; + int ret; + + priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, priv); + init_waitqueue_head(&priv->release_wq); + + dev = root_device_register("test"); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, dev); + + drm = drm_dev_alloc(&drm_driver, dev); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, drm); + + drm->mode_config.funcs = &drm_mode_config_funcs; + ret = drmm_mode_config_init(drm); + KUNIT_ASSERT_EQ(test, ret, 0); + + ret = drmm_add_action_or_reset(drm, drm_release, priv); + KUNIT_ASSERT_EQ(test, ret, 0); + + ret = drm_dev_register(drm, 0); + KUNIT_ASSERT_EQ(test, ret, 0); + + drm_dev_unregister(drm); + drm_dev_put(drm); + root_device_unregister(dev); + + ret = wait_event_interruptible_timeout(priv->release_wq, priv->release_done, + msecs_to_jiffies(RELEASE_TIMEOUT_MS)); + KUNIT_EXPECT_GT(test, ret, 0); +} + +static void drm_register_unregister_with_devm_test(struct kunit *test) +{ + struct test_priv *priv; + struct drm_device *drm; + struct device *dev; + int ret; + + priv = kunit_kzalloc(test, sizeof(*priv), GFP_KERNEL); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, priv); + init_waitqueue_head(&priv->release_wq); + + dev = root_device_register("test"); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, dev); + + drm = __devm_drm_dev_alloc(dev, &drm_driver, sizeof(*drm), 0); + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, drm); + + drm->mode_config.funcs = &drm_mode_config_funcs; + ret = drmm_mode_config_init(drm); + KUNIT_ASSERT_EQ(test, ret, 0); + + ret = drmm_add_action_or_reset(drm, drm_release, priv); + KUNIT_ASSERT_EQ(test, ret, 0); + + ret = drm_dev_register(drm, 0); + KUNIT_ASSERT_EQ(test, ret, 0); + + drm_dev_unregister(drm); + root_device_unregister(dev); + + ret = wait_event_interruptible_timeout(priv->release_wq, priv->release_done, + msecs_to_jiffies(RELEASE_TIMEOUT_MS)); + KUNIT_EXPECT_GT(test, ret, 0); +} + +static struct kunit_case drm_register_tests[] = { + KUNIT_CASE(drm_register_unregister_test), + KUNIT_CASE(drm_register_unregister_with_devm_test), + {} +}; + +static struct kunit_suite drm_register_test_suite = { + .name = "drm-test-register", + .test_cases = drm_register_tests +}; + +kunit_test_suite(drm_register_test_suite);