Message ID | 20230103114427.1825-1-marcan@marcan.st |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:4e01:0:0:0:0:0 with SMTP id p1csp4568196wrt; Tue, 3 Jan 2023 03:49:28 -0800 (PST) X-Google-Smtp-Source: AMrXdXuTQFO2Ac4vMLgJIHV0CNz8NlnK3ecO8Nho1dOkVrQVHAtUGQ8pnAJPy9TiBg1OVWnkwkPJ X-Received: by 2002:a05:6a20:8b9a:b0:ae:e972:72a4 with SMTP id m26-20020a056a208b9a00b000aee97272a4mr49906527pzh.36.1672746568703; Tue, 03 Jan 2023 03:49:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672746568; cv=none; d=google.com; s=arc-20160816; b=BV45ojYFNBZtVlZvLl70JKlax/s6P03zidAJLelAKNuSjmaSq2Gtleb28JWPg/Kfr8 p+CzSS7R603CUfkypcTbkB5G7DZ8Lc0Qi+Rf3lDqvst2MwrN42VFA++UEUFe6ntGKDkn 0E+OZKSIkYWoD9u2G3cgxOCVsIswX2qUw3rmL+ldgDf3sGgIq1cdfULaM2DL7CTluGs5 WM4UKLr03giiF+SHJjMXnO3Pwr0zfj0lNKb6RRR2mtkYYGKEHJz0BqOxOc8rd5QfiQ5Y OVB1SUK7Q9RWzYuguf1b6jGZqHYarBcvMUxEIXO6wWUtdjQcddQtZzry8n5DePYEwKb5 NnKA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=wKe+t8VA60QmNqVieOkZ4XUtzVjUyVuha0KvYNUVkvI=; b=lP3oW5J3RKyHzsfV0znImLnoLp6dZijoDcqGr1/Ecv0Y295e+3wH9+0hkvhXviszOJ xOvEWT3DK1JezP3M72x0LTQdjYvCcq/hsme0TqWIVS8nnW+CfaXJr/xZdtdaAgVaDJRq M7QuguNe6Ite/j/FQTbR35dT4kOH//vD9c79/EwgUCnvnRVlEZHaheYLU2pQlyJEnYRe aR3nz6md2lBCc4bp7TziWLayo5csLkgCBJxI8WV/FjNraJeVFlsAwo9w0NTApBJ16/V/ ry2qeGYUaYWRjL6pCbkVCMVyXBnO37ZNvUMjdov/FKyWUzmf6A4qOLQQ1bQD2Wq7TVTM 9SRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@marcan.st header.s=default header.b=obvVnoOO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=marcan.st Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j3-20020a63cf03000000b004827ef5f0cesi32469958pgg.353.2023.01.03.03.49.15; Tue, 03 Jan 2023 03:49:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@marcan.st header.s=default header.b=obvVnoOO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=marcan.st Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237161AbjACLpT (ORCPT <rfc822;tmhikaru@gmail.com> + 99 others); Tue, 3 Jan 2023 06:45:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39302 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237411AbjACLo5 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 3 Jan 2023 06:44:57 -0500 Received: from mail.marcansoft.com (marcansoft.com [212.63.210.85]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 20062625D; Tue, 3 Jan 2023 03:44:54 -0800 (PST) Received: from [127.0.0.1] (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: sendonly@marcansoft.com) by mail.marcansoft.com (Postfix) with ESMTPSA id 5731241F72; Tue, 3 Jan 2023 11:44:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=marcan.st; s=default; t=1672746292; bh=wOgcEgG9+pZDSbBKY1gb2vXCECwZ3HDBdpb7Vw0TlGs=; h=From:To:Cc:Subject:Date; b=obvVnoOOwOgqZwleUIuXZh9Vut/Qr8xCsjjEGEDUK7cgkhzG/0tkJewmuk8VkWCYe EwODwQrvmpRpqnz6xHlxkTym6m+P+2IFzxV28IKhexwbIYipjnQNaQa0kpExRCR08+ /w0XNFrsj/NzBwHRpnUGVHufxSJM1xtkcmyXK4En8C/kLtfPhk+QIrNSvnWlTw5ei1 4EXD2bnSDVniVAowOk4X7e2j+6sb/uV0aZmOujrssMN8LyYcuyDk+bJbKwJ3IYpWPY Hki9fHU0f+0M9ZpdX5UEFLDvMr1Juh3ezFBYDjNi4wQMpdt3Rqh41RhbzRmeQuBSxh 9NUNLXKZf7uuQ== From: Hector Martin <marcan@marcan.st> To: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Cc: Sven Peter <sven@svenpeter.dev>, Alyssa Rosenzweig <alyssa@rosenzweig.io>, asahi@lists.linux.dev, linux-kernel@vger.kernel.org, Hector Martin <marcan@marcan.st>, stable@vger.kernel.org, Eric Curtin <ecurtin@redhat.com> Subject: [PATCH v2] nvmem: core: Fix race in nvmem_register() Date: Tue, 3 Jan 2023 20:44:28 +0900 Message-Id: <20230103114427.1825-1-marcan@marcan.st> X-Mailer: git-send-email 2.35.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1754001905815760038?= X-GMAIL-MSGID: =?utf-8?q?1754001905815760038?= |
Series |
[v2] nvmem: core: Fix race in nvmem_register()
|
|
Commit Message
Hector Martin
Jan. 3, 2023, 11:44 a.m. UTC
nvmem_register() currently registers the device before adding the nvmem cells, which creates a race window where consumers may find the nvmem device (and not get PROBE_DEFERred), but then fail to find the cells and error out. Move device registration to the end of nvmem_register(), to close the race. Observed when the stars line up on Apple Silicon machines with the (not yet upstream, but trivial) spmi nvmem driver and the macsmc-rtc client: [ 0.487375] macsmc-rtc macsmc-rtc: error -ENOENT: Failed to get rtc_offset NVMEM cell Fixes: eace75cfdcf7 ("nvmem: Add a simple NVMEM framework for nvmem providers") Cc: stable@vger.kernel.org Reviewed-by: Eric Curtin <ecurtin@redhat.com> Signed-off-by: Hector Martin <marcan@marcan.st> --- drivers/nvmem/core.c | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) -- 2.35.1
Comments
On 03/01/2023 11:44, Hector Martin wrote: > nvmem_register() currently registers the device before adding the nvmem > cells, which creates a race window where consumers may find the nvmem > device (and not get PROBE_DEFERred), but then fail to find the cells and > error out. > > Move device registration to the end of nvmem_register(), to close the > race. > > Observed when the stars line up on Apple Silicon machines with the (not > yet upstream, but trivial) spmi nvmem driver and the macsmc-rtc client: > > [ 0.487375] macsmc-rtc macsmc-rtc: error -ENOENT: Failed to get rtc_offset NVMEM cell > > Fixes: eace75cfdcf7 ("nvmem: Add a simple NVMEM framework for nvmem providers") > Cc: stable@vger.kernel.org > Reviewed-by: Eric Curtin <ecurtin@redhat.com> > Signed-off-by: Hector Martin <marcan@marcan.st> > --- What has changed since v1? > drivers/nvmem/core.c | 32 +++++++++++++++++--------------- > 1 file changed, 17 insertions(+), 15 deletions(-) > > diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c > index 321d7d63e068..606f428d6292 100644 > --- a/drivers/nvmem/core.c > +++ b/drivers/nvmem/core.c > @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > break; > } > > - if (rval) { > - ida_free(&nvmem_ida, nvmem->id); > - kfree(nvmem); > - return ERR_PTR(rval); > - } > + if (rval) > + goto err_gpiod_put; Why was gpiod changes added to this patch, that should be a separate patch/discussion, as this is not relevant to the issue that you are reporting. > > nvmem->read_only = device_property_present(config->dev, "read-only") || > config->read_only || !nvmem->reg_write; > @@ -837,20 +834,16 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > > dev_dbg(&nvmem->dev, "Registering nvmem device %s\n", config->name); > > - rval = device_register(&nvmem->dev); > - if (rval) > - goto err_put_device; > - > if (nvmem->nkeepout) { > rval = nvmem_validate_keepouts(nvmem); > if (rval) > - goto err_device_del; > + goto err_gpiod_put; > } > > if (config->compat) { > rval = nvmem_sysfs_setup_compat(nvmem, config); > if (rval) > - goto err_device_del; > + goto err_gpiod_put; > } > > if (config->cells) { > @@ -867,6 +860,15 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > if (rval) > goto err_remove_cells; > > + rval = device_register(&nvmem->dev); > + if (rval) { > + nvmem_device_remove_all_cells(nvmem); > + if (config->compat) > + nvmem_sysfs_remove_compat(nvmem, config); > + put_device(&nvmem->dev); > + return ERR_PTR(rval); > + } > + > blocking_notifier_call_chain(&nvmem_notifier, NVMEM_ADD, nvmem); > > return nvmem; > @@ -876,10 +878,10 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > err_teardown_compat: > if (config->compat) > nvmem_sysfs_remove_compat(nvmem, config); > -err_device_del: > - device_del(&nvmem->dev); > -err_put_device: > - put_device(&nvmem->dev); > +err_gpiod_put: > + gpiod_put(nvmem->wp_gpio); > + ida_free(&nvmem_ida, nvmem->id); > + kfree(nvmem); > > return ERR_PTR(rval); > } > -- > 2.35.1 >
On 03/01/2023 21.41, Srinivas Kandagatla wrote: > > > On 03/01/2023 11:44, Hector Martin wrote: >> nvmem_register() currently registers the device before adding the nvmem >> cells, which creates a race window where consumers may find the nvmem >> device (and not get PROBE_DEFERred), but then fail to find the cells and >> error out. >> >> Move device registration to the end of nvmem_register(), to close the >> race. >> >> Observed when the stars line up on Apple Silicon machines with the (not >> yet upstream, but trivial) spmi nvmem driver and the macsmc-rtc client: >> >> [ 0.487375] macsmc-rtc macsmc-rtc: error -ENOENT: Failed to get rtc_offset NVMEM cell >> >> Fixes: eace75cfdcf7 ("nvmem: Add a simple NVMEM framework for nvmem providers") >> Cc: stable@vger.kernel.org >> Reviewed-by: Eric Curtin <ecurtin@redhat.com> >> Signed-off-by: Hector Martin <marcan@marcan.st> >> --- > > What has changed since v1? What you told me to. I'm trying to get a silly bug fixed after you ignored my original patch until Russell independently discovered and submitted a fix for the same thing. I think we've wasted enough developer time here already. > >> drivers/nvmem/core.c | 32 +++++++++++++++++--------------- >> 1 file changed, 17 insertions(+), 15 deletions(-) >> >> diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c >> index 321d7d63e068..606f428d6292 100644 >> --- a/drivers/nvmem/core.c >> +++ b/drivers/nvmem/core.c >> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >> break; >> } >> >> - if (rval) { >> - ida_free(&nvmem_ida, nvmem->id); >> - kfree(nvmem); >> - return ERR_PTR(rval); >> - } >> + if (rval) >> + goto err_gpiod_put; > > Why was gpiod changes added to this patch, that should be a separate > patch/discussion, as this is not relevant to the issue that you are > reporting. Because freeing the device also does a gpiod_put in the destructor, so doing this is correct in every other instance below and maintains existing behavior, and it just so happens that this instance converges into the same codepath so it is correct to merge it, and it just so happens that the gpiod put was missing in this path to begin with so this becomes a drive-by bugfix. If you don't like it I can remove it (i.e. reintroduce the bug for no good reason) and you can submit this fix yourself, because I have no incentive to waste time submitting a separate patch to fix a GPIO leak in an error path corner case in a subsystem I don't own and I have much bigger things to spend my (increasingly lower and lower) willingness to fight for upstream submissions than this. Seriously, what is wrong with y'all kernel people. No other open source project wastes contributors' time with stupid nitpicks like this. I found a bug, I fixed it, I then fixed the issues you pointed out, and I don't have the time nor energy to fight over this kind of nonsense next. Do you want bugs fixed or not? >> >> nvmem->read_only = device_property_present(config->dev, "read-only") || >> config->read_only || !nvmem->reg_write; >> @@ -837,20 +834,16 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >> >> dev_dbg(&nvmem->dev, "Registering nvmem device %s\n", config->name); >> >> - rval = device_register(&nvmem->dev); >> - if (rval) >> - goto err_put_device; >> - >> if (nvmem->nkeepout) { >> rval = nvmem_validate_keepouts(nvmem); >> if (rval) >> - goto err_device_del; >> + goto err_gpiod_put; >> } >> >> if (config->compat) { >> rval = nvmem_sysfs_setup_compat(nvmem, config); >> if (rval) >> - goto err_device_del; >> + goto err_gpiod_put; >> } >> >> if (config->cells) { >> @@ -867,6 +860,15 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >> if (rval) >> goto err_remove_cells; >> >> + rval = device_register(&nvmem->dev); >> + if (rval) { >> + nvmem_device_remove_all_cells(nvmem); >> + if (config->compat) >> + nvmem_sysfs_remove_compat(nvmem, config); >> + put_device(&nvmem->dev); >> + return ERR_PTR(rval); >> + } >> + >> blocking_notifier_call_chain(&nvmem_notifier, NVMEM_ADD, nvmem); >> >> return nvmem; >> @@ -876,10 +878,10 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >> err_teardown_compat: >> if (config->compat) >> nvmem_sysfs_remove_compat(nvmem, config); >> -err_device_del: >> - device_del(&nvmem->dev); >> -err_put_device: >> - put_device(&nvmem->dev); >> +err_gpiod_put: >> + gpiod_put(nvmem->wp_gpio); >> + ida_free(&nvmem_ida, nvmem->id); >> + kfree(nvmem); >> >> return ERR_PTR(rval); >> } >> -- >> 2.35.1 >> > - Hector
On 03/01/2023 13:48, Hector Martin wrote: > On 03/01/2023 21.41, Srinivas Kandagatla wrote: >> >> >> On 03/01/2023 11:44, Hector Martin wrote: >>> nvmem_register() currently registers the device before adding the nvmem >>> cells, which creates a race window where consumers may find the nvmem >>> device (and not get PROBE_DEFERred), but then fail to find the cells and >>> error out. >>> >>> Move device registration to the end of nvmem_register(), to close the >>> race. >>> >>> Observed when the stars line up on Apple Silicon machines with the (not >>> yet upstream, but trivial) spmi nvmem driver and the macsmc-rtc client: >>> >>> [ 0.487375] macsmc-rtc macsmc-rtc: error -ENOENT: Failed to get rtc_offset NVMEM cell >>> >>> Fixes: eace75cfdcf7 ("nvmem: Add a simple NVMEM framework for nvmem providers") >>> Cc: stable@vger.kernel.org >>> Reviewed-by: Eric Curtin <ecurtin@redhat.com> >>> Signed-off-by: Hector Martin <marcan@marcan.st> >>> --- >> >> What has changed since v1? > > What you told me to. I'm trying to get a silly bug fixed after you > ignored my original patch until Russell independently discovered and > submitted a fix for the same thing. I think we've wasted enough > developer time here already. You should remember that maintainers have other regular job and holidays apart from maintaining. When you last sent the patch we are already in near/middle of merge window. If I had applied your original patch without proper review, It would have introduced more regressions. Be patient and we/I understand your concern. Its always possible that multiple developers hit same bug and endup in multiple patches, there is no way to avoid this unless every developer checks for similar patches on the list. > >> >>> drivers/nvmem/core.c | 32 +++++++++++++++++--------------- >>> 1 file changed, 17 insertions(+), 15 deletions(-) >>> >>> diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c >>> index 321d7d63e068..606f428d6292 100644 >>> --- a/drivers/nvmem/core.c >>> +++ b/drivers/nvmem/core.c >>> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >>> break; >>> } >>> >>> - if (rval) { >>> - ida_free(&nvmem_ida, nvmem->id); >>> - kfree(nvmem); >>> - return ERR_PTR(rval); >>> - } >>> + if (rval) >>> + goto err_gpiod_put; >> >> Why was gpiod changes added to this patch, that should be a separate >> patch/discussion, as this is not relevant to the issue that you are >> reporting. > > Because freeing the device also does a gpiod_put in the destructor, so This are clearly untested, And I dont want this to be in the middle to fix to the issue you are hitting. We should always be careful about untested changes, in this case gpiod has some conditions to check before doing a put. So the patch is incorrect as it is. > doing this is correct in every other instance below and maintains > existing behavior, and it just so happens that this instance converges > into the same codepath so it is correct to merge it, and it just so > happens that the gpiod put was missing in this path to begin with so > this becomes a drive-by bugfix. > > If you don't like it I can remove it (i.e. reintroduce the bug for no > good reason) and you can submit this fix yourself, because I have no > incentive to waste time submitting a separate patch to fix a GPIO leak > in an error path corner case in a subsystem I don't own and I have much > bigger things to spend my (increasingly lower and lower) willingness to > fight for upstream submissions than this. > > Seriously, what is wrong with y'all kernel people. No other open source > project wastes contributors' time with stupid nitpicks like this. I These are not stupid nit picks your v1/v2 patches introduced memory leaks and regressions so i will not be picking up any patches that fall in that area. > found a bug, I fixed it, I then fixed the issues you pointed out, and I > don't have the time nor energy to fight over this kind of nonsense next. I think its worth reading ./Documentation/process/submitting-patches.rst thanks, Srini > Do you want bugs fixed or not? > >>> >>> nvmem->read_only = device_property_present(config->dev, "read-only") || >>> config->read_only || !nvmem->reg_write; >>> @@ -837,20 +834,16 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >>> >>> dev_dbg(&nvmem->dev, "Registering nvmem device %s\n", config->name); >>> >>> - rval = device_register(&nvmem->dev); >>> - if (rval) >>> - goto err_put_device; >>> - >>> if (nvmem->nkeepout) { >>> rval = nvmem_validate_keepouts(nvmem); >>> if (rval) >>> - goto err_device_del; >>> + goto err_gpiod_put; >>> } >>> >>> if (config->compat) { >>> rval = nvmem_sysfs_setup_compat(nvmem, config); >>> if (rval) >>> - goto err_device_del; >>> + goto err_gpiod_put; >>> } >>> >>> if (config->cells) { >>> @@ -867,6 +860,15 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >>> if (rval) >>> goto err_remove_cells; >>> >>> + rval = device_register(&nvmem->dev); >>> + if (rval) { >>> + nvmem_device_remove_all_cells(nvmem); >>> + if (config->compat) >>> + nvmem_sysfs_remove_compat(nvmem, config); >>> + put_device(&nvmem->dev); >>> + return ERR_PTR(rval); >>> + } >>> + >>> blocking_notifier_call_chain(&nvmem_notifier, NVMEM_ADD, nvmem); >>> >>> return nvmem; >>> @@ -876,10 +878,10 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >>> err_teardown_compat: >>> if (config->compat) >>> nvmem_sysfs_remove_compat(nvmem, config); >>> -err_device_del: >>> - device_del(&nvmem->dev); >>> -err_put_device: >>> - put_device(&nvmem->dev); >>> +err_gpiod_put: >>> + gpiod_put(nvmem->wp_gpio); >>> + ida_free(&nvmem_ida, nvmem->id); >>> + kfree(nvmem); >>> >>> return ERR_PTR(rval); >>> } >>> -- >>> 2.35.1 >>> >> > > > - Hector
On 03/01/2023 23.22, Srinivas Kandagatla wrote: >>>> drivers/nvmem/core.c | 32 +++++++++++++++++--------------- >>>> 1 file changed, 17 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c >>>> index 321d7d63e068..606f428d6292 100644 >>>> --- a/drivers/nvmem/core.c >>>> +++ b/drivers/nvmem/core.c >>>> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >>>> break; >>>> } >>>> >>>> - if (rval) { >>>> - ida_free(&nvmem_ida, nvmem->id); >>>> - kfree(nvmem); >>>> - return ERR_PTR(rval); >>>> - } >>>> + if (rval) >>>> + goto err_gpiod_put; >>> >>> Why was gpiod changes added to this patch, that should be a separate >>> patch/discussion, as this is not relevant to the issue that you are >>> reporting. >> >> Because freeing the device also does a gpiod_put in the destructor, so > This are clearly untested, And I dont want this to be in the middle to > fix to the issue you are hitting. I somehow doubt you tested any of these error paths either. Nobody tests initialization error paths. That's why there was a gpio leak here to begin with. > We should always be careful about untested changes, in this case gpiod > has some conditions to check before doing a put. So the patch is > incorrect as it is. Then the existing code is also incorrect as it is, because the device release callback is doing the same gpiod_put() already. I just moved it out since we are now registering the device later. > These are not stupid nit picks your v1/v2 patches introduced memory > leaks and regressions so i will not be picking up any patches that fall > in that area. v1 certainly had issues which you rightly pointed out. Now with v2 you are nitpicking that I merged two codepaths that belong together, and fixed a corner case bug in the process. If there is an actual problem, please point it out. "Lol why did you fix this other bug that naturally falls into the influence of the changes being made" is not constructive. >> found a bug, I fixed it, I then fixed the issues you pointed out, and I >> don't have the time nor energy to fight over this kind of nonsense next. > > I think its worth reading ./Documentation/process/submitting-patches.rst Oh I've read it alright. You're not the first maintainer to have me lose more faith in the kernel development process, this isn't my first rodeo (hint: check MAINTAINERS, I'm also a maintainer). And I know it doesn't have to be like this because other maintainers that are actually reasonable and nice to work with do exist. I'm going to note that right now this core subsystem code is broken in the *success path*, while all the arguments here are about the *error path*. In other words, there is a negligible change that any mistakes/regressions I'm making here would actually impact people, while the status quo is that the code is actively breaking people's systems. So, are you going to actually point out the supposed regressions with v2 so we can actually fix this bug, or should I just drop this submission and keep it in my downstream tree and you can continue to get bug reports about this race condition? - Hector
Hi Hector, On Tue, Jan 03, 2023 at 10:48:52PM +0900, Hector Martin wrote: > >> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > >> break; > >> } > >> > >> - if (rval) { > >> - ida_free(&nvmem_ida, nvmem->id); > >> - kfree(nvmem); > >> - return ERR_PTR(rval); > >> - } > >> + if (rval) > >> + goto err_gpiod_put; > > > > Why was gpiod changes added to this patch, that should be a separate > > patch/discussion, as this is not relevant to the issue that you are > > reporting. > > Because freeing the device also does a gpiod_put in the destructor, so > doing this is correct in every other instance below and maintains > existing behavior, and it just so happens that this instance converges > into the same codepath so it is correct to merge it, and it just so > happens that the gpiod put was missing in this path to begin with so > this becomes a drive-by bugfix. > > If you don't like it I can remove it (i.e. reintroduce the bug for no > good reason) and you can submit this fix yourself, because I have no > incentive to waste time submitting a separate patch to fix a GPIO leak > in an error path corner case in a subsystem I don't own and I have much > bigger things to spend my (increasingly lower and lower) willingness to > fight for upstream submissions than this. > > Seriously, what is wrong with y'all kernel people. No other open source > project wastes contributors' time with stupid nitpicks like this. I > found a bug, I fixed it, I then fixed the issues you pointed out, and I > don't have the time nor energy to fight over this kind of nonsense next. > Do you want bugs fixed or not? This is not nonsense. We have always had a policy of one fix/change per patch, and in this case it makes complete and utter sense. Of course, the interpretation of "one change" is a matter of opinion. Your patch contains two bug fixes for problems: 1) publication of nvmem_device before it's fully setup (leading to the race) which has been around since the inception of nvmem stuff. 2) fixing a memory leak for gpiod stuff, caused by a recent patch 5544e90c8126 ("nvmem: core: add error handling for dev_set_name") from September 2022. Hence these two changes need different treatment for stable backporting, with the former needing to be backported to more kernels than the latter. So, it makes complete sense to split the two fixes into their own separate patches. Why are other projects happy to accept a patch that fixes multiple issues? Maybe they work in different ways - maybe they don't backport changes to older releases? Maybe they don't do multiple stable versions like we do with the kernel.
On 04/01/2023 00.06, Russell King (Oracle) wrote: > Hi Hector, > > On Tue, Jan 03, 2023 at 10:48:52PM +0900, Hector Martin wrote: >>>> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >>>> break; >>>> } >>>> >>>> - if (rval) { >>>> - ida_free(&nvmem_ida, nvmem->id); >>>> - kfree(nvmem); >>>> - return ERR_PTR(rval); >>>> - } >>>> + if (rval) >>>> + goto err_gpiod_put; >>> >>> Why was gpiod changes added to this patch, that should be a separate >>> patch/discussion, as this is not relevant to the issue that you are >>> reporting. >> >> Because freeing the device also does a gpiod_put in the destructor, so >> doing this is correct in every other instance below and maintains >> existing behavior, and it just so happens that this instance converges >> into the same codepath so it is correct to merge it, and it just so >> happens that the gpiod put was missing in this path to begin with so >> this becomes a drive-by bugfix. >> >> If you don't like it I can remove it (i.e. reintroduce the bug for no >> good reason) and you can submit this fix yourself, because I have no >> incentive to waste time submitting a separate patch to fix a GPIO leak >> in an error path corner case in a subsystem I don't own and I have much >> bigger things to spend my (increasingly lower and lower) willingness to >> fight for upstream submissions than this. >> >> Seriously, what is wrong with y'all kernel people. No other open source >> project wastes contributors' time with stupid nitpicks like this. I >> found a bug, I fixed it, I then fixed the issues you pointed out, and I >> don't have the time nor energy to fight over this kind of nonsense next. >> Do you want bugs fixed or not? > > This is not nonsense. We have always had a policy of one fix/change > per patch, and in this case it makes complete and utter sense. Of > course, the interpretation of "one change" is a matter of opinion. The change here is the race condition fix. That change involves adding an error cleanup path that involves a gpio_put(). Therefore it seems logical to actually use it in that one extra case that should've used it anyway, a few lines above. Now, > > Your patch contains two bug fixes for problems: > 1) publication of nvmem_device before it's fully setup (leading to the > race) which has been around since the inception of nvmem stuff. > 2) fixing a memory leak for gpiod stuff, caused by a recent patch > 5544e90c8126 ("nvmem: core: add error handling for dev_set_name") > from September 2022. That's a fair argument for having two patches (I didn't know the gpiod leak was introduced later). However, the backport is nontrivial anyway if you want clean code, because if we merge the codepaths the fix would end up being different in backports and mainline. Which means we now need 3 patches for them to apply properly. Which is more effort than I'm willing to put in for an issue I don't care about. But the bigger problem is that this isn't what Srini replied with, he's now saying my patch is outright broken, and I'm tired of this nonsense. - Hector
On Tue, Jan 03, 2023 at 11:56:21PM +0900, Hector Martin wrote: > On 03/01/2023 23.22, Srinivas Kandagatla wrote: > >>>> drivers/nvmem/core.c | 32 +++++++++++++++++--------------- > >>>> 1 file changed, 17 insertions(+), 15 deletions(-) > >>>> > >>>> diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c > >>>> index 321d7d63e068..606f428d6292 100644 > >>>> --- a/drivers/nvmem/core.c > >>>> +++ b/drivers/nvmem/core.c > >>>> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > >>>> break; > >>>> } > >>>> > >>>> - if (rval) { > >>>> - ida_free(&nvmem_ida, nvmem->id); > >>>> - kfree(nvmem); > >>>> - return ERR_PTR(rval); > >>>> - } > >>>> + if (rval) > >>>> + goto err_gpiod_put; > >>> > >>> Why was gpiod changes added to this patch, that should be a separate > >>> patch/discussion, as this is not relevant to the issue that you are > >>> reporting. > >> > >> Because freeing the device also does a gpiod_put in the destructor, so > > This are clearly untested, And I dont want this to be in the middle to > > fix to the issue you are hitting. > > I somehow doubt you tested any of these error paths either. Nobody tests > initialization error paths. That's why there was a gpio leak here to > begin with. Sadly, this is one of the biggest problems with error paths, they get very little proper testing - and in most cases we're reliant on reviewers spotting errors. That's why we much prefer the devm_* stuff, but even that can be error-prone. > > We should always be careful about untested changes, in this case gpiod > > has some conditions to check before doing a put. So the patch is > > incorrect as it is. > > Then the existing code is also incorrect as it is, because the device > release callback is doing the same gpiod_put() already. I just moved it > out since we are now registering the device later. At the point where this change is being made (checking rval after dev_set_name()) the struct device has not been initialised, so the release callback will not be called. nvmem->wp_gpio will be leaked. However, there may be bigger problems with wp_gpio - related to where it can come from and what we do with it. nvmem->wp_gpio has two sources - one of them is gpiod_get_optional(), and we need to call gpiod_put() on that to drop the reference that _this_ code acquired. The other is config->wp_gpio - we don't own that reference, yet we call gpiod_put() on it. I'm not sure whether config->wp_gpio is actually used anywhere - my grepping so far has not found any users, but maybe Srivinas knows better. Hence, sorting out the leak of wp_gpio needs more discussion, and it would not be right to delay merging the fix for the very real race that people hit today resulting in stuff not working while we try to work out how wp_gpio should be handled. So... always fix one problem in one patch. Sometimes a fix is not as obvious as one may first think.
On Wed, Jan 04, 2023 at 12:14:10AM +0900, Hector Martin wrote: > On 04/01/2023 00.06, Russell King (Oracle) wrote: > > Hi Hector, > > > > On Tue, Jan 03, 2023 at 10:48:52PM +0900, Hector Martin wrote: > >>>> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > >>>> break; > >>>> } > >>>> > >>>> - if (rval) { > >>>> - ida_free(&nvmem_ida, nvmem->id); > >>>> - kfree(nvmem); > >>>> - return ERR_PTR(rval); > >>>> - } > >>>> + if (rval) > >>>> + goto err_gpiod_put; > >>> > >>> Why was gpiod changes added to this patch, that should be a separate > >>> patch/discussion, as this is not relevant to the issue that you are > >>> reporting. > >> > >> Because freeing the device also does a gpiod_put in the destructor, so > >> doing this is correct in every other instance below and maintains > >> existing behavior, and it just so happens that this instance converges > >> into the same codepath so it is correct to merge it, and it just so > >> happens that the gpiod put was missing in this path to begin with so > >> this becomes a drive-by bugfix. > >> > >> If you don't like it I can remove it (i.e. reintroduce the bug for no > >> good reason) and you can submit this fix yourself, because I have no > >> incentive to waste time submitting a separate patch to fix a GPIO leak > >> in an error path corner case in a subsystem I don't own and I have much > >> bigger things to spend my (increasingly lower and lower) willingness to > >> fight for upstream submissions than this. > >> > >> Seriously, what is wrong with y'all kernel people. No other open source > >> project wastes contributors' time with stupid nitpicks like this. I > >> found a bug, I fixed it, I then fixed the issues you pointed out, and I > >> don't have the time nor energy to fight over this kind of nonsense next. > >> Do you want bugs fixed or not? > > > > This is not nonsense. We have always had a policy of one fix/change > > per patch, and in this case it makes complete and utter sense. Of > > course, the interpretation of "one change" is a matter of opinion. > > The change here is the race condition fix. That change involves adding > an error cleanup path that involves a gpio_put(). Therefore it seems > logical to actually use it in that one extra case that should've used it > anyway, a few lines above. The two are entirely unrelated. as I've already explained. The call to device_register() happens _after_ the check for rval from the dev_set_name() that you are changing. Moving device_register() doesn't make the lack of gpiod_put() any better or worse than it was before. That said, I'm now thinking that my patch is actually wrong, but for a different reason unrelated to the gpiod issue. :(
On 04/01/2023 00.18, Russell King (Oracle) wrote: > On Tue, Jan 03, 2023 at 11:56:21PM +0900, Hector Martin wrote: >> On 03/01/2023 23.22, Srinivas Kandagatla wrote: >>>>>> drivers/nvmem/core.c | 32 +++++++++++++++++--------------- >>>>>> 1 file changed, 17 insertions(+), 15 deletions(-) >>>>>> >>>>>> diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c >>>>>> index 321d7d63e068..606f428d6292 100644 >>>>>> --- a/drivers/nvmem/core.c >>>>>> +++ b/drivers/nvmem/core.c >>>>>> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) >>>>>> break; >>>>>> } >>>>>> >>>>>> - if (rval) { >>>>>> - ida_free(&nvmem_ida, nvmem->id); >>>>>> - kfree(nvmem); >>>>>> - return ERR_PTR(rval); >>>>>> - } >>>>>> + if (rval) >>>>>> + goto err_gpiod_put; >>>>> >>>>> Why was gpiod changes added to this patch, that should be a separate >>>>> patch/discussion, as this is not relevant to the issue that you are >>>>> reporting. >>>> >>>> Because freeing the device also does a gpiod_put in the destructor, so >>> This are clearly untested, And I dont want this to be in the middle to >>> fix to the issue you are hitting. >> >> I somehow doubt you tested any of these error paths either. Nobody tests >> initialization error paths. That's why there was a gpio leak here to >> begin with. > > Sadly, this is one of the biggest problems with error paths, they get > very little proper testing - and in most cases we're reliant on > reviewers spotting errors. That's why we much prefer the devm_* stuff, > but even that can be error-prone. > >>> We should always be careful about untested changes, in this case gpiod >>> has some conditions to check before doing a put. So the patch is >>> incorrect as it is. >> >> Then the existing code is also incorrect as it is, because the device >> release callback is doing the same gpiod_put() already. I just moved it >> out since we are now registering the device later. > > At the point where this change is being made (checking rval after > dev_set_name()) the struct device has not been initialised, so the > release callback will not be called. nvmem->wp_gpio will be leaked. But later in the code where device_put() was being called would will be, and that callback is calling gpiod_put() unconditionally, which is why I am doing the same after moving the device registration later. Is this wrong? Well, > However, there may be bigger problems with wp_gpio - related to where > it can come from and what we do with it. > > nvmem->wp_gpio has two sources - one of them is gpiod_get_optional(), > and we need to call gpiod_put() on that to drop the reference that > _this_ code acquired. The other is config->wp_gpio - we don't own > that reference, yet we call gpiod_put() on it. I'm not sure whether > config->wp_gpio is actually used anywhere - my grepping so far has > not found any users, but maybe Srivinas knows better. ... then yes, it's wrong, and the original code is already broken too. I merely moved functionality around in the other cases *except* the one case after dev_set_name(), where I swapped one bug out for calling other buggy code derived from existing buggy code. ¯\_(ツ)_/¯ > Hence, sorting out the leak of wp_gpio needs more discussion, and it > would not be right to delay merging the fix for the very real race > that people hit today resulting in stuff not working while we try > to work out how wp_gpio should be handled. > So... always fix one problem in one patch. Sometimes a fix is not as > obvious as one may first think. Sure, but you do realize the issue here, right? This, coupled with the kernel development model (and *especially* the impolite discourse that often goes along with it from maintainers) incentivizes doing the minimum amount of work to fix the minimum amount of bugs that actually affect people. I couldn't care less about the gpiod leak fix, it doesn't affect me, but I noticed and thought I could fix it. Turns out it's more complicated than that because the existing code is even more broken than I thought, sure - but I have no incentive to start that conversation nor fix all this now. Now I just know next time I touch nvmem code I won't bother pointing out any bugs I notice by accident, because clearly the discussion likely to follow is more mental overhead than I'm willing to spend on something that doesn't affect me. When bugs don't noticeably break people and fixing them is too much effort, nobody will. While if my change had been merged as-is, we'd have at least fixed one common case bug and reduced two related bugs to one class of bug (the unconditional gpiod_put()) which is still a strict improvement. And then someone can point out the incorrect puts when the GPIO isn't owned and fix that later if they want. - Hector
On Wed, Jan 04, 2023 at 12:33:33AM +0900, Hector Martin wrote: > On 04/01/2023 00.18, Russell King (Oracle) wrote: > > On Tue, Jan 03, 2023 at 11:56:21PM +0900, Hector Martin wrote: > >> On 03/01/2023 23.22, Srinivas Kandagatla wrote: > >>>>>> drivers/nvmem/core.c | 32 +++++++++++++++++--------------- > >>>>>> 1 file changed, 17 insertions(+), 15 deletions(-) > >>>>>> > >>>>>> diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c > >>>>>> index 321d7d63e068..606f428d6292 100644 > >>>>>> --- a/drivers/nvmem/core.c > >>>>>> +++ b/drivers/nvmem/core.c > >>>>>> @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) > >>>>>> break; > >>>>>> } > >>>>>> > >>>>>> - if (rval) { > >>>>>> - ida_free(&nvmem_ida, nvmem->id); > >>>>>> - kfree(nvmem); > >>>>>> - return ERR_PTR(rval); > >>>>>> - } > >>>>>> + if (rval) > >>>>>> + goto err_gpiod_put; > >>>>> > >>>>> Why was gpiod changes added to this patch, that should be a separate > >>>>> patch/discussion, as this is not relevant to the issue that you are > >>>>> reporting. > >>>> > >>>> Because freeing the device also does a gpiod_put in the destructor, so > >>> This are clearly untested, And I dont want this to be in the middle to > >>> fix to the issue you are hitting. > >> > >> I somehow doubt you tested any of these error paths either. Nobody tests > >> initialization error paths. That's why there was a gpio leak here to > >> begin with. > > > > Sadly, this is one of the biggest problems with error paths, they get > > very little proper testing - and in most cases we're reliant on > > reviewers spotting errors. That's why we much prefer the devm_* stuff, > > but even that can be error-prone. > > > >>> We should always be careful about untested changes, in this case gpiod > >>> has some conditions to check before doing a put. So the patch is > >>> incorrect as it is. > >> > >> Then the existing code is also incorrect as it is, because the device > >> release callback is doing the same gpiod_put() already. I just moved it > >> out since we are now registering the device later. > > > > At the point where this change is being made (checking rval after > > dev_set_name()) the struct device has not been initialised, so the > > release callback will not be called. nvmem->wp_gpio will be leaked. > > But later in the code where device_put() was being called would will be, > and that callback is calling gpiod_put() unconditionally, which is why I > am doing the same after moving the device registration later. > > Is this wrong? Well, I'm not going to read the rest of your rant, honestly it's really not worth it. Let's just concentrate on trying to work out how best to fix this crud. Not only is there the issue with wp_gpio, but the whole IDA handling is fscked as well, so there's many problems to be sorted out here, and if we lump them all into one patch, we'll probably be getting to the point of completely rewriting nvmem_register() making backports extremely difficult.
diff --git a/drivers/nvmem/core.c b/drivers/nvmem/core.c index 321d7d63e068..606f428d6292 100644 --- a/drivers/nvmem/core.c +++ b/drivers/nvmem/core.c @@ -822,11 +822,8 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) break; } - if (rval) { - ida_free(&nvmem_ida, nvmem->id); - kfree(nvmem); - return ERR_PTR(rval); - } + if (rval) + goto err_gpiod_put; nvmem->read_only = device_property_present(config->dev, "read-only") || config->read_only || !nvmem->reg_write; @@ -837,20 +834,16 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) dev_dbg(&nvmem->dev, "Registering nvmem device %s\n", config->name); - rval = device_register(&nvmem->dev); - if (rval) - goto err_put_device; - if (nvmem->nkeepout) { rval = nvmem_validate_keepouts(nvmem); if (rval) - goto err_device_del; + goto err_gpiod_put; } if (config->compat) { rval = nvmem_sysfs_setup_compat(nvmem, config); if (rval) - goto err_device_del; + goto err_gpiod_put; } if (config->cells) { @@ -867,6 +860,15 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) if (rval) goto err_remove_cells; + rval = device_register(&nvmem->dev); + if (rval) { + nvmem_device_remove_all_cells(nvmem); + if (config->compat) + nvmem_sysfs_remove_compat(nvmem, config); + put_device(&nvmem->dev); + return ERR_PTR(rval); + } + blocking_notifier_call_chain(&nvmem_notifier, NVMEM_ADD, nvmem); return nvmem; @@ -876,10 +878,10 @@ struct nvmem_device *nvmem_register(const struct nvmem_config *config) err_teardown_compat: if (config->compat) nvmem_sysfs_remove_compat(nvmem, config); -err_device_del: - device_del(&nvmem->dev); -err_put_device: - put_device(&nvmem->dev); +err_gpiod_put: + gpiod_put(nvmem->wp_gpio); + ida_free(&nvmem_ida, nvmem->id); + kfree(nvmem); return ERR_PTR(rval); }