Message ID | 20231121070619.9836-3-saeed@kernel.org |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:2b07:b0:403:3b70:6f57 with SMTP id io7csp444522vqb; Mon, 20 Nov 2023 23:08:33 -0800 (PST) X-Google-Smtp-Source: AGHT+IGSw77xI20jE8oYUqEuYaaZytXstO9DHSilJ1rwpNeyROeMVxnQvn9gHSf5gv3pO1e7J5n4 X-Received: by 2002:a05:6358:4407:b0:168:d2f8:d2ad with SMTP id z7-20020a056358440700b00168d2f8d2admr9263381rwc.7.1700550513682; Mon, 20 Nov 2023 23:08:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700550513; cv=none; d=google.com; s=arc-20160816; b=LmExIzcUXiRSzsvyxdNxdyeYulT7KKoo7N8y7CkrPJ3AapwgYi9ZjzDXqOoK2OV7DV 3v8C8/hpnS21EXXehd/R9hJrPosweyMGsGJk8RITPAP8DkzvqN8JJSDW784enxBzGjon /sNmgPwv16bAisblbw23FoacPQd0hyCoDr7Y9rI7mIgMYj+Pzq2WZWhKH+XRajo2QPdF WTGE3l+IcBp5tZjRDsfVl576UKZ99HPZTMG6VNBau74kcEJQQhyXZBjhhbLVFYsT6zjP Cd3IwHiaI9AKONFEYMuWzCu6PGAxGONOzxQaVrOy6xhO0fjW9nPoFIX2nCGaMQieoy4/ Z4mw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=EFepjdNFszVAVwsSuQXKkHjEKCG82U0NDFT7LoG9fq0=; fh=X0hOMbpOU1raqJinhHHPFdyR1sBFq2lMGBVOqO0x1pk=; b=bTrEvUZFPrOiHT4AchmTtOwFi3C7DZq8LESK50YDatP3zYof9xwgxuKGJTFMOWEUCl p2oJW9nAUeU49x+vAmh/Cd/SIOfeZk2xMJr6UDL4umvO7153oEV2m3c0Qlqz8J7r0rkz h27ko8j1OaJj4+//DpP0lQKP99b1SE3wUekHR83Op8af0LbxcEvd5j12cMxqMDvPxTx+ DP+7RDWoDscnQJPTe6Zdb26y4oNAMzeHhma6eMCBPylsUhGN7Mp3QjrXLbNCFXvI9tki jDc5ylycSvP7SqAQUcC/mVBufk9/O+l6QwdgFHQPEQz9Bdy9ro842keAA1QxaWdIO2iR G5Uw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=mi+Pgl1w; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id 143-20020a630095000000b005bdf596397bsi9672130pga.732.2023.11.20.23.08.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Nov 2023 23:08:33 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=mi+Pgl1w; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id F1C0980DEA64; Mon, 20 Nov 2023 23:06:43 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229536AbjKUHGk (ORCPT <rfc822;ouuuleilei@gmail.com> + 99 others); Tue, 21 Nov 2023 02:06:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35362 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229754AbjKUHGe (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 21 Nov 2023 02:06:34 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DF92710C for <linux-kernel@vger.kernel.org>; Mon, 20 Nov 2023 23:06:28 -0800 (PST) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 77D5AC433C8; Tue, 21 Nov 2023 07:06:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1700550388; bh=isJE+4To73RbzPw8SDE1Zz0T/qfFE6bocOEKEudicH0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=mi+Pgl1wvwnpVnwtnnEbnNS/+QC1P52sQYHZkYBuCnDBJwAArVD/Kq7fWQomA5Pdu 2CxDukFP8BH6Y0blRhzcNotoC6HuBo+y9S8iTokbJKbELKo14vbyibKX6sF+oKGYBN C/ib9SfRFWM0tf1I50O1GEwR50d8jtOya92JpLYwtyPYv8HoJeBBcMOTTgIrfItWQA nxPmFhEhuTpp670oN2OsvROjbiRfFkFthNa5WInBz+HZPoZ6PdOFr48nyNwhZInwva cH/VfJsmXyg6zLkLmQ/VYmymrjMeYc2Kh0D3ljFR0Nu0Z49kM5tt7hHtkC6hi2qc7o LZz+dTpbWHVlQ== From: Saeed Mahameed <saeed@kernel.org> To: Arnd Bergmann <arnd@arndb.de>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Gunthorpe <jgg@nvidia.com>, Leon Romanovsky <leonro@nvidia.com>, Jiri Pirko <jiri@nvidia.com>, Leonid Bloch <lbloch@nvidia.com>, Itay Avraham <itayavr@nvidia.com>, Jakub Kicinski <kuba@kernel.org>, linux-kernel@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com> Subject: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver Date: Mon, 20 Nov 2023 23:06:16 -0800 Message-ID: <20231121070619.9836-3-saeed@kernel.org> X-Mailer: git-send-email 2.42.0 In-Reply-To: <20231121070619.9836-1-saeed@kernel.org> References: <20231121070619.9836-1-saeed@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 20 Nov 2023 23:06:44 -0800 (PST) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1783156455098001207 X-GMAIL-MSGID: 1783156455098001207 |
Series | mlx5 ConnectX control misc driver | |
Commit Message
Saeed Mahameed
Nov. 21, 2023, 7:06 a.m. UTC
From: Saeed Mahameed <saeedm@nvidia.com> The ConnectX HW family supported by the mlx5 drivers uses an architecture where a FW component executes "mailbox RPCs" issued by the driver to make changes to the device. This results in a complex debugging environment where the FW component has information and low level configuration that needs to be accessed to userspace for debugging purposes. Historically a userspace program was used that accessed the PCI register and config space directly through /sys/bus/pci/.../XXX and could operate these debugging interfaces in parallel with the running driver. This approach is incompatible with secure boot and kernel lockdown so this driver provides a secure and restricted interface to that same data. On open the driver would allocate a special FW UID (user context ID) restrected to debug RPCs only, later in this series all user RPCs will be stamped with this UID. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> --- MAINTAINERS | 8 + drivers/misc/Kconfig | 1 + drivers/misc/Makefile | 1 + drivers/misc/mlx5ctl/Kconfig | 14 ++ drivers/misc/mlx5ctl/Makefile | 4 + drivers/misc/mlx5ctl/main.c | 314 ++++++++++++++++++++++++++++++++++ 6 files changed, 342 insertions(+) create mode 100644 drivers/misc/mlx5ctl/Kconfig create mode 100644 drivers/misc/mlx5ctl/Makefile create mode 100644 drivers/misc/mlx5ctl/main.c
Comments
On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote: > +config MLX5CTL > + tristate "mlx5 ConnectX control misc driver" > + depends on MLX5_CORE > + help > + MLX5CTL provides interface for the user process to access the debug and > + configuration registers of the ConnectX hardware family > + (NICs, PCI switches and SmartNIC SoCs). > + This will allow configuration and debug tools to work out of the box on > + mainstream kernel. Did you run checkpatch.pl on this? You lost the tabs :( > --- /dev/null > +++ b/drivers/misc/mlx5ctl/main.c > @@ -0,0 +1,314 @@ > +// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 Why is this dual licensed? Again, you are using a GPL-only api for this driver, how would that even be possible to make this code be BSD? I thought we already discussed this, AND I talked to someone who discussed this with a nvidia lawyer already and I thought this was going to get changed. What happened to that? thanks, greg k-h
On Mon, Nov 27, 2023 at 01:36:54PM +0000, Greg Kroah-Hartman wrote: > Why is this dual licensed? Again, you are using a GPL-only api for this > driver, how would that even be possible to make this code be BSD? The code to FreeBSD with mlx5 related dual licensed code integrated is freely available to inspect. I've never looked myself but I'm told FreeBSD changes the reference Linux code to remove the use of GPL code, and that FreeBSD has created BSD licensed versions of some kernel APIs to support that. > I thought we already discussed this, AND I talked to someone who > discussed this with a nvidia lawyer already and I thought this was going > to get changed. What happened to that? It is in the cover letter. You asked for an approval and statement from our legal and we obtained it. Our lawyers did a review, discussed with a LF contact, and continue to instruct to use the dual license. We've done what you required us to do. The summary I have of the call you refer to does not include a discussion or agreement about change in nvidia policy regarding mlx5 code. Like Dave said, our lawyers are not your lawyers. Now that we have involved legal, and they have given an instruction, we must follow it. Regards, Jason
On Mon, Nov 27, 2023 at 10:40:17AM -0400, Jason Gunthorpe wrote: > On Mon, Nov 27, 2023 at 01:36:54PM +0000, Greg Kroah-Hartman wrote: > > > Why is this dual licensed? Again, you are using a GPL-only api for this > > driver, how would that even be possible to make this code be BSD? > > The code to FreeBSD with mlx5 related dual licensed code integrated is > freely available to inspect. I've never looked myself but I'm told > FreeBSD changes the reference Linux code to remove the use of GPL > code, and that FreeBSD has created BSD licensed versions of some > kernel APIs to support that. Yeah, I'm not going to get into the legality of "creating BSD licensed versions of gpl-only Linux apis" but note, lots of lawyers look longingly at those things when they consider early retirement :) > > I thought we already discussed this, AND I talked to someone who > > discussed this with a nvidia lawyer already and I thought this was going > > to get changed. What happened to that? > > It is in the cover letter. You asked for an approval and statement > from our legal and we obtained it. Our lawyers did a review, discussed > with a LF contact, and continue to instruct to use the dual > license. We've done what you required us to do. Ah, missed that, sorry, I didn't see it in the changes for this set of patches, it was in the previous submission. > The summary I have of the call you refer to does not include a > discussion or agreement about change in nvidia policy regarding mlx5 > code. > > Like Dave said, our lawyers are not your lawyers. Now that we have > involved legal, and they have given an instruction, we must follow it. I think everyone involved is thankful that your lawyers are not mine :) Ok, best of luck with this mess, I'll stop harping on it now and just point out all of the other issues here. First off, you all need to get the network maintainers to agree that this driver is ok to do this way, and I don't think that has happened yet, so I'll wait on reviewing the series until that is resolved. thanks, greg k-h
On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote: > Ok, best of luck with this mess, I'll stop harping on it now and just > point out all of the other issues here. First off, you all need to get > the network maintainers to agree that this driver is ok to do this way, > and I don't think that has happened yet, so I'll wait on reviewing the > series until that is resolved. As I said already, I strongly disagree with the idea that the netdev maintainers get a global veto on what happens with mlx5 devices just because they sometimes have an ethernet port on the back of the card. This module is primarily (but not exclusively) for rdma related functionality, not netdev, and the RDMA maintainers Ack it. Thanks, Jason
On Mon, Nov 27, 2023 at 12:17:32PM -0400, Jason Gunthorpe wrote: > On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote: > > > Ok, best of luck with this mess, I'll stop harping on it now and just > > point out all of the other issues here. First off, you all need to get > > the network maintainers to agree that this driver is ok to do this way, > > and I don't think that has happened yet, so I'll wait on reviewing the > > series until that is resolved. > > As I said already, I strongly disagree with the idea that the netdev > maintainers get a global veto on what happens with mlx5 devices just > because they sometimes have an ethernet port on the back of the card. I understand you might disagree, however I hold their opinion in high regard and want to ensure that they agree that exposing device-specific debugging information for a device that deals with networking is ok to do so in a device-specific misc device node and not through some other way that other networking devices normally do (i.e. netlink or some-other-such-thing.) Note, device-specific character devices have almost always proven to be a bad idea in the long run, I understand your immediate need to do something like this, but remember that keeping it alive for the next 20+ years is going to be tough. > This module is primarily (but not exclusively) for rdma related > functionality, not netdev, and the RDMA maintainers Ack it. In my mind, RDMA implies networking, as it's over a network connection, but hey, I might be wrong :) thanks, greg k-h
On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote: > +struct mlx5ctl_dev { > + struct mlx5_core_dev *mdev; > + struct miscdevice miscdev; > + struct auxiliary_device *adev; > + struct list_head fd_list; > + spinlock_t fd_list_lock; /* protect list add/del */ > + struct rw_semaphore rw_lock; > + struct kref refcount; You now have 2 different things that control the lifespan of this structure. We really need some way to automatically check this so that people don't keep making this same mistake, it happens all the time :( Please pick one structure (miscdevice) or the other (kref) to control the lifespan, having 2 will just not work. Also, why a rw_semaphore? Only use those if you can prove with a benchmark that it is actually faster, otherwise it's simpler to just use a normal mutex (hint, you are changing the fields in the structure with the read lock held, which feels very wrong, and so needs a LOT of documentation, or just use a normal mutex...) thanks, greg k-h
On 27 Nov 18:27, Greg Kroah-Hartman wrote: >On Mon, Nov 27, 2023 at 12:17:32PM -0400, Jason Gunthorpe wrote: >> On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote: >> >> > Ok, best of luck with this mess, I'll stop harping on it now and just >> > point out all of the other issues here. First off, you all need to get >> > the network maintainers to agree that this driver is ok to do this way, >> > and I don't think that has happened yet, so I'll wait on reviewing the >> > series until that is resolved. >> >> As I said already, I strongly disagree with the idea that the netdev >> maintainers get a global veto on what happens with mlx5 devices just >> because they sometimes have an ethernet port on the back of the card. > >I understand you might disagree, however I hold their opinion in high >regard and want to ensure that they agree that exposing device-specific >debugging information for a device that deals with networking is ok to >do so in a device-specific misc device node and not through some other >way that other networking devices normally do (i.e. netlink or >some-other-such-thing.) > >Note, device-specific character devices have almost always proven to be >a bad idea in the long run, I understand your immediate need to do >something like this, but remember that keeping it alive for the next 20+ >years is going to be tough. > This driver is different as it doesn't replace existing mlx5 drivers, mlx5 functionality drivers are still there to expose the device features through the standard stacks, this is just a companion driver to access debug information, by driver and FW design mlx5ctl is not meant to manage or pilot the device like other device specific char drivers. To be clear this debug driver (or at least an older version of it) has been already in use for over than 15 years, since the beginning of mlx5, we used to only provide it as external package called mft debug tools [1] which has the kernel parts as well. Now it's time to upstream it. mlx5ctl will keep serving existing and future HW for the next few decades, I am pretty sure of that. as the cover-letter explains mlx5 architecture is set in stone and written in ink, the same mlx5 drivers work on any ConnectX chip since 2012, and the will keep working on the next generations of chips, mlx5ctl will be no different. [1] https://network.nvidia.com/products/adapter-software/firmware-tools/ >> This module is primarily (but not exclusively) for rdma related >> functionality, not netdev, and the RDMA maintainers Ack it. > For Infiniband/virtio/vfio/vdpa/nvme/fpga ConnectX devices mlx5 netdev doesn't even exist, so it is not reasonable to ask that the debug interface should go via the netdev stack, mlx5ctl is needed to serve all users of mlx5 devices, not only netdev (networking). So I really find this odd, that one stack maintainer gets a veto over all others. >In my mind, RDMA implies networking, as it's over a network connection, >but hey, I might be wrong :) > >thanks, > >greg k-h
On Mon, 27 Nov 2023 11:26:06 -0800 Saeed Mahameed wrote: > This driver is different as it doesn't replace existing mlx5 drivers, > mlx5 functionality drivers are still there to expose the device features > through the standard stacks, this is just a companion driver to access > debug information, by driver and FW design mlx5ctl is not meant to > manage or pilot the device like other device specific char drivers. You keep saying "debug information" which is really underselling this driver. Are you not going to support mstreg? The common development flow as far as netdev is concerned is: - some customer is interested in a new feature of a chip - vendor hacks the support out of tree, using oot module and/or user space tooling - customer does a PoC with that hacked up, non-upstream solution - if it works, vendor has to find out a proper upstream API, hopefully agreed on with other vendors - if it doesn't match customer needs the whole thing lands in the bin If the vendor specific PoC can be achieved with fully upstream software we lose leverage to force vendors to agree on common APIs. This should all be self-evident for people familiar with netdev, but I thought I'd spell it out. I understand that the device has other uses, but every modern NIC has other uses. I don't see how netdev can agree to this driver as long as there is potential for configuring random networking things thru it. All major netdev vendors have a set of out of tree tools / "expose everything misc drivers", "for debug". They will soon follow your example if we let this in.
On Mon, Nov 27, 2023 at 04:07:19PM -0800, Jakub Kicinski wrote: > On Mon, 27 Nov 2023 11:26:06 -0800 Saeed Mahameed wrote: > > This driver is different as it doesn't replace existing mlx5 drivers, > > mlx5 functionality drivers are still there to expose the device features > > through the standard stacks, this is just a companion driver to access > > debug information, by driver and FW design mlx5ctl is not meant to > > manage or pilot the device like other device specific char drivers. > > You keep saying "debug information" which is really underselling this > driver. Are you not going to support mstreg? > > The common development flow as far as netdev is concerned is: > - some customer is interested in a new feature of a chip > - vendor hacks the support out of tree, using oot module and/or > user space tooling > - customer does a PoC with that hacked up, non-upstream solution > - if it works, vendor has to find out a proper upstream API, > hopefully agreed on with other vendors > - if it doesn't match customer needs the whole thing lands in the bin > > If the vendor specific PoC can be achieved with fully upstream software > we lose leverage to force vendors to agree on common APIs. Please elaborate on what "common" API there is to create here? Do you agree that each ASIC in the device is unique and hence will have made different trade offs - both features and nerd knobs to tune and affect the performance and runtime capabilities? If you do not agree, then we need to have a different discussion ... If you do, please elaborate on the outline of some common API that could possibly be done here. You said no to the devlink parameters as a way to tune an ASIC. This is a common, established tool, using a common, established message channel but in an open, free form way of letting a customer see what tunables there are for a specific H/W version and firmware version and then set them. That is about as common as can be for different vendors creating different ASICs with different goals and design objectives. Yet, you somehow expect all of them to have nerd knob X and tunable Y. That is not realistic. > > This should all be self-evident for people familiar with netdev, but > I thought I'd spell it out. > > I understand that the device has other uses, but every modern NIC has > other uses. I don't see how netdev can agree to this driver as long as > there is potential for configuring random networking things thru it. > All major netdev vendors have a set of out of tree tools / "expose > everything misc drivers", "for debug". They will soon follow your > example if we let this in. Out of tree drivers are already ingrained into customers now. Mellanox (in this case) has tried many different angles at getting access to H/W unique tunables (i.e., the devlink comment) and now dumping huge amounts of data. Your response to the devlink parameters attempt is to basically abuse the firmware upgrade command as a way to push a binary blob that can contain said settings (and I still have not fully wrapped my head around the fact that you suggested that option). What specifically are you looking for? There are different vendors and different h/w options for specific market based reasons. Your hard line stance against needs like this is what is pushing out of tree drivers and tools to continue.
On Mon, 27 Nov 2023 21:46:28 -0700 David Ahern wrote: > > You keep saying "debug information" which is really underselling this > > driver. Are you not going to support mstreg? > > > > The common development flow as far as netdev is concerned is: > > - some customer is interested in a new feature of a chip > > - vendor hacks the support out of tree, using oot module and/or > > user space tooling > > - customer does a PoC with that hacked up, non-upstream solution > > - if it works, vendor has to find out a proper upstream API, > > hopefully agreed on with other vendors > > - if it doesn't match customer needs the whole thing lands in the bin > > > > If the vendor specific PoC can be achieved with fully upstream software > > we lose leverage to force vendors to agree on common APIs. > > Please elaborate on what "common" API there is to create here? Damn, am I so bad at explaining basic things or y'all are spending 5 seconds reading this and are not really paying attention? :| > Do you agree that each ASIC in the device is unique and hence will > have made different trade offs - both features and nerd knobs to tune > and affect the performance and runtime capabilities? If you do not > agree, then we need to have a different discussion ... > If you do, please elaborate on the outline of some common API that > could possibly be done here. We have devlink params. If that doesn't scale we can look for other solutions but let's see them not scale _in practice_ first. > You said no to the devlink parameters as a way to tune an ASIC. What? When? > This is a common, established tool, using a common, established message > channel but in an open, free form way of letting a customer see what > tunables there are for a specific H/W version and firmware version > and then set them. That is about as common as can be for different > vendors creating different ASICs with different goals and design > objectives. Yet, you somehow expect all of them to have nerd knob X > and tunable Y. That is not realistic. I don't know what you're talking about. > > This should all be self-evident for people familiar with netdev, but > > I thought I'd spell it out. > > > > I understand that the device has other uses, but every modern NIC has > > other uses. I don't see how netdev can agree to this driver as long as > > there is potential for configuring random networking things thru it. > > All major netdev vendors have a set of out of tree tools / "expose > > everything misc drivers", "for debug". They will soon follow your > > example if we let this in. > > Out of tree drivers are already ingrained into customers now. Mellanox > (in this case) has tried many different angles at getting access to H/W > unique tunables (i.e., the devlink comment) and now dumping huge amounts > of data. Your response to the devlink parameters attempt is to basically > abuse the firmware upgrade command as a way to push a binary blob that > can contain said settings (and I still have not fully wrapped my head > around the fact that you suggested that option). > > What specifically are you looking for? There are different vendors and > different h/w options for specific market based reasons. Your hard line > stance against needs like this is what is pushing out of tree drivers > and tools to continue. Sounds like you'd like a similar open-ended interface for your device.
On Tue, Nov 28, 2023 at 06:53:21AM -0800, Jakub Kicinski wrote: > > You said no to the devlink parameters as a way to tune an ASIC. > > What? When? You said you already rejected it at the very start of this discussion and linked to the video recording of the rejection discussion: https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/ This session was specifically on the 600 FW configuration parameters that mlx5 has. This is something that is done today on non-secure boot systems with direct PCI access on sysfs and would be absorbed into this driver on secure-boot systems. Ie nothing really changes from the broader ecosystem perspective. The discussion starts at time index 22:11. Dave (IIRC? sorry) is talking about unique things and suggesting devlink. Many people in the audiance seem to support this idea 27:00 -> 28:00 you repeated this thread's argument about said NO "occasionally you are allowed to use [devlink parameters] them" At 29 you said just keep it all out of tree. Oh, I chimed in around 30:00 saying it is not the kernel maintainers job to force standardization. This is a user problem. 31:25 you said again "nothing" 31:30 S390 teams would like this and pushed back against "keep it all out of tree" describing the situation as a "deadlock". 33:00 you said need two implementors for device specific parameters, which misses the point of the discussion, IMHO. And this was at the Dublin LPC, so over a year ago. I second Dave's question - if you do not like mlx5ctl, then what is your vision to solve all these user problems? Jason
On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote: > You said you already rejected it at the very start of this discussion > and linked to the video recording of the rejection discussion: > > https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/ > > This session was specifically on the 600 FW configuration parameters > that mlx5 has. This is something that is done today on non-secure boot > systems with direct PCI access on sysfs and would be absorbed into > this driver on secure-boot systems. Ie nothing really changes from the > broader ecosystem perspective. The question at LPC was about making devlink params completely transparent to the kernel. Basically added directly from FW. That what I was not happy about. You can add as many params at the driver level as you want. In fact I asked Saeed repeatedly to start posting all those params instead of complaining. > I second Dave's question - if you do not like mlx5ctl, then what is > your vision to solve all these user problems? Let the users complain about the user problems. Also something I repeatedly told Saeed. His response was something along the lines of users are secret, they can't post on the list, blah, blah. You know one user who is participating in this thread? *ME* While the lot of you work for vendors.
On 11/28/23 7:53 AM, Jakub Kicinski wrote: > On Mon, 27 Nov 2023 21:46:28 -0700 David Ahern wrote: >>> You keep saying "debug information" which is really underselling this >>> driver. Are you not going to support mstreg? >>> >>> The common development flow as far as netdev is concerned is: >>> - some customer is interested in a new feature of a chip >>> - vendor hacks the support out of tree, using oot module and/or >>> user space tooling >>> - customer does a PoC with that hacked up, non-upstream solution >>> - if it works, vendor has to find out a proper upstream API, >>> hopefully agreed on with other vendors >>> - if it doesn't match customer needs the whole thing lands in the bin >>> >>> If the vendor specific PoC can be achieved with fully upstream software >>> we lose leverage to force vendors to agree on common APIs. >> >> Please elaborate on what "common" API there is to create here? > > Damn, am I so bad at explaining basic things or y'all are spending > 5 seconds reading this and are not really paying attention? :| > >> Do you agree that each ASIC in the device is unique and hence will >> have made different trade offs - both features and nerd knobs to tune >> and affect the performance and runtime capabilities? If you do not >> agree, then we need to have a different discussion ... >> If you do, please elaborate on the outline of some common API that >> could possibly be done here. > > We have devlink params. If that doesn't scale we can look for other > solutions but let's see them not scale _in practice_ first. > >> You said no to the devlink parameters as a way to tune an ASIC. > > What? When? Jason responded with the same LPC reference, so I will only add a reference to the slides for interested parties: https://lpc.events/event/16/contributions/1359/attachments/1092/2094/FW-Centric-devices.pdf > > Sounds like you'd like a similar open-ended interface for your device. That's the danger of chiming in on threads like this - people drawing conclusions they should not. My interest on this thread is along the same lines for commenting during both the LPC 2022 discussion and again at netconf this year - trying to understand and keep track of the strongly held opinions of maintainers and what options are deemed off limits for similar needs. No vendor wants to go in circles redesigning and rewriting s/w.
On Tue, Nov 28, 2023 at 08:44:21AM -0800, Jakub Kicinski wrote: > On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote: > > You said you already rejected it at the very start of this discussion > > and linked to the video recording of the rejection discussion: > > > > https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/ > > > > This session was specifically on the 600 FW configuration parameters > > that mlx5 has. This is something that is done today on non-secure boot > > systems with direct PCI access on sysfs and would be absorbed into > > this driver on secure-boot systems. Ie nothing really changes from the > > broader ecosystem perspective. > > The question at LPC was about making devlink params completely > transparent to the kernel. Basically added directly from FW. > That what I was not happy about. It is creating a back-porting nightmare for all the enterprise distributions. > You can add as many params at the driver level as you want. > In fact I asked Saeed repeatedly to start posting all those > params instead of complaining. That really isn't what you said in the video. Regardless, configurables are only one part of what mlx5ctl addresses, we still have all the debugability problems, which are arguably more important. > > I second Dave's question - if you do not like mlx5ctl, then what is > > your vision to solve all these user problems? > > Let the users complain about the user problems. Also something > I repeatedly told Saeed. His response was something along the lines > of users are secret, they can't post on the list, blah, blah. You mean like the S390 team at IBM did in the video? This is not a reasonable position. One of the jobs of the vendors is to aggregate the user requests. Even the giant hyperscale customers that do have the capacity to come on this list prefer to delegate these things to us. If you want to get a direct user forum the kernel mailing list is not an appropriate place to do it. > You know one user who is participating in this thread? > *ME* > While the lot of you work for vendors. I'm sick of this vendor bashing. You work for *one* user. You know who talks to *every* user out there? *ME*. User and vendors need debugging of this complex HW. I don't need to bring a parade of a dozen users to this thread to re-enforce that obvious truth. Indeed when debugging is required the vendor usually has to do it, so we are the user in this discussion. You didn't answer the question, what is your alternative debug-ability vision here? Jason
On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote: > > The question at LPC was about making devlink params completely > > transparent to the kernel. Basically added directly from FW. > > That what I was not happy about. > > It is creating a back-porting nightmare for all the enterprise > distributions. We don't care about enterprise distros, Jason, or stable kernel APIs. > > You can add as many params at the driver level as you want. > > In fact I asked Saeed repeatedly to start posting all those > > params instead of complaining. > > That really isn't what you said in the video. > > Regardless, configurables are only one part of what mlx5ctl addresses, > we still have all the debugability problems, which are arguably more > important. Read-only debug interfaces are "do whatever you want" in netdev. Params controlling them (ie. writing stuff) need to be reviewed but are also allowed. Doesn't mlx5 have a pile of stuff in debugfs already? Nobody bothered to answer my "are you not going support mstreg over this" question (arbitrary register writes). > > Let the users complain about the user problems. Also something > > I repeatedly told Saeed. His response was something along the lines > > of users are secret, they can't post on the list, blah, blah. > > You mean like the S390 team at IBM did in the video? > > This is not a reasonable position. One of the jobs of the vendors is > to aggregate the user requests. Even the giant hyperscale customers > that do have the capacity to come on this list prefer to delegate > these things to us. > > If you want to get a direct user forum the kernel mailing list is not > an appropriate place to do it. Agree to disagree. > > You know one user who is participating in this thread? > > *ME* > > While the lot of you work for vendors. > > I'm sick of this vendor bashing. You work for *one* user. You know who > talks to *every* user out there? *ME*. > > User and vendors need debugging of this complex HW. I don't need to > bring a parade of a dozen users to this thread to re-enforce that > obvious truth. Indeed when debugging is required the vendor usually > has to do it, so we are the user in this discussion. > > You didn't answer the question, what is your alternative debug-ability > vision here? Covered above. And it's been discussed multiple times. Honestly I don't want to spend any more time discussing this. Once you're ready to work together in good faith let me know. On future revisions of this series please carry: Nacked-by: Jakub Kicinski <kuba@kernel.org>
On 28 Nov 08:44, Jakub Kicinski wrote: >On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote: >> You said you already rejected it at the very start of this discussion >> and linked to the video recording of the rejection discussion: >> >> https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/ >> >> This session was specifically on the 600 FW configuration parameters >> that mlx5 has. This is something that is done today on non-secure boot >> systems with direct PCI access on sysfs and would be absorbed into >> this driver on secure-boot systems. Ie nothing really changes from the >> broader ecosystem perspective. > >The question at LPC was about making devlink params completely >transparent to the kernel. Basically added directly from FW. >That what I was not happy about. > >You can add as many params at the driver level as you want. >In fact I asked Saeed repeatedly to start posting all those >params instead of complaining. > We posted many params over the years the you shot down on the spot, do you really expect me to implement 600 of those knowing that you will nack 80% of them asking to have common knobs for all vendors, and you know that is impossible. you nack patches then ask for a porpossal, we came up with many proposal and discussed them face to face on multiple occasions, LPC/netconf etc, then you ask for patches, then you nack again, we are just going in circles here.. >> I second Dave's question - if you do not like mlx5ctl, then what is >> your vision to solve all these user problems? > >Let the users complain about the user problems. Also something >I repeatedly told Saeed. His response was something along the lines >of users are secret, they can't post on the list, blah, blah. > I never said it is a secret, but I can't publicly reveal who my customers are and what they want, you know very well who asked for the high frequency counter sampling.. So we came up with a very clear solution, that has nothing to do with netdev, since for that particular use-case it is not netdev specific, and netdev stack isn't even present. >You know one user who is participating in this thread? >*ME* >While the lot of you work for vendors. And how *YOU* expect the vendors to debug *YOUR* issues, if you don't allow them to access their HW? Asking all vendors to use *YOUR* "devlink generic_dev generic_knob" is an insult to all vendors, how about you provide the ASIC design and RTLs to all vendors and we just manufacture it for you..
On 28 Nov 10:33, Jakub Kicinski wrote: >On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote: >> > The question at LPC was about making devlink params completely >> > transparent to the kernel. Basically added directly from FW. >> > That what I was not happy about. >> >> It is creating a back-porting nightmare for all the enterprise >> distributions. > >We don't care about enterprise distros, Jason, or stable kernel APIs. > >> > You can add as many params at the driver level as you want. >> > In fact I asked Saeed repeatedly to start posting all those >> > params instead of complaining. >> >> That really isn't what you said in the video. >> >> Regardless, configurables are only one part of what mlx5ctl addresses, >> we still have all the debugability problems, which are arguably more >> important. > >Read-only debug interfaces are "do whatever you want" in netdev. >Params controlling them (ie. writing stuff) need to be reviewed >but are also allowed. > >Doesn't mlx5 have a pile of stuff in debugfs already? > not enough, not scalable and it's a backporting and maintenance nightmare as Jason already showed. mlx5 supports creating millions of objects, tools need to selectively pick which objects to dump for a specific use case, if it's ok with you to do this in debugfs, then ioctl is much cleaner .. so what's your problem with mlx5ctl? >Nobody bothered to answer my "are you not going support mstreg over >this" question (arbitrary register writes). > >> > Let the users complain about the user problems. Also something >> > I repeatedly told Saeed. His response was something along the lines >> > of users are secret, they can't post on the list, blah, blah. >> >> You mean like the S390 team at IBM did in the video? >> >> This is not a reasonable position. One of the jobs of the vendors is >> to aggregate the user requests. Even the giant hyperscale customers >> that do have the capacity to come on this list prefer to delegate >> these things to us. >> >> If you want to get a direct user forum the kernel mailing list is not >> an appropriate place to do it. > >Agree to disagree. > >> > You know one user who is participating in this thread? >> > *ME* >> > While the lot of you work for vendors. >> >> I'm sick of this vendor bashing. You work for *one* user. You know who >> talks to *every* user out there? *ME*. >> >> User and vendors need debugging of this complex HW. I don't need to >> bring a parade of a dozen users to this thread to re-enforce that >> obvious truth. Indeed when debugging is required the vendor usually >> has to do it, so we are the user in this discussion. >> >> You didn't answer the question, what is your alternative debug-ability >> vision here? > >Covered above. And it's been discussed multiple times. > >Honestly I don't want to spend any more time discussing this. >Once you're ready to work together in good faith let me know. > >On future revisions of this series please carry: > >Nacked-by: Jakub Kicinski <kuba@kernel.org> I asked before and I never got a technical answer, based on what? All we got is just political views and complaints against vendors. What is your proposal for accessing every possible debug information from a vendor specific device ? devlink X Y Z, debugfs? won't work, sorry. And I can't accept "do it out of tree" as an answer from a well established linux maintainer, the whole point of this is to have this available in every linux box with any mlx5 configuration (not only netdev) so we can start debugging on the spot. For your claims that we need this for setting device parameters, it is simply not true, because we don't need this driver to do that, so please go back and read the cover-letter and code, and let me know what is wrong with our approach to get access to our device's debug info.
On 28 Nov 10:33, Jakub Kicinski wrote: >On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote: >> > The question at LPC was about making devlink params completely >> > transparent to the kernel. Basically added directly from FW. >> > That what I was not happy about. >> >> It is creating a back-porting nightmare for all the enterprise >> distributions. > >We don't care about enterprise distros, Jason, or stable kernel APIs. > Oh, I missed this one, so you don't care about users? Users often pay to distros for support, and distros always turn to vendors for debug situations, in fact one of the high stakeholders for this is an enterprise distro.. Also Jason and I are users, and more than 300 engineers at nvidia and dozens of customers are users, deploying 100s of thousands of ConnectX chips in their fleets. if it weren't important for us and our users, I wouldn't even fight this hard.. but it is important, and very useful, I can't express that hard enough. So you don't care ? Then don't ask about who the users are, in the other email, apparently it's all about your personal opinion and views about vendors what drives your responses. So it is really hard to debate with you..
On Tue, Nov 28, 2023 at 12:10:00PM -0800, Saeed Mahameed wrote: > On 28 Nov 10:33, Jakub Kicinski wrote: > > On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote: > > > > The question at LPC was about making devlink params completely > > > > transparent to the kernel. Basically added directly from FW. > > > > That what I was not happy about. > > > > > > It is creating a back-porting nightmare for all the enterprise > > > distributions. > > > > We don't care about enterprise distros, Jason, or stable kernel APIs. > > > > Oh, I missed this one, so you don't care about users? Ok, time out please. This isn't going anywhere fast except to make everyone else mad. To Jakub's point, no, we don't care about enterprise distros, they are a consumer of our releases and while some of them pay the salaries of our developers, they really don't have much influence over our development rules as they are just so far behind in releases that their releases look nothing like what we do in places (i.e. Linux "like" just like many Android systems.) If the enterprise distros pop in here and give us their opinions of the patchset, I would GREATLY appreciate it, as having more people review code at this point in time would be most helpful instead of having to hear about how the interfaces are broken 2 years from now. And I think that's what is driving your work now, the "enterprise" distros finally picked up the "lock down the kernel from random PCI device access in userspace" which caused you to have to drop your userspace implementation to create a real kernel driver, correct? And as for stable kernel apis, you all know that's just not a thing, and has nothing to do with users EXCEPT it benefits users because it keeps kernel code smaller and faster overall, that's well documented and users appreciate it. > Users often pay to distros for support, and distros always turn to vendors > for debug situations, in fact one of the high stakeholders for this is an > enterprise distro.. Hey, I was right, an enterprise distro wants this driver, great, can you get them to review it as well? I'm tired of being the only one to review patches like this and could use the help if they are going to rely on this (why do they pass that work to me?). They should be the ones helping us catch basic things like the reference count issues I pointed out, as they can test the code, I can't :) But, let's step back a bit further. It seems the network device normal interface for this is using devlink. And drivers are allowed to have driver-specific devlink interfaces, as you know, your driver has lots of them today. Why not just add more? What's wrong with 600+ more interfaces being added as that way they would be well documented and fit in with the existing infrastructure that you have today. Is the issue that the firmware doesn't guarantee that these interfaces will be documented or stable or even known at this point in time? If so, how are your going to have a good userspace tool for this? What am I missing that requires a totally-new-and-custom user/kernel api from what all other network drivers are using today? thanks, greg k-h
On 27 Nov 18:59, Greg Kroah-Hartman wrote: >On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote: >> +struct mlx5ctl_dev { >> + struct mlx5_core_dev *mdev; >> + struct miscdevice miscdev; >> + struct auxiliary_device *adev; >> + struct list_head fd_list; >> + spinlock_t fd_list_lock; /* protect list add/del */ >> + struct rw_semaphore rw_lock; >> + struct kref refcount; > >You now have 2 different things that control the lifespan of this >structure. We really need some way to automatically check this so that >people don't keep making this same mistake, it happens all the time :( > >Please pick one structure (miscdevice) or the other (kref) to control >the lifespan, having 2 will just not work. > miscdevice doesn't handle the lifespan, open files will remain open even after the miscdevice was unregistered, hence we use the kref to defer the kfree until the last open file is closed. >Also, why a rw_semaphore? Only use those if you can prove with a >benchmark that it is actually faster, otherwise it's simpler to just use >a normal mutex (hint, you are changing the fields in the structure with >the read lock held, which feels very wrong, and so needs a LOT of >documentation, or just use a normal mutex...) > It is needed so we can protect against underlaying device unloading while miscdevice is active, we use rw semaphore since we don't care about synchronization between any of the fops, but we want to protect current active ioctls and fops from sudden mlx5ctl_remove (auxiliary_driver.remove) which can happen dynamically due to underlaying device removal. So here is the locking scheme: write_lock() : only on mlx5_ctl remove and mark the device is down via assigning NULL to mcdev->dev, to let all new readers abort and to wait for current readers to finish their task. read_lock(): used in all fops and ioctls, to make sure underlaying mlx5_core device is still active, and to prevent open files to access the device when miscdevice was already unregistered. I agree, this should've been documented in the code, I will add documentation. >thanks, > >greg k-h
On Wed, Nov 29, 2023 at 01:08:39AM -0800, Saeed Mahameed wrote: > On 27 Nov 18:59, Greg Kroah-Hartman wrote: > > On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote: > > > +struct mlx5ctl_dev { > > > + struct mlx5_core_dev *mdev; > > > + struct miscdevice miscdev; > > > + struct auxiliary_device *adev; > > > + struct list_head fd_list; > > > + spinlock_t fd_list_lock; /* protect list add/del */ > > > + struct rw_semaphore rw_lock; > > > + struct kref refcount; > > > > You now have 2 different things that control the lifespan of this > > structure. We really need some way to automatically check this so that > > people don't keep making this same mistake, it happens all the time :( > > > > Please pick one structure (miscdevice) or the other (kref) to control > > the lifespan, having 2 will just not work. > > > > miscdevice doesn't handle the lifespan, open files will remain open even > after the miscdevice was unregistered, hence we use the kref to defer the > kfree until the last open file is closed. miscdevice has a reference counter and a lifecycle, you can not have two reference counted objects in the same structure and expect things to work well. > > Also, why a rw_semaphore? Only use those if you can prove with a > > benchmark that it is actually faster, otherwise it's simpler to just use > > a normal mutex (hint, you are changing the fields in the structure with > > the read lock held, which feels very wrong, and so needs a LOT of > > documentation, or just use a normal mutex...) > > > > It is needed so we can protect against underlaying device unloading while > miscdevice is active, we use rw semaphore since we don't care about > synchronization between any of the fops, but we want to protect current > active ioctls and fops from sudden mlx5ctl_remove (auxiliary_driver.remove) > which can happen dynamically due to underlaying device removal. Then use a normal mutex. Only use a rw lock if you can prove the performance needs it as usually a rw lock is slower and more complex as you then have to document stuff like: > So here is the locking scheme: > > write_lock() : only on mlx5_ctl remove and mark the device is down > via assigning NULL to mcdev->dev, to let all new readers abort and to wait > for current readers to finish their task. > > read_lock(): used in all fops and ioctls, to make sure underlaying > mlx5_core device is still active, and to prevent open files to access the > device when miscdevice was already unregistered. > > I agree, this should've been documented in the code, I will add > documentation. Just make it simple and use a normal mutex please. And fix up the reference counting, it shouldn't be this complex, it's just a "simple" misc device driver :) But before you do that, please see my other email about why not using devlink for all of this instead. thanks, greg k-h
On Wed, Nov 29, 2023 at 09:20:32AM +0000, Greg Kroah-Hartman wrote: > On Wed, Nov 29, 2023 at 01:08:39AM -0800, Saeed Mahameed wrote: > > On 27 Nov 18:59, Greg Kroah-Hartman wrote: > > > On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote: > > > > +struct mlx5ctl_dev { > > > > + struct mlx5_core_dev *mdev; > > > > + struct miscdevice miscdev; > > > > + struct auxiliary_device *adev; > > > > + struct list_head fd_list; > > > > + spinlock_t fd_list_lock; /* protect list add/del */ > > > > + struct rw_semaphore rw_lock; > > > > + struct kref refcount; > > > > > > You now have 2 different things that control the lifespan of this > > > structure. We really need some way to automatically check this so that > > > people don't keep making this same mistake, it happens all the time :( > > > > > > Please pick one structure (miscdevice) or the other (kref) to control > > > the lifespan, having 2 will just not work. > > > > > > > miscdevice doesn't handle the lifespan, open files will remain open even > > after the miscdevice was unregistered, hence we use the kref to defer the > > kfree until the last open file is closed. > > miscdevice has a reference counter and a lifecycle, you can not have two > reference counted objects in the same structure and expect things to > work well. This second refcount is hidden well: struct miscdevice { int minor; const char *name; const struct file_operations *fops; struct list_head list; struct device *parent; struct device *this_device; const struct attribute_group **groups; const char *nodename; umode_t mode; }; You have slammed us in the other thread for doing a poor review job here, but none of us see what you are seeing. Can you explain better? > > write_lock() : only on mlx5_ctl remove and mark the device is down > > via assigning NULL to mcdev->dev, to let all new readers abort and to wait > > for current readers to finish their task. > > > > read_lock(): used in all fops and ioctls, to make sure underlaying > > mlx5_core device is still active, and to prevent open files to access the > > device when miscdevice was already unregistered. > > > > I agree, this should've been documented in the code, I will add > > documentation. > > Just make it simple and use a normal mutex please. A normal mutex would make the entire ioctl interface single threaded, this is not desirable. > But before you do that, please see my other email about why not using > devlink for all of this instead. We've been over this already, the devlink discussion is about some configuration stuff. It has never been suggested to cover the debug interface. This series is primarily about debug, the devlink thing is a distraction to main point. Jason
On Wed, Nov 29, 2023 at 09:02:00AM -0400, Jason Gunthorpe wrote: > On Wed, Nov 29, 2023 at 09:20:32AM +0000, Greg Kroah-Hartman wrote: > > On Wed, Nov 29, 2023 at 01:08:39AM -0800, Saeed Mahameed wrote: > > > On 27 Nov 18:59, Greg Kroah-Hartman wrote: > > > > On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote: > > > > > +struct mlx5ctl_dev { > > > > > + struct mlx5_core_dev *mdev; > > > > > + struct miscdevice miscdev; > > > > > + struct auxiliary_device *adev; > > > > > + struct list_head fd_list; > > > > > + spinlock_t fd_list_lock; /* protect list add/del */ > > > > > + struct rw_semaphore rw_lock; > > > > > + struct kref refcount; > > > > > > > > You now have 2 different things that control the lifespan of this > > > > structure. We really need some way to automatically check this so that > > > > people don't keep making this same mistake, it happens all the time :( > > > > > > > > Please pick one structure (miscdevice) or the other (kref) to control > > > > the lifespan, having 2 will just not work. > > > > > > > > > > miscdevice doesn't handle the lifespan, open files will remain open even > > > after the miscdevice was unregistered, hence we use the kref to defer the > > > kfree until the last open file is closed. > > > > miscdevice has a reference counter and a lifecycle, you can not have two > > reference counted objects in the same structure and expect things to > > work well. > > This second refcount is hidden well: > > struct miscdevice { > int minor; > const char *name; > const struct file_operations *fops; > struct list_head list; > struct device *parent; > struct device *this_device; > const struct attribute_group **groups; > const char *nodename; > umode_t mode; > }; Ugh, you are right, I was wrong, there is no reference count here, using a miscdevice _requires_ you to have a separate reference count, like you all did. My fault. > > > write_lock() : only on mlx5_ctl remove and mark the device is down > > > via assigning NULL to mcdev->dev, to let all new readers abort and to wait > > > for current readers to finish their task. > > > > > > read_lock(): used in all fops and ioctls, to make sure underlaying > > > mlx5_core device is still active, and to prevent open files to access the > > > device when miscdevice was already unregistered. > > > > > > I agree, this should've been documented in the code, I will add > > > documentation. > > > > Just make it simple and use a normal mutex please. > > A normal mutex would make the entire ioctl interface single threaded, > this is not desirable. Why not? It's an ioctl for a single device, surely this isn't performance criticial. And then only grab it when needed, on read/write/ioctl path it shouldn't be needed at all due to the proper reference counting of the structures. Only on open/close, right? And again, for a rw semaphore, benchmarks matter, often, if not almost always, a normal mutex is faster for stuff like this. If not, then a benchmark will show it. > > But before you do that, please see my other email about why not using > > devlink for all of this instead. > > We've been over this already, the devlink discussion is about some > configuration stuff. It was? I see device-specific diagonostic data for the mlx5 driver being exported through devlink today, that's not configuration. Why not just add more? > It has never been suggested to cover the debug interface. This series > is primarily about debug, the devlink thing is a distraction to main > point. For me it is the main point at the moment. Please explain why devlink does not work for the information that you have created a misc device where you want an ioctl api instead, as I honestly do not understand. thanks, greg k-h
On Wed, Nov 29, 2023 at 03:41:54PM +0000, Greg Kroah-Hartman wrote: > Ugh, you are right, I was wrong, there is no reference count here, using > a miscdevice _requires_ you to have a separate reference count, like you > all did. My fault. I think we did not do a good job conveying that several experienced people have looked at this internally already. I see you noticed a few things we missed, but this is not unreviewed garbage code. We have different opionions on how some of these things should work but none of them are what I'd consider unlinuxy or wrong. > > > Just make it simple and use a normal mutex please. > > > > A normal mutex would make the entire ioctl interface single threaded, > > this is not desirable. > > Why not? It's an ioctl for a single device, surely this isn't > performance criticial. The RPCs to the FW can take a long time to execute and the "single device" is really a multi-threaded CPU complex on the other side. Being able to submit any commands for (say) 0.5s while one thread chews on something is not desirable. > And then only grab it when needed, on read/write/ioctl path it > shouldn't be needed at all due to the proper reference counting of > the structures. Only on open/close, right? No - this lock is protecting the mcdev->mdev from being freed. If the read side is held then mcdev->mdev cannot be freed. This is not a fully refcounted object because the mdev is actually the device underlying the aux device and it will be logically destroyed after the aux device_driver remove() function returns. The purpose of the rwlock is to ensure that the remove() function does not return until all threads have completed using the mdev. Once remove proceeds the read side of the lock will observe mdev==NULL and fail all operations until the FD is closed. The refcount keeps the memory underpinning the basic operation of the FD alive after remove() until the FD finally closes. At least for the purposes of this driver it is how it has to be. I agree the mdev API surface that requires some of this may not be the best it can be. > And again, for a rw semaphore, benchmarks matter, often, if not almost > always, a normal mutex is faster for stuff like this. If not, then a > benchmark will show it. We are not talking about micro differences in lock/unlock cost here. This is obviously better because of how much time the lock will stay locked while executing RPCs and sleeping waiting for interrupts back from FW. We are talking >> 100's of ms of time spent locked while executing FW RPCs. > > We've been over this already, the devlink discussion is about some > > configuration stuff. > > It was? I see device-specific diagonostic data for the mlx5 driver > being exported through devlink today, that's not configuration. Why not > just add more? Today mlx5 has a bunch of debuggables: - counters, even some device specific counters. Both in RDMA and netdev objects exposed through netlink and sysfs - A "core dump" mechanism through devlink - Some debugfs stuff, but we are now aware of serious problems with it - Some "devlink parameters" which configure a limited set of things These are not "interactive". Yes we can add more counters, more parameters and more stuff to the core dump, but that isn't what I mean by debugging. This is a complex device, it has processors on the other side running a lot of software. A primary purpose of this design is to allow bi-directional communication with debug agents in the FW. Consider, (I'm trying to draw a picture here without disclosing trade secrets about how the device operates), that one of the debug agents is a GDB stub. There are many CPUs inside these devices, field engineers may wish to enter a breakpoint and inspect something. With this scheme we would put the GDB stub protocol packets into RPC messages and execute them to the FW. In the very worst nightmare debug scenario we might build a custom FW with a custom debug agent targetting the specific problem and use these generic FW RPCs to communicate however that very special one-off agent requires. Notice what such a thing requires: a future compatible channel to exchange data bi-directionally with the SW components in the FW. The design here of processing *generic RPCs* is not some dirty nod toward enterprise distros - it is a sane and logical design to put things in userspace that the kernel has no value-add to being part of. This is standard straight-from-Linus design values. It is how things like nvme built their very similar "vendor commands" system (see for instance the intel & wd commands in nvmecli). So, can we run generic opaque RPCs over some TBD devlink interface, including with DMA? Technologically yes, of course. Practically? No. Jakub has clearly stated his position on philisophical limits in netdev and those include devlink. It excludes soemthing so open-ended and ill-defined. FWIW, I agree with this, devlink should be much more constrained. It is why we never suggested devlink for this kind of debugging. Genericness is also the source of alot of the cross discussion in this thread. A generic RPC interface can do *alot*. Actually, it can do *almost anything* to the device. That is the entire point. What it can't do is violate the integrity of the kernel - hence why it is a FW RPC interface and not direct access to device registers. So when Jakub says we can send configuration values that could also be sent over devlink parameters, or read counters that could also be read over netlink, he is not wrong. But we already do all those duplicative things through the direct PCI access and have for at least 10 years. I disagree that this duplication is a problem. Re-usable, cross vendor generic interfaces have merit on their own. mlx5 has long been implementing a wide selection of them as they are invented. We've been doing this even though those interfaces duplicate what the raw PCI tools already did. This new driver isn't going to change that. We will still participate in netdev and still implement these other things. We are a big part of the community, I don't need someone holding a stick over the mlx5 team's head beating them to do the "right" thing. Jason
Hi, On Wed, Nov 29, 2023 at 09:08:03AM +0000, Greg Kroah-Hartman wrote: > On Tue, Nov 28, 2023 at 12:10:00PM -0800, Saeed Mahameed wrote: > > On 28 Nov 10:33, Jakub Kicinski wrote: > > > On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote: > > > > > The question at LPC was about making devlink params completely > > > > > transparent to the kernel. Basically added directly from FW. > > > > > That what I was not happy about. > > > > > > > > It is creating a back-porting nightmare for all the enterprise > > > > distributions. > > > > > > We don't care about enterprise distros, Jason, or stable kernel APIs. > > > > > > > Oh, I missed this one, so you don't care about users? > > Ok, time out please. This isn't going anywhere fast except to make > everyone else mad. > > To Jakub's point, no, we don't care about enterprise distros, they are a > consumer of our releases and while some of them pay the salaries of our > developers, they really don't have much influence over our development > rules as they are just so far behind in releases that their releases > look nothing like what we do in places (i.e. Linux "like" just like many > Android systems.) > > If the enterprise distros pop in here and give us their opinions of the > patchset, I would GREATLY appreciate it, as having more people review > code at this point in time would be most helpful instead of having to > hear about how the interfaces are broken 2 years from now. We will be happy to test and review v4 of this series. Fully interactive debugging has become essential to getting to the root cause of complex issues that arise between the types of devices being discussed and their interactions with various software layers. Turning knobs and dumping registers just isn't sufficient, and I wish we'd had this capability long ago. Our customers have already benefited from the interactive debugging capability that these patches provide, but the full potential won't be realized until this is upstream. > > And I think that's what is driving your work now, the "enterprise" > distros finally picked up the "lock down the kernel from random PCI > device access in userspace" which caused you to have to drop your > userspace implementation to create a real kernel driver, correct? > > And as for stable kernel apis, you all know that's just not a thing, and > has nothing to do with users EXCEPT it benefits users because it keeps > kernel code smaller and faster overall, that's well documented and users > appreciate it. > > > Users often pay to distros for support, and distros always turn to vendors > > for debug situations, in fact one of the high stakeholders for this is an > > enterprise distro.. > > Hey, I was right, an enterprise distro wants this driver, great, can you > get them to review it as well? I'm tired of being the only one to > review patches like this and could use the help if they are going to > rely on this (why do they pass that work to me?). They should be the > ones helping us catch basic things like the reference count issues I > pointed out, as they can test the code, I can't :) Guilty as charged? But, I'm sure we are not the only ones. I'm sorry that we were not able to spot this and provide a review earlier, but we are happy to do so for v4. > > But, let's step back a bit further. > > It seems the network device normal interface for this is using devlink. > And drivers are allowed to have driver-specific devlink interfaces, as > you know, your driver has lots of them today. Why not just add more? > What's wrong with 600+ more interfaces being added as that way they > would be well documented and fit in with the existing infrastructure > that you have today. > > Is the issue that the firmware doesn't guarantee that these interfaces > will be documented or stable or even known at this point in time? If > so, how are your going to have a good userspace tool for this? What am > I missing that requires a totally-new-and-custom user/kernel api from > what all other network drivers are using today? > > thanks, > > greg k-h Aron
On Mon, 4 Dec 2023 15:37:56 -0600 Aron Silverton wrote: > > To Jakub's point, no, we don't care about enterprise distros, they are a > > consumer of our releases and while some of them pay the salaries of our > > developers, they really don't have much influence over our development > > rules as they are just so far behind in releases that their releases > > look nothing like what we do in places (i.e. Linux "like" just like many > > Android systems.) > > > > If the enterprise distros pop in here and give us their opinions of the > > patchset, I would GREATLY appreciate it, as having more people review > > code at this point in time would be most helpful instead of having to > > hear about how the interfaces are broken 2 years from now. > > We will be happy to test and review v4 of this series. > > Fully interactive debugging has become essential to getting to the root > cause of complex issues that arise between the types of devices being > discussed and their interactions with various software layers. Turning > knobs and dumping registers just isn't sufficient, and I wish we'd had > this capability long ago. Could you shed some light on what issues you were debugging, broadly? > Our customers have already benefited from the interactive debugging > capability that these patches provide, but the full potential won't be > realized until this is upstream. Can you elaborate on why "full potential won't be realized until this is upstream"? The driver looks very easy to ship as a standalone module.
Hi Jakub, On Mon, Dec 04, 2023 at 06:52:10PM -0800, Jakub Kicinski wrote: > On Mon, 4 Dec 2023 15:37:56 -0600 Aron Silverton wrote: > > > To Jakub's point, no, we don't care about enterprise distros, they are a > > > consumer of our releases and while some of them pay the salaries of our > > > developers, they really don't have much influence over our development > > > rules as they are just so far behind in releases that their releases > > > look nothing like what we do in places (i.e. Linux "like" just like many > > > Android systems.) > > > > > > If the enterprise distros pop in here and give us their opinions of the > > > patchset, I would GREATLY appreciate it, as having more people review > > > code at this point in time would be most helpful instead of having to > > > hear about how the interfaces are broken 2 years from now. > > > > We will be happy to test and review v4 of this series. > > > > Fully interactive debugging has become essential to getting to the root > > cause of complex issues that arise between the types of devices being > > discussed and their interactions with various software layers. Turning > > knobs and dumping registers just isn't sufficient, and I wish we'd had > > this capability long ago. > > Could you shed some light on what issues you were debugging, broadly? > I'll do my best, with two recent instances: 1. As mentioned already, we recently faced a complex problem with RDMA in KVM and were getting nowhere trying to debug using the usual methods. Mellanox support was able to use this debug interface to see what was happening on the PCI bus and prove that the issue was caused by corrupted PCIe transactions. This finally put the investigation on the correct path. The debug interface was used consistently and extensively to test theories about what was happening in the system and, ultimately, allowed the problem to be solved. 2. We've faced RDMA issues related to lost EQ doorbells, requiring complex debug, and ultimately root-caused as a defective CPU. Without interactive access to the device allowing us to test theories like, "what if we manually restart the EQ", we could not have proven this definitively. > > Our customers have already benefited from the interactive debugging > > capability that these patches provide, but the full potential won't be > > realized until this is upstream. > > Can you elaborate on why "full potential won't be realized until this > is upstream"? The driver looks very easy to ship as a standalone module. Firstly, We believe in working upstream and all of the advantages that that brings to all the distros as well as to us and our customers. Secondly, Our cloud business offers many types of machine instances, some with bare metal/vfio mlx5 devices, that require customer driven debug and we want our customers to have the freedom to choose which OS they want to use. Aron
On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote: > 1. As mentioned already, we recently faced a complex problem with RDMA > in KVM and were getting nowhere trying to debug using the usual methods. > Mellanox support was able to use this debug interface to see what was > happening on the PCI bus and prove that the issue was caused by > corrupted PCIe transactions. This finally put the investigation on the > correct path. The debug interface was used consistently and extensively > to test theories about what was happening in the system and, ultimately, > allowed the problem to be solved. You hit on an important point, and what is also my experience working at Meta. I may have even mentioned it in this thread already. If there is a serious issue with a complex device, there are two ways you can get support - dump all you can and send the dump to the vendor or get on a live debugging session with their engineers. Users' ability to debug those devices is practically non-existent. The idea that we need access to FW internals is predicated on the assumption that we have an ability to make sense of those internals. Once you're on a support call with the vendor - just load a custom kernel, module, whatever, it's already extremely expensive manual labor. > 2. We've faced RDMA issues related to lost EQ doorbells, requiring > complex debug, and ultimately root-caused as a defective CPU. Without > interactive access to the device allowing us to test theories like, > "what if we manually restart the EQ", we could not have proven this > definitively. I'm not familiar with the RDMA debugging capabilities. Perhaps there are some gaps there. The more proprietary the implementation the harder it is to debug. An answer to that would be "try to keep as much as possible open".. and interfaces which let closed user space talk to closed FW take us in the opposite direction. FWIW good netdevice drivers have a selftest which tests IRQ generation and EQ handling. I think that'd cover the case you're describing? IDK if mlx5 has them, but if it doesn't definitely worth adding. And I recommend running those on suspicious machines (ethtool -t, devlink has some selftests, too) > Firstly, We believe in working upstream and all of the advantages that > that brings to all the distros as well as to us and our customers. > > Secondly, Our cloud business offers many types of machine instances, > some with bare metal/vfio mlx5 devices, that require customer driven > debug and we want our customers to have the freedom to choose which OS > they want to use. I understand that having everything packaged and shipped together makes life easier. If the point of the kernel at this stage of its evolution is to collect incompatible bits of vendor software, make sure they build cleanly and ship them to distros - someone should tell me, and I will relent.
On 12/5/23 9:48 PM, Jakub Kicinski wrote: > On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote: >> 1. As mentioned already, we recently faced a complex problem with RDMA >> in KVM and were getting nowhere trying to debug using the usual methods. >> Mellanox support was able to use this debug interface to see what was >> happening on the PCI bus and prove that the issue was caused by >> corrupted PCIe transactions. This finally put the investigation on the >> correct path. The debug interface was used consistently and extensively >> to test theories about what was happening in the system and, ultimately, >> allowed the problem to be solved. > > You hit on an important point, and what is also my experience working > at Meta. I may have even mentioned it in this thread already. > If there is a serious issue with a complex device, there are two ways > you can get support - dump all you can and send the dump to the vendor > or get on a live debugging session with their engineers. Users' ability > to debug those devices is practically non-existent. The idea that we > need access to FW internals is predicated on the assumption that we > have an ability to make sense of those internals. > > Once you're on a support call with the vendor - just load a custom > kernel, module, whatever, it's already extremely expensive manual labor. You rail against out of tree drivers and vendor proprietary tools, and now you argue for just that. There is no reason debugging capabilities can not be built into the OS and used when needed. That means anything needed - from kernel modules to userspace tools. The Meta data point is not representative of the world at large - different scale, different needs, different expertise on staff (OS and H/W). Getting S/W installed (especially anything requiring a compiler) in a production server (and VMs) is not an easy request and in many cases not even possible. When a customer hits problem, the standard steps are to run a script, generate a tar file and ship it to the OS vendor. Engineers at the OS vendor go through it and may need other data - like getting detailed dumps from individual pieces of H/W. Every time those requests require going to a vendor web site to pull down vendor tools, get permission to install them, schedule the run of said tool ... it only serves to drag out the debugging process. ie., this open-ended stance only serves to hurt Linux users.
David, I agree with your points. I think you're misreading what I said. On Thu, 7 Dec 2023 08:54:51 -0700 David Ahern wrote: > You rail against out of tree drivers and vendor proprietary tools, and > now you argue for just that. I don't rail against out of tree drivers, very much the opposite. Linux supports out of tree modules, and I agree that's 100% the correct thing to do. I'd encourage more people to take advantage of that. The problem is quite the opposite, all the "security hardening" is making it almost impossible for users to take advantage of OOT modules. > There is no reason debugging capabilities > can not be built into the OS and used when needed. That means anything > needed - from kernel modules to userspace tools. > > The Meta data point is not representative of the world at large - > different scale, different needs, different expertise on staff (OS and > H/W). Getting S/W installed (especially anything requiring a compiler) > in a production server (and VMs) is not an easy request and in many > cases not even possible. I did not say it's easy. > When a customer hits problem, the standard steps are to run a script, > generate a tar file and ship it to the OS vendor. Engineers at the OS > vendor go through it and may need other data - like getting detailed > dumps from individual pieces of H/W. You say that like this is not _exactly_ what I just said!? > Every time those requests require > going to a vendor web site to pull down vendor tools, get permission to > install them, schedule the run of said tool ... it only serves to drag > out the debugging process. ie., this open-ended stance only serves to > hurt Linux users. Right, exactly. What are you arguing with then? As I said - we have a very open / accommodating policy for extracting all sort of debug and state dumps. You can put whatever read only stuff you want in debugfs** Read-write interfaces must be constrained to a clear set of commands / settings but also very much allowed. As you said users need to be able to extract debug info to share with the vendors, no tools necessary. ** (this will surprise you but you can also put stats there, if they are custom, I don't care if they go into ethtool -S or debugfs)
On Tue, Dec 05, 2023 at 08:48:55PM -0800, Jakub Kicinski wrote: > On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote: > > 1. As mentioned already, we recently faced a complex problem with RDMA > > in KVM and were getting nowhere trying to debug using the usual methods. > > Mellanox support was able to use this debug interface to see what was > > happening on the PCI bus and prove that the issue was caused by > > corrupted PCIe transactions. This finally put the investigation on the > > correct path. The debug interface was used consistently and extensively > > to test theories about what was happening in the system and, ultimately, > > allowed the problem to be solved. > > You hit on an important point, and what is also my experience working > at Meta. I may have even mentioned it in this thread already. > If there is a serious issue with a complex device, there are two ways > you can get support - dump all you can and send the dump to the vendor > or get on a live debugging session with their engineers. Users' ability > to debug those devices is practically non-existent. The idea that we > need access to FW internals is predicated on the assumption that we > have an ability to make sense of those internals. > > Once you're on a support call with the vendor - just load a custom > kernel, module, whatever, it's already extremely expensive manual labor. > > > 2. We've faced RDMA issues related to lost EQ doorbells, requiring > > complex debug, and ultimately root-caused as a defective CPU. Without > > interactive access to the device allowing us to test theories like, > > "what if we manually restart the EQ", we could not have proven this > > definitively. > > I'm not familiar with the RDMA debugging capabilities. Perhaps there > are some gaps there. The more proprietary the implementation the harder > it is to debug. An answer to that would be "try to keep as much as > possible open".. and interfaces which let closed user space talk to > closed FW take us in the opposite direction. > > FWIW good netdevice drivers have a selftest which tests IRQ generation > and EQ handling. I think that'd cover the case you're describing? > IDK if mlx5 has them, but if it doesn't definitely worth adding. And I > recommend running those on suspicious machines (ethtool -t, devlink has > some selftests, too) Essentially, a warning light, and that doesn't solve the underlying problem. We still need experts (e.g., vendors) to investigate with their toolsets when and where the problem occurs. I offered this as an example of one issue we solved. I cannot predict what kind of issues will pop up in the future, and writing a self-test for every possible situation is impossible by definition. > > > Firstly, We believe in working upstream and all of the advantages that > > that brings to all the distros as well as to us and our customers. > > > > Secondly, Our cloud business offers many types of machine instances, > > some with bare metal/vfio mlx5 devices, that require customer driven > > debug and we want our customers to have the freedom to choose which OS > > they want to use. > > I understand that having everything packaged and shipped together makes > life easier. I think it is a requirement. We operate with Secure Boot. The kernel is locked down. We don't have debugfs access, even if it were sufficient, and we cannot compile and load modules. Even without Secure Boot, there may not be a build environment available. We really need the module ready-to-go when the debug calls for it - no building, no reboots, no months long attempts to reproduce in some lab - just immediate availability of the debug interface on the affected machine. > > If the point of the kernel at this stage of its evolution is to collect > incompatible bits of vendor software, make sure they build cleanly and > ship them to distros - someone should tell me, and I will relent. I'm not sure I follow you... The mlx5ctl driver seems very compatible with the mlx5 device driver. I may be misunderstanding.
On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote: > > I understand that having everything packaged and shipped together makes > > life easier. > > I think it is a requirement. We operate with Secure Boot. The kernel is > locked down. We don't have debugfs access, even if it were sufficient, > and we cannot compile and load modules. Even without Secure Boot, there > may not be a build environment available. This 'no debugfs' requirement is a kernel lockdown thing, I presume? Are we expected to throw debugfs out the window and for all vendors to reimplement their debug functionality via a misc driver taking arbitrary ioctls? Not only does that sound like a complete waste of time and going backward in terms of quality of the interfaces, needing custom vendor tools etc. etc., but also you go from (hopefully somewhat) upstream reviewed debugfs interface to an interface where the only security assurance is vendor telling you "trust me, it's all good".
On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote: > On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote: > > > I understand that having everything packaged and shipped together makes > > > life easier. > > > > I think it is a requirement. We operate with Secure Boot. The kernel is > > locked down. We don't have debugfs access, even if it were sufficient, > > and we cannot compile and load modules. Even without Secure Boot, there > > may not be a build environment available. > > This 'no debugfs' requirement is a kernel lockdown thing, I presume? > Are we expected to throw debugfs out the window and for all vendors > to reimplement their debug functionality via a misc driver taking > arbitrary ioctls? Not only does that sound like a complete waste of > time and going backward in terms of quality of the interfaces, needing > custom vendor tools etc. etc., but also you go from (hopefully somewhat) > upstream reviewed debugfs interface to an interface where the only > security assurance is vendor telling you "trust me, it's all good". IIRC, with lockdown, we can read from debugfs IFF the entries' permissions are 0400. We cannot write. It's not suitable for implementing an interactive debug interface.
On 07 Dec 10:41, Aron Silverton wrote: >On Tue, Dec 05, 2023 at 08:48:55PM -0800, Jakub Kicinski wrote: >> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote: >> > 1. As mentioned already, we recently faced a complex problem with RDMA >> > in KVM and were getting nowhere trying to debug using the usual methods. >> > Mellanox support was able to use this debug interface to see what was >> > happening on the PCI bus and prove that the issue was caused by >> > corrupted PCIe transactions. This finally put the investigation on the >> > correct path. The debug interface was used consistently and extensively >> > to test theories about what was happening in the system and, ultimately, >> > allowed the problem to be solved. >> >> You hit on an important point, and what is also my experience working >> at Meta. I may have even mentioned it in this thread already. >> If there is a serious issue with a complex device, there are two ways >> you can get support - dump all you can and send the dump to the vendor >> or get on a live debugging session with their engineers. Users' ability >> to debug those devices is practically non-existent. The idea that we >> need access to FW internals is predicated on the assumption that we >> have an ability to make sense of those internals. >> >> Once you're on a support call with the vendor - just load a custom >> kernel, module, whatever, it's already extremely expensive manual labor. >> >> > 2. We've faced RDMA issues related to lost EQ doorbells, requiring >> > complex debug, and ultimately root-caused as a defective CPU. Without >> > interactive access to the device allowing us to test theories like, >> > "what if we manually restart the EQ", we could not have proven this >> > definitively. >> >> I'm not familiar with the RDMA debugging capabilities. Perhaps there >> are some gaps there. The more proprietary the implementation the harder >> it is to debug. An answer to that would be "try to keep as much as >> possible open".. and interfaces which let closed user space talk to >> closed FW take us in the opposite direction. >> >> FWIW good netdevice drivers have a selftest which tests IRQ generation >> and EQ handling. I think that'd cover the case you're describing? >> IDK if mlx5 has them, but if it doesn't definitely worth adding. And I >> recommend running those on suspicious machines (ethtool -t, devlink has >> some selftests, too) > >Essentially, a warning light, and that doesn't solve the underlying >problem. We still need experts (e.g., vendors) to investigate with their >toolsets when and where the problem occurs. > >I offered this as an example of one issue we solved. I cannot predict >what kind of issues will pop up in the future, and writing a self-test >for every possible situation is impossible by definition. > >> >> > Firstly, We believe in working upstream and all of the advantages that >> > that brings to all the distros as well as to us and our customers. >> > >> > Secondly, Our cloud business offers many types of machine instances, >> > some with bare metal/vfio mlx5 devices, that require customer driven >> > debug and we want our customers to have the freedom to choose which OS >> > they want to use. >> >> I understand that having everything packaged and shipped together makes >> life easier. > >I think it is a requirement. We operate with Secure Boot. The kernel is >locked down. We don't have debugfs access, even if it were sufficient, >and we cannot compile and load modules. Even without Secure Boot, there >may not be a build environment available. > >We really need the module ready-to-go when the debug calls for it - no >building, no reboots, no months long attempts to reproduce in some lab - >just immediate availability of the debug interface on the affected >machine. > >> >> If the point of the kernel at this stage of its evolution is to collect >> incompatible bits of vendor software, make sure they build cleanly and >> ship them to distros - someone should tell me, and I will relent. > >I'm not sure I follow you... The mlx5ctl driver seems very compatible >with the mlx5 device driver. I may be misunderstanding. > mlx5ctl is 100% compatible with mlx5 ConnectX open spec [1], and supports any mlx5 driven stacks, not only netdev, it is able to expose millions of objects and device states interactively, debugfs would explode if we even try to accommodate some of these objects or states via debugfs, not to mention it is also impossible to maintain a stable debugfs output for such a huge data set, when this mlx5ctl interface speaks out a clear and open ConnectX language, which is the hole point of the driver. ConnectX is a highly programmable device for the enduser, and we have a very open / accommodating policy, an advanced user who can read the open spec [1], will also have the ability to do self-debug of their own RDMA/DPU/FPGA apps or similar usecases. Also I would like to repeat, this is not touching netdev, netdev's policies do not apply to the greater kernel or RDMA, and we have use cases with pure-infiniband/DPU/FPGA cards that have no netdev at all, or other cases with pur virtio instances, and much more. [1] https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf
On 07 Dec 12:06, Aron Silverton wrote: >On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote: >> On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote: >> > > I understand that having everything packaged and shipped together makes >> > > life easier. >> > >> > I think it is a requirement. We operate with Secure Boot. The kernel is >> > locked down. We don't have debugfs access, even if it were sufficient, >> > and we cannot compile and load modules. Even without Secure Boot, there >> > may not be a build environment available. >> >> This 'no debugfs' requirement is a kernel lockdown thing, I presume? >> Are we expected to throw debugfs out the window and for all vendors >> to reimplement their debug functionality via a misc driver taking >> arbitrary ioctls? Not only does that sound like a complete waste of >> time and going backward in terms of quality of the interfaces, needing >> custom vendor tools etc. etc., but also you go from (hopefully somewhat) >> upstream reviewed debugfs interface to an interface where the only >> security assurance is vendor telling you "trust me, it's all good". > >IIRC, with lockdown, we can read from debugfs IFF the entries' >permissions are 0400. We cannot write. It's not suitable for >implementing an interactive debug interface. I would like to add that debugfs is usually used to expose the driver software states, as it evolves and changes with the driver code, but as I explained in the other email, it's clearly not a good solution to expose arbitrary objects of complex devices, that require interactive and selective debug interfaces tailored to the user use-case.
On Thu, Dec 07, 2023 at 12:06:28PM -0600, Aron Silverton wrote: > On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote: > > On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote: > > > > I understand that having everything packaged and shipped together makes > > > > life easier. > > > > > > I think it is a requirement. We operate with Secure Boot. The kernel is > > > locked down. We don't have debugfs access, even if it were sufficient, > > > and we cannot compile and load modules. Even without Secure Boot, there > > > may not be a build environment available. > > > > This 'no debugfs' requirement is a kernel lockdown thing, I presume? > > Are we expected to throw debugfs out the window and for all vendors > > to reimplement their debug functionality via a misc driver taking > > arbitrary ioctls? Not only does that sound like a complete waste of > > time and going backward in terms of quality of the interfaces, needing > > custom vendor tools etc. etc., but also you go from (hopefully somewhat) > > upstream reviewed debugfs interface to an interface where the only > > security assurance is vendor telling you "trust me, it's all good". > > IIRC, with lockdown, we can read from debugfs IFF the entries' > permissions are 0400. We cannot write. It's not suitable for > implementing an interactive debug interface. This is a policy decision that a distro decides to do, and I have seen it happen many times. The problem with this is then, as you have found out, vendors try to work around the removal of access to debugfs by creating new interfaces like misc drivers and sysfs files to emulate what they were previously exporting with debugfs. When they do that, they break the reason why the distro/vendor decided to prevent access to debugfs in the first place, making the whole system insecure again! I see this happen all the time in Android devices as Android restricted access to debugfs many years ago to try to solve the problem of drivers doing insecure things there. Those drivers then just moved those insecure operations to a different interface, making the system insecure again. To stop this cat-and-mouse game, work with the vendors that are implementing these requirements to shut down access to these interfaces. That is a policy decision made by them, and does NOT mean that the kernel community needs to then start taking code to circumvent those decisions as that makes the whole thing pointless. So in short, use debugfs, that is what it was designed for. If a vendor does not wish to enable access to debugfs, then that is their decision (probably a good one), don't circumvent it by making a new interface, as odds are, the vendor will eventually realize it and work to cut off that new interface as well, as they rightfully should. thanks, greg k-h
On Thu, Dec 07, 2023 at 11:02:36AM -0800, Saeed Mahameed wrote: > I would like to add that debugfs is usually used to expose the driver > software states, as it evolves and changes with the driver code, but as I > explained in the other email, it's clearly not a good solution to expose > arbitrary objects of complex devices, that require interactive and > selective debug interfaces tailored to the user use-case. Why not? Remember, the only rule in debugfs is "there are no rules!" Well, there is one practical one, "do not rely on debugfs for any functioning system properties", i.e. "if access to debugfs is not present, or something in debugfs breaks, the kernel should continue to work just fine with no change in operations". But that's an overall system-level rule, not a rule for what you can put into debugfs. Have fun! greg k-h
On Fri, Dec 08, 2023 at 06:27:14AM +0100, Greg Kroah-Hartman wrote: > When they do that, they break the reason why the distro/vendor decided > to prevent access to debugfs in the first place, making the whole system > insecure again! Maybe others have, be we did not! In fact Saeed addresses this explicitly in the cover letter: Historically a userspace program was used that accessed the PCI register and config space directly through /sys/bus/pci/.../XXX and could operate these debugging interfaces in parallel with the running driver. This approach is incompatible with secure boot and kernel lockdown so this driver provides a secure and restricted interface to that. We did a lot of work here to build an interface that is compatible with the security principles of lockdown. This work is embodied by the new FW bit MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES which causes the FW to run these RPCs in a security context that is compatible with lockdown. The overall philosophical architecture is similar to the NVMe vendor command channel which is also a way to deliver opaque RPCs to a device FW and is considered lockdown compatible. This series is not some insecure attempt to bypass lockdown like you may have seen in Android. mlx5 is the driver for the most popular high speed NIC in the world. Our customers take security seriously, and we take security seriously. Jason
On Fri, Dec 08, 2023 at 06:29:29AM +0100, Greg Kroah-Hartman wrote: > On Thu, Dec 07, 2023 at 11:02:36AM -0800, Saeed Mahameed wrote: > > I would like to add that debugfs is usually used to expose the driver > > software states, as it evolves and changes with the driver code, but as I > > explained in the other email, it's clearly not a good solution to expose > > arbitrary objects of complex devices, that require interactive and > > selective debug interfaces tailored to the user use-case. > > Why not? Remember, the only rule in debugfs is "there are no rules!" We already have debugfs files to issue RPCs. They are not secure and not lockdown compatible. Few users have been interested in this, Aron does a good job explaining the general perspective I've seen in many places. Users want an in-tree solution that is compatible with lockdown. A solution that works for all the mlx5 deployment modes (including Infiniband native without netdev) and covers most of the functionality they previously enjoyed with the /sys/../resource based tooling. This series delivers that. Nobody has offered an alterative vision that achieves the same thing. There have been lots of suggestions how to do small little parts, but not everything together as this does. > Well, there is one practical one, "do not rely on debugfs for any > functioning system properties" Jakub expressed additional "netdev only" rules for debugfs. Read-write interfaces must be constrained to a clear set of commands / settings Which I think is what Saeed is reacting to. Jason
On Thu, Dec 07, 2023 at 10:54:02AM -0800, Saeed Mahameed wrote: > Also I would like to repeat, this is not touching netdev, netdev's policies > do not apply to the greater kernel or RDMA, and we have use cases with > pure-infiniband/DPU/FPGA cards that have no netdev at all, or other cases > with pur virtio instances, and much more. Yes. I mean just about every complex block driver has some kind of vendor spcific tooling for debugging, statistics, etc. Trying to deny it just because one function expose by a device is a network device is even more silly than disallowing it for pure net devices (which already tend to be complex beasts).
diff --git a/MAINTAINERS b/MAINTAINERS index 97f51d5ec1cf..4b532df16b19 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13835,6 +13835,14 @@ L: virtualization@lists.linux-foundation.org S: Supported F: drivers/vdpa/mlx5/ +MELLANOX MLX5 ConnectX Diag DRIVER +M: Saeed Mahameed <saeedm@nvidia.com> +R: Itay Avraham <itayavr@nvidia.com> +L: linux-kernel@vger.kernel.org +S: Supported +F: drivers/misc/mlx5ctl/ +F: include/uapi/misc/mlx5ctl.h + MELLANOX MLXCPLD I2C AND MUX DRIVER M: Vadim Pasternak <vadimp@nvidia.com> M: Michael Shych <michaelsh@nvidia.com> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index f37c4b8380ae..c0e4823648ed 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -579,4 +579,5 @@ source "drivers/misc/cardreader/Kconfig" source "drivers/misc/uacce/Kconfig" source "drivers/misc/pvpanic/Kconfig" source "drivers/misc/mchp_pci1xxxx/Kconfig" +source "drivers/misc/mlx5ctl/Kconfig" endmenu diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile index f2a4d1ff65d4..49bc4697f498 100644 --- a/drivers/misc/Makefile +++ b/drivers/misc/Makefile @@ -67,3 +67,4 @@ obj-$(CONFIG_TMR_MANAGER) += xilinx_tmr_manager.o obj-$(CONFIG_TMR_INJECT) += xilinx_tmr_inject.o obj-$(CONFIG_TPS6594_ESM) += tps6594-esm.o obj-$(CONFIG_TPS6594_PFSM) += tps6594-pfsm.o +obj-$(CONFIG_MLX5CTL) += mlx5ctl/ diff --git a/drivers/misc/mlx5ctl/Kconfig b/drivers/misc/mlx5ctl/Kconfig new file mode 100644 index 000000000000..faaa1dba2cc2 --- /dev/null +++ b/drivers/misc/mlx5ctl/Kconfig @@ -0,0 +1,14 @@ +# SPDX-License-Identifier: GPL-2.0 +# + +config MLX5CTL + tristate "mlx5 ConnectX control misc driver" + depends on MLX5_CORE + help + MLX5CTL provides interface for the user process to access the debug and + configuration registers of the ConnectX hardware family + (NICs, PCI switches and SmartNIC SoCs). + This will allow configuration and debug tools to work out of the box on + mainstream kernel. + + If you don't know what to do here, say N. diff --git a/drivers/misc/mlx5ctl/Makefile b/drivers/misc/mlx5ctl/Makefile new file mode 100644 index 000000000000..b5c7f99e0ab6 --- /dev/null +++ b/drivers/misc/mlx5ctl/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_MLX5CTL) += mlx5ctl.o +mlx5ctl-y := main.o diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c new file mode 100644 index 000000000000..8eb150461b80 --- /dev/null +++ b/drivers/misc/mlx5ctl/main.c @@ -0,0 +1,314 @@ +// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 +/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */ + +#include <linux/miscdevice.h> +#include <linux/fs.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/auxiliary_bus.h> +#include <linux/mlx5/device.h> +#include <linux/mlx5/driver.h> +#include <linux/atomic.h> +#include <linux/refcount.h> + +MODULE_DESCRIPTION("mlx5 ConnectX control misc driver"); +MODULE_AUTHOR("Saeed Mahameed <saeedm@nvidia.com>"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct mlx5ctl_dev { + struct mlx5_core_dev *mdev; + struct miscdevice miscdev; + struct auxiliary_device *adev; + struct list_head fd_list; + spinlock_t fd_list_lock; /* protect list add/del */ + struct rw_semaphore rw_lock; + struct kref refcount; +}; + +struct mlx5ctl_fd { + u16 uctx_uid; + u32 uctx_cap; + u32 ucap; /* user cap */ + struct mlx5ctl_dev *mcdev; + struct list_head list; +}; + +#define mlx5ctl_err(mcdev, format, ...) \ + dev_err(mcdev->miscdev.parent, format, ##__VA_ARGS__) + +#define mlx5ctl_dbg(mcdev, format, ...) \ + dev_dbg(mcdev->miscdev.parent, "PID %d: " format, \ + current->pid, ##__VA_ARGS__) + +enum { + MLX5_UCTX_OBJECT_CAP_RAW_TX = 0x1, + MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES = 0x2, + MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES = 0x4, +}; + +static int mlx5ctl_alloc_uid(struct mlx5ctl_dev *mcdev, u32 cap) +{ + u32 out[MLX5_ST_SZ_DW(create_uctx_out)] = {}; + u32 in[MLX5_ST_SZ_DW(create_uctx_in)] = {}; + void *uctx; + int err; + u16 uid; + + uctx = MLX5_ADDR_OF(create_uctx_in, in, uctx); + + mlx5ctl_dbg(mcdev, "MLX5_CMD_OP_CREATE_UCTX: caps 0x%x\n", cap); + MLX5_SET(create_uctx_in, in, opcode, MLX5_CMD_OP_CREATE_UCTX); + MLX5_SET(uctx, uctx, cap, cap); + + err = mlx5_cmd_exec(mcdev->mdev, in, sizeof(in), out, sizeof(out)); + if (err) + return err; + + uid = MLX5_GET(create_uctx_out, out, uid); + mlx5ctl_dbg(mcdev, "allocated uid %d with caps 0x%x\n", uid, cap); + return uid; +} + +static void mlx5ctl_release_uid(struct mlx5ctl_dev *mcdev, u16 uid) +{ + u32 in[MLX5_ST_SZ_DW(destroy_uctx_in)] = {}; + struct mlx5_core_dev *mdev = mcdev->mdev; + int err; + + MLX5_SET(destroy_uctx_in, in, opcode, MLX5_CMD_OP_DESTROY_UCTX); + MLX5_SET(destroy_uctx_in, in, uid, uid); + + err = mlx5_cmd_exec_in(mdev, destroy_uctx, in); + mlx5ctl_dbg(mcdev, "released uid %d err(%d)\n", uid, err); +} + +static void mcdev_get(struct mlx5ctl_dev *mcdev); +static void mcdev_put(struct mlx5ctl_dev *mcdev); + +static int mlx5ctl_open_mfd(struct mlx5ctl_fd *mfd) +{ + struct mlx5_core_dev *mdev = mfd->mcdev->mdev; + struct mlx5ctl_dev *mcdev = mfd->mcdev; + u32 ucap = 0, cap = 0; + int uid; + +#define MLX5_UCTX_CAP(mdev, cap) \ + (MLX5_CAP_GEN(mdev, uctx_cap) & MLX5_UCTX_OBJECT_CAP_##cap) + + if (capable(CAP_NET_RAW) && MLX5_UCTX_CAP(mdev, RAW_TX)) { + ucap |= CAP_NET_RAW; + cap |= MLX5_UCTX_OBJECT_CAP_RAW_TX; + } + + if (capable(CAP_SYS_RAWIO) && MLX5_UCTX_CAP(mdev, INTERNAL_DEVICE_RESOURCES)) { + ucap |= CAP_SYS_RAWIO; + cap |= MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES; + } + + if (capable(CAP_SYS_ADMIN) && MLX5_UCTX_CAP(mdev, TOOLS_RESOURCES)) { + ucap |= CAP_SYS_ADMIN; + cap |= MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES; + } + + uid = mlx5ctl_alloc_uid(mcdev, cap); + if (uid < 0) + return uid; + + mfd->uctx_uid = uid; + mfd->uctx_cap = cap; + mfd->ucap = ucap; + mfd->mcdev = mcdev; + + mlx5ctl_dbg(mcdev, "allocated uid %d with uctx caps 0x%x, user cap 0x%x\n", + uid, cap, ucap); + return 0; +} + +static void mlx5ctl_release_mfd(struct mlx5ctl_fd *mfd) +{ + struct mlx5ctl_dev *mcdev = mfd->mcdev; + + mlx5ctl_release_uid(mcdev, mfd->uctx_uid); +} + +static int mlx5ctl_open(struct inode *inode, struct file *file) +{ + struct mlx5_core_dev *mdev; + struct mlx5ctl_dev *mcdev; + struct mlx5ctl_fd *mfd; + int err = 0; + + mcdev = container_of(file->private_data, struct mlx5ctl_dev, miscdev); + mcdev_get(mcdev); + down_read(&mcdev->rw_lock); + mdev = mcdev->mdev; + if (!mdev) { + err = -ENODEV; + goto unlock; + } + + mfd = kzalloc(sizeof(*mfd), GFP_KERNEL_ACCOUNT); + if (!mfd) + return -ENOMEM; + + mfd->mcdev = mcdev; + err = mlx5ctl_open_mfd(mfd); + if (err) + goto unlock; + + spin_lock(&mcdev->fd_list_lock); + list_add_tail(&mfd->list, &mcdev->fd_list); + spin_unlock(&mcdev->fd_list_lock); + + file->private_data = mfd; + +unlock: + up_read(&mcdev->rw_lock); + if (err) { + mcdev_put(mcdev); + kfree(mfd); + } + return err; +} + +static int mlx5ctl_release(struct inode *inode, struct file *file) +{ + struct mlx5ctl_fd *mfd = file->private_data; + struct mlx5ctl_dev *mcdev = mfd->mcdev; + + down_read(&mcdev->rw_lock); + if (!mcdev->mdev) { + pr_debug("[%d] UID %d mlx5ctl: mdev is already released\n", + current->pid, mfd->uctx_uid); + /* All mfds are already released, skip ... */ + goto unlock; + } + + spin_lock(&mcdev->fd_list_lock); + list_del(&mfd->list); + spin_unlock(&mcdev->fd_list_lock); + + mlx5ctl_release_mfd(mfd); + +unlock: + kfree(mfd); + up_read(&mcdev->rw_lock); + mcdev_put(mcdev); + file->private_data = NULL; + return 0; +} + +static const struct file_operations mlx5ctl_fops = { + .owner = THIS_MODULE, + .open = mlx5ctl_open, + .release = mlx5ctl_release, +}; + +static int mlx5ctl_probe(struct auxiliary_device *adev, + const struct auxiliary_device_id *id) + +{ + struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev); + struct mlx5_core_dev *mdev = madev->mdev; + struct mlx5ctl_dev *mcdev; + char *devname = NULL; + int err; + + mcdev = kzalloc(sizeof(*mcdev), GFP_KERNEL_ACCOUNT); + if (!mcdev) + return -ENOMEM; + + kref_init(&mcdev->refcount); + INIT_LIST_HEAD(&mcdev->fd_list); + spin_lock_init(&mcdev->fd_list_lock); + init_rwsem(&mcdev->rw_lock); + mcdev->mdev = mdev; + mcdev->adev = adev; + devname = kasprintf(GFP_KERNEL_ACCOUNT, "mlx5ctl-%s", + dev_name(&adev->dev)); + if (!devname) { + err = -ENOMEM; + goto abort; + } + + mcdev->miscdev = (struct miscdevice) { + .minor = MISC_DYNAMIC_MINOR, + .name = devname, + .fops = &mlx5ctl_fops, + .parent = &adev->dev, + }; + + err = misc_register(&mcdev->miscdev); + if (err) { + mlx5ctl_err(mcdev, "mlx5ctl: failed to register misc device err %d\n", err); + goto abort; + } + + mlx5ctl_dbg(mcdev, "probe mdev@%s %s\n", dev_driver_string(mdev->device), dev_name(mdev->device)); + + auxiliary_set_drvdata(adev, mcdev); + + return 0; + +abort: + kfree(devname); + kfree(mcdev); + return err; +} + +static void mlx5ctl_remove(struct auxiliary_device *adev) +{ + struct mlx5ctl_dev *mcdev = auxiliary_get_drvdata(adev); + struct mlx5_core_dev *mdev = mcdev->mdev; + struct mlx5ctl_fd *mfd, *n; + + misc_deregister(&mcdev->miscdev); + down_write(&mcdev->rw_lock); + + list_for_each_entry_safe(mfd, n, &mcdev->fd_list, list) { + mlx5ctl_dbg(mcdev, "UID %d still has open FDs\n", mfd->uctx_uid); + list_del(&mfd->list); + mlx5ctl_release_mfd(mfd); + } + + mlx5ctl_dbg(mcdev, "removed mdev %s %s\n", + dev_driver_string(mdev->device), dev_name(mdev->device)); + + mcdev->mdev = NULL; /* prevent already open fds from accessing the device */ + up_write(&mcdev->rw_lock); + mcdev_put(mcdev); +} + +static void mcdev_free(struct kref *ref) +{ + struct mlx5ctl_dev *mcdev = container_of(ref, struct mlx5ctl_dev, refcount); + + kfree(mcdev->miscdev.name); + kfree(mcdev); +} + +static void mcdev_get(struct mlx5ctl_dev *mcdev) +{ + kref_get(&mcdev->refcount); +} + +static void mcdev_put(struct mlx5ctl_dev *mcdev) +{ + kref_put(&mcdev->refcount, mcdev_free); +} + +static const struct auxiliary_device_id mlx5ctl_id_table[] = { + { .name = MLX5_ADEV_NAME ".ctl", }, + {}, +}; + +MODULE_DEVICE_TABLE(auxiliary, mlx5ctl_id_table); + +static struct auxiliary_driver mlx5ctl_driver = { + .name = "ctl", + .probe = mlx5ctl_probe, + .remove = mlx5ctl_remove, + .id_table = mlx5ctl_id_table, +}; + +module_auxiliary_driver(mlx5ctl_driver);