diff mbox series

[V3,2/5] misc: mlx5ctl: Add mlx5ctl misc driver

Message ID	20231121070619.9836-3-saeed@kernel.org
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; From: Saeed Mahameed <saeed@kernel.org> To: Arnd Bergmann <arnd@arndb.de>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jason Gunthorpe <jgg@nvidia.com>, Leon Romanovsky <leonro@nvidia.com>, Jiri Pirko <jiri@nvidia.com>, Leonid Bloch <lbloch@nvidia.com>, Itay Avraham <itayavr@nvidia.com>, Jakub Kicinski <kuba@kernel.org>, linux-kernel@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com> Subject: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver Date: Mon, 20 Nov 2023 23:06:16 -0800 Message-ID: <20231121070619.9836-3-saeed@kernel.org> In-Reply-To: <20231121070619.9836-1-saeed@kernel.org> References: <20231121070619.9836-1-saeed@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mlx5 ConnectX control misc driver \| [V3,0/5] mlx5 ConnectX control misc driver [V3,1/5] mlx5: Add aux dev for ctl interface [V3,2/5] misc: mlx5ctl: Add mlx5ctl misc driver [V3,3/5] misc: mlx5ctl: Add info ioctl [V3,4/5] misc: mlx5ctl: Add command rpc ioctl [V3,5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

Commit Message

Saeed Mahameed Nov. 21, 2023, 7:06 a.m. UTC

  From: Saeed Mahameed <saeedm@nvidia.com>

The ConnectX HW family supported by the mlx5 drivers uses an architecture
where a FW component executes "mailbox RPCs" issued by the driver to make
changes to the device. This results in a complex debugging environment
where the FW component has information and low level configuration that
needs to be accessed to userspace for debugging purposes.

Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that same data.

On open the driver would allocate a special FW UID (user context ID)
restrected to debug RPCs only, later in this series all user RPCs will
be stamped with this UID.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 MAINTAINERS                   |   8 +
 drivers/misc/Kconfig          |   1 +
 drivers/misc/Makefile         |   1 +
 drivers/misc/mlx5ctl/Kconfig  |  14 ++
 drivers/misc/mlx5ctl/Makefile |   4 +
 drivers/misc/mlx5ctl/main.c   | 314 ++++++++++++++++++++++++++++++++++
 6 files changed, 342 insertions(+)
 create mode 100644 drivers/misc/mlx5ctl/Kconfig
 create mode 100644 drivers/misc/mlx5ctl/Makefile
 create mode 100644 drivers/misc/mlx5ctl/main.c

Comments

Greg KH Nov. 27, 2023, 1:36 p.m. UTC | #1

On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> +config MLX5CTL
> +	tristate "mlx5 ConnectX control misc driver"
> +	depends on MLX5_CORE
> +	help
> +	  MLX5CTL provides interface for the user process to access the debug and
> +          configuration registers of the ConnectX hardware family
> +          (NICs, PCI switches and SmartNIC SoCs).
> +          This will allow configuration and debug tools to work out of the box on
> +          mainstream kernel.

Did you run checkpatch.pl on this?  You lost the tabs :(

> --- /dev/null
> +++ b/drivers/misc/mlx5ctl/main.c
> @@ -0,0 +1,314 @@
> +// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0

Why is this dual licensed?  Again, you are using a GPL-only api for this
driver, how would that even be possible to make this code be BSD?

I thought we already discussed this, AND I talked to someone who
discussed this with a nvidia lawyer already and I thought this was going
to get changed.  What happened to that?

thanks,

greg k-h

Jason Gunthorpe Nov. 27, 2023, 2:40 p.m. UTC | #2

On Mon, Nov 27, 2023 at 01:36:54PM +0000, Greg Kroah-Hartman wrote:

> Why is this dual licensed?  Again, you are using a GPL-only api for this
> driver, how would that even be possible to make this code be BSD?

The code to FreeBSD with mlx5 related dual licensed code integrated is
freely available to inspect. I've never looked myself but I'm told
FreeBSD changes the reference Linux code to remove the use of GPL
code, and that FreeBSD has created BSD licensed versions of some
kernel APIs to support that.

> I thought we already discussed this, AND I talked to someone who
> discussed this with a nvidia lawyer already and I thought this was going
> to get changed.  What happened to that?

It is in the cover letter. You asked for an approval and statement
from our legal and we obtained it. Our lawyers did a review, discussed
with a LF contact, and continue to instruct to use the dual
license. We've done what you required us to do.

The summary I have of the call you refer to does not include a
discussion or agreement about change in nvidia policy regarding mlx5
code.

Like Dave said, our lawyers are not your lawyers. Now that we have
involved legal, and they have given an instruction, we must follow it.

Regards,
Jason

Greg KH Nov. 27, 2023, 3:51 p.m. UTC | #3

On Mon, Nov 27, 2023 at 10:40:17AM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 27, 2023 at 01:36:54PM +0000, Greg Kroah-Hartman wrote:
> 
> > Why is this dual licensed?  Again, you are using a GPL-only api for this
> > driver, how would that even be possible to make this code be BSD?
> 
> The code to FreeBSD with mlx5 related dual licensed code integrated is
> freely available to inspect. I've never looked myself but I'm told
> FreeBSD changes the reference Linux code to remove the use of GPL
> code, and that FreeBSD has created BSD licensed versions of some
> kernel APIs to support that.

Yeah, I'm not going to get into the legality of "creating BSD licensed
versions of gpl-only Linux apis" but note, lots of lawyers look
longingly at those things when they consider early retirement :)

> > I thought we already discussed this, AND I talked to someone who
> > discussed this with a nvidia lawyer already and I thought this was going
> > to get changed.  What happened to that?
> 
> It is in the cover letter. You asked for an approval and statement
> from our legal and we obtained it. Our lawyers did a review, discussed
> with a LF contact, and continue to instruct to use the dual
> license. We've done what you required us to do.

Ah, missed that, sorry, I didn't see it in the changes for this set of
patches, it was in the previous submission.

> The summary I have of the call you refer to does not include a
> discussion or agreement about change in nvidia policy regarding mlx5
> code.
> 
> Like Dave said, our lawyers are not your lawyers. Now that we have
> involved legal, and they have given an instruction, we must follow it.

I think everyone involved is thankful that your lawyers are not mine :)

Ok, best of luck with this mess, I'll stop harping on it now and just
point out all of the other issues here.  First off, you all need to get
the network maintainers to agree that this driver is ok to do this way,
and I don't think that has happened yet, so I'll wait on reviewing the
series until that is resolved.

thanks,

greg k-h

Jason Gunthorpe Nov. 27, 2023, 4:17 p.m. UTC | #4

On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote:

> Ok, best of luck with this mess, I'll stop harping on it now and just
> point out all of the other issues here.  First off, you all need to get
> the network maintainers to agree that this driver is ok to do this way,
> and I don't think that has happened yet, so I'll wait on reviewing the
> series until that is resolved.

As I said already, I strongly disagree with the idea that the netdev
maintainers get a global veto on what happens with mlx5 devices just
because they sometimes have an ethernet port on the back of the card.

This module is primarily (but not exclusively) for rdma related
functionality, not netdev, and the RDMA maintainers Ack it.

Thanks,
Jason

Greg KH Nov. 27, 2023, 6:27 p.m. UTC | #5

On Mon, Nov 27, 2023 at 12:17:32PM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote:
> 
> > Ok, best of luck with this mess, I'll stop harping on it now and just
> > point out all of the other issues here.  First off, you all need to get
> > the network maintainers to agree that this driver is ok to do this way,
> > and I don't think that has happened yet, so I'll wait on reviewing the
> > series until that is resolved.
> 
> As I said already, I strongly disagree with the idea that the netdev
> maintainers get a global veto on what happens with mlx5 devices just
> because they sometimes have an ethernet port on the back of the card.

I understand you might disagree, however I hold their opinion in high
regard and want to ensure that they agree that exposing device-specific
debugging information for a device that deals with networking is ok to
do so in a device-specific misc device node and not through some other
way that other networking devices normally do (i.e. netlink or
some-other-such-thing.)

Note, device-specific character devices have almost always proven to be
a bad idea in the long run, I understand your immediate need to do
something like this, but remember that keeping it alive for the next 20+
years is going to be tough.

> This module is primarily (but not exclusively) for rdma related
> functionality, not netdev, and the RDMA maintainers Ack it.

In my mind, RDMA implies networking, as it's over a network connection,
but hey, I might be wrong :)

thanks,

greg k-h

Greg KH Nov. 27, 2023, 6:59 p.m. UTC | #6

On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> +struct mlx5ctl_dev {
> +	struct mlx5_core_dev *mdev;
> +	struct miscdevice miscdev;
> +	struct auxiliary_device *adev;
> +	struct list_head fd_list;
> +	spinlock_t fd_list_lock; /* protect list add/del */
> +	struct rw_semaphore rw_lock;
> +	struct kref refcount;

You now have 2 different things that control the lifespan of this
structure.  We really need some way to automatically check this so that
people don't keep making this same mistake, it happens all the time :(

Please pick one structure (miscdevice) or the other (kref) to control
the lifespan, having 2 will just not work.

Also, why a rw_semaphore?  Only use those if you can prove with a
benchmark that it is actually faster, otherwise it's simpler to just use
a normal mutex (hint, you are changing the fields in the structure with
the read lock held, which feels very wrong, and so needs a LOT of
documentation, or just use a normal mutex...)

thanks,

greg k-h

Saeed Mahameed Nov. 27, 2023, 7:26 p.m. UTC | #7

On 27 Nov 18:27, Greg Kroah-Hartman wrote:
>On Mon, Nov 27, 2023 at 12:17:32PM -0400, Jason Gunthorpe wrote:
>> On Mon, Nov 27, 2023 at 03:51:10PM +0000, Greg Kroah-Hartman wrote:
>>
>> > Ok, best of luck with this mess, I'll stop harping on it now and just
>> > point out all of the other issues here.  First off, you all need to get
>> > the network maintainers to agree that this driver is ok to do this way,
>> > and I don't think that has happened yet, so I'll wait on reviewing the
>> > series until that is resolved.
>>
>> As I said already, I strongly disagree with the idea that the netdev
>> maintainers get a global veto on what happens with mlx5 devices just
>> because they sometimes have an ethernet port on the back of the card.
>
>I understand you might disagree, however I hold their opinion in high
>regard and want to ensure that they agree that exposing device-specific
>debugging information for a device that deals with networking is ok to
>do so in a device-specific misc device node and not through some other
>way that other networking devices normally do (i.e. netlink or
>some-other-such-thing.)
>
>Note, device-specific character devices have almost always proven to be
>a bad idea in the long run, I understand your immediate need to do
>something like this, but remember that keeping it alive for the next 20+
>years is going to be tough.
>

This driver is different as it doesn't replace existing mlx5 drivers,
mlx5 functionality drivers are still there to expose the device features
through the standard stacks, this is just a companion driver to access
debug information, by driver and FW design mlx5ctl is not meant to
manage or pilot the device like other device specific char drivers.

To be clear this debug driver (or at least an older version of it) 
has been already in use for over than 15 years, since the beginning
of mlx5, we used to only provide it as external package called mft 
debug tools [1] which has the kernel parts as well. Now it's time to
upstream it.

mlx5ctl will keep serving existing and future HW for the next few decades,
I am pretty sure of that. as the cover-letter explains  mlx5 architecture
is set in stone and written in ink, the same mlx5 drivers work on any
ConnectX chip since 2012, and the will keep working on the next generations
of chips, mlx5ctl will be no different.

[1] https://network.nvidia.com/products/adapter-software/firmware-tools/

>> This module is primarily (but not exclusively) for rdma related
>> functionality, not netdev, and the RDMA maintainers Ack it.
>

For Infiniband/virtio/vfio/vdpa/nvme/fpga ConnectX devices mlx5 netdev
doesn't even exist, so it is not reasonable to ask that the debug
interface should go via the netdev stack, mlx5ctl is needed to serve
all users of mlx5 devices, not only netdev (networking).

So I really find this odd, that one stack maintainer gets a veto over all
others.

>In my mind, RDMA implies networking, as it's over a network connection,
>but hey, I might be wrong :)
>
>thanks,
>
>greg k-h

Jakub Kicinski Nov. 28, 2023, 12:07 a.m. UTC | #8

On Mon, 27 Nov 2023 11:26:06 -0800 Saeed Mahameed wrote:
> This driver is different as it doesn't replace existing mlx5 drivers,
> mlx5 functionality drivers are still there to expose the device features
> through the standard stacks, this is just a companion driver to access
> debug information, by driver and FW design mlx5ctl is not meant to
> manage or pilot the device like other device specific char drivers.

You keep saying "debug information" which is really underselling this
driver. Are you not going to support mstreg?

The common development flow as far as netdev is concerned is:
 - some customer is interested in a new feature of a chip
 - vendor hacks the support out of tree, using oot module and/or
   user space tooling
 - customer does a PoC with that hacked up, non-upstream solution
   - if it works, vendor has to find out a proper upstream API,
     hopefully agreed on with other vendors
   - if it doesn't match customer needs the whole thing lands in the bin

If the vendor specific PoC can be achieved with fully upstream software
we lose leverage to force vendors to agree on common APIs.

This should all be self-evident for people familiar with netdev, but
I thought I'd spell it out.

I understand that the device has other uses, but every modern NIC has
other uses. I don't see how netdev can agree to this driver as long as
there is potential for configuring random networking things thru it.
All major netdev vendors have a set of out of tree tools / "expose
everything misc drivers", "for debug". They will soon follow your
example if we let this in.

David Ahern Nov. 28, 2023, 4:46 a.m. UTC | #9

On Mon, Nov 27, 2023 at 04:07:19PM -0800, Jakub Kicinski wrote:
> On Mon, 27 Nov 2023 11:26:06 -0800 Saeed Mahameed wrote:
> > This driver is different as it doesn't replace existing mlx5 drivers,
> > mlx5 functionality drivers are still there to expose the device features
> > through the standard stacks, this is just a companion driver to access
> > debug information, by driver and FW design mlx5ctl is not meant to
> > manage or pilot the device like other device specific char drivers.
> 
> You keep saying "debug information" which is really underselling this
> driver. Are you not going to support mstreg?
> 
> The common development flow as far as netdev is concerned is:
>  - some customer is interested in a new feature of a chip
>  - vendor hacks the support out of tree, using oot module and/or
>    user space tooling
>  - customer does a PoC with that hacked up, non-upstream solution
>    - if it works, vendor has to find out a proper upstream API,
>      hopefully agreed on with other vendors
>    - if it doesn't match customer needs the whole thing lands in the bin
> 
> If the vendor specific PoC can be achieved with fully upstream software
> we lose leverage to force vendors to agree on common APIs.

Please elaborate on what "common" API there is to create here?

Do you agree that each ASIC in the device is unique and hence will
have made different trade offs - both features and nerd knobs to tune
and affect the performance and runtime capabilities? If you do not
agree, then we need to have a different discussion ...
If you do, please elaborate on the outline of some common API that
could possibly be done here.

You said no to the devlink parameters as a way to tune an ASIC. This
is a common, established tool, using a common, established message
channel but in an open, free form way of letting a customer see what
tunables there are for a specific H/W version and firmware version
and then set them. That is about as common as can be for different
vendors creating different ASICs with different goals and design
objectives. Yet, you somehow expect all of them to have nerd knob X
and tunable Y. That is not realistic.

> 
> This should all be self-evident for people familiar with netdev, but
> I thought I'd spell it out.
> 
> I understand that the device has other uses, but every modern NIC has
> other uses. I don't see how netdev can agree to this driver as long as
> there is potential for configuring random networking things thru it.
> All major netdev vendors have a set of out of tree tools / "expose
> everything misc drivers", "for debug". They will soon follow your
> example if we let this in.

Out of tree drivers are already ingrained into customers now. Mellanox
(in this case) has tried many different angles at getting access to H/W
unique tunables (i.e., the devlink comment) and now dumping huge amounts
of data. Your response to the devlink parameters attempt is to basically
abuse the firmware upgrade command as a way to push a binary blob that
can contain said settings (and I still have not fully wrapped my head
around the fact that you suggested that option).

What specifically are you looking for? There are different vendors and
different h/w options for specific market based reasons. Your hard line
stance against needs like this is what is pushing out of tree drivers
and tools to continue.

Jakub Kicinski Nov. 28, 2023, 2:53 p.m. UTC | #10

On Mon, 27 Nov 2023 21:46:28 -0700 David Ahern wrote:
> > You keep saying "debug information" which is really underselling this
> > driver. Are you not going to support mstreg?
> > 
> > The common development flow as far as netdev is concerned is:
> >  - some customer is interested in a new feature of a chip
> >  - vendor hacks the support out of tree, using oot module and/or
> >    user space tooling
> >  - customer does a PoC with that hacked up, non-upstream solution
> >    - if it works, vendor has to find out a proper upstream API,
> >      hopefully agreed on with other vendors
> >    - if it doesn't match customer needs the whole thing lands in the bin
> > 
> > If the vendor specific PoC can be achieved with fully upstream software
> > we lose leverage to force vendors to agree on common APIs.  
> 
> Please elaborate on what "common" API there is to create here?

Damn, am I so bad at explaining basic things or y'all are spending
5 seconds reading this and are not really paying attention? :|

> Do you agree that each ASIC in the device is unique and hence will
> have made different trade offs - both features and nerd knobs to tune
> and affect the performance and runtime capabilities? If you do not
> agree, then we need to have a different discussion ...
> If you do, please elaborate on the outline of some common API that
> could possibly be done here.

We have devlink params. If that doesn't scale we can look for other
solutions but let's see them not scale _in practice_ first.

> You said no to the devlink parameters as a way to tune an ASIC.

What? When?

> This is a common, established tool, using a common, established message
> channel but in an open, free form way of letting a customer see what
> tunables there are for a specific H/W version and firmware version
> and then set them. That is about as common as can be for different
> vendors creating different ASICs with different goals and design
> objectives. Yet, you somehow expect all of them to have nerd knob X
> and tunable Y. That is not realistic.

I don't know what you're talking about.

> > This should all be self-evident for people familiar with netdev, but
> > I thought I'd spell it out.
> > 
> > I understand that the device has other uses, but every modern NIC has
> > other uses. I don't see how netdev can agree to this driver as long as
> > there is potential for configuring random networking things thru it.
> > All major netdev vendors have a set of out of tree tools / "expose
> > everything misc drivers", "for debug". They will soon follow your
> > example if we let this in.  
> 
> Out of tree drivers are already ingrained into customers now. Mellanox
> (in this case) has tried many different angles at getting access to H/W
> unique tunables (i.e., the devlink comment) and now dumping huge amounts
> of data. Your response to the devlink parameters attempt is to basically
> abuse the firmware upgrade command as a way to push a binary blob that
> can contain said settings (and I still have not fully wrapped my head
> around the fact that you suggested that option).
> 
> What specifically are you looking for? There are different vendors and
> different h/w options for specific market based reasons. Your hard line
> stance against needs like this is what is pushing out of tree drivers
> and tools to continue.

Sounds like you'd like a similar open-ended interface for your device.

Jason Gunthorpe Nov. 28, 2023, 4:24 p.m. UTC | #11

On Tue, Nov 28, 2023 at 06:53:21AM -0800, Jakub Kicinski wrote:
> > You said no to the devlink parameters as a way to tune an ASIC.
> 
> What? When?

You said you already rejected it at the very start of this discussion
and linked to the video recording of the rejection discussion:

https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/

This session was specifically on the 600 FW configuration parameters
that mlx5 has. This is something that is done today on non-secure boot
systems with direct PCI access on sysfs and would be absorbed into
this driver on secure-boot systems. Ie nothing really changes from the
broader ecosystem perspective.

The discussion starts at time index 22:11.

Dave (IIRC? sorry) is talking about unique things and suggesting
devlink. Many people in the audiance seem to support this idea

27:00 -> 28:00 you repeated this thread's argument about said NO
"occasionally you are allowed to use [devlink parameters] them"

At 29 you said just keep it all out of tree.

Oh, I chimed in around 30:00 saying it is not the kernel maintainers
job to force standardization. This is a user problem.

31:25 you said again "nothing"

31:30 S390 teams would like this and pushed back against "keep it all
out of tree" describing the situation as a "deadlock".

33:00 you said need two implementors for device specific parameters,
which misses the point of the discussion, IMHO.

And this was at the Dublin LPC, so over a year ago.

I second Dave's question - if you do not like mlx5ctl, then what is
your vision to solve all these user problems?

Jason

Jakub Kicinski Nov. 28, 2023, 4:44 p.m. UTC | #12

On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote:
> You said you already rejected it at the very start of this discussion
> and linked to the video recording of the rejection discussion:
> 
> https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/
> 
> This session was specifically on the 600 FW configuration parameters
> that mlx5 has. This is something that is done today on non-secure boot
> systems with direct PCI access on sysfs and would be absorbed into
> this driver on secure-boot systems. Ie nothing really changes from the
> broader ecosystem perspective.

The question at LPC was about making devlink params completely
transparent to the kernel. Basically added directly from FW.
That what I was not happy about.

You can add as many params at the driver level as you want.
In fact I asked Saeed repeatedly to start posting all those
params instead of complaining.

> I second Dave's question - if you do not like mlx5ctl, then what is
> your vision to solve all these user problems?

Let the users complain about the user problems. Also something
I repeatedly told Saeed. His response was something along the lines
of users are secret, they can't post on the list, blah, blah.

You know one user who is participating in this thread?
*ME*
While the lot of you work for vendors.

David Ahern Nov. 28, 2023, 4:52 p.m. UTC | #13

On 11/28/23 7:53 AM, Jakub Kicinski wrote:
> On Mon, 27 Nov 2023 21:46:28 -0700 David Ahern wrote:
>>> You keep saying "debug information" which is really underselling this
>>> driver. Are you not going to support mstreg?
>>>
>>> The common development flow as far as netdev is concerned is:
>>>  - some customer is interested in a new feature of a chip
>>>  - vendor hacks the support out of tree, using oot module and/or
>>>    user space tooling
>>>  - customer does a PoC with that hacked up, non-upstream solution
>>>    - if it works, vendor has to find out a proper upstream API,
>>>      hopefully agreed on with other vendors
>>>    - if it doesn't match customer needs the whole thing lands in the bin
>>>
>>> If the vendor specific PoC can be achieved with fully upstream software
>>> we lose leverage to force vendors to agree on common APIs.  
>>
>> Please elaborate on what "common" API there is to create here?
> 
> Damn, am I so bad at explaining basic things or y'all are spending
> 5 seconds reading this and are not really paying attention? :|
> 
>> Do you agree that each ASIC in the device is unique and hence will
>> have made different trade offs - both features and nerd knobs to tune
>> and affect the performance and runtime capabilities? If you do not
>> agree, then we need to have a different discussion ...
>> If you do, please elaborate on the outline of some common API that
>> could possibly be done here.
> 
> We have devlink params. If that doesn't scale we can look for other
> solutions but let's see them not scale _in practice_ first.
> 
>> You said no to the devlink parameters as a way to tune an ASIC.
> 
> What? When?

Jason responded with the same LPC reference, so I will only add a
reference to the slides for interested parties:

https://lpc.events/event/16/contributions/1359/attachments/1092/2094/FW-Centric-devices.pdf

> 
> Sounds like you'd like a similar open-ended interface for your device.

That's the danger of chiming in on threads like this - people drawing
conclusions they should not.

My interest on this thread is along the same lines for commenting during
both the LPC 2022 discussion and again at netconf this year - trying to
understand and keep track of the strongly held opinions of maintainers
and what options are deemed off limits for similar needs. No vendor
wants to go in circles redesigning and rewriting s/w.

Jason Gunthorpe Nov. 28, 2023, 5:52 p.m. UTC | #14

On Tue, Nov 28, 2023 at 08:44:21AM -0800, Jakub Kicinski wrote:
> On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote:
> > You said you already rejected it at the very start of this discussion
> > and linked to the video recording of the rejection discussion:
> > 
> > https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/
> > 
> > This session was specifically on the 600 FW configuration parameters
> > that mlx5 has. This is something that is done today on non-secure boot
> > systems with direct PCI access on sysfs and would be absorbed into
> > this driver on secure-boot systems. Ie nothing really changes from the
> > broader ecosystem perspective.
> 
> The question at LPC was about making devlink params completely
> transparent to the kernel. Basically added directly from FW.
> That what I was not happy about.

It is creating a back-porting nightmare for all the enterprise
distributions.

> You can add as many params at the driver level as you want.
> In fact I asked Saeed repeatedly to start posting all those
> params instead of complaining.

That really isn't what you said in the video.

Regardless, configurables are only one part of what mlx5ctl addresses,
we still have all the debugability problems, which are arguably more
important.

> > I second Dave's question - if you do not like mlx5ctl, then what is
> > your vision to solve all these user problems?
>
> Let the users complain about the user problems. Also something
> I repeatedly told Saeed. His response was something along the lines
> of users are secret, they can't post on the list, blah, blah.

You mean like the S390 team at IBM did in the video?

This is not a reasonable position. One of the jobs of the vendors is
to aggregate the user requests. Even the giant hyperscale customers
that do have the capacity to come on this list prefer to delegate
these things to us.

If you want to get a direct user forum the kernel mailing list is not
an appropriate place to do it.

> You know one user who is participating in this thread?
> *ME*
> While the lot of you work for vendors.

I'm sick of this vendor bashing. You work for *one* user. You know who
talks to *every* user out there? *ME*.

User and vendors need debugging of this complex HW. I don't need to
bring a parade of a dozen users to this thread to re-enforce that
obvious truth. Indeed when debugging is required the vendor usually
has to do it, so we are the user in this discussion.

You didn't answer the question, what is your alternative debug-ability
vision here?

Jason

Jakub Kicinski Nov. 28, 2023, 6:33 p.m. UTC | #15

On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
> > The question at LPC was about making devlink params completely
> > transparent to the kernel. Basically added directly from FW.
> > That what I was not happy about.  
> 
> It is creating a back-porting nightmare for all the enterprise
> distributions.

We don't care about enterprise distros, Jason, or stable kernel APIs.

> > You can add as many params at the driver level as you want.
> > In fact I asked Saeed repeatedly to start posting all those
> > params instead of complaining.  
> 
> That really isn't what you said in the video.
> 
> Regardless, configurables are only one part of what mlx5ctl addresses,
> we still have all the debugability problems, which are arguably more
> important.

Read-only debug interfaces are "do whatever you want" in netdev.
Params controlling them (ie. writing stuff) need to be reviewed 
but are also allowed.

Doesn't mlx5 have a pile of stuff in debugfs already?

Nobody bothered to answer my "are you not going support mstreg over
this" question (arbitrary register writes).

> > Let the users complain about the user problems. Also something
> > I repeatedly told Saeed. His response was something along the lines
> > of users are secret, they can't post on the list, blah, blah.  
> 
> You mean like the S390 team at IBM did in the video?
> 
> This is not a reasonable position. One of the jobs of the vendors is
> to aggregate the user requests. Even the giant hyperscale customers
> that do have the capacity to come on this list prefer to delegate
> these things to us.
> 
> If you want to get a direct user forum the kernel mailing list is not
> an appropriate place to do it.

Agree to disagree.

> > You know one user who is participating in this thread?
> > *ME*
> > While the lot of you work for vendors.  
> 
> I'm sick of this vendor bashing. You work for *one* user. You know who
> talks to *every* user out there? *ME*.
> 
> User and vendors need debugging of this complex HW. I don't need to
> bring a parade of a dozen users to this thread to re-enforce that
> obvious truth. Indeed when debugging is required the vendor usually
> has to do it, so we are the user in this discussion.
> 
> You didn't answer the question, what is your alternative debug-ability
> vision here?

Covered above. And it's been discussed multiple times.

Honestly I don't want to spend any more time discussing this.
Once you're ready to work together in good faith let me know.

On future revisions of this series please carry:

Nacked-by: Jakub Kicinski <kuba@kernel.org>

Saeed Mahameed Nov. 28, 2023, 7:31 p.m. UTC | #16

On 28 Nov 08:44, Jakub Kicinski wrote:
>On Tue, 28 Nov 2023 12:24:13 -0400 Jason Gunthorpe wrote:
>> You said you already rejected it at the very start of this discussion
>> and linked to the video recording of the rejection discussion:
>>
>> https://lore.kernel.org/all/20231019165055.GT3952@nvidia.com/
>>
>> This session was specifically on the 600 FW configuration parameters
>> that mlx5 has. This is something that is done today on non-secure boot
>> systems with direct PCI access on sysfs and would be absorbed into
>> this driver on secure-boot systems. Ie nothing really changes from the
>> broader ecosystem perspective.
>
>The question at LPC was about making devlink params completely
>transparent to the kernel. Basically added directly from FW.
>That what I was not happy about.
>
>You can add as many params at the driver level as you want.
>In fact I asked Saeed repeatedly to start posting all those
>params instead of complaining.
>

We posted many params over the years the you shot down on the spot,
do you really expect me to implement 600 of those knowing that you will
nack 80% of them asking to have common knobs for all vendors, and you know
that is impossible.
you nack patches then ask for a porpossal, we came up with many proposal
and discussed them face to face on multiple occasions, LPC/netconf etc,
then you ask for patches, then you nack again, we are just going in circles
here..

>> I second Dave's question - if you do not like mlx5ctl, then what is
>> your vision to solve all these user problems?
>
>Let the users complain about the user problems. Also something
>I repeatedly told Saeed. His response was something along the lines
>of users are secret, they can't post on the list, blah, blah.
>

I never said it is a secret, but I can't publicly reveal who my customers
are and what they want, you know very well who asked for the high
frequency counter sampling.. So we came up with a very clear solution,
that has nothing to do with netdev, since for that particular use-case it
is not netdev specific, and netdev stack isn't even present.

>You know one user who is participating in this thread?
>*ME*
>While the lot of you work for vendors.

And how *YOU* expect the vendors to debug *YOUR* issues, if you don't
allow them to access their HW? 

Asking all vendors to use *YOUR* "devlink generic_dev generic_knob" is an
insult to all vendors, how about you provide the ASIC design and RTLs
to all vendors and we just manufacture it for you..

Saeed Mahameed Nov. 28, 2023, 7:55 p.m. UTC | #17

On 28 Nov 10:33, Jakub Kicinski wrote:
>On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
>> > The question at LPC was about making devlink params completely
>> > transparent to the kernel. Basically added directly from FW.
>> > That what I was not happy about.
>>
>> It is creating a back-porting nightmare for all the enterprise
>> distributions.
>
>We don't care about enterprise distros, Jason, or stable kernel APIs.
>
>> > You can add as many params at the driver level as you want.
>> > In fact I asked Saeed repeatedly to start posting all those
>> > params instead of complaining.
>>
>> That really isn't what you said in the video.
>>
>> Regardless, configurables are only one part of what mlx5ctl addresses,
>> we still have all the debugability problems, which are arguably more
>> important.
>
>Read-only debug interfaces are "do whatever you want" in netdev.
>Params controlling them (ie. writing stuff) need to be reviewed
>but are also allowed.
>
>Doesn't mlx5 have a pile of stuff in debugfs already?
>

not enough, not scalable and it's a backporting and maintenance nightmare
as Jason already showed.

mlx5 supports creating millions of objects, tools need to selectively
pick which objects to dump for a specific use case, if it's ok with you to
do this in debugfs, then ioctl is much cleaner .. so what's your problem
with mlx5ctl?


>Nobody bothered to answer my "are you not going support mstreg over
>this" question (arbitrary register writes).
>
>> > Let the users complain about the user problems. Also something
>> > I repeatedly told Saeed. His response was something along the lines
>> > of users are secret, they can't post on the list, blah, blah.
>>
>> You mean like the S390 team at IBM did in the video?
>>
>> This is not a reasonable position. One of the jobs of the vendors is
>> to aggregate the user requests. Even the giant hyperscale customers
>> that do have the capacity to come on this list prefer to delegate
>> these things to us.
>>
>> If you want to get a direct user forum the kernel mailing list is not
>> an appropriate place to do it.
>
>Agree to disagree.
>
>> > You know one user who is participating in this thread?
>> > *ME*
>> > While the lot of you work for vendors.
>>
>> I'm sick of this vendor bashing. You work for *one* user. You know who
>> talks to *every* user out there? *ME*.
>>
>> User and vendors need debugging of this complex HW. I don't need to
>> bring a parade of a dozen users to this thread to re-enforce that
>> obvious truth. Indeed when debugging is required the vendor usually
>> has to do it, so we are the user in this discussion.
>>
>> You didn't answer the question, what is your alternative debug-ability
>> vision here?
>
>Covered above. And it's been discussed multiple times.
>
>Honestly I don't want to spend any more time discussing this.
>Once you're ready to work together in good faith let me know.
>
>On future revisions of this series please carry:
>
>Nacked-by: Jakub Kicinski <kuba@kernel.org>

I asked before and I never got a technical answer, based on what?

All we got is just political views and complaints against vendors.
What is your proposal for accessing every possible debug information from a
vendor specific device ? devlink X Y Z, debugfs?  won't work, sorry.

And I can't accept "do it out of tree" as an answer from a well
established linux maintainer, the whole point of this is to have this
available in every linux box with any mlx5 configuration (not only netdev)
so we can start debugging on the spot.

For your claims that we need this for setting device parameters, it
is simply not true, because we don't need this driver to do that,
so please go back and read the cover-letter and code, and let me know what
is wrong with our approach to get access to our device's debug info.

Saeed Mahameed Nov. 28, 2023, 8:10 p.m. UTC | #18

On 28 Nov 10:33, Jakub Kicinski wrote:
>On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
>> > The question at LPC was about making devlink params completely
>> > transparent to the kernel. Basically added directly from FW.
>> > That what I was not happy about.
>>
>> It is creating a back-porting nightmare for all the enterprise
>> distributions.
>
>We don't care about enterprise distros, Jason, or stable kernel APIs.
>

Oh, I missed this one, so you don't care about users? 

Users often pay to distros for support, and distros always turn to vendors
for debug situations, in fact one of the high stakeholders for this is an
enterprise distro..

Also Jason and I are users, and more than 300 engineers at nvidia
and dozens of customers are users, deploying 100s of thousands of ConnectX
chips in their fleets.

if it weren't important for us and our users, I wouldn't even fight this
hard.. but it is important, and very useful, I can't express that hard
enough.

So you don't care ? Then don't ask about who the users are, in the other
email, apparently it's all about your personal opinion and views about
vendors what drives your responses. So it is really hard to debate with
you..

Greg KH Nov. 29, 2023, 9:08 a.m. UTC | #19

On Tue, Nov 28, 2023 at 12:10:00PM -0800, Saeed Mahameed wrote:
> On 28 Nov 10:33, Jakub Kicinski wrote:
> > On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
> > > > The question at LPC was about making devlink params completely
> > > > transparent to the kernel. Basically added directly from FW.
> > > > That what I was not happy about.
> > > 
> > > It is creating a back-porting nightmare for all the enterprise
> > > distributions.
> > 
> > We don't care about enterprise distros, Jason, or stable kernel APIs.
> > 
> 
> Oh, I missed this one, so you don't care about users?

Ok, time out please.  This isn't going anywhere fast except to make
everyone else mad.

To Jakub's point, no, we don't care about enterprise distros, they are a
consumer of our releases and while some of them pay the salaries of our
developers, they really don't have much influence over our development
rules as they are just so far behind in releases that their releases
look nothing like what we do in places (i.e. Linux "like" just like many
Android systems.)

If the enterprise distros pop in here and give us their opinions of the
patchset, I would GREATLY appreciate it, as having more people review
code at this point in time would be most helpful instead of having to
hear about how the interfaces are broken 2 years from now.

And I think that's what is driving your work now, the "enterprise"
distros finally picked up the "lock down the kernel from random PCI
device access in userspace" which caused you to have to drop your
userspace implementation to create a real kernel driver, correct?

And as for stable kernel apis, you all know that's just not a thing, and
has nothing to do with users EXCEPT it benefits users because it keeps
kernel code smaller and faster overall, that's well documented and users
appreciate it.

> Users often pay to distros for support, and distros always turn to vendors
> for debug situations, in fact one of the high stakeholders for this is an
> enterprise distro..

Hey, I was right, an enterprise distro wants this driver, great, can you
get them to review it as well?  I'm tired of being the only one to
review patches like this and could use the help if they are going to
rely on this (why do they pass that work to me?).  They should be the
ones helping us catch basic things like the reference count issues I
pointed out, as they can test the code, I can't :)

But, let's step back a bit further.

It seems the network device normal interface for this is using devlink.
And drivers are allowed to have driver-specific devlink interfaces, as
you know, your driver has lots of them today.  Why not just add more?
What's wrong with 600+ more interfaces being added as that way they
would be well documented and fit in with the existing infrastructure
that you have today.

Is the issue that the firmware doesn't guarantee that these interfaces
will be documented or stable or even known at this point in time?  If
so, how are your going to have a good userspace tool for this?  What am
I missing that requires a totally-new-and-custom user/kernel api from
what all other network drivers are using today?

thanks,

greg k-h

Saeed Mahameed Nov. 29, 2023, 9:08 a.m. UTC | #20

On 27 Nov 18:59, Greg Kroah-Hartman wrote:
>On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
>> +struct mlx5ctl_dev {
>> +	struct mlx5_core_dev *mdev;
>> +	struct miscdevice miscdev;
>> +	struct auxiliary_device *adev;
>> +	struct list_head fd_list;
>> +	spinlock_t fd_list_lock; /* protect list add/del */
>> +	struct rw_semaphore rw_lock;
>> +	struct kref refcount;
>
>You now have 2 different things that control the lifespan of this
>structure.  We really need some way to automatically check this so that
>people don't keep making this same mistake, it happens all the time :(
>
>Please pick one structure (miscdevice) or the other (kref) to control
>the lifespan, having 2 will just not work.
>

miscdevice doesn't handle the lifespan, open files will remain open even
after the miscdevice was unregistered, hence we use the kref to defer the
kfree until the last open file is closed.

>Also, why a rw_semaphore?  Only use those if you can prove with a
>benchmark that it is actually faster, otherwise it's simpler to just use
>a normal mutex (hint, you are changing the fields in the structure with
>the read lock held, which feels very wrong, and so needs a LOT of
>documentation, or just use a normal mutex...)
>

It is needed so we can protect against underlaying device unloading while
miscdevice is active, we use rw semaphore since we don't care about
synchronization between any of the fops, but we want to protect current
active ioctls and fops from sudden mlx5ctl_remove (auxiliary_driver.remove)
which can happen dynamically due to underlaying device removal.

So here is the locking scheme:

write_lock() : only on mlx5_ctl remove and mark the device is down
via assigning NULL to mcdev->dev, to let all new readers abort and to wait
for current readers to finish their task.

read_lock(): used in all fops and ioctls, to make sure underlaying
mlx5_core device is still active, and to prevent open files to access the
device when miscdevice was already unregistered.

I agree, this should've been documented in the code, I will add
documentation.

>thanks,
>
>greg k-h

Greg KH Nov. 29, 2023, 9:20 a.m. UTC | #21

On Wed, Nov 29, 2023 at 01:08:39AM -0800, Saeed Mahameed wrote:
> On 27 Nov 18:59, Greg Kroah-Hartman wrote:
> > On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> > > +struct mlx5ctl_dev {
> > > +	struct mlx5_core_dev *mdev;
> > > +	struct miscdevice miscdev;
> > > +	struct auxiliary_device *adev;
> > > +	struct list_head fd_list;
> > > +	spinlock_t fd_list_lock; /* protect list add/del */
> > > +	struct rw_semaphore rw_lock;
> > > +	struct kref refcount;
> > 
> > You now have 2 different things that control the lifespan of this
> > structure.  We really need some way to automatically check this so that
> > people don't keep making this same mistake, it happens all the time :(
> > 
> > Please pick one structure (miscdevice) or the other (kref) to control
> > the lifespan, having 2 will just not work.
> > 
> 
> miscdevice doesn't handle the lifespan, open files will remain open even
> after the miscdevice was unregistered, hence we use the kref to defer the
> kfree until the last open file is closed.

miscdevice has a reference counter and a lifecycle, you can not have two
reference counted objects in the same structure and expect things to
work well.

> > Also, why a rw_semaphore?  Only use those if you can prove with a
> > benchmark that it is actually faster, otherwise it's simpler to just use
> > a normal mutex (hint, you are changing the fields in the structure with
> > the read lock held, which feels very wrong, and so needs a LOT of
> > documentation, or just use a normal mutex...)
> > 
> 
> It is needed so we can protect against underlaying device unloading while
> miscdevice is active, we use rw semaphore since we don't care about
> synchronization between any of the fops, but we want to protect current
> active ioctls and fops from sudden mlx5ctl_remove (auxiliary_driver.remove)
> which can happen dynamically due to underlaying device removal.

Then use a normal mutex.  Only use a rw lock if you can prove the
performance needs it as usually a rw lock is slower and more complex as
you then have to document stuff like:

> So here is the locking scheme:
> 
> write_lock() : only on mlx5_ctl remove and mark the device is down
> via assigning NULL to mcdev->dev, to let all new readers abort and to wait
> for current readers to finish their task.
> 
> read_lock(): used in all fops and ioctls, to make sure underlaying
> mlx5_core device is still active, and to prevent open files to access the
> device when miscdevice was already unregistered.
> 
> I agree, this should've been documented in the code, I will add
> documentation.

Just make it simple and use a normal mutex please.

And fix up the reference counting, it shouldn't be this complex, it's
just a "simple" misc device driver :)

But before you do that, please see my other email about why not using
devlink for all of this instead.

thanks,

greg k-h

Jason Gunthorpe Nov. 29, 2023, 1:02 p.m. UTC | #22

On Wed, Nov 29, 2023 at 09:20:32AM +0000, Greg Kroah-Hartman wrote:
> On Wed, Nov 29, 2023 at 01:08:39AM -0800, Saeed Mahameed wrote:
> > On 27 Nov 18:59, Greg Kroah-Hartman wrote:
> > > On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> > > > +struct mlx5ctl_dev {
> > > > +	struct mlx5_core_dev *mdev;
> > > > +	struct miscdevice miscdev;
> > > > +	struct auxiliary_device *adev;
> > > > +	struct list_head fd_list;
> > > > +	spinlock_t fd_list_lock; /* protect list add/del */
> > > > +	struct rw_semaphore rw_lock;
> > > > +	struct kref refcount;
> > > 
> > > You now have 2 different things that control the lifespan of this
> > > structure.  We really need some way to automatically check this so that
> > > people don't keep making this same mistake, it happens all the time :(
> > > 
> > > Please pick one structure (miscdevice) or the other (kref) to control
> > > the lifespan, having 2 will just not work.
> > > 
> > 
> > miscdevice doesn't handle the lifespan, open files will remain open even
> > after the miscdevice was unregistered, hence we use the kref to defer the
> > kfree until the last open file is closed.
> 
> miscdevice has a reference counter and a lifecycle, you can not have two
> reference counted objects in the same structure and expect things to
> work well.

This second refcount is hidden well:

struct miscdevice {
	int minor;
	const char *name;
	const struct file_operations *fops;
	struct list_head list;
	struct device *parent;
	struct device *this_device;
	const struct attribute_group **groups;
	const char *nodename;
	umode_t mode;
};

You have slammed us in the other thread for doing a poor review job
here, but none of us see what you are seeing. Can you explain better?

> > write_lock() : only on mlx5_ctl remove and mark the device is down
> > via assigning NULL to mcdev->dev, to let all new readers abort and to wait
> > for current readers to finish their task.
> > 
> > read_lock(): used in all fops and ioctls, to make sure underlaying
> > mlx5_core device is still active, and to prevent open files to access the
> > device when miscdevice was already unregistered.
> > 
> > I agree, this should've been documented in the code, I will add
> > documentation.
> 
> Just make it simple and use a normal mutex please.

A normal mutex would make the entire ioctl interface single threaded,
this is not desirable.

> But before you do that, please see my other email about why not using
> devlink for all of this instead.

We've been over this already, the devlink discussion is about some
configuration stuff. It has never been suggested to cover the debug
interface. This series is primarily about debug, the devlink thing is
a distraction to main point.

Jason

Greg KH Nov. 29, 2023, 3:41 p.m. UTC | #23

On Wed, Nov 29, 2023 at 09:02:00AM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 29, 2023 at 09:20:32AM +0000, Greg Kroah-Hartman wrote:
> > On Wed, Nov 29, 2023 at 01:08:39AM -0800, Saeed Mahameed wrote:
> > > On 27 Nov 18:59, Greg Kroah-Hartman wrote:
> > > > On Mon, Nov 20, 2023 at 11:06:16PM -0800, Saeed Mahameed wrote:
> > > > > +struct mlx5ctl_dev {
> > > > > +	struct mlx5_core_dev *mdev;
> > > > > +	struct miscdevice miscdev;
> > > > > +	struct auxiliary_device *adev;
> > > > > +	struct list_head fd_list;
> > > > > +	spinlock_t fd_list_lock; /* protect list add/del */
> > > > > +	struct rw_semaphore rw_lock;
> > > > > +	struct kref refcount;
> > > > 
> > > > You now have 2 different things that control the lifespan of this
> > > > structure.  We really need some way to automatically check this so that
> > > > people don't keep making this same mistake, it happens all the time :(
> > > > 
> > > > Please pick one structure (miscdevice) or the other (kref) to control
> > > > the lifespan, having 2 will just not work.
> > > > 
> > > 
> > > miscdevice doesn't handle the lifespan, open files will remain open even
> > > after the miscdevice was unregistered, hence we use the kref to defer the
> > > kfree until the last open file is closed.
> > 
> > miscdevice has a reference counter and a lifecycle, you can not have two
> > reference counted objects in the same structure and expect things to
> > work well.
> 
> This second refcount is hidden well:
> 
> struct miscdevice {
> 	int minor;
> 	const char *name;
> 	const struct file_operations *fops;
> 	struct list_head list;
> 	struct device *parent;
> 	struct device *this_device;
> 	const struct attribute_group **groups;
> 	const char *nodename;
> 	umode_t mode;
> };

Ugh, you are right, I was wrong, there is no reference count here, using
a miscdevice _requires_ you to have a separate reference count, like you
all did.  My fault.

> > > write_lock() : only on mlx5_ctl remove and mark the device is down
> > > via assigning NULL to mcdev->dev, to let all new readers abort and to wait
> > > for current readers to finish their task.
> > > 
> > > read_lock(): used in all fops and ioctls, to make sure underlaying
> > > mlx5_core device is still active, and to prevent open files to access the
> > > device when miscdevice was already unregistered.
> > > 
> > > I agree, this should've been documented in the code, I will add
> > > documentation.
> > 
> > Just make it simple and use a normal mutex please.
> 
> A normal mutex would make the entire ioctl interface single threaded,
> this is not desirable.

Why not?  It's an ioctl for a single device, surely this isn't
performance criticial.  And then only grab it when needed, on
read/write/ioctl path it shouldn't be needed at all due to the proper
reference counting of the structures.  Only on open/close, right?

And again, for a rw semaphore, benchmarks matter, often, if not almost
always, a normal mutex is faster for stuff like this.  If not, then a
benchmark will show it.

> > But before you do that, please see my other email about why not using
> > devlink for all of this instead.
> 
> We've been over this already, the devlink discussion is about some
> configuration stuff.

It was?  I see device-specific diagonostic data for the mlx5 driver
being exported through devlink today, that's not configuration.  Why not
just add more?

> It has never been suggested to cover the debug interface. This series
> is primarily about debug, the devlink thing is a distraction to main
> point.

For me it is the main point at the moment.  Please explain why devlink
does not work for the information that you have created a misc device
where you want an ioctl api instead, as I honestly do not understand.

thanks,

greg k-h

Jason Gunthorpe Nov. 29, 2023, 6:07 p.m. UTC | #24

On Wed, Nov 29, 2023 at 03:41:54PM +0000, Greg Kroah-Hartman wrote:

> Ugh, you are right, I was wrong, there is no reference count here, using
> a miscdevice _requires_ you to have a separate reference count, like you
> all did.  My fault.

I think we did not do a good job conveying that several experienced
people have looked at this internally already. I see you noticed a few
things we missed, but this is not unreviewed garbage code. We have
different opionions on how some of these things should work but none
of them are what I'd consider unlinuxy or wrong.

> > > Just make it simple and use a normal mutex please.
> > 
> > A normal mutex would make the entire ioctl interface single threaded,
> > this is not desirable.
> 
> Why not?  It's an ioctl for a single device, surely this isn't
> performance criticial.  

The RPCs to the FW can take a long time to execute and the "single
device" is really a multi-threaded CPU complex on the other
side. Being able to submit any commands for (say) 0.5s while one
thread chews on something is not desirable.

> And then only grab it when needed, on read/write/ioctl path it
> shouldn't be needed at all due to the proper reference counting of
> the structures.  Only on open/close, right?

No - this lock is protecting the mcdev->mdev from being freed. If the
read side is held then mcdev->mdev cannot be freed. This is not a
fully refcounted object because the mdev is actually the device
underlying the aux device and it will be logically destroyed after the
aux device_driver remove() function returns.

The purpose of the rwlock is to ensure that the remove() function does
not return until all threads have completed using the mdev. Once
remove proceeds the read side of the lock will observe mdev==NULL and
fail all operations until the FD is closed. The refcount keeps the
memory underpinning the basic operation of the FD alive after remove()
until the FD finally closes.

At least for the purposes of this driver it is how it has to be. I
agree the mdev API surface that requires some of this may not be the
best it can be.

> And again, for a rw semaphore, benchmarks matter, often, if not almost
> always, a normal mutex is faster for stuff like this.  If not, then a
> benchmark will show it.

We are not talking about micro differences in lock/unlock cost here.
This is obviously better because of how much time the lock will stay
locked while executing RPCs and sleeping waiting for interrupts back
from FW. We are talking >> 100's of ms of time spent locked while
executing FW RPCs.

> > We've been over this already, the devlink discussion is about some
> > configuration stuff.
> 
> It was?  I see device-specific diagonostic data for the mlx5 driver
> being exported through devlink today, that's not configuration.  Why not
> just add more?

Today mlx5 has a bunch of debuggables:
 - counters, even some device specific counters. Both in RDMA and
   netdev objects exposed through netlink and sysfs
 - A "core dump" mechanism through devlink
 - Some debugfs stuff, but we are now aware of serious problems with it
 - Some "devlink parameters" which configure a limited set of things

These are not "interactive". Yes we can add more counters, more
parameters and more stuff to the core dump, but that isn't what I mean
by debugging.

This is a complex device, it has processors on the other side running
a lot of software. A primary purpose of this design is to allow
bi-directional communication with debug agents in the FW.

Consider, (I'm trying to draw a picture here without disclosing trade
secrets about how the device operates), that one of the debug agents
is a GDB stub. There are many CPUs inside these devices, field
engineers may wish to enter a breakpoint and inspect something. With
this scheme we would put the GDB stub protocol packets into RPC
messages and execute them to the FW.

In the very worst nightmare debug scenario we might build a custom FW
with a custom debug agent targetting the specific problem and use
these generic FW RPCs to communicate however that very special one-off
agent requires.

Notice what such a thing requires: a future compatible channel to
exchange data bi-directionally with the SW components in the FW.

The design here of processing *generic RPCs* is not some dirty nod
toward enterprise distros - it is a sane and logical design to put
things in userspace that the kernel has no value-add to being part
of. This is standard straight-from-Linus design values. It is how
things like nvme built their very similar "vendor commands" system
(see for instance the intel & wd commands in nvmecli).

So, can we run generic opaque RPCs over some TBD devlink interface,
including with DMA? Technologically yes, of course. Practically?
No. Jakub has clearly stated his position on philisophical limits in
netdev and those include devlink. It excludes soemthing so open-ended
and ill-defined. FWIW, I agree with this, devlink should be much more
constrained. It is why we never suggested devlink for this kind of
debugging.

Genericness is also the source of alot of the cross discussion in this
thread.

A generic RPC interface can do *alot*. Actually, it can do *almost
anything* to the device. That is the entire point. What it can't do is
violate the integrity of the kernel - hence why it is a FW RPC
interface and not direct access to device registers.

So when Jakub says we can send configuration values that could also be
sent over devlink parameters, or read counters that could also be read
over netlink, he is not wrong. But we already do all those duplicative
things through the direct PCI access and have for at least 10 years.

I disagree that this duplication is a problem. Re-usable, cross vendor
generic interfaces have merit on their own. mlx5 has long been
implementing a wide selection of them as they are invented. We've been
doing this even though those interfaces duplicate what the raw PCI
tools already did.

This new driver isn't going to change that. We will still participate
in netdev and still implement these other things. We are a big part of
the community, I don't need someone holding a stick over the mlx5
team's head beating them to do the "right" thing.

Jason

Aron Silverton Dec. 4, 2023, 9:37 p.m. UTC | #25

Hi,

On Wed, Nov 29, 2023 at 09:08:03AM +0000, Greg Kroah-Hartman wrote:
> On Tue, Nov 28, 2023 at 12:10:00PM -0800, Saeed Mahameed wrote:
> > On 28 Nov 10:33, Jakub Kicinski wrote:
> > > On Tue, 28 Nov 2023 13:52:24 -0400 Jason Gunthorpe wrote:
> > > > > The question at LPC was about making devlink params completely
> > > > > transparent to the kernel. Basically added directly from FW.
> > > > > That what I was not happy about.
> > > > 
> > > > It is creating a back-porting nightmare for all the enterprise
> > > > distributions.
> > > 
> > > We don't care about enterprise distros, Jason, or stable kernel APIs.
> > > 
> > 
> > Oh, I missed this one, so you don't care about users?
> 
> Ok, time out please.  This isn't going anywhere fast except to make
> everyone else mad.
> 
> To Jakub's point, no, we don't care about enterprise distros, they are a
> consumer of our releases and while some of them pay the salaries of our
> developers, they really don't have much influence over our development
> rules as they are just so far behind in releases that their releases
> look nothing like what we do in places (i.e. Linux "like" just like many
> Android systems.)
> 
> If the enterprise distros pop in here and give us their opinions of the
> patchset, I would GREATLY appreciate it, as having more people review
> code at this point in time would be most helpful instead of having to
> hear about how the interfaces are broken 2 years from now.

We will be happy to test and review v4 of this series.

Fully interactive debugging has become essential to getting to the root
cause of complex issues that arise between the types of devices being
discussed and their interactions with various software layers. Turning
knobs and dumping registers just isn't sufficient, and I wish we'd had
this capability long ago.

Our customers have already benefited from the interactive debugging
capability that these patches provide, but the full potential won't be
realized until this is upstream.

> 
> And I think that's what is driving your work now, the "enterprise"
> distros finally picked up the "lock down the kernel from random PCI
> device access in userspace" which caused you to have to drop your
> userspace implementation to create a real kernel driver, correct?
> 
> And as for stable kernel apis, you all know that's just not a thing, and
> has nothing to do with users EXCEPT it benefits users because it keeps
> kernel code smaller and faster overall, that's well documented and users
> appreciate it.
> 
> > Users often pay to distros for support, and distros always turn to vendors
> > for debug situations, in fact one of the high stakeholders for this is an
> > enterprise distro..
> 
> Hey, I was right, an enterprise distro wants this driver, great, can you
> get them to review it as well?  I'm tired of being the only one to
> review patches like this and could use the help if they are going to
> rely on this (why do they pass that work to me?).  They should be the
> ones helping us catch basic things like the reference count issues I
> pointed out, as they can test the code, I can't :)

Guilty as charged? But, I'm sure we are not the only ones. I'm sorry that
we were not able to spot this and provide a review earlier, but we are
happy to do so for v4.

> 
> But, let's step back a bit further.
> 
> It seems the network device normal interface for this is using devlink.
> And drivers are allowed to have driver-specific devlink interfaces, as
> you know, your driver has lots of them today.  Why not just add more?
> What's wrong with 600+ more interfaces being added as that way they
> would be well documented and fit in with the existing infrastructure
> that you have today.
> 
> Is the issue that the firmware doesn't guarantee that these interfaces
> will be documented or stable or even known at this point in time?  If
> so, how are your going to have a good userspace tool for this?  What am
> I missing that requires a totally-new-and-custom user/kernel api from
> what all other network drivers are using today?
> 
> thanks,
> 
> greg k-h

Aron

Jakub Kicinski Dec. 5, 2023, 2:52 a.m. UTC | #26

On Mon, 4 Dec 2023 15:37:56 -0600 Aron Silverton wrote:
> > To Jakub's point, no, we don't care about enterprise distros, they are a
> > consumer of our releases and while some of them pay the salaries of our
> > developers, they really don't have much influence over our development
> > rules as they are just so far behind in releases that their releases
> > look nothing like what we do in places (i.e. Linux "like" just like many
> > Android systems.)
> > 
> > If the enterprise distros pop in here and give us their opinions of the
> > patchset, I would GREATLY appreciate it, as having more people review
> > code at this point in time would be most helpful instead of having to
> > hear about how the interfaces are broken 2 years from now.  
> 
> We will be happy to test and review v4 of this series.
> 
> Fully interactive debugging has become essential to getting to the root
> cause of complex issues that arise between the types of devices being
> discussed and their interactions with various software layers. Turning
> knobs and dumping registers just isn't sufficient, and I wish we'd had
> this capability long ago.

Could you shed some light on what issues you were debugging, broadly?

> Our customers have already benefited from the interactive debugging
> capability that these patches provide, but the full potential won't be
> realized until this is upstream.

Can you elaborate on why "full potential won't be realized until this
is upstream"? The driver looks very easy to ship as a standalone module.

Aron Silverton Dec. 5, 2023, 5:11 p.m. UTC | #27

Hi Jakub,

On Mon, Dec 04, 2023 at 06:52:10PM -0800, Jakub Kicinski wrote:
> On Mon, 4 Dec 2023 15:37:56 -0600 Aron Silverton wrote:
> > > To Jakub's point, no, we don't care about enterprise distros, they are a
> > > consumer of our releases and while some of them pay the salaries of our
> > > developers, they really don't have much influence over our development
> > > rules as they are just so far behind in releases that their releases
> > > look nothing like what we do in places (i.e. Linux "like" just like many
> > > Android systems.)
> > > 
> > > If the enterprise distros pop in here and give us their opinions of the
> > > patchset, I would GREATLY appreciate it, as having more people review
> > > code at this point in time would be most helpful instead of having to
> > > hear about how the interfaces are broken 2 years from now.  
> > 
> > We will be happy to test and review v4 of this series.
> > 
> > Fully interactive debugging has become essential to getting to the root
> > cause of complex issues that arise between the types of devices being
> > discussed and their interactions with various software layers. Turning
> > knobs and dumping registers just isn't sufficient, and I wish we'd had
> > this capability long ago.
> 
> Could you shed some light on what issues you were debugging, broadly?
> 

I'll do my best, with two recent instances:

1. As mentioned already, we recently faced a complex problem with RDMA
in KVM and were getting nowhere trying to debug using the usual methods.
Mellanox support was able to use this debug interface to see what was
happening on the PCI bus and prove that the issue was caused by
corrupted PCIe transactions. This finally put the investigation on the
correct path. The debug interface was used consistently and extensively
to test theories about what was happening in the system and, ultimately,
allowed the problem to be solved.

2. We've faced RDMA issues related to lost EQ doorbells, requiring
complex debug, and ultimately root-caused as a defective CPU. Without
interactive access to the device allowing us to test theories like,
"what if we manually restart the EQ", we could not have proven this
definitively.

> > Our customers have already benefited from the interactive debugging
> > capability that these patches provide, but the full potential won't be
> > realized until this is upstream.
> 
> Can you elaborate on why "full potential won't be realized until this
> is upstream"? The driver looks very easy to ship as a standalone module.

Firstly, We believe in working upstream and all of the advantages that
that brings to all the distros as well as to us and our customers.

Secondly, Our cloud business offers many types of machine instances,
some with bare metal/vfio mlx5 devices, that require customer driven
debug and we want our customers to have the freedom to choose which OS
they want to use.

Aron

Jakub Kicinski Dec. 6, 2023, 4:48 a.m. UTC | #28

On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
> 1. As mentioned already, we recently faced a complex problem with RDMA
> in KVM and were getting nowhere trying to debug using the usual methods.
> Mellanox support was able to use this debug interface to see what was
> happening on the PCI bus and prove that the issue was caused by
> corrupted PCIe transactions. This finally put the investigation on the
> correct path. The debug interface was used consistently and extensively
> to test theories about what was happening in the system and, ultimately,
> allowed the problem to be solved.

You hit on an important point, and what is also my experience working
at Meta. I may have even mentioned it in this thread already.
If there is a serious issue with a complex device, there are two ways
you can get support - dump all you can and send the dump to the vendor
or get on a live debugging session with their engineers. Users' ability
to debug those devices is practically non-existent. The idea that we
need access to FW internals is predicated on the assumption that we
have an ability to make sense of those internals.

Once you're on a support call with the vendor - just load a custom
kernel, module, whatever, it's already extremely expensive manual labor.

> 2. We've faced RDMA issues related to lost EQ doorbells, requiring
> complex debug, and ultimately root-caused as a defective CPU. Without
> interactive access to the device allowing us to test theories like,
> "what if we manually restart the EQ", we could not have proven this
> definitively.

I'm not familiar with the RDMA debugging capabilities. Perhaps there
are some gaps there. The more proprietary the implementation the harder
it is to debug. An answer to that would be "try to keep as much as
possible open".. and interfaces which let closed user space talk to
closed FW take us in the opposite direction.

FWIW good netdevice drivers have a selftest which tests IRQ generation
and EQ handling. I think that'd cover the case you're describing?
IDK if mlx5 has them, but if it doesn't definitely worth adding. And I
recommend running those on suspicious machines (ethtool -t, devlink has
some selftests, too)

> Firstly, We believe in working upstream and all of the advantages that
> that brings to all the distros as well as to us and our customers.
> 
> Secondly, Our cloud business offers many types of machine instances,
> some with bare metal/vfio mlx5 devices, that require customer driven
> debug and we want our customers to have the freedom to choose which OS
> they want to use.

I understand that having everything packaged and shipped together makes
life easier.

If the point of the kernel at this stage of its evolution is to collect
incompatible bits of vendor software, make sure they build cleanly and
ship them to distros - someone should tell me, and I will relent.

David Ahern Dec. 7, 2023, 3:54 p.m. UTC | #29

On 12/5/23 9:48 PM, Jakub Kicinski wrote:
> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
>> 1. As mentioned already, we recently faced a complex problem with RDMA
>> in KVM and were getting nowhere trying to debug using the usual methods.
>> Mellanox support was able to use this debug interface to see what was
>> happening on the PCI bus and prove that the issue was caused by
>> corrupted PCIe transactions. This finally put the investigation on the
>> correct path. The debug interface was used consistently and extensively
>> to test theories about what was happening in the system and, ultimately,
>> allowed the problem to be solved.
> 
> You hit on an important point, and what is also my experience working
> at Meta. I may have even mentioned it in this thread already.
> If there is a serious issue with a complex device, there are two ways
> you can get support - dump all you can and send the dump to the vendor
> or get on a live debugging session with their engineers. Users' ability
> to debug those devices is practically non-existent. The idea that we
> need access to FW internals is predicated on the assumption that we
> have an ability to make sense of those internals.
> 
> Once you're on a support call with the vendor - just load a custom
> kernel, module, whatever, it's already extremely expensive manual labor.

You rail against out of tree drivers and vendor proprietary tools, and
now you argue for just that. There is no reason debugging capabilities
can not be built into the OS and used when needed. That means anything
needed - from kernel modules to userspace tools.

The Meta data point is not representative of the world at large -
different scale, different needs, different expertise on staff (OS and
H/W). Getting S/W installed (especially anything requiring a compiler)
in a production server (and VMs) is not an easy request and in many
cases not even possible.

When a customer hits problem, the standard steps are to run a script,
generate a tar file and ship it to the OS vendor. Engineers at the OS
vendor go through it and may need other data - like getting detailed
dumps from individual pieces of H/W. Every time those requests require
going to a vendor web site to pull down vendor tools, get permission to
install them, schedule the run of said tool ... it only serves to drag
out the debugging process. ie., this open-ended stance only serves to
hurt Linux users.

Jakub Kicinski Dec. 7, 2023, 4:20 p.m. UTC | #30

David, I agree with your points. I think you're misreading
what I said.

On Thu, 7 Dec 2023 08:54:51 -0700 David Ahern wrote:
> You rail against out of tree drivers and vendor proprietary tools, and
> now you argue for just that.

I don't rail against out of tree drivers, very much the opposite.
Linux supports out of tree modules, and I agree that's 100% the
correct thing to do. I'd encourage more people to take advantage of
that. The problem is quite the opposite, all the "security hardening"
is making it almost impossible for users to take advantage of OOT
modules.

> There is no reason debugging capabilities
> can not be built into the OS and used when needed. That means anything
> needed - from kernel modules to userspace tools.
> 
> The Meta data point is not representative of the world at large -
> different scale, different needs, different expertise on staff (OS and
> H/W). Getting S/W installed (especially anything requiring a compiler)
> in a production server (and VMs) is not an easy request and in many
> cases not even possible.

I did not say it's easy.

> When a customer hits problem, the standard steps are to run a script,
> generate a tar file and ship it to the OS vendor. Engineers at the OS
> vendor go through it and may need other data - like getting detailed
> dumps from individual pieces of H/W.

You say that like this is not _exactly_ what I just said!?

> Every time those requests require
> going to a vendor web site to pull down vendor tools, get permission to
> install them, schedule the run of said tool ... it only serves to drag
> out the debugging process. ie., this open-ended stance only serves to
> hurt Linux users.

Right, exactly. What are you arguing with then? As I said - we have a
very open / accommodating policy for extracting all sort of debug and
state dumps. You can put whatever read only stuff you want in debugfs**
Read-write interfaces must be constrained to a clear set of commands /
settings but also very much allowed. As you said users need to be able
to extract debug info to share with the vendors, no tools necessary.

** (this will surprise you but you can also put stats there, if they 
    are custom, I don't care if they go into ethtool -S or debugfs)

Aron Silverton Dec. 7, 2023, 4:41 p.m. UTC | #31

On Tue, Dec 05, 2023 at 08:48:55PM -0800, Jakub Kicinski wrote:
> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
> > 1. As mentioned already, we recently faced a complex problem with RDMA
> > in KVM and were getting nowhere trying to debug using the usual methods.
> > Mellanox support was able to use this debug interface to see what was
> > happening on the PCI bus and prove that the issue was caused by
> > corrupted PCIe transactions. This finally put the investigation on the
> > correct path. The debug interface was used consistently and extensively
> > to test theories about what was happening in the system and, ultimately,
> > allowed the problem to be solved.
> 
> You hit on an important point, and what is also my experience working
> at Meta. I may have even mentioned it in this thread already.
> If there is a serious issue with a complex device, there are two ways
> you can get support - dump all you can and send the dump to the vendor
> or get on a live debugging session with their engineers. Users' ability
> to debug those devices is practically non-existent. The idea that we
> need access to FW internals is predicated on the assumption that we
> have an ability to make sense of those internals.
> 
> Once you're on a support call with the vendor - just load a custom
> kernel, module, whatever, it's already extremely expensive manual labor.
> 
> > 2. We've faced RDMA issues related to lost EQ doorbells, requiring
> > complex debug, and ultimately root-caused as a defective CPU. Without
> > interactive access to the device allowing us to test theories like,
> > "what if we manually restart the EQ", we could not have proven this
> > definitively.
> 
> I'm not familiar with the RDMA debugging capabilities. Perhaps there
> are some gaps there. The more proprietary the implementation the harder
> it is to debug. An answer to that would be "try to keep as much as
> possible open".. and interfaces which let closed user space talk to
> closed FW take us in the opposite direction.
> 
> FWIW good netdevice drivers have a selftest which tests IRQ generation
> and EQ handling. I think that'd cover the case you're describing?
> IDK if mlx5 has them, but if it doesn't definitely worth adding. And I
> recommend running those on suspicious machines (ethtool -t, devlink has
> some selftests, too)

Essentially, a warning light, and that doesn't solve the underlying
problem. We still need experts (e.g., vendors) to investigate with their
toolsets when and where the problem occurs.

I offered this as an example of one issue we solved. I cannot predict
what kind of issues will pop up in the future, and writing a self-test
for every possible situation is impossible by definition.

> 
> > Firstly, We believe in working upstream and all of the advantages that
> > that brings to all the distros as well as to us and our customers.
> > 
> > Secondly, Our cloud business offers many types of machine instances,
> > some with bare metal/vfio mlx5 devices, that require customer driven
> > debug and we want our customers to have the freedom to choose which OS
> > they want to use.
> 
> I understand that having everything packaged and shipped together makes
> life easier.

I think it is a requirement. We operate with Secure Boot. The kernel is
locked down. We don't have debugfs access, even if it were sufficient,
and we cannot compile and load modules. Even without Secure Boot, there
may not be a build environment available.

We really need the module ready-to-go when the debug calls for it - no
building, no reboots, no months long attempts to reproduce in some lab -
just immediate availability of the debug interface on the affected
machine.

> 
> If the point of the kernel at this stage of its evolution is to collect
> incompatible bits of vendor software, make sure they build cleanly and
> ship them to distros - someone should tell me, and I will relent.

I'm not sure I follow you... The mlx5ctl driver seems very compatible
with the mlx5 device driver. I may be misunderstanding.

Jakub Kicinski Dec. 7, 2023, 5:23 p.m. UTC | #32

On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
> > I understand that having everything packaged and shipped together makes
> > life easier.  
> 
> I think it is a requirement. We operate with Secure Boot. The kernel is
> locked down. We don't have debugfs access, even if it were sufficient,
> and we cannot compile and load modules. Even without Secure Boot, there
> may not be a build environment available.

This 'no debugfs' requirement is a kernel lockdown thing, I presume?
Are we expected to throw debugfs out the window and for all vendors
to reimplement their debug functionality via a misc driver taking
arbitrary ioctls? Not only does that sound like a complete waste of
time and going backward in terms of quality of the interfaces, needing
custom vendor tools etc. etc., but also you go from (hopefully somewhat)
upstream reviewed debugfs interface to an interface where the only
security assurance is vendor telling you "trust me, it's all good".

Aron Silverton Dec. 7, 2023, 6:06 p.m. UTC | #33

On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote:
> On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
> > > I understand that having everything packaged and shipped together makes
> > > life easier.  
> > 
> > I think it is a requirement. We operate with Secure Boot. The kernel is
> > locked down. We don't have debugfs access, even if it were sufficient,
> > and we cannot compile and load modules. Even without Secure Boot, there
> > may not be a build environment available.
> 
> This 'no debugfs' requirement is a kernel lockdown thing, I presume?
> Are we expected to throw debugfs out the window and for all vendors
> to reimplement their debug functionality via a misc driver taking
> arbitrary ioctls? Not only does that sound like a complete waste of
> time and going backward in terms of quality of the interfaces, needing
> custom vendor tools etc. etc., but also you go from (hopefully somewhat)
> upstream reviewed debugfs interface to an interface where the only
> security assurance is vendor telling you "trust me, it's all good".

IIRC, with lockdown, we can read from debugfs IFF the entries'
permissions are 0400. We cannot write. It's not suitable for
implementing an interactive debug interface.

Saeed Mahameed Dec. 7, 2023, 6:54 p.m. UTC | #34

On 07 Dec 10:41, Aron Silverton wrote:
>On Tue, Dec 05, 2023 at 08:48:55PM -0800, Jakub Kicinski wrote:
>> On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
>> > 1. As mentioned already, we recently faced a complex problem with RDMA
>> > in KVM and were getting nowhere trying to debug using the usual methods.
>> > Mellanox support was able to use this debug interface to see what was
>> > happening on the PCI bus and prove that the issue was caused by
>> > corrupted PCIe transactions. This finally put the investigation on the
>> > correct path. The debug interface was used consistently and extensively
>> > to test theories about what was happening in the system and, ultimately,
>> > allowed the problem to be solved.
>>
>> You hit on an important point, and what is also my experience working
>> at Meta. I may have even mentioned it in this thread already.
>> If there is a serious issue with a complex device, there are two ways
>> you can get support - dump all you can and send the dump to the vendor
>> or get on a live debugging session with their engineers. Users' ability
>> to debug those devices is practically non-existent. The idea that we
>> need access to FW internals is predicated on the assumption that we
>> have an ability to make sense of those internals.
>>
>> Once you're on a support call with the vendor - just load a custom
>> kernel, module, whatever, it's already extremely expensive manual labor.
>>
>> > 2. We've faced RDMA issues related to lost EQ doorbells, requiring
>> > complex debug, and ultimately root-caused as a defective CPU. Without
>> > interactive access to the device allowing us to test theories like,
>> > "what if we manually restart the EQ", we could not have proven this
>> > definitively.
>>
>> I'm not familiar with the RDMA debugging capabilities. Perhaps there
>> are some gaps there. The more proprietary the implementation the harder
>> it is to debug. An answer to that would be "try to keep as much as
>> possible open".. and interfaces which let closed user space talk to
>> closed FW take us in the opposite direction.
>>
>> FWIW good netdevice drivers have a selftest which tests IRQ generation
>> and EQ handling. I think that'd cover the case you're describing?
>> IDK if mlx5 has them, but if it doesn't definitely worth adding. And I
>> recommend running those on suspicious machines (ethtool -t, devlink has
>> some selftests, too)
>
>Essentially, a warning light, and that doesn't solve the underlying
>problem. We still need experts (e.g., vendors) to investigate with their
>toolsets when and where the problem occurs.
>
>I offered this as an example of one issue we solved. I cannot predict
>what kind of issues will pop up in the future, and writing a self-test
>for every possible situation is impossible by definition.
>
>>
>> > Firstly, We believe in working upstream and all of the advantages that
>> > that brings to all the distros as well as to us and our customers.
>> >
>> > Secondly, Our cloud business offers many types of machine instances,
>> > some with bare metal/vfio mlx5 devices, that require customer driven
>> > debug and we want our customers to have the freedom to choose which OS
>> > they want to use.
>>
>> I understand that having everything packaged and shipped together makes
>> life easier.
>
>I think it is a requirement. We operate with Secure Boot. The kernel is
>locked down. We don't have debugfs access, even if it were sufficient,
>and we cannot compile and load modules. Even without Secure Boot, there
>may not be a build environment available.
>
>We really need the module ready-to-go when the debug calls for it - no
>building, no reboots, no months long attempts to reproduce in some lab -
>just immediate availability of the debug interface on the affected
>machine.
>
>>
>> If the point of the kernel at this stage of its evolution is to collect
>> incompatible bits of vendor software, make sure they build cleanly and
>> ship them to distros - someone should tell me, and I will relent.
>
>I'm not sure I follow you... The mlx5ctl driver seems very compatible
>with the mlx5 device driver. I may be misunderstanding.
>

mlx5ctl is 100% compatible with mlx5 ConnectX open spec [1], and supports
any mlx5 driven stacks, not only netdev, it is able to expose millions of
objects and device states interactively, debugfs would explode if we even
try to accommodate some of these objects or states via debugfs, not to
mention it is also impossible to maintain a stable debugfs output for such
a huge data set, when this mlx5ctl interface speaks out a clear and open
ConnectX language, which is the hole point of the driver.

ConnectX is a highly programmable device for the enduser, and we have a
very open / accommodating policy, an advanced user who can read the open
spec [1], will also have the ability to do self-debug of their own
RDMA/DPU/FPGA apps or similar usecases.

Also I would like to repeat, this is not touching netdev, netdev's policies
do not apply to the greater kernel or RDMA, and we have use cases with
pure-infiniband/DPU/FPGA cards that have no netdev at all, or other cases
with pur virtio instances, and much more.

[1] https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf

Saeed Mahameed Dec. 7, 2023, 7:02 p.m. UTC | #35

On 07 Dec 12:06, Aron Silverton wrote:
>On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote:
>> On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
>> > > I understand that having everything packaged and shipped together makes
>> > > life easier.
>> >
>> > I think it is a requirement. We operate with Secure Boot. The kernel is
>> > locked down. We don't have debugfs access, even if it were sufficient,
>> > and we cannot compile and load modules. Even without Secure Boot, there
>> > may not be a build environment available.
>>
>> This 'no debugfs' requirement is a kernel lockdown thing, I presume?
>> Are we expected to throw debugfs out the window and for all vendors
>> to reimplement their debug functionality via a misc driver taking
>> arbitrary ioctls? Not only does that sound like a complete waste of
>> time and going backward in terms of quality of the interfaces, needing
>> custom vendor tools etc. etc., but also you go from (hopefully somewhat)
>> upstream reviewed debugfs interface to an interface where the only
>> security assurance is vendor telling you "trust me, it's all good".
>
>IIRC, with lockdown, we can read from debugfs IFF the entries'
>permissions are 0400. We cannot write. It's not suitable for
>implementing an interactive debug interface.

I would like to add that debugfs is usually used to expose the driver
software states, as it evolves and changes with the driver code, but as I
explained in the other email, it's clearly not a good solution to expose
arbitrary objects of complex devices, that require interactive and
selective debug interfaces tailored to the user use-case.

Greg KH Dec. 8, 2023, 5:27 a.m. UTC | #36

On Thu, Dec 07, 2023 at 12:06:28PM -0600, Aron Silverton wrote:
> On Thu, Dec 07, 2023 at 09:23:29AM -0800, Jakub Kicinski wrote:
> > On Thu, 7 Dec 2023 10:41:25 -0600 Aron Silverton wrote:
> > > > I understand that having everything packaged and shipped together makes
> > > > life easier.  
> > > 
> > > I think it is a requirement. We operate with Secure Boot. The kernel is
> > > locked down. We don't have debugfs access, even if it were sufficient,
> > > and we cannot compile and load modules. Even without Secure Boot, there
> > > may not be a build environment available.
> > 
> > This 'no debugfs' requirement is a kernel lockdown thing, I presume?
> > Are we expected to throw debugfs out the window and for all vendors
> > to reimplement their debug functionality via a misc driver taking
> > arbitrary ioctls? Not only does that sound like a complete waste of
> > time and going backward in terms of quality of the interfaces, needing
> > custom vendor tools etc. etc., but also you go from (hopefully somewhat)
> > upstream reviewed debugfs interface to an interface where the only
> > security assurance is vendor telling you "trust me, it's all good".
> 
> IIRC, with lockdown, we can read from debugfs IFF the entries'
> permissions are 0400. We cannot write. It's not suitable for
> implementing an interactive debug interface.

This is a policy decision that a distro decides to do, and I have seen
it happen many times.  The problem with this is then, as you have found
out, vendors try to work around the removal of access to debugfs by
creating new interfaces like misc drivers and sysfs files to emulate
what they were previously exporting with debugfs.

When they do that, they break the reason why the distro/vendor decided
to prevent access to debugfs in the first place, making the whole system
insecure again!

I see this happen all the time in Android devices as Android restricted
access to debugfs many years ago to try to solve the problem of drivers
doing insecure things there.  Those drivers then just moved those
insecure operations to a different interface, making the system insecure
again.

To stop this cat-and-mouse game, work with the vendors that are
implementing these requirements to shut down access to these interfaces.
That is a policy decision made by them, and does NOT mean that the
kernel community needs to then start taking code to circumvent those
decisions as that makes the whole thing pointless.

So in short, use debugfs, that is what it was designed for.  If a vendor
does not wish to enable access to debugfs, then that is their decision
(probably a good one), don't circumvent it by making a new interface, as
odds are, the vendor will eventually realize it and work to cut off that
new interface as well, as they rightfully should.

thanks,

greg k-h

Greg KH Dec. 8, 2023, 5:29 a.m. UTC | #37

On Thu, Dec 07, 2023 at 11:02:36AM -0800, Saeed Mahameed wrote:
> I would like to add that debugfs is usually used to expose the driver
> software states, as it evolves and changes with the driver code, but as I
> explained in the other email, it's clearly not a good solution to expose
> arbitrary objects of complex devices, that require interactive and
> selective debug interfaces tailored to the user use-case.

Why not?  Remember, the only rule in debugfs is "there are no rules!"

Well, there is one practical one, "do not rely on debugfs for any
functioning system properties", i.e. "if access to debugfs is not
present, or something in debugfs breaks, the kernel should continue to
work just fine with no change in operations".  But that's an overall
system-level rule, not a rule for what you can put into debugfs.

Have fun!

greg k-h

Jason Gunthorpe Dec. 8, 2023, 12:52 p.m. UTC | #38

On Fri, Dec 08, 2023 at 06:27:14AM +0100, Greg Kroah-Hartman wrote:

> When they do that, they break the reason why the distro/vendor decided
> to prevent access to debugfs in the first place, making the whole system
> insecure again!

Maybe others have, be we did not!

In fact Saeed addresses this explicitly in the cover letter:

 Historically a userspace program was used that accessed the PCI register
 and config space directly through /sys/bus/pci/.../XXX and could operate
 these debugging interfaces in parallel with the running driver.
 This approach is incompatible with secure boot and kernel lockdown so this
 driver provides a secure and restricted interface to that.

We did a lot of work here to build an interface that is compatible
with the security principles of lockdown. This work is embodied by the
new FW bit MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES which causes the FW to
run these RPCs in a security context that is compatible with lockdown.

The overall philosophical architecture is similar to the NVMe vendor
command channel which is also a way to deliver opaque RPCs to a device
FW and is considered lockdown compatible.

This series is not some insecure attempt to bypass lockdown like you
may have seen in Android. mlx5 is the driver for the most popular high
speed NIC in the world. Our customers take security seriously, and we
take security seriously.

Jason

Jason Gunthorpe Dec. 8, 2023, 1:34 p.m. UTC | #39

On Fri, Dec 08, 2023 at 06:29:29AM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 07, 2023 at 11:02:36AM -0800, Saeed Mahameed wrote:
> > I would like to add that debugfs is usually used to expose the driver
> > software states, as it evolves and changes with the driver code, but as I
> > explained in the other email, it's clearly not a good solution to expose
> > arbitrary objects of complex devices, that require interactive and
> > selective debug interfaces tailored to the user use-case.
> 
> Why not?  Remember, the only rule in debugfs is "there are no rules!"

We already have debugfs files to issue RPCs. They are not secure and
not lockdown compatible. Few users have been interested in this, Aron
does a good job explaining the general perspective I've seen in many
places.

Users want an in-tree solution that is compatible with lockdown. A
solution that works for all the mlx5 deployment modes (including
Infiniband native without netdev) and covers most of the functionality
they previously enjoyed with the /sys/../resource based tooling.

This series delivers that.

Nobody has offered an alterative vision that achieves the same
thing. There have been lots of suggestions how to do small little
parts, but not everything together as this does.

> Well, there is one practical one, "do not rely on debugfs for any
> functioning system properties"

Jakub expressed additional "netdev only" rules for debugfs.

  Read-write interfaces must be constrained to a clear set of commands /
  settings

Which I think is what Saeed is reacting to.

Jason

Christoph Hellwig Dec. 13, 2023, 4:55 p.m. UTC | #40

On Thu, Dec 07, 2023 at 10:54:02AM -0800, Saeed Mahameed wrote:
> Also I would like to repeat, this is not touching netdev, netdev's policies
> do not apply to the greater kernel or RDMA, and we have use cases with
> pure-infiniband/DPU/FPGA cards that have no netdev at all, or other cases
> with pur virtio instances, and much more.

Yes.  I mean just about every complex block driver has some kind of
vendor spcific tooling for debugging, statistics, etc.  Trying to deny
it just because one function expose by a device is a network device is
even more silly than disallowing it for pure net devices (which already
tend to be complex beasts).

diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 97f51d5ec1cf..4b532df16b19 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13835,6 +13835,14 @@  L:	virtualization@lists.linux-foundation.org
 S:	Supported
 F:	drivers/vdpa/mlx5/
 
+MELLANOX MLX5 ConnectX Diag DRIVER
+M:	Saeed Mahameed <saeedm@nvidia.com>
+R:	Itay Avraham <itayavr@nvidia.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	drivers/misc/mlx5ctl/
+F:	include/uapi/misc/mlx5ctl.h
+
 MELLANOX MLXCPLD I2C AND MUX DRIVER
 M:	Vadim Pasternak <vadimp@nvidia.com>
 M:	Michael Shych <michaelsh@nvidia.com>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index f37c4b8380ae..c0e4823648ed 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -579,4 +579,5 @@  source "drivers/misc/cardreader/Kconfig"
 source "drivers/misc/uacce/Kconfig"
 source "drivers/misc/pvpanic/Kconfig"
 source "drivers/misc/mchp_pci1xxxx/Kconfig"
+source "drivers/misc/mlx5ctl/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index f2a4d1ff65d4..49bc4697f498 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -67,3 +67,4 @@  obj-$(CONFIG_TMR_MANAGER)      += xilinx_tmr_manager.o
 obj-$(CONFIG_TMR_INJECT)	+= xilinx_tmr_inject.o
 obj-$(CONFIG_TPS6594_ESM)	+= tps6594-esm.o
 obj-$(CONFIG_TPS6594_PFSM)	+= tps6594-pfsm.o
+obj-$(CONFIG_MLX5CTL)		+= mlx5ctl/
diff --git a/drivers/misc/mlx5ctl/Kconfig b/drivers/misc/mlx5ctl/Kconfig
new file mode 100644
index 000000000000..faaa1dba2cc2
--- /dev/null
+++ b/drivers/misc/mlx5ctl/Kconfig
@@ -0,0 +1,14 @@ 
+# SPDX-License-Identifier: GPL-2.0
+#
+
+config MLX5CTL
+	tristate "mlx5 ConnectX control misc driver"
+	depends on MLX5_CORE
+	help
+	  MLX5CTL provides interface for the user process to access the debug and
+          configuration registers of the ConnectX hardware family
+          (NICs, PCI switches and SmartNIC SoCs).
+          This will allow configuration and debug tools to work out of the box on
+          mainstream kernel.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/misc/mlx5ctl/Makefile b/drivers/misc/mlx5ctl/Makefile
new file mode 100644
index 000000000000..b5c7f99e0ab6
--- /dev/null
+++ b/drivers/misc/mlx5ctl/Makefile
@@ -0,0 +1,4 @@ 
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_MLX5CTL) += mlx5ctl.o
+mlx5ctl-y := main.o
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
new file mode 100644
index 000000000000..8eb150461b80
--- /dev/null
+++ b/drivers/misc/mlx5ctl/main.c
@@ -0,0 +1,314 @@ 
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <linux/miscdevice.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+#include <linux/atomic.h>
+#include <linux/refcount.h>
+
+MODULE_DESCRIPTION("mlx5 ConnectX control misc driver");
+MODULE_AUTHOR("Saeed Mahameed <saeedm@nvidia.com>");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct mlx5ctl_dev {
+	struct mlx5_core_dev *mdev;
+	struct miscdevice miscdev;
+	struct auxiliary_device *adev;
+	struct list_head fd_list;
+	spinlock_t fd_list_lock; /* protect list add/del */
+	struct rw_semaphore rw_lock;
+	struct kref refcount;
+};
+
+struct mlx5ctl_fd {
+	u16 uctx_uid;
+	u32 uctx_cap;
+	u32 ucap; /* user cap */
+	struct mlx5ctl_dev *mcdev;
+	struct list_head list;
+};
+
+#define mlx5ctl_err(mcdev, format, ...) \
+	dev_err(mcdev->miscdev.parent, format, ##__VA_ARGS__)
+
+#define mlx5ctl_dbg(mcdev, format, ...) \
+	dev_dbg(mcdev->miscdev.parent, "PID %d: " format, \
+		current->pid, ##__VA_ARGS__)
+
+enum {
+	MLX5_UCTX_OBJECT_CAP_RAW_TX                     = 0x1,
+	MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES  = 0x2,
+	MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES            = 0x4,
+};
+
+static int mlx5ctl_alloc_uid(struct mlx5ctl_dev *mcdev, u32 cap)
+{
+	u32 out[MLX5_ST_SZ_DW(create_uctx_out)] = {};
+	u32 in[MLX5_ST_SZ_DW(create_uctx_in)] = {};
+	void *uctx;
+	int err;
+	u16 uid;
+
+	uctx = MLX5_ADDR_OF(create_uctx_in, in, uctx);
+
+	mlx5ctl_dbg(mcdev, "MLX5_CMD_OP_CREATE_UCTX: caps 0x%x\n", cap);
+	MLX5_SET(create_uctx_in, in, opcode, MLX5_CMD_OP_CREATE_UCTX);
+	MLX5_SET(uctx, uctx, cap, cap);
+
+	err = mlx5_cmd_exec(mcdev->mdev, in, sizeof(in), out, sizeof(out));
+	if (err)
+		return err;
+
+	uid = MLX5_GET(create_uctx_out, out, uid);
+	mlx5ctl_dbg(mcdev, "allocated uid %d with caps 0x%x\n", uid, cap);
+	return uid;
+}
+
+static void mlx5ctl_release_uid(struct mlx5ctl_dev *mcdev, u16 uid)
+{
+	u32 in[MLX5_ST_SZ_DW(destroy_uctx_in)] = {};
+	struct mlx5_core_dev *mdev = mcdev->mdev;
+	int err;
+
+	MLX5_SET(destroy_uctx_in, in, opcode, MLX5_CMD_OP_DESTROY_UCTX);
+	MLX5_SET(destroy_uctx_in, in, uid, uid);
+
+	err = mlx5_cmd_exec_in(mdev, destroy_uctx, in);
+	mlx5ctl_dbg(mcdev, "released uid %d err(%d)\n", uid, err);
+}
+
+static void mcdev_get(struct mlx5ctl_dev *mcdev);
+static void mcdev_put(struct mlx5ctl_dev *mcdev);
+
+static int mlx5ctl_open_mfd(struct mlx5ctl_fd *mfd)
+{
+	struct mlx5_core_dev *mdev = mfd->mcdev->mdev;
+	struct mlx5ctl_dev *mcdev = mfd->mcdev;
+	u32 ucap = 0, cap = 0;
+	int uid;
+
+#define MLX5_UCTX_CAP(mdev, cap) \
+	(MLX5_CAP_GEN(mdev, uctx_cap) & MLX5_UCTX_OBJECT_CAP_##cap)
+
+	if (capable(CAP_NET_RAW) && MLX5_UCTX_CAP(mdev, RAW_TX)) {
+		ucap |= CAP_NET_RAW;
+		cap |= MLX5_UCTX_OBJECT_CAP_RAW_TX;
+	}
+
+	if (capable(CAP_SYS_RAWIO) && MLX5_UCTX_CAP(mdev, INTERNAL_DEVICE_RESOURCES)) {
+		ucap |= CAP_SYS_RAWIO;
+		cap |= MLX5_UCTX_OBJECT_CAP_INTERNAL_DEVICE_RESOURCES;
+	}
+
+	if (capable(CAP_SYS_ADMIN) && MLX5_UCTX_CAP(mdev, TOOLS_RESOURCES)) {
+		ucap |= CAP_SYS_ADMIN;
+		cap |= MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES;
+	}
+
+	uid = mlx5ctl_alloc_uid(mcdev, cap);
+	if (uid < 0)
+		return uid;
+
+	mfd->uctx_uid = uid;
+	mfd->uctx_cap = cap;
+	mfd->ucap = ucap;
+	mfd->mcdev = mcdev;
+
+	mlx5ctl_dbg(mcdev, "allocated uid %d with uctx caps 0x%x, user cap 0x%x\n",
+		    uid, cap, ucap);
+	return 0;
+}
+
+static void mlx5ctl_release_mfd(struct mlx5ctl_fd *mfd)
+{
+	struct mlx5ctl_dev *mcdev = mfd->mcdev;
+
+	mlx5ctl_release_uid(mcdev,  mfd->uctx_uid);
+}
+
+static int mlx5ctl_open(struct inode *inode, struct file *file)
+{
+	struct mlx5_core_dev *mdev;
+	struct mlx5ctl_dev *mcdev;
+	struct mlx5ctl_fd *mfd;
+	int err = 0;
+
+	mcdev = container_of(file->private_data, struct mlx5ctl_dev, miscdev);
+	mcdev_get(mcdev);
+	down_read(&mcdev->rw_lock);
+	mdev = mcdev->mdev;
+	if (!mdev) {
+		err = -ENODEV;
+		goto unlock;
+	}
+
+	mfd = kzalloc(sizeof(*mfd), GFP_KERNEL_ACCOUNT);
+	if (!mfd)
+		return -ENOMEM;
+
+	mfd->mcdev = mcdev;
+	err = mlx5ctl_open_mfd(mfd);
+	if (err)
+		goto unlock;
+
+	spin_lock(&mcdev->fd_list_lock);
+	list_add_tail(&mfd->list, &mcdev->fd_list);
+	spin_unlock(&mcdev->fd_list_lock);
+
+	file->private_data = mfd;
+
+unlock:
+	up_read(&mcdev->rw_lock);
+	if (err) {
+		mcdev_put(mcdev);
+		kfree(mfd);
+	}
+	return err;
+}
+
+static int mlx5ctl_release(struct inode *inode, struct file *file)
+{
+	struct mlx5ctl_fd *mfd = file->private_data;
+	struct mlx5ctl_dev *mcdev = mfd->mcdev;
+
+	down_read(&mcdev->rw_lock);
+	if (!mcdev->mdev) {
+		pr_debug("[%d] UID %d mlx5ctl: mdev is already released\n",
+			 current->pid, mfd->uctx_uid);
+		/* All mfds are already released, skip ... */
+		goto unlock;
+	}
+
+	spin_lock(&mcdev->fd_list_lock);
+	list_del(&mfd->list);
+	spin_unlock(&mcdev->fd_list_lock);
+
+	mlx5ctl_release_mfd(mfd);
+
+unlock:
+	kfree(mfd);
+	up_read(&mcdev->rw_lock);
+	mcdev_put(mcdev);
+	file->private_data = NULL;
+	return 0;
+}
+
+static const struct file_operations mlx5ctl_fops = {
+	.owner = THIS_MODULE,
+	.open = mlx5ctl_open,
+	.release = mlx5ctl_release,
+};
+
+static int mlx5ctl_probe(struct auxiliary_device *adev,
+			 const struct auxiliary_device_id *id)
+
+{
+	struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
+	struct mlx5_core_dev *mdev = madev->mdev;
+	struct mlx5ctl_dev *mcdev;
+	char *devname = NULL;
+	int err;
+
+	mcdev = kzalloc(sizeof(*mcdev), GFP_KERNEL_ACCOUNT);
+	if (!mcdev)
+		return -ENOMEM;
+
+	kref_init(&mcdev->refcount);
+	INIT_LIST_HEAD(&mcdev->fd_list);
+	spin_lock_init(&mcdev->fd_list_lock);
+	init_rwsem(&mcdev->rw_lock);
+	mcdev->mdev = mdev;
+	mcdev->adev = adev;
+	devname = kasprintf(GFP_KERNEL_ACCOUNT, "mlx5ctl-%s",
+			    dev_name(&adev->dev));
+	if (!devname) {
+		err = -ENOMEM;
+		goto abort;
+	}
+
+	mcdev->miscdev = (struct miscdevice) {
+		.minor = MISC_DYNAMIC_MINOR,
+		.name = devname,
+		.fops = &mlx5ctl_fops,
+		.parent = &adev->dev,
+	};
+
+	err = misc_register(&mcdev->miscdev);
+	if (err) {
+		mlx5ctl_err(mcdev, "mlx5ctl: failed to register misc device err %d\n", err);
+		goto abort;
+	}
+
+	mlx5ctl_dbg(mcdev, "probe mdev@%s %s\n", dev_driver_string(mdev->device), dev_name(mdev->device));
+
+	auxiliary_set_drvdata(adev, mcdev);
+
+	return 0;
+
+abort:
+	kfree(devname);
+	kfree(mcdev);
+	return err;
+}
+
+static void mlx5ctl_remove(struct auxiliary_device *adev)
+{
+	struct mlx5ctl_dev *mcdev = auxiliary_get_drvdata(adev);
+	struct mlx5_core_dev *mdev = mcdev->mdev;
+	struct mlx5ctl_fd *mfd, *n;
+
+	misc_deregister(&mcdev->miscdev);
+	down_write(&mcdev->rw_lock);
+
+	list_for_each_entry_safe(mfd, n, &mcdev->fd_list, list) {
+		mlx5ctl_dbg(mcdev, "UID %d still has open FDs\n", mfd->uctx_uid);
+		list_del(&mfd->list);
+		mlx5ctl_release_mfd(mfd);
+	}
+
+	mlx5ctl_dbg(mcdev, "removed mdev %s %s\n",
+		    dev_driver_string(mdev->device), dev_name(mdev->device));
+
+	mcdev->mdev = NULL; /* prevent already open fds from accessing the device */
+	up_write(&mcdev->rw_lock);
+	mcdev_put(mcdev);
+}
+
+static void mcdev_free(struct kref *ref)
+{
+	struct mlx5ctl_dev *mcdev = container_of(ref, struct mlx5ctl_dev, refcount);
+
+	kfree(mcdev->miscdev.name);
+	kfree(mcdev);
+}
+
+static void mcdev_get(struct mlx5ctl_dev *mcdev)
+{
+	kref_get(&mcdev->refcount);
+}
+
+static void mcdev_put(struct mlx5ctl_dev *mcdev)
+{
+	kref_put(&mcdev->refcount, mcdev_free);
+}
+
+static const struct auxiliary_device_id mlx5ctl_id_table[] = {
+	{ .name = MLX5_ADEV_NAME ".ctl", },
+	{},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mlx5ctl_id_table);
+
+static struct auxiliary_driver mlx5ctl_driver = {
+	.name = "ctl",
+	.probe = mlx5ctl_probe,
+	.remove = mlx5ctl_remove,
+	.id_table = mlx5ctl_id_table,
+};
+
+module_auxiliary_driver(mlx5ctl_driver);