[V4,0/5] mlx5 ConnectX control misc driver

Message ID	20240207072435.14182-1-saeed@kernel.org
Headers	Received-SPF: pass (google.com: domain of linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; From: Saeed Mahameed <saeed@kernel.org> To: Arnd Bergmann <arnd@arndb.de>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Leon Romanovsky <leonro@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Jiri Pirko <jiri@nvidia.com>, Leonid Bloch <lbloch@nvidia.com>, Itay Avraham <itayavr@nvidia.com>, Jakub Kicinski <kuba@kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, David Ahern <dsahern@kernel.org>, Aron Silverton <aron.silverton@oracle.com>, Christoph Hellwig <hch@infradead.org>, andrew.gospodarek@broadcom.com, linux-kernel@vger.kernel.org Subject: [PATCH V4 0/5] mlx5 ConnectX control misc driver Date: Tue, 6 Feb 2024 23:24:30 -0800 Message-ID: <20240207072435.14182-1-saeed@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX
Series	mlx5 ConnectX control misc driver \| [V4,0/5] mlx5 ConnectX control misc driver [V4,1/5] mlx5: Add aux dev for ctl interface [V4,2/5] misc: mlx5ctl: Add mlx5ctl misc driver [V4,3/5] misc: mlx5ctl: Add info ioctl [V4,4/5] misc: mlx5ctl: Add command rpc ioctl [V4,5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

Message ID

20240207072435.14182-1-saeed@kernel.org

Headers

Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org designates
 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
From: Saeed Mahameed <saeed@kernel.org>
To: Arnd Bergmann <arnd@arndb.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Leon Romanovsky <leonro@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Jiri Pirko <jiri@nvidia.com>,
	Leonid Bloch <lbloch@nvidia.com>,
	Itay Avraham <itayavr@nvidia.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Saeed Mahameed <saeedm@nvidia.com>,
	David Ahern <dsahern@kernel.org>,
	Aron Silverton <aron.silverton@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	andrew.gospodarek@broadcom.com,
	linux-kernel@vger.kernel.org
Subject: [PATCH V4 0/5] mlx5 ConnectX control misc driver
Date: Tue,  6 Feb 2024 23:24:30 -0800
Message-ID: <20240207072435.14182-1-saeed@kernel.org>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

mlx5 ConnectX control misc driver |

Message

Saeed Mahameed Feb. 7, 2024, 7:24 a.m. UTC

  From: Saeed Mahameed <saeedm@nvidia.com>

Recap from V3 discussion:
=========================

LWN has published an article on this series aptly summarizing the debate.
LINK: https://lwn.net/Articles/955001/

We continue to think that mlx5ctl is reasonable and aligned with the
greater kernel community values. People have pointed to the HW RAID
miscdevices as a good analog. The MD developers did not get to block HW
RAID configuration on the basis that it undermines their work on the
software RAID stack. Further, while there is a superficial similarity to
the DRM/accel debate, that was grounded in a real concern that DRM values
on open source would be bypassed. That argument does not hold up here as
this does come with open source userspace and the functionality mlx5ctl
enables on lockdown has always been available to ConnectX users through
the non-lockdown PCI sysfs. netdev has been doing just fine despite the
long standing presence of this tooling and we have continued to work with
Jakub on building common APIs when appropriate. mlx5 already implements
a wide range of the netdev common interfaces, many of which were pushed
forward by our staff - the DPLL configuration netlink being a recent
example.

Version history:
================
V3: https://lore.kernel.org/all/20231121070619.9836-1-saeed@kernel.org/#r
V2: https://lore.kernel.org/all/20231119092450.164996-1-saeed@kernel.org/#r
V1: https://lore.kernel.org/all/20231018081941.475277-1-saeed@kernel.org/#r

V3->V4:
 - Document locking scheme for device lifecycle
 - Document reserved bits will always be checked for 0 by driver
 - Use GFP_KERNEL instead of ACCOUNT for short lived buffers
 - Create sysfs link to parent device under the misc device's sysfs
 - Remove unnecessary device name from info ioctl output
 - Remove reserved and future flags fields from ioctls
 - Precise size checking for ioctl user input.

V2->V3:
  - Fix bad Sign-off line.
  - Fix kernel robot warnings, define a user ptr arg for umem_unreg ioctl
    instead of plain integer to simplify compat_ioctl usage.

V1->V2:
  - Provide legal statement and sign-off for dual license use
  - Fix License clause to use: BSD-3-Clause OR GPL-2.0
  - Fix kernel robot warnings
  - Use dev_dbg directly instead of umem_dbg() local wrapper
  - Implement .compat_ioctl for 32bit compatibility
  - Fix mlx5ctl_info ABI structure size and alignment
  - Local pointer to correct type instead of in-place cast
  - Check unused fields and flags are 0 on ioctl path
  - Use correct macro to declare scalar arg ioctl command
    #define MLX5CTL_IOCTL_UMEM_UNREG \
           _IO(MLX5CTL_IOCTL_MAGIC, 0x3)

mlx5 ConnectX control misc driver:
==================================

The ConnectX HW family supported by the mlx5 drivers uses an architecture
where a FW component executes "mailbox RPCs" issued by the driver to make
changes to the device. This results in a complex debugging environment
where the FW component has information and low level configuration that
needs to be accessed to userspace for debugging purposes.

Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that.

Patch breakdown:
================

1) The first patch in the series introduces the main driver file with the
implementation of a new mlx5 auxiliary device driver to run on top
mlx5_core device instances, on probe it creates a new misc device and in
this patch we implement the open and release fops, On open the driver
would allocate a special FW UID (user context ID) restricted to debug
RPCs only, where all user debug rpcs will be executed under this UID,
and on release the UID will be freed.

2) The second patch adds an info ioctl that will show the allocated UID
and the available capability masks of the device and the current UID, and
some other useful device information such as the underlying ConnectX

Example:
    $ sudo ./mlx5ctlu mlx5_core.ctl.0
    mlx5dev: 0000:00:04.0
    UCTX UID: 1
    UCTX CAP: 0x3
    DEV UCTX CAP: 0x3
    USER CAP: 0x1d

3) Third patch will add the capability to execute debug RPCs under the
special UID.

In the mlx5 architecture the FW RPC commands are of the format of
inbox and outbox buffers. The inbox buffer contains the command
rpc layout as described in the ConnectX Programmers Reference Manual
(PRM) document and as defined in linux/include/mlx5/mlx5_ifc.h.

On success the user outbox buffer will be filled with the device's rpc
response.

For example to query device capabilities:
a user fills out an inbox buffer with the inbox layout:
    struct mlx5_ifc_query_hca_cap_in_bits
and expects an outbox buffer with the layout:
     struct mlx5_ifc_cmd_hca_cap_bits

4) The fourth patch adds the ability to register user memory into the
ConntectX device and create a umem object that points to that memory.

Command rpc outbox buffer is limited in size, which can be very
annoying when trying to pull large traces out of the device.
Many rpcs offer the ability to scatter output traces, contexts
and logs directly into user space buffers in a single shot.

The registered memory will be described by a device UMEM object which
has a unique umem_id, this umem_id can be later used in the rpc inbox
to tell the device where to populate the response output,
e.g HW traces and other debug object queries.

Example usecase, a ConnectX device coredump can be as large as 2MB.
Using inline rpcs will take thousands of rpcs to get the full
coredump which can consume multiple seconds.

With UMEM, it can be done in a single rpc, using 2MB of umem user buffer.

Other usecases with umem:
  - dynamic HW and FW trace monitoring
  - high frequency diagnostic counters sampling
  - batched objects and resource dumps

See links below for information about user space tools that use this
interface:

[1] https://github.com/saeedtx/mlx5ctl

[2] https://github.com/Mellanox/mstflint
see:

   d) mstregdump utility
      This utility dumps hardware registers from Mellanox hardware
      for later analysis by Mellanox.

    g) mstconfig
      This tool sets or queries non-volatile configurable options
      for Mellanox HCAs.

    h) mstfwmanager
      Mellanox firmware update and query utility which scans the system
      for available Mellanox devices (only mst PCI devices) and performs
      the necessary firmware updates.

    i) mstreg
      The mlxreg utility allows users to obtain information regarding
      supported access registers, such as their fields

License: BSD-3-Clause OR GPL-2.0
================================
After a review of this thread [3], and a conversation with the LF,
Mellanox and NVIDIA legal continue to approve the use of a Dual GPL &
Permissive License for mlx5 related driver contributions. This makes it
clear to future contributors that this file may be adapted and reused
under BSD-3-Clause terms on other operating systems. Contributions will
be handled in the normal way and the dual license will apply
automatically. If people wish to contribute significantly and opt out of
a dual license they may separate their GPL only contributions in dedicated
files.

Jason has a signing authority for NVIDIA and has gone through our internal
process to get approval.

[3] https://lore.kernel.org/all/20231018081941.475277-3-saeed@kernel.org/#r

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> # for legal
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Nacked-by: Jakub Kicinski <kuba@kernel.org>

Saeed Mahameed (5):
  mlx5: Add aux dev for ctl interface
  misc: mlx5ctl: Add mlx5ctl misc driver
  misc: mlx5ctl: Add info ioctl
  misc: mlx5ctl: Add command rpc ioctl
  misc: mlx5ctl: Add umem reg/unreg ioctl

 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |   8 +
 drivers/misc/Kconfig                          |   1 +
 drivers/misc/Makefile                         |   1 +
 drivers/misc/mlx5ctl/Kconfig                  |  14 +
 drivers/misc/mlx5ctl/Makefile                 |   5 +
 drivers/misc/mlx5ctl/main.c                   | 597 ++++++++++++++++++
 drivers/misc/mlx5ctl/umem.c                   | 322 ++++++++++
 drivers/misc/mlx5ctl/umem.h                   |  17 +
 drivers/net/ethernet/mellanox/mlx5/core/dev.c |   8 +
 include/uapi/misc/mlx5ctl.h                   |  50 ++
 11 files changed, 1024 insertions(+)
 create mode 100644 drivers/misc/mlx5ctl/Kconfig
 create mode 100644 drivers/misc/mlx5ctl/Makefile
 create mode 100644 drivers/misc/mlx5ctl/main.c
 create mode 100644 drivers/misc/mlx5ctl/umem.c
 create mode 100644 drivers/misc/mlx5ctl/umem.h
 create mode 100644 include/uapi/misc/mlx5ctl.h

--
2.43.0

Comments

Jakub Kicinski Feb. 7, 2024, 3:03 p.m. UTC | #1

On Tue,  6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:
> From: Saeed Mahameed <saeedm@nvidia.com>
> 
> Recap from V3 discussion:
> =========================
> 
> LWN has published an article on this series aptly summarizing the debate.
> LINK: https://lwn.net/Articles/955001/
> 
> We continue to think that mlx5ctl is reasonable and aligned with the
> greater kernel community values. People have pointed to the HW RAID
> miscdevices as a good analog. The MD developers did not get to block HW
> RAID configuration on the basis that it undermines their work on the
> software RAID stack. Further, while there is a superficial similarity to
> the DRM/accel debate, that was grounded in a real concern that DRM values
> on open source would be bypassed. That argument does not hold up here as
> this does come with open source userspace and the functionality mlx5ctl
> enables on lockdown has always been available to ConnectX users through
> the non-lockdown PCI sysfs. netdev has been doing just fine despite the
> long standing presence of this tooling and we have continued to work with
> Jakub on building common APIs when appropriate. mlx5 already implements
> a wide range of the netdev common interfaces, many of which were pushed
> forward by our staff - the DPLL configuration netlink being a recent
> example.

I appreciate Jiri's contributions (and you hired Maciej off of Intel at
some point) but don't make it sound like nVidia lead DPLL, or pushed for
a common interface :| Intel posted SyncE support. I asked them make it
a standalone API family:

https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

Vadim from Meta joined in and helped a lot based on the OCP time card.
Then after some delaying and weird noises y'all started participating.

> mlx5 ConnectX control misc driver:
> ==================================
> 
> The ConnectX HW family supported by the mlx5 drivers uses an architecture
> where a FW component executes "mailbox RPCs" issued by the driver to make
> changes to the device. This results in a complex debugging environment
> where the FW component has information and low level configuration that
> needs to be accessed to userspace for debugging purposes.

You don't explain anywhere why addressing the challenges of using
debugfs in secure environments isn't the way to go.

Also you keep saying debugging purposes but the driver is called
"control misc driver", you need to iron out your narrative just 
a bit more.

> Historically a userspace program was used that accessed the PCI register
> and config space directly through /sys/bus/pci/.../XXX and could operate
> these debugging interfaces in parallel with the running driver.
> This approach is incompatible with secure boot and kernel lockdown so this
> driver provides a secure and restricted interface to that.

[snip]

>     i) mstreg
>       The mlxreg utility allows users to obtain information regarding
>       supported access registers, such as their fields

So the access mstreg gives over this interface is read only? That's
what your description sounds like, but given your complaints about
"inability to add knobs" and "control" in the name of the driver that
must be false.

> Other usecases with umem:
>   - dynamic HW and FW trace monitoring
>   - high frequency diagnostic counters sampling

Ah yes, the high frequency counters. Something that is definitely
impossible to implement in a generic way. You were literally in the
room at netconf when David Ahern described his proposal for this.

Anyway, I don't want to waste any more time arguing with you.
My opinion is that the kernel community is under no obligation to carry
your proprietary gateway interfaces. I may be wrong, but I'm entitled
to my opinion.

Please do me the basic courtesy of carrying my nack on these patches:

Nacked-by: Jakub Kicinski <kuba@kernel.org>

and CC netdev, so I don't have to respond again :|

Saeed Mahameed Feb. 8, 2024, 5:03 a.m. UTC | #2

On 07 Feb 07:03, Jakub Kicinski wrote:
>On Tue,  6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:
>> From: Saeed Mahameed <saeedm@nvidia.com>
>>
>> Recap from V3 discussion:
>> =========================
>>
>> LWN has published an article on this series aptly summarizing the debate.
>> LINK: https://lwn.net/Articles/955001/
>>
>> We continue to think that mlx5ctl is reasonable and aligned with the
>> greater kernel community values. People have pointed to the HW RAID
>> miscdevices as a good analog. The MD developers did not get to block HW
>> RAID configuration on the basis that it undermines their work on the
>> software RAID stack. Further, while there is a superficial similarity to
>> the DRM/accel debate, that was grounded in a real concern that DRM values
>> on open source would be bypassed. That argument does not hold up here as
>> this does come with open source userspace and the functionality mlx5ctl
>> enables on lockdown has always been available to ConnectX users through
>> the non-lockdown PCI sysfs. netdev has been doing just fine despite the
>> long standing presence of this tooling and we have continued to work with
>> Jakub on building common APIs when appropriate. mlx5 already implements
>> a wide range of the netdev common interfaces, many of which were pushed
>> forward by our staff - the DPLL configuration netlink being a recent
>> example.
>
>I appreciate Jiri's contributions (and you hired Maciej off of Intel at
>some point) but don't make it sound like nVidia lead DPLL, or pushed for
>a common interface :| Intel posted SyncE support. I asked them make it
>a standalone API family:
>

I will let the stats speak for itself.
$ git shortlog -sne --no-merges net/devlink 
and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory
$ git shortlog -sne --no-merges net/core/devlink.c

More than 70% of the commits are authored by more than 10 different individuals
form Mellanox/Nvidia .. 

Ok you don't like DPLL, here is a list of some central devlink features we did 
push to the devlink standard API:

  - subfunction API and devlink infrastructure
  - Shared buffer API
  - port function and rate API
  - shared buffer
  - health 

>https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
>
>Vadim from Meta joined in and helped a lot based on the OCP time card.
>Then after some delaying and weird noises y'all started participating.
>

I remember those discussions, and I agree it is very weird when it
takes 3 vendors and 2 years to get a simple devlink API for single bit
flip accepted.

>> mlx5 ConnectX control misc driver:
>> ==================================
>>
>> The ConnectX HW family supported by the mlx5 drivers uses an architecture
>> where a FW component executes "mailbox RPCs" issued by the driver to make
>> changes to the device. This results in a complex debugging environment
>> where the FW component has information and low level configuration that
>> needs to be accessed to userspace for debugging purposes.
>
>You don't explain anywhere why addressing the challenges of using
>debugfs in secure environments isn't the way to go.
>

I already answered this question in length in v3
here: https://lore.kernel.org/all/ZWZFm2qqhV1wKKCV@x130/

>Also you keep saying debugging purposes but the driver is called
>"control misc driver", you need to iron out your narrative just
>a bit more.
>
>> Historically a userspace program was used that accessed the PCI register
>> and config space directly through /sys/bus/pci/.../XXX and could operate
>> these debugging interfaces in parallel with the running driver.
>> This approach is incompatible with secure boot and kernel lockdown so this
>> driver provides a secure and restricted interface to that.
>
>[snip]
>
>>     i) mstreg
>>       The mlxreg utility allows users to obtain information regarding
>>       supported access registers, such as their fields
>
>So the access mstreg gives over this interface is read only? That's
>what your description sounds like, but given your complaints about
>"inability to add knobs" and "control" in the name of the driver that
>must be false.
>

Yes this is enforced by the mlx5ctl driver and FW using the special
debug uid.

>> Other usecases with umem:
>>   - dynamic HW and FW trace monitoring
>>   - high frequency diagnostic counters sampling
>
>Ah yes, the high frequency counters. Something that is definitely
>impossible to implement in a generic way. You were literally in the
>room at netconf when David Ahern described his proposal for this.
>

I was in the room and I am in support of David's idea, I like it a lot,
but I don't believe we have any concrete proposal, and we don't have any
use case for it in netdev for now, our use case for this is currently RDMA
and HPC specific.

Also siimilar to devlink we will be the first to jump in and implement
the new API once defined, but this doesn't mean I need to throw away the
whole driver just because one single use case will be implemented in netdev
one day, and I am sure the netdev implementation won't satisfy all the
use-cases of high frequency counters:

Also keep in mind high frequency counters is a very small part of the debug 
and access capabilities the mlx5ctl interface offers.

>Anyway, I don't want to waste any more time arguing with you.
>My opinion is that the kernel community is under no obligation to carry
>your proprietary gateway interfaces. I may be wrong, but I'm entitled
>to my opinion.
>

Thanks, I appreciate your honesty, but I must disagree with your Nack, we
provided enough argument for why we believe this approach is the right
way to go, it is clear from the responses on V3 and from the LWN article
that we have the community support for this open source project.

>Please do me the basic courtesy of carrying my nack on these patches:
>
>Nacked-by: Jakub Kicinski <kuba@kernel.org>
>
>and CC netdev, so I don't have to respond again :|

Ack.

Thanks,
Saeed.

Jakub Kicinski Feb. 9, 2024, 2:15 a.m. UTC | #3

On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote:
> On 07 Feb 07:03, Jakub Kicinski wrote:
> >On Tue,  6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:  
> >> From: Saeed Mahameed <saeedm@nvidia.com>
> >>
> >> Recap from V3 discussion:
> >> =========================
> >>
> >> LWN has published an article on this series aptly summarizing the debate.
> >> LINK: https://lwn.net/Articles/955001/
> >>
> >> We continue to think that mlx5ctl is reasonable and aligned with the
> >> greater kernel community values. People have pointed to the HW RAID
> >> miscdevices as a good analog. The MD developers did not get to block HW
> >> RAID configuration on the basis that it undermines their work on the
> >> software RAID stack. Further, while there is a superficial similarity to
> >> the DRM/accel debate, that was grounded in a real concern that DRM values
> >> on open source would be bypassed. That argument does not hold up here as
> >> this does come with open source userspace and the functionality mlx5ctl
> >> enables on lockdown has always been available to ConnectX users through
> >> the non-lockdown PCI sysfs. netdev has been doing just fine despite the
> >> long standing presence of this tooling and we have continued to work with
> >> Jakub on building common APIs when appropriate. mlx5 already implements
> >> a wide range of the netdev common interfaces, many of which were pushed
> >> forward by our staff - the DPLL configuration netlink being a recent
> >> example.  
> >
> >I appreciate Jiri's contributions (and you hired Maciej off of Intel at
> >some point) but don't make it sound like nVidia lead DPLL, or pushed for
> >a common interface :| Intel posted SyncE support. I asked them make it
> >a standalone API family:
>
> I will let the stats speak for itself.
> $ git shortlog -sne --no-merges net/devlink 
> and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory
> $ git shortlog -sne --no-merges net/core/devlink.c
> 
> More than 70% of the commits are authored by more than 10 different individuals
> form Mellanox/Nvidia .. 

I'm not questioning that there are folks at nVidia who prefer to go 
the generic infrastructure route. Jiri and mlxsw team especially.
I do worry that adding a pass thru interface will undermine them,
but that wasn't my main point.

> Ok you don't like DPLL,

I didn't say I dislike DPLL. I think it's a very odd example for
you to pick for nVidia's contribution. My recollection is:

 - Maciej from Intel started developing upstream API for SyncE support
 - I asked him to generalize it to DPLL, he started working on it
 - nVidia expressed interest in creating a common interface, we thought
   it'd be great to align vendors
 - nVidia hired Maciej from Intel, shutting down Intel's progress for a while
 - nVidia went AWoL, long response times, we held meetings to nudge
   you along, no commitments
 - then after months and months Jiri started helping Arkadiusz and Vadim

I remember thinking at the time that it must have been a terrible
experience for Intel, definitely not how cooperation upstream should
look :|

IDK how disconnected from upstream netdev you have to be to put that on
your banner.

> here is a list of some central devlink features we did 
> push to the devlink standard API:
> 
>   - subfunction API and devlink infrastructure
>   - Shared buffer API
>   - port function and rate API
>   - shared buffer
>   - health 

I guess shared buffer was important enough to mention twice? :)

> >https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
> >
> >Vadim from Meta joined in and helped a lot based on the OCP time card.
> >Then after some delaying and weird noises y'all started participating.
> >  
> 
> I remember those discussions, and I agree it is very weird when it
> takes 3 vendors and 2 years to get a simple devlink API for single bit
> flip accepted.

In hindsight it was naive of us to try to wait for nVidia.
It's better to let one vendor blaze the trail and others to follow.

> >>     i) mstreg
> >>       The mlxreg utility allows users to obtain information regarding
> >>       supported access registers, such as their fields  
> >
> >So the access mstreg gives over this interface is read only? That's
> >what your description sounds like, but given your complaints about
> >"inability to add knobs" and "control" in the name of the driver that
> >must be false.
> >  
> 
> Yes this is enforced by the mlx5ctl driver and FW using the special
> debug uid.

So we trust the proprietary FW not to let the proprietary user space
do something out of scope. Got it.

> >> Other usecases with umem:
> >>   - dynamic HW and FW trace monitoring
> >>   - high frequency diagnostic counters sampling  
> >
> >Ah yes, the high frequency counters. Something that is definitely
> >impossible to implement in a generic way. You were literally in the
> >room at netconf when David Ahern described his proposal for this.
> >  
> 
> I was in the room and I am in support of David's idea, I like it a lot,
> but I don't believe we have any concrete proposal, and we don't have any
> use case for it in netdev for now, our use case for this is currently RDMA
> and HPC specific.
> 
> Also siimilar to devlink we will be the first to jump in and implement
> the new API once defined, but this doesn't mean I need to throw away the

I'm not asking to throw it away. The question is only whether you get
to push it upstream and skirt subsystem rules by posting a "misc" driver
without even CCing the maintainers on v1 :|

> whole driver just because one single use case will be implemented in netdev
> one day, and I am sure the netdev implementation won't satisfy all the
> use-cases of high frequency counters:
> 
> Also keep in mind high frequency counters is a very small part of the debug 
> and access capabilities the mlx5ctl interface offers.
> 
> >Anyway, I don't want to waste any more time arguing with you.
> >My opinion is that the kernel community is under no obligation to carry
> >your proprietary gateway interfaces. I may be wrong, but I'm entitled
> >to my opinion.
> 
> Thanks, I appreciate your honesty, but I must disagree with your Nack, we
> provided enough argument for why we believe this approach is the right
> way to go, it is clear from the responses on V3 and from the LWN article
> that we have the community support for this open source project.

Why don't you repost it to netdev and see how many acks you get?
I'm not the only netdev maintainer.

Jiri Pirko Feb. 9, 2024, 6:55 a.m. UTC | #4

Fri, Feb 09, 2024 at 03:15:55AM CET, kuba@kernel.org wrote:
>On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote:
>> On 07 Feb 07:03, Jakub Kicinski wrote:
>> >On Tue,  6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote:  
>> >> From: Saeed Mahameed <saeedm@nvidia.com>

[...]

>
>> Ok you don't like DPLL,
>
>I didn't say I dislike DPLL. I think it's a very odd example for
>you to pick for nVidia's contribution. My recollection is:
>
> - Maciej from Intel started developing upstream API for SyncE support
> - I asked him to generalize it to DPLL, he started working on it
> - nVidia expressed interest in creating a common interface, we thought
>   it'd be great to align vendors
> - nVidia hired Maciej from Intel, shutting down Intel's progress for a while
> - nVidia went AWoL, long response times, we held meetings to nudge
>   you along, no commitments
> - then after months and months Jiri started helping Arkadiusz and Vadim
>
>I remember thinking at the time that it must have been a terrible
>experience for Intel, definitely not how cooperation upstream should
>look :|

For the record, I spent huge amount of time reviewing the patchset and
ended up with redesigning significant chunks of it, steering Arkadiusz
and Vadim the way I felt is the correct one. Oftentimes I said to myself
it would be much quicker to take the patchset over and do it myself :)

Anyway, at the end, I think that the result is very good. Solid and well
defined uapi, nice kernel implementation, 3 drivers implementing it,
each with slightly different usecase, all clicks. If this is not
good example of upstream cooperation, I'm not sure what else is...

But, I don't think this is related to the misc driver discussion, I just
wanted to express my pov on dpll process, when I see people talking
about it :)

>
>IDK how disconnected from upstream netdev you have to be to put that on
>your banner.

[...]

David Ahern Feb. 9, 2024, 10:42 p.m. UTC | #5

On 2/8/24 7:15 PM, Jakub Kicinski wrote:
>>> Ah yes, the high frequency counters. Something that is definitely
>>> impossible to implement in a generic way. You were literally in the
>>> room at netconf when David Ahern described his proposal for this.

The key point of that proposal is host memory mapped to userspace where
H/W counters land (either via direct DMA by a H/W push or a
kthread/timer pulling in updates). That is similar to what is proposed here.

>>>  
>>
>> I was in the room and I am in support of David's idea, I like it a lot,
>> but I don't believe we have any concrete proposal, and we don't have any
>> use case for it in netdev for now, our use case for this is currently RDMA
>> and HPC specific.
>>
>> Also siimilar to devlink we will be the first to jump in and implement
>> the new API once defined, but this doesn't mean I need to throw away the
> 
> I'm not asking to throw it away. The question is only whether you get
> to push it upstream and skirt subsystem rules by posting a "misc" driver
> without even CCing the maintainers on v1 :|

Can you define what you mean by 'skirt subsystem rules'? That's a new
one to me.

BTW, there is already a broadcom driver under drivers/misc that seems to
have a lot of overlap capability wise to this driver. Perhaps a Broadcom
person could chime in.

> 
>> whole driver just because one single use case will be implemented in netdev
>> one day, and I am sure the netdev implementation won't satisfy all the
>> use-cases of high frequency counters:
>>
>> Also keep in mind high frequency counters is a very small part of the debug 
>> and access capabilities the mlx5ctl interface offers.
>>
>>> Anyway, I don't want to waste any more time arguing with you.
>>> My opinion is that the kernel community is under no obligation to carry
>>> your proprietary gateway interfaces. I may be wrong, but I'm entitled
>>> to my opinion.
>>
>> Thanks, I appreciate your honesty, but I must disagree with your Nack, we
>> provided enough argument for why we believe this approach is the right
>> way to go, it is clear from the responses on V3 and from the LWN article
>> that we have the community support for this open source project.
> 
> Why don't you repost it to netdev and see how many acks you get?
> I'm not the only netdev maintainer.

I'll go out on that limb and say I would have no problem ACK'ing the
driver. It's been proven time and time again that these kinds of
debugging facilities are needed for these kinds of complex,
multifunction devices.

Jakub Kicinski Feb. 9, 2024, 10:58 p.m. UTC | #6

On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote:
> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
> >> I was in the room and I am in support of David's idea, I like it a lot,
> >> but I don't believe we have any concrete proposal, and we don't have any
> >> use case for it in netdev for now, our use case for this is currently RDMA
> >> and HPC specific.
> >>
> >> Also siimilar to devlink we will be the first to jump in and implement
> >> the new API once defined, but this doesn't mean I need to throw away the  
> > 
> > I'm not asking to throw it away. The question is only whether you get
> > to push it upstream and skirt subsystem rules by posting a "misc" driver
> > without even CCing the maintainers on v1 :|  
> 
> Can you define what you mean by 'skirt subsystem rules'? That's a new
> one to me.

I mean that Saeed is well aware that direct FW <> user space interfaces
are not allowed in netdev, so he posted this "misc" driver without
CCing us, knowing we'd nack it.

Maybe the baseline question is whether major subsystems are allowed to
set their own rules. I think they should as historically we have a very
broad range of, eh, openness in different fields. Networking is pretty
open because of the necessary interoperability.

> BTW, there is already a broadcom driver under drivers/misc that seems to
> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
> person could chime in.

I'm not aware. Or do you mean bcm-vk? That's a video encoder.

> >> Thanks, I appreciate your honesty, but I must disagree with your Nack, we
> >> provided enough argument for why we believe this approach is the right
> >> way to go, it is clear from the responses on V3 and from the LWN article
> >> that we have the community support for this open source project.  
> > 
> > Why don't you repost it to netdev and see how many acks you get?
> > I'm not the only netdev maintainer.  
> 
> I'll go out on that limb and say I would have no problem ACK'ing the
> driver. It's been proven time and time again that these kinds of
> debugging facilities are needed for these kinds of complex,
> multifunction devices.

Can we have bare minimum of honesty and stop pretending that his is
primarily about debug :|

Jason Gunthorpe Feb. 10, 2024, 1:01 a.m. UTC | #7

On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote:
> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
> >>> Ah yes, the high frequency counters. Something that is definitely
> >>> impossible to implement in a generic way. You were literally in the
> >>> room at netconf when David Ahern described his proposal for this.
> 
> The key point of that proposal is host memory mapped to userspace where
> H/W counters land (either via direct DMA by a H/W push or a
> kthread/timer pulling in updates). That is similar to what is proposed here.

The counter experiment that inspired Saeed to write about it here was
done using mlx5ctl interfaces and some other POC stuff on an RDMA
network monitoring RDMA workloads, inspecting RDMA objects.

So if your proposal also considers how to select RDMA object counters,
control the detailed sampling hardware with RDMA stuff, and works
on a netdev-free InfiniBand network, then it might be interesting.

It was actually interesting research, I hope some information will be
made public.

> BTW, there is already a broadcom driver under drivers/misc that seems to
> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
> person could chime in.

Yeah, there are lots of examples of drivers that use this kind FW API
direct to userspace. It is a common design pattern across the kernel
in many subsystems. At the core it is following the general philosophy
of pushing things to userspace that don't need to be in the kernel. It
is more secure, more hackable and easier to deploy.

It becomes a userspace decision what kind of tooling community will
develop and what the ultimate user experience will be.

> > Why don't you repost it to netdev and see how many acks you get?
> > I'm not the only netdev maintainer.
> 
> I'll go out on that limb and say I would have no problem ACK'ing the
> driver. It's been proven time and time again that these kinds of
> debugging facilities are needed for these kinds of complex,
> multifunction devices.

Agree as well. Ack for RDMA community. This is perfectly consistent
with the subsystem's existing design of directly exposing the device
to userspace. It is essential as we can't piggyback on any "generic"
netdev stuff on InfiniBand HW. Further, I anticipate most of the
mlx5ctl users would actually be running primarily RDMA related
workloads anyhow.

There is not that many people that buy these expensive cards and don't
use them to their full capability. 

Recently at usenix Microsoft shared some details of their production
network in the paper "Empowering Azure Storage with RDMA".

Notably they shared that "Traffic statistics of all Azure public
regions between January 18 and February 16, 2023. Traffic was measured
by collecting switch counters of server-facing ports on all Top of
Rack (ToR) switches. Around 70% of traffic was RDMA."

It is a rare public insight into what is going on in the industry at
large, and why RDMA is a significant and important subsytem.

Jason

David Ahern Feb. 10, 2024, 5:01 a.m. UTC | #8

On 2/9/24 3:58 PM, Jakub Kicinski wrote:
> On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote:
>> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
>>>> I was in the room and I am in support of David's idea, I like it a lot,
>>>> but I don't believe we have any concrete proposal, and we don't have any
>>>> use case for it in netdev for now, our use case for this is currently RDMA
>>>> and HPC specific.
>>>>
>>>> Also siimilar to devlink we will be the first to jump in and implement
>>>> the new API once defined, but this doesn't mean I need to throw away the  
>>>
>>> I'm not asking to throw it away. The question is only whether you get
>>> to push it upstream and skirt subsystem rules by posting a "misc" driver
>>> without even CCing the maintainers on v1 :|  
>>
>> Can you define what you mean by 'skirt subsystem rules'? That's a new
>> one to me.
> 
> I mean that Saeed is well aware that direct FW <> user space interfaces
> are not allowed in netdev, so he posted this "misc" driver without
> CCing us, knowing we'd nack it.

The argument you are making here is that if a device has an ethernet
port, all patches must flow through netdev. Or am I misunderstanding?

There are a lot of devices that toggle between IB and ethernet with only
one (ignore roce here) active at a moment. MLX5 (like many) is split
between drivers/net and drivers/infinband. If the debugging capabilities
went through drivers/infiniband would you have the same stance?

Maybe my bigger question is should drivers that can do different
physical layers, or can operate without a netdev, should they be split
into different drivers? drivers/misc for the PCI interface, drivers/net
for ethernet interfaces and its APIs and drivers/infiniband for IB and
its APIs? Generic capabilities like this debugging interface belong to
the PCI device since it is generic to the device hence drivers/misc.

> 
> Maybe the baseline question is whether major subsystems are allowed to
> set their own rules. I think they should as historically we have a very
> broad range of, eh, openness in different fields. Networking is pretty
> open because of the necessary interoperability.

Interoperability applies solely to networking protocols, not how a
device is designed and certainly not to how it can be debugged. There is
a clear difference between standard networking protocols (packets on a
wire and its headers) and device designs.

> 
>> BTW, there is already a broadcom driver under drivers/misc that seems to
>> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
>> person could chime in.
> 
> I'm not aware. Or do you mean bcm-vk? That's a video encoder.

If that specific piece of S/W is a video encoder, why isn't it under
drivers/video? Scanning the code it seems to me to fall under the open
channel between userspace and F/W which is a common paradigm. But I do
not want this to distract from this patch set; really I was just
browsing existing drivers for any overlap.

Greg KH Feb. 11, 2024, 11:03 a.m. UTC | #9

On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote:
> >> BTW, there is already a broadcom driver under drivers/misc that seems to
> >> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
> >> person could chime in.
> > 
> > I'm not aware. Or do you mean bcm-vk? That's a video encoder.
> 
> If that specific piece of S/W is a video encoder, why isn't it under
> drivers/video? Scanning the code it seems to me to fall under the open
> channel between userspace and F/W which is a common paradigm. But I do
> not want this to distract from this patch set; really I was just
> browsing existing drivers for any overlap.

It is an "offload-engine" type of thing that was added before we had the
specific subsystem for it.  It should be moved to drivers/accel/ one of
these days, want to volunteer to help out with that?  :)

thanks,

greg k-h

David Ahern Feb. 11, 2024, 4:59 p.m. UTC | #10

On 2/9/24 6:01 PM, Jason Gunthorpe wrote:
> On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote:
>> On 2/8/24 7:15 PM, Jakub Kicinski wrote:
>>>>> Ah yes, the high frequency counters. Something that is definitely
>>>>> impossible to implement in a generic way. You were literally in the
>>>>> room at netconf when David Ahern described his proposal for this.
>>
>> The key point of that proposal is host memory mapped to userspace where
>> H/W counters land (either via direct DMA by a H/W push or a
>> kthread/timer pulling in updates). That is similar to what is proposed here.
> 
> The counter experiment that inspired Saeed to write about it here was
> done using mlx5ctl interfaces and some other POC stuff on an RDMA
> network monitoring RDMA workloads, inspecting RDMA objects.
> 
> So if your proposal also considers how to select RDMA object counters,
> control the detailed sampling hardware with RDMA stuff, and works
> on a netdev-free InfiniBand network, then it might be interesting.

Response at netconf in September was mixed. As I recall Jakub for
example was shaking his head at 'yet another stats proposal', but since
he has referenced it a couple of times now maybe it is worth moving
beyond slides to a POC. The uapi discussed was netlink (genl) based;
driver hooks were not discussed. Perhaps I can get a working POC for
both stacks by netdevconf in July.

David Ahern Feb. 11, 2024, 5:01 p.m. UTC | #11

On 2/11/24 4:03 AM, Greg Kroah-Hartman wrote:
> On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote:
>>>> BTW, there is already a broadcom driver under drivers/misc that seems to
>>>> have a lot of overlap capability wise to this driver. Perhaps a Broadcom
>>>> person could chime in.
>>>
>>> I'm not aware. Or do you mean bcm-vk? That's a video encoder.
>>
>> If that specific piece of S/W is a video encoder, why isn't it under
>> drivers/video? Scanning the code it seems to me to fall under the open
>> channel between userspace and F/W which is a common paradigm. But I do
>> not want this to distract from this patch set; really I was just
>> browsing existing drivers for any overlap.
> 
> It is an "offload-engine" type of thing that was added before we had the
> specific subsystem for it.  It should be moved to drivers/accel/ one of
> these days, want to volunteer to help out with that?  :)
> 

Thanks for the background.

As for volunteering to move it, I believe Broadcom has more kernel
engineers than Enfabrica. :-)

Christoph Hellwig Feb. 14, 2024, 8:29 a.m. UTC | #12

Chmining in late after a vacation last week, but I really do not
understand the discussion this cause.

Complex devices need diagnostics, and mlx5ctl and the (free software)
userspace tool provide that.  Given how it is a complex multi-subsystem
driver that exports at least RDMA, ethernet, nvme and virtio_blk
interfaces this functionality by definition can't fit into a single
subsystem.

So with my nvme co-maintainer hat on:

Acked-by: Christoph Hellwig <hch@lst.de>

to th concept (which is not a full code review!).

With my busy kernel contributor head on I have to voice my
dissatisfaction with the subsystem maintainer overreach that's causing
the troubles here.  I think all maintainers can and should voice the
opinions, be those technical or political, but trying to block a useful
feature without lots of precedence because it is vaguely related to the
subsystem is not helpful.  Note that this is absolutely not intended to
shut the discussion down - if we can find valid arguments why some of
this functionality should also be reported through a netdev API we
should continue that.  E.g. just because we have SCSI, NVMe or product
specific passthrough interface we still try to provide useful common
interface generically.

Jakub Kicinski Feb. 14, 2024, 3:48 p.m. UTC | #13

On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:
> With my busy kernel contributor head on I have to voice my
> dissatisfaction with the subsystem maintainer overreach that's causing
> the troubles here. 

Overreach is unfortunate, I'd love to say "please do merge it as part 
of RDMA". You probably don't trust my opinion but Jason admitted himself
this is primarily for RDMA. RDMA is what it is in terms of openness and
all vendors trying to sell their secret magic sauce.

The problem is that some RDMA stuff is built really closely on TCP,
and given Jason's and co. inability to understand that good fences
make good neighbors it will soon start getting into the netdev stack :|

Ah, and I presume they may also want it for their DOCA products. 
So 80% RDMA, 15% DOCA, 5% the rest is my guess.

> I think all maintainers can and should voice the
> opinions, be those technical or political, but trying to block a useful
> feature without lots of precedence because it is vaguely related to the
> subsystem is not helpful. 

Not sure what you mean by "without lots of precedence" but you can ask
around netdev. We have nacked such interfaces multiple times.
The best proof the rule exists and is well established it is that Saeed
has himself asked us a number of times to lift it.

What should be expected of us is fairness and not engaging in politics.
We have a clear rule against opaque user space to FW interfaces,
and I don't see how we could enforce that fairly for pure Ethernet
devices if big vendors get to do whatever they want.

> Note that this is absolutely not intended to
> shut the discussion down - if we can find valid arguments why some of
> this functionality should also be reported through a netdev API we
> should continue that.

Once again, netdev requirements for debug are - you can do whatever you
want but other than knobs clearly defined in the code interface must be
read only.

Andy Gospodarek Feb. 14, 2024, 4:17 p.m. UTC | #14

On Wed, Feb 14, 2024 at 12:29:16AM -0800, Christoph Hellwig wrote:
> Chmining in late after a vacation last week, but I really do not
> understand the discussion this cause.
> 
> Complex devices need diagnostics, and mlx5ctl and the (free software)
> userspace tool provide that.  Given how it is a complex multi-subsystem
> driver that exports at least RDMA, ethernet, nvme and virtio_blk
> interfaces this functionality by definition can't fit into a single
> subsystem.
> 
> So with my nvme co-maintainer hat on:
> 
> Acked-by: Christoph Hellwig <hch@lst.de>
> 
> to th concept (which is not a full code review!).

I've been trying to figure out how to respond correctly to this over the
last few days, but I share your sentiment.  It's probably time to have
something like this upstream for more devices.  My initial concerns with
something that allows direct access to hardware to send messages to FW
or read/write registers over BARs were:

1.  How someone working at a distro would be able to help/understand if
a tool like this was run and may have programmed their hardware
differently than a default driver or FW.  There now exists a chance for
identical systems running identical drivers and FW to behave differently
due to out-of-band configuration.  One thought I had was some sort of
journal to note that config happened from outside, but I'm not sure
there is much value there.  With the ability to dump regs with
devlink health it's possible to know that values may have changed, so I'm not
concerned about this since that infra exists.

2.  If one can make configuration changes to hardware without kernel
APIs (devlink et al), will people still develop new kernel APIs?  I
think the answer to this is 'yes' as realistically using default tools
is much better than using vendor tools for regular configuration.  Even
if vendors provide shortcuts to program hardware for eval/testing/debug
my experience is that these are not acceptable long-term.  Requests are always
made to include this type of changes in future releases.  So I'm not too
concerned about the ossification of kernel APIs due to this being included.

So if there is general agreement that this is acceptable (especially
compared to other out-of-tree drivers, I think a few who find this
useful should sync on the best way forward; I'm not sure a separate
driver for each vendor is the right approach.

If upstream (and therefore distros) are going to accept this we probably
owe it to them to not have misc drivers for every different flavor of
hardware out there when it might be possible to add a generic driver
that can connect to a PCI device via new (auxiliary bus?) API.

Jason Gunthorpe Feb. 14, 2024, 5:57 p.m. UTC | #15

On Wed, Feb 14, 2024 at 11:17:58AM -0500, Andy Gospodarek wrote:

> 1.  How someone working at a distro would be able to help/understand if
> a tool like this was run and may have programmed their hardware
> differently than a default driver or FW.

This is stepping a bit further ahead of the debug focused interfaces
mlx5ctl currently has and into configuration..

Obviously a generic FW RPC can do configuration too.

We do have alot of industry experience with configuration already.

Realistically every complex device already has on-device FLASH that
has on-device FW configuration. Formalizing what already exists and is
widely used isn't going to make the world worse.

I'm also very certain there are actual problems with FW configuration
incompatibility. eg there are PCI related configurables on mlx5 you
can set that will break everything if you are not The Special User
that the configuration was created for.

Today everyone already deals with FW version specific behavioral
differences and bugs.

FW Vers + FW Vers&Config is also the current state of affairs, and is
delt with about the same. IMHO

> due to out-of-band configuration.  One thought I had was some sort of
> journal to note that config happened from outside, but I'm not sure
> there is much value there.  

I think devices using this kind of approach need to be well behaved -
like no wild (production) access to random bits of HW. Things need to
be structured and reportable.

There is a clear split in my mind between:
 - inspection debugging
 - invasive mutating debugging
 - configuration

And maybe "invasive mutating debugging" taints the kernel or something
like that.

> With the ability to dump regs with devlink health it's possible to
> know that values may have changed, so I'm not concerned about this
> since that infra exists.

Distros need good tooling to report the state of HW in distro problem
reports. If something is lacking we should improve on it.

TBH I'd pick "report current status" over "journaling" because devices
have FLASH and configuration changes can typically be permanent across
reboots.

Also, I would generally frown on designs changing OS visible behavior
without running the device through a FLR.

> 2.  If one can make configuration changes to hardware without kernel
> APIs (devlink et al), will people still develop new kernel APIs?  

Sure? Why not? Every vendor has existing tooling to do this
configuration stuff. mlx5's existing tooling runs in userspace and
uses /sys/../resource. *Everyone* has been using this for the last 15
years. So whatever impact it has is long since baked in.

> I think the answer to this is 'yes' as realistically using default tools
> is much better than using vendor tools for regular configuration.  

Personally I think the answer is more nuanced. Users have problems
they want to solve. If your problem is "provision my HW from the
factory" you may be perfectlly happy with a shared userspace program
that can do that for all the HW vendors at the site. No artificial
kernel enforced commonality required.

IMHO as kernel people we often look at the hammer we have and fall
into these patterns where "only the kernel can do abstractions!" but
it isn't true, we can very effectively make abstractions in userspace
too.

> if vendors provide shortcuts to program hardware for eval/testing/debug
> my experience is that these are not acceptable long-term.  Requests are always
> made to include this type of changes in future releases.  So I'm not too
> concerned about the ossification of kernel APIs due to this being included.

Nor am I. Users will ask for the things that work best for them and
vendors have been historically good at responding.

I also look at RDMA, which is deeply down this path of not doing
common interfaces in the kernel. We have alot of robust open source
now! The common interfaces that the userspace community developed are
wildly different and much more useful than what the prior effort to
build kernel common interfaces was creating. In fact there are now
competing ideas on what those interfaces should be and alot of
interesting development.

The kernel common APIs that were developed before turned out to be
such a straight jacket that it held back so much stuff and forced
people into out-of-tree and badly harmed community forming in the
userspace side.

In other words, allowing userspace to have freedom has really pushed
things forward in good ways.

> So if there is general agreement that this is acceptable (especially
> compared to other out-of-tree drivers, I think a few who find this
> useful should sync on the best way forward; I'm not sure a separate
> driver for each vendor is the right approach.

I also like this, I don't want the outcome of this discussion to be
that only mlx5ctl gets merged. I want all the HW that has this problem
to have support in the mainline kernel.

I want to see a robust multi-vendor userspace community that houses
alot of stuff and is working to solve user problems. To me this point
is the big win for the community.

If we need a formally named kernel subsystem to pull that community
together then lets do that.  We can probably have some common ioctls
and shared uABI marshaling/discovery/etc like nvme does.  At the end
it can't really be abstracted too much, the user facing API is really
going to be "send a FW RPC" like mlx5/nvme-vendor does.

Honestly it will be easier to discuss and document the overall design
and what devices have to do to be compliant within a tidy subsytem
with a name that people can identify as a thing. Naming things and
having spaces is really important to build community around them.

Certainly, I am up for this. If you have a similar usage flows I'm
quite happy to work with everyone to launch a new mini-subystem. If
other people are reading this and think they want to take part too
please speak up.

> If upstream (and therefore distros) are going to accept this we probably
> owe it to them to not have misc drivers for every different flavor of
> hardware out there when it might be possible to add a generic driver
> that can connect to a PCI device via new (auxiliary bus?) API.

Distros often prefer if they have less packaging, versioning and
community to deal with :)

I'm inspired by things like nvme-cli that show a nice way forward
where there can be vendor plugins and a shared github with a common
shared umbrella of CLI, documentation and packaging.

With some co-operation from the distros we can push vendors to
participate in the shared userspace, and we can push vendors from the
kernel by denying kernel driver support without basic integration to
the userspace (like DRM does with mesa).

Thanks,
Jason

Jakub Kicinski Feb. 14, 2024, 6:11 p.m. UTC | #16

On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:
> There is a clear split in my mind between:
>  - inspection debugging
>  - invasive mutating debugging
>  - configuration

Yes there's a clear split, and how are you going to enforce it on 
an opaque interface? Put an "evil" bit in the common header?

> And maybe "invasive mutating debugging" taints the kernel or something
> like that.

Jason Gunthorpe Feb. 14, 2024, 6:37 p.m. UTC | #17

On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:
> > There is a clear split in my mind between:
> >  - inspection debugging
> >  - invasive mutating debugging
> >  - configuration
> 
> Yes there's a clear split, and how are you going to enforce it on 
> an opaque interface? Put an "evil" bit in the common header?

The interface is opaque through a subsystem, it doesn't mean it is
completely opaque to every driver layer in the kernel. There is still a
HW specific kernel driver that delivers the FW command to the actual
HW.

In the mlx5 model the kernel driver stamps the command with "uid"
which is effectively a security scope label. This cannot be avoided by
userspace and is fundamental to why mlx5ctl is secure in a lockdown
kernel.

For example mlx5's FW interface has the concept of security scopes. We
have several defined today:
 - Kernel
 - Userspace rdma
 - Userspace rdma with CAP_NET_RAW
 - Userspace rdma with CAP_SYS_RAWIO

So we trivally add three more for the scopes I listed above. The
mlx5ctl driver as posted already introduced a new scope, for example.

Userspace will ask the kernel for an appropriate security scope after
opening the char-device. If userspace asks for invasive then you get a
taint. Issuing an invasive command without a kernel applied invasive
security label will be rejected by the FW.

We trust the kernel to apply the security label for the origin of the
command. We trust the the device FW to implement security scopes,
because these are RDMA devices and all of RDMA and all of SRIOV
virtualization are totally broken if the device FW cannot be trusted
to maintain security separation between scopes.

Jason

David Ahern Feb. 14, 2024, 8:31 p.m. UTC | #18

On 2/9/24 10:01 PM, David Ahern wrote:
> Maybe my bigger question is should drivers that can do different
> physical layers, or can operate without a netdev, should they be split
> into different drivers? drivers/misc for the PCI interface, drivers/net
> for ethernet interfaces and its APIs and drivers/infiniband for IB and
> its APIs? Generic capabilities like this debugging interface belong to
> the PCI device since it is generic to the device hence drivers/misc.
> 

I do not recall seeing a response to this line of questions. We are
considering splitting our driver along these lines to upstream it given
the independent nature of the capabilities and need to interact with
different S/W APIs and hence different maintainers. Any reason new
drivers should not be split along these lines?

If the answer is no, then where is the best home for the PCI device
driver piece of it? drivers/misc or drivers/accel or create a new
drivers/multifcn/?

Jason Gunthorpe Feb. 15, 2024, 12:46 a.m. UTC | #19

On Wed, Feb 14, 2024 at 01:31:29PM -0700, David Ahern wrote:
> On 2/9/24 10:01 PM, David Ahern wrote:
> > Maybe my bigger question is should drivers that can do different
> > physical layers, or can operate without a netdev, should they be split
> > into different drivers? drivers/misc for the PCI interface, drivers/net
> > for ethernet interfaces and its APIs and drivers/infiniband for IB and
> > its APIs? Generic capabilities like this debugging interface belong to
> > the PCI device since it is generic to the device hence drivers/misc.
> > 
> 
> I do not recall seeing a response to this line of questions. We are
> considering splitting our driver along these lines to upstream it given
> the independent nature of the capabilities and need to interact with
> different S/W APIs and hence different maintainers. Any reason new
> drivers should not be split along these lines?

I think it is essential to split them up like this, it is why auxbus
was created in the first place. The subsystem specific logic should
live in the subsystem space.

I've had some conversations where we have disagreements about which
functions end up in which directories in these split up drivers, but
the general principle is good.

I like mlx5's split where the core is about booting and communicating
with the FW and some common helpers then every subsystem has its own
FW calls in its own functions. Other people liked the idea that as
much "low level" code was in the core and the subsystems would call
helper functions in the core that invoke the FW functions.

> If the answer is no, then where is the best home for the PCI device
> driver piece of it? drivers/misc or drivers/accel or create a new
> drivers/multifcn/?

I've had conversations about this too, I would vote for a new
directory going forward. drivers/auxbus or something specifically for
these core components of an auxbus based multi-subsystem driver.

We likely need some reasonable rules so that the core driver remains
an in-kernel component and doesn't start growing weird uapi. If you
want uapi it needs to come through an appropriate subsystem specific
thing, even if that is misc.

Jason

Christoph Hellwig Feb. 15, 2024, 7 a.m. UTC | #20

On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> Overreach is unfortunate, I'd love to say "please do merge it as part 
> of RDMA". You probably don't trust my opinion but Jason admitted himself
> this is primarily for RDMA. RDMA is what it is in terms of openness and
> all vendors trying to sell their secret magic sauce.

Common.  RDMA has two important open standards, one of them even done
in IETF that most open of all standards organizations.

> > I think all maintainers can and should voice the
> > opinions, be those technical or political, but trying to block a useful
> > feature without lots of precedence because it is vaguely related to the
> > subsystem is not helpful. 
> 
> Not sure what you mean by "without lots of precedence" but you can ask

Should have been with.  Just about every subsystem with complex devices
has this kind of direct interface for observability and co in at least
some drivers.

Jiri Pirko Feb. 15, 2024, 12:08 p.m. UTC | #21

Thu, Feb 15, 2024 at 08:00:40AM CET, hch@infradead.org wrote:
>On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:

[...]


>> > I think all maintainers can and should voice the
>> > opinions, be those technical or political, but trying to block a useful
>> > feature without lots of precedence because it is vaguely related to the
>> > subsystem is not helpful. 
>> 
>> Not sure what you mean by "without lots of precedence" but you can ask
>
>Should have been with.  Just about every subsystem with complex devices
>has this kind of direct interface for observability and co in at least
>some drivers.

What about configuration? How do you ensure the direct FW/HW access
is used only for observability/debug purposes. I mean, if you can't,
I think it is incorrect to name it that way, isn't it?

Jason Gunthorpe Feb. 15, 2024, 1:21 p.m. UTC | #22

On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:
> > With my busy kernel contributor head on I have to voice my
> > dissatisfaction with the subsystem maintainer overreach that's causing
> > the troubles here. 
> 
> Overreach is unfortunate, I'd love to say "please do merge it as part 
> of RDMA". You probably don't trust my opinion but Jason admitted himself
> this is primarily for RDMA.

"admitted"? You make it sound like a crime. I've been very clear on
this need from the RDMA community since the first posting.

> The problem is that some RDMA stuff is built really closely on TCP,

Huh? Since when? Are you talking about soft-iwarp? That is a reasearch
project and Bernard is very responsive, if you have issues ask him and
he will help.

Otherwise the actual HW devices are not entangled with netdev TCP, the
few iWarp devices have their own TCP implementation, in accordance
with what the IETF standardized.

> and given Jason's and co. inability to understand that good fences
> make good neighbors it will soon start getting into the netdev stack :|

I seem to recall you saying RDMA shouldn't call any netdev APIs at
all. We were unable to agree on where to build the fence for this
reason.

> Ah, and I presume they may also want it for their DOCA products. 
> So 80% RDMA, 15% DOCA, 5% the rest is my guess.

I don't know all details about DOCA, but what I know about runs over
RDMA.

> Not sure what you mean by "without lots of precedence" but you can ask
> around netdev. We have nacked such interfaces multiple times.
> The best proof the rule exists and is well established it is that Saeed
> has himself asked us a number of times to lift it.
> 
> What should be expected of us is fairness and not engaging in politics.
> We have a clear rule against opaque user space to FW interfaces,
> and I don't see how we could enforce that fairly for pure Ethernet
> devices if big vendors get to do whatever they want.

If your community is telling your rules are not working for them
anymore, it is not nice to tell them that rules exist and cannot be
questioned. Try working together toward a reasonable consensus
solution.

The world has changed alot, the use cases are different, the users are
different, the devices are different. When Dave made that prohibition
long ago it was not in a world of a multi billion transistor NIC being
deployed in uniform clusters of unimaginable size.

Jason

Jakub Kicinski Feb. 16, 2024, 1 a.m. UTC | #23

On Wed, 14 Feb 2024 23:00:40 -0800 Christoph Hellwig wrote:
> On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> > Overreach is unfortunate, I'd love to say "please do merge it as part 
> > of RDMA". You probably don't trust my opinion but Jason admitted himself
> > this is primarily for RDMA. RDMA is what it is in terms of openness and
> > all vendors trying to sell their secret magic sauce.  
> 
> Common.  RDMA has two important open standards, one of them even done
> in IETF that most open of all standards organizations.

While I don't dispute that there are standards which can be read,
the practical interoperability of RDMA devices is extremely low.
By practical I mean having two devices from different vendors
achieve any reasonable performance talking to each other.
Even two devices from _the same_ vendor but different generations
are unlikely to perform.

Given how RDMA is deployed (uniform, greenfield/full replacement)
this is entirely reasonable from the engineering perspective.

But this is a bit of a vicious cycle, vendors have little incentive 
to interoperate, and primarily focus on adding secret sauce outside of 
the standard. In fact you're lucky if the vendor didn't bake some
extension which requires custom switches into the NICs :(

Compare that to WiFi, which is a level of standardization netdev folks
are more accustomed to. You can connect a new device from vendor X to 
a 10 year old AP from vendor Y and it will run with high perf.

Unfortunately because of the AI craze I have some experience
with RDMA deployments now. Perhaps you have more, perhaps your
experience differs.

Jakub Kicinski Feb. 16, 2024, 1:10 a.m. UTC | #24

On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote:
> On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:  
> > > With my busy kernel contributor head on I have to voice my
> > > dissatisfaction with the subsystem maintainer overreach that's causing
> > > the troubles here.   
> > 
> > Overreach is unfortunate, I'd love to say "please do merge it as part 
> > of RDMA". You probably don't trust my opinion but Jason admitted himself
> > this is primarily for RDMA.  
> 
> "admitted"? You make it sound like a crime. I've been very clear on
> this need from the RDMA community since the first posting.

Sorry, unintentional :) Maybe it's a misunderstanding but my impression
was that at least Saeed was trying hard to make this driver a common
one, not just for RDMA.

> > The problem is that some RDMA stuff is built really closely on TCP,  
> 
> Huh? Since when? Are you talking about soft-iwarp? That is a reasearch
> project and Bernard is very responsive, if you have issues ask him and
> he will help.
> 
> Otherwise the actual HW devices are not entangled with netdev TCP, the
> few iWarp devices have their own TCP implementation, in accordance
> with what the IETF standardized.

There are some things I know either from work or otherwise told me 
in confidence which I can't share. This is very frustrating for
me, and probably for you, sorry :(

> > and given Jason's and co. inability to understand that good fences
> > make good neighbors it will soon start getting into the netdev stack :|  
> 
> I seem to recall you saying RDMA shouldn't call any netdev APIs at
> all. We were unable to agree on where to build the fence for this
> reason.

Last time you were calling into the IPsec stack right? It's not just 
a basic API. IDK how to draw a line, definitely open to suggestions!

> > Ah, and I presume they may also want it for their DOCA products. 
> > So 80% RDMA, 15% DOCA, 5% the rest is my guess.  
> 
> I don't know all details about DOCA, but what I know about runs over
> RDMA.

Well, since you're an RDMA person that's not really saying much
about existence of other parts.

Jakub Kicinski Feb. 16, 2024, 1:40 a.m. UTC | #25

On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote:
> On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote:
> > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:  
> > > There is a clear split in my mind between:
> > >  - inspection debugging
> > >  - invasive mutating debugging
> > >  - configuration  
> > 
> > Yes there's a clear split, and how are you going to enforce it on 
> > an opaque interface? Put an "evil" bit in the common header?  
> 
> The interface is opaque through a subsystem, it doesn't mean it is
> completely opaque to every driver layer in the kernel. There is still a
> HW specific kernel driver that delivers the FW command to the actual
> HW.
> 
> In the mlx5 model the kernel driver stamps the command with "uid"
> which is effectively a security scope label. This cannot be avoided by
> userspace and is fundamental to why mlx5ctl is secure in a lockdown
> kernel.
> 
> For example mlx5's FW interface has the concept of security scopes. We
> have several defined today:
>  - Kernel
>  - Userspace rdma
>  - Userspace rdma with CAP_NET_RAW
>  - Userspace rdma with CAP_SYS_RAWIO
> 
> So we trivally add three more for the scopes I listed above. The
> mlx5ctl driver as posted already introduced a new scope, for example.
> 
> Userspace will ask the kernel for an appropriate security scope after
> opening the char-device. If userspace asks for invasive then you get a
> taint. Issuing an invasive command without a kernel applied invasive
> security label will be rejected by the FW.
> 
> We trust the kernel to apply the security label for the origin of the
> command. We trust the the device FW to implement security scopes,
> because these are RDMA devices and all of RDMA and all of SRIOV
> virtualization are totally broken if the device FW cannot be trusted
> to maintain security separation between scopes.

You have changed the argument.

The problem Andy was raising is that users having access to low level
configuration will make it impossible for distro's support to tell
device configuration. There won't be any trace of activity at the OS
level.

To which you replied that you can differentiate between debugging and
configuration on an opaque interface, _in the kernel_.

Which I disagree with, obviously.

And now you're saying that you can maintain security if you trust 
the firmware to enforce some rules.

I'm not talking about security here, the evil bit is just an example
of an unsound design.

David Ahern Feb. 16, 2024, 4:20 a.m. UTC | #26

On 2/15/24 6:10 PM, Jakub Kicinski wrote:
>>> and given Jason's and co. inability to understand that good fences
>>> make good neighbors it will soon start getting into the netdev stack :|  
>>
>> I seem to recall you saying RDMA shouldn't call any netdev APIs at
>> all. We were unable to agree on where to build the fence for this
>> reason.
> 
> Last time you were calling into the IPsec stack right? It's not just 
> a basic API. IDK how to draw a line, definitely open to suggestions!
> 

In general, each subsystem should focus on proper APIs that other
subsystems can leverage and not worry about how people wire them up.
That NVME, RDMA/IB and io_uring want to re-use the Linux networking
stack is a good thing.

Jason Gunthorpe Feb. 16, 2024, 2:27 p.m. UTC | #27

On Thu, Feb 15, 2024 at 05:40:34PM -0800, Jakub Kicinski wrote:
> On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote:
> > On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote:
> > > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote:  
> > > > There is a clear split in my mind between:
> > > >  - inspection debugging
> > > >  - invasive mutating debugging
> > > >  - configuration  
> > > 
> > > Yes there's a clear split, and how are you going to enforce it on 
> > > an opaque interface? Put an "evil" bit in the common header?  
> > 
> > The interface is opaque through a subsystem, it doesn't mean it is
> > completely opaque to every driver layer in the kernel. There is still a
> > HW specific kernel driver that delivers the FW command to the actual
> > HW.
> > 
> > In the mlx5 model the kernel driver stamps the command with "uid"
> > which is effectively a security scope label. This cannot be avoided by
> > userspace and is fundamental to why mlx5ctl is secure in a lockdown
> > kernel.
> > 
> > For example mlx5's FW interface has the concept of security scopes. We
> > have several defined today:
> >  - Kernel
> >  - Userspace rdma
> >  - Userspace rdma with CAP_NET_RAW
> >  - Userspace rdma with CAP_SYS_RAWIO
> > 
> > So we trivally add three more for the scopes I listed above. The
> > mlx5ctl driver as posted already introduced a new scope, for example.
> > 
> > Userspace will ask the kernel for an appropriate security scope after
> > opening the char-device. If userspace asks for invasive then you get a
> > taint. Issuing an invasive command without a kernel applied invasive
> > security label will be rejected by the FW.
> > 
> > We trust the kernel to apply the security label for the origin of the
> > command. We trust the the device FW to implement security scopes,
> > because these are RDMA devices and all of RDMA and all of SRIOV
> > virtualization are totally broken if the device FW cannot be trusted
> > to maintain security separation between scopes.
> 
> You have changed the argument.

I explained how the technical bits of a part work, you clipped out my
answer to Andy's concern.

> The problem Andy was raising is that users having access to low level
> configuration will make it impossible for distro's support to tell
> device configuration. There won't be any trace of activity at the OS
> level.

I responded to that by saying the answer is to have robust dumping of
the device configuration and suggested a taint bit if changes are made
to the device outside that support envelope and can't be captured in
the dumps.

This first part is already what everyone already does. There is some
supported configuration in flash and there are tools to dump and
inspect this. The field teams understand they need to look at that,
and existing data collection tools already capture this stuff. I don't
view we have a real problem here.

The step beyond Andy was talking about is the hypothetical "what if
you touch random unsafe registers directly or something" Which
probably shouldn't be allowed on a lockdown kernel, but assuming a
device could do so safely, my answer was to trigger a taint.

Then you asked how do you trigger a taint if the kernel doesn't parse
the commands.

> To which you replied that you can differentiate between debugging and
> configuration on an opaque interface, _in the kernel_.

I did not say "_in the kernel_" meaning the kernel would do it, I
meant the kernel would ensure it is done.

The kernel delegates the differentiation to the FW and it trusts the
FW to do that work for it.

> Which I disagree with, obviously.

I don't know why? How is it obvious?

> And now you're saying that you can maintain security if you trust 
> the firmware to enforce some rules.

Right. The userspace sends a command. The kernel tags it with the
permission the userspace has. The FW parses the command and checks the
kernel supplied permission against the command content and permits it.

We trust the FW to do the restriction on behalf of the kernel.

The restriction we are talking about is containing the userspace to
only operate within the distro's support envelope. The kernel can ask
the FW to enforce that rule as a matter of security and trust the FW
to do so.

Jason

Jason Gunthorpe Feb. 16, 2024, 3:05 p.m. UTC | #28

On Thu, Feb 15, 2024 at 05:00:46PM -0800, Jakub Kicinski wrote:

> But this is a bit of a vicious cycle, vendors have little incentive 
> to interoperate, and primarily focus on adding secret sauce outside of 
> the standard. In fact you're lucky if the vendor didn't bake some
> extension which requires custom switches into the NICs :(

This may all seem shocking if you come from the netdev world, but this
has been normal for HPC networking for the last 30 years at least.

My counter perspective would be that we are currently in a pretty good
moment for HPC industry because we actually have open source
implementations for most of it. In fact most actual deployments are
running something quite close to the mainline open source stack.

The main hold out right now is Cray/HPE's Slingshot networking family
(based on ethernet apparently), but less open source.

I would say the HPC community has a very different community goal post
that netdev land. Make your thing, whatever it is. Come with an open
kernel driver, a open rdma-core, a open libfabric/ucx and plug into
the open dpdk/nccl/ucx/libfabric layer and demonstrate your thing
works with openmpi/etc applications.

Supporting that open stack is broadly my north star for the kernel
perspective as Mesa is to DRM.

Several points of this chain are open industry standards driven by
technical working group communities.

This is what the standardization and interoperability looks like
here. It is probably totally foreign from a netdev view point, far
less focus on the wire protocol, devices and kernel. Here the focus is
on application and software interoperability. Still, it is open in
a pretty solid way.

Jason

Jason Gunthorpe Feb. 16, 2024, 7:04 p.m. UTC | #29

On Thu, Feb 15, 2024 at 05:10:13PM -0800, Jakub Kicinski wrote:
> On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote:
> > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote:
> > > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote:  
> > > > With my busy kernel contributor head on I have to voice my
> > > > dissatisfaction with the subsystem maintainer overreach that's causing
> > > > the troubles here.   
> > > 
> > > Overreach is unfortunate, I'd love to say "please do merge it as part 
> > > of RDMA". You probably don't trust my opinion but Jason admitted himself
> > > this is primarily for RDMA.  
> > 
> > "admitted"? You make it sound like a crime. I've been very clear on
> > this need from the RDMA community since the first posting.
> 
> Sorry, unintentional :) Maybe it's a misunderstanding but my impression
> was that at least Saeed was trying hard to make this driver a common
> one, not just for RDMA.

The hardware is common, this is a driver to talk to the shared FW. I
don't know how you'd do just one when netdev is effectively an RDMA
application running in the kernel, from the perspective of the FW.

There is no real line between these things beyond the artificial one
we have created in the kernel.

> > > The problem is that some RDMA stuff is built really closely on TCP,  
> > 
> > Huh? Since when? Are you talking about soft-iwarp? That is a reasearch
> > project and Bernard is very responsive, if you have issues ask him and
> > he will help.
> > 
> > Otherwise the actual HW devices are not entangled with netdev TCP, the
> > few iWarp devices have their own TCP implementation, in accordance
> > with what the IETF standardized.
> 
> There are some things I know either from work or otherwise told me 
> in confidence which I can't share. This is very frustrating for
> me, and probably for you, sorry :(

Well, all I can say is I know of no forthcoming RDMA things with any
different relationship to TCP. I think if someone wants to more TCP
they will have a hard time, and I'm not inclined to seriously help
anyone get more TCP into RDMA. iWarp is trouble enough already.

> > I seem to recall you saying RDMA shouldn't call any netdev APIs at
> > all. We were unable to agree on where to build the fence for this
> > reason.
> 
> Last time you were calling into the IPsec stack right? It's not just 
> a basic API. IDK how to draw a line, definitely open to suggestions!

I thought the two halfs of the mlx5 driver were co-ordinating their
usage of the shared HW around the ipsec configuration pushed into the
device by netdev xfrm.

> > > Ah, and I presume they may also want it for their DOCA products. 
> > > So 80% RDMA, 15% DOCA, 5% the rest is my guess.  
> > 
> > I don't know all details about DOCA, but what I know about runs over
> > RDMA.
> 
> Well, since you're an RDMA person that's not really saying much
> about existence of other parts.

<shrug> why does DOCA matter? Should we have not done io_uring because
Oracle might use it? Besides, from what I know about DOCA it is almost
all data plane stuff and RDMA fully covers that already..

Regards,
Jason

Jason Gunthorpe March 4, 2024, 4:02 p.m. UTC | #30

On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:

> I also like this, I don't want the outcome of this discussion to be
> that only mlx5ctl gets merged. I want all the HW that has this problem
> to have support in the mainline kernel.

To this end here is my proposal to move forward with a new
mini-subsystem to provide rules for this common approach. Get the
existing tools out of hacky unstructured direct hardware access via
/sys/XX and into a formalized lockdown compatible system. I've talked
to enough people now to think we have a critical mass.

===============
fwctl subsystem
===============

Overview
--------

Modern devices contain extensive amounts of FW, and in many cases, are largely
software defined pieces of hardware. The evolution of this approach is largely a
reaction to Moore's Law where a chip tape out is now highly expensive, and the
chip design is extremely large. Replacing fixed HW logic with a flexible and
tightly coupled FW/HW combination is an effective risk mitigation against chip
respin. Problems in the HW design can be counteracted in device FW. This is
especially true for devices which present a stable and backwards compatible
interface to the operating system driver (such as NVMe).

The FW layer in devices has grown to incredible sizes and devices frequently
integrate clusters of fast processors to run it. For example, mlx5 devices have
over 30MB of FW code, and big configurations operate with over1GB of FW managed
runtime state.

The availability of such a flexible layer has created quite a variety in the
industry where single pieces of silicon are now configurable software defined
devices and can operate in quite different ways depending on the need. Further
we often see cases where specific sites wish to operate devices in ways that are
highly specialized and can only run an application that has been tailored to
their unique configuration.

Further, devices have become multi-functional and integrated they no longer
fit neatly into the kernel's division of subsystems. Modern multi-functional
devices have drivers, such as bnxt/ice/mlx5/pds, that span many subsystems
while sharing the underlying hardware using the auxiliary device system.

All together this creates a challenge for the operating system, where devices
have an expansive FW environment that needs robust vendor-specific debugging
support, and FW driven functionality that is not well suited to “generic”
interfaces. fwctl seeks to allow access to the full device functionality from
user space in the areas of debuggability, management, and first-boot/nth-boot
provisioning.

fwctl is aimed at the common device design pattern where the OS and FW
communicate via an RPC message layer constructed with a queue or mailbox scheme.
In this case the driver will typically have some layer to deliver RPC messages
and collect RPC responses from device FW. The in-kernel subsystem drivers that
operate the device for its primary purposes will use these RPCs to build their
drivers, but devices also usually have a set of ancillary RPCs that don't really
fit into any specific subsystem. For example, a HW RAID controller is primarily
operated by the block layer but also comes with a set of RPCs to administer the
construction of drives within the HW RAID.

In the past when devices were more single function individual subsystems would
grow different approaches to solving some of these common problems, for instance
monitoring device health, manipulating its FLASH, debugging the FW,
provisioning, all have various unique interfaces across the kernel.

fwctl's purpose is to define a common set of limited rules, described below,
that allow user space to securely construct and execute RPCs inside device FW.
The rules serve as an agreement between the operating system and FW on how to
correctly design the RPC interface. As a uAPI the subsystem provides a thin
layer of discovery and a generic uAPI to deliver the RPCs and collect the
response. It supports a system of user space libraries and tools which will
use this interface to control the device.

Scope of Action
---------------

fwctl drivers are strictly restricted to being a way to operate the device FW.
It is not an avenue to access random kernel internals, or other operating system
SW states.

fwctl instances must operate on a well-defined device function, and the device
should have a well-defined security model for what scope within the device the
function is permitted to access. For instance, the most complex PCIe device
today may broadly have several function level scopes:

 1. A privileged function with full access to the on-device global state and
    configuration

 2. Multiple hypervisor function with control over itself and child functions
    used with VMs

 3. Multiple VM function tightly scoped within the VM

The device may create a logical parent/child relationship between these scopes,
for instance a child VM's FW may be within the scope of the hypervisor FW. It is
quite common in the VFIO world that the hypervisor environment has a complex
provisioning/profiling/configuration responsibility for the function VFIO
assigns to the VM.

Further, within the function, devices often have RPC commands that fall within
some general scopes of action:

 1. Access to function & child configuration, flash, etc that becomes live at a
    function reset.

 2. Access to function & child runtime configuration that kernel drivers can
    discover at runtime.

 3. Read only access to function debug information that may report on FW objects
    in the function & child, including FW objects owned by other kernel
    subsystems.

 4. Write access to function & child debug information strictly compatible with
    the principles of kernel lockdown and kernel integrity protection. Triggers
    a kernel Taint.

 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.

Limitation to the appropriate scope must be performed by the kernel.

There are many things this interface must not allow user space to do (without a
Taint or CAP), broadly derived from the principles of kernel lockdown. Some
examples:

 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
    otherwise compromise device or system security and integrity.

 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
    objects owned by kernel drivers.

 3. Directly configure or otherwise control kernel drivers. The kernel driver
    can react to the device configuration at function reset/driver load time,
    but otherwise should not be coupled to fwctl.

 4. Operate the HW in a way that overlaps with the core purpose of another
    primary kernel subsystem, such as read/write to LBAs, send/receive of
    Network packets, or operate an accelerator's data plane.

fwctl is not a replacement for device direct access subsystems like uacce or
VFIO.

Security Response
-----------------

The kernel remains the gatekeeper for this interface. If violations of the
scopes, security or isolation principles are found, we have options to let
devices fix them with a FW update, push a kernel patch to parse and block RPC
commands or push a kernel patch to block entire firmware versions, or devices.

While the kernel can always directly parse and restrict RPCs, it is expected
that the existing kernel pattern of allowing drivers to delegate validation to
FW to be a useful design.

fwctl Driver design
-------------------

In many cases a fwctl driver is going to be part of a larger cross-subsystem
device possibly using the auxiliary_device mechanism. In that case several
subsystems are going to be sharing the same device and FW interface layer so the
device design must already provide for isolation and co-operation between kernel
subsystems. fwctl should fit into that same model.

Part of the driver should include a description of how its scope restrictions
and security model work. The driver and FW together must ensure that RPCs
provided by user space are mapped to the appropriate scope and that the user
space has rights to work on that scope. Broadly there are two approaches drivers
can take:

Have the kernel provide a scope label to the FW along with the user space RPC
and FW will parse and enforce the scope

Have the kernel parse the RPC and enforce the scope

The driver and FW must co-operate to ensure that either fwctl cannot allocate
any FW resources, or any resources it does allocate are freed on FD closure.  A
driver primarily constructed around FW RPCs may find that its core PCI function
and RPC layer belongs under fwctl with auxiliary devices connecting to other
subsystems.

User space Operation
--------------------

To be written in detail, broadly:

 - A “fwctl” class

 - Each fwctl device driver makes a fwctl instance with a struct device and
   sysfs presence in its class and a char device /dev/fwctl/XX

 - Some sysfs files per-instance describing what the device is and bootstrap
   details needed to operate and format RPCs.

 - fds opened from /dev/fwctl/XX support an IOCTL to specify the scope, execute
   an RPC and return the response.

 - A security layer ensuring that operations from user space are restricted and
   contained within the defined scopes.

User space Community
--------------------

Drawing inspiration from nvme-cli, participating in the kernel side must come
with a user space in a common TBD git tree, at a minimum to usefully operate the
kernel driver. Providing such an implementation is a pre-condition to merging a
kernel driver.

The goal is to build user space community around some of the shared problems
we all have, and ideally develop some common user space programs with some
starting themes of:

 - Device in-field debugging

 - HW provisioning

 - VFIO child device profiling before VM boot

 - Confidential Compute topics (attestation, secure provisioning)

That stretches across all subsystems in the kernel. fwupd is a great example of
how an excellent user space experience can emerge out of kernel-side diversity.

Existing Similar Examples
-------------------------

The approach described in this document is not a new idea. Direct, or near
direct device access has been offered by the kernel in different areas for
decades. With more devices wanting to follow this design pattern it is becoming
clear that it is not entirely well understood and, more importantly, the
security considerations are not well defined or agreed upon.

Some examples:

 - HW RAID controllers. This includes RPCs to do things like compose drives into
   a RAID volume, configure RAID parameters, monitor the HW and more.

 - Baseboard managers. RPCs for configuring settings in the device and more

 - NVMe vendor command capsules. nvme-cli provides access to some monitoring
   functions that vendors have defined, but more exists.

 - DRM allows user space drivers to send commands to the device via kernel
   mediation

 - RDMA allows user space drivers to directly push commands to the device
   without kernel involvement

Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc

The first 3 would be examples of areas that fwctl intends to cover.

Some key lessons learned from these past efforts are the importance of having a
common user space project to use as a pre-condition for obtaining a kernel
driver. Developing good community around useful software in user space is key to
getting vendor companies to fund participation.