Message ID | 20240207072435.14182-1-saeed@kernel.org |
---|---|
Headers |
Return-Path: <linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:7301:168b:b0:106:860b:bbdd with SMTP id ma11csp2059639dyb; Tue, 6 Feb 2024 23:28:58 -0800 (PST) X-Google-Smtp-Source: AGHT+IEkCNFDsaHTuKgBFJMkH+8znpeafHLsRuuYWwm+XrXalqz0eifPdj4ziEAInoFWSYm/BbVt X-Received: by 2002:aa7:d693:0:b0:560:654c:157a with SMTP id d19-20020aa7d693000000b00560654c157amr3091664edr.30.1707290937836; Tue, 06 Feb 2024 23:28:57 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707290937; cv=pass; d=google.com; s=arc-20160816; b=cnjW43HXXYHAws2pHEZR8MZrs9t+WJmWvTclJiWT7mKOUFmQcA9z1S68EVxW6YIVja vVJB9wG+VPRRQteKLO4q05VxDkd1oYKoGldCC3XgSaOKIEZMRzMbsbqYx2Q92H3ro6uF REqw+IZcasapDmIqpUxRgiESUkae1WoXhEnpCTFEcehRcNoufZbaNxbSo+bKsRlto4P/ 7n9gY4yv9d2d1W/s0qgszYckxvH1qhWlA3mAvihLAXyk3Rc4aJTducsLDz8EDSveAyc4 i5/mdbWGStTgwXq4PondnN7jNukS9EizCnormKsjUw6rEpfliTUmrpPhKRTczeYeFeVC ZTRA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:cc:to :from:dkim-signature; bh=3shc6ooXT3PLJf2RTkFD2HHahxh1RfTjYpSYIitIY3U=; fh=tmXapDjKv2zvSszVhDZ8lcV/45fM4xDgIda3gBh2C8Y=; b=F8X+Jc2GPR+T4Ar+LZkXEqKhoWWt2JBUMkpDB3RsbEzuUP7nDKnq3R5dYRzkMlSY/m gP7sRt1JYz+ABHQXSvy5O7Sm+ql/kRC1B1Zl+MNtaLEfLKej57f2WoPI+rWPhiUy7GK2 uXy63zKbsx1kPGK0W+gOn8+S6AnDZDNXnfmzQNihhVNbBaEhsAHwDrqbPcpR0onSvMVr 0K7Wod6+V+prObbPXuOt9Zbai92mLXnbcr0qJNbVfLJ929aFhzpiZQXuvbYpBESHtHuK TX50oKmrYzbRF8/cOeY7E+jjIlx9OVyiW3z2mfqEjHziDjTTk0tBFVeWRqtnaDJXBR1k ogig==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="k5BR/Drb"; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Forwarded-Encrypted: i=2; AJvYcCW8zObWWM95SwFZz1nPBhtZoz7X6duDFpLRx5DjdG0i1JgD7ACXPvA7BX3pICrbV2CY9P9fWLGA/YhpQay9WvS2LqpjWw== Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id l19-20020aa7c3d3000000b0055fa07cd8d5si526740edr.563.2024.02.06.23.28.57 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 06 Feb 2024 23:28:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b="k5BR/Drb"; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-56072-ouuuleilei=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 487D81F256F3 for <ouuuleilei@gmail.com>; Wed, 7 Feb 2024 07:28:57 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 677C120322; Wed, 7 Feb 2024 07:28:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="k5BR/Drb" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5B8CE1F61C for <linux-kernel@vger.kernel.org>; Wed, 7 Feb 2024 07:28:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707290904; cv=none; b=RW9BW6cVGHfW6w5uAfiPBezNBcJzwZsFMMn6CsZq9i5zq5kkPM2+F15nb+22iAaInGIfPdK4y0/Si0VA5frETgMzxHriMCt6rIpWp+rxZ4LU4fxnLlwKFY6q3Lbj35AzhAGUF7lks+o0sjV0a2X0TS3GbTzM1fC7jE6CpV2Ee0A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707290904; c=relaxed/simple; bh=eGtrr4VW10JCFCUBkjAf6PTQszmNmJaBy09Ocx3LzZk=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=E43YO/wlDHL7VCLwD1TMefNGHEyXxhDMDhSvEoMGeHTX1NwW/J97UUFQup+0LvzQS9v8YfRJiWrKB61g8aqOvyRuoryPECSlVIbCg7FR1PRAqODo6hEbjZc76irrMHLddioj4txLTrm59b+3uoO4MjJKBdhprYsFwS8jcjV6QGE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=k5BR/Drb; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id C8DB6C433C7; Wed, 7 Feb 2024 07:28:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1707290903; bh=eGtrr4VW10JCFCUBkjAf6PTQszmNmJaBy09Ocx3LzZk=; h=From:To:Cc:Subject:Date:From; b=k5BR/DrbkoYmLsOW1P3a06ls7ei+TGqP/YQvryerjvBVCOJTy+Yl3eVUh5IV3yd+/ MdLE8TGDOFvDzNyAIFADTTReAU2noF5KT3WUvjBDB6yqomhmW63VFiY4JZ9Vl3Uf/s BliPXa3reIZOVMKTKSlXeQd2a/2uU7+dwwDVGM6CBZr+swA8JHASdx7h6LRE2U3AP1 bnAY4ECSU04peF60ukbSFOrYdfPOwD7uLeljTonEcS8IO5FAUmd8k1a8cKo/5CIAEI KzVx/BeXhRidQJ1RdG+fDVx+gVoEBMEnWG6cNzsgsYa/jp4951SMnILOEK+gtfFh8T y+6M3VVyA2pMw== From: Saeed Mahameed <saeed@kernel.org> To: Arnd Bergmann <arnd@arndb.de>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Leon Romanovsky <leonro@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Jiri Pirko <jiri@nvidia.com>, Leonid Bloch <lbloch@nvidia.com>, Itay Avraham <itayavr@nvidia.com>, Jakub Kicinski <kuba@kernel.org>, Saeed Mahameed <saeedm@nvidia.com>, David Ahern <dsahern@kernel.org>, Aron Silverton <aron.silverton@oracle.com>, Christoph Hellwig <hch@infradead.org>, andrew.gospodarek@broadcom.com, linux-kernel@vger.kernel.org Subject: [PATCH V4 0/5] mlx5 ConnectX control misc driver Date: Tue, 6 Feb 2024 23:24:30 -0800 Message-ID: <20240207072435.14182-1-saeed@kernel.org> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: <linux-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1790224302066937454 X-GMAIL-MSGID: 1790224302066937454 |
Series | mlx5 ConnectX control misc driver | |
Message
Saeed Mahameed
Feb. 7, 2024, 7:24 a.m. UTC
From: Saeed Mahameed <saeedm@nvidia.com> Recap from V3 discussion: ========================= LWN has published an article on this series aptly summarizing the debate. LINK: https://lwn.net/Articles/955001/ We continue to think that mlx5ctl is reasonable and aligned with the greater kernel community values. People have pointed to the HW RAID miscdevices as a good analog. The MD developers did not get to block HW RAID configuration on the basis that it undermines their work on the software RAID stack. Further, while there is a superficial similarity to the DRM/accel debate, that was grounded in a real concern that DRM values on open source would be bypassed. That argument does not hold up here as this does come with open source userspace and the functionality mlx5ctl enables on lockdown has always been available to ConnectX users through the non-lockdown PCI sysfs. netdev has been doing just fine despite the long standing presence of this tooling and we have continued to work with Jakub on building common APIs when appropriate. mlx5 already implements a wide range of the netdev common interfaces, many of which were pushed forward by our staff - the DPLL configuration netlink being a recent example. Version history: ================ V3: https://lore.kernel.org/all/20231121070619.9836-1-saeed@kernel.org/#r V2: https://lore.kernel.org/all/20231119092450.164996-1-saeed@kernel.org/#r V1: https://lore.kernel.org/all/20231018081941.475277-1-saeed@kernel.org/#r V3->V4: - Document locking scheme for device lifecycle - Document reserved bits will always be checked for 0 by driver - Use GFP_KERNEL instead of ACCOUNT for short lived buffers - Create sysfs link to parent device under the misc device's sysfs - Remove unnecessary device name from info ioctl output - Remove reserved and future flags fields from ioctls - Precise size checking for ioctl user input. V2->V3: - Fix bad Sign-off line. - Fix kernel robot warnings, define a user ptr arg for umem_unreg ioctl instead of plain integer to simplify compat_ioctl usage. V1->V2: - Provide legal statement and sign-off for dual license use - Fix License clause to use: BSD-3-Clause OR GPL-2.0 - Fix kernel robot warnings - Use dev_dbg directly instead of umem_dbg() local wrapper - Implement .compat_ioctl for 32bit compatibility - Fix mlx5ctl_info ABI structure size and alignment - Local pointer to correct type instead of in-place cast - Check unused fields and flags are 0 on ioctl path - Use correct macro to declare scalar arg ioctl command #define MLX5CTL_IOCTL_UMEM_UNREG \ _IO(MLX5CTL_IOCTL_MAGIC, 0x3) mlx5 ConnectX control misc driver: ================================== The ConnectX HW family supported by the mlx5 drivers uses an architecture where a FW component executes "mailbox RPCs" issued by the driver to make changes to the device. This results in a complex debugging environment where the FW component has information and low level configuration that needs to be accessed to userspace for debugging purposes. Historically a userspace program was used that accessed the PCI register and config space directly through /sys/bus/pci/.../XXX and could operate these debugging interfaces in parallel with the running driver. This approach is incompatible with secure boot and kernel lockdown so this driver provides a secure and restricted interface to that. Patch breakdown: ================ 1) The first patch in the series introduces the main driver file with the implementation of a new mlx5 auxiliary device driver to run on top mlx5_core device instances, on probe it creates a new misc device and in this patch we implement the open and release fops, On open the driver would allocate a special FW UID (user context ID) restricted to debug RPCs only, where all user debug rpcs will be executed under this UID, and on release the UID will be freed. 2) The second patch adds an info ioctl that will show the allocated UID and the available capability masks of the device and the current UID, and some other useful device information such as the underlying ConnectX Example: $ sudo ./mlx5ctlu mlx5_core.ctl.0 mlx5dev: 0000:00:04.0 UCTX UID: 1 UCTX CAP: 0x3 DEV UCTX CAP: 0x3 USER CAP: 0x1d 3) Third patch will add the capability to execute debug RPCs under the special UID. In the mlx5 architecture the FW RPC commands are of the format of inbox and outbox buffers. The inbox buffer contains the command rpc layout as described in the ConnectX Programmers Reference Manual (PRM) document and as defined in linux/include/mlx5/mlx5_ifc.h. On success the user outbox buffer will be filled with the device's rpc response. For example to query device capabilities: a user fills out an inbox buffer with the inbox layout: struct mlx5_ifc_query_hca_cap_in_bits and expects an outbox buffer with the layout: struct mlx5_ifc_cmd_hca_cap_bits 4) The fourth patch adds the ability to register user memory into the ConntectX device and create a umem object that points to that memory. Command rpc outbox buffer is limited in size, which can be very annoying when trying to pull large traces out of the device. Many rpcs offer the ability to scatter output traces, contexts and logs directly into user space buffers in a single shot. The registered memory will be described by a device UMEM object which has a unique umem_id, this umem_id can be later used in the rpc inbox to tell the device where to populate the response output, e.g HW traces and other debug object queries. Example usecase, a ConnectX device coredump can be as large as 2MB. Using inline rpcs will take thousands of rpcs to get the full coredump which can consume multiple seconds. With UMEM, it can be done in a single rpc, using 2MB of umem user buffer. Other usecases with umem: - dynamic HW and FW trace monitoring - high frequency diagnostic counters sampling - batched objects and resource dumps See links below for information about user space tools that use this interface: [1] https://github.com/saeedtx/mlx5ctl [2] https://github.com/Mellanox/mstflint see: d) mstregdump utility This utility dumps hardware registers from Mellanox hardware for later analysis by Mellanox. g) mstconfig This tool sets or queries non-volatile configurable options for Mellanox HCAs. h) mstfwmanager Mellanox firmware update and query utility which scans the system for available Mellanox devices (only mst PCI devices) and performs the necessary firmware updates. i) mstreg The mlxreg utility allows users to obtain information regarding supported access registers, such as their fields License: BSD-3-Clause OR GPL-2.0 ================================ After a review of this thread [3], and a conversation with the LF, Mellanox and NVIDIA legal continue to approve the use of a Dual GPL & Permissive License for mlx5 related driver contributions. This makes it clear to future contributors that this file may be adapted and reused under BSD-3-Clause terms on other operating systems. Contributions will be handled in the normal way and the dual license will apply automatically. If people wish to contribute significantly and opt out of a dual license they may separate their GPL only contributions in dedicated files. Jason has a signing authority for NVIDIA and has gone through our internal process to get approval. [3] https://lore.kernel.org/all/20231018081941.475277-3-saeed@kernel.org/#r Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> # for legal Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Nacked-by: Jakub Kicinski <kuba@kernel.org> Saeed Mahameed (5): mlx5: Add aux dev for ctl interface misc: mlx5ctl: Add mlx5ctl misc driver misc: mlx5ctl: Add info ioctl misc: mlx5ctl: Add command rpc ioctl misc: mlx5ctl: Add umem reg/unreg ioctl .../userspace-api/ioctl/ioctl-number.rst | 1 + MAINTAINERS | 8 + drivers/misc/Kconfig | 1 + drivers/misc/Makefile | 1 + drivers/misc/mlx5ctl/Kconfig | 14 + drivers/misc/mlx5ctl/Makefile | 5 + drivers/misc/mlx5ctl/main.c | 597 ++++++++++++++++++ drivers/misc/mlx5ctl/umem.c | 322 ++++++++++ drivers/misc/mlx5ctl/umem.h | 17 + drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 + include/uapi/misc/mlx5ctl.h | 50 ++ 11 files changed, 1024 insertions(+) create mode 100644 drivers/misc/mlx5ctl/Kconfig create mode 100644 drivers/misc/mlx5ctl/Makefile create mode 100644 drivers/misc/mlx5ctl/main.c create mode 100644 drivers/misc/mlx5ctl/umem.c create mode 100644 drivers/misc/mlx5ctl/umem.h create mode 100644 include/uapi/misc/mlx5ctl.h -- 2.43.0
Comments
On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: > From: Saeed Mahameed <saeedm@nvidia.com> > > Recap from V3 discussion: > ========================= > > LWN has published an article on this series aptly summarizing the debate. > LINK: https://lwn.net/Articles/955001/ > > We continue to think that mlx5ctl is reasonable and aligned with the > greater kernel community values. People have pointed to the HW RAID > miscdevices as a good analog. The MD developers did not get to block HW > RAID configuration on the basis that it undermines their work on the > software RAID stack. Further, while there is a superficial similarity to > the DRM/accel debate, that was grounded in a real concern that DRM values > on open source would be bypassed. That argument does not hold up here as > this does come with open source userspace and the functionality mlx5ctl > enables on lockdown has always been available to ConnectX users through > the non-lockdown PCI sysfs. netdev has been doing just fine despite the > long standing presence of this tooling and we have continued to work with > Jakub on building common APIs when appropriate. mlx5 already implements > a wide range of the netdev common interfaces, many of which were pushed > forward by our staff - the DPLL configuration netlink being a recent > example. I appreciate Jiri's contributions (and you hired Maciej off of Intel at some point) but don't make it sound like nVidia lead DPLL, or pushed for a common interface :| Intel posted SyncE support. I asked them make it a standalone API family: https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ Vadim from Meta joined in and helped a lot based on the OCP time card. Then after some delaying and weird noises y'all started participating. > mlx5 ConnectX control misc driver: > ================================== > > The ConnectX HW family supported by the mlx5 drivers uses an architecture > where a FW component executes "mailbox RPCs" issued by the driver to make > changes to the device. This results in a complex debugging environment > where the FW component has information and low level configuration that > needs to be accessed to userspace for debugging purposes. You don't explain anywhere why addressing the challenges of using debugfs in secure environments isn't the way to go. Also you keep saying debugging purposes but the driver is called "control misc driver", you need to iron out your narrative just a bit more. > Historically a userspace program was used that accessed the PCI register > and config space directly through /sys/bus/pci/.../XXX and could operate > these debugging interfaces in parallel with the running driver. > This approach is incompatible with secure boot and kernel lockdown so this > driver provides a secure and restricted interface to that. [snip] > i) mstreg > The mlxreg utility allows users to obtain information regarding > supported access registers, such as their fields So the access mstreg gives over this interface is read only? That's what your description sounds like, but given your complaints about "inability to add knobs" and "control" in the name of the driver that must be false. > Other usecases with umem: > - dynamic HW and FW trace monitoring > - high frequency diagnostic counters sampling Ah yes, the high frequency counters. Something that is definitely impossible to implement in a generic way. You were literally in the room at netconf when David Ahern described his proposal for this. Anyway, I don't want to waste any more time arguing with you. My opinion is that the kernel community is under no obligation to carry your proprietary gateway interfaces. I may be wrong, but I'm entitled to my opinion. Please do me the basic courtesy of carrying my nack on these patches: Nacked-by: Jakub Kicinski <kuba@kernel.org> and CC netdev, so I don't have to respond again :|
On 07 Feb 07:03, Jakub Kicinski wrote: >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: >> From: Saeed Mahameed <saeedm@nvidia.com> >> >> Recap from V3 discussion: >> ========================= >> >> LWN has published an article on this series aptly summarizing the debate. >> LINK: https://lwn.net/Articles/955001/ >> >> We continue to think that mlx5ctl is reasonable and aligned with the >> greater kernel community values. People have pointed to the HW RAID >> miscdevices as a good analog. The MD developers did not get to block HW >> RAID configuration on the basis that it undermines their work on the >> software RAID stack. Further, while there is a superficial similarity to >> the DRM/accel debate, that was grounded in a real concern that DRM values >> on open source would be bypassed. That argument does not hold up here as >> this does come with open source userspace and the functionality mlx5ctl >> enables on lockdown has always been available to ConnectX users through >> the non-lockdown PCI sysfs. netdev has been doing just fine despite the >> long standing presence of this tooling and we have continued to work with >> Jakub on building common APIs when appropriate. mlx5 already implements >> a wide range of the netdev common interfaces, many of which were pushed >> forward by our staff - the DPLL configuration netlink being a recent >> example. > >I appreciate Jiri's contributions (and you hired Maciej off of Intel at >some point) but don't make it sound like nVidia lead DPLL, or pushed for >a common interface :| Intel posted SyncE support. I asked them make it >a standalone API family: > I will let the stats speak for itself. $ git shortlog -sne --no-merges net/devlink and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory $ git shortlog -sne --no-merges net/core/devlink.c More than 70% of the commits are authored by more than 10 different individuals form Mellanox/Nvidia .. Ok you don't like DPLL, here is a list of some central devlink features we did push to the devlink standard API: - subfunction API and devlink infrastructure - Shared buffer API - port function and rate API - shared buffer - health >https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ > >Vadim from Meta joined in and helped a lot based on the OCP time card. >Then after some delaying and weird noises y'all started participating. > I remember those discussions, and I agree it is very weird when it takes 3 vendors and 2 years to get a simple devlink API for single bit flip accepted. >> mlx5 ConnectX control misc driver: >> ================================== >> >> The ConnectX HW family supported by the mlx5 drivers uses an architecture >> where a FW component executes "mailbox RPCs" issued by the driver to make >> changes to the device. This results in a complex debugging environment >> where the FW component has information and low level configuration that >> needs to be accessed to userspace for debugging purposes. > >You don't explain anywhere why addressing the challenges of using >debugfs in secure environments isn't the way to go. > I already answered this question in length in v3 here: https://lore.kernel.org/all/ZWZFm2qqhV1wKKCV@x130/ >Also you keep saying debugging purposes but the driver is called >"control misc driver", you need to iron out your narrative just >a bit more. > >> Historically a userspace program was used that accessed the PCI register >> and config space directly through /sys/bus/pci/.../XXX and could operate >> these debugging interfaces in parallel with the running driver. >> This approach is incompatible with secure boot and kernel lockdown so this >> driver provides a secure and restricted interface to that. > >[snip] > >> i) mstreg >> The mlxreg utility allows users to obtain information regarding >> supported access registers, such as their fields > >So the access mstreg gives over this interface is read only? That's >what your description sounds like, but given your complaints about >"inability to add knobs" and "control" in the name of the driver that >must be false. > Yes this is enforced by the mlx5ctl driver and FW using the special debug uid. >> Other usecases with umem: >> - dynamic HW and FW trace monitoring >> - high frequency diagnostic counters sampling > >Ah yes, the high frequency counters. Something that is definitely >impossible to implement in a generic way. You were literally in the >room at netconf when David Ahern described his proposal for this. > I was in the room and I am in support of David's idea, I like it a lot, but I don't believe we have any concrete proposal, and we don't have any use case for it in netdev for now, our use case for this is currently RDMA and HPC specific. Also siimilar to devlink we will be the first to jump in and implement the new API once defined, but this doesn't mean I need to throw away the whole driver just because one single use case will be implemented in netdev one day, and I am sure the netdev implementation won't satisfy all the use-cases of high frequency counters: Also keep in mind high frequency counters is a very small part of the debug and access capabilities the mlx5ctl interface offers. >Anyway, I don't want to waste any more time arguing with you. >My opinion is that the kernel community is under no obligation to carry >your proprietary gateway interfaces. I may be wrong, but I'm entitled >to my opinion. > Thanks, I appreciate your honesty, but I must disagree with your Nack, we provided enough argument for why we believe this approach is the right way to go, it is clear from the responses on V3 and from the LWN article that we have the community support for this open source project. >Please do me the basic courtesy of carrying my nack on these patches: > >Nacked-by: Jakub Kicinski <kuba@kernel.org> > >and CC netdev, so I don't have to respond again :| Ack. Thanks, Saeed.
On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote: > On 07 Feb 07:03, Jakub Kicinski wrote: > >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: > >> From: Saeed Mahameed <saeedm@nvidia.com> > >> > >> Recap from V3 discussion: > >> ========================= > >> > >> LWN has published an article on this series aptly summarizing the debate. > >> LINK: https://lwn.net/Articles/955001/ > >> > >> We continue to think that mlx5ctl is reasonable and aligned with the > >> greater kernel community values. People have pointed to the HW RAID > >> miscdevices as a good analog. The MD developers did not get to block HW > >> RAID configuration on the basis that it undermines their work on the > >> software RAID stack. Further, while there is a superficial similarity to > >> the DRM/accel debate, that was grounded in a real concern that DRM values > >> on open source would be bypassed. That argument does not hold up here as > >> this does come with open source userspace and the functionality mlx5ctl > >> enables on lockdown has always been available to ConnectX users through > >> the non-lockdown PCI sysfs. netdev has been doing just fine despite the > >> long standing presence of this tooling and we have continued to work with > >> Jakub on building common APIs when appropriate. mlx5 already implements > >> a wide range of the netdev common interfaces, many of which were pushed > >> forward by our staff - the DPLL configuration netlink being a recent > >> example. > > > >I appreciate Jiri's contributions (and you hired Maciej off of Intel at > >some point) but don't make it sound like nVidia lead DPLL, or pushed for > >a common interface :| Intel posted SyncE support. I asked them make it > >a standalone API family: > > I will let the stats speak for itself. > $ git shortlog -sne --no-merges net/devlink > and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory > $ git shortlog -sne --no-merges net/core/devlink.c > > More than 70% of the commits are authored by more than 10 different individuals > form Mellanox/Nvidia .. I'm not questioning that there are folks at nVidia who prefer to go the generic infrastructure route. Jiri and mlxsw team especially. I do worry that adding a pass thru interface will undermine them, but that wasn't my main point. > Ok you don't like DPLL, I didn't say I dislike DPLL. I think it's a very odd example for you to pick for nVidia's contribution. My recollection is: - Maciej from Intel started developing upstream API for SyncE support - I asked him to generalize it to DPLL, he started working on it - nVidia expressed interest in creating a common interface, we thought it'd be great to align vendors - nVidia hired Maciej from Intel, shutting down Intel's progress for a while - nVidia went AWoL, long response times, we held meetings to nudge you along, no commitments - then after months and months Jiri started helping Arkadiusz and Vadim I remember thinking at the time that it must have been a terrible experience for Intel, definitely not how cooperation upstream should look :| IDK how disconnected from upstream netdev you have to be to put that on your banner. > here is a list of some central devlink features we did > push to the devlink standard API: > > - subfunction API and devlink infrastructure > - Shared buffer API > - port function and rate API > - shared buffer > - health I guess shared buffer was important enough to mention twice? :) > >https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ > > > >Vadim from Meta joined in and helped a lot based on the OCP time card. > >Then after some delaying and weird noises y'all started participating. > > > > I remember those discussions, and I agree it is very weird when it > takes 3 vendors and 2 years to get a simple devlink API for single bit > flip accepted. In hindsight it was naive of us to try to wait for nVidia. It's better to let one vendor blaze the trail and others to follow. > >> i) mstreg > >> The mlxreg utility allows users to obtain information regarding > >> supported access registers, such as their fields > > > >So the access mstreg gives over this interface is read only? That's > >what your description sounds like, but given your complaints about > >"inability to add knobs" and "control" in the name of the driver that > >must be false. > > > > Yes this is enforced by the mlx5ctl driver and FW using the special > debug uid. So we trust the proprietary FW not to let the proprietary user space do something out of scope. Got it. > >> Other usecases with umem: > >> - dynamic HW and FW trace monitoring > >> - high frequency diagnostic counters sampling > > > >Ah yes, the high frequency counters. Something that is definitely > >impossible to implement in a generic way. You were literally in the > >room at netconf when David Ahern described his proposal for this. > > > > I was in the room and I am in support of David's idea, I like it a lot, > but I don't believe we have any concrete proposal, and we don't have any > use case for it in netdev for now, our use case for this is currently RDMA > and HPC specific. > > Also siimilar to devlink we will be the first to jump in and implement > the new API once defined, but this doesn't mean I need to throw away the I'm not asking to throw it away. The question is only whether you get to push it upstream and skirt subsystem rules by posting a "misc" driver without even CCing the maintainers on v1 :| > whole driver just because one single use case will be implemented in netdev > one day, and I am sure the netdev implementation won't satisfy all the > use-cases of high frequency counters: > > Also keep in mind high frequency counters is a very small part of the debug > and access capabilities the mlx5ctl interface offers. > > >Anyway, I don't want to waste any more time arguing with you. > >My opinion is that the kernel community is under no obligation to carry > >your proprietary gateway interfaces. I may be wrong, but I'm entitled > >to my opinion. > > Thanks, I appreciate your honesty, but I must disagree with your Nack, we > provided enough argument for why we believe this approach is the right > way to go, it is clear from the responses on V3 and from the LWN article > that we have the community support for this open source project. Why don't you repost it to netdev and see how many acks you get? I'm not the only netdev maintainer.
Fri, Feb 09, 2024 at 03:15:55AM CET, kuba@kernel.org wrote: >On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote: >> On 07 Feb 07:03, Jakub Kicinski wrote: >> >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: >> >> From: Saeed Mahameed <saeedm@nvidia.com> [...] > >> Ok you don't like DPLL, > >I didn't say I dislike DPLL. I think it's a very odd example for >you to pick for nVidia's contribution. My recollection is: > > - Maciej from Intel started developing upstream API for SyncE support > - I asked him to generalize it to DPLL, he started working on it > - nVidia expressed interest in creating a common interface, we thought > it'd be great to align vendors > - nVidia hired Maciej from Intel, shutting down Intel's progress for a while > - nVidia went AWoL, long response times, we held meetings to nudge > you along, no commitments > - then after months and months Jiri started helping Arkadiusz and Vadim > >I remember thinking at the time that it must have been a terrible >experience for Intel, definitely not how cooperation upstream should >look :| For the record, I spent huge amount of time reviewing the patchset and ended up with redesigning significant chunks of it, steering Arkadiusz and Vadim the way I felt is the correct one. Oftentimes I said to myself it would be much quicker to take the patchset over and do it myself :) Anyway, at the end, I think that the result is very good. Solid and well defined uapi, nice kernel implementation, 3 drivers implementing it, each with slightly different usecase, all clicks. If this is not good example of upstream cooperation, I'm not sure what else is... But, I don't think this is related to the misc driver discussion, I just wanted to express my pov on dpll process, when I see people talking about it :) > >IDK how disconnected from upstream netdev you have to be to put that on >your banner. [...]
On 2/8/24 7:15 PM, Jakub Kicinski wrote: >>> Ah yes, the high frequency counters. Something that is definitely >>> impossible to implement in a generic way. You were literally in the >>> room at netconf when David Ahern described his proposal for this. The key point of that proposal is host memory mapped to userspace where H/W counters land (either via direct DMA by a H/W push or a kthread/timer pulling in updates). That is similar to what is proposed here. >>> >> >> I was in the room and I am in support of David's idea, I like it a lot, >> but I don't believe we have any concrete proposal, and we don't have any >> use case for it in netdev for now, our use case for this is currently RDMA >> and HPC specific. >> >> Also siimilar to devlink we will be the first to jump in and implement >> the new API once defined, but this doesn't mean I need to throw away the > > I'm not asking to throw it away. The question is only whether you get > to push it upstream and skirt subsystem rules by posting a "misc" driver > without even CCing the maintainers on v1 :| Can you define what you mean by 'skirt subsystem rules'? That's a new one to me. BTW, there is already a broadcom driver under drivers/misc that seems to have a lot of overlap capability wise to this driver. Perhaps a Broadcom person could chime in. > >> whole driver just because one single use case will be implemented in netdev >> one day, and I am sure the netdev implementation won't satisfy all the >> use-cases of high frequency counters: >> >> Also keep in mind high frequency counters is a very small part of the debug >> and access capabilities the mlx5ctl interface offers. >> >>> Anyway, I don't want to waste any more time arguing with you. >>> My opinion is that the kernel community is under no obligation to carry >>> your proprietary gateway interfaces. I may be wrong, but I'm entitled >>> to my opinion. >> >> Thanks, I appreciate your honesty, but I must disagree with your Nack, we >> provided enough argument for why we believe this approach is the right >> way to go, it is clear from the responses on V3 and from the LWN article >> that we have the community support for this open source project. > > Why don't you repost it to netdev and see how many acks you get? > I'm not the only netdev maintainer. I'll go out on that limb and say I would have no problem ACK'ing the driver. It's been proven time and time again that these kinds of debugging facilities are needed for these kinds of complex, multifunction devices.
On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote: > On 2/8/24 7:15 PM, Jakub Kicinski wrote: > >> I was in the room and I am in support of David's idea, I like it a lot, > >> but I don't believe we have any concrete proposal, and we don't have any > >> use case for it in netdev for now, our use case for this is currently RDMA > >> and HPC specific. > >> > >> Also siimilar to devlink we will be the first to jump in and implement > >> the new API once defined, but this doesn't mean I need to throw away the > > > > I'm not asking to throw it away. The question is only whether you get > > to push it upstream and skirt subsystem rules by posting a "misc" driver > > without even CCing the maintainers on v1 :| > > Can you define what you mean by 'skirt subsystem rules'? That's a new > one to me. I mean that Saeed is well aware that direct FW <> user space interfaces are not allowed in netdev, so he posted this "misc" driver without CCing us, knowing we'd nack it. Maybe the baseline question is whether major subsystems are allowed to set their own rules. I think they should as historically we have a very broad range of, eh, openness in different fields. Networking is pretty open because of the necessary interoperability. > BTW, there is already a broadcom driver under drivers/misc that seems to > have a lot of overlap capability wise to this driver. Perhaps a Broadcom > person could chime in. I'm not aware. Or do you mean bcm-vk? That's a video encoder. > >> Thanks, I appreciate your honesty, but I must disagree with your Nack, we > >> provided enough argument for why we believe this approach is the right > >> way to go, it is clear from the responses on V3 and from the LWN article > >> that we have the community support for this open source project. > > > > Why don't you repost it to netdev and see how many acks you get? > > I'm not the only netdev maintainer. > > I'll go out on that limb and say I would have no problem ACK'ing the > driver. It's been proven time and time again that these kinds of > debugging facilities are needed for these kinds of complex, > multifunction devices. Can we have bare minimum of honesty and stop pretending that his is primarily about debug :|
On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote: > On 2/8/24 7:15 PM, Jakub Kicinski wrote: > >>> Ah yes, the high frequency counters. Something that is definitely > >>> impossible to implement in a generic way. You were literally in the > >>> room at netconf when David Ahern described his proposal for this. > > The key point of that proposal is host memory mapped to userspace where > H/W counters land (either via direct DMA by a H/W push or a > kthread/timer pulling in updates). That is similar to what is proposed here. The counter experiment that inspired Saeed to write about it here was done using mlx5ctl interfaces and some other POC stuff on an RDMA network monitoring RDMA workloads, inspecting RDMA objects. So if your proposal also considers how to select RDMA object counters, control the detailed sampling hardware with RDMA stuff, and works on a netdev-free InfiniBand network, then it might be interesting. It was actually interesting research, I hope some information will be made public. > BTW, there is already a broadcom driver under drivers/misc that seems to > have a lot of overlap capability wise to this driver. Perhaps a Broadcom > person could chime in. Yeah, there are lots of examples of drivers that use this kind FW API direct to userspace. It is a common design pattern across the kernel in many subsystems. At the core it is following the general philosophy of pushing things to userspace that don't need to be in the kernel. It is more secure, more hackable and easier to deploy. It becomes a userspace decision what kind of tooling community will develop and what the ultimate user experience will be. > > Why don't you repost it to netdev and see how many acks you get? > > I'm not the only netdev maintainer. > > I'll go out on that limb and say I would have no problem ACK'ing the > driver. It's been proven time and time again that these kinds of > debugging facilities are needed for these kinds of complex, > multifunction devices. Agree as well. Ack for RDMA community. This is perfectly consistent with the subsystem's existing design of directly exposing the device to userspace. It is essential as we can't piggyback on any "generic" netdev stuff on InfiniBand HW. Further, I anticipate most of the mlx5ctl users would actually be running primarily RDMA related workloads anyhow. There is not that many people that buy these expensive cards and don't use them to their full capability. Recently at usenix Microsoft shared some details of their production network in the paper "Empowering Azure Storage with RDMA". Notably they shared that "Traffic statistics of all Azure public regions between January 18 and February 16, 2023. Traffic was measured by collecting switch counters of server-facing ports on all Top of Rack (ToR) switches. Around 70% of traffic was RDMA." It is a rare public insight into what is going on in the industry at large, and why RDMA is a significant and important subsytem. Jason
On 2/9/24 3:58 PM, Jakub Kicinski wrote: > On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote: >> On 2/8/24 7:15 PM, Jakub Kicinski wrote: >>>> I was in the room and I am in support of David's idea, I like it a lot, >>>> but I don't believe we have any concrete proposal, and we don't have any >>>> use case for it in netdev for now, our use case for this is currently RDMA >>>> and HPC specific. >>>> >>>> Also siimilar to devlink we will be the first to jump in and implement >>>> the new API once defined, but this doesn't mean I need to throw away the >>> >>> I'm not asking to throw it away. The question is only whether you get >>> to push it upstream and skirt subsystem rules by posting a "misc" driver >>> without even CCing the maintainers on v1 :| >> >> Can you define what you mean by 'skirt subsystem rules'? That's a new >> one to me. > > I mean that Saeed is well aware that direct FW <> user space interfaces > are not allowed in netdev, so he posted this "misc" driver without > CCing us, knowing we'd nack it. The argument you are making here is that if a device has an ethernet port, all patches must flow through netdev. Or am I misunderstanding? There are a lot of devices that toggle between IB and ethernet with only one (ignore roce here) active at a moment. MLX5 (like many) is split between drivers/net and drivers/infinband. If the debugging capabilities went through drivers/infiniband would you have the same stance? Maybe my bigger question is should drivers that can do different physical layers, or can operate without a netdev, should they be split into different drivers? drivers/misc for the PCI interface, drivers/net for ethernet interfaces and its APIs and drivers/infiniband for IB and its APIs? Generic capabilities like this debugging interface belong to the PCI device since it is generic to the device hence drivers/misc. > > Maybe the baseline question is whether major subsystems are allowed to > set their own rules. I think they should as historically we have a very > broad range of, eh, openness in different fields. Networking is pretty > open because of the necessary interoperability. Interoperability applies solely to networking protocols, not how a device is designed and certainly not to how it can be debugged. There is a clear difference between standard networking protocols (packets on a wire and its headers) and device designs. > >> BTW, there is already a broadcom driver under drivers/misc that seems to >> have a lot of overlap capability wise to this driver. Perhaps a Broadcom >> person could chime in. > > I'm not aware. Or do you mean bcm-vk? That's a video encoder. If that specific piece of S/W is a video encoder, why isn't it under drivers/video? Scanning the code it seems to me to fall under the open channel between userspace and F/W which is a common paradigm. But I do not want this to distract from this patch set; really I was just browsing existing drivers for any overlap.
On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote: > >> BTW, there is already a broadcom driver under drivers/misc that seems to > >> have a lot of overlap capability wise to this driver. Perhaps a Broadcom > >> person could chime in. > > > > I'm not aware. Or do you mean bcm-vk? That's a video encoder. > > If that specific piece of S/W is a video encoder, why isn't it under > drivers/video? Scanning the code it seems to me to fall under the open > channel between userspace and F/W which is a common paradigm. But I do > not want this to distract from this patch set; really I was just > browsing existing drivers for any overlap. It is an "offload-engine" type of thing that was added before we had the specific subsystem for it. It should be moved to drivers/accel/ one of these days, want to volunteer to help out with that? :) thanks, greg k-h
On 2/9/24 6:01 PM, Jason Gunthorpe wrote: > On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote: >> On 2/8/24 7:15 PM, Jakub Kicinski wrote: >>>>> Ah yes, the high frequency counters. Something that is definitely >>>>> impossible to implement in a generic way. You were literally in the >>>>> room at netconf when David Ahern described his proposal for this. >> >> The key point of that proposal is host memory mapped to userspace where >> H/W counters land (either via direct DMA by a H/W push or a >> kthread/timer pulling in updates). That is similar to what is proposed here. > > The counter experiment that inspired Saeed to write about it here was > done using mlx5ctl interfaces and some other POC stuff on an RDMA > network monitoring RDMA workloads, inspecting RDMA objects. > > So if your proposal also considers how to select RDMA object counters, > control the detailed sampling hardware with RDMA stuff, and works > on a netdev-free InfiniBand network, then it might be interesting. Response at netconf in September was mixed. As I recall Jakub for example was shaking his head at 'yet another stats proposal', but since he has referenced it a couple of times now maybe it is worth moving beyond slides to a POC. The uapi discussed was netlink (genl) based; driver hooks were not discussed. Perhaps I can get a working POC for both stacks by netdevconf in July.
On 2/11/24 4:03 AM, Greg Kroah-Hartman wrote: > On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote: >>>> BTW, there is already a broadcom driver under drivers/misc that seems to >>>> have a lot of overlap capability wise to this driver. Perhaps a Broadcom >>>> person could chime in. >>> >>> I'm not aware. Or do you mean bcm-vk? That's a video encoder. >> >> If that specific piece of S/W is a video encoder, why isn't it under >> drivers/video? Scanning the code it seems to me to fall under the open >> channel between userspace and F/W which is a common paradigm. But I do >> not want this to distract from this patch set; really I was just >> browsing existing drivers for any overlap. > > It is an "offload-engine" type of thing that was added before we had the > specific subsystem for it. It should be moved to drivers/accel/ one of > these days, want to volunteer to help out with that? :) > Thanks for the background. As for volunteering to move it, I believe Broadcom has more kernel engineers than Enfabrica. :-)
Chmining in late after a vacation last week, but I really do not
understand the discussion this cause.
Complex devices need diagnostics, and mlx5ctl and the (free software)
userspace tool provide that. Given how it is a complex multi-subsystem
driver that exports at least RDMA, ethernet, nvme and virtio_blk
interfaces this functionality by definition can't fit into a single
subsystem.
So with my nvme co-maintainer hat on:
Acked-by: Christoph Hellwig <hch@lst.de>
to th concept (which is not a full code review!).
With my busy kernel contributor head on I have to voice my
dissatisfaction with the subsystem maintainer overreach that's causing
the troubles here. I think all maintainers can and should voice the
opinions, be those technical or political, but trying to block a useful
feature without lots of precedence because it is vaguely related to the
subsystem is not helpful. Note that this is absolutely not intended to
shut the discussion down - if we can find valid arguments why some of
this functionality should also be reported through a netdev API we
should continue that. E.g. just because we have SCSI, NVMe or product
specific passthrough interface we still try to provide useful common
interface generically.
On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > With my busy kernel contributor head on I have to voice my > dissatisfaction with the subsystem maintainer overreach that's causing > the troubles here. Overreach is unfortunate, I'd love to say "please do merge it as part of RDMA". You probably don't trust my opinion but Jason admitted himself this is primarily for RDMA. RDMA is what it is in terms of openness and all vendors trying to sell their secret magic sauce. The problem is that some RDMA stuff is built really closely on TCP, and given Jason's and co. inability to understand that good fences make good neighbors it will soon start getting into the netdev stack :| Ah, and I presume they may also want it for their DOCA products. So 80% RDMA, 15% DOCA, 5% the rest is my guess. > I think all maintainers can and should voice the > opinions, be those technical or political, but trying to block a useful > feature without lots of precedence because it is vaguely related to the > subsystem is not helpful. Not sure what you mean by "without lots of precedence" but you can ask around netdev. We have nacked such interfaces multiple times. The best proof the rule exists and is well established it is that Saeed has himself asked us a number of times to lift it. What should be expected of us is fairness and not engaging in politics. We have a clear rule against opaque user space to FW interfaces, and I don't see how we could enforce that fairly for pure Ethernet devices if big vendors get to do whatever they want. > Note that this is absolutely not intended to > shut the discussion down - if we can find valid arguments why some of > this functionality should also be reported through a netdev API we > should continue that. Once again, netdev requirements for debug are - you can do whatever you want but other than knobs clearly defined in the code interface must be read only.
On Wed, Feb 14, 2024 at 12:29:16AM -0800, Christoph Hellwig wrote: > Chmining in late after a vacation last week, but I really do not > understand the discussion this cause. > > Complex devices need diagnostics, and mlx5ctl and the (free software) > userspace tool provide that. Given how it is a complex multi-subsystem > driver that exports at least RDMA, ethernet, nvme and virtio_blk > interfaces this functionality by definition can't fit into a single > subsystem. > > So with my nvme co-maintainer hat on: > > Acked-by: Christoph Hellwig <hch@lst.de> > > to th concept (which is not a full code review!). I've been trying to figure out how to respond correctly to this over the last few days, but I share your sentiment. It's probably time to have something like this upstream for more devices. My initial concerns with something that allows direct access to hardware to send messages to FW or read/write registers over BARs were: 1. How someone working at a distro would be able to help/understand if a tool like this was run and may have programmed their hardware differently than a default driver or FW. There now exists a chance for identical systems running identical drivers and FW to behave differently due to out-of-band configuration. One thought I had was some sort of journal to note that config happened from outside, but I'm not sure there is much value there. With the ability to dump regs with devlink health it's possible to know that values may have changed, so I'm not concerned about this since that infra exists. 2. If one can make configuration changes to hardware without kernel APIs (devlink et al), will people still develop new kernel APIs? I think the answer to this is 'yes' as realistically using default tools is much better than using vendor tools for regular configuration. Even if vendors provide shortcuts to program hardware for eval/testing/debug my experience is that these are not acceptable long-term. Requests are always made to include this type of changes in future releases. So I'm not too concerned about the ossification of kernel APIs due to this being included. So if there is general agreement that this is acceptable (especially compared to other out-of-tree drivers, I think a few who find this useful should sync on the best way forward; I'm not sure a separate driver for each vendor is the right approach. If upstream (and therefore distros) are going to accept this we probably owe it to them to not have misc drivers for every different flavor of hardware out there when it might be possible to add a generic driver that can connect to a PCI device via new (auxiliary bus?) API.
On Wed, Feb 14, 2024 at 11:17:58AM -0500, Andy Gospodarek wrote: > 1. How someone working at a distro would be able to help/understand if > a tool like this was run and may have programmed their hardware > differently than a default driver or FW. This is stepping a bit further ahead of the debug focused interfaces mlx5ctl currently has and into configuration.. Obviously a generic FW RPC can do configuration too. We do have alot of industry experience with configuration already. Realistically every complex device already has on-device FLASH that has on-device FW configuration. Formalizing what already exists and is widely used isn't going to make the world worse. I'm also very certain there are actual problems with FW configuration incompatibility. eg there are PCI related configurables on mlx5 you can set that will break everything if you are not The Special User that the configuration was created for. Today everyone already deals with FW version specific behavioral differences and bugs. FW Vers + FW Vers&Config is also the current state of affairs, and is delt with about the same. IMHO > due to out-of-band configuration. One thought I had was some sort of > journal to note that config happened from outside, but I'm not sure > there is much value there. I think devices using this kind of approach need to be well behaved - like no wild (production) access to random bits of HW. Things need to be structured and reportable. There is a clear split in my mind between: - inspection debugging - invasive mutating debugging - configuration And maybe "invasive mutating debugging" taints the kernel or something like that. > With the ability to dump regs with devlink health it's possible to > know that values may have changed, so I'm not concerned about this > since that infra exists. Distros need good tooling to report the state of HW in distro problem reports. If something is lacking we should improve on it. TBH I'd pick "report current status" over "journaling" because devices have FLASH and configuration changes can typically be permanent across reboots. Also, I would generally frown on designs changing OS visible behavior without running the device through a FLR. > 2. If one can make configuration changes to hardware without kernel > APIs (devlink et al), will people still develop new kernel APIs? Sure? Why not? Every vendor has existing tooling to do this configuration stuff. mlx5's existing tooling runs in userspace and uses /sys/../resource. *Everyone* has been using this for the last 15 years. So whatever impact it has is long since baked in. > I think the answer to this is 'yes' as realistically using default tools > is much better than using vendor tools for regular configuration. Personally I think the answer is more nuanced. Users have problems they want to solve. If your problem is "provision my HW from the factory" you may be perfectlly happy with a shared userspace program that can do that for all the HW vendors at the site. No artificial kernel enforced commonality required. IMHO as kernel people we often look at the hammer we have and fall into these patterns where "only the kernel can do abstractions!" but it isn't true, we can very effectively make abstractions in userspace too. > if vendors provide shortcuts to program hardware for eval/testing/debug > my experience is that these are not acceptable long-term. Requests are always > made to include this type of changes in future releases. So I'm not too > concerned about the ossification of kernel APIs due to this being included. Nor am I. Users will ask for the things that work best for them and vendors have been historically good at responding. I also look at RDMA, which is deeply down this path of not doing common interfaces in the kernel. We have alot of robust open source now! The common interfaces that the userspace community developed are wildly different and much more useful than what the prior effort to build kernel common interfaces was creating. In fact there are now competing ideas on what those interfaces should be and alot of interesting development. The kernel common APIs that were developed before turned out to be such a straight jacket that it held back so much stuff and forced people into out-of-tree and badly harmed community forming in the userspace side. In other words, allowing userspace to have freedom has really pushed things forward in good ways. > So if there is general agreement that this is acceptable (especially > compared to other out-of-tree drivers, I think a few who find this > useful should sync on the best way forward; I'm not sure a separate > driver for each vendor is the right approach. I also like this, I don't want the outcome of this discussion to be that only mlx5ctl gets merged. I want all the HW that has this problem to have support in the mainline kernel. I want to see a robust multi-vendor userspace community that houses alot of stuff and is working to solve user problems. To me this point is the big win for the community. If we need a formally named kernel subsystem to pull that community together then lets do that. We can probably have some common ioctls and shared uABI marshaling/discovery/etc like nvme does. At the end it can't really be abstracted too much, the user facing API is really going to be "send a FW RPC" like mlx5/nvme-vendor does. Honestly it will be easier to discuss and document the overall design and what devices have to do to be compliant within a tidy subsytem with a name that people can identify as a thing. Naming things and having spaces is really important to build community around them. Certainly, I am up for this. If you have a similar usage flows I'm quite happy to work with everyone to launch a new mini-subystem. If other people are reading this and think they want to take part too please speak up. > If upstream (and therefore distros) are going to accept this we probably > owe it to them to not have misc drivers for every different flavor of > hardware out there when it might be possible to add a generic driver > that can connect to a PCI device via new (auxiliary bus?) API. Distros often prefer if they have less packaging, versioning and community to deal with :) I'm inspired by things like nvme-cli that show a nice way forward where there can be vendor plugins and a shared github with a common shared umbrella of CLI, documentation and packaging. With some co-operation from the distros we can push vendors to participate in the shared userspace, and we can push vendors from the kernel by denying kernel driver support without basic integration to the userspace (like DRM does with mesa). Thanks, Jason
On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > There is a clear split in my mind between: > - inspection debugging > - invasive mutating debugging > - configuration Yes there's a clear split, and how are you going to enforce it on an opaque interface? Put an "evil" bit in the common header? > And maybe "invasive mutating debugging" taints the kernel or something > like that.
On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote: > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > > There is a clear split in my mind between: > > - inspection debugging > > - invasive mutating debugging > > - configuration > > Yes there's a clear split, and how are you going to enforce it on > an opaque interface? Put an "evil" bit in the common header? The interface is opaque through a subsystem, it doesn't mean it is completely opaque to every driver layer in the kernel. There is still a HW specific kernel driver that delivers the FW command to the actual HW. In the mlx5 model the kernel driver stamps the command with "uid" which is effectively a security scope label. This cannot be avoided by userspace and is fundamental to why mlx5ctl is secure in a lockdown kernel. For example mlx5's FW interface has the concept of security scopes. We have several defined today: - Kernel - Userspace rdma - Userspace rdma with CAP_NET_RAW - Userspace rdma with CAP_SYS_RAWIO So we trivally add three more for the scopes I listed above. The mlx5ctl driver as posted already introduced a new scope, for example. Userspace will ask the kernel for an appropriate security scope after opening the char-device. If userspace asks for invasive then you get a taint. Issuing an invasive command without a kernel applied invasive security label will be rejected by the FW. We trust the kernel to apply the security label for the origin of the command. We trust the the device FW to implement security scopes, because these are RDMA devices and all of RDMA and all of SRIOV virtualization are totally broken if the device FW cannot be trusted to maintain security separation between scopes. Jason
On 2/9/24 10:01 PM, David Ahern wrote: > Maybe my bigger question is should drivers that can do different > physical layers, or can operate without a netdev, should they be split > into different drivers? drivers/misc for the PCI interface, drivers/net > for ethernet interfaces and its APIs and drivers/infiniband for IB and > its APIs? Generic capabilities like this debugging interface belong to > the PCI device since it is generic to the device hence drivers/misc. > I do not recall seeing a response to this line of questions. We are considering splitting our driver along these lines to upstream it given the independent nature of the capabilities and need to interact with different S/W APIs and hence different maintainers. Any reason new drivers should not be split along these lines? If the answer is no, then where is the best home for the PCI device driver piece of it? drivers/misc or drivers/accel or create a new drivers/multifcn/?
On Wed, Feb 14, 2024 at 01:31:29PM -0700, David Ahern wrote: > On 2/9/24 10:01 PM, David Ahern wrote: > > Maybe my bigger question is should drivers that can do different > > physical layers, or can operate without a netdev, should they be split > > into different drivers? drivers/misc for the PCI interface, drivers/net > > for ethernet interfaces and its APIs and drivers/infiniband for IB and > > its APIs? Generic capabilities like this debugging interface belong to > > the PCI device since it is generic to the device hence drivers/misc. > > > > I do not recall seeing a response to this line of questions. We are > considering splitting our driver along these lines to upstream it given > the independent nature of the capabilities and need to interact with > different S/W APIs and hence different maintainers. Any reason new > drivers should not be split along these lines? I think it is essential to split them up like this, it is why auxbus was created in the first place. The subsystem specific logic should live in the subsystem space. I've had some conversations where we have disagreements about which functions end up in which directories in these split up drivers, but the general principle is good. I like mlx5's split where the core is about booting and communicating with the FW and some common helpers then every subsystem has its own FW calls in its own functions. Other people liked the idea that as much "low level" code was in the core and the subsystems would call helper functions in the core that invoke the FW functions. > If the answer is no, then where is the best home for the PCI device > driver piece of it? drivers/misc or drivers/accel or create a new > drivers/multifcn/? I've had conversations about this too, I would vote for a new directory going forward. drivers/auxbus or something specifically for these core components of an auxbus based multi-subsystem driver. We likely need some reasonable rules so that the core driver remains an in-kernel component and doesn't start growing weird uapi. If you want uapi it needs to come through an appropriate subsystem specific thing, even if that is misc. Jason
On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > Overreach is unfortunate, I'd love to say "please do merge it as part > of RDMA". You probably don't trust my opinion but Jason admitted himself > this is primarily for RDMA. RDMA is what it is in terms of openness and > all vendors trying to sell their secret magic sauce. Common. RDMA has two important open standards, one of them even done in IETF that most open of all standards organizations. > > I think all maintainers can and should voice the > > opinions, be those technical or political, but trying to block a useful > > feature without lots of precedence because it is vaguely related to the > > subsystem is not helpful. > > Not sure what you mean by "without lots of precedence" but you can ask Should have been with. Just about every subsystem with complex devices has this kind of direct interface for observability and co in at least some drivers.
Thu, Feb 15, 2024 at 08:00:40AM CET, hch@infradead.org wrote: >On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: [...] >> > I think all maintainers can and should voice the >> > opinions, be those technical or political, but trying to block a useful >> > feature without lots of precedence because it is vaguely related to the >> > subsystem is not helpful. >> >> Not sure what you mean by "without lots of precedence" but you can ask > >Should have been with. Just about every subsystem with complex devices >has this kind of direct interface for observability and co in at least >some drivers. What about configuration? How do you ensure the direct FW/HW access is used only for observability/debug purposes. I mean, if you can't, I think it is incorrect to name it that way, isn't it?
On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > > With my busy kernel contributor head on I have to voice my > > dissatisfaction with the subsystem maintainer overreach that's causing > > the troubles here. > > Overreach is unfortunate, I'd love to say "please do merge it as part > of RDMA". You probably don't trust my opinion but Jason admitted himself > this is primarily for RDMA. "admitted"? You make it sound like a crime. I've been very clear on this need from the RDMA community since the first posting. > The problem is that some RDMA stuff is built really closely on TCP, Huh? Since when? Are you talking about soft-iwarp? That is a reasearch project and Bernard is very responsive, if you have issues ask him and he will help. Otherwise the actual HW devices are not entangled with netdev TCP, the few iWarp devices have their own TCP implementation, in accordance with what the IETF standardized. > and given Jason's and co. inability to understand that good fences > make good neighbors it will soon start getting into the netdev stack :| I seem to recall you saying RDMA shouldn't call any netdev APIs at all. We were unable to agree on where to build the fence for this reason. > Ah, and I presume they may also want it for their DOCA products. > So 80% RDMA, 15% DOCA, 5% the rest is my guess. I don't know all details about DOCA, but what I know about runs over RDMA. > Not sure what you mean by "without lots of precedence" but you can ask > around netdev. We have nacked such interfaces multiple times. > The best proof the rule exists and is well established it is that Saeed > has himself asked us a number of times to lift it. > > What should be expected of us is fairness and not engaging in politics. > We have a clear rule against opaque user space to FW interfaces, > and I don't see how we could enforce that fairly for pure Ethernet > devices if big vendors get to do whatever they want. If your community is telling your rules are not working for them anymore, it is not nice to tell them that rules exist and cannot be questioned. Try working together toward a reasonable consensus solution. The world has changed alot, the use cases are different, the users are different, the devices are different. When Dave made that prohibition long ago it was not in a world of a multi billion transistor NIC being deployed in uniform clusters of unimaginable size. Jason
On Wed, 14 Feb 2024 23:00:40 -0800 Christoph Hellwig wrote: > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > > Overreach is unfortunate, I'd love to say "please do merge it as part > > of RDMA". You probably don't trust my opinion but Jason admitted himself > > this is primarily for RDMA. RDMA is what it is in terms of openness and > > all vendors trying to sell their secret magic sauce. > > Common. RDMA has two important open standards, one of them even done > in IETF that most open of all standards organizations. While I don't dispute that there are standards which can be read, the practical interoperability of RDMA devices is extremely low. By practical I mean having two devices from different vendors achieve any reasonable performance talking to each other. Even two devices from _the same_ vendor but different generations are unlikely to perform. Given how RDMA is deployed (uniform, greenfield/full replacement) this is entirely reasonable from the engineering perspective. But this is a bit of a vicious cycle, vendors have little incentive to interoperate, and primarily focus on adding secret sauce outside of the standard. In fact you're lucky if the vendor didn't bake some extension which requires custom switches into the NICs :( Compare that to WiFi, which is a level of standardization netdev folks are more accustomed to. You can connect a new device from vendor X to a 10 year old AP from vendor Y and it will run with high perf. Unfortunately because of the AI craze I have some experience with RDMA deployments now. Perhaps you have more, perhaps your experience differs.
On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote: > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > > > With my busy kernel contributor head on I have to voice my > > > dissatisfaction with the subsystem maintainer overreach that's causing > > > the troubles here. > > > > Overreach is unfortunate, I'd love to say "please do merge it as part > > of RDMA". You probably don't trust my opinion but Jason admitted himself > > this is primarily for RDMA. > > "admitted"? You make it sound like a crime. I've been very clear on > this need from the RDMA community since the first posting. Sorry, unintentional :) Maybe it's a misunderstanding but my impression was that at least Saeed was trying hard to make this driver a common one, not just for RDMA. > > The problem is that some RDMA stuff is built really closely on TCP, > > Huh? Since when? Are you talking about soft-iwarp? That is a reasearch > project and Bernard is very responsive, if you have issues ask him and > he will help. > > Otherwise the actual HW devices are not entangled with netdev TCP, the > few iWarp devices have their own TCP implementation, in accordance > with what the IETF standardized. There are some things I know either from work or otherwise told me in confidence which I can't share. This is very frustrating for me, and probably for you, sorry :( > > and given Jason's and co. inability to understand that good fences > > make good neighbors it will soon start getting into the netdev stack :| > > I seem to recall you saying RDMA shouldn't call any netdev APIs at > all. We were unable to agree on where to build the fence for this > reason. Last time you were calling into the IPsec stack right? It's not just a basic API. IDK how to draw a line, definitely open to suggestions! > > Ah, and I presume they may also want it for their DOCA products. > > So 80% RDMA, 15% DOCA, 5% the rest is my guess. > > I don't know all details about DOCA, but what I know about runs over > RDMA. Well, since you're an RDMA person that's not really saying much about existence of other parts.
On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote: > On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote: > > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > > > There is a clear split in my mind between: > > > - inspection debugging > > > - invasive mutating debugging > > > - configuration > > > > Yes there's a clear split, and how are you going to enforce it on > > an opaque interface? Put an "evil" bit in the common header? > > The interface is opaque through a subsystem, it doesn't mean it is > completely opaque to every driver layer in the kernel. There is still a > HW specific kernel driver that delivers the FW command to the actual > HW. > > In the mlx5 model the kernel driver stamps the command with "uid" > which is effectively a security scope label. This cannot be avoided by > userspace and is fundamental to why mlx5ctl is secure in a lockdown > kernel. > > For example mlx5's FW interface has the concept of security scopes. We > have several defined today: > - Kernel > - Userspace rdma > - Userspace rdma with CAP_NET_RAW > - Userspace rdma with CAP_SYS_RAWIO > > So we trivally add three more for the scopes I listed above. The > mlx5ctl driver as posted already introduced a new scope, for example. > > Userspace will ask the kernel for an appropriate security scope after > opening the char-device. If userspace asks for invasive then you get a > taint. Issuing an invasive command without a kernel applied invasive > security label will be rejected by the FW. > > We trust the kernel to apply the security label for the origin of the > command. We trust the the device FW to implement security scopes, > because these are RDMA devices and all of RDMA and all of SRIOV > virtualization are totally broken if the device FW cannot be trusted > to maintain security separation between scopes. You have changed the argument. The problem Andy was raising is that users having access to low level configuration will make it impossible for distro's support to tell device configuration. There won't be any trace of activity at the OS level. To which you replied that you can differentiate between debugging and configuration on an opaque interface, _in the kernel_. Which I disagree with, obviously. And now you're saying that you can maintain security if you trust the firmware to enforce some rules. I'm not talking about security here, the evil bit is just an example of an unsound design.
On 2/15/24 6:10 PM, Jakub Kicinski wrote: >>> and given Jason's and co. inability to understand that good fences >>> make good neighbors it will soon start getting into the netdev stack :| >> >> I seem to recall you saying RDMA shouldn't call any netdev APIs at >> all. We were unable to agree on where to build the fence for this >> reason. > > Last time you were calling into the IPsec stack right? It's not just > a basic API. IDK how to draw a line, definitely open to suggestions! > In general, each subsystem should focus on proper APIs that other subsystems can leverage and not worry about how people wire them up. That NVME, RDMA/IB and io_uring want to re-use the Linux networking stack is a good thing.
On Thu, Feb 15, 2024 at 05:40:34PM -0800, Jakub Kicinski wrote: > On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote: > > On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote: > > > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > > > > There is a clear split in my mind between: > > > > - inspection debugging > > > > - invasive mutating debugging > > > > - configuration > > > > > > Yes there's a clear split, and how are you going to enforce it on > > > an opaque interface? Put an "evil" bit in the common header? > > > > The interface is opaque through a subsystem, it doesn't mean it is > > completely opaque to every driver layer in the kernel. There is still a > > HW specific kernel driver that delivers the FW command to the actual > > HW. > > > > In the mlx5 model the kernel driver stamps the command with "uid" > > which is effectively a security scope label. This cannot be avoided by > > userspace and is fundamental to why mlx5ctl is secure in a lockdown > > kernel. > > > > For example mlx5's FW interface has the concept of security scopes. We > > have several defined today: > > - Kernel > > - Userspace rdma > > - Userspace rdma with CAP_NET_RAW > > - Userspace rdma with CAP_SYS_RAWIO > > > > So we trivally add three more for the scopes I listed above. The > > mlx5ctl driver as posted already introduced a new scope, for example. > > > > Userspace will ask the kernel for an appropriate security scope after > > opening the char-device. If userspace asks for invasive then you get a > > taint. Issuing an invasive command without a kernel applied invasive > > security label will be rejected by the FW. > > > > We trust the kernel to apply the security label for the origin of the > > command. We trust the the device FW to implement security scopes, > > because these are RDMA devices and all of RDMA and all of SRIOV > > virtualization are totally broken if the device FW cannot be trusted > > to maintain security separation between scopes. > > You have changed the argument. I explained how the technical bits of a part work, you clipped out my answer to Andy's concern. > The problem Andy was raising is that users having access to low level > configuration will make it impossible for distro's support to tell > device configuration. There won't be any trace of activity at the OS > level. I responded to that by saying the answer is to have robust dumping of the device configuration and suggested a taint bit if changes are made to the device outside that support envelope and can't be captured in the dumps. This first part is already what everyone already does. There is some supported configuration in flash and there are tools to dump and inspect this. The field teams understand they need to look at that, and existing data collection tools already capture this stuff. I don't view we have a real problem here. The step beyond Andy was talking about is the hypothetical "what if you touch random unsafe registers directly or something" Which probably shouldn't be allowed on a lockdown kernel, but assuming a device could do so safely, my answer was to trigger a taint. Then you asked how do you trigger a taint if the kernel doesn't parse the commands. > To which you replied that you can differentiate between debugging and > configuration on an opaque interface, _in the kernel_. I did not say "_in the kernel_" meaning the kernel would do it, I meant the kernel would ensure it is done. The kernel delegates the differentiation to the FW and it trusts the FW to do that work for it. > Which I disagree with, obviously. I don't know why? How is it obvious? > And now you're saying that you can maintain security if you trust > the firmware to enforce some rules. Right. The userspace sends a command. The kernel tags it with the permission the userspace has. The FW parses the command and checks the kernel supplied permission against the command content and permits it. We trust the FW to do the restriction on behalf of the kernel. The restriction we are talking about is containing the userspace to only operate within the distro's support envelope. The kernel can ask the FW to enforce that rule as a matter of security and trust the FW to do so. Jason
On Thu, Feb 15, 2024 at 05:00:46PM -0800, Jakub Kicinski wrote: > But this is a bit of a vicious cycle, vendors have little incentive > to interoperate, and primarily focus on adding secret sauce outside of > the standard. In fact you're lucky if the vendor didn't bake some > extension which requires custom switches into the NICs :( This may all seem shocking if you come from the netdev world, but this has been normal for HPC networking for the last 30 years at least. My counter perspective would be that we are currently in a pretty good moment for HPC industry because we actually have open source implementations for most of it. In fact most actual deployments are running something quite close to the mainline open source stack. The main hold out right now is Cray/HPE's Slingshot networking family (based on ethernet apparently), but less open source. I would say the HPC community has a very different community goal post that netdev land. Make your thing, whatever it is. Come with an open kernel driver, a open rdma-core, a open libfabric/ucx and plug into the open dpdk/nccl/ucx/libfabric layer and demonstrate your thing works with openmpi/etc applications. Supporting that open stack is broadly my north star for the kernel perspective as Mesa is to DRM. Several points of this chain are open industry standards driven by technical working group communities. This is what the standardization and interoperability looks like here. It is probably totally foreign from a netdev view point, far less focus on the wire protocol, devices and kernel. Here the focus is on application and software interoperability. Still, it is open in a pretty solid way. Jason
On Thu, Feb 15, 2024 at 05:10:13PM -0800, Jakub Kicinski wrote: > On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote: > > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > > > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > > > > With my busy kernel contributor head on I have to voice my > > > > dissatisfaction with the subsystem maintainer overreach that's causing > > > > the troubles here. > > > > > > Overreach is unfortunate, I'd love to say "please do merge it as part > > > of RDMA". You probably don't trust my opinion but Jason admitted himself > > > this is primarily for RDMA. > > > > "admitted"? You make it sound like a crime. I've been very clear on > > this need from the RDMA community since the first posting. > > Sorry, unintentional :) Maybe it's a misunderstanding but my impression > was that at least Saeed was trying hard to make this driver a common > one, not just for RDMA. The hardware is common, this is a driver to talk to the shared FW. I don't know how you'd do just one when netdev is effectively an RDMA application running in the kernel, from the perspective of the FW. There is no real line between these things beyond the artificial one we have created in the kernel. > > > The problem is that some RDMA stuff is built really closely on TCP, > > > > Huh? Since when? Are you talking about soft-iwarp? That is a reasearch > > project and Bernard is very responsive, if you have issues ask him and > > he will help. > > > > Otherwise the actual HW devices are not entangled with netdev TCP, the > > few iWarp devices have their own TCP implementation, in accordance > > with what the IETF standardized. > > There are some things I know either from work or otherwise told me > in confidence which I can't share. This is very frustrating for > me, and probably for you, sorry :( Well, all I can say is I know of no forthcoming RDMA things with any different relationship to TCP. I think if someone wants to more TCP they will have a hard time, and I'm not inclined to seriously help anyone get more TCP into RDMA. iWarp is trouble enough already. > > I seem to recall you saying RDMA shouldn't call any netdev APIs at > > all. We were unable to agree on where to build the fence for this > > reason. > > Last time you were calling into the IPsec stack right? It's not just > a basic API. IDK how to draw a line, definitely open to suggestions! I thought the two halfs of the mlx5 driver were co-ordinating their usage of the shared HW around the ipsec configuration pushed into the device by netdev xfrm. > > > Ah, and I presume they may also want it for their DOCA products. > > > So 80% RDMA, 15% DOCA, 5% the rest is my guess. > > > > I don't know all details about DOCA, but what I know about runs over > > RDMA. > > Well, since you're an RDMA person that's not really saying much > about existence of other parts. <shrug> why does DOCA matter? Should we have not done io_uring because Oracle might use it? Besides, from what I know about DOCA it is almost all data plane stuff and RDMA fully covers that already.. Regards, Jason
On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote: > I also like this, I don't want the outcome of this discussion to be > that only mlx5ctl gets merged. I want all the HW that has this problem > to have support in the mainline kernel. To this end here is my proposal to move forward with a new mini-subsystem to provide rules for this common approach. Get the existing tools out of hacky unstructured direct hardware access via /sys/XX and into a formalized lockdown compatible system. I've talked to enough people now to think we have a critical mass. =============== fwctl subsystem =============== Overview -------- Modern devices contain extensive amounts of FW, and in many cases, are largely software defined pieces of hardware. The evolution of this approach is largely a reaction to Moore's Law where a chip tape out is now highly expensive, and the chip design is extremely large. Replacing fixed HW logic with a flexible and tightly coupled FW/HW combination is an effective risk mitigation against chip respin. Problems in the HW design can be counteracted in device FW. This is especially true for devices which present a stable and backwards compatible interface to the operating system driver (such as NVMe). The FW layer in devices has grown to incredible sizes and devices frequently integrate clusters of fast processors to run it. For example, mlx5 devices have over 30MB of FW code, and big configurations operate with over1GB of FW managed runtime state. The availability of such a flexible layer has created quite a variety in the industry where single pieces of silicon are now configurable software defined devices and can operate in quite different ways depending on the need. Further we often see cases where specific sites wish to operate devices in ways that are highly specialized and can only run an application that has been tailored to their unique configuration. Further, devices have become multi-functional and integrated they no longer fit neatly into the kernel's division of subsystems. Modern multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many subsystems while sharing the underlying hardware using the auxiliary device system. All together this creates a challenge for the operating system, where devices have an expansive FW environment that needs robust vendor-specific debugging support, and FW driven functionality that is not well suited to “generic” interfaces. fwctl seeks to allow access to the full device functionality from user space in the areas of debuggability, management, and first-boot/nth-boot provisioning. fwctl is aimed at the common device design pattern where the OS and FW communicate via an RPC message layer constructed with a queue or mailbox scheme. In this case the driver will typically have some layer to deliver RPC messages and collect RPC responses from device FW. The in-kernel subsystem drivers that operate the device for its primary purposes will use these RPCs to build their drivers, but devices also usually have a set of ancillary RPCs that don't really fit into any specific subsystem. For example, a HW RAID controller is primarily operated by the block layer but also comes with a set of RPCs to administer the construction of drives within the HW RAID. In the past when devices were more single function individual subsystems would grow different approaches to solving some of these common problems, for instance monitoring device health, manipulating its FLASH, debugging the FW, provisioning, all have various unique interfaces across the kernel. fwctl's purpose is to define a common set of limited rules, described below, that allow user space to securely construct and execute RPCs inside device FW. The rules serve as an agreement between the operating system and FW on how to correctly design the RPC interface. As a uAPI the subsystem provides a thin layer of discovery and a generic uAPI to deliver the RPCs and collect the response. It supports a system of user space libraries and tools which will use this interface to control the device. Scope of Action --------------- fwctl drivers are strictly restricted to being a way to operate the device FW. It is not an avenue to access random kernel internals, or other operating system SW states. fwctl instances must operate on a well-defined device function, and the device should have a well-defined security model for what scope within the device the function is permitted to access. For instance, the most complex PCIe device today may broadly have several function level scopes: 1. A privileged function with full access to the on-device global state and configuration 2. Multiple hypervisor function with control over itself and child functions used with VMs 3. Multiple VM function tightly scoped within the VM The device may create a logical parent/child relationship between these scopes, for instance a child VM's FW may be within the scope of the hypervisor FW. It is quite common in the VFIO world that the hypervisor environment has a complex provisioning/profiling/configuration responsibility for the function VFIO assigns to the VM. Further, within the function, devices often have RPC commands that fall within some general scopes of action: 1. Access to function & child configuration, flash, etc that becomes live at a function reset. 2. Access to function & child runtime configuration that kernel drivers can discover at runtime. 3. Read only access to function debug information that may report on FW objects in the function & child, including FW objects owned by other kernel subsystems. 4. Write access to function & child debug information strictly compatible with the principles of kernel lockdown and kernel integrity protection. Triggers a kernel Taint. 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO. Limitation to the appropriate scope must be performed by the kernel. There are many things this interface must not allow user space to do (without a Taint or CAP), broadly derived from the principles of kernel lockdown. Some examples: 1. DMA to/from arbitrary memory, hang the system, run code in the device, or otherwise compromise device or system security and integrity. 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel objects owned by kernel drivers. 3. Directly configure or otherwise control kernel drivers. The kernel driver can react to the device configuration at function reset/driver load time, but otherwise should not be coupled to fwctl. 4. Operate the HW in a way that overlaps with the core purpose of another primary kernel subsystem, such as read/write to LBAs, send/receive of Network packets, or operate an accelerator's data plane. fwctl is not a replacement for device direct access subsystems like uacce or VFIO. Security Response ----------------- The kernel remains the gatekeeper for this interface. If violations of the scopes, security or isolation principles are found, we have options to let devices fix them with a FW update, push a kernel patch to parse and block RPC commands or push a kernel patch to block entire firmware versions, or devices. While the kernel can always directly parse and restrict RPCs, it is expected that the existing kernel pattern of allowing drivers to delegate validation to FW to be a useful design. fwctl Driver design ------------------- In many cases a fwctl driver is going to be part of a larger cross-subsystem device possibly using the auxiliary_device mechanism. In that case several subsystems are going to be sharing the same device and FW interface layer so the device design must already provide for isolation and co-operation between kernel subsystems. fwctl should fit into that same model. Part of the driver should include a description of how its scope restrictions and security model work. The driver and FW together must ensure that RPCs provided by user space are mapped to the appropriate scope and that the user space has rights to work on that scope. Broadly there are two approaches drivers can take: Have the kernel provide a scope label to the FW along with the user space RPC and FW will parse and enforce the scope Have the kernel parse the RPC and enforce the scope The driver and FW must co-operate to ensure that either fwctl cannot allocate any FW resources, or any resources it does allocate are freed on FD closure. A driver primarily constructed around FW RPCs may find that its core PCI function and RPC layer belongs under fwctl with auxiliary devices connecting to other subsystems. User space Operation -------------------- To be written in detail, broadly: - A “fwctl” class - Each fwctl device driver makes a fwctl instance with a struct device and sysfs presence in its class and a char device /dev/fwctl/XX - Some sysfs files per-instance describing what the device is and bootstrap details needed to operate and format RPCs. - fds opened from /dev/fwctl/XX support an IOCTL to specify the scope, execute an RPC and return the response. - A security layer ensuring that operations from user space are restricted and contained within the defined scopes. User space Community -------------------- Drawing inspiration from nvme-cli, participating in the kernel side must come with a user space in a common TBD git tree, at a minimum to usefully operate the kernel driver. Providing such an implementation is a pre-condition to merging a kernel driver. The goal is to build user space community around some of the shared problems we all have, and ideally develop some common user space programs with some starting themes of: - Device in-field debugging - HW provisioning - VFIO child device profiling before VM boot - Confidential Compute topics (attestation, secure provisioning) That stretches across all subsystems in the kernel. fwupd is a great example of how an excellent user space experience can emerge out of kernel-side diversity. Existing Similar Examples ------------------------- The approach described in this document is not a new idea. Direct, or near direct device access has been offered by the kernel in different areas for decades. With more devices wanting to follow this design pattern it is becoming clear that it is not entirely well understood and, more importantly, the security considerations are not well defined or agreed upon. Some examples: - HW RAID controllers. This includes RPCs to do things like compose drives into a RAID volume, configure RAID parameters, monitor the HW and more. - Baseboard managers. RPCs for configuring settings in the device and more - NVMe vendor command capsules. nvme-cli provides access to some monitoring functions that vendors have defined, but more exists. - DRM allows user space drivers to send commands to the device via kernel mediation - RDMA allows user space drivers to directly push commands to the device without kernel involvement Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc The first 3 would be examples of areas that fwctl intends to cover. Some key lessons learned from these past efforts are the importance of having a common user space project to use as a pre-condition for obtaining a kernel driver. Developing good community around useful software in user space is key to getting vendor companies to fund participation.