[V3,5/5] misc: mlx5ctl: Add umem reg/unreg ioctl

Message ID 20231121070619.9836-6-saeed@kernel.org
State New
Headers
Series mlx5 ConnectX control misc driver |

Commit Message

Saeed Mahameed Nov. 21, 2023, 7:06 a.m. UTC
  From: Saeed Mahameed <saeedm@nvidia.com>

Command rpc outbox buffer is limited in size, which can be very
annoying when trying to pull large traces out of the device.
Many device rpcs offer the ability to scatter output traces, contexts
and logs directly into user space buffers in a single shot.

Allow user to register user memory space, so the device may dump
information directly into user memory space.

The registered memory will be described by a device UMEM object which
has a unique umem_id, this umem_id can be later used in the rpc inbox
to tell the device where to populate the response output,
e.g HW traces and other debug object queries.

To do so this patch introduces two ioctls:

MLX5CTL_IOCTL_UMEM_REG(va_address, size):
 - calculate page fragments from the user provided virtual address
 - pin the pages, and allocate a sg list
 - dma map the sg list
 - create a UMEM device object that points to the dma addresses
 - add a driver umem object to an xarray data base for bookkeeping
 - return UMEM ID to user so it can be used in subsequent rpcs

MLX5CTL_IOCTL_UMEM_UNREG(umem_id):
 - user provides a pre allocated umem ID
 - unwinds the above

Example usecase, ConnectX device coredump can be as large as 2MB.
Using inline rpcs will take thousands of rpcs to get the full
coredump which can take multiple seconds.

With UMEM, it can be done in a single rpc, using 2MB of umem user buffer.

$ ./mlx5ctlu mlx5_core.ctl.0 coredump --umem_size=$(( 2 ** 20 ))

00 00 00 00 01 00 20 00 00 00 00 04 00 00 48 ec
00 00 00 08 00 00 00 00 00 00 00 0c 00 00 00 03
00 00 00 10 00 00 00 00 00 00 00 14 00 00 00 00
....
00 50 0b 3c 00 00 00 00 00 50 0b 40 00 00 00 00
00 50 0b 44 00 00 00 00 00 50 0b 48 00 00 00 00
00 50 0c 00 00 00 00 00

INFO : Core dump done
INFO : Core dump size 831304
INFO : Core dump address 0x0
INFO : Core dump cookie 0x500c04
INFO : More Dump 0

Other usecases are: dynamic HW and FW trace monitoring, high frequency
diagnostic counters sampling and batched objects and resource dumps.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
 drivers/misc/mlx5ctl/Makefile |   1 +
 drivers/misc/mlx5ctl/main.c   |  99 +++++++++++
 drivers/misc/mlx5ctl/umem.c   | 322 ++++++++++++++++++++++++++++++++++
 drivers/misc/mlx5ctl/umem.h   |  17 ++
 include/uapi/misc/mlx5ctl.h   |  22 +++
 5 files changed, 461 insertions(+)
 create mode 100644 drivers/misc/mlx5ctl/umem.c
 create mode 100644 drivers/misc/mlx5ctl/umem.h
  

Comments

Jakub Kicinski Nov. 21, 2023, 8:44 p.m. UTC | #1
On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
> high frequency diagnostic counters

So is it a debug driver or not a debug driver?

Because I'm pretty sure some people want to have access to high freq
counters in production, across their fleet. What's worse David Ahern
has been pitching a way of exposing device counters which would be
common across netdev.

Definite nack on this patch.
  
Saeed Mahameed Nov. 21, 2023, 9:04 p.m. UTC | #2
On 21 Nov 12:44, Jakub Kicinski wrote:
>On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>> high frequency diagnostic counters
>
>So is it a debug driver or not a debug driver?
>

High frequency _diagnostic_ counters are a very useful tool for
debugging a high performance chip. So yes this is for diagnostics/debug.

>Because I'm pretty sure some people want to have access to high freq
>counters in production, across their fleet. What's worse David Ahern
>has been pitching a way of exposing device counters which would be
>common across netdev.
>

This is not netdev, this driver is to support ConnectX chips and SoCs
with any stack, netdev/rdma/vdpa/virtio and internal chip units and
acceleration engines, add to that ARM core diagnostics in case of
Blue-Field DPUs. 

I am not looking for counting netdev ethernet packets in this driver.

I am also pretty sure David will also want an interface to access other
than netdev counters, to get more visibility on how a specific chip is
behaving.

>Definite nack on this patch.

Based on what ?
  
Jakub Kicinski Nov. 21, 2023, 10:10 p.m. UTC | #3
On Tue, 21 Nov 2023 13:04:06 -0800 Saeed Mahameed wrote:
> On 21 Nov 12:44, Jakub Kicinski wrote:
>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:  
>>> high frequency diagnostic counters  
>>
>> So is it a debug driver or not a debug driver?
> 
> High frequency _diagnostic_ counters are a very useful tool for
> debugging a high performance chip. So yes this is for diagnostics/debug.

You keep saying debugging but if it's expected to run on all servers in
the fleet _monitoring_ performance, then it's a very different thing.
To me it certainly moves this driver from "debug thing loaded when
things fail" to the "always loaded in production" category.
  
David Ahern Nov. 21, 2023, 10:18 p.m. UTC | #4
On 11/21/23 1:04 PM, Saeed Mahameed wrote:
> On 21 Nov 12:44, Jakub Kicinski wrote:
>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>>> high frequency diagnostic counters
>>
>> So is it a debug driver or not a debug driver?
>>
> 
> High frequency _diagnostic_ counters are a very useful tool for
> debugging a high performance chip. So yes this is for diagnostics/debug.
> 
>> Because I'm pretty sure some people want to have access to high freq
>> counters in production, across their fleet. What's worse David Ahern
>> has been pitching a way of exposing device counters which would be
>> common across netdev.
.

For context on the `what's worse ...` comment for those who have not
seen the netconf slides:
https://netdev.bots.linux.dev/netconf/2023/david.pdf

and I am having a hard time parsing Kuba's intent with that comment here
(knowing you did not like the pitch I made at netconf :-))


> 
> This is not netdev, this driver is to support ConnectX chips and SoCs
> with any stack, netdev/rdma/vdpa/virtio and internal chip units and
> acceleration engines, add to that ARM core diagnostics in case of
> Blue-Field DPUs.
> I am not looking for counting netdev ethernet packets in this driver.
> 
> I am also pretty sure David will also want an interface to access other
> than netdev counters, to get more visibility on how a specific chip is
> behaving.

yes, and h/w counters were part of the proposal. One thought is to
leverage userspace registered memory with the device vs mapping bar
space, but we have not moved beyond a theoretical discussion at this point.

> 
>> Definite nack on this patch.
> 
> Based on what ?

It's a generic interface argument?
  
Saeed Mahameed Nov. 21, 2023, 10:46 p.m. UTC | #5
On 21 Nov 14:18, David Ahern wrote:
>On 11/21/23 1:04 PM, Saeed Mahameed wrote:
>> On 21 Nov 12:44, Jakub Kicinski wrote:
>>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>>>> high frequency diagnostic counters
>>>
>>> So is it a debug driver or not a debug driver?
>>>
>>
>> High frequency _diagnostic_ counters are a very useful tool for
>> debugging a high performance chip. So yes this is for diagnostics/debug.
>>
>>> Because I'm pretty sure some people want to have access to high freq
>>> counters in production, across their fleet. What's worse David Ahern
>>> has been pitching a way of exposing device counters which would be
>>> common across netdev.
>.
>
>For context on the `what's worse ...` comment for those who have not
>seen the netconf slides:
>https://netdev.bots.linux.dev/netconf/2023/david.pdf
>
>and I am having a hard time parsing Kuba's intent with that comment here
>(knowing you did not like the pitch I made at netconf :-))
>
>
>>
>> This is not netdev, this driver is to support ConnectX chips and SoCs
>> with any stack, netdev/rdma/vdpa/virtio and internal chip units and
>> acceleration engines, add to that ARM core diagnostics in case of
>> Blue-Field DPUs.
>> I am not looking for counting netdev ethernet packets in this driver.
>>
>> I am also pretty sure David will also want an interface to access other
>> than netdev counters, to get more visibility on how a specific chip is
>> behaving.
>
>yes, and h/w counters were part of the proposal. One thought is to
>leverage userspace registered memory with the device vs mapping bar
>space, but we have not moved beyond a theoretical discussion at this point.
>
>>
>>> Definite nack on this patch.
>>
>> Based on what ?
>
>It's a generic interface argument?
>

For this driver the diagnostic counters is only a small part of the debug
utilities the driver provides, so it is not fair to nak this patch based
on one use-case, we need this driver to also dump other stuff like
core dumps, FW contexts, internal objects, register dumps, resource dumps,
etc ..

This patch original purpose was to allow core dumps, since core dump can go
up to 2MB of memory, without this patch we won't have core dump ability
which is more important for debugging than diagnostic counters.

You can find more here:
https://github.com/saeedtx/mlx5ctl#mlx5ctl-userspace-linux-debug-utilities-for-mlx5-connectx-devices

For diagnostic counters we can continue the discussion to have a generic
interface I am all for it, but it's irrelevant for this submission.
  
Saeed Mahameed Nov. 21, 2023, 10:52 p.m. UTC | #6
On 21 Nov 14:10, Jakub Kicinski wrote:
>On Tue, 21 Nov 2023 13:04:06 -0800 Saeed Mahameed wrote:
>> On 21 Nov 12:44, Jakub Kicinski wrote:
>>> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
>>>> high frequency diagnostic counters
>>>
>>> So is it a debug driver or not a debug driver?
>>
>> High frequency _diagnostic_ counters are a very useful tool for
>> debugging a high performance chip. So yes this is for diagnostics/debug.
>
>You keep saying debugging but if it's expected to run on all servers in
>the fleet _monitoring_ performance, then it's a very different thing.
>To me it certainly moves this driver from "debug thing loaded when
>things fail" to the "always loaded in production" category.

Exactly, only when things fails or the user want to debug something.
  
For your example, you can monitor network performance via standard netdev
tools, once you start experiencing hiccups, you can use this driver and
the corresponding tools to quickly grab HW debug information, useful to
further root cause and analyze the network hiccups. 

Again this is only one use-case, the driver is intended to provide any
debug information, not only diagnostic counters or monitoring tools.
The goal of this driver is not the one use case you have in mind.
  
Jason Gunthorpe Nov. 21, 2023, 11:46 p.m. UTC | #7
On Tue, Nov 21, 2023 at 12:44:56PM -0800, Jakub Kicinski wrote:
> On Mon, 20 Nov 2023 23:06:19 -0800 Saeed Mahameed wrote:
> > high frequency diagnostic counters
> 
> So is it a debug driver or not a debug driver?

In the part you decided not to quote Saeed explained how the main
purpose of the generic DMA to userspace mechanism is to transfer FW
trace, FW memory copies and other large data dumps.

The thing with generic stuff is you can use it for lots of things if
you are so inclined. Saeed gave many examples. I think you took it in
the wrong way as I am not aware of any plan for actual high speed
netdev relavent counters in a performance monitor application. It
isn't that kind of "high speed".

The main interest is for micro-architectural debugging
information. The kind that are opaque unless you can reference the RTL
to understand what it means. It is "high speed" in the sense that
executing a FW command per register/counter would be offensively slow
compared to executing a FW command to bulk DMA a cluster of
micro-architecture registers/etc in the device.

The design is so generic because it is a debug interface that we want
to be always available and always fully functional. Mellanox ships new
FW and new chips at a rapid rate, we do not want to be changing the
kernel driver every time we do anything. That will never get
backported into production kernels across all our supported customers
fast enough. Debug features that a field support engineer cannot
access simply do not exist.

Debugs are challenging. mlx5 is the most popular datacenter NIC in the
world. We have so many insane problems, you wouldn't belive it. I just
spent 8 months leading a debug that turned out to be a qemu defect
(thanks Paolo for all the help!!). This debug data and flexibility is
critical to making these hugely complex systems work.

Jason
  

Patch

diff --git a/drivers/misc/mlx5ctl/Makefile b/drivers/misc/mlx5ctl/Makefile
index b5c7f99e0ab6..f35234e931a8 100644
--- a/drivers/misc/mlx5ctl/Makefile
+++ b/drivers/misc/mlx5ctl/Makefile
@@ -2,3 +2,4 @@ 
 
 obj-$(CONFIG_MLX5CTL) += mlx5ctl.o
 mlx5ctl-y := main.o
+mlx5ctl-y += umem.o
diff --git a/drivers/misc/mlx5ctl/main.c b/drivers/misc/mlx5ctl/main.c
index e7776ea4bfca..7e6d6da26a79 100644
--- a/drivers/misc/mlx5ctl/main.c
+++ b/drivers/misc/mlx5ctl/main.c
@@ -12,6 +12,8 @@ 
 #include <linux/atomic.h>
 #include <linux/refcount.h>
 
+#include "umem.h"
+
 MODULE_DESCRIPTION("mlx5 ConnectX control misc driver");
 MODULE_AUTHOR("Saeed Mahameed <saeedm@nvidia.com>");
 MODULE_LICENSE("Dual BSD/GPL");
@@ -30,6 +32,8 @@  struct mlx5ctl_fd {
 	u16 uctx_uid;
 	u32 uctx_cap;
 	u32 ucap; /* user cap */
+
+	struct mlx5ctl_umem_db *umem_db;
 	struct mlx5ctl_dev *mcdev;
 	struct list_head list;
 };
@@ -115,6 +119,12 @@  static int mlx5ctl_open_mfd(struct mlx5ctl_fd *mfd)
 	if (uid < 0)
 		return uid;
 
+	mfd->umem_db = mlx5ctl_umem_db_create(mdev, uid);
+	if (IS_ERR(mfd->umem_db)) {
+		mlx5ctl_release_uid(mcdev, uid);
+		return PTR_ERR(mfd->umem_db);
+	}
+
 	mfd->uctx_uid = uid;
 	mfd->uctx_cap = cap;
 	mfd->ucap = ucap;
@@ -129,6 +139,7 @@  static void mlx5ctl_release_mfd(struct mlx5ctl_fd *mfd)
 {
 	struct mlx5ctl_dev *mcdev = mfd->mcdev;
 
+	mlx5ctl_umem_db_destroy(mfd->umem_db);
 	mlx5ctl_release_uid(mcdev,  mfd->uctx_uid);
 }
 
@@ -323,6 +334,86 @@  static int mlx5ctl_cmdrpc_ioctl(struct file *file,
 	return err;
 }
 
+static int mlx5ctl_ioctl_umem_reg(struct file *file,
+				  struct mlx5ctl_umem_reg __user *arg,
+				  size_t usize)
+{
+	struct mlx5ctl_fd *mfd = file->private_data;
+	struct mlx5ctl_umem_reg *umem_reg;
+	int umem_id, err = 0;
+	size_t ksize = 0;
+
+	ksize = max(sizeof(struct mlx5ctl_umem_reg), usize);
+	umem_reg = kzalloc(ksize, GFP_KERNEL_ACCOUNT);
+	if (!umem_reg)
+		return -ENOMEM;
+
+	umem_reg->size = sizeof(struct mlx5ctl_umem_reg);
+
+	if (copy_from_user(umem_reg, arg, usize)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	if (umem_reg->flags || umem_reg->reserved1 || umem_reg->reserved2) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	umem_id = mlx5ctl_umem_reg(mfd->umem_db,
+				   (unsigned long)umem_reg->addr,
+				   umem_reg->len);
+	if (umem_id < 0) {
+		err = umem_id;
+		goto out;
+	}
+
+	umem_reg->umem_id = umem_id;
+
+	if (copy_to_user(arg, umem_reg, usize)) {
+		mlx5ctl_umem_unreg(mfd->umem_db, umem_id);
+		err = -EFAULT;
+	}
+out:
+	kfree(umem_reg);
+	return err;
+}
+
+static int mlx5ctl_ioctl_umem_unreg(struct file *file,
+				    struct mlx5ctl_umem_unreg __user *arg,
+				    size_t usize)
+{
+	struct mlx5ctl_fd *mfd = file->private_data;
+	struct mlx5ctl_umem_unreg *umem_unreg;
+	size_t ksize = 0;
+	int err = 0;
+
+	ksize = max(sizeof(struct mlx5ctl_umem_unreg), usize);
+	umem_unreg = kzalloc(ksize, GFP_KERNEL_ACCOUNT);
+	if (!umem_unreg)
+		return -ENOMEM;
+
+	umem_unreg->size = sizeof(struct mlx5ctl_umem_unreg);
+
+	if (copy_from_user(umem_unreg, arg, usize)) {
+		err = -EFAULT;
+		goto out;
+	}
+
+	if (umem_unreg->flags) {
+		err = -EOPNOTSUPP;
+		goto out;
+	}
+
+	err = mlx5ctl_umem_unreg(mfd->umem_db, umem_unreg->umem_id);
+
+	if (!err && copy_to_user(arg, umem_unreg, usize))
+		err = -EFAULT;
+out:
+	kfree(umem_unreg);
+	return err;
+}
+
 static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	struct mlx5ctl_fd *mfd = file->private_data;
@@ -352,6 +443,14 @@  static long mlx5ctl_ioctl(struct file *file, unsigned int cmd, unsigned long arg
 		err = mlx5ctl_cmdrpc_ioctl(file, argp, size);
 		break;
 
+	case MLX5CTL_IOCTL_UMEM_REG:
+		err = mlx5ctl_ioctl_umem_reg(file, argp, size);
+		break;
+
+	case MLX5CTL_IOCTL_UMEM_UNREG:
+		err = mlx5ctl_ioctl_umem_unreg(file, argp, size);
+		break;
+
 	default:
 		mlx5ctl_dbg(mcdev, "Unknown ioctl %x\n", cmd);
 		err = -ENOIOCTLCMD;
diff --git a/drivers/misc/mlx5ctl/umem.c b/drivers/misc/mlx5ctl/umem.c
new file mode 100644
index 000000000000..e62030dadf51
--- /dev/null
+++ b/drivers/misc/mlx5ctl/umem.c
@@ -0,0 +1,322 @@ 
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+#include <uapi/misc/mlx5ctl.h>
+
+#include "umem.h"
+
+#define MLX5CTL_UMEM_MAX_MB 64
+
+static unsigned long umem_num_pages(u64 addr, size_t len)
+{
+	return DIV_ROUND_UP(addr + len - PAGE_ALIGN_DOWN(addr), PAGE_SIZE);
+}
+
+struct mlx5ctl_umem {
+	struct sg_table sgt;
+	unsigned long addr;
+	size_t size;
+	size_t offset;
+	size_t npages;
+	struct task_struct *source_task;
+	struct mm_struct *source_mm;
+	struct user_struct *source_user;
+	u32 umem_id;
+	struct page **page_list;
+};
+
+struct mlx5ctl_umem_db {
+	struct xarray xarray;
+	struct mlx5_core_dev *mdev;
+	u32 uctx_uid;
+};
+
+static int inc_user_locked_vm(struct mlx5ctl_umem *umem, unsigned long npages)
+{
+	unsigned long lock_limit;
+	unsigned long cur_pages;
+	unsigned long new_pages;
+
+	lock_limit = task_rlimit(umem->source_task, RLIMIT_MEMLOCK) >>
+		     PAGE_SHIFT;
+	do {
+		cur_pages = atomic_long_read(&umem->source_user->locked_vm);
+		new_pages = cur_pages + npages;
+		if (new_pages > lock_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&umem->source_user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+	return 0;
+}
+
+static void dec_user_locked_vm(struct mlx5ctl_umem *umem, unsigned long npages)
+{
+	if (WARN_ON(atomic_long_read(&umem->source_user->locked_vm) < npages))
+		return;
+	atomic_long_sub(npages, &umem->source_user->locked_vm);
+}
+
+#define PAGES_2_MB(pages) ((pages) >> (20 - PAGE_SHIFT))
+
+static struct mlx5ctl_umem *mlx5ctl_umem_pin(struct mlx5ctl_umem_db *umem_db,
+					     unsigned long addr, size_t size)
+{
+	size_t npages = umem_num_pages(addr, size);
+	struct mlx5_core_dev *mdev = umem_db->mdev;
+	unsigned long endaddr = addr + size;
+	struct mlx5ctl_umem *umem;
+	struct page **page_list;
+	int err = -EINVAL;
+	int pinned = 0;
+
+	dev_dbg(mdev->device, "%s: addr %p size %zu npages %zu\n",
+		__func__, (void __user *)addr, size, npages);
+
+	/* Avoid integer overflow */
+	if (endaddr < addr || PAGE_ALIGN(endaddr) < endaddr)
+		return ERR_PTR(-EINVAL);
+
+	if (npages == 0 || PAGES_2_MB(npages) > MLX5CTL_UMEM_MAX_MB)
+		return ERR_PTR(-EINVAL);
+
+	page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL_ACCOUNT);
+	if (!page_list)
+		return ERR_PTR(-ENOMEM);
+
+	umem = kzalloc(sizeof(*umem), GFP_KERNEL_ACCOUNT);
+	if (!umem) {
+		kvfree(page_list);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	umem->addr = addr;
+	umem->size = size;
+	umem->offset = addr & ~PAGE_MASK;
+	umem->npages = npages;
+
+	umem->page_list = page_list;
+	umem->source_mm = current->mm;
+	umem->source_task = current->group_leader;
+	get_task_struct(current->group_leader);
+	umem->source_user = get_uid(current_user());
+
+	/* mm and RLIMIT_MEMLOCK user task accounting similar to what is
+	 * being done in iopt_alloc_pages() and do_update_pinned()
+	 * for IOPT_PAGES_ACCOUNT_USER @drivers/iommu/iommufd/pages.c
+	 */
+	mmgrab(umem->source_mm);
+
+	pinned = pin_user_pages_fast(addr, npages, FOLL_WRITE, page_list);
+	if (pinned != npages) {
+		dev_dbg(mdev->device, "pin_user_pages_fast failed %d\n", pinned);
+		err = pinned < 0 ? pinned : -ENOMEM;
+		goto pin_failed;
+	}
+
+	err = inc_user_locked_vm(umem, npages);
+	if (err)
+		goto pin_failed;
+
+	atomic64_add(npages, &umem->source_mm->pinned_vm);
+
+	err = sg_alloc_table_from_pages(&umem->sgt, page_list, npages, 0,
+					npages << PAGE_SHIFT, GFP_KERNEL_ACCOUNT);
+	if (err) {
+		dev_dbg(mdev->device, "sg_alloc_table failed: %d\n", err);
+		goto sgt_failed;
+	}
+
+	dev_dbg(mdev->device, "\tsgt: size %zu npages %zu sgt.nents (%d)\n",
+		size, npages, umem->sgt.nents);
+
+	err = dma_map_sgtable(mdev->device, &umem->sgt, DMA_BIDIRECTIONAL, 0);
+	if (err) {
+		dev_dbg(mdev->device, "dma_map_sgtable failed: %d\n", err);
+		goto dma_failed;
+	}
+
+	dev_dbg(mdev->device, "\tsgt: dma_nents %d\n", umem->sgt.nents);
+	return umem;
+
+dma_failed:
+sgt_failed:
+	sg_free_table(&umem->sgt);
+	atomic64_sub(npages, &umem->source_mm->pinned_vm);
+	dec_user_locked_vm(umem, npages);
+pin_failed:
+	if (pinned > 0)
+		unpin_user_pages(page_list, pinned);
+	mmdrop(umem->source_mm);
+	free_uid(umem->source_user);
+	put_task_struct(umem->source_task);
+
+	kfree(umem);
+	kvfree(page_list);
+	return ERR_PTR(err);
+}
+
+static void mlx5ctl_umem_unpin(struct mlx5ctl_umem_db *umem_db,
+			       struct mlx5ctl_umem *umem)
+{
+	struct mlx5_core_dev *mdev = umem_db->mdev;
+
+	dev_dbg(mdev->device, "%s: addr %p size %zu npages %zu dma_nents %d\n",
+		__func__, (void *)umem->addr, umem->size, umem->npages,
+		umem->sgt.nents);
+
+	dma_unmap_sgtable(mdev->device, &umem->sgt, DMA_BIDIRECTIONAL, 0);
+	sg_free_table(&umem->sgt);
+
+	atomic64_sub(umem->npages, &umem->source_mm->pinned_vm);
+	dec_user_locked_vm(umem, umem->npages);
+	unpin_user_pages(umem->page_list, umem->npages);
+	mmdrop(umem->source_mm);
+	free_uid(umem->source_user);
+	put_task_struct(umem->source_task);
+
+	kvfree(umem->page_list);
+	kfree(umem);
+}
+
+static int mlx5ctl_umem_create(struct mlx5_core_dev *mdev,
+			       struct mlx5ctl_umem *umem, u32 uid)
+{
+	u32 out[MLX5_ST_SZ_DW(create_umem_out)] = {};
+	int err, inlen, i, n = 0;
+	struct scatterlist *sg;
+	void *in, *umemptr;
+	__be64 *mtt;
+
+	inlen = MLX5_ST_SZ_BYTES(create_umem_in) +
+		umem->npages * MLX5_ST_SZ_BYTES(mtt);
+
+	in = kzalloc(inlen, GFP_KERNEL_ACCOUNT);
+	if (!in)
+		return -ENOMEM;
+
+	MLX5_SET(create_umem_in, in, opcode, MLX5_CMD_OP_CREATE_UMEM);
+	MLX5_SET(create_umem_in, in, uid, uid);
+
+	umemptr = MLX5_ADDR_OF(create_umem_in, in, umem);
+
+	MLX5_SET(umem, umemptr, log_page_size,
+		 PAGE_SHIFT - MLX5_ADAPTER_PAGE_SHIFT);
+	MLX5_SET64(umem, umemptr, num_of_mtt, umem->npages);
+	MLX5_SET(umem, umemptr, page_offset, umem->offset);
+
+	dev_dbg(mdev->device,
+		"UMEM CREATE: log_page_size %d num_of_mtt %lld page_offset %d\n",
+		MLX5_GET(umem, umemptr, log_page_size),
+		MLX5_GET64(umem, umemptr, num_of_mtt),
+		MLX5_GET(umem, umemptr, page_offset));
+
+	mtt = MLX5_ADDR_OF(create_umem_in, in, umem.mtt);
+	for_each_sgtable_dma_sg(&umem->sgt, sg, i) {
+		u64 dma_addr = sg_dma_address(sg);
+		ssize_t len = sg_dma_len(sg);
+
+		for (; n < umem->npages && len > 0; n++, mtt++) {
+			*mtt = cpu_to_be64(dma_addr);
+			MLX5_SET(mtt, mtt, wr_en, 1);
+			MLX5_SET(mtt, mtt, rd_en, 1);
+			dma_addr += PAGE_SIZE;
+			len -= PAGE_SIZE;
+		}
+		WARN_ON_ONCE(n == umem->npages && len > 0);
+	}
+
+	err = mlx5_cmd_exec(mdev, in, inlen, out, sizeof(out));
+	if (err)
+		goto out;
+
+	umem->umem_id = MLX5_GET(create_umem_out, out, umem_id);
+	dev_dbg(mdev->device, "\tUMEM CREATED: umem_id %d\n", umem->umem_id);
+out:
+	kfree(in);
+	return err;
+}
+
+static void mlx5ctl_umem_destroy(struct mlx5_core_dev *mdev,
+				 struct mlx5ctl_umem *umem)
+{
+	u32 in[MLX5_ST_SZ_DW(destroy_umem_in)] = {};
+
+	MLX5_SET(destroy_umem_in, in, opcode, MLX5_CMD_OP_DESTROY_UMEM);
+	MLX5_SET(destroy_umem_in, in, umem_id, umem->umem_id);
+
+	dev_dbg(mdev->device, "UMEM DESTROY: umem_id %d\n", umem->umem_id);
+	mlx5_cmd_exec_in(mdev, destroy_umem, in);
+}
+
+int mlx5ctl_umem_reg(struct mlx5ctl_umem_db *umem_db, unsigned long addr,
+		     size_t size)
+{
+	struct mlx5ctl_umem *umem;
+	void *ret;
+	int err;
+
+	umem = mlx5ctl_umem_pin(umem_db, addr, size);
+	if (IS_ERR(umem))
+		return PTR_ERR(umem);
+
+	err = mlx5ctl_umem_create(umem_db->mdev, umem, umem_db->uctx_uid);
+	if (err)
+		goto umem_create_err;
+
+	ret = xa_store(&umem_db->xarray, umem->umem_id, umem, GFP_KERNEL_ACCOUNT);
+	if (WARN(xa_is_err(ret), "Failed to store UMEM")) {
+		err = xa_err(ret);
+		goto xa_store_err;
+	}
+
+	return umem->umem_id;
+
+xa_store_err:
+	mlx5ctl_umem_destroy(umem_db->mdev, umem);
+umem_create_err:
+	mlx5ctl_umem_unpin(umem_db, umem);
+	return err;
+}
+
+int mlx5ctl_umem_unreg(struct mlx5ctl_umem_db *umem_db, u32 umem_id)
+{
+	struct mlx5ctl_umem *umem;
+
+	umem = xa_erase(&umem_db->xarray, umem_id);
+	if (!umem)
+		return -ENOENT;
+
+	mlx5ctl_umem_destroy(umem_db->mdev, umem);
+	mlx5ctl_umem_unpin(umem_db, umem);
+	return 0;
+}
+
+struct mlx5ctl_umem_db *mlx5ctl_umem_db_create(struct mlx5_core_dev *mdev,
+					       u32 uctx_uid)
+{
+	struct mlx5ctl_umem_db *umem_db;
+
+	umem_db = kzalloc(sizeof(*umem_db), GFP_KERNEL_ACCOUNT);
+	if (!umem_db)
+		return ERR_PTR(-ENOMEM);
+
+	xa_init(&umem_db->xarray);
+	umem_db->mdev = mdev;
+	umem_db->uctx_uid = uctx_uid;
+
+	return umem_db;
+}
+
+void mlx5ctl_umem_db_destroy(struct mlx5ctl_umem_db *umem_db)
+{
+	struct mlx5ctl_umem *umem;
+	unsigned long index;
+
+	xa_for_each(&umem_db->xarray, index, umem)
+		mlx5ctl_umem_unreg(umem_db, umem->umem_id);
+
+	xa_destroy(&umem_db->xarray);
+	kfree(umem_db);
+}
diff --git a/drivers/misc/mlx5ctl/umem.h b/drivers/misc/mlx5ctl/umem.h
new file mode 100644
index 000000000000..9cf62e5e775e
--- /dev/null
+++ b/drivers/misc/mlx5ctl/umem.h
@@ -0,0 +1,17 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0 */
+/* Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. */
+
+#ifndef __MLX5CTL_UMEM_H__
+#define __MLX5CTL_UMEM_H__
+
+#include <linux/types.h>
+#include <linux/mlx5/driver.h>
+
+struct mlx5ctl_umem_db;
+
+struct mlx5ctl_umem_db *mlx5ctl_umem_db_create(struct mlx5_core_dev *mdev, u32 uctx_uid);
+void mlx5ctl_umem_db_destroy(struct mlx5ctl_umem_db *umem_db);
+int mlx5ctl_umem_reg(struct mlx5ctl_umem_db *umem_db, unsigned long addr, size_t size);
+int mlx5ctl_umem_unreg(struct mlx5ctl_umem_db *umem_db, u32 umem_id);
+
+#endif /* __MLX5CTL_UMEM_H__ */
diff --git a/include/uapi/misc/mlx5ctl.h b/include/uapi/misc/mlx5ctl.h
index 3277eaf78a37..506aa8db75b4 100644
--- a/include/uapi/misc/mlx5ctl.h
+++ b/include/uapi/misc/mlx5ctl.h
@@ -24,6 +24,22 @@  struct mlx5ctl_cmdrpc {
 	__aligned_u64 flags;
 };
 
+struct mlx5ctl_umem_reg {
+	__aligned_u64 flags;
+	__u32 size;
+	__u32 reserved1;
+	__aligned_u64 addr; /* user address */
+	__aligned_u64 len; /* user buffer length */
+	__u32 umem_id; /* returned device's umem ID */
+	__u32 reserved2;
+};
+
+struct mlx5ctl_umem_unreg {
+	__aligned_u64 flags;
+	__u32 size;
+	__u32 umem_id;
+};
+
 #define MLX5CTL_MAX_RPC_SIZE 8192
 
 #define MLX5CTL_IOCTL_MAGIC 0x5c
@@ -34,4 +50,10 @@  struct mlx5ctl_cmdrpc {
 #define MLX5CTL_IOCTL_CMDRPC \
 	_IOWR(MLX5CTL_IOCTL_MAGIC, 0x1, struct mlx5ctl_cmdrpc)
 
+#define MLX5CTL_IOCTL_UMEM_REG \
+	_IOWR(MLX5CTL_IOCTL_MAGIC, 0x2, struct mlx5ctl_umem_reg)
+
+#define MLX5CTL_IOCTL_UMEM_UNREG \
+	_IOWR(MLX5CTL_IOCTL_MAGIC, 0x3, struct mlx5ctl_umem_unreg)
+
 #endif /* __MLX5CTL_IOCTL_H__ */