[RFC,06/10] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
Commit Message
Add an interface for the user to notify the kernel that it is done reading
the NET_RX dmabuf pages returned as cmsg. The kernel will drop the
reference on the NET_RX pages to make them available for re-use.
Signed-off-by: Mina Almasry <almasrymina@google.com>
---
include/uapi/asm-generic/socket.h | 1 +
include/uapi/linux/uio.h | 4 +++
net/core/sock.c | 41 +++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+)
Comments
On 7/10/23 15:32, Mina Almasry wrote:
> Add an interface for the user to notify the kernel that it is done reading
> the NET_RX dmabuf pages returned as cmsg. The kernel will drop the
> reference on the NET_RX pages to make them available for re-use.
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> ---
> + for (i = 0; i < num_tokens; i++) {
> + for (j = 0; j < tokens[i].token_count; j++) {
> + struct page *pg = xa_erase(&sk->sk_pagepool,
> + tokens[i].token_start + j);
> +
> + if (pg)
> + put_page(pg);
> + else
> + /* -EINTR here notifies the userspace
> + * that not all tokens passed to it have
> + * been freed.
> + */
> + ret = -EINTR;
Unless I'm missing something, this type of error reporting is
unrecoverable -- userspace doesn't know how many tokens have been freed.
I think you should either make it explicitly unrecoverable (somehow shut
down dmabuf handling entirely) or tell userspace how many tokens were
successfully freed.
--Andy
On Sun, Jul 16, 2023 at 4:57 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On 7/10/23 15:32, Mina Almasry wrote:
> > Add an interface for the user to notify the kernel that it is done reading
> > the NET_RX dmabuf pages returned as cmsg. The kernel will drop the
> > reference on the NET_RX pages to make them available for re-use.
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> > ---
>
> > + for (i = 0; i < num_tokens; i++) {
> > + for (j = 0; j < tokens[i].token_count; j++) {
> > + struct page *pg = xa_erase(&sk->sk_pagepool,
> > + tokens[i].token_start + j);
> > +
> > + if (pg)
> > + put_page(pg);
> > + else
> > + /* -EINTR here notifies the userspace
> > + * that not all tokens passed to it have
> > + * been freed.
> > + */
> > + ret = -EINTR;
>
> Unless I'm missing something, this type of error reporting is
> unrecoverable -- userspace doesn't know how many tokens have been freed.
>
> I think you should either make it explicitly unrecoverable (somehow shut
> down dmabuf handling entirely) or tell userspace how many tokens were
> successfully freed.
>
Thank you, the latter suggestion sounds perfect actually. The user
can't do much if the kernel fails to free all the tokens, but at least
it can know that something wrong is happening and can log an error for
further debugging. Great suggestion! Thanks!
@@ -132,6 +132,7 @@
#define SO_RCVMARK 75
+#define SO_DEVMEM_DONTNEED 97
#define SO_DEVMEM_HEADER 98
#define SCM_DEVMEM_HEADER SO_DEVMEM_HEADER
#define SO_DEVMEM_OFFSET 99
@@ -26,6 +26,10 @@ struct cmsg_devmem {
__u32 frag_token;
};
+struct devmemtoken {
+ __u32 token_start;
+ __u32 token_count;
+};
/*
* UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
*/
@@ -1531,7 +1531,48 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
/* Paired with READ_ONCE() in tcp_rtx_synack() */
WRITE_ONCE(sk->sk_txrehash, (u8)val);
break;
+ case SO_DEVMEM_DONTNEED: {
+ struct devmemtoken tokens[128];
+ unsigned int num_tokens, i, j;
+ if (sk->sk_type != SOCK_STREAM ||
+ sk->sk_protocol != IPPROTO_TCP) {
+ ret = -EBADF;
+ break;
+ }
+
+ if (optlen % sizeof(struct devmemtoken) ||
+ optlen > sizeof(tokens)) {
+ ret = -EINVAL;
+ break;
+ }
+
+ num_tokens = optlen / sizeof(struct devmemtoken);
+ if (copy_from_sockptr(tokens, optval, optlen)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ ret = 0;
+
+ for (i = 0; i < num_tokens; i++) {
+ for (j = 0; j < tokens[i].token_count; j++) {
+ struct page *pg = xa_erase(&sk->sk_pagepool,
+ tokens[i].token_start + j);
+
+ if (pg)
+ put_page(pg);
+ else
+ /* -EINTR here notifies the userspace
+ * that not all tokens passed to it have
+ * been freed.
+ */
+ ret = -EINTR;
+ }
+ }
+
+ break;
+ }
default:
ret = -ENOPROTOOPT;
break;