From patchwork Sun Nov 6 19:40:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arseniy Krasnov X-Patchwork-Id: 16173 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp1647438wru; Sun, 6 Nov 2022 11:46:27 -0800 (PST) X-Google-Smtp-Source: AMsMyM4vYILmOZXEbq8/u00BxEAgb6J5njuU+2NON/QqvHgtj71BFbX/OXEVcevzZXfXzd3/xa6f X-Received: by 2002:a17:90a:c56:b0:213:d200:e958 with SMTP id u22-20020a17090a0c5600b00213d200e958mr41481694pje.6.1667763986884; Sun, 06 Nov 2022 11:46:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667763986; cv=none; d=google.com; s=arc-20160816; b=I5soE6k+FG25pVgR+O8HdpCDw8pExwPL0UhieKpBrkJrCHnkNWDTFiK9P1uOzHehC5 xvyQKSjGi2B0xdSSY5T2H6p+wBtDPVPSrxZFZLsMiqqK3Q3GwO2T40P6BE63b4mSwo1B Zph6ZitnlC/bvUO9BemsHw9j7mGAoQm9gWTJNH/FWIdo5Yr90NJ/loyXmzp2lxnOo+e/ n091KIngywI4tX2N2h9AFabN8xMwLuH/ArM71xdV7OJN4k+WXKW61cfc9gjDnZhMZZOw JU+5aA/gGDNqJVej4a5RZ5CP3SQikPOoCB1bS1xBjmNnJr0h4Ocp/26jnE4Hw579DmR/ Eylw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :content-id:content-language:accept-language:in-reply-to:message-id :date:thread-index:thread-topic:subject:cc:to:from:dkim-signature; bh=0umJpkLmREN7M8YV0WN1Ff+SQdE0Z+GkV2xpTDm1Q/w=; b=VfJuKI1KRLcrptgGqjfpHNanYu63DN3KgJvUNEqluU/lOJvP0Ur+TKAn7+MdQZVLyx FHJ9KVIygeJ7JX4EPUgt4v49XAfbRMmBhbgXRPQiLwSgMBk1ok4PlWUiqqJ9V8QQCNcw TttMn+73uvOX9uIxxQsTFlMVxcfUgR91/wIbr/O7DvypH0W4iAVicAt1EU19Kc61Lr2P jb63/rA1Lo/xbzPjdXToRJ6DapoLYm6L3+ucIv/3LOA34uGHK6KjpLyqYfDUp+r7F+OZ U4vZR/3/C2P3whOY94ZNUBhZ7+Zk+cdgyEiAcwuIvFSqelvLgRDVOaeRA9R7lh/DBEke 068A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=ZfUXWCHL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b15-20020a170902b60f00b001781675f423si6861185pls.556.2022.11.06.11.46.13; Sun, 06 Nov 2022 11:46:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=ZfUXWCHL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230162AbiKFTkt (ORCPT + 99 others); Sun, 6 Nov 2022 14:40:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42940 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230002AbiKFTkr (ORCPT ); Sun, 6 Nov 2022 14:40:47 -0500 Received: from mx.sberdevices.ru (mx.sberdevices.ru [45.89.227.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1C1FEE2A; Sun, 6 Nov 2022 11:40:45 -0800 (PST) Received: from s-lin-edge02.sberdevices.ru (localhost [127.0.0.1]) by mx.sberdevices.ru (Postfix) with ESMTP id 0B4475FD04; Sun, 6 Nov 2022 22:40:44 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sberdevices.ru; s=mail; t=1667763644; bh=0umJpkLmREN7M8YV0WN1Ff+SQdE0Z+GkV2xpTDm1Q/w=; h=From:To:Subject:Date:Message-ID:Content-Type:MIME-Version; b=ZfUXWCHLIkDt2hwa7b1dUyHFDcmGekjzqmCm175Wv9QrnmKlKbhIO/y3wpC8jnPvv gaosu092pebh6pjUl5kxNR5hIZaQWdrenos0LfY/sYCkIhQK0qaCJPQNd8hm8UI6eH Elw0gqHnb6WDh3aPeTJfyPicD9uffT89wW2NYVEEoJsr5emfPKJvr8AUy5ESgwiFsb B20aH53DP9LkgHJmWzhFL+PkDgsiIldcZJHNKNFd74GrFTb/AA7tixqGw86GP5MSvt 3rfXZAL1mA7nRMEyH9gydMCog2hZQhj0SIrjBL85vl3h0q692f01AycfwwaZtlgZ1J ZcjuHGGi0pZIA== Received: from S-MS-EXCH01.sberdevices.ru (S-MS-EXCH01.sberdevices.ru [172.16.1.4]) by mx.sberdevices.ru (Postfix) with ESMTP; Sun, 6 Nov 2022 22:40:43 +0300 (MSK) From: Arseniy Krasnov To: Stefano Garzarella , Stefan Hajnoczi , "Michael S. Tsirkin" , Jason Wang , "David S. Miller" , "edumazet@google.com" , Jakub Kicinski , Paolo Abeni , Krasnov Arseniy CC: "linux-kernel@vger.kernel.org" , "kvm@vger.kernel.org" , "virtualization@lists.linux-foundation.org" , "netdev@vger.kernel.org" , kernel Subject: [RFC PATCH v3 03/11] af_vsock: add zerocopy receive logic Thread-Topic: [RFC PATCH v3 03/11] af_vsock: add zerocopy receive logic Thread-Index: AQHY8heWueHDDL4x/0iPxprlEpp8Wg== Date: Sun, 6 Nov 2022 19:40:12 +0000 Message-ID: <7aeba781-db09-9be1-a9a3-a4c16da38fb5@sberdevices.ru> In-Reply-To: Accept-Language: en-US, ru-RU Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.16.1.12] Content-ID: <11A927F904AFCB4D871B16A0244EDE99@sberdevices.ru> MIME-Version: 1.0 X-KSMG-Rule-ID: 4 X-KSMG-Message-Action: clean X-KSMG-AntiSpam-Status: not scanned, disabled by settings X-KSMG-AntiSpam-Interceptor-Info: not scanned X-KSMG-AntiPhishing: not scanned, disabled by settings X-KSMG-AntiVirus: Kaspersky Secure Mail Gateway, version 1.1.2.30, bases: 2022/11/06 12:52:00 #20573807 X-KSMG-AntiVirus-Status: Clean, skipped X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748777289991743759?= X-GMAIL-MSGID: =?utf-8?q?1748777289991743759?= This: 1) Adds callback for 'mmap()' call on socket. It checks vm area flags and sets vm area ops. 2) Adds special 'getsockopt()' case which calls transport zerocopy callback. Input argument is vm area address. 3) Adds 'getsockopt()/setsockopt()' for switching on/off rx zerocopy mode. Signed-off-by: Arseniy Krasnov --- include/net/af_vsock.h | 8 ++ include/uapi/linux/vm_sockets.h | 3 + net/vmw_vsock/af_vsock.c | 187 +++++++++++++++++++++++++++++++- 3 files changed, 196 insertions(+), 2 deletions(-) -- 2.35.0 diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h index 568a87c5e0d0..e4f12ef8e623 100644 --- a/include/net/af_vsock.h +++ b/include/net/af_vsock.h @@ -73,6 +73,8 @@ struct vsock_sock { /* Private to transport. */ void *trans; + + bool rx_zerocopy_on; }; s64 vsock_stream_has_data(struct vsock_sock *vsk); @@ -138,6 +140,12 @@ struct vsock_transport { bool (*stream_allow)(u32 cid, u32 port); int (*set_rcvlowat)(struct vsock_sock *vsk, int val); + int (*zerocopy_rx_mmap)(struct vsock_sock *vsk, + struct vm_area_struct *vma); + int (*zerocopy_dequeue)(struct vsock_sock *vsk, + struct page **pages, + unsigned long *pages_num); + /* SEQ_PACKET. */ ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg, int flags); diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h index c60ca33eac59..d1f792bed1a7 100644 --- a/include/uapi/linux/vm_sockets.h +++ b/include/uapi/linux/vm_sockets.h @@ -83,6 +83,9 @@ #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8 +#define SO_VM_SOCKETS_MAP_RX 9 +#define SO_VM_SOCKETS_ZEROCOPY 10 + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index ee418701cdee..21a915eb0820 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -1663,6 +1663,16 @@ static int vsock_connectible_setsockopt(struct socket *sock, } break; } + case SO_VM_SOCKETS_ZEROCOPY: { + if (sock->state != SS_UNCONNECTED) { + err = -EOPNOTSUPP; + break; + } + + COPY_IN(val); + vsk->rx_zerocopy_on = val; + break; + } default: err = -ENOPROTOOPT; @@ -1676,6 +1686,124 @@ static int vsock_connectible_setsockopt(struct socket *sock, return err; } +static const struct vm_operations_struct afvsock_vm_ops = { +}; + +static int vsock_recv_zerocopy(struct socket *sock, + unsigned long address) +{ + const struct vsock_transport *transport; + struct vm_area_struct *vma; + unsigned long vma_pages; + struct vsock_sock *vsk; + struct page **pages; + struct sock *sk; + int err; + int i; + + sk = sock->sk; + vsk = vsock_sk(sk); + err = 0; + + lock_sock(sk); + + if (!vsk->rx_zerocopy_on) { + err = -EOPNOTSUPP; + goto out_unlock_sock; + } + + transport = vsk->transport; + + if (!transport->zerocopy_dequeue) { + err = -EOPNOTSUPP; + goto out_unlock_sock; + } + + mmap_write_lock(current->mm); + + vma = vma_lookup(current->mm, address); + + if (!vma || vma->vm_ops != &afvsock_vm_ops) { + err = -EINVAL; + goto out_unlock_vma; + } + + /* Allow to use vm area only from the first page. */ + if (vma->vm_start != address) { + err = -EINVAL; + goto out_unlock_vma; + } + + vma_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE; + pages = kmalloc_array(vma_pages, sizeof(pages[0]), + GFP_KERNEL | __GFP_ZERO); + + if (!pages) { + err = -EINVAL; + goto out_unlock_vma; + } + + err = transport->zerocopy_dequeue(vsk, pages, &vma_pages); + + if (err) + goto out_unlock_vma; + + /* Now 'vma_pages' contains number of pages in array. + * If array element is NULL, skip it, go to next page. + */ + for (i = 0; i < vma_pages; i++) { + if (pages[i]) { + unsigned long pages_inserted; + + pages_inserted = 1; + err = vm_insert_pages(vma, address, &pages[i], &pages_inserted); + + if (err || pages_inserted) { + /* Failed to insert some pages, we have "partially" + * mapped vma. Do not return, set error code. This + * code will be returned to user. User needs to call + * 'madvise()/mmap()' to clear this vma. Anyway, + * references to all pages will to be dropped below. + */ + if (!err) { + err = -EFAULT; + break; + } + } + } + + address += PAGE_SIZE; + } + + i = 0; + + while (i < vma_pages) { + /* Drop ref count for all pages, returned by transport. + * We call 'put_page()' only once, as transport needed + * to 'get_page()' at least only once also, to prevent + * pages being freed. If transport calls 'get_page()' + * more twice or more for every page - we don't care, + * if transport calls 'get_page()' only one time, this + * meanse that every page had ref count equal to 1,then + * 'vm_insert_pages()' increments it to 2. After this + * loop, ref count will be 1 again, and page will be + * returned to allocator by user. + */ + if (pages[i]) + put_page(pages[i]); + i++; + } + + kfree(pages); + +out_unlock_vma: + mmap_write_unlock(current->mm); +out_unlock_sock: + release_sock(sk); + + return err; +} + static int vsock_connectible_getsockopt(struct socket *sock, int level, int optname, char __user *optval, @@ -1720,6 +1848,26 @@ static int vsock_connectible_getsockopt(struct socket *sock, lv = sock_get_timeout(vsk->connect_timeout, &v, optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD); break; + case SO_VM_SOCKETS_ZEROCOPY: { + lock_sock(sk); + + v.val64 = vsk->rx_zerocopy_on; + + release_sock(sk); + + break; + } + case SO_VM_SOCKETS_MAP_RX: { + unsigned long vma_addr; + + if (len < sizeof(vma_addr)) + return -EINVAL; + + if (copy_from_user(&vma_addr, optval, sizeof(vma_addr))) + return -EFAULT; + + return vsock_recv_zerocopy(sock, vma_addr); + } default: return -ENOPROTOOPT; @@ -2167,6 +2315,41 @@ static int vsock_set_rcvlowat(struct sock *sk, int val) return 0; } +static int afvsock_mmap(struct file *file, struct socket *sock, + struct vm_area_struct *vma) +{ + const struct vsock_transport *transport; + struct vsock_sock *vsk; + struct sock *sk; + int err; + + if (vma->vm_flags & (VM_WRITE | VM_EXEC)) + return -EPERM; + + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC); + vma->vm_flags |= (VM_MIXEDMAP); + vma->vm_ops = &afvsock_vm_ops; + + sk = sock->sk; + vsk = vsock_sk(sk); + + lock_sock(sk); + + transport = vsk->transport; + + if (!transport || !transport->zerocopy_rx_mmap) { + err = -EOPNOTSUPP; + goto out_unlock; + } + + err = transport->zerocopy_rx_mmap(vsk, vma); + +out_unlock: + release_sock(sk); + + return err; +} + static const struct proto_ops vsock_stream_ops = { .family = PF_VSOCK, .owner = THIS_MODULE, @@ -2184,7 +2367,7 @@ static const struct proto_ops vsock_stream_ops = { .getsockopt = vsock_connectible_getsockopt, .sendmsg = vsock_connectible_sendmsg, .recvmsg = vsock_connectible_recvmsg, - .mmap = sock_no_mmap, + .mmap = afvsock_mmap, .sendpage = sock_no_sendpage, .set_rcvlowat = vsock_set_rcvlowat, }; @@ -2206,7 +2389,7 @@ static const struct proto_ops vsock_seqpacket_ops = { .getsockopt = vsock_connectible_getsockopt, .sendmsg = vsock_connectible_sendmsg, .recvmsg = vsock_connectible_recvmsg, - .mmap = sock_no_mmap, + .mmap = afvsock_mmap, .sendpage = sock_no_sendpage, };