Message ID | 20230603204939.1598818-1-AVKrasnov@sberdevices.ru |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp1841047vqr; Sat, 3 Jun 2023 13:56:21 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ52gAipFS2ihv/kCM67tr8lZCtYmYXp4e+IGNslwX6iL50vTteXSMLWBF7RUQkdYNrYZxlB X-Received: by 2002:a05:6808:424d:b0:39a:73d6:f34b with SMTP id dp13-20020a056808424d00b0039a73d6f34bmr3820192oib.11.1685825781166; Sat, 03 Jun 2023 13:56:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685825781; cv=none; d=google.com; s=arc-20160816; b=CU51Nf1ddmSih/hcn6BB/nWcUKvxxfSufDrqxEcP+hNJWJS6mSjhenruFMHhb6XZx+ ML3MntjLfkaEH4mNzBqGAOO840j6zQRcJIqy5kF4R1WJvmwQyy9Z6i1k7qQFnqVaE/wc mqEZR/t9BkMjPYTNVWJPTd2V205XfyD5u3eJ/bFLNsHtrVLglN4879eS21BAqcG60syY 1qmS3+rNYhV1Ok+olmDp4nbgSfdCDJVKErqvU8JzqWQDAhgXxUqy+SltBFqHpbGQPA0J yc9INASZ06IbErc9BkUh+R2fA9qln2mumT9pbvSanN3A+qbZ5eRlQrMTk45J3CFcsKF4 1AJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=DkhTH4Dlu7qCweQnUXCg99aOthoTCUAw4JyykSDzG+g=; b=nuZIHrAh2Ekm+61b6uMR/2C3vFgIPHDFhqyazwO8tvxMNv02kfXyXPa4n+uUQgXMao dQM0n1lJYXNIMuCEnPJw9L7978XtbcRzAe6scgldY0Ud02pOpZ8Sjq+8Zo5Tj/aMgbtT hBMEmHzWtXuSrTxL0fTy+D9L+PPtnRYKhEe+GEtcIv73vjatsMm58C5PUGNCoqLaWWrJ NG/b8hc5bHTTroZ6E1/Q14vafIYfeDtr5LHQqZZCF+Ck44gdhwLwmezJ8RlmWNca3iXA 7x9KFtgGbTSSfLdjN3IGttkFh+Bqo+USYPbJ8QyQulX9mIUzbBV6WO1G8gV/Xsx3+RDW xdEw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=orJJ6UPG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z9-20020a170902834900b001a970d40828si2924353pln.603.2023.06.03.13.56.06; Sat, 03 Jun 2023 13:56:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=orJJ6UPG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233391AbjFCUzM (ORCPT <rfc822;stefanalexe802@gmail.com> + 99 others); Sat, 3 Jun 2023 16:55:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37812 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231819AbjFCUyz (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sat, 3 Jun 2023 16:54:55 -0400 Received: from mx.sberdevices.ru (mx.sberdevices.ru [45.89.227.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 01A101B5; Sat, 3 Jun 2023 13:54:49 -0700 (PDT) Received: from s-lin-edge02.sberdevices.ru (localhost [127.0.0.1]) by mx.sberdevices.ru (Postfix) with ESMTP id 4B48C5FD0F; Sat, 3 Jun 2023 23:54:46 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sberdevices.ru; s=mail; t=1685825686; bh=DkhTH4Dlu7qCweQnUXCg99aOthoTCUAw4JyykSDzG+g=; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; b=orJJ6UPG7VtKWKSy1wozKoZcs41C9QeXDCBm48YSNPLqewGTdi3DjUbQM7M4xxAyz PACzP+UM5eRDo490A5dapR2vY7UY3K80kugLkalDg2kHnkbt09uZI44FxTsLPFrx+K pm41smJgftlIC5DEg0b0yFShPq6Mn1/aB+3/EuiYFofijk1uh6PL9/oQSDhPDUPfHB eLk+n8Pt4k1GdBl5t0yrJJQCOobNOQ88U1oVS/lwWELB3Ux6xjjxHeFYCM1owliWRj wMLxzpX6HYXBjVLWoUQC6K/eupfZUzABzpYZngJpAW15K5rB6L4iP2Mw6+LsfcaQ8N WnOTeiOsdWTFA== Received: from S-MS-EXCH01.sberdevices.ru (S-MS-EXCH01.sberdevices.ru [172.16.1.4]) by mx.sberdevices.ru (Postfix) with ESMTP; Sat, 3 Jun 2023 23:54:42 +0300 (MSK) From: Arseniy Krasnov <AVKrasnov@sberdevices.ru> To: Stefan Hajnoczi <stefanha@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, Bobby Eshleman <bobby.eshleman@bytedance.com> CC: <kvm@vger.kernel.org>, <virtualization@lists.linux-foundation.org>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <kernel@sberdevices.ru>, <oxffffaa@gmail.com>, <avkrasnov@sberdevices.ru>, Arseniy Krasnov <AVKrasnov@sberdevices.ru> Subject: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support Date: Sat, 3 Jun 2023 23:49:22 +0300 Message-ID: <20230603204939.1598818-1-AVKrasnov@sberdevices.ru> X-Mailer: git-send-email 2.35.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [172.16.1.6] X-ClientProxiedBy: S-MS-EXCH02.sberdevices.ru (172.16.1.5) To S-MS-EXCH01.sberdevices.ru (172.16.1.4) X-KSMG-Rule-ID: 4 X-KSMG-Message-Action: clean X-KSMG-AntiSpam-Status: not scanned, disabled by settings X-KSMG-AntiSpam-Interceptor-Info: not scanned X-KSMG-AntiPhishing: not scanned, disabled by settings X-KSMG-AntiVirus: Kaspersky Secure Mail Gateway, version 1.1.2.30, bases: 2023/06/03 16:55:00 #21417531 X-KSMG-AntiVirus-Status: Clean, skipped X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1767716454078608205?= X-GMAIL-MSGID: =?utf-8?q?1767716454078608205?= |
Series |
vsock: MSG_ZEROCOPY flag support
|
|
Message
Arseniy Krasnov
June 3, 2023, 8:49 p.m. UTC
Hello, DESCRIPTION this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow current implementation for TCP as much as possible: 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY flag will be ignored (e.g. without completion). 2) Kernel uses completions from socket's error queue. Single completion for single tx syscall (or it can merge several completions to single one). I used already implemented logic for MSG_ZEROCOPY support: 'msg_zerocopy_realloc()' etc. Difference with copy way is not significant. During packet allocation, non-linear skb is created and filled with pinned user pages. There are also some updates for vhost and guest parts of transport - in both cases i've added handling of non-linear skb for virtio part. vhost copies data from such skb to the guest's rx virtio buffers. In the guest, virtio transport fills tx virtio queue with pages from skb. Head of this patchset is: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b This version has several limits/problems: 1) As this feature totally depends on transport, there is no way (or it is difficult) to check whether transport is able to handle it or not during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific setsockopt callback from setsockopt callback for SOL_SOCKET, but this leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback are not considered to be called from each other. So in current version SO_ZEROCOPY is set successfully to any type (e.g. transport) of AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, tx routine will fail with EOPNOTSUPP. ^^^ This is still no resolved :( 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue one completion. In each completion there is flag which shows how tx was performed: zerocopy or copy. This leads that whole message must be send in zerocopy or copy way - we can't send part of message with copying and rest of message with zerocopy mode (or vice versa). Now, we need to account vsock credit logic, e.g. we can't send whole data once - only allowed number of bytes could sent at any moment. In case of copying way there is no problem as in worst case we can send single bytes, but zerocopy is more complex because smallest transmission unit is single page. So if there is not enough space at peer's side to send integer number of pages (at least one) - we will wait, thus stalling tx side. To overcome this problem i've added simple rule - zerocopy is possible only when there is enough space at another side for whole message (to check, that current 'msghdr' was already used in previous tx iterations i use 'iov_offset' field of it's iov iter). ^^^ Discussed as ok during v2. Link: https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/ 3) loopback transport is not supported, because it requires to implement non-linear skb handling in dequeue logic (as we "send" fragged skb and "receive" it from the same queue). I'm going to implement it in next versions. ^^^ fixed in v2 4) Current implementation sets max length of packet to 64KB. IIUC this is due to 'kmalloc()' allocated data buffers. I think, in case of MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is not touched for data - user space pages are used as buffers. Also this limit trims every message which is > 64KB, thus such messages will be send in copy mode due to 'iov_offset' check in 2). ^^^ fixed in v2 PATCHSET STRUCTURE Patchset has the following structure: 1) Handle non-linear skbuff on receive in virtio/vhost. 2) Handle non-linear skbuff on send in virtio/vhost. 3) Updates for AF_VSOCK. 4) Enable MSG_ZEROCOPY support on transports. 5) Tests/tools/docs updates. PERFORMANCE Performance: it is a little bit tricky to compare performance between copy and zerocopy transmissions. In zerocopy way we need to wait when user buffers will be released by kernel, so it is like synchronous path (wait until device driver will process it), while in copy way we can feed data to kernel as many as we want, don't care about device driver. So I compared only time which we spend in the 'send()' syscall. Then if this value will be combined with total number of transmitted bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not enough credit, receiver allocates same amount of space as sender needs. Sender: ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc] Receiver: ./vsock_perf --vsk-size 256M I run tests on two setups: desktop with Core i7 - I use this PC for development and in this case guest is nested guest, and host is normal guest. Another hardware is some embedded board with Atom - here I don't have nested virtualization - host runs on hw, and guest is normal guest. G2H transmission (values are Gbit/s): Core i7 with nested guest. Atom with normal guest. *-------------------------------* *-------------------------------* | | | | | | | | | buf size | copy | zerocopy | | buf size | copy | zerocopy | | | | | | | | | *-------------------------------* *-------------------------------* | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 | *-------------------------------* *-------------------------------* | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 | *-------------------------------* *-------------------------------* | 256KB | 33 | 244 | | 256KB | 7.8 | 55 | *-------------------------------* *-------------------------------* | 1M | 30 | 373 | | 1M | 7 | 95 | *-------------------------------* *-------------------------------* | 8M | 22 | 475 | | 8M | 7 | 114 | *-------------------------------* *-------------------------------* H2G: Core i7 with nested guest. Atom with normal guest. *-------------------------------* *-------------------------------* | | | | | | | | | buf size | copy | zerocopy | | buf size | copy | zerocopy | | | | | | | | | *-------------------------------* *-------------------------------* | 4KB | 20 | 10 | | 4KB | 4.37 | 3 | *-------------------------------* *-------------------------------* | 32KB | 37 | 75 | | 32KB | 11 | 18 | *-------------------------------* *-------------------------------* | 256KB | 44 | 299 | | 256KB | 11 | 62 | *-------------------------------* *-------------------------------* | 1M | 28 | 335 | | 1M | 9 | 77 | *-------------------------------* *-------------------------------* | 8M | 27 | 417 | | 8M | 9.35 | 115 | *-------------------------------* *-------------------------------* * Let's look to the first line of both tables - where copy is better than zerocopy. I analyzed this case more deeply and found that bottleneck is function 'vhost_work_queue()'. With 4K buffer size, caller spends too much time in it with zerocopy mode (comparing to copy mode). This happens only with 4K buffer size. This function just calls 'wake_up_process()' and its internal logic does not depends on skb, so i think potential reason (may be) is interval between two calls of this function (e.g. how often it is called). Note, that 'vhost_work_queue()' differs from the same function at guest's side of transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which i think is more optimized for worker purposes, than direct call to 'wake_up_process()'. But again - this is just my assumption. Loopback: Core i7 with nested guest. Atom with normal guest. *-------------------------------* *-------------------------------* | | | | | | | | | buf size | copy | zerocopy | | buf size | copy | zerocopy | | | | | | | | | *-------------------------------* *-------------------------------* | 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 | *-------------------------------* *-------------------------------* | 32KB | 38 | 44 | | 32KB | 10 | 10 | *-------------------------------* *-------------------------------* | 256KB | 55 | 168 | | 256KB | 15 | 36 | *-------------------------------* *-------------------------------* | 1M | 53 | 250 | | 1M | 12 | 45 | *-------------------------------* *-------------------------------* | 8M | 40 | 344 | | 8M | 11 | 74 | *-------------------------------* *-------------------------------* I analyzed performace difference more deeply for the following setup: server: ./vsock_perf --vsk-size 16M client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc] In other words I send 16M of data from guest to host in copy/zerocopy modes and with two different sizes of buffer - 4K and 64K. Let's see to tx path for both modes - it consists of two steps: copy: 1) Allocate skb of buffer's length. 2) Copy data to skb from buffer. zerocopy: 1) Allocate skb with header space only. 2) Pin pages of the buffer and insert them to skb. I measured average number of ns (returned by 'ktime_get()') for each step above: 1) Skb allocation (for both copy and zerocopy modes). 2) For copy mode in 'memcpy_to_msg()' - copying. 3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning. Here are results for copy mode: *-------------------------------------* | buf | skb alloc | 'memcpy_to_msg()' | *-------------------------------------* | | | | | 64K | 5000ns | 25000ns | | | | | *-------------------------------------* | | | | | 4K | 800ns | 2200ns | | | | | *-------------------------------------* Here are results for zerocopy mode: *-----------------------------------------------* | buf | skb alloc | '__zerocopy_sg_from_iter()' | *-----------------------------------------------* | | | | | 64K | 250ns | 3500ns | | | | | *-----------------------------------------------* | | | | | 4K | 250ns | 3000ns | | | | | *-----------------------------------------------* I guess that reason of zerocopy performance is low overhead for page pinning: there is big difference between 4K and 64K in case of copying (25000 vs 2200), but in pinning case - just 3000 vs 3500. So, zerocopy is faster than classic copy mode, but of course it requires specific architecture of application due to user pages pinning, buffer size and alignment. NOTES If host fails to send data with "Cannot allocate memory", check value /proc/sys/net/core/optmem_max - it is accounted during completion skb allocation. Try to update it to for example 1M and try send again: "echo 1048576 > /proc/sys/net/core/optmem_max" (as root). TESTING This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to cover new code as much as possible so there are different cases for MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io vector types (different sizes, alignments, with unmapped pages). I also run tests with loopback transport and run vsockmon. In v3 i've added io_uring test as separated application. LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER In v3 Stefano Garzarella <sgarzare@redhat.com> asked to split this patchset for several parts, because it looks too big for review. I think in this version (v4) we can do it in the following way: [0001 - 0005] - this is preparation for virtio/vhost part. [0006 - 0009] - this is preparation for AF_VSOCK part. [0010 - 0013] - these patches allows to trigger logic from the previous two parts. [0014 - rest] - updates for doc, tests, utils. This part doesn't touch kernel code and looks not critical. Thanks, Arseniy Link to v1: https://lore.kernel.org/netdev/0e7c6fc4-b4a6-a27b-36e9-359597bba2b5@sberdevices.ru/ Link to v2: https://lore.kernel.org/netdev/20230423192643.1537470-1-AVKrasnov@sberdevices.ru/ Link to v3: https://lore.kernel.org/netdev/20230522073950.3574171-1-AVKrasnov@sberdevices.ru/ Changelog: v1 -> v2: - Replace 'get_user_pages()' with 'pin_user_pages()'. - Loopback transport support. v2 -> v3 - Use 'get_user_pages()' instead of 'pin_user_pages()'. I think this is right approach, because i'm using '__zerocopy_sg_from_iter()' function. It is already implemented and used by io_uring zerocopy tx logic to 'pin' pages of user's buffer. - Use 'skb_copy_datagram_iter()' to copy data from both linear and non-linear skb to user's iov iter. It already has support for copying data from paged part of skb (by calling 'kmap()'). In v2 i used my own "from scratch" implemented function. With this and previous thing I significantly reduced LOC number in kernel part. - Add io_uring test for AF_VSOCK. It is implemented as separated util, because it depends on liburing (i think there is no need to link 'vsock_test' with liburing, because io_uring functionality depends on environment - both in kernel and userspace). - Values from PERFORMANCE section are updated for all transports, but I didn't found any significant difference with v2. - More details in commit messages. v3 -> v4: - Requirement for buffers to have page aligned base and size is removed, because virtio can handle such buffers. - Crash with SOCK_SEQPACKET is fixed. This is done by setting owner of new 'skb' before passing it to '__zerocopy_sg_from_iter()'. Last one dereferences owner of the passed skb without any checks (it was NULL). - Type of "owning" of the newly created skb is also changed: in v3 and before it was 'skb_set_owner_sk_safe()'. I replace it with this one: 'skb_set_owner_w()'. This is because '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc' of socket which owns skb, thus we need a proper destructor which decrements it back - it is 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'. Otherwise we get leak of resource - such socket will be never deallocated. - Use ITER_KVEC instead of ITER_IOVEC when skb is copied to another one for passing to TAP device. Reason of this update is that ITER_IOVEC considered as userspace memory, while we have only kernel memory here. Arseniy Krasnov (17): vsock/virtio: read data from non-linear skb vhost/vsock: read data from non-linear skb vsock/virtio: support to send non-linear skb vsock/virtio: non-linear skb handling for tap vsock/virtio: MSG_ZEROCOPY flag support vsock: check error queue to set EPOLLERR vsock: read from socket's error queue vsock: check for MSG_ZEROCOPY support on send vsock: enable SOCK_SUPPORT_ZC bit vhost/vsock: support MSG_ZEROCOPY for transport vsock/virtio: support MSG_ZEROCOPY for transport vsock/loopback: support MSG_ZEROCOPY for transport net/sock: enable setting SO_ZEROCOPY for PF_VSOCK docs: net: description of MSG_ZEROCOPY for AF_VSOCK test/vsock: MSG_ZEROCOPY flag tests test/vsock: MSG_ZEROCOPY support for vsock_perf test/vsock: io_uring rx/tx tests Documentation/networking/msg_zerocopy.rst | 12 +- drivers/vhost/vsock.c | 18 +- include/linux/socket.h | 1 + include/linux/virtio_vsock.h | 1 + include/net/af_vsock.h | 7 + net/core/sock.c | 4 +- net/vmw_vsock/af_vsock.c | 19 +- net/vmw_vsock/virtio_transport.c | 44 ++- net/vmw_vsock/virtio_transport_common.c | 317 ++++++++++++++++----- net/vmw_vsock/vsock_loopback.c | 8 + tools/testing/vsock/Makefile | 9 +- tools/testing/vsock/util.c | 218 +++++++++++++++ tools/testing/vsock/util.h | 18 ++ tools/testing/vsock/vsock_perf.c | 139 +++++++++- tools/testing/vsock/vsock_test.c | 16 ++ tools/testing/vsock/vsock_test_zerocopy.c | 312 +++++++++++++++++++++ tools/testing/vsock/vsock_test_zerocopy.h | 15 + tools/testing/vsock/vsock_uring_test.c | 321 ++++++++++++++++++++++ 18 files changed, 1385 insertions(+), 94 deletions(-) create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h create mode 100644 tools/testing/vsock/vsock_uring_test.c
Comments
Hey Arseniy, Thanks for this series, very good stuff! On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote: > Hello, > > DESCRIPTION > > this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow > current implementation for TCP as much as possible: > > 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this > flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY > flag will be ignored (e.g. without completion). > > 2) Kernel uses completions from socket's error queue. Single completion > for single tx syscall (or it can merge several completions to single > one). I used already implemented logic for MSG_ZEROCOPY support: > 'msg_zerocopy_realloc()' etc. > > Difference with copy way is not significant. During packet allocation, > non-linear skb is created and filled with pinned user pages. > There are also some updates for vhost and guest parts of transport - in > both cases i've added handling of non-linear skb for virtio part. vhost > copies data from such skb to the guest's rx virtio buffers. In the guest, > virtio transport fills tx virtio queue with pages from skb. > > Head of this patchset is: > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b > > > This version has several limits/problems: > > 1) As this feature totally depends on transport, there is no way (or it > is difficult) to check whether transport is able to handle it or not > during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific > setsockopt callback from setsockopt callback for SOL_SOCKET, but this > leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback > are not considered to be called from each other. So in current version > SO_ZEROCOPY is set successfully to any type (e.g. transport) of > AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, > tx routine will fail with EOPNOTSUPP. > > ^^^ > This is still no resolved :( > I think to get around this you could use set SOCK_CUSTOM_SOCKOPT in the vsock create function, handle SO_ZEROCOPY in the vsock handler, but pass the rest of the common options to sock_setsockopt(). I think the next issue you would run into though is that users may call setsockopt() before connect(), and so the transport will still be unknown (except for dgrams, which are weird for reasons). What do you think about EOPNOTSUPP being returned when the user selects an incompatible transport with connect() instead of returning it later in the tx path? > 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue > one completion. In each completion there is flag which shows how tx > was performed: zerocopy or copy. This leads that whole message must > be send in zerocopy or copy way - we can't send part of message with > copying and rest of message with zerocopy mode (or vice versa). Now, > we need to account vsock credit logic, e.g. we can't send whole data > once - only allowed number of bytes could sent at any moment. In case > of copying way there is no problem as in worst case we can send single > bytes, but zerocopy is more complex because smallest transmission > unit is single page. So if there is not enough space at peer's side > to send integer number of pages (at least one) - we will wait, thus > stalling tx side. To overcome this problem i've added simple rule - > zerocopy is possible only when there is enough space at another side > for whole message (to check, that current 'msghdr' was already used > in previous tx iterations i use 'iov_offset' field of it's iov iter). > > ^^^ > Discussed as ok during v2. Link: > https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/ > > 3) loopback transport is not supported, because it requires to implement > non-linear skb handling in dequeue logic (as we "send" fragged skb > and "receive" it from the same queue). I'm going to implement it in > next versions. > > ^^^ fixed in v2 > > 4) Current implementation sets max length of packet to 64KB. IIUC this > is due to 'kmalloc()' allocated data buffers. I think, in case of > MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is > not touched for data - user space pages are used as buffers. Also > this limit trims every message which is > 64KB, thus such messages > will be send in copy mode due to 'iov_offset' check in 2). > > ^^^ fixed in v2 > > PATCHSET STRUCTURE > > Patchset has the following structure: > 1) Handle non-linear skbuff on receive in virtio/vhost. > 2) Handle non-linear skbuff on send in virtio/vhost. > 3) Updates for AF_VSOCK. > 4) Enable MSG_ZEROCOPY support on transports. > 5) Tests/tools/docs updates. > > PERFORMANCE > > Performance: it is a little bit tricky to compare performance between > copy and zerocopy transmissions. In zerocopy way we need to wait when > user buffers will be released by kernel, so it is like synchronous > path (wait until device driver will process it), while in copy way we > can feed data to kernel as many as we want, don't care about device > driver. So I compared only time which we spend in the 'send()' syscall. > Then if this value will be combined with total number of transmitted > bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not > enough credit, receiver allocates same amount of space as sender needs. > > Sender: > ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc] > > Receiver: > ./vsock_perf --vsk-size 256M > > I run tests on two setups: desktop with Core i7 - I use this PC for > development and in this case guest is nested guest, and host is normal > guest. Another hardware is some embedded board with Atom - here I don't > have nested virtualization - host runs on hw, and guest is normal guest. > > G2H transmission (values are Gbit/s): > > Core i7 with nested guest. Atom with normal guest. > > *-------------------------------* *-------------------------------* > | | | | | | | | > | buf size | copy | zerocopy | | buf size | copy | zerocopy | > | | | | | | | | > *-------------------------------* *-------------------------------* > | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 | > *-------------------------------* *-------------------------------* > | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 | > *-------------------------------* *-------------------------------* > | 256KB | 33 | 244 | | 256KB | 7.8 | 55 | > *-------------------------------* *-------------------------------* > | 1M | 30 | 373 | | 1M | 7 | 95 | > *-------------------------------* *-------------------------------* > | 8M | 22 | 475 | | 8M | 7 | 114 | > *-------------------------------* *-------------------------------* > > H2G: > > Core i7 with nested guest. Atom with normal guest. > > *-------------------------------* *-------------------------------* > | | | | | | | | > | buf size | copy | zerocopy | | buf size | copy | zerocopy | > | | | | | | | | > *-------------------------------* *-------------------------------* > | 4KB | 20 | 10 | | 4KB | 4.37 | 3 | > *-------------------------------* *-------------------------------* > | 32KB | 37 | 75 | | 32KB | 11 | 18 | > *-------------------------------* *-------------------------------* > | 256KB | 44 | 299 | | 256KB | 11 | 62 | > *-------------------------------* *-------------------------------* > | 1M | 28 | 335 | | 1M | 9 | 77 | > *-------------------------------* *-------------------------------* > | 8M | 27 | 417 | | 8M | 9.35 | 115 | > *-------------------------------* *-------------------------------* > Nice! [...] Thanks, Bobby
Hello Bobby! Sorry for a little bit late reply. On 12.06.2023 20:20, Bobby Eshleman wrote: > Hey Arseniy, > > Thanks for this series, very good stuff! > > On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote: >> Hello, >> >> DESCRIPTION >> >> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >> current implementation for TCP as much as possible: >> >> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this >> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY >> flag will be ignored (e.g. without completion). >> >> 2) Kernel uses completions from socket's error queue. Single completion >> for single tx syscall (or it can merge several completions to single >> one). I used already implemented logic for MSG_ZEROCOPY support: >> 'msg_zerocopy_realloc()' etc. >> >> Difference with copy way is not significant. During packet allocation, >> non-linear skb is created and filled with pinned user pages. >> There are also some updates for vhost and guest parts of transport - in >> both cases i've added handling of non-linear skb for virtio part. vhost >> copies data from such skb to the guest's rx virtio buffers. In the guest, >> virtio transport fills tx virtio queue with pages from skb. >> >> Head of this patchset is: >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b >> >> >> This version has several limits/problems: >> >> 1) As this feature totally depends on transport, there is no way (or it >> is difficult) to check whether transport is able to handle it or not >> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific >> setsockopt callback from setsockopt callback for SOL_SOCKET, but this >> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback >> are not considered to be called from each other. So in current version >> SO_ZEROCOPY is set successfully to any type (e.g. transport) of >> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, >> tx routine will fail with EOPNOTSUPP. >> >> ^^^ >> This is still no resolved :( >> > > I think to get around this you could use set SOCK_CUSTOM_SOCKOPT in the > vsock create function, handle SO_ZEROCOPY in the vsock handler, but pass > the rest of the common options to sock_setsockopt(). Ah yes, I really forget about this way, thanks! > > I think the next issue you would run into though is that users may call > setsockopt() before connect(), and so the transport will still be > unknown (except for dgrams, which are weird for reasons). > > What do you think about EOPNOTSUPP being returned when the user selects > an incompatible transport with connect() instead of returning it later > in the tx path? Yes, I think it is ok, in 'vsock_assign_transport()' which was called from 'connect()' I will check that if zerocopy transmission is enabled, I will check that transport supports it (seqpacket mode works in the same way - if transports doesn't support it -> connect failed). So if 'setsockopt()' is called before 'connect()' (e.g. transport is unknown), I just set this option and thats all. Later in 'connect()' during transport assignment I'll check that selected transport supports this feature if needed. If 'setsockopt()' is called after 'connect()' everything is simple - transport is already known. Thanks for this clue, I'll include it in v5! > >> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue >> one completion. In each completion there is flag which shows how tx >> was performed: zerocopy or copy. This leads that whole message must >> be send in zerocopy or copy way - we can't send part of message with >> copying and rest of message with zerocopy mode (or vice versa). Now, >> we need to account vsock credit logic, e.g. we can't send whole data >> once - only allowed number of bytes could sent at any moment. In case >> of copying way there is no problem as in worst case we can send single >> bytes, but zerocopy is more complex because smallest transmission >> unit is single page. So if there is not enough space at peer's side >> to send integer number of pages (at least one) - we will wait, thus >> stalling tx side. To overcome this problem i've added simple rule - >> zerocopy is possible only when there is enough space at another side >> for whole message (to check, that current 'msghdr' was already used >> in previous tx iterations i use 'iov_offset' field of it's iov iter). >> >> ^^^ >> Discussed as ok during v2. Link: >> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/ >> >> 3) loopback transport is not supported, because it requires to implement >> non-linear skb handling in dequeue logic (as we "send" fragged skb >> and "receive" it from the same queue). I'm going to implement it in >> next versions. >> >> ^^^ fixed in v2 >> >> 4) Current implementation sets max length of packet to 64KB. IIUC this >> is due to 'kmalloc()' allocated data buffers. I think, in case of >> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is >> not touched for data - user space pages are used as buffers. Also >> this limit trims every message which is > 64KB, thus such messages >> will be send in copy mode due to 'iov_offset' check in 2). >> >> ^^^ fixed in v2 >> >> PATCHSET STRUCTURE >> >> Patchset has the following structure: >> 1) Handle non-linear skbuff on receive in virtio/vhost. >> 2) Handle non-linear skbuff on send in virtio/vhost. >> 3) Updates for AF_VSOCK. >> 4) Enable MSG_ZEROCOPY support on transports. >> 5) Tests/tools/docs updates. >> >> PERFORMANCE >> >> Performance: it is a little bit tricky to compare performance between >> copy and zerocopy transmissions. In zerocopy way we need to wait when >> user buffers will be released by kernel, so it is like synchronous >> path (wait until device driver will process it), while in copy way we >> can feed data to kernel as many as we want, don't care about device >> driver. So I compared only time which we spend in the 'send()' syscall. >> Then if this value will be combined with total number of transmitted >> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not >> enough credit, receiver allocates same amount of space as sender needs. >> >> Sender: >> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc] >> >> Receiver: >> ./vsock_perf --vsk-size 256M >> >> I run tests on two setups: desktop with Core i7 - I use this PC for >> development and in this case guest is nested guest, and host is normal >> guest. Another hardware is some embedded board with Atom - here I don't >> have nested virtualization - host runs on hw, and guest is normal guest. >> >> G2H transmission (values are Gbit/s): >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 | >> *-------------------------------* *-------------------------------* >> | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 | >> *-------------------------------* *-------------------------------* >> | 256KB | 33 | 244 | | 256KB | 7.8 | 55 | >> *-------------------------------* *-------------------------------* >> | 1M | 30 | 373 | | 1M | 7 | 95 | >> *-------------------------------* *-------------------------------* >> | 8M | 22 | 475 | | 8M | 7 | 114 | >> *-------------------------------* *-------------------------------* >> >> H2G: >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 20 | 10 | | 4KB | 4.37 | 3 | >> *-------------------------------* *-------------------------------* >> | 32KB | 37 | 75 | | 32KB | 11 | 18 | >> *-------------------------------* *-------------------------------* >> | 256KB | 44 | 299 | | 256KB | 11 | 62 | >> *-------------------------------* *-------------------------------* >> | 1M | 28 | 335 | | 1M | 9 | 77 | >> *-------------------------------* *-------------------------------* >> | 8M | 27 | 417 | | 8M | 9.35 | 115 | >> *-------------------------------* *-------------------------------* >> > > Nice! > > > [...] > > Thanks, > Bobby Thanks, Arseniy
On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote: >Hello, > > DESCRIPTION > >this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >current implementation for TCP as much as possible: > >1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this > flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY > flag will be ignored (e.g. without completion). > >2) Kernel uses completions from socket's error queue. Single completion > for single tx syscall (or it can merge several completions to single > one). I used already implemented logic for MSG_ZEROCOPY support: > 'msg_zerocopy_realloc()' etc. > >Difference with copy way is not significant. During packet allocation, >non-linear skb is created and filled with pinned user pages. >There are also some updates for vhost and guest parts of transport - in >both cases i've added handling of non-linear skb for virtio part. vhost >copies data from such skb to the guest's rx virtio buffers. In the guest, >virtio transport fills tx virtio queue with pages from skb. > >Head of this patchset is: >https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b > > >This version has several limits/problems: > >1) As this feature totally depends on transport, there is no way (or it > is difficult) to check whether transport is able to handle it or not > during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific > setsockopt callback from setsockopt callback for SOL_SOCKET, but this > leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback > are not considered to be called from each other. So in current version > SO_ZEROCOPY is set successfully to any type (e.g. transport) of > AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, > tx routine will fail with EOPNOTSUPP. > > ^^^ > This is still no resolved :( > >2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue > one completion. In each completion there is flag which shows how tx > was performed: zerocopy or copy. This leads that whole message must > be send in zerocopy or copy way - we can't send part of message with > copying and rest of message with zerocopy mode (or vice versa). Now, > we need to account vsock credit logic, e.g. we can't send whole data > once - only allowed number of bytes could sent at any moment. In case > of copying way there is no problem as in worst case we can send single > bytes, but zerocopy is more complex because smallest transmission > unit is single page. So if there is not enough space at peer's side > to send integer number of pages (at least one) - we will wait, thus > stalling tx side. To overcome this problem i've added simple rule - > zerocopy is possible only when there is enough space at another side > for whole message (to check, that current 'msghdr' was already used > in previous tx iterations i use 'iov_offset' field of it's iov iter). > > ^^^ > Discussed as ok during v2. Link: > https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/ > >3) loopback transport is not supported, because it requires to implement > non-linear skb handling in dequeue logic (as we "send" fragged skb > and "receive" it from the same queue). I'm going to implement it in > next versions. > > ^^^ fixed in v2 > >4) Current implementation sets max length of packet to 64KB. IIUC this > is due to 'kmalloc()' allocated data buffers. I think, in case of > MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is > not touched for data - user space pages are used as buffers. Also > this limit trims every message which is > 64KB, thus such messages > will be send in copy mode due to 'iov_offset' check in 2). > > ^^^ fixed in v2 > > PATCHSET STRUCTURE > >Patchset has the following structure: >1) Handle non-linear skbuff on receive in virtio/vhost. >2) Handle non-linear skbuff on send in virtio/vhost. >3) Updates for AF_VSOCK. >4) Enable MSG_ZEROCOPY support on transports. >5) Tests/tools/docs updates. > > PERFORMANCE > >Performance: it is a little bit tricky to compare performance between >copy and zerocopy transmissions. In zerocopy way we need to wait when >user buffers will be released by kernel, so it is like synchronous >path (wait until device driver will process it), while in copy way we >can feed data to kernel as many as we want, don't care about device >driver. So I compared only time which we spend in the 'send()' syscall. >Then if this value will be combined with total number of transmitted >bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not >enough credit, receiver allocates same amount of space as sender needs. > >Sender: >./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc] > >Receiver: >./vsock_perf --vsk-size 256M > >I run tests on two setups: desktop with Core i7 - I use this PC for >development and in this case guest is nested guest, and host is normal >guest. Another hardware is some embedded board with Atom - here I don't >have nested virtualization - host runs on hw, and guest is normal guest. > >G2H transmission (values are Gbit/s): > > Core i7 with nested guest. Atom with normal guest. > >*-------------------------------* *-------------------------------* >| | | | | | | | >| buf size | copy | zerocopy | | buf size | copy | zerocopy | >| | | | | | | | >*-------------------------------* *-------------------------------* >| 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 | >*-------------------------------* *-------------------------------* >| 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 | >*-------------------------------* *-------------------------------* >| 256KB | 33 | 244 | | 256KB | 7.8 | 55 | >*-------------------------------* *-------------------------------* >| 1M | 30 | 373 | | 1M | 7 | 95 | >*-------------------------------* *-------------------------------* >| 8M | 22 | 475 | | 8M | 7 | 114 | >*-------------------------------* *-------------------------------* > >H2G: > > Core i7 with nested guest. Atom with normal guest. > >*-------------------------------* *-------------------------------* >| | | | | | | | >| buf size | copy | zerocopy | | buf size | copy | zerocopy | >| | | | | | | | >*-------------------------------* *-------------------------------* >| 4KB | 20 | 10 | | 4KB | 4.37 | 3 | >*-------------------------------* *-------------------------------* >| 32KB | 37 | 75 | | 32KB | 11 | 18 | >*-------------------------------* *-------------------------------* >| 256KB | 44 | 299 | | 256KB | 11 | 62 | >*-------------------------------* *-------------------------------* >| 1M | 28 | 335 | | 1M | 9 | 77 | >*-------------------------------* *-------------------------------* >| 8M | 27 | 417 | | 8M | 9.35 | 115 | >*-------------------------------* *-------------------------------* > > * Let's look to the first line of both tables - where copy is better > than zerocopy. I analyzed this case more deeply and found that > bottleneck is function 'vhost_work_queue()'. With 4K buffer size, > caller spends too much time in it with zerocopy mode (comparing to > copy mode). This happens only with 4K buffer size. This function just > calls 'wake_up_process()' and its internal logic does not depends on > skb, so i think potential reason (may be) is interval between two > calls of this function (e.g. how often it is called). Note, that > 'vhost_work_queue()' differs from the same function at guest's side of > transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which > i think is more optimized for worker purposes, than direct call to > 'wake_up_process()'. But again - this is just my assumption. Thanks for the analysis, however for small payloads it makes sense that the cost might be too high that optimization does not bring benefits. > >Loopback: > > Core i7 with nested guest. Atom with normal guest. > >*-------------------------------* *-------------------------------* >| | | | | | | | >| buf size | copy | zerocopy | | buf size | copy | zerocopy | >| | | | | | | | >*-------------------------------* *-------------------------------* >| 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 | >*-------------------------------* *-------------------------------* >| 32KB | 38 | 44 | | 32KB | 10 | 10 | >*-------------------------------* *-------------------------------* >| 256KB | 55 | 168 | | 256KB | 15 | 36 | >*-------------------------------* *-------------------------------* >| 1M | 53 | 250 | | 1M | 12 | 45 | >*-------------------------------* *-------------------------------* >| 8M | 40 | 344 | | 8M | 11 | 74 | >*-------------------------------* *-------------------------------* > >I analyzed performace difference more deeply for the following setup: >server: ./vsock_perf --vsk-size 16M >client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc] > >In other words I send 16M of data from guest to host in copy/zerocopy >modes and with two different sizes of buffer - 4K and 64K. Let's see >to tx path for both modes - it consists of two steps: > >copy: >1) Allocate skb of buffer's length. >2) Copy data to skb from buffer. > >zerocopy: >1) Allocate skb with header space only. >2) Pin pages of the buffer and insert them to skb. > >I measured average number of ns (returned by 'ktime_get()') for each >step above: >1) Skb allocation (for both copy and zerocopy modes). >2) For copy mode in 'memcpy_to_msg()' - copying. >3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning. > >Here are results for copy mode: >*-------------------------------------* >| buf | skb alloc | 'memcpy_to_msg()' | >*-------------------------------------* >| | | | >| 64K | 5000ns | 25000ns | >| | | | >*-------------------------------------* >| | | | >| 4K | 800ns | 2200ns | >| | | | >*-------------------------------------* > >Here are results for zerocopy mode: >*-----------------------------------------------* >| buf | skb alloc | '__zerocopy_sg_from_iter()' | >*-----------------------------------------------* >| | | | >| 64K | 250ns | 3500ns | >| | | | >*-----------------------------------------------* >| | | | >| 4K | 250ns | 3000ns | >| | | | >*-----------------------------------------------* > >I guess that reason of zerocopy performance is low overhead for page >pinning: there is big difference between 4K and 64K in case of copying >(25000 vs 2200), but in pinning case - just 3000 vs 3500. > >So, zerocopy is faster than classic copy mode, but of course it requires >specific architecture of application due to user pages pinning, buffer >size and alignment. Makes sense! > > NOTES > >If host fails to send data with "Cannot allocate memory", check value >/proc/sys/net/core/optmem_max - it is accounted during completion skb >allocation. Try to update it to for example 1M and try send again: >"echo 1048576 > /proc/sys/net/core/optmem_max" (as root). > > TESTING > >This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to >cover new code as much as possible so there are different cases for >MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io >vector types (different sizes, alignments, with unmapped pages). I also >run tests with loopback transport and run vsockmon. In v3 i've added >io_uring test as separated application. > > LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER > >In v3 Stefano Garzarella <sgarzare@redhat.com> asked to split this patchset >for several parts, because it looks too big for review. I think in this >version (v4) we can do it in the following way: > >[0001 - 0005] - this is preparation for virtio/vhost part. >[0006 - 0009] - this is preparation for AF_VSOCK part. >[0010 - 0013] - these patches allows to trigger logic from the previous > two parts. >[0014 - rest] - updates for doc, tests, utils. This part doesn't touch > kernel code and looks not critical. Yeah, I like this split, but I'd include 14 in the (10, 13) group. I have reviewed most of them and I think we are well on our way :-) I've already seen that Bobby suggested changes for v5, so I'll review that version better. Great work so far! Thanks, Stefano
On 26.06.2023 19:15, Stefano Garzarella wrote: > On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote: >> Hello, >> >> DESCRIPTION >> >> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >> current implementation for TCP as much as possible: >> >> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this >> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY >> flag will be ignored (e.g. without completion). >> >> 2) Kernel uses completions from socket's error queue. Single completion >> for single tx syscall (or it can merge several completions to single >> one). I used already implemented logic for MSG_ZEROCOPY support: >> 'msg_zerocopy_realloc()' etc. >> >> Difference with copy way is not significant. During packet allocation, >> non-linear skb is created and filled with pinned user pages. >> There are also some updates for vhost and guest parts of transport - in >> both cases i've added handling of non-linear skb for virtio part. vhost >> copies data from such skb to the guest's rx virtio buffers. In the guest, >> virtio transport fills tx virtio queue with pages from skb. >> >> Head of this patchset is: >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b >> >> >> This version has several limits/problems: >> >> 1) As this feature totally depends on transport, there is no way (or it >> is difficult) to check whether transport is able to handle it or not >> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific >> setsockopt callback from setsockopt callback for SOL_SOCKET, but this >> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback >> are not considered to be called from each other. So in current version >> SO_ZEROCOPY is set successfully to any type (e.g. transport) of >> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, >> tx routine will fail with EOPNOTSUPP. >> >> ^^^ >> This is still no resolved :( >> >> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue >> one completion. In each completion there is flag which shows how tx >> was performed: zerocopy or copy. This leads that whole message must >> be send in zerocopy or copy way - we can't send part of message with >> copying and rest of message with zerocopy mode (or vice versa). Now, >> we need to account vsock credit logic, e.g. we can't send whole data >> once - only allowed number of bytes could sent at any moment. In case >> of copying way there is no problem as in worst case we can send single >> bytes, but zerocopy is more complex because smallest transmission >> unit is single page. So if there is not enough space at peer's side >> to send integer number of pages (at least one) - we will wait, thus >> stalling tx side. To overcome this problem i've added simple rule - >> zerocopy is possible only when there is enough space at another side >> for whole message (to check, that current 'msghdr' was already used >> in previous tx iterations i use 'iov_offset' field of it's iov iter). >> >> ^^^ >> Discussed as ok during v2. Link: >> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/ >> >> 3) loopback transport is not supported, because it requires to implement >> non-linear skb handling in dequeue logic (as we "send" fragged skb >> and "receive" it from the same queue). I'm going to implement it in >> next versions. >> >> ^^^ fixed in v2 >> >> 4) Current implementation sets max length of packet to 64KB. IIUC this >> is due to 'kmalloc()' allocated data buffers. I think, in case of >> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is >> not touched for data - user space pages are used as buffers. Also >> this limit trims every message which is > 64KB, thus such messages >> will be send in copy mode due to 'iov_offset' check in 2). >> >> ^^^ fixed in v2 >> >> PATCHSET STRUCTURE >> >> Patchset has the following structure: >> 1) Handle non-linear skbuff on receive in virtio/vhost. >> 2) Handle non-linear skbuff on send in virtio/vhost. >> 3) Updates for AF_VSOCK. >> 4) Enable MSG_ZEROCOPY support on transports. >> 5) Tests/tools/docs updates. >> >> PERFORMANCE >> >> Performance: it is a little bit tricky to compare performance between >> copy and zerocopy transmissions. In zerocopy way we need to wait when >> user buffers will be released by kernel, so it is like synchronous >> path (wait until device driver will process it), while in copy way we >> can feed data to kernel as many as we want, don't care about device >> driver. So I compared only time which we spend in the 'send()' syscall. >> Then if this value will be combined with total number of transmitted >> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not >> enough credit, receiver allocates same amount of space as sender needs. >> >> Sender: >> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc] >> >> Receiver: >> ./vsock_perf --vsk-size 256M >> >> I run tests on two setups: desktop with Core i7 - I use this PC for >> development and in this case guest is nested guest, and host is normal >> guest. Another hardware is some embedded board with Atom - here I don't >> have nested virtualization - host runs on hw, and guest is normal guest. >> >> G2H transmission (values are Gbit/s): >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 | >> *-------------------------------* *-------------------------------* >> | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 | >> *-------------------------------* *-------------------------------* >> | 256KB | 33 | 244 | | 256KB | 7.8 | 55 | >> *-------------------------------* *-------------------------------* >> | 1M | 30 | 373 | | 1M | 7 | 95 | >> *-------------------------------* *-------------------------------* >> | 8M | 22 | 475 | | 8M | 7 | 114 | >> *-------------------------------* *-------------------------------* >> >> H2G: >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 20 | 10 | | 4KB | 4.37 | 3 | >> *-------------------------------* *-------------------------------* >> | 32KB | 37 | 75 | | 32KB | 11 | 18 | >> *-------------------------------* *-------------------------------* >> | 256KB | 44 | 299 | | 256KB | 11 | 62 | >> *-------------------------------* *-------------------------------* >> | 1M | 28 | 335 | | 1M | 9 | 77 | >> *-------------------------------* *-------------------------------* >> | 8M | 27 | 417 | | 8M | 9.35 | 115 | >> *-------------------------------* *-------------------------------* >> >> * Let's look to the first line of both tables - where copy is better >> than zerocopy. I analyzed this case more deeply and found that >> bottleneck is function 'vhost_work_queue()'. With 4K buffer size, >> caller spends too much time in it with zerocopy mode (comparing to >> copy mode). This happens only with 4K buffer size. This function just >> calls 'wake_up_process()' and its internal logic does not depends on >> skb, so i think potential reason (may be) is interval between two >> calls of this function (e.g. how often it is called). Note, that >> 'vhost_work_queue()' differs from the same function at guest's side of >> transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which >> i think is more optimized for worker purposes, than direct call to >> 'wake_up_process()'. But again - this is just my assumption. > > Thanks for the analysis, however for small payloads it makes sense that > the cost might be too high that optimization does not bring benefits. > >> >> Loopback: >> >> Core i7 with nested guest. Atom with normal guest. >> >> *-------------------------------* *-------------------------------* >> | | | | | | | | >> | buf size | copy | zerocopy | | buf size | copy | zerocopy | >> | | | | | | | | >> *-------------------------------* *-------------------------------* >> | 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 | >> *-------------------------------* *-------------------------------* >> | 32KB | 38 | 44 | | 32KB | 10 | 10 | >> *-------------------------------* *-------------------------------* >> | 256KB | 55 | 168 | | 256KB | 15 | 36 | >> *-------------------------------* *-------------------------------* >> | 1M | 53 | 250 | | 1M | 12 | 45 | >> *-------------------------------* *-------------------------------* >> | 8M | 40 | 344 | | 8M | 11 | 74 | >> *-------------------------------* *-------------------------------* >> >> I analyzed performace difference more deeply for the following setup: >> server: ./vsock_perf --vsk-size 16M >> client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc] >> >> In other words I send 16M of data from guest to host in copy/zerocopy >> modes and with two different sizes of buffer - 4K and 64K. Let's see >> to tx path for both modes - it consists of two steps: >> >> copy: >> 1) Allocate skb of buffer's length. >> 2) Copy data to skb from buffer. >> >> zerocopy: >> 1) Allocate skb with header space only. >> 2) Pin pages of the buffer and insert them to skb. >> >> I measured average number of ns (returned by 'ktime_get()') for each >> step above: >> 1) Skb allocation (for both copy and zerocopy modes). >> 2) For copy mode in 'memcpy_to_msg()' - copying. >> 3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning. >> >> Here are results for copy mode: >> *-------------------------------------* >> | buf | skb alloc | 'memcpy_to_msg()' | >> *-------------------------------------* >> | | | | >> | 64K | 5000ns | 25000ns | >> | | | | >> *-------------------------------------* >> | | | | >> | 4K | 800ns | 2200ns | >> | | | | >> *-------------------------------------* >> >> Here are results for zerocopy mode: >> *-----------------------------------------------* >> | buf | skb alloc | '__zerocopy_sg_from_iter()' | >> *-----------------------------------------------* >> | | | | >> | 64K | 250ns | 3500ns | >> | | | | >> *-----------------------------------------------* >> | | | | >> | 4K | 250ns | 3000ns | >> | | | | >> *-----------------------------------------------* >> >> I guess that reason of zerocopy performance is low overhead for page >> pinning: there is big difference between 4K and 64K in case of copying >> (25000 vs 2200), but in pinning case - just 3000 vs 3500. >> >> So, zerocopy is faster than classic copy mode, but of course it requires >> specific architecture of application due to user pages pinning, buffer >> size and alignment. > > Makes sense! > >> >> NOTES >> >> If host fails to send data with "Cannot allocate memory", check value >> /proc/sys/net/core/optmem_max - it is accounted during completion skb >> allocation. Try to update it to for example 1M and try send again: >> "echo 1048576 > /proc/sys/net/core/optmem_max" (as root). >> >> TESTING >> >> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to >> cover new code as much as possible so there are different cases for >> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io >> vector types (different sizes, alignments, with unmapped pages). I also >> run tests with loopback transport and run vsockmon. In v3 i've added >> io_uring test as separated application. >> >> LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER >> >> In v3 Stefano Garzarella <sgarzare@redhat.com> asked to split this patchset >> for several parts, because it looks too big for review. I think in this >> version (v4) we can do it in the following way: >> >> [0001 - 0005] - this is preparation for virtio/vhost part. >> [0006 - 0009] - this is preparation for AF_VSOCK part. >> [0010 - 0013] - these patches allows to trigger logic from the previous >> two parts. >> [0014 - rest] - updates for doc, tests, utils. This part doesn't touch >> kernel code and looks not critical. > > Yeah, I like this split, but I'd include 14 in the (10, 13) group. > > I have reviewed most of them and I think we are well on our way :-) > I've already seen that Bobby suggested changes for v5, so I'll review > that version better. > > Great work so far! Hello Stefano! Thanks for review! I left some questions, but most of comments are clear for me. So I guess that idea of split is that I still keep all patches in a big single patchset, but preserve structure described above and we will do review process step by step according split? Or I should split this patchset for 3 separated sets? I guess this will be more complex to review... Thanks, Arseniy > > Thanks, > Stefano >
On Tue, Jun 27, 2023 at 07:55:58AM +0300, Arseniy Krasnov wrote: > > >On 26.06.2023 19:15, Stefano Garzarella wrote: >> On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote: [...] >>> >>> LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER >>> >>> In v3 Stefano Garzarella <sgarzare@redhat.com> asked to split this patchset >>> for several parts, because it looks too big for review. I think in this >>> version (v4) we can do it in the following way: >>> >>> [0001 - 0005] - this is preparation for virtio/vhost part. >>> [0006 - 0009] - this is preparation for AF_VSOCK part. >>> [0010 - 0013] - these patches allows to trigger logic from the previous >>> two parts. >>> [0014 - rest] - updates for doc, tests, utils. This part doesn't touch >>> kernel code and looks not critical. >> >> Yeah, I like this split, but I'd include 14 in the (10, 13) group. >> >> I have reviewed most of them and I think we are well on our way :-) >> I've already seen that Bobby suggested changes for v5, so I'll review >> that version better. >> >> Great work so far! > >Hello Stefano! Hi Arseniy :-) > >Thanks for review! I left some questions, but most of comments are clear >for me. So I guess that idea of split is that I still keep all patches in >a big single patchset, but preserve structure described above and we will >do review process step by step according split? > >Or I should split this patchset for 3 separated sets? I guess this will be >more complex to review... If the next is still RFC, a single series is fine. Thanks, Stefano