Message ID | 0e7c6fc4-b4a6-a27b-36e9-359597bba2b5@sberdevices.ru |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:adf:eb09:0:0:0:0:0 with SMTP id s9csp2086842wrn; Sun, 5 Feb 2023 22:54:32 -0800 (PST) X-Google-Smtp-Source: AK7set+JTWuefmauebYVKG+RMXjKQoy0EgbsHFV3CKlgG/abtAYzoD8CNzBxNZeOXIB5ZX9+pff9 X-Received: by 2002:a17:906:9610:b0:878:605b:ffef with SMTP id s16-20020a170906961000b00878605bffefmr23801060ejx.55.1675666472808; Sun, 05 Feb 2023 22:54:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1675666472; cv=none; d=google.com; s=arc-20160816; b=Bea4wRzqfZsWVCQyKFIN/GJyKXFeyZL83pGLyp4ktff9wgAV5iRYr66bqfcmSZazBk sgYfCwRHym10GiQqh8YLTOil6CxKSPnyTdb8930rkMNifbzigYZqP72VeAynaaX2ZyUk 6EaE+/ND1izsNm+gtgbHRsSkYC5oKw/BvQC7KGgivDwgLgWtBJAEVZCgKQwtAkVdr/2y g3TQEenPX1KH4s1j1p1FlZVplR+yykWKsK+xCXyvP77WnE4VmPD1/GHOiyTSIPgmZhCg mom05VLTjnl18pnCWj2YKI5MVP+KR6DlRk6b4sppGxX8bUaFpu0FVHNM3w+P1Yqsz34Q tpJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :content-id:content-language:accept-language:message-id:date :thread-index:thread-topic:subject:cc:to:from:dkim-signature; bh=W0AZoQYdrXV5hWAmgg6fHvPpS8a3OyjtrscQ67WekUk=; b=bwVsswyuA7pJ1Gl/ump4jIkYc6y3G+nmUd6g598Tyq6fuCLdQQh7E0SgF68BhOXx9D rVvZEMJnoQsv/0MS79LA5vMnyHNwnmQeWQgbL0yRZnshsS2NjmfdU4BCJJ3fU5uBeR30 YhZnCDcFJTHcUnAxO+DwvdDq7Bn2UF61wiRKUEV240aPNOLywfusMuhqOkpXBVJp9HDv RKtVma/fQEDbnjMXoEmOOJKBEdYO4rJwqewmjP8/fmYiEbomisoG2fgtGh12uDBvm6Yz mddWmJLy9pUG4TWNESACIukvOGYShGjhUSXvpxVfTUxwk5AaUjwok08dLCYXqGOYVl9e QkjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=DSFpNTfj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f16-20020a170906739000b008847d7ed36csi13479547ejl.895.2023.02.05.22.54.09; Sun, 05 Feb 2023 22:54:32 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=DSFpNTfj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229731AbjBFGwK (ORCPT <rfc822;kmanaouilinux@gmail.com> + 99 others); Mon, 6 Feb 2023 01:52:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229447AbjBFGwI (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Mon, 6 Feb 2023 01:52:08 -0500 Received: from mx.sberdevices.ru (mx.sberdevices.ru [45.89.227.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82695B745; Sun, 5 Feb 2023 22:52:02 -0800 (PST) Received: from s-lin-edge02.sberdevices.ru (localhost [127.0.0.1]) by mx.sberdevices.ru (Postfix) with ESMTP id 29E115FD02; Mon, 6 Feb 2023 09:51:59 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sberdevices.ru; s=mail; t=1675666319; bh=W0AZoQYdrXV5hWAmgg6fHvPpS8a3OyjtrscQ67WekUk=; h=From:To:Subject:Date:Message-ID:Content-Type:MIME-Version; b=DSFpNTfjfEUSklHX0SzcE/2oRVDJSeacU765KhIUBhLb8zJhW7GEI6ACL8HWcJ1Ab FSYaCi/OBq2k7KLiwaERV2mmB+4IAYcJAWm+NG5EqmVbz40WoJhdHlklCu2RZeLc6x SRr7/7wvTRc4DytPFzUGE1xEIw+qZbN0ZFFfrCODCSdsToQ/fkAP9VeQTZTUDlFVPF hVqvvBHH55I+a4xI7MgAspIZesp1tfCzpH+Z4IbB/izENf0m/hywm7Za3raEG+Ipjv TMAknNpCNRzjZxICAkz4J1uw7W3wI0VwnR/TnxAEfxmmRtmMP76HSA5K04gC6yhZXe 2fsReO4Dso1fA== Received: from S-MS-EXCH02.sberdevices.ru (S-MS-EXCH02.sberdevices.ru [172.16.1.5]) by mx.sberdevices.ru (Postfix) with ESMTP; Mon, 6 Feb 2023 09:51:55 +0300 (MSK) From: Arseniy Krasnov <AVKrasnov@sberdevices.ru> To: Stefan Hajnoczi <stefanha@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Arseniy Krasnov <AVKrasnov@sberdevices.ru>, "Krasnov Arseniy" <oxffffaa@gmail.com> CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, kernel <kernel@sberdevices.ru> Subject: [RFC PATCH v1 00/12] vsock: MSG_ZEROCOPY flag support Thread-Topic: [RFC PATCH v1 00/12] vsock: MSG_ZEROCOPY flag support Thread-Index: AQHZOfeA2nL8nCs+GE2d7KRGHhlyGw== Date: Mon, 6 Feb 2023 06:51:55 +0000 Message-ID: <0e7c6fc4-b4a6-a27b-36e9-359597bba2b5@sberdevices.ru> Accept-Language: en-US, ru-RU Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.16.1.12] Content-Type: text/plain; charset="utf-8" Content-ID: <9DBF5BD914616347BBBBC33C6C85928E@sberdevices.ru> Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-KSMG-Rule-ID: 4 X-KSMG-Message-Action: clean X-KSMG-AntiSpam-Status: not scanned, disabled by settings X-KSMG-AntiSpam-Interceptor-Info: not scanned X-KSMG-AntiPhishing: not scanned, disabled by settings X-KSMG-AntiVirus: Kaspersky Secure Mail Gateway, version 1.1.2.30, bases: 2023/02/06 01:18:00 #20834045 X-KSMG-AntiVirus-Status: Clean, skipped X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,T_SPF_TEMPERROR autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1757063647023463139?= X-GMAIL-MSGID: =?utf-8?q?1757063647023463139?= |
Series |
vsock: MSG_ZEROCOPY flag support
|
|
Message
Arseniy Krasnov
Feb. 6, 2023, 6:51 a.m. UTC
Hello, DESCRIPTION this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow current implementation for TCP as much as possible: 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY flag will be ignored (e.g. without completion). 2) Kernel uses completions from socket's error queue. Single completion for single tx syscall (or it can merge several completions to single one). I used already implemented logic for MSG_ZEROCOPY support: 'msg_zerocopy_realloc()' etc. Difference with copy way is not significant. During packet allocation, non-linear skb is created, then I call 'get_user_pages()' for each page from user's iov iterator (I think i don't need 'pin_user_pages()' as there is no backing storage for these pages) and add each returned page to the skb as fragment. There are also some updates for vhost and guest parts of transport - in both cases i've added handling of non-linear skb for virtio part. vhost copies data from such skb to the guest's rx virtio buffers. In the guest, virtio transport fills virtio queue with pages from skb. I think doc in Documentation/networking/msg_zerocopy.rst could be also updated in next versions. This version has several limits/problems: 1) As this feature totally depends on transport, there is no way (or it is difficult) to check whether transport is able to handle it or not during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific setsockopt callback from setsockopt callback for SOL_SOCKET, but this leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback are not considered to be called from each other. So in current version SO_ZEROCOPY is set successfully to any type (e.g. transport) of AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, tx routine will fail with EOPNOTSUPP. 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue one completion. In each completion there is flag which shows how tx was performed: zerocopy or copy. This leads that whole message must be send in zerocopy or copy way - we can't send part of message with copying and rest of message with zerocopy mode (or vice versa). Now, we need to account vsock credit logic, e.g. we can't send whole data once - only allowed number of bytes could sent at any moment. In case of copying way there is no problem as in worst case we can send single bytes, but zerocopy is more complex because smallest transmission unit is single page. So if there is not enough space at peer's side to send integer number of pages (at least one) - we will wait, thus stalling tx side. To overcome this problem i've added simple rule - zerocopy is possible only when there is enough space at another side for whole message (to check, that current 'msghdr' was already used in previous tx iterations i use 'iov_offset' field of it's iov iter). 3) loopback transport is not supported, because it requires to implement non-linear skb handling in dequeue logic (as we "send" fragged skb and "receive" it from the same queue). I'm going to implement it in next versions. 4) Current implementation sets max length of packet to 64KB. IIUC this is due to 'kmalloc()' allocated data buffers. I think, in case of MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is not touched for data - user space pages are used as buffers. Also this limit trims every message which is > 64KB, thus such messages will be send in copy mode due to 'iov_offset' check in 2). PERFORMANCE Performance: it is a little bit tricky to compare performance between copy and zerocopy transmissions. In zerocopy way we need to wait when user buffers will be released by kernel, so it something like synchronous path (wait until device driver will process it), while in copy way we can feed data to kernel as many as we want, don't care about device driver. So I compared only time which we spend in 'sendmsg()' syscall. Also there is limit from 4) above so max buffer size is 64KB. I've tested this patchset in the nested VM, but i think for V1 it is not a big deal. Sender: ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc] Receiver: ./vsock_perf --vsk-size 256M Number in cell is seconds which senders spends inside tx syscall. Guest to host transmission: *-------------------------------* | | | | | buf size | copy | zerocopy | | | | | *-------------------------------* | 4KB | 0.26 | 0.042 | *-------------------------------* | 16KB | 0.11 | 0.014 | *-------------------------------* | 32KB | 0.05 | 0.009 | *-------------------------------* | 64KB | 0.04 | 0.005 | *-------------------------------* Host to guest transmission: *--------------------------------* | | | | | buf size | copy | zerocopy | | | | | *--------------------------------* | 4KB | 0.049 | 0.034 | *--------------------------------* | 16KB | 0.03 | 0.024 | *--------------------------------* | 32KB | 0.025 | 0.01 | *--------------------------------* | 64KB | 0.028 | 0.01 | *--------------------------------* If host fails to send data with "Cannot allocate memory", check value /proc/sys/net/core/optmem_max - it is accounted during completion skb allocation. Zerocopy is faster than classic copy mode, but of course it requires specific architecture of application due to user pages pinning, buffer size and alignment. In next versions i'm going to fix 64KB barrier to perform tests with bigger buffer sizes. TESTING This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to cover new code as much as possible so there are different cases for MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io vector types (different sizes, alignments, with unmapped pages). Thanks, Arseniy Arseniy Krasnov(12): vsock: check error queue to set EPOLLERR vsock: read from socket's error queue vsock: check for MSG_ZEROCOPY support vhost/vsock: non-linear skb handling support vsock/virtio: non-linear skb support vsock/virtio: non-linear skb handling for TAP dev vsock/virtio: MGS_ZEROCOPY flag support vhost/vsock: support MSG_ZEROCOPY for transport vsock/virtio: support MSG_ZEROCOPY for transport net/sock: enable setting SO_ZEROCOPY for PF_VSOCK test/vsock: MSG_ZEROCOPY flag tests test/vsock: MSG_ZEROCOPY support for vsock_perf drivers/vhost/vsock.c | 62 +++- include/linux/socket.h | 1 + include/linux/virtio_vsock.h | 12 + include/net/af_vsock.h | 2 + net/core/sock.c | 4 +- net/vmw_vsock/af_vsock.c | 35 ++- net/vmw_vsock/virtio_transport.c | 38 ++- net/vmw_vsock/virtio_transport_common.c | 255 ++++++++++++++-- tools/testing/vsock/Makefile | 2 +- tools/testing/vsock/util.h | 1 + tools/testing/vsock/vsock_perf.c | 127 +++++++- tools/testing/vsock/vsock_test.c | 11 + tools/testing/vsock/vsock_test_zerocopy.c | 470 ++++++++++++++++++++++++++++++ tools/testing/vsock/vsock_test_zerocopy.h | 12 + 14 files changed, 991 insertions(+), 41 deletions(-) -- 2.25.1
Comments
Hi Arseniy, sorry for the delay, but I was offline. On Mon, Feb 06, 2023 at 06:51:55AM +0000, Arseniy Krasnov wrote: >Hello, > > DESCRIPTION > >this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >current implementation for TCP as much as possible: > >1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this > flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY > flag will be ignored (e.g. without completion). > >2) Kernel uses completions from socket's error queue. Single completion > for single tx syscall (or it can merge several completions to single > one). I used already implemented logic for MSG_ZEROCOPY support: > 'msg_zerocopy_realloc()' etc. I will review for the vsock point of view. Hope some net maintainers can comment about SO_ZEROCOPY. Anyway I think is a good idea to keep it as close as possible to the TCP implementation. > >Difference with copy way is not significant. During packet allocation, >non-linear skb is created, then I call 'get_user_pages()' for each page >from user's iov iterator (I think i don't need 'pin_user_pages()' as Are these pages exposed to the host via virtqueues? If so, I think we should pin them. What happens if the host accesses them but these pages have been unmapped? >there is no backing storage for these pages) and add each returned page >to the skb as fragment. There are also some updates for vhost and guest >parts of transport - in both cases i've added handling of non-linear skb >for virtio part. vhost copies data from such skb to the guest's rx virtio >buffers. In the guest, virtio transport fills virtio queue with pages >from skb. > >I think doc in Documentation/networking/msg_zerocopy.rst could be also >updated in next versions. Yep, good idea. > >This version has several limits/problems: > >1) As this feature totally depends on transport, there is no way (or it > is difficult) to check whether transport is able to handle it or not > during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific > setsockopt callback from setsockopt callback for SOL_SOCKET, but this > leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback > are not considered to be called from each other. So in current version > SO_ZEROCOPY is set successfully to any type (e.g. transport) of > AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, > tx routine will fail with EOPNOTSUPP. I'll take a look, but if we have no alternative, I think it's okay to make tx fail. > >2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue > one completion. In each completion there is flag which shows how tx > was performed: zerocopy or copy. This leads that whole message must > be send in zerocopy or copy way - we can't send part of message with > copying and rest of message with zerocopy mode (or vice versa). Now, > we need to account vsock credit logic, e.g. we can't send whole data > once - only allowed number of bytes could sent at any moment. In case > of copying way there is no problem as in worst case we can send single > bytes, but zerocopy is more complex because smallest transmission > unit is single page. So if there is not enough space at peer's side > to send integer number of pages (at least one) - we will wait, thus > stalling tx side. To overcome this problem i've added simple rule - > zerocopy is possible only when there is enough space at another side > for whole message (to check, that current 'msghdr' was already used > in previous tx iterations i use 'iov_offset' field of it's iov iter). I see the problem and I think your approach is the right one. > >3) loopback transport is not supported, because it requires to implement > non-linear skb handling in dequeue logic (as we "send" fragged skb > and "receive" it from the same queue). I'm going to implement it in > next versions. loopback is useful for testing and debugging, so it would be great to have the support, but if it's too complicated, we can do it later. > >4) Current implementation sets max length of packet to 64KB. IIUC this > is due to 'kmalloc()' allocated data buffers. I think, in case of Yep, I think so. When I started touching this code, the limit was already there. As you said, I think it was introduced to have a limit on (host/device side?) allocation, but buf_alloc might be enough, so maybe we could also remove it for copy mode. The only problem I see is compatibility with old devices/drivers, so maybe we need a feature in the spec to say the limit is gone, or have a field in the virtio config space where the device specifies its limit (for the driver, the limit is implicitly that of the buffer allocated and put in the virtqueue). This reminded me that Laura had proposed something similar before, maybe we should take it up again: https://markmail.org/message/3el4ckeakfilg5wo > MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is > not touched for data - user space pages are used as buffers. Also > this limit trims every message which is > 64KB, thus such messages > will be send in copy mode due to 'iov_offset' check in 2). The host still needs to allocate and copy, so maybe the limitation could be to avoid large allocations in the host, but actually the host can use vmalloc because it doesn't need them to be contiguous. > > PERFORMANCE > >Performance: it is a little bit tricky to compare performance between >copy and zerocopy transmissions. In zerocopy way we need to wait when >user buffers will be released by kernel, so it something like synchronous >path (wait until device driver will process it), while in copy way we >can feed data to kernel as many as we want, don't care about device >driver. So I compared only time which we spend in 'sendmsg()' syscall. >Also there is limit from 4) above so max buffer size is 64KB. I've >tested this patchset in the nested VM, but i think for V1 it is not a >big deal. > >Sender: >./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc] > >Receiver: >./vsock_perf --vsk-size 256M > >Number in cell is seconds which senders spends inside tx syscall. > >Guest to host transmission: > >*-------------------------------* >| | | | >| buf size | copy | zerocopy | >| | | | >*-------------------------------* >| 4KB | 0.26 | 0.042 | >*-------------------------------* >| 16KB | 0.11 | 0.014 | >*-------------------------------* >| 32KB | 0.05 | 0.009 | >*-------------------------------* >| 64KB | 0.04 | 0.005 | >*-------------------------------* > >Host to guest transmission: > >*--------------------------------* >| | | | >| buf size | copy | zerocopy | >| | | | >*--------------------------------* >| 4KB | 0.049 | 0.034 | >*--------------------------------* >| 16KB | 0.03 | 0.024 | >*--------------------------------* >| 32KB | 0.025 | 0.01 | >*--------------------------------* >| 64KB | 0.028 | 0.01 | >*--------------------------------* > >If host fails to send data with "Cannot allocate memory", check value >/proc/sys/net/core/optmem_max - it is accounted during completion skb >allocation. > >Zerocopy is faster than classic copy mode, but of course it requires >specific architecture of application due to user pages pinning, buffer >size and alignment. In next versions i'm going to fix 64KB barrier to >perform tests with bigger buffer sizes. Yep, I see. Adjusting vsock_perf to compare also Gbps (can io_uring helps in this case to have more requests in-flight?) would be great. > > TESTING > >This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to >cover new code as much as possible so there are different cases for >MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io >vector types (different sizes, alignments, with unmapped pages). Great! Thanks for adding the tests! I'll go through the patches between today and Monday. Sorry again for taking so long! Thanks, Stefano
On 16.02.2023 16:33, Stefano Garzarella wrote: > Hi Arseniy, > sorry for the delay, but I was offline. Hello! Sure no problem!, i was also offline a little bit so i'm replying just now > > On Mon, Feb 06, 2023 at 06:51:55AM +0000, Arseniy Krasnov wrote: >> Hello, >> >> DESCRIPTION >> >> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >> current implementation for TCP as much as possible: >> >> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this >> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY >> flag will be ignored (e.g. without completion). >> >> 2) Kernel uses completions from socket's error queue. Single completion >> for single tx syscall (or it can merge several completions to single >> one). I used already implemented logic for MSG_ZEROCOPY support: >> 'msg_zerocopy_realloc()' etc. > > I will review for the vsock point of view. Hope some net maintainers can > comment about SO_ZEROCOPY. > > Anyway I think is a good idea to keep it as close as possible to the TCP > implementation. > >> >> Difference with copy way is not significant. During packet allocation, >> non-linear skb is created, then I call 'get_user_pages()' for each page >> from user's iov iterator (I think i don't need 'pin_user_pages()' as > > Are these pages exposed to the host via virtqueues? If so, I think we > should pin them. What happens if the host accesses them but these pages > have been unmapped? Yes, user pages with data will be used by the virtio device. 'pin' - You mean use 'pin_user_pages()'? Unmapped - You mean guest will unmap it, while host must access it to copy packet? Such pages have incremented refcount by 'get_user_pages()', so page is locked and memory and will be locked until host finishes copying data. I think it is better to discuss things related to 'get/pin_user_pages()' in one place, for example in 07/12 where this function is called. > >> there is no backing storage for these pages) and add each returned page >> to the skb as fragment. There are also some updates for vhost and guest >> parts of transport - in both cases i've added handling of non-linear skb >> for virtio part. vhost copies data from such skb to the guest's rx virtio >> buffers. In the guest, virtio transport fills virtio queue with pages >> from skb. >> >> I think doc in Documentation/networking/msg_zerocopy.rst could be also >> updated in next versions. > > Yep, good idea. Ack, i'll do it in v2. > >> >> This version has several limits/problems: >> >> 1) As this feature totally depends on transport, there is no way (or it >> is difficult) to check whether transport is able to handle it or not >> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific >> setsockopt callback from setsockopt callback for SOL_SOCKET, but this >> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback >> are not considered to be called from each other. So in current version >> SO_ZEROCOPY is set successfully to any type (e.g. transport) of >> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, >> tx routine will fail with EOPNOTSUPP. > > I'll take a look, but if we have no alternative, I think it's okay to > make tx fail.> Thanks >> >> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue >> one completion. In each completion there is flag which shows how tx >> was performed: zerocopy or copy. This leads that whole message must >> be send in zerocopy or copy way - we can't send part of message with >> copying and rest of message with zerocopy mode (or vice versa). Now, >> we need to account vsock credit logic, e.g. we can't send whole data >> once - only allowed number of bytes could sent at any moment. In case >> of copying way there is no problem as in worst case we can send single >> bytes, but zerocopy is more complex because smallest transmission >> unit is single page. So if there is not enough space at peer's side >> to send integer number of pages (at least one) - we will wait, thus >> stalling tx side. To overcome this problem i've added simple rule - >> zerocopy is possible only when there is enough space at another side >> for whole message (to check, that current 'msghdr' was already used >> in previous tx iterations i use 'iov_offset' field of it's iov iter). > > I see the problem and I think your approach is the right one. > >> >> 3) loopback transport is not supported, because it requires to implement >> non-linear skb handling in dequeue logic (as we "send" fragged skb >> and "receive" it from the same queue). I'm going to implement it in >> next versions. > > loopback is useful for testing and debugging, so it would be great to > have the support, but if it's too complicated, we can do it later. > Ok, i'll implement it in v2. >> >> 4) Current implementation sets max length of packet to 64KB. IIUC this >> is due to 'kmalloc()' allocated data buffers. I think, in case of > > Yep, I think so. > When I started touching this code, the limit was already there. > As you said, I think it was introduced to have a limit on (host/device > side?) allocation, but buf_alloc might be enough, so maybe we could > also remove it for copy mode. > The only problem I see is compatibility with old devices/drivers, so > maybe we need a feature in the spec to say the limit is gone, or have a > field in the virtio config space where the device specifies its limit > (for the driver, the limit is implicitly that of the buffer allocated > and put in the virtqueue). > > This reminded me that Laura had proposed something similar before, > maybe we should take it up again: > https://markmail.org/message/3el4ckeakfilg5wo > >> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is >> not touched for data - user space pages are used as buffers. Also >> this limit trims every message which is > 64KB, thus such messages >> will be send in copy mode due to 'iov_offset' check in 2). > > The host still needs to allocate and copy, so maybe the limitation > could be to avoid large allocations in the host, but actually the host > can use vmalloc because it doesn't need them to be contiguous. > Hmmm, I think it is possible to solve this situation in the following way - i can keep limitation for 64KB for copy mode, and remove it for zero copy, but I'll limit each packet size to 64KB(of course technically it is not exactly 64KB, it is min(max packet size, MAX_SKB_FRAGS * PAGE_SIZE) where max packet size is 64Kb, but for simplicity let's call this limit 64Kb:) ). E.g. when zerocopy transmission needs to send for example 129Kb(of course peer's free space is big enough and this check is passed), I'll won't trim 129Kb to 64Kb + 64Kb + 1Kb in the current manner - by returning to af_vsock.c after sending every skb. I'll alloc several skbs, (3 in this case - 64Kb + 64Kb + 1Kb) in single call to the transport. Completion arg will be attached only to the last one skb, and send these 3 skbs. Host still processes 64Kb(let it be 64Kb for simplicity again :) ) packets - no big allocations. Moreover, i think that this logic could be a little optimization for copy mode - why we allocate single skb and always return to af_vsock.c? May be we can iterate needed number of skbs in the loop and send them. Also about vmalloc(), IIUC there is already this idea which replaces 'kmalloc()' to 'kvmalloc()'. >> >> PERFORMANCE >> >> Performance: it is a little bit tricky to compare performance between >> copy and zerocopy transmissions. In zerocopy way we need to wait when >> user buffers will be released by kernel, so it something like synchronous >> path (wait until device driver will process it), while in copy way we >> can feed data to kernel as many as we want, don't care about device >> driver. So I compared only time which we spend in 'sendmsg()' syscall. >> Also there is limit from 4) above so max buffer size is 64KB. I've >> tested this patchset in the nested VM, but i think for V1 it is not a >> big deal. >> >> Sender: >> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc] >> >> Receiver: >> ./vsock_perf --vsk-size 256M >> >> Number in cell is seconds which senders spends inside tx syscall. >> >> Guest to host transmission: >> >> *-------------------------------* >> | | | | >> | buf size | copy | zerocopy | >> | | | | >> *-------------------------------* >> | 4KB | 0.26 | 0.042 | >> *-------------------------------* >> | 16KB | 0.11 | 0.014 | >> *-------------------------------* >> | 32KB | 0.05 | 0.009 | >> *-------------------------------* >> | 64KB | 0.04 | 0.005 | >> *-------------------------------* >> >> Host to guest transmission: >> >> *--------------------------------* >> | | | | >> | buf size | copy | zerocopy | >> | | | | >> *--------------------------------* >> | 4KB | 0.049 | 0.034 | >> *--------------------------------* >> | 16KB | 0.03 | 0.024 | >> *--------------------------------* >> | 32KB | 0.025 | 0.01 | >> *--------------------------------* >> | 64KB | 0.028 | 0.01 | >> *--------------------------------* >> >> If host fails to send data with "Cannot allocate memory", check value >> /proc/sys/net/core/optmem_max - it is accounted during completion skb >> allocation. >> >> Zerocopy is faster than classic copy mode, but of course it requires >> specific architecture of application due to user pages pinning, buffer >> size and alignment. In next versions i'm going to fix 64KB barrier to >> perform tests with bigger buffer sizes. > > Yep, I see. > Adjusting vsock_perf to compare also Gbps (can io_uring helps in this > case to have more requests in-flight?) would be great. > Yes, i'll add Gbps. Also I thought about adding io_uring test to the current test suite because io_uring also have MSG_ZEROCOPY type of request, and interesting thing is that io_uring uses it's own already allocated uarg. >> >> TESTING >> >> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to >> cover new code as much as possible so there are different cases for >> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io >> vector types (different sizes, alignments, with unmapped pages). > > Great! Thanks for adding the tests! > > I'll go through the patches between today and Monday. > Sorry again for taking so long! Sure, Thanks for review! I think we are not hurry :) > > Thanks, > Stefano >
On Mon, Feb 20, 2023 at 08:59:32AM +0000, Krasnov Arseniy wrote: >On 16.02.2023 16:33, Stefano Garzarella wrote: >> Hi Arseniy, >> sorry for the delay, but I was offline. > >Hello! Sure no problem!, i was also offline a little bit so i'm replying >just now > >> >> On Mon, Feb 06, 2023 at 06:51:55AM +0000, Arseniy Krasnov wrote: >>> Hello, >>> >>> DESCRIPTION >>> >>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow >>> current implementation for TCP as much as possible: >>> >>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this >>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY >>> flag will be ignored (e.g. without completion). >>> >>> 2) Kernel uses completions from socket's error queue. Single completion >>> for single tx syscall (or it can merge several completions to single >>> one). I used already implemented logic for MSG_ZEROCOPY support: >>> 'msg_zerocopy_realloc()' etc. >> >> I will review for the vsock point of view. Hope some net maintainers can >> comment about SO_ZEROCOPY. >> >> Anyway I think is a good idea to keep it as close as possible to the TCP >> implementation. >> >>> >>> Difference with copy way is not significant. During packet allocation, >>> non-linear skb is created, then I call 'get_user_pages()' for each page >>> from user's iov iterator (I think i don't need 'pin_user_pages()' as >> >> Are these pages exposed to the host via virtqueues? If so, I think we >> should pin them. What happens if the host accesses them but these pages >> have been unmapped? > >Yes, user pages with data will be used by the virtio device. >'pin' - You mean use 'pin_user_pages()'? Unmapped - You mean guest > will unmap it, while host must access it to copy packet? Such pages > have incremented refcount by 'get_user_pages()', so page is locked > and memory and will be locked until host finishes copying data. Yep, I got it while reviewing patch 7 ;-) > I think it is better to discuss things related to 'get/pin_user_pages()' > in one place, for example in 07/12 where this function is called. Agree! > >> >>> there is no backing storage for these pages) and add each returned page >>> to the skb as fragment. There are also some updates for vhost and guest >>> parts of transport - in both cases i've added handling of non-linear skb >>> for virtio part. vhost copies data from such skb to the guest's rx virtio >>> buffers. In the guest, virtio transport fills virtio queue with pages >>> from skb. >>> >>> I think doc in Documentation/networking/msg_zerocopy.rst could be also >>> updated in next versions. >> >> Yep, good idea. > >Ack, i'll do it in v2. > >> >>> >>> This version has several limits/problems: >>> >>> 1) As this feature totally depends on transport, there is no way (or it >>> is difficult) to check whether transport is able to handle it or not >>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific >>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this >>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback >>> are not considered to be called from each other. So in current version >>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of >>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY, >>> tx routine will fail with EOPNOTSUPP. >> >> I'll take a look, but if we have no alternative, I think it's okay to >> make tx fail.> > >Thanks > >>> >>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue >>> one completion. In each completion there is flag which shows how tx >>> was performed: zerocopy or copy. This leads that whole message must >>> be send in zerocopy or copy way - we can't send part of message with >>> copying and rest of message with zerocopy mode (or vice versa). Now, >>> we need to account vsock credit logic, e.g. we can't send whole data >>> once - only allowed number of bytes could sent at any moment. In case >>> of copying way there is no problem as in worst case we can send single >>> bytes, but zerocopy is more complex because smallest transmission >>> unit is single page. So if there is not enough space at peer's side >>> to send integer number of pages (at least one) - we will wait, thus >>> stalling tx side. To overcome this problem i've added simple rule - >>> zerocopy is possible only when there is enough space at another side >>> for whole message (to check, that current 'msghdr' was already used >>> in previous tx iterations i use 'iov_offset' field of it's iov iter). >> >> I see the problem and I think your approach is the right one. >> >>> >>> 3) loopback transport is not supported, because it requires to implement >>> non-linear skb handling in dequeue logic (as we "send" fragged skb >>> and "receive" it from the same queue). I'm going to implement it in >>> next versions. >> >> loopback is useful for testing and debugging, so it would be great to >> have the support, but if it's too complicated, we can do it later. >> > >Ok, i'll implement it in v2. > >>> >>> 4) Current implementation sets max length of packet to 64KB. IIUC this >>> is due to 'kmalloc()' allocated data buffers. I think, in case of >> >> Yep, I think so. >> When I started touching this code, the limit was already there. >> As you said, I think it was introduced to have a limit on (host/device >> side?) allocation, but buf_alloc might be enough, so maybe we could >> also remove it for copy mode. >> The only problem I see is compatibility with old devices/drivers, so >> maybe we need a feature in the spec to say the limit is gone, or have a >> field in the virtio config space where the device specifies its limit >> (for the driver, the limit is implicitly that of the buffer allocated >> and put in the virtqueue). >> >> This reminded me that Laura had proposed something similar before, >> maybe we should take it up again: >> https://markmail.org/message/3el4ckeakfilg5wo >> >>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is >>> not touched for data - user space pages are used as buffers. Also >>> this limit trims every message which is > 64KB, thus such messages >>> will be send in copy mode due to 'iov_offset' check in 2). >> >> The host still needs to allocate and copy, so maybe the limitation >> could be to avoid large allocations in the host, but actually the host >> can use vmalloc because it doesn't need them to be contiguous. >> > >Hmmm, I think it is possible to solve this situation in the following >way - i can keep limitation for 64KB for copy mode, and remove it for >zero copy, but I'll limit each packet size to 64KB(of course technically >it is not exactly 64KB, it is min(max packet size, MAX_SKB_FRAGS * PAGE_SIZE) >where max packet size is 64Kb, but for simplicity let's call this limit 64Kb:) ). >E.g. when zerocopy transmission needs to send for example 129Kb(of course >peer's free space is big enough and this check is passed), I'll won't trim >129Kb to 64Kb + 64Kb + 1Kb in the current manner - by returning to af_vsock.c >after sending every skb. I'll alloc several skbs, (3 in this case - 64Kb + 64Kb + 1Kb) >in single call to the transport. Completion arg will be attached >only to the last one skb, and send these 3 skbs. Host still processes >64Kb(let it be 64Kb for simplicity again :) ) packets - no big allocations. Make sense to me! >Moreover, i think that this logic could be a little optimization for >copy mode - why we allocate single skb and always return to af_vsock.c? @Bobby, can you help us here? >May be we can iterate needed number of skbs in the loop and send them. Yep, I think is doable. > >Also about vmalloc(), IIUC there is already this idea which replaces 'kmalloc()' >to 'kvmalloc()'. Yep, I think it is already merged: 0e3f72931fc4 ("vhost/vsock: Use kvmalloc/kvfree for larger packets.") But this is in the vhost transport (device emulation running in the host), where we don't need that the pages are pinned. > >>> >>> PERFORMANCE >>> >>> Performance: it is a little bit tricky to compare performance between >>> copy and zerocopy transmissions. In zerocopy way we need to wait when >>> user buffers will be released by kernel, so it something like synchronous >>> path (wait until device driver will process it), while in copy way we >>> can feed data to kernel as many as we want, don't care about device >>> driver. So I compared only time which we spend in 'sendmsg()' syscall. >>> Also there is limit from 4) above so max buffer size is 64KB. I've >>> tested this patchset in the nested VM, but i think for V1 it is not a >>> big deal. >>> >>> Sender: >>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc] >>> >>> Receiver: >>> ./vsock_perf --vsk-size 256M >>> >>> Number in cell is seconds which senders spends inside tx syscall. >>> >>> Guest to host transmission: >>> >>> *-------------------------------* >>> | | | | >>> | buf size | copy | zerocopy | >>> | | | | >>> *-------------------------------* >>> | 4KB | 0.26 | 0.042 | >>> *-------------------------------* >>> | 16KB | 0.11 | 0.014 | >>> *-------------------------------* >>> | 32KB | 0.05 | 0.009 | >>> *-------------------------------* >>> | 64KB | 0.04 | 0.005 | >>> *-------------------------------* >>> >>> Host to guest transmission: >>> >>> *--------------------------------* >>> | | | | >>> | buf size | copy | zerocopy | >>> | | | | >>> *--------------------------------* >>> | 4KB | 0.049 | 0.034 | >>> *--------------------------------* >>> | 16KB | 0.03 | 0.024 | >>> *--------------------------------* >>> | 32KB | 0.025 | 0.01 | >>> *--------------------------------* >>> | 64KB | 0.028 | 0.01 | >>> *--------------------------------* >>> >>> If host fails to send data with "Cannot allocate memory", check value >>> /proc/sys/net/core/optmem_max - it is accounted during completion skb >>> allocation. >>> >>> Zerocopy is faster than classic copy mode, but of course it requires >>> specific architecture of application due to user pages pinning, buffer >>> size and alignment. In next versions i'm going to fix 64KB barrier to >>> perform tests with bigger buffer sizes. >> >> Yep, I see. >> Adjusting vsock_perf to compare also Gbps (can io_uring helps in this >> case to have more requests in-flight?) would be great. >> > >Yes, i'll add Gbps. Also I thought about adding io_uring test to >the current test suite because io_uring also have MSG_ZEROCOPY >type of request, and interesting thing is that io_uring uses it's >own already allocated uarg. Cool! > >>> >>> TESTING >>> >>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to >>> cover new code as much as possible so there are different cases for >>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io >>> vector types (different sizes, alignments, with unmapped pages). >> >> Great! Thanks for adding the tests! >> >> I'll go through the patches between today and Monday. >> Sorry again for taking so long! > >Sure, Thanks for review! I think we are not hurry :) Yep, thank you for this work! Stefano