Message ID | f60d7e94-795d-06fd-0321-6972533700c5@sberdevices.ru |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp1644024wru; Sun, 6 Nov 2022 11:35:15 -0800 (PST) X-Google-Smtp-Source: AMsMyM6I5q3UNsynMWTC146i44Zl0+vA2zcuTZgX8sa1oVGLyrrQ7GOtUlyhdpWUxS0XM+kM/ML8 X-Received: by 2002:a17:907:a056:b0:78d:1cc8:9fa0 with SMTP id gz22-20020a170907a05600b0078d1cc89fa0mr662027ejc.666.1667763315213; Sun, 06 Nov 2022 11:35:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1667763315; cv=none; d=google.com; s=arc-20160816; b=lkFuMJxGwhC1aLXGPIvXyjD6yFywLbt9rr2T7XFo5Ba95sUM+SoOuk+jpPZO7nEJPW ASglnX9A6w6p4rntB+WSiykWii1DIqiaFeb8Z5t06AG/x0Pd7mGpMkFcpaNp0yXBvd1o q2Rkr6Ssb/H7zkESLN+e0bXL8Yk3k4VX4Bm7bsauKwcHCCnr3O1vm2boz0GOFkKA/Vrn 1FEz3KqCITzydrc4v5aKUnxsNjjRIK3d+avofx+vz9qGF1/753DUT1f4m5rRwvbttkPy RBcueBh3aNXIkX/HJ9BPNyyOMEPxvGX+jseDBiFfcZOStzMNDstogvCninZGheSJBPK/ 6qsg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :content-id:content-language:accept-language:message-id:date :thread-index:thread-topic:subject:cc:to:from:dkim-signature; bh=3oI49mfSUD4HBOFaoAGHurGmkP0+gpzzigWlw+po9z8=; b=YZ3ehgWlSkonwAlsmez4MjKj/HAddC+eTacxPDqq30NwgRuN5J3NYBlcvJD/N6MFG3 JkjHb2lia6HpSShUFIbcBT8L+QQM1eig2UZYLYHyRFSfJHAhh7h99Um/aJvBDF2ezRN9 2agJKj0DdqjlNFBVHMOLgNsMbG9NFhuLtcv4vHpvdCYW34HcYtwhHOMZ9zKQWeMVlQ14 l/P5vO2irHNpc9d36ABZihbEyxAcWa5NDU2F28arV1yir5rdeVe4BCc1GIto8whOqF3l x9O7XOGL6XNuWYysGnxsCQiZeUZIqpY1/RnGkA445e7w01u/A0Vp7XwK4SxDAz4HX7oK iVoQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=FVrn0K3c; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id rv14-20020a17090710ce00b007a087ccd275si4965689ejb.384.2022.11.06.11.34.51; Sun, 06 Nov 2022 11:35:15 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@sberdevices.ru header.s=mail header.b=FVrn0K3c; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=sberdevices.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229899AbiKFTe1 (ORCPT <rfc822;hjfbswb@gmail.com> + 99 others); Sun, 6 Nov 2022 14:34:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40060 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229787AbiKFTeY (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 6 Nov 2022 14:34:24 -0500 Received: from mx.sberdevices.ru (mx.sberdevices.ru [45.89.227.171]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AAEB9E027; Sun, 6 Nov 2022 11:34:19 -0800 (PST) Received: from s-lin-edge02.sberdevices.ru (localhost [127.0.0.1]) by mx.sberdevices.ru (Postfix) with ESMTP id DB04F5FD03; Sun, 6 Nov 2022 22:34:15 +0300 (MSK) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sberdevices.ru; s=mail; t=1667763255; bh=3oI49mfSUD4HBOFaoAGHurGmkP0+gpzzigWlw+po9z8=; h=From:To:Subject:Date:Message-ID:Content-Type:MIME-Version; b=FVrn0K3cJRBacmmDktcI5lmvOOA+F0Wma7xPWT6eb2kui5eO5tqzy150yyXp1rgT+ k5z6f0nqamuB50PHywvkeBM/H/Vhc2hQQMLV+2sZV05jlfmiSKF5gvmvAdd0+OIy1A zLcdvr5OtbLtHE5WyC+Yz5mguF67iE9GJSj8F37GZT1q0fepKJnoZAktBqEyMEJTRd P2bVvl7AOxWaXpyaMwo+zwE6Py7m6A3uasCLZnWRzNGX8gvr9aFLQAjmpw+tu9bT19 LpOXv7hLPcdiypB69SRKnkrtGpFljXOyWeAhSpvBDGV3fA5fz5crXFRTtsTc09vfWb VRDXgBYRnGUCw== Received: from S-MS-EXCH02.sberdevices.ru (S-MS-EXCH02.sberdevices.ru [172.16.1.5]) by mx.sberdevices.ru (Postfix) with ESMTP; Sun, 6 Nov 2022 22:34:12 +0300 (MSK) From: Arseniy Krasnov <AVKrasnov@sberdevices.ru> To: Stefan Hajnoczi <stefanha@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, "David S. Miller" <davem@davemloft.net>, "edumazet@google.com" <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Krasnov Arseniy <oxffffaa@gmail.com> CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, kernel <kernel@sberdevices.ru> Subject: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive Thread-Topic: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive Thread-Index: AQHY8hatKKTuchPoekmnGA3HAR0Siw== Date: Sun, 6 Nov 2022 19:33:41 +0000 Message-ID: <f60d7e94-795d-06fd-0321-6972533700c5@sberdevices.ru> Accept-Language: en-US, ru-RU Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.16.1.12] Content-Type: text/plain; charset="utf-8" Content-ID: <7761C3286EAF0147AE65037D95E3A050@sberdevices.ru> Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-KSMG-Rule-ID: 4 X-KSMG-Message-Action: clean X-KSMG-AntiSpam-Status: not scanned, disabled by settings X-KSMG-AntiSpam-Interceptor-Info: not scanned X-KSMG-AntiPhishing: not scanned, disabled by settings X-KSMG-AntiVirus: Kaspersky Secure Mail Gateway, version 1.1.2.30, bases: 2022/11/06 12:52:00 #20573807 X-KSMG-AntiVirus-Status: Clean, skipped X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1748776586481146976?= X-GMAIL-MSGID: =?utf-8?q?1748776586481146976?= |
Series |
virtio/vsock: experimental zerocopy receive
|
|
Message
Arseniy Krasnov
Nov. 6, 2022, 7:33 p.m. UTC
INTRODUCTION Hello, This is experimental patchset for virtio vsock zerocopy receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()' to fill provided vma area with pages of virtio receive buffers. After received data was processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED flag set(but if user will not call 'madvise()', next 'getsockopt()' will fail). DETAILS Here is how mapping with mapped pages looks exactly: first page contains information about mapped data buffers. At zero offset mapping contains special data structure: struct virtio_vsock_usr_hdr_pref { u32 poll_value; u32 hdr_num; }; This structure contains two fields: 'poll_value' - shows that current socket has data to read.When socket's intput queue is empty, 'poll_value' is set to 0 by kernel. When input queue has some data, 'poll_value' is set to 1 by kernel. When socket is closed for data receive, 'poll_value' is ~0.This tells user that "there will be no more data,continue to call 'getsockopt()' until you'll find 'hdr_num' == 0".User spins on it in userspace, without calling 'poll()' system call(of course, 'poll()' is still working). 'hdr_num' - shows number of mapped pages with data which starts from second page of this mappined. NOTE: This version has two limitations: 1) One mapping per socket is supported. It is implemented by adding 'struct page*' pointer to 'struct virtio_vsock' structure (first page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But, I think, support for multiple pages could be implemented by using something like hash table of such pages, or more simple, just use first page of mapping as headers page by default. Also I think, number of such pages may be controlled by 'setsockop()'. 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even after calling 'madvise()'/'munmap()' on the whole mapping.This is because socket can't handle 'munmap()' calls(as there is no such callback in 'proto_ops'),thus polling page exists until socket is opened. After 'virtio_vsock_usr_hdr_pref' object, first page contains array of trimmed virtio vsock packet headers (in contains only length of data on the corresponding page and 'flags' field): struct virtio_vsock_usr_hdr { uint32_t length; uint32_t flags; }; Field 'length' allows user to know exact size of payload within each sequence of pages and field 'flags' allows to process SOCK_SEQPACKET flags(such as message bounds or record bounds).All other pages are data pages from virtio queue. Page 0 Page 1 Page N [ pref hdr0 .. hdrN ][ data ] .. [ data ] | | ^ ^ | | | | | *-------|-----------* | | *----------------* Of course, single header could represent array of pages (when packet's buffer is bigger than one page).So here is example of detailed mapping layout for some set of packages. Lets consider that we have the following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will be inserted to user's vma. Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] Page 1: [ 56 ] Page 2: [ 4096 ] Page 3: [ 4096 ] Page 4: [ 4096 ] Page 5: [ 8 ] Page 0 contains only array of headers: 'pref' is 'struct virtio_vsock_usr_hdr_pref'. 'hdr0' has 56 in length field. 'hdr1' has 4096 in length field. 'hdr2' has 8200 in length field. 'hdr3' has 0 in length field(this is end of data marker). Page 1 corresponds to 'hdr0' and has only 56 bytes of data. Page 2 corresponds to 'hdr1' and filled with data. Page 3 corresponds to 'hdr2' and filled with data. Page 4 corresponds to 'hdr2' and filled with data. Page 5 corresponds to 'hdr2' and has only 8 bytes of data. pref will be the following: poll_value = 1, hdr_num = 5 This patchset also changes packets allocation way: current uses only 'kmalloc()' to create data buffer. Problem happens when we try to map such buffers to user's vma - kernel restricts to map slab pages to user's vma(as pages of "not large" 'kmalloc()' allocations have flag PageSlab set and "not large" could be bigger than one page).So to avoid this, data buffers now allocated using 'alloc_pages()' call. DIFFERENCE WITH TCP As this feature uses same approach as for TCP protocol,here are some difference between both version(from user's POV): 1) For 'getsockopt()': - This version passes only address of mapping. - TCP passes special structure to 'getsockopt()'. In addition to the address of mapping in contains 'length' and 'recv_skip_hint'.First means size of data inside mapping(out param, set by kernel).Second has bool type, if it is true, then user must dequeue rest of data using 'read()' syscall(e.g. it is out parameter also). 2) Mapping structure: - This version uses first page of mapping for meta data and rest of pages for data. - TCP version uses whole mapping for data only. 3) Data layout: - This version inserts virtio buffers to mapping, so each buffer may be filled partially. To get size of payload in every buffer, first mapping's page must be used(see 2). - TCP version inserts pages of single skb. *Please, correct me if I made some mistake in TCP zerocopy description. TESTS This patchset updates 'vsock_test' utility: two tests for new feature were added. First test covers invalid cases.Second checks valid transmission case. BENCHMARKING For benchmakring I've created small test utility 'vsock_rx_perf'. It works in client/server mode. When client connects to server, server starts sending specified amount of data to client(size is set as input argument). Client reads data and waits for next portion of it. In client mode, dequeue logic works in three modes: copy, zerocopy and zerocopy with user polling. 1) In copy mode client uses 'read()' system call. 2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data and 'poll()' to wait data. 3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()', but to wait data it polls shared page(e.g. busyloop). Here is usage: -c <cid> Peer CID to connect to(if run in client mode). -m <megabytes> Number of megabytes to send. -b <bytes> Size of RX/TX buffer(or mapping) in pages. -r <bytes> SO_RCVLOWAT value in bytes(if run in client mode). -v <bytes> peer credit. -s Run as server. -z [n|y|u] Dequeue mode. n - copy mode. 1) above. y - zero copy mode. 2) above. u - zero copy mode + user poll. 3) above. Utility produces the following output: 1) In server mode it prints number of sec spent for whole tx loop. 2) In client mode it prints several values: * Number of sec, spent for whole rx loop(including 'poll()'). * Number of sec, spend in dequeue system calls: In case of '-z n' it will be time in 'read()'. In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'. * Number of wake ups with POLLIN flag set(except '-z u' mode). * Average time(ns) in single dequeue iteration(e.g. divide second value by third). Idea of test is to compare zerocopy approach and classic copy, as it is clear, that to dequeue some "small" amount of data, copy must be better, because syscall with 'memcpy()' for 1 byte(for example) is just nothing against two system calls, where first must map at least one page, while second will unmap it. Test was performed with the following settings: 1) Total size of data to send is 2G(-m argument). 2) Peer's buffer size is changed to 2G(-v argument) - this is needed to avoid stalls of sender to wait for enough credit. 3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number of bytes to dequeue in single loop iteration. Buffer size limits max number of bytes to read, while SO_RCVLOWAT won't allow user to get too small number of bytes. 4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of course we can set it to peer's buffer size and as we are in STREAM mode it leads to 'write()' will be called once. Deignations here and below: H2G - host to guest transmission. Server is host, client is guest. G2H - guest to host transmission. Server is guest, client is host. C - copy mode. ZC - zerocopy mode. ZU - zerocopy with user poll mode. This mode is removed from test at this moment, because I need to support SO_RCVLOWAT logic in it. So, rows corresponds to dequeue mode, while columns show number of bytes to dequeue in each mode. Each cell contains several values in the next format: *------------* | A / B | | C | | D | *------------* A - number of seconds which server spent in tx loop. B - number of seconds which client spent in rx loop. C - number of seconds which client spent in rx loop, but except 'poll()' system call(e.g. only in dequeue system calls). D - Average number of ns for each POLLIN wake up(in other words it is average value for C). G2H: #0 #1 #2 #3 #4 #5 *----*---------*---------*---------*---------*---------*---------* | | | | | | | | | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | | | | | | | | | *----*---------*---------*---------*---------*---------*---------* | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35| | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 | *----*---------*---------*---------*---------*---------*---------* | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46| | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 | *----*---------*---------*---------*---------*---------*---------* H2G: #0 #1 #2 #3 #4 #5 *----*---------*---------*---------*---------*---------*---------* | | | | | | | | | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | | | | | | | | | *----*---------*---------*---------*---------*---------*---------* | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00| | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 | *----*---------*---------*---------*---------*---------*---------* | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35| | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 | *----*---------*---------*---------*---------*---------*---------* Here are my thoughts about these numbers(most obvious): 1) Let's check C and D values. We see, that zerocopy dequeue is faster on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I think this is main result of this test(at this moment), that shows performance difference between copy and zerocopy). 2) In G2H mode both server and client spend almost same time in rx/tx loops(see A / B in G2H table) - it looks good. In H2G mode, there is significant difference between server and client. I think there are some side effects which produces such effect(continue to analyze). 3) Let's check C value. We can see, that G2H is always faster that H2G. In both copy and zerocopy mode. 4) Another interesting thing could be seen for example in H2G table, row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account time spent in 'poll()', copy mode looks faster(even it spends more time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()'). I think, it is also not obvious effect. So, according 1), it is better to use zerocopy, if You need to process big buffers, with small rx waitings(for example it nay be video stream). In other cases - it is better to use classic copy way, as it will be more lightweight. All tests were performed on x86 board with 4-core Celeron N2930 CPU(of course it is, not a mainframe, but better than test with nested guest) and 8Gb of RAM. Anyway, this is not final version, and I will continue to improve both kernel logic and performance tests. SUGGESTIONS 1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I can merge both patches into single one? 2) This version works with single page headers. May be I can add new 'getsockopt()' feature to allow to use multiple pages for headers. CHANGE LOG v1 -> v2: 1) Zerocopy receive mode must be enabled/disabled (off by default). I did not use generic SO_ZEROCOPY flag, because in virtio-vsock case this feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK layer flag was added:SO_VM_SOCKETS_ZEROCOPY,while previous meaning of SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now renamed to SO_VM_SOCKETS_MAP_RX. 2) Packet header which is exported to user now gets new extra field: 'copy_len'. This field handles special case: user reads data from socket in non zerocopy way(with disabled zerocopy) and then enables zerocopy. In this case, vhost part will switch buffer allocation logic from 'kmalloc()' to direct calls for buddy allocator. But,there could be some pending 'kmalloc()' allocated packets in socket's rx list, and then user tries to read such packets in zerocopy way, dequeue will fail, because SLAB pages could not be inserted to user's vm area.So when such packet is found during zerocopy dequeue,dequeue loop will break and 'copy_len' will show size of such "bad" packet.After user detects this case,it must use 'read()/recv()' calls to dequeue such packet. 3) Also may be move this features under config option? <<<<< DECLINED v2 -> v3: 1) Zerocopy could be enabled only for socket in SS_UNCONNECTED state, so 'copy_len' field from v2 was removed. 2) Mapping layout was updated. First page of mapping now contains the following structure: 'struct virtio_vsock_usr_hdr_pref' starting from the first byte. Then 'struct virtio_vsock_usr_hdr' are placed in array. 3) Transport get/set callbacks for zerocopy were removed, now flag to check zerocopy receive on/off is storead in 'vsock_sock'. 4) For 'virtio_transport_recv_pkt()' interface changed. This was done, because vhost driver needs to check whether zerocopy is enabled on socket or not.So socket lookup is performed until packet allocation and socket structure is passed to this function. void virtio_transport_recv_pkt(struct virtio_transport *, struct virtio_vsock_pkt *); changed to void virtio_transport_recv_pkt(struct virtio_transport *, struct sock *, struct virtio_vsock_pkt *); If 'struct sock *' argument is NULL, this function works as before, otherwise it skips socket lookup, using input 'sock' as destination socket. 4) Test for userspace polling was added. 5) Zerocopy tests were moved to dedicated '.c' file. 6) More benchmark results. 7) Two tests updated: message bound test reworked and test for big message transmission was added. Arseniy Krasnov(11): virtio/vsock: rework packet allocation logic virtio/vsock: update 'virtio_transport_recv_pkt()' af_vsock: add zerocopy receive logic virtio/vsock: add transport zerocopy callback vhost/vsock: switch packet's buffer allocation vhost/vsock: enable zerocopy callback virtio/vsock: enable zerocopy callback test/vsock: rework message bound test test/vsock: add big message test test/vsock: add receive zerocopy tests test/vsock: vsock_rx_perf utility drivers/vhost/vsock.c | 58 ++- include/linux/virtio_vsock.h | 8 + include/net/af_vsock.h | 8 + include/uapi/linux/virtio_vsock.h | 14 + include/uapi/linux/vm_sockets.h | 3 + net/vmw_vsock/af_vsock.c | 187 ++++++++- net/vmw_vsock/virtio_transport.c | 4 +- net/vmw_vsock/virtio_transport_common.c | 256 ++++++++++++- net/vmw_vsock/vsock_loopback.c | 2 +- tools/include/uapi/linux/virtio_vsock.h | 15 + tools/include/uapi/linux/vm_sockets.h | 8 + tools/testing/vsock/Makefile | 3 +- tools/testing/vsock/control.c | 34 ++ tools/testing/vsock/control.h | 2 + tools/testing/vsock/util.c | 40 +- tools/testing/vsock/util.h | 5 + tools/testing/vsock/vsock_rx_perf.c | 604 ++++++++++++++++++++++++++++++ tools/testing/vsock/vsock_test.c | 198 +++++++++- tools/testing/vsock/vsock_test_zerocopy.c | 530 ++++++++++++++++++++++++++ tools/testing/vsock/vsock_test_zerocopy.h | 14 + 20 files changed, 1953 insertions(+), 40 deletions(-) -- 2.35.0
Comments
Hi Arseniy, maybe we should start rebasing this series on the new support for skbuff: https://lore.kernel.org/lkml/20221110171723.24263-1-bobby.eshleman@bytedance.com/ CCing Bobby to see if it's easy to integrate since you're both changing the packet allocation. On Sun, Nov 06, 2022 at 07:33:41PM +0000, Arseniy Krasnov wrote: > > > INTRODUCTION > >Hello, > > This is experimental patchset for virtio vsock zerocopy receive. >It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses >same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()' >to fill provided vma area with pages of virtio receive buffers. After >received data was processed by user, pages must be freed by 'madvise()' >call with MADV_DONTNEED flag set(but if user will not call 'madvise()', >next 'getsockopt()' will fail). > > DETAILS > > Here is how mapping with mapped pages looks exactly: first page >contains information about mapped data buffers. At zero offset mapping >contains special data structure: > > struct virtio_vsock_usr_hdr_pref { > u32 poll_value; > u32 hdr_num; > }; > >This structure contains two fields: >'poll_value' - shows that current socket has data to read.When socket's >intput queue is empty, 'poll_value' is set to 0 by kernel. When input >queue has some data, 'poll_value' is set to 1 by kernel. When socket is >closed for data receive, 'poll_value' is ~0.This tells user that "there >will be no more data,continue to call 'getsockopt()' until you'll find >'hdr_num' == 0".User spins on it in userspace, without calling 'poll()' >system call(of course, 'poll()' is still working). >'hdr_num' - shows number of mapped pages with data which starts from >second page of this mappined. > >NOTE: > This version has two limitations: > > 1) One mapping per socket is supported. It is implemented by adding > 'struct page*' pointer to 'struct virtio_vsock' structure (first > page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But, > I think, support for multiple pages could be implemented by using > something like hash table of such pages, or more simple, just use > first page of mapping as headers page by default. Also I think, > number of such pages may be controlled by 'setsockop()'. > > 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even > after calling 'madvise()'/'munmap()' on the whole mapping.This is > because socket can't handle 'munmap()' calls(as there is no such > callback in 'proto_ops'),thus polling page exists until socket is > opened. > >After 'virtio_vsock_usr_hdr_pref' object, first page contains array of >trimmed virtio vsock packet headers (in contains only length of data on >the corresponding page and 'flags' field): > > struct virtio_vsock_usr_hdr { > uint32_t length; > uint32_t flags; > }; > >Field 'length' allows user to know exact size of payload within each >sequence of pages and field 'flags' allows to process SOCK_SEQPACKET >flags(such as message bounds or record bounds).All other pages are data >pages from virtio queue. > > Page 0 Page 1 Page N > > [ pref hdr0 .. hdrN ][ data ] .. [ data ] > | | ^ ^ > | | | | > | *-------|-----------* > | | > *----------------* > > Of course, single header could represent array of pages (when >packet's buffer is bigger than one page).So here is example of detailed >mapping layout for some set of packages. Lets consider that we have the >following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All >pages: 0,1,2,3,4 and 5 will be inserted to user's vma. > > Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] > Page 1: [ 56 ] > Page 2: [ 4096 ] > Page 3: [ 4096 ] > Page 4: [ 4096 ] > Page 5: [ 8 ] > > Page 0 contains only array of headers: > 'pref' is 'struct virtio_vsock_usr_hdr_pref'. > 'hdr0' has 56 in length field. > 'hdr1' has 4096 in length field. > 'hdr2' has 8200 in length field. > 'hdr3' has 0 in length field(this is end of data marker). > > Page 1 corresponds to 'hdr0' and has only 56 bytes of data. > Page 2 corresponds to 'hdr1' and filled with data. > Page 3 corresponds to 'hdr2' and filled with data. > Page 4 corresponds to 'hdr2' and filled with data. > Page 5 corresponds to 'hdr2' and has only 8 bytes of data. > > pref will be the following: poll_value = 1, hdr_num = 5 > > This patchset also changes packets allocation way: current uses >only 'kmalloc()' to create data buffer. Problem happens when we try to >map such buffers to user's vma - kernel restricts to map slab pages >to user's vma(as pages of "not large" 'kmalloc()' allocations have flag >PageSlab set and "not large" could be bigger than one page).So to avoid >this, data buffers now allocated using 'alloc_pages()' call. > > DIFFERENCE WITH TCP > > As this feature uses same approach as for TCP protocol,here are >some difference between both version(from user's POV): > >1) For 'getsockopt()': > - This version passes only address of mapping. > - TCP passes special structure to 'getsockopt()'. In addition to the > address of mapping in contains 'length' and 'recv_skip_hint'.First > means size of data inside mapping(out param, set by kernel).Second > has bool type, if it is true, then user must dequeue rest of data > using 'read()' syscall(e.g. it is out parameter also). >2) Mapping structure: > - This version uses first page of mapping for meta data and rest of > pages for data. > - TCP version uses whole mapping for data only. >3) Data layout: > - This version inserts virtio buffers to mapping, so each buffer may > be filled partially. To get size of payload in every buffer, first > mapping's page must be used(see 2). > - TCP version inserts pages of single skb. > >*Please, correct me if I made some mistake in TCP zerocopy description. Thank you for the description. Do you think it would be possible to try to do the same as TCP? Especially now that we should support skbuff. > > TESTS > > This patchset updates 'vsock_test' utility: two tests for new >feature were added. First test covers invalid cases.Second checks valid >transmission case. Thank you, I really appreciate you adding new tests each time! Great job! > > BENCHMARKING > > For benchmakring I've created small test utility 'vsock_rx_perf'. >It works in client/server mode. When client connects to server, server >starts sending specified amount of data to client(size is set as input >argument). Client reads data and waits for next portion of it. In client >mode, dequeue logic works in three modes: copy, zerocopy and zerocopy >with user polling. Cool, thanks for adding it in this series. > >1) In copy mode client uses 'read()' system call. >2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data > and 'poll()' to wait data. >3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()', > but to wait data it polls shared page(e.g. busyloop). > >Here is usage: >-c <cid> Peer CID to connect to(if run in client mode). >-m <megabytes> Number of megabytes to send. >-b <bytes> Size of RX/TX buffer(or mapping) in pages. >-r <bytes> SO_RCVLOWAT value in bytes(if run in client mode). >-v <bytes> peer credit. >-s Run as server. >-z [n|y|u] Dequeue mode. > n - copy mode. 1) above. > y - zero copy mode. 2) above. > u - zero copy mode + user poll. 3) above. > >Utility produces the following output: >1) In server mode it prints number of sec spent for whole tx loop. >2) In client mode it prints several values: > * Number of sec, spent for whole rx loop(including 'poll()'). > * Number of sec, spend in dequeue system calls: > In case of '-z n' it will be time in 'read()'. > In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'. > * Number of wake ups with POLLIN flag set(except '-z u' mode). > * Average time(ns) in single dequeue iteration(e.g. divide second > value by third). > >Idea of test is to compare zerocopy approach and classic copy, as it is >clear, that to dequeue some "small" amount of data, copy must be better, >because syscall with 'memcpy()' for 1 byte(for example) is just nothing >against two system calls, where first must map at least one page, while >second will unmap it. > >Test was performed with the following settings: >1) Total size of data to send is 2G(-m argument). > >2) Peer's buffer size is changed to 2G(-v argument) - this is needed to > avoid stalls of sender to wait for enough credit. > >3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number > of bytes to dequeue in single loop iteration. Buffer size limits max > number of bytes to read, while SO_RCVLOWAT won't allow user to get > too small number of bytes. > >4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of > course we can set it to peer's buffer size and as we are in STREAM > mode it leads to 'write()' will be called once. > >Deignations here and below: >H2G - host to guest transmission. Server is host, client is guest. >G2H - guest to host transmission. Server is guest, client is host. >C - copy mode. >ZC - zerocopy mode. >ZU - zerocopy with user poll mode. This mode is removed from test at > this moment, because I need to support SO_RCVLOWAT logic in it. > >So, rows corresponds to dequeue mode, while columns show number of Maybe it would be better to label the rows, I guess the first one is C and the second one ZC? Maybe it would be better to report Gbps so if we change the amount of data exchanged, we always have a way to compare. >bytes >to dequeue in each mode. Each cell contains several values in the next >format: >*------------* >| A / B | >| C | >| D | >*------------* > >A - number of seconds which server spent in tx loop. >B - number of seconds which client spent in rx loop. >C - number of seconds which client spent in rx loop, but except 'poll()' > system call(e.g. only in dequeue system calls). >D - Average number of ns for each POLLIN wake up(in other words > it is average value for C). I see only 3 values in the cells, I missed which one is C and which one is D. > >G2H: > > #0 #1 #2 #3 #4 #5 > *----*---------*---------*---------*---------*---------*---------* > | | | | | | | | > | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | > | | | | | | | | > *----*---------*---------*---------*---------*---------*---------* > | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35| > | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 | > *----*---------*---------*---------*---------*---------*---------* > | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46| > | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 | > *----*---------*---------*---------*---------*---------*---------* > >H2G: > > #0 #1 #2 #3 #4 #5 > *----*---------*---------*---------*---------*---------*---------* > | | | | | | | | > | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | > | | | | | | | | > *----*---------*---------*---------*---------*---------*---------* > | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00| > | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 | > *----*---------*---------*---------*---------*---------*---------* > | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35| > | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 | > *----*---------*---------*---------*---------*---------*---------* > >Here are my thoughts about these numbers(most obvious): > >1) Let's check C and D values. We see, that zerocopy dequeue is faster > on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I > think this is main result of this test(at this moment), that shows > performance difference between copy and zerocopy). Yes, I think this is expected. > >2) In G2H mode both server and client spend almost same time in rx/tx > loops(see A / B in G2H table) - it looks good. In H2G mode, there is > significant difference between server and client. I think there are > some side effects which produces such effect(continue to analyze). Perhaps a different cost to notify the receiver? I think it's better to talk about transmitter and receiver, instead of server and client, I think it's confusing. > >3) Let's check C value. We can see, that G2H is always faster that H2G. > In both copy and zerocopy mode. This is expected because the guest queues buffers up to 64K entirely, while the host has to split packets into the guest's preallocated buffers, which are 4K. > >4) Another interesting thing could be seen for example in H2G table, > row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode > is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was I see 1.65 vs 1.10, are these the same data, or am I looking at it wrong? > faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account > time spent in 'poll()', copy mode looks faster(even it spends more > time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()'). > I think, it is also not obvious effect. > >So, according 1), it is better to use zerocopy, if You need to process >big buffers, with small rx waitings(for example it nay be video stream). >In other cases - it is better to use classic copy way, as it will be >more lightweight. > >All tests were performed on x86 board with 4-core Celeron N2930 CPU(of >course it is, not a mainframe, but better than test with nested guest) >and 8Gb of RAM. > >Anyway, this is not final version, and I will continue to improve both >kernel logic and performance tests. Great work so far! Maybe to avoid having to rebase everything later, it's already worthwhile to start using Bobby's patch with skbuff. > > SUGGESTIONS > >1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I > can merge both patches into single one? This is already very big, so I don't know if it's worth breaking into a preparation series and then a series that adds both. >2) This version works with single page headers. May be I can add new > 'getsockopt()' feature to allow to use multiple pages for headers. What would be the benefit? A small suggestion, run checkpatch with --strict, because there are several warnings that would be better resolved. I'll take a quick look at the patches, but I'd rather do a more detailed review with skbuffs. Thanks, Stefano
On Fri, Nov 11, 2022 at 2:47 PM Stefano Garzarella <sgarzare@redhat.com> wrote: > > Hi Arseniy, > maybe we should start rebasing this series on the new support for > skbuff: > https://lore.kernel.org/lkml/20221110171723.24263-1-bobby.eshleman@bytedance.com/ > > CCing Bobby to see if it's easy to integrate since you're both changing > the packet allocation. > > > On Sun, Nov 06, 2022 at 07:33:41PM +0000, Arseniy Krasnov wrote: > > > > > > INTRODUCTION > > > >Hello, > > > > This is experimental patchset for virtio vsock zerocopy receive. > >It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses > >same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()' > >to fill provided vma area with pages of virtio receive buffers. After > >received data was processed by user, pages must be freed by 'madvise()' > >call with MADV_DONTNEED flag set(but if user will not call 'madvise()', > >next 'getsockopt()' will fail). > > > > DETAILS > > > > Here is how mapping with mapped pages looks exactly: first page > >contains information about mapped data buffers. At zero offset mapping > >contains special data structure: > > > > struct virtio_vsock_usr_hdr_pref { > > u32 poll_value; > > u32 hdr_num; > > }; > > > >This structure contains two fields: > >'poll_value' - shows that current socket has data to read.When socket's > >intput queue is empty, 'poll_value' is set to 0 by kernel. When input > >queue has some data, 'poll_value' is set to 1 by kernel. When socket is > >closed for data receive, 'poll_value' is ~0.This tells user that "there > >will be no more data,continue to call 'getsockopt()' until you'll find > >'hdr_num' == 0".User spins on it in userspace, without calling 'poll()' > >system call(of course, 'poll()' is still working). > >'hdr_num' - shows number of mapped pages with data which starts from > >second page of this mappined. > > > >NOTE: > > This version has two limitations: > > > > 1) One mapping per socket is supported. It is implemented by adding > > 'struct page*' pointer to 'struct virtio_vsock' structure (first > > page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But, > > I think, support for multiple pages could be implemented by using > > something like hash table of such pages, or more simple, just use > > first page of mapping as headers page by default. Also I think, > > number of such pages may be controlled by 'setsockop()'. > > > > 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even > > after calling 'madvise()'/'munmap()' on the whole mapping.This is > > because socket can't handle 'munmap()' calls(as there is no such > > callback in 'proto_ops'),thus polling page exists until socket is > > opened. > > > >After 'virtio_vsock_usr_hdr_pref' object, first page contains array of > >trimmed virtio vsock packet headers (in contains only length of data on > >the corresponding page and 'flags' field): > > > > struct virtio_vsock_usr_hdr { > > uint32_t length; > > uint32_t flags; > > }; > > > >Field 'length' allows user to know exact size of payload within each > >sequence of pages and field 'flags' allows to process SOCK_SEQPACKET > >flags(such as message bounds or record bounds).All other pages are data > >pages from virtio queue. > > > > Page 0 Page 1 Page N > > > > [ pref hdr0 .. hdrN ][ data ] .. [ data ] > > | | ^ ^ > > | | | | > > | *-------|-----------* > > | | > > *----------------* > > > > Of course, single header could represent array of pages (when > >packet's buffer is bigger than one page).So here is example of detailed > >mapping layout for some set of packages. Lets consider that we have the > >following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All > >pages: 0,1,2,3,4 and 5 will be inserted to user's vma. > > > > Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] > > Page 1: [ 56 ] > > Page 2: [ 4096 ] > > Page 3: [ 4096 ] > > Page 4: [ 4096 ] > > Page 5: [ 8 ] > > > > Page 0 contains only array of headers: > > 'pref' is 'struct virtio_vsock_usr_hdr_pref'. > > 'hdr0' has 56 in length field. > > 'hdr1' has 4096 in length field. > > 'hdr2' has 8200 in length field. > > 'hdr3' has 0 in length field(this is end of data marker). > > > > Page 1 corresponds to 'hdr0' and has only 56 bytes of data. > > Page 2 corresponds to 'hdr1' and filled with data. > > Page 3 corresponds to 'hdr2' and filled with data. > > Page 4 corresponds to 'hdr2' and filled with data. > > Page 5 corresponds to 'hdr2' and has only 8 bytes of data. > > > > pref will be the following: poll_value = 1, hdr_num = 5 > > > > This patchset also changes packets allocation way: current uses > >only 'kmalloc()' to create data buffer. Problem happens when we try to > >map such buffers to user's vma - kernel restricts to map slab pages > >to user's vma(as pages of "not large" 'kmalloc()' allocations have flag > >PageSlab set and "not large" could be bigger than one page).So to avoid > >this, data buffers now allocated using 'alloc_pages()' call. > > > > DIFFERENCE WITH TCP > > > > As this feature uses same approach as for TCP protocol,here are > >some difference between both version(from user's POV): > > > >1) For 'getsockopt()': > > - This version passes only address of mapping. > > - TCP passes special structure to 'getsockopt()'. In addition to the > > address of mapping in contains 'length' and 'recv_skip_hint'.First > > means size of data inside mapping(out param, set by kernel).Second > > has bool type, if it is true, then user must dequeue rest of data > > using 'read()' syscall(e.g. it is out parameter also). > >2) Mapping structure: > > - This version uses first page of mapping for meta data and rest of > > pages for data. > > - TCP version uses whole mapping for data only. > >3) Data layout: > > - This version inserts virtio buffers to mapping, so each buffer may > > be filled partially. To get size of payload in every buffer, first > > mapping's page must be used(see 2). > > - TCP version inserts pages of single skb. > > > >*Please, correct me if I made some mistake in TCP zerocopy description. > > > Thank you for the description. Do you think it would be possible to try > to do the same as TCP? > Especially now that we should support skbuff. > > > > > TESTS > > > > This patchset updates 'vsock_test' utility: two tests for new > >feature were added. First test covers invalid cases.Second checks valid > >transmission case. > > Thank you, I really appreciate you adding new tests each time! Great > job! > > > > > BENCHMARKING > > > > For benchmakring I've created small test utility 'vsock_rx_perf'. > >It works in client/server mode. When client connects to server, server > >starts sending specified amount of data to client(size is set as input > >argument). Client reads data and waits for next portion of it. In client > >mode, dequeue logic works in three modes: copy, zerocopy and zerocopy > >with user polling. > > Cool, thanks for adding it in this series. > > > > >1) In copy mode client uses 'read()' system call. > >2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data > > and 'poll()' to wait data. > >3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()', > > but to wait data it polls shared page(e.g. busyloop). > > > >Here is usage: > >-c <cid> Peer CID to connect to(if run in client mode). > >-m <megabytes> Number of megabytes to send. > >-b <bytes> Size of RX/TX buffer(or mapping) in pages. > >-r <bytes> SO_RCVLOWAT value in bytes(if run in client mode). > >-v <bytes> peer credit. > >-s Run as server. > >-z [n|y|u] Dequeue mode. > > n - copy mode. 1) above. > > y - zero copy mode. 2) above. > > u - zero copy mode + user poll. 3) above. > > > >Utility produces the following output: > >1) In server mode it prints number of sec spent for whole tx loop. > >2) In client mode it prints several values: > > * Number of sec, spent for whole rx loop(including 'poll()'). > > * Number of sec, spend in dequeue system calls: > > In case of '-z n' it will be time in 'read()'. > > In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'. > > * Number of wake ups with POLLIN flag set(except '-z u' mode). > > * Average time(ns) in single dequeue iteration(e.g. divide second > > value by third). > > > >Idea of test is to compare zerocopy approach and classic copy, as it is > >clear, that to dequeue some "small" amount of data, copy must be better, > >because syscall with 'memcpy()' for 1 byte(for example) is just nothing > >against two system calls, where first must map at least one page, while > >second will unmap it. > > > >Test was performed with the following settings: > >1) Total size of data to send is 2G(-m argument). > > > >2) Peer's buffer size is changed to 2G(-v argument) - this is needed to > > avoid stalls of sender to wait for enough credit. > > > >3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number > > of bytes to dequeue in single loop iteration. Buffer size limits max > > number of bytes to read, while SO_RCVLOWAT won't allow user to get > > too small number of bytes. > > > >4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of > > course we can set it to peer's buffer size and as we are in STREAM > > mode it leads to 'write()' will be called once. > > > >Deignations here and below: > >H2G - host to guest transmission. Server is host, client is guest. > >G2H - guest to host transmission. Server is guest, client is host. > >C - copy mode. > >ZC - zerocopy mode. > >ZU - zerocopy with user poll mode. This mode is removed from test at > > this moment, because I need to support SO_RCVLOWAT logic in it. > > > >So, rows corresponds to dequeue mode, while columns show number of > > Maybe it would be better to label the rows, I guess the first one is C > and the second one ZC? > > Maybe it would be better to report Gbps so if we change the amount of > data exchanged, we always have a way to compare. > > >bytes > >to dequeue in each mode. Each cell contains several values in the next > >format: > >*------------* > >| A / B | > >| C | > >| D | > >*------------* > > > >A - number of seconds which server spent in tx loop. > >B - number of seconds which client spent in rx loop. > >C - number of seconds which client spent in rx loop, but except 'poll()' > > system call(e.g. only in dequeue system calls). > >D - Average number of ns for each POLLIN wake up(in other words > > it is average value for C). > > I see only 3 values in the cells, I missed which one is C and which one > is D. > > > > >G2H: > > > > #0 #1 #2 #3 #4 #5 > > *----*---------*---------*---------*---------*---------*---------* > > | | | | | | | | > > | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | > > | | | | | | | | > > *----*---------*---------*---------*---------*---------*---------* > > | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35| > > | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 | > > *----*---------*---------*---------*---------*---------*---------* > > | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46| > > | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 | > > *----*---------*---------*---------*---------*---------*---------* > > > >H2G: > > > > #0 #1 #2 #3 #4 #5 > > *----*---------*---------*---------*---------*---------*---------* > > | | | | | | | | > > | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | > > | | | | | | | | > > *----*---------*---------*---------*---------*---------*---------* > > | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00| > > | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 | > > *----*---------*---------*---------*---------*---------*---------* > > | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35| > > | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 | > > *----*---------*---------*---------*---------*---------*---------* > > > >Here are my thoughts about these numbers(most obvious): > > > >1) Let's check C and D values. We see, that zerocopy dequeue is faster > > on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I > > think this is main result of this test(at this moment), that shows > > performance difference between copy and zerocopy). > > Yes, I think this is expected. > > > > >2) In G2H mode both server and client spend almost same time in rx/tx > > loops(see A / B in G2H table) - it looks good. In H2G mode, there is > > significant difference between server and client. I think there are > > some side effects which produces such effect(continue to analyze). > > Perhaps a different cost to notify the receiver? I think it's better to > talk about transmitter and receiver, instead of server and client, I > think it's confusing. > > > > >3) Let's check C value. We can see, that G2H is always faster that H2G. > > In both copy and zerocopy mode. > > This is expected because the guest queues buffers up to 64K entirely, > while the host has to split packets into the guest's preallocated > buffers, which are 4K. > > > > >4) Another interesting thing could be seen for example in H2G table, > > row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode > > is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was > > I see 1.65 vs 1.10, are these the same data, or am I looking at it > wrong? > > > faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account > > time spent in 'poll()', copy mode looks faster(even it spends more > > time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()'). > > I think, it is also not obvious effect. > > > >So, according 1), it is better to use zerocopy, if You need to process > >big buffers, with small rx waitings(for example it nay be video stream). > >In other cases - it is better to use classic copy way, as it will be > >more lightweight. > > > >All tests were performed on x86 board with 4-core Celeron N2930 CPU(of > >course it is, not a mainframe, but better than test with nested guest) > >and 8Gb of RAM. > > > >Anyway, this is not final version, and I will continue to improve both > >kernel logic and performance tests. > > Great work so far! > > Maybe to avoid having to rebase everything later, it's already > worthwhile to start using Bobby's patch with skbuff. > > > > > SUGGESTIONS > > > >1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I > > can merge both patches into single one? > > This is already very big, so I don't know if it's worth breaking into a > preparation series and then a series that adds both. For example, some test patches not related to zerocopy could go separately. Maybe even vsock_rx_perf without the zerocopy part that is not definitive for now. Too big a set is always scary, even if this one is divided well, but some patches as mentioned could go separately. I left some comments, but as said I prefer to review it after the rebase with skbuff, because I think it changes enough. I'm sorry about that, but having the skbuffs I think is very important. Thanks, Stefano
On 11.11.2022 17:06, Stefano Garzarella wrote: > On Fri, Nov 11, 2022 at 2:47 PM Stefano Garzarella <sgarzare@redhat.com> wrote: >> >> Hi Arseniy, >> maybe we should start rebasing this series on the new support for >> skbuff: >> https://lore.kernel.org/lkml/20221110171723.24263-1-bobby.eshleman@bytedance.com/ >> >> CCing Bobby to see if it's easy to integrate since you're both changing >> the packet allocation. Hello Stefano, Sure! I was waiting for next version of skbuff support(previous one was in august as i remember): 1) Anyway it will be implemented as skbuff is general well-known thing for networking :) 2) It will simplify MSG_ZEROCOPY support, because it uses API based on skbuff. >> >> >> On Sun, Nov 06, 2022 at 07:33:41PM +0000, Arseniy Krasnov wrote: >>> >>> >>> INTRODUCTION >>> >>> Hello, >>> >>> This is experimental patchset for virtio vsock zerocopy receive. >>> It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses >>> same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()' >>> to fill provided vma area with pages of virtio receive buffers. After >>> received data was processed by user, pages must be freed by 'madvise()' >>> call with MADV_DONTNEED flag set(but if user will not call 'madvise()', >>> next 'getsockopt()' will fail). >>> >>> DETAILS >>> >>> Here is how mapping with mapped pages looks exactly: first page >>> contains information about mapped data buffers. At zero offset mapping >>> contains special data structure: >>> >>> struct virtio_vsock_usr_hdr_pref { >>> u32 poll_value; >>> u32 hdr_num; >>> }; >>> >>> This structure contains two fields: >>> 'poll_value' - shows that current socket has data to read.When socket's >>> intput queue is empty, 'poll_value' is set to 0 by kernel. When input >>> queue has some data, 'poll_value' is set to 1 by kernel. When socket is >>> closed for data receive, 'poll_value' is ~0.This tells user that "there >>> will be no more data,continue to call 'getsockopt()' until you'll find >>> 'hdr_num' == 0".User spins on it in userspace, without calling 'poll()' >>> system call(of course, 'poll()' is still working). >>> 'hdr_num' - shows number of mapped pages with data which starts from >>> second page of this mappined. >>> >>> NOTE: >>> This version has two limitations: >>> >>> 1) One mapping per socket is supported. It is implemented by adding >>> 'struct page*' pointer to 'struct virtio_vsock' structure (first >>> page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But, >>> I think, support for multiple pages could be implemented by using >>> something like hash table of such pages, or more simple, just use >>> first page of mapping as headers page by default. Also I think, >>> number of such pages may be controlled by 'setsockop()'. >>> >>> 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even >>> after calling 'madvise()'/'munmap()' on the whole mapping.This is >>> because socket can't handle 'munmap()' calls(as there is no such >>> callback in 'proto_ops'),thus polling page exists until socket is >>> opened. >>> >>> After 'virtio_vsock_usr_hdr_pref' object, first page contains array of >>> trimmed virtio vsock packet headers (in contains only length of data on >>> the corresponding page and 'flags' field): >>> >>> struct virtio_vsock_usr_hdr { >>> uint32_t length; >>> uint32_t flags; >>> }; >>> >>> Field 'length' allows user to know exact size of payload within each >>> sequence of pages and field 'flags' allows to process SOCK_SEQPACKET >>> flags(such as message bounds or record bounds).All other pages are data >>> pages from virtio queue. >>> >>> Page 0 Page 1 Page N >>> >>> [ pref hdr0 .. hdrN ][ data ] .. [ data ] >>> | | ^ ^ >>> | | | | >>> | *-------|-----------* >>> | | >>> *----------------* >>> >>> Of course, single header could represent array of pages (when >>> packet's buffer is bigger than one page).So here is example of detailed >>> mapping layout for some set of packages. Lets consider that we have the >>> following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All >>> pages: 0,1,2,3,4 and 5 will be inserted to user's vma. >>> >>> Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ] >>> Page 1: [ 56 ] >>> Page 2: [ 4096 ] >>> Page 3: [ 4096 ] >>> Page 4: [ 4096 ] >>> Page 5: [ 8 ] >>> >>> Page 0 contains only array of headers: >>> 'pref' is 'struct virtio_vsock_usr_hdr_pref'. >>> 'hdr0' has 56 in length field. >>> 'hdr1' has 4096 in length field. >>> 'hdr2' has 8200 in length field. >>> 'hdr3' has 0 in length field(this is end of data marker). >>> >>> Page 1 corresponds to 'hdr0' and has only 56 bytes of data. >>> Page 2 corresponds to 'hdr1' and filled with data. >>> Page 3 corresponds to 'hdr2' and filled with data. >>> Page 4 corresponds to 'hdr2' and filled with data. >>> Page 5 corresponds to 'hdr2' and has only 8 bytes of data. >>> >>> pref will be the following: poll_value = 1, hdr_num = 5 >>> >>> This patchset also changes packets allocation way: current uses >>> only 'kmalloc()' to create data buffer. Problem happens when we try to >>> map such buffers to user's vma - kernel restricts to map slab pages >>> to user's vma(as pages of "not large" 'kmalloc()' allocations have flag >>> PageSlab set and "not large" could be bigger than one page).So to avoid >>> this, data buffers now allocated using 'alloc_pages()' call. >>> >>> DIFFERENCE WITH TCP >>> >>> As this feature uses same approach as for TCP protocol,here are >>> some difference between both version(from user's POV): >>> >>> 1) For 'getsockopt()': >>> - This version passes only address of mapping. >>> - TCP passes special structure to 'getsockopt()'. In addition to the >>> address of mapping in contains 'length' and 'recv_skip_hint'.First >>> means size of data inside mapping(out param, set by kernel).Second >>> has bool type, if it is true, then user must dequeue rest of data >>> using 'read()' syscall(e.g. it is out parameter also). >>> 2) Mapping structure: >>> - This version uses first page of mapping for meta data and rest of >>> pages for data. >>> - TCP version uses whole mapping for data only. >>> 3) Data layout: >>> - This version inserts virtio buffers to mapping, so each buffer may >>> be filled partially. To get size of payload in every buffer, first >>> mapping's page must be used(see 2). >>> - TCP version inserts pages of single skb. >>> >>> *Please, correct me if I made some mistake in TCP zerocopy description. >> >> >> Thank you for the description. Do you think it would be possible to try >> to do the same as TCP? >> Especially now that we should support skbuff. Yes, i'll rework my patchset for skbuff usage. >> >>> >>> TESTS >>> >>> This patchset updates 'vsock_test' utility: two tests for new >>> feature were added. First test covers invalid cases.Second checks valid >>> transmission case. >> >> Thank you, I really appreciate you adding new tests each time! Great >> job! >> >>> >>> BENCHMARKING >>> >>> For benchmakring I've created small test utility 'vsock_rx_perf'. >>> It works in client/server mode. When client connects to server, server >>> starts sending specified amount of data to client(size is set as input >>> argument). Client reads data and waits for next portion of it. In client >>> mode, dequeue logic works in three modes: copy, zerocopy and zerocopy >>> with user polling. >> >> Cool, thanks for adding it in this series. >> >>> >>> 1) In copy mode client uses 'read()' system call. >>> 2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data >>> and 'poll()' to wait data. >>> 3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()', >>> but to wait data it polls shared page(e.g. busyloop). >>> >>> Here is usage: >>> -c <cid> Peer CID to connect to(if run in client mode). >>> -m <megabytes> Number of megabytes to send. >>> -b <bytes> Size of RX/TX buffer(or mapping) in pages. >>> -r <bytes> SO_RCVLOWAT value in bytes(if run in client mode). >>> -v <bytes> peer credit. >>> -s Run as server. >>> -z [n|y|u] Dequeue mode. >>> n - copy mode. 1) above. >>> y - zero copy mode. 2) above. >>> u - zero copy mode + user poll. 3) above. >>> >>> Utility produces the following output: >>> 1) In server mode it prints number of sec spent for whole tx loop. >>> 2) In client mode it prints several values: >>> * Number of sec, spent for whole rx loop(including 'poll()'). >>> * Number of sec, spend in dequeue system calls: >>> In case of '-z n' it will be time in 'read()'. >>> In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'. >>> * Number of wake ups with POLLIN flag set(except '-z u' mode). >>> * Average time(ns) in single dequeue iteration(e.g. divide second >>> value by third). >>> >>> Idea of test is to compare zerocopy approach and classic copy, as it is >>> clear, that to dequeue some "small" amount of data, copy must be better, >>> because syscall with 'memcpy()' for 1 byte(for example) is just nothing >>> against two system calls, where first must map at least one page, while >>> second will unmap it. >>> >>> Test was performed with the following settings: >>> 1) Total size of data to send is 2G(-m argument). >>> >>> 2) Peer's buffer size is changed to 2G(-v argument) - this is needed to >>> avoid stalls of sender to wait for enough credit. >>> >>> 3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number >>> of bytes to dequeue in single loop iteration. Buffer size limits max >>> number of bytes to read, while SO_RCVLOWAT won't allow user to get >>> too small number of bytes. >>> >>> 4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of >>> course we can set it to peer's buffer size and as we are in STREAM >>> mode it leads to 'write()' will be called once. >>> >>> Deignations here and below: >>> H2G - host to guest transmission. Server is host, client is guest. >>> G2H - guest to host transmission. Server is guest, client is host. >>> C - copy mode. >>> ZC - zerocopy mode. >>> ZU - zerocopy with user poll mode. This mode is removed from test at >>> this moment, because I need to support SO_RCVLOWAT logic in it. >>> >>> So, rows corresponds to dequeue mode, while columns show number of >> >> Maybe it would be better to label the rows, I guess the first one is C >> and the second one ZC? Ooops, seems something wrong happened during patch send, here is valid table: G2H: #0 #1 #2 #3 #4 #5 *----*---------*---------*---------*---------*---------*---------* | | | | | | | | | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | | | | | | | | | *----*---------*---------*---------*---------*---------*---------* | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35| #0| C | 1.44 | 1.81 | 1.14 | 1.47 | 1.32 | 1.78 | | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 | *----*---------*---------*---------*---------*---------*---------* | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46| #1| ZC | 1.7 | 1.76 | 0.96 | 0.7 | 0.58 | 0.61 | | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 | *----*---------*---------*---------*---------*---------*---------* H2G: #0 #1 #2 #3 #4 #5 *----*---------*---------*---------*---------*---------*---------* | | | | | | | | | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | | | | | | | | | *----*---------*---------*---------*---------*---------*---------* | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00| #0| C | 3.3 | 4.3 | 2.37 | 2.33 | 2.42 | 2.75 | | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 | *----*---------*---------*---------*---------*---------*---------* | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35| #1| ZC | 3.5 | 3.38 | 2.32 | 1.75 | 1.25 | 0.98 | | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 | *----*---------*---------*---------*---------*---------*---------* Lines with #0 and #1 missed! Don know why, sorry:) >> >> Maybe it would be better to report Gbps so if we change the amount of >> data exchanged, we always have a way to compare. >> Ok >>> bytes >>> to dequeue in each mode. Each cell contains several values in the next >>> format: >>> *------------* >>> | A / B | >>> | C | >>> | D | >>> *------------* >>> >>> A - number of seconds which server spent in tx loop. >>> B - number of seconds which client spent in rx loop. >>> C - number of seconds which client spent in rx loop, but except 'poll()' >>> system call(e.g. only in dequeue system calls). >>> D - Average number of ns for each POLLIN wake up(in other words >>> it is average value for C). >> >> I see only 3 values in the cells, I missed which one is C and which one >> is D. Yep, correct version of table is above >> >>> >>> G2H: >>> >>> #0 #1 #2 #3 #4 #5 >>> *----*---------*---------*---------*---------*---------*---------* >>> | | | | | | | | >>> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | >>> | | | | | | | | >>> *----*---------*---------*---------*---------*---------*---------* >>> | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35| >>> | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 | >>> *----*---------*---------*---------*---------*---------*---------* >>> | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46| >>> | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 | >>> *----*---------*---------*---------*---------*---------*---------* >>> >>> H2G: >>> >>> #0 #1 #2 #3 #4 #5 >>> *----*---------*---------*---------*---------*---------*---------* >>> | | | | | | | | >>> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb | >>> | | | | | | | | >>> *----*---------*---------*---------*---------*---------*---------* >>> | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00| >>> | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 | >>> *----*---------*---------*---------*---------*---------*---------* >>> | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35| >>> | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 | >>> *----*---------*---------*---------*---------*---------*---------* >>> >>> Here are my thoughts about these numbers(most obvious): >>> >>> 1) Let's check C and D values. We see, that zerocopy dequeue is faster >>> on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I >>> think this is main result of this test(at this moment), that shows >>> performance difference between copy and zerocopy). >> >> Yes, I think this is expected. >> >>> >>> 2) In G2H mode both server and client spend almost same time in rx/tx >>> loops(see A / B in G2H table) - it looks good. In H2G mode, there is >>> significant difference between server and client. I think there are >>> some side effects which produces such effect(continue to analyze). >> >> Perhaps a different cost to notify the receiver? I think it's better to >> talk about transmitter and receiver, instead of server and client, I >> think it's confusing. Ok >> >>> >>> 3) Let's check C value. We can see, that G2H is always faster that H2G. >>> In both copy and zerocopy mode. >> >> This is expected because the guest queues buffers up to 64K entirely, >> while the host has to split packets into the guest's preallocated >> buffers, which are 4K. >> >>> >>> 4) Another interesting thing could be seen for example in H2G table, >>> row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode >>> is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was >> >> I see 1.65 vs 1.10, are these the same data, or am I looking at it >> wrong? Aha, pls see valid version of the table, my fault >> >>> faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account >>> time spent in 'poll()', copy mode looks faster(even it spends more >>> time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()'). >>> I think, it is also not obvious effect. >>> >>> So, according 1), it is better to use zerocopy, if You need to process >>> big buffers, with small rx waitings(for example it nay be video stream). >>> In other cases - it is better to use classic copy way, as it will be >>> more lightweight. >>> >>> All tests were performed on x86 board with 4-core Celeron N2930 CPU(of >>> course it is, not a mainframe, but better than test with nested guest) >>> and 8Gb of RAM. >>> >>> Anyway, this is not final version, and I will continue to improve both >>> kernel logic and performance tests. >> >> Great work so far! >> >> Maybe to avoid having to rebase everything later, it's already >> worthwhile to start using Bobby's patch with skbuff. >> >>> >>> SUGGESTIONS >>> >>> 1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I >>> can merge both patches into single one? >> >> This is already very big, so I don't know if it's worth breaking into a >> preparation series and then a series that adds both. Hm, ok, I think i can send both series separately(MSG_ZEROCOPY is smaller). It will be more simple to test and review. Which one of two will be ready to merge shortly - the second will be rebased and retested. > > For example, some test patches not related to zerocopy could go > separately. Maybe even vsock_rx_perf without the zerocopy part that is > not definitive for now. Ok, i think i can send patch which updates current tests as single one - it will be easy to review and merge. Also vsock_rx_perf: i can prepare it as dedicated patch. Of course, without zerocopy it has same functionality as iperf, but i think it will good to have independent small tool which implements both rx and tx zerocopy support(vsock_rx_perf will be named vsock_perf for example). It is small tool(comparing to iperf), targeted for vsock only and located in kernel source tree. > > Too big a set is always scary, even if this one is divided well, but > some patches as mentioned could go separately. > > I left some comments, but as said I prefer to review it after the > rebase with skbuff, because I think it changes enough. I'm sorry about > that, but having the skbuffs I think is very important. Sure, no problem. Of course skbuff is correct way! Thanks for review! > > Thanks, > Stefano >
On Fri, Nov 11, 2022 at 02:47:15PM +0100, Stefano Garzarella wrote: > Hi Arseniy, > maybe we should start rebasing this series on the new support for skbuff: https://lore.kernel.org/lkml/20221110171723.24263-1-bobby.eshleman@bytedance.com/ > > CCing Bobby to see if it's easy to integrate since you're both changing the > packet allocation. > This looks like the packet allocation can be married somewhat nicely in since SKBs may be built from pages using build_skb(). There may be some tweaking necessary though, since it also uses the tail chunk of the page to hold struct skb_shared_info IIRC. I left some comments on the patch with the allocator in it. > > Maybe to avoid having to rebase everything later, it's already worthwhile to > start using Bobby's patch with skbuff. > I'll be waiting until Monday to see if some more feedback comes in before sending out v4, so I expect v4 early next week, FWIW. Best, Bobby
On 11.11.2022 23:45, Bobby Eshleman wrote: > On Fri, Nov 11, 2022 at 02:47:15PM +0100, Stefano Garzarella wrote: >> Hi Arseniy, >> maybe we should start rebasing this series on the new support for skbuff: https://lore.kernel.org/lkml/20221110171723.24263-1-bobby.eshleman@bytedance.com/ >> >> CCing Bobby to see if it's easy to integrate since you're both changing the >> packet allocation. >> > > This looks like the packet allocation can be married somewhat nicely in > since SKBs may be built from pages using build_skb(). There may be some > tweaking necessary though, since it also uses the tail chunk of the page > to hold struct skb_shared_info IIRC. > > I left some comments on the patch with the allocator in it. Hello Bobby, thanks for review. I'll rebase my patchset on Your skbuff support. > >> >> Maybe to avoid having to rebase everything later, it's already worthwhile to >> start using Bobby's patch with skbuff. >> > > I'll be waiting until Monday to see if some more feedback comes in > before sending out v4, so I expect v4 early next week, FWIW. One request from me, could You please CC me for next versions of Your patchset, because: 1) I'll always have latest version of skbuff support. 2) I'll see review process also. My contacts: oxffffaa@gmail.com AVKrasnov@sberdevices.ru Thanks, Arseniy > > Best, > Bobby
On 12.11.2022 14:40, Arseniy Krasnov wrote: Hello again Bobby, i wasn't CCed in Your patchset, but I review it anyway and write comments here in this manner:) I found strange thing: In 'virtio_transport_recv_enqueue()' new packet could be copied to the last packet in rx queue(skb in current version). During copy You update last skb length by call 'skb_put(last_skb, skb->len)' inside 'memcpy()'. So 'last_skb' now have new length, but header of packet is not updated. Now let's look to 'virtio_transport_seqpacket_do_dequeue()', it uses value from packet's header as 'pkt_len', not from skb: pkt_len = (size_t)le32_to_cpu(hdr->len); I think we need to update last packet's header during merging new packet to last packet of rx queue. Thanks, Arseniy > On 11.11.2022 23:45, Bobby Eshleman wrote: >> On Fri, Nov 11, 2022 at 02:47:15PM +0100, Stefano Garzarella wrote: >>> Hi Arseniy, >>> maybe we should start rebasing this series on the new support for skbuff: https://lore.kernel.org/lkml/20221110171723.24263-1-bobby.eshleman@bytedance.com/ >>> >>> CCing Bobby to see if it's easy to integrate since you're both changing the >>> packet allocation. >>> >> >> This looks like the packet allocation can be married somewhat nicely in >> since SKBs may be built from pages using build_skb(). There may be some >> tweaking necessary though, since it also uses the tail chunk of the page >> to hold struct skb_shared_info IIRC. >> >> I left some comments on the patch with the allocator in it. > Hello Bobby, > > thanks for review. I'll rebase my patchset on Your skbuff support. >> >>> >>> Maybe to avoid having to rebase everything later, it's already worthwhile to >>> start using Bobby's patch with skbuff. >>> >> >> I'll be waiting until Monday to see if some more feedback comes in >> before sending out v4, so I expect v4 early next week, FWIW. > One request from me, could You please CC me for next versions of > Your patchset, because: > 1) I'll always have latest version of skbuff support. > 2) I'll see review process also. > > My contacts: > oxffffaa@gmail.com > AVKrasnov@sberdevices.ru > > Thanks, Arseniy > >> >> Best, >> Bobby >