Message ID | 20230413-b4-vsock-dgram-v2-0-079cc7cee62e@bytedance.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp33032vqo; Thu, 13 Apr 2023 17:27:54 -0700 (PDT) X-Google-Smtp-Source: AKy350ZAycPhtfgSQLYV+f1lThePsKFpGpeNgLYKRKjHVCRZ4faubPv820CnL6Bbiu+jGqNPmuwG X-Received: by 2002:a05:6a00:1992:b0:63b:3567:8d74 with SMTP id d18-20020a056a00199200b0063b35678d74mr5839495pfl.7.1681432074397; Thu, 13 Apr 2023 17:27:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681432074; cv=none; d=google.com; s=arc-20160816; b=po57THJe0Z6ysmPIBILnzo4ByUxnAhWgU/OlK/NCWorogy/gW3HCq0FbhZcQa0PwQN x5iN5vuW6esU5lP6w6NGlpDayi3Pn3KRVIRo/q75uh/KZ7q5quwUXBTspwixDUXGBiY3 CLcyT+FWrXzKrEstu0zNhHaBZPe/DZUkWTJI2MmUxtFp9ni5o7Tf8YgDioH/J/z8KfeI WGZsqXUWFQyTqLuRKJzJKqa4Lrcm/7o4VCFKWfojWBhxKrNQx/hxh4f/dFhFNcB+b9co b7ON/vo58ylugfzobJ/QMhOw51l1w18b1NyqSQeXV451okYpps4aG3AAojGheXJ0E8f/ 64xw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:content-transfer-encoding:mime-version :message-id:date:subject:from:dkim-signature; bh=AqiCSGbRn9fQNjLoKtsWJCL6VzfL/SXpL2kJq1LBwJY=; b=z/n3um3UHx5CMEtoFqZffSwwilT/swFWqC/j2ciuRjwYj6aIkUTCDxgUmfqUrpbE5o APRAN+90OG3uCBjvzITkpdm2cG70FdcGsdyShhSHRo56chCMYgm3kqbmXWsL7glS8cDK PION9fV8g4C+rdEHvJPgdC9dC+W3jFGu1qhEIdLQTCrLKqCKC0w121qVrTxl/ChZFVu3 sPWXIcIEJ6MvG5YKECOUGpchyywtL66Hi6rUmlnnPv3GH0ec4BOipgetu6tOV5W0lIhh 2cJO2KCyJSt6A654yqO8dHCckxNXMm7N1ynk47lsTj/d/+4eF1cWNXsmGOY+bgp0athw c6JQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=XKH+aecJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k3-20020aa79723000000b0063928548a3csi2847455pfg.153.2023.04.13.17.27.39; Thu, 13 Apr 2023 17:27:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=XKH+aecJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230386AbjDNA0m (ORCPT <rfc822;peter110.wang@gmail.com> + 99 others); Thu, 13 Apr 2023 20:26:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230304AbjDNA0e (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 13 Apr 2023 20:26:34 -0400 Received: from mail-qt1-x829.google.com (mail-qt1-x829.google.com [IPv6:2607:f8b0:4864:20::829]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E273421F for <linux-kernel@vger.kernel.org>; Thu, 13 Apr 2023 17:26:01 -0700 (PDT) Received: by mail-qt1-x829.google.com with SMTP id ay11so96345qtb.10 for <linux-kernel@vger.kernel.org>; Thu, 13 Apr 2023 17:26:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1681431960; x=1684023960; h=cc:to:content-transfer-encoding:mime-version:message-id:date :subject:from:from:to:cc:subject:date:message-id:reply-to; bh=AqiCSGbRn9fQNjLoKtsWJCL6VzfL/SXpL2kJq1LBwJY=; b=XKH+aecJClMmFZ/SeC2z779Hh6mM5IG879+Y0PuXEMAMz9DLNceMvYUDEhC6D+bzd0 bvuixFg3uwX8YTtPhwBRsmOlRVr3OPraetw7DnoRKRHTX8Iq7K/GCo838uRXKFE2/fEb 17dLLDUNtQjYRxfsn2L8zZ/EPfqdxO9bqpmpiaFz1rKdL10rE8xVpOzgHdGY0Dp/yHDn 3lnbhSecgg91O9l2UasUz7ss8uFNi1NXA8C0fodCjXSPr5bK1Ga4KLRdlHVjnyQOLuZE bwnAU8uwQkdEKb/ZMOUu1LuFZm4ntaDTzVfsG6ie0Lj/K8QUOSSB3Lum3MonNrzMeovz xLPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681431960; x=1684023960; h=cc:to:content-transfer-encoding:mime-version:message-id:date :subject:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=AqiCSGbRn9fQNjLoKtsWJCL6VzfL/SXpL2kJq1LBwJY=; b=hcFhYz0HMeeLpuhNyK0M2fhb8Kd9rdn8rkp6KUUgPuPM7YSH6dsvV0ZU7hKKz+JXC4 cbW/NS7HPmTcW9QLoWjsLF5gT9WR0jNzVnjWo+i1tZKnIWPC8F0Zcx381jvCIX9XxBec Vm0tWjfL3OSjSGNcfrK1stjGDCpcWraFIEpGu+/eQbnbXyQIaf+WyS2pZ/i+WN4kW0dh vdf7KxJWJ+rB9aFCgTOpM4JEvrLVLK/U1bOwsG8cHvjav7iCpBa8KZRS6U5YdKV6+Tv4 Q5Wx+iwVoERIVnlivzZjUkDV5z3VL1iGJ7uvFsUg5TlkhpVY09rIvVYq/DCTrAJrgh+y Frsw== X-Gm-Message-State: AAQBX9eqfUd7y4rmV7yAxU5Cq41O+ZEVOhvMC2z6/M2Kcz9VXDB9Knkh v5PWp3XSbEbAfXzeQN4G97YX+Q== X-Received: by 2002:a05:622a:249:b0:3e3:7ce1:e746 with SMTP id c9-20020a05622a024900b003e37ce1e746mr5845106qtx.15.1681431960570; Thu, 13 Apr 2023 17:26:00 -0700 (PDT) Received: from [172.17.0.3] ([130.44.215.122]) by smtp.gmail.com with ESMTPSA id a1-20020ac844a1000000b003eabcc29132sm309928qto.29.2023.04.13.17.25.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 13 Apr 2023 17:26:00 -0700 (PDT) From: Bobby Eshleman <bobby.eshleman@bytedance.com> Subject: [PATCH RFC net-next v2 0/4] virtio/vsock: support datagrams Date: Fri, 14 Apr 2023 00:25:56 +0000 Message-Id: <20230413-b4-vsock-dgram-v2-0-079cc7cee62e@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-B4-Tracking: v=1; b=H4sIAJSdOGQC/z2NQQrCMBBFr1Jm7UBN0izcCh7ArbiYScc2iKlkS imU3t3ERZfvPx5/A5UcReHSbJBliRqnVMCcGggjpUEw9oXBtMa27myRHS46hTf2Q6YPWvbC5Mk 76aBETCrImVIYa5ZkxiTrXNU3yyuu/68H3G/Xuh3+ue8/i5UPJ4wAAAA= To: Stefan Hajnoczi <stefanha@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, "K. Y. Srinivasan" <kys@microsoft.com>, Haiyang Zhang <haiyangz@microsoft.com>, Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>, Bryan Tan <bryantan@vmware.com>, Vishnu Dasa <vdasa@vmware.com>, VMware PV-Drivers Reviewers <pv-drivers@vmware.com> Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org, Bobby Eshleman <bobby.eshleman@bytedance.com>, Jiang Wang <jiang.wang@bytedance.com> X-Mailer: b4 0.12.2 X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1763109319143230080?= X-GMAIL-MSGID: =?utf-8?q?1763109319143230080?= |
Series | virtio/vsock: support datagrams | |
Message
Bobby Eshleman
April 14, 2023, 12:25 a.m. UTC
Hey all!
This series introduces support for datagrams to virtio/vsock.
It is a spin-off (and smaller version) of this series from the summer:
https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/
Please note that this is an RFC and should not be merged until
associated changes are made to the virtio specification, which will
follow after discussion from this series.
This series first supports datagrams in a basic form for virtio, and
then optimizes the sendpath for all transports.
The result is a very fast datagram communication protocol that
outperforms even UDP on multi-queue virtio-net w/ vhost on a variety
of multi-threaded workload samples.
For those that are curious, some summary data comparing UDP and VSOCK
DGRAM (N=5):
vCPUS: 16
virtio-net queues: 16
payload size: 4KB
Setup: bare metal + vm (non-nested)
UDP: 287.59 MB/s
VSOCK DGRAM: 509.2 MB/s
Some notes about the implementation...
This datagram implementation forces datagrams to self-throttle according
to the threshold set by sk_sndbuf. It behaves similar to the credits
used by streams in its effect on throughput and memory consumption, but
it is not influenced by the receiving socket as credits are.
The device drops packets silently. There is room for improvement by
building into the device and driver some intelligence around how to
reduce frequency of kicking the virtqueue when packet loss is high. I
think there is a good discussion to be had on this.
In this series I am also proposing that fairness be reexamined as an
issue separate from datagrams, which differs from my previous series
that coupled these issues. After further testing and reflection on the
design, I do not believe that these need to be coupled and I do not
believe this implementation introduces additional unfairness or
exacerbates pre-existing unfairness.
I attempted to characterize vsock fairness by using a pool of processes
to stress test the shared resources while measuring the performance of a
lone stream socket. Given unfair preference for datagrams, we would
assume that a lone stream socket would degrade much more when a pool of
datagram sockets was stressing the system than when a pool of stream
sockets are stressing the system. The result, however, showed no
significant difference between the degradation of throughput of the lone
stream socket when using a pool of datagrams to stress the queue over
using a pool of streams. The absolute difference in throughput actually
favored datagrams as interfering least as the mean difference was +16%
compared to using streams to stress test (N=7), but it was not
statistically significant. Workloads were matched for payload size and
buffer size (to approximate memory consumption) and process count, and
stress workloads were configured to start before and last long after the
lifetime of the "lone" stream socket flow to ensure that competing flows
were continuously hot.
Given the above data, I propose that vsock fairness be addressed
independent of datagrams and to defer its implementation to a future
series.
Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
---
Bobby Eshleman (3):
virtio/vsock: support dgram
virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit
vsock: Add lockless sendmsg() support
Jiang Wang (1):
tests: add vsock dgram tests
drivers/vhost/vsock.c | 17 +-
include/net/af_vsock.h | 20 ++-
include/uapi/linux/virtio_vsock.h | 2 +
net/vmw_vsock/af_vsock.c | 287 ++++++++++++++++++++++++++++----
net/vmw_vsock/diag.c | 10 +-
net/vmw_vsock/hyperv_transport.c | 15 +-
net/vmw_vsock/virtio_transport.c | 10 +-
net/vmw_vsock/virtio_transport_common.c | 221 ++++++++++++++++++++----
net/vmw_vsock/vmci_transport.c | 70 ++++++--
tools/testing/vsock/util.c | 105 ++++++++++++
tools/testing/vsock/util.h | 4 +
tools/testing/vsock/vsock_test.c | 193 +++++++++++++++++++++
12 files changed, 859 insertions(+), 95 deletions(-)
---
base-commit: ed72bd5a6790a0c3747cb32b0427f921bd03bb71
change-id: 20230413-b4-vsock-dgram-3b6eba6a64e5
Best regards,
Comments
CC'ing Cong. On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote: > Hey all! > > This series introduces support for datagrams to virtio/vsock. > > It is a spin-off (and smaller version) of this series from the summer: > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ > > Please note that this is an RFC and should not be merged until > associated changes are made to the virtio specification, which will > follow after discussion from this series. > > This series first supports datagrams in a basic form for virtio, and > then optimizes the sendpath for all transports. > > The result is a very fast datagram communication protocol that > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety > of multi-threaded workload samples. > > For those that are curious, some summary data comparing UDP and VSOCK > DGRAM (N=5): > > vCPUS: 16 > virtio-net queues: 16 > payload size: 4KB > Setup: bare metal + vm (non-nested) > > UDP: 287.59 MB/s > VSOCK DGRAM: 509.2 MB/s > > Some notes about the implementation... > > This datagram implementation forces datagrams to self-throttle according > to the threshold set by sk_sndbuf. It behaves similar to the credits > used by streams in its effect on throughput and memory consumption, but > it is not influenced by the receiving socket as credits are. > > The device drops packets silently. There is room for improvement by > building into the device and driver some intelligence around how to > reduce frequency of kicking the virtqueue when packet loss is high. I > think there is a good discussion to be had on this. > > In this series I am also proposing that fairness be reexamined as an > issue separate from datagrams, which differs from my previous series > that coupled these issues. After further testing and reflection on the > design, I do not believe that these need to be coupled and I do not > believe this implementation introduces additional unfairness or > exacerbates pre-existing unfairness. > > I attempted to characterize vsock fairness by using a pool of processes > to stress test the shared resources while measuring the performance of a > lone stream socket. Given unfair preference for datagrams, we would > assume that a lone stream socket would degrade much more when a pool of > datagram sockets was stressing the system than when a pool of stream > sockets are stressing the system. The result, however, showed no > significant difference between the degradation of throughput of the lone > stream socket when using a pool of datagrams to stress the queue over > using a pool of streams. The absolute difference in throughput actually > favored datagrams as interfering least as the mean difference was +16% > compared to using streams to stress test (N=7), but it was not > statistically significant. Workloads were matched for payload size and > buffer size (to approximate memory consumption) and process count, and > stress workloads were configured to start before and last long after the > lifetime of the "lone" stream socket flow to ensure that competing flows > were continuously hot. > > Given the above data, I propose that vsock fairness be addressed > independent of datagrams and to defer its implementation to a future > series. > > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> > --- > Bobby Eshleman (3): > virtio/vsock: support dgram > virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit > vsock: Add lockless sendmsg() support > > Jiang Wang (1): > tests: add vsock dgram tests > > drivers/vhost/vsock.c | 17 +- > include/net/af_vsock.h | 20 ++- > include/uapi/linux/virtio_vsock.h | 2 + > net/vmw_vsock/af_vsock.c | 287 ++++++++++++++++++++++++++++---- > net/vmw_vsock/diag.c | 10 +- > net/vmw_vsock/hyperv_transport.c | 15 +- > net/vmw_vsock/virtio_transport.c | 10 +- > net/vmw_vsock/virtio_transport_common.c | 221 ++++++++++++++++++++---- > net/vmw_vsock/vmci_transport.c | 70 ++++++-- > tools/testing/vsock/util.c | 105 ++++++++++++ > tools/testing/vsock/util.h | 4 + > tools/testing/vsock/vsock_test.c | 193 +++++++++++++++++++++ > 12 files changed, 859 insertions(+), 95 deletions(-) > --- > base-commit: ed72bd5a6790a0c3747cb32b0427f921bd03bb71 > change-id: 20230413-b4-vsock-dgram-3b6eba6a64e5 > > Best regards, > -- > Bobby Eshleman <bobby.eshleman@bytedance.com> >
CC'ing virtio-dev@lists.oasis-open.org because this thread is starting to touch the spec. On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote: > Hi Bobby, > > On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote: > > CC'ing Cong. > > > > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote: > > > Hey all! > > > > > > This series introduces support for datagrams to virtio/vsock. > > Great! Thanks for restarting this work! > No problem! > > > > > > It is a spin-off (and smaller version) of this series from the summer: > > > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ > > > > > > Please note that this is an RFC and should not be merged until > > > associated changes are made to the virtio specification, which will > > > follow after discussion from this series. > > > > > > This series first supports datagrams in a basic form for virtio, and > > > then optimizes the sendpath for all transports. > > > > > > The result is a very fast datagram communication protocol that > > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety > > > of multi-threaded workload samples. > > > > > > For those that are curious, some summary data comparing UDP and VSOCK > > > DGRAM (N=5): > > > > > > vCPUS: 16 > > > virtio-net queues: 16 > > > payload size: 4KB > > > Setup: bare metal + vm (non-nested) > > > > > > UDP: 287.59 MB/s > > > VSOCK DGRAM: 509.2 MB/s > > > > > > Some notes about the implementation... > > > > > > This datagram implementation forces datagrams to self-throttle according > > > to the threshold set by sk_sndbuf. It behaves similar to the credits > > > used by streams in its effect on throughput and memory consumption, but > > > it is not influenced by the receiving socket as credits are. > > So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right? > Correct. > We should check if VMCI behaves the same. > > > > > > > The device drops packets silently. There is room for improvement by > > > building into the device and driver some intelligence around how to > > > reduce frequency of kicking the virtqueue when packet loss is high. I > > > think there is a good discussion to be had on this. > > Can you elaborate a bit here? > > Do you mean some mechanism to report to the sender that a destination > (cid, port) is full so the packet will be dropped? > Correct. There is also the case of there being no receiver at all for this address since this case isn't rejected upon connect(). Ideally, such a socket (which will have 100% packet loss) will be throttled aggressively. Before we go down too far on this path, I also want to clarify that using UDP over vhost/virtio-net also has this property... this can be observed by using tcpdump to dump the UDP packets on the bridge network your VM is using. UDP packets sent to a garbage address can be seen on the host bridge (this is the nature of UDP, how is the host supposed to know the address eventually goes nowhere). I mention the above because I think it is possible for vsock to avoid this cost, given that it benefits from being point-to-point and g2h/h2g. If we're okay with vsock being on par, then the current series does that. I propose something below that can be added later and maybe negotiated as a feature bit too. > Can we adapt the credit mechanism? > I've thought about this a lot because the attraction of the approach for me would be that we could get the wait/buffer-limiting logic for free and without big changes to the protocol, but the problem is that the unreliable nature of datagrams means that the source's free-running tx_cnt will become out-of-sync with the destination's fwd_cnt upon packet loss. Imagine a source that initializes and starts sending packets before a destination socket even is created, the source's self-throttling will be dysfunctional because its tx_cnt will always far exceed the destination's fwd_cnt. We could play tricks with the meaning of the CREDIT_UPDATE message and fwd_cnt/buf_alloc fields, but I don't think we want to go down that path. I think that the best and simplest approach introduces a congestion notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the destination sends this notification. At a given repeated time period T, the source can check if it has received any notifications in the last T. If so, it halves its buffer allocation. If not, it doubles its buffer allocation unless it is already at its max or original value. An "invalid" socket which never has any receiver will converge towards a rate limit of one packet per time T * log2(average pkt size). That is, a socket with 100% packet loss will only be able to send 16 bytes every 4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within 160ms given at least one packet sent per 5ms. I have no idea if that is a reasonable default T for vsock, I just pulled it out of a hat for the sake of the example. "Normal" sockets will be responsive to high loss and rebalance during low loss. The source is trying to guess and converge on the actual buffer state of the destination. This would reuse the already-existing throttling mechanisms that throttle based upon buffer allocation. The usage of sk_sndbuf would have to be re-worked. The application using sendmsg() will see EAGAIN when throttled, or just sleep if !MSG_DONTWAIT. I looked at alternative schemes (like the Datagram Congestion Control Protocol), but I do not think the added complexity is necessary in the case of vsock (DCCP requires congestion windows, sequence numbers, batch acknowledgements, etc...). I also looked at UDP-based application protocols like TFTP, DHCP, and SIP over UDP which use a delay-based backoff mechanism, but seem to require acknowledgement for those packet types, which trigger the retries and backoffs. I think we can get away with the simpler approach and not have to potentially kill performance with per-packet acknowledgements. > > > > > > In this series I am also proposing that fairness be reexamined as an > > > issue separate from datagrams, which differs from my previous series > > > that coupled these issues. After further testing and reflection on the > > > design, I do not believe that these need to be coupled and I do not > > > believe this implementation introduces additional unfairness or > > > exacerbates pre-existing unfairness. > > I see. > > > > > > > I attempted to characterize vsock fairness by using a pool of processes > > > to stress test the shared resources while measuring the performance of a > > > lone stream socket. Given unfair preference for datagrams, we would > > > assume that a lone stream socket would degrade much more when a pool of > > > datagram sockets was stressing the system than when a pool of stream > > > sockets are stressing the system. The result, however, showed no > > > significant difference between the degradation of throughput of the lone > > > stream socket when using a pool of datagrams to stress the queue over > > > using a pool of streams. The absolute difference in throughput actually > > > favored datagrams as interfering least as the mean difference was +16% > > > compared to using streams to stress test (N=7), but it was not > > > statistically significant. Workloads were matched for payload size and > > > buffer size (to approximate memory consumption) and process count, and > > > stress workloads were configured to start before and last long after the > > > lifetime of the "lone" stream socket flow to ensure that competing flows > > > were continuously hot. > > > > > > Given the above data, I propose that vsock fairness be addressed > > > independent of datagrams and to defer its implementation to a future > > > series. > > Makes sense to me. > > I left some preliminary comments, anyway now it seems reasonable to use > the same virtqueues, so we can go head with the spec proposal. > > Thanks, > Stefano > Thanks for the review! Best, Bobby
On Fri, Apr 28, 2023 at 12:43:09PM +0200, Stefano Garzarella wrote: > On Sat, Apr 15, 2023 at 07:13:47AM +0000, Bobby Eshleman wrote: > > CC'ing virtio-dev@lists.oasis-open.org because this thread is starting > > to touch the spec. > > > > On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote: > > > Hi Bobby, > > > > > > On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote: > > > > CC'ing Cong. > > > > > > > > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote: > > > > > Hey all! > > > > > > > > > > This series introduces support for datagrams to virtio/vsock. > > > > > > Great! Thanks for restarting this work! > > > > > > > No problem! > > > > > > > > > > > > It is a spin-off (and smaller version) of this series from the summer: > > > > > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ > > > > > > > > > > Please note that this is an RFC and should not be merged until > > > > > associated changes are made to the virtio specification, which will > > > > > follow after discussion from this series. > > > > > > > > > > This series first supports datagrams in a basic form for virtio, and > > > > > then optimizes the sendpath for all transports. > > > > > > > > > > The result is a very fast datagram communication protocol that > > > > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety > > > > > of multi-threaded workload samples. > > > > > > > > > > For those that are curious, some summary data comparing UDP and VSOCK > > > > > DGRAM (N=5): > > > > > > > > > > vCPUS: 16 > > > > > virtio-net queues: 16 > > > > > payload size: 4KB > > > > > Setup: bare metal + vm (non-nested) > > > > > > > > > > UDP: 287.59 MB/s > > > > > VSOCK DGRAM: 509.2 MB/s > > > > > > > > > > Some notes about the implementation... > > > > > > > > > > This datagram implementation forces datagrams to self-throttle according > > > > > to the threshold set by sk_sndbuf. It behaves similar to the credits > > > > > used by streams in its effect on throughput and memory consumption, but > > > > > it is not influenced by the receiving socket as credits are. > > > > > > So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right? > > > > > > > Correct. > > > > > We should check if VMCI behaves the same. > > > > > > > > > > > > > The device drops packets silently. There is room for improvement by > > > > > building into the device and driver some intelligence around how to > > > > > reduce frequency of kicking the virtqueue when packet loss is high. I > > > > > think there is a good discussion to be had on this. > > > > > > Can you elaborate a bit here? > > > > > > Do you mean some mechanism to report to the sender that a destination > > > (cid, port) is full so the packet will be dropped? > > > > > > > Correct. There is also the case of there being no receiver at all for > > this address since this case isn't rejected upon connect(). Ideally, > > such a socket (which will have 100% packet loss) will be throttled > > aggressively. > > > > Before we go down too far on this path, I also want to clarify that > > using UDP over vhost/virtio-net also has this property... this can be > > observed by using tcpdump to dump the UDP packets on the bridge network > > your VM is using. UDP packets sent to a garbage address can be seen on > > the host bridge (this is the nature of UDP, how is the host supposed to > > know the address eventually goes nowhere). I mention the above because I > > think it is possible for vsock to avoid this cost, given that it > > benefits from being point-to-point and g2h/h2g. > > > > If we're okay with vsock being on par, then the current series does > > that. I propose something below that can be added later and maybe > > negotiated as a feature bit too. > > I see and I agree on that, let's do it step by step. > If we can do it in the first phase is great, but I think is fine to add > this feature later. > > > > > > Can we adapt the credit mechanism? > > > > > > > I've thought about this a lot because the attraction of the approach for > > me would be that we could get the wait/buffer-limiting logic for free > > and without big changes to the protocol, but the problem is that the > > unreliable nature of datagrams means that the source's free-running > > tx_cnt will become out-of-sync with the destination's fwd_cnt upon > > packet loss. > > We need to understand where the packet can be lost. > If the packet always reaches the destination (vsock driver or device), > we can discard it, but also update the counters. > > > > > Imagine a source that initializes and starts sending packets before a > > destination socket even is created, the source's self-throttling will be > > dysfunctional because its tx_cnt will always far exceed the > > destination's fwd_cnt. > > Right, the other problem I see is that the socket aren't connected, so > we have 1-N relationship. > Oh yeah, good point. > > > > We could play tricks with the meaning of the CREDIT_UPDATE message and > > fwd_cnt/buf_alloc fields, but I don't think we want to go down that > > path. > > > > I think that the best and simplest approach introduces a congestion > > notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the > > destination sends this notification. At a given repeated time period T, > > the source can check if it has received any notifications in the last T. > > If so, it halves its buffer allocation. If not, it doubles its buffer > > allocation unless it is already at its max or original value. > > > > An "invalid" socket which never has any receiver will converge towards a > > rate limit of one packet per time T * log2(average pkt size). That is, a > > socket with 100% packet loss will only be able to send 16 bytes every > > 4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within > > 160ms given at least one packet sent per 5ms. I have no idea if that is > > a reasonable default T for vsock, I just pulled it out of a hat for the > > sake of the example. > > > > "Normal" sockets will be responsive to high loss and rebalance during > > low loss. The source is trying to guess and converge on the actual > > buffer state of the destination. > > > > This would reuse the already-existing throttling mechanisms that > > throttle based upon buffer allocation. The usage of sk_sndbuf would have > > to be re-worked. The application using sendmsg() will see EAGAIN when > > throttled, or just sleep if !MSG_DONTWAIT. > > I see, it looks interesting, but I think we need to share that > information between multiple sockets, since the same destination > (cid, port), can be reached by multiple sockets. > Good point, that is true. > Another approach could be to have both congestion notification and > decongestion, but maybe it produces double traffic. > I think this could simplify things and could reduce noise. It is also probably sufficient for the source to simply halt upon congestion notification and resume upon decongestion notification, instead of scaling up and down like I suggested above. It also avoids the burstiness that would occur with a "congestion notification"-only approach where the source guesses when to resume and guesses wrong. The congestion notification may want to have an expiration period after which the sender can resume without receiving a decongestion notification? If it receives congestion again, then it can halt again. > > > > I looked at alternative schemes (like the Datagram Congestion Control > > Protocol), but I do not think the added complexity is necessary in the > > case of vsock (DCCP requires congestion windows, sequence numbers, batch > > acknowledgements, etc...). I also looked at UDP-based application > > protocols like TFTP, DHCP, and SIP over UDP which use a delay-based > > backoff mechanism, but seem to require acknowledgement for those packet > > types, which trigger the retries and backoffs. I think we can get away > > with the simpler approach and not have to potentially kill performance > > with per-packet acknowledgements. > > Yep I agree. I think our advantage is that the channel (virtqueues), > can't lose packets. > Exactly. > > > > > > > > > > > > In this series I am also proposing that fairness be reexamined as an > > > > > issue separate from datagrams, which differs from my previous series > > > > > that coupled these issues. After further testing and reflection on the > > > > > design, I do not believe that these need to be coupled and I do not > > > > > believe this implementation introduces additional unfairness or > > > > > exacerbates pre-existing unfairness. > > > > > > I see. > > > > > > > > > > > > > I attempted to characterize vsock fairness by using a pool of processes > > > > > to stress test the shared resources while measuring the performance of a > > > > > lone stream socket. Given unfair preference for datagrams, we would > > > > > assume that a lone stream socket would degrade much more when a pool of > > > > > datagram sockets was stressing the system than when a pool of stream > > > > > sockets are stressing the system. The result, however, showed no > > > > > significant difference between the degradation of throughput of the lone > > > > > stream socket when using a pool of datagrams to stress the queue over > > > > > using a pool of streams. The absolute difference in throughput actually > > > > > favored datagrams as interfering least as the mean difference was +16% > > > > > compared to using streams to stress test (N=7), but it was not > > > > > statistically significant. Workloads were matched for payload size and > > > > > buffer size (to approximate memory consumption) and process count, and > > > > > stress workloads were configured to start before and last long after the > > > > > lifetime of the "lone" stream socket flow to ensure that competing flows > > > > > were continuously hot. > > > > > > > > > > Given the above data, I propose that vsock fairness be addressed > > > > > independent of datagrams and to defer its implementation to a future > > > > > series. > > > > > > Makes sense to me. > > > > > > I left some preliminary comments, anyway now it seems reasonable to use > > > the same virtqueues, so we can go head with the spec proposal. > > > > > > Thanks, > > > Stefano > > > > > > > Thanks for the review! > > You're welcome! > > Stefano > Best, Bobby
Hi Bobby, On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote: >CC'ing Cong. > >On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote: >> Hey all! >> >> This series introduces support for datagrams to virtio/vsock. Great! Thanks for restarting this work! >> >> It is a spin-off (and smaller version) of this series from the summer: >> https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ >> >> Please note that this is an RFC and should not be merged until >> associated changes are made to the virtio specification, which will >> follow after discussion from this series. >> >> This series first supports datagrams in a basic form for virtio, and >> then optimizes the sendpath for all transports. >> >> The result is a very fast datagram communication protocol that >> outperforms even UDP on multi-queue virtio-net w/ vhost on a variety >> of multi-threaded workload samples. >> >> For those that are curious, some summary data comparing UDP and VSOCK >> DGRAM (N=5): >> >> vCPUS: 16 >> virtio-net queues: 16 >> payload size: 4KB >> Setup: bare metal + vm (non-nested) >> >> UDP: 287.59 MB/s >> VSOCK DGRAM: 509.2 MB/s >> >> Some notes about the implementation... >> >> This datagram implementation forces datagrams to self-throttle according >> to the threshold set by sk_sndbuf. It behaves similar to the credits >> used by streams in its effect on throughput and memory consumption, but >> it is not influenced by the receiving socket as credits are. So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right? We should check if VMCI behaves the same. >> >> The device drops packets silently. There is room for improvement by >> building into the device and driver some intelligence around how to >> reduce frequency of kicking the virtqueue when packet loss is high. I >> think there is a good discussion to be had on this. Can you elaborate a bit here? Do you mean some mechanism to report to the sender that a destination (cid, port) is full so the packet will be dropped? Can we adapt the credit mechanism? >> >> In this series I am also proposing that fairness be reexamined as an >> issue separate from datagrams, which differs from my previous series >> that coupled these issues. After further testing and reflection on the >> design, I do not believe that these need to be coupled and I do not >> believe this implementation introduces additional unfairness or >> exacerbates pre-existing unfairness. I see. >> >> I attempted to characterize vsock fairness by using a pool of processes >> to stress test the shared resources while measuring the performance of a >> lone stream socket. Given unfair preference for datagrams, we would >> assume that a lone stream socket would degrade much more when a pool of >> datagram sockets was stressing the system than when a pool of stream >> sockets are stressing the system. The result, however, showed no >> significant difference between the degradation of throughput of the lone >> stream socket when using a pool of datagrams to stress the queue over >> using a pool of streams. The absolute difference in throughput actually >> favored datagrams as interfering least as the mean difference was +16% >> compared to using streams to stress test (N=7), but it was not >> statistically significant. Workloads were matched for payload size and >> buffer size (to approximate memory consumption) and process count, and >> stress workloads were configured to start before and last long after the >> lifetime of the "lone" stream socket flow to ensure that competing flows >> were continuously hot. >> >> Given the above data, I propose that vsock fairness be addressed >> independent of datagrams and to defer its implementation to a future >> series. Makes sense to me. I left some preliminary comments, anyway now it seems reasonable to use the same virtqueues, so we can go head with the spec proposal. Thanks, Stefano
On Sat, Apr 15, 2023 at 07:13:47AM +0000, Bobby Eshleman wrote: >CC'ing virtio-dev@lists.oasis-open.org because this thread is starting >to touch the spec. > >On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote: >> Hi Bobby, >> >> On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote: >> > CC'ing Cong. >> > >> > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote: >> > > Hey all! >> > > >> > > This series introduces support for datagrams to virtio/vsock. >> >> Great! Thanks for restarting this work! >> > >No problem! > >> > > >> > > It is a spin-off (and smaller version) of this series from the summer: >> > > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ >> > > >> > > Please note that this is an RFC and should not be merged until >> > > associated changes are made to the virtio specification, which will >> > > follow after discussion from this series. >> > > >> > > This series first supports datagrams in a basic form for virtio, and >> > > then optimizes the sendpath for all transports. >> > > >> > > The result is a very fast datagram communication protocol that >> > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety >> > > of multi-threaded workload samples. >> > > >> > > For those that are curious, some summary data comparing UDP and VSOCK >> > > DGRAM (N=5): >> > > >> > > vCPUS: 16 >> > > virtio-net queues: 16 >> > > payload size: 4KB >> > > Setup: bare metal + vm (non-nested) >> > > >> > > UDP: 287.59 MB/s >> > > VSOCK DGRAM: 509.2 MB/s >> > > >> > > Some notes about the implementation... >> > > >> > > This datagram implementation forces datagrams to self-throttle according >> > > to the threshold set by sk_sndbuf. It behaves similar to the credits >> > > used by streams in its effect on throughput and memory consumption, but >> > > it is not influenced by the receiving socket as credits are. >> >> So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right? >> > >Correct. > >> We should check if VMCI behaves the same. >> >> > > >> > > The device drops packets silently. There is room for improvement by >> > > building into the device and driver some intelligence around how to >> > > reduce frequency of kicking the virtqueue when packet loss is high. I >> > > think there is a good discussion to be had on this. >> >> Can you elaborate a bit here? >> >> Do you mean some mechanism to report to the sender that a destination >> (cid, port) is full so the packet will be dropped? >> > >Correct. There is also the case of there being no receiver at all for >this address since this case isn't rejected upon connect(). Ideally, >such a socket (which will have 100% packet loss) will be throttled >aggressively. > >Before we go down too far on this path, I also want to clarify that >using UDP over vhost/virtio-net also has this property... this can be >observed by using tcpdump to dump the UDP packets on the bridge network >your VM is using. UDP packets sent to a garbage address can be seen on >the host bridge (this is the nature of UDP, how is the host supposed to >know the address eventually goes nowhere). I mention the above because I >think it is possible for vsock to avoid this cost, given that it >benefits from being point-to-point and g2h/h2g. > >If we're okay with vsock being on par, then the current series does >that. I propose something below that can be added later and maybe >negotiated as a feature bit too. I see and I agree on that, let's do it step by step. If we can do it in the first phase is great, but I think is fine to add this feature later. > >> Can we adapt the credit mechanism? >> > >I've thought about this a lot because the attraction of the approach for >me would be that we could get the wait/buffer-limiting logic for free >and without big changes to the protocol, but the problem is that the >unreliable nature of datagrams means that the source's free-running >tx_cnt will become out-of-sync with the destination's fwd_cnt upon >packet loss. We need to understand where the packet can be lost. If the packet always reaches the destination (vsock driver or device), we can discard it, but also update the counters. > >Imagine a source that initializes and starts sending packets before a >destination socket even is created, the source's self-throttling will be >dysfunctional because its tx_cnt will always far exceed the >destination's fwd_cnt. Right, the other problem I see is that the socket aren't connected, so we have 1-N relationship. > >We could play tricks with the meaning of the CREDIT_UPDATE message and >fwd_cnt/buf_alloc fields, but I don't think we want to go down that >path. > >I think that the best and simplest approach introduces a congestion >notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the >destination sends this notification. At a given repeated time period T, >the source can check if it has received any notifications in the last T. >If so, it halves its buffer allocation. If not, it doubles its buffer >allocation unless it is already at its max or original value. > >An "invalid" socket which never has any receiver will converge towards a >rate limit of one packet per time T * log2(average pkt size). That is, a >socket with 100% packet loss will only be able to send 16 bytes every >4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within >160ms given at least one packet sent per 5ms. I have no idea if that is >a reasonable default T for vsock, I just pulled it out of a hat for the >sake of the example. > >"Normal" sockets will be responsive to high loss and rebalance during >low loss. The source is trying to guess and converge on the actual >buffer state of the destination. > >This would reuse the already-existing throttling mechanisms that >throttle based upon buffer allocation. The usage of sk_sndbuf would have >to be re-worked. The application using sendmsg() will see EAGAIN when >throttled, or just sleep if !MSG_DONTWAIT. I see, it looks interesting, but I think we need to share that information between multiple sockets, since the same destination (cid, port), can be reached by multiple sockets. Another approach could be to have both congestion notification and decongestion, but maybe it produces double traffic. > >I looked at alternative schemes (like the Datagram Congestion Control >Protocol), but I do not think the added complexity is necessary in the >case of vsock (DCCP requires congestion windows, sequence numbers, batch >acknowledgements, etc...). I also looked at UDP-based application >protocols like TFTP, DHCP, and SIP over UDP which use a delay-based >backoff mechanism, but seem to require acknowledgement for those packet >types, which trigger the retries and backoffs. I think we can get away >with the simpler approach and not have to potentially kill performance >with per-packet acknowledgements. Yep I agree. I think our advantage is that the channel (virtqueues), can't lose packets. > >> > > >> > > In this series I am also proposing that fairness be reexamined as an >> > > issue separate from datagrams, which differs from my previous series >> > > that coupled these issues. After further testing and reflection on the >> > > design, I do not believe that these need to be coupled and I do not >> > > believe this implementation introduces additional unfairness or >> > > exacerbates pre-existing unfairness. >> >> I see. >> >> > > >> > > I attempted to characterize vsock fairness by using a pool of processes >> > > to stress test the shared resources while measuring the performance of a >> > > lone stream socket. Given unfair preference for datagrams, we would >> > > assume that a lone stream socket would degrade much more when a pool of >> > > datagram sockets was stressing the system than when a pool of stream >> > > sockets are stressing the system. The result, however, showed no >> > > significant difference between the degradation of throughput of the lone >> > > stream socket when using a pool of datagrams to stress the queue over >> > > using a pool of streams. The absolute difference in throughput actually >> > > favored datagrams as interfering least as the mean difference was +16% >> > > compared to using streams to stress test (N=7), but it was not >> > > statistically significant. Workloads were matched for payload size and >> > > buffer size (to approximate memory consumption) and process count, and >> > > stress workloads were configured to start before and last long after the >> > > lifetime of the "lone" stream socket flow to ensure that competing flows >> > > were continuously hot. >> > > >> > > Given the above data, I propose that vsock fairness be addressed >> > > independent of datagrams and to defer its implementation to a future >> > > series. >> >> Makes sense to me. >> >> I left some preliminary comments, anyway now it seems reasonable to use >> the same virtqueues, so we can go head with the spec proposal. >> >> Thanks, >> Stefano >> > >Thanks for the review! You're welcome! Stefano
On Sat, Apr 15, 2023 at 03:55:05PM +0000, Bobby Eshleman wrote: >On Fri, Apr 28, 2023 at 12:43:09PM +0200, Stefano Garzarella wrote: >> On Sat, Apr 15, 2023 at 07:13:47AM +0000, Bobby Eshleman wrote: >> > CC'ing virtio-dev@lists.oasis-open.org because this thread is starting >> > to touch the spec. >> > >> > On Wed, Apr 19, 2023 at 12:00:17PM +0200, Stefano Garzarella wrote: >> > > Hi Bobby, >> > > >> > > On Fri, Apr 14, 2023 at 11:18:40AM +0000, Bobby Eshleman wrote: >> > > > CC'ing Cong. >> > > > >> > > > On Fri, Apr 14, 2023 at 12:25:56AM +0000, Bobby Eshleman wrote: >> > > > > Hey all! >> > > > > >> > > > > This series introduces support for datagrams to virtio/vsock. >> > > >> > > Great! Thanks for restarting this work! >> > > >> > >> > No problem! >> > >> > > > > >> > > > > It is a spin-off (and smaller version) of this series from the summer: >> > > > > https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/ >> > > > > >> > > > > Please note that this is an RFC and should not be merged until >> > > > > associated changes are made to the virtio specification, which will >> > > > > follow after discussion from this series. >> > > > > >> > > > > This series first supports datagrams in a basic form for virtio, and >> > > > > then optimizes the sendpath for all transports. >> > > > > >> > > > > The result is a very fast datagram communication protocol that >> > > > > outperforms even UDP on multi-queue virtio-net w/ vhost on a variety >> > > > > of multi-threaded workload samples. >> > > > > >> > > > > For those that are curious, some summary data comparing UDP and VSOCK >> > > > > DGRAM (N=5): >> > > > > >> > > > > vCPUS: 16 >> > > > > virtio-net queues: 16 >> > > > > payload size: 4KB >> > > > > Setup: bare metal + vm (non-nested) >> > > > > >> > > > > UDP: 287.59 MB/s >> > > > > VSOCK DGRAM: 509.2 MB/s >> > > > > >> > > > > Some notes about the implementation... >> > > > > >> > > > > This datagram implementation forces datagrams to self-throttle according >> > > > > to the threshold set by sk_sndbuf. It behaves similar to the credits >> > > > > used by streams in its effect on throughput and memory consumption, but >> > > > > it is not influenced by the receiving socket as credits are. >> > > >> > > So, sk_sndbuf influece the sender and sk_rcvbuf the receiver, right? >> > > >> > >> > Correct. >> > >> > > We should check if VMCI behaves the same. >> > > >> > > > > >> > > > > The device drops packets silently. There is room for improvement by >> > > > > building into the device and driver some intelligence around how to >> > > > > reduce frequency of kicking the virtqueue when packet loss is high. I >> > > > > think there is a good discussion to be had on this. >> > > >> > > Can you elaborate a bit here? >> > > >> > > Do you mean some mechanism to report to the sender that a destination >> > > (cid, port) is full so the packet will be dropped? >> > > >> > >> > Correct. There is also the case of there being no receiver at all for >> > this address since this case isn't rejected upon connect(). Ideally, >> > such a socket (which will have 100% packet loss) will be throttled >> > aggressively. >> > >> > Before we go down too far on this path, I also want to clarify that >> > using UDP over vhost/virtio-net also has this property... this can be >> > observed by using tcpdump to dump the UDP packets on the bridge network >> > your VM is using. UDP packets sent to a garbage address can be seen on >> > the host bridge (this is the nature of UDP, how is the host supposed to >> > know the address eventually goes nowhere). I mention the above because I >> > think it is possible for vsock to avoid this cost, given that it >> > benefits from being point-to-point and g2h/h2g. >> > >> > If we're okay with vsock being on par, then the current series does >> > that. I propose something below that can be added later and maybe >> > negotiated as a feature bit too. >> >> I see and I agree on that, let's do it step by step. >> If we can do it in the first phase is great, but I think is fine to add >> this feature later. >> >> > >> > > Can we adapt the credit mechanism? >> > > >> > >> > I've thought about this a lot because the attraction of the approach for >> > me would be that we could get the wait/buffer-limiting logic for free >> > and without big changes to the protocol, but the problem is that the >> > unreliable nature of datagrams means that the source's free-running >> > tx_cnt will become out-of-sync with the destination's fwd_cnt upon >> > packet loss. >> >> We need to understand where the packet can be lost. >> If the packet always reaches the destination (vsock driver or device), >> we can discard it, but also update the counters. >> >> > >> > Imagine a source that initializes and starts sending packets before a >> > destination socket even is created, the source's self-throttling will be >> > dysfunctional because its tx_cnt will always far exceed the >> > destination's fwd_cnt. >> >> Right, the other problem I see is that the socket aren't connected, so >> we have 1-N relationship. >> > >Oh yeah, good point. > >> > >> > We could play tricks with the meaning of the CREDIT_UPDATE message and >> > fwd_cnt/buf_alloc fields, but I don't think we want to go down that >> > path. >> > >> > I think that the best and simplest approach introduces a congestion >> > notification (VIRTIO_VSOCK_OP_CN?). When a packet is dropped, the >> > destination sends this notification. At a given repeated time period T, >> > the source can check if it has received any notifications in the last T. >> > If so, it halves its buffer allocation. If not, it doubles its buffer >> > allocation unless it is already at its max or original value. >> > >> > An "invalid" socket which never has any receiver will converge towards a >> > rate limit of one packet per time T * log2(average pkt size). That is, a >> > socket with 100% packet loss will only be able to send 16 bytes every >> > 4T. A default send buffer of MAX_UINT32 and T=5ms would hit zero within >> > 160ms given at least one packet sent per 5ms. I have no idea if that is >> > a reasonable default T for vsock, I just pulled it out of a hat for the >> > sake of the example. >> > >> > "Normal" sockets will be responsive to high loss and rebalance during >> > low loss. The source is trying to guess and converge on the actual >> > buffer state of the destination. >> > >> > This would reuse the already-existing throttling mechanisms that >> > throttle based upon buffer allocation. The usage of sk_sndbuf would have >> > to be re-worked. The application using sendmsg() will see EAGAIN when >> > throttled, or just sleep if !MSG_DONTWAIT. >> >> I see, it looks interesting, but I think we need to share that >> information between multiple sockets, since the same destination >> (cid, port), can be reached by multiple sockets. >> > >Good point, that is true. > >> Another approach could be to have both congestion notification and >> decongestion, but maybe it produces double traffic. >> > >I think this could simplify things and could reduce noise. It is also >probably sufficient for the source to simply halt upon congestion >notification and resume upon decongestion notification, instead of >scaling up and down like I suggested above. It also avoids the >burstiness that would occur with a "congestion notification"-only >approach where the source guesses when to resume and guesses wrong. > >The congestion notification may want to have an expiration period after >which the sender can resume without receiving a decongestion >notification? If it receives congestion again, then it can halt again. Yep, I agree. Anyway the congestion/decongestion messages should be just a hint, because the other peer has to keep the state and a malicious host/guest could use it for DoS, so the peer could discard these packets if it has no more space to save the state. Thanks, Stefano