[RFC,net-next,v4,0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration
Message ID | 1679887699-54797-1-git-send-email-guwen@linux.alibaba.com |
---|---|
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:b0ea:0:b0:3b6:4342:cba0 with SMTP id b10csp1252353vqo; Sun, 26 Mar 2023 20:45:06 -0700 (PDT) X-Google-Smtp-Source: AKy350YSMlWNIRG2ioIO6c+LadXNnLj38m4h3xXQPCvugT1MFvf+uOc7IBpmtkuc+gtuns/Q8d8U X-Received: by 2002:a17:90a:1990:b0:23d:16d6:2f05 with SMTP id 16-20020a17090a199000b0023d16d62f05mr11580068pji.22.1679888706551; Sun, 26 Mar 2023 20:45:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1679888706; cv=none; d=google.com; s=arc-20160816; b=Zxsdmc32PaOIQSrXoRQrQrz3dcH68iQ7VjqBSAvH/lsJFJfjVcVAv9srAJIWd1QztF 0KojD5pPjtUmF8g6h5+hcVtJubdsx02NVqEBRWTUxFRgtvUmQ4zMWE1PBarl6jkqzpo/ oQsfqWml3shTNw35EUWLi9t4s84z4M0vAVHBMB/bQkLvoDIjJ8MpjIYjNelPoQapnEsJ Gl8Sy+Ic/wmvAsTCM1FBdQFEWgeHYkgAHLNnG1e5nsLpHu0J4jQWHeykThpbAHpSPmbv SKj9jf/0w3C6/m/kH+7q97U5I/RDqt/WeVP+4IwtEntviGiU4/+vBmH2hyyTq9Sx6nCm iBFA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:date:subject:cc:to:from; bh=wubPmZZGaPNOmVVXNha3LKMMo/fpyDpABFsyx5wXhvA=; b=xB+rNkZ2YzKfbY/ylDxs57Xt2DbfIu6W2niyp6oNoDo6MobpR/SGHSqOWWylfBbpQQ OqfkBu7Bl/c/jLqa4B9+76wnlieL5/4etnWLRzlSa5E6P+5Syr+4YSufizuVdDno7YSf 7S/QxVkAN2xY0u6MZ5NOsBFXtEs5Rg85ykxvzqjDKPQXD6RaWMhq4yqrfekYJuP0HaVB 1KMabnfNurAXf0hZTtBuKqI2W253B86exG7XYujG1MuPRyPKko12DNhg/X+oydeqU8ee SHtZGX68bF4ab5C+2EdcxJuI2vWRe9UOuUhcbC0HQjX9msF/e1LUkYFf+zeLtWHhspeQ nNXQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id m62-20020a632641000000b00502fd2d2901si27022862pgm.343.2023.03.26.20.44.52; Sun, 26 Mar 2023 20:45:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230020AbjC0D2b (ORCPT <rfc822;makky5685@gmail.com> + 99 others); Sun, 26 Mar 2023 23:28:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60800 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229640AbjC0D21 (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Sun, 26 Mar 2023 23:28:27 -0400 Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4840340E8; Sun, 26 Mar 2023 20:28:25 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=guwen@linux.alibaba.com;NM=1;PH=DS;RN=11;SR=0;TI=SMTPD_---0Vef7WO6_1679887699; Received: from localhost(mailfrom:guwen@linux.alibaba.com fp:SMTPD_---0Vef7WO6_1679887699) by smtp.aliyun-inc.com; Mon, 27 Mar 2023 11:28:21 +0800 From: Wen Gu <guwen@linux.alibaba.com> To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com, wintera@linux.ibm.com, davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com Cc: linux-s390@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH net-next v4 0/9] net/smc: Introduce SMC-D-based OS internal communication acceleration Date: Mon, 27 Mar 2023 11:28:10 +0800 Message-Id: <1679887699-54797-1-git-send-email-guwen@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 X-Spam-Status: No, score=-8.0 required=5.0 tests=ENV_AND_HDR_SPF_MATCH, NUMERIC_HTTP_ADDR,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE, SPF_PASS,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1761490980316846009?= X-GMAIL-MSGID: =?utf-8?q?1761490980316846009?= |
Series |
net/smc: Introduce SMC-D-based OS internal communication acceleration
|
|
Message
Wen Gu
March 27, 2023, 3:28 a.m. UTC
Hi, all # Background The background and previous discussion can be referred from [1],[6]. We found SMC-D can be used to accelerate OS internal communication, such as loopback or between two containers within the same OS instance. So this patch set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device) to emulate an ISM device, so that SMC-D can also be used on architectures other than s390. The SMC-D loopback device are designed as a system global device, visible to all containers. This version is implemented based on the generalized interface provided by [2]. And there is an open issue, which will be mentioned later. # Design This patch set basically follows the design of the previous version. Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D generalized code and extract some helpers to make SMC-D protocol compatible with devices other than s390 ISM device. Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2 device and designed to provide communication between SMC sockets in the same OS instance. +-------------------------------------------+ | +--------------+ +--------------+ | | | SMC socket A | | SMC socket B | | | +--------------+ +--------------+ | | ^ ^ | | | +----------------+ | | | | | SMC stack | | | | +--->| +------------+ |<--| | | | | dummy | | | | | | device | | | | +-+------------+-+ | | OS | +-------------------------------------------+ Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where SMC-D is used to communicate within VM (loopback here) or between VMs on the same host (based on virtio-ism device, see [3]). What these scenarios have in common is that the local sndbuf and peer RMB can be mapped to same physical memory region, so the data copy between the local sndbuf and peer RMB can be omitted. Performance improvement brought by this extension can be found in # Benchmark Test. +----------+ +----------+ | socket A | | socket B | +----------+ +----------+ | ^ | +---------+ | regard as | | ----------| local sndbuf | B's | regard as | | RMB | local RMB |-------> | | +---------+ Patch #9/9 realizes the support of loopback device for the above-mentioned expanded SMC-D protocol interface. # Benchmark Test * Test environments: - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. - SMC sndbuf/RMB size 1MB. * Test object: - TCP lo: run on TCP loopback. - domain: run on UNIX domain. - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9. - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9. 1. ipc-benchmark (see [4]) - ./<foo> -c 1000000 -s 100 TCP-lo domain SMC-lo SMC-lo-nocpy Message rate (msg/s) 79025 115736(+46.45%) 146760(+85.71%) 149800(+89.56%) 2. sockperf - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 TCP-lo SMC-lo SMC-lo-nocpy Bandwidth(MBps) 4822.388 4940.918(+2.56%) 8086.67(+67.69%) Latency(us) 6.298 3.352(-46.78%) 3.35(-46.81%) 3. iperf3 - serv: <smc_run> taskset -c <cpu> iperf3 -s - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15 TCP-lo SMC-lo SMC-lo-nocpy Bitrate(Gb/s) 40.7 40.5(-0.49%) 72.4(+77.89%) 4. nginx/wrk - serv: <smc_run> nginx - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80 TCP-lo SMC-lo SMC-lo-nocpy Requests/s 155994.57 214544.79(+37.53%) 215538.55(+38.17%) # Open issue The open issue is about how to detect that the source and target of CLC proposal are within the same OS instance and can communicate through the SMC-D loopback device. Similar issue also exists when using virtio-ism devices (the background and details of virtio-ism device can be referred from [3]). In previous discussions, multiple options were proposed (see [5]). Thanks again for the help of the community. :) But as we discussed, these solutions have some imperfection. So this version of RFC continues to use previous workaround, that is, a 64-bit random GID is generated for SMC-D loopback device. If the GIDs of the devices found by two peers are the same, then they are considered to be in the same OS instance and can communicate with each other by the loopback device. This approach needs that the loopback device GID is globally unique. But theoretically there is a possibility of a collision. Assume the following situations: (1) Assume that the SMC-D loopback devices of the two different OS instances happen to generate the same 64-bit GID. For the convenience of description, we refer to the sockets on these two different OS instance as server A and client B. A will misjudge that the two are on the same OS instance because the same GID in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B in CLC accept message. B receives the CLC accept message. And according to patch #7/9, B tries to attach its sndbuf to A's RMB by token-A. (2) And assume that the OS instance where B is located happens to have an unattached RMB whose 64-bit token is same as token-A. Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB, sends token-B to A in CLC confirm message. Similarly, A receives the message and tries to attach its sndbuf to B's RMB by token-B. (3) Similar to (2), assume that the OS instance where A is located happens to have an unattached RMB whose 64-bit token is same as token-B. Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly believe that an SMC-D connection based on the loopback device is established between them. If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur 3 times, then an unreachable SMC-D connection will be established, which is nasty. But if one of above is not satisfied, it will safely fallback to TCP. Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3) probability is acceptable? Can we just use 64-bits random generated number as GID in loopback device? Some other ways that may be able to make loopback GID unique are 1) Using a 128-bit UUID to identify SMC-D loopback device or virtio-ism device, because the probability of a 128-bit UUID collision is considered negligible. But it needs to extend the CLC message to carry a longer GID. 2) Using MAC address of netdev in the OS as part of SMC-D loopback device GID, provided that the MAC addresses are unique. But the MAC address could theoretically also be incorrectly set to be the same. Hope to hear opinions from the community. Any ideas are welcome. Thanks! Wen Gu v4->v3 1. Rebase to the latest net-next; 2. Introduce SEID helper. SMC-D loopback will return SMCD_DEFAULT_V2_SEID. And if it coexist with ISM device, the SEID of ISM device will overwrite SMCD_DEFAULT_V2_SEID as smc_ism_v2_system_eid. 3. Won't remove dmb_node from hashtable until no sndbuf attaching to it. Something postponed in this version 1. Hierarchy perference of SMC-D devices when loopback and ISM devices coexist, which will be determinated after comparing the performance of loopback and ISM. v3->v2 1. Adapt new generalized interface provided by [2]; 2. Select loopback device through SMC-D v2 protocol; 3. Split the loopback-related implementation and generic implementation into different patches more reasonably. v1->v2 1. Fix some build WARNINGs complained by kernel test rebot Reported-by: kernel test robot <lkp@intel.com> 2. Add iperf3 test data. [1] https://lore.kernel.org/netdev/1671506505-104676-1-git-send-email-guwen@linux.alibaba.com/ [2] https://lore.kernel.org/netdev/20230123181752.1068-1-jaka@linux.ibm.com/ [3] https://lists.oasis-open.org/archives/virtio-comment/202302/msg00148.html [4] https://github.com/goldsborough/ipc-bench [5] https://lore.kernel.org/netdev/b9867c7d-bb2b-16fc-feda-b79579aa833d@linux.ibm.com/ [6] https://lore.kernel.org/netdev/1676477905-88043-1-git-send-email-guwen@linux.alibaba.com/ Wen Gu (9): net/smc: Decouple ism_dev from SMC-D device dump net/smc: Decouple ism_dev from SMC-D DMB registration net/smc: Extract v2 check helper from SMC-D device registration net/smc: Introduce SMC-D loopback device net/smc: Introduce an interface for getting DMB attribute net/smc: Introudce interfaces for DMB attach and detach net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D net/smc: Modify cursor update logic when using mappable DMB net/smc: Add interface implementation of loopback device drivers/s390/net/ism_drv.c | 5 +- include/net/smc.h | 18 +- net/smc/Makefile | 2 +- net/smc/af_smc.c | 26 ++- net/smc/smc_cdc.c | 59 ++++-- net/smc/smc_cdc.h | 1 + net/smc/smc_core.c | 70 ++++++- net/smc/smc_core.h | 1 + net/smc/smc_ism.c | 99 ++++++++-- net/smc/smc_ism.h | 5 + net/smc/smc_loopback.c | 445 +++++++++++++++++++++++++++++++++++++++++++++ net/smc/smc_loopback.h | 56 ++++++ 12 files changed, 750 insertions(+), 37 deletions(-) create mode 100644 net/smc/smc_loopback.c create mode 100644 net/smc/smc_loopback.h
Comments
On 27.03.23 05:28, Wen Gu wrote: > Hi, all > > # Background > > The background and previous discussion can be referred from [1],[6]. > > We found SMC-D can be used to accelerate OS internal communication, such as > loopback or between two containers within the same OS instance. So this patch > set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device) > to emulate an ISM device, so that SMC-D can also be used on architectures > other than s390. The SMC-D loopback device are designed as a system global > device, visible to all containers. > > This version is implemented based on the generalized interface provided by [2]. > And there is an open issue, which will be mentioned later. > > # Design > > This patch set basically follows the design of the previous version. > > Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D > generalized code and extract some helpers to make SMC-D protocol compatible > with devices other than s390 ISM device. > > Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2 > device and designed to provide communication between SMC sockets in the same OS > instance. > > +-------------------------------------------+ > | +--------------+ +--------------+ | > | | SMC socket A | | SMC socket B | | > | +--------------+ +--------------+ | > | ^ ^ | > | | +----------------+ | | > | | | SMC stack | | | > | +--->| +------------+ |<--| | > | | | dummy | | | > | | | device | | | > | +-+------------+-+ | > | OS | > +-------------------------------------------+ > > Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where > SMC-D is used to communicate within VM (loopback here) or between VMs on the same > host (based on virtio-ism device, see [3]). What these scenarios have in common > is that the local sndbuf and peer RMB can be mapped to same physical memory region, > so the data copy between the local sndbuf and peer RMB can be omitted. Performance > improvement brought by this extension can be found in # Benchmark Test. > > +----------+ +----------+ > | socket A | | socket B | > +----------+ +----------+ > | ^ > | +---------+ | > regard as | | ----------| > local sndbuf | B's | regard as > | | RMB | local RMB > |-------> | | > +---------+ > > Patch #9/9 realizes the support of loopback device for the above-mentioned expanded > SMC-D protocol interface. > > # Benchmark Test > > * Test environments: > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. > - SMC sndbuf/RMB size 1MB. > > * Test object: > - TCP lo: run on TCP loopback. > - domain: run on UNIX domain. > - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9. > - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9. > > 1. ipc-benchmark (see [4]) > > - ./<foo> -c 1000000 -s 100 > > TCP-lo domain SMC-lo SMC-lo-nocpy > Message > rate (msg/s) 79025 115736(+46.45%) 146760(+85.71%) 149800(+89.56%) > > 2. sockperf > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 > > TCP-lo SMC-lo SMC-lo-nocpy > Bandwidth(MBps) 4822.388 4940.918(+2.56%) 8086.67(+67.69%) > Latency(us) 6.298 3.352(-46.78%) 3.35(-46.81%) > > 3. iperf3 > > - serv: <smc_run> taskset -c <cpu> iperf3 -s > - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15 > > TCP-lo SMC-lo SMC-lo-nocpy > Bitrate(Gb/s) 40.7 40.5(-0.49%) 72.4(+77.89%) > > 4. nginx/wrk > > - serv: <smc_run> nginx > - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80 > > TCP-lo SMC-lo SMC-lo-nocpy > Requests/s 155994.57 214544.79(+37.53%) 215538.55(+38.17%) > > > # Open issue > > The open issue is about how to detect that the source and target of CLC proposal > are within the same OS instance and can communicate through the SMC-D loopback device. > Similar issue also exists when using virtio-ism devices (the background and details > of virtio-ism device can be referred from [3]). In previous discussions, multiple > options were proposed (see [5]). Thanks again for the help of the community. :) > > But as we discussed, these solutions have some imperfection. So this version of RFC > continues to use previous workaround, that is, a 64-bit random GID is generated for > SMC-D loopback device. If the GIDs of the devices found by two peers are the same, > then they are considered to be in the same OS instance and can communicate with each > other by the loopback device. > > This approach needs that the loopback device GID is globally unique. But theoretically > there is a possibility of a collision. Assume the following situations: > > (1) Assume that the SMC-D loopback devices of the two different OS instances happen > to generate the same 64-bit GID. > > For the convenience of description, we refer to the sockets on these two > different OS instance as server A and client B. > > A will misjudge that the two are on the same OS instance because the same GID > in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B > in CLC accept message. > > B receives the CLC accept message. And according to patch #7/9, B tries to > attach its sndbuf to A's RMB by token-A. > > (2) And assume that the OS instance where B is located happens to have an unattached > RMB whose 64-bit token is same as token-A. > > Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB, > sends token-B to A in CLC confirm message. > > Similarly, A receives the message and tries to attach its sndbuf to B's RMB by > token-B. > > (3) Similar to (2), assume that the OS instance where A is located happens to have > an unattached RMB whose 64-bit token is same as token-B. > > Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly > believe that an SMC-D connection based on the loopback device is established > between them. > > If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur > 3 times, then an unreachable SMC-D connection will be established, which is nasty. > But if one of above is not satisfied, it will safely fallback to TCP. > > Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3) > probability is acceptable? Can we just use 64-bits random generated number as GID in > loopback device? > > Some other ways that may be able to make loopback GID unique are > 1) Using a 128-bit UUID to identify SMC-D loopback device or virtio-ism device, because > the probability of a 128-bit UUID collision is considered negligible. But it needs > to extend the CLC message to carry a longer GID. > 2) Using MAC address of netdev in the OS as part of SMC-D loopback device GID, provided > that the MAC addresses are unique. But the MAC address could theoretically also be > incorrectly set to be the same. > > Hope to hear opinions from the community. Any ideas are welcome. > > Thanks! > Wen Gu Hi Wen, Thank you for the new version. The discussion on the open issue is still on-going in our organisation internally. I appreciate your patience! One thing I need to mention during testing the loopback device on our platform is that we get crash, because smc_ism-signal_shutdown() is called by smc_1gr_free_work(), which is called indirectly by smc_conn_free(). Please make sure that it would go to the path of the loopback device cleanly. Any question and consideration is welcome! Thanks, Wenjia > > v4->v3 > 1. Rebase to the latest net-next; > 2. Introduce SEID helper. SMC-D loopback will return SMCD_DEFAULT_V2_SEID. And if it > coexist with ISM device, the SEID of ISM device will overwrite SMCD_DEFAULT_V2_SEID > as smc_ism_v2_system_eid. > 3. Won't remove dmb_node from hashtable until no sndbuf attaching to it. > > Something postponed in this version > 1. Hierarchy perference of SMC-D devices when loopback and ISM devices coexist, which > will be determinated after comparing the performance of loopback and ISM. > > v3->v2 > 1. Adapt new generalized interface provided by [2]; > 2. Select loopback device through SMC-D v2 protocol; > 3. Split the loopback-related implementation and generic implementation into different > patches more reasonably. > > v1->v2 > 1. Fix some build WARNINGs complained by kernel test rebot > Reported-by: kernel test robot <lkp@intel.com> > 2. Add iperf3 test data. > > [1] https://lore.kernel.org/netdev/1671506505-104676-1-git-send-email-guwen@linux.alibaba.com/ > [2] https://lore.kernel.org/netdev/20230123181752.1068-1-jaka@linux.ibm.com/ > [3] https://lists.oasis-open.org/archives/virtio-comment/202302/msg00148.html > [4] https://github.com/goldsborough/ipc-bench > [5] https://lore.kernel.org/netdev/b9867c7d-bb2b-16fc-feda-b79579aa833d@linux.ibm.com/ > [6] https://lore.kernel.org/netdev/1676477905-88043-1-git-send-email-guwen@linux.alibaba.com/ > > > Wen Gu (9): > net/smc: Decouple ism_dev from SMC-D device dump > net/smc: Decouple ism_dev from SMC-D DMB registration > net/smc: Extract v2 check helper from SMC-D device registration > net/smc: Introduce SMC-D loopback device > net/smc: Introduce an interface for getting DMB attribute > net/smc: Introudce interfaces for DMB attach and detach > net/smc: Avoid data copy from sndbuf to peer RMB in SMC-D > net/smc: Modify cursor update logic when using mappable DMB > net/smc: Add interface implementation of loopback device > > drivers/s390/net/ism_drv.c | 5 +- > include/net/smc.h | 18 +- > net/smc/Makefile | 2 +- > net/smc/af_smc.c | 26 ++- > net/smc/smc_cdc.c | 59 ++++-- > net/smc/smc_cdc.h | 1 + > net/smc/smc_core.c | 70 ++++++- > net/smc/smc_core.h | 1 + > net/smc/smc_ism.c | 99 ++++++++-- > net/smc/smc_ism.h | 5 + > net/smc/smc_loopback.c | 445 +++++++++++++++++++++++++++++++++++++++++++++ > net/smc/smc_loopback.h | 56 ++++++ > 12 files changed, 750 insertions(+), 37 deletions(-) > create mode 100644 net/smc/smc_loopback.c > create mode 100644 net/smc/smc_loopback.h >
On Mon, 2023-03-27 at 11:28 +0800, Wen Gu wrote: > Hi, all > > # Background > > The background and previous discussion can be referred from [1],[6]. > > We found SMC-D can be used to accelerate OS internal communication, such as > loopback or between two containers within the same OS instance. So this patch > set provides a kind of SMC-D dummy device (we call it the SMC-D loopback device) > to emulate an ISM device, so that SMC-D can also be used on architectures > other than s390. The SMC-D loopback device are designed as a system global > device, visible to all containers. > > This version is implemented based on the generalized interface provided by [2]. > And there is an open issue, which will be mentioned later. > > # Design > > This patch set basically follows the design of the previous version. > > Patch #1/9 ~ #3/9 attempt to decouple ISM-related structures from the SMC-D > generalized code and extract some helpers to make SMC-D protocol compatible > with devices other than s390 ISM device. > > Patch #4/9 introduces a kind of loopback device, which is defined as SMC-D v2 > device and designed to provide communication between SMC sockets in the same OS > instance. > > +-------------------------------------------+ > | +--------------+ +--------------+ | > | | SMC socket A | | SMC socket B | | > | +--------------+ +--------------+ | > | ^ ^ | > | | +----------------+ | | > | | | SMC stack | | | > | +--->| +------------+ |<--| | > | | | dummy | | | > | | | device | | | > | +-+------------+-+ | > | OS | > +-------------------------------------------+ > > Patch #5/9 ~ #8/9 expand SMC-D protocol interface (smcd_ops) for scenarios where > SMC-D is used to communicate within VM (loopback here) or between VMs on the same > host (based on virtio-ism device, see [3]). What these scenarios have in common > is that the local sndbuf and peer RMB can be mapped to same physical memory region, > so the data copy between the local sndbuf and peer RMB can be omitted. Performance > improvement brought by this extension can be found in # Benchmark Test. > > +----------+ +----------+ > | socket A | | socket B | > +----------+ +----------+ > | ^ > | +---------+ | > regard as | | ----------| > local sndbuf | B's | regard as > | | RMB | local RMB > |-------> | | > +---------+ > > Patch #9/9 realizes the support of loopback device for the above-mentioned expanded > SMC-D protocol interface. > > # Benchmark Test > > * Test environments: > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. > - SMC sndbuf/RMB size 1MB. > > * Test object: > - TCP lo: run on TCP loopback. > - domain: run on UNIX domain. > - SMC lo: run on SMC loopback device with patch #1/9 ~ #4/9. > - SMC lo-nocpy: run on SMC loopback device with patch #1/9 ~ #9/9. > > 1. ipc-benchmark (see [4]) > > - ./<foo> -c 1000000 -s 100 > > TCP-lo domain SMC-lo SMC-lo-nocpy > Message > rate (msg/s) 79025 115736(+46.45%) 146760(+85.71%) 149800(+89.56%) > > 2. sockperf > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 > > TCP-lo SMC-lo SMC-lo-nocpy > Bandwidth(MBps) 4822.388 4940.918(+2.56%) 8086.67(+67.69%) > Latency(us) 6.298 3.352(-46.78%) 3.35(-46.81%) > > 3. iperf3 > > - serv: <smc_run> taskset -c <cpu> iperf3 -s > - clnt: <smc_run> taskset -c <cpu> iperf3 -c 127.0.0.1 -t 15 > > TCP-lo SMC-lo SMC-lo-nocpy > Bitrate(Gb/s) 40.7 40.5(-0.49%) 72.4(+77.89%) > > 4. nginx/wrk > > - serv: <smc_run> nginx > - clnt: <smc_run> wrk -t 8 -c 500 -d 30 http://127.0.0.1:80 > > TCP-lo SMC-lo SMC-lo-nocpy > Requests/s 155994.57 214544.79(+37.53%) 215538.55(+38.17%) > > > # Open issue > > The open issue is about how to detect that the source and target of CLC proposal > are within the same OS instance and can communicate through the SMC-D loopback device. > Similar issue also exists when using virtio-ism devices (the background and details > of virtio-ism device can be referred from [3]). In previous discussions, multiple > options were proposed (see [5]). Thanks again for the help of the community. :) > > But as we discussed, these solutions have some imperfection. So this version of RFC > continues to use previous workaround, that is, a 64-bit random GID is generated for > SMC-D loopback device. If the GIDs of the devices found by two peers are the same, > then they are considered to be in the same OS instance and can communicate with each > other by the loopback device. > > This approach needs that the loopback device GID is globally unique. But theoretically > there is a possibility of a collision. Assume the following situations: > > (1) Assume that the SMC-D loopback devices of the two different OS instances happen > to generate the same 64-bit GID. > > For the convenience of description, we refer to the sockets on these two > different OS instance as server A and client B. > > A will misjudge that the two are on the same OS instance because the same GID > in CLC proposal message. Then A creates its RMB and sends 64-bit token-A to B > in CLC accept message. > > B receives the CLC accept message. And according to patch #7/9, B tries to > attach its sndbuf to A's RMB by token-A. > > (2) And assume that the OS instance where B is located happens to have an unattached > RMB whose 64-bit token is same as token-A. > > Then B successfully attaches its sndbuf to the wrong RMB, and creates its RMB, > sends token-B to A in CLC confirm message. > > Similarly, A receives the message and tries to attach its sndbuf to B's RMB by > token-B. > > (3) Similar to (2), assume that the OS instance where A is located happens to have > an unattached RMB whose 64-bit token is same as token-B. > > Then A successfully attach its sndbuf to the wrong RMB. Both sides mistakenly > believe that an SMC-D connection based on the loopback device is established > between them. > > If the above 3 coincidences all happen, that is, 64-bit random number conflicts occur > 3 times, then an unreachable SMC-D connection will be established, which is nasty. > But if one of above is not satisfied, it will safely fallback to TCP. > > Since the chances of these happening are very small, I wonder if this risk of 1/2^(64*3) > probability is acceptable? Can we just use 64-bits random generated number as GID in > loopback device? Let me just spell out some details here to make sure we're all on the same page. You're assuming that GIDs are generated randomly at cryptographic quality. In the code I can see that you use get_random_bytes() which as its comment explains supplies the same quality randomness as /dev/urandom so on modern kernels that should provide cryptographic quality randomness and be fine. Might be something to keep in mind for backports though. The fixed CHID of 0xFFFF makes sure this system identity confusion can only occur between SMC-D loopback (and possibly virtio-ism?) never with ISM based SMC-D or SMC-R as these never use this CHID value. Correct? Now for the collision scenario above. As I understand it the probability of the case where fallback does *not* occur is equivalent to a 128 bit hash collision. Basically the random 64 bit GID_A concatenated with the 64 bit DMB Token_A needs to just happen to match the concatenation of the random 64 bit GID_B with DMB Token_B. With that interpretation we can consult Wikipedia[0] for a nice table of how many random GID+DMB Token choices are needed for a certain collision probability. For 128 bits at least 8.2×10^11 tries would be needed just to reach a 10^-15 collision probability. Considering the collision does not only need to exist between two systems but these also need to try to communicate with each other and happen to use the colliding DMBs for things to get into the broken fallback case I think from a theoretical point of view this sounds like neglible risk to me. That said I'm more worried about the fallback to TCP being broken due to a code bug once the GIDs do match which is already extremely unlikely and thus not naturally tested in the wild. Do we have a plan how to keep testing that fallback scenario somehow. Maybe with a selftest or something? If we can solve the testing part then I'm personally in favor of this approach of going with cryptograhically random GID and DMB token. It's simple and doesn't depend on external factors and doesn't need a protocol extension except for possibly reserving CHID 0xFFFF. One more question though, what about the SEID why does that have to be fixed and at least partially match what ISM devices use? I think I'm missing some SMC protocol/design detail here. I'm guessing this would require a protocol change? Thanks, Niklas [0] https://en.wikipedia.org/wiki/Birthday_attack
On 05.04.23 19:04, Niklas Schnelle wrote: > One more question though, what about the SEID why does that have to be > fixed and at least partially match what ISM devices use? I think I'm > missing some SMC protocol/design detail here. I'm guessing this would > require a protocol change? > > Thanks, > Niklas Niklas, in the initial SMC CLC handshake the client and server exchange the SEID (one per peer system) and up to 8 proposals for SMC-D interfaces. Wen's current proposal assumes that smc-d loopback can be one of these 8 proposed interfaces, iiuc. So on s390 the proposal can contain ISM devices and a smc-d loopback device at the same time. If one of the peers is e.g. an older Linux version, it will just ignore the loopback-device in the list (Don't find a match for CHID 0xFFFF) and use an ISM interface for SMC-D if possible. Therefor it is important that the SEID is used in the same way as it is today in the handshake. If we decide for some reason (virtio-ism open issues?) that a protocol change/extension is required/wanted, then it is a new game and we can come up with new identifiers, but we may lose compatibility to backlevel systems. Alexandra
On Thu, 2023-04-06 at 13:14 +0200, Alexandra Winter wrote: > > On 05.04.23 19:04, Niklas Schnelle wrote: > > One more question though, what about the SEID why does that have to be > > fixed and at least partially match what ISM devices use? I think I'm > > missing some SMC protocol/design detail here. I'm guessing this would > > require a protocol change? > > > > Thanks, > > Niklas > > Niklas, > in the initial SMC CLC handshake the client and server exchange the SEID (one per peer system) > and up to 8 proposals for SMC-D interfaces. > Wen's current proposal assumes that smc-d loopback can be one of these 8 proposed interfaces, > iiuc. So on s390 the proposal can contain ISM devices and a smc-d loopback device at the same time. > If one of the peers is e.g. an older Linux version, it will just ignore the loopback-device > in the list (Don't find a match for CHID 0xFFFF) and use an ISM interface for SMC-D if possible. > Therefor it is important that the SEID is used in the same way as it is today in the handshake. > > If we decide for some reason (virtio-ism open issues?) that a protocol change/extension is > required/wanted, then it is a new game and we can come up with new identifiers, but we may > lose compatibility to backlevel systems. > > Alexandra Ok that makes sense to me. I was looking at the code in patch 4 of this series and there it looks to me like SMC-D loopback as implemented would always use the newly added SMCD_DEFAULT_V2_SEID might have misread it though. From your description I think that would be wrong, if a SEID is defined as on s390 it should use that SEID in the CLC for all SMC variants. Similarly on other architectures it should use the same SEID for SMC-D as for SMC-R, right? Also with partially match I was actually wrong the SMCD_DEFAULT_V2_SEID.seid_string starts with "IBM-DEF-ISMSEID…" while on s390's existing ISM we use "IBM-SYSZ- ISMSEID…" so if SMC-D loopback correctly uses the shared SEID on s390 we can already only get GID.DMB collisions only on the same mainframe. Thanks, Niklas
Hi Niklas, On 2023/4/6 01:04, Niklas Schnelle wrote: > > Let me just spell out some details here to make sure we're all on the > same page. > > You're assuming that GIDs are generated randomly at cryptographic > quality. In the code I can see that you use get_random_bytes() which as > its comment explains supplies the same quality randomness as > /dev/urandom so on modern kernels that should provide cryptographic > quality randomness and be fine. Might be something to keep in mind for > backports though. > > The fixed CHID of 0xFFFF makes sure this system identity confusion can > only occur between SMC-D loopback (and possibly virtio-ism?) never with > ISM based SMC-D or SMC-R as these never use this CHID value. Correct? Yes, CHID of 0xFFFF used for SMC-D loopback ensures the GID collision won't involve ISM based SMC-D or SMC-R. > > Now for the collision scenario above. As I understand it the > probability of the case where fallback does *not* occur is equivalent > to a 128 bit hash collision. Basically the random 64 bit GID_A > concatenated with the 64 bit DMB Token_A needs to just happen to match > the concatenation of the random 64 bit GID_B with DMB Token_B. Yes, almost like this. A very little correction: Token_A happens to match a DMB token in B's kernel (not necessary Token_B) and Token_B happens to match a DMB token in A's kernel (not necessary Token_A). With > that interpretation we can consult Wikipedia[0] for a nice table of how > many random GID+DMB Token choices are needed for a certain collision > probability. For 128 bits at least 8.2×10^11 tries would be needed just > to reach a 10^-15 collision probability. Considering the collision does > not only need to exist between two systems but these also need to try > to communicate with each other and happen to use the colliding DMBs for > things to get into the broken fallback case I think from a theoretical > point of view this sounds like neglible risk to me. > Thanks for the reference data. > That said I'm more worried about the fallback to TCP being broken due > to a code bug once the GIDs do match which is already extremely > unlikely and thus not naturally tested in the wild. Do we have a plan > how to keep testing that fallback scenario somehow. Maybe with a > selftest or something? > IIUC, you are worried about the code implementation of fallback when GID collides but DMB token check works? If so, I think we can provide a way to set loopback device's GID manually, so that we can inject GID collision fault to test the code. > If we can solve the testing part then I'm personally in favor of this > approach of going with cryptograhically random GID and DMB token. It's > simple and doesn't depend on external factors and doesn't need a > protocol extension except for possibly reserving CHID 0xFFFF. > > One more question though, what about the SEID why does that have to be > fixed and at least partially match what ISM devices use? I think I'm > missing some SMC protocol/design detail here. I'm guessing this would > require a protocol change? SEID related topic will be replied in the next e-mail. > > Thanks, > Niklas > > [0] https://en.wikipedia.org/wiki/Birthday_attack > Thanks! Wen Gu
On 2023/4/6 22:27, Niklas Schnelle wrote: > On Thu, 2023-04-06 at 13:14 +0200, Alexandra Winter wrote: >> >> On 05.04.23 19:04, Niklas Schnelle wrote: >>> One more question though, what about the SEID why does that have to be >>> fixed and at least partially match what ISM devices use? I think I'm >>> missing some SMC protocol/design detail here. I'm guessing this would >>> require a protocol change? >>> >>> Thanks, >>> Niklas >> >> Niklas, >> in the initial SMC CLC handshake the client and server exchange the SEID (one per peer system) >> and up to 8 proposals for SMC-D interfaces. >> Wen's current proposal assumes that smc-d loopback can be one of these 8 proposed interfaces, >> iiuc. So on s390 the proposal can contain ISM devices and a smc-d loopback device at the same time. >> If one of the peers is e.g. an older Linux version, it will just ignore the loopback-device >> in the list (Don't find a match for CHID 0xFFFF) and use an ISM interface for SMC-D if possible. >> Therefor it is important that the SEID is used in the same way as it is today in the handshake. >> >> If we decide for some reason (virtio-ism open issues?) that a protocol change/extension is >> required/wanted, then it is a new game and we can come up with new identifiers, but we may >> lose compatibility to backlevel systems. >> >> Alexandra > > Ok that makes sense to me. I was looking at the code in patch 4 of this > series and there it looks to me like SMC-D loopback as implemented > would always use the newly added SMCD_DEFAULT_V2_SEID might have > misread it though. From your description I think that would be wrong, > if a SEID is defined as on s390 it should use that SEID in the CLC for > all SMC variants. Similarly on other architectures it should use the > same SEID for SMC-D as for SMC-R, right? Also with partially match I > was actually wrong the SMCD_DEFAULT_V2_SEID.seid_string starts with > "IBM-DEF-ISMSEID…" while on s390's existing ISM we use "IBM-SYSZ- > ISMSEID…" so if SMC-D loopback correctly uses the shared SEID on s390 > we can already only get GID.DMB collisions only on the same mainframe. > > Thanks, > Niklas SMC stack uses a global variable smc_ism_v2_system_eid to indicate the only one SEID of system. Because all ISMv2 devices return the same SEID, SEID of the first registered ISMv2 device will be assigned to smc_ism_v2_system_eid. Now we have extension SMC-D devices, loopback or virtio-ism device, and this may need a little change. My original idea was that - Extension SMC-D devices always return SMCD_DEFAULT_V2_SEID as SEID. - If there is only extension device in the system, smc_ism_v2_system_eid will record SMCD_DEFAULT_V2_SEID returned by SMC-D extension device. - If extension devices coexist with ISM devices on s390, smc_ism_v2_system_eid will record SEID of ISM devices. But inspired by your comments, I find the original idea has some problems in situation that one side has only extension devices but the other side has both extension and ISM devices. Although they can communicate through the extension devices(virtio-ism), SMC-D connection is unavailable due to the different SEIDs. So as you suggested, the extension devices (loopback or virtio-ism) should use the same way as ISM device to get the shared SEID on s390 arch. And on arch other than s390, extension SMC-D device can use a fixed SEID like SMCD_DEFAULT_V2_SEID here if we do not require SMC-D communication between different architectures. Thanks, Wen Gu
On 2023/4/5 22:48, Wenjia Zhang wrote: > > Hi Wen, > > Thank you for the new version. The discussion on the open issue is still on-going in our organisation internally. I > appreciate your patience! > > One thing I need to mention during testing the loopback device on our platform is that we get crash, because > smc_ism-signal_shutdown() is called by smc_1gr_free_work(), which is called indirectly by smc_conn_free(). Please make > sure that it would go to the path of the loopback device cleanly. Any question and consideration is welcome! > > Thanks, > Wenjia Thank you! Wenjia. Testing on s390 is really helpful. Since most of the path in smc_ism_signal_shutdown() is inside the preprocessing macro '#if IS_ENABLED(CONFIG_ISM) ... #endif', so they are not executed in my test environment, therefore I didn't realized the interface of ops->signal_event in loopback device and missed the crash. I will fix this and check for the other parts wrapped by '#if IS_ENABLED(CONFIG_ISM) ... #endif' which I ignored before. Then I will send out a new version. Thanks, Wen Gu