Message ID | 20230920192300.3772199-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:612c:172:b0:3f2:4152:657d with SMTP id h50csp4467705vqi; Wed, 20 Sep 2023 15:33:01 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF8YdchO9+ZmnsysOr8VjPUExxM+DwPa7SUmCE36GbNIflmdZnvtTEypQ4UGj+92xzVdR25 X-Received: by 2002:a05:6a00:851:b0:68a:5e5b:e450 with SMTP id q17-20020a056a00085100b0068a5e5be450mr4103846pfk.26.1695249181583; Wed, 20 Sep 2023 15:33:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695249181; cv=none; d=google.com; s=arc-20160816; b=rQY5GATYh36Ii9lwS8vbpL4txMSXuqzXhbCkzmCKXFcPOJ4YtVwr4ta01Qp8XM86S5 oiBKClc7Pi2NwB7caXx0LtU/auOURNIhPEmipsgwAPHcmHbCMYmJl2McW/RqBKU2YCfd NpFtSDzkypuv5BlC5vLkQrEBpAVAjXfHRyKVinaKUx993XRDEBjeFC9G9Tkp2nhR7nNf hw9FYKl+mdftIKG1EeZHZIoPN+prIpAhK2Nprfskw/8YmUsdk/crRXR7yMvI/R9QaDnM C9MvL5Dyg1vnPr+g9pciGGijoirRHyJbIifvFUcXLuZ/ksBN2/5KgF9wqcIz77aSyIgi 3HDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=+ChiqShwLX1gNz0mQdxUhhrEbGWvJcnG/XL0z7xrAJ0=; fh=vegKCTfi9EcwJmTLkNY/pt7pjUN00rPFpGjVwf8L/Y4=; b=FKhkdB1HZGHf/ME71yQ4rBCtd5P2s5RxX9lyjmRJZZKu4n00dH2wwPEiss0QrXvULD nckIvV6xv15yI/1Z6WnSaD87evtqGVQRn52UliSFanqEcsaz9p6VfrbxhXLZti9bqVFc pudY3OW1yGohW0q89YkXx2mT5e5gFNiPyWyTjSIsRC8PuFZS03b0Gibmh11SuUCdjbJc 2Rxp8UjhKl5urXklscwyVCk/lk8viS7de7p3qcqa7QuiAFNFgiuDktZKvorH0XEW3TkO y7LbiKxzY7QVGEIXyCz0a229yMt0i6WHJdvisKmQdd1EXVDWUIXilPhbtVmu//AZUoTQ Ce2w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=fiQoKXSv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id ck12-20020a056a00328c00b00690ff5e479fsi98435pfb.332.2023.09.20.15.33.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Sep 2023 15:33:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=fiQoKXSv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id AF209837D499; Wed, 20 Sep 2023 12:23:29 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229885AbjITTXT (ORCPT <rfc822;realc9580@gmail.com> + 27 others); Wed, 20 Sep 2023 15:23:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59882 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229531AbjITTXS (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 20 Sep 2023 15:23:18 -0400 Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 653E28F for <linux-kernel@vger.kernel.org>; Wed, 20 Sep 2023 12:23:12 -0700 (PDT) Received: by mail-ej1-x629.google.com with SMTP id a640c23a62f3a-98377c5d53eso13301666b.0 for <linux-kernel@vger.kernel.org>; Wed, 20 Sep 2023 12:23:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695237791; x=1695842591; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+ChiqShwLX1gNz0mQdxUhhrEbGWvJcnG/XL0z7xrAJ0=; b=fiQoKXSv3nRhVTzziLQn8xYXQW2wQxsO2I39eI0xhlLuxta8ShWOWUv6TzVubbjMlC Ro+RI2cqPlF/iqGNEDf2tOKmPUDawt73RSoiFNSBK1KAPajSG50q7GnIUvsoQwL2WV2j xPuhvt77evDW7D9xlDfcnPCMyxwUip5k1IdKFwX2XEmA7UVfgqclAThXHlAzXB2oJlJE ZFbQ9wQl4jShPTYCmxhGBM6M6TWIJScZ7+78DWaLqwCKn1cfKZna6DHqB22HDYjvy/uM 3+1pahVtvB9dJ3RBcaOSLMi7cjn8EipnO9nBUPuRoGKqaFWYp99T2+SkWaryHrObV7hs fMvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695237791; x=1695842591; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+ChiqShwLX1gNz0mQdxUhhrEbGWvJcnG/XL0z7xrAJ0=; b=LBP6zMU7cTYX1RuKsyEjB2doLCeXaSV/yOzPU6dK5x/ddOuATMwjJK6BN1qNsfo2dg lIXSlU397yesQCyNUAw0jKJAnRYgp9ub6+CYJNGKJV7oOFlE6iW9MmCtSOTzgA5rgUrE 9fjeyjTMaz7+8vrgZ5EeES3DgZU6XsXYS39Qu+w846Y7qDzZIn69NMGkJ3yVKzEtlfvP 4mVI3V1RODL1QOxPbeO+aUp6d2vnH528oa22nkS523IN/KdTrdKgruLslXzNxeggl/77 Bb6zvzO/QFv7letQqYV2ZoNga0pfGKNqGdSsvoym3cTtA0ocDpcxwIQfv/vOz02jUvbd TOog== X-Gm-Message-State: AOJu0Yy1bJ4HwOXgKIoL/RouA+wyN8ZvEMWw3TbSTv3UP0CcrglyoR4s aINEZdbXCbqdEH/n/txK6HGx79Rd3jI= X-Received: by 2002:a17:906:1da9:b0:9ae:1872:d01a with SMTP id u9-20020a1709061da900b009ae1872d01amr2928910ejh.76.1695237790360; Wed, 20 Sep 2023 12:23:10 -0700 (PDT) Received: from noahgold-DESK.intel.com ([192.55.55.56]) by smtp.gmail.com with ESMTPSA id rh1-20020a17090720e100b009ad8ba6976bsm9885812ejb.9.2023.09.20.12.23.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Sep 2023 12:23:09 -0700 (PDT) From: Noah Goldstein <goldstein.w.n@gmail.com> To: x86@kernel.org Cc: linux-kernel@vger.kernel.org, edumazet@google.com, tglx@linutronix.de, mingo@redhat.com, torvalds@linux-foundation.org, bp@alien8.de, dave.hansen@linux.intel.com, David.Laight@aculab.com, hpa@zytor.com, goldstein.w.n@gmail.com, David Laight <david.laight@aculab.com> Subject: x86/csum: Remove unnecessary odd handling Date: Wed, 20 Sep 2023 14:23:00 -0500 Message-Id: <20230920192300.3772199-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230628020657.957880-1-goldstein.w.n@gmail.com> References: <20230628020657.957880-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Wed, 20 Sep 2023 12:23:29 -0700 (PDT) X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1769938355391134531 X-GMAIL-MSGID: 1777597605585070131 |
Series |
x86/csum: Remove unnecessary odd handling
|
|
Commit Message
Noah Goldstein
Sept. 20, 2023, 7:23 p.m. UTC
The special case for odd aligned buffers is unnecessary and mostly just adds overhead. Aligned buffers is the expectations, and even for unaligned buffer, the only case that was helped is if the buffer was 1-byte from word aligned which is ~1/7 of the cases. Overall it seems highly unlikely to be worth to extra branch. It was left in the previous perf improvement patch because I was erroneously comparing the exact output of `csum_partial(...)`, but really we only need `csum_fold(csum_partial(...))` to match so its safe to remove. All csum kunit tests pass. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Laight <david.laight@aculab.com> --- arch/x86/lib/csum-partial_64.c | 36 ++++------------------------------ 1 file changed, 4 insertions(+), 32 deletions(-)
Comments
Hi Noah, kernel test robot noticed the following build warnings: [auto build test WARNING on tip/x86/core] [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-odd-handling/20230921-032450 base: tip/x86/core patch link: https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com patch subject: x86/csum: Remove unnecessary odd handling config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long vim +16 arch/x86/lib/csum-partial_64.c 13 14 static inline __wsum csum_finalize_sum(u64 temp64) 15 { > 16 return (temp64 + ror64(temp64, 32)) >> 32; 17 } 18
On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote: > > Hi Noah, > > kernel test robot noticed the following build warnings: > > [auto build test WARNING on tip/x86/core] > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-odd-handling/20230921-032450 > base: tip/x86/core > patch link: https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com > patch subject: x86/csum: Remove unnecessary odd handling > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config) > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/ > > sparse warnings: (new ones prefixed by >>) > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > vim +16 arch/x86/lib/csum-partial_64.c > > 13 > 14 static inline __wsum csum_finalize_sum(u64 temp64) > 15 { > > 16 return (temp64 + ror64(temp64, 32)) >> 32; > 17 } > 18 Just needs a `(__wsum)` cast. Should I post a new patch? > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki
From: Noah Goldstein > Sent: 23 September 2023 15:05 > > On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote: > > > > Hi Noah, > > > > kernel test robot noticed the following build warnings: > > > > [auto build test WARNING on tip/x86/core] > > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921] > > [If your patch is applied to the wrong git tree, kindly drop us a note. > > And when submitting patch, we suggest to use '--base' as documented in > > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > > > url: https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary- > odd-handling/20230921-032450 > > base: tip/x86/core > > patch link: https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com > > patch subject: x86/csum: Remove unnecessary odd handling > > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day- > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config) > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > > reproduce (this is a W=1 build): (https://download.01.org/0day- > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce) > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot <lkp@intel.com> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/ > > > > sparse warnings: (new ones prefixed by >>) > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > > > vim +16 arch/x86/lib/csum-partial_64.c > > > > 13 > > 14 static inline __wsum csum_finalize_sum(u64 temp64) > > 15 { > > > 16 return (temp64 + ror64(temp64, 32)) >> 32; > > 17 } > > 18 > > Just needs a `(__wsum)` cast. Should I post a new patch? It'll need to be a (__force __wsum) cast. I think new patches are expected... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Sat, Sep 23, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote: > > From: Noah Goldstein > > Sent: 23 September 2023 15:05 > > > > On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote: > > > > > > Hi Noah, > > > > > > kernel test robot noticed the following build warnings: > > > > > > [auto build test WARNING on tip/x86/core] > > > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921] > > > [If your patch is applied to the wrong git tree, kindly drop us a note. > > > And when submitting patch, we suggest to use '--base' as documented in > > > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > > > > > url: https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary- > > odd-handling/20230921-032450 > > > base: tip/x86/core > > > patch link: https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com > > > patch subject: x86/csum: Remove unnecessary odd handling > > > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day- > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config) > > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > > > reproduce (this is a W=1 build): (https://download.01.org/0day- > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce) > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > > the same patch/commit), kindly add following tags > > > | Reported-by: kernel test robot <lkp@intel.com> > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/ > > > > > > sparse warnings: (new ones prefixed by >>) > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > > > > > vim +16 arch/x86/lib/csum-partial_64.c > > > > > > 13 > > > 14 static inline __wsum csum_finalize_sum(u64 temp64) > > > 15 { > > > > 16 return (temp64 + ror64(temp64, 32)) >> 32; > > > 17 } > > > 18 > > > > Just needs a `(__wsum)` cast. Should I post a new patch? > > It'll need to be a (__force __wsum) cast. > > I think new patches are expected... > Thank you, posted V4 that should fix the warning. > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
On Sun, Sep 24, 2023 at 7:35 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Sat, Sep 23, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote: > > > > From: Noah Goldstein > > > Sent: 23 September 2023 15:05 > > > > > > On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote: > > > > > > > > Hi Noah, > > > > > > > > kernel test robot noticed the following build warnings: > > > > > > > > [auto build test WARNING on tip/x86/core] > > > > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921] > > > > [If your patch is applied to the wrong git tree, kindly drop us a note. > > > > And when submitting patch, we suggest to use '--base' as documented in > > > > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > > > > > > > url: https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary- > > > odd-handling/20230921-032450 > > > > base: tip/x86/core > > > > patch link: https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com > > > > patch subject: x86/csum: Remove unnecessary odd handling > > > > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day- > > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config) > > > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > > > > reproduce (this is a W=1 build): (https://download.01.org/0day- > > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce) > > > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > > > the same patch/commit), kindly add following tags > > > > | Reported-by: kernel test robot <lkp@intel.com> > > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/ > > > > > > > > sparse warnings: (new ones prefixed by >>) > > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > > > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > > > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > > > > > > > vim +16 arch/x86/lib/csum-partial_64.c > > > > > > > > 13 > > > > 14 static inline __wsum csum_finalize_sum(u64 temp64) > > > > 15 { > > > > > 16 return (temp64 + ror64(temp64, 32)) >> 32; > > > > 17 } > > > > 18 > > > > > > Just needs a `(__wsum)` cast. Should I post a new patch? > > > > It'll need to be a (__force __wsum) cast. > > > > I think new patches are expected... > > > Thank you, posted V4 that should fix the warning. > > David > > > > - > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > Registration No: 1397386 (Wales) ping.
On Sat, Dec 23, 2023 at 2:18 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Sun, Sep 24, 2023 at 7:35 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Sat, Sep 23, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote: > > > > > > From: Noah Goldstein > > > > Sent: 23 September 2023 15:05 > > > > > > > > On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote: > > > > > > > > > > Hi Noah, > > > > > > > > > > kernel test robot noticed the following build warnings: > > > > > > > > > > [auto build test WARNING on tip/x86/core] > > > > > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921] > > > > > [If your patch is applied to the wrong git tree, kindly drop us a note. > > > > > And when submitting patch, we suggest to use '--base' as documented in > > > > > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > > > > > > > > > url: https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary- > > > > odd-handling/20230921-032450 > > > > > base: tip/x86/core > > > > > patch link: https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com > > > > > patch subject: x86/csum: Remove unnecessary odd handling > > > > > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day- > > > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config) > > > > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 > > > > > reproduce (this is a W=1 build): (https://download.01.org/0day- > > > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce) > > > > > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > > > > the same patch/commit), kindly add following tags > > > > > | Reported-by: kernel test robot <lkp@intel.com> > > > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/ > > > > > > > > > > sparse warnings: (new ones prefixed by >>) > > > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > > > > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression > > > > (different base types) @@ expected restricted __wsum @@ got unsigned long long @@ > > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: expected restricted __wsum > > > > > arch/x86/lib/csum-partial_64.c:16:45: sparse: got unsigned long long > > > > > > > > > > vim +16 arch/x86/lib/csum-partial_64.c > > > > > > > > > > 13 > > > > > 14 static inline __wsum csum_finalize_sum(u64 temp64) > > > > > 15 { > > > > > > 16 return (temp64 + ror64(temp64, 32)) >> 32; > > > > > 17 } > > > > > 18 > > > > > > > > Just needs a `(__wsum)` cast. Should I post a new patch? > > > > > > It'll need to be a (__force __wsum) cast. > > > > > > I think new patches are expected... > > > > > Thank you, posted V4 that should fix the warning. > > > David > > > > > > - > > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > > Registration No: 1397386 (Wales) > > ping. ping
On 1/4/24 15:28, Noah Goldstein wrote: >> ping. > ping Noah, it would be great to elaborate on what you are pinging about. This thread has a note that you "posted v4", but I don't see a clearly-tagged v4 in my inbox, just a bunch of identially-titled non-standard-subject[1] emails without any indication of what changed in the individual patches. 1. https://docs.kernel.org/process/submitting-patches.html
On Thu, 4 Jan 2024 at 15:28, Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > ping Bah. I think this keeps falling through the cracks because the networking people consider this an architecture thing, and the x86 people probably consider this a networking thing. Anyway, since I looked at the thing originally, and feel like I know the x86 side and understand the strange IP csum too, I just applied it directly. Linus
On Thu, 4 Jan 2024 at 15:36, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Anyway, since I looked at the thing originally, and feel like I know > the x86 side and understand the strange IP csum too, I just applied it > directly. I ended up just applying my 40-byte cleanup thing too that I've been keeping in my own tree since posting it (as the "Silly csum improvement. Maybe" patch). I've been running it on my own machine since last June, and I finally even enabled the csum KUnit test just to double-check it. Linus
From: Linus Torvalds > Sent: 05 January 2024 00:33 > > On Thu, 4 Jan 2024 at 15:36, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Anyway, since I looked at the thing originally, and feel like I know > > the x86 side and understand the strange IP csum too, I just applied it > > directly. > > I ended up just applying my 40-byte cleanup thing too that I've been > keeping in my own tree since posting it (as the "Silly csum > improvement. Maybe" patch). Interesting, I'm pretty sure trying to get two blocks of 'adc' scheduled in parallel like that doesn't work. I got an adc every clock from this 'beast': + /* + * Align the byte count to a multiple of 16 then + * add 64 bit words to alternating registers. + * Finally reduce to 64 bits. + */ + asm( " bt $4, %[len]\n" + " jnc 10f\n" + " add (%[buff], %[len]), %[sum_0]\n" + " adc 8(%[buff], %[len]), %[sum_1]\n" + " lea 16(%[len]), %[len]\n" + "10: jecxz 20f\n" + " adc (%[buff], %[len]), %[sum_0]\n" + " adc 8(%[buff], %[len]), %[sum_1]\n" + " lea 32(%[len]), %[len_tmp]\n" + " adc 16(%[buff], %[len]), %[sum_0]\n" + " adc 24(%[buff], %[len]), %[sum_1]\n" + " mov %[len_tmp], %[len]\n" + " jmp 10b\n" + "20: adc %[sum_0], %[sum]\n" + " adc %[sum_1], %[sum]\n" + " adc $0, %[sum]\n" + : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1), + [len] "+&c" (len), [len_tmp] "=&r" (len_tmp) + : [buff] "r" (buff) + : "memory" ); Followed by code to sort out and trailing 15 bytes. Intel cpu (from P-II until Broadwell 5th-gen) take two clocks for 'adc' (probably because it needs 3 inputs). So 'adc' chains ran a lot slower than you might think. (Clearly no one ever actually benchmarked the old code!) The first fix made the carry output available early - so adding to alternate registers helps. IIRC this is in Ivy/Sandy bridge. Maybe no one cares about Ivy/Sandy bridge and Haswell any more. AMD cpu don't have this problem. I'm pretty sure I measured that loop with a misaligned buffer. Measurably slower, but less than one clock per cache line. I guess that the cache-line crossing reads get split, but you gain most back because the cpu can do two reads/clock. Maybe I'll sort out another patch... I did get 15/16 bytes/clock with a similar loop that used adox/adcx but that needed unrolling again and only works on a few cpu. IIRC amd have some cpu that support adox - but execute it slowly! Annoyingly you can't use 'loop' even on cpu that support adox because it is stupidly slow on intel cpu (ok on amd). That version is a lot of pain since it needs run-time patching. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: David Laight > Sent: 05 January 2024 10:41 > > From: Linus Torvalds > > Sent: 05 January 2024 00:33 > > > > On Thu, 4 Jan 2024 at 15:36, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > Anyway, since I looked at the thing originally, and feel like I know > > > the x86 side and understand the strange IP csum too, I just applied it > > > directly. > > > > I ended up just applying my 40-byte cleanup thing too that I've been > > keeping in my own tree since posting it (as the "Silly csum > > improvement. Maybe" patch). > > Interesting, I'm pretty sure trying to get two blocks of > 'adc' scheduled in parallel like that doesn't work. > > I got an adc every clock from this 'beast': > + /* > + * Align the byte count to a multiple of 16 then > + * add 64 bit words to alternating registers. > + * Finally reduce to 64 bits. > + */ > + asm( " bt $4, %[len]\n" > + " jnc 10f\n" > + " add (%[buff], %[len]), %[sum_0]\n" > + " adc 8(%[buff], %[len]), %[sum_1]\n" > + " lea 16(%[len]), %[len]\n" > + "10: jecxz 20f\n" > + " adc (%[buff], %[len]), %[sum_0]\n" > + " adc 8(%[buff], %[len]), %[sum_1]\n" > + " lea 32(%[len]), %[len_tmp]\n" > + " adc 16(%[buff], %[len]), %[sum_0]\n" > + " adc 24(%[buff], %[len]), %[sum_1]\n" > + " mov %[len_tmp], %[len]\n" > + " jmp 10b\n" > + "20: adc %[sum_0], %[sum]\n" > + " adc %[sum_1], %[sum]\n" > + " adc $0, %[sum]\n" > + : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1), > + [len] "+&c" (len), [len_tmp] "=&r" (len_tmp) > + : [buff] "r" (buff) > + : "memory" ); I've got far too many x86 checksum functions lying around. Actually you don't need all that. Anything recent (probably Broadwell on) will execute: "10: jecxz 20f\n" " adc (%[buff], %[len]), %[sum]\n" " adc 8(%[buff], %[len]), %[sum]\n" " lea 16(%[len]), %[len]\n" " jmp 10b\n" "20: adc $0, %[sum]\n" in two clocks per iteration - 8 bytes/clock. Since it is trivial to handle 8n+4 buffers (eg as above) that only leaves the C code to handle the final 0-7 bytes. > Maybe I'll sort out another patch... Probably after the next rc1 is out. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote: > > Interesting, I'm pretty sure trying to get two blocks of > 'adc' scheduled in parallel like that doesn't work. You should check out the benchmark at https://github.com/fenrus75/csum_partial and see if you can improve on it. I'm including the patch (on top of that code by Arjan) to implement the actual current kernel version as "New version". Linus
From: Linus Torvalds > Sent: 05 January 2024 18:06 > > On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote: > > > > Interesting, I'm pretty sure trying to get two blocks of > > 'adc' scheduled in parallel like that doesn't work. > > You should check out the benchmark at > > https://github.com/fenrus75/csum_partial > > and see if you can improve on it. I'm including the patch (on top of > that code by Arjan) to implement the actual current kernel version as > "New version". I'd have to fix his benchmark code first :-) You can't use the TSC unless you lock the cpu frequency. The longer the test runs for the faster the cpu will run. I've had to use the performance counters to get accurate cycle counts. I'm also sure I tried two separate 'adc' chains and failed to get any overlapped instrictions. But I'll try his loop in my framework. The 'lfence' version also really does matter. In some sense it doesn't matter if you add 10 clocks to the IP header checksum - they'll be dwarfed by everything else. But if you do have to checksum an 8k NFS UDP datagram the loop count matters. The OOO engine is very good a just piling up instructions that are waiting for previous instructions in a big queue and executing instruction for which the data is available. So the control flow for the checksum code can finish well before the checksum is available. On a related point, do you remember what the 'killer app' was for doing the checksum in copy_to/from_user? The change pre-dates git and I can't find the commit message. The only place it is done in the receive path is for UDP. At a guess that helped 8k UDP datagrams going into a userspace nfsd? I bet nothing really depends on RX UDP performance to userspace on systems that don't do hw checksum? The tx side is done copying data into an skb, I'm not sure if that is all copies - you really don't want to do it if the hardware supports tx checksum offload. Using 'rep movsb' for copy_from_user() has to be faster than the I-cache busting 'copy and checksum' code - even if you end up doing a checksum not much later on. IIRC (I looked a couple of weeks ago) only x86, sparc32 and m68k actually have 'copy and checksum' (and the m68k didn't seem to have the markers for faults!). (All of them are I-cache killers) The default iovec[] version checksums every fragment - which gives grief trying to report a 32bit checksum and EFAULT. OTOH it could sum the linear kernel buffer after the copy. That would let it fold the checksum to 17bits and report -1 for EFAULT. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, 5 Jan 2024 at 15:53, David Laight <David.Laight@aculab.com> wrote: > > I'd have to fix his benchmark code first :-) > You can't use the TSC unless you lock the cpu frequency. > The longer the test runs for the faster the cpu will run. They'll stabilize, it has soem cycle result aging code. But yes, set the CPU policy to 'performance' or do performance counters if you care deeply. > On a related point, do you remember what the 'killer app' > was for doing the checksum in copy_to/from_user? No. It's a long time ago, and many things have changed since. It's possible the copy-and-csum it's not worth it any more, simply because all modern network cards will do the csum for us, and I think loopback sets a flag saying "no need to checksum" too. But I do have a strong memory of it being a big deal back when. A _loong_ time ago. Linus
On Sat, Jan 6, 2024 at 1:19 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 5 Jan 2024 at 15:53, David Laight <David.Laight@aculab.com> wrote: > > > > I'd have to fix his benchmark code first :-) > > You can't use the TSC unless you lock the cpu frequency. > > The longer the test runs for the faster the cpu will run. > > They'll stabilize, it has soem cycle result aging code. > > But yes, set the CPU policy to 'performance' or do performance > counters if you care deeply. > > > On a related point, do you remember what the 'killer app' > > was for doing the checksum in copy_to/from_user? > > No. It's a long time ago, and many things have changed since. > > It's possible the copy-and-csum it's not worth it any more, simply > because all modern network cards will do the csum for us, and I think > loopback sets a flag saying "no need to checksum" too. > > But I do have a strong memory of it being a big deal back when. A > _loong_ time ago. > On a related note, at least with clang, I found that csum_ipv6_magic() is needlessly using temporary on-stack storage, showing a stall on Cascade Lake unless I am patching add32_with_carry() : diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index 407beebadaf45a748f91a36b78bd1d023449b132..c3d6f47626c70d81f0c2ba401d85050b09a39922 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -171,7 +171,7 @@ static inline unsigned add32_with_carry(unsigned a, unsigned b) asm("addl %2,%0\n\t" "adcl $0,%0" : "=r" (a) - : "0" (a), "rm" (b)); + : "0" (a), "r" (b)); return a; } Before patch : ffffffff81b9b600 <csum_ipv6_magic>: ffffffff81b9b600: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) ffffffff81b9b601: R_X86_64_NONE __fentry__-0x4 ffffffff81b9b605: 55 push %rbp ffffffff81b9b606: 48 89 e5 mov %rsp,%rbp ffffffff81b9b609: 48 83 ec 04 sub $0x4,%rsp // This will be removed after patch ffffffff81b9b60d: 0f ca bswap %edx ffffffff81b9b60f: 89 c9 mov %ecx,%ecx ffffffff81b9b611: 48 c1 e1 08 shl $0x8,%rcx ffffffff81b9b615: 44 89 c0 mov %r8d,%eax ffffffff81b9b618: 48 01 d0 add %rdx,%rax ffffffff81b9b61b: 48 01 c8 add %rcx,%rax ffffffff81b9b61e: 48 03 07 add (%rdi),%rax ffffffff81b9b621: 48 13 47 08 adc 0x8(%rdi),%rax ffffffff81b9b625: 48 13 06 adc (%rsi),%rax ffffffff81b9b628: 48 13 46 08 adc 0x8(%rsi),%rax ffffffff81b9b62c: 48 83 d0 00 adc $0x0,%rax ffffffff81b9b630: 48 89 c1 mov %rax,%rcx ffffffff81b9b633: 48 c1 e9 20 shr $0x20,%rcx ffffffff81b9b637: 89 4d fc mov %ecx,-0x4(%rbp) // Store exc on the stack ffffffff81b9b63a: 03 45 fc add -0x4(%rbp),%eax // reads the value right away, stalling some Intel cpus. ffffffff81b9b63d: 83 d0 00 adc $0x0,%eax ffffffff81b9b640: 89 c1 mov %eax,%ecx // Note that ecs content was scratch anyway ffffffff81b9b642: c1 e1 10 shl $0x10,%ecx ffffffff81b9b645: 25 00 00 ff ff and $0xffff0000,%eax ffffffff81b9b64a: 01 c8 add %ecx,%eax ffffffff81b9b64c: 15 ff ff 00 00 adc $0xffff,%eax ffffffff81b9b651: f7 d0 not %eax ffffffff81b9b653: c1 e8 10 shr $0x10,%eax ffffffff81b9b656: 48 83 c4 04 add $0x4,%rsp ffffffff81b9b65a: 5d pop %rbp ffffffff81b9b65b: c3 ret ffffffff81b9b65c: cc int3 ffffffff81b9b65d: 00 00 add %al,(%rax) ffffffff81b9b65f: cc int3 After patch: 00000000000000a0 <csum_ipv6_magic>: a0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) a1: R_X86_64_NONE __fentry__-0x4 a5: 55 push %rbp a6: 48 89 e5 mov %rsp,%rbp a9: 0f ca bswap %edx ab: 89 c9 mov %ecx,%ecx ad: 48 c1 e1 08 shl $0x8,%rcx b1: 44 89 c0 mov %r8d,%eax b4: 48 01 d0 add %rdx,%rax b7: 48 01 c8 add %rcx,%rax ba: 48 03 07 add (%rdi),%rax bd: 48 13 47 08 adc 0x8(%rdi),%rax c1: 48 13 06 adc (%rsi),%rax c4: 48 13 46 08 adc 0x8(%rsi),%rax c8: 48 83 d0 00 adc $0x0,%rax cc: 48 89 c1 mov %rax,%rcx cf: 48 c1 e9 20 shr $0x20,%rcx d3: 01 c8 add %ecx,%eax // just better ! d5: 83 d0 00 adc $0x0,%eax d8: 89 c1 mov %eax,%ecx da: c1 e1 10 shl $0x10,%ecx dd: 25 00 00 ff ff and $0xffff0000,%eax e2: 01 c8 add %ecx,%eax e4: 15 ff ff 00 00 adc $0xffff,%eax e9: f7 d0 not %eax eb: c1 e8 10 shr $0x10,%eax ee: 5d pop %rbp ef: c3 ret f0: cc int3
On Sat, 6 Jan 2024 at 02:26, Eric Dumazet <edumazet@google.com> wrote: > > On a related note, at least with clang, I found that csum_ipv6_magic() > is needlessly using temporary on-stack storage, > showing a stall on Cascade Lake unless I am patching add32_with_carry() : This is a known compiler bug in clang: https://github.com/llvm/llvm-project/issues/20571 https://github.com/llvm/llvm-project/issues/30873 https://github.com/llvm/llvm-project/issues/34837 and while we could always just use "r" for constraints, it will (a) possibly run out of registers in some situations (b) pessimize gcc that does this right Can you please push the clang people to not do the stupid thing they do now, which is to say "oh, I can use registers _or_ memory, so I'll always use memory". Btw, it's actually much worse than that, and clang is just doing incredibly broken things. Switching to "r" just hides the garbage. Because sometimes the source really *is* memory, ie we have net/ipv4/udp.c: csum = csum_add(csum, frags->csum); and then it's criminally stupid to load it into a register when it can be just used directly. But clang says "criminally stupid? *I* will show you stupid!" - because what *clang* does with the above is this thing of beauty: movl 136(%rax), %edi movl %edi, 28(%rsp) addl 28(%rsp), %ecx adcl $0, %ecx which has turned from "criminally stupid" to "it's art, I tell you - you're not supposed to understand it". Anyway, this is *literally* a clang bug. Complain to clang people for being *extra* stupid - we told the compiler that it can use a register or memory, and clang decided "I'll use memory", so then when we gave it a memory location, it said "no, not *that* memory - I'll just reload it on stack". In contrast, gcc gets this right - and for that udp.c case it just generates addl 136(%rax),%ecx # frags_67->D.58941.D.58869.D.58836.csum, a adcl $0,%ecx # a like it should. And for csum_ipv6_magic, gcc generates this: addl %edx,%eax # tmp112, a adcl $0,%eax # a IOW, the kernel is *right*, and this is purely clang being completely bogus. I really don't want to penalize a good compiler because a bad one can't get its act together. Linus
From: Linus Torvalds > Sent: 05 January 2024 18:06 > > On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote: > > > > Interesting, I'm pretty sure trying to get two blocks of > > 'adc' scheduled in parallel like that doesn't work. > > You should check out the benchmark at > > https://github.com/fenrus75/csum_partial > > and see if you can improve on it. I'm including the patch (on top of > that code by Arjan) to implement the actual current kernel version as > "New version". Annoyingly (for me) you are partially right... I found where my ip checksum perf code was hiding and revisited it. Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp' did an adc every clock it isn't happening for me now. I'm only measuring the inner loop for multiples of 64 bytes. The code less than 8 bytes and partial final words is a separate problem. The less unrolled the main loop, the less overhead there'll be for 'normal' sizes. So I've changed your '80 byte' block to 64 bytes for consistency. I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell (adc takes two clocks - although adc to alternate regs is one clock on sandy bridge). My test system is an i7-7700, I think anything from broadwell (gen 4) will be at least as good. I don't have a modern amd cpu. The best loop for 256+ bytes is an adxc/adxo one. However that requires the run-time patching. Followed by new kernel version (two blocks of 4 adc). The surprising one is: xor sum, sum 1: adc (buff), sum adc 8(buff), sum lea 16(buff), buff dec count jnz 1b adc $0, sum For 256 bytes it is only a couple of clocks slower. Maybe 10% slower for 512+ bytes. But it need almost no extra code for 'normal' buffer sizes. By comparison the adxc/adxo one is 20% faster. The code is doing: old = rdpmc mfence csum = do_csum(buf, len); mfence clocks = rdpmc - old (That is directly reading the pmc register.) With 'no-op' function it takes 160 clocks (I-cache resident). Without the mfence 40 - but pretty much everything can execute after the 2nd rdpmc. I've attached my (horrid) test program. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <linux/perf_event.h> #include <sys/mman.h> #include <sys/syscall.h> #ifndef PASSES #define PASSES 10 #endif static int init_pmc(void) { static struct perf_event_attr perf_attr = { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES, .pinned = 1, }; struct perf_event_mmap_page *pc; int perf_fd; perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0); if (perf_fd < 0) { fprintf(stderr, "perf_event_open failed: errno %d\n", errno); exit(1); } pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0); if (pc == MAP_FAILED) { fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno); exit(1); } return pc->index - 1; } static inline unsigned int rdpmc(unsigned int counter) { unsigned int low, high; #ifndef NOFENCE asm volatile("mfence"); #endif asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter)); #ifndef NOFENCE asm volatile("mfence"); #endif // return low bits, counter might to 32 or 40 bits wide. return low; } __attribute__((noinline)) unsigned long overhead(const unsigned char *buff, unsigned long len) { return 0; } // Existing kernel loop. // Simple 'adc' loop, abs max 8 bytes/clock. // Will be very slow on old cpu (Pre Ivy bridge) cpu where 'dec' // has a false dependeny against the carry flag. // And also pre-broadwell where the adc take two clocks. __attribute__((noinline)) unsigned long adc_dec_8(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; long count = len/64; asm("1: addq 0(%[buff]), %[csum]\n" " adcq 8(%[buff]), %[csum]\n" " adcq 16(%[buff]), %[csum]\n" " adcq 24(%[buff]), %[csum]\n" " adcq 32(%[buff]), %[csum]\n" " adcq 40(%[buff]), %[csum]\n" " adcq 48(%[buff]), %[csum]\n" " adcq 56(%[buff]), %[csum]\n" " adcq $0, %[csum]\n" " lea 64(%[buff]), %[buff]\n" " dec %[count]\n" " jnz 1b" : [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); return csum; } // As above but only 4 adc __attribute__((noinline)) unsigned long adc_dec_4(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; long count = len/32; asm(" clc\n" "1: adcq (%[buff]), %[csum]\n" " adcq 8(%[buff]), %[csum]\n" " adcq 16(%[buff]), %[csum]\n" " adcq 24(%[buff]), %[csum]\n" " lea 32(%[buff]), %[buff]\n" " dec %[count]\n" " jnz 1b\n" " adcq $0, %[csum]\n" : [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); return csum; } // As above but only 2 adc __attribute__((noinline)) unsigned long adc_dec_2(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; long count = len/16; asm(" clc\n" "1: adcq (%[buff]), %[csum]\n" " adcq 8(%[buff]), %[csum]\n" " lea 16(%[buff]), %[buff]\n" " dec %[count]\n" " jnz 1b\n" " adcq $0, %[csum]\n" : [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); return csum; } // Alternate 'adc' loop that avoids using the zero flag. __attribute__((noinline)) unsigned long adc_jcxz_4(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; buff += len; len = -len; asm(" clc\n" "1: jrcxz 2f\n" " adcq 0(%[buff], %[len]), %[csum]\n" " adcq 8(%[buff], %[len]), %[csum]\n" " adcq 16(%[buff], %[len]), %[csum]\n" " adcq 24(%[buff], %[len]), %[csum]\n" " lea 32(%[len]), %[len]\n" " jmp 1b\n" "2: adcq $0, %[csum]\n" : [csum] "+&r" (csum), [len] "+&c" (len) : [buff] "r" (buff) : "memory"); return csum; } // Shorter alternate 'adc' loop that avoids using the zero flag. __attribute__((noinline)) unsigned long adc_jcxz_2(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; buff += len; len = -len; asm(" clc\n" "1: jrcxz 2f\n" " adcq 0(%[buff], %[len]), %[csum]\n" " adcq 8(%[buff], %[len]), %[csum]\n" " lea 16(%[len]), %[len]\n" " jmp 1b\n" "2: adcq $0, %[csum]\n" : [csum] "+&r" (csum), [len] "+&c" (len) : [buff] "r" (buff) : "memory"); return csum; } // Proposed kernel loop. // This splits the adc chain in two. __attribute__((noinline)) unsigned long adc_4_pair(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; long count = len/64; unsigned long csum1; asm("1: movq 0(%[buff]), %[csum1]\n" " adcq 8(%[buff]), %[csum1]\n" " adcq 16(%[buff]), %[csum1]\n" " adcq 24(%[buff]), %[csum1]\n" " adcq 32(%[buff]), %[csum1]\n" " adcq $0, %[csum1]\n" " lea 64(%[buff]), %[buff]\n" " addq 40-64(%[buff]), %[csum]\n" " adcq 48-64(%[buff]), %[csum]\n" " adcq 56-64(%[buff]), %[csum]\n" " adcq %[csum1], %[csum]\n" " adcq $0, %[csum]\n" " dec %[count]\n" " jnz 1b" : [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); return csum; } // Less unrolled version of previous. __attribute__((noinline)) unsigned long adc_2_pair(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; long count = len; unsigned long csum1; asm("1: movq 0(%[buff]), %[csum1]\n" " adcq 8(%[buff]), %[csum1]\n" " adcq $0, %[csum1]\n" " lea 32(%[buff]), %[buff]\n" " addq 16-32(%[buff]), %[csum]\n" " adcq 24-32(%[buff]), %[csum]\n" " adcq %[csum1], %[csum]\n" " adcq $0, %[csum]\n" " sub $32, %[count]\n" " jnz 1b" : [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count) : [buff] "r" (buff) : "memory"); return csum; } // adxc/adxo loop. // This is the only loop that exceeds 8 bytes/clock for 1k buffers. __attribute__((noinline)) unsigned long adcx_adox(const unsigned char *buff, unsigned long len) { unsigned long csum = 0; unsigned long csum_odd; buff += len; len = -len; asm( " xor %[csum_odd], %[csum_odd]\n" // Also clears carry and overflow "10: jrcxz 20f\n" " adcx (%[buff], %[len]), %[csum]\n" " adox 8(%[buff], %[len]), %[csum_odd]\n" " adcx 16(%[buff], %[len]), %[csum]\n" " adox 24(%[buff], %[len]), %[csum_odd]\n" " adcx 32(%[buff], %[len]), %[csum]\n" " adox 40(%[buff], %[len]), %[csum_odd]\n" " adcx 48(%[buff], %[len]), %[csum]\n" " adox 56(%[buff], %[len]), %[csum_odd]\n" " lea 64(%[len]), %[len]\n" " jmp 10b\n" "20: adox %[len], %[csum_odd]\n" // [len] is zero " adcx %[csum_odd], %[csum]\n" " adcx %[len], %[csum]" : [csum] "+&r" (csum), [len] "+&c" (len), [csum_odd] "=&r" (csum_odd) : [buff] "r" (buff) : "memory"); return csum; } #define OPTIMISER_HIDE_VAR(x) asm volatile( "": "+r" (x)) // In principe the cpu can do 2 reads every clock. // So combining with adds to different registers it should // be possible to get this C code to run near to 8 bytes/clock. // But the compiler doesn't want to 'play ball'. // The OPTIMISER_HIDE_VAR() helps a bit __attribute__((noinline)) unsigned long add_c_16(const unsigned char *buff, unsigned long len) { unsigned long csum_odd = 0; unsigned long csum = 0; buff += len; len = -len; do { csum += *(unsigned int *)(buff + len + 0); csum_odd += *(unsigned int *)(buff + len + 4); OPTIMISER_HIDE_VAR(csum); OPTIMISER_HIDE_VAR(csum_odd); csum += *(unsigned int *)(buff + len + 8); csum_odd += *(unsigned int *)(buff + len + 12); OPTIMISER_HIDE_VAR(csum); OPTIMISER_HIDE_VAR(csum_odd); } while (len += 16); csum += csum_odd; return csum; } // As above but unrolled further __attribute__((noinline)) unsigned long add_c_32(const unsigned char *buff, unsigned long len) { unsigned long csum_odd = 0; unsigned long csum = 0; buff += len; len = -len; do { csum += *(unsigned int *)(buff + len + 0); csum_odd += *(unsigned int *)(buff + len + 4); OPTIMISER_HIDE_VAR(csum); OPTIMISER_HIDE_VAR(csum_odd); csum += *(unsigned int *)(buff + len + 8); csum_odd += *(unsigned int *)(buff + len + 12); OPTIMISER_HIDE_VAR(csum); OPTIMISER_HIDE_VAR(csum_odd); csum += *(unsigned int *)(buff + len + 16); csum_odd += *(unsigned int *)(buff + len + 20); OPTIMISER_HIDE_VAR(csum); OPTIMISER_HIDE_VAR(csum_odd); csum += *(unsigned int *)(buff + len + 24); csum_odd += *(unsigned int *)(buff + len + 28); OPTIMISER_HIDE_VAR(csum); OPTIMISER_HIDE_VAR(csum_odd); } while (len += 32); csum += csum_odd; return csum; } // Alternative C loop that sums 'val << 32' to get the carries. // Might be better on arm64 than x86 __attribute__((noinline)) unsigned long add_c_high(const unsigned char *buff, unsigned long len) { unsigned long sum_hi_odd = 0, sum_lo_odd = 0; unsigned long sum_hi = 0, sum_lo = 0; unsigned long val; unsigned int i; buff += len; len = -len; do { val = *(unsigned long *)(buff + len + 0); sum_lo += val; sum_hi += val >> 32; val = *(unsigned long *)(buff + len + 8); sum_lo_odd += val; sum_hi_odd += val >> 32; } while (len += 16); OPTIMISER_HIDE_VAR(sum_lo); OPTIMISER_HIDE_VAR(sum_hi); sum_lo += sum_lo_odd; sum_hi += sum_hi_odd; sum_hi += (sum_lo >> 32) - (sum_hi & 0xffffffffu); return sum_hi + (sum_lo & 0xffffffff); } unsigned char buffer[8192] __attribute__((aligned(4096))) = { 0x46,0x56,0x20,0x04,0x10,0x02,0x50,0x07,0x72,0x4d,0xc6,0x3d,0x31,0x85,0x2d,0xbd, 0xe2,0xe0,0x9d,0x3e,0x3b,0x7a,0x70,0x3d,0xd2,0xfb,0x8c,0xbf,0x95,0x10,0xa9,0xbe, 0xeb,0xfd,0x29,0x40,0xd5,0x7a,0x61,0x40,0xde,0xcd,0x14,0xbf,0x81,0x1b,0xf6,0x3f, 0xbc,0xff,0x17,0x3f,0x67,0x1c,0x6e,0xbe,0xf4,0xc2,0x05,0x40,0x0b,0x13,0x78,0x3f, 0xfe,0x47,0xa7,0xbd,0x59,0xc2,0x15,0x3f,0x07,0xd0,0xea,0xbf,0x97,0xf1,0x3c,0x3f, 0xcc,0xfa,0x6b,0x40,0x72,0x6a,0x4f,0xbe,0x0b,0xe3,0x75,0x3e,0x3c,0x9b,0x0e,0xbf, 0xa9,0xeb,0xb7,0x3f,0xeb,0x4a,0xec,0x3e,0x33,0x8c,0x0c,0x3f,0x6a,0xf2,0xf3,0x3e, 0x2b,0x45,0x86,0x3f,0x83,0xce,0x8a,0x3f,0xf6,0x01,0x16,0x40,0x9c,0x17,0x47,0x3e, 0x44,0x83,0x61,0x40,0x74,0xc7,0x5c,0x3f,0xec,0xe7,0x95,0x3f,0xee,0x19,0xb5,0xbf, 0xb5,0xf0,0x03,0xbf,0xd1,0x02,0x1c,0x3e,0xa3,0x55,0x90,0xbe,0x1e,0x0b,0xa1,0xbf, 0xa4,0xa8,0xb4,0x3f,0xc6,0x68,0x91,0x3f,0xd1,0xc5,0xab,0x3f,0xb9,0x14,0x62,0x3f, 0x7c,0xe0,0xb9,0xbf,0xc0,0xa4,0xb5,0x3d,0x6f,0xd9,0xa7,0x3f,0x8f,0xc4,0xb0,0x3d, 0x48,0x2c,0x7a,0x3e,0x83,0xb2,0x3c,0x40,0x36,0xd3,0x18,0x40,0xb7,0xa9,0x57,0x40, 0xda,0xd3,0x95,0x3f,0x74,0x95,0xc0,0xbe,0xbb,0xce,0x71,0x3e,0x95,0xec,0x18,0xbf, 0x94,0x17,0xdd,0x3f,0x98,0xa5,0x02,0x3f,0xbb,0xfb,0xbb,0x3e,0xd0,0x5a,0x9c,0x3f, 0xd4,0x70,0x9b,0xbf,0x3b,0x9f,0x20,0xc0,0x84,0x5b,0x0f,0x40,0x5e,0x48,0x2c,0xbf, }; static __attribute__((noinline)) void print_csum(const char *msg, unsigned long csum, unsigned int *ticks) { static unsigned int offset; unsigned int i; csum = (csum & 0xffffffff) + (csum >> 32); csum = (csum & 0xffffffff) + (csum >> 32); csum = (csum & 0xffff) + (csum >> 16); csum = (csum & 0xffff) + (csum >> 16); printf(" %4lx", csum); for (i = 0; i < PASSES; i++) printf(" %5d", ticks[i] - offset); printf(" %s\n", msg); if (!offset) { offset = ticks[0]; for (i = 1; i < PASSES; i++) if (offset > ticks[i]) offset = ticks[i]; } } static inline __attribute__((always_inline)) void do_csum(int pmc_id, const char *msg, unsigned long (*fn)(const unsigned char *, unsigned long), const unsigned char *buf, unsigned long len) { unsigned int ticks[PASSES]; unsigned long csum; unsigned int tick; unsigned int i; for (i = 0; i < PASSES; i++) { tick = rdpmc(pmc_id); csum = fn(buf, len); ticks[i] = rdpmc(pmc_id) - tick; } print_csum(msg, csum, ticks); } int main(int argc, char **argv) { unsigned long csum; unsigned int len, off; unsigned int pmc_id = init_pmc(); len = 128; off = 0; for (;;) { switch (getopt(argc, argv, "l:o:")) { case -1: break; default: exit(1); case 'l': len = atoi(optarg); continue; case 'o': off = atoi(optarg); continue; } break; } /* Solving partial blocks is a different problem... */ len = (len + 63) & ~63; if (!len) len = 64; do_csum(pmc_id, "overhead", overhead, buffer + off, len); do_csum(pmc_id, "adc_dec_2", adc_dec_2, buffer + off, len); do_csum(pmc_id, "adc_dec_4", adc_dec_4, buffer + off, len); do_csum(pmc_id, "adc_dec_8", adc_dec_8, buffer + off, len); do_csum(pmc_id, "adc_jcxz_2", adc_jcxz_2, buffer + off, len); do_csum(pmc_id, "adc_jcxz_4", adc_jcxz_4, buffer + off, len); do_csum(pmc_id, "adc_2_pair", adc_2_pair, buffer + off, len); do_csum(pmc_id, "adc_4_pair", adc_4_pair, buffer + off, len); do_csum(pmc_id, "adcx_adox", adcx_adox, buffer + off, len); do_csum(pmc_id, "add_c_16", add_c_16, buffer + off, len); do_csum(pmc_id, "add_c_32", add_c_32, buffer + off, len); do_csum(pmc_id, "add_c_high", add_c_high, buffer + off, len); return 0; }
On January 6, 2024 2:08:48 PM PST, David Laight <David.Laight@ACULAB.COM> wrote: >From: Linus Torvalds >> Sent: 05 January 2024 18:06 >> >> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote: >> > >> > Interesting, I'm pretty sure trying to get two blocks of >> > 'adc' scheduled in parallel like that doesn't work. >> >> You should check out the benchmark at >> >> https://github.com/fenrus75/csum_partial >> >> and see if you can improve on it. I'm including the patch (on top of >> that code by Arjan) to implement the actual current kernel version as >> "New version". > >Annoyingly (for me) you are partially right... > >I found where my ip checksum perf code was hiding and revisited it. >Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp' >did an adc every clock it isn't happening for me now. > >I'm only measuring the inner loop for multiples of 64 bytes. >The code less than 8 bytes and partial final words is a >separate problem. >The less unrolled the main loop, the less overhead there'll >be for 'normal' sizes. >So I've changed your '80 byte' block to 64 bytes for consistency. > >I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell >(adc takes two clocks - although adc to alternate regs is one clock >on sandy bridge). >My test system is an i7-7700, I think anything from broadwell (gen 4) >will be at least as good. >I don't have a modern amd cpu. > >The best loop for 256+ bytes is an adxc/adxo one. >However that requires the run-time patching. >Followed by new kernel version (two blocks of 4 adc). >The surprising one is: > xor sum, sum > 1: adc (buff), sum > adc 8(buff), sum > lea 16(buff), buff > dec count > jnz 1b > adc $0, sum >For 256 bytes it is only a couple of clocks slower. >Maybe 10% slower for 512+ bytes. >But it need almost no extra code for 'normal' buffer sizes. >By comparison the adxc/adxo one is 20% faster. > >The code is doing: > old = rdpmc > mfence > csum = do_csum(buf, len); > mfence > clocks = rdpmc - old >(That is directly reading the pmc register.) >With 'no-op' function it takes 160 clocks (I-cache resident). >Without the mfence 40 - but pretty much everything can execute >after the 2nd rdpmc. > >I've attached my (horrid) test program. > > David > >- >Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK >Registration No: 1397386 (Wales) Rather than runtime patching perhaps separate paths...
From: H. Peter Anvin > Sent: 07 January 2024 01:09 > > On January 6, 2024 2:08:48 PM PST, David Laight <David.Laight@ACULAB.COM> wrote: ... > >The best loop for 256+ bytes is an adxc/adxo one. > >However that requires the run-time patching. ... > Rather than runtime patching perhaps separate paths... It will need to detect the cpu type earlier, so a static branch is probably enough. Easier than substituting the entire code block. I think it is silvermont and knight's landing that have a 4 clock penalty for 64bit adxc (Intel atom family). That might only be a decode penalty, so doesn't affect the loop 'that much' (adc is 2 clocks on those cpu). So probably not actually worth doing a run-time performance check. I might 'cook up' a full checksum function later. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Eric Dumazet > Sent: 06 January 2024 10:26 ... > On a related note, at least with clang, I found that csum_ipv6_magic() > is needlessly using temporary on-stack storage, > showing a stall on Cascade Lake unless I am patching add32_with_carry() : > > diff --git a/arch/x86/include/asm/checksum_64.h > b/arch/x86/include/asm/checksum_64.h > index 407beebadaf45a748f91a36b78bd1d023449b132..c3d6f47626c70d81f0c2ba401d85050b09a39922 > 100644 > --- a/arch/x86/include/asm/checksum_64.h > +++ b/arch/x86/include/asm/checksum_64.h > @@ -171,7 +171,7 @@ static inline unsigned add32_with_carry(unsigned > a, unsigned b) > asm("addl %2,%0\n\t" > "adcl $0,%0" > : "=r" (a) > - : "0" (a), "rm" (b)); > + : "0" (a), "r" (b)); > return a; > } Try replacing: return csum_fold( (__force __wsum)add32_with_carry(sum64 & 0xffffffff, sum64>>32)); with: return csum_fold((__force __wsum)((sum64 + ror64(sum64, 32)) >> 32)); Should be less instructions as well. (shift, add, shift v shift, mov, and, add, add) Although both might be 3 clocks. The best C version of csum_fold (from IIRC arc) is also likely to be better than the x86 asm one - certainly no worse. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c index cea25ca8b8cf..b83c8accd756 100644 --- a/arch/x86/lib/csum-partial_64.c +++ b/arch/x86/lib/csum-partial_64.c @@ -11,26 +11,9 @@ #include <asm/checksum.h> #include <asm/word-at-a-time.h> -static inline unsigned short from32to16(unsigned a) +static inline __wsum csum_finalize_sum(u64 temp64) { - unsigned short b = a >> 16; - asm("addw %w2,%w0\n\t" - "adcw $0,%w0\n" - : "=r" (b) - : "0" (b), "r" (a)); - return b; -} - -static inline __wsum csum_tail(u64 temp64, int odd) -{ - unsigned int result; - - result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff); - if (unlikely(odd)) { - result = from32to16(result); - result = ((result >> 8) & 0xff) | ((result & 0xff) << 8); - } - return (__force __wsum)result; + return (temp64 + ror64(temp64, 32)) >> 32; } /* @@ -47,17 +30,6 @@ static inline __wsum csum_tail(u64 temp64, int odd) __wsum csum_partial(const void *buff, int len, __wsum sum) { u64 temp64 = (__force u64)sum; - unsigned odd; - - odd = 1 & (unsigned long) buff; - if (unlikely(odd)) { - if (unlikely(len == 0)) - return sum; - temp64 = ror32((__force u32)sum, 8); - temp64 += (*(unsigned char *)buff << 8); - len--; - buff++; - } /* * len == 40 is the hot case due to IPv6 headers, but annotating it likely() @@ -73,7 +45,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) "adcq $0,%[res]" : [res] "+r"(temp64) : [src] "r"(buff), "m"(*(const char(*)[40])buff)); - return csum_tail(temp64, odd); + return csum_finalize_sum(temp64); } if (unlikely(len >= 64)) { /* @@ -143,7 +115,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum) : [res] "+r"(temp64) : [trail] "r"(trail)); } - return csum_tail(temp64, odd); + return csum_finalize_sum(temp64); } EXPORT_SYMBOL(csum_partial);