Message ID | 20221116041342.3841-11-elliott@hpe.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a5d:6687:0:0:0:0:0 with SMTP id l7csp3084059wru; Tue, 15 Nov 2022 20:16:06 -0800 (PST) X-Google-Smtp-Source: AA0mqf4aZDdI0UwkT47x1wrc1waopKtWnOW0pUJBPQr2zbSQI+ig/HSDp886GlzhZMviAz35oFv6 X-Received: by 2002:a62:e412:0:b0:56d:a1fc:7000 with SMTP id r18-20020a62e412000000b0056da1fc7000mr21083445pfh.35.1668572166290; Tue, 15 Nov 2022 20:16:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1668572166; cv=none; d=google.com; s=arc-20160816; b=Ffd5vo8PEPhTzmhONmuCaXKuIRbfUUeAcnK2KnqgV8gylXU1lo2DjdxKDQefpD3z9e 0a6DL/j4WNHd4mAwP7aTKEfsxIMHF2vw8nSDW9TRpXcG6fY/veK0QodQJnoUCKp2xPVU euaztjKL2I7lfFXWmxtZH2NSdOvNxV3oFMkTK5fEx0qjzS5LXRNb4rGEiku/wB0scFLh AYUavyYgs+Tel9qhMSuI0ScnbmmJ/TA6ChU/Z+WL9GO7PtoK714IOKMS1xIC27/0iO0C zutTQUo+Ranc86rtT/WZnrjcn66Fd/bYgBQWD4PMGilAat7esvSKdcKzUdjf9utpYGuQ fuUw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=DBZSqkANZQQ0OruNgKmmCEGIHHi7mZUgVGLxPPk3P8k=; b=M2qsmqMW8KEl175xyIouwCFBnWIbgWXQ6gh2wOuN5JM6DceKtxSspe56yIbGi3D00J ImKoi8UfGQQZSm/FyPYFY8dIsXSFdiLI00c6bpzpZovJlc6dS2O3i/r4hD6YRo0bH+QV d7DLD+R5BV6ccFcfVA/IiRii8uakEtMAJz0UaNscWhwIWrcWSfkzd9lOSh9MQeNhzmGx 1oLz+aKoaBgjKhs16vIaCEq4njP2AftF7vDCxRv8isbWkO+8pbP0YkGhtdQE+23xeDE+ CnhvN0klzdwcSFR8CnhBjErwsIN4IyVi6YRvJSXV1l6RWoIM4cNSRFGSeXUPWaotPFhe 527A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@hpe.com header.s=pps0720 header.b=AUz+aGuz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=hpe.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e19-20020a170902e0d300b00186881e1ff0si12739282pla.302.2022.11.15.20.15.53; Tue, 15 Nov 2022 20:16:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@hpe.com header.s=pps0720 header.b=AUz+aGuz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=hpe.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230247AbiKPEO1 (ORCPT <rfc822;maxim.cournoyer@gmail.com> + 99 others); Tue, 15 Nov 2022 23:14:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54042 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231702AbiKPEOP (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Tue, 15 Nov 2022 23:14:15 -0500 Received: from mx0a-002e3701.pphosted.com (mx0a-002e3701.pphosted.com [148.163.147.86]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 720602D1DB; Tue, 15 Nov 2022 20:14:14 -0800 (PST) Received: from pps.filterd (m0134422.ppops.net [127.0.0.1]) by mx0b-002e3701.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 2AG401G7021757; Wed, 16 Nov 2022 04:14:05 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hpe.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pps0720; bh=DBZSqkANZQQ0OruNgKmmCEGIHHi7mZUgVGLxPPk3P8k=; b=AUz+aGuzCA7f7EAWYy6mMP5TMq133ey6CcO/y7KePC293RXjthMsM7itHq9YQczbltyY gmNzgonVvC/cpXn3ByK6E7+WKTZ6y0BO2mkMtiGtyBFlvAX5d2ercfokLWYBFcHeFfJf GmoCF83mdgv5MVYdUFC2vLt7xZJop6SUrJEJYPsM/TZupQrBMuHF4PZ1CfARWKSIzscI LkQRQxdJbZZ/rWwg9nPL2OgeC7tzMx2pK7P21lS6ISkWmxRdRNKv8hesz6i7YIedaMvr js6y7DMoc1ksHhHkkL8rZoCuUgHqq9C7Cfakm8npDbH1NT0VcOto+Zn722ule2L1tssW 6Q== Received: from p1lg14880.it.hpe.com (p1lg14880.it.hpe.com [16.230.97.201]) by mx0b-002e3701.pphosted.com (PPS) with ESMTPS id 3kvrew82we-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 16 Nov 2022 04:14:05 +0000 Received: from p1lg14885.dc01.its.hpecorp.net (unknown [10.119.18.236]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by p1lg14880.it.hpe.com (Postfix) with ESMTPS id 0AE40806B77; Wed, 16 Nov 2022 04:14:05 +0000 (UTC) Received: from adevxp033-sys.us.rdlabs.hpecorp.net (unknown [16.231.227.36]) by p1lg14885.dc01.its.hpecorp.net (Postfix) with ESMTP id 83A7C808BA7; Wed, 16 Nov 2022 04:14:04 +0000 (UTC) From: Robert Elliott <elliott@hpe.com> To: herbert@gondor.apana.org.au, davem@davemloft.net, tim.c.chen@linux.intel.com, ap420073@gmail.com, ardb@kernel.org, Jason@zx2c4.com, David.Laight@ACULAB.COM, ebiggers@kernel.org, linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Robert Elliott <elliott@hpe.com> Subject: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption Date: Tue, 15 Nov 2022 22:13:28 -0600 Message-Id: <20221116041342.3841-11-elliott@hpe.com> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20221116041342.3841-1-elliott@hpe.com> References: <20221103042740.6556-1-elliott@hpe.com> <20221116041342.3841-1-elliott@hpe.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Proofpoint-ORIG-GUID: 56s8Ohy8TZ1WboXjT_ok7Gn6ijdkpJuE X-Proofpoint-GUID: 56s8Ohy8TZ1WboXjT_ok7Gn6ijdkpJuE X-HPE-SCL: -1 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.219,Aquarius:18.0.895,Hydra:6.0.545,FMLib:17.11.122.1 definitions=2022-11-15_08,2022-11-15_03,2022-06-22_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 mlxscore=0 impostorscore=0 adultscore=0 lowpriorityscore=0 clxscore=1015 priorityscore=1501 malwarescore=0 phishscore=0 suspectscore=0 spamscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2210170000 definitions=main-2211160029 X-Spam-Status: No, score=-2.8 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1749624727795367060?= X-GMAIL-MSGID: =?utf-8?q?1749624727795367060?= |
Series |
crypto: fix RCU stalls
|
|
Commit Message
Elliott, Robert (Servers)
Nov. 16, 2022, 4:13 a.m. UTC
Use a static const unsigned int for the limit of the number of bytes
processed between kernel_fpu_begin() and kernel_fpu_end() rather than
using the SZ_4K macro (which is a signed value), or a magic value
of 4096U embedded in the C code.
Use unsigned int rather than size_t for some of the arguments to
avoid typecasting for the min() macro.
Signed-off-by: Robert Elliott <elliott@hpe.com>
---
v3 use static int rather than macro, change to while loops
rather than do/while loops
---
arch/x86/crypto/nhpoly1305-avx2-glue.c | 11 +++++---
arch/x86/crypto/nhpoly1305-sse2-glue.c | 11 +++++---
arch/x86/crypto/poly1305_glue.c | 37 +++++++++++++++++---------
arch/x86/crypto/polyval-clmulni_glue.c | 8 ++++--
4 files changed, 46 insertions(+), 21 deletions(-)
Comments
On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote: > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ > +static const unsigned int bytes_per_fpu = 337 * 1024; > + Use an enum for constants like this: enum { BYTES_PER_FPU = ... }; You can even make it function-local, so it's near the code that uses it, which will better justify its existence. Also, where did you get this number? Seems kind of weird. > asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t message_len, > u8 hash[NH_HASH_BYTES]); > > @@ -26,18 +29,20 @@ static void _nh_avx2(const u32 *key, const u8 *message, size_t message_len, > static int nhpoly1305_avx2_update(struct shash_desc *desc, > const u8 *src, unsigned int srclen) > { > + BUILD_BUG_ON(bytes_per_fpu == 0); Make the constant function local and remove this check. > +7 > if (srclen < 64 || !crypto_simd_usable()) > return crypto_nhpoly1305_update(desc, src, srclen); > > - do { > - unsigned int n = min_t(unsigned int, srclen, SZ_4K); > + while (srclen) { Does this add a needless additional check or does it generate better code? Would be nice to have some explanation of the rationale. Same comments as above apply for the rest of this patch ans series.
> -----Original Message----- > From: Jason A. Donenfeld <Jason@zx2c4.com> > Sent: Wednesday, November 16, 2022 5:14 AM > To: Elliott, Robert (Servers) <elliott@hpe.com> > Cc: herbert@gondor.apana.org.au; davem@davemloft.net; > tim.c.chen@linux.intel.com; ap420073@gmail.com; ardb@kernel.org; > David.Laight@ACULAB.COM; ebiggers@kernel.org; linux- > crypto@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote: > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ > > +static const unsigned int bytes_per_fpu = 337 * 1024; > > + > > Use an enum for constants like this: > > enum { BYTES_PER_FPU = ... }; > > You can even make it function-local, so it's near the code that uses it, > which will better justify its existence. Using crc32c-intel as an example, the gcc 12.2.1 assembly output is the same for: 1. const variable per-module static const unsigned int bytes_per_fpu = 868 * 1024; 2. enum per-module enum { bytes_per_fpu = 868 * 1024 } ; 3. enum per function (_update and _finup) enum { bytes_per_fpu = 868 * 1024 } ; Each function gets a movl instruction with that constant, and has the same compare, add, subtract, and jump instructions. # ../arch/x86/crypto/crc32c-intel_glue.c:198: unsigned int chunk = min(len, bytes_per_fpu); movl $888832, %eax #, tmp127 # ../arch/x86/crypto/crc32c-intel_glue.c:171: unsigned int chunk = min(len, bytes_per_fpu); movl $888832, %r13d #, tmp128 Since enum doesn't guarantee any particular type, those variations upset the min() macro. min_t() is necessary to eliminate the compiler warning. ../arch/x86/crypto/crc32c-intel_glue.c: In function ‘crc32c_pcl_intel_update’: ../arch/x86/crypto/crc32c-intel_glue.c:171:97: warning: comparison of distinct pointer types lacks a cast 171 | unsigned int chunk = min(len, bytes_per_fpu); > Also, where did you get this number? Seems kind of weird. As described in replies on the v2 patches, I created a tcrypt test that runs each algorithm on a 1 MiB buffer with no loop limits (and irqs disabled), picks the best result out of 10 passes, and calculates the number of bytes that nominally fit in 30 us (on a 2.2 GHz Cascade Lake CPU). Actual results with those values vary from 37 to 102 us; it is much better than running unlimited, but still imperfect. https://lore.kernel.org/lkml/MW5PR84MB184284FBED63E2D043C93A6FAB369@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM/ The hash algorithms seem to congregate around three speeds: - slow: 10 to 20 KiB for sha* and sm3 - medium: 200 to 400 KiB for poly* - fast: 600 to 800 KiB for crc* so it might be preferable to just go with three values (e.g., 16, 256, and 512 KiB). There's a lot of variability in CPU architecture, CPU speeds, and other system activity that make this impossible to perfect. It'd be ideal if the loops checked the CPU cycle count rather than worked on a byte count, but the RDTSC instruction is notoriously slow and would slow down overall performance. The RAID6 library supports running a benchmark during each boot to pick the best implementation to use (not always enabled): [ 3.341372] raid6: skipped pq benchmark and selected avx512x4 The crypto subsystem does something similar with its self-tests, which could be expanded to include speed tests to tune the loop values. However, that slows down boot and could be misled by an NMI or SMI during the test, which could lead to even worse results. The selected values could be kept in each file or put in a shared .h file. Both approaches seem difficult to maintain. The powerpc modules have paragraph-sized comments explaining their MAX_BYTES macros (e.g., in arch/powerpc/crypto/sha256-spe-glue.c) that might be a good model for documenting the values: * MAX_BYTES defines the number of bytes that are allowed to be processed * between preempt_disable() and preempt_enable(). SHA256 takes ~2,000 * operations per 64 bytes. e500 cores can issue two arithmetic instructions * per clock cycle using one 32/64 bit unit (SU1) and one 32 bit unit (SU2). * Thus 1KB of input data will need an estimated maximum of 18,000 cycles. * Headroom for cache misses included. Even with the low end model clocked * at 667 MHz this equals to a critical time window of less than 27us. > > asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t > message_len, > > u8 hash[NH_HASH_BYTES]); > > > > @@ -26,18 +29,20 @@ static void _nh_avx2(const u32 *key, const u8 > *message, size_t message_len, > > static int nhpoly1305_avx2_update(struct shash_desc *desc, > > const u8 *src, unsigned int srclen) > > { > > + BUILD_BUG_ON(bytes_per_fpu == 0); > > Make the constant function local and remove this check. That just makes sure someone editing the source code doesn't pick a value that will cause the loops to hang; it's stronger than a comment saying "don't set this to 0". It's only a compile-time check, and doesn't result in any the assembly language output that I can see. > > +7 > > if (srclen < 64 || !crypto_simd_usable()) > > return crypto_nhpoly1305_update(desc, src, srclen); > > > > - do { > > - unsigned int n = min_t(unsigned int, srclen, SZ_4K); > > + while (srclen) { > > Does this add a needless additional check or does it generate better > code? Would be nice to have some explanation of the rationale. Each module's assembly function can have different handling of - length 0 - length < block size - length < some minimum length - length < a performance switchover point - length not a multiple of block size - current buffer pointer not aligned to block size Sometimes the C glue logic checks values upfront; sometimes it doesn't. The while loops help get them to follow one of two patterns: while (length) or while (length >= BLOCK_SIZE) and sidestep some of the special handling concerns. Performance-wise, the patches are either - adding lots of kernel_fpu_begin() and kernel_fpu_end() calls (all the ones that were running unlimited) - removing lots of kernel_fpu_begin() and kernel_fpu_end() calls (e.g., polyval relaxed from 4 KiB to 383 KiB) which is much more impactful than the common while loop entry. I created tcrypt tests that try lengths around all the special values like 0, 16, 4096, and the selected bytes per FPU size (comparing results to the generic algorithms like the extended self-tests), so I think these loops are functionally correct (I've found that violating the undocumented assumptions of the assembly functions is a good way to exercise RCU stall reporting).
From: Elliott, Robert > Sent: 22 November 2022 05:06 ... > Since enum doesn't guarantee any particular type, those variations > upset the min() macro. min_t() is necessary to eliminate the > compiler warning. Yes, min() is fundamentally broken. min_t() isn't really a solution. I think min() needs to include something like: #define min(a, b) \ __builtin_constant(b) && (b) + 0u <= MAX_INT ? \ ((a) < (int)(b) ? (a) : (int)(b)) : \ ... So in the common case where 'b' is a small constant integer it doesn't matter whether the is it signed or unsigned. I might try compiling a kernel where min_t() does that instead of the casts - just to see how many of the casts are actually needed. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Nov 16, 2022 at 12:13:51PM +0100, Jason A. Donenfeld wrote: > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote: > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ > > +static const unsigned int bytes_per_fpu = 337 * 1024; > > + > > Use an enum for constants like this: > > enum { BYTES_PER_FPU = ... }; > > You can even make it function-local, so it's near the code that uses it, > which will better justify its existence. > > Also, where did you get this number? Seems kind of weird. These numbers are highly dependent on hardware and I think having them hard-coded is wrong. Perhaps we should try a different approach. How about just limiting the size to 4K, and then depending on need_resched we break out of the loop? Something like: if (!len) return 0; kernel_fpu_begin(); for (;;) { unsigned int chunk = min(len, 4096); sha1_base_do_update(desc, data, chunk, sha1_xform); len -= chunk; data += chunk; if (!len) break; if (need_resched()) { kernel_fpu_end(); cond_resched(); kernel_fpu_begin(); } } kernel_fpu_end(); Cheers,
On Fri, 25 Nov 2022 at 09:41, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Wed, Nov 16, 2022 at 12:13:51PM +0100, Jason A. Donenfeld wrote: > > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote: > > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ > > > +static const unsigned int bytes_per_fpu = 337 * 1024; > > > + > > > > Use an enum for constants like this: > > > > enum { BYTES_PER_FPU = ... }; > > > > You can even make it function-local, so it's near the code that uses it, > > which will better justify its existence. > > > > Also, where did you get this number? Seems kind of weird. > > These numbers are highly dependent on hardware and I think having > them hard-coded is wrong. > > Perhaps we should try a different approach. How about just limiting > the size to 4K, and then depending on need_resched we break out of > the loop? Something like: > > if (!len) > return 0; > > kernel_fpu_begin(); > for (;;) { > unsigned int chunk = min(len, 4096); > > sha1_base_do_update(desc, data, chunk, sha1_xform); > > len -= chunk; > data += chunk; > > if (!len) > break; > > if (need_resched()) { > kernel_fpu_end(); > cond_resched(); > kernel_fpu_begin(); > } > } > kernel_fpu_end(); > On arm64, this is implemented in an assembler macro 'cond_yield' so we don't need to preserve/restore the SIMD state state at all if the yield is not going to result in a call to schedule(). For example, the SHA3 code keeps 400 bytes of state in registers, which we don't want to save and reload unless needed. (5f6cb2e617681 'crypto: arm64/sha512-ce - simplify NEON yield') So the asm routines will call cond_yield, and return early if a yield is required, with the number of blocks or bytes left to process as the return value. The C wrapper just calls the asm routine in a loop until the return value becomes 0. That way, we don't need magic values at all, and the yield will occur as soon as the asm inner loop observes the yield condition so the latency should be much lower as well. Note that it is only used in shash implementations, given that they are the only ones that may receive unbounded inputs.
On Fri, Nov 25, 2022 at 09:59:17AM +0100, Ard Biesheuvel wrote: > > On arm64, this is implemented in an assembler macro 'cond_yield' so we > don't need to preserve/restore the SIMD state state at all if the > yield is not going to result in a call to schedule(). For example, the > SHA3 code keeps 400 bytes of state in registers, which we don't want > to save and reload unless needed. (5f6cb2e617681 'crypto: > arm64/sha512-ce - simplify NEON yield') Yes this would be optimally done from the assembly code which would make a difference if they benefited from larger block sizes. Cheers,
On Fri, Nov 25, 2022 at 09:59:17AM +0100, Ard Biesheuvel wrote: > On arm64, this is implemented in an assembler macro 'cond_yield' so we > don't need to preserve/restore the SIMD state state at all if the > yield is not going to result in a call to schedule(). For example, the > SHA3 code keeps 400 bytes of state in registers, which we don't want > to save and reload unless needed. (5f6cb2e617681 'crypto: > arm64/sha512-ce - simplify NEON yield') That sounds like the optimal approach. There is a cost to unnecessary kernel_fpu_begin()/end() calls - increasing their usage in the x86 sha512 driver added 929 us during one boot. The cond_yield check is just a few memory reads and conditional branches. I see that is built on the asm-offsets.c technique mentioned by Dave Hansen in the x86 aria driver thread. > Note that it is only used in shash implementations, given that they > are the only ones that may receive unbounded inputs. Although typical usage probably doesn't stress this, the length of the additional associated data presented to aead implementations is unconstrained as well. At least in x86, they can end up processing multiple megabytes in one chunk like the hash functions (if the associated data is a big buffer described by a sg list created with sg_init_one()).
> On Fri, Nov 25, 2022 at 09:59:17AM +0100, Ard Biesheuvel wrote: > > Note that it is only used in shash implementations, given that they > > are the only ones that may receive unbounded inputs. > > Although typical usage probably doesn't stress this, the length of the > additional associated data presented to aead implementations is > unconstrained as well. At least in x86, they can end up processing > multiple megabytes in one chunk like the hash functions (if the > associated data is a big buffer described by a sg list created > with sg_init_one()). > Reviewing the two arm64 aead drivers, aes-ce-ccm-glue.c solves that by including this in the do/while loop in ccm_calculate_auth_mac(): n = min_t(u32, n, SZ_4K); /* yield NEON at least every 4k */ That was added by 36a916af641 ("crypto: arm64/aes-ccm - yield NEON when processing auth-only data") in 2021, also relying on 41691c44606b ("crypto: arm64/aes-ccm - reduce NEON begin/end calls for common case"). ghash-ce-glue.c seems to be missing that in its similar function named gcm_calculate_auth_mac().
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Friday, November 25, 2022 2:41 AM > To: Jason A. Donenfeld <Jason@zx2c4.com> > Cc: Elliott, Robert (Servers) <elliott@hpe.com>; davem@davemloft.net; > tim.c.chen@linux.intel.com; ap420073@gmail.com; ardb@kernel.org; > David.Laight@aculab.com; ebiggers@kernel.org; linux- > crypto@vger.kernel.org; linux-kernel@vger.kernel.org > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > On Wed, Nov 16, 2022 at 12:13:51PM +0100, Jason A. Donenfeld wrote: > > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote: > > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ > > > +static const unsigned int bytes_per_fpu = 337 * 1024; > > > + > > > > Use an enum for constants like this: > > > > enum { BYTES_PER_FPU = ... }; > > > > You can even make it function-local, so it's near the code that uses it, > > which will better justify its existence. > > > > Also, where did you get this number? Seems kind of weird. > > These numbers are highly dependent on hardware and I think having > them hard-coded is wrong. > > Perhaps we should try a different approach. How about just limiting > the size to 4K, and then depending on need_resched we break out of > the loop? Something like: > > if (!len) > return 0; > > kernel_fpu_begin(); > for (;;) { > unsigned int chunk = min(len, 4096); > > sha1_base_do_update(desc, data, chunk, sha1_xform); > > len -= chunk; > data += chunk; > > if (!len) > break; > > if (need_resched()) { > kernel_fpu_end(); > cond_resched(); > kernel_fpu_begin(); > } > } > kernel_fpu_end(); > I implemented that conditional approach in the sha algorithms. The results of a boot (using sha512 for module signatures, with crypto extra tests enabled, comparing to sha512 with a 20 KiB fixed limit) are: sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles NOTE: I didn't have a patch in place to isolate the counts for each variation (ssse3 vs. avx vs. avx2) and - for sha512: sha512 vs. sha384 - for sha256: sha256 vs. sha224 so the numbers include sha256 and sha512 running twice as many tests as sha1. This approach looks very good: - 16% of the number of begin/end calls - 10% of the CPU cycles spent making the calls - the FPU context is held for a long time (77 ms) but only while it's not needed. That's much more efficient than releasing it every 30 us just in case. I'll keep testing this to make sure RCU stalls stay away, and apply the approach to the other algorithms. In x86, need_resched() has to deal with a PER_CPU variable, so I'm not sure it's worth the hassle to figure out how to do that from assembly code.
On Fri, Dec 02, 2022 at 06:21:02AM +0000, Elliott, Robert (Servers) wrote: > > I'll keep testing this to make sure RCU stalls stay away, and apply > the approach to the other algorithms. Thanks for doing all the hard work! BTW, just a minor nit but you can delete the cond_resched() call because kernel_fpu_end()/preempt_enable() will do it anyway. Cheers,
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Friday, December 2, 2022 3:25 AM > To: Elliott, Robert (Servers) <elliott@hpe.com> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > On Fri, Dec 02, 2022 at 06:21:02AM +0000, Elliott, Robert (Servers) wrote: ... > BTW, just a minor nit but you can delete the cond_resched() call > because kernel_fpu_end()/preempt_enable() will do it anyway. That happens under CONFIG_PREEMPTION=y (from include/Linux/preempt.h and arch/x86/include/asm/preempt.h) Is calling cond_resched() still helpful if that is not the configuration?
On Fri, Dec 02, 2022 at 04:15:23PM +0000, Elliott, Robert (Servers) wrote: > > > > -----Original Message----- > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Sent: Friday, December 2, 2022 3:25 AM > > To: Elliott, Robert (Servers) <elliott@hpe.com> > > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > > > On Fri, Dec 02, 2022 at 06:21:02AM +0000, Elliott, Robert (Servers) wrote: > ... > > BTW, just a minor nit but you can delete the cond_resched() call > > because kernel_fpu_end()/preempt_enable() will do it anyway. > > That happens under > CONFIG_PREEMPTION=y > (from include/Linux/preempt.h and arch/x86/include/asm/preempt.h) > > Is calling cond_resched() still helpful if that is not the configuration? Perhaps, but then again perhaps if preemption is off, maybe we shouldn't even bother with the 4K split. Were the initial warnings with or without preemption? Personally I don't really care since I always use preemption. The PREEMPT Kconfigs do provide a bit of nuance with the split between PREEMPT_NONE vs. PREEMPT_VOLUNTARY. But perhaps that is just overkill for our situation. I'll leave it to you to decide :) Thanks,
> > > BTW, just a minor nit but you can delete the cond_resched() call > > > because kernel_fpu_end()/preempt_enable() will do it anyway. > > > > That happens under > > CONFIG_PREEMPTION=y > > (from include/Linux/preempt.h and arch/x86/include/asm/preempt.h) > > > > Is calling cond_resched() still helpful if that is not the configuration? > > > Perhaps, but then again perhaps if preemption is off, maybe we > shouldn't even bother with the 4K split. Were the initial > warnings with or without preemption? > > Personally I don't really care since I always use preemption. > > The PREEMPT Kconfigs do provide a bit of nuance with the split > between PREEMPT_NONE vs. PREEMPT_VOLUNTARY. But perhaps that is > just overkill for our situation. I was thinking about this a few days ago, and my 2¢ is that it's probably best to not preempt the kernel in the middle of a crypto operation under PREEMPT_VOLUNTARY. We're already not preempting during these operations, and there haven't been complaints of excessive latency because of these crypto operations. If we skip the kernel_fpu_{begin,end} pair when not under CONFIG_PREEMPT, we'll save a significant cycle count that is wasted currently. See Elliot Robert's numbers on conditional begin/end in sha to see the benefits of not saving/restoring unnecessarily: "10% of the CPU cycles spent making the [kernel_fpu_{begin,end}] calls". > I'll leave it to you to decide :) One extra thought: commit 827ee47: "crypto: x86 - add some helper macros for ECB and CBC modes" makes a mention of fpu save/restore being done lazily. I don't know the details, so would that change this discussion? Thanks for listening, Peter Lafreniere <peter@n8pjl.ca>
From: Peter Lafreniere > Sent: 06 December 2022 14:04 > > > > > BTW, just a minor nit but you can delete the cond_resched() call > > > > because kernel_fpu_end()/preempt_enable() will do it anyway. > > > > > > That happens under > > > CONFIG_PREEMPTION=y > > > (from include/Linux/preempt.h and arch/x86/include/asm/preempt.h) > > > > > > Is calling cond_resched() still helpful if that is not the configuration? > > > > > > Perhaps, but then again perhaps if preemption is off, maybe we > > shouldn't even bother with the 4K split. Were the initial > > warnings with or without preemption? > > > > Personally I don't really care since I always use preemption. > > > > The PREEMPT Kconfigs do provide a bit of nuance with the split > > between PREEMPT_NONE vs. PREEMPT_VOLUNTARY. But perhaps that is > > just overkill for our situation. > > I was thinking about this a few days ago, and my 2¢ is that it's > probably best to not preempt the kernel in the middle of a crypto > operation under PREEMPT_VOLUNTARY. We're already not preempting during > these operations, and there haven't been complaints of excessive latency > because of these crypto operations. ... Probably because the people who have been suffering from (and looking for) latency issues aren't running crypto tests. I've found some terrible pre-emption latency issues trying to get RT processes scheduled in a sensible timeframe. I wouldn't worry about 100us - I'm doing audio processing every 10ms, but anything much longer causes problems when trying to use 90+% of the cpu time for lots of audio channels. I didn't try a CONFIG_RT kernel, the application needs to run on a standard 'distro' kernel. In any case I suspect all the extra processes switches (etc) the RT kernel adds will completely kill performance. I wonder how much it would cost to measure the time spent with pre-empt disabled (and not checked) and to trace long intervals. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > Perhaps we should try a different approach. How about just limiting > > the size to 4K, and then depending on need_resched we break out of > > the loop? Something like: > > > > if (!len) > > return 0; > > > > kernel_fpu_begin(); > > for (;;) { > > unsigned int chunk = min(len, 4096); > > > > sha1_base_do_update(desc, data, chunk, sha1_xform); > > > > len -= chunk; > > data += chunk; > > > > if (!len) > > break; > > > > if (need_resched()) { > > kernel_fpu_end(); > > cond_resched(); > > kernel_fpu_begin(); > > } > > } > > kernel_fpu_end(); > > > I implemented that conditional approach in the sha algorithms. > > The results of a boot (using sha512 for module signatures, with > crypto extra tests enabled, comparing to sha512 with a 20 KiB > fixed limit) are: > > sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles > sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles > sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles > sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles > > NOTE: I didn't have a patch in place to isolate the counts for each variation > (ssse3 vs. avx vs. avx2) and > - for sha512: sha512 vs. sha384 > - for sha256: sha256 vs. sha224 > so the numbers include sha256 and sha512 running twice as many tests > as sha1. > > This approach looks very good: > - 16% of the number of begin/end calls > - 10% of the CPU cycles spent making the calls > - the FPU context is held for a long time (77 ms) but only while > it's not needed. > > That's much more efficient than releasing it every 30 us just in case. How recently did you make this change? I implemented this conditional approach for ecb_cbc_helpers.h, but saw no changes at all to performance on serpent-avx2 and twofish-avx. kernel_fpu_{begin,end} (after the first call to begin) don't do anything more than enable/disable preemption and make a few writes to the mxcsr. It's likely that the above approach has the tiniest bit less overhead, and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests a performance uplift. This brings us back to this question: should crypto routines be preempted under PREEMPT_VOLUNTARY or not? > I'll keep testing this to make sure RCU stalls stay away, and apply > the approach to the other algorithms. I missed the earlier discussions. Have you seen issues with RCU stalls/latency spikes because of crypto routines? If so, what preemption model were you running? > In x86, need_resched() has to deal with a PER_CPU variable, so I'm > not sure it's worth the hassle to figure out how to do that from > assembly code. Leave it in c. It'll be more maintainable that way. Cheers, Peter Lafreniere <peter@n8pjl.ca>
> -----Original Message----- > From: Peter Lafreniere <peter@n8pjl.ca> > Sent: Tuesday, December 6, 2022 5:06 PM > To: Elliott, Robert (Servers) <elliott@hpe.com> > Subject: RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > > > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption > > > Perhaps we should try a different approach. How about just limiting > > > the size to 4K, and then depending on need_resched we break out of > > > the loop? Something like: > > > > > > if (!len) > > > return 0; > > > > > > kernel_fpu_begin(); > > > for (;;) { > > > unsigned int chunk = min(len, 4096); > > > > > > sha1_base_do_update(desc, data, chunk, sha1_xform); > > > > > > len -= chunk; > > > data += chunk; > > > > > > if (!len) > > > break; > > > > > > if (need_resched()) { > > > kernel_fpu_end(); > > > cond_resched(); > > > kernel_fpu_begin(); > > > } > > > } > > > kernel_fpu_end(); > > > > > > I implemented that conditional approach in the sha algorithms. > > > > The results of a boot (using sha512 for module signatures, with > > crypto extra tests enabled, comparing to sha512 with a 20 KiB > > fixed limit) are: > > > > sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU > context 35828 cycles > > sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU > context 118612 cycles > > sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU > context 169140982 cycles > > sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU > context 4049644 cycles > > > > NOTE: I didn't have a patch in place to isolate the counts for each > variation > > (ssse3 vs. avx vs. avx2) and > > - for sha512: sha512 vs. sha384 > > - for sha256: sha256 vs. sha224 > > so the numbers include sha256 and sha512 running twice as many tests > > as sha1. > > > > This approach looks very good: > > - 16% of the number of begin/end calls > > - 10% of the CPU cycles spent making the calls > > - the FPU context is held for a long time (77 ms) but only while > > it's not needed. > > > > That's much more efficient than releasing it every 30 us just in case. > > How recently did you make this change? I implemented this conditional > approach for ecb_cbc_helpers.h, but saw no changes at all to performance > on serpent-avx2 and twofish-avx. The hash functions are the main problem; the skciphers receive requests already broken into 4 KiB chunks by the SG list helpers. > kernel_fpu_{begin,end} (after the first call to begin) don't do anything > more than enable/disable preemption and make a few writes to the mxcsr. > It's likely that the above approach has the tiniest bit less overhead, > and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests > a performance uplift. > > > I'll keep testing this to make sure RCU stalls stay away, and apply > > the approach to the other algorithms. > > I missed the earlier discussions. Have you seen issues with RCU > stalls/latency spikes because of crypto routines? If so, what preemption > model were you running? While running Wireshark in Fedora, I noticed the top function consuming CPU cycles (per "perf top") was sha512_generic. Although Fedora and RHEL have the x86 optimized driver compiled as a module, nothing in the distro or application space noticed it was there and loaded it. The only x86 optimized drivers that do get used are the ones built-in to the kernel. After making changes to load the x86 sha512 module, I noticed several boots over the next few weeks reported RCU stalls, all in the sha512_avx2 function. Because the stack traces take a long time to print to the serial port, these can trigger soft lockups as well. Fedora and RHEL default to "Voluntary Kernel Preemption (Desktop)": # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set The reason was that sha512 and all the other x86 crypto hash functions process the entire data in one kernel_fpu_begin()/end() block, which blocks preemption. Each boot checks module signatures for about 4000 files, totaling about 2.4 GB. Breaking the loops into smaller chunks fixes the problem. However, since functions like crc32c are 20x faster than sha1, one value like 4 KiB is not ideal. A few non-hash functions have issues too. Although most skciphers are broken up into 4 KiB chunks by the sg list walking functions, aegis packs everything inside one kernel_fpu_begin()/end() block. All the aead functions handle the main data with sg list walking functions, but handle all the associated data inside one kernel_fpu_begin()/end() block. > > In x86, need_resched() has to deal with a PER_CPU variable, so I'm > > not sure it's worth the hassle to figure out how to do that from > > assembly code. > > Leave it in c. It'll be more maintainable that way. I'm testing a new kernel_fpu_yield() utility function that looks nice: void __sha1_transform_avx2(struct sha1_state *state, const u8 *data, int blocks) { if (blocks <= 0) return; kernel_fpu_begin(); for (;;) { const int chunks = min(blocks, 4096 / SHA1_BLOCK_SIZE); sha1_transform_avx2(state->state, data, chunks); blocks -= chunks; if (blocks <= 0) break; data += chunks * SHA1_BLOCK_SIZE; kernel_fpu_yield(); } kernel_fpu_end(); } This construction also makes it easy to add debug counters to observe what is happening. In a boot with preempt=none and the crypto extra self-tests enabled, two modules benefitted from that new yield call: /sys/module/sha256_ssse3/parameters/fpu_rescheds:3 /sys/module/sha512_ssse3/parameters/fpu_rescheds:515 10 passes of 1 MiB buffer tests on all the drivers shows several others benefitting: /sys/module/aegis128_aesni/parameters/fpu_rescheds:1 /sys/module/aesni_intel/parameters/fpu_rescheds:0 /sys/module/aria_aesni_avx_x86_64/parameters/fpu_rescheds:45 /sys/module/camellia_aesni_avx2/parameters/fpu_rescheds:0 /sys/module/camellia_aesni_avx_x86_64/parameters/fpu_rescheds:0 /sys/module/camellia_x86_64/parameters/fpu_rescheds:0 /sys/module/cast5_avx_x86_64/parameters/fpu_rescheds:0 /sys/module/cast6_avx_x86_64/parameters/fpu_rescheds:0 /sys/module/chacha_x86_64/parameters/fpu_rescheds:0 /sys/module/crc32c_intel/parameters/fpu_rescheds:1 /sys/module/crc32_pclmul/parameters/fpu_rescheds:1 /sys/module/crct10dif_pclmul/parameters/fpu_rescheds:1 /sys/module/ghash_clmulni_intel/parameters/fpu_rescheds:1 /sys/module/libblake2s_x86_64/parameters/fpu_rescheds:0 /sys/module/nhpoly1305_avx2/parameters/fpu_rescheds:1 /sys/module/nhpoly1305_sse2/parameters/fpu_rescheds:1 /sys/module/poly1305_x86_64/parameters/fpu_rescheds:1 /sys/module/polyval_clmulni/parameters/fpu_rescheds:1 /sys/module/serpent_avx2/parameters/fpu_rescheds:0 /sys/module/serpent_avx_x86_64/parameters/fpu_rescheds:0 /sys/module/serpent_sse2_x86_64/parameters/fpu_rescheds:0 /sys/module/sha1_ssse3/parameters/fpu_rescheds:3 /sys/module/sha256_ssse3/parameters/fpu_rescheds:9 /sys/module/sha512_ssse3/parameters/fpu_rescheds:723 /sys/module/sm3_avx_x86_64/parameters/fpu_rescheds:171 /sys/module/sm4_aesni_avx_x86_64/parameters/fpu_rescheds:0 /sys/module/twofish_avx_x86_64/parameters/fpu_rescheds:0 /sys/module/twofish_x86_64_3way/parameters/fpu_rescheds:0 I'll keep experimenting with all the preempt modes, heavier workloads, and shorter RCU timeouts to confirm this solution is robust. It might even be appropriate for the generic drivers, if they suffer from the problems that sm4 shows here. > This brings us back to this question: should crypto routines be > preempted under PREEMPT_VOLUNTARY or not? I think so. The RCU stall and soft lockup detectors aren't disabled, so there is still an expectation of sharing the CPUs even in PREEMPT=none mode. 1 MiB tests under CONFIG_PREEMPT=none triggered soft lockups while running CBC mode for SM4, Camellia, and Serpent: [ 208.975253] tcrypt: PERL "cfb-sm4-aesni-avx2" => 22499840, [ 218.187217] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [modprobe:3433] ... [ 219.391776] tcrypt: PERL "cbc-sm4-aesni-avx2" => 22528138, [ 244.471181] tcrypt: PERL "ecb-sm4-aesni-avx" => 4469626, [ 246.181064] watchdog: BUG: soft lockup - CPU#1 stuck for 52s! [modprobe:3433] ... [ 250.168239] tcrypt: PERL "cbc-camellia-aesni-avx2" => 12202738, [ 264.047440] tcrypt: PERL "cbc-cast5-avx" => 17744280, [ 273.091258] tcrypt: PERL "cbc-cast6-avx" => 19375400, [ 274.183249] watchdog: BUG: soft lockup - CPU#1 stuck for 78s! [modprobe:3433] ... [ 283.066260] tcrypt: PERL "cbc-serpent-avx2" => 21454930, SM4 falls back to the generic driver for encryption; it only has optimized decryption functions. Therefore, it doesn't make any kernel_fpu_end() calls and thus makes no rescheduling calls. This shows the CPU cycles for 1 MiB of encrypt and decrypt for each algorithm (no soft lockups this time). SM4, Serpent, Cast5, and Cast6 encryption in CBC mode are the slowest by far. [ 2233.362748] tcrypt: PERL my %speeds_skcipher = ( [ 2234.427387] tcrypt: PERL "cbc-aes-aesni" => 2178586, [ 2234.738823] tcrypt: PERL "cbc-aes-aesni" => 538752, [ 2235.064335] tcrypt: PERL "ctr-aes-aesni" => 574026, [ 2235.389427] tcrypt: PERL "ctr-aes-aesni" => 574060, [ 2236.451594] tcrypt: PERL "cts-cbc-aes-aesni" => 2178946, [ 2236.762174] tcrypt: PERL "cts-cbc-aes-aesni" => 540066, [ 2237.070371] tcrypt: PERL "ecb-aes-aesni" => 536970, [ 2237.379549] tcrypt: PERL "ecb-aes-aesni" => 538012, [ 2237.686137] tcrypt: PERL "xctr-aes-aesni" => 534690, [ 2237.993315] tcrypt: PERL "xctr-aes-aesni" => 534632, [ 2238.304077] tcrypt: PERL "xts-aes-aesni" => 542590, [ 2238.615057] tcrypt: PERL "xts-aes-aesni" => 541296, [ 2240.233298] tcrypt: PERL "ctr-aria-avx" => 3393212, [ 2241.849000] tcrypt: PERL "ctr-aria-avx" => 3391982, [ 2242.081296] tcrypt: PERL "xchacha12-simd" => 370794, [ 2242.316868] tcrypt: PERL "xchacha12-simd" => 373788, [ 2242.626165] tcrypt: PERL "xchacha20-simd" => 536310, [ 2242.936646] tcrypt: PERL "xchacha20-simd" => 537094, [ 2243.250356] tcrypt: PERL "chacha20-simd" => 540542, [ 2243.559396] tcrypt: PERL "chacha20-simd" => 536604, [ 2244.831594] tcrypt: PERL "ctr-sm4-aesni-avx2" => 2642674, [ 2246.106143] tcrypt: PERL "ctr-sm4-aesni-avx2" => 2640350, [ 2256.475661] tcrypt: PERL "cfb-sm4-aesni-avx2" => 22496346, [ 2257.732511] tcrypt: PERL "cfb-sm4-aesni-avx2" => 2604932, [ 2268.123821] tcrypt: PERL "cbc-sm4-aesni-avx2" => 22528268, [ 2269.378028] tcrypt: PERL "cbc-sm4-aesni-avx2" => 2601090, [ 2271.533556] tcrypt: PERL "ctr-sm4-aesni-avx" => 4559648, [ 2273.688772] tcrypt: PERL "ctr-sm4-aesni-avx" => 4561300, [ 2284.073187] tcrypt: PERL "cfb-sm4-aesni-avx" => 22499496, [ 2286.177732] tcrypt: PERL "cfb-sm4-aesni-avx" => 4457588, [ 2296.569751] tcrypt: PERL "cbc-sm4-aesni-avx" => 22529182, [ 2298.677312] tcrypt: PERL "cbc-sm4-aesni-avx" => 4457226, [ 2300.789931] tcrypt: PERL "ecb-sm4-aesni-avx" => 4464282, [ 2302.899974] tcrypt: PERL "ecb-sm4-aesni-avx" => 4466052, [ 2308.589365] tcrypt: PERL "cbc-camellia-aesni-avx2" => 12260426, [ 2309.737064] tcrypt: PERL "cbc-camellia-aesni-avx2" => 2350988, [ 2315.433319] tcrypt: PERL "cbc-camellia-aesni" => 12248986, [ 2317.262589] tcrypt: PERL "cbc-camellia-aesni" => 3814202, [ 2325.460542] tcrypt: PERL "cbc-cast5-avx" => 17739828, [ 2327.856127] tcrypt: PERL "cbc-cast5-avx" => 5061992, [ 2336.668992] tcrypt: PERL "cbc-cast6-avx" => 19066440, [ 2340.470787] tcrypt: PERL "cbc-cast6-avx" => 8147336, [ 2350.376676] tcrypt: PERL "cbc-serpent-avx2" => 21466002, [ 2351.646295] tcrypt: PERL "cbc-serpent-avx2" => 2611362, [ 2361.562736] tcrypt: PERL "cbc-serpent-avx" => 21471118, [ 2364.019693] tcrypt: PERL "cbc-serpent-avx" => 5201506, [ 2373.930747] tcrypt: PERL "cbc-serpent-sse2" => 21465594, [ 2376.697210] tcrypt: PERL "cbc-serpent-sse2" => 5855766, [ 2380.944596] tcrypt: PERL "cbc-twofish-avx" => 9058090, [ 2383.308215] tcrypt: PERL "cbc-twofish-avx" => 4989064, [ 2384.904158] tcrypt: PERL "ecb-aria-avx" => 3299260, [ 2386.498365] tcrypt: PERL "ecb-aria-avx" => 3297534, [ 2387.625226] tcrypt: PERL "ecb-camellia-aesni-avx2" => 2306326, [ 2388.757749] tcrypt: PERL "ecb-camellia-aesni-avx2" => 2312876, [ 2390.549340] tcrypt: PERL "ecb-camellia-aesni" => 3752534, [ 2392.335240] tcrypt: PERL "ecb-camellia-aesni" => 3751896, [ 2394.724956] tcrypt: PERL "ecb-cast5-avx" => 5032914, [ 2397.116268] tcrypt: PERL "ecb-cast5-avx" => 5041908, [ 2400.935093] tcrypt: PERL "ecb-cast6-avx" => 8148418, [ 2404.754816] tcrypt: PERL "ecb-cast6-avx" => 8150448, [ 2406.025861] tcrypt: PERL "ecb-serpent-avx2" => 2613024, [ 2407.286682] tcrypt: PERL "ecb-serpent-avx2" => 2602556, [ 2409.732474] tcrypt: PERL "ecb-serpent-avx" => 5191944, [ 2412.161829] tcrypt: PERL "ecb-serpent-avx" => 5165230, [ 2414.678835] tcrypt: PERL "ecb-serpent-sse2" => 5345630, [ 2417.217632] tcrypt: PERL "ecb-serpent-sse2" => 5331110, [ 2419.545136] tcrypt: PERL "ecb-twofish-avx" => 4917424, [ 2421.870457] tcrypt: PERL "ecb-twofish-avx" => 4915194, [ 2421.870564] tcrypt: PERL );
> I'll keep experimenting with all the preempt modes, heavier > workloads, and shorter RCU timeouts to confirm this solution > is robust. It might even be appropriate for the generic > drivers, if they suffer from the problems that sm4 shows here. I have a set of patches that's looking promising. It's no longer generating RCU stall warnings or soft lockups with either x86 drivers or generic drivers (sm4 is particularly taxing). Test case: * added 28 clones of the tcrypt module so modprobe can run it many times in parallel (1 thread per CPU core) * added 1 MiB big buffer functional tests (compare to generic results) * added 1 MiB big buffer speed tests * 3 windows running * 28 threads running * modprobe with each defined test mode in order 1, 2, 3, etc. * RCU stall timeouts set to shortest supported values * run in preempt=none, preempt=voluntary, preempt=full modes Patches include: * Ard's kmap_local() patch * Suppress RCU stall warnings during speed tests. Change the rcu_sysrq_start()/end() functions to be general purpose and call them from tcrypt test functions that measure time of a crypto operation * add crypto_yield() unilaterally in skcipher_walk_done so it is run even if data is aligned * add crypto_yield() in aead_encrypt/decrypt so they always call it like skcipher * add crypto_yield() at the end each hash update(), digest(), and finup() function so they always call it like skcipher * add kernel_fpu_yield() calls every 4 KiB inside x86 kernel_fpu_begin()/end() blocks, so the x86 functions always yield to the scheduler even when they're bypassing those helper functions (that now call crypto_yield() more consistently) I'll keep trying to break it over the weekend. If it holds up I'll post the patches next week.
diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c index 8ea5ab0f1ca7..f7dc9c563bb5 100644 --- a/arch/x86/crypto/nhpoly1305-avx2-glue.c +++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c @@ -13,6 +13,9 @@ #include <linux/sizes.h> #include <asm/simd.h> +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ +static const unsigned int bytes_per_fpu = 337 * 1024; + asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t message_len, u8 hash[NH_HASH_BYTES]); @@ -26,18 +29,20 @@ static void _nh_avx2(const u32 *key, const u8 *message, size_t message_len, static int nhpoly1305_avx2_update(struct shash_desc *desc, const u8 *src, unsigned int srclen) { + BUILD_BUG_ON(bytes_per_fpu == 0); + if (srclen < 64 || !crypto_simd_usable()) return crypto_nhpoly1305_update(desc, src, srclen); - do { - unsigned int n = min_t(unsigned int, srclen, SZ_4K); + while (srclen) { + unsigned int n = min(srclen, bytes_per_fpu); kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2); kernel_fpu_end(); src += n; srclen -= n; - } while (srclen); + } return 0; } diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c index 2b353d42ed13..daffcc7019ad 100644 --- a/arch/x86/crypto/nhpoly1305-sse2-glue.c +++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c @@ -13,6 +13,9 @@ #include <linux/sizes.h> #include <asm/simd.h> +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ +static const unsigned int bytes_per_fpu = 199 * 1024; + asmlinkage void nh_sse2(const u32 *key, const u8 *message, size_t message_len, u8 hash[NH_HASH_BYTES]); @@ -26,18 +29,20 @@ static void _nh_sse2(const u32 *key, const u8 *message, size_t message_len, static int nhpoly1305_sse2_update(struct shash_desc *desc, const u8 *src, unsigned int srclen) { + BUILD_BUG_ON(bytes_per_fpu == 0); + if (srclen < 64 || !crypto_simd_usable()) return crypto_nhpoly1305_update(desc, src, srclen); - do { - unsigned int n = min_t(unsigned int, srclen, SZ_4K); + while (srclen) { + unsigned int n = min(srclen, bytes_per_fpu); kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2); kernel_fpu_end(); src += n; srclen -= n; - } while (srclen); + } return 0; } diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c index 1dfb8af48a3c..16831c036d71 100644 --- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -15,20 +15,27 @@ #include <asm/intel-family.h> #include <asm/simd.h> +#define POLY1305_BLOCK_SIZE_MASK (~(POLY1305_BLOCK_SIZE - 1)) + +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ +static const unsigned int bytes_per_fpu = 217 * 1024; + asmlinkage void poly1305_init_x86_64(void *ctx, const u8 key[POLY1305_BLOCK_SIZE]); asmlinkage void poly1305_blocks_x86_64(void *ctx, const u8 *inp, - const size_t len, const u32 padbit); + const unsigned int len, + const u32 padbit); asmlinkage void poly1305_emit_x86_64(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], const u32 nonce[4]); asmlinkage void poly1305_emit_avx(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], const u32 nonce[4]); -asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, const size_t len, - const u32 padbit); -asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, const size_t len, - const u32 padbit); +asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, + const unsigned int len, const u32 padbit); +asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, + const unsigned int len, const u32 padbit); asmlinkage void poly1305_blocks_avx512(void *ctx, const u8 *inp, - const size_t len, const u32 padbit); + const unsigned int len, + const u32 padbit); static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_avx); static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_avx2); @@ -86,14 +93,12 @@ static void poly1305_simd_init(void *ctx, const u8 key[POLY1305_BLOCK_SIZE]) poly1305_init_x86_64(ctx, key); } -static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, +static void poly1305_simd_blocks(void *ctx, const u8 *inp, unsigned int len, const u32 padbit) { struct poly1305_arch_internal *state = ctx; - /* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE || - SZ_4K % POLY1305_BLOCK_SIZE); + BUILD_BUG_ON(bytes_per_fpu < POLY1305_BLOCK_SIZE); if (!static_branch_likely(&poly1305_use_avx) || (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) || @@ -103,8 +108,14 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, return; } - do { - const size_t bytes = min_t(size_t, len, SZ_4K); + while (len) { + unsigned int bytes; + + if (len < POLY1305_BLOCK_SIZE) + bytes = len; + else + bytes = min(len, + bytes_per_fpu & POLY1305_BLOCK_SIZE_MASK); kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512)) @@ -117,7 +128,7 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, len -= bytes; inp += bytes; - } while (len); + } } static void poly1305_simd_emit(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], diff --git a/arch/x86/crypto/polyval-clmulni_glue.c b/arch/x86/crypto/polyval-clmulni_glue.c index b7664d018851..de1c908f7412 100644 --- a/arch/x86/crypto/polyval-clmulni_glue.c +++ b/arch/x86/crypto/polyval-clmulni_glue.c @@ -29,6 +29,9 @@ #define NUM_KEY_POWERS 8 +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */ +static const unsigned int bytes_per_fpu = 393 * 1024; + struct polyval_tfm_ctx { /* * These powers must be in the order h^8, ..., h^1. @@ -107,6 +110,8 @@ static int polyval_x86_update(struct shash_desc *desc, unsigned int nblocks; unsigned int n; + BUILD_BUG_ON(bytes_per_fpu < POLYVAL_BLOCK_SIZE); + if (dctx->bytes) { n = min(srclen, dctx->bytes); pos = dctx->buffer + POLYVAL_BLOCK_SIZE - dctx->bytes; @@ -123,8 +128,7 @@ static int polyval_x86_update(struct shash_desc *desc, } while (srclen >= POLYVAL_BLOCK_SIZE) { - /* Allow rescheduling every 4K bytes. */ - nblocks = min(srclen, 4096U) / POLYVAL_BLOCK_SIZE; + nblocks = min(srclen, bytes_per_fpu) / POLYVAL_BLOCK_SIZE; internal_polyval_update(tctx, src, nblocks, dctx->buffer); srclen -= nblocks * POLYVAL_BLOCK_SIZE; src += nblocks * POLYVAL_BLOCK_SIZE;