Message ID | 20230720173956.3674987-2-glider@google.com |
---|---|
State | New |
Headers |
Return-Path: <linux-kernel-owner@vger.kernel.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:c923:0:b0:3e4:2afc:c1 with SMTP id j3csp3298343vqt; Thu, 20 Jul 2023 11:14:26 -0700 (PDT) X-Google-Smtp-Source: APBJJlF/vFA8QEofnY0V0s0XG0ltrsKwr7TNnu/ryFr8a4uasgb7mpc2plnQLspZVGvSR6Yflcwt X-Received: by 2002:a17:903:1249:b0:1b9:cb27:7f43 with SMTP id u9-20020a170903124900b001b9cb277f43mr212600plh.43.1689876866164; Thu, 20 Jul 2023 11:14:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689876866; cv=none; d=google.com; s=arc-20160816; b=Nq/wPQuxBH6evVbPghBjJfLCVzVHy0lUbLqIB4s7UV+Uh1TGVd4MEkshfVrEIJaM/D sHk4E5nClR3RuhK0xzMGg/wg0ystgSw7FV1gHN+YN/AU5rUayIUzMKBdzbwxsq2JqaPA hla8pI8Yyce13k/C9yhGdHGtY3RnLZL7K1WPOXZogb9XXSO8CSjffI5RClsuOr+DF7Nl uxEQoa9X5cIZV4+Qbc4odlw88AN0GcEm4HL7AwZbg7Dvcs/q/05GfZ4S/42kS/3GIlOH O+LBZOaEmPD9eGUvdM0zr623g+bFjhb8ZsOHpDD3A2/RgftB/h3fcLn1RxkptJhMUdyN CvkQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:dkim-signature; bh=sBTD77nZvv18xeO6UXUxdZ0TBIDcYAZAkWaNswyfelY=; fh=JgIgyeVfHwvOXq2SUW9xixt7IvWppQhIeupjE3HP1BQ=; b=BK7YvxlIt3yiCaI0hu0CY83yLHoxUk5F3m2shcsfllmc3X+argVKdLpfwts2GBvEOi IVLm618phcJxaekt4K20Z2DQ34rsdI1AUjYvVcZnZ1iWfsXVN5lqLVj+vWrOLW3H7LPb BxRhkLEWag04xdpgJKmleu13ZFozjx8w+So4q2nkfD6rzCJBlyHCGW+1lqQl5hqGRYHV Oj4jq/erPwrbvVNVaTneRO9xCZWJJ0cCCoHxna6GhWipJvNior2jOfmRZGgge/u6LOwI 9tPyIVSmLVpCEw6aKA9AXiavOvhUTsJqdUXUQZIUjth6x8UM4/6/oiJFzBNZW9VmdvPU lFZg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=1yT9HCrQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ka7-20020a170903334700b001b53a3fbcc3si1269177plb.328.2023.07.20.11.14.12; Thu, 20 Jul 2023 11:14:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=1yT9HCrQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230220AbjGTRkK (ORCPT <rfc822;assdfgzxcv4@gmail.com> + 99 others); Thu, 20 Jul 2023 13:40:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38836 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229681AbjGTRkG (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Thu, 20 Jul 2023 13:40:06 -0400 Received: from mail-ej1-x649.google.com (mail-ej1-x649.google.com [IPv6:2a00:1450:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CDDB9E52 for <linux-kernel@vger.kernel.org>; Thu, 20 Jul 2023 10:40:04 -0700 (PDT) Received: by mail-ej1-x649.google.com with SMTP id a640c23a62f3a-994320959f4so73994466b.2 for <linux-kernel@vger.kernel.org>; Thu, 20 Jul 2023 10:40:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689874803; x=1690479603; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=sBTD77nZvv18xeO6UXUxdZ0TBIDcYAZAkWaNswyfelY=; b=1yT9HCrQQBAuoPLZ7UGNCLsWKvdxdYuhPi2NOHvNyCTSGgIEes/2FWg7ihwLHAhBfR 1PeLv9ipNyiodNyHDDK6BSNY8lzNRKw8Oix+MqAbxNmnzi9yiavk+N7YHSIgr0bFHzPs 8jV8LZHfabUsgSYAepexYCKMLMQ+dZQBsP2w2FU868WrtZgu64b0q/VPbS+dyJeruvFn BleAB1+spyaED66wcjfeaBbfoPl03mU6B1mt1b7bDlMZWQrCquq9K2CxeCXPiKIQWf4k IKxjpvdbWrbdBgYYm3DhUM1180bxFCEzrpPPJVvTv1C44fNSmbNxAKCG5CH6A9gChutu jrtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689874803; x=1690479603; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=sBTD77nZvv18xeO6UXUxdZ0TBIDcYAZAkWaNswyfelY=; b=K7N7YvAA6BfU4Hyi8bh9JTFU3uqxrTGz3l2oLDhV7tDY+TGjrs7YRtKfwjVCkV2QbZ 5jgul+TDWyQTkrqDjQ/quhsmbQBjMe4wuDHqrXgfLyYJ4bDIhycdbgDYqig9gR7eaeVN /u9UDDmjikIVRxlE+EUAiquVTB/EgIHss0wxP0DcdlNjwpqrI9bGKUT7rh8DvyKyarSC uOoP5G0syDGzt3uyfN9nJbPkpMJwg8VtKhnpRYDtUv46XmdTclfZR2BR6Ay9dElZjFiw d9NQKX3bYmVxcsI640N5PzmWPjXsBzGCKJ1mxDUuY0xC8eJ/HHhZkSsObfDEUffdfDSY QquQ== X-Gm-Message-State: ABy/qLbIZQznpQoV/Kb7Bw9bCVgsf3ptwCPNsa/SzdD5urTKNkEAFQUl gmrGDHlSzXEdAU1jryZgExtize/qkWo= X-Received: from glider.muc.corp.google.com ([2a00:79e0:9c:201:c495:c29a:e275:1dfb]) (user=glider job=sendgmr) by 2002:a50:cc9c:0:b0:51d:e3e6:cc6b with SMTP id q28-20020a50cc9c000000b0051de3e6cc6bmr25395edi.6.1689874803086; Thu, 20 Jul 2023 10:40:03 -0700 (PDT) Date: Thu, 20 Jul 2023 19:39:52 +0200 In-Reply-To: <20230720173956.3674987-1-glider@google.com> Mime-Version: 1.0 References: <20230720173956.3674987-1-glider@google.com> X-Mailer: git-send-email 2.41.0.487.g6d72f3e995-goog Message-ID: <20230720173956.3674987-2-glider@google.com> Subject: [PATCH v4 1/5] lib/bitmap: add bitmap_{set,get}_value() From: Alexander Potapenko <glider@google.com> To: glider@google.com, catalin.marinas@arm.com, will@kernel.org, pcc@google.com, andreyknvl@gmail.com, andriy.shevchenko@linux.intel.com, linux@rasmusvillemoes.dk, yury.norov@gmail.com Cc: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, eugenis@google.com, syednwaris@gmail.com, william.gray@linaro.org, Arnd Bergmann <arnd@arndb.de> Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: <linux-kernel.vger.kernel.org> X-Mailing-List: linux-kernel@vger.kernel.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1771964324765915414 X-GMAIL-MSGID: 1771964324765915414 |
Series |
Implement MTE tag compression for swapped pages
|
|
Commit Message
Alexander Potapenko
July 20, 2023, 5:39 p.m. UTC
From: Syed Nayyar Waris <syednwaris@gmail.com> The two new functions allow setting/getting values of length up to BITS_PER_LONG bits at arbitrary position in the bitmap. The code was taken from "bitops: Introduce the for_each_set_clump macro" by Syed Nayyar Waris with a couple of minor changes: - instead of using roundup(), which adds an unnecessary dependency on <linux/math.h>, we calculate space as BITS_PER_LONG-offset; - indentation is reduced by not using else-clauses (suggested by checkpatch for bitmap_get_value()) Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Syed Nayyar Waris <syednwaris@gmail.com> Signed-off-by: William Breathitt Gray <william.gray@linaro.org> Link: https://lore.kernel.org/lkml/fe12eedf3666f4af5138de0e70b67a07c7f40338.1592224129.git.syednwaris@gmail.com/ Suggested-by: Yury Norov <yury.norov@gmail.com> Co-developed-by: Alexander Potapenko <glider@google.com> Signed-off-by: Alexander Potapenko <glider@google.com> --- v4: - Address comments by Andy Shevchenko and Yury Norov: - prevent passing values >= 64 to GENMASK() - fix commit authorship - change comments - check for unlikely(nbits==0) - drop unnecessary const declarations - fix kernel-doc comments - rename bitmap_{get,set}_value() to bitmap_{read,write}() --- include/linux/bitmap.h | 60 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+)
Comments
On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: > +/** > + * bitmap_write - write n-bit value within a memory region > + * @map: address to the bitmap memory region > + * @value: value of nbits > + * @start: bit offset of the n-bit value > + * @nbits: size of value in bits, up to BITS_PER_LONG > + */ > +static inline void bitmap_write(unsigned long *map, > + unsigned long value, > + unsigned long start, unsigned long nbits) > +{ > + size_t index = BIT_WORD(start); > + unsigned long offset = start % BITS_PER_LONG; > + unsigned long space = BITS_PER_LONG - offset; > + > + if (unlikely(!nbits)) > + return; > + value &= GENMASK(nbits - 1, 0); Strictly speaking, a 'value' shouldn't contain set bits beyond nbits because otherwise it's an out-of-bonds type of error. This is kind of gray zone to me, because it's a caller's responsibility to provide correct input. But on the other hand, it would be a great headache debugging corrupted bitmaps. Now that we've got a single user of the bitmap_write, and even more, it's wrapped with a helper, I think it would be reasonable to trim a 'value' in the helper, if needed. Anyways, the comment must warn about that loudly... > + if (space >= nbits) { > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset); 'GENMASK(nbits - 1, 0) << offset' looks really silly. > + map[index] |= value << offset; Here you commit 2 reads and 2 writes for a word instead of one. > + return; > + } > + map[index] &= ~BITMAP_FIRST_WORD_MASK(start); ~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves ~25 bytes of .text on x86_64. > + map[index] |= value << offset; > + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); > + map[index + 1] |= (value >> space); > +} With all that I think the implementation should look something like this: unsigned long w, mask; if (unlikely(nbits == 0)) return 0; map += BIT_WORD(start); start %= BITS_PER_LONG; end = start + nbits - 1; w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start)); *map = w | (value << start); if (end < BITS_PER_LONG) return; w = *++map & BITMAP_FIRST_WORD_MASK(end); *map = w | (value >> BITS_PER_LONG * 2 - end); It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect it should be more efficient. Alexander, can you please try the above and compare? In previous iteration, I asked you to share disassembly listings for the functions. Can you please do that now? Regarding the rest of the series: - I still see Evgenii's name in mtecomp.c, and EA0 references; - git-am throws warning about trailing line; - checkpatch warns 7 times; Can you fix all the above before sending the new version? Have you tested generic part against BE32, BE64 and LE32 architectures; and arch part against BE64? If not, please do. You're mentioning that the compression ratio is 2 to 20x. Can you share the absolute numbers? If it's 1k vs 2k, I think most people just don't care... Can you share the code that you used to measure the compression ratio? Would it make sense to export the numbers via sysfs? Thanks, Yury
On Sat, Jul 22, 2023 at 06:57:26PM -0700, Yury Norov wrote: > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: > > +/** > > + * bitmap_write - write n-bit value within a memory region > > + * @map: address to the bitmap memory region > > + * @value: value of nbits > > + * @start: bit offset of the n-bit value > > + * @nbits: size of value in bits, up to BITS_PER_LONG > > + */ > > +static inline void bitmap_write(unsigned long *map, > > + unsigned long value, > > + unsigned long start, unsigned long nbits) > > +{ > > + size_t index = BIT_WORD(start); > > + unsigned long offset = start % BITS_PER_LONG; > > + unsigned long space = BITS_PER_LONG - offset; > > + > > + if (unlikely(!nbits)) > > + return; > > + value &= GENMASK(nbits - 1, 0); > > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits > because otherwise it's an out-of-bonds type of error. > > This is kind of gray zone to me, because it's a caller's responsibility > to provide correct input. But on the other hand, it would be a great > headache debugging corrupted bitmaps. > > Now that we've got a single user of the bitmap_write, and even more, > it's wrapped with a helper, I think it would be reasonable to trim a > 'value' in the helper, if needed. > > Anyways, the comment must warn about that loudly... OK. I spent a night with that, and I'm still not sure. Pseudo code that implements it looks like this: for (bit = 0; bit < nbits; bit++) assign_bit(start + bit, bitmap, val & BIT(bit)); And it ignores trailing bits. So if we're optimizing this pattern, we'd ignore these bits just as well... Either way, whatever we decide, let's stay clear with our intentions and mention explicitly that tail bits are either must be zero, or ignored. Alexander, can you add the snippet above to the comments for the bitmap_write() and bitmap_read(), as well as in the test? Also, if we decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK() generate better code than GENMASK(nbits - 1, 0) does? Commets are very appreciated. Thanks, Yury
On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote: > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: > > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset); > > 'GENMASK(nbits - 1, 0) << offset' looks really silly. But you followed the thread to get a clue why it's written in this form, right? ... > With all that I think the implementation should look something like > this: I would go this way if and only if the code generation on main architectures with both GCC and clang is better. And maybe even some performance tests need to be provided. ... > Alexander, can you please try the above and compare? > In previous iteration, I asked you to share disassembly listings for the > functions. Can you please do that now? Exactly what we need before going with your suggestion(s).
On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote: > On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote: > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: > > > > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset); > > > > 'GENMASK(nbits - 1, 0) << offset' looks really silly. > > But you followed the thread to get a clue why it's written in this form, right? Yes, I did. But I don't expect everyone looking at kernel code would spend time recovering discussions that explain why that happened. So, at least it would be fine to drop a comment. > ... > > > With all that I think the implementation should look something like > > this: > > I would go this way if and only if the code generation on main architectures > with both GCC and clang is better. > > And maybe even some performance tests need to be provided. For the following implementation: void my_bitmap_write(unsigned long *map, unsigned long value, unsigned long start, unsigned long nbits) { unsigned long w, end; if (unlikely(nbits == 0)) return; value &= GENMASK(nbits - 1, 0); map += BIT_WORD(start); start %= BITS_PER_LONG; end = start + nbits - 1; w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start)); *map = w | (value << start); if (end < BITS_PER_LONG) return; w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG); *map = w | (value >> (BITS_PER_LONG - start)); } This is the bloat-o-meter output: $ scripts/bloat-o-meter lib/test_bitmap.o.orig lib/test_bitmap.o add/remove: 8/0 grow/shrink: 1/0 up/down: 2851/0 (2851) Function old new delta test_bitmap_init 3846 5484 +1638 test_bitmap_write_perf - 401 +401 bitmap_write - 271 +271 my_bitmap_write - 248 +248 bitmap_read - 229 +229 __pfx_test_bitmap_write_perf - 16 +16 __pfx_my_bitmap_write - 16 +16 __pfx_bitmap_write - 16 +16 __pfx_bitmap_read - 16 +16 Total: Before=36964, After=39815, chg +7.71% And this is the performance test: for (cnt = 0; cnt < 5; cnt++) { time = ktime_get(); for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) { for (i = 0; i < 1000; i++) { if (i + nbits > 1000) break; bitmap_write(bmap, val, i, nbits); } } time = ktime_get() - time; pr_err("bitmap_write:\t%llu\t", time); time = ktime_get(); for (nbits = 1; nbits <= BITS_PER_LONG; nbits++) { for (i = 0; i < 1000; i++) { if (i + nbits > 1000) break; my_bitmap_write(bmap, val, i, nbits); } } time = ktime_get() - time; pr_cont("%llu\n", time); } Which on x86_64/kvm with GCC gives: Orig My [ 1.630731] test_bitmap: bitmap_write: 299092 252764 [ 1.631584] test_bitmap: bitmap_write: 299522 252554 [ 1.632429] test_bitmap: bitmap_write: 299171 258665 [ 1.633280] test_bitmap: bitmap_write: 299241 252794 [ 1.634133] test_bitmap: bitmap_write: 306716 252934 So, it's ~15% difference in performance and 8% in size. I don't insist on my implementation, but I think, we'd experiment for more with code generation. Thanks, Yury
On Mon, Jul 24, 2023 at 10:04:34PM -0700, Yury Norov wrote: > On Mon, Jul 24, 2023 at 11:36:36AM +0300, Andy Shevchenko wrote: > > On Sat, Jul 22, 2023 at 06:57:23PM -0700, Yury Norov wrote: > > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: ... > > > 'GENMASK(nbits - 1, 0) << offset' looks really silly. > > > > But you followed the thread to get a clue why it's written in this form, right? > > Yes, I did. But I don't expect everyone looking at kernel code would spend > time recovering discussions that explain why that happened. So, at least it > would be fine to drop a comment. See also below. ... > w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start)); This GENMASK() may generate worse code as compiler uses more instructions instead of simple approach with the above.. ... > bitmap_write - 271 +271 > my_bitmap_write - 248 +248 > bitmap_read - 229 +229 my_ -- means your proposal? Do you mean you have it better in size than original one?
On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote: > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: > > +/** > > + * bitmap_write - write n-bit value within a memory region > > + * @map: address to the bitmap memory region > > + * @value: value of nbits > > + * @start: bit offset of the n-bit value > > + * @nbits: size of value in bits, up to BITS_PER_LONG > > + */ > > +static inline void bitmap_write(unsigned long *map, > > + unsigned long value, > > + unsigned long start, unsigned long nbits) > > +{ > > + size_t index = BIT_WORD(start); > > + unsigned long offset = start % BITS_PER_LONG; > > + unsigned long space = BITS_PER_LONG - offset; > > + > > + if (unlikely(!nbits)) > > + return; > > + value &= GENMASK(nbits - 1, 0); > > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits > because otherwise it's an out-of-bonds type of error. I can easily imagine someone passing -1 (or ~0) as a value, but wanting to only write n bits of n. > > This is kind of gray zone to me, because it's a caller's responsibility > to provide correct input. But on the other hand, it would be a great > headache debugging corrupted bitmaps. > > Now that we've got a single user of the bitmap_write, and even more, > it's wrapped with a helper, I think it would be reasonable to trim a > 'value' in the helper, if needed. > > Anyways, the comment must warn about that loudly... > > > + if (space >= nbits) { > > + map[index] &= ~(GENMASK(nbits - 1, 0) << offset); > > 'GENMASK(nbits - 1, 0) << offset' looks really silly. As noted in the other patch discussion, pulling offset into GENMASK is actually not an identity transformation, because nbits + offset may exceed 64, producing an invalid mask. > > > + map[index] |= value << offset; > > Here you commit 2 reads and 2 writes for a word instead of one. There won't be two reads and two writes. The compiler will read map[index] to a register, apply the mask, then write value << offset to it, then perform the write. Unless map[] is a volatile, repeated reads/writes will be optimized out by any decent compiler. > > > + return; > > + } > > + map[index] &= ~BITMAP_FIRST_WORD_MASK(start); > > ~FIRST_WORD is the same as LAST WORD. I tried to replace, and it saves > ~25 bytes of .text on x86_64. > > > + map[index] |= value << offset; > > + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); > > + map[index + 1] |= (value >> space); > > +} > > With all that I think the implementation should look something like > this: > > unsigned long w, mask; > > if (unlikely(nbits == 0)) > return 0; > > map += BIT_WORD(start); > start %= BITS_PER_LONG; > end = start + nbits - 1; > > w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start)); > *map = w | (value << start); > > if (end < BITS_PER_LONG) > return; > > w = *++map & BITMAP_FIRST_WORD_MASK(end); > *map = w | (value >> BITS_PER_LONG * 2 - end); > > It's not tested, except the /s/~FIRST_WORD/LAST_WORD/ part, but I expect > it should be more efficient. > > Alexander, can you please try the above and compare? I like the idea of sharing the first write between the branches, and it can be made even shorter: =========================================================== void bitmap_write_new(unsigned long *map, unsigned long value, unsigned long start, unsigned long nbits) { unsigned long offset; unsigned long space; size_t index; bool fit; if (unlikely(!nbits)) return; value &= GENMASK(nbits - 1, 0); offset = start % BITS_PER_LONG; space = BITS_PER_LONG - offset; index = BIT_WORD(start); fit = space >= nbits; map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : ~BITMAP_FIRST_WORD_MASK(start)); map[index] |= value << offset; if (fit) return; map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); map[index + 1] |= (value >> space); } =========================================================== According to Godbolt (https://godbolt.org/z/n5Te779bf), this function is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under GCC (which on the other hand does a poor job optimizing both). Overall, given that there's currently a single user of these functions, isn't it premature to optimize them without knowing anything about their performance? > In previous iteration, I asked you to share disassembly listings for the > functions. Can you please do that now? Will godbolt work for you (see above)? > > Regarding the rest of the series: > - I still see Evgenii's name in mtecomp.c, and EA0 references; Will fix, thanks! > - git-am throws warning about trailing line; Interesting, I generate the patches using `git format-patch`, I thought `git am` should be the inversion of it. I'll check. > - checkpatch warns 7 times; Indeed, there were warnings that I ignored (e.g. one of them was caused by me adding extern symbols to the test module, as requested during the review process). I think I can fix all of them. > > Can you fix all the above before sending the new version? > > Have you tested generic part against BE32, BE64 and LE32 architectures; > and arch part against BE64? If not, please do. BE64 works, I'll also need to check the 32-bit versions as well. Note that MTE is an ARM64 feature (yet we still need to ensure bitops.h works on 32 bits). > > You're mentioning that the compression ratio is 2 to 20x. Can you > share the absolute numbers? If it's 1k vs 2k, I think most people > just don't care... I'll provide the exact numbers with the next patch series. Last time I checked, the order of magnitude was tens of megabytes. > Can you share the code that you used to measure the compression ratio? > Would it make sense to export the numbers via sysfs? For out-of-line allocations the data can be derived from /proc/slabinfo, but we don't calculate inline allocations. Agreed, a debugfs interface won't hurt.
On Wed, Jul 26, 2023 at 10:08:28AM +0200, Alexander Potapenko wrote: > On Sun, Jul 23, 2023 at 3:57 AM Yury Norov <yury.norov@gmail.com> wrote: > > > > On Thu, Jul 20, 2023 at 07:39:52PM +0200, Alexander Potapenko wrote: > > > +/** > > > + * bitmap_write - write n-bit value within a memory region > > > + * @map: address to the bitmap memory region > > > + * @value: value of nbits > > > + * @start: bit offset of the n-bit value > > > + * @nbits: size of value in bits, up to BITS_PER_LONG > > > + */ > > > +static inline void bitmap_write(unsigned long *map, > > > + unsigned long value, > > > + unsigned long start, unsigned long nbits) > > > +{ > > > + size_t index = BIT_WORD(start); > > > + unsigned long offset = start % BITS_PER_LONG; > > > + unsigned long space = BITS_PER_LONG - offset; > > > + > > > + if (unlikely(!nbits)) > > > + return; > > > + value &= GENMASK(nbits - 1, 0); > > > > Strictly speaking, a 'value' shouldn't contain set bits beyond nbits > > because otherwise it's an out-of-bonds type of error. > > I can easily imagine someone passing -1 (or ~0) as a value, but > wanting to only write n bits of n. This is an abuse of new API because we've got a bitmap_set(). But whatever, let's keep that masking. ... > I like the idea of sharing the first write between the branches, and > it can be made even shorter: > > =========================================================== > void bitmap_write_new(unsigned long *map, unsigned long value, > unsigned long start, unsigned long nbits) > { > unsigned long offset; > unsigned long space; > size_t index; > bool fit; > > if (unlikely(!nbits)) > return; > > value &= GENMASK(nbits - 1, 0); > offset = start % BITS_PER_LONG; > space = BITS_PER_LONG - offset; > index = BIT_WORD(start); > fit = space >= nbits; space >= nbits <=> BITS_PER_LONG - offset >= nbits <=> offset + nbits <= BITS_PER_LONG > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : So here GENMASK(nbits + offset - 1, offset) is at max: GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my point. Does it make sense? > ~BITMAP_FIRST_WORD_MASK(start)); As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK() and vise-versa. > map[index] |= value << offset; > if (fit) > return; > > map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); > map[index + 1] |= (value >> space); > } > =========================================================== > > According to Godbolt (https://godbolt.org/z/n5Te779bf), this function > is 32 bytes shorter than yours under x86 Clang, and 8 bytes - under > GCC (which on the other hand does a poor job optimizing both). > > Overall, given that there's currently a single user of these > functions, isn't it premature to optimize them without knowing > anything about their performance? > > > In previous iteration, I asked you to share disassembly listings for the > > functions. Can you please do that now? > > Will godbolt work for you (see above)? I don't know for how long an external resource will keep the reference alive. My SSD keeps emails long enough. ... > > You're mentioning that the compression ratio is 2 to 20x. Can you > > share the absolute numbers? If it's 1k vs 2k, I think most people > > just don't care... > > I'll provide the exact numbers with the next patch series. Last time I > checked, the order of magnitude was tens of megabytes. That's impressive. Fruitful idea. It would be important for embedded guys who may disable MTE because of memory overhead. I think it's worth to mention that in Kconfig together with associate performance overhead, if it ever measurable. > > Can you share the code that you used to measure the compression ratio? > > Would it make sense to export the numbers via sysfs? > > For out-of-line allocations the data can be derived from > /proc/slabinfo, but we don't calculate inline allocations. > Agreed, a debugfs interface won't hurt.
> space >= nbits <=> > BITS_PER_LONG - offset >= nbits <=> > offset + nbits <= BITS_PER_LONG > > > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : > > So here GENMASK(nbits + offset - 1, offset) is at max: > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my > point. Does it make sense? It indeed does. Perhaps pulling offset inside GENMASK is not a bug after all (a simple test does not show any difference between their behavior. But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below). My guess is that this happens because the compiler fails to reuse the value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and calculates `GENMASK(nbits - 1 + offset, offset)` from scratch. > > > ~BITMAP_FIRST_WORD_MASK(start)); > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK() > and vise-versa. Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than BITMAP_LAST_WORD_MASK(). > > > map[index] |= value << offset; > > if (fit) > > return; > > > > map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); OTOH I managed to shave three more bytes off by replacing ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here. > > map[index + 1] |= (value >> space); > > } I'll post the implementations together with the disassembly below. I used some Clang 17.0.0 version that is a couple months behind upstream, but that still produces sustainably shorter code (~48 bytes less) than the trunk GCC on Godbolt. 1. Original implementation of bitmap_write() from this patch - 164 bytes (interestingly, it's 157 bytes with Clang 14.0.6) ================================================================== void bitmap_write(unsigned long *map, unsigned long value, unsigned long start, unsigned long nbits) { size_t index = BIT_WORD(start); unsigned long offset = start % BITS_PER_LONG; unsigned long space = BITS_PER_LONG - offset; if (unlikely(!nbits)) return; value &= GENMASK(nbits - 1, 0); if (space >= nbits) { map[index] &= ~(GENMASK(nbits - 1, 0) << offset); map[index] |= value << offset; return; } map[index] &= ~BITMAP_FIRST_WORD_MASK(start); map[index] |= value << offset; map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); map[index + 1] |= (value >> space); } ================================================================== 0000000000001200 <bitmap_write>: 1200: 41 57 push %r15 1202: 41 56 push %r14 1204: 53 push %rbx 1205: 48 85 c9 test %rcx,%rcx 1208: 74 7b je 1285 <bitmap_write+0x85> 120a: 48 89 c8 mov %rcx,%rax 120d: 49 89 d2 mov %rdx,%r10 1210: 49 c1 ea 06 shr $0x6,%r10 1214: 41 89 d1 mov %edx,%r9d 1217: f6 d9 neg %cl 1219: 49 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%r15 1220: 49 d3 ef shr %cl,%r15 1223: 41 83 e1 3f and $0x3f,%r9d 1227: 41 b8 40 00 00 00 mov $0x40,%r8d 122d: 4c 21 fe and %r15,%rsi 1230: 48 89 f3 mov %rsi,%rbx 1233: 44 89 c9 mov %r9d,%ecx 1236: 48 d3 e3 shl %cl,%rbx 1239: 4d 29 c8 sub %r9,%r8 123c: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11 1243: 4e 8b 34 d7 mov (%rdi,%r10,8),%r14 1247: 49 39 c0 cmp %rax,%r8 124a: 73 3f jae 128b <bitmap_write+0x8b> 124c: 49 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%r15 1253: 44 89 c9 mov %r9d,%ecx 1256: 49 d3 e7 shl %cl,%r15 1259: 49 f7 d7 not %r15 125c: 4d 21 fe and %r15,%r14 125f: 49 09 de or %rbx,%r14 1262: 4e 89 34 d7 mov %r14,(%rdi,%r10,8) 1266: 01 c2 add %eax,%edx 1268: f6 da neg %dl 126a: 89 d1 mov %edx,%ecx 126c: 49 d3 eb shr %cl,%r11 126f: 49 f7 d3 not %r11 1272: 44 89 c1 mov %r8d,%ecx 1275: 48 d3 ee shr %cl,%rsi 1278: 4e 23 5c d7 08 and 0x8(%rdi,%r10,8),%r11 127d: 4c 09 de or %r11,%rsi 1280: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8) 1285: 5b pop %rbx 1286: 41 5e pop %r14 1288: 41 5f pop %r15 128a: c3 ret 128b: 44 89 c9 mov %r9d,%ecx 128e: 49 d3 e7 shl %cl,%r15 1291: 49 f7 d7 not %r15 1294: 4d 21 fe and %r15,%r14 1297: 49 09 de or %rbx,%r14 129a: 4e 89 34 d7 mov %r14,(%rdi,%r10,8) 129e: 5b pop %rbx 129f: 41 5e pop %r14 12a1: 41 5f pop %r15 12a3: c3 ret ================================================================== 2. Your implementation, my_bitmap_write() - 152 bytes (used to be slightly worse with Clang 14.0.6 - 159 bytes): ================================================================== void my_bitmap_write(unsigned long *map, unsigned long value, unsigned long start, unsigned long nbits) { unsigned long w, end; if (unlikely(nbits == 0)) return; value &= GENMASK(nbits - 1, 0); map += BIT_WORD(start); start %= BITS_PER_LONG; end = start + nbits - 1; w = *map & (end < BITS_PER_LONG ? ~GENMASK(end, start) : BITMAP_LAST_WORD_MASK(start)); *map = w | (value << start); if (end < BITS_PER_LONG) return; w = *++map & BITMAP_LAST_WORD_MASK(end + 1 - BITS_PER_LONG); *map = w | (value >> (BITS_PER_LONG - start)); } ================================================================== 0000000000001160 <my_bitmap_write>: 1160: 48 85 c9 test %rcx,%rcx 1163: 0f 84 8e 00 00 00 je 11f7 <my_bitmap_write+0x97> 1169: 49 89 c9 mov %rcx,%r9 116c: f6 d9 neg %cl 116e: 48 d3 e6 shl %cl,%rsi 1171: 48 d3 ee shr %cl,%rsi 1174: 49 89 d2 mov %rdx,%r10 1177: 49 c1 ea 06 shr $0x6,%r10 117b: 89 d0 mov %edx,%eax 117d: 83 e0 3f and $0x3f,%eax 1180: 4e 8d 04 08 lea (%rax,%r9,1),%r8 1184: 4a 8d 0c 08 lea (%rax,%r9,1),%rcx 1188: 48 ff c9 dec %rcx 118b: 4e 8b 0c d7 mov (%rdi,%r10,8),%r9 118f: 48 83 f9 3f cmp $0x3f,%rcx 1193: 77 29 ja 11be <my_bitmap_write+0x5e> 1195: 41 f6 d8 neg %r8b 1198: 48 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%rdx 119f: 44 89 c1 mov %r8d,%ecx 11a2: 48 d3 ea shr %cl,%rdx 11a5: 89 c1 mov %eax,%ecx 11a7: 48 d3 ea shr %cl,%rdx 11aa: 48 d3 e2 shl %cl,%rdx 11ad: 48 d3 e6 shl %cl,%rsi 11b0: 48 f7 d2 not %rdx 11b3: 49 21 d1 and %rdx,%r9 11b6: 4c 09 ce or %r9,%rsi 11b9: 4a 89 34 d7 mov %rsi,(%rdi,%r10,8) 11bd: c3 ret 11be: f6 da neg %dl 11c0: 89 d1 mov %edx,%ecx 11c2: 49 d3 e1 shl %cl,%r9 11c5: 49 d3 e9 shr %cl,%r9 11c8: 48 89 f2 mov %rsi,%rdx 11cb: 89 c1 mov %eax,%ecx 11cd: 48 d3 e2 shl %cl,%rdx 11d0: 4c 09 ca or %r9,%rdx 11d3: 4a 89 14 d7 mov %rdx,(%rdi,%r10,8) 11d7: 4a 8b 54 d7 08 mov 0x8(%rdi,%r10,8),%rdx 11dc: 41 f6 d8 neg %r8b 11df: 44 89 c1 mov %r8d,%ecx 11e2: 48 d3 e2 shl %cl,%rdx 11e5: 48 d3 ea shr %cl,%rdx 11e8: f6 d8 neg %al 11ea: 89 c1 mov %eax,%ecx 11ec: 48 d3 ee shr %cl,%rsi 11ef: 48 09 d6 or %rdx,%rsi 11f2: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8) 11f7: c3 ret ================================================================== 3. My improved version built on top of yours and mentioned above under the name bitmap_write_new() - 116 bytes: ================================================================== void bitmap_write_new(unsigned long *map, unsigned long value, unsigned long start, unsigned long nbits) { unsigned long offset; unsigned long space; size_t index; bool fit; if (unlikely(!nbits)) return; value &= GENMASK(nbits - 1, 0); offset = start % BITS_PER_LONG; space = BITS_PER_LONG - offset; index = BIT_WORD(start); fit = space >= nbits; map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : ~BITMAP_FIRST_WORD_MASK(start)); map[index] |= value << offset; if (fit) return; map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits); map[index + 1] |= (value >> space); } ================================================================== 00000000000012b0 <bitmap_write_new>: 12b0: 48 85 c9 test %rcx,%rcx 12b3: 74 6e je 1323 <bitmap_write_new+0x73> 12b5: 48 89 c8 mov %rcx,%rax 12b8: f6 d9 neg %cl 12ba: 49 c7 c2 ff ff ff ff mov $0xffffffffffffffff,%r10 12c1: 49 d3 ea shr %cl,%r10 12c4: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11 12cb: 4c 21 d6 and %r10,%rsi 12ce: 89 d1 mov %edx,%ecx 12d0: 83 e1 3f and $0x3f,%ecx 12d3: 41 b8 40 00 00 00 mov $0x40,%r8d 12d9: 49 29 c8 sub %rcx,%r8 12dc: 49 89 d1 mov %rdx,%r9 12df: 49 c1 e9 06 shr $0x6,%r9 12e3: 49 39 c0 cmp %rax,%r8 12e6: 4d 0f 42 d3 cmovb %r11,%r10 12ea: 49 d3 e2 shl %cl,%r10 12ed: 49 f7 d2 not %r10 12f0: 4e 23 14 cf and (%rdi,%r9,8),%r10 12f4: 49 89 f3 mov %rsi,%r11 12f7: 49 d3 e3 shl %cl,%r11 12fa: 4d 09 d3 or %r10,%r11 12fd: 4e 89 1c cf mov %r11,(%rdi,%r9,8) 1301: 49 39 c0 cmp %rax,%r8 1304: 73 1d jae 1323 <bitmap_write_new+0x73> 1306: 01 d0 add %edx,%eax 1308: 4a 8b 54 cf 08 mov 0x8(%rdi,%r9,8),%rdx 130d: 89 c1 mov %eax,%ecx 130f: 48 d3 ea shr %cl,%rdx 1312: 48 d3 e2 shl %cl,%rdx 1315: 44 89 c1 mov %r8d,%ecx 1318: 48 d3 ee shr %cl,%rsi 131b: 48 09 d6 or %rdx,%rsi 131e: 4a 89 74 cf 08 mov %rsi,0x8(%rdi,%r9,8) 1323: c3 ret ================================================================== 4. The version of bitmap_write_new() that uses `(GENMASK(nbits - 1 + offset, offset)` - 146 bytes: ================================================================== void bitmap_write_new_shift(unsigned long *map, unsigned long value, unsigned long start, unsigned long nbits) { unsigned long offset; unsigned long space; size_t index; bool fit; if (unlikely(!nbits)) return; value &= GENMASK(nbits - 1, 0); offset = start % BITS_PER_LONG; space = BITS_PER_LONG - offset; index = BIT_WORD(start); fit = space >= nbits; map[index] &= (fit ? ~(GENMASK(nbits - 1 + offset, offset)) : ~BITMAP_FIRST_WORD_MASK(start)); map[index] |= value << offset; if (fit) return; map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits); map[index + 1] |= (value >> space); } ================================================================== 0000000000001330 <bitmap_write_new_shift>: 1330: 48 85 c9 test %rcx,%rcx 1333: 74 6a je 139f <bitmap_write_new_shift+0x6f> 1335: 48 89 c8 mov %rcx,%rax 1338: f6 d9 neg %cl 133a: 48 d3 e6 shl %cl,%rsi 133d: 48 d3 ee shr %cl,%rsi 1340: 41 89 d0 mov %edx,%r8d 1343: 41 83 e0 3f and $0x3f,%r8d 1347: 41 b9 40 00 00 00 mov $0x40,%r9d 134d: 4d 29 c1 sub %r8,%r9 1350: 49 89 d2 mov %rdx,%r10 1353: 49 c1 ea 06 shr $0x6,%r10 1357: 49 c7 c3 ff ff ff ff mov $0xffffffffffffffff,%r11 135e: 44 89 c1 mov %r8d,%ecx 1361: 49 d3 e3 shl %cl,%r11 1364: 49 39 c1 cmp %rax,%r9 1367: 73 37 jae 13a0 <bitmap_write_new_shift+0x70> 1369: 53 push %rbx 136a: 49 f7 d3 not %r11 136d: 4e 23 1c d7 and (%rdi,%r10,8),%r11 1371: 48 89 f3 mov %rsi,%rbx 1374: 44 89 c1 mov %r8d,%ecx 1377: 48 d3 e3 shl %cl,%rbx 137a: 4c 09 db or %r11,%rbx 137d: 4a 89 1c d7 mov %rbx,(%rdi,%r10,8) 1381: 01 d0 add %edx,%eax 1383: 4a 8b 54 d7 08 mov 0x8(%rdi,%r10,8),%rdx 1388: 89 c1 mov %eax,%ecx 138a: 48 d3 ea shr %cl,%rdx 138d: 48 d3 e2 shl %cl,%rdx 1390: 44 89 c9 mov %r9d,%ecx 1393: 48 d3 ee shr %cl,%rsi 1396: 48 09 d6 or %rdx,%rsi 1399: 4a 89 74 d7 08 mov %rsi,0x8(%rdi,%r10,8) 139e: 5b pop %rbx 139f: c3 ret 13a0: 44 01 c0 add %r8d,%eax 13a3: f6 d8 neg %al 13a5: 89 c1 mov %eax,%ecx 13a7: 49 d3 e3 shl %cl,%r11 13aa: 49 d3 eb shr %cl,%r11 13ad: 49 f7 d3 not %r11 13b0: 44 89 c1 mov %r8d,%ecx 13b3: 48 d3 e6 shl %cl,%rsi 13b6: 4e 23 1c d7 and (%rdi,%r10,8),%r11 13ba: 4c 09 de or %r11,%rsi 13bd: 4a 89 34 d7 mov %rsi,(%rdi,%r10,8) 13c1: c3 ret ================================================================== For posterity, I am also attaching the C file containing the four implementations together with some code testing them. PS. I won't be actively working on the patch series till the end of August, so feel free to ignore this letter until then.
On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote: > > space >= nbits <=> > > BITS_PER_LONG - offset >= nbits <=> > > offset + nbits <= BITS_PER_LONG > > > > > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : > > > > So here GENMASK(nbits + offset - 1, offset) is at max: > > GENMASK(BITS_PER_LONG - 1, offset). And it never overflows, which is my > > point. Does it make sense? > > It indeed does. Perhaps pulling offset inside GENMASK is not a bug > after all (a simple test does not show any difference between their > behavior. > But `GENMASK(nbits - 1 + offset, offset)` blows up the code (see below). > My guess is that this happens because the compiler fails to reuse the > value of `GENMASK(nbits - 1, 0)` used to clamp the value to write, and > calculates `GENMASK(nbits - 1 + offset, offset)` from scratch. OK. Can you put a comment explaining this? Or maybe would be even better to use BITMAP_LAST_WORD_MASK() here: mask = BITMAP_LAST_WORD_MASK(nbits); value &= mask; ... map[index] &= (fit ? (~mask << offset)) : > > > ~BITMAP_FIRST_WORD_MASK(start)); > > > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK() > > and vise-versa. > > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than > BITMAP_LAST_WORD_MASK(). Wow... If that's consistent across different compilers/arches, we'd just drop the latter. Thanks for pointing that. I'll check. > > > map[index] |= value << offset; > > > if (fit) > > > return; > > > > > > map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); > > OTOH I managed to shave three more bytes off by replacing > ~BITMAP_LAST_WORD_MASK with a BITMAP_FIRST_WORD_MASK here. > > > > map[index + 1] |= (value >> space); > > > } > > I'll post the implementations together with the disassembly below. > I used some Clang 17.0.0 version that is a couple months behind > upstream, but that still produces sustainably shorter code (~48 bytes > less) than the trunk GCC on Godbolt. > > 1. Original implementation of bitmap_write() from this patch - 164 > bytes (interestingly, it's 157 bytes with Clang 14.0.6) I spotted that too in some other case. Newer compilers tend to generate bigger code, but the result usually works faster. One particular reason for my case was a loop unrolling. [...] > 3. My improved version built on top of yours and mentioned above under > the name bitmap_write_new() - 116 bytes: 30% better in size - that's impressive! > ================================================================== > void bitmap_write_new(unsigned long *map, unsigned long value, > unsigned long start, unsigned long nbits) > { > unsigned long offset; > unsigned long space; > size_t index; > bool fit; > > if (unlikely(!nbits)) > return; > > value &= GENMASK(nbits - 1, 0); > offset = start % BITS_PER_LONG; > space = BITS_PER_LONG - offset; > index = BIT_WORD(start); > fit = space >= nbits; > > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : > ~BITMAP_FIRST_WORD_MASK(start)); > map[index] |= value << offset; > if (fit) > return; > > map[index + 1] &= BITMAP_FIRST_WORD_MASK(start + nbits); > map[index + 1] |= (value >> space); > } Thanks, Yury
On Fri, Aug 04, 2023 at 12:55:01PM -0700, Yury Norov wrote: > On Fri, Aug 04, 2023 at 06:07:00PM +0200, Alexander Potapenko wrote: > > > > ~BITMAP_FIRST_WORD_MASK(start)); > > > > > > As I said, ~BITMAP_FIRST_WORD_MASK() is the same as BITMAP_LAST_WORD_MASK() > > > and vise-versa. > > > > Surprisingly, ~BITMAP_FIRST_WORD_MASK() generates better code than > > BITMAP_LAST_WORD_MASK(). > > Wow... If that's consistent across different compilers/arches, we'd > just drop the latter. Thanks for pointing that. I'll check. It might be... ... > > map[index] &= (fit ? (~(GENMASK(nbits - 1, 0) << offset)) : > > ~BITMAP_FIRST_WORD_MASK(start)); ...that both parts of ternary are using negation. So compiler may use negation only once. (Haven't looked at the assembly.)
> > Regarding the rest of the series: > - I still see Evgenii's name in mtecomp.c, and EA0 references; Double-checked there are none in v5 (only the Suggested-by: tag) > - git-am throws warning about trailing line; I checked locally that `git am` does not warn about v5 patches. But given that the patches are generated with `git format-patch` I suspect they get garbled when you download them, could it be the case? > - checkpatch warns 7 times; It now warns 4 times, three warnings are about updating MAINTAINERS (I don't think there's need for this), the last one is about CONFIG_ARM64_MTE_COMP_KUNIT_TEST not having three lines of description text in Kconfig. > Can you fix all the above before sending the new version? > Have you tested generic part against BE32, BE64 and LE32 architectures; > and arch part against BE64? If not, please do. I did now. > You're mentioning that the compression ratio is 2 to 20x. Can you > share the absolute numbers? If it's 1k vs 2k, I think most people > just don't care... In the other thread I mentioned that although 20x compression is reachable, it may not lead to practical savings. I reworded the description, having added the absolute numbers. > Can you share the code that you used to measure the compression ratio? > Would it make sense to export the numbers via sysfs? Done in v5 > Thanks, > Yury
> Either way, whatever we decide, let's stay clear with our intentions > and mention explicitly that tail bits are either must be zero, or > ignored. > > Alexander, can you add the snippet above to the comments for the > bitmap_write() and bitmap_read(), as well as in the test? Also, if we > decide to clear tail of the input value, would BITMAP_LAST_WORD_MASK() > generate better code than GENMASK(nbits - 1, 0) does? Added the snippet above to bitmap_write(). I think however it is obvious that bitmap_read() reads only nbits? > Commets are very appreciated. > > Thanks, > Yury
> OK. Can you put a comment explaining this? Or maybe would be even > better to use BITMAP_LAST_WORD_MASK() here: > > mask = BITMAP_LAST_WORD_MASK(nbits); > value &= mask; > ... > map[index] &= (fit ? (~mask << offset)) : I changed GENMASK to BITMAP_LAST_WORD_MASK here, and I think it's self-explanatory now, WDYT?
diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h index 03644237e1efb..bc21c09a2e038 100644 --- a/include/linux/bitmap.h +++ b/include/linux/bitmap.h @@ -76,7 +76,11 @@ struct device; * bitmap_to_arr32(buf, src, nbits) Copy nbits from buf to u32[] dst * bitmap_to_arr64(buf, src, nbits) Copy nbits from buf to u64[] dst * bitmap_get_value8(map, start) Get 8bit value from map at start + * bitmap_read(map, start, nbits) Read an nbits-sized value from + * map at start * bitmap_set_value8(map, value, start) Set 8bit value to map at start + * bitmap_write(map, value, start, nbits) Write an nbits-sized value to + * map at start * * Note, bitmap_zero() and bitmap_fill() operate over the region of * unsigned longs, that is, bits behind bitmap till the unsigned long @@ -583,6 +587,33 @@ static inline unsigned long bitmap_get_value8(const unsigned long *map, return (map[index] >> offset) & 0xFF; } +/** + * bitmap_read - read a value of n-bits from the memory region + * @map: address to the bitmap memory region + * @start: bit offset of the n-bit value + * @nbits: size of value in bits, up to BITS_PER_LONG + * + * Returns: value of nbits located at the @start bit offset within the @map + * memory region. + */ +static inline unsigned long bitmap_read(const unsigned long *map, + unsigned long start, + unsigned long nbits) +{ + size_t index = BIT_WORD(start); + unsigned long offset = start % BITS_PER_LONG; + unsigned long space = BITS_PER_LONG - offset; + unsigned long value_low, value_high; + + if (unlikely(!nbits)) + return 0; + if (space >= nbits) + return (map[index] >> offset) & GENMASK(nbits - 1, 0); + value_low = map[index] & BITMAP_FIRST_WORD_MASK(start); + value_high = map[index + 1] & BITMAP_LAST_WORD_MASK(start + nbits); + return (value_low >> offset) | (value_high << space); +} + /** * bitmap_set_value8 - set an 8-bit value within a memory region * @map: address to the bitmap memory region @@ -599,6 +630,35 @@ static inline void bitmap_set_value8(unsigned long *map, unsigned long value, map[index] |= value << offset; } +/** + * bitmap_write - write n-bit value within a memory region + * @map: address to the bitmap memory region + * @value: value of nbits + * @start: bit offset of the n-bit value + * @nbits: size of value in bits, up to BITS_PER_LONG + */ +static inline void bitmap_write(unsigned long *map, + unsigned long value, + unsigned long start, unsigned long nbits) +{ + size_t index = BIT_WORD(start); + unsigned long offset = start % BITS_PER_LONG; + unsigned long space = BITS_PER_LONG - offset; + + if (unlikely(!nbits)) + return; + value &= GENMASK(nbits - 1, 0); + if (space >= nbits) { + map[index] &= ~(GENMASK(nbits - 1, 0) << offset); + map[index] |= value << offset; + return; + } + map[index] &= ~BITMAP_FIRST_WORD_MASK(start); + map[index] |= value << offset; + map[index + 1] &= ~BITMAP_LAST_WORD_MASK(start + nbits); + map[index + 1] |= (value >> space); +} + #endif /* __ASSEMBLY__ */ #endif /* __LINUX_BITMAP_H */