[v1,1/2] x86/asm/bitops: Replace __fls() by its generic builtin implementation

Message ID 20221106095106.849154-2-mailhol.vincent@wanadoo.fr
State New
Headers
Series x86/asm/bitops: optimize fls functions for constant expressions |

Commit Message

Vincent Mailhol Nov. 6, 2022, 9:51 a.m. UTC
  Below snippet:

  #include <linux/bitops.h>

  unsigned int foo(unsigned long word)
  {
  	return __fls(word);
  }

produces this on GCC 12.1.0:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	48 89 fb             	mov    %rdi,%rbx
     d:	e8 00 00 00 00       	call   12 <foo+0x12>
    12:	48 0f bd c3          	bsr    %rbx,%rax
    16:	5b                   	pop    %rbx
    17:	31 ff                	xor    %edi,%edi
    19:	e9 00 00 00 00       	jmp    1e <foo+0x1e>

and that on clang 14.0.6:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	50                   	push   %rax
     b:	48 89 fb             	mov    %rdi,%rbx
     e:	e8 00 00 00 00       	call   13 <foo+0x13>
    13:	48 89 1c 24          	mov    %rbx,(%rsp)
    17:	48 0f bd 04 24       	bsr    (%rsp),%rax
    1c:	48 83 c4 08          	add    $0x8,%rsp
    20:	5b                   	pop    %rbx
    21:	c3                   	ret

The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
produces the exact same code on GCC and below code on clang:

  0000000000000000 <foo>:
     0:	f3 0f 1e fa          	endbr64
     4:	e8 00 00 00 00       	call   9 <foo+0x9>
     9:	53                   	push   %rbx
     a:	48 89 fb             	mov    %rdi,%rbx
     d:	e8 00 00 00 00       	call   12 <foo+0x12>
    12:	48 0f bd c3          	bsr    %rbx,%rax
    16:	5b                   	pop    %rbx
    17:	c3                   	ret

The builtin implementation is better for two reasons:

  1/ it saves two instructions on clang (a push and a stack pointer
     decrement) because of a useless tentative to save rax.

  2/ when used on constant expressions, the compiler is only able to
     fold the builtin version (c.f. [2]).

For those two reasons, replace the assembly implementation by its
builtin counterpart.

[1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h

[2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

CC: Borislav Petkov <bp@suse.de>
CC: Nick Desaulniers <ndesaulniers@google.com>
CC: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
---
 arch/x86/include/asm/bitops.h | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)
  

Comments

Peter Zijlstra Nov. 7, 2022, 9:38 a.m. UTC | #1
On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> The builtin implementation is better for two reasons:
> 
>   1/ it saves two instructions on clang (a push and a stack pointer
>      decrement) because of a useless tentative to save rax.

I'm thinking this is the same old clang-sucks-at-"rm" constraints and
*really* should not be a reason to change things. Clang should get fixed
already.

>   2/ when used on constant expressions, the compiler is only able to
>      fold the builtin version (c.f. [2]).
> 
> For those two reasons, replace the assembly implementation by its
> builtin counterpart.
> 
> [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> 
> [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")

I would much prefer consistently with 146034fed6ee.
  
Vincent Mailhol Nov. 7, 2022, 12:19 p.m. UTC | #2
On Mon. 7 Nov. 2022 at 18:38, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> > The builtin implementation is better for two reasons:
> >
> >   1/ it saves two instructions on clang (a push and a stack pointer
> >      decrement) because of a useless tentative to save rax.
>
> I'm thinking this is the same old clang-sucks-at-"rm" constraints and
> *really* should not be a reason to change things. Clang should get fixed
> already.
>
> >   2/ when used on constant expressions, the compiler is only able to
> >      fold the builtin version (c.f. [2]).
> >
> > For those two reasons, replace the assembly implementation by its
> > builtin counterpart.
> >
> > [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> >
> > [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> I would much prefer consistently with 146034fed6ee.

There is one big difference between 146034fed6ee and this patch. For
the ffs() functions, the x86 asm produces *better* code so there is a
reason to keep the x86 asm.
The clang missed optimization is not the core reason for me to send
this patch. The purpose of the x86 asm code is to be more performant
than the generic implementation, isn't it? Let's imagine for a moment
that the x86 asm and the builtin produced exactly the same output,
then what would be the reason for keeping the x86 asm version?

My point is not that x86 asm is worse, but that x86 asm isn't better.
The clang missed optimization is one additional reason for this patch,
not the main one.

Yours sincerely,
Vincent Mailhol
  
Nick Desaulniers Nov. 10, 2022, 7:06 p.m. UTC | #3
On Sun, Nov 6, 2022 at 1:51 AM Vincent Mailhol
<mailhol.vincent@wanadoo.fr> wrote:
>
> Below snippet:
>
>   #include <linux/bitops.h>
>
>   unsigned int foo(unsigned long word)
>   {
>         return __fls(word);
>   }
>
> produces this on GCC 12.1.0:
>
>   0000000000000000 <foo>:
>      0: f3 0f 1e fa             endbr64
>      4: e8 00 00 00 00          call   9 <foo+0x9>
>      9: 53                      push   %rbx
>      a: 48 89 fb                mov    %rdi,%rbx
>      d: e8 00 00 00 00          call   12 <foo+0x12>
>     12: 48 0f bd c3             bsr    %rbx,%rax
>     16: 5b                      pop    %rbx
>     17: 31 ff                   xor    %edi,%edi
>     19: e9 00 00 00 00          jmp    1e <foo+0x1e>
>
> and that on clang 14.0.6:
>
>   0000000000000000 <foo>:
>      0: f3 0f 1e fa             endbr64
>      4: e8 00 00 00 00          call   9 <foo+0x9>
>      9: 53                      push   %rbx
>      a: 50                      push   %rax
>      b: 48 89 fb                mov    %rdi,%rbx
>      e: e8 00 00 00 00          call   13 <foo+0x13>
>     13: 48 89 1c 24             mov    %rbx,(%rsp)
>     17: 48 0f bd 04 24          bsr    (%rsp),%rax
>     1c: 48 83 c4 08             add    $0x8,%rsp
>     20: 5b                      pop    %rbx
>     21: c3                      ret
>
> The implementation from <asm-generic/bitops/builtin-__fls.h> [1]
> produces the exact same code on GCC and below code on clang:
>
>   0000000000000000 <foo>:
>      0: f3 0f 1e fa             endbr64
>      4: e8 00 00 00 00          call   9 <foo+0x9>
>      9: 53                      push   %rbx
>      a: 48 89 fb                mov    %rdi,%rbx
>      d: e8 00 00 00 00          call   12 <foo+0x12>
>     12: 48 0f bd c3             bsr    %rbx,%rax
>     16: 5b                      pop    %rbx
>     17: c3                      ret
>
> The builtin implementation is better for two reasons:
>
>   1/ it saves two instructions on clang (a push and a stack pointer
>      decrement) because of a useless tentative to save rax.
>
>   2/ when used on constant expressions, the compiler is only able to
>      fold the builtin version (c.f. [2]).
>
> For those two reasons, replace the assembly implementation by its
> builtin counterpart.
>
> [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
>
> [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> CC: Borislav Petkov <bp@suse.de>
> CC: Nick Desaulniers <ndesaulniers@google.com>
> CC: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>

LGTM; thanks for the patch!
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>

> ---
>  arch/x86/include/asm/bitops.h | 14 +-------------
>  1 file changed, 1 insertion(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
> index 2edf68475fec..a31453d7686d 100644
> --- a/arch/x86/include/asm/bitops.h
> +++ b/arch/x86/include/asm/bitops.h
> @@ -285,19 +285,7 @@ static __always_inline unsigned long variable_ffz(unsigned long word)
>          (unsigned long)__builtin_ctzl(~word) : \
>          variable_ffz(word))
>
> -/*
> - * __fls: find last set bit in word
> - * @word: The word to search
> - *
> - * Undefined if no set bit exists, so code should check against 0 first.
> - */
> -static __always_inline unsigned long __fls(unsigned long word)
> -{
> -       asm("bsr %1,%0"
> -           : "=r" (word)
> -           : "rm" (word));
> -       return word;
> -}
> +#include <asm-generic/bitops/builtin-__fls.h>
>
>  #undef ADDR
>
> --
> 2.37.4
>
  
Nick Desaulniers Nov. 10, 2022, 7:20 p.m. UTC | #4
On Mon, Nov 7, 2022 at 1:39 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Nov 06, 2022 at 06:51:05PM +0900, Vincent Mailhol wrote:
> > The builtin implementation is better for two reasons:
> >
> >   1/ it saves two instructions on clang (a push and a stack pointer
> >      decrement) because of a useless tentative to save rax.
>
> I'm thinking this is the same old clang-sucks-at-"rm" constraints and
> *really* should not be a reason to change things. Clang should get fixed
> already.

Well messing up constant folding for all compilers absolutely should
be a reason!

I did get a chance to speak with some colleagues more about this at
the LLVM developer meeting during the past 2 days.  We have some ideas
on approaches that might work.  There's some higher priority features
we're working on first, but I suspect we'll be able to visit that
issue soon.  It's a pretty tricky dance between instruction selection
and register allocation.

>
> >   2/ when used on constant expressions, the compiler is only able to
> >      fold the builtin version (c.f. [2]).
> >
> > For those two reasons, replace the assembly implementation by its
> > builtin counterpart.
> >
> > [1] https://elixir.bootlin.com/linux/v6.0/source/include/asm-generic/bitops/builtin-__fls.h
> >
> > [2] commit 146034fed6ee ("x86/asm/bitops: Use __builtin_ffs() to evaluate constant expressions")
>
> I would much prefer consistently with 146034fed6ee.

The bottom of this file arch/x86/include/asm/bitops.h is full of
#include <asm-generic/bitops/*.h>
  

Patch

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 2edf68475fec..a31453d7686d 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -285,19 +285,7 @@  static __always_inline unsigned long variable_ffz(unsigned long word)
 	 (unsigned long)__builtin_ctzl(~word) :	\
 	 variable_ffz(word))
 
-/*
- * __fls: find last set bit in word
- * @word: The word to search
- *
- * Undefined if no set bit exists, so code should check against 0 first.
- */
-static __always_inline unsigned long __fls(unsigned long word)
-{
-	asm("bsr %1,%0"
-	    : "=r" (word)
-	    : "rm" (word));
-	return word;
-}
+#include <asm-generic/bitops/builtin-__fls.h>
 
 #undef ADDR