[X86] Split lea into shorter left shift by 2 or 3 bits with -Oz.
Checks
Commit Message
This patch avoids long lea instructions for performing x<<2 and x<<3
by splitting them into shorter sal and move (or xchg instructions).
Because this increases the number of instructions, but reduces the
total size, its suitable for -Oz (but not -Os).
The impact can be seen in the new test case:
int foo(int x) { return x<<2; }
int bar(int x) { return x<<3; }
long long fool(long long x) { return x<<2; }
long long barl(long long x) { return x<<3; }
where with -O2 we generate:
foo: lea 0x0(,%rdi,4),%eax // 7 bytes
retq
bar: lea 0x0(,%rdi,8),%eax // 7 bytes
retq
fool: lea 0x0(,%rdi,4),%rax // 8 bytes
retq
barl: lea 0x0(,%rdi,8),%rax // 8 bytes
retq
and with -Oz we now generate:
foo: xchg %eax,%edi // 1 byte
shl $0x2,%eax // 3 bytes
retq
bar: xchg %eax,%edi // 1 byte
shl $0x3,%eax // 3 bytes
retq
fool: xchg %rax,%rdi // 2 bytes
shl $0x2,%rax // 4 bytes
retq
barl: xchg %rax,%rdi // 2 bytes
shl $0x3,%rax // 4 bytes
retq
Over the entirety of the CSiBE code size benchmark this saves 1347
bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32.
Conveniently, there's already a backend function in i386.cc for
deciding whether to split an lea into its component instructions,
ix86_avoid_lea_for_addr, all that's required is an additional clause
checking for -Oz (i.e. optimize_size > 1).
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board='unix{-m32}'
with no new failures. Additional testing was performed by repeating
these steps after removing the "optimize_size > 1" condition, so that
suitable lea instructions were always split [-Oz is not heavily
tested, so this invoked the new code during the bootstrap and
regression testing], again with no regressions. Ok for mainline?
2023-10-05 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used
to perform left shifts into shorter instructions with -Oz.
gcc/testsuite/ChangeLog
* gcc.target/i386/lea-2.c: New test case.
Comments
On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch avoids long lea instructions for performing x<<2 and x<<3
> by splitting them into shorter sal and move (or xchg instructions).
> Because this increases the number of instructions, but reduces the
> total size, its suitable for -Oz (but not -Os).
>
> The impact can be seen in the new test case:
>
> int foo(int x) { return x<<2; }
> int bar(int x) { return x<<3; }
> long long fool(long long x) { return x<<2; }
> long long barl(long long x) { return x<<3; }
>
> where with -O2 we generate:
>
> foo: lea 0x0(,%rdi,4),%eax // 7 bytes
> retq
> bar: lea 0x0(,%rdi,8),%eax // 7 bytes
> retq
> fool: lea 0x0(,%rdi,4),%rax // 8 bytes
> retq
> barl: lea 0x0(,%rdi,8),%rax // 8 bytes
> retq
>
> and with -Oz we now generate:
>
> foo: xchg %eax,%edi // 1 byte
> shl $0x2,%eax // 3 bytes
> retq
> bar: xchg %eax,%edi // 1 byte
> shl $0x3,%eax // 3 bytes
> retq
> fool: xchg %rax,%rdi // 2 bytes
> shl $0x2,%rax // 4 bytes
> retq
> barl: xchg %rax,%rdi // 2 bytes
> shl $0x3,%rax // 4 bytes
> retq
>
> Over the entirety of the CSiBE code size benchmark this saves 1347
> bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32.
> Conveniently, there's already a backend function in i386.cc for
> deciding whether to split an lea into its component instructions,
> ix86_avoid_lea_for_addr, all that's required is an additional clause
> checking for -Oz (i.e. optimize_size > 1).
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board='unix{-m32}'
> with no new failures. Additional testing was performed by repeating
> these steps after removing the "optimize_size > 1" condition, so that
> suitable lea instructions were always split [-Oz is not heavily
> tested, so this invoked the new code during the bootstrap and
> regression testing], again with no regressions. Ok for mainline?
>
>
> 2023-10-05 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used
> to perform left shifts into shorter instructions with -Oz.
>
> gcc/testsuite/ChangeLog
> * gcc.target/i386/lea-2.c: New test case.
>
OK, but ...
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { ! ia32 } } } */
Is there a reason to avoid 32-bit targets? I'd expect that the
optimization also triggers on x86_32 for 32bit integers.
+/* { dg-options "-Oz" } */
+int foo(int x) { return x<<2; }
+int bar(int x) { return x<<3; }
+long long fool(long long x) { return x<<2; }
+long long barl(long long x) { return x<<3; }
+/* { dg-final { scan-assembler-not "lea\[lq\]" } } */
Uros.
Hi Uros,
Very many thanks for the speedy reviews.
Uros Bizjak wrote:
> On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > This patch avoids long lea instructions for performing x<<2 and x<<3
> > by splitting them into shorter sal and move (or xchg instructions).
> > Because this increases the number of instructions, but reduces the
> > total size, its suitable for -Oz (but not -Os).
> >
> > The impact can be seen in the new test case:
> >
> > int foo(int x) { return x<<2; }
> > int bar(int x) { return x<<3; }
> > long long fool(long long x) { return x<<2; } long long barl(long long
> > x) { return x<<3; }
> >
> > where with -O2 we generate:
> >
> > foo: lea 0x0(,%rdi,4),%eax // 7 bytes
> > retq
> > bar: lea 0x0(,%rdi,8),%eax // 7 bytes
> > retq
> > fool: lea 0x0(,%rdi,4),%rax // 8 bytes
> > retq
> > barl: lea 0x0(,%rdi,8),%rax // 8 bytes
> > retq
> >
> > and with -Oz we now generate:
> >
> > foo: xchg %eax,%edi // 1 byte
> > shl $0x2,%eax // 3 bytes
> > retq
> > bar: xchg %eax,%edi // 1 byte
> > shl $0x3,%eax // 3 bytes
> > retq
> > fool: xchg %rax,%rdi // 2 bytes
> > shl $0x2,%rax // 4 bytes
> > retq
> > barl: xchg %rax,%rdi // 2 bytes
> > shl $0x3,%rax // 4 bytes
> > retq
> >
> > Over the entirety of the CSiBE code size benchmark this saves 1347
> > bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32.
> > Conveniently, there's already a backend function in i386.cc for
> > deciding whether to split an lea into its component instructions,
> > ix86_avoid_lea_for_addr, all that's required is an additional clause
> > checking for -Oz (i.e. optimize_size > 1).
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board='unix{-m32}'
> > with no new failures. Additional testing was performed by repeating
> > these steps after removing the "optimize_size > 1" condition, so that
> > suitable lea instructions were always split [-Oz is not heavily
> > tested, so this invoked the new code during the bootstrap and
> > regression testing], again with no regressions. Ok for mainline?
> >
> >
> > 2023-10-05 Roger Sayle <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> > * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used
> > to perform left shifts into shorter instructions with -Oz.
> >
> > gcc/testsuite/ChangeLog
> > * gcc.target/i386/lea-2.c: New test case.
> >
>
> OK, but ...
>
> @@ -0,0 +1,7 @@
> +/* { dg-do compile { target { ! ia32 } } } */
>
> Is there a reason to avoid 32-bit targets? I'd expect that the optimization also
> triggers on x86_32 for 32bit integers.
Good catch. You're 100% correct; because the test case just checks that an LEA
is not used, and not for the specific sequence of shift instructions used instead,
this test also passes with --target_board='unix{-m32}'. I'll remove the target clause
from the dg-do compile directive.
> +/* { dg-options "-Oz" } */
> +int foo(int x) { return x<<2; }
> +int bar(int x) { return x<<3; }
> +long long fool(long long x) { return x<<2; } long long barl(long long
> +x) { return x<<3; }
> +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */
Thanks again.
Roger
--
@@ -15543,6 +15543,13 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[])
&& (regno0 == regno1 || regno0 == regno2))
return true;
+ /* Split with -Oz if the encoding requires fewer bytes. */
+ if (optimize_size > 1
+ && parts.scale > 1
+ && !parts.base
+ && (!parts.disp || parts.disp == const0_rtx))
+ return true;
+
/* Check we need to optimize. */
if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
return false;
new file mode 100644
@@ -0,0 +1,7 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-Oz" } */
+int foo(int x) { return x<<2; }
+int bar(int x) { return x<<3; }
+long long fool(long long x) { return x<<2; }
+long long barl(long long x) { return x<<3; }
+/* { dg-final { scan-assembler-not "lea\[lq\]" } } */