[x86] Fine tune STV register conversion costs for -Os.
Checks
Commit Message
The eagle-eyed may have spotted that my recent testcases for DImode shifts
on x86_64 included -mno-stv in the dg-options. This is because the
Scalar-To-Vector (STV) pass currently transforms these shifts to use
SSE vector operations, producing larger code even with -Os. The issue
is that the compute_convert_gain currently underestimates the size of
instructions required for interunit moves, which is corrected with the
patch below.
For the simple test case:
unsigned long long shl1(unsigned long long x) { return x << 1; }
without this patch, GCC -m32 -Os -mavx2 currently generates:
shl1: push %ebp // 1 byte
mov %esp,%ebp // 2 bytes
vmovq 0x8(%ebp),%xmm0 // 5 bytes
pop %ebp // 1 byte
vpaddq %xmm0,%xmm0,%xmm0 // 4 bytes
vmovd %xmm0,%eax // 4 bytes
vpextrd $0x1,%xmm0,%edx // 6 bytes
ret // 1 byte = 24 bytes total
with this patch, we now generate the shorter
shl1: push %ebp // 1 byte
mov %esp,%ebp // 2 bytes
mov 0x8(%ebp),%eax // 3 bytes
mov 0xc(%ebp),%edx // 3 bytes
pop %ebp // 1 byte
add %eax,%eax // 2 bytes
adc %edx,%edx // 2 bytes
ret // 1 byte = 15 bytes total
Benchmarking using CSiBE, shows that this patch saves 1361 bytes
when compiling with -m32 -Os, and saves 172 bytes when compiling
with -Os.
This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures. Ok for mainline?
2023-10-23 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-features.cc (compute_convert_gain): Provide
more accurate values (sizes) for inter-unit moves with -Os.
Thanks in advance,
Roger
--
Comments
On Mon, Oct 23, 2023 at 4:47 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> The eagle-eyed may have spotted that my recent testcases for DImode shifts
> on x86_64 included -mno-stv in the dg-options. This is because the
> Scalar-To-Vector (STV) pass currently transforms these shifts to use
> SSE vector operations, producing larger code even with -Os. The issue
> is that the compute_convert_gain currently underestimates the size of
> instructions required for interunit moves, which is corrected with the
> patch below.
>
> For the simple test case:
>
> unsigned long long shl1(unsigned long long x) { return x << 1; }
>
> without this patch, GCC -m32 -Os -mavx2 currently generates:
>
> shl1: push %ebp // 1 byte
> mov %esp,%ebp // 2 bytes
> vmovq 0x8(%ebp),%xmm0 // 5 bytes
> pop %ebp // 1 byte
> vpaddq %xmm0,%xmm0,%xmm0 // 4 bytes
> vmovd %xmm0,%eax // 4 bytes
> vpextrd $0x1,%xmm0,%edx // 6 bytes
> ret // 1 byte = 24 bytes total
>
> with this patch, we now generate the shorter
>
> shl1: push %ebp // 1 byte
> mov %esp,%ebp // 2 bytes
> mov 0x8(%ebp),%eax // 3 bytes
> mov 0xc(%ebp),%edx // 3 bytes
> pop %ebp // 1 byte
> add %eax,%eax // 2 bytes
> adc %edx,%edx // 2 bytes
> ret // 1 byte = 15 bytes total
>
> Benchmarking using CSiBE, shows that this patch saves 1361 bytes
> when compiling with -m32 -Os, and saves 172 bytes when compiling
> with -Os.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32}
> with no new failures. Ok for mainline?
>
>
> 2023-10-23 Roger Sayle <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
> * config/i386/i386-features.cc (compute_convert_gain): Provide
> more accurate values (sizes) for inter-unit moves with -Os.
LGTM.
Thanks,
Uros.
>
>
> Thanks in advance,
> Roger
> --
>
@@ -752,11 +752,33 @@ general_scalar_chain::compute_convert_gain ()
fprintf (dump_file, " Instruction conversion gain: %d\n", gain);
/* Cost the integer to sse and sse to integer moves. */
- cost += n_sse_to_integer * ix86_cost->sse_to_integer;
- /* ??? integer_to_sse but we only have that in the RA cost table.
- Assume sse_to_integer/integer_to_sse are the same which they
- are at the moment. */
- cost += n_integer_to_sse * ix86_cost->sse_to_integer;
+ if (!optimize_function_for_size_p (cfun))
+ {
+ cost += n_sse_to_integer * ix86_cost->sse_to_integer;
+ /* ??? integer_to_sse but we only have that in the RA cost table.
+ Assume sse_to_integer/integer_to_sse are the same which they
+ are at the moment. */
+ cost += n_integer_to_sse * ix86_cost->sse_to_integer;
+ }
+ else if (TARGET_64BIT || smode == SImode)
+ {
+ cost += n_sse_to_integer * COSTS_N_BYTES (4);
+ cost += n_integer_to_sse * COSTS_N_BYTES (4);
+ }
+ else if (TARGET_SSE4_1)
+ {
+ /* vmovd (4 bytes) + vpextrd (6 bytes). */
+ cost += n_sse_to_integer * COSTS_N_BYTES (10);
+ /* vmovd (4 bytes) + vpinsrd (6 bytes). */
+ cost += n_integer_to_sse * COSTS_N_BYTES (10);
+ }
+ else
+ {
+ /* movd (4 bytes) + psrlq (5 bytes) + movd (4 bytes). */
+ cost += n_sse_to_integer * COSTS_N_BYTES (13);
+ /* movd (4 bytes) + movd (4 bytes) + unpckldq (4 bytes). */
+ cost += n_integer_to_sse * COSTS_N_BYTES (12);
+ }
if (dump_file)
fprintf (dump_file, " Registers conversion cost: %d\n", cost);