RISC-V: Synthesize power-of-two constants.

Message ID 7716481f-2786-f1d0-27dc-b76cac630353@gmail.com
State Unresolved
Headers
Series RISC-V: Synthesize power-of-two constants. |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Robin Dapp May 30, 2023, 7:13 p.m. UTC
  Hi,

I figured I'd send this patch that I quickly hacked together some
days back.  It's likely going to be controversial because we don't
have vector costs in place at all yet and even with costs it's
probably debatable as the emitted sequence is longer :)
I'm willing to defer or ditch it altogether but as it's small and
localized why not at least discuss it quickly.

For immediates that are powers of two, instead of loading them into a
GPR and then broadcasting (incurring the scalar-vector latency) we
can synthesize them with a vmv.vi and a vsll.v.i.  Depending on actual
costs we could also add more complicated synthesis patterns in the
future.

Regards
 Robin

gcc/ChangeLog:

	* config/riscv/riscv-selftests.cc (run_const_vector_selftests):
	Adjust expectation.
	* config/riscv/riscv-v.cc (expand_const_vector): Synthesize
	power-of-two constants.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c: Adjust test
	expectation.
	* gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c: Dito.
	* gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c: Dito.
	* gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c: Dito.
---
 gcc/config/riscv/riscv-selftests.cc           |  9 +++++-
 gcc/config/riscv/riscv-v.cc                   | 31 +++++++++++++++++++
 .../riscv/rvv/autovec/vmv-imm-fixed-rv32.c    |  5 +--
 .../riscv/rvv/autovec/vmv-imm-fixed-rv64.c    |  5 +--
 .../riscv/rvv/autovec/vmv-imm-rv32.c          |  5 +--
 .../riscv/rvv/autovec/vmv-imm-rv64.c          |  5 +--
 6 files changed, 51 insertions(+), 9 deletions(-)
  

Comments

Andrew Waterman May 30, 2023, 8:18 p.m. UTC | #1
This turns out to be a de-optimization for implementations with any
amount of temporal execution (which is most machines with LMUL > 1 and
even some machines with LMUL <= 1).  Scalar instructions are generally
cheaper than multi-cycle-occupancy vector operations, so reducing
scalar work by increasing vector work is normally not a good tradeoff.
(And even if the vector instruction has unit occupancy, it likely
burns a bit more energy.)  The best generic scheme to load 143 into
all elements of a vector register is to first load 143 into a scalar
register, then use vmv.v.x.  If the proposed scheme is profitable on
some implementations in some circumstances, it should probably be
enabled only when tuning for that implementation.


On Tue, May 30, 2023 at 12:14 PM Robin Dapp via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> I figured I'd send this patch that I quickly hacked together some
> days back.  It's likely going to be controversial because we don't
> have vector costs in place at all yet and even with costs it's
> probably debatable as the emitted sequence is longer :)
> I'm willing to defer or ditch it altogether but as it's small and
> localized why not at least discuss it quickly.
>
> For immediates that are powers of two, instead of loading them into a
> GPR and then broadcasting (incurring the scalar-vector latency) we
> can synthesize them with a vmv.vi and a vsll.v.i.  Depending on actual
> costs we could also add more complicated synthesis patterns in the
> future.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv-selftests.cc (run_const_vector_selftests):
>         Adjust expectation.
>         * config/riscv/riscv-v.cc (expand_const_vector): Synthesize
>         power-of-two constants.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c: Adjust test
>         expectation.
>         * gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c: Dito.
>         * gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c: Dito.
>         * gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c: Dito.
> ---
>  gcc/config/riscv/riscv-selftests.cc           |  9 +++++-
>  gcc/config/riscv/riscv-v.cc                   | 31 +++++++++++++++++++
>  .../riscv/rvv/autovec/vmv-imm-fixed-rv32.c    |  5 +--
>  .../riscv/rvv/autovec/vmv-imm-fixed-rv64.c    |  5 +--
>  .../riscv/rvv/autovec/vmv-imm-rv32.c          |  5 +--
>  .../riscv/rvv/autovec/vmv-imm-rv64.c          |  5 +--
>  6 files changed, 51 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/config/riscv/riscv-selftests.cc b/gcc/config/riscv/riscv-selftests.cc
> index 1bf1a648fa1..21fa460bb1f 100644
> --- a/gcc/config/riscv/riscv-selftests.cc
> +++ b/gcc/config/riscv/riscv-selftests.cc
> @@ -259,9 +259,16 @@ run_const_vector_selftests (void)
>               rtx_insn *insn = get_last_insn ();
>               rtx src = XEXP (SET_SRC (PATTERN (insn)), 1);
>               /* 1. Should be vmv.v.i for in rang of -16 ~ 15.
> -                2. Should be vmv.v.x for exceed -16 ~ 15.  */
> +                2. For 16 (and appropriate higher powers of two)
> +                   expect a shift because we emit a
> +                   vmv.v.i v1, 8 and a
> +                   vsll.v.i v1, v1, 1.
> +                3. Should be vmv.v.x for everything else.  */
>               if (IN_RANGE (val, -16, 15))
>                 ASSERT_TRUE (rtx_equal_p (src, dup));
> +             else if (IN_RANGE (val, 16, 16))
> +               ASSERT_TRUE (GET_CODE (src) == ASHIFT
> +                            && INTVAL (XEXP (src, 1)) == 1);
>               else
>                 ASSERT_TRUE (
>                   rtx_equal_p (src,
> diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
> index b381970140d..b295a48bb9d 100644
> --- a/gcc/config/riscv/riscv-v.cc
> +++ b/gcc/config/riscv/riscv-v.cc
> @@ -560,6 +560,7 @@ expand_const_vector (rtx target, rtx src)
>    rtx elt;
>    if (const_vec_duplicate_p (src, &elt))
>      {
> +      HOST_WIDE_INT val = INTVAL (elt);
>        rtx tmp = register_operand (target, mode) ? target : gen_reg_rtx (mode);
>        /* Element in range -16 ~ 15 integer or 0.0 floating-point,
>          we use vmv.v.i instruction.  */
> @@ -568,6 +569,36 @@ expand_const_vector (rtx target, rtx src)
>           rtx ops[] = {tmp, src};
>           emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops);
>         }
> +      /* If we can reach the immediate by loading an immediate and shifting,
> +        assume this is cheaper than loading a scalar.
> +        A power-of-two value > 15 cannot be loaded with vmv.v.i but we can
> +        load 8 into a vector register and shift it.  */
> +      else if (val > 15 && wi::popcount (val) == 1
> +              && exact_log2 (val) - 3 /* exact_log2 (8)  */
> +              <= 15)
> +       {
> +         /* We could also allow shifting an immediate and adding
> +            another one if VAL is suitable.
> +            This would allow us to synthesize constants like
> +            143 = 128 + 15 via
> +            vmv.v.i v1, 8
> +            vsll.vi v1, v1, 4
> +            vadd.vi v1, v1, 15
> +            TODO: Try more sequences and actually compare costs.  */
> +
> +         HOST_WIDE_INT sw = exact_log2 (val);
> +         rtx eight = gen_const_vec_duplicate (mode, GEN_INT (8));
> +         rtx imm = gen_reg_rtx (mode);
> +
> +         /* Load '8' as broadcast immediate.  */
> +         rtx ops1[] = {imm, eight};
> +         emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops1);
> +
> +         /* Shift it.  */
> +         rtx ops2[] = {tmp, imm, GEN_INT (sw - 3)};
> +         emit_vlmax_insn (code_for_pred_scalar (ASHIFT, mode),
> +                          RVV_BINOP, ops2);
> +       }
>        else
>         {
>           elt = force_reg (elt_mode, elt);
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
> index e8d017f7339..5aaf55935a0 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
> index f85ad4117d3..0a7effde08a 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
> index 6843bc6018d..d5e7fa409e8 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
> index 39fb2a6cc7b..adb6a0b869e 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> --
> 2.40.1
  
juzhe.zhong@rivai.ai May 30, 2023, 10:01 p.m. UTC | #2
I agree with Andrew.

And I don't think this patch is appropriate for following reasons:
1. This patch increases vector workload in machine since 
     it convert scalar load + vmv.v.x into vmv.v.i + vsll.vi.
2. For multi-issue OoO machine, scalar instructions are very cheap
    when they are located in vector codegen. For example a sequence
    like this:
      scalar insn
      scalar insn
      vector insn
      scalar insn
      vector insn
      ....
      In such situation, we can issue multiple instructions simultaneously,
      and the latency of scalar instructions will be hided so scalar instruction
      is cheap. Wheras this patch increasing vector pipeline workload is not
      friendly to OoO machine what I mentioned above.
3.   I can image the only benefit of this patch is that we can reduce scalar register pressure
      in some extreme circumstances. However, I don't this benefit is "real" since GCC should
      well schedule the instruction sequence when we well tune the vector instructions scheduling  
      model and cost model to make such register live range very short when the scalar register
      pressure is very high.

Overal, I disagree with this patch.

Thanks.


juzhe.zhong@rivai.ai
 
From: Andrew Waterman
Date: 2023-05-31 04:18
To: Robin Dapp
CC: gcc-patches; Kito Cheng; palmer; juzhe.zhong@rivai.ai; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Synthesize power-of-two constants.
This turns out to be a de-optimization for implementations with any
amount of temporal execution (which is most machines with LMUL > 1 and
even some machines with LMUL <= 1).  Scalar instructions are generally
cheaper than multi-cycle-occupancy vector operations, so reducing
scalar work by increasing vector work is normally not a good tradeoff.
(And even if the vector instruction has unit occupancy, it likely
burns a bit more energy.)  The best generic scheme to load 143 into
all elements of a vector register is to first load 143 into a scalar
register, then use vmv.v.x.  If the proposed scheme is profitable on
some implementations in some circumstances, it should probably be
enabled only when tuning for that implementation.
 
 
On Tue, May 30, 2023 at 12:14 PM Robin Dapp via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> I figured I'd send this patch that I quickly hacked together some
> days back.  It's likely going to be controversial because we don't
> have vector costs in place at all yet and even with costs it's
> probably debatable as the emitted sequence is longer :)
> I'm willing to defer or ditch it altogether but as it's small and
> localized why not at least discuss it quickly.
>
> For immediates that are powers of two, instead of loading them into a
> GPR and then broadcasting (incurring the scalar-vector latency) we
> can synthesize them with a vmv.vi and a vsll.v.i.  Depending on actual
> costs we could also add more complicated synthesis patterns in the
> future.
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv-selftests.cc (run_const_vector_selftests):
>         Adjust expectation.
>         * config/riscv/riscv-v.cc (expand_const_vector): Synthesize
>         power-of-two constants.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c: Adjust test
>         expectation.
>         * gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c: Dito.
>         * gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c: Dito.
>         * gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c: Dito.
> ---
>  gcc/config/riscv/riscv-selftests.cc           |  9 +++++-
>  gcc/config/riscv/riscv-v.cc                   | 31 +++++++++++++++++++
>  .../riscv/rvv/autovec/vmv-imm-fixed-rv32.c    |  5 +--
>  .../riscv/rvv/autovec/vmv-imm-fixed-rv64.c    |  5 +--
>  .../riscv/rvv/autovec/vmv-imm-rv32.c          |  5 +--
>  .../riscv/rvv/autovec/vmv-imm-rv64.c          |  5 +--
>  6 files changed, 51 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/config/riscv/riscv-selftests.cc b/gcc/config/riscv/riscv-selftests.cc
> index 1bf1a648fa1..21fa460bb1f 100644
> --- a/gcc/config/riscv/riscv-selftests.cc
> +++ b/gcc/config/riscv/riscv-selftests.cc
> @@ -259,9 +259,16 @@ run_const_vector_selftests (void)
>               rtx_insn *insn = get_last_insn ();
>               rtx src = XEXP (SET_SRC (PATTERN (insn)), 1);
>               /* 1. Should be vmv.v.i for in rang of -16 ~ 15.
> -                2. Should be vmv.v.x for exceed -16 ~ 15.  */
> +                2. For 16 (and appropriate higher powers of two)
> +                   expect a shift because we emit a
> +                   vmv.v.i v1, 8 and a
> +                   vsll.v.i v1, v1, 1.
> +                3. Should be vmv.v.x for everything else.  */
>               if (IN_RANGE (val, -16, 15))
>                 ASSERT_TRUE (rtx_equal_p (src, dup));
> +             else if (IN_RANGE (val, 16, 16))
> +               ASSERT_TRUE (GET_CODE (src) == ASHIFT
> +                            && INTVAL (XEXP (src, 1)) == 1);
>               else
>                 ASSERT_TRUE (
>                   rtx_equal_p (src,
> diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
> index b381970140d..b295a48bb9d 100644
> --- a/gcc/config/riscv/riscv-v.cc
> +++ b/gcc/config/riscv/riscv-v.cc
> @@ -560,6 +560,7 @@ expand_const_vector (rtx target, rtx src)
>    rtx elt;
>    if (const_vec_duplicate_p (src, &elt))
>      {
> +      HOST_WIDE_INT val = INTVAL (elt);
>        rtx tmp = register_operand (target, mode) ? target : gen_reg_rtx (mode);
>        /* Element in range -16 ~ 15 integer or 0.0 floating-point,
>          we use vmv.v.i instruction.  */
> @@ -568,6 +569,36 @@ expand_const_vector (rtx target, rtx src)
>           rtx ops[] = {tmp, src};
>           emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops);
>         }
> +      /* If we can reach the immediate by loading an immediate and shifting,
> +        assume this is cheaper than loading a scalar.
> +        A power-of-two value > 15 cannot be loaded with vmv.v.i but we can
> +        load 8 into a vector register and shift it.  */
> +      else if (val > 15 && wi::popcount (val) == 1
> +              && exact_log2 (val) - 3 /* exact_log2 (8)  */
> +              <= 15)
> +       {
> +         /* We could also allow shifting an immediate and adding
> +            another one if VAL is suitable.
> +            This would allow us to synthesize constants like
> +            143 = 128 + 15 via
> +            vmv.v.i v1, 8
> +            vsll.vi v1, v1, 4
> +            vadd.vi v1, v1, 15
> +            TODO: Try more sequences and actually compare costs.  */
> +
> +         HOST_WIDE_INT sw = exact_log2 (val);
> +         rtx eight = gen_const_vec_duplicate (mode, GEN_INT (8));
> +         rtx imm = gen_reg_rtx (mode);
> +
> +         /* Load '8' as broadcast immediate.  */
> +         rtx ops1[] = {imm, eight};
> +         emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops1);
> +
> +         /* Shift it.  */
> +         rtx ops2[] = {tmp, imm, GEN_INT (sw - 3)};
> +         emit_vlmax_insn (code_for_pred_scalar (ASHIFT, mode),
> +                          RVV_BINOP, ops2);
> +       }
>        else
>         {
>           elt = force_reg (elt_mode, elt);
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
> index e8d017f7339..5aaf55935a0 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
> index f85ad4117d3..0a7effde08a 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
> index 6843bc6018d..d5e7fa409e8 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
> index 39fb2a6cc7b..adb6a0b869e 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
> @@ -3,5 +3,6 @@
>
>  #include "vmv-imm-template.h"
>
> -/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
> -/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
> +/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
> +/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
> --
> 2.40.1
  
Jeff Law May 30, 2023, 10:09 p.m. UTC | #3
On 5/30/23 16:01, 钟居哲 wrote:
> I agree with Andrew.
> 
> And I don't think this patch is appropriate for following reasons:
> 1. This patch increases vector workload in machine since
>       it convert scalar load + vmv.v.x into vmv.v.i + vsll.vi.
This is probably uarch dependent.  I can probably construct cases where 
the first will be better and I can probably construct cases where the 
latter will be better.  In fact the recommendation from our uarch team 
is to generally do this stuff on the vector side.



> 2. For multi-issue OoO machine, scalar instructions are very cheap
>      when they are located in vector codegen. For example a sequence
>      like this:
>        scalar insn
>        scalar insn
>        vector insn
>        scalar insn
> vector insn
>        ....
>        In such situation, we can issue multiple instructions simultaneously,
>        and the latency of scalar instructions will be hided so scalar 
> instruction
>        is cheap. Wheras this patch increasing vector pipeline workload 
> is not
>        friendly to OoO machine what I mentioned above.
I probably need to be careful what I say here :-)  I'll go with mixing 
vector/scalar code may incur certain penalties on some 
microarchitectures depending on the exact code sequences involved.


> 3.   I can image the only benefit of this patch is that we can reduce 
> scalar register pressure
>        in some extreme circumstances. However, I don't this benefit is 
> "real" since GCC should
>        well schedule the instruction sequence when we well tune the 
> vector instructions scheduling
>        model and cost model to make such register live range very short 
> when the scalar register
>        pressure is very high.
> 
> Overal, I disagree with this patch.
What I think this all argues is that it'll likely need to be uarch 
dependent.    I'm not yet sure how to describe the properties of the 
uarch in a concise manner to put into our costing structure yet though.

jeff
  
juzhe.zhong@rivai.ai May 30, 2023, 10:13 p.m. UTC | #4
Ok. I prefer just keep scalar load + vmv.v.x by default since I believe most machines 
prefer this way.



juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-05-31 06:09
To: 钟居哲; andrew; rdapp.gcc
CC: gcc-patches; kito.cheng; palmer
Subject: Re: [PATCH] RISC-V: Synthesize power-of-two constants.
 
 
On 5/30/23 16:01, 钟居哲 wrote:
> I agree with Andrew.
> 
> And I don't think this patch is appropriate for following reasons:
> 1. This patch increases vector workload in machine since
>       it convert scalar load + vmv.v.x into vmv.v.i + vsll.vi.
This is probably uarch dependent.  I can probably construct cases where 
the first will be better and I can probably construct cases where the 
latter will be better.  In fact the recommendation from our uarch team 
is to generally do this stuff on the vector side.
 
 
 
> 2. For multi-issue OoO machine, scalar instructions are very cheap
>      when they are located in vector codegen. For example a sequence
>      like this:
>        scalar insn
>        scalar insn
>        vector insn
>        scalar insn
> vector insn
>        ....
>        In such situation, we can issue multiple instructions simultaneously,
>        and the latency of scalar instructions will be hided so scalar 
> instruction
>        is cheap. Wheras this patch increasing vector pipeline workload 
> is not
>        friendly to OoO machine what I mentioned above.
I probably need to be careful what I say here :-)  I'll go with mixing 
vector/scalar code may incur certain penalties on some 
microarchitectures depending on the exact code sequences involved.
 
 
> 3.   I can image the only benefit of this patch is that we can reduce 
> scalar register pressure
>        in some extreme circumstances. However, I don't this benefit is 
> "real" since GCC should
>        well schedule the instruction sequence when we well tune the 
> vector instructions scheduling
>        model and cost model to make such register live range very short 
> when the scalar register
>        pressure is very high.
> 
> Overal, I disagree with this patch.
What I think this all argues is that it'll likely need to be uarch 
dependent.    I'm not yet sure how to describe the properties of the 
uarch in a concise manner to put into our costing structure yet though.
 
jeff
  
Jeff Law May 30, 2023, 10:14 p.m. UTC | #5
On 5/30/23 16:13, 钟居哲 wrote:
> Ok. I prefer just keep scalar load + vmv.v.x by default since I believe 
> most machines prefer this way.
Seems quite reasonable to me.
jeff
  
Philipp Tomsich May 30, 2023, 10:17 p.m. UTC | #6
Assuming a fully pipelined vector unit (and from experience on
AArch64), an u-arch's scalar-to-vector move cost is likely to play a
significant role in whether this will be profitable or not.

--Philipp.

On Wed, 31 May 2023 at 00:10, Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 5/30/23 16:01, 钟居哲 wrote:
> > I agree with Andrew.
> >
> > And I don't think this patch is appropriate for following reasons:
> > 1. This patch increases vector workload in machine since
> >       it convert scalar load + vmv.v.x into vmv.v.i + vsll.vi.
> This is probably uarch dependent.  I can probably construct cases where
> the first will be better and I can probably construct cases where the
> latter will be better.  In fact the recommendation from our uarch team
> is to generally do this stuff on the vector side.
>
>
>
> > 2. For multi-issue OoO machine, scalar instructions are very cheap
> >      when they are located in vector codegen. For example a sequence
> >      like this:
> >        scalar insn
> >        scalar insn
> >        vector insn
> >        scalar insn
> > vector insn
> >        ....
> >        In such situation, we can issue multiple instructions simultaneously,
> >        and the latency of scalar instructions will be hided so scalar
> > instruction
> >        is cheap. Wheras this patch increasing vector pipeline workload
> > is not
> >        friendly to OoO machine what I mentioned above.
> I probably need to be careful what I say here :-)  I'll go with mixing
> vector/scalar code may incur certain penalties on some
> microarchitectures depending on the exact code sequences involved.
>
>
> > 3.   I can image the only benefit of this patch is that we can reduce
> > scalar register pressure
> >        in some extreme circumstances. However, I don't this benefit is
> > "real" since GCC should
> >        well schedule the instruction sequence when we well tune the
> > vector instructions scheduling
> >        model and cost model to make such register live range very short
> > when the scalar register
> >        pressure is very high.
> >
> > Overal, I disagree with this patch.
> What I think this all argues is that it'll likely need to be uarch
> dependent.    I'm not yet sure how to describe the properties of the
> uarch in a concise manner to put into our costing structure yet though.
>
> jeff
  

Patch

diff --git a/gcc/config/riscv/riscv-selftests.cc b/gcc/config/riscv/riscv-selftests.cc
index 1bf1a648fa1..21fa460bb1f 100644
--- a/gcc/config/riscv/riscv-selftests.cc
+++ b/gcc/config/riscv/riscv-selftests.cc
@@ -259,9 +259,16 @@  run_const_vector_selftests (void)
 	      rtx_insn *insn = get_last_insn ();
 	      rtx src = XEXP (SET_SRC (PATTERN (insn)), 1);
 	      /* 1. Should be vmv.v.i for in rang of -16 ~ 15.
-		 2. Should be vmv.v.x for exceed -16 ~ 15.  */
+		 2. For 16 (and appropriate higher powers of two)
+		    expect a shift because we emit a
+		    vmv.v.i v1, 8 and a
+		    vsll.v.i v1, v1, 1.
+		 3. Should be vmv.v.x for everything else.  */
 	      if (IN_RANGE (val, -16, 15))
 		ASSERT_TRUE (rtx_equal_p (src, dup));
+	      else if (IN_RANGE (val, 16, 16))
+		ASSERT_TRUE (GET_CODE (src) == ASHIFT
+			     && INTVAL (XEXP (src, 1)) == 1);
 	      else
 		ASSERT_TRUE (
 		  rtx_equal_p (src,
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index b381970140d..b295a48bb9d 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -560,6 +560,7 @@  expand_const_vector (rtx target, rtx src)
   rtx elt;
   if (const_vec_duplicate_p (src, &elt))
     {
+      HOST_WIDE_INT val = INTVAL (elt);
       rtx tmp = register_operand (target, mode) ? target : gen_reg_rtx (mode);
       /* Element in range -16 ~ 15 integer or 0.0 floating-point,
 	 we use vmv.v.i instruction.  */
@@ -568,6 +569,36 @@  expand_const_vector (rtx target, rtx src)
 	  rtx ops[] = {tmp, src};
 	  emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops);
 	}
+      /* If we can reach the immediate by loading an immediate and shifting,
+	 assume this is cheaper than loading a scalar.
+	 A power-of-two value > 15 cannot be loaded with vmv.v.i but we can
+	 load 8 into a vector register and shift it.  */
+      else if (val > 15 && wi::popcount (val) == 1
+	       && exact_log2 (val) - 3 /* exact_log2 (8)  */
+	       <= 15)
+	{
+	  /* We could also allow shifting an immediate and adding
+	     another one if VAL is suitable.
+	     This would allow us to synthesize constants like
+	     143 = 128 + 15 via
+	     vmv.v.i v1, 8
+	     vsll.vi v1, v1, 4
+	     vadd.vi v1, v1, 15
+	     TODO: Try more sequences and actually compare costs.  */
+
+	  HOST_WIDE_INT sw = exact_log2 (val);
+	  rtx eight = gen_const_vec_duplicate (mode, GEN_INT (8));
+	  rtx imm = gen_reg_rtx (mode);
+
+	  /* Load '8' as broadcast immediate.  */
+	  rtx ops1[] = {imm, eight};
+	  emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops1);
+
+	  /* Shift it.  */
+	  rtx ops2[] = {tmp, imm, GEN_INT (sw - 3)};
+	  emit_vlmax_insn (code_for_pred_scalar (ASHIFT, mode),
+			   RVV_BINOP, ops2);
+	}
       else
 	{
 	  elt = force_reg (elt_mode, elt);
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
index e8d017f7339..5aaf55935a0 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv32.c
@@ -3,5 +3,6 @@ 
 
 #include "vmv-imm-template.h"
 
-/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
-/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
+/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
+/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
+/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
index f85ad4117d3..0a7effde08a 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-fixed-rv64.c
@@ -3,5 +3,6 @@ 
 
 #include "vmv-imm-template.h"
 
-/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
-/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
+/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
+/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
+/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
index 6843bc6018d..d5e7fa409e8 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv32.c
@@ -3,5 +3,6 @@ 
 
 #include "vmv-imm-template.h"
 
-/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
-/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
+/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
+/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
+/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
index 39fb2a6cc7b..adb6a0b869e 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vmv-imm-rv64.c
@@ -3,5 +3,6 @@ 
 
 #include "vmv-imm-template.h"
 
-/* { dg-final { scan-assembler-times {vmv.v.i} 32 } } */
-/* { dg-final { scan-assembler-times {vmv.v.x} 8 } } */
+/* { dg-final { scan-assembler-times {vmv.v.i} 34 } } */
+/* { dg-final { scan-assembler-times {vsll.vi} 2 } } */
+/* { dg-final { scan-assembler-times {vmv.v.x} 6 } } */