RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
Checks
Commit Message
From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
According to RVV ISA:
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
We can enhance VLA SLP auto-vectorization with (16.5.1. Synthesizing vdecompress)
Decompress operation.
Case 1 (nunits = POLY_INT_CST [16, 16]):
_48 = VEC_PERM_EXPR <_37, _35, { 0, POLY_INT_CST [16, 16], 1, POLY_INT_CST [17, 16], 2, POLY_INT_CST [18, 16], ... }>;
We can optimize such VLA SLP permuation pattern into:
_48 = vdecompress (_37, _35, mask = { 0, 1, 0, 1, ... };
Case 2 (nunits = POLY_INT_CST [16, 16]):
_23 = VEC_PERM_EXPR <_46, _44, { POLY_INT_CST [1, 1], POLY_INT_CST [3, 3], POLY_INT_CST [2, 1], POLY_INT_CST [4, 3], POLY_INT_CST [3, 1], POLY_INT_CST [5, 3], ... }>;
We can optimize such VLA SLP permuation pattern into:
_48 = vdecompress (slidedown(_46, 1/2 nunits), slidedown(_44, 1/2 nunits), mask = { 0, 1, 0, 1, ... };
For example:
void __attribute__ ((noinline, noclone))
vec_slp (uint64_t *restrict a, uint64_t b, uint64_t c, int n)
{
for (int i = 0; i < n; ++i)
{
a[i * 2] += b;
a[i * 2 + 1] += c;
}
}
ASM:
...
vid.v v0
vand.vi v0,v0,1
vmseq.vi v0,v0,1 ===> mask = { 0, 1, 0, 1, ... }
vdecompress:
viota.m v3,v0
vrgather.vv v2,v1,v3,v0.t
...
gcc/ChangeLog:
* config/riscv/riscv-v.cc (emit_vlmax_decompress_insn): New function.
(expand_const_vector): Enhance repeating sequence mask.
(shuffle_decompress_patterns): New function.
(expand_vec_perm_const_1): Add decompress optimization.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/partial/slp-8.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp-9.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-8.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-9.c: New test.
---
gcc/config/riscv/riscv-v.cc | 146 +++++++++++++++++-
.../riscv/rvv/autovec/partial/slp-8.c | 30 ++++
.../riscv/rvv/autovec/partial/slp-9.c | 31 ++++
.../riscv/rvv/autovec/partial/slp_run-8.c | 30 ++++
.../riscv/rvv/autovec/partial/slp_run-9.c | 30 ++++
5 files changed, 260 insertions(+), 7 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-8.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-9.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-8.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-9.c
Comments
Hi Juzhe,
seems a nice improvement, looks good to me. While reading I was wondering
if vzext could help synthesize some (zero-based) patterns as well
(e.g. 0 3 0 3...).
However the sequences I could come up with were not shorter than what we
are already emitting, so probably not.
Regards
Robin
No. Such pattern you pointed I already supported.
The operation is very simple.
Just use a single vmv.v.i but larger SEW is enough. No need vzext.
juzhe.zhong@rivai.ai
From: Robin Dapp
Date: 2023-06-12 22:43
To: juzhe.zhong; gcc-patches
CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
Hi Juzhe,
seems a nice improvement, looks good to me. While reading I was wondering
if vzext could help synthesize some (zero-based) patterns as well
(e.g. 0 3 0 3...).
However the sequences I could come up with were not shorter than what we
are already emitting, so probably not.
Regards
Robin
I didn't take a close review yet, (and I suspect I can't find time
before I start my vacation :P), but I am thinking we may adding
selftests for expand_const_vector in *future*, again, not blocker for
this patch :)
On Mon, Jun 12, 2023 at 10:51 PM 钟居哲 <juzhe.zhong@rivai.ai> wrote:
>
> No. Such pattern you pointed I already supported.
> The operation is very simple.
> Just use a single vmv.v.i but larger SEW is enough. No need vzext.
>
> ________________________________
> juzhe.zhong@rivai.ai
>
>
> From: Robin Dapp
> Date: 2023-06-12 22:43
> To: juzhe.zhong; gcc-patches
> CC: rdapp.gcc; kito.cheng; kito.cheng; palmer; palmer; jeffreyalaw
> Subject: Re: [PATCH] RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operation
> Hi Juzhe,
>
> seems a nice improvement, looks good to me. While reading I was wondering
> if vzext could help synthesize some (zero-based) patterns as well
> (e.g. 0 3 0 3...).
> However the sequences I could come up with were not shorter than what we
> are already emitting, so probably not.
>
> Regards
> Robin
>
On 6/12/23 08:54, Kito Cheng wrote:
> I didn't take a close review yet, (and I suspect I can't find time
> before I start my vacation :P), but I am thinking we may adding
> selftests for expand_const_vector in *future*, again, not blocker for
> this patch :)
I'll take this one. Go enjoy your vacation!
jeff
@@ -836,6 +836,46 @@ emit_vlmax_masked_gather_mu_insn (rtx target, rtx op, rtx sel, rtx mask)
emit_vlmax_masked_mu_insn (icode, RVV_BINOP_MU, ops);
}
+/* According to RVV ISA spec (16.5.1. Synthesizing vdecompress):
+ https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
+
+ There is no inverse vdecompress provided, as this operation can be readily
+ synthesized using iota and a masked vrgather:
+
+ Desired functionality of 'vdecompress'
+ 7 6 5 4 3 2 1 0 # vid
+
+ e d c b a # packed vector of 5 elements
+ 1 0 0 1 1 1 0 1 # mask vector of 8 elements
+ p q r s t u v w # destination register before vdecompress
+
+ e q r d c b v a # result of vdecompress
+ # v0 holds mask
+ # v1 holds packed data
+ # v11 holds input expanded vector and result
+ viota.m v10, v0 # Calc iota from mask in v0
+ vrgather.vv v11, v1, v10, v0.t # Expand into destination
+ p q r s t u v w # v11 destination register
+ e d c b a # v1 source vector
+ 1 0 0 1 1 1 0 1 # v0 mask vector
+
+ 4 4 4 3 2 1 1 0 # v10 result of viota.m
+ e q r d c b v a # v11 destination after vrgather using viota.m under mask
+*/
+static void
+emit_vlmax_decompress_insn (rtx target, rtx op, rtx mask)
+{
+ machine_mode data_mode = GET_MODE (target);
+ machine_mode sel_mode = related_int_vector_mode (data_mode).require ();
+ if (GET_MODE_INNER (data_mode) == QImode)
+ sel_mode = get_vector_mode (HImode, GET_MODE_NUNITS (data_mode)).require ();
+
+ rtx sel = gen_reg_rtx (sel_mode);
+ rtx iota_ops[] = {sel, mask};
+ emit_vlmax_insn (code_for_pred_iota (sel_mode), RVV_UNOP, iota_ops);
+ emit_vlmax_masked_gather_mu_insn (target, op, sel, mask);
+}
+
/* Emit merge instruction. */
static machine_mode
@@ -934,14 +974,41 @@ expand_const_vector (rtx target, rtx src)
{
machine_mode mode = GET_MODE (target);
scalar_mode elt_mode = GET_MODE_INNER (mode);
+ poly_uint64 nunits = GET_MODE_NUNITS (mode);
+ unsigned int nelts_per_pattern = CONST_VECTOR_NELTS_PER_PATTERN (src);
+ unsigned int npatterns = CONST_VECTOR_NPATTERNS (src);
if (GET_MODE_CLASS (mode) == MODE_VECTOR_BOOL)
{
rtx elt;
- gcc_assert (
- const_vec_duplicate_p (src, &elt)
- && (rtx_equal_p (elt, const0_rtx) || rtx_equal_p (elt, const1_rtx)));
- rtx ops[] = {target, src};
- emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops);
+ if (const_vec_duplicate_p (src, &elt))
+ {
+ rtx ops[] = {target, src};
+ emit_vlmax_insn (code_for_pred_mov (mode), RVV_UNOP, ops);
+ }
+ else
+ {
+ gcc_assert (CONST_VECTOR_DUPLICATE_P (src));
+ if (npatterns == 2)
+ {
+ /* Generate mask with repeating sequence:
+ 1. { 0, 1, 0, 1, ... }.
+ 2. { 1, 0, 1, 0, ... }. */
+ rtx ele1 = CONST_VECTOR_ELT (src, 1);
+ machine_mode vid_mode
+ = get_vector_mode (QImode, nunits).require ();
+ rtx vid = gen_reg_rtx (vid_mode);
+ rtx vid_repeat = gen_reg_rtx (vid_mode);
+ emit_insn (
+ gen_vec_series (vid_mode, vid, const0_rtx, const1_rtx));
+ rtx and_ops[] = {vid_repeat, vid, const1_rtx};
+ emit_vlmax_insn (code_for_pred_scalar (AND, vid_mode), RVV_BINOP,
+ and_ops);
+ rtx const_vec = gen_const_vector_dup (vid_mode, INTVAL (ele1));
+ expand_vec_cmp (target, EQ, vid_repeat, const_vec);
+ }
+ else
+ gcc_unreachable ();
+ }
return;
}
@@ -977,8 +1044,6 @@ expand_const_vector (rtx target, rtx src)
}
/* Handle variable-length vector. */
- unsigned int nelts_per_pattern = CONST_VECTOR_NELTS_PER_PATTERN (src);
- unsigned int npatterns = CONST_VECTOR_NPATTERNS (src);
rvv_builder builder (mode, npatterns, nelts_per_pattern);
for (unsigned int i = 0; i < nelts_per_pattern; i++)
{
@@ -2337,6 +2402,71 @@ struct expand_vec_perm_d
bool testing_p;
};
+/* Recognize decompress patterns:
+
+ 1. VEC_PERM_EXPR op0 and op1
+ with isel = { 0, nunits, 1, nunits + 1, ... }.
+ Decompress op0 and op1 vector with the mask = { 0, 1, 0, 1, ... }.
+
+ 2. VEC_PERM_EXPR op0 and op1
+ with isel = { 1/2 nunits, 3/2 nunits, 1/2 nunits+1, 3/2 nunits+1,... }.
+ Slide down op0 and op1 with OFFSET = 1/2 nunits.
+ Decompress op0 and op1 vector with the mask = { 0, 1, 0, 1, ... }.
+*/
+static bool
+shuffle_decompress_patterns (struct expand_vec_perm_d *d)
+{
+ poly_uint64 nelt = d->perm.length ();
+ machine_mode mask_mode = get_mask_mode (d->vmode).require ();
+
+ /* For constant size indices, we dont't need to handle it here.
+ Just leave it to vec_perm<mode>. */
+ if (d->perm.length ().is_constant ())
+ return false;
+
+ poly_uint64 first = d->perm[0];
+ if ((maybe_ne (first, 0U) && maybe_ne (first * 2, nelt))
+ || !d->perm.series_p (0, 2, first, 1)
+ || !d->perm.series_p (1, 2, first + nelt, 1))
+ return false;
+
+ /* Permuting two SEW8 variable-length vectors need vrgatherei16.vv.
+ Otherwise, it could overflow the index range. */
+ machine_mode sel_mode;
+ if (GET_MODE_INNER (d->vmode) == QImode
+ && !get_vector_mode (HImode, nelt).exists (&sel_mode))
+ return false;
+
+ /* Success! */
+ if (d->testing_p)
+ return true;
+
+ rtx op0, op1;
+ if (known_eq (first, 0U))
+ {
+ op0 = d->op0;
+ op1 = d->op1;
+ }
+ else
+ {
+ op0 = gen_reg_rtx (d->vmode);
+ op1 = gen_reg_rtx (d->vmode);
+ insn_code icode = code_for_pred_slide (UNSPEC_VSLIDEDOWN, d->vmode);
+ rtx ops0[] = {op0, d->op0, gen_int_mode (first, Pmode)};
+ rtx ops1[] = {op1, d->op1, gen_int_mode (first, Pmode)};
+ emit_vlmax_insn (icode, RVV_BINOP, ops0);
+ emit_vlmax_insn (icode, RVV_BINOP, ops1);
+ }
+ /* Generate { 0, 1, .... } mask. */
+ rvv_builder builder (mask_mode, 2, 1);
+ builder.quick_push (CONST0_RTX (BImode));
+ builder.quick_push (CONST1_RTX (BImode));
+ emit_move_insn (d->target, op0);
+ emit_vlmax_decompress_insn (d->target, op1,
+ force_reg (mask_mode, builder.build ()));
+ return true;
+}
+
/* Recognize the pattern that can be shuffled by generic approach. */
static bool
@@ -2388,6 +2518,8 @@ expand_vec_perm_const_1 (struct expand_vec_perm_d *d)
{
if (d->vmode == d->op_mode)
{
+ if (shuffle_decompress_patterns (d))
+ return true;
if (shuffle_generic_patterns (d))
return true;
return false;
new file mode 100644
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-optimized-details" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE) \
+ TYPE __attribute__ ((noinline, noclone)) \
+ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n) \
+ { \
+ for (int i = 0; i < n; ++i) \
+ { \
+ a[i * 2] += b; \
+ a[i * 2 + 1] += c; \
+ } \
+ }
+
+#define TEST_ALL(T) \
+ T (int8_t) \
+ T (uint8_t) \
+ T (int16_t) \
+ T (uint16_t) \
+ T (int32_t) \
+ T (uint32_t) \
+ T (int64_t) \
+ T (uint64_t)
+
+TEST_ALL (VEC_PERM)
+
+/* { dg-final { scan-tree-dump-times "\.VEC_PERM" 2 "optimized" } } */
+/* { dg-final { scan-assembler-times {viota.m} 2 } } */
new file mode 100644
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=scalable -fno-vect-cost-model -fdump-tree-optimized-details" } */
+
+#include <stdint.h>
+
+#define VEC_PERM(TYPE) \
+ TYPE __attribute__ ((noinline, noclone)) \
+ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int n) \
+ { \
+ for (int i = 0; i < n; ++i) \
+ { \
+ a[i * 4] += b; \
+ a[i * 4 + 1] += c; \
+ a[i * 4 + 2] += b; \
+ a[i * 4 + 3] += c; \
+ } \
+ }
+
+#define TEST_ALL(T) \
+ T (int8_t) \
+ T (uint8_t) \
+ T (int16_t) \
+ T (uint16_t) \
+ T (int32_t) \
+ T (uint32_t) \
+ T (int64_t) \
+ T (uint64_t)
+
+TEST_ALL (VEC_PERM)
+
+/* { dg-final { scan-assembler-times {viota.m} 2 } } */
new file mode 100644
@@ -0,0 +1,30 @@
+/* { dg-do run { target { riscv_vector } } } */
+/* { dg-additional-options "--param riscv-autovec-preference=scalable -fno-vect-cost-model" } */
+
+#include "slp-8.c"
+
+#define N (103 * 2)
+
+#define HARNESS(TYPE) \
+ { \
+ TYPE a[N], b[2] = { 3, 11 }; \
+ for (unsigned int i = 0; i < N; ++i) \
+ { \
+ a[i] = i * 2 + i % 5; \
+ asm volatile ("" ::: "memory"); \
+ } \
+ vec_slp_##TYPE (a, b[0], b[1], N / 2); \
+ for (unsigned int i = 0; i < N; ++i) \
+ { \
+ TYPE orig = i * 2 + i % 5; \
+ TYPE expected = orig + b[i % 2]; \
+ if (a[i] != expected) \
+ __builtin_abort (); \
+ } \
+ }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+ TEST_ALL (HARNESS)
+}
new file mode 100644
@@ -0,0 +1,30 @@
+/* { dg-do run { target { riscv_vector } } } */
+/* { dg-additional-options "--param riscv-autovec-preference=scalable -fno-vect-cost-model" } */
+
+#include "slp-9.c"
+
+#define N (103 * 4)
+
+#define HARNESS(TYPE) \
+ { \
+ TYPE a[N], b[2] = { 3, 11 }; \
+ for (unsigned int i = 0; i < N; ++i) \
+ { \
+ a[i] = i * 2 + i % 5; \
+ asm volatile ("" ::: "memory"); \
+ } \
+ vec_slp_##TYPE (a, b[0], b[1], N / 4); \
+ for (unsigned int i = 0; i < N; ++i) \
+ { \
+ TYPE orig = i * 2 + i % 5; \
+ TYPE expected = orig + b[i % 2]; \
+ if (a[i] != expected) \
+ __builtin_abort (); \
+ } \
+ }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+ TEST_ALL (HARNESS)
+}