[Ready,to,commit,V3] RISC-V: Add AVL propagation PASS for RVV auto-vectorization
Checks
Commit Message
This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization
which is a known issue for a long time and I finally find the time to address it.
Consider a simple vector addition operation:
https://godbolt.org/z/7hfGfEjW3
void
foo (int *__restrict a,
int *__restrict b,
int *__restrict n)
{
for (int i = 0; i < n; i++)
a[i] = a[i] + b[i];
}
Optimized IR:
Loop body:
_38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]); -> vsetvli a5,a2,e8,mf4,ta,ma
...
vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0); -> vle32.v v2,0(a0)
vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0); -> vle32.v v1,0(a1)
vect__7.12_19 = vect__6.11_20 + vect__4.8_27; -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
.MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19); -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)
We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment:
vect__7.12_19 = vect__6.11_20 + vect__4.8_27;
GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization.
Such flow are used by all other targets like ARM SVE (RVV also uses such flow):
ARM SVE:
.L3:
ld1w z30.s, p7/z, [x0, x3, lsl 2] -> predicated load
ld1w z31.s, p7/z, [x1, x3, lsl 2] -> predicated load
add z31.s, z31.s, z30.s -> un-predicated add
st1w z31.s, p7, [x0, x3, lsl 2] -> predicated store
Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it.
Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons:
1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend.
2. Changing Loop vectorizer for it will make code base ugly and hard to maintain.
3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, ....
We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns.
To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls
due to AVL/VL toggling.
The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS)
Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several
experiments and tries.
The reasons as follows:
1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which
turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
PASS become heavy and heavy again, then we will need to refactor it again in the future.
Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor
fixes.
2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things, I don't think we should fuse them into same PASS.
3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation.
4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations.
This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements.
We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate
VSETVL PASS again which is already so complicated.)
Here is an example to demonstrate more:
https://godbolt.org/z/bE86sv3q5
void foo2 (int *__restrict a,
int *__restrict b,
int *__restrict c,
int *__restrict a2,
int *__restrict b2,
int *__restrict c2,
int *__restrict a3,
int *__restrict b3,
int *__restrict c3,
int *__restrict a4,
int *__restrict b4,
int *__restrict c4,
int *__restrict a5,
int *__restrict b5,
int *__restrict c5,
int n)
{
for (int i = 0; i < n; i++){
a[i] = b[i] + c[i];
b5[i] = b[i] + c[i];
a2[i] = b2[i] + c2[i];
a3[i] = b3[i] + c3[i];
a4[i] = b4[i] + c4[i];
a5[i] = a[i] + a4[i];
a[i] = a5[i] + b5[i]+ a[i];
a[i] = a[i] + c[i];
b5[i] = a[i] + c[i];
a2[i] = a[i] + c2[i];
a3[i] = a[i] + c3[i];
a4[i] = a[i] + c4[i];
a5[i] = a[i] + a4[i];
a[i] = a[i] + b5[i]+ a[i];
}
}
1. Loop Body:
Before this patch: After this patch:
vsetvli a4,t1,e8,mf4,ta,ma vsetvli a4,t1,e32,m1,ta,ma
vle32.v v2,0(a2) vle32.v v2,0(a2)
vle32.v v4,0(a1) vle32.v v3,0(t2)
vle32.v v1,0(t2) vle32.v v4,0(a1)
vsetvli a7,zero,e32,m1,ta,ma vle32.v v1,0(t0)
vadd.vv v4,v2,v4 vadd.vv v4,v2,v4
vsetvli zero,a4,e32,m1,ta,ma vadd.vv v1,v3,v1
vle32.v v3,0(s0) vadd.vv v1,v1,v4
vsetvli a7,zero,e32,m1,ta,ma vadd.vv v1,v1,v4
vadd.vv v1,v3,v1 vadd.vv v1,v1,v4
vadd.vv v1,v1,v4 vadd.vv v1,v1,v2
vadd.vv v1,v1,v4 vadd.vv v2,v1,v2
vadd.vv v1,v1,v4 vse32.v v2,0(t5)
vsetvli zero,a4,e32,m1,ta,ma vadd.vv v2,v2,v1
vle32.v v4,0(a5) vadd.vv v2,v2,v1
vsetvli a7,zero,e32,m1,ta,ma slli a7,a4,2
vadd.vv v1,v1,v2 vadd.vv v3,v1,v3
vadd.vv v2,v1,v2 vle32.v v5,0(a5)
vadd.vv v4,v1,v4 vle32.v v6,0(t6)
vsetvli zero,a4,e32,m1,ta,ma vse32.v v3,0(t3)
vse32.v v2,0(t5) vse32.v v2,0(a0)
vse32.v v4,0(a3) vadd.vv v3,v3,v1
vsetvli a7,zero,e32,m1,ta,ma vadd.vv v2,v1,v5
vadd.vv v3,v1,v3 vse32.v v3,0(t4)
vadd.vv v2,v2,v1 vadd.vv v1,v1,v6
vadd.vv v2,v2,v1 vse32.v v2,0(a3)
vsetvli zero,a4,e32,m1,ta,ma vse32.v v1,0(a6)
vse32.v v2,0(a0)
vse32.v v3,0(t3)
vle32.v v2,0(t0)
vsetvli a7,zero,e32,m1,ta,ma
vadd.vv v3,v3,v1
vsetvli zero,a4,e32,m1,ta,ma
vse32.v v3,0(t4)
vsetvli a7,zero,e32,m1,ta,ma
slli a7,a4,2
vadd.vv v1,v1,v2
sub t1,t1,a4
vsetvli zero,a4,e32,m1,ta,ma
vse32.v v1,0(a6)
It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated.
2. Epilogue:
Before this patch: After this patch:
.L5: .L5:
ld s0,8(sp) ret
addi sp,sp,16
jr ra
This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register
which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma'
The final codegen after this patch:
foo2:
lw t1,56(sp)
ld t6,0(sp)
ld t3,8(sp)
ld t0,16(sp)
ld t2,24(sp)
ld t4,32(sp)
ld t5,40(sp)
ble t1,zero,.L5
.L3:
vsetvli a4,t1,e32,m1,ta,ma
vle32.v v2,0(a2)
vle32.v v3,0(t2)
vle32.v v4,0(a1)
vle32.v v1,0(t0)
vadd.vv v4,v2,v4
vadd.vv v1,v3,v1
vadd.vv v1,v1,v4
vadd.vv v1,v1,v4
vadd.vv v1,v1,v4
vadd.vv v1,v1,v2
vadd.vv v2,v1,v2
vse32.v v2,0(t5)
vadd.vv v2,v2,v1
vadd.vv v2,v2,v1
slli a7,a4,2
vadd.vv v3,v1,v3
vle32.v v5,0(a5)
vle32.v v6,0(t6)
vse32.v v3,0(t3)
vse32.v v2,0(a0)
vadd.vv v3,v3,v1
vadd.vv v2,v1,v5
vse32.v v3,0(t4)
vadd.vv v1,v1,v6
vse32.v v2,0(a3)
vse32.v v1,0(a6)
sub t1,t1,a4
add a1,a1,a7
add a2,a2,a7
add a5,a5,a7
add t6,t6,a7
add t0,t0,a7
add t2,t2,a7
add t5,t5,a7
add a3,a3,a7
add a6,a6,a7
add t3,t3,a7
add t4,t4,a7
add a0,a0,a7
bne t1,zero,.L3
.L5:
ret
PR target/111318
PR target/111888
gcc/ChangeLog:
* config.gcc: Add AVL propagation pass.
* config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto.
* config/riscv/riscv-protos.h (make_pass_avlprop): Ditto.
* config/riscv/t-riscv: Ditto.
* config/riscv/riscv-avlprop.cc: New file.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c: Adapt test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/pr111318.c: New test.
* gcc.target/riscv/rvv/autovec/pr111888.c: New test.
---
gcc/config.gcc | 2 +-
gcc/config/riscv/riscv-avlprop.cc | 419 ++++++++++++++++++
gcc/config/riscv/riscv-passes.def | 1 +
gcc/config/riscv/riscv-protos.h | 1 +
gcc/config/riscv/t-riscv | 6 +
.../costmodel/riscv/rvv/dynamic-lmul4-5.c | 2 +-
.../costmodel/riscv/rvv/dynamic-lmul8-2.c | 2 +-
.../riscv/rvv/autovec/partial/select_vl-2.c | 5 +-
.../gcc.target/riscv/rvv/autovec/pr111318.c | 16 +
.../gcc.target/riscv/rvv/autovec/pr111888.c | 33 ++
.../riscv/rvv/autovec/ternop/ternop_nofm-2.c | 1 -
11 files changed, 482 insertions(+), 6 deletions(-)
create mode 100644 gcc/config/riscv/riscv-avlprop.cc
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
Comments
popcount and mask_gather_load_run fails seem to be an issue with my
setup or a bug with QEMU based on the v2 discussion :-)
OK with me regarding testing (I don't have the authority to approve a
patch, but Kito already said LGTM):
https://inbox.sourceware.org/gcc-patches/CALLt3ThXmk4pey2QhSUvK183uuK3oY5bU=a4m8QYv-6UukBYyg@mail.gmail.com/
Thanks for your patience and the revisions!
Your patch resolves these failures on glibc qemu:
rv32gcv:
FAIL: gfortran.dg/intrinsic_pack_6.f90 -O2 execution test
FAIL: gfortran.dg/intrinsic_pack_6.f90 -O3 -g execution test
FAIL: gfortran.dg/matmul_3.f90 -O2 execution test
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution,
-O2 -fbounds-check
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution,
-O2 -fomit-frame-pointer -finline-functions
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution,
-O3 -g
rv64gcv:
FAIL: gfortran.dg/matmul_6.f90 -O2 execution test
Tested-by: Patrick O'Neill <patrick@rivosinc.com>
Patrick
On 10/26/23 01:13, Juzhe-Zhong wrote:
> This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization
> which is a known issue for a long time and I finally find the time to address it.
>
> Consider a simple vector addition operation:
>
> https://godbolt.org/z/7hfGfEjW3
>
> void
> foo (int *__restrict a,
> int *__restrict b,
> int *__restrict n)
> {
> for (int i = 0; i < n; i++)
> a[i] = a[i] + b[i];
> }
>
> Optimized IR:
>
> Loop body:
> _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]); -> vsetvli a5,a2,e8,mf4,ta,ma
> ...
> vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0); -> vle32.v v2,0(a0)
> vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0); -> vle32.v v1,0(a1)
> vect__7.12_19 = vect__6.11_20 + vect__4.8_27; -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
> .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19); -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)
>
> We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
> The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment:
>
> vect__7.12_19 = vect__6.11_20 + vect__4.8_27;
>
> GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization.
> Such flow are used by all other targets like ARM SVE (RVV also uses such flow):
>
> ARM SVE:
>
> .L3:
> ld1w z30.s, p7/z, [x0, x3, lsl 2] -> predicated load
> ld1w z31.s, p7/z, [x1, x3, lsl 2] -> predicated load
> add z31.s, z31.s, z30.s -> un-predicated add
> st1w z31.s, p7, [x0, x3, lsl 2] -> predicated store
>
> Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it.
>
> Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons:
>
> 1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend.
> 2. Changing Loop vectorizer for it will make code base ugly and hard to maintain.
> 3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, ....
> We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns.
>
> To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls
> due to AVL/VL toggling.
>
> The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS)
>
> Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several
> experiments and tries.
>
> The reasons as follows:
>
> 1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which
> turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
> PASS become heavy and heavy again, then we will need to refactor it again in the future.
> Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor
> fixes.
>
> 2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things, I don't think we should fuse them into same PASS.
>
> 3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation.
>
> 4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations.
> This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements.
> We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate
> VSETVL PASS again which is already so complicated.)
>
> Here is an example to demonstrate more:
>
> https://godbolt.org/z/bE86sv3q5
>
> void foo2 (int *__restrict a,
> int *__restrict b,
> int *__restrict c,
> int *__restrict a2,
> int *__restrict b2,
> int *__restrict c2,
> int *__restrict a3,
> int *__restrict b3,
> int *__restrict c3,
> int *__restrict a4,
> int *__restrict b4,
> int *__restrict c4,
> int *__restrict a5,
> int *__restrict b5,
> int *__restrict c5,
> int n)
> {
> for (int i = 0; i < n; i++){
> a[i] = b[i] + c[i];
> b5[i] = b[i] + c[i];
> a2[i] = b2[i] + c2[i];
> a3[i] = b3[i] + c3[i];
> a4[i] = b4[i] + c4[i];
> a5[i] = a[i] + a4[i];
> a[i] = a5[i] + b5[i]+ a[i];
>
> a[i] = a[i] + c[i];
> b5[i] = a[i] + c[i];
> a2[i] = a[i] + c2[i];
> a3[i] = a[i] + c3[i];
> a4[i] = a[i] + c4[i];
> a5[i] = a[i] + a4[i];
> a[i] = a[i] + b5[i]+ a[i];
> }
> }
>
> 1. Loop Body:
>
> Before this patch: After this patch:
>
> vsetvli a4,t1,e8,mf4,ta,ma vsetvli a4,t1,e32,m1,ta,ma
> vle32.v v2,0(a2) vle32.v v2,0(a2)
> vle32.v v4,0(a1) vle32.v v3,0(t2)
> vle32.v v1,0(t2) vle32.v v4,0(a1)
> vsetvli a7,zero,e32,m1,ta,ma vle32.v v1,0(t0)
> vadd.vv v4,v2,v4 vadd.vv v4,v2,v4
> vsetvli zero,a4,e32,m1,ta,ma vadd.vv v1,v3,v1
> vle32.v v3,0(s0) vadd.vv v1,v1,v4
> vsetvli a7,zero,e32,m1,ta,ma vadd.vv v1,v1,v4
> vadd.vv v1,v3,v1 vadd.vv v1,v1,v4
> vadd.vv v1,v1,v4 vadd.vv v1,v1,v2
> vadd.vv v1,v1,v4 vadd.vv v2,v1,v2
> vadd.vv v1,v1,v4 vse32.v v2,0(t5)
> vsetvli zero,a4,e32,m1,ta,ma vadd.vv v2,v2,v1
> vle32.v v4,0(a5) vadd.vv v2,v2,v1
> vsetvli a7,zero,e32,m1,ta,ma slli a7,a4,2
> vadd.vv v1,v1,v2 vadd.vv v3,v1,v3
> vadd.vv v2,v1,v2 vle32.v v5,0(a5)
> vadd.vv v4,v1,v4 vle32.v v6,0(t6)
> vsetvli zero,a4,e32,m1,ta,ma vse32.v v3,0(t3)
> vse32.v v2,0(t5) vse32.v v2,0(a0)
> vse32.v v4,0(a3) vadd.vv v3,v3,v1
> vsetvli a7,zero,e32,m1,ta,ma vadd.vv v2,v1,v5
> vadd.vv v3,v1,v3 vse32.v v3,0(t4)
> vadd.vv v2,v2,v1 vadd.vv v1,v1,v6
> vadd.vv v2,v2,v1 vse32.v v2,0(a3)
> vsetvli zero,a4,e32,m1,ta,ma vse32.v v1,0(a6)
> vse32.v v2,0(a0)
> vse32.v v3,0(t3)
> vle32.v v2,0(t0)
> vsetvli a7,zero,e32,m1,ta,ma
> vadd.vv v3,v3,v1
> vsetvli zero,a4,e32,m1,ta,ma
> vse32.v v3,0(t4)
> vsetvli a7,zero,e32,m1,ta,ma
> slli a7,a4,2
> vadd.vv v1,v1,v2
> sub t1,t1,a4
> vsetvli zero,a4,e32,m1,ta,ma
> vse32.v v1,0(a6)
>
> It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated.
>
> 2. Epilogue:
> Before this patch: After this patch:
>
> .L5: .L5:
> ld s0,8(sp) ret
> addi sp,sp,16
> jr ra
>
> This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register
> which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma'
>
> The final codegen after this patch:
>
> foo2:
> lw t1,56(sp)
> ld t6,0(sp)
> ld t3,8(sp)
> ld t0,16(sp)
> ld t2,24(sp)
> ld t4,32(sp)
> ld t5,40(sp)
> ble t1,zero,.L5
> .L3:
> vsetvli a4,t1,e32,m1,ta,ma
> vle32.v v2,0(a2)
> vle32.v v3,0(t2)
> vle32.v v4,0(a1)
> vle32.v v1,0(t0)
> vadd.vv v4,v2,v4
> vadd.vv v1,v3,v1
> vadd.vv v1,v1,v4
> vadd.vv v1,v1,v4
> vadd.vv v1,v1,v4
> vadd.vv v1,v1,v2
> vadd.vv v2,v1,v2
> vse32.v v2,0(t5)
> vadd.vv v2,v2,v1
> vadd.vv v2,v2,v1
> slli a7,a4,2
> vadd.vv v3,v1,v3
> vle32.v v5,0(a5)
> vle32.v v6,0(t6)
> vse32.v v3,0(t3)
> vse32.v v2,0(a0)
> vadd.vv v3,v3,v1
> vadd.vv v2,v1,v5
> vse32.v v3,0(t4)
> vadd.vv v1,v1,v6
> vse32.v v2,0(a3)
> vse32.v v1,0(a6)
> sub t1,t1,a4
> add a1,a1,a7
> add a2,a2,a7
> add a5,a5,a7
> add t6,t6,a7
> add t0,t0,a7
> add t2,t2,a7
> add t5,t5,a7
> add a3,a3,a7
> add a6,a6,a7
> add t3,t3,a7
> add t4,t4,a7
> add a0,a0,a7
> bne t1,zero,.L3
> .L5:
> ret
>
> PR target/111318
> PR target/111888
>
> gcc/ChangeLog:
>
> * config.gcc: Add AVL propagation pass.
> * config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto.
> * config/riscv/riscv-protos.h (make_pass_avlprop): Ditto.
> * config/riscv/t-riscv: Ditto.
> * config/riscv/riscv-avlprop.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c: Adapt test.
> * gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c: Ditto.
> * gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Ditto.
> * gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto.
> * gcc.target/riscv/rvv/autovec/pr111318.c: New test.
> * gcc.target/riscv/rvv/autovec/pr111888.c: New test.
>
> ---
> gcc/config.gcc | 2 +-
> gcc/config/riscv/riscv-avlprop.cc | 419 ++++++++++++++++++
> gcc/config/riscv/riscv-passes.def | 1 +
> gcc/config/riscv/riscv-protos.h | 1 +
> gcc/config/riscv/t-riscv | 6 +
> .../costmodel/riscv/rvv/dynamic-lmul4-5.c | 2 +-
> .../costmodel/riscv/rvv/dynamic-lmul8-2.c | 2 +-
> .../riscv/rvv/autovec/partial/select_vl-2.c | 5 +-
> .../gcc.target/riscv/rvv/autovec/pr111318.c | 16 +
> .../gcc.target/riscv/rvv/autovec/pr111888.c | 33 ++
> .../riscv/rvv/autovec/ternop/ternop_nofm-2.c | 1 -
> 11 files changed, 482 insertions(+), 6 deletions(-)
> create mode 100644 gcc/config/riscv/riscv-avlprop.cc
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 09a7fb13da1..d34ea246a98 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -544,7 +544,7 @@ pru-*-*)
> riscv*)
> cpu_type=riscv
> extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o"
> - extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o"
> + extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o"
> extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
> extra_objs="${extra_objs} thead.o"
> d_target_objs="riscv-d.o"
> diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc
> new file mode 100644
> index 00000000000..2c79ec81806
> --- /dev/null
> +++ b/gcc/config/riscv/riscv-avlprop.cc
> @@ -0,0 +1,419 @@
> +/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler.
> + Copyright (C) 2023-2023 Free Software Foundation, Inc.
> + Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 3, or(at your option)
> +any later version.
> +
> +GCC is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +GNU General Public License for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3. If not see
> +<http://www.gnu.org/licenses/>. */
> +
> +/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions.
> + A standalone AVL propagation pass is designed because:
> +
> + - Better code maintain:
> + Current LCM-based VSETVL pass is so complicated that codes
> + there will become even harder to maintain. A straight forward
> + AVL propagation PASS is much easier to maintain.
> +
> + - Reduce scalar register pressure:
> + A type of AVL propagation is we propagate AVL from NON-VLMAX
> + instruction to VLMAX instruction.
> + Note: VLMAX instruction should be ignore tail elements (TA)
> + and the result should be used by the NON-VLMAX instruction.
> + This optimization is mostly for auto-vectorization codes:
> +
> + vsetvli r136, r137 --- SELECT_VL
> + vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD
> + vadd.vv (use VLMAX) --- PLUS_EXPR
> + vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE
> +
> + NO AVL propation:
> +
> + vsetvli a5, a4, ta
> + vle8.v v1
> + vsetvli t0, zero, ta
> + vadd.vv v2, v1, v1
> + vse8.v v2
> +
> + We can propagate the AVL to 'vadd.vv' since its result
> + is consumed by a 'vse8.v' which has AVL = a5 and its
> + tail elements are agnostic.
> +
> + We DON'T do this optimization on VSETVL pass since it is a
> + post-RA pass that consumed 't0' already wheras a standalone
> + pre-RA AVL propagation pass allows us elide the consumption
> + of the pseudo register of 't0' then we can reduce scalar
> + register pressure.
> +
> + - More AVL propagation opportunities:
> + A pre-RA pass is more flexible for AVL REG def-use chain,
> + thus we will get more potential AVL propagation as long as
> + it doesn't increase the scalar register pressure.
> +*/
> +
> +#define IN_TARGET_CODE 1
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "tm.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "target.h"
> +#include "tree-pass.h"
> +#include "df.h"
> +#include "rtl-ssa.h"
> +#include "cfgcleanup.h"
> +#include "insn-attr.h"
> +
> +using namespace rtl_ssa;
> +using namespace riscv_vector;
> +
> +enum avlprop_type
> +{
> + /* VLMAX AVL and tail agnostic candidates. */
> + AVLPROP_VLMAX_TA,
> + AVLPROP_NONE
> +};
> +
> +/* dump helper functions */
> +static const char *
> +avlprop_type_to_str (enum avlprop_type type)
> +{
> + switch (type)
> + {
> + case AVLPROP_VLMAX_TA:
> + return "vlmax_ta";
> +
> + default:
> + gcc_unreachable ();
> + }
> +}
> +
> +static bool
> +vlmax_ta_p (rtx_insn *rinsn)
> +{
> + return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn);
> +}
> +
> +const pass_data pass_data_avlprop = {
> + RTL_PASS, /* type */
> + "avlprop", /* name */
> + OPTGROUP_NONE, /* optinfo_flags */
> + TV_NONE, /* tv_id */
> + 0, /* properties_required */
> + 0, /* properties_provided */
> + 0, /* properties_destroyed */
> + 0, /* todo_flags_start */
> + 0, /* todo_flags_finish */
> +};
> +
> +class pass_avlprop : public rtl_opt_pass
> +{
> +public:
> + pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {}
> +
> + /* opt_pass methods: */
> + virtual bool gate (function *) final override
> + {
> + return TARGET_VECTOR && optimize > 0;
> + }
> + virtual unsigned int execute (function *) final override;
> +
> +private:
> + /* The AVL propagation instructions and corresponding preferred AVL.
> + It will be updated during the analysis. */
> + hash_map<insn_info *, rtx> *m_avl_propagations;
> +
> + /* Potential feasible AVL propagation candidates. */
> + auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates;
> +
> + rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const;
> + rtx get_vlmax_ta_preferred_avl (insn_info *) const;
> + rtx get_nonvlmax_avl (insn_info *) const;
> +
> + void avlprop_init (function *);
> + void avlprop_done (void);
> +}; // class pass_avlprop
> +
> +void
> +pass_avlprop::avlprop_init (function *fn)
> +{
> + calculate_dominance_info (CDI_DOMINATORS);
> + df_analyze ();
> + crtl->ssa = new function_info (fn);
> + m_avl_propagations = new hash_map<insn_info *, rtx>;
> +}
> +
> +void
> +pass_avlprop::avlprop_done (void)
> +{
> + free_dominance_info (CDI_DOMINATORS);
> + if (crtl->ssa->perform_pending_updates ())
> + cleanup_cfg (0);
> + delete crtl->ssa;
> + crtl->ssa = nullptr;
> + delete m_avl_propagations;
> + m_avl_propagations = NULL;
> + if (!m_candidates.is_empty ())
> + m_candidates.release ();
> +}
> +
> +/* If we have a preferred AVL to propagate, return the AVL.
> + Otherwise, return NULL_RTX as we don't need have any preferred
> + AVL. */
> +
> +rtx
> +pass_avlprop::get_preferred_avl (
> + const std::pair<enum avlprop_type, insn_info *> candidate) const
> +{
> + switch (candidate.first)
> + {
> + case AVLPROP_VLMAX_TA:
> + return get_vlmax_ta_preferred_avl (candidate.second);
> + default:
> + gcc_unreachable ();
> + }
> + return NULL_RTX;
> +}
> +
> +/* This is a straight forward pattern ALWAYS in paritial auto-vectorization:
> +
> + VL = SELECT_AVL (AVL, ...)
> + V0 = MASK_LEN_LOAD (..., VL)
> + V1 = MASK_LEN_LOAD (..., VL)
> + V2 = V0 + V1 --- Missed LEN information.
> + MASK_LEN_STORE (..., V2, VL)
> +
> + We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN)
> + because:
> +
> + - Few code changes in Loop Vectorizer.
> + - Reuse the current clean flow of partial vectorization, That is, apply
> + predicate LEN or MASK into LOAD/STORE operations and other special
> + arithmetic operations (e.d. DIV), then do the whole vector register
> + operation if it DON'T affect the correctness.
> + Such flow is used by all other targets like x86, sve, s390, ... etc.
> + - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD.
> +
> + We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which
> + generates the VLMAX instruction due to missed LEN information. The later
> + VSETVL PASS will elided the redundant vsetvls.
> +*/
> +
> +rtx
> +pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const
> +{
> + int sew = get_sew (insn->rtl ());
> + enum vlmul_type vlmul = get_vlmul (insn->rtl ());
> + int ratio = calculate_ratio (sew, vlmul);
> +
> + rtx use_avl = NULL_RTX;
> + for (def_info *def : insn->defs ())
> + {
> + if (!is_a<set_info *> (def) || def->is_mem ())
> + return NULL_RTX;
> + const auto *set = dyn_cast<set_info *> (def);
> +
> + /* FIXME: Stop AVL propagation if any USE is not a RVV real
> + instruction. It should be totally enough for vectorized codes since
> + they always locate at extended blocks.
> +
> + TODO: We can extend PHI checking for intrinsic codes if it
> + necessary in the future. */
> + if (!set->is_local_to_ebb ())
> + return NULL_RTX;
> +
> + for (use_info *use : set->nondebug_insn_uses ())
> + {
> + insn_info *use_insn = use->insn ();
> + if (!use_insn->can_be_optimized () || use_insn->is_asm ()
> + || use_insn->is_call () || use_insn->has_volatile_refs ()
> + || use_insn->has_pre_post_modify ()
> + || !has_vl_op (use_insn->rtl ())
> + || !tail_agnostic_p (use_insn->rtl ()))
> + return NULL_RTX;
> +
> + int new_sew = get_sew (use_insn->rtl ());
> + enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ());
> + int new_ratio = calculate_ratio (new_sew, new_vlmul);
> + if (new_ratio != ratio)
> + return NULL_RTX;
> +
> + rtx new_use_avl = get_nonvlmax_avl (use_insn);
> + if (!new_use_avl || SUBREG_P (new_use_avl))
> + return NULL_RTX;
> + if (REG_P (new_use_avl))
> + {
> + resource_info resource = full_register (REGNO (new_use_avl));
> + def_lookup dl = crtl->ssa->find_def (resource, use_insn);
> + if (dl.matching_set ())
> + return NULL_RTX;
> + def_info *def1 = dl.prev_def (insn);
> + def_info *def2 = dl.prev_def (use_insn);
> + if (!def1 || !def2 || def1 != def2)
> + return NULL_RTX;
> +
> + /* FIXME: We only all AVL propation within a block which should
> + be totally enough for vectorized codes.
> +
> + TODO: We can enhance it here for intrinsic codes in the future
> + if it is necessary. */
> + if (def1->insn ()->bb () != insn->bb ()
> + && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (),
> + def1->insn ()->bb ()->cfg_bb ()))
> + return NULL_RTX;
> + if (def1->insn ()->bb () == insn->bb ()
> + && def1->insn ()->compare_with (insn) >= 0)
> + return NULL_RTX;
> + }
> +
> + if (!use_avl)
> + use_avl = new_use_avl;
> + else if (!rtx_equal_p (use_avl, new_use_avl))
> + return NULL_RTX;
> + }
> + }
> +
> + return use_avl;
> +}
> +
> +/* Try to get the NONVLMAX AVL of the INSN.
> + INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN
> + before the PASS but has been propagated a NON-VLMAX AVL
> + in the before round propagation. */
> +rtx
> +pass_avlprop::get_nonvlmax_avl (insn_info *insn) const
> +{
> + if (m_avl_propagations->get (insn))
> + return (*m_avl_propagations->get (insn));
> + else if (nonvlmax_avl_type_p (insn->rtl ()))
> + {
> + extract_insn_cached (insn->rtl ());
> + return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())];
> + }
> +
> + return NULL_RTX;
> +}
> +
> +/* Main entry point for this pass. */
> +unsigned int
> +pass_avlprop::execute (function *fn)
> +{
> + avlprop_init (fn);
> +
> + /* Iterate the whole function in reverse order (which could speed the
> + convergence) to collect all potential candidates that could be AVL
> + propagated.
> +
> + Note that: **NOT** all the candidates will be successfully AVL propagated.
> + */
> + for (bb_info *bb : crtl->ssa->reverse_bbs ())
> + {
> + for (insn_info *insn : bb->reverse_real_nondebug_insns ())
> + {
> + /* We only forward AVL to the instruction that has AVL/VL operand
> + and can be optimized in RTL_SSA level. */
> + if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ()))
> + continue;
> +
> + /* TODO: We only do AVL propagation for VLMAX AVL with tail
> + agnostic policy since we have missed-LEN information partial
> + autovectorization. We could add more more AVL propagation
> + for intrinsic codes in the future. */
> + if (vlmax_ta_p (insn->rtl ()))
> + m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn));
> + }
> + }
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n",
> + m_candidates.length ());
> + for (const auto candidate : m_candidates)
> + {
> + fprintf (dump_file, "\nAVL propagation type: %s\n",
> + avlprop_type_to_str (candidate.first));
> + print_rtl_single (dump_file, candidate.second->rtl ());
> + }
> + }
> +
> + /* Go through all the candidates looking for AVL that we could propagate. */
> + bool change_p = true;
> + while (change_p)
> + {
> + change_p = false;
> + for (auto &candidate : m_candidates)
> + {
> + rtx new_avl = get_preferred_avl (candidate);
> + if (new_avl)
> + {
> + gcc_assert (!vlmax_avl_p (new_avl));
> + auto &update
> + = m_avl_propagations->get_or_insert (candidate.second);
> + change_p = !rtx_equal_p (update, new_avl);
> + update = new_avl;
> + }
> + }
> + }
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n",
> + (int) m_avl_propagations->elements ());
> +
> + for (const auto prop : *m_avl_propagations)
> + {
> + rtx_insn *rinsn = prop.first->rtl ();
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "\nPropagating AVL: ");
> + print_rtl_single (dump_file, prop.second);
> + fprintf (dump_file, "into: ");
> + print_rtl_single (dump_file, rinsn);
> + }
> + /* Replace AVL operand. */
> + extract_insn_cached (rinsn);
> + rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
> + int count = count_regno_occurrences (rinsn, REGNO (avl));
> + gcc_assert (count == 1);
> + rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second);
> + validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false);
> +
> + /* Change AVL TYPE into NONVLMAX if it is VLMAX. */
> + if (vlmax_avl_type_p (rinsn))
> + {
> + int index = get_attr_avl_type_idx (rinsn);
> + gcc_assert (index != INVALID_ATTRIBUTE);
> + validate_change_or_fail (rinsn, recog_data.operand_loc[index],
> + get_avl_type_rtx (avl_type::NONVLMAX),
> + false);
> + }
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + {
> + fprintf (dump_file, "Successfully to match this instruction: ");
> + print_rtl_single (dump_file, rinsn);
> + }
> + }
> +
> + avlprop_done ();
> + return 0;
> +}
> +
> +rtl_opt_pass *
> +make_pass_avlprop (gcc::context *ctxt)
> +{
> + return new pass_avlprop (ctxt);
> +}
> diff --git a/gcc/config/riscv/riscv-passes.def b/gcc/config/riscv/riscv-passes.def
> index 4084122cf0a..b6260939d5c 100644
> --- a/gcc/config/riscv/riscv-passes.def
> +++ b/gcc/config/riscv/riscv-passes.def
> @@ -18,4 +18,5 @@
> <http://www.gnu.org/licenses/>. */
>
> INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs);
> +INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop);
> INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl);
> diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
> index 668d75043ca..d4e17fc3fd0 100644
> --- a/gcc/config/riscv/riscv-protos.h
> +++ b/gcc/config/riscv/riscv-protos.h
> @@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio
> extern bool riscv_hard_regno_rename_ok (unsigned, unsigned);
>
> rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt);
> +rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
> rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
>
> /* Routines implemented in riscv-string.c. */
> diff --git a/gcc/config/riscv/t-riscv b/gcc/config/riscv/t-riscv
> index dd17056fe82..f8ca3f4ac57 100644
> --- a/gcc/config/riscv/t-riscv
> +++ b/gcc/config/riscv/t-riscv
> @@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \
> $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> $(srcdir)/config/riscv/riscv-vector-costs.cc
>
> +riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \
> + $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \
> + $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h
> + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> + $(srcdir)/config/riscv/riscv-avlprop.cc
> +
> riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \
> $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H)
> $(COMPILE) $<
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c
> index 928a507a363..5278e4aa38f 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c
> @@ -39,7 +39,7 @@ void foo2 (int16_t *__restrict a,
> }
> }
>
> -/* { dg-final { scan-assembler {e32,m4} } } */
> +/* { dg-final { scan-assembler {e16,m2} } } */
> /* { dg-final { scan-assembler-not {csrr} } } */
> /* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */
> /* { dg-final { scan-tree-dump-times "Maximum lmul = 4" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c
> index a50265fc1ec..1db2e073846 100644
> --- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c
> @@ -10,7 +10,7 @@ foo (int32_t *__restrict a, int16_t *__restrict b, int n)
> a[i] = a[i] + b[i];
> }
>
> -/* { dg-final { scan-assembler {e32,m8} } } */
> +/* { dg-final { scan-assembler {e16,m4} } } */
> /* { dg-final { scan-assembler-not {csrr} } } */
> /* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */
> /* { dg-final { scan-tree-dump-not "Maximum lmul = 4" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> index eac7cbc757b..ca88d42cdf4 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
> @@ -7,10 +7,11 @@
> /*
> ** foo:
> ** vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> +** ...
> ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
> ** ...
> -** vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> -** add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+
> +** vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
> +** ...
> ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
> ** ...
> */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> new file mode 100644
> index 00000000000..ff36da8feeb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
> +
> +void
> +foo (int *__restrict a, int *__restrict b, int *__restrict c, int n)
> +{
> + for (int i = 0; i < n; i += 1)
> + c[i] = a[i] + b[i];
> +}
> +
> +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetivli} } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
> new file mode 100644
> index 00000000000..2387c20a26c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
> +
> +void
> +foo (int *__restrict a, int *__restrict b, int *__restrict c,
> + int *__restrict a2, int *__restrict b2, int *__restrict c2,
> + int *__restrict a3, int *__restrict b3, int *__restrict c3,
> + int *__restrict a4, int *__restrict b4, int *__restrict c4,
> + int *__restrict a5, int *__restrict b5, int *__restrict c5,
> + int *__restrict d, int *__restrict d2, int *__restrict d3,
> + int *__restrict d4, int *__restrict d5, int n, int m)
> +{
> + for (int i = 0; i < n; i++)
> + {
> + a[i] = b[i] + c[i];
> + a2[i] = b2[i] + c2[i];
> + a3[i] = b3[i] + c3[i];
> + a4[i] = b4[i] + c4[i];
> + a5[i] = a[i] + a4[i];
> + d[i] = a[i] - a2[i];
> + d2[i] = a2[i] * a[i];
> + d3[i] = a3[i] * a2[i];
> + d4[i] = a2[i] * d2[i];
> + d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i];
> + }
> +}
> +
> +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetivli} } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
> +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> index 965365da4bb..13367423751 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
> @@ -3,7 +3,6 @@
>
> #include "ternop-2.c"
>
> -/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */
> /* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */
> /* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */
> /* { dg-final { scan-assembler-not {\tvmv} } } */
> rv32gcv:
> FAIL: gfortran.dg/intrinsic_pack_6.f90 -O2 execution test
> FAIL: gfortran.dg/intrinsic_pack_6.f90 -O3 -g execution test
> FAIL: gfortran.dg/matmul_3.f90 -O2 execution test
> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2
> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fbounds-check
> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fomit-frame-pointer -finline-functions
> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O3 -g
>
> rv64gcv:
> FAIL: gfortran.dg/matmul_6.f90 -O2 execution test
Those might also flip flop, I have them seen FAIL and PASS before
randomly. It looks like there is at least 10 of those, really need
to figure out the root cause...
Regards
Robin
On 10/26/23 11:15, Robin Dapp wrote:
>> rv32gcv:
>> FAIL: gfortran.dg/intrinsic_pack_6.f90 -O2 execution test
>> FAIL: gfortran.dg/intrinsic_pack_6.f90 -O3 -g execution test
>> FAIL: gfortran.dg/matmul_3.f90 -O2 execution test
>> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2
>> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fbounds-check
>> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fomit-frame-pointer -finline-functions
>> FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O3 -g
>>
>> rv64gcv:
>> FAIL: gfortran.dg/matmul_6.f90 -O2 execution test
> Those might also flip flop, I have them seen FAIL and PASS before
> randomly. It looks like there is at least 10 of those, really need
> to figure out the root cause...
>
> Regards
> Robin
I've seen the same thing on CI for some of these failures on rv32gcv but always as a group:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111969
Example CI run with the flaky group:
https://github.com/patrick-rivos/gcc-postcommit-ci/issues/75
The fact that some are resolved while not resolving the full group makes me hopeful that:
FAIL: gfortran.dg/intrinsic_pack_6.f90 execution test
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution
are really resolved
I haven't seen these testcases be flaky on CI:
FAIL: gfortran.dg/matmul_3.f90 -O2 execution test
FAIL: gfortran.dg/matmul_6.f90 -O2 execution test
Patrick
Thanks Patrick. Committed.
juzhe.zhong@rivai.ai
From: Patrick O'Neill
Date: 2023-10-27 02:12
To: Juzhe-Zhong; gcc-patches
CC: Kito Cheng; Robin Dapp
Subject: Re: [Ready to commit V3] RISC-V: Add AVL propagation PASS for RVV auto-vectorization
popcount and mask_gather_load_run fails seem to be an issue with my setup or a bug with QEMU based on the v2 discussion :-)
OK with me regarding testing (I don't have the authority to approve a patch, but Kito already said LGTM):
https://inbox.sourceware.org/gcc-patches/CALLt3ThXmk4pey2QhSUvK183uuK3oY5bU=a4m8QYv-6UukBYyg@mail.gmail.com/
Thanks for your patience and the revisions!
Your patch resolves these failures on glibc qemu:
rv32gcv:
FAIL: gfortran.dg/intrinsic_pack_6.f90 -O2 execution test
FAIL: gfortran.dg/intrinsic_pack_6.f90 -O3 -g execution test
FAIL: gfortran.dg/matmul_3.f90 -O2 execution test
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fbounds-check
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fomit-frame-pointer -finline-functions
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O3 -g
rv64gcv:
FAIL: gfortran.dg/matmul_6.f90 -O2 execution test
Tested-by: Patrick O'Neill <patrick@rivosinc.com>
Patrick
On 10/26/23 01:13, Juzhe-Zhong wrote:
This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization
which is a known issue for a long time and I finally find the time to address it.
Consider a simple vector addition operation:
https://godbolt.org/z/7hfGfEjW3
void
foo (int *__restrict a,
int *__restrict b,
int *__restrict n)
{
for (int i = 0; i < n; i++)
a[i] = a[i] + b[i];
}
Optimized IR:
Loop body:
_38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]); -> vsetvli a5,a2,e8,mf4,ta,ma
...
vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0); -> vle32.v v2,0(a0)
vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0); -> vle32.v v1,0(a1)
vect__7.12_19 = vect__6.11_20 + vect__4.8_27; -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
.MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19); -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)
We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment:
vect__7.12_19 = vect__6.11_20 + vect__4.8_27;
GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization.
Such flow are used by all other targets like ARM SVE (RVV also uses such flow):
ARM SVE:
.L3:
ld1w z30.s, p7/z, [x0, x3, lsl 2] -> predicated load
ld1w z31.s, p7/z, [x1, x3, lsl 2] -> predicated load
add z31.s, z31.s, z30.s -> un-predicated add
st1w z31.s, p7, [x0, x3, lsl 2] -> predicated store
Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it.
Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons:
1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend.
2. Changing Loop vectorizer for it will make code base ugly and hard to maintain.
3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, ....
We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns.
To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls
due to AVL/VL toggling.
The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS)
Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several
experiments and tries.
The reasons as follows:
1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which
turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
PASS become heavy and heavy again, then we will need to refactor it again in the future.
Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor
fixes.
2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things, I don't think we should fuse them into same PASS.
3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation.
4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations.
This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements.
We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate
VSETVL PASS again which is already so complicated.)
Here is an example to demonstrate more:
https://godbolt.org/z/bE86sv3q5
void foo2 (int *__restrict a,
int *__restrict b,
int *__restrict c,
int *__restrict a2,
int *__restrict b2,
int *__restrict c2,
int *__restrict a3,
int *__restrict b3,
int *__restrict c3,
int *__restrict a4,
int *__restrict b4,
int *__restrict c4,
int *__restrict a5,
int *__restrict b5,
int *__restrict c5,
int n)
{
for (int i = 0; i < n; i++){
a[i] = b[i] + c[i];
b5[i] = b[i] + c[i];
a2[i] = b2[i] + c2[i];
a3[i] = b3[i] + c3[i];
a4[i] = b4[i] + c4[i];
a5[i] = a[i] + a4[i];
a[i] = a5[i] + b5[i]+ a[i];
a[i] = a[i] + c[i];
b5[i] = a[i] + c[i];
a2[i] = a[i] + c2[i];
a3[i] = a[i] + c3[i];
a4[i] = a[i] + c4[i];
a5[i] = a[i] + a4[i];
a[i] = a[i] + b5[i]+ a[i];
}
}
1. Loop Body:
Before this patch: After this patch:
vsetvli a4,t1,e8,mf4,ta,ma vsetvli a4,t1,e32,m1,ta,ma
vle32.v v2,0(a2) vle32.v v2,0(a2)
vle32.v v4,0(a1) vle32.v v3,0(t2)
vle32.v v1,0(t2) vle32.v v4,0(a1)
vsetvli a7,zero,e32,m1,ta,ma vle32.v v1,0(t0)
vadd.vv v4,v2,v4 vadd.vv v4,v2,v4
vsetvli zero,a4,e32,m1,ta,ma vadd.vv v1,v3,v1
vle32.v v3,0(s0) vadd.vv v1,v1,v4
vsetvli a7,zero,e32,m1,ta,ma vadd.vv v1,v1,v4
vadd.vv v1,v3,v1 vadd.vv v1,v1,v4
vadd.vv v1,v1,v4 vadd.vv v1,v1,v2
vadd.vv v1,v1,v4 vadd.vv v2,v1,v2
vadd.vv v1,v1,v4 vse32.v v2,0(t5)
vsetvli zero,a4,e32,m1,ta,ma vadd.vv v2,v2,v1
vle32.v v4,0(a5) vadd.vv v2,v2,v1
vsetvli a7,zero,e32,m1,ta,ma slli a7,a4,2
vadd.vv v1,v1,v2 vadd.vv v3,v1,v3
vadd.vv v2,v1,v2 vle32.v v5,0(a5)
vadd.vv v4,v1,v4 vle32.v v6,0(t6)
vsetvli zero,a4,e32,m1,ta,ma vse32.v v3,0(t3)
vse32.v v2,0(t5) vse32.v v2,0(a0)
vse32.v v4,0(a3) vadd.vv v3,v3,v1
vsetvli a7,zero,e32,m1,ta,ma vadd.vv v2,v1,v5
vadd.vv v3,v1,v3 vse32.v v3,0(t4)
vadd.vv v2,v2,v1 vadd.vv v1,v1,v6
vadd.vv v2,v2,v1 vse32.v v2,0(a3)
vsetvli zero,a4,e32,m1,ta,ma vse32.v v1,0(a6)
vse32.v v2,0(a0)
vse32.v v3,0(t3)
vle32.v v2,0(t0)
vsetvli a7,zero,e32,m1,ta,ma
vadd.vv v3,v3,v1
vsetvli zero,a4,e32,m1,ta,ma
vse32.v v3,0(t4)
vsetvli a7,zero,e32,m1,ta,ma
slli a7,a4,2
vadd.vv v1,v1,v2
sub t1,t1,a4
vsetvli zero,a4,e32,m1,ta,ma
vse32.v v1,0(a6)
It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated.
2. Epilogue:
Before this patch: After this patch:
.L5: .L5:
ld s0,8(sp) ret
addi sp,sp,16
jr ra
This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register
which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma'
The final codegen after this patch:
foo2:
lw t1,56(sp)
ld t6,0(sp)
ld t3,8(sp)
ld t0,16(sp)
ld t2,24(sp)
ld t4,32(sp)
ld t5,40(sp)
ble t1,zero,.L5
.L3:
vsetvli a4,t1,e32,m1,ta,ma
vle32.v v2,0(a2)
vle32.v v3,0(t2)
vle32.v v4,0(a1)
vle32.v v1,0(t0)
vadd.vv v4,v2,v4
vadd.vv v1,v3,v1
vadd.vv v1,v1,v4
vadd.vv v1,v1,v4
vadd.vv v1,v1,v4
vadd.vv v1,v1,v2
vadd.vv v2,v1,v2
vse32.v v2,0(t5)
vadd.vv v2,v2,v1
vadd.vv v2,v2,v1
slli a7,a4,2
vadd.vv v3,v1,v3
vle32.v v5,0(a5)
vle32.v v6,0(t6)
vse32.v v3,0(t3)
vse32.v v2,0(a0)
vadd.vv v3,v3,v1
vadd.vv v2,v1,v5
vse32.v v3,0(t4)
vadd.vv v1,v1,v6
vse32.v v2,0(a3)
vse32.v v1,0(a6)
sub t1,t1,a4
add a1,a1,a7
add a2,a2,a7
add a5,a5,a7
add t6,t6,a7
add t0,t0,a7
add t2,t2,a7
add t5,t5,a7
add a3,a3,a7
add a6,a6,a7
add t3,t3,a7
add t4,t4,a7
add a0,a0,a7
bne t1,zero,.L3
.L5:
ret
PR target/111318
PR target/111888
gcc/ChangeLog:
* config.gcc: Add AVL propagation pass.
* config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto.
* config/riscv/riscv-protos.h (make_pass_avlprop): Ditto.
* config/riscv/t-riscv: Ditto.
* config/riscv/riscv-avlprop.cc: New file.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c: Adapt test.
* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/pr111318.c: New test.
* gcc.target/riscv/rvv/autovec/pr111888.c: New test.
---
gcc/config.gcc | 2 +-
gcc/config/riscv/riscv-avlprop.cc | 419 ++++++++++++++++++
gcc/config/riscv/riscv-passes.def | 1 +
gcc/config/riscv/riscv-protos.h | 1 +
gcc/config/riscv/t-riscv | 6 +
.../costmodel/riscv/rvv/dynamic-lmul4-5.c | 2 +-
.../costmodel/riscv/rvv/dynamic-lmul8-2.c | 2 +-
.../riscv/rvv/autovec/partial/select_vl-2.c | 5 +-
.../gcc.target/riscv/rvv/autovec/pr111318.c | 16 +
.../gcc.target/riscv/rvv/autovec/pr111888.c | 33 ++
.../riscv/rvv/autovec/ternop/ternop_nofm-2.c | 1 -
11 files changed, 482 insertions(+), 6 deletions(-)
create mode 100644 gcc/config/riscv/riscv-avlprop.cc
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
diff --git a/gcc/config.gcc b/gcc/config.gcc
index 09a7fb13da1..d34ea246a98 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -544,7 +544,7 @@ pru-*-*)
riscv*)
cpu_type=riscv
extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o"
- extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o"
+ extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o"
extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
extra_objs="${extra_objs} thead.o"
d_target_objs="riscv-d.o"
diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc
new file mode 100644
index 00000000000..2c79ec81806
--- /dev/null
+++ b/gcc/config/riscv/riscv-avlprop.cc
@@ -0,0 +1,419 @@
+/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler.
+ Copyright (C) 2023-2023 Free Software Foundation, Inc.
+ Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or(at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3. If not see
+<http://www.gnu.org/licenses/>. */
+
+/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions.
+ A standalone AVL propagation pass is designed because:
+
+ - Better code maintain:
+ Current LCM-based VSETVL pass is so complicated that codes
+ there will become even harder to maintain. A straight forward
+ AVL propagation PASS is much easier to maintain.
+
+ - Reduce scalar register pressure:
+ A type of AVL propagation is we propagate AVL from NON-VLMAX
+ instruction to VLMAX instruction.
+ Note: VLMAX instruction should be ignore tail elements (TA)
+ and the result should be used by the NON-VLMAX instruction.
+ This optimization is mostly for auto-vectorization codes:
+
+ vsetvli r136, r137 --- SELECT_VL
+ vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD
+ vadd.vv (use VLMAX) --- PLUS_EXPR
+ vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE
+
+ NO AVL propation:
+
+ vsetvli a5, a4, ta
+ vle8.v v1
+ vsetvli t0, zero, ta
+ vadd.vv v2, v1, v1
+ vse8.v v2
+
+ We can propagate the AVL to 'vadd.vv' since its result
+ is consumed by a 'vse8.v' which has AVL = a5 and its
+ tail elements are agnostic.
+
+ We DON'T do this optimization on VSETVL pass since it is a
+ post-RA pass that consumed 't0' already wheras a standalone
+ pre-RA AVL propagation pass allows us elide the consumption
+ of the pseudo register of 't0' then we can reduce scalar
+ register pressure.
+
+ - More AVL propagation opportunities:
+ A pre-RA pass is more flexible for AVL REG def-use chain,
+ thus we will get more potential AVL propagation as long as
+ it doesn't increase the scalar register pressure.
+*/
+
+#define IN_TARGET_CODE 1
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "backend.h"
+#include "rtl.h"
+#include "target.h"
+#include "tree-pass.h"
+#include "df.h"
+#include "rtl-ssa.h"
+#include "cfgcleanup.h"
+#include "insn-attr.h"
+
+using namespace rtl_ssa;
+using namespace riscv_vector;
+
+enum avlprop_type
+{
+ /* VLMAX AVL and tail agnostic candidates. */
+ AVLPROP_VLMAX_TA,
+ AVLPROP_NONE
+};
+
+/* dump helper functions */
+static const char *
+avlprop_type_to_str (enum avlprop_type type)
+{
+ switch (type)
+ {
+ case AVLPROP_VLMAX_TA:
+ return "vlmax_ta";
+
+ default:
+ gcc_unreachable ();
+ }
+}
+
+static bool
+vlmax_ta_p (rtx_insn *rinsn)
+{
+ return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn);
+}
+
+const pass_data pass_data_avlprop = {
+ RTL_PASS, /* type */
+ "avlprop", /* name */
+ OPTGROUP_NONE, /* optinfo_flags */
+ TV_NONE, /* tv_id */
+ 0, /* properties_required */
+ 0, /* properties_provided */
+ 0, /* properties_destroyed */
+ 0, /* todo_flags_start */
+ 0, /* todo_flags_finish */
+};
+
+class pass_avlprop : public rtl_opt_pass
+{
+public:
+ pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {}
+
+ /* opt_pass methods: */
+ virtual bool gate (function *) final override
+ {
+ return TARGET_VECTOR && optimize > 0;
+ }
+ virtual unsigned int execute (function *) final override;
+
+private:
+ /* The AVL propagation instructions and corresponding preferred AVL.
+ It will be updated during the analysis. */
+ hash_map<insn_info *, rtx> *m_avl_propagations;
+
+ /* Potential feasible AVL propagation candidates. */
+ auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates;
+
+ rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const;
+ rtx get_vlmax_ta_preferred_avl (insn_info *) const;
+ rtx get_nonvlmax_avl (insn_info *) const;
+
+ void avlprop_init (function *);
+ void avlprop_done (void);
+}; // class pass_avlprop
+
+void
+pass_avlprop::avlprop_init (function *fn)
+{
+ calculate_dominance_info (CDI_DOMINATORS);
+ df_analyze ();
+ crtl->ssa = new function_info (fn);
+ m_avl_propagations = new hash_map<insn_info *, rtx>;
+}
+
+void
+pass_avlprop::avlprop_done (void)
+{
+ free_dominance_info (CDI_DOMINATORS);
+ if (crtl->ssa->perform_pending_updates ())
+ cleanup_cfg (0);
+ delete crtl->ssa;
+ crtl->ssa = nullptr;
+ delete m_avl_propagations;
+ m_avl_propagations = NULL;
+ if (!m_candidates.is_empty ())
+ m_candidates.release ();
+}
+
+/* If we have a preferred AVL to propagate, return the AVL.
+ Otherwise, return NULL_RTX as we don't need have any preferred
+ AVL. */
+
+rtx
+pass_avlprop::get_preferred_avl (
+ const std::pair<enum avlprop_type, insn_info *> candidate) const
+{
+ switch (candidate.first)
+ {
+ case AVLPROP_VLMAX_TA:
+ return get_vlmax_ta_preferred_avl (candidate.second);
+ default:
+ gcc_unreachable ();
+ }
+ return NULL_RTX;
+}
+
+/* This is a straight forward pattern ALWAYS in paritial auto-vectorization:
+
+ VL = SELECT_AVL (AVL, ...)
+ V0 = MASK_LEN_LOAD (..., VL)
+ V1 = MASK_LEN_LOAD (..., VL)
+ V2 = V0 + V1 --- Missed LEN information.
+ MASK_LEN_STORE (..., V2, VL)
+
+ We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN)
+ because:
+
+ - Few code changes in Loop Vectorizer.
+ - Reuse the current clean flow of partial vectorization, That is, apply
+ predicate LEN or MASK into LOAD/STORE operations and other special
+ arithmetic operations (e.d. DIV), then do the whole vector register
+ operation if it DON'T affect the correctness.
+ Such flow is used by all other targets like x86, sve, s390, ... etc.
+ - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD.
+
+ We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which
+ generates the VLMAX instruction due to missed LEN information. The later
+ VSETVL PASS will elided the redundant vsetvls.
+*/
+
+rtx
+pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const
+{
+ int sew = get_sew (insn->rtl ());
+ enum vlmul_type vlmul = get_vlmul (insn->rtl ());
+ int ratio = calculate_ratio (sew, vlmul);
+
+ rtx use_avl = NULL_RTX;
+ for (def_info *def : insn->defs ())
+ {
+ if (!is_a<set_info *> (def) || def->is_mem ())
+ return NULL_RTX;
+ const auto *set = dyn_cast<set_info *> (def);
+
+ /* FIXME: Stop AVL propagation if any USE is not a RVV real
+ instruction. It should be totally enough for vectorized codes since
+ they always locate at extended blocks.
+
+ TODO: We can extend PHI checking for intrinsic codes if it
+ necessary in the future. */
+ if (!set->is_local_to_ebb ())
+ return NULL_RTX;
+
+ for (use_info *use : set->nondebug_insn_uses ())
+ {
+ insn_info *use_insn = use->insn ();
+ if (!use_insn->can_be_optimized () || use_insn->is_asm ()
+ || use_insn->is_call () || use_insn->has_volatile_refs ()
+ || use_insn->has_pre_post_modify ()
+ || !has_vl_op (use_insn->rtl ())
+ || !tail_agnostic_p (use_insn->rtl ()))
+ return NULL_RTX;
+
+ int new_sew = get_sew (use_insn->rtl ());
+ enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ());
+ int new_ratio = calculate_ratio (new_sew, new_vlmul);
+ if (new_ratio != ratio)
+ return NULL_RTX;
+
+ rtx new_use_avl = get_nonvlmax_avl (use_insn);
+ if (!new_use_avl || SUBREG_P (new_use_avl))
+ return NULL_RTX;
+ if (REG_P (new_use_avl))
+ {
+ resource_info resource = full_register (REGNO (new_use_avl));
+ def_lookup dl = crtl->ssa->find_def (resource, use_insn);
+ if (dl.matching_set ())
+ return NULL_RTX;
+ def_info *def1 = dl.prev_def (insn);
+ def_info *def2 = dl.prev_def (use_insn);
+ if (!def1 || !def2 || def1 != def2)
+ return NULL_RTX;
+
+ /* FIXME: We only all AVL propation within a block which should
+ be totally enough for vectorized codes.
+
+ TODO: We can enhance it here for intrinsic codes in the future
+ if it is necessary. */
+ if (def1->insn ()->bb () != insn->bb ()
+ && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (),
+ def1->insn ()->bb ()->cfg_bb ()))
+ return NULL_RTX;
+ if (def1->insn ()->bb () == insn->bb ()
+ && def1->insn ()->compare_with (insn) >= 0)
+ return NULL_RTX;
+ }
+
+ if (!use_avl)
+ use_avl = new_use_avl;
+ else if (!rtx_equal_p (use_avl, new_use_avl))
+ return NULL_RTX;
+ }
+ }
+
+ return use_avl;
+}
+
+/* Try to get the NONVLMAX AVL of the INSN.
+ INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN
+ before the PASS but has been propagated a NON-VLMAX AVL
+ in the before round propagation. */
+rtx
+pass_avlprop::get_nonvlmax_avl (insn_info *insn) const
+{
+ if (m_avl_propagations->get (insn))
+ return (*m_avl_propagations->get (insn));
+ else if (nonvlmax_avl_type_p (insn->rtl ()))
+ {
+ extract_insn_cached (insn->rtl ());
+ return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())];
+ }
+
+ return NULL_RTX;
+}
+
+/* Main entry point for this pass. */
+unsigned int
+pass_avlprop::execute (function *fn)
+{
+ avlprop_init (fn);
+
+ /* Iterate the whole function in reverse order (which could speed the
+ convergence) to collect all potential candidates that could be AVL
+ propagated.
+
+ Note that: **NOT** all the candidates will be successfully AVL propagated.
+ */
+ for (bb_info *bb : crtl->ssa->reverse_bbs ())
+ {
+ for (insn_info *insn : bb->reverse_real_nondebug_insns ())
+ {
+ /* We only forward AVL to the instruction that has AVL/VL operand
+ and can be optimized in RTL_SSA level. */
+ if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ()))
+ continue;
+
+ /* TODO: We only do AVL propagation for VLMAX AVL with tail
+ agnostic policy since we have missed-LEN information partial
+ autovectorization. We could add more more AVL propagation
+ for intrinsic codes in the future. */
+ if (vlmax_ta_p (insn->rtl ()))
+ m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn));
+ }
+ }
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n",
+ m_candidates.length ());
+ for (const auto candidate : m_candidates)
+ {
+ fprintf (dump_file, "\nAVL propagation type: %s\n",
+ avlprop_type_to_str (candidate.first));
+ print_rtl_single (dump_file, candidate.second->rtl ());
+ }
+ }
+
+ /* Go through all the candidates looking for AVL that we could propagate. */
+ bool change_p = true;
+ while (change_p)
+ {
+ change_p = false;
+ for (auto &candidate : m_candidates)
+ {
+ rtx new_avl = get_preferred_avl (candidate);
+ if (new_avl)
+ {
+ gcc_assert (!vlmax_avl_p (new_avl));
+ auto &update
+ = m_avl_propagations->get_or_insert (candidate.second);
+ change_p = !rtx_equal_p (update, new_avl);
+ update = new_avl;
+ }
+ }
+ }
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n",
+ (int) m_avl_propagations->elements ());
+
+ for (const auto prop : *m_avl_propagations)
+ {
+ rtx_insn *rinsn = prop.first->rtl ();
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "\nPropagating AVL: ");
+ print_rtl_single (dump_file, prop.second);
+ fprintf (dump_file, "into: ");
+ print_rtl_single (dump_file, rinsn);
+ }
+ /* Replace AVL operand. */
+ extract_insn_cached (rinsn);
+ rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
+ int count = count_regno_occurrences (rinsn, REGNO (avl));
+ gcc_assert (count == 1);
+ rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second);
+ validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false);
+
+ /* Change AVL TYPE into NONVLMAX if it is VLMAX. */
+ if (vlmax_avl_type_p (rinsn))
+ {
+ int index = get_attr_avl_type_idx (rinsn);
+ gcc_assert (index != INVALID_ATTRIBUTE);
+ validate_change_or_fail (rinsn, recog_data.operand_loc[index],
+ get_avl_type_rtx (avl_type::NONVLMAX),
+ false);
+ }
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "Successfully to match this instruction: ");
+ print_rtl_single (dump_file, rinsn);
+ }
+ }
+
+ avlprop_done ();
+ return 0;
+}
+
+rtl_opt_pass *
+make_pass_avlprop (gcc::context *ctxt)
+{
+ return new pass_avlprop (ctxt);
+}
diff --git a/gcc/config/riscv/riscv-passes.def b/gcc/config/riscv/riscv-passes.def
index 4084122cf0a..b6260939d5c 100644
--- a/gcc/config/riscv/riscv-passes.def
+++ b/gcc/config/riscv/riscv-passes.def
@@ -18,4 +18,5 @@
<http://www.gnu.org/licenses/>. */
INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs);
+INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop);
INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl);
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 668d75043ca..d4e17fc3fd0 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio
extern bool riscv_hard_regno_rename_ok (unsigned, unsigned);
rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt);
+rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
/* Routines implemented in riscv-string.c. */
diff --git a/gcc/config/riscv/t-riscv b/gcc/config/riscv/t-riscv
index dd17056fe82..f8ca3f4ac57 100644
--- a/gcc/config/riscv/t-riscv
+++ b/gcc/config/riscv/t-riscv
@@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \
$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
$(srcdir)/config/riscv/riscv-vector-costs.cc
+riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \
+ $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \
+ $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h
+ $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+ $(srcdir)/config/riscv/riscv-avlprop.cc
+
riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \
$(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H)
$(COMPILE) $<
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c
index 928a507a363..5278e4aa38f 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c
@@ -39,7 +39,7 @@ void foo2 (int16_t *__restrict a,
}
}
-/* { dg-final { scan-assembler {e32,m4} } } */
+/* { dg-final { scan-assembler {e16,m2} } } */
/* { dg-final { scan-assembler-not {csrr} } } */
/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */
/* { dg-final { scan-tree-dump-times "Maximum lmul = 4" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c
index a50265fc1ec..1db2e073846 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c
@@ -10,7 +10,7 @@ foo (int32_t *__restrict a, int16_t *__restrict b, int n)
a[i] = a[i] + b[i];
}
-/* { dg-final { scan-assembler {e32,m8} } } */
+/* { dg-final { scan-assembler {e16,m4} } } */
/* { dg-final { scan-assembler-not {csrr} } } */
/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */
/* { dg-final { scan-tree-dump-not "Maximum lmul = 4" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
index eac7cbc757b..ca88d42cdf4 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c
@@ -7,10 +7,11 @@
/*
** foo:
** vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
+** ...
** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
** ...
-** vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
-** add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+
+** vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
+** ...
** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
** ...
*/
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
new file mode 100644
index 00000000000..ff36da8feeb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
+
+void
+foo (int *__restrict a, int *__restrict b, int *__restrict c, int n)
+{
+ for (int i = 0; i < n; i += 1)
+ c[i] = a[i] + b[i];
+}
+
+/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
+/* { dg-final { scan-assembler-not {vsetivli} } } */
+/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
new file mode 100644
index 00000000000..2387c20a26c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
+
+void
+foo (int *__restrict a, int *__restrict b, int *__restrict c,
+ int *__restrict a2, int *__restrict b2, int *__restrict c2,
+ int *__restrict a3, int *__restrict b3, int *__restrict c3,
+ int *__restrict a4, int *__restrict b4, int *__restrict c4,
+ int *__restrict a5, int *__restrict b5, int *__restrict c5,
+ int *__restrict d, int *__restrict d2, int *__restrict d3,
+ int *__restrict d4, int *__restrict d5, int n, int m)
+{
+ for (int i = 0; i < n; i++)
+ {
+ a[i] = b[i] + c[i];
+ a2[i] = b2[i] + c2[i];
+ a3[i] = b3[i] + c3[i];
+ a4[i] = b4[i] + c4[i];
+ a5[i] = a[i] + a4[i];
+ d[i] = a[i] - a2[i];
+ d2[i] = a2[i] * a[i];
+ d3[i] = a3[i] * a2[i];
+ d4[i] = a2[i] * d2[i];
+ d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i];
+ }
+}
+
+/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
+/* { dg-final { scan-assembler-not {vsetivli} } } */
+/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
index 965365da4bb..13367423751 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c
@@ -3,7 +3,6 @@
#include "ternop-2.c"
-/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */
/* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */
/* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */
/* { dg-final { scan-assembler-not {\tvmv} } } */
Yeah. No worry. We will eventually run full coverage testing && fix all bugs in stage 3 && stage 4.
We are planning to run the whole gcc testsuite with all these following compile option:
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m1
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=dynamic
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m1
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m1
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m1
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=dynamic
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m1
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=dynamic
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m1
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m2
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m4
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=dynamic
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m1 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl128b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m1 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl256b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m1 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl512b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m1 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl1024b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m1 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl2048b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m1 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m2 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m4 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8 --param=riscv-autovec-preference=fixed-vlmax
-march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=dynamic --param=riscv-autovec-preference=fixed-vlmax
We will need your help of CI. Currently, it's still stage 1 and we are working on pushing as many optimizations as possible.
Thanks.
juzhe.zhong@rivai.ai
From: Patrick O'Neill
Date: 2023-10-27 02:43
To: Robin Dapp; Juzhe-Zhong
CC: Kito Cheng; gcc-patches
Subject: Re: [Ready to commit V3] RISC-V: Add AVL propagation PASS for RVV auto-vectorization
On 10/26/23 11:15, Robin Dapp wrote:
rv32gcv:
FAIL: gfortran.dg/intrinsic_pack_6.f90 -O2 execution test
FAIL: gfortran.dg/intrinsic_pack_6.f90 -O3 -g execution test
FAIL: gfortran.dg/matmul_3.f90 -O2 execution test
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fbounds-check
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O2 -fomit-frame-pointer -finline-functions
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution, -O3 -g
rv64gcv:
FAIL: gfortran.dg/matmul_6.f90 -O2 execution test
Those might also flip flop, I have them seen FAIL and PASS before
randomly. It looks like there is at least 10 of those, really need
to figure out the root cause...
Regards
Robin
I've seen the same thing on CI for some of these failures on rv32gcv but always as a group:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111969
Example CI run with the flaky group:
https://github.com/patrick-rivos/gcc-postcommit-ci/issues/75
The fact that some are resolved while not resolving the full group makes me hopeful that:
FAIL: gfortran.dg/intrinsic_pack_6.f90 execution test
FAIL: gfortran.fortran-torture/execute/intrinsic_matmul.f90 execution
are really resolved
I haven't seen these testcases be flaky on CI:
FAIL: gfortran.dg/matmul_3.f90 -O2 execution test
FAIL: gfortran.dg/matmul_6.f90 -O2 execution test
Patrick
../../gcc/config/riscv/riscv-avlprop.cc: In member function 'virtual unsigned int pass_avlprop::execute(function*)':
../../gcc/config/riscv/riscv-avlprop.cc:346:23: error: loop variable 'candidate' creates a copy from type 'const std::pair<avlprop_type, rtl_ssa::insn_info*>' [-Werror=range-loop-construct]
346 | for (const auto candidate : m_candidates)
| ^~~~~~~~~
../../gcc/config/riscv/riscv-avlprop.cc:346:23: note: use reference type to prevent copying
346 | for (const auto candidate : m_candidates)
| ^~~~~~~~~
| &
Should be fixed by the below PATCH, feel free to ping me if any issues.
https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634616.html
Pan
-----Original Message-----
From: Andreas Schwab <schwab@linux-m68k.org>
Sent: Saturday, October 28, 2023 4:16 PM
To: 钟居哲 <juzhe.zhong@rivai.ai>
Cc: patrick <patrick@rivosinc.com>; gcc-patches <gcc-patches@gcc.gnu.org>; kito.cheng <kito.cheng@gmail.com>; rdapp.gcc <rdapp.gcc@gmail.com>
Subject: Re: [Ready to commit V3] RISC-V: Add AVL propagation PASS for RVV auto-vectorization
../../gcc/config/riscv/riscv-avlprop.cc: In member function 'virtual unsigned int pass_avlprop::execute(function*)':
../../gcc/config/riscv/riscv-avlprop.cc:346:23: error: loop variable 'candidate' creates a copy from type 'const std::pair<avlprop_type, rtl_ssa::insn_info*>' [-Werror=range-loop-construct]
346 | for (const auto candidate : m_candidates)
| ^~~~~~~~~
../../gcc/config/riscv/riscv-avlprop.cc:346:23: note: use reference type to prevent copying
346 | for (const auto candidate : m_candidates)
| ^~~~~~~~~
| &
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
@@ -544,7 +544,7 @@ pru-*-*)
riscv*)
cpu_type=riscv
extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o"
- extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o"
+ extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o"
extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o"
extra_objs="${extra_objs} thead.o"
d_target_objs="riscv-d.o"
new file mode 100644
@@ -0,0 +1,419 @@
+/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler.
+ Copyright (C) 2023-2023 Free Software Foundation, Inc.
+ Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or(at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3. If not see
+<http://www.gnu.org/licenses/>. */
+
+/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions.
+ A standalone AVL propagation pass is designed because:
+
+ - Better code maintain:
+ Current LCM-based VSETVL pass is so complicated that codes
+ there will become even harder to maintain. A straight forward
+ AVL propagation PASS is much easier to maintain.
+
+ - Reduce scalar register pressure:
+ A type of AVL propagation is we propagate AVL from NON-VLMAX
+ instruction to VLMAX instruction.
+ Note: VLMAX instruction should be ignore tail elements (TA)
+ and the result should be used by the NON-VLMAX instruction.
+ This optimization is mostly for auto-vectorization codes:
+
+ vsetvli r136, r137 --- SELECT_VL
+ vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD
+ vadd.vv (use VLMAX) --- PLUS_EXPR
+ vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE
+
+ NO AVL propation:
+
+ vsetvli a5, a4, ta
+ vle8.v v1
+ vsetvli t0, zero, ta
+ vadd.vv v2, v1, v1
+ vse8.v v2
+
+ We can propagate the AVL to 'vadd.vv' since its result
+ is consumed by a 'vse8.v' which has AVL = a5 and its
+ tail elements are agnostic.
+
+ We DON'T do this optimization on VSETVL pass since it is a
+ post-RA pass that consumed 't0' already wheras a standalone
+ pre-RA AVL propagation pass allows us elide the consumption
+ of the pseudo register of 't0' then we can reduce scalar
+ register pressure.
+
+ - More AVL propagation opportunities:
+ A pre-RA pass is more flexible for AVL REG def-use chain,
+ thus we will get more potential AVL propagation as long as
+ it doesn't increase the scalar register pressure.
+*/
+
+#define IN_TARGET_CODE 1
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "backend.h"
+#include "rtl.h"
+#include "target.h"
+#include "tree-pass.h"
+#include "df.h"
+#include "rtl-ssa.h"
+#include "cfgcleanup.h"
+#include "insn-attr.h"
+
+using namespace rtl_ssa;
+using namespace riscv_vector;
+
+enum avlprop_type
+{
+ /* VLMAX AVL and tail agnostic candidates. */
+ AVLPROP_VLMAX_TA,
+ AVLPROP_NONE
+};
+
+/* dump helper functions */
+static const char *
+avlprop_type_to_str (enum avlprop_type type)
+{
+ switch (type)
+ {
+ case AVLPROP_VLMAX_TA:
+ return "vlmax_ta";
+
+ default:
+ gcc_unreachable ();
+ }
+}
+
+static bool
+vlmax_ta_p (rtx_insn *rinsn)
+{
+ return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn);
+}
+
+const pass_data pass_data_avlprop = {
+ RTL_PASS, /* type */
+ "avlprop", /* name */
+ OPTGROUP_NONE, /* optinfo_flags */
+ TV_NONE, /* tv_id */
+ 0, /* properties_required */
+ 0, /* properties_provided */
+ 0, /* properties_destroyed */
+ 0, /* todo_flags_start */
+ 0, /* todo_flags_finish */
+};
+
+class pass_avlprop : public rtl_opt_pass
+{
+public:
+ pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {}
+
+ /* opt_pass methods: */
+ virtual bool gate (function *) final override
+ {
+ return TARGET_VECTOR && optimize > 0;
+ }
+ virtual unsigned int execute (function *) final override;
+
+private:
+ /* The AVL propagation instructions and corresponding preferred AVL.
+ It will be updated during the analysis. */
+ hash_map<insn_info *, rtx> *m_avl_propagations;
+
+ /* Potential feasible AVL propagation candidates. */
+ auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates;
+
+ rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const;
+ rtx get_vlmax_ta_preferred_avl (insn_info *) const;
+ rtx get_nonvlmax_avl (insn_info *) const;
+
+ void avlprop_init (function *);
+ void avlprop_done (void);
+}; // class pass_avlprop
+
+void
+pass_avlprop::avlprop_init (function *fn)
+{
+ calculate_dominance_info (CDI_DOMINATORS);
+ df_analyze ();
+ crtl->ssa = new function_info (fn);
+ m_avl_propagations = new hash_map<insn_info *, rtx>;
+}
+
+void
+pass_avlprop::avlprop_done (void)
+{
+ free_dominance_info (CDI_DOMINATORS);
+ if (crtl->ssa->perform_pending_updates ())
+ cleanup_cfg (0);
+ delete crtl->ssa;
+ crtl->ssa = nullptr;
+ delete m_avl_propagations;
+ m_avl_propagations = NULL;
+ if (!m_candidates.is_empty ())
+ m_candidates.release ();
+}
+
+/* If we have a preferred AVL to propagate, return the AVL.
+ Otherwise, return NULL_RTX as we don't need have any preferred
+ AVL. */
+
+rtx
+pass_avlprop::get_preferred_avl (
+ const std::pair<enum avlprop_type, insn_info *> candidate) const
+{
+ switch (candidate.first)
+ {
+ case AVLPROP_VLMAX_TA:
+ return get_vlmax_ta_preferred_avl (candidate.second);
+ default:
+ gcc_unreachable ();
+ }
+ return NULL_RTX;
+}
+
+/* This is a straight forward pattern ALWAYS in paritial auto-vectorization:
+
+ VL = SELECT_AVL (AVL, ...)
+ V0 = MASK_LEN_LOAD (..., VL)
+ V1 = MASK_LEN_LOAD (..., VL)
+ V2 = V0 + V1 --- Missed LEN information.
+ MASK_LEN_STORE (..., V2, VL)
+
+ We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN)
+ because:
+
+ - Few code changes in Loop Vectorizer.
+ - Reuse the current clean flow of partial vectorization, That is, apply
+ predicate LEN or MASK into LOAD/STORE operations and other special
+ arithmetic operations (e.d. DIV), then do the whole vector register
+ operation if it DON'T affect the correctness.
+ Such flow is used by all other targets like x86, sve, s390, ... etc.
+ - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD.
+
+ We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which
+ generates the VLMAX instruction due to missed LEN information. The later
+ VSETVL PASS will elided the redundant vsetvls.
+*/
+
+rtx
+pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const
+{
+ int sew = get_sew (insn->rtl ());
+ enum vlmul_type vlmul = get_vlmul (insn->rtl ());
+ int ratio = calculate_ratio (sew, vlmul);
+
+ rtx use_avl = NULL_RTX;
+ for (def_info *def : insn->defs ())
+ {
+ if (!is_a<set_info *> (def) || def->is_mem ())
+ return NULL_RTX;
+ const auto *set = dyn_cast<set_info *> (def);
+
+ /* FIXME: Stop AVL propagation if any USE is not a RVV real
+ instruction. It should be totally enough for vectorized codes since
+ they always locate at extended blocks.
+
+ TODO: We can extend PHI checking for intrinsic codes if it
+ necessary in the future. */
+ if (!set->is_local_to_ebb ())
+ return NULL_RTX;
+
+ for (use_info *use : set->nondebug_insn_uses ())
+ {
+ insn_info *use_insn = use->insn ();
+ if (!use_insn->can_be_optimized () || use_insn->is_asm ()
+ || use_insn->is_call () || use_insn->has_volatile_refs ()
+ || use_insn->has_pre_post_modify ()
+ || !has_vl_op (use_insn->rtl ())
+ || !tail_agnostic_p (use_insn->rtl ()))
+ return NULL_RTX;
+
+ int new_sew = get_sew (use_insn->rtl ());
+ enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ());
+ int new_ratio = calculate_ratio (new_sew, new_vlmul);
+ if (new_ratio != ratio)
+ return NULL_RTX;
+
+ rtx new_use_avl = get_nonvlmax_avl (use_insn);
+ if (!new_use_avl || SUBREG_P (new_use_avl))
+ return NULL_RTX;
+ if (REG_P (new_use_avl))
+ {
+ resource_info resource = full_register (REGNO (new_use_avl));
+ def_lookup dl = crtl->ssa->find_def (resource, use_insn);
+ if (dl.matching_set ())
+ return NULL_RTX;
+ def_info *def1 = dl.prev_def (insn);
+ def_info *def2 = dl.prev_def (use_insn);
+ if (!def1 || !def2 || def1 != def2)
+ return NULL_RTX;
+
+ /* FIXME: We only all AVL propation within a block which should
+ be totally enough for vectorized codes.
+
+ TODO: We can enhance it here for intrinsic codes in the future
+ if it is necessary. */
+ if (def1->insn ()->bb () != insn->bb ()
+ && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (),
+ def1->insn ()->bb ()->cfg_bb ()))
+ return NULL_RTX;
+ if (def1->insn ()->bb () == insn->bb ()
+ && def1->insn ()->compare_with (insn) >= 0)
+ return NULL_RTX;
+ }
+
+ if (!use_avl)
+ use_avl = new_use_avl;
+ else if (!rtx_equal_p (use_avl, new_use_avl))
+ return NULL_RTX;
+ }
+ }
+
+ return use_avl;
+}
+
+/* Try to get the NONVLMAX AVL of the INSN.
+ INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN
+ before the PASS but has been propagated a NON-VLMAX AVL
+ in the before round propagation. */
+rtx
+pass_avlprop::get_nonvlmax_avl (insn_info *insn) const
+{
+ if (m_avl_propagations->get (insn))
+ return (*m_avl_propagations->get (insn));
+ else if (nonvlmax_avl_type_p (insn->rtl ()))
+ {
+ extract_insn_cached (insn->rtl ());
+ return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())];
+ }
+
+ return NULL_RTX;
+}
+
+/* Main entry point for this pass. */
+unsigned int
+pass_avlprop::execute (function *fn)
+{
+ avlprop_init (fn);
+
+ /* Iterate the whole function in reverse order (which could speed the
+ convergence) to collect all potential candidates that could be AVL
+ propagated.
+
+ Note that: **NOT** all the candidates will be successfully AVL propagated.
+ */
+ for (bb_info *bb : crtl->ssa->reverse_bbs ())
+ {
+ for (insn_info *insn : bb->reverse_real_nondebug_insns ())
+ {
+ /* We only forward AVL to the instruction that has AVL/VL operand
+ and can be optimized in RTL_SSA level. */
+ if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ()))
+ continue;
+
+ /* TODO: We only do AVL propagation for VLMAX AVL with tail
+ agnostic policy since we have missed-LEN information partial
+ autovectorization. We could add more more AVL propagation
+ for intrinsic codes in the future. */
+ if (vlmax_ta_p (insn->rtl ()))
+ m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn));
+ }
+ }
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n",
+ m_candidates.length ());
+ for (const auto candidate : m_candidates)
+ {
+ fprintf (dump_file, "\nAVL propagation type: %s\n",
+ avlprop_type_to_str (candidate.first));
+ print_rtl_single (dump_file, candidate.second->rtl ());
+ }
+ }
+
+ /* Go through all the candidates looking for AVL that we could propagate. */
+ bool change_p = true;
+ while (change_p)
+ {
+ change_p = false;
+ for (auto &candidate : m_candidates)
+ {
+ rtx new_avl = get_preferred_avl (candidate);
+ if (new_avl)
+ {
+ gcc_assert (!vlmax_avl_p (new_avl));
+ auto &update
+ = m_avl_propagations->get_or_insert (candidate.second);
+ change_p = !rtx_equal_p (update, new_avl);
+ update = new_avl;
+ }
+ }
+ }
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n",
+ (int) m_avl_propagations->elements ());
+
+ for (const auto prop : *m_avl_propagations)
+ {
+ rtx_insn *rinsn = prop.first->rtl ();
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "\nPropagating AVL: ");
+ print_rtl_single (dump_file, prop.second);
+ fprintf (dump_file, "into: ");
+ print_rtl_single (dump_file, rinsn);
+ }
+ /* Replace AVL operand. */
+ extract_insn_cached (rinsn);
+ rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
+ int count = count_regno_occurrences (rinsn, REGNO (avl));
+ gcc_assert (count == 1);
+ rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second);
+ validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false);
+
+ /* Change AVL TYPE into NONVLMAX if it is VLMAX. */
+ if (vlmax_avl_type_p (rinsn))
+ {
+ int index = get_attr_avl_type_idx (rinsn);
+ gcc_assert (index != INVALID_ATTRIBUTE);
+ validate_change_or_fail (rinsn, recog_data.operand_loc[index],
+ get_avl_type_rtx (avl_type::NONVLMAX),
+ false);
+ }
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ {
+ fprintf (dump_file, "Successfully to match this instruction: ");
+ print_rtl_single (dump_file, rinsn);
+ }
+ }
+
+ avlprop_done ();
+ return 0;
+}
+
+rtl_opt_pass *
+make_pass_avlprop (gcc::context *ctxt)
+{
+ return new pass_avlprop (ctxt);
+}
@@ -18,4 +18,5 @@
<http://www.gnu.org/licenses/>. */
INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs);
+INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop);
INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl);
@@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio
extern bool riscv_hard_regno_rename_ok (unsigned, unsigned);
rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt);
+rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt);
rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt);
/* Routines implemented in riscv-string.c. */
@@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \
$(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
$(srcdir)/config/riscv/riscv-vector-costs.cc
+riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \
+ $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \
+ $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h
+ $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
+ $(srcdir)/config/riscv/riscv-avlprop.cc
+
riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \
$(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H)
$(COMPILE) $<
@@ -39,7 +39,7 @@ void foo2 (int16_t *__restrict a,
}
}
-/* { dg-final { scan-assembler {e32,m4} } } */
+/* { dg-final { scan-assembler {e16,m2} } } */
/* { dg-final { scan-assembler-not {csrr} } } */
/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */
/* { dg-final { scan-tree-dump-times "Maximum lmul = 4" 1 "vect" } } */
@@ -10,7 +10,7 @@ foo (int32_t *__restrict a, int16_t *__restrict b, int n)
a[i] = a[i] + b[i];
}
-/* { dg-final { scan-assembler {e32,m8} } } */
+/* { dg-final { scan-assembler {e16,m4} } } */
/* { dg-final { scan-assembler-not {csrr} } } */
/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */
/* { dg-final { scan-tree-dump-not "Maximum lmul = 4" "vect" } } */
@@ -7,10 +7,11 @@
/*
** foo:
** vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
+** ...
** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
** ...
-** vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
-** add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+
+** vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au]
+** ...
** vle32\.v\tv[0-9]+,0\([a-x0-9]+\)
** ...
*/
new file mode 100644
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
+
+void
+foo (int *__restrict a, int *__restrict b, int *__restrict c, int n)
+{
+ for (int i = 0; i < n; i += 1)
+ c[i] = a[i] + b[i];
+}
+
+/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
+/* { dg-final { scan-assembler-not {vsetivli} } } */
+/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
new file mode 100644
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */
+
+void
+foo (int *__restrict a, int *__restrict b, int *__restrict c,
+ int *__restrict a2, int *__restrict b2, int *__restrict c2,
+ int *__restrict a3, int *__restrict b3, int *__restrict c3,
+ int *__restrict a4, int *__restrict b4, int *__restrict c4,
+ int *__restrict a5, int *__restrict b5, int *__restrict c5,
+ int *__restrict d, int *__restrict d2, int *__restrict d3,
+ int *__restrict d4, int *__restrict d5, int n, int m)
+{
+ for (int i = 0; i < n; i++)
+ {
+ a[i] = b[i] + c[i];
+ a2[i] = b2[i] + c2[i];
+ a3[i] = b3[i] + c3[i];
+ a4[i] = b4[i] + c4[i];
+ a5[i] = a[i] + a4[i];
+ d[i] = a[i] - a2[i];
+ d2[i] = a2[i] * a[i];
+ d3[i] = a3[i] * a2[i];
+ d4[i] = a2[i] * d2[i];
+ d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i];
+ }
+}
+
+/* { dg-final { scan-assembler-times {vsetvli} 1 } } */
+/* { dg-final { scan-assembler-not {vsetivli} } } */
+/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */
+/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */
@@ -3,7 +3,6 @@
#include "ternop-2.c"
-/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */
/* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */
/* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */
/* { dg-final { scan-assembler-not {\tvmv} } } */