CPROP: Allow cprop optimization when the function has a single block
Checks
Commit Message
From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
Hi, this patch is present for GCC 14 since I understand it's not appropriate
to land it in GCC 13.
NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
one is entry block, the other is exit block.
So according this code, the function will not do cprop optimization when
there is only exactly one single block.
I am not sure whether it's correct to fix it like this.
Can anyone tell me why forbid cprop optimization if the function only has s
single block ?
Let's take a look at these 2 examples of RVV intrinsics:
1. void f1 (void * in, void *out, int64_t x, int n)
{
vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
__riscv_vse64_v_i64m1 (out + 2, v4, 4);
}
asm:
addi sp,sp,-16
sw a2,8(sp)
sw a3,12(sp)
sw a2,0(sp)
sw a3,4(sp)
addi a5,a0,1
vsetivli zero,4,e64,m1,ta,ma
addi a0,a0,2
vle64.v v24,0(a5)
addi a5,sp,8
vlse64.v v27,0(a5),zero
addi a1,a1,2
vsetvli zero,zero,e64,m1,tu,ma
vle64.v v24,0(a0)
vsetvli zero,zero,e64,m1,ta,ma
vlse64.v v25,0(sp),zero
vadd.vv v26,v24,v27
vadd.vv v24,v26,v25
vse64.v v24,0(a1)
addi sp,sp,16
jr ra
you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
optimization (the function only has 1 single block). It can be optimized if we apply
this patch.
2. void f1 (void * in, void *out, int64_t x, int n)
{
if (n) {
vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
__riscv_vse64_v_i64m1 (out + 2, v4, 4);
}
}
asm:
f1:
vsetivli zero,4,e64,m1,ta,ma
beq a4,zero,.L7
addi sp,sp,-16
sw a2,8(sp)
sw a3,12(sp)
addi a5,a0,1
vle64.v v24,0(a5)
addi a0,a0,2
addi a5,sp,8
vlse64.v v25,0(a5),zero
addi a1,a1,2
vsetvli zero,zero,e64,m1,tu,ma
vle64.v v24,0(a0)
vadd.vv v26,v24,v25
vadd.vv v24,v26,v25
vse64.v v24,0(a1)
addi sp,sp,16
jr ra
.L7:
ret
Here, if we add if (n) condition here, the program will end up with more than 1 block.
So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
I am not sure whether this patch is correct.
Can anyone help me with that ?
Thanks.
gcc/ChangeLog:
* cprop.cc (one_cprop_pass): Remove +1.
---
gcc/cprop.cc | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> Hi, this patch is present for GCC 14 since I understand it's not appropriate
> to land it in GCC 13.
>
> NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> one is entry block, the other is exit block.
> So according this code, the function will not do cprop optimization when
> there is only exactly one single block.
cprop / GCSE are global dataflow problems, there's "nothing" to do for
a single block, the local problem plus its application isn't considered
but probably left for CSE.
Why does CSE not perform the desired transform?
> I am not sure whether it's correct to fix it like this.
> Can anyone tell me why forbid cprop optimization if the function only has s
> single block ?
>
> Let's take a look at these 2 examples of RVV intrinsics:
> 1. void f1 (void * in, void *out, int64_t x, int n)
> {
> vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> }
>
> asm:
> addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> sw a2,0(sp)
> sw a3,4(sp)
> addi a5,a0,1
> vsetivli zero,4,e64,m1,ta,ma
> addi a0,a0,2
> vle64.v v24,0(a5)
> addi a5,sp,8
> vlse64.v v27,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vsetvli zero,zero,e64,m1,ta,ma
> vlse64.v v25,0(sp),zero
> vadd.vv v26,v24,v27
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> optimization (the function only has 1 single block). It can be optimized if we apply
> this patch.
>
> 2. void f1 (void * in, void *out, int64_t x, int n)
> {
> if (n) {
> vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> }
> }
>
> asm:
> f1:
> vsetivli zero,4,e64,m1,ta,ma
> beq a4,zero,.L7
> addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> addi a5,a0,1
> vle64.v v24,0(a5)
> addi a0,a0,2
> addi a5,sp,8
> vlse64.v v25,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vadd.vv v26,v24,v25
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> .L7:
> ret
>
> Here, if we add if (n) condition here, the program will end up with more than 1 block.
> So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
>
> I am not sure whether this patch is correct.
> Can anyone help me with that ?
> Thanks.
>
>
> gcc/ChangeLog:
>
> * cprop.cc (one_cprop_pass): Remove +1.
>
> ---
> gcc/cprop.cc | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> index 6ec0bda4a24..615bc5078b6 100644
> --- a/gcc/cprop.cc
> +++ b/gcc/cprop.cc
> @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> int changed = 0;
>
> /* Return if there's nothing to do, or it is too expensive. */
> - if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> + if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> return 0;
>
>
I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
This the RTL dump before CPROP:
(insn 19 18 20 4 (set (reg:VNx1DI 151)
(if_then_else:VNx1DI (unspec:VNx1BI [
(const_vector:VNx1BI repeat [
(const_int 1 [0x1])
])
(const_int 4 [0x4])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
(unspec:VNx1DI [
(const_int 0 [0])
] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
(nil))
(insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
(if_then_else:VNx1DI (unspec:VNx1BI [
(const_vector:VNx1BI repeat [
(const_int 1 [0x1])
])
(const_int 4 [0x4])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
(reg:VNx1DI 151))
(unspec:VNx1DI [
(const_int 0 [0])
] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
(expr_list:REG_DEAD (reg:VNx1DI 151)
(nil)))
(insn 21 20 22 4 (set (reg:VNx1DI 152)
(if_then_else:VNx1DI (unspec:VNx1BI [
(const_vector:VNx1BI repeat [
(const_int 1 [0x1])
])
(const_int 4 [0x4])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
(unspec:VNx1DI [
(const_int 0 [0])
] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
(nil))
(insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
(if_then_else:VNx1DI (unspec:VNx1BI [
(const_vector:VNx1BI repeat [
(const_int 1 [0x1])
])
(const_int 4 [0x4])
(const_int 0 [0])
(const_int 2 [0x2])
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
(reg:VNx1DI 152))
(reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
(expr_list:REG_DEAD (reg:VNx1DI 152)
(expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
(expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
(nil)))))
After CRPOP:
(insn 15 14 16 4 (set (reg:VNx1DI 147)
(if_then_else:VNx1DI (unspec:VNx1BI [
(const_vector:VNx1BI repeat [
(const_int 1 [0x1])
])
(const_int 4 [0x4])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
(unspec:VNx1DI [
(const_int 0 [0])
] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
(nil))
(insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
(if_then_else:VNx1DI (unspec:VNx1BI [
(const_vector:VNx1BI repeat [
(const_int 1 [0x1])
])
(const_int 4 [0x4])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
(reg:VNx1DI 147))
(unspec:VNx1DI [
(const_int 0 [0])
] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
(expr_list:REG_DEAD (reg:VNx1DI 147)
(expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
(nil))))
(insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
(if_then_else:VNx1DI (unspec:VNx1BI [
(const_vector:VNx1BI repeat [
(const_int 1 [0x1])
])
(const_int 4 [0x4])
(const_int 2 [0x2]) repeated x2
(const_int 0 [0])
(reg:SI 66 vl)
(reg:SI 67 vtype)
] UNSPEC_VPREDICATE)
(plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
(reg:VNx1DI 147))
(unspec:VNx1DI [
(const_int 0 [0])
] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
(expr_list:REG_DEAD (reg:VNx1DI 148)
(expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
(nil))))
You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction。
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-02-01 20:40
To: Ju-Zhe Zhong
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> Hi, this patch is present for GCC 14 since I understand it's not appropriate
> to land it in GCC 13.
>
> NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> one is entry block, the other is exit block.
> So according this code, the function will not do cprop optimization when
> there is only exactly one single block.
cprop / GCSE are global dataflow problems, there's "nothing" to do for
a single block, the local problem plus its application isn't considered
but probably left for CSE.
Why does CSE not perform the desired transform?
> I am not sure whether it's correct to fix it like this.
> Can anyone tell me why forbid cprop optimization if the function only has s
> single block ?
>
> Let's take a look at these 2 examples of RVV intrinsics:
> 1. void f1 (void * in, void *out, int64_t x, int n)
> {
> vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> }
>
> asm:
> addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> sw a2,0(sp)
> sw a3,4(sp)
> addi a5,a0,1
> vsetivli zero,4,e64,m1,ta,ma
> addi a0,a0,2
> vle64.v v24,0(a5)
> addi a5,sp,8
> vlse64.v v27,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vsetvli zero,zero,e64,m1,ta,ma
> vlse64.v v25,0(sp),zero
> vadd.vv v26,v24,v27
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> optimization (the function only has 1 single block). It can be optimized if we apply
> this patch.
>
> 2. void f1 (void * in, void *out, int64_t x, int n)
> {
> if (n) {
> vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> }
> }
>
> asm:
> f1:
> vsetivli zero,4,e64,m1,ta,ma
> beq a4,zero,.L7
> addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> addi a5,a0,1
> vle64.v v24,0(a5)
> addi a0,a0,2
> addi a5,sp,8
> vlse64.v v25,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vadd.vv v26,v24,v25
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> .L7:
> ret
>
> Here, if we add if (n) condition here, the program will end up with more than 1 block.
> So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
>
> I am not sure whether this patch is correct.
> Can anyone help me with that ?
> Thanks.
>
>
> gcc/ChangeLog:
>
> * cprop.cc (one_cprop_pass): Remove +1.
>
> ---
> gcc/cprop.cc | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> index 6ec0bda4a24..615bc5078b6 100644
> --- a/gcc/cprop.cc
> +++ b/gcc/cprop.cc
> @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> int changed = 0;
>
> /* Return if there's nothing to do, or it is too expensive. */
> - if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> + if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> return 0;
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
> I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
>
> This the RTL dump before CPROP:
>
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 151))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 151)
> (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 0 [0])
> (const_int 2 [0x2])
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 152))
> (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 152)
> (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
> (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
> (nil)))))
>
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 147))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 147)
> (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
> (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
> (reg:VNx1DI 147))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 148)
> (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
> (nil))))
>
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?
I see loads of UNSPECs, that might explain why some passes do something
and some not. That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> >
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> >
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>
> Why does CSE not perform the desired transform?
>
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> >
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1. void f1 (void * in, void *out, int64_t x, int n)
> > {
> > vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> > vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> > vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> > vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> > __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> > }
> >
> > asm:
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> >
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> > if (n) {
> > vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> > vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> > vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> > vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> > __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> > }
> > }
> >
> > asm:
> > f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> >
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> >
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> >
> >
> > gcc/ChangeLog:
> >
> > * cprop.cc (one_cprop_pass): Remove +1.
> >
> > ---
> > gcc/cprop.cc | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> > int changed = 0;
> >
> > /* Return if there's nothing to do, or it is too expensive. */
> > - if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > + if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> > || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> > return 0;
> >
> >
>
>
>> I see loads of UNSPECs, that might explain why some passes do something
>> and some not. That said, not sure what exactly CPROP does, the two
>> pred_broadcast insns look exactly the same so CSE should CSE them?
Yes, the "source" these 2 pred_broadcast the same. However, they have different pseudos in their "dest" (operands[0])
The first one is 151, the second one is 152. And the result in these 2 pseudos are the same.
The first pred_add use 151, the second pred_add use 152.
CSE should do the job to eliminate the second pred_broadcast, but I am not sure whether CSE can propagate the 151 pseudo
to the second pred_add ??
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-02-01 20:51
To: juzhe.zhong@rivai.ai
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
> I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
>
> This the RTL dump before CPROP:
>
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 151))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 151)
> (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 0 [0])
> (const_int 2 [0x2])
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 152))
> (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 152)
> (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
> (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
> (nil)))))
>
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 147))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 147)
> (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
> (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
> (reg:VNx1DI 147))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 148)
> (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
> (nil))))
>
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?
I see loads of UNSPECs, that might explain why some passes do something
and some not. That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> >
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> >
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>
> Why does CSE not perform the desired transform?
>
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> >
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1. void f1 (void * in, void *out, int64_t x, int n)
> > {
> > vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> > vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> > vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> > vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> > __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> > }
> >
> > asm:
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> >
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> > if (n) {
> > vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> > vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> > vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> > vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> > __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> > }
> > }
> >
> > asm:
> > f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> >
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> >
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> >
> >
> > gcc/ChangeLog:
> >
> > * cprop.cc (one_cprop_pass): Remove +1.
> >
> > ---
> > gcc/cprop.cc | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> > int changed = 0;
> >
> > /* Return if there's nothing to do, or it is too expensive. */
> > - if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > + if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> > || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> > return 0;
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
Yeah, Thanks. You are right. CSE should do the job.
Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
as the dependency of pred_broadcast:
(insn 19 18 20 4 (set (reg:VNx1DI 152)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> (nil))
Then CSE failed to set the 152 as copy.
VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
each RVV instruction.
Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
produce incorrect vsetvl configuration)
The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
riscv_regno_to_class [VL_REGNUM] = VL_REGS;
riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
Such configuration make CSE failed.
However, if I change the reg_class :
riscv_regno_to_class [VL_REGNUM] = NO_REGS;
riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
The CSE now can do the optimization now!
1) Would you mind telling me the difference between them?
2) If I set these 2 global status register as NO_REGS, will it create issues for the global status configuration of each RVV instructions ?
Thanks.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-02-01 23:46
To: juzhe.zhong
CC: rguenther; gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
Am 01.02.2023 um 13:57 schrieb juzhe.zhong@rivai.ai:
>> I see loads of UNSPECs, that might explain why some passes do something
>> and some not. That said, not sure what exactly CPROP does, the two
>> pred_broadcast insns look exactly the same so CSE should CSE them?
Yes, the "source" these 2 pred_broadcast the same. However, they have different pseudos in their "dest" (operands[0])
The first one is 151, the second one is 152. And the result in these 2 pseudos are the same.
The first pred_add use 151, the second pred_add use 152.
CSE should do the job to eliminate the second pred_broadcast, but I am not sure whether CSE can propagate the 151 pseudo
to the second pred_add ??
It should do that. At least it should make setting 152 a copy. So you need to figure why it doesn’t do that? Maybe it uses costs and those say recomputing is cheaper than a copy?
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-02-01 20:51
To: juzhe.zhong@rivai.ai
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
> I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
>
> This the RTL dump before CPROP:
>
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 151))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 151)
> (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 0 [0])
> (const_int 2 [0x2])
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 152))
> (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 152)
> (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
> (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
> (nil)))))
>
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
> (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
> (reg:VNx1DI 147))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 147)
> (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
> (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
> (if_then_else:VNx1DI (unspec:VNx1BI [
> (const_vector:VNx1BI repeat [
> (const_int 1 [0x1])
> ])
> (const_int 4 [0x4])
> (const_int 2 [0x2]) repeated x2
> (const_int 0 [0])
> (reg:SI 66 vl)
> (reg:SI 67 vtype)
> ] UNSPEC_VPREDICATE)
> (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
> (reg:VNx1DI 147))
> (unspec:VNx1DI [
> (const_int 0 [0])
> ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
> (expr_list:REG_DEAD (reg:VNx1DI 148)
> (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
> (nil))))
>
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?
I see loads of UNSPECs, that might explain why some passes do something
and some not. That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?
>
>
> juzhe.zhong@rivai.ai
>
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> >
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> >
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>
> Why does CSE not perform the desired transform?
>
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> >
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1. void f1 (void * in, void *out, int64_t x, int n)
> > {
> > vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> > vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> > vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> > vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> > __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> > }
> >
> > asm:
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> >
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> > if (n) {
> > vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> > vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> > vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> > vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> > __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> > }
> > }
> >
> > asm:
> > f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> >
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> >
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> >
> >
> > gcc/ChangeLog:
> >
> > * cprop.cc (one_cprop_pass): Remove +1.
> >
> > ---
> > gcc/cprop.cc | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> > int changed = 0;
> >
> > /* Return if there's nothing to do, or it is too expensive. */
> > - if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > + if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> > || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> > return 0;
> >
> >
>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
On Thu, 2 Feb 2023, juzhe.zhong@rivai.ai wrote:
> Yeah, Thanks. You are right. CSE should do the job.
> Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
> as the dependency of pred_broadcast:
> (insn 19 18 20 4 (set (reg:VNx1DI 152)
> > (if_then_else:VNx1DI (unspec:VNx1BI [
> > (const_vector:VNx1BI repeat [
> > (const_int 1 [0x1])
> > ])
> > (const_int 4 [0x4])
> > (const_int 2 [0x2]) repeated x2
> > (const_int 0 [0])
> > (reg:SI 66 vl)
> > (reg:SI 67 vtype)
> > ] UNSPEC_VPREDICATE)
> > (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> > (unspec:VNx1DI [
> > (const_int 0 [0])
> > ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> > (nil))
> Then CSE failed to set the 152 as copy.
>
> VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
> Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
> each RVV instruction.
> Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
> such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
> produce incorrect vsetvl configuration)
>
> The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
> riscv_regno_to_class [VL_REGNUM] = VL_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
> Such configuration make CSE failed.
>
> However, if I change the reg_class :
> riscv_regno_to_class [VL_REGNUM] = NO_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
> The CSE now can do the optimization now!
>
> 1) Would you mind telling me the difference between them?
No idea. I think CSE avoids to touch hard register references because
eliding them to copies can increase register pressure.
> 2) If I set these 2 global status register as NO_REGS, will it create
> issues for the global status configuration of each RVV instructions ?
No idea either. Usually these kind of dependences are introduced
by targets at the point the VL setting is introduced to avoid
pessimizing optimizations earlier. Often, for cases like a VL
register, this is done after register allocation only and indeed
necessary to avoid the second scheduling pass from breaking things.
Richard.
Thank you so much. Kito helped me fix it already.
RVV instruction patterns can have CSE optimizations now.
juzhe.zhong@rivai.ai
From: Richard Biener
Date: 2023-02-02 20:26
To: juzhe.zhong@rivai.ai
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Thu, 2 Feb 2023, juzhe.zhong@rivai.ai wrote:
> Yeah, Thanks. You are right. CSE should do the job.
> Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
> as the dependency of pred_broadcast:
> (insn 19 18 20 4 (set (reg:VNx1DI 152)
> > (if_then_else:VNx1DI (unspec:VNx1BI [
> > (const_vector:VNx1BI repeat [
> > (const_int 1 [0x1])
> > ])
> > (const_int 4 [0x4])
> > (const_int 2 [0x2]) repeated x2
> > (const_int 0 [0])
> > (reg:SI 66 vl)
> > (reg:SI 67 vtype)
> > ] UNSPEC_VPREDICATE)
> > (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
> > (unspec:VNx1DI [
> > (const_int 0 [0])
> > ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
> > (nil))
> Then CSE failed to set the 152 as copy.
>
> VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
> Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
> each RVV instruction.
> Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
> such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
> produce incorrect vsetvl configuration)
>
> The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
> riscv_regno_to_class [VL_REGNUM] = VL_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
> Such configuration make CSE failed.
>
> However, if I change the reg_class :
> riscv_regno_to_class [VL_REGNUM] = NO_REGS;
> riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
> The CSE now can do the optimization now!
>
> 1) Would you mind telling me the difference between them?
No idea. I think CSE avoids to touch hard register references because
eliding them to copies can increase register pressure.
> 2) If I set these 2 global status register as NO_REGS, will it create
> issues for the global status configuration of each RVV instructions ?
No idea either. Usually these kind of dependences are introduced
by targets at the point the VL setting is introduced to avoid
pessimizing optimizations earlier. Often, for cases like a VL
register, this is done after register allocation only and indeed
necessary to avoid the second scheduling pass from breaking things.
Richard.
On 2/2/23 05:26, Richard Biener wrote:
> On Thu, 2 Feb 2023, juzhe.zhong@rivai.ai wrote:
>
>> Yeah, Thanks. You are right. CSE should do the job.
>> Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
>> as the dependency of pred_broadcast:
>> (insn 19 18 20 4 (set (reg:VNx1DI 152)
>>> (if_then_else:VNx1DI (unspec:VNx1BI [
>>> (const_vector:VNx1BI repeat [
>>> (const_int 1 [0x1])
>>> ])
>>> (const_int 4 [0x4])
>>> (const_int 2 [0x2]) repeated x2
>>> (const_int 0 [0])
>>> (reg:SI 66 vl)
>>> (reg:SI 67 vtype)
>>> ] UNSPEC_VPREDICATE)
>>> (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>>> (unspec:VNx1DI [
>>> (const_int 0 [0])
>>> ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>>> (nil))
>> Then CSE failed to set the 152 as copy.
>>
>> VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
>> Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
>> each RVV instruction.
>> Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
>> such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
>> produce incorrect vsetvl configuration)
>>
>> The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
>> riscv_regno_to_class [VL_REGNUM] = VL_REGS;
>> riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
>> Such configuration make CSE failed.
>>
>> However, if I change the reg_class :
>> riscv_regno_to_class [VL_REGNUM] = NO_REGS;
>> riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
>> The CSE now can do the optimization now!
>>
>> 1) Would you mind telling me the difference between them?
>
> No idea. I think CSE avoids to touch hard register references because
> eliding them to copies can increase register pressure.IIRC the costing is set up differently and for a given partition a
pseudo will be preferred over a hard reg. This is in addition to other
places that test the small register class hooks.
>
>> 2) If I set these 2 global status register as NO_REGS, will it create
>> issues for the global status configuration of each RVV instructions ?
>
> No idea either. Usually these kind of dependences are introduced
> by targets at the point the VL setting is introduced to avoid
> pessimizing optimizations earlier. Often, for cases like a VL
> register, this is done after register allocation only and indeed
> necessary to avoid the second scheduling pass from breaking things.
Yea. I'm wondering about when the right place to introduce these
dependencies might be. I'm still a few months out from worrying about
RVV, but it's not too far away.
jeff
On 2/2/23 05:35, juzhe.zhong@rivai.ai wrote:
> Thank you so much. Kito helped me fix it already.
> RVV instruction patterns can have CSE optimizations now.
What was the issue?
jeff
We set VL/VTYPE these 2 implicit global status denpency register as fixed reg.
Then CSE can do the optimization now.
>> Yea. I'm wondering about when the right place to introduce these
>>dependencies might be. I'm still a few months out from worrying about
>>RVV, but it's not too far away.
You don't need to worry about RVV. I can promise you that RVV support in GCC will be solid and
optimal. You can just try. For example, try VSETVL PASS, this PASS implemented in GCC is much better
than LLVM. I have include so many fancy optimizations there.
juzhe.zhong@rivai.ai
From: Jeff Law
Date: 2023-02-02 22:36
To: juzhe.zhong@rivai.ai; rguenther
CC: gcc-patches; kito.cheng; richard.sandiford; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On 2/2/23 05:35, juzhe.zhong@rivai.ai wrote:
> Thank you so much. Kito helped me fix it already.
> RVV instruction patterns can have CSE optimizations now.
What was the issue?
jeff
> > Thank you so much. Kito helped me fix it already.
> > RVV instruction patterns can have CSE optimizations now.
> What was the issue?
VL and VTYPE isn't listed in fixed register so CSE feel that isn't
cheap (See CHEAP_REGNO in cse.cc),
but actually it's kind of mistake sett for VL and VTYPE register to
non fixed register,
it all managed by vsetvl insertion pass, and won't involved into the
register allocation
process, so it should be set 1 in FIXED_REGISTERS,
then CSE pass is happy to cse that after we fix that :)
More story behind that is we were trying to rely on RA to manage VL
and VTYPE before,
and then...we gave up and decided to manage that by ourselves.
@@ -1749,7 +1749,7 @@ one_cprop_pass (void)
int changed = 0;
/* Return if there's nothing to do, or it is too expensive. */
- if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
+ if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
|| gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
return 0;