[v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.
Checks
Commit Message
Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the
short-circuit operation instead of the non-short-circuit operation.
This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000.
gcc/ChangeLog:
* config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define.
gcc/testsuite/ChangeLog:
* gcc.target/loongarch/short-circuit.c: New test.
Comments
On Tue, 2023-12-12 at 19:14 +0800, Jiahao Xu wrote:
> Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the
> short-circuit operation instead of the non-short-circuit operation.
>
> This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000.
In r14-15 we removed LOGICAL_OP_NON_SHORT_CIRCUIT definition because the
default value (1 for all current LoongArch CPUs with branch_cost = 6)
may reduce the number of conditional branch instructions.
I guess here the problem is floating-point compare instruction is much
more costly than other instructions but the fact is not correctly
modeled yet. Could you try
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
where I've raised fp_add cost (which is used for estimating floating-
point compare cost) to 5 instructions and see if it solves your problem
without LOGICAL_OP_NON_SHORT_CIRCUIT?
If not I guess you can try increasing the floating-point comparison cost
more in loongarch_rtx_costs:
case UNLT:
/* Branch comparisons have VOIDmode, so use the first operand's
mode instead. */
mode = GET_MODE (XEXP (x, 0));
if (FLOAT_MODE_P (mode))
{
*total = loongarch_cost->fp_add;
Try to make it fp_add + something?
return false;
}
*total = loongarch_binary_cost (x, COSTS_N_INSNS (1), COSTS_N_INSNS (4),
speed);
return true;
If adjusting the cost model does not work I'd say this is a middle-end
issue and we should submit a bug report.
> gcc/ChangeLog:
>
> * config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/loongarch/short-circuit.c: New test.
>
> diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h
> index f1350b6048f..880c576c35b 100644
> --- a/gcc/config/loongarch/loongarch.h
> +++ b/gcc/config/loongarch/loongarch.h
> @@ -869,6 +869,7 @@ typedef struct {
> 1 is the default; other values are interpreted relative to that. */
>
> #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost
> +#define LOGICAL_OP_NON_SHORT_CIRCUIT 0
>
> /* Return the asm template for a conditional branch instruction.
> OPCODE is the opcode's mnemonic and OPERANDS is the asm template for
> diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
> new file mode 100644
> index 00000000000..bed585ee172
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
> @@ -0,0 +1,19 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */
> +
> +int
> +short_circuit (float *a)
> +{
> + float t1x = a[0];
> + float t2x = a[1];
> + float t1y = a[2];
> + float t2y = a[3];
> + float t1z = a[4];
> + float t2z = a[5];
> +
> + if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z)
> + return 0;
> +
> + return 1;
> +}
> +/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */
在 2023/12/12 下午7:26, Xi Ruoyao 写道:
> On Tue, 2023-12-12 at 19:14 +0800, Jiahao Xu wrote:
>> Define LOGICAL_OP_NON_SHORT_CIRCUIT as 0, for a short-circuit branch, use the
>> short-circuit operation instead of the non-short-circuit operation.
>>
>> This gives a 1.8% improvement in SPECCPU 2017 fprate on 3A6000.
> In r14-15 we removed LOGICAL_OP_NON_SHORT_CIRCUIT definition because the
> default value (1 for all current LoongArch CPUs with branch_cost = 6)
> may reduce the number of conditional branch instructions.
>
> I guess here the problem is floating-point compare instruction is much
> more costly than other instructions but the fact is not correctly
> modeled yet. Could you try
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> where I've raised fp_add cost (which is used for estimating floating-
> point compare cost) to 5 instructions and see if it solves your problem
> without LOGICAL_OP_NON_SHORT_CIRCUIT?
I think this is not the same issue as the cost of floating-point
comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
affects how the short-circuit branch, such as (A AND-IF B), is executed,
and it is not directly related to the cost of floating-point comparison
instructions. I will try to test it using SPECCPU 2017.
> If not I guess you can try increasing the floating-point comparison cost
> more in loongarch_rtx_costs:
>
> case UNLT:
> /* Branch comparisons have VOIDmode, so use the first operand's
> mode instead. */
> mode = GET_MODE (XEXP (x, 0));
> if (FLOAT_MODE_P (mode))
> {
> *total = loongarch_cost->fp_add;
>
>
> Try to make it fp_add + something?
>
> return false;
> }
> *total = loongarch_binary_cost (x, COSTS_N_INSNS (1), COSTS_N_INSNS (4),
> speed);
> return true;
>
>
> If adjusting the cost model does not work I'd say this is a middle-end
> issue and we should submit a bug report.
>
>> gcc/ChangeLog:
>>
>> * config/loongarch/loongarch.h (LOGICAL_OP_NON_SHORT_CIRCUIT): Define.
>>
>> gcc/testsuite/ChangeLog:
>>
>> * gcc.target/loongarch/short-circuit.c: New test.
>>
>> diff --git a/gcc/config/loongarch/loongarch.h b/gcc/config/loongarch/loongarch.h
>> index f1350b6048f..880c576c35b 100644
>> --- a/gcc/config/loongarch/loongarch.h
>> +++ b/gcc/config/loongarch/loongarch.h
>> @@ -869,6 +869,7 @@ typedef struct {
>> 1 is the default; other values are interpreted relative to that. */
>>
>> #define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost
>> +#define LOGICAL_OP_NON_SHORT_CIRCUIT 0
>>
>> /* Return the asm template for a conditional branch instruction.
>> OPCODE is the opcode's mnemonic and OPERANDS is the asm template for
>> diff --git a/gcc/testsuite/gcc.target/loongarch/short-circuit.c b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
>> new file mode 100644
>> index 00000000000..bed585ee172
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/loongarch/short-circuit.c
>> @@ -0,0 +1,19 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */
>> +
>> +int
>> +short_circuit (float *a)
>> +{
>> + float t1x = a[0];
>> + float t2x = a[1];
>> + float t1y = a[2];
>> + float t2y = a[3];
>> + float t1z = a[4];
>> + float t2z = a[5];
>> +
>> + if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z)
>> + return 0;
>> +
>> + return 1;
>> +}
>> +/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */
On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
> > I guess here the problem is floating-point compare instruction is much
> > more costly than other instructions but the fact is not correctly
> > modeled yet. Could you try
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> > where I've raised fp_add cost (which is used for estimating floating-
> > point compare cost) to 5 instructions and see if it solves your problem
> > without LOGICAL_OP_NON_SHORT_CIRCUIT?
> I think this is not the same issue as the cost of floating-point
> comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
> affects how the short-circuit branch, such as (A AND-IF B), is executed,
> and it is not directly related to the cost of floating-point comparison
> instructions. I will try to test it using SPECCPU 2017.
The point is if the cost of floating-point comparison is very high, the
middle end *should* short cut floating-point comparisons even if
LOGICAL_OP_NON_SHORT_CIRCUIT = 1.
I've created https://gcc.gnu.org/PR112985.
Another factor regressing the code is we don't have modeled movcf2gr
instruction yet, so we are not really eliding the branches as
LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
> On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
> > > I guess here the problem is floating-point compare instruction is much
> > > more costly than other instructions but the fact is not correctly
> > > modeled yet. Could you try
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
> > > where I've raised fp_add cost (which is used for estimating floating-
> > > point compare cost) to 5 instructions and see if it solves your problem
> > > without LOGICAL_OP_NON_SHORT_CIRCUIT?
> > I think this is not the same issue as the cost of floating-point
> > comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
> > affects how the short-circuit branch, such as (A AND-IF B), is executed,
> > and it is not directly related to the cost of floating-point comparison
> > instructions. I will try to test it using SPECCPU 2017.
>
> The point is if the cost of floating-point comparison is very high, the
> middle end *should* short cut floating-point comparisons even if
> LOGICAL_OP_NON_SHORT_CIRCUIT = 1.
>
> I've created https://gcc.gnu.org/PR112985.
>
> Another factor regressing the code is we don't have modeled movcf2gr
> instruction yet, so we are not really eliding the branches as
> LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
I made up this:
diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md
index a5d0dcd65fe..84d828ebd0f 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode"
[(set_attr "type" "fcmp")
(set_attr "mode" "FCC")])
+(define_insn "movcf2gr<GPR:mode>"
+ [(set (match_operand:GPR 0 "register_operand" "=r")
+ (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
+ (const_int 0))
+ (const_int 1)
+ (const_int 0)))]
+ "TARGET_HARD_FLOAT"
+ "movcf2gr\t%0,%1"
+ [(set_attr "type" "move")
+ (set_attr "mode" "FCC")])
+
+(define_expand "cstore<ANYF:mode>4"
+ [(set (match_operand:SI 0 "register_operand")
+ (match_operator:SI 1 "loongarch_fcmp_operator"
+ [(match_operand:ANYF 2 "register_operand")
+ (match_operand:ANYF 3 "register_operand")]))]
+ ""
+ {
+ rtx fcc = gen_reg_rtx (FCCmode);
+ rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
+ operands[2], operands[3]);
+
+ emit_insn (gen_rtx_SET (fcc, cmp));
+ if (TARGET_64BIT)
+ {
+ rtx gpr = gen_reg_rtx (DImode);
+ emit_insn (gen_movcf2grdi (gpr, fcc));
+ emit_insn (gen_rtx_SET (operands[0],
+ lowpart_subreg (SImode, gpr, DImode)));
+ }
+ else
+ emit_insn (gen_movcf2grsi (operands[0], fcc));
+
+ DONE;
+ })
+
;;
;; ....................
diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md
index 9e9ce58cb53..83fea08315c 100644
--- a/gcc/config/loongarch/predicates.md
+++ b/gcc/config/loongarch/predicates.md
@@ -590,6 +590,10 @@ (define_predicate "order_operator"
(define_predicate "loongarch_cstore_operator"
(match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
+(define_predicate "loongarch_fcmp_operator"
+ (match_code
+ "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
+
(define_predicate "small_data_pattern"
(and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
(match_test "loongarch_small_data_pattern_p (op)")))
and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
= 1):
fld.s $f1,$r4,0
fld.s $f0,$r4,4
fld.s $f3,$r4,8
fld.s $f2,$r4,12
fcmp.slt.s $fcc1,$f0,$f3
fcmp.sgt.s $fcc0,$f1,$f2
movcf2gr $r13,$fcc1
movcf2gr $r12,$fcc0
or $r12,$r12,$r13
bnez $r12,.L3
fld.s $f4,$r4,16
fld.s $f5,$r4,20
or $r4,$r0,$r0
fcmp.sgt.s $fcc1,$f1,$f5
fcmp.slt.s $fcc0,$f0,$f4
movcf2gr $r12,$fcc1
movcf2gr $r13,$fcc0
or $r12,$r12,$r13
bnez $r12,.L2
fcmp.sgt.s $fcc1,$f3,$f5
fcmp.slt.s $fcc0,$f2,$f4
movcf2gr $r4,$fcc1
movcf2gr $r12,$fcc0
or $r4,$r4,$r12
xori $r4,$r4,1
slli.w $r4,$r4,0
jr $r1
.align 4
.L3:
or $r4,$r0,$r0
.align 4
.L2:
jr $r1
Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).
Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples). We may be able to handle via
the ext_dce pass [1] in the future.
[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
在 2023/12/13 上午2:27, Xi Ruoyao 写道:
>
> fld.s $f1,$r4,0
> fld.s $f0,$r4,4
> fld.s $f3,$r4,8
> fld.s $f2,$r4,12
> fcmp.slt.s $fcc1,$f0,$f3
> fcmp.sgt.s $fcc0,$f1,$f2
> movcf2gr $r13,$fcc1
> movcf2gr $r12,$fcc0
> or $r12,$r12,$r13
> bnez $r12,.L3
> fld.s $f4,$r4,16
> fld.s $f5,$r4,20
> or $r4,$r0,$r0
> fcmp.sgt.s $fcc1,$f1,$f5
> fcmp.slt.s $fcc0,$f0,$f4
> movcf2gr $r12,$fcc1
> movcf2gr $r13,$fcc0
> or $r12,$r12,$r13
> bnez $r12,.L2
> fcmp.sgt.s $fcc1,$f3,$f5
> fcmp.slt.s $fcc0,$f2,$f4
> movcf2gr $r4,$fcc1
> movcf2gr $r12,$fcc0
> or $r4,$r4,$r12
> xori $r4,$r4,1
> slli.w $r4,$r4,0
> jr $r1
> .align 4
> .L3:
> or $r4,$r0,$r0
> .align 4
> .L2:
> jr $r1
>
> Per my micro-benchmark this is much faster than
> LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
> when the branches are not predictable).
>
> Note that there is a redundant slli.w instruction in the compiled code
> and I couldn't find a way to remove it (my trick in the TARGET_64BIT
> branch only works for simple examples). We may be able to handle via
> the ext_dce pass [1] in the future.
Patches in attachments can remove the remaining symbol extension
directives from
the assembly.
> [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
>
在 2023/12/13 上午2:27, Xi Ruoyao 写道:
> On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
>
> fld.s $f1,$r4,0
> fld.s $f0,$r4,4
> fld.s $f3,$r4,8
> fld.s $f2,$r4,12
> fcmp.slt.s $fcc1,$f0,$f3
> fcmp.sgt.s $fcc0,$f1,$f2
> movcf2gr $r13,$fcc1
> movcf2gr $r12,$fcc0
There is also a problem that on 3A5000 MOVCF2GR requires 7 cycles,
MOVCF2FR+MOVFR2GR is a cycle. 3A6000 has no problem.
> or $r12,$r12,$r13
> bnez $r12,.L3
> fld.s $f4,$r4,16
> fld.s $f5,$r4,20
> or $r4,$r0,$r0
> fcmp.sgt.s $fcc1,$f1,$f5
> fcmp.slt.s $fcc0,$f0,$f4
> movcf2gr $r12,$fcc1
> movcf2gr $r13,$fcc0
> or $r12,$r12,$r13
> bnez $r12,.L2
> fcmp.sgt.s $fcc1,$f3,$f5
> fcmp.slt.s $fcc0,$f2,$f4
> movcf2gr $r4,$fcc1
> movcf2gr $r12,$fcc0
> or $r4,$r4,$r12
> xori $r4,$r4,1
> slli.w $r4,$r4,0
> jr $r1
> .align 4
> .L3:
> or $r4,$r0,$r0
> .align 4
> .L2:
> jr $r1
>
> Per my micro-benchmark this is much faster than
> LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
> when the branches are not predictable).
>
> Note that there is a redundant slli.w instruction in the compiled code
> and I couldn't find a way to remove it (my trick in the TARGET_64BIT
> branch only works for simple examples). We may be able to handle via
> the ext_dce pass [1] in the future.
>
> [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
>
在 2023/12/13 上午2:27, Xi Ruoyao 写道:
> On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
>> On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
>>>> I guess here the problem is floating-point compare instruction is much
>>>> more costly than other instructions but the fact is not correctly
>>>> modeled yet. Could you try
>>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
>>>> where I've raised fp_add cost (which is used for estimating floating-
>>>> point compare cost) to 5 instructions and see if it solves your problem
>>>> without LOGICAL_OP_NON_SHORT_CIRCUIT?
>>> I think this is not the same issue as the cost of floating-point
>>> comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
>>> affects how the short-circuit branch, such as (A AND-IF B), is executed,
>>> and it is not directly related to the cost of floating-point comparison
>>> instructions. I will try to test it using SPECCPU 2017.
>> The point is if the cost of floating-point comparison is very high, the
>> middle end *should* short cut floating-point comparisons even if
>> LOGICAL_OP_NON_SHORT_CIRCUIT = 1.
>>
>> I've created https://gcc.gnu.org/PR112985.
>>
>> Another factor regressing the code is we don't have modeled movcf2gr
>> instruction yet, so we are not really eliding the branches as
>> LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
> I made up this:
>
> diff --git a/gcc/config/loongarch/loongarch.md b/gcc/config/loongarch/loongarch.md
> index a5d0dcd65fe..84d828ebd0f 100644
> --- a/gcc/config/loongarch/loongarch.md
> +++ b/gcc/config/loongarch/loongarch.md
> @@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode"
> [(set_attr "type" "fcmp")
> (set_attr "mode" "FCC")])
>
> +(define_insn "movcf2gr<GPR:mode>"
> + [(set (match_operand:GPR 0 "register_operand" "=r")
> + (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
> + (const_int 0))
> + (const_int 1)
> + (const_int 0)))]
> + "TARGET_HARD_FLOAT"
> + "movcf2gr\t%0,%1"
> + [(set_attr "type" "move")
> + (set_attr "mode" "FCC")])
> +
> +(define_expand "cstore<ANYF:mode>4"
> + [(set (match_operand:SI 0 "register_operand")
> + (match_operator:SI 1 "loongarch_fcmp_operator"
> + [(match_operand:ANYF 2 "register_operand")
> + (match_operand:ANYF 3 "register_operand")]))]
> + ""
> + {
> + rtx fcc = gen_reg_rtx (FCCmode);
> + rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
> + operands[2], operands[3]);
> +
> + emit_insn (gen_rtx_SET (fcc, cmp));
> + if (TARGET_64BIT)
> + {
> + rtx gpr = gen_reg_rtx (DImode);
> + emit_insn (gen_movcf2grdi (gpr, fcc));
> + emit_insn (gen_rtx_SET (operands[0],
> + lowpart_subreg (SImode, gpr, DImode)));
> + }
> + else
> + emit_insn (gen_movcf2grsi (operands[0], fcc));
> +
> + DONE;
> + })
> +
>
>
> ;;
> ;; ....................
> diff --git a/gcc/config/loongarch/predicates.md b/gcc/config/loongarch/predicates.md
> index 9e9ce58cb53..83fea08315c 100644
> --- a/gcc/config/loongarch/predicates.md
> +++ b/gcc/config/loongarch/predicates.md
> @@ -590,6 +590,10 @@ (define_predicate "order_operator"
> (define_predicate "loongarch_cstore_operator"
> (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
>
> +(define_predicate "loongarch_fcmp_operator"
> + (match_code
> + "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
> +
> (define_predicate "small_data_pattern"
> (and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
> (match_test "loongarch_small_data_pattern_p (op)")))
>
> and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
> = 1):
>
> fld.s $f1,$r4,0
> fld.s $f0,$r4,4
> fld.s $f3,$r4,8
> fld.s $f2,$r4,12
> fcmp.slt.s $fcc1,$f0,$f3
> fcmp.sgt.s $fcc0,$f1,$f2
> movcf2gr $r13,$fcc1
> movcf2gr $r12,$fcc0
> or $r12,$r12,$r13
> bnez $r12,.L3
> fld.s $f4,$r4,16
> fld.s $f5,$r4,20
> or $r4,$r0,$r0
> fcmp.sgt.s $fcc1,$f1,$f5
> fcmp.slt.s $fcc0,$f0,$f4
> movcf2gr $r12,$fcc1
> movcf2gr $r13,$fcc0
> or $r12,$r12,$r13
> bnez $r12,.L2
> fcmp.sgt.s $fcc1,$f3,$f5
> fcmp.slt.s $fcc0,$f2,$f4
> movcf2gr $r4,$fcc1
> movcf2gr $r12,$fcc0
> or $r4,$r4,$r12
> xori $r4,$r4,1
> slli.w $r4,$r4,0
> jr $r1
> .align 4
> .L3:
> or $r4,$r0,$r0
> .align 4
> .L2:
> jr $r1
>
> Per my micro-benchmark this is much faster than
> LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
> when the branches are not predictable).
>
> Note that there is a redundant slli.w instruction in the compiled code
> and I couldn't find a way to remove it (my trick in the TARGET_64BIT
> branch only works for simple examples). We may be able to handle via
> the ext_dce pass [1] in the future.
>
> [1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html
>
This test was extracted from the hot functions of 526.blender_r. Setting
LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
instruction count and a 13.4% performance improvement. After applying
the patch mentioned above, the assembly code looks much better with
LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
improved the performance of 526 by 3%. The definition of
LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
the optimizations you made determine how rtl is generated. They are not
conflicting and combining them would yield better results. Currently, I
have only tested it on 526, and I will continue testing its impact on
the entire SPEC 2017 suite.
On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
> This test was extracted from the hot functions of 526.blender_r. Setting
> LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
> instruction count and a 13.4% performance improvement. After applying
> the patch mentioned above, the assembly code looks much better with
> LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
> Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
> improved the performance of 526 by 3%. The definition of
> LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
> the optimizations you made determine how rtl is generated. They are not
> conflicting and combining them would yield better results. Currently, I
> have only tested it on 526, and I will continue testing its impact on
> the entire SPEC 2017 suite.
The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
fixed-point only code. In practice the usage of -ffast-math is very
rare ("real" Linux packages invoking floating-point operations often
just malfunction with it) and it seems not good to regress common cases
with uncommon cases.
在 2023/12/13 下午2:21, Xi Ruoyao 写道:
> On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
>> This test was extracted from the hot functions of 526.blender_r. Setting
>> LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
>> instruction count and a 13.4% performance improvement. After applying
>> the patch mentioned above, the assembly code looks much better with
>> LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
>> Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
>> improved the performance of 526 by 3%. The definition of
>> LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
>> the optimizations you made determine how rtl is generated. They are not
>> conflicting and combining them would yield better results. Currently, I
>> have only tested it on 526, and I will continue testing its impact on
>> the entire SPEC 2017 suite.
> The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
> fixed-point only code. In practice the usage of -ffast-math is very
> rare ("real" Linux packages invoking floating-point operations often
> just malfunction with it) and it seems not good to regress common cases
> with uncommon cases.
>
Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark
results in a 1.6% decrease in dynamic instruction count and an overall
performance improvement of 0.5%. Most of the SPEC2017 int programs
experience a decrease in instruction count, and there are no instances
of performance regression observed.
On Wed, 2023-12-13 at 14:32 +0800, Jiahao Xu wrote:
>
> 在 2023/12/13 下午2:21, Xi Ruoyao 写道:
> > On Wed, 2023-12-13 at 14:17 +0800, Jiahao Xu wrote:
> > > This test was extracted from the hot functions of 526.blender_r. Setting
> > > LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic
> > > instruction count and a 13.4% performance improvement. After applying
> > > the patch mentioned above, the assembly code looks much better with
> > > LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.
> > > Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further
> > > improved the performance of 526 by 3%. The definition of
> > > LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while
> > > the optimizations you made determine how rtl is generated. They are not
> > > conflicting and combining them would yield better results. Currently, I
> > > have only tested it on 526, and I will continue testing its impact on
> > > the entire SPEC 2017 suite.
> > The problem with LOGICAL_OP_NON_SHORT_CIRCUIT = 0 is it may regress
> > fixed-point only code. In practice the usage of -ffast-math is very
> > rare ("real" Linux packages invoking floating-point operations often
> > just malfunction with it) and it seems not good to regress common cases
> > with uncommon cases.
> >
> Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 in SPEC2017 intrate benchmark
> results in a 1.6% decrease in dynamic instruction count and an overall
> performance improvement of 0.5%. Most of the SPEC2017 int programs
> experience a decrease in instruction count, and there are no instances
> of performance regression observed.
Ok then. But add these info into commit message.
@@ -869,6 +869,7 @@ typedef struct {
1 is the default; other values are interpreted relative to that. */
#define BRANCH_COST(speed_p, predictable_p) loongarch_branch_cost
+#define LOGICAL_OP_NON_SHORT_CIRCUIT 0
/* Return the asm template for a conditional branch instruction.
OPCODE is the opcode's mnemonic and OPERANDS is the asm template for
new file mode 100644
@@ -0,0 +1,19 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ffast-math -fdump-tree-gimple" } */
+
+int
+short_circuit (float *a)
+{
+ float t1x = a[0];
+ float t2x = a[1];
+ float t1y = a[2];
+ float t2y = a[3];
+ float t1z = a[4];
+ float t2z = a[5];
+
+ if (t1x > t2y || t2x < t1y || t1x > t2z || t2x < t1z || t1y > t2z || t2y < t1z)
+ return 0;
+
+ return 1;
+}
+/* { dg-final { scan-tree-dump-times "if" 6 "gimple" } } */