match: Do not select to branchless expression when target has movcc pattern [PR113095]
Checks
Commit Message
This allows the backend to generate movcc instructions, if target
machine has movcc pattern.
branchless-cond.c needs to be updated since some target machines have
conditional move instructions, and the experssion will not change to
branchless expression.
gcc/ChangeLog:
PR target/113095
* match.pd (`(zero_one == 0) ? y : z <op> y`,
`(zero_one != 0) ? z <op> y : y`): Do not match to branchless
expression, if target machine has conditional move pattern.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/branchless-cond.c: Update testcase.
---
gcc/match.pd | 30 +++++++++++++++++--
.../gcc.dg/tree-ssa/branchless-cond.c | 6 ++--
2 files changed, 31 insertions(+), 5 deletions(-)
Comments
On Wed, 17 Jan 2024, Monk Chiang wrote:
> This allows the backend to generate movcc instructions, if target
> machine has movcc pattern.
>
> branchless-cond.c needs to be updated since some target machines have
> conditional move instructions, and the experssion will not change to
> branchless expression.
While I agree this pattern should possibly be applied during RTL
expansion or instruction selection on x86 which also has movcc
the multiplication is cheaper. So I don't think this isn't the way to go.
I'd rather revert the change than trying to "fix" it this way?
Thanks,
Richard.
> gcc/ChangeLog:
> PR target/113095
> * match.pd (`(zero_one == 0) ? y : z <op> y`,
> `(zero_one != 0) ? z <op> y : y`): Do not match to branchless
> expression, if target machine has conditional move pattern.
>
> gcc/testsuite/ChangeLog:
> * gcc.dg/tree-ssa/branchless-cond.c: Update testcase.
> ---
> gcc/match.pd | 30 +++++++++++++++++--
> .../gcc.dg/tree-ssa/branchless-cond.c | 6 ++--
> 2 files changed, 31 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index e42ecaf9ec7..a1f90b1cd41 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -4231,7 +4231,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> (if (INTEGRAL_TYPE_P (type)
> && TYPE_PRECISION (type) > 1
> && (INTEGRAL_TYPE_P (TREE_TYPE (@0))))
> - (op (mult (convert:type @0) @2) @1))))
> + (with {
> + bool can_movecc_p = false;
> + if (can_conditionally_move_p (TYPE_MODE (type)))
> + can_movecc_p = true;
> +
> + /* Some target only support word_mode for movcc pattern, if type can
> + extend to word_mode then use conditional move. Even if there is a
> + extend instruction, the cost is lower than branchless. */
> + if (can_extend_p (word_mode, TYPE_MODE (type), TYPE_UNSIGNED (type))
> + && can_conditionally_move_p (word_mode))
> + can_movecc_p = true;
> + }
> + (if (!can_movecc_p)
> + (op (mult (convert:type @0) @2) @1))))))
>
> /* (zero_one != 0) ? z <op> y : y -> ((typeof(y))zero_one * z) <op> y */
> (for op (bit_xor bit_ior plus)
> @@ -4243,7 +4256,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> (if (INTEGRAL_TYPE_P (type)
> && TYPE_PRECISION (type) > 1
> && (INTEGRAL_TYPE_P (TREE_TYPE (@0))))
> - (op (mult (convert:type @0) @2) @1))))
> + (with {
> + bool can_movecc_p = false;
> + if (can_conditionally_move_p (TYPE_MODE (type)))
> + can_movecc_p = true;
> +
> + /* Some target only support word_mode for movcc pattern, if type can
> + extend to word_mode then use conditional move. Even if there is a
> + extend instruction, the cost is lower than branchless. */
> + if (can_extend_p (word_mode, TYPE_MODE (type), TYPE_UNSIGNED (type))
> + && can_conditionally_move_p (word_mode))
> + can_movecc_p = true;
> + }
> + (if (!can_movecc_p)
> + (op (mult (convert:type @0) @2) @1))))))
>
> /* ?: Value replacement. */
> /* a == 0 ? b : b + a -> b + a */
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/branchless-cond.c b/gcc/testsuite/gcc.dg/tree-ssa/branchless-cond.c
> index e063dc4bb5f..c002ed97364 100644
> --- a/gcc/testsuite/gcc.dg/tree-ssa/branchless-cond.c
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/branchless-cond.c
> @@ -21,6 +21,6 @@ int f4(unsigned int x, unsigned int y, unsigned int z)
> return ((x & 1) != 0) ? z | y : y;
> }
>
> -/* { dg-final { scan-tree-dump-times " \\\*" 4 "optimized" } } */
> -/* { dg-final { scan-tree-dump-times " & " 4 "optimized" } } */
> -/* { dg-final { scan-tree-dump-not "if " "optimized" } } */
> +/* { dg-final { scan-tree-dump-times " \\\*" 4 "optimized" { xfail { "aarch64*-*-* alpha*-*-* bfin*-*-* epiphany-*-* i?86-*-* x86_64-*-* nds32*-*-*" } } } } */
> +/* { dg-final { scan-tree-dump-times " & " 4 "optimized" { xfail { "aarch64*-*-* alpha*-*-* bfin*-*-* epiphany-*-* i?86-*-* x86_64-*-* nds32*-*-*" } } } } */
> +/* { dg-final { scan-tree-dump-not "if " "optimized" { xfail { "aarch64*-*-* alpha*-*-* bfin*-*-* epiphany-*-* i?86-*-* x86_64-*-* nds32*-*-*" } } } } */
>
On 1/17/24 05:14, Richard Biener wrote:
> On Wed, 17 Jan 2024, Monk Chiang wrote:
>
>> This allows the backend to generate movcc instructions, if target
>> machine has movcc pattern.
>>
>> branchless-cond.c needs to be updated since some target machines have
>> conditional move instructions, and the experssion will not change to
>> branchless expression.
>
> While I agree this pattern should possibly be applied during RTL
> expansion or instruction selection on x86 which also has movcc
> the multiplication is cheaper. So I don't think this isn't the way to go.
>
> I'd rather revert the change than trying to "fix" it this way?
WRT reverting -- the patch in question's sole purpose was to enable
branchless sequences for that very same code. Reverting would regress
performance on a variety of micro-architectures. IIUC, the issue is
that the SiFive part in question has a fusion which allows it to do the
branchy sequence cheaply.
ISTM this really needs to be addressed during expansion and most likely
with a RISC-V target twiddle for the micro-archs which have
short-forward-branch optimizations.
jeff
Thanks for your advice!! I agree it should be fixed in the RISC-V backend
when expansion.
On Wed, Jan 17, 2024 at 10:37 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
> On 1/17/24 05:14, Richard Biener wrote:
> > On Wed, 17 Jan 2024, Monk Chiang wrote:
> >
> >> This allows the backend to generate movcc instructions, if target
> >> machine has movcc pattern.
> >>
> >> branchless-cond.c needs to be updated since some target machines have
> >> conditional move instructions, and the experssion will not change to
> >> branchless expression.
> >
> > While I agree this pattern should possibly be applied during RTL
> > expansion or instruction selection on x86 which also has movcc
> > the multiplication is cheaper. So I don't think this isn't the way to
> go.
> >
> > I'd rather revert the change than trying to "fix" it this way?
> WRT reverting -- the patch in question's sole purpose was to enable
> branchless sequences for that very same code. Reverting would regress
> performance on a variety of micro-architectures. IIUC, the issue is
> that the SiFive part in question has a fusion which allows it to do the
> branchy sequence cheaply.
>
> ISTM this really needs to be addressed during expansion and most likely
> with a RISC-V target twiddle for the micro-archs which have
> short-forward-branch optimizations.
>
> jeff
>
On Wed, 17 Jan 2024 19:19:58 PST (-0800), monk.chiang@sifive.com wrote:
> Thanks for your advice!! I agree it should be fixed in the RISC-V backend
> when expansion.
>
>
> On Wed, Jan 17, 2024 at 10:37 PM Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>>
>>
>> On 1/17/24 05:14, Richard Biener wrote:
>> > On Wed, 17 Jan 2024, Monk Chiang wrote:
>> >
>> >> This allows the backend to generate movcc instructions, if target
>> >> machine has movcc pattern.
>> >>
>> >> branchless-cond.c needs to be updated since some target machines have
>> >> conditional move instructions, and the experssion will not change to
>> >> branchless expression.
>> >
>> > While I agree this pattern should possibly be applied during RTL
>> > expansion or instruction selection on x86 which also has movcc
>> > the multiplication is cheaper. So I don't think this isn't the way to
>> go.
>> >
>> > I'd rather revert the change than trying to "fix" it this way?
>> WRT reverting -- the patch in question's sole purpose was to enable
>> branchless sequences for that very same code. Reverting would regress
>> performance on a variety of micro-architectures. IIUC, the issue is
>> that the SiFive part in question has a fusion which allows it to do the
>> branchy sequence cheaply.
>>
>> ISTM this really needs to be addressed during expansion and most likely
>> with a RISC-V target twiddle for the micro-archs which have
>> short-forward-branch optimizations.
IIRC I ran into some of these middle-end interactions a year or two ago
and determined that we'd need middle-end changes to get this working
smoothly -- essentially replacing the expander checks for a MOVCC insn
with some sort of costing.
Without that, we're just going to end up with some missed optimizations
that favor one way or the other.
>>
>> jeff
>>
@@ -4231,7 +4231,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (INTEGRAL_TYPE_P (type)
&& TYPE_PRECISION (type) > 1
&& (INTEGRAL_TYPE_P (TREE_TYPE (@0))))
- (op (mult (convert:type @0) @2) @1))))
+ (with {
+ bool can_movecc_p = false;
+ if (can_conditionally_move_p (TYPE_MODE (type)))
+ can_movecc_p = true;
+
+ /* Some target only support word_mode for movcc pattern, if type can
+ extend to word_mode then use conditional move. Even if there is a
+ extend instruction, the cost is lower than branchless. */
+ if (can_extend_p (word_mode, TYPE_MODE (type), TYPE_UNSIGNED (type))
+ && can_conditionally_move_p (word_mode))
+ can_movecc_p = true;
+ }
+ (if (!can_movecc_p)
+ (op (mult (convert:type @0) @2) @1))))))
/* (zero_one != 0) ? z <op> y : y -> ((typeof(y))zero_one * z) <op> y */
(for op (bit_xor bit_ior plus)
@@ -4243,7 +4256,20 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(if (INTEGRAL_TYPE_P (type)
&& TYPE_PRECISION (type) > 1
&& (INTEGRAL_TYPE_P (TREE_TYPE (@0))))
- (op (mult (convert:type @0) @2) @1))))
+ (with {
+ bool can_movecc_p = false;
+ if (can_conditionally_move_p (TYPE_MODE (type)))
+ can_movecc_p = true;
+
+ /* Some target only support word_mode for movcc pattern, if type can
+ extend to word_mode then use conditional move. Even if there is a
+ extend instruction, the cost is lower than branchless. */
+ if (can_extend_p (word_mode, TYPE_MODE (type), TYPE_UNSIGNED (type))
+ && can_conditionally_move_p (word_mode))
+ can_movecc_p = true;
+ }
+ (if (!can_movecc_p)
+ (op (mult (convert:type @0) @2) @1))))))
/* ?: Value replacement. */
/* a == 0 ? b : b + a -> b + a */
@@ -21,6 +21,6 @@ int f4(unsigned int x, unsigned int y, unsigned int z)
return ((x & 1) != 0) ? z | y : y;
}
-/* { dg-final { scan-tree-dump-times " \\\*" 4 "optimized" } } */
-/* { dg-final { scan-tree-dump-times " & " 4 "optimized" } } */
-/* { dg-final { scan-tree-dump-not "if " "optimized" } } */
+/* { dg-final { scan-tree-dump-times " \\\*" 4 "optimized" { xfail { "aarch64*-*-* alpha*-*-* bfin*-*-* epiphany-*-* i?86-*-* x86_64-*-* nds32*-*-*" } } } } */
+/* { dg-final { scan-tree-dump-times " & " 4 "optimized" { xfail { "aarch64*-*-* alpha*-*-* bfin*-*-* epiphany-*-* i?86-*-* x86_64-*-* nds32*-*-*" } } } } */
+/* { dg-final { scan-tree-dump-not "if " "optimized" { xfail { "aarch64*-*-* alpha*-*-* bfin*-*-* epiphany-*-* i?86-*-* x86_64-*-* nds32*-*-*" } } } } */