[v2] x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F
Checks
Commit Message
There's no reason to constrain this to AVX512VL, unless instructed so by
-mprefer-vector-width=, as the wider operation is unusable for more
narrow operands only when the possible memory source is a non-broadcast
one. This way even the scalar copysign<mode>3 can benefit from the
operation being a single-insn one (leaving aside moves which the
compiler decides to insert for unclear reasons, and leaving aside the
fact that bcst_mem_operand() is too restrictive for broadcast to be
embedded right into VPTERNLOG*).
Along with this also request value duplication in
ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
excess space allocation in .rodata.*, filled with zeros which are never
read.
gcc/
* config/i386/i386-expand.cc (ix86_expand_copysign): Request
value duplication by ix86_build_signbit_mask() when AVX512F and
not HFmode.
* config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
2-alternative form. Adjust "mode" attribute. Add "enabled"
attribute.
(*<avx512>_vpternlog<mode>_1): Also permit when TARGET_AVX512F
&& !TARGET_PREFER_AVX256.
(*<avx512>_vpternlog<mode>_2): Likewise.
(*<avx512>_vpternlog<mode>_3): Likewise.
---
I guess the underlying pattern, going along the lines of what
<mask_codefor>one_cmpl<mode>2<mask_name> uses, can be applied elsewhere
as well.
HFmode could use embedded broadcast too for copysign and alike, but that
would need to be V2HF -> V8HF (for which I don't think there are any
existing patterns).
---
v2: Respect -mprefer-vector-width=.
Comments
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, June 16, 2023 2:22 PM
> To: gcc-patches@gcc.gnu.org
> Cc: Kirill Yukhin <kirill.yukhin@gmail.com>; Liu, Hongtao
> <hongtao.liu@intel.com>
> Subject: [PATCH v2] x86: make VPTERNLOG* usable on less than 512-bit
> operands with just AVX512F
>
> There's no reason to constrain this to AVX512VL, unless instructed so by -
> mprefer-vector-width=, as the wider operation is unusable for more narrow
> operands only when the possible memory source is a non-broadcast one.
> This way even the scalar copysign<mode>3 can benefit from the operation
> being a single-insn one (leaving aside moves which the compiler decides to
> insert for unclear reasons, and leaving aside the fact that
> bcst_mem_operand() is too restrictive for broadcast to be embedded right
> into VPTERNLOG*).
>
> Along with this also request value duplication in ix86_expand_copysign()'s
> call to ix86_build_signbit_mask(), eliminating excess space allocation
> in .rodata.*, filled with zeros which are never read.
>
> gcc/
>
> * config/i386/i386-expand.cc (ix86_expand_copysign): Request
> value duplication by ix86_build_signbit_mask() when AVX512F and
> not HFmode.
> * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
> 2-alternative form. Adjust "mode" attribute. Add "enabled"
> attribute.
> (*<avx512>_vpternlog<mode>_1): Also permit when
> TARGET_AVX512F
> && !TARGET_PREFER_AVX256.
> (*<avx512>_vpternlog<mode>_2): Likewise.
> (*<avx512>_vpternlog<mode>_3): Likewise.
> ---
> I guess the underlying pattern, going along the lines of what
> <mask_codefor>one_cmpl<mode>2<mask_name> uses, can be applied
> elsewhere as well.
>
> HFmode could use embedded broadcast too for copysign and alike, but that
> would need to be V2HF -> V8HF (for which I don't think there are any existing
> patterns).
> ---
> v2: Respect -mprefer-vector-width=.
>
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[])
> else
> dest = NULL_RTX;
> op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
> - mask = ix86_build_signbit_mask (vmode, 0, 0);
> + mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode !=
> + HFmode, 0);
>
> if (CONST_DOUBLE_P (operands[1]))
> {
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12597,11 +12597,11 @@
> (set_attr "mode" "<sseinsnmode>")])
>
> (define_insn "*<avx512>_vternlog<mode>_all"
> - [(set (match_operand:V 0 "register_operand" "=v")
> + [(set (match_operand:V 0 "register_operand" "=v,v")
> (unspec:V
> - [(match_operand:V 1 "register_operand" "0")
> - (match_operand:V 2 "register_operand" "v")
> - (match_operand:V 3 "bcst_vector_operand" "vmBr")
> + [(match_operand:V 1 "register_operand" "0,0")
> + (match_operand:V 2 "register_operand" "v,v")
> + (match_operand:V 3 "bcst_vector_operand" "vBr,m")
> (match_operand:SI 4 "const_0_to_255_operand")]
> UNSPEC_VTERNLOG))]
> "TARGET_AVX512F
Change condition to <MODE_SIZE> == 64 || TARGET_AVX512VL || (TARGET_AVX512F && !TARGET_PREFER_AVX256)
Also please add a testcase for case TARGET_AVX512F && !TARGET_PREFER_AVX256.
> @@ -12609,10 +12609,22 @@
> it's not real AVX512FP16 instruction. */
> && (GET_MODE_SIZE (GET_MODE_INNER (<MODE>mode)) >= 4
> || GET_CODE (operands[3]) != VEC_DUPLICATE)"
> - "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}"
> +{
> + if (TARGET_AVX512VL)
> + return "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}";
> + else
> + return "vpternlog<ternlogsuffix>\t{%4, %g3, %g2, %g0|%g0, %g2, %g3,
> +%4}"; }
> [(set_attr "type" "sselog")
> (set_attr "prefix" "evex")
> - (set_attr "mode" "<sseinsnmode>")])
> + (set (attr "mode")
> + (if_then_else (match_test "TARGET_AVX512VL")
> + (const_string "<sseinsnmode>")
> + (const_string "XI")))
> + (set (attr "enabled")
> + (if_then_else (eq_attr "alternative" "1")
> + (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
> + (const_string "*")))])
>
> ;; There must be lots of other combinations like ;; @@ -12641,7 +12653,8
> @@
> (any_logic2:V
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand")
> (match_operand:V 4 "regmem_or_bitnot_regmem_operand"))))]
> - "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
> + "(<MODE_SIZE> == 64 || TARGET_AVX512VL
> + || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[1]),
> STRIP_UNARY (operands[4]))
> @@ -12725,7 +12738,8 @@
> (match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 4 "regmem_or_bitnot_regmem_operand")))]
> - "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
> + "(<MODE_SIZE> == 64 || TARGET_AVX512VL
> + || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[1]),
> STRIP_UNARY (operands[4]))
> @@ -12808,7 +12822,8 @@
> (match_operand:V 1 "regmem_or_bitnot_regmem_operand")
> (match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand")))]
> - "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
> + "(<MODE_SIZE> == 64 || TARGET_AVX512VL
> + || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
> && ix86_pre_reload_split ()"
> "#"
> "&& 1"
On 19.06.2023 04:07, Liu, Hongtao wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Friday, June 16, 2023 2:22 PM
>>
>> --- a/gcc/config/i386/sse.md
>> +++ b/gcc/config/i386/sse.md
>> @@ -12597,11 +12597,11 @@
>> (set_attr "mode" "<sseinsnmode>")])
>>
>> (define_insn "*<avx512>_vternlog<mode>_all"
>> - [(set (match_operand:V 0 "register_operand" "=v")
>> + [(set (match_operand:V 0 "register_operand" "=v,v")
>> (unspec:V
>> - [(match_operand:V 1 "register_operand" "0")
>> - (match_operand:V 2 "register_operand" "v")
>> - (match_operand:V 3 "bcst_vector_operand" "vmBr")
>> + [(match_operand:V 1 "register_operand" "0,0")
>> + (match_operand:V 2 "register_operand" "v,v")
>> + (match_operand:V 3 "bcst_vector_operand" "vBr,m")
>> (match_operand:SI 4 "const_0_to_255_operand")]
>> UNSPEC_VTERNLOG))]
>> "TARGET_AVX512F
> Change condition to <MODE_SIZE> == 64 || TARGET_AVX512VL || (TARGET_AVX512F && !TARGET_PREFER_AVX256)
May I ask why you think this is necessary? The condition of the insn
already wasn't in sync with the condition used in all three splitters,
and I didn't see any reason why now they would need to be brought in
sync. First and foremost because of the use of the UNSPEC (equally
before and after this patch).
Furthermore, isn't it the case that I'm already mostly expressing
this with the "enabled" attribute? At the very least I think I
should drop that again then if following your request?
> Also please add a testcase for case TARGET_AVX512F && !TARGET_PREFER_AVX256.
Especially in a case like this one I'm wondering about the usefulness
of a contrived testcase: It won't test more than one minor sub-case of
the whole set of constructs covered here. But well, here as well as
for the other change I'll invent something.
Jan
On Mon, Jun 19, 2023 at 3:09 PM Jan Beulich via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On 19.06.2023 04:07, Liu, Hongtao wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Friday, June 16, 2023 2:22 PM
> >>
> >> --- a/gcc/config/i386/sse.md
> >> +++ b/gcc/config/i386/sse.md
> >> @@ -12597,11 +12597,11 @@
> >> (set_attr "mode" "<sseinsnmode>")])
> >>
> >> (define_insn "*<avx512>_vternlog<mode>_all"
> >> - [(set (match_operand:V 0 "register_operand" "=v")
> >> + [(set (match_operand:V 0 "register_operand" "=v,v")
> >> (unspec:V
> >> - [(match_operand:V 1 "register_operand" "0")
> >> - (match_operand:V 2 "register_operand" "v")
> >> - (match_operand:V 3 "bcst_vector_operand" "vmBr")
> >> + [(match_operand:V 1 "register_operand" "0,0")
> >> + (match_operand:V 2 "register_operand" "v,v")
> >> + (match_operand:V 3 "bcst_vector_operand" "vBr,m")
> >> (match_operand:SI 4 "const_0_to_255_operand")]
> >> UNSPEC_VTERNLOG))]
> >> "TARGET_AVX512F
> > Change condition to <MODE_SIZE> == 64 || TARGET_AVX512VL || (TARGET_AVX512F && !TARGET_PREFER_AVX256)
>
> May I ask why you think this is necessary? The condition of the insn
> already wasn't in sync with the condition used in all three splitters,
I think it's a latent bug for the original
*<avx512>_vternlog<mode>_all, it should be <MODE_SIZE> == 64 ||
TARGET_AVX512VL instead of TARGET_AVX512F, since TARGET_AVX512VL is
needed for 128/256-bit vectors. The bug won't be exposed since the
pattern is only generated by those 3 splitters which are guarded by
TARGET_AVX512VL. But We can just fix this to make the pattern exactly
correct.
> and I didn't see any reason why now they would need to be brought in
> sync. First and foremost because of the use of the UNSPEC (equally
> before and after this patch).
>
> Furthermore, isn't it the case that I'm already mostly expressing
> this with the "enabled" attribute? At the very least I think I
> should drop that again then if following your request?
You only handle alternative 1, but for alternative 0, it is still
enabled when TARGET_PREFER_AVX256 && !TARGET_AVX512VL for 128/256-bit
vectors. You don't need the drop that, alternative 1 still needs
<MODE_SIZE> == 64 || TARGET_AVX512VL since memory_operand can't be
operated with the zmm instruction for 128/256-bit vectors.
>
> > Also please add a testcase for case TARGET_AVX512F && !TARGET_PREFER_AVX256.
>
> Especially in a case like this one I'm wondering about the usefulness
> of a contrived testcase: It won't test more than one minor sub-case of
> the whole set of constructs covered here. But well, here as well as
> for the other change I'll invent something.
We don't need all sub-case, one is enough to guard that your
optimization won't be corrupted by a later commit.
.i.e
typedef int v4si __attribute((vector_size(16)));
v4si
foo (v4si a, v4si b, v4si c)
{
return (a & b) | c;
}
We can now generate vpternlog with -mavx512f -O2 -mno-avx512vl
-mprefer-vector-width=512.
>
> Jan
@@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[])
else
dest = NULL_RTX;
op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
- mask = ix86_build_signbit_mask (vmode, 0, 0);
+ mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode != HFmode, 0);
if (CONST_DOUBLE_P (operands[1]))
{
@@ -12597,11 +12597,11 @@
(set_attr "mode" "<sseinsnmode>")])
(define_insn "*<avx512>_vternlog<mode>_all"
- [(set (match_operand:V 0 "register_operand" "=v")
+ [(set (match_operand:V 0 "register_operand" "=v,v")
(unspec:V
- [(match_operand:V 1 "register_operand" "0")
- (match_operand:V 2 "register_operand" "v")
- (match_operand:V 3 "bcst_vector_operand" "vmBr")
+ [(match_operand:V 1 "register_operand" "0,0")
+ (match_operand:V 2 "register_operand" "v,v")
+ (match_operand:V 3 "bcst_vector_operand" "vBr,m")
(match_operand:SI 4 "const_0_to_255_operand")]
UNSPEC_VTERNLOG))]
"TARGET_AVX512F
@@ -12609,10 +12609,22 @@
it's not real AVX512FP16 instruction. */
&& (GET_MODE_SIZE (GET_MODE_INNER (<MODE>mode)) >= 4
|| GET_CODE (operands[3]) != VEC_DUPLICATE)"
- "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}"
+{
+ if (TARGET_AVX512VL)
+ return "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}";
+ else
+ return "vpternlog<ternlogsuffix>\t{%4, %g3, %g2, %g0|%g0, %g2, %g3, %4}";
+}
[(set_attr "type" "sselog")
(set_attr "prefix" "evex")
- (set_attr "mode" "<sseinsnmode>")])
+ (set (attr "mode")
+ (if_then_else (match_test "TARGET_AVX512VL")
+ (const_string "<sseinsnmode>")
+ (const_string "XI")))
+ (set (attr "enabled")
+ (if_then_else (eq_attr "alternative" "1")
+ (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
+ (const_string "*")))])
;; There must be lots of other combinations like
;;
@@ -12641,7 +12653,8 @@
(any_logic2:V
(match_operand:V 3 "regmem_or_bitnot_regmem_operand")
(match_operand:V 4 "regmem_or_bitnot_regmem_operand"))))]
- "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
+ "(<MODE_SIZE> == 64 || TARGET_AVX512VL
+ || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
&& ix86_pre_reload_split ()
&& (rtx_equal_p (STRIP_UNARY (operands[1]),
STRIP_UNARY (operands[4]))
@@ -12725,7 +12738,8 @@
(match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
(match_operand:V 3 "regmem_or_bitnot_regmem_operand"))
(match_operand:V 4 "regmem_or_bitnot_regmem_operand")))]
- "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
+ "(<MODE_SIZE> == 64 || TARGET_AVX512VL
+ || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
&& ix86_pre_reload_split ()
&& (rtx_equal_p (STRIP_UNARY (operands[1]),
STRIP_UNARY (operands[4]))
@@ -12808,7 +12822,8 @@
(match_operand:V 1 "regmem_or_bitnot_regmem_operand")
(match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
(match_operand:V 3 "regmem_or_bitnot_regmem_operand")))]
- "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
+ "(<MODE_SIZE> == 64 || TARGET_AVX512VL
+ || (TARGET_AVX512F && !TARGET_PREFER_AVX256))
&& ix86_pre_reload_split ()"
"#"
"&& 1"