x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F
Checks
Commit Message
There's no reason to constrain this to AVX512VL, as the wider operation
is not usable for more narrow operands only when the possible memory
source is a non-broadcast one. This way even the scalar copysign<mode>3
can benefit from the operation being a single-insn one (leaving aside
moves which the compiler decides to insert for unclear reasons, and
leaving aside the fact that bcst_mem_operand() is too restrictive for
broadcast to be embedded right into VPTERNLOG*).
Along with this also request value duplication in
ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
excess space allocation in .rodata.*, filled with zeros which are never
read.
gcc/
* config/i386/i386-expand.cc (ix86_expand_copysign): Request
value duplication by ix86_build_signbit_mask() when AVX512F and
not HFmode.
* config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
2-alternative form. Adjust "mode" attribute. Add "enabled"
attribute.
(*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F.
(*<avx512>_vpternlog<mode>_2): Likewise.
(*<avx512>_vpternlog<mode>_3): Likewise.
---
I guess the underlying pattern, going along the lines of what
<mask_codefor>one_cmpl<mode>2<mask_name> uses, can be applied elsewhere
as well.
HFmode could use embedded broadcast too for copysign and alike, but that
would need to be V2HF -> V8HF (for which I don't think there are any
existing patterns).
Comments
On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> There's no reason to constrain this to AVX512VL, as the wider operation
> is not usable for more narrow operands only when the possible memory
But this may require more resources (on AMD znver4 processor a zmm
instruction will also be split into 2 uops, right?) And on some intel
processors(SKX/CLX) there will be frequency reduction.
If it needs to be done, it is better guarded with
!TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL
or users explicitly uses -mprefer-vector-width=256, we don't want to
produce any zmm instruction for surprise.(Although
-mprefer-vector-width=256 is supposed for auto-vectorizer, but backend
codegen also use it under such cases, i.e. in *movsf_internal
alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.)
> source is a non-broadcast one. This way even the scalar copysign<mode>3
> can benefit from the operation being a single-insn one (leaving aside
> moves which the compiler decides to insert for unclear reasons, and
> leaving aside the fact that bcst_mem_operand() is too restrictive for
> broadcast to be embedded right into VPTERNLOG*).
>
> Along with this also request value duplication in
> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
> excess space allocation in .rodata.*, filled with zeros which are never
> read.
>
> gcc/
>
> * config/i386/i386-expand.cc (ix86_expand_copysign): Request
> value duplication by ix86_build_signbit_mask() when AVX512F and
> not HFmode.
> * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
> 2-alternative form. Adjust "mode" attribute. Add "enabled"
> attribute.
> (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F.
> (*<avx512>_vpternlog<mode>_2): Likewise.
> (*<avx512>_vpternlog<mode>_3): Likewise.
> ---
> I guess the underlying pattern, going along the lines of what
> <mask_codefor>one_cmpl<mode>2<mask_name> uses, can be applied elsewhere
> as well.
>
> HFmode could use embedded broadcast too for copysign and alike, but that
> would need to be V2HF -> V8HF (for which I don't think there are any
> existing patterns).
>
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[])
> else
> dest = NULL_RTX;
> op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
> - mask = ix86_build_signbit_mask (vmode, 0, 0);
> + mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode != HFmode, 0);
>
> if (CONST_DOUBLE_P (operands[1]))
> {
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -12399,11 +12399,11 @@
> (set_attr "mode" "<sseinsnmode>")])
>
> (define_insn "*<avx512>_vternlog<mode>_all"
> - [(set (match_operand:V 0 "register_operand" "=v")
> + [(set (match_operand:V 0 "register_operand" "=v,v")
> (unspec:V
> - [(match_operand:V 1 "register_operand" "0")
> - (match_operand:V 2 "register_operand" "v")
> - (match_operand:V 3 "bcst_vector_operand" "vmBr")
> + [(match_operand:V 1 "register_operand" "0,0")
> + (match_operand:V 2 "register_operand" "v,v")
> + (match_operand:V 3 "bcst_vector_operand" "vBr,m")
> (match_operand:SI 4 "const_0_to_255_operand")]
> UNSPEC_VTERNLOG))]
> "TARGET_AVX512F
> @@ -12411,10 +12411,22 @@
> it's not real AVX512FP16 instruction. */
> && (GET_MODE_SIZE (GET_MODE_INNER (<MODE>mode)) >= 4
> || GET_CODE (operands[3]) != VEC_DUPLICATE)"
> - "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}"
> +{
> + if (TARGET_AVX512VL)
> + return "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}";
> + else
> + return "vpternlog<ternlogsuffix>\t{%4, %g3, %g2, %g0|%g0, %g2, %g3, %4}";
> +}
> [(set_attr "type" "sselog")
> (set_attr "prefix" "evex")
> - (set_attr "mode" "<sseinsnmode>")])
> + (set (attr "mode")
> + (if_then_else (match_test "TARGET_AVX512VL")
> + (const_string "<sseinsnmode>")
> + (const_string "XI")))
> + (set (attr "enabled")
> + (if_then_else (eq_attr "alternative" "1")
> + (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
> + (const_string "*")))])
>
> ;; There must be lots of other combinations like
> ;;
> @@ -12443,7 +12455,7 @@
> (any_logic2:V
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand")
> (match_operand:V 4 "regmem_or_bitnot_regmem_operand"))))]
> - "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
> + "TARGET_AVX512F
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[1]),
> STRIP_UNARY (operands[4]))
> @@ -12527,7 +12539,7 @@
> (match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 4 "regmem_or_bitnot_regmem_operand")))]
> - "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
> + "TARGET_AVX512F
> && ix86_pre_reload_split ()
> && (rtx_equal_p (STRIP_UNARY (operands[1]),
> STRIP_UNARY (operands[4]))
> @@ -12610,7 +12622,7 @@
> (match_operand:V 1 "regmem_or_bitnot_regmem_operand")
> (match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
> (match_operand:V 3 "regmem_or_bitnot_regmem_operand")))]
> - "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
> + "TARGET_AVX512F
> && ix86_pre_reload_split ()"
> "#"
> "&& 1"
On 14.06.2023 10:10, Hongtao Liu wrote:
> On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> There's no reason to constrain this to AVX512VL, as the wider operation
>> is not usable for more narrow operands only when the possible memory
> But this may require more resources (on AMD znver4 processor a zmm
> instruction will also be split into 2 uops, right?) And on some intel
> processors(SKX/CLX) there will be frequency reduction.
I'm afraid I don't follow: Largely the same AVX512 code would be
generated when passing -mavx512vl, so how can power/performance
considerations matter here? All I'm doing here (and in a few more
patches I'm still in the process of testing) is relax when AVX512
insns can actually be used (reducing the copying between registers
and/or the number of insns needed). My understanding on the Intel
side is that it only matters whether AVX512 insns are used, not
what vector length they are. You may be right about znver4, though.
Nevertheless I agree ...
> If it needs to be done, it is better guarded with
> !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL
> or users explicitly uses -mprefer-vector-width=256, we don't want to
> produce any zmm instruction for surprise.(Although
> -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend
> codegen also use it under such cases, i.e. in *movsf_internal
> alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.)
... that respecting such overrides is probably desirable, so I'll
adjust.
Jan
>> source is a non-broadcast one. This way even the scalar copysign<mode>3
>> can benefit from the operation being a single-insn one (leaving aside
>> moves which the compiler decides to insert for unclear reasons, and
>> leaving aside the fact that bcst_mem_operand() is too restrictive for
>> broadcast to be embedded right into VPTERNLOG*).
>>
>> Along with this also request value duplication in
>> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
>> excess space allocation in .rodata.*, filled with zeros which are never
>> read.
>>
>> gcc/
>>
>> * config/i386/i386-expand.cc (ix86_expand_copysign): Request
>> value duplication by ix86_build_signbit_mask() when AVX512F and
>> not HFmode.
>> * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
>> 2-alternative form. Adjust "mode" attribute. Add "enabled"
>> attribute.
>> (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F.
>> (*<avx512>_vpternlog<mode>_2): Likewise.
>> (*<avx512>_vpternlog<mode>_3): Likewise.
On Wed, Jun 14, 2023 at 5:32 PM Jan Beulich <jbeulich@suse.com> wrote:
>
> On 14.06.2023 10:10, Hongtao Liu wrote:
> > On Wed, Jun 14, 2023 at 1:59 PM Jan Beulich via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> There's no reason to constrain this to AVX512VL, as the wider operation
> >> is not usable for more narrow operands only when the possible memory
> > But this may require more resources (on AMD znver4 processor a zmm
> > instruction will also be split into 2 uops, right?) And on some intel
> > processors(SKX/CLX) there will be frequency reduction.
>
> I'm afraid I don't follow: Largely the same AVX512 code would be
> generated when passing -mavx512vl, so how can power/performance
> considerations matter here? All I'm doing here (and in a few more
Yes , for -march=*** is ok since AVX512VL is included.
what your patch improve is -mavx512f -mno-avx512vl, but for specific
option combinations like -mavx512f -mprefer-vector-width=256
-mno-avx512vl, your patch will produce zmm instruction which is not
expected.
> patches I'm still in the process of testing) is relax when AVX512
> insns can actually be used (reducing the copying between registers
> and/or the number of insns needed). My understanding on the Intel
> side is that it only matters whether AVX512 insns are used, not
No, vector length matters, ymm/xmm evex insns are ok to use, but zmm
insns will cause frequency reduction.
> what vector length they are. You may be right about znver4, though.
>
> Nevertheless I agree ...
>
> > If it needs to be done, it is better guarded with
> > !TARGET_PREFER_AVX256, at least when micro-architecture AVX256_OPTIMAL
> > or users explicitly uses -mprefer-vector-width=256, we don't want to
> > produce any zmm instruction for surprise.(Although
> > -mprefer-vector-width=256 is supposed for auto-vectorizer, but backend
> > codegen also use it under such cases, i.e. in *movsf_internal
> > alternative 5 use zmm only TARGET_AVX512F && !TARGET_PREFER_AVX256.)
>
> ... that respecting such overrides is probably desirable, so I'll
> adjust.
>
> Jan
>
> >> source is a non-broadcast one. This way even the scalar copysign<mode>3
> >> can benefit from the operation being a single-insn one (leaving aside
> >> moves which the compiler decides to insert for unclear reasons, and
> >> leaving aside the fact that bcst_mem_operand() is too restrictive for
> >> broadcast to be embedded right into VPTERNLOG*).
> >>
> >> Along with this also request value duplication in
> >> ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
> >> excess space allocation in .rodata.*, filled with zeros which are never
> >> read.
> >>
> >> gcc/
> >>
> >> * config/i386/i386-expand.cc (ix86_expand_copysign): Request
> >> value duplication by ix86_build_signbit_mask() when AVX512F and
> >> not HFmode.
> >> * config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
> >> 2-alternative form. Adjust "mode" attribute. Add "enabled"
> >> attribute.
> >> (*<avx512>_vpternlog<mode>_1): Relax to just TARGET_AVX512F.
> >> (*<avx512>_vpternlog<mode>_2): Likewise.
> >> (*<avx512>_vpternlog<mode>_3): Likewise.
>
@@ -2266,7 +2266,7 @@ ix86_expand_copysign (rtx operands[])
else
dest = NULL_RTX;
op1 = lowpart_subreg (vmode, force_reg (mode, operands[2]), mode);
- mask = ix86_build_signbit_mask (vmode, 0, 0);
+ mask = ix86_build_signbit_mask (vmode, TARGET_AVX512F && mode != HFmode, 0);
if (CONST_DOUBLE_P (operands[1]))
{
@@ -12399,11 +12399,11 @@
(set_attr "mode" "<sseinsnmode>")])
(define_insn "*<avx512>_vternlog<mode>_all"
- [(set (match_operand:V 0 "register_operand" "=v")
+ [(set (match_operand:V 0 "register_operand" "=v,v")
(unspec:V
- [(match_operand:V 1 "register_operand" "0")
- (match_operand:V 2 "register_operand" "v")
- (match_operand:V 3 "bcst_vector_operand" "vmBr")
+ [(match_operand:V 1 "register_operand" "0,0")
+ (match_operand:V 2 "register_operand" "v,v")
+ (match_operand:V 3 "bcst_vector_operand" "vBr,m")
(match_operand:SI 4 "const_0_to_255_operand")]
UNSPEC_VTERNLOG))]
"TARGET_AVX512F
@@ -12411,10 +12411,22 @@
it's not real AVX512FP16 instruction. */
&& (GET_MODE_SIZE (GET_MODE_INNER (<MODE>mode)) >= 4
|| GET_CODE (operands[3]) != VEC_DUPLICATE)"
- "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}"
+{
+ if (TARGET_AVX512VL)
+ return "vpternlog<ternlogsuffix>\t{%4, %3, %2, %0|%0, %2, %3, %4}";
+ else
+ return "vpternlog<ternlogsuffix>\t{%4, %g3, %g2, %g0|%g0, %g2, %g3, %4}";
+}
[(set_attr "type" "sselog")
(set_attr "prefix" "evex")
- (set_attr "mode" "<sseinsnmode>")])
+ (set (attr "mode")
+ (if_then_else (match_test "TARGET_AVX512VL")
+ (const_string "<sseinsnmode>")
+ (const_string "XI")))
+ (set (attr "enabled")
+ (if_then_else (eq_attr "alternative" "1")
+ (symbol_ref "<MODE_SIZE> == 64 || TARGET_AVX512VL")
+ (const_string "*")))])
;; There must be lots of other combinations like
;;
@@ -12443,7 +12455,7 @@
(any_logic2:V
(match_operand:V 3 "regmem_or_bitnot_regmem_operand")
(match_operand:V 4 "regmem_or_bitnot_regmem_operand"))))]
- "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
+ "TARGET_AVX512F
&& ix86_pre_reload_split ()
&& (rtx_equal_p (STRIP_UNARY (operands[1]),
STRIP_UNARY (operands[4]))
@@ -12527,7 +12539,7 @@
(match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
(match_operand:V 3 "regmem_or_bitnot_regmem_operand"))
(match_operand:V 4 "regmem_or_bitnot_regmem_operand")))]
- "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
+ "TARGET_AVX512F
&& ix86_pre_reload_split ()
&& (rtx_equal_p (STRIP_UNARY (operands[1]),
STRIP_UNARY (operands[4]))
@@ -12610,7 +12622,7 @@
(match_operand:V 1 "regmem_or_bitnot_regmem_operand")
(match_operand:V 2 "regmem_or_bitnot_regmem_operand"))
(match_operand:V 3 "regmem_or_bitnot_regmem_operand")))]
- "(<MODE_SIZE> == 64 || TARGET_AVX512VL)
+ "TARGET_AVX512F
&& ix86_pre_reload_split ()"
"#"
"&& 1"