On Tue, May 30, 2023 at 9:39 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Mon, May 29, 2023 at 8:17 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
> >
> >
> > This is my proposed minimal fix for PR target/109973 (hopefully suitable
> > for backporting) that follows Jakub Jelinek's suggestion that we introduce
> > CCZmode and CCCmode variants of ptest and vptest, so that the i386
> > backend treats [v]ptest instructions similarly to testl instructions;
> > using different CCmodes to indicate which condition flags are desired,
> > and then relying on the RTL cmpelim pass to eliminate redundant tests.
> >
> > This conveniently matches Intel's intrinsics, that provide different
> > functions for retrieving different flags, _mm_testz_si128 tests the
> > Z flag, _mm_testc_si128 tests the carry flag. Currently we use the
> > same instruction (pattern) for both, and unfortunately the *ptest<mode>_and
> > optimization is only valid when the ptest/vptest instruction is used to
> > set/test the Z flag.
> >
> > The downside, as predicted by Jakub, is that GCC's cmpelim pass is
> > currently COMPARE-centric and not able to merge the ptests from expressions
> > such as _mm256_testc_si256 (a, b) + _mm256_testz_si256 (a, b), which is a
> > known issue, PR target/80040. I've some follow-up patches to improve
> > things, but this first patch fixes the wrong-code regression, replacing
> > it with a rare missed-optimization (hopefully suitable for GCC 13).
> >
> > The only change that was unanticipated was the tweak to ix86_match_ccmode.
> > Oddly, CCZmode is allowable for CCmode, but CCCmode isn't. Given that
> > CCZmode means just the Z flag, CCCmode means just the C flag, and
> > CCmode means all the flags, I'm guessing this asymmetry is unintentional.
> > Perhaps a super-safe fix is to explicitly test for CCZmode, CCCmode or
> > CCmode
> > in the *<sse4_1>_ptest<mode> pattern's predicate, and not attempt to
> > re-use ix86_match_ccmode?
>
> It is actually the other way. CCZmode should NOT be allowed for CCmode
> in ix86_match_ccmode. When CCmode is requested, we don't assume
> anything about FLAGS bits, so we expect all bits to be valid. CCZmode
> implies only Z bit, and should be compatible only with itself. So, the
> "break;" is in the wrong place, it should be before E_CCZmode.
Hm, but PTEST is the *PRODUCER* of flags, not the consumer...
So, the whole picture should be like this:
(define_insn "*cmp<mode>_ccno_1"
[(set (reg FLAGS_REG)
(compare (match_operand:SWI 0 "nonimmediate_operand" "<r>,?m<r>")
(match_operand:SWI 1 "const0_operand")))]
"ix86_match_ccmode (insn, CCNOmode)"
The above means that the compare PROVIDES all bits, but O is
guaranteed to be zero.
(define_insn "*cmp<mode>_1"
[(set (reg FLAGS_REG)
(compare (match_operand:SWI 0 "nonimmediate_operand" "<r>m,<r>")
(match_operand:SWI 1 "<general_operand>" "<r><i>,<r><m>")))]
"ix86_match_ccmode (insn, CCmode)"
The above means that compare PROVIDES all bits.
+(define_expand "<sse4_1>_ptest<mode>"
+ [(set (reg:CC FLAGS_REG)
+ (unspec:CC [(match_operand:V_AVX 0 "register_operand")
+ (match_operand:V_AVX 1 "vector_operand")]
+ UNSPEC_PTEST))]
+ "TARGET_SSE4_1")
This is not true, PTEST does not provide all FLAGS bits in a general sense.
So, I think your original patch is OK, but please introduce the
ix86_match_ptest_ccmode function instead of reusing ix86_match_ccmode.
Uros.
>
> Uros.
>
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures. Ok for mainline?
> >
> >
> > 2023-05-29 Roger Sayle <roger@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> > PR targt/109973
> > * config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new
> > CODE_for_sse4_1_ptestzv2di.
> > (__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di.
> > (__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di.
> > (__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di.
> > * config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode
> > when expanding UNSPEC_PTEST to compare against zero.
> > * config/i386/i386-features.cc (scalar_chain::convert_compare):
> > Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons.
> > (general_scalar_chain::convert_insn): Use CCZmode for COMPARE
> > result.
> > (timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result.
> > * config/i386/i386.cc (ix86_match_ccmode): Allow the SET_SRC to be
> > an UNSPEC, in addition to a COMPARE. Consider CCCmode to be a form
> > of CCmode.
> > * config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK
> > to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ.
> > (*<sse4_1>_ptest<mode>): Add asterisk to hide define_insn.
> > Remove ":CC" flags specification, and use ix86_match_ccmode instead.
> > (<sse4_1>_ptestz<mode>): New define_expand to specify CCZ.
> > (<sse4_1>_ptestc<mode>): New define_expand to specify CCC.
> > (<sse4_1>_ptest<mode>): A define_expand using CC to preserve the
> > current behavior.
> > (*ptest<mode>_and): Specify CCZ to only perform this optimization
> > when only the Z flag is required.
> >
> > gcc/testsuite/ChangeLog
> > PR targt/109973
> > * gcc.target/i386/pr109973-1.c: New test case.
> > * gcc.target/i386/pr109973-2.c: Likewise.
> >
> >
> > Thanks,
> > Roger
> > --
> >
@@ -1004,8 +1004,8 @@ BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_roundps_sfix, "__builtin_ia32_
BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2, "__builtin_ia32_roundps_az", IX86_BUILTIN_ROUNDPS_AZ, UNKNOWN, (int) V4SF_FTYPE_V4SF)
BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_roundv4sf2_sfix, "__builtin_ia32_roundps_az_sfix", IX86_BUILTIN_ROUNDPS_AZ_SFIX, UNKNOWN, (int) V4SI_FTYPE_V4SF)
-BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) INT_FTYPE_V2DI_V2DI_PTEST)
-BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) INT_FTYPE_V2DI_V2DI_PTEST)
+BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestzv2di, "__builtin_ia32_ptestz128", IX86_BUILTIN_PTESTZ, EQ, (int) INT_FTYPE_V2DI_V2DI_PTEST)
+BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestcv2di, "__builtin_ia32_ptestc128", IX86_BUILTIN_PTESTC, LTU, (int) INT_FTYPE_V2DI_V2DI_PTEST)
BDESC (OPTION_MASK_ISA_SSE4_1, 0, CODE_FOR_sse4_1_ptestv2di, "__builtin_ia32_ptestnzc128", IX86_BUILTIN_PTESTNZC, GTU, (int) INT_FTYPE_V2DI_V2DI_PTEST)
/* SSE4.2 */
@@ -1164,8 +1164,8 @@ BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestpd256, "__builtin_ia32_vtestnzc
BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, "__builtin_ia32_vtestzps256", IX86_BUILTIN_VTESTZPS256, EQ, (int) INT_FTYPE_V8SF_V8SF_PTEST)
BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, "__builtin_ia32_vtestcps256", IX86_BUILTIN_VTESTCPS256, LTU, (int) INT_FTYPE_V8SF_V8SF_PTEST)
BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_vtestps256, "__builtin_ia32_vtestnzcps256", IX86_BUILTIN_VTESTNZCPS256, GTU, (int) INT_FTYPE_V8SF_V8SF_PTEST)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestv4di, "__builtin_ia32_ptestz256", IX86_BUILTIN_PTESTZ256, EQ, (int) INT_FTYPE_V4DI_V4DI_PTEST)
-BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestv4di, "__builtin_ia32_ptestc256", IX86_BUILTIN_PTESTC256, LTU, (int) INT_FTYPE_V4DI_V4DI_PTEST)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestzv4di, "__builtin_ia32_ptestz256", IX86_BUILTIN_PTESTZ256, EQ, (int) INT_FTYPE_V4DI_V4DI_PTEST)
+BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestcv4di, "__builtin_ia32_ptestc256", IX86_BUILTIN_PTESTC256, LTU, (int) INT_FTYPE_V4DI_V4DI_PTEST)
BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_ptestv4di, "__builtin_ia32_ptestnzc256", IX86_BUILTIN_PTESTNZC256, GTU, (int) INT_FTYPE_V4DI_V4DI_PTEST)
BDESC (OPTION_MASK_ISA_AVX, 0, CODE_FOR_avx_movmskpd256, "__builtin_ia32_movmskpd256", IX86_BUILTIN_MOVMSKPD256, UNKNOWN, (int) INT_FTYPE_V4DF )
@@ -2370,8 +2370,8 @@ ix86_expand_branch (enum rtx_code code, rtx op0, rtx op1, rtx label)
tmp = gen_reg_rtx (mode);
emit_insn (gen_rtx_SET (tmp, gen_rtx_XOR (mode, op0, op1)));
tmp = gen_lowpart (p_mode, tmp);
- emit_insn (gen_rtx_SET (gen_rtx_REG (CCmode, FLAGS_REG),
- gen_rtx_UNSPEC (CCmode,
+ emit_insn (gen_rtx_SET (gen_rtx_REG (CCZmode, FLAGS_REG),
+ gen_rtx_UNSPEC (CCZmode,
gen_rtvec (2, tmp, tmp),
UNSPEC_PTEST)));
tmp = gen_rtx_fmt_ee (code, VOIDmode, flag, const0_rtx);
@@ -974,7 +974,7 @@ general_scalar_chain::convert_op (rtx *op, rtx_insn *insn)
}
}
-/* Convert COMPARE to vector mode. */
+/* Convert CCZmode COMPARE to vector mode. */
rtx
scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn)
@@ -1023,7 +1023,7 @@ scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn)
emit_insn_before (gen_rtx_SET (tmp, op11), insn);
op11 = tmp;
}
- return gen_rtx_UNSPEC (CCmode, gen_rtvec (2, op11, op12),
+ return gen_rtx_UNSPEC (CCZmode, gen_rtvec (2, op11, op12),
UNSPEC_PTEST);
}
else
@@ -1052,7 +1052,7 @@ scalar_chain::convert_compare (rtx op1, rtx op2, rtx_insn *insn)
src = tmp;
}
- return gen_rtx_UNSPEC (CCmode, gen_rtvec (2, src, src), UNSPEC_PTEST);
+ return gen_rtx_UNSPEC (CCZmode, gen_rtvec (2, src, src), UNSPEC_PTEST);
}
/* Helper function for converting INSN to vector mode. */
@@ -1219,7 +1219,7 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
break;
case COMPARE:
- dst = gen_rtx_REG (CCmode, FLAGS_REG);
+ dst = gen_rtx_REG (CCZmode, FLAGS_REG);
src = convert_compare (XEXP (src, 0), XEXP (src, 1), insn);
break;
@@ -1726,7 +1726,7 @@ timode_scalar_chain::convert_insn (rtx_insn *insn)
break;
case COMPARE:
- dst = gen_rtx_REG (CCmode, FLAGS_REG);
+ dst = gen_rtx_REG (CCZmode, FLAGS_REG);
src = convert_compare (XEXP (src, 0), XEXP (src, 1), insn);
break;
@@ -15864,7 +15864,8 @@ ix86_match_ccmode (rtx insn, machine_mode req_mode)
if (GET_CODE (set) == PARALLEL)
set = XVECEXP (set, 0, 0);
gcc_assert (GET_CODE (set) == SET);
- gcc_assert (GET_CODE (SET_SRC (set)) == COMPARE);
+ gcc_assert (GET_CODE (SET_SRC (set)) == COMPARE
+ || GET_CODE (SET_SRC (set)) == UNSPEC);
set_mode = GET_MODE (SET_DEST (set));
switch (set_mode)
@@ -15890,10 +15891,12 @@ ix86_match_ccmode (rtx insn, machine_mode req_mode)
case E_CCZmode:
break;
+ case E_CCCmode:
+ if (req_mode == CCmode)
+ break;
+ /* FALLTHRU */
case E_CCGZmode:
-
case E_CCAmode:
- case E_CCCmode:
case E_CCOmode:
case E_CCPmode:
case E_CCSmode:
@@ -20423,10 +20423,10 @@
UNSPEC_MOVMSK)
(match_operand 2 "const_int_operand")))]
"TARGET_SSE4_1 && (INTVAL (operands[2]) == (int) (<vi1avx2const>))"
- [(set (reg:CC FLAGS_REG)
- (unspec:CC [(match_dup 0)
- (match_dup 0)]
- UNSPEC_PTEST))])
+ [(set (reg:CCZ FLAGS_REG)
+ (unspec:CCZ [(match_dup 0)
+ (match_dup 0)]
+ UNSPEC_PTEST))])
(define_expand "sse2_maskmovdqu"
[(set (match_operand:V16QI 0 "memory_operand")
@@ -23078,13 +23078,13 @@
(set_attr "mode" "<MODE>")])
;; ptest is very similar to comiss and ucomiss when setting FLAGS_REG.
-;; But it is not a really compare instruction.
-(define_insn "<sse4_1>_ptest<mode>"
- [(set (reg:CC FLAGS_REG)
- (unspec:CC [(match_operand:V_AVX 0 "register_operand" "Yr, *x, x")
- (match_operand:V_AVX 1 "vector_operand" "YrBm, *xBm, xm")]
- UNSPEC_PTEST))]
- "TARGET_SSE4_1"
+;; But it is not really a compare instruction.
+(define_insn "*<sse4_1>_ptest<mode>"
+ [(set (reg FLAGS_REG)
+ (unspec [(match_operand:V_AVX 0 "register_operand" "Yr, *x, x")
+ (match_operand:V_AVX 1 "vector_operand" "YrBm, *xBm, xm")]
+ UNSPEC_PTEST))]
+ "TARGET_SSE4_1 && ix86_match_ccmode (insn, CCmode)"
"%vptest\t{%1, %0|%0, %1}"
[(set_attr "isa" "noavx,noavx,avx")
(set_attr "type" "ssecomi")
@@ -23097,6 +23097,30 @@
(const_string "*")))
(set_attr "mode" "<sseinsnmode>")])
+;; Expand a ptest to set the Z flag.
+(define_expand "<sse4_1>_ptestz<mode>"
+ [(set (reg:CCZ FLAGS_REG)
+ (unspec:CCZ [(match_operand:V_AVX 0 "register_operand")
+ (match_operand:V_AVX 1 "vector_operand")]
+ UNSPEC_PTEST))]
+ "TARGET_SSE4_1")
+
+;; Expand a ptest to set the C flag
+(define_expand "<sse4_1>_ptestc<mode>"
+ [(set (reg:CCC FLAGS_REG)
+ (unspec:CCC [(match_operand:V_AVX 0 "register_operand")
+ (match_operand:V_AVX 1 "vector_operand")]
+ UNSPEC_PTEST))]
+ "TARGET_SSE4_1")
+
+;; Expand a ptest to set both the Z and C flags
+(define_expand "<sse4_1>_ptest<mode>"
+ [(set (reg:CC FLAGS_REG)
+ (unspec:CC [(match_operand:V_AVX 0 "register_operand")
+ (match_operand:V_AVX 1 "vector_operand")]
+ UNSPEC_PTEST))]
+ "TARGET_SSE4_1")
+
(define_insn "ptesttf2"
[(set (reg:CC FLAGS_REG)
(unspec:CC [(match_operand:TF 0 "register_operand" "Yr, *x, x")
@@ -23111,17 +23135,17 @@
(set_attr "mode" "TI")])
(define_insn_and_split "*ptest<mode>_and"
- [(set (reg:CC FLAGS_REG)
- (unspec:CC [(and:V_AVX (match_operand:V_AVX 0 "register_operand")
- (match_operand:V_AVX 1 "vector_operand"))
- (and:V_AVX (match_dup 0) (match_dup 1))]
+ [(set (reg:CCZ FLAGS_REG)
+ (unspec:CCZ [(and:V_AVX (match_operand:V_AVX 0 "register_operand")
+ (match_operand:V_AVX 1 "vector_operand"))
+ (and:V_AVX (match_dup 0) (match_dup 1))]
UNSPEC_PTEST))]
"TARGET_SSE4_1
&& ix86_pre_reload_split ()"
"#"
"&& 1"
- [(set (reg:CC FLAGS_REG)
- (unspec:CC [(match_dup 0) (match_dup 1)] UNSPEC_PTEST))])
+ [(set (reg:CCZ FLAGS_REG)
+ (unspec:CCZ [(match_dup 0) (match_dup 1)] UNSPEC_PTEST))])
(define_expand "nearbyint<mode>2"
[(set (match_operand:VFH 0 "register_operand")
new file mode 100644
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+
+typedef long long __m256i __attribute__ ((__vector_size__ (32)));
+
+int
+foo (__m256i x, __m256i y)
+{
+ __m256i a = x & y;
+ return __builtin_ia32_ptestc256 (a, a);
+}
+
+/* { dg-final { scan-assembler "vpand" } } */
new file mode 100644
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.1" } */
+
+typedef long long __m128i __attribute__ ((__vector_size__ (16)));
+
+int
+foo (__m128i x, __m128i y)
+{
+ __m128i a = x & y;
+ return __builtin_ia32_ptestc128 (a, a);
+}
+
+/* { dg-final { scan-assembler "pand" } } */