[PATCH-2,rs6000] Enable vector mode for by pieces equality compare [PR111449]

Message ID 92676187-37ed-4d1f-aad1-c8eb4c938fa5@linux.ibm.com
State Accepted
Headers
Series [PATCH-2,rs6000] Enable vector mode for by pieces equality compare [PR111449] |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

HAO CHEN GUI Nov. 6, 2023, 2:36 a.m. UTC
  Hi,
  This patch enables vector mode for by pieces equality compare. It
adds a new expand pattern - cbrnachv16qi4 and set MOVE_MAX_PIECES
and COMPARE_MAX_PIECES to 16 bytes when P8 vector enabled. The compare
relies both move and compare instructions, so both macro are changed.
The vector load/store might be unaligned, so the 16-byte move and
compare are only enabled when p8 vector enabled (TARGET_VSX +
TARGET_EFFICIENT_UNALIGNED_VSX).

  This patch enables 16 byte by pieces move. As the vector mode is not
enabled for by pieces move, TImode is used for the move. It caused some
regression cases. I drafted the third patch to fix them.

  Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
regressions. Is this OK for trunk?

Thanks
Gui Haochen


ChangeLog
rs6000: Enable vector mode for by pieces equality compare

This patch adds a new expand pattern - cbranchv16qi4 to enable vector
mode by pieces equality compare on rs6000.  The macro MOVE_MAX_PIECES
(COMPARE_MAX_PIECES) is set to 16 bytes when P8 vector is enabled,
otherwise keeps unchanged.  The macro STORE_MAX_PIECES is set to the
same value as MOVE_MAX_PIECES by default, so now it's explicitly
defined and keeps unchanged.

gcc/
	PR target/111449
	* config/rs6000/altivec.md (cbranchv16qi4): New expand pattern.
	* config/rs6000/rs6000.cc (rs6000_generate_compare): Generate
	insn sequence for V16QImode equality compare.
	* config/rs6000/rs6000.h (MOVE_MAX_PIECES): Define.
	(STORE_MAX_PIECES): Define.

gcc/testsuite/
	PR target/111449
	* gcc.target/powerpc/pr111449-1.c: New.

patch.diff
  

Comments

Kewen.Lin Nov. 7, 2023, 2:40 a.m. UTC | #1
Hi Haochen,

on 2023/11/6 10:36, HAO CHEN GUI wrote:
> Hi,
>   This patch enables vector mode for by pieces equality compare. It
> adds a new expand pattern - cbrnachv16qi4 and set MOVE_MAX_PIECES
> and COMPARE_MAX_PIECES to 16 bytes when P8 vector enabled. The compare
> relies both move and compare instructions, so both macro are changed.
> The vector load/store might be unaligned, so the 16-byte move and
> compare are only enabled when p8 vector enabled (TARGET_VSX +
> TARGET_EFFICIENT_UNALIGNED_VSX).
> 
>   This patch enables 16 byte by pieces move. As the vector mode is not
> enabled for by pieces move, TImode is used for the move. It caused some
> regression cases. I drafted the third patch to fix them.

Could you also list the failures if the total number is small?

> 
>   Bootstrapped and tested on x86 and powerpc64-linux BE and LE with no
> regressions. Is this OK for trunk?
> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Enable vector mode for by pieces equality compare
> 
> This patch adds a new expand pattern - cbranchv16qi4 to enable vector
> mode by pieces equality compare on rs6000.  The macro MOVE_MAX_PIECES
> (COMPARE_MAX_PIECES) is set to 16 bytes when P8 vector is enabled,
> otherwise keeps unchanged.  The macro STORE_MAX_PIECES is set to the
> same value as MOVE_MAX_PIECES by default, so now it's explicitly
> defined and keeps unchanged.
> 
> gcc/
> 	PR target/111449
> 	* config/rs6000/altivec.md (cbranchv16qi4): New expand pattern.
> 	* config/rs6000/rs6000.cc (rs6000_generate_compare): Generate
> 	insn sequence for V16QImode equality compare.
> 	* config/rs6000/rs6000.h (MOVE_MAX_PIECES): Define.
> 	(STORE_MAX_PIECES): Define.
> 
> gcc/testsuite/
> 	PR target/111449
> 	* gcc.target/powerpc/pr111449-1.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index e8a596fb7e9..d0937f192d6 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -2605,6 +2605,45 @@ (define_insn "altivec_vupklpx"
>  }
>    [(set_attr "type" "vecperm")])
> 
> +/* The cbranch_optabs doesn't allow FAIL, so altivec load/store
> +   instructions are disabled as the cost is high for unaligned
> +   load/store.  */
> +(define_expand "cbranchv16qi4"
> +  [(use (match_operator 0 "equality_operator"
> +	[(match_operand:V16QI 1 "reg_or_mem_operand")
> +	 (match_operand:V16QI 2 "reg_or_mem_operand")]))
> +   (use (match_operand 3))]
> +  "VECTOR_MEM_VSX_P (V16QImode)
> +   && TARGET_EFFICIENT_UNALIGNED_VSX"
> +{
> +  if (!TARGET_P9_VECTOR
> +      && !BYTES_BIG_ENDIAN

Nit: If these two aim to match the below comment "P8 little endian", it
seems easier to read it with "TARGET_P8_VECTOR && !BYTES_BIG_ENDIAN".

> +      && MEM_P (operands[1])
> +      && !altivec_indexed_or_indirect_operand (operands[1], V16QImode)

Need a comment on why it checks altivec_indexed_or_indirect_operand, it's
not obvious.

> +      && MEM_P (operands[2])
> +      && !altivec_indexed_or_indirect_operand (operands[2], V16QImode))
> +    {
> +      /* Use direct move for P8 little endian to skip bswap, as the byte
> +	 order doesn't matter for equality compare.  */
> +      rtx reg_op1 = gen_reg_rtx (V16QImode);
> +      rtx reg_op2 = gen_reg_rtx (V16QImode);
> +      rs6000_emit_le_vsx_permute (reg_op1, operands[1], V16QImode);
> +      rs6000_emit_le_vsx_permute (reg_op2, operands[2], V16QImode);
> +      operands[1] = reg_op1;
> +      operands[2] = reg_op2;
> +    }
> +  else
> +    {
> +      operands[1] = force_reg (V16QImode, operands[1]);
> +      operands[2] = force_reg (V16QImode, operands[2]);
> +    }
> +
> +  rtx_code code = GET_CODE (operands[0]);
> +  operands[0] = gen_rtx_fmt_ee (code, V16QImode, operands[1], operands[2]);
> +  rs6000_emit_cbranch (V16QImode, operands);
> +  DONE;
> +})
> +
>  ;; Compare vectors producing a vector result and a predicate, setting CR6 to
>  ;; indicate a combined status
>  (define_insn "altivec_vcmpequ<VI_char>_p"
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index cc24dd5301e..10279052636 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -15472,6 +15472,18 @@ rs6000_generate_compare (rtx cmp, machine_mode mode)
>  	  else
>  	    emit_insn (gen_stack_protect_testsi (compare_result, op0, op1b));
>  	}
> +      else if (mode == V16QImode)
> +	{
> +	  gcc_assert (code == EQ || code == NE);
> +
> +	  rtx result_vector = gen_reg_rtx (V16QImode);
> +	  rtx cc_bit = gen_reg_rtx (SImode);
> +	  emit_insn (gen_altivec_vcmpequb_p (result_vector, op0, op1));
> +	  emit_insn (gen_cr6_test_for_lt (cc_bit));
> +	  emit_insn (gen_rtx_SET (compare_result,
> +				  gen_rtx_COMPARE (comp_mode, cc_bit,
> +						   const1_rtx)));
> +	}
>        else
>  	emit_insn (gen_rtx_SET (compare_result,
>  				gen_rtx_COMPARE (comp_mode, op0, op1)));
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 22595f6ebd7..51441825e20 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -1730,6 +1730,8 @@ typedef struct rs6000_args
>     in one reasonably fast instruction.  */
>  #define MOVE_MAX (! TARGET_POWERPC64 ? 4 : 8)
>  #define MAX_MOVE_MAX 8
> +#define MOVE_MAX_PIECES (TARGET_P8_VECTOR ? 16 : (TARGET_POWERPC64 ? 8 : 4))

But the cbranchv16qi4 condition uses TARGET_EFFICIENT_UNALIGNED_VSX which can be
separately disabled but TARGET_P8_VECTOR still enabled?  So I think we should
keep both consistent.

> +#define STORE_MAX_PIECES (TARGET_POWERPC64 ? 8 : 4)
> 
>  /* Nonzero if access to memory by bytes is no faster than for words.
>     Also nonzero if doing byte operations (specifically shifts) in registers
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr111449-1.c b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c
> new file mode 100644
> index 00000000000..d8583d3303b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile { target { has_arch_pwr8 } } } */

Nit: has_arch_pwr8 would make it un-tested on Power7 default env,
I'd prefer to remove this "has_arch_pwr8" and append
"-mdejagnu-cpu=power8" to dg-options.

Otherwise LGTM.

BR,
Kewen

> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-mvsx -O2" } */
> +
> +/* Ensure vector mode is used for 16-byte memory equality compare.  */
> +
> +int compare1 (const char* s1, const char* s2)
> +{
> +  return __builtin_memcmp (s1, s2, 16) == 0;
> +}
> +
> +int compare2 (const char* s1)
> +{
> +  return __builtin_memcmp (s1, "0123456789012345", 16) == 0;
> +}
> +
> +/* { dg-final { scan-assembler-times {\mvcmpequb\.} 2 } } */
> +/* { dg-final { scan-assembler-not {\mcmpd\M} } } */
  
HAO CHEN GUI Nov. 7, 2023, 3:24 a.m. UTC | #2
Hi Kewen,

   Thanks for your review comments. Just one question on following
comment.

在 2023/11/7 10:40, Kewen.Lin 写道:
> Nit: has_arch_pwr8 would make it un-tested on Power7 default env, I'd prefer to remove this "has_arch_pwr8" and append "-mdejagnu-cpu=power8" to dg-options.

My original propose is to test the case on p8/p9/p10. Each of them
generate different instruction sequence. If it's assigned
"-mdejagnu-cpu=power8", only p8 instruction sequence is generated.
Does it lost the coverage?

Thanks
Gui Haochen
  
Kewen.Lin Nov. 7, 2023, 5:21 a.m. UTC | #3
Hi,

on 2023/11/7 11:24, HAO CHEN GUI wrote:
> Hi Kewen,
> 
>    Thanks for your review comments. Just one question on following
> comment.
> 
> 在 2023/11/7 10:40, Kewen.Lin 写道:
>> Nit: has_arch_pwr8 would make it un-tested on Power7 default env, I'd prefer to remove this "has_arch_pwr8" and append "-mdejagnu-cpu=power8" to dg-options.
> 
> My original propose is to test the case on p8/p9/p10. Each of them
> generate different instruction sequence. If it's assigned

But the actual test point is the same for p8/p9/p10:

+/* { dg-final { scan-assembler-times {\mvcmpequb\.} 2 } } */
+/* { dg-final { scan-assembler-not {\mcmpd\M} } } */

,it checks if we generates vector comparison, the p8/p9/p10
difference doesn't matter in this checking at least.
IMHO the consistency for vector comparison on different cpus
shouldn't be tested in this case either, it should be covered
by some other existing test cases dedicated for vector
comparison instead.  But if you are eager to cover p8/p9/p10,
we can separate the test cases with separated -mdejagnu-cpu=,
but I don't feel it's worthy.  :)

BR,
Kewen

> "-mdejagnu-cpu=power8", only p8 instruction sequence is generated.
> Does it lost the coverage?
> 
> Thanks
> Gui Haochen
  

Patch

diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
index e8a596fb7e9..d0937f192d6 100644
--- a/gcc/config/rs6000/altivec.md
+++ b/gcc/config/rs6000/altivec.md
@@ -2605,6 +2605,45 @@  (define_insn "altivec_vupklpx"
 }
   [(set_attr "type" "vecperm")])

+/* The cbranch_optabs doesn't allow FAIL, so altivec load/store
+   instructions are disabled as the cost is high for unaligned
+   load/store.  */
+(define_expand "cbranchv16qi4"
+  [(use (match_operator 0 "equality_operator"
+	[(match_operand:V16QI 1 "reg_or_mem_operand")
+	 (match_operand:V16QI 2 "reg_or_mem_operand")]))
+   (use (match_operand 3))]
+  "VECTOR_MEM_VSX_P (V16QImode)
+   && TARGET_EFFICIENT_UNALIGNED_VSX"
+{
+  if (!TARGET_P9_VECTOR
+      && !BYTES_BIG_ENDIAN
+      && MEM_P (operands[1])
+      && !altivec_indexed_or_indirect_operand (operands[1], V16QImode)
+      && MEM_P (operands[2])
+      && !altivec_indexed_or_indirect_operand (operands[2], V16QImode))
+    {
+      /* Use direct move for P8 little endian to skip bswap, as the byte
+	 order doesn't matter for equality compare.  */
+      rtx reg_op1 = gen_reg_rtx (V16QImode);
+      rtx reg_op2 = gen_reg_rtx (V16QImode);
+      rs6000_emit_le_vsx_permute (reg_op1, operands[1], V16QImode);
+      rs6000_emit_le_vsx_permute (reg_op2, operands[2], V16QImode);
+      operands[1] = reg_op1;
+      operands[2] = reg_op2;
+    }
+  else
+    {
+      operands[1] = force_reg (V16QImode, operands[1]);
+      operands[2] = force_reg (V16QImode, operands[2]);
+    }
+
+  rtx_code code = GET_CODE (operands[0]);
+  operands[0] = gen_rtx_fmt_ee (code, V16QImode, operands[1], operands[2]);
+  rs6000_emit_cbranch (V16QImode, operands);
+  DONE;
+})
+
 ;; Compare vectors producing a vector result and a predicate, setting CR6 to
 ;; indicate a combined status
 (define_insn "altivec_vcmpequ<VI_char>_p"
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index cc24dd5301e..10279052636 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -15472,6 +15472,18 @@  rs6000_generate_compare (rtx cmp, machine_mode mode)
 	  else
 	    emit_insn (gen_stack_protect_testsi (compare_result, op0, op1b));
 	}
+      else if (mode == V16QImode)
+	{
+	  gcc_assert (code == EQ || code == NE);
+
+	  rtx result_vector = gen_reg_rtx (V16QImode);
+	  rtx cc_bit = gen_reg_rtx (SImode);
+	  emit_insn (gen_altivec_vcmpequb_p (result_vector, op0, op1));
+	  emit_insn (gen_cr6_test_for_lt (cc_bit));
+	  emit_insn (gen_rtx_SET (compare_result,
+				  gen_rtx_COMPARE (comp_mode, cc_bit,
+						   const1_rtx)));
+	}
       else
 	emit_insn (gen_rtx_SET (compare_result,
 				gen_rtx_COMPARE (comp_mode, op0, op1)));
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 22595f6ebd7..51441825e20 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -1730,6 +1730,8 @@  typedef struct rs6000_args
    in one reasonably fast instruction.  */
 #define MOVE_MAX (! TARGET_POWERPC64 ? 4 : 8)
 #define MAX_MOVE_MAX 8
+#define MOVE_MAX_PIECES (TARGET_P8_VECTOR ? 16 : (TARGET_POWERPC64 ? 8 : 4))
+#define STORE_MAX_PIECES (TARGET_POWERPC64 ? 8 : 4)

 /* Nonzero if access to memory by bytes is no faster than for words.
    Also nonzero if doing byte operations (specifically shifts) in registers
diff --git a/gcc/testsuite/gcc.target/powerpc/pr111449-1.c b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c
new file mode 100644
index 00000000000..d8583d3303b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr111449-1.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile { target { has_arch_pwr8 } } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-mvsx -O2" } */
+
+/* Ensure vector mode is used for 16-byte memory equality compare.  */
+
+int compare1 (const char* s1, const char* s2)
+{
+  return __builtin_memcmp (s1, s2, 16) == 0;
+}
+
+int compare2 (const char* s1)
+{
+  return __builtin_memcmp (s1, "0123456789012345", 16) == 0;
+}
+
+/* { dg-final { scan-assembler-times {\mvcmpequb\.} 2 } } */
+/* { dg-final { scan-assembler-not {\mcmpd\M} } } */