[committed] amdgcn: add -march=gfx1030 EXPERIMENTAL

Message ID b99badf1-895a-4ff1-b68f-abc665a85625@codesourcery.com
State Unresolved
Headers
Series [committed] amdgcn: add -march=gfx1030 EXPERIMENTAL |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Andrew Stubbs Oct. 20, 2023, 11:51 a.m. UTC
  I've committed this patch that allows building binaries for AMD gfx1030 
GPUs. I can't actually test it, however, so somebody else will have to 
debug it (or wait for me to get my hands on a device). Richi reports 
that it does not execute correctly, as is.

This is an experimental broken feature, so the multilib is deliberatly 
not built in the default configuration, and not (yet) documented.

If you want to try it you will also need to apply the attached Newlib 
patch. I shall submit that to the Newlib list shortly. Then configure 
the accelerator compiler as follows: --with-arch=gfx1030.

Andrew
amdgcn: remove unnecessary scalar cache flush

The exit code isn't actually written via the scalar cache so the cache flush
is not actually needed.
amdgcn: add -march=gfx1030 EXPERIMENTAL

Accept the architecture configure option and resolve build failures.  This is
enough to build binaries, but I've not got a device to test it on, so there
are probably runtime issues to fix.  The cache control instructions might be
unsafe (or too conservative), and the kernel metadata might be off.  Vector
reductions will need to be reworked for RDNA2.  In principle, it would be
better to use wavefrontsize32 for this architecture, but that would mean
switching everything to allow SImode masks, so wavefrontsize64 it is.

The multilib is not included in the default configuration so either configure
--with-arch=gfx1030 or include it in --with-multilib-list=gfx1030,....

The majority of this patch has no effect on other devices, but changing from
using scalar writes for the exit value to vector writes means we don't need
the scalar cache write-back instruction anywhere (which doesn't exist in RDNA2).

gcc/ChangeLog:

	* config.gcc: Allow --with-arch=gfx1030.
	* config/gcn/gcn-hsa.h (NO_XNACK): gfx1030 does not support xnack.
	(ASM_SPEC): gfx1030 needs -mattr=+wavefrontsize64 set.
	* config/gcn/gcn-opts.h (enum processor_type): Add PROCESSOR_GFX1030.
	(TARGET_GFX1030): New.
	(TARGET_RDNA2): New.
	* config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
	(addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
	(subc<mode>3<exec_vcc>): Likewise.
	(<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
	(vec_cmp<mode>di): Likewise.
	(vec_cmp<u><mode>di): Likewise.
	(vec_cmp<mode>di_exec): Likewise.
	(vec_cmp<u><mode>di_exec): Likewise.
	(vec_cmp<mode>di_dup): Likewise.
	(vec_cmp<mode>di_dup_exec): Likewise.
	(reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
	(*<reduc_op>_dpp_shr_<mode>): Likewise.
	(*plus_carry_dpp_shr_<mode>): Likewise.
	(*plus_carry_in_dpp_shr_<mode>): Likewise.
	* config/gcn/gcn.cc (gcn_option_override): Recognise gfx1030.
	(gcn_global_address_p): RDNA2 only allows smaller offsets.
	(gcn_addr_space_legitimate_address_p): Likewise.
	(gcn_omp_device_kind_arch_isa): Recognise gfx1030.
	(gcn_expand_epilogue): Use VGPRs instead of SGPRs.
	(output_file_start): Configure gfx1030.
	* config/gcn/gcn.h (TARGET_CPU_CPP_BUILTINS): Add __RDNA2__;
	(ASSEMBLER_DIALECT): New.
	* config/gcn/gcn.md (rdna): New define_attr.
	(enabled): Use "rdna" attribute.
	(gcn_return): Remove s_dcache_wb.
	(addcsi3_scalar): Add RDNA2 syntax variant.
	(addcsi3_scalar_zero): Likewise.
	(addptrdi3): Likewise.
	(mulsi3): v_mul_lo_i32 should be v_mul_lo_u32 on all ISA.
	(*memory_barrier): Add RDNA2 syntax variant.
	(atomic_load<mode>): Add RDNA2 cache control variants, and disable
	scalar atomics for RDNA2.
	(atomic_store<mode>): Likewise.
	(atomic_exchange<mode>): Likewise.
	* config/gcn/gcn.opt (gpu_type): Add gfx1030.
	* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX1030): New.
	(main): Recognise -march=gfx1030.
	* config/gcn/t-omp-device: Add gfx1030 isa.

libgcc/ChangeLog:

	* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Set false for __RDNA2__.

libgomp/ChangeLog:

	* plugin/plugin-gcn.c (EF_AMDGPU_MACH_AMDGCN_GFX1030): New.
	(isa_hsa_name): Recognise gfx1030.
	(isa_code): Likewise.
	* team.c (defined): Remove s_endpgm.

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 9c397156868..0782cbc6e91 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4529,7 +4529,7 @@ case "${target}" in
 		for which in arch tune; do
 			eval "val=\$with_$which"
 			case ${val} in
-			"" | fiji | gfx900 | gfx906 | gfx908 | gfx90a)
+			"" | fiji | gfx900 | gfx906 | gfx908 | gfx90a | gfx1030)
 				# OK
 				;;
 			*)
diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
index 0b5610bbcbe..aa1294cf48f 100644
--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -75,7 +75,7 @@ extern unsigned int gcn_local_sym_hash (const char *name);
    supported for gcn.  */
 #define GOMP_SELF_SPECS ""
 
-#define NO_XNACK "!march=*:;march=fiji:;"
+#define NO_XNACK "!march=*:;march=fiji:;march=gfx1030:;"
 #define NO_SRAM_ECC "!march=*:;march=fiji:;march=gfx900:;march=gfx906:;"
 
 /* In HSACOv4 no attribute setting means the binary supports "any" hardware
@@ -92,6 +92,7 @@ extern unsigned int gcn_local_sym_hash (const char *name);
 		  "%{!march=*|march=fiji:--amdhsa-code-object-version=3} " \
 		  "%{" NO_XNACK XNACKOPT "}" \
 		  "%{" NO_SRAM_ECC SRAMOPT "} " \
+		  "%{march=gfx1030:-mattr=+wavefrontsize64} " \
 		  "-filetype=obj"
 #define LINK_SPEC "--pie --export-dynamic"
 #define LIB_SPEC  "-lc"
diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
index f780a7c17fe..b4f494d868c 100644
--- a/gcc/config/gcn/gcn-opts.h
+++ b/gcc/config/gcn/gcn-opts.h
@@ -24,7 +24,8 @@ enum processor_type
   PROCESSOR_VEGA10,  // gfx900
   PROCESSOR_VEGA20,  // gfx906
   PROCESSOR_GFX908,
-  PROCESSOR_GFX90a
+  PROCESSOR_GFX90a,
+  PROCESSOR_GFX1030
 };
 
 #define TARGET_FIJI (gcn_arch == PROCESSOR_FIJI)
@@ -32,12 +33,14 @@ enum processor_type
 #define TARGET_VEGA20 (gcn_arch == PROCESSOR_VEGA20)
 #define TARGET_GFX908 (gcn_arch == PROCESSOR_GFX908)
 #define TARGET_GFX90a (gcn_arch == PROCESSOR_GFX90a)
+#define TARGET_GFX1030 (gcn_arch == PROCESSOR_GFX1030)
 
 /* Set in gcn_option_override.  */
 extern enum gcn_isa {
   ISA_UNKNOWN,
   ISA_GCN3,
   ISA_GCN5,
+  ISA_RDNA2,
   ISA_CDNA1,
   ISA_CDNA2
 } gcn_isa;
@@ -50,6 +53,8 @@ extern enum gcn_isa {
 #define TARGET_CDNA1_PLUS (gcn_isa >= ISA_CDNA1)
 #define TARGET_CDNA2 (gcn_isa == ISA_CDNA2)
 #define TARGET_CDNA2_PLUS (gcn_isa >= ISA_CDNA2)
+#define TARGET_RDNA2 (gcn_isa == ISA_RDNA2)
+
 
 #define TARGET_M0_LDS_LIMIT (TARGET_GCN3)
 #define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS)
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 32b170e8522..c128c819c89 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -1412,7 +1412,7 @@ (define_insn "@dpp_move<mode>"
 	  [(match_operand:V_noHI 1 "register_operand" " v")
 	   (match_operand:SI 2 "const_int_operand"    " n")]
 	  UNSPEC_MOV_DPP_SHR))]
-  ""
+  "!TARGET_RDNA2"
   {
     return gcn_expand_dpp_shr_insn (<MODE>mode, "v_mov_b32",
 				    UNSPEC_MOV_DPP_SHR, INTVAL (operands[2]));
@@ -1548,7 +1548,7 @@ (define_insn "addc<mode>3<exec_vcc>"
 			  (match_dup 1))
 			(match_dup 1))))]
   ""
-  "v_addc%^_u32\t%0, %4, %2, %1, %3"
+  "{v_addc%^_u32|v_add_co_ci_u32}\t%0, %4, %2, %1, %3"
   [(set_attr "type" "vop2,vop3b")
    (set_attr "length" "4,8")])
 
@@ -1613,10 +1613,10 @@ (define_insn "subc<mode>3<exec_vcc>"
 			(match_dup 1))))]
   ""
   "@
-   v_subb%^_u32\t%0, %4, %1, %2, %3
-   v_subb%^_u32\t%0, %4, %1, %2, %3
-   v_subbrev%^_u32\t%0, %4, %2, %1, %3
-   v_subbrev%^_u32\t%0, %4, %2, %1, %3"
+   {v_subb%^_u32|v_sub_co_ci_u32}\t%0, %4, %1, %2, %3
+   {v_subb%^_u32|v_sub_co_ci_u32}\t%0, %4, %1, %2, %3
+   {v_subbrev%^_u32|v_subrev_co_ci_u32}\t%0, %4, %2, %1, %3
+   {v_subbrev%^_u32|v_subrev_co_ci_u32}\t%0, %4, %2, %1, %3"
   [(set_attr "type" "vop2,vop3b,vop2,vop3b")
    (set_attr "length" "4,8,4,8")])
 
@@ -3667,11 +3667,11 @@ (define_insn_and_split "<convop><mode><vndi>2_exec"
 ;; {{{ Vector comparison/merge
 
 (define_insn "vec_cmp<mode>di"
-  [(set (match_operand:DI 0 "register_operand"	      "=cV,cV,  e, e,Sg,Sg")
+  [(set (match_operand:DI 0 "register_operand"	      "=cV,cV,  e, e,Sg,Sg,  e, e")
 	(match_operator:DI 1 "gcn_fp_compare_operator"
-	  [(match_operand:V_noQI 2 "gcn_alu_operand"  "vSv, B,vSv, B, v,vA")
-	   (match_operand:V_noQI 3 "gcn_vop3_operand" "  v, v,  v, v,vA, v")]))
-   (clobber (match_scratch:DI 4			      "= X, X, cV,cV, X, X"))]
+	  [(match_operand:V_noQI 2 "gcn_alu_operand"  "vSv, B,vSv, B, v,vA,vSv, B")
+	   (match_operand:V_noQI 3 "gcn_vop3_operand" "  v, v,  v, v,vA, v,  v, v")]))
+   (clobber (match_scratch:DI 4			      "= X, X, cV,cV, X, X,  X, X"))]
   ""
   "@
    v_cmp%E1\tvcc, %2, %3
@@ -3679,9 +3679,12 @@ (define_insn "vec_cmp<mode>di"
    v_cmpx%E1\tvcc, %2, %3
    v_cmpx%E1\tvcc, %2, %3
    v_cmp%E1\t%0, %2, %3
-   v_cmp%E1\t%0, %2, %3"
-  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a,vop3a")
-   (set_attr "length" "4,8,4,8,8,8")])
+   v_cmp%E1\t%0, %2, %3
+   v_cmpx%E1\t%2, %3
+   v_cmpx%E1\t%2, %3"
+  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a,vop3a,vopc,vopc")
+   (set_attr "length" "4,8,4,8,8,8,4,8")
+   (set_attr "rdna" "*,*,no,no,*,*,yes,yes")])
 
 (define_expand "vec_cmpu<mode>di"
   [(match_operand:DI 0 "register_operand")
@@ -3716,13 +3719,13 @@ (define_expand "vec_cmp<u><mode>di"
   })
 
 (define_insn "vec_cmp<mode>di_exec"
-  [(set (match_operand:DI 0 "register_operand"	       "=cV,cV,  e, e,Sg,Sg")
+  [(set (match_operand:DI 0 "register_operand"	       "=cV,cV,  e, e,Sg,Sg,  e, e")
 	(and:DI
 	  (match_operator 1 "gcn_fp_compare_operator"
-	    [(match_operand:V_noQI 2 "gcn_alu_operand" "vSv, B,vSv, B, v,vA")
-	     (match_operand:V_noQI 3 "gcn_vop3_operand" " v, v,  v, v,vA, v")])
-	  (match_operand:DI 4 "gcn_exec_reg_operand"   "  e, e,  e, e, e, e")))
-   (clobber (match_scratch:DI 5			       "= X, X, cV,cV, X, X"))]
+	    [(match_operand:V_noQI 2 "gcn_alu_operand" "vSv, B,vSv, B, v,vA,vSv, B")
+	     (match_operand:V_noQI 3 "gcn_vop3_operand" " v, v,  v, v,vA, v,  v, v")])
+	  (match_operand:DI 4 "gcn_exec_reg_operand"   "  e, e,  e, e, e, e,  e, e")))
+   (clobber (match_scratch:DI 5			       "= X, X, cV,cV, X, X,  X, X"))]
   ""
   "@
    v_cmp%E1\tvcc, %2, %3
@@ -3730,9 +3733,12 @@ (define_insn "vec_cmp<mode>di_exec"
    v_cmpx%E1\tvcc, %2, %3
    v_cmpx%E1\tvcc, %2, %3
    v_cmp%E1\t%0, %2, %3
-   v_cmp%E1\t%0, %2, %3"
-  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a,vop3a")
-   (set_attr "length" "4,8,4,8,8,8")])
+   v_cmp%E1\t%0, %2, %3
+   v_cmpx%E1\t%2, %3
+   v_cmpx%E1\t%2, %3"
+  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a,vop3a,vopc,vopc")
+   (set_attr "length" "4,8,4,8,8,8,4,8")
+   (set_attr "rdna" "*,*,no,no,*,*,yes,yes")])
 
 (define_expand "vec_cmpu<mode>di_exec"
   [(match_operand:DI 0 "register_operand")
@@ -3772,42 +3778,48 @@ (define_expand "vec_cmp<u><mode>di_exec"
   })
 
 (define_insn "vec_cmp<mode>di_dup"
-  [(set (match_operand:DI 0 "register_operand"		   "=cV,cV, e,e,Sg")
+  [(set (match_operand:DI 0 "register_operand"		   "=cV,cV, e,e,Sg, e,e")
 	(match_operator:DI 1 "gcn_fp_compare_operator"
 	  [(vec_duplicate:V_noQI
 	     (match_operand:<SCALAR_MODE> 2 "gcn_alu_operand"
-							   " Sv, B,Sv,B, A"))
-	   (match_operand:V_noQI 3 "gcn_vop3_operand"	   "  v, v, v,v, v")]))
-   (clobber (match_scratch:DI 4				   "= X,X,cV,cV, X"))]
+							   " Sv, B,Sv,B, A,Sv,B"))
+	   (match_operand:V_noQI 3 "gcn_vop3_operand"	   "  v, v, v,v, v, v,v")]))
+   (clobber (match_scratch:DI 4				   "= X,X,cV,cV, X, X,X"))]
   ""
   "@
    v_cmp%E1\tvcc, %2, %3
    v_cmp%E1\tvcc, %2, %3
    v_cmpx%E1\tvcc, %2, %3
    v_cmpx%E1\tvcc, %2, %3
-   v_cmp%E1\t%0, %2, %3"
-  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a")
-   (set_attr "length" "4,8,4,8,8")])
+   v_cmp%E1\t%0, %2, %3
+   v_cmpx%E1\t%2, %3
+   v_cmpx%E1\t%2, %3"
+  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a,vopc,vopc")
+   (set_attr "length" "4,8,4,8,8,4,8")
+   (set_attr "rdna" "*,*,no,no,*,yes,yes")])
 
 (define_insn "vec_cmp<mode>di_dup_exec"
-  [(set (match_operand:DI 0 "register_operand"		    "=cV,cV, e,e,Sg")
+  [(set (match_operand:DI 0 "register_operand"		    "=cV,cV, e,e,Sg, e,e")
 	(and:DI
 	  (match_operator 1 "gcn_fp_compare_operator"
 	    [(vec_duplicate:V_noQI
 	       (match_operand:<SCALAR_MODE> 2 "gcn_alu_operand"
-							    " Sv, B,Sv,B, A"))
-	     (match_operand:V_noQI 3 "gcn_vop3_operand"	    "  v, v, v,v, v")])
-	  (match_operand:DI 4 "gcn_exec_reg_operand"	    "  e, e, e,e, e")))
-   (clobber (match_scratch:DI 5				    "= X,X,cV,cV, X"))]
+							    " Sv, B,Sv,B, A,Sv,B"))
+	     (match_operand:V_noQI 3 "gcn_vop3_operand"	    "  v, v, v,v, v, v,v")])
+	  (match_operand:DI 4 "gcn_exec_reg_operand"	    "  e, e, e,e, e, e,e")))
+   (clobber (match_scratch:DI 5				    "= X,X,cV,cV, X, X,X"))]
   ""
   "@
    v_cmp%E1\tvcc, %2, %3
    v_cmp%E1\tvcc, %2, %3
    v_cmpx%E1\tvcc, %2, %3
    v_cmpx%E1\tvcc, %2, %3
-   v_cmp%E1\t%0, %2, %3"
-  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a")
-   (set_attr "length" "4,8,4,8,8")])
+   v_cmp%E1\t%0, %2, %3
+   v_cmpx%E1\t%2, %3
+   v_cmpx%E1\t%2, %3"
+  [(set_attr "type" "vopc,vopc,vopc,vopc,vop3a,vopc,vopc")
+   (set_attr "length" "4,8,4,8,8,4,8")
+   (set_attr "rdna" "*,*,no,no,*,yes,yes")])
 
 (define_expand "vcond_mask_<mode>di"
   [(parallel
@@ -4176,7 +4188,7 @@ (define_expand "reduc_<reduc_op>_scal_<mode>"
 	(unspec:<SCALAR_MODE>
 	  [(match_operand:V_ALL 1 "register_operand")]
 	  REDUC_UNSPEC))]
-  ""
+  "!TARGET_RDNA2"
   {
     rtx tmp = gcn_expand_reduc_scalar (<MODE>mode, operands[1],
 				       <reduc_unspec>);
@@ -4229,7 +4241,8 @@ (define_insn "*<reduc_op>_dpp_shr_<mode>"
 	  REDUC_UNSPEC))]
   ; GCN3 requires a carry out, GCN5 not
   "!(TARGET_GCN3 && SCALAR_INT_MODE_P (<SCALAR_MODE>mode)
-     && <reduc_unspec> == UNSPEC_PLUS_DPP_SHR)"
+     && <reduc_unspec> == UNSPEC_PLUS_DPP_SHR)
+   && !TARGET_RDNA2"
   {
     return gcn_expand_dpp_shr_insn (<MODE>mode, "<reduc_insn>",
 				    <reduc_unspec>, INTVAL (operands[3]));
@@ -4274,7 +4287,7 @@ (define_insn "*plus_carry_dpp_shr_<mode>"
 	   (match_operand:SI 3 "const_int_operand"	  "n")]
 	  UNSPEC_PLUS_CARRY_DPP_SHR))
    (clobber (reg:DI VCC_REG))]
-  ""
+  "!TARGET_RDNA2"
   {
     return gcn_expand_dpp_shr_insn (<VnSI>mode, "v_add%^_u32",
 				    UNSPEC_PLUS_CARRY_DPP_SHR,
@@ -4292,7 +4305,7 @@ (define_insn "*plus_carry_in_dpp_shr_<mode>"
 	   (match_operand:DI 4 "register_operand"   "cV")]
 	  UNSPEC_PLUS_CARRY_IN_DPP_SHR))
    (clobber (reg:DI VCC_REG))]
-  ""
+  "!TARGET_RDNA2"
   {
     return gcn_expand_dpp_shr_insn (<MODE>mode, "v_addc%^_u32",
 				    UNSPEC_PLUS_CARRY_IN_DPP_SHR,
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index ef3b6472a52..6f85f55803c 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -136,6 +136,7 @@ gcn_option_override (void)
       : gcn_arch == PROCESSOR_VEGA20 ? ISA_GCN5
       : gcn_arch == PROCESSOR_GFX908 ? ISA_CDNA1
       : gcn_arch == PROCESSOR_GFX90a ? ISA_CDNA2
+      : gcn_arch == PROCESSOR_GFX1030 ? ISA_RDNA2
       : ISA_UNKNOWN);
   gcc_assert (gcn_isa != ISA_UNKNOWN);
 
@@ -1616,6 +1617,7 @@ gcn_global_address_p (rtx addr)
     {
       rtx base = XEXP (addr, 0);
       rtx offset = XEXP (addr, 1);
+      int offsetbits = (TARGET_RDNA2 ? 11 : 12);
       bool immediate_p = (CONST_INT_P (offset)
 			  && INTVAL (offset) >= -(1 << 12)
 			  && INTVAL (offset) < (1 << 12));
@@ -1748,10 +1750,11 @@ gcn_addr_space_legitimate_address_p (machine_mode mode, rtx x, bool strict,
 	  rtx base = XEXP (x, 0);
 	  rtx offset = XEXP (x, 1);
 
+	  int offsetbits = (TARGET_RDNA2 ? 11 : 12);
 	  bool immediate_p = (GET_CODE (offset) == CONST_INT
-			      /* Signed 13-bit immediate.  */
-			      && INTVAL (offset) >= -(1 << 12)
-			      && INTVAL (offset) < (1 << 12)
+			      /* Signed 12/13-bit immediate.  */
+			      && INTVAL (offset) >= -(1 << offsetbits)
+			      && INTVAL (offset) < (1 << offsetbits)
 			      /* The low bits of the offset are ignored, even
 			         when they're meant to realign the pointer.  */
 			      && !(INTVAL (offset) & 0x3));
@@ -3029,6 +3032,8 @@ gcn_omp_device_kind_arch_isa (enum omp_device_kind_arch_isa trait,
 	return gcn_arch == PROCESSOR_GFX908;
       if (strcmp (name, "gfx90a") == 0)
 	return gcn_arch == PROCESSOR_GFX90a;
+      if (strcmp (name, "gfx1030") == 0)
+	return gcn_arch == PROCESSOR_GFX1030;
       return 0;
     default:
       gcc_unreachable ();
@@ -3610,9 +3615,11 @@ gcn_expand_epilogue (void)
       set_mem_addr_space (retptr_mem, ADDR_SPACE_SCALAR_FLAT);
       emit_move_insn (kernarg_reg, retptr_mem);
 
-      rtx retval_mem = gen_rtx_MEM (SImode, kernarg_reg);
-      rtx scalar_retval = gen_rtx_REG (SImode, FIRST_PARM_REG);
-      set_mem_addr_space (retval_mem, ADDR_SPACE_SCALAR_FLAT);
+      rtx retval_addr = gen_rtx_REG (DImode, FIRST_VPARM_REG);
+      emit_move_insn (retval_addr, kernarg_reg);
+      rtx retval_mem = gen_rtx_MEM (SImode, retval_addr);
+      rtx scalar_retval = gen_rtx_REG (SImode, FIRST_VPARM_REG + 2);
+      set_mem_addr_space (retval_mem, ADDR_SPACE_FLAT);
       emit_move_insn (scalar_retval, gen_rtx_REG (SImode, RETURN_VALUE_REG));
       emit_move_insn (retval_mem, scalar_retval);
     }
@@ -6454,6 +6461,11 @@ output_file_start (void)
     case PROCESSOR_GFX90a:
       cpu = "gfx90a";
       break;
+    case PROCESSOR_GFX1030:
+      cpu = "gfx1030";
+      xnack = "";
+      sram_ecc = "";
+      break;
     default: gcc_unreachable ();
     }
 
diff --git a/gcc/config/gcn/gcn.h b/gcc/config/gcn/gcn.h
index 4ff9a5d4d12..6372f49d379 100644
--- a/gcc/config/gcn/gcn.h
+++ b/gcc/config/gcn/gcn.h
@@ -28,6 +28,8 @@
 	builtin_define ("__CDNA1__");                                          \
       else if (TARGET_CDNA2)                                                   \
 	builtin_define ("__CDNA2__");                                          \
+      else if (TARGET_RDNA2)                                                   \
+	builtin_define ("__RDNA2__");                                          \
       if (TARGET_FIJI)                                                         \
 	{                                                                      \
 	  builtin_define ("__fiji__");                                         \
@@ -43,6 +45,8 @@
 	builtin_define ("__gfx90a__");                                         \
   } while (0)
 
+#define ASSEMBLER_DIALECT (TARGET_RDNA2 ? 1 : 0)
+
 /* Support for a compile-time default architecture and tuning.
    The rules are:
    --with-arch is ignored if -march is specified.
diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 30fe9e34a35..a3d8beefd6d 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -285,9 +285,16 @@ (define_attr "length" ""
 ; Disable alternatives that only apply to specific ISA variants.
 
 (define_attr "gcn_version" "gcn3,gcn5" (const_string "gcn3"))
+(define_attr "rdna" "any,no,yes" (const_string "any"))
 
 (define_attr "enabled" ""
-  (cond [(eq_attr "gcn_version" "gcn3") (const_int 1)
+  (cond [(and (eq_attr "rdna" "no")
+	      (ne (symbol_ref "TARGET_RDNA2") (const_int 0)))
+	   (const_int 0)
+	 (and (eq_attr "rdna" "yes")
+	      (eq (symbol_ref "TARGET_RDNA2") (const_int 0)))
+	   (const_int 0)
+	 (eq_attr "gcn_version" "gcn3") (const_int 1)
 	 (and (eq_attr "gcn_version" "gcn5")
 	      (ne (symbol_ref "TARGET_GCN5_PLUS") (const_int 0)))
 	   (const_int 1)]
@@ -812,7 +819,7 @@ (define_insn "gcn_return"
     if (cfun && cfun->machine && cfun->machine->normal_function)
       return "s_setpc_b64\ts[18:19]";
     else
-      return "s_waitcnt\tlgkmcnt(0)\;s_dcache_wb\;s_endpgm";
+      return "s_waitcnt\tlgkmcnt(0)\;s_endpgm";
   }
   [(set_attr "type" "sop1")
    (set_attr "length" "12")])
@@ -1179,7 +1186,7 @@ (define_insn "addcsi3_scalar"
   ""
   "@
    s_addc_u32\t%0, %1, %2
-   v_addc%^_u32\t%0, vcc, %2, %1, vcc"
+   {v_addc%^_u32|v_add_co_ci_u32}\t%0, vcc, %2, %1, vcc"
   [(set_attr "type" "sop2,vop2")
    (set_attr "length" "8,4")])
 
@@ -1195,7 +1202,7 @@ (define_insn "addcsi3_scalar_zero"
   ""
   "@
    s_addc_u32\t%0, %1, 0
-   v_addc%^_u32\t%0, vcc, 0, %1, vcc"
+   {v_addc%^_u32|v_add_co_ci_u32}\t%0, vcc, 0, %1, vcc"
   [(set_attr "type" "sop2,vop2")
    (set_attr "length" "4")])
 
@@ -1225,7 +1232,8 @@ (define_insn "addptrdi3"
 				gen_rtx_REG (DImode, CC_SAVE_REG) };
 
 	output_asm_insn ("v_add%^_u32\t%L0, %3, %L2, %L1", new_operands);
-	output_asm_insn ("v_addc%^_u32\t%H0, %3, %H2, %H1, %3", new_operands);
+	output_asm_insn ("{v_addc%^_u32|v_add_co_ci_u32}\t%H0, %3, %H2, %H1, %3",
+			 new_operands);
       }
     else
       {
@@ -1363,7 +1371,7 @@ (define_insn "mulsi3"
    s_mul_i32\t%0, %1, %2
    s_mulk_i32\t%0, %2
    s_mul_i32\t%0, %1, %2
-   v_mul_lo_i32\t%0, %1, %2"
+   v_mul_lo_u32\t%0, %1, %2"
   [(set_attr "type" "sop2,sopk,sop2,vop3a")
    (set_attr "length" "4,4,8,4")])
 
@@ -1885,7 +1893,7 @@ (define_insn "*memory_barrier"
   [(set (match_operand:BLK 0)
 	(unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
   ""
-  "buffer_wbinvl1_vol"
+  "{buffer_wbinvl1_vol|buffer_gl0_inv}"
   [(set_attr "type" "mubuf")
    (set_attr "length" "4")])
 
@@ -2004,6 +2012,7 @@ (define_insn "atomic_load<mode>"
    (use (match_operand:SIDI 2 "immediate_operand" "  i, i, i"))]
   ""
   {
+    /* FIXME: RDNA cache instructions may be too conservative?  */
     switch (INTVAL (operands[2]))
       {
       case MEMMODEL_RELAXED:
@@ -2026,11 +2035,17 @@ (define_insn "atomic_load<mode>"
 	    return "s_load%o0\t%0, %A1 glc\;s_waitcnt\tlgkmcnt(0)\;"
 		   "s_dcache_wb_vol";
 	  case 1:
-	    return "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0\;"
-		   "buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0\;"
+		      "buffer_gl0_inv"
+		    : "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0\;"
+		      "buffer_wbinvl1_vol");
 	  case 2:
-	    return "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)\;"
-		   "buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)\;"
+		      "buffer_gl0_inv"
+		    : "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)\;"
+		      "buffer_wbinvl1_vol");
 	  }
 	break;
       case MEMMODEL_ACQ_REL:
@@ -2042,11 +2057,17 @@ (define_insn "atomic_load<mode>"
 	    return "s_dcache_wb_vol\;s_load%o0\t%0, %A1 glc\;"
 		   "s_waitcnt\tlgkmcnt(0)\;s_dcache_inv_vol";
 	  case 1:
-	    return "buffer_wbinvl1_vol\;flat_load%o0\t%0, %A1%O1 glc\;"
-		   "s_waitcnt\t0\;buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;flat_load%o0\t%0, %A1%O1 glc\;"
+		      "s_waitcnt\t0\;buffer_gl0_inv"
+		    : "buffer_wbinvl1_vol\;flat_load%o0\t%0, %A1%O1 glc\;"
+		      "s_waitcnt\t0\;buffer_wbinvl1_vol");
 	  case 2:
-	    return "buffer_wbinvl1_vol\;global_load%o0\t%0, %A1%O1 glc\;"
-		   "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;global_load%o0\t%0, %A1%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_gl0_inv"
+		    : "buffer_wbinvl1_vol\;global_load%o0\t%0, %A1%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol");
 	  }
 	break;
       }
@@ -2054,7 +2075,8 @@ (define_insn "atomic_load<mode>"
   }
   [(set_attr "type" "smem,flat,flat")
    (set_attr "length" "20")
-   (set_attr "gcn_version" "gcn5,*,gcn5")])
+   (set_attr "gcn_version" "gcn5,*,gcn5")
+   (set_attr "rdna" "no,*,*")])
 
 (define_insn "atomic_store<mode>"
   [(set (match_operand:SIDI 0 "memory_operand"      "=RS,RF,RM")
@@ -2084,9 +2106,13 @@ (define_insn "atomic_store<mode>"
 	  case 0:
 	    return "s_dcache_wb_vol\;s_store%o1\t%1, %A0 glc";
 	  case 1:
-	    return "buffer_wbinvl1_vol\;flat_store%o1\t%A0, %1%O0 glc";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;flat_store%o1\t%A0, %1%O0 glc"
+		    : "buffer_wbinvl1_vol\;flat_store%o1\t%A0, %1%O0 glc");
 	  case 2:
-	    return "buffer_wbinvl1_vol\;global_store%o1\t%A0, %1%O0 glc";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;global_store%o1\t%A0, %1%O0 glc"
+		    : "buffer_wbinvl1_vol\;global_store%o1\t%A0, %1%O0 glc");
 	  }
 	break;
       case MEMMODEL_ACQ_REL:
@@ -2098,11 +2124,17 @@ (define_insn "atomic_store<mode>"
 	    return "s_dcache_wb_vol\;s_store%o1\t%1, %A0 glc\;"
 		   "s_waitcnt\tlgkmcnt(0)\;s_dcache_inv_vol";
 	  case 1:
-	    return "buffer_wbinvl1_vol\;flat_store%o1\t%A0, %1%O0 glc\;"
-		   "s_waitcnt\t0\;buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;flat_store%o1\t%A0, %1%O0 glc\;"
+		      "s_waitcnt\t0\;buffer_gl0_inv"
+		    : "buffer_wbinvl1_vol\;flat_store%o1\t%A0, %1%O0 glc\;"
+		      "s_waitcnt\t0\;buffer_wbinvl1_vol");
 	  case 2:
-	    return "buffer_wbinvl1_vol\;global_store%o1\t%A0, %1%O0 glc\;"
-		   "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;global_store%o1\t%A0, %1%O0 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_gl0_inv"
+		    : "buffer_wbinvl1_vol\;global_store%o1\t%A0, %1%O0 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol");
 	  }
 	break;
       }
@@ -2110,7 +2142,8 @@ (define_insn "atomic_store<mode>"
   }
   [(set_attr "type" "smem,flat,flat")
    (set_attr "length" "20")
-   (set_attr "gcn_version" "gcn5,*,gcn5")])
+   (set_attr "gcn_version" "gcn5,*,gcn5")
+   (set_attr "rdna" "no,*,*")])
 
 (define_insn "atomic_exchange<mode>"
   [(set (match_operand:SIDI 0 "register_operand"    "=Sm, v, v")
@@ -2145,11 +2178,17 @@ (define_insn "atomic_exchange<mode>"
 	    return "s_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\tlgkmcnt(0)\;"
 		   "s_dcache_wb_vol\;s_dcache_inv_vol";
 	  case 1:
-	    return "flat_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\t0\;"
-		   "buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "flat_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\t0\;"
+		      "buffer_gl0_inv"
+		    : "flat_atomic_swap<X>\t%0, %1, %2 glc\;s_waitcnt\t0\;"
+		      "buffer_wbinvl1_vol");
 	  case 2:
-	    return "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
-		   "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_gl0_inv"
+		    : "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol");
 	  }
 	break;
       case MEMMODEL_RELEASE:
@@ -2160,12 +2199,19 @@ (define_insn "atomic_exchange<mode>"
 	    return "s_dcache_wb_vol\;s_atomic_swap<X>\t%0, %1, %2 glc\;"
 		   "s_waitcnt\tlgkmcnt(0)";
 	  case 1:
-	    return "buffer_wbinvl1_vol\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
-		   "s_waitcnt\t0";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
+		      "s_waitcnt\t0"
+		    : "buffer_wbinvl1_vol\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
+		      "s_waitcnt\t0");
 	  case 2:
-	    return "buffer_wbinvl1_vol\;"
-		   "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
-		   "s_waitcnt\tvmcnt(0)";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;"
+		      "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)"
+		    : "buffer_wbinvl1_vol\;"
+		      "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)");
 	  }
 	break;
       case MEMMODEL_ACQ_REL:
@@ -2177,12 +2223,19 @@ (define_insn "atomic_exchange<mode>"
 	    return "s_dcache_wb_vol\;s_atomic_swap<X>\t%0, %1, %2 glc\;"
 		   "s_waitcnt\tlgkmcnt(0)\;s_dcache_inv_vol";
 	  case 1:
-	    return "buffer_wbinvl1_vol\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
-		   "s_waitcnt\t0\;buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
+		      "s_waitcnt\t0\;buffer_gl0_inv"
+		    : "buffer_wbinvl1_vol\;flat_atomic_swap<X>\t%0, %1, %2 glc\;"
+		      "s_waitcnt\t0\;buffer_wbinvl1_vol");
 	  case 2:
-	    return "buffer_wbinvl1_vol\;"
-		   "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
-		   "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol";
+	    return (TARGET_RDNA2
+		    ? "buffer_gl0_inv\;"
+		      "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_gl0_inv"
+		    : "buffer_wbinvl1_vol\;"
+		      "global_atomic_swap<X>\t%0, %A1, %2%O1 glc\;"
+		      "s_waitcnt\tvmcnt(0)\;buffer_wbinvl1_vol");
 	  }
 	break;
       }
@@ -2190,7 +2243,8 @@ (define_insn "atomic_exchange<mode>"
   }
   [(set_attr "type" "smem,flat,flat")
    (set_attr "length" "20")
-   (set_attr "gcn_version" "gcn5,*,gcn5")])
+   (set_attr "gcn_version" "gcn5,*,gcn5")
+   (set_attr "rdna" "no,*,*")])
 
 ;; }}}
 ;; {{{ OpenACC / OpenMP
diff --git a/gcc/config/gcn/gcn.opt b/gcc/config/gcn/gcn.opt
index 36c2b535284..7a852c51c84 100644
--- a/gcc/config/gcn/gcn.opt
+++ b/gcc/config/gcn/gcn.opt
@@ -40,6 +40,9 @@ Enum(gpu_type) String(gfx908) Value(PROCESSOR_GFX908)
 EnumValue
 Enum(gpu_type) String(gfx90a) Value(PROCESSOR_GFX90a)
 
+EnumValue
+Enum(gpu_type) String(gfx1030) Value(PROCESSOR_GFX1030)
+
 march=
 Target RejectNegative Joined ToLower Enum(gpu_type) Var(gcn_arch) Init(PROCESSOR_FIJI)
 Specify the name of the target GPU.
diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index 8b608bf024e..f6d56b798e1 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -57,6 +57,8 @@
 #define EF_AMDGPU_MACH_AMDGCN_GFX908 0x30
 #undef  EF_AMDGPU_MACH_AMDGCN_GFX90a
 #define EF_AMDGPU_MACH_AMDGCN_GFX90a 0x3f
+#undef  EF_AMDGPU_MACH_AMDGCN_GFX1030
+#define EF_AMDGPU_MACH_AMDGCN_GFX1030 0x36
 
 #define EF_AMDGPU_FEATURE_XNACK_V4	0x300  /* Mask.  */
 #define EF_AMDGPU_FEATURE_XNACK_UNSUPPORTED_V4	0x000
@@ -942,6 +944,8 @@ main (int argc, char **argv)
 	elf_arch = EF_AMDGPU_MACH_AMDGCN_GFX908;
       else if (strcmp (argv[i], "-march=gfx90a") == 0)
 	elf_arch = EF_AMDGPU_MACH_AMDGCN_GFX90a;
+      else if (strcmp (argv[i], "-march=gfx1030") == 0)
+	elf_arch = EF_AMDGPU_MACH_AMDGCN_GFX1030;
 #define STR "-mstack-size="
       else if (startswith (argv[i], STR))
 	gcn_stack_size = atoi (argv[i] + strlen (STR));
diff --git a/gcc/config/gcn/t-omp-device b/gcc/config/gcn/t-omp-device
index 538624f7ec7..b1cd998a8b1 100644
--- a/gcc/config/gcn/t-omp-device
+++ b/gcc/config/gcn/t-omp-device
@@ -1,4 +1,4 @@
 omp-device-properties-gcn: $(srcdir)/config/gcn/gcn.cc
 	echo kind: gpu > $@
 	echo arch: amdgcn gcn >> $@
-	echo isa: fiji gfx803 gfx900 gfx906 gfx908 gfx90a >> $@
+	echo isa: fiji gfx803 gfx900 gfx906 gfx908 gfx90a gfx1030 >> $@
diff --git a/libgcc/config/gcn/amdgcn_veclib.h b/libgcc/config/gcn/amdgcn_veclib.h
index 15ea20bcd55..88df5c7df91 100644
--- a/libgcc/config/gcn/amdgcn_veclib.h
+++ b/libgcc/config/gcn/amdgcn_veclib.h
@@ -229,7 +229,8 @@ do { \
 
 
 #if defined (__GCN3__) || defined (__GCN5__) \
-    || defined (__CDNA1__) || defined (__CDNA2__)
+    || defined (__CDNA1__) || defined (__CDNA2__) \
+    || defined (__RDNA2__)
 #define CDNA3_PLUS 0
 #else
 #define CDNA3_PLUS 1
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index ef22d48da79..4328d3de14e 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -377,7 +377,8 @@ typedef enum {
   EF_AMDGPU_MACH_AMDGCN_GFX900 = 0x02c,
   EF_AMDGPU_MACH_AMDGCN_GFX906 = 0x02f,
   EF_AMDGPU_MACH_AMDGCN_GFX908 = 0x030,
-  EF_AMDGPU_MACH_AMDGCN_GFX90a = 0x03f
+  EF_AMDGPU_MACH_AMDGCN_GFX90a = 0x03f,
+  EF_AMDGPU_MACH_AMDGCN_GFX1030 = 0x036
 } EF_AMDGPU_MACH;
 
 const static int EF_AMDGPU_MACH_MASK = 0x000000ff;
@@ -1633,6 +1634,7 @@ const static char *gcn_gfx900_s = "gfx900";
 const static char *gcn_gfx906_s = "gfx906";
 const static char *gcn_gfx908_s = "gfx908";
 const static char *gcn_gfx90a_s = "gfx90a";
+const static char *gcn_gfx1030_s = "gfx1030";
 const static int gcn_isa_name_len = 6;
 
 /* Returns the name that the HSA runtime uses for the ISA or NULL if we do not
@@ -1652,6 +1654,8 @@ isa_hsa_name (int isa) {
       return gcn_gfx908_s;
     case EF_AMDGPU_MACH_AMDGCN_GFX90a:
       return gcn_gfx90a_s;
+    case EF_AMDGPU_MACH_AMDGCN_GFX1030:
+      return gcn_gfx1030_s;
     }
   return NULL;
 }
@@ -1691,6 +1695,9 @@ isa_code(const char *isa) {
   if (!strncmp (isa, gcn_gfx90a_s, gcn_isa_name_len))
     return EF_AMDGPU_MACH_AMDGCN_GFX90a;
 
+  if (!strncmp (isa, gcn_gfx1030_s, gcn_isa_name_len))
+    return EF_AMDGPU_MACH_AMDGCN_GFX1030;
+
   return -1;
 }
 
diff --git a/libgomp/team.c b/libgomp/team.c
index b4fd6f2704c..0edc6e5bf28 100644
--- a/libgomp/team.c
+++ b/libgomp/team.c
@@ -253,8 +253,7 @@ gomp_free_pool_helper (void *thread_pool)
 #elif defined(__nvptx__)
   asm ("exit;");
 #elif defined(__AMDGCN__)
-  asm ("s_dcache_wb\n\t"
-       "s_endpgm");
+  asm ("s_endpgm");
 #else
 #error gomp_free_pool_helper must terminate the thread
 #endif
  

Comments

Andrew Stubbs Oct. 27, 2023, 5:06 p.m. UTC | #1
On 20/10/2023 12:51, Andrew Stubbs wrote:
> I've committed this patch that allows building binaries for AMD gfx1030 
> GPUs. I can't actually test it, however, so somebody else will have to 
> debug it (or wait for me to get my hands on a device). Richi reports 
> that it does not execute correctly, as is.

The patch introduced a bug returning exit values that affected all the 
targets. I've now committed a patch to fix the issue.

Andrew
amdgcn: Fix bug in gfx1030 support patch

The previous patch to add gfx1030 support introduced an issue with passing
exit codes from kernels run under gcn-run (offload kernels were unaffected).

gcc/ChangeLog:

	PR target/112088
	* config/gcn/gcn.cc (gcn_expand_epilogue): Fix kernel epilogue register
	conflict.

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 6f85f55803c..6a2aaefceca 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -3615,13 +3615,11 @@ gcn_expand_epilogue (void)
       set_mem_addr_space (retptr_mem, ADDR_SPACE_SCALAR_FLAT);
       emit_move_insn (kernarg_reg, retptr_mem);
 
-      rtx retval_addr = gen_rtx_REG (DImode, FIRST_VPARM_REG);
+      rtx retval_addr = gen_rtx_REG (DImode, FIRST_VPARM_REG + 2);
       emit_move_insn (retval_addr, kernarg_reg);
       rtx retval_mem = gen_rtx_MEM (SImode, retval_addr);
-      rtx scalar_retval = gen_rtx_REG (SImode, FIRST_VPARM_REG + 2);
       set_mem_addr_space (retval_mem, ADDR_SPACE_FLAT);
-      emit_move_insn (scalar_retval, gen_rtx_REG (SImode, RETURN_VALUE_REG));
-      emit_move_insn (retval_mem, scalar_retval);
+      emit_move_insn (retval_mem, gen_rtx_REG (SImode, RETURN_VALUE_REG));
     }
 
   emit_jump_insn (gen_gcn_return ());
  
Thomas Schwinge Feb. 12, 2024, 4:35 p.m. UTC | #2
Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> I've committed this patch

... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:

> [...] Vector
> reductions will need to be reworked for RDNA2.  [...]

> 	* config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> 	(addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> 	(subc<mode>3<exec_vcc>): Likewise.
> 	(<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> 	(vec_cmp<mode>di): Likewise.
> 	(vec_cmp<u><mode>di): Likewise.
> 	(vec_cmp<mode>di_exec): Likewise.
> 	(vec_cmp<u><mode>di_exec): Likewise.
> 	(vec_cmp<mode>di_dup): Likewise.
> 	(vec_cmp<mode>di_dup_exec): Likewise.
> 	(reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> 	(*<reduc_op>_dpp_shr_<mode>): Likewise.
> 	(*plus_carry_dpp_shr_<mode>): Likewise.
> 	(*plus_carry_in_dpp_shr_<mode>): Likewise.

Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.
Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

    +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

    -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
    +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
    +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

    $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
    i=1, ints[i]=0x1 a=1, b=2
    i=2, ints[i]=0x80000000 a=1, b=2
    i=3, ints[i]=0x2 a=1, b=2
    i=4, ints[i]=0x40000000 a=1, b=2
    i=5, ints[i]=0x10000 a=1, b=2
    i=6, ints[i]=0x8000 a=1, b=2
    i=7, ints[i]=0xa5a5a5a5 a=16, b=32
    i=8, ints[i]=0x5a5a5a5a a=16, b=32
    i=9, ints[i]=0xcafe0000 a=11, b=22
    i=10, ints[i]=0xcafe00 a=11, b=22
    i=11, ints[i]=0xcafe a=11, b=22
    i=12, ints[i]=0xffffffff a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?

Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

    int my_popcount (unsigned int x)
    {
      int stmp__12.12;
      vector(64) int vect__12.11;
      vector(64) unsigned int vect__1.8;
      vector(64) unsigned int _13;
      vector(64) unsigned int vect_cst__18;
      vector(64) int [all others];
    
      <bb 2> [local count: 32534376]:
      vect_cst__18 = { [all 'x_8(D)'] };
      vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
      _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all '1'] }, { [all '0'] });
      vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
      _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34, [...], 93, 94, 95 }>;
      _27 = vect__12.11_24 + _26;
      _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77, 78, 79 }>;
      _29 = _27 + _28;
      _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70, 71 }>;
      _31 = _29 + _30;
      _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, 66, 67 }>;
      _33 = _31 + _32;
      _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, 64, 65 }>;
      _35 = _33 + _34;
      _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, 63, 64 }>;
      _37 = _35 + _36;
      stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
      return stmp__12.12_38;

..., for example, for 'x = 7', we get:

      vect_cst__18 = { [all '7'] };
      vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
      _13 = { 1, 1, 1, 0, 0, 0, [...] };
      vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
      _26 = { [all '0'] };
      _27 = { 1, 1, 1, 0, 0, 0, [...] };
      _28 = { [all '0'] };
      _29 = { 1, 1, 1, 0, 0, 0, [...] };
      _30 = { [all '0'] };
      _31 = { 1, 1, 1, 0, 0, 0, [...] };
      _32 = { [all '0'] };
      _33 = { 1, 1, 1, 0, 0, 0, [...] };
      _34 = { 1, 0, 0, 0, [...] };
      _35 = { 2, 1, 1, 0, 0, 0, [...] };
      _36 = { 1, 1, 0, 0, 0, [...] };
      _37 = { 3, 2, 1, 0, 0, 0, [...] };
      stmp__12.12_38 = 3;
      return 3;

..., so the algorithm would appear to synthesize correct code for that
case.  Adding '7' to 'builtin-bitops-1.c', we however again get:

    i=13, ints[i]=0x7 a=3, b=6


With the following hack applied to 'gcc/tree-vect-loop.cc':

    @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
           reduce_with_shift = have_whole_vector_shift (mode1);
           if (!VECTOR_MODE_P (mode1)
              || !directly_supported_p (code, vectype1))
            reduce_with_shift = false;
    +      reduce_with_shift = false;

..., I'm able to work around those regressions: by means of forcing
"Reduce using scalar code" instead of "Reduce using vector shifts".


Grüße
 Thomas
  
Richard Biener Feb. 13, 2024, 8:26 a.m. UTC | #3
On Mon, 12 Feb 2024, Thomas Schwinge wrote:

> Hi!
> 
> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> > I've committed this patch
> 
> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> 
> The RDNA2 ISA variant doesn't support certain instructions previous
> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> 
> > [...] Vector
> > reductions will need to be reworked for RDNA2.  [...]
> 
> > 	* config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> > 	(addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> > 	(subc<mode>3<exec_vcc>): Likewise.
> > 	(<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> > 	(vec_cmp<mode>di): Likewise.
> > 	(vec_cmp<u><mode>di): Likewise.
> > 	(vec_cmp<mode>di_exec): Likewise.
> > 	(vec_cmp<u><mode>di_exec): Likewise.
> > 	(vec_cmp<mode>di_dup): Likewise.
> > 	(vec_cmp<mode>di_dup_exec): Likewise.
> > 	(reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> > 	(*<reduc_op>_dpp_shr_<mode>): Likewise.
> > 	(*plus_carry_dpp_shr_<mode>): Likewise.
> > 	(*plus_carry_in_dpp_shr_<mode>): Likewise.
> 
> Etc.  The expectation being that GCC middle end copes with this, and
> synthesizes some less ideal yet still functional vector code, I presume.
> 
> The later RDNA3/gfx1100 support builds on top of this, and that's what
> I'm currently working on getting proper GCC/GCN target (not offloading)
> results for.
> 
> I'm seeing a good number of execution test FAILs (regressions compared to
> my earlier non-gfx1100 testing), and I've now tracked down where one
> large class of those comes into existance -- not yet how to resolve,
> unfortunately.  But maybe, with you guys' combined vectorizer and back
> end experience, the latter will be done quickly?
> 
> Richard, I don't know if you've ever run actual GCC/GCN target (not
> offloading) testing; let me know if you have any questions about that.

I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.

> Given that (at least largely?) the same patterns etc. are disabled as in
> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> same issues.  You can build GCC/GCN target like you build the offloading
> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> even use a offloading GCC/GCN build to reproduce the issue below.
> 
> One example is the attached 'builtin-bitops-1.c', reduced from
> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> miscompiled as soon as '-ftree-vectorize' is effective:
> 
>     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize
> 
> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> '-march=gfx90a' vs. '-march=gfx1100', we see:
> 
>     +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
> 
> ..., and therefore:
> 
>     -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
>     +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>     +builtin-bitops-1.c:7:17: note:  extract scalar result
> 
> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> generated:
> 
>     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>     i=1, ints[i]=0x1 a=1, b=2
>     i=2, ints[i]=0x80000000 a=1, b=2
>     i=3, ints[i]=0x2 a=1, b=2
>     i=4, ints[i]=0x40000000 a=1, b=2
>     i=5, ints[i]=0x10000 a=1, b=2
>     i=6, ints[i]=0x8000 a=1, b=2
>     i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>     i=8, ints[i]=0x5a5a5a5a a=16, b=32
>     i=9, ints[i]=0xcafe0000 a=11, b=22
>     i=10, ints[i]=0xcafe00 a=11, b=22
>     i=11, ints[i]=0xcafe a=11, b=22
>     i=12, ints[i]=0xffffffff a=32, b=64
> 
> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> 
> I don't speak enough "vectorization" to fully understand the generic
> vectorized algorithm and its implementation.  It appears that the
> "Reduce using vector shifts" code has been around for a very long time,
> but also has gone through a number of changes.  I can't tell which GCC
> targets/configurations it's actually used for (in the same way as for
> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> or rather in the GCN back end, or GCN back end parameterizing the generic
> code?

The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).

> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
> 
>     int my_popcount (unsigned int x)
>     {
>       int stmp__12.12;
>       vector(64) int vect__12.11;
>       vector(64) unsigned int vect__1.8;
>       vector(64) unsigned int _13;
>       vector(64) unsigned int vect_cst__18;
>       vector(64) int [all others];
>     
>       <bb 2> [local count: 32534376]:
>       vect_cst__18 = { [all 'x_8(D)'] };
>       vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
>       _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all '1'] }, { [all '0'] });
>       vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
>       _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34, [...], 93, 94, 95 }>;
>       _27 = vect__12.11_24 + _26;
>       _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77, 78, 79 }>;
>       _29 = _27 + _28;
>       _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70, 71 }>;
>       _31 = _29 + _30;
>       _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, 66, 67 }>;
>       _33 = _31 + _32;
>       _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, 64, 65 }>;
>       _35 = _33 + _34;
>       _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, 63, 64 }>;
>       _37 = _35 + _36;
>       stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
>       return stmp__12.12_38;
> 
> ..., for example, for 'x = 7', we get:
> 
>       vect_cst__18 = { [all '7'] };
>       vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
>       _13 = { 1, 1, 1, 0, 0, 0, [...] };
>       vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
>       _26 = { [all '0'] };
>       _27 = { 1, 1, 1, 0, 0, 0, [...] };
>       _28 = { [all '0'] };
>       _29 = { 1, 1, 1, 0, 0, 0, [...] };
>       _30 = { [all '0'] };
>       _31 = { 1, 1, 1, 0, 0, 0, [...] };
>       _32 = { [all '0'] };
>       _33 = { 1, 1, 1, 0, 0, 0, [...] };
>       _34 = { 1, 0, 0, 0, [...] };
>       _35 = { 2, 1, 1, 0, 0, 0, [...] };
>       _36 = { 1, 1, 0, 0, 0, [...] };
>       _37 = { 3, 2, 1, 0, 0, 0, [...] };
>       stmp__12.12_38 = 3;
>       return 3;
> 
> ..., so the algorithm would appear to synthesize correct code for that
> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> 
>     i=13, ints[i]=0x7 a=3, b=6
> 
> 
> With the following hack applied to 'gcc/tree-vect-loop.cc':
> 
>     @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
>            reduce_with_shift = have_whole_vector_shift (mode1);
>            if (!VECTOR_MODE_P (mode1)
>               || !directly_supported_p (code, vectype1))
>             reduce_with_shift = false;
>     +      reduce_with_shift = false;
> 
> ..., I'm able to work around those regressions: by means of forcing
> "Reduce using scalar code" instead of "Reduce using vector shifts".

I would say it somewhere gets broken between the vectorizer and the GPU
which means likely in the target?  Can you point out an issue in the
actual generated GCN code?

Iff this kind of reduction is the issue you'd see quite a lot of
vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
be that the "mask" is still set wrong when doing the reduction
steps?

Richard.
  
Andrew Stubbs Feb. 14, 2024, 12:56 p.m. UTC | #4
On 13/02/2024 08:26, Richard Biener wrote:
> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> 
>> Hi!
>>
>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>> I've committed this patch
>>
>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>>
>> The RDNA2 ISA variant doesn't support certain instructions previous
>> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
>>
>>> [...] Vector
>>> reductions will need to be reworked for RDNA2.  [...]
>>
>>> 	* config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>>> 	(addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>>> 	(subc<mode>3<exec_vcc>): Likewise.
>>> 	(<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>>> 	(vec_cmp<mode>di): Likewise.
>>> 	(vec_cmp<u><mode>di): Likewise.
>>> 	(vec_cmp<mode>di_exec): Likewise.
>>> 	(vec_cmp<u><mode>di_exec): Likewise.
>>> 	(vec_cmp<mode>di_dup): Likewise.
>>> 	(vec_cmp<mode>di_dup_exec): Likewise.
>>> 	(reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>>> 	(*<reduc_op>_dpp_shr_<mode>): Likewise.
>>> 	(*plus_carry_dpp_shr_<mode>): Likewise.
>>> 	(*plus_carry_in_dpp_shr_<mode>): Likewise.
>>
>> Etc.  The expectation being that GCC middle end copes with this, and
>> synthesizes some less ideal yet still functional vector code, I presume.
>>
>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>> I'm currently working on getting proper GCC/GCN target (not offloading)
>> results for.
>>
>> I'm seeing a good number of execution test FAILs (regressions compared to
>> my earlier non-gfx1100 testing), and I've now tracked down where one
>> large class of those comes into existance -- not yet how to resolve,
>> unfortunately.  But maybe, with you guys' combined vectorizer and back
>> end experience, the latter will be done quickly?
>>
>> Richard, I don't know if you've ever run actual GCC/GCN target (not
>> offloading) testing; let me know if you have any questions about that.
> 
> I've only done offload testing - in the x86_64 build tree run
> check-target-libgomp.  If you can tell me how to do GCN target testing
> (maybe document it on the wiki even!) I can try do that as well.
> 
>> Given that (at least largely?) the same patterns etc. are disabled as in
>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
>> same issues.  You can build GCC/GCN target like you build the offloading
>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
>> even use a offloading GCC/GCN build to reproduce the issue below.
>>
>> One example is the attached 'builtin-bitops-1.c', reduced from
>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>> miscompiled as soon as '-ftree-vectorize' is effective:
>>
>>      $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize
>>
>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>> '-march=gfx90a' vs. '-march=gfx1100', we see:
>>
>>      +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
>>
>> ..., and therefore:
>>
>>      -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
>>      +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>>      +builtin-bitops-1.c:7:17: note:  extract scalar result
>>
>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>> generated:
>>
>>      $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>      i=1, ints[i]=0x1 a=1, b=2
>>      i=2, ints[i]=0x80000000 a=1, b=2
>>      i=3, ints[i]=0x2 a=1, b=2
>>      i=4, ints[i]=0x40000000 a=1, b=2
>>      i=5, ints[i]=0x10000 a=1, b=2
>>      i=6, ints[i]=0x8000 a=1, b=2
>>      i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>      i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>      i=9, ints[i]=0xcafe0000 a=11, b=22
>>      i=10, ints[i]=0xcafe00 a=11, b=22
>>      i=11, ints[i]=0xcafe a=11, b=22
>>      i=12, ints[i]=0xffffffff a=32, b=64
>>
>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
>>
>> I don't speak enough "vectorization" to fully understand the generic
>> vectorized algorithm and its implementation.  It appears that the
>> "Reduce using vector shifts" code has been around for a very long time,
>> but also has gone through a number of changes.  I can't tell which GCC
>> targets/configurations it's actually used for (in the same way as for
>> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
>> or rather in the GCN back end, or GCN back end parameterizing the generic
>> code?
> 
> The "shift" reduction is basically doing reduction by repeatedly
> adding the upper to the lower half of the vector (each time halving
> the vector size).
> 
>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
>>
>>      int my_popcount (unsigned int x)
>>      {
>>        int stmp__12.12;
>>        vector(64) int vect__12.11;
>>        vector(64) unsigned int vect__1.8;
>>        vector(64) unsigned int _13;
>>        vector(64) unsigned int vect_cst__18;
>>        vector(64) int [all others];
>>      
>>        <bb 2> [local count: 32534376]:
>>        vect_cst__18 = { [all 'x_8(D)'] };
>>        vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
>>        _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all '1'] }, { [all '0'] });
>>        vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
>>        _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34, [...], 93, 94, 95 }>;
>>        _27 = vect__12.11_24 + _26;
>>        _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77, 78, 79 }>;
>>        _29 = _27 + _28;
>>        _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70, 71 }>;
>>        _31 = _29 + _30;
>>        _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, 66, 67 }>;
>>        _33 = _31 + _32;
>>        _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, 64, 65 }>;
>>        _35 = _33 + _34;
>>        _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, 63, 64 }>;
>>        _37 = _35 + _36;
>>        stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
>>        return stmp__12.12_38;
>>
>> ..., for example, for 'x = 7', we get:
>>
>>        vect_cst__18 = { [all '7'] };
>>        vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
>>        _13 = { 1, 1, 1, 0, 0, 0, [...] };
>>        vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
>>        _26 = { [all '0'] };
>>        _27 = { 1, 1, 1, 0, 0, 0, [...] };
>>        _28 = { [all '0'] };
>>        _29 = { 1, 1, 1, 0, 0, 0, [...] };
>>        _30 = { [all '0'] };
>>        _31 = { 1, 1, 1, 0, 0, 0, [...] };
>>        _32 = { [all '0'] };
>>        _33 = { 1, 1, 1, 0, 0, 0, [...] };
>>        _34 = { 1, 0, 0, 0, [...] };
>>        _35 = { 2, 1, 1, 0, 0, 0, [...] };
>>        _36 = { 1, 1, 0, 0, 0, [...] };
>>        _37 = { 3, 2, 1, 0, 0, 0, [...] };
>>        stmp__12.12_38 = 3;
>>        return 3;
>>
>> ..., so the algorithm would appear to synthesize correct code for that
>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
>>
>>      i=13, ints[i]=0x7 a=3, b=6
>>
>>
>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>>
>>      @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
>>             reduce_with_shift = have_whole_vector_shift (mode1);
>>             if (!VECTOR_MODE_P (mode1)
>>                || !directly_supported_p (code, vectype1))
>>              reduce_with_shift = false;
>>      +      reduce_with_shift = false;
>>
>> ..., I'm able to work around those regressions: by means of forcing
>> "Reduce using scalar code" instead of "Reduce using vector shifts".
> 
> I would say it somewhere gets broken between the vectorizer and the GPU
> which means likely in the target?  Can you point out an issue in the
> actual generated GCN code?
> 
> Iff this kind of reduction is the issue you'd see quite a lot of
> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
> be that the "mask" is still set wrong when doing the reduction
> steps?

It looks like the ds_bpermute_b32 instruction works differently on RDNA3 
(vs. GCN/CDNA and even RDNA2).

 From the pseudocode in the documentation:

   for i in 0 : WAVE64 ? 63 : 31 do
     // ADDR needs to be divided by 4.
     // High-order bits are ignored.
     // NOTE: destination lane is MOD 32 regardless of wave size.
     src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
     // EXEC is applied to the source VGPR reads.
     if EXEC[src_lane].u1 then
       tmp[i] = VGPR[src_lane][DATA0]
     endif
   endfor;

The key detail is the "mod 32"; the other architectures have "mod 64" there.

So, the last 32 lanes are discarded, and the first 32 lanes are 
duplicated into the last, and this explains why my_popcount returns 
double the expected value for smaller inputs.

Richi, can you confirm that this testcase works properly on your card, 
please?

To test, assuming you only have the offload toolchain built, compile 
using x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should 
produce a raw AMD ELF file. Then you run it using "gcn-run a.out" (you 
can find gcn-run under libexec).

Andrew
  
Richard Biener Feb. 14, 2024, 1:27 p.m. UTC | #5
On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 13/02/2024 08:26, Richard Biener wrote:
> > On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> > 
> >> Hi!
> >>
> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >>> I've committed this patch
> >>
> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>
> >> The RDNA2 ISA variant doesn't support certain instructions previous
> >> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> >>
> >>> [...] Vector
> >>> reductions will need to be reworked for RDNA2.  [...]
> >>
> >>>  * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> >>>  (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> >>>  (subc<mode>3<exec_vcc>): Likewise.
> >>>  (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> >>>  (vec_cmp<mode>di): Likewise.
> >>>  (vec_cmp<u><mode>di): Likewise.
> >>>  (vec_cmp<mode>di_exec): Likewise.
> >>>  (vec_cmp<u><mode>di_exec): Likewise.
> >>>  (vec_cmp<mode>di_dup): Likewise.
> >>>  (vec_cmp<mode>di_dup_exec): Likewise.
> >>>  (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> >>>  (*<reduc_op>_dpp_shr_<mode>): Likewise.
> >>>  (*plus_carry_dpp_shr_<mode>): Likewise.
> >>>  (*plus_carry_in_dpp_shr_<mode>): Likewise.
> >>
> >> Etc.  The expectation being that GCC middle end copes with this, and
> >> synthesizes some less ideal yet still functional vector code, I presume.
> >>
> >> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >> I'm currently working on getting proper GCC/GCN target (not offloading)
> >> results for.
> >>
> >> I'm seeing a good number of execution test FAILs (regressions compared to
> >> my earlier non-gfx1100 testing), and I've now tracked down where one
> >> large class of those comes into existance -- not yet how to resolve,
> >> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >> end experience, the latter will be done quickly?
> >>
> >> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >> offloading) testing; let me know if you have any questions about that.
> > 
> > I've only done offload testing - in the x86_64 build tree run
> > check-target-libgomp.  If you can tell me how to do GCN target testing
> > (maybe document it on the wiki even!) I can try do that as well.
> > 
> >> Given that (at least largely?) the same patterns etc. are disabled as in
> >> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >> same issues.  You can build GCC/GCN target like you build the offloading
> >> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >> even use a offloading GCC/GCN build to reproduce the issue below.
> >>
> >> One example is the attached 'builtin-bitops-1.c', reduced from
> >> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >> miscompiled as soon as '-ftree-vectorize' is effective:
> >>
> >>      $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>      -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>      -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>      -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1
> >>      -ftree-vectorize
> >>
> >> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>
> >>      +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
> >>
> >> ..., and therefore:
> >>
> >>      -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
> >>      +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>      +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>
> >> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> >> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >> generated:
> >>
> >>      $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>      i=1, ints[i]=0x1 a=1, b=2
> >>      i=2, ints[i]=0x80000000 a=1, b=2
> >>      i=3, ints[i]=0x2 a=1, b=2
> >>      i=4, ints[i]=0x40000000 a=1, b=2
> >>      i=5, ints[i]=0x10000 a=1, b=2
> >>      i=6, ints[i]=0x8000 a=1, b=2
> >>      i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>      i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>      i=9, ints[i]=0xcafe0000 a=11, b=22
> >>      i=10, ints[i]=0xcafe00 a=11, b=22
> >>      i=11, ints[i]=0xcafe a=11, b=22
> >>      i=12, ints[i]=0xffffffff a=32, b=64
> >>
> >> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>
> >> I don't speak enough "vectorization" to fully understand the generic
> >> vectorized algorithm and its implementation.  It appears that the
> >> "Reduce using vector shifts" code has been around for a very long time,
> >> but also has gone through a number of changes.  I can't tell which GCC
> >> targets/configurations it's actually used for (in the same way as for
> >> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> >> or rather in the GCN back end, or GCN back end parameterizing the generic
> >> code?
> > 
> > The "shift" reduction is basically doing reduction by repeatedly
> > adding the upper to the lower half of the vector (each time halving
> > the vector size).
> > 
> >> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
> >>
> >>      int my_popcount (unsigned int x)
> >>      {
> >>        int stmp__12.12;
> >>        vector(64) int vect__12.11;
> >>        vector(64) unsigned int vect__1.8;
> >>        vector(64) unsigned int _13;
> >>        vector(64) unsigned int vect_cst__18;
> >>        vector(64) int [all others];
> >>      
> >>        <bb 2> [local count: 32534376]:
> >>        vect_cst__18 = { [all 'x_8(D)'] };
> >>        vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
> >>        _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all
> >>        '1'] }, { [all '0'] });
> >>        vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
> >>        _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34,
> >>        [...], 93, 94, 95 }>;
> >>        _27 = vect__12.11_24 + _26;
> >>        _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77,
> >>        78, 79 }>;
> >>        _29 = _27 + _28;
> >>        _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70,
> >>        71 }>;
> >>        _31 = _29 + _30;
> >>        _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, 66,
> >>        67 }>;
> >>        _33 = _31 + _32;
> >>        _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, 64,
> >>        65 }>;
> >>        _35 = _33 + _34;
> >>        _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, 63,
> >>        64 }>;
> >>        _37 = _35 + _36;
> >>        stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
> >>        return stmp__12.12_38;
> >>
> >> ..., for example, for 'x = 7', we get:
> >>
> >>        vect_cst__18 = { [all '7'] };
> >>        vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
> >>        _13 = { 1, 1, 1, 0, 0, 0, [...] };
> >>        vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
> >>        _26 = { [all '0'] };
> >>        _27 = { 1, 1, 1, 0, 0, 0, [...] };
> >>        _28 = { [all '0'] };
> >>        _29 = { 1, 1, 1, 0, 0, 0, [...] };
> >>        _30 = { [all '0'] };
> >>        _31 = { 1, 1, 1, 0, 0, 0, [...] };
> >>        _32 = { [all '0'] };
> >>        _33 = { 1, 1, 1, 0, 0, 0, [...] };
> >>        _34 = { 1, 0, 0, 0, [...] };
> >>        _35 = { 2, 1, 1, 0, 0, 0, [...] };
> >>        _36 = { 1, 1, 0, 0, 0, [...] };
> >>        _37 = { 3, 2, 1, 0, 0, 0, [...] };
> >>        stmp__12.12_38 = 3;
> >>        return 3;
> >>
> >> ..., so the algorithm would appear to synthesize correct code for that
> >> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> >>
> >>      i=13, ints[i]=0x7 a=3, b=6
> >>
> >>
> >> With the following hack applied to 'gcc/tree-vect-loop.cc':
> >>
> >>      @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction (loop_vec_info
> >>      loop_vinfo,
> >>             reduce_with_shift = have_whole_vector_shift (mode1);
> >>             if (!VECTOR_MODE_P (mode1)
> >>                || !directly_supported_p (code, vectype1))
> >>              reduce_with_shift = false;
> >>      +      reduce_with_shift = false;
> >>
> >> ..., I'm able to work around those regressions: by means of forcing
> >> "Reduce using scalar code" instead of "Reduce using vector shifts".
> > 
> > I would say it somewhere gets broken between the vectorizer and the GPU
> > which means likely in the target?  Can you point out an issue in the
> > actual generated GCN code?
> > 
> > Iff this kind of reduction is the issue you'd see quite a lot of
> > vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
> > be that the "mask" is still set wrong when doing the reduction
> > steps?
> 
> It looks like the ds_bpermute_b32 instruction works differently on RDNA3 (vs.
> GCN/CDNA and even RDNA2).
> 
> From the pseudocode in the documentation:
> 
>   for i in 0 : WAVE64 ? 63 : 31 do
>     // ADDR needs to be divided by 4.
>     // High-order bits are ignored.
>     // NOTE: destination lane is MOD 32 regardless of wave size.
>     src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
>     // EXEC is applied to the source VGPR reads.
>     if EXEC[src_lane].u1 then
>       tmp[i] = VGPR[src_lane][DATA0]
>     endif
>   endfor;
> 
> The key detail is the "mod 32"; the other architectures have "mod 64" there.
> 
> So, the last 32 lanes are discarded, and the first 32 lanes are duplicated
> into the last, and this explains why my_popcount returns double the expected
> value for smaller inputs.
> 
> Richi, can you confirm that this testcase works properly on your card, please?
>
> To test, assuming you only have the offload toolchain built, compile using
> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw AMD
> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run under
> libexec).

I'm getting

i=1, ints[i]=0x1 a=1, b=2
i=2, ints[i]=0x80000000 a=1, b=2
i=3, ints[i]=0x2 a=1, b=2
i=4, ints[i]=0x40000000 a=1, b=2
i=5, ints[i]=0x10000 a=1, b=2
i=6, ints[i]=0x8000 a=1, b=2
i=7, ints[i]=0xa5a5a5a5 a=16, b=32
i=8, ints[i]=0x5a5a5a5a a=16, b=32
i=9, ints[i]=0xcafe0000 a=11, b=22
i=10, ints[i]=0xcafe00 a=11, b=22
i=11, ints[i]=0xcafe a=11, b=22
i=12, ints[i]=0xffffffff a=32, b=64

which I think is the same as Thomas output and thus wrong?

When building with -O0 I get no output.

I'm of course building with -march=gfx1030

Richard.

> Andrew
> 
>
  
Andrew Stubbs Feb. 14, 2024, 1:40 p.m. UTC | #6
On 14/02/2024 13:27, Richard Biener wrote:
> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> 
>> On 13/02/2024 08:26, Richard Biener wrote:
>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>>>
>>>> Hi!
>>>>
>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>>>> I've committed this patch
>>>>
>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>>>>
>>>> The RDNA2 ISA variant doesn't support certain instructions previous
>>>> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
>>>>
>>>>> [...] Vector
>>>>> reductions will need to be reworked for RDNA2.  [...]
>>>>
>>>>>   * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>>>>>   (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>>>>>   (subc<mode>3<exec_vcc>): Likewise.
>>>>>   (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>>>>>   (vec_cmp<mode>di): Likewise.
>>>>>   (vec_cmp<u><mode>di): Likewise.
>>>>>   (vec_cmp<mode>di_exec): Likewise.
>>>>>   (vec_cmp<u><mode>di_exec): Likewise.
>>>>>   (vec_cmp<mode>di_dup): Likewise.
>>>>>   (vec_cmp<mode>di_dup_exec): Likewise.
>>>>>   (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>>>>>   (*<reduc_op>_dpp_shr_<mode>): Likewise.
>>>>>   (*plus_carry_dpp_shr_<mode>): Likewise.
>>>>>   (*plus_carry_in_dpp_shr_<mode>): Likewise.
>>>>
>>>> Etc.  The expectation being that GCC middle end copes with this, and
>>>> synthesizes some less ideal yet still functional vector code, I presume.
>>>>
>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>>>> I'm currently working on getting proper GCC/GCN target (not offloading)
>>>> results for.
>>>>
>>>> I'm seeing a good number of execution test FAILs (regressions compared to
>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
>>>> large class of those comes into existance -- not yet how to resolve,
>>>> unfortunately.  But maybe, with you guys' combined vectorizer and back
>>>> end experience, the latter will be done quickly?
>>>>
>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
>>>> offloading) testing; let me know if you have any questions about that.
>>>
>>> I've only done offload testing - in the x86_64 build tree run
>>> check-target-libgomp.  If you can tell me how to do GCN target testing
>>> (maybe document it on the wiki even!) I can try do that as well.
>>>
>>>> Given that (at least largely?) the same patterns etc. are disabled as in
>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
>>>> same issues.  You can build GCC/GCN target like you build the offloading
>>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
>>>> even use a offloading GCC/GCN build to reproduce the issue below.
>>>>
>>>> One example is the attached 'builtin-bitops-1.c', reduced from
>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>>>> miscompiled as soon as '-ftree-vectorize' is effective:
>>>>
>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
>>>>       -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1
>>>>       -ftree-vectorize
>>>>
>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
>>>>
>>>>       +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.
>>>>
>>>> ..., and therefore:
>>>>
>>>>       -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
>>>>       +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>>>>       +builtin-bitops-1.c:7:17: note:  extract scalar result
>>>>
>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
>>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>>>> generated:
>>>>
>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>       i=1, ints[i]=0x1 a=1, b=2
>>>>       i=2, ints[i]=0x80000000 a=1, b=2
>>>>       i=3, ints[i]=0x2 a=1, b=2
>>>>       i=4, ints[i]=0x40000000 a=1, b=2
>>>>       i=5, ints[i]=0x10000 a=1, b=2
>>>>       i=6, ints[i]=0x8000 a=1, b=2
>>>>       i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>>>       i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>>>       i=9, ints[i]=0xcafe0000 a=11, b=22
>>>>       i=10, ints[i]=0xcafe00 a=11, b=22
>>>>       i=11, ints[i]=0xcafe a=11, b=22
>>>>       i=12, ints[i]=0xffffffff a=32, b=64
>>>>
>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
>>>>
>>>> I don't speak enough "vectorization" to fully understand the generic
>>>> vectorized algorithm and its implementation.  It appears that the
>>>> "Reduce using vector shifts" code has been around for a very long time,
>>>> but also has gone through a number of changes.  I can't tell which GCC
>>>> targets/configurations it's actually used for (in the same way as for
>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
>>>> or rather in the GCN back end, or GCN back end parameterizing the generic
>>>> code?
>>>
>>> The "shift" reduction is basically doing reduction by repeatedly
>>> adding the upper to the lower half of the vector (each time halving
>>> the vector size).
>>>
>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
>>>>
>>>>       int my_popcount (unsigned int x)
>>>>       {
>>>>         int stmp__12.12;
>>>>         vector(64) int vect__12.11;
>>>>         vector(64) unsigned int vect__1.8;
>>>>         vector(64) unsigned int _13;
>>>>         vector(64) unsigned int vect_cst__18;
>>>>         vector(64) int [all others];
>>>>       
>>>>         <bb 2> [local count: 32534376]:
>>>>         vect_cst__18 = { [all 'x_8(D)'] };
>>>>         vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
>>>>         _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, { [all
>>>>         '1'] }, { [all '0'] });
>>>>         vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
>>>>         _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34,
>>>>         [...], 93, 94, 95 }>;
>>>>         _27 = vect__12.11_24 + _26;
>>>>         _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77,
>>>>         78, 79 }>;
>>>>         _29 = _27 + _28;
>>>>         _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69, 70,
>>>>         71 }>;
>>>>         _31 = _29 + _30;
>>>>         _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65, 66,
>>>>         67 }>;
>>>>         _33 = _31 + _32;
>>>>         _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63, 64,
>>>>         65 }>;
>>>>         _35 = _33 + _34;
>>>>         _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62, 63,
>>>>         64 }>;
>>>>         _37 = _35 + _36;
>>>>         stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
>>>>         return stmp__12.12_38;
>>>>
>>>> ..., for example, for 'x = 7', we get:
>>>>
>>>>         vect_cst__18 = { [all '7'] };
>>>>         vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
>>>>         _13 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>         vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>         _26 = { [all '0'] };
>>>>         _27 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>         _28 = { [all '0'] };
>>>>         _29 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>         _30 = { [all '0'] };
>>>>         _31 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>         _32 = { [all '0'] };
>>>>         _33 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>         _34 = { 1, 0, 0, 0, [...] };
>>>>         _35 = { 2, 1, 1, 0, 0, 0, [...] };
>>>>         _36 = { 1, 1, 0, 0, 0, [...] };
>>>>         _37 = { 3, 2, 1, 0, 0, 0, [...] };
>>>>         stmp__12.12_38 = 3;
>>>>         return 3;
>>>>
>>>> ..., so the algorithm would appear to synthesize correct code for that
>>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
>>>>
>>>>       i=13, ints[i]=0x7 a=3, b=6
>>>>
>>>>
>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>>>>
>>>>       @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction (loop_vec_info
>>>>       loop_vinfo,
>>>>              reduce_with_shift = have_whole_vector_shift (mode1);
>>>>              if (!VECTOR_MODE_P (mode1)
>>>>                 || !directly_supported_p (code, vectype1))
>>>>               reduce_with_shift = false;
>>>>       +      reduce_with_shift = false;
>>>>
>>>> ..., I'm able to work around those regressions: by means of forcing
>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
>>>
>>> I would say it somewhere gets broken between the vectorizer and the GPU
>>> which means likely in the target?  Can you point out an issue in the
>>> actual generated GCN code?
>>>
>>> Iff this kind of reduction is the issue you'd see quite a lot of
>>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
>>> be that the "mask" is still set wrong when doing the reduction
>>> steps?
>>
>> It looks like the ds_bpermute_b32 instruction works differently on RDNA3 (vs.
>> GCN/CDNA and even RDNA2).
>>
>>  From the pseudocode in the documentation:
>>
>>    for i in 0 : WAVE64 ? 63 : 31 do
>>      // ADDR needs to be divided by 4.
>>      // High-order bits are ignored.
>>      // NOTE: destination lane is MOD 32 regardless of wave size.
>>      src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
>>      // EXEC is applied to the source VGPR reads.
>>      if EXEC[src_lane].u1 then
>>        tmp[i] = VGPR[src_lane][DATA0]
>>      endif
>>    endfor;
>>
>> The key detail is the "mod 32"; the other architectures have "mod 64" there.
>>
>> So, the last 32 lanes are discarded, and the first 32 lanes are duplicated
>> into the last, and this explains why my_popcount returns double the expected
>> value for smaller inputs.
>>
>> Richi, can you confirm that this testcase works properly on your card, please?
>>
>> To test, assuming you only have the offload toolchain built, compile using
>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw AMD
>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run under
>> libexec).
> 
> I'm getting
> 
> i=1, ints[i]=0x1 a=1, b=2
> i=2, ints[i]=0x80000000 a=1, b=2
> i=3, ints[i]=0x2 a=1, b=2
> i=4, ints[i]=0x40000000 a=1, b=2
> i=5, ints[i]=0x10000 a=1, b=2
> i=6, ints[i]=0x8000 a=1, b=2
> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> i=8, ints[i]=0x5a5a5a5a a=16, b=32
> i=9, ints[i]=0xcafe0000 a=11, b=22
> i=10, ints[i]=0xcafe00 a=11, b=22
> i=11, ints[i]=0xcafe a=11, b=22
> i=12, ints[i]=0xffffffff a=32, b=64
> 
> which I think is the same as Thomas output and thus wrong?
> 
> When building with -O0 I get no output.
> 
> I'm of course building with -march=gfx1030

OK, please try this example, just to check my expectation that your 
permute works:

typedef int v64si __attribute__ ((vector_size (256)));

int main()
{
   v64si permute = {
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40,
     40, 40, 40, 40, 40, 40, 40, 40
   };
   v64si result;

   asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute), 
"e"(-1L));

   for (int i=0; i<63; i++)
     __builtin_printf ("%d ", result[i]);
   __builtin_printf ("\n");

   return 0;
}

On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it 
prints "10" 32 times, and "42" 32 times (which doesn't quite match what 
I'd expect from the pseudocode, but does match the written description). 
Which do you get?

Thanks

Andrew
  
Richard Biener Feb. 14, 2024, 1:43 p.m. UTC | #7
On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 14/02/2024 13:27, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 13/02/2024 08:26, Richard Biener wrote:
> >>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >>>>> I've committed this patch
> >>>>
> >>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>>>
> >>>> The RDNA2 ISA variant doesn't support certain instructions previous
> >>>> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> >>>>
> >>>>> [...] Vector
> >>>>> reductions will need to be reworked for RDNA2.  [...]
> >>>>
> >>>>>   * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> >>>>>   (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> >>>>>   (subc<mode>3<exec_vcc>): Likewise.
> >>>>>   (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> >>>>>   (vec_cmp<mode>di): Likewise.
> >>>>>   (vec_cmp<u><mode>di): Likewise.
> >>>>>   (vec_cmp<mode>di_exec): Likewise.
> >>>>>   (vec_cmp<u><mode>di_exec): Likewise.
> >>>>>   (vec_cmp<mode>di_dup): Likewise.
> >>>>>   (vec_cmp<mode>di_dup_exec): Likewise.
> >>>>>   (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> >>>>>   (*<reduc_op>_dpp_shr_<mode>): Likewise.
> >>>>>   (*plus_carry_dpp_shr_<mode>): Likewise.
> >>>>>   (*plus_carry_in_dpp_shr_<mode>): Likewise.
> >>>>
> >>>> Etc.  The expectation being that GCC middle end copes with this, and
> >>>> synthesizes some less ideal yet still functional vector code, I presume.
> >>>>
> >>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >>>> I'm currently working on getting proper GCC/GCN target (not offloading)
> >>>> results for.
> >>>>
> >>>> I'm seeing a good number of execution test FAILs (regressions compared to
> >>>> my earlier non-gfx1100 testing), and I've now tracked down where one
> >>>> large class of those comes into existance -- not yet how to resolve,
> >>>> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >>>> end experience, the latter will be done quickly?
> >>>>
> >>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >>>> offloading) testing; let me know if you have any questions about that.
> >>>
> >>> I've only done offload testing - in the x86_64 build tree run
> >>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>> (maybe document it on the wiki even!) I can try do that as well.
> >>>
> >>>> Given that (at least largely?) the same patterns etc. are disabled as in
> >>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >>>> same issues.  You can build GCC/GCN target like you build the offloading
> >>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >>>> even use a offloading GCC/GCN build to reproduce the issue below.
> >>>>
> >>>> One example is the attached 'builtin-bitops-1.c', reduced from
> >>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >>>> miscompiled as soon as '-ftree-vectorize' is effective:
> >>>>
> >>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>>>       -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
> >>>>       -O1
> >>>>       -ftree-vectorize
> >>>>
> >>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>>>
> >>>>       +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
> >>>>       target.
> >>>>
> >>>> ..., and therefore:
> >>>>
> >>>>       -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
> >>>>       reduction.
> >>>>       +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>>>       +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>>>
> >>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> >>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >>>> generated:
> >>>>
> >>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>>>       i=1, ints[i]=0x1 a=1, b=2
> >>>>       i=2, ints[i]=0x80000000 a=1, b=2
> >>>>       i=3, ints[i]=0x2 a=1, b=2
> >>>>       i=4, ints[i]=0x40000000 a=1, b=2
> >>>>       i=5, ints[i]=0x10000 a=1, b=2
> >>>>       i=6, ints[i]=0x8000 a=1, b=2
> >>>>       i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>>>       i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>>>       i=9, ints[i]=0xcafe0000 a=11, b=22
> >>>>       i=10, ints[i]=0xcafe00 a=11, b=22
> >>>>       i=11, ints[i]=0xcafe a=11, b=22
> >>>>       i=12, ints[i]=0xffffffff a=32, b=64
> >>>>
> >>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>>>
> >>>> I don't speak enough "vectorization" to fully understand the generic
> >>>> vectorized algorithm and its implementation.  It appears that the
> >>>> "Reduce using vector shifts" code has been around for a very long time,
> >>>> but also has gone through a number of changes.  I can't tell which GCC
> >>>> targets/configurations it's actually used for (in the same way as for
> >>>> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> >>>> or rather in the GCN back end, or GCN back end parameterizing the generic
> >>>> code?
> >>>
> >>> The "shift" reduction is basically doing reduction by repeatedly
> >>> adding the upper to the lower half of the vector (each time halving
> >>> the vector size).
> >>>
> >>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
> >>>>
> >>>>       int my_popcount (unsigned int x)
> >>>>       {
> >>>>         int stmp__12.12;
> >>>>         vector(64) int vect__12.11;
> >>>>         vector(64) unsigned int vect__1.8;
> >>>>         vector(64) unsigned int _13;
> >>>>         vector(64) unsigned int vect_cst__18;
> >>>>         vector(64) int [all others];
> >>>>       
> >>>>         <bb 2> [local count: 32534376]:
> >>>>         vect_cst__18 = { [all 'x_8(D)'] };
> >>>>         vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
> >>>>         _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, {
> >>>>         [all
> >>>>         '1'] }, { [all '0'] });
> >>>>         vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
> >>>>         _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34,
> >>>>         [...], 93, 94, 95 }>;
> >>>>         _27 = vect__12.11_24 + _26;
> >>>>         _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77,
> >>>>         78, 79 }>;
> >>>>         _29 = _27 + _28;
> >>>>         _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69,
> >>>>         70,
> >>>>         71 }>;
> >>>>         _31 = _29 + _30;
> >>>>         _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65,
> >>>>         66,
> >>>>         67 }>;
> >>>>         _33 = _31 + _32;
> >>>>         _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63,
> >>>>         64,
> >>>>         65 }>;
> >>>>         _35 = _33 + _34;
> >>>>         _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62,
> >>>>         63,
> >>>>         64 }>;
> >>>>         _37 = _35 + _36;
> >>>>         stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
> >>>>         return stmp__12.12_38;
> >>>>
> >>>> ..., for example, for 'x = 7', we get:
> >>>>
> >>>>         vect_cst__18 = { [all '7'] };
> >>>>         vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
> >>>>         _13 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _26 = { [all '0'] };
> >>>>         _27 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _28 = { [all '0'] };
> >>>>         _29 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _30 = { [all '0'] };
> >>>>         _31 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _32 = { [all '0'] };
> >>>>         _33 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _34 = { 1, 0, 0, 0, [...] };
> >>>>         _35 = { 2, 1, 1, 0, 0, 0, [...] };
> >>>>         _36 = { 1, 1, 0, 0, 0, [...] };
> >>>>         _37 = { 3, 2, 1, 0, 0, 0, [...] };
> >>>>         stmp__12.12_38 = 3;
> >>>>         return 3;
> >>>>
> >>>> ..., so the algorithm would appear to synthesize correct code for that
> >>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> >>>>
> >>>>       i=13, ints[i]=0x7 a=3, b=6
> >>>>
> >>>>
> >>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
> >>>>
> >>>>       @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
> >>>>       (loop_vec_info
> >>>>       loop_vinfo,
> >>>>              reduce_with_shift = have_whole_vector_shift (mode1);
> >>>>              if (!VECTOR_MODE_P (mode1)
> >>>>                 || !directly_supported_p (code, vectype1))
> >>>>               reduce_with_shift = false;
> >>>>       +      reduce_with_shift = false;
> >>>>
> >>>> ..., I'm able to work around those regressions: by means of forcing
> >>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
> >>>
> >>> I would say it somewhere gets broken between the vectorizer and the GPU
> >>> which means likely in the target?  Can you point out an issue in the
> >>> actual generated GCN code?
> >>>
> >>> Iff this kind of reduction is the issue you'd see quite a lot of
> >>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
> >>> be that the "mask" is still set wrong when doing the reduction
> >>> steps?
> >>
> >> It looks like the ds_bpermute_b32 instruction works differently on RDNA3
> >> (vs.
> >> GCN/CDNA and even RDNA2).
> >>
> >>  From the pseudocode in the documentation:
> >>
> >>    for i in 0 : WAVE64 ? 63 : 31 do
> >>      // ADDR needs to be divided by 4.
> >>      // High-order bits are ignored.
> >>      // NOTE: destination lane is MOD 32 regardless of wave size.
> >>      src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
> >>      // EXEC is applied to the source VGPR reads.
> >>      if EXEC[src_lane].u1 then
> >>        tmp[i] = VGPR[src_lane][DATA0]
> >>      endif
> >>    endfor;
> >>
> >> The key detail is the "mod 32"; the other architectures have "mod 64"
> >> there.
> >>
> >> So, the last 32 lanes are discarded, and the first 32 lanes are duplicated
> >> into the last, and this explains why my_popcount returns double the
> >> expected
> >> value for smaller inputs.
> >>
> >> Richi, can you confirm that this testcase works properly on your card,
> >> please?
> >>
> >> To test, assuming you only have the offload toolchain built, compile using
> >> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw
> >> AMD
> >> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run under
> >> libexec).
> > 
> > I'm getting
> > 
> > i=1, ints[i]=0x1 a=1, b=2
> > i=2, ints[i]=0x80000000 a=1, b=2
> > i=3, ints[i]=0x2 a=1, b=2
> > i=4, ints[i]=0x40000000 a=1, b=2
> > i=5, ints[i]=0x10000 a=1, b=2
> > i=6, ints[i]=0x8000 a=1, b=2
> > i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> > i=8, ints[i]=0x5a5a5a5a a=16, b=32
> > i=9, ints[i]=0xcafe0000 a=11, b=22
> > i=10, ints[i]=0xcafe00 a=11, b=22
> > i=11, ints[i]=0xcafe a=11, b=22
> > i=12, ints[i]=0xffffffff a=32, b=64
> > 
> > which I think is the same as Thomas output and thus wrong?
> > 
> > When building with -O0 I get no output.
> > 
> > I'm of course building with -march=gfx1030
> 
> OK, please try this example, just to check my expectation that your permute
> works:
> 
> typedef int v64si __attribute__ ((vector_size (256)));
> 
> int main()
> {
>   v64si permute = {
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40
>   };
>   v64si result;
> 
>   asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute), "e"(-1L));
> 
>   for (int i=0; i<63; i++)
>     __builtin_printf ("%d ", result[i]);
>   __builtin_printf ("\n");
> 
>   return 0;
> }
> 
> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it prints
> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect
> from the pseudocode, but does match the written description). Which do you
> get?

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 
10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 
42 42 42 42 42 42 42 42 42 42 42 42 42 

so RDNA2 matches RDNA3 here.

Richard.
  
Andrew Stubbs Feb. 14, 2024, 3:23 p.m. UTC | #8
On 14/02/2024 13:43, Richard Biener wrote:
> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> 
>> On 14/02/2024 13:27, Richard Biener wrote:
>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>>
>>>> On 13/02/2024 08:26, Richard Biener wrote:
>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>>>>>> I've committed this patch
>>>>>>
>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>>>>>>
>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
>>>>>>
>>>>>>> [...] Vector
>>>>>>> reductions will need to be reworked for RDNA2.  [...]
>>>>>>
>>>>>>>    * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>>>>>>>    (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>>>>>>>    (subc<mode>3<exec_vcc>): Likewise.
>>>>>>>    (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>>>>>>>    (vec_cmp<mode>di): Likewise.
>>>>>>>    (vec_cmp<u><mode>di): Likewise.
>>>>>>>    (vec_cmp<mode>di_exec): Likewise.
>>>>>>>    (vec_cmp<u><mode>di_exec): Likewise.
>>>>>>>    (vec_cmp<mode>di_dup): Likewise.
>>>>>>>    (vec_cmp<mode>di_dup_exec): Likewise.
>>>>>>>    (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>>>>>>>    (*<reduc_op>_dpp_shr_<mode>): Likewise.
>>>>>>>    (*plus_carry_dpp_shr_<mode>): Likewise.
>>>>>>>    (*plus_carry_in_dpp_shr_<mode>): Likewise.
>>>>>>
>>>>>> Etc.  The expectation being that GCC middle end copes with this, and
>>>>>> synthesizes some less ideal yet still functional vector code, I presume.
>>>>>>
>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>>>>>> I'm currently working on getting proper GCC/GCN target (not offloading)
>>>>>> results for.
>>>>>>
>>>>>> I'm seeing a good number of execution test FAILs (regressions compared to
>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
>>>>>> large class of those comes into existance -- not yet how to resolve,
>>>>>> unfortunately.  But maybe, with you guys' combined vectorizer and back
>>>>>> end experience, the latter will be done quickly?
>>>>>>
>>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
>>>>>> offloading) testing; let me know if you have any questions about that.
>>>>>
>>>>> I've only done offload testing - in the x86_64 build tree run
>>>>> check-target-libgomp.  If you can tell me how to do GCN target testing
>>>>> (maybe document it on the wiki even!) I can try do that as well.
>>>>>
>>>>>> Given that (at least largely?) the same patterns etc. are disabled as in
>>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
>>>>>> same issues.  You can build GCC/GCN target like you build the offloading
>>>>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
>>>>>> even use a offloading GCC/GCN build to reproduce the issue below.
>>>>>>
>>>>>> One example is the attached 'builtin-bitops-1.c', reduced from
>>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>>>>>> miscompiled as soon as '-ftree-vectorize' is effective:
>>>>>>
>>>>>>        $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
>>>>>>        -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>>>        -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
>>>>>>        -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
>>>>>>        -O1
>>>>>>        -ftree-vectorize
>>>>>>
>>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
>>>>>>
>>>>>>        +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
>>>>>>        target.
>>>>>>
>>>>>> ..., and therefore:
>>>>>>
>>>>>>        -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
>>>>>>        reduction.
>>>>>>        +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>>>>>>        +builtin-bitops-1.c:7:17: note:  extract scalar result
>>>>>>
>>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
>>>>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>>>>>> generated:
>>>>>>
>>>>>>        $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>>>        i=1, ints[i]=0x1 a=1, b=2
>>>>>>        i=2, ints[i]=0x80000000 a=1, b=2
>>>>>>        i=3, ints[i]=0x2 a=1, b=2
>>>>>>        i=4, ints[i]=0x40000000 a=1, b=2
>>>>>>        i=5, ints[i]=0x10000 a=1, b=2
>>>>>>        i=6, ints[i]=0x8000 a=1, b=2
>>>>>>        i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>>>>>        i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>>>>>        i=9, ints[i]=0xcafe0000 a=11, b=22
>>>>>>        i=10, ints[i]=0xcafe00 a=11, b=22
>>>>>>        i=11, ints[i]=0xcafe a=11, b=22
>>>>>>        i=12, ints[i]=0xffffffff a=32, b=64
>>>>>>
>>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
>>>>>>
>>>>>> I don't speak enough "vectorization" to fully understand the generic
>>>>>> vectorized algorithm and its implementation.  It appears that the
>>>>>> "Reduce using vector shifts" code has been around for a very long time,
>>>>>> but also has gone through a number of changes.  I can't tell which GCC
>>>>>> targets/configurations it's actually used for (in the same way as for
>>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
>>>>>> or rather in the GCN back end, or GCN back end parameterizing the generic
>>>>>> code?
>>>>>
>>>>> The "shift" reduction is basically doing reduction by repeatedly
>>>>> adding the upper to the lower half of the vector (each time halving
>>>>> the vector size).
>>>>>
>>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
>>>>>>
>>>>>>        int my_popcount (unsigned int x)
>>>>>>        {
>>>>>>          int stmp__12.12;
>>>>>>          vector(64) int vect__12.11;
>>>>>>          vector(64) unsigned int vect__1.8;
>>>>>>          vector(64) unsigned int _13;
>>>>>>          vector(64) unsigned int vect_cst__18;
>>>>>>          vector(64) int [all others];
>>>>>>        
>>>>>>          <bb 2> [local count: 32534376]:
>>>>>>          vect_cst__18 = { [all 'x_8(D)'] };
>>>>>>          vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
>>>>>>          _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, {
>>>>>>          [all
>>>>>>          '1'] }, { [all '0'] });
>>>>>>          vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
>>>>>>          _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34,
>>>>>>          [...], 93, 94, 95 }>;
>>>>>>          _27 = vect__12.11_24 + _26;
>>>>>>          _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77,
>>>>>>          78, 79 }>;
>>>>>>          _29 = _27 + _28;
>>>>>>          _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69,
>>>>>>          70,
>>>>>>          71 }>;
>>>>>>          _31 = _29 + _30;
>>>>>>          _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65,
>>>>>>          66,
>>>>>>          67 }>;
>>>>>>          _33 = _31 + _32;
>>>>>>          _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63,
>>>>>>          64,
>>>>>>          65 }>;
>>>>>>          _35 = _33 + _34;
>>>>>>          _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62,
>>>>>>          63,
>>>>>>          64 }>;
>>>>>>          _37 = _35 + _36;
>>>>>>          stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
>>>>>>          return stmp__12.12_38;
>>>>>>
>>>>>> ..., for example, for 'x = 7', we get:
>>>>>>
>>>>>>          vect_cst__18 = { [all '7'] };
>>>>>>          vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
>>>>>>          _13 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>          vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>          _26 = { [all '0'] };
>>>>>>          _27 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>          _28 = { [all '0'] };
>>>>>>          _29 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>          _30 = { [all '0'] };
>>>>>>          _31 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>          _32 = { [all '0'] };
>>>>>>          _33 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>          _34 = { 1, 0, 0, 0, [...] };
>>>>>>          _35 = { 2, 1, 1, 0, 0, 0, [...] };
>>>>>>          _36 = { 1, 1, 0, 0, 0, [...] };
>>>>>>          _37 = { 3, 2, 1, 0, 0, 0, [...] };
>>>>>>          stmp__12.12_38 = 3;
>>>>>>          return 3;
>>>>>>
>>>>>> ..., so the algorithm would appear to synthesize correct code for that
>>>>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
>>>>>>
>>>>>>        i=13, ints[i]=0x7 a=3, b=6
>>>>>>
>>>>>>
>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>>>>>>
>>>>>>        @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
>>>>>>        (loop_vec_info
>>>>>>        loop_vinfo,
>>>>>>               reduce_with_shift = have_whole_vector_shift (mode1);
>>>>>>               if (!VECTOR_MODE_P (mode1)
>>>>>>                  || !directly_supported_p (code, vectype1))
>>>>>>                reduce_with_shift = false;
>>>>>>        +      reduce_with_shift = false;
>>>>>>
>>>>>> ..., I'm able to work around those regressions: by means of forcing
>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
>>>>>
>>>>> I would say it somewhere gets broken between the vectorizer and the GPU
>>>>> which means likely in the target?  Can you point out an issue in the
>>>>> actual generated GCN code?
>>>>>
>>>>> Iff this kind of reduction is the issue you'd see quite a lot of
>>>>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
>>>>> be that the "mask" is still set wrong when doing the reduction
>>>>> steps?
>>>>
>>>> It looks like the ds_bpermute_b32 instruction works differently on RDNA3
>>>> (vs.
>>>> GCN/CDNA and even RDNA2).
>>>>
>>>>   From the pseudocode in the documentation:
>>>>
>>>>     for i in 0 : WAVE64 ? 63 : 31 do
>>>>       // ADDR needs to be divided by 4.
>>>>       // High-order bits are ignored.
>>>>       // NOTE: destination lane is MOD 32 regardless of wave size.
>>>>       src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
>>>>       // EXEC is applied to the source VGPR reads.
>>>>       if EXEC[src_lane].u1 then
>>>>         tmp[i] = VGPR[src_lane][DATA0]
>>>>       endif
>>>>     endfor;
>>>>
>>>> The key detail is the "mod 32"; the other architectures have "mod 64"
>>>> there.
>>>>
>>>> So, the last 32 lanes are discarded, and the first 32 lanes are duplicated
>>>> into the last, and this explains why my_popcount returns double the
>>>> expected
>>>> value for smaller inputs.
>>>>
>>>> Richi, can you confirm that this testcase works properly on your card,
>>>> please?
>>>>
>>>> To test, assuming you only have the offload toolchain built, compile using
>>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw
>>>> AMD
>>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run under
>>>> libexec).
>>>
>>> I'm getting
>>>
>>> i=1, ints[i]=0x1 a=1, b=2
>>> i=2, ints[i]=0x80000000 a=1, b=2
>>> i=3, ints[i]=0x2 a=1, b=2
>>> i=4, ints[i]=0x40000000 a=1, b=2
>>> i=5, ints[i]=0x10000 a=1, b=2
>>> i=6, ints[i]=0x8000 a=1, b=2
>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>> i=9, ints[i]=0xcafe0000 a=11, b=22
>>> i=10, ints[i]=0xcafe00 a=11, b=22
>>> i=11, ints[i]=0xcafe a=11, b=22
>>> i=12, ints[i]=0xffffffff a=32, b=64
>>>
>>> which I think is the same as Thomas output and thus wrong?
>>>
>>> When building with -O0 I get no output.
>>>
>>> I'm of course building with -march=gfx1030
>>
>> OK, please try this example, just to check my expectation that your permute
>> works:
>>
>> typedef int v64si __attribute__ ((vector_size (256)));
>>
>> int main()
>> {
>>    v64si permute = {
>>      40, 40, 40, 40, 40, 40, 40, 40,
>>      40, 40, 40, 40, 40, 40, 40, 40,
>>      40, 40, 40, 40, 40, 40, 40, 40,
>>      40, 40, 40, 40, 40, 40, 40, 40,
>>      40, 40, 40, 40, 40, 40, 40, 40,
>>      40, 40, 40, 40, 40, 40, 40, 40,
>>      40, 40, 40, 40, 40, 40, 40, 40,
>>      40, 40, 40, 40, 40, 40, 40, 40
>>    };
>>    v64si result;
>>
>>    asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute), "e"(-1L));
>>
>>    for (int i=0; i<63; i++)
>>      __builtin_printf ("%d ", result[i]);
>>    __builtin_printf ("\n");
>>
>>    return 0;
>> }
>>
>> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it prints
>> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect
>> from the pseudocode, but does match the written description). Which do you
>> get?
> 
> 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
> 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
> 42 42 42 42 42 42 42 42 42 42 42 42 42
> 
> so RDNA2 matches RDNA3 here.

OK, that probably is the problem with both our reductions then. The 
RDNA2 manual has the 32-lane wording in the description, but the 
instruction pseudocode lies. :(

I'm now not sure how to implement permute without actually hitting 
memory? The permutation vector is exactly what we'd need to do a gather 
load from memory (not a coincident), but we'd need to find a memory 
location to do it, ideally in the low-latency LDS memory, and it'd have 
to be thread-safe.

The attached not-well-tested patch should allow only valid permutations. 
Hopefully we go back to working code, but there'll be things that won't 
vectorize. That said, the new "dump" output code has fewer and probably 
cheaper instructions, so hmmm.

Andrew
amdgcn: Disallow unsupported permute on RDNA devices

The RDNA architecture has limited support for permute operations.  This should
allow use of the permutations that do work, and fall back to linear code for
other cases.

gcc/ChangeLog:

	* config/gcn/gcn-valu.md
	(vec_extract<V_MOV:mode><V_MOV_ALT:mode>): Add conditions for RDNA.
	* config/gcn/gcn.cc (gcn_vectorize_vec_perm_const): Check permutation
	details are supported on RDNA devices.

diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 23b441f8e8b..59e27d0aed7 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -982,7 +982,8 @@
    (match_operand:V_MOV 1 "register_operand")
    (match_operand 2 "immediate_operand")]
   "MODE_VF (<V_MOV_ALT:MODE>mode) < MODE_VF (<V_MOV:MODE>mode)
-   && <V_MOV_ALT:SCALAR_MODE>mode == <V_MOV:SCALAR_MODE>mode"
+   && <V_MOV_ALT:SCALAR_MODE>mode == <V_MOV:SCALAR_MODE>mode
+   && (!TARGET_RDNA2_PLUS || MODE_VF (<V_MOV:MODE>mode) <= 32)"
   {
     int numlanes = GET_MODE_NUNITS (<V_MOV_ALT:MODE>mode);
     int firstlane = INTVAL (operands[2]) * numlanes;
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index e80de2ce056..f067743e31a 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5110,19 +5110,24 @@ gcn_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
   gcc_assert (nelt <= 64);
   gcc_assert (sel.length () == nelt);
 
-  if (!dst)
-    {
-      /* All vector permutations are possible on this architecture,
-         with varying degrees of efficiency depending on the permutation. */
-      return true;
-    }
-
   unsigned int perm[64];
   for (unsigned int i = 0; i < nelt; ++i)
     perm[i] = sel[i] & (2 * nelt - 1);
   for (unsigned int i = nelt; i < 64; ++i)
     perm[i] = 0;
 
+  /* RDNA devices can only to permutations within each group of 32-lanes.
+     Reject permutations that cross the boundary.  */
+  if (TARGET_RDNA2_PLUS)
+    for (unsigned int i = 0; i < nelt; i++)
+      if (i < 31 ? perm[i] > 31 : perm[i] < 32)
+	return false;
+
+  /* All vector permutations are possible on other architectures,
+     with varying degrees of efficiency depending on the permutation. */
+  if (!dst)
+    return true;
+
   src0 = force_reg (vmode, src0);
   src1 = force_reg (vmode, src1);
  
Richard Biener Feb. 15, 2024, 7:49 a.m. UTC | #9
On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 14/02/2024 13:43, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 14/02/2024 13:27, Richard Biener wrote:
> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>
> >>>> On 13/02/2024 08:26, Richard Biener wrote:
> >>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com>
> >>>>>> wrote:
> >>>>>>> I've committed this patch
> >>>>>>
> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>>>>>
> >>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
> >>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
> >>>>>> disabled:
> >>>>>>
> >>>>>>> [...] Vector
> >>>>>>> reductions will need to be reworked for RDNA2.  [...]
> >>>>>>
> >>>>>>>    * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> >>>>>>>    (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> >>>>>>>    (subc<mode>3<exec_vcc>): Likewise.
> >>>>>>>    (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> >>>>>>>    (vec_cmp<mode>di): Likewise.
> >>>>>>>    (vec_cmp<u><mode>di): Likewise.
> >>>>>>>    (vec_cmp<mode>di_exec): Likewise.
> >>>>>>>    (vec_cmp<u><mode>di_exec): Likewise.
> >>>>>>>    (vec_cmp<mode>di_dup): Likewise.
> >>>>>>>    (vec_cmp<mode>di_dup_exec): Likewise.
> >>>>>>>    (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> >>>>>>>    (*<reduc_op>_dpp_shr_<mode>): Likewise.
> >>>>>>>    (*plus_carry_dpp_shr_<mode>): Likewise.
> >>>>>>>    (*plus_carry_in_dpp_shr_<mode>): Likewise.
> >>>>>>
> >>>>>> Etc.  The expectation being that GCC middle end copes with this, and
> >>>>>> synthesizes some less ideal yet still functional vector code, I
> >>>>>> presume.
> >>>>>>
> >>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >>>>>> I'm currently working on getting proper GCC/GCN target (not offloading)
> >>>>>> results for.
> >>>>>>
> >>>>>> I'm seeing a good number of execution test FAILs (regressions compared
> >>>>>> to
> >>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
> >>>>>> large class of those comes into existance -- not yet how to resolve,
> >>>>>> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >>>>>> end experience, the latter will be done quickly?
> >>>>>>
> >>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >>>>>> offloading) testing; let me know if you have any questions about that.
> >>>>>
> >>>>> I've only done offload testing - in the x86_64 build tree run
> >>>>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>>>> (maybe document it on the wiki even!) I can try do that as well.
> >>>>>
> >>>>>> Given that (at least largely?) the same patterns etc. are disabled as
> >>>>>> in
> >>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >>>>>> same issues.  You can build GCC/GCN target like you build the
> >>>>>> offloading
> >>>>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >>>>>> even use a offloading GCC/GCN build to reproduce the issue below.
> >>>>>>
> >>>>>> One example is the attached 'builtin-bitops-1.c', reduced from
> >>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >>>>>> miscompiled as soon as '-ftree-vectorize' is effective:
> >>>>>>
> >>>>>>        $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>>>>>        -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>>>>>        -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>>>>>        -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
> >>>>>>        -O1
> >>>>>>        -ftree-vectorize
> >>>>>>
> >>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>>>>>
> >>>>>>        +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
> >>>>>>        target.
> >>>>>>
> >>>>>> ..., and therefore:
> >>>>>>
> >>>>>>        -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
> >>>>>>        reduction.
> >>>>>>        +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>>>>>        +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>>>>>
> >>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build
> >>>>>> a
> >>>>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >>>>>> generated:
> >>>>>>
> >>>>>>        $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>>>>>        i=1, ints[i]=0x1 a=1, b=2
> >>>>>>        i=2, ints[i]=0x80000000 a=1, b=2
> >>>>>>        i=3, ints[i]=0x2 a=1, b=2
> >>>>>>        i=4, ints[i]=0x40000000 a=1, b=2
> >>>>>>        i=5, ints[i]=0x10000 a=1, b=2
> >>>>>>        i=6, ints[i]=0x8000 a=1, b=2
> >>>>>>        i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>>>>>        i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>>>>>        i=9, ints[i]=0xcafe0000 a=11, b=22
> >>>>>>        i=10, ints[i]=0xcafe00 a=11, b=22
> >>>>>>        i=11, ints[i]=0xcafe a=11, b=22
> >>>>>>        i=12, ints[i]=0xffffffff a=32, b=64
> >>>>>>
> >>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>>>>>
> >>>>>> I don't speak enough "vectorization" to fully understand the generic
> >>>>>> vectorized algorithm and its implementation.  It appears that the
> >>>>>> "Reduce using vector shifts" code has been around for a very long time,
> >>>>>> but also has gone through a number of changes.  I can't tell which GCC
> >>>>>> targets/configurations it's actually used for (in the same way as for
> >>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer
> >>>>>> code,
> >>>>>> or rather in the GCN back end, or GCN back end parameterizing the
> >>>>>> generic
> >>>>>> code?
> >>>>>
> >>>>> The "shift" reduction is basically doing reduction by repeatedly
> >>>>> adding the upper to the lower half of the vector (each time halving
> >>>>> the vector size).
> >>>>>
> >>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized'
> >>>>>> code:
> >>>>>>
> >>>>>>        int my_popcount (unsigned int x)
> >>>>>>        {
> >>>>>>          int stmp__12.12;
> >>>>>>          vector(64) int vect__12.11;
> >>>>>>          vector(64) unsigned int vect__1.8;
> >>>>>>          vector(64) unsigned int _13;
> >>>>>>          vector(64) unsigned int vect_cst__18;
> >>>>>>          vector(64) int [all others];
> >>>>>>        
> >>>>>>          <bb 2> [local count: 32534376]:
> >>>>>>          vect_cst__18 = { [all 'x_8(D)'] };
> >>>>>>          vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
> >>>>>>          _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, {
> >>>>>>          [all
> >>>>>>          '1'] }, { [all '0'] });
> >>>>>>          vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
> >>>>>>          _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33,
> >>>>>>          34,
> >>>>>>          [...], 93, 94, 95 }>;
> >>>>>>          _27 = vect__12.11_24 + _26;
> >>>>>>          _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...],
> >>>>>>          77,
> >>>>>>          78, 79 }>;
> >>>>>>          _29 = _27 + _28;
> >>>>>>          _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...],
> >>>>>>          69,
> >>>>>>          70,
> >>>>>>          71 }>;
> >>>>>>          _31 = _29 + _30;
> >>>>>>          _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65,
> >>>>>>          66,
> >>>>>>          67 }>;
> >>>>>>          _33 = _31 + _32;
> >>>>>>          _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63,
> >>>>>>          64,
> >>>>>>          65 }>;
> >>>>>>          _35 = _33 + _34;
> >>>>>>          _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62,
> >>>>>>          63,
> >>>>>>          64 }>;
> >>>>>>          _37 = _35 + _36;
> >>>>>>          stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
> >>>>>>          return stmp__12.12_38;
> >>>>>>
> >>>>>> ..., for example, for 'x = 7', we get:
> >>>>>>
> >>>>>>          vect_cst__18 = { [all '7'] };
> >>>>>>          vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
> >>>>>>          _13 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>          vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>          _26 = { [all '0'] };
> >>>>>>          _27 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>          _28 = { [all '0'] };
> >>>>>>          _29 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>          _30 = { [all '0'] };
> >>>>>>          _31 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>          _32 = { [all '0'] };
> >>>>>>          _33 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>          _34 = { 1, 0, 0, 0, [...] };
> >>>>>>          _35 = { 2, 1, 1, 0, 0, 0, [...] };
> >>>>>>          _36 = { 1, 1, 0, 0, 0, [...] };
> >>>>>>          _37 = { 3, 2, 1, 0, 0, 0, [...] };
> >>>>>>          stmp__12.12_38 = 3;
> >>>>>>          return 3;
> >>>>>>
> >>>>>> ..., so the algorithm would appear to synthesize correct code for that
> >>>>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> >>>>>>
> >>>>>>        i=13, ints[i]=0x7 a=3, b=6
> >>>>>>
> >>>>>>
> >>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
> >>>>>>
> >>>>>>        @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
> >>>>>>        (loop_vec_info
> >>>>>>        loop_vinfo,
> >>>>>>               reduce_with_shift = have_whole_vector_shift (mode1);
> >>>>>>               if (!VECTOR_MODE_P (mode1)
> >>>>>>                  || !directly_supported_p (code, vectype1))
> >>>>>>                reduce_with_shift = false;
> >>>>>>        +      reduce_with_shift = false;
> >>>>>>
> >>>>>> ..., I'm able to work around those regressions: by means of forcing
> >>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
> >>>>>
> >>>>> I would say it somewhere gets broken between the vectorizer and the GPU
> >>>>> which means likely in the target?  Can you point out an issue in the
> >>>>> actual generated GCN code?
> >>>>>
> >>>>> Iff this kind of reduction is the issue you'd see quite a lot of
> >>>>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
> >>>>> be that the "mask" is still set wrong when doing the reduction
> >>>>> steps?
> >>>>
> >>>> It looks like the ds_bpermute_b32 instruction works differently on RDNA3
> >>>> (vs.
> >>>> GCN/CDNA and even RDNA2).
> >>>>
> >>>>   From the pseudocode in the documentation:
> >>>>
> >>>>     for i in 0 : WAVE64 ? 63 : 31 do
> >>>>       // ADDR needs to be divided by 4.
> >>>>       // High-order bits are ignored.
> >>>>       // NOTE: destination lane is MOD 32 regardless of wave size.
> >>>>       src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
> >>>>       // EXEC is applied to the source VGPR reads.
> >>>>       if EXEC[src_lane].u1 then
> >>>>         tmp[i] = VGPR[src_lane][DATA0]
> >>>>       endif
> >>>>     endfor;
> >>>>
> >>>> The key detail is the "mod 32"; the other architectures have "mod 64"
> >>>> there.
> >>>>
> >>>> So, the last 32 lanes are discarded, and the first 32 lanes are
> >>>> duplicated
> >>>> into the last, and this explains why my_popcount returns double the
> >>>> expected
> >>>> value for smaller inputs.
> >>>>
> >>>> Richi, can you confirm that this testcase works properly on your card,
> >>>> please?
> >>>>
> >>>> To test, assuming you only have the offload toolchain built, compile
> >>>> using
> >>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw
> >>>> AMD
> >>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run
> >>>> under
> >>>> libexec).
> >>>
> >>> I'm getting
> >>>
> >>> i=1, ints[i]=0x1 a=1, b=2
> >>> i=2, ints[i]=0x80000000 a=1, b=2
> >>> i=3, ints[i]=0x2 a=1, b=2
> >>> i=4, ints[i]=0x40000000 a=1, b=2
> >>> i=5, ints[i]=0x10000 a=1, b=2
> >>> i=6, ints[i]=0x8000 a=1, b=2
> >>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>> i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>> i=9, ints[i]=0xcafe0000 a=11, b=22
> >>> i=10, ints[i]=0xcafe00 a=11, b=22
> >>> i=11, ints[i]=0xcafe a=11, b=22
> >>> i=12, ints[i]=0xffffffff a=32, b=64
> >>>
> >>> which I think is the same as Thomas output and thus wrong?
> >>>
> >>> When building with -O0 I get no output.
> >>>
> >>> I'm of course building with -march=gfx1030
> >>
> >> OK, please try this example, just to check my expectation that your permute
> >> works:
> >>
> >> typedef int v64si __attribute__ ((vector_size (256)));
> >>
> >> int main()
> >> {
> >>    v64si permute = {
> >>      40, 40, 40, 40, 40, 40, 40, 40,
> >>      40, 40, 40, 40, 40, 40, 40, 40,
> >>      40, 40, 40, 40, 40, 40, 40, 40,
> >>      40, 40, 40, 40, 40, 40, 40, 40,
> >>      40, 40, 40, 40, 40, 40, 40, 40,
> >>      40, 40, 40, 40, 40, 40, 40, 40,
> >>      40, 40, 40, 40, 40, 40, 40, 40,
> >>      40, 40, 40, 40, 40, 40, 40, 40
> >>    };
> >>    v64si result;
> >>
> >>    asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute),
> >>    "e"(-1L));
> >>
> >>    for (int i=0; i<63; i++)
> >>      __builtin_printf ("%d ", result[i]);
> >>    __builtin_printf ("\n");
> >>
> >>    return 0;
> >> }
> >>
> >> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it
> >> prints
> >> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect
> >> from the pseudocode, but does match the written description). Which do you
> >> get?
> > 
> > 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
> > 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
> > 42 42 42 42 42 42 42 42 42 42 42 42 42
> > 
> > so RDNA2 matches RDNA3 here.
> 
> OK, that probably is the problem with both our reductions then. The RDNA2
> manual has the 32-lane wording in the description, but the instruction
> pseudocode lies. :(
> 
> I'm now not sure how to implement permute without actually hitting memory? The
> permutation vector is exactly what we'd need to do a gather load from memory
> (not a coincident), but we'd need to find a memory location to do it, ideally
> in the low-latency LDS memory, and it'd have to be thread-safe.
> 
> The attached not-well-tested patch should allow only valid permutations.
> Hopefully we go back to working code, but there'll be things that won't
> vectorize. That said, the new "dump" output code has fewer and probably
> cheaper instructions, so hmmm.

This fixes the reduced builtin-bitops-1.c on RDNA2.

I suppse if RDNA really only has 32 lane vectors (it sounds like it,
even if it can "simulate" 64 lane ones?) then it might make sense to
vectorize for 32 lanes?  That said, with variable-length it likely
doesn't matter but I'd not expose fixed-size modes with 64 lanes then?

Richard.
  
Andrew Stubbs Feb. 15, 2024, 10:03 a.m. UTC | #10
On 15/02/2024 07:49, Richard Biener wrote:
> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> 
>> On 14/02/2024 13:43, Richard Biener wrote:
>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>>
>>>> On 14/02/2024 13:27, Richard Biener wrote:
>>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>>>>
>>>>>> On 13/02/2024 08:26, Richard Biener wrote:
>>>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com>
>>>>>>>> wrote:
>>>>>>>>> I've committed this patch
>>>>>>>>
>>>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>>>>>>>>
>>>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
>>>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
>>>>>>>> disabled:
>>>>>>>>
>>>>>>>>> [...] Vector
>>>>>>>>> reductions will need to be reworked for RDNA2.  [...]
>>>>>>>>
>>>>>>>>>     * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>>>>>>>>>     (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>>>>>>>>>     (subc<mode>3<exec_vcc>): Likewise.
>>>>>>>>>     (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>>>>>>>>>     (vec_cmp<mode>di): Likewise.
>>>>>>>>>     (vec_cmp<u><mode>di): Likewise.
>>>>>>>>>     (vec_cmp<mode>di_exec): Likewise.
>>>>>>>>>     (vec_cmp<u><mode>di_exec): Likewise.
>>>>>>>>>     (vec_cmp<mode>di_dup): Likewise.
>>>>>>>>>     (vec_cmp<mode>di_dup_exec): Likewise.
>>>>>>>>>     (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>>>>>>>>>     (*<reduc_op>_dpp_shr_<mode>): Likewise.
>>>>>>>>>     (*plus_carry_dpp_shr_<mode>): Likewise.
>>>>>>>>>     (*plus_carry_in_dpp_shr_<mode>): Likewise.
>>>>>>>>
>>>>>>>> Etc.  The expectation being that GCC middle end copes with this, and
>>>>>>>> synthesizes some less ideal yet still functional vector code, I
>>>>>>>> presume.
>>>>>>>>
>>>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>>>>>>>> I'm currently working on getting proper GCC/GCN target (not offloading)
>>>>>>>> results for.
>>>>>>>>
>>>>>>>> I'm seeing a good number of execution test FAILs (regressions compared
>>>>>>>> to
>>>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
>>>>>>>> large class of those comes into existance -- not yet how to resolve,
>>>>>>>> unfortunately.  But maybe, with you guys' combined vectorizer and back
>>>>>>>> end experience, the latter will be done quickly?
>>>>>>>>
>>>>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
>>>>>>>> offloading) testing; let me know if you have any questions about that.
>>>>>>>
>>>>>>> I've only done offload testing - in the x86_64 build tree run
>>>>>>> check-target-libgomp.  If you can tell me how to do GCN target testing
>>>>>>> (maybe document it on the wiki even!) I can try do that as well.
>>>>>>>
>>>>>>>> Given that (at least largely?) the same patterns etc. are disabled as
>>>>>>>> in
>>>>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
>>>>>>>> same issues.  You can build GCC/GCN target like you build the
>>>>>>>> offloading
>>>>>>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
>>>>>>>> even use a offloading GCC/GCN build to reproduce the issue below.
>>>>>>>>
>>>>>>>> One example is the attached 'builtin-bitops-1.c', reduced from
>>>>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
>>>>>>>> miscompiled as soon as '-ftree-vectorize' is effective:
>>>>>>>>
>>>>>>>>         $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
>>>>>>>>         -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>>>>>         -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
>>>>>>>>         -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
>>>>>>>>         -O1
>>>>>>>>         -ftree-vectorize
>>>>>>>>
>>>>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
>>>>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
>>>>>>>>
>>>>>>>>         +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
>>>>>>>>         target.
>>>>>>>>
>>>>>>>> ..., and therefore:
>>>>>>>>
>>>>>>>>         -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
>>>>>>>>         reduction.
>>>>>>>>         +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
>>>>>>>>         +builtin-bitops-1.c:7:17: note:  extract scalar result
>>>>>>>>
>>>>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build
>>>>>>>> a
>>>>>>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
>>>>>>>> generated:
>>>>>>>>
>>>>>>>>         $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>>>>>         i=1, ints[i]=0x1 a=1, b=2
>>>>>>>>         i=2, ints[i]=0x80000000 a=1, b=2
>>>>>>>>         i=3, ints[i]=0x2 a=1, b=2
>>>>>>>>         i=4, ints[i]=0x40000000 a=1, b=2
>>>>>>>>         i=5, ints[i]=0x10000 a=1, b=2
>>>>>>>>         i=6, ints[i]=0x8000 a=1, b=2
>>>>>>>>         i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>>>>>>>         i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>>>>>>>         i=9, ints[i]=0xcafe0000 a=11, b=22
>>>>>>>>         i=10, ints[i]=0xcafe00 a=11, b=22
>>>>>>>>         i=11, ints[i]=0xcafe a=11, b=22
>>>>>>>>         i=12, ints[i]=0xffffffff a=32, b=64
>>>>>>>>
>>>>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
>>>>>>>>
>>>>>>>> I don't speak enough "vectorization" to fully understand the generic
>>>>>>>> vectorized algorithm and its implementation.  It appears that the
>>>>>>>> "Reduce using vector shifts" code has been around for a very long time,
>>>>>>>> but also has gone through a number of changes.  I can't tell which GCC
>>>>>>>> targets/configurations it's actually used for (in the same way as for
>>>>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer
>>>>>>>> code,
>>>>>>>> or rather in the GCN back end, or GCN back end parameterizing the
>>>>>>>> generic
>>>>>>>> code?
>>>>>>>
>>>>>>> The "shift" reduction is basically doing reduction by repeatedly
>>>>>>> adding the upper to the lower half of the vector (each time halving
>>>>>>> the vector size).
>>>>>>>
>>>>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized'
>>>>>>>> code:
>>>>>>>>
>>>>>>>>         int my_popcount (unsigned int x)
>>>>>>>>         {
>>>>>>>>           int stmp__12.12;
>>>>>>>>           vector(64) int vect__12.11;
>>>>>>>>           vector(64) unsigned int vect__1.8;
>>>>>>>>           vector(64) unsigned int _13;
>>>>>>>>           vector(64) unsigned int vect_cst__18;
>>>>>>>>           vector(64) int [all others];
>>>>>>>>         
>>>>>>>>           <bb 2> [local count: 32534376]:
>>>>>>>>           vect_cst__18 = { [all 'x_8(D)'] };
>>>>>>>>           vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
>>>>>>>>           _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, {
>>>>>>>>           [all
>>>>>>>>           '1'] }, { [all '0'] });
>>>>>>>>           vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
>>>>>>>>           _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33,
>>>>>>>>           34,
>>>>>>>>           [...], 93, 94, 95 }>;
>>>>>>>>           _27 = vect__12.11_24 + _26;
>>>>>>>>           _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...],
>>>>>>>>           77,
>>>>>>>>           78, 79 }>;
>>>>>>>>           _29 = _27 + _28;
>>>>>>>>           _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...],
>>>>>>>>           69,
>>>>>>>>           70,
>>>>>>>>           71 }>;
>>>>>>>>           _31 = _29 + _30;
>>>>>>>>           _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65,
>>>>>>>>           66,
>>>>>>>>           67 }>;
>>>>>>>>           _33 = _31 + _32;
>>>>>>>>           _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63,
>>>>>>>>           64,
>>>>>>>>           65 }>;
>>>>>>>>           _35 = _33 + _34;
>>>>>>>>           _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62,
>>>>>>>>           63,
>>>>>>>>           64 }>;
>>>>>>>>           _37 = _35 + _36;
>>>>>>>>           stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
>>>>>>>>           return stmp__12.12_38;
>>>>>>>>
>>>>>>>> ..., for example, for 'x = 7', we get:
>>>>>>>>
>>>>>>>>           vect_cst__18 = { [all '7'] };
>>>>>>>>           vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
>>>>>>>>           _13 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _26 = { [all '0'] };
>>>>>>>>           _27 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _28 = { [all '0'] };
>>>>>>>>           _29 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _30 = { [all '0'] };
>>>>>>>>           _31 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _32 = { [all '0'] };
>>>>>>>>           _33 = { 1, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _34 = { 1, 0, 0, 0, [...] };
>>>>>>>>           _35 = { 2, 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _36 = { 1, 1, 0, 0, 0, [...] };
>>>>>>>>           _37 = { 3, 2, 1, 0, 0, 0, [...] };
>>>>>>>>           stmp__12.12_38 = 3;
>>>>>>>>           return 3;
>>>>>>>>
>>>>>>>> ..., so the algorithm would appear to synthesize correct code for that
>>>>>>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
>>>>>>>>
>>>>>>>>         i=13, ints[i]=0x7 a=3, b=6
>>>>>>>>
>>>>>>>>
>>>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>>>>>>>>
>>>>>>>>         @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
>>>>>>>>         (loop_vec_info
>>>>>>>>         loop_vinfo,
>>>>>>>>                reduce_with_shift = have_whole_vector_shift (mode1);
>>>>>>>>                if (!VECTOR_MODE_P (mode1)
>>>>>>>>                   || !directly_supported_p (code, vectype1))
>>>>>>>>                 reduce_with_shift = false;
>>>>>>>>         +      reduce_with_shift = false;
>>>>>>>>
>>>>>>>> ..., I'm able to work around those regressions: by means of forcing
>>>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
>>>>>>>
>>>>>>> I would say it somewhere gets broken between the vectorizer and the GPU
>>>>>>> which means likely in the target?  Can you point out an issue in the
>>>>>>> actual generated GCN code?
>>>>>>>
>>>>>>> Iff this kind of reduction is the issue you'd see quite a lot of
>>>>>>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
>>>>>>> be that the "mask" is still set wrong when doing the reduction
>>>>>>> steps?
>>>>>>
>>>>>> It looks like the ds_bpermute_b32 instruction works differently on RDNA3
>>>>>> (vs.
>>>>>> GCN/CDNA and even RDNA2).
>>>>>>
>>>>>>    From the pseudocode in the documentation:
>>>>>>
>>>>>>      for i in 0 : WAVE64 ? 63 : 31 do
>>>>>>        // ADDR needs to be divided by 4.
>>>>>>        // High-order bits are ignored.
>>>>>>        // NOTE: destination lane is MOD 32 regardless of wave size.
>>>>>>        src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
>>>>>>        // EXEC is applied to the source VGPR reads.
>>>>>>        if EXEC[src_lane].u1 then
>>>>>>          tmp[i] = VGPR[src_lane][DATA0]
>>>>>>        endif
>>>>>>      endfor;
>>>>>>
>>>>>> The key detail is the "mod 32"; the other architectures have "mod 64"
>>>>>> there.
>>>>>>
>>>>>> So, the last 32 lanes are discarded, and the first 32 lanes are
>>>>>> duplicated
>>>>>> into the last, and this explains why my_popcount returns double the
>>>>>> expected
>>>>>> value for smaller inputs.
>>>>>>
>>>>>> Richi, can you confirm that this testcase works properly on your card,
>>>>>> please?
>>>>>>
>>>>>> To test, assuming you only have the offload toolchain built, compile
>>>>>> using
>>>>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw
>>>>>> AMD
>>>>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run
>>>>>> under
>>>>>> libexec).
>>>>>
>>>>> I'm getting
>>>>>
>>>>> i=1, ints[i]=0x1 a=1, b=2
>>>>> i=2, ints[i]=0x80000000 a=1, b=2
>>>>> i=3, ints[i]=0x2 a=1, b=2
>>>>> i=4, ints[i]=0x40000000 a=1, b=2
>>>>> i=5, ints[i]=0x10000 a=1, b=2
>>>>> i=6, ints[i]=0x8000 a=1, b=2
>>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
>>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32
>>>>> i=9, ints[i]=0xcafe0000 a=11, b=22
>>>>> i=10, ints[i]=0xcafe00 a=11, b=22
>>>>> i=11, ints[i]=0xcafe a=11, b=22
>>>>> i=12, ints[i]=0xffffffff a=32, b=64
>>>>>
>>>>> which I think is the same as Thomas output and thus wrong?
>>>>>
>>>>> When building with -O0 I get no output.
>>>>>
>>>>> I'm of course building with -march=gfx1030
>>>>
>>>> OK, please try this example, just to check my expectation that your permute
>>>> works:
>>>>
>>>> typedef int v64si __attribute__ ((vector_size (256)));
>>>>
>>>> int main()
>>>> {
>>>>     v64si permute = {
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40,
>>>>       40, 40, 40, 40, 40, 40, 40, 40
>>>>     };
>>>>     v64si result;
>>>>
>>>>     asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute),
>>>>     "e"(-1L));
>>>>
>>>>     for (int i=0; i<63; i++)
>>>>       __builtin_printf ("%d ", result[i]);
>>>>     __builtin_printf ("\n");
>>>>
>>>>     return 0;
>>>> }
>>>>
>>>> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it
>>>> prints
>>>> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect
>>>> from the pseudocode, but does match the written description). Which do you
>>>> get?
>>>
>>> 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
>>> 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
>>> 42 42 42 42 42 42 42 42 42 42 42 42 42
>>>
>>> so RDNA2 matches RDNA3 here.
>>
>> OK, that probably is the problem with both our reductions then. The RDNA2
>> manual has the 32-lane wording in the description, but the instruction
>> pseudocode lies. :(
>>
>> I'm now not sure how to implement permute without actually hitting memory? The
>> permutation vector is exactly what we'd need to do a gather load from memory
>> (not a coincident), but we'd need to find a memory location to do it, ideally
>> in the low-latency LDS memory, and it'd have to be thread-safe.
>>
>> The attached not-well-tested patch should allow only valid permutations.
>> Hopefully we go back to working code, but there'll be things that won't
>> vectorize. That said, the new "dump" output code has fewer and probably
>> cheaper instructions, so hmmm.
> 
> This fixes the reduced builtin-bitops-1.c on RDNA2.
> 
> I suppse if RDNA really only has 32 lane vectors (it sounds like it,
> even if it can "simulate" 64 lane ones?) then it might make sense to
> vectorize for 32 lanes?  That said, with variable-length it likely
> doesn't matter but I'd not expose fixed-size modes with 64 lanes then?

For most operations, wavefrontsize=64 works just fine; the GPU runs each 
instruction twice and presents a pair of hardware registers as a logical 
64-lane register. This breaks down for permutations and reductions, and 
is obviously inefficient when they vectors are not fully utilized, but 
is otherwise compatible with the GCN/CDNA compiler.

I didn't want to invest all the effort it would take to support 
wavefrontsize=32, which would be the natural mode for these devices; the 
number of places that have "64" hard-coded is just too big. Not only 
that, but the EXEC and VCC registers change from DImode to SImode and 
that's going to break a lot of stuff. (And we have no paying customer 
for this.)

I'm open to patch submissions. :)

Andrew
  
Richard Biener Feb. 15, 2024, 10:21 a.m. UTC | #11
On Thu, 15 Feb 2024, Andrew Stubbs wrote:

> On 15/02/2024 07:49, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 14/02/2024 13:43, Richard Biener wrote:
> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>
> >>>> On 14/02/2024 13:27, Richard Biener wrote:
> >>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> >>>>>
> >>>>>> On 13/02/2024 08:26, Richard Biener wrote:
> >>>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>>>>>
> >>>>>>>> Hi!
> >>>>>>>>
> >>>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com>
> >>>>>>>> wrote:
> >>>>>>>>> I've committed this patch
> >>>>>>>>
> >>>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >>>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>>>>>>>
> >>>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
> >>>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
> >>>>>>>> disabled:
> >>>>>>>>
> >>>>>>>>> [...] Vector
> >>>>>>>>> reductions will need to be reworked for RDNA2.  [...]
> >>>>>>>>
> >>>>>>>>>     * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> >>>>>>>>>     (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> >>>>>>>>>     (subc<mode>3<exec_vcc>): Likewise.
> >>>>>>>>>     (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> >>>>>>>>>     (vec_cmp<mode>di): Likewise.
> >>>>>>>>>     (vec_cmp<u><mode>di): Likewise.
> >>>>>>>>>     (vec_cmp<mode>di_exec): Likewise.
> >>>>>>>>>     (vec_cmp<u><mode>di_exec): Likewise.
> >>>>>>>>>     (vec_cmp<mode>di_dup): Likewise.
> >>>>>>>>>     (vec_cmp<mode>di_dup_exec): Likewise.
> >>>>>>>>>     (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> >>>>>>>>>     (*<reduc_op>_dpp_shr_<mode>): Likewise.
> >>>>>>>>>     (*plus_carry_dpp_shr_<mode>): Likewise.
> >>>>>>>>>     (*plus_carry_in_dpp_shr_<mode>): Likewise.
> >>>>>>>>
> >>>>>>>> Etc.  The expectation being that GCC middle end copes with this, and
> >>>>>>>> synthesizes some less ideal yet still functional vector code, I
> >>>>>>>> presume.
> >>>>>>>>
> >>>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's
> >>>>>>>> what
> >>>>>>>> I'm currently working on getting proper GCC/GCN target (not
> >>>>>>>> offloading)
> >>>>>>>> results for.
> >>>>>>>>
> >>>>>>>> I'm seeing a good number of execution test FAILs (regressions
> >>>>>>>> compared
> >>>>>>>> to
> >>>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
> >>>>>>>> large class of those comes into existance -- not yet how to resolve,
> >>>>>>>> unfortunately.  But maybe, with you guys' combined vectorizer and
> >>>>>>>> back
> >>>>>>>> end experience, the latter will be done quickly?
> >>>>>>>>
> >>>>>>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >>>>>>>> offloading) testing; let me know if you have any questions about
> >>>>>>>> that.
> >>>>>>>
> >>>>>>> I've only done offload testing - in the x86_64 build tree run
> >>>>>>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>>>>>> (maybe document it on the wiki even!) I can try do that as well.
> >>>>>>>
> >>>>>>>> Given that (at least largely?) the same patterns etc. are disabled as
> >>>>>>>> in
> >>>>>>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit
> >>>>>>>> the
> >>>>>>>> same issues.  You can build GCC/GCN target like you build the
> >>>>>>>> offloading
> >>>>>>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you
> >>>>>>>> can
> >>>>>>>> even use a offloading GCC/GCN build to reproduce the issue below.
> >>>>>>>>
> >>>>>>>> One example is the attached 'builtin-bitops-1.c', reduced from
> >>>>>>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >>>>>>>> miscompiled as soon as '-ftree-vectorize' is effective:
> >>>>>>>>
> >>>>>>>>         $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>>>>>>>         -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>>>>>>>         -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>>>>>>>         -fdump-ipa-all-all -fdump-rtl-all-all -save-temps
> >>>>>>>>         -march=gfx1100
> >>>>>>>>         -O1
> >>>>>>>>         -ftree-vectorize
> >>>>>>>>
> >>>>>>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >>>>>>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>>>>>>>
> >>>>>>>>         +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
> >>>>>>>>         target.
> >>>>>>>>
> >>>>>>>> ..., and therefore:
> >>>>>>>>
> >>>>>>>>         -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
> >>>>>>>>         reduction.
> >>>>>>>>         +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>>>>>>>         +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>>>>>>>
> >>>>>>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we
> >>>>>>>> build
> >>>>>>>> a
> >>>>>>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >>>>>>>> generated:
> >>>>>>>>
> >>>>>>>>         $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>>>>>>>         i=1, ints[i]=0x1 a=1, b=2
> >>>>>>>>         i=2, ints[i]=0x80000000 a=1, b=2
> >>>>>>>>         i=3, ints[i]=0x2 a=1, b=2
> >>>>>>>>         i=4, ints[i]=0x40000000 a=1, b=2
> >>>>>>>>         i=5, ints[i]=0x10000 a=1, b=2
> >>>>>>>>         i=6, ints[i]=0x8000 a=1, b=2
> >>>>>>>>         i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>>>>>>>         i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>>>>>>>         i=9, ints[i]=0xcafe0000 a=11, b=22
> >>>>>>>>         i=10, ints[i]=0xcafe00 a=11, b=22
> >>>>>>>>         i=11, ints[i]=0xcafe a=11, b=22
> >>>>>>>>         i=12, ints[i]=0xffffffff a=32, b=64
> >>>>>>>>
> >>>>>>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>>>>>>>
> >>>>>>>> I don't speak enough "vectorization" to fully understand the generic
> >>>>>>>> vectorized algorithm and its implementation.  It appears that the
> >>>>>>>> "Reduce using vector shifts" code has been around for a very long
> >>>>>>>> time,
> >>>>>>>> but also has gone through a number of changes.  I can't tell which
> >>>>>>>> GCC
> >>>>>>>> targets/configurations it's actually used for (in the same way as for
> >>>>>>>> GCN gfx1100), and thus whether there's an issue in that vectorizer
> >>>>>>>> code,
> >>>>>>>> or rather in the GCN back end, or GCN back end parameterizing the
> >>>>>>>> generic
> >>>>>>>> code?
> >>>>>>>
> >>>>>>> The "shift" reduction is basically doing reduction by repeatedly
> >>>>>>> adding the upper to the lower half of the vector (each time halving
> >>>>>>> the vector size).
> >>>>>>>
> >>>>>>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized'
> >>>>>>>> code:
> >>>>>>>>
> >>>>>>>>         int my_popcount (unsigned int x)
> >>>>>>>>         {
> >>>>>>>>           int stmp__12.12;
> >>>>>>>>           vector(64) int vect__12.11;
> >>>>>>>>           vector(64) unsigned int vect__1.8;
> >>>>>>>>           vector(64) unsigned int _13;
> >>>>>>>>           vector(64) unsigned int vect_cst__18;
> >>>>>>>>           vector(64) int [all others];
> >>>>>>>>         
> >>>>>>>>           <bb 2> [local count: 32534376]:
> >>>>>>>>           vect_cst__18 = { [all 'x_8(D)'] };
> >>>>>>>>           vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63
> >>>>>>>>           };
> >>>>>>>>           _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19,
> >>>>>>>>           {
> >>>>>>>>           [all
> >>>>>>>>           '1'] }, { [all '0'] });
> >>>>>>>>           vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
> >>>>>>>>           _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32,
> >>>>>>>>           33,
> >>>>>>>>           34,
> >>>>>>>>           [...], 93, 94, 95 }>;
> >>>>>>>>           _27 = vect__12.11_24 + _26;
> >>>>>>>>           _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18,
> >>>>>>>>           [...],
> >>>>>>>>           77,
> >>>>>>>>           78, 79 }>;
> >>>>>>>>           _29 = _27 + _28;
> >>>>>>>>           _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...],
> >>>>>>>>           69,
> >>>>>>>>           70,
> >>>>>>>>           71 }>;
> >>>>>>>>           _31 = _29 + _30;
> >>>>>>>>           _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...],
> >>>>>>>>           65,
> >>>>>>>>           66,
> >>>>>>>>           67 }>;
> >>>>>>>>           _33 = _31 + _32;
> >>>>>>>>           _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...],
> >>>>>>>>           63,
> >>>>>>>>           64,
> >>>>>>>>           65 }>;
> >>>>>>>>           _35 = _33 + _34;
> >>>>>>>>           _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...],
> >>>>>>>>           62,
> >>>>>>>>           63,
> >>>>>>>>           64 }>;
> >>>>>>>>           _37 = _35 + _36;
> >>>>>>>>           stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
> >>>>>>>>           return stmp__12.12_38;
> >>>>>>>>
> >>>>>>>> ..., for example, for 'x = 7', we get:
> >>>>>>>>
> >>>>>>>>           vect_cst__18 = { [all '7'] };
> >>>>>>>>           vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
> >>>>>>>>           _13 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _26 = { [all '0'] };
> >>>>>>>>           _27 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _28 = { [all '0'] };
> >>>>>>>>           _29 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _30 = { [all '0'] };
> >>>>>>>>           _31 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _32 = { [all '0'] };
> >>>>>>>>           _33 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _34 = { 1, 0, 0, 0, [...] };
> >>>>>>>>           _35 = { 2, 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _36 = { 1, 1, 0, 0, 0, [...] };
> >>>>>>>>           _37 = { 3, 2, 1, 0, 0, 0, [...] };
> >>>>>>>>           stmp__12.12_38 = 3;
> >>>>>>>>           return 3;
> >>>>>>>>
> >>>>>>>> ..., so the algorithm would appear to synthesize correct code for
> >>>>>>>> that
> >>>>>>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> >>>>>>>>
> >>>>>>>>         i=13, ints[i]=0x7 a=3, b=6
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
> >>>>>>>>
> >>>>>>>>         @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
> >>>>>>>>         (loop_vec_info
> >>>>>>>>         loop_vinfo,
> >>>>>>>>                reduce_with_shift = have_whole_vector_shift (mode1);
> >>>>>>>>                if (!VECTOR_MODE_P (mode1)
> >>>>>>>>                   || !directly_supported_p (code, vectype1))
> >>>>>>>>                 reduce_with_shift = false;
> >>>>>>>>         +      reduce_with_shift = false;
> >>>>>>>>
> >>>>>>>> ..., I'm able to work around those regressions: by means of forcing
> >>>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
> >>>>>>>
> >>>>>>> I would say it somewhere gets broken between the vectorizer and the
> >>>>>>> GPU
> >>>>>>> which means likely in the target?  Can you point out an issue in the
> >>>>>>> actual generated GCN code?
> >>>>>>>
> >>>>>>> Iff this kind of reduction is the issue you'd see quite a lot of
> >>>>>>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
> >>>>>>> be that the "mask" is still set wrong when doing the reduction
> >>>>>>> steps?
> >>>>>>
> >>>>>> It looks like the ds_bpermute_b32 instruction works differently on
> >>>>>> RDNA3
> >>>>>> (vs.
> >>>>>> GCN/CDNA and even RDNA2).
> >>>>>>
> >>>>>>    From the pseudocode in the documentation:
> >>>>>>
> >>>>>>      for i in 0 : WAVE64 ? 63 : 31 do
> >>>>>>        // ADDR needs to be divided by 4.
> >>>>>>        // High-order bits are ignored.
> >>>>>>        // NOTE: destination lane is MOD 32 regardless of wave size.
> >>>>>>        src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
> >>>>>>        // EXEC is applied to the source VGPR reads.
> >>>>>>        if EXEC[src_lane].u1 then
> >>>>>>          tmp[i] = VGPR[src_lane][DATA0]
> >>>>>>        endif
> >>>>>>      endfor;
> >>>>>>
> >>>>>> The key detail is the "mod 32"; the other architectures have "mod 64"
> >>>>>> there.
> >>>>>>
> >>>>>> So, the last 32 lanes are discarded, and the first 32 lanes are
> >>>>>> duplicated
> >>>>>> into the last, and this explains why my_popcount returns double the
> >>>>>> expected
> >>>>>> value for smaller inputs.
> >>>>>>
> >>>>>> Richi, can you confirm that this testcase works properly on your card,
> >>>>>> please?
> >>>>>>
> >>>>>> To test, assuming you only have the offload toolchain built, compile
> >>>>>> using
> >>>>>> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a
> >>>>>> raw
> >>>>>> AMD
> >>>>>> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run
> >>>>>> under
> >>>>>> libexec).
> >>>>>
> >>>>> I'm getting
> >>>>>
> >>>>> i=1, ints[i]=0x1 a=1, b=2
> >>>>> i=2, ints[i]=0x80000000 a=1, b=2
> >>>>> i=3, ints[i]=0x2 a=1, b=2
> >>>>> i=4, ints[i]=0x40000000 a=1, b=2
> >>>>> i=5, ints[i]=0x10000 a=1, b=2
> >>>>> i=6, ints[i]=0x8000 a=1, b=2
> >>>>> i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>>>> i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>>>> i=9, ints[i]=0xcafe0000 a=11, b=22
> >>>>> i=10, ints[i]=0xcafe00 a=11, b=22
> >>>>> i=11, ints[i]=0xcafe a=11, b=22
> >>>>> i=12, ints[i]=0xffffffff a=32, b=64
> >>>>>
> >>>>> which I think is the same as Thomas output and thus wrong?
> >>>>>
> >>>>> When building with -O0 I get no output.
> >>>>>
> >>>>> I'm of course building with -march=gfx1030
> >>>>
> >>>> OK, please try this example, just to check my expectation that your
> >>>> permute
> >>>> works:
> >>>>
> >>>> typedef int v64si __attribute__ ((vector_size (256)));
> >>>>
> >>>> int main()
> >>>> {
> >>>>     v64si permute = {
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40,
> >>>>       40, 40, 40, 40, 40, 40, 40, 40
> >>>>     };
> >>>>     v64si result;
> >>>>
> >>>>     asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute),
> >>>>     "e"(-1L));
> >>>>
> >>>>     for (int i=0; i<63; i++)
> >>>>       __builtin_printf ("%d ", result[i]);
> >>>>     __builtin_printf ("\n");
> >>>>
> >>>>     return 0;
> >>>> }
> >>>>
> >>>> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it
> >>>> prints
> >>>> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd
> >>>> expect
> >>>> from the pseudocode, but does match the written description). Which do
> >>>> you
> >>>> get?
> >>>
> >>> 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
> >>> 10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42
> >>> 42 42 42 42 42 42 42 42 42 42 42 42 42
> >>>
> >>> so RDNA2 matches RDNA3 here.
> >>
> >> OK, that probably is the problem with both our reductions then. The RDNA2
> >> manual has the 32-lane wording in the description, but the instruction
> >> pseudocode lies. :(
> >>
> >> I'm now not sure how to implement permute without actually hitting memory?
> >> The
> >> permutation vector is exactly what we'd need to do a gather load from
> >> memory
> >> (not a coincident), but we'd need to find a memory location to do it,
> >> ideally
> >> in the low-latency LDS memory, and it'd have to be thread-safe.
> >>
> >> The attached not-well-tested patch should allow only valid permutations.
> >> Hopefully we go back to working code, but there'll be things that won't
> >> vectorize. That said, the new "dump" output code has fewer and probably
> >> cheaper instructions, so hmmm.
> > 
> > This fixes the reduced builtin-bitops-1.c on RDNA2.
> > 
> > I suppse if RDNA really only has 32 lane vectors (it sounds like it,
> > even if it can "simulate" 64 lane ones?) then it might make sense to
> > vectorize for 32 lanes?  That said, with variable-length it likely
> > doesn't matter but I'd not expose fixed-size modes with 64 lanes then?
> 
> For most operations, wavefrontsize=64 works just fine; the GPU runs each
> instruction twice and presents a pair of hardware registers as a logical
> 64-lane register. This breaks down for permutations and reductions, and is
> obviously inefficient when they vectors are not fully utilized, but is
> otherwise compatible with the GCN/CDNA compiler.
> 
> I didn't want to invest all the effort it would take to support
> wavefrontsize=32, which would be the natural mode for these devices; the
> number of places that have "64" hard-coded is just too big. Not only that, but
> the EXEC and VCC registers change from DImode to SImode and that's going to
> break a lot of stuff. (And we have no paying customer for this.)
> 
> I'm open to patch submissions. :)

OK, I see ;)  As said for fully masked that's a good answer.  I'd
probably still not expose V64mode modes in the RTL expanders for the
vect_* patterns?  Or, what happens if you change
gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?

Does that possibly leave performance on the plate? (not sure if there's
any documents about choosing wavefrontsize=64 vs 32 with regard to
performance)

Note it would entirely forbit the vectorizer from using larger modes,
it just makes it prefer the smaller ones.  OTOH if you then run
wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
by always masking it?

So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
allowing 32 there ...

Anyway, the fix works, so that's the most important thing ;)

Richard.
  
Thomas Schwinge Feb. 15, 2024, 10:23 a.m. UTC | #12
Hi!

On 2024-02-15T08:49:17+0100, Richard Biener <rguenther@suse.de> wrote:
> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> On 14/02/2024 13:43, Richard Biener wrote:
>> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> >> On 14/02/2024 13:27, Richard Biener wrote:
>> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> >>>> On 13/02/2024 08:26, Richard Biener wrote:
>> >>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com>
>> >>>>>> wrote:
>> >>>>>>> I've committed this patch
>> >>>>>>
>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>> >>>>>>
>> >>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
>> >>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
>> >>>>>> disabled:
>> >>>>>>
>> >>>>>>> [...] Vector
>> >>>>>>> reductions will need to be reworked for RDNA2.  [...]
>> >>>>>>
>> >>>>>>>    * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>> >>>>>>>    (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>> >>>>>>>    (subc<mode>3<exec_vcc>): Likewise.
>> >>>>>>>    (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>> >>>>>>>    (vec_cmp<mode>di): Likewise.
>> >>>>>>>    (vec_cmp<u><mode>di): Likewise.
>> >>>>>>>    (vec_cmp<mode>di_exec): Likewise.
>> >>>>>>>    (vec_cmp<u><mode>di_exec): Likewise.
>> >>>>>>>    (vec_cmp<mode>di_dup): Likewise.
>> >>>>>>>    (vec_cmp<mode>di_dup_exec): Likewise.
>> >>>>>>>    (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>> >>>>>>>    (*<reduc_op>_dpp_shr_<mode>): Likewise.
>> >>>>>>>    (*plus_carry_dpp_shr_<mode>): Likewise.
>> >>>>>>>    (*plus_carry_in_dpp_shr_<mode>): Likewise.
>> >>>>>>
>> >>>>>> Etc.  The expectation being that GCC middle end copes with this, and
>> >>>>>> synthesizes some less ideal yet still functional vector code, I presume.
>> >>>>>>
>> >>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>> >>>>>> I'm currently working on getting proper GCC/GCN target (not offloading)
>> >>>>>> results for.
>> >>>>>>
>> >>>>>> I'm seeing a good number of execution test FAILs (regressions compared to
>> >>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
>> >>>>>> large class of those comes into existance -- [...]

>> >>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>> >>>>>>
>> >>>>>>        @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
>> >>>>>>        (loop_vec_info
>> >>>>>>        loop_vinfo,
>> >>>>>>               reduce_with_shift = have_whole_vector_shift (mode1);
>> >>>>>>               if (!VECTOR_MODE_P (mode1)
>> >>>>>>                  || !directly_supported_p (code, vectype1))
>> >>>>>>                reduce_with_shift = false;
>> >>>>>>        +      reduce_with_shift = false;
>> >>>>>>
>> >>>>>> ..., I'm able to work around those regressions: by means of forcing
>> >>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".

>> The attached not-well-tested patch should allow only valid permutations.
>> Hopefully we go back to working code, but there'll be things that won't
>> vectorize. That said, the new "dump" output code has fewer and probably
>> cheaper instructions, so hmmm.
>
> This fixes the reduced builtin-bitops-1.c on RDNA2.

I confirm that "amdgcn: Disallow unsupported permute on RDNA devices"
also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a
good number of additional FAILs (regressions), where presumably we
permute via different code paths.  Thanks!

There also are a few regressions, but only minor:

    PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors)
    PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test
    PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "vectorized 1 loops" 4
    [-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence distance negative" 4

..., because:

    gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times
    FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence distance negative" 4

    PASS: gcc.dg/vect/vect-119.c (test for excess errors)
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving load of size 2" 1
    PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum"

..., because:

    gcc.dg/vect/vect-119.c: pattern found 3 times
    FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving load of size 2" 1

    PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect "Reduce using vector shifts"

    PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect "Reduce using vector shifts"

..., plus the following, in combination with the earlier changes
disabling patterns:

    PASS: gcc.dg/vect/vect-reduc-or_1.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-or_1.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_1.c scan-tree-dump vect "Reduce using direct vector reduction"

    PASS: gcc.dg/vect/vect-reduc-or_2.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-or_2.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_2.c scan-tree-dump vect "Reduce using direct vector reduction"

Such test cases will need conditionalization on specific configurations.
I'm fine if we just let those FAIL (for RDNA2+) for the time being; there
are a good number of similar scanning FAILs pre-existing also for
non-gfx1100.


Grüße
 Thomas
  
Andrew Stubbs Feb. 15, 2024, 10:59 a.m. UTC | #13
On 15/02/2024 10:21, Richard Biener wrote:
[snip]
>>> I suppse if RDNA really only has 32 lane vectors (it sounds like it,
>>> even if it can "simulate" 64 lane ones?) then it might make sense to
>>> vectorize for 32 lanes?  That said, with variable-length it likely
>>> doesn't matter but I'd not expose fixed-size modes with 64 lanes then?
>>
>> For most operations, wavefrontsize=64 works just fine; the GPU runs each
>> instruction twice and presents a pair of hardware registers as a logical
>> 64-lane register. This breaks down for permutations and reductions, and is
>> obviously inefficient when they vectors are not fully utilized, but is
>> otherwise compatible with the GCN/CDNA compiler.
>>
>> I didn't want to invest all the effort it would take to support
>> wavefrontsize=32, which would be the natural mode for these devices; the
>> number of places that have "64" hard-coded is just too big. Not only that, but
>> the EXEC and VCC registers change from DImode to SImode and that's going to
>> break a lot of stuff. (And we have no paying customer for this.)
>>
>> I'm open to patch submissions. :)
> 
> OK, I see ;)  As said for fully masked that's a good answer.  I'd
> probably still not expose V64mode modes in the RTL expanders for the
> vect_* patterns?  Or, what happens if you change
> gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
> and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?

Changing the preferred mode probably would fix permute.

> Does that possibly leave performance on the plate? (not sure if there's
> any documents about choosing wavefrontsize=64 vs 32 with regard to
> performance)
> 
> Note it would entirely forbit the vectorizer from using larger modes,
> it just makes it prefer the smaller ones.  OTOH if you then run
> wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
> by always masking it?

Right, the GPU will continue to process the "top half" of the vector as 
an additional step, regardless whether you put anything useful there, or 
not.

> So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
> allowing 32 there ...

I think the DImode to SImode change is the most difficult fix. Unless 
you know of a cunning trick, that's going to mean a lot of changes to a 
lot of the machine description; substitutions, duplications, iterators, 
indirections, etc., etc., etc.

The "64" substitution would be tedious but less hairy. I did a lot of 
those when I created the fake vector sizes.

> Anyway, the fix works, so that's the most important thing ;)

:)

Andrew
  
Richard Biener Feb. 15, 2024, 12:31 p.m. UTC | #14
On Thu, 15 Feb 2024, Andrew Stubbs wrote:

> On 15/02/2024 10:21, Richard Biener wrote:
> [snip]
> >>> I suppse if RDNA really only has 32 lane vectors (it sounds like it,
> >>> even if it can "simulate" 64 lane ones?) then it might make sense to
> >>> vectorize for 32 lanes?  That said, with variable-length it likely
> >>> doesn't matter but I'd not expose fixed-size modes with 64 lanes then?
> >>
> >> For most operations, wavefrontsize=64 works just fine; the GPU runs each
> >> instruction twice and presents a pair of hardware registers as a logical
> >> 64-lane register. This breaks down for permutations and reductions, and is
> >> obviously inefficient when they vectors are not fully utilized, but is
> >> otherwise compatible with the GCN/CDNA compiler.
> >>
> >> I didn't want to invest all the effort it would take to support
> >> wavefrontsize=32, which would be the natural mode for these devices; the
> >> number of places that have "64" hard-coded is just too big. Not only that,
> >> but
> >> the EXEC and VCC registers change from DImode to SImode and that's going to
> >> break a lot of stuff. (And we have no paying customer for this.)
> >>
> >> I'm open to patch submissions. :)
> > 
> > OK, I see ;)  As said for fully masked that's a good answer.  I'd
> > probably still not expose V64mode modes in the RTL expanders for the
> > vect_* patterns?  Or, what happens if you change
> > gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
> > and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?
> 
> Changing the preferred mode probably would fix permute.
> 
> > Does that possibly leave performance on the plate? (not sure if there's
> > any documents about choosing wavefrontsize=64 vs 32 with regard to
> > performance)
> > 
> > Note it would entirely forbit the vectorizer from using larger modes,
> > it just makes it prefer the smaller ones.  OTOH if you then run
> > wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
> > by always masking it?
> 
> Right, the GPU will continue to process the "top half" of the vector as an
> additional step, regardless whether you put anything useful there, or not.
> 
> > So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
> > allowing 32 there ...
> 
> I think the DImode to SImode change is the most difficult fix. Unless you know
> of a cunning trick, that's going to mean a lot of changes to a lot of the
> machine description; substitutions, duplications, iterators, indirections,
> etc., etc., etc.

Hmm, maybe just leave it at DImode in the patterns?  OTOH mode
iterators to do both SImode and DImode might work as well, but yeah,
a lot of churn.

Richard.
  
Andrew Stubbs Feb. 15, 2024, 1:02 p.m. UTC | #15
On 15/02/2024 10:23, Thomas Schwinge wrote:
> Hi!
> 
> On 2024-02-15T08:49:17+0100, Richard Biener <rguenther@suse.de> wrote:
>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>> On 14/02/2024 13:43, Richard Biener wrote:
>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>>>> On 14/02/2024 13:27, Richard Biener wrote:
>>>>>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>>>>>>> On 13/02/2024 08:26, Richard Biener wrote:
>>>>>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>>>>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com>
>>>>>>>>> wrote:
>>>>>>>>>> I've committed this patch
>>>>>>>>>
>>>>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>>>>>>>>>
>>>>>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
>>>>>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
>>>>>>>>> disabled:
>>>>>>>>>
>>>>>>>>>> [...] Vector
>>>>>>>>>> reductions will need to be reworked for RDNA2.  [...]
>>>>>>>>>
>>>>>>>>>>     * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>>>>>>>>>>     (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>>>>>>>>>>     (subc<mode>3<exec_vcc>): Likewise.
>>>>>>>>>>     (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>>>>>>>>>>     (vec_cmp<mode>di): Likewise.
>>>>>>>>>>     (vec_cmp<u><mode>di): Likewise.
>>>>>>>>>>     (vec_cmp<mode>di_exec): Likewise.
>>>>>>>>>>     (vec_cmp<u><mode>di_exec): Likewise.
>>>>>>>>>>     (vec_cmp<mode>di_dup): Likewise.
>>>>>>>>>>     (vec_cmp<mode>di_dup_exec): Likewise.
>>>>>>>>>>     (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>>>>>>>>>>     (*<reduc_op>_dpp_shr_<mode>): Likewise.
>>>>>>>>>>     (*plus_carry_dpp_shr_<mode>): Likewise.
>>>>>>>>>>     (*plus_carry_in_dpp_shr_<mode>): Likewise.
>>>>>>>>>
>>>>>>>>> Etc.  The expectation being that GCC middle end copes with this, and
>>>>>>>>> synthesizes some less ideal yet still functional vector code, I presume.
>>>>>>>>>
>>>>>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>>>>>>>>> I'm currently working on getting proper GCC/GCN target (not offloading)
>>>>>>>>> results for.
>>>>>>>>>
>>>>>>>>> I'm seeing a good number of execution test FAILs (regressions compared to
>>>>>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
>>>>>>>>> large class of those comes into existance -- [...]
> 
>>>>>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>>>>>>>>>
>>>>>>>>>         @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
>>>>>>>>>         (loop_vec_info
>>>>>>>>>         loop_vinfo,
>>>>>>>>>                reduce_with_shift = have_whole_vector_shift (mode1);
>>>>>>>>>                if (!VECTOR_MODE_P (mode1)
>>>>>>>>>                   || !directly_supported_p (code, vectype1))
>>>>>>>>>                 reduce_with_shift = false;
>>>>>>>>>         +      reduce_with_shift = false;
>>>>>>>>>
>>>>>>>>> ..., I'm able to work around those regressions: by means of forcing
>>>>>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
> 
>>> The attached not-well-tested patch should allow only valid permutations.
>>> Hopefully we go back to working code, but there'll be things that won't
>>> vectorize. That said, the new "dump" output code has fewer and probably
>>> cheaper instructions, so hmmm.
>>
>> This fixes the reduced builtin-bitops-1.c on RDNA2.
> 
> I confirm that "amdgcn: Disallow unsupported permute on RDNA devices"
> also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a
> good number of additional FAILs (regressions), where presumably we
> permute via different code paths.  Thanks!
> 
> There also are a few regressions, but only minor:
> 
>      PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors)
>      PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test
>      PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "vectorized 1 loops" 4
>      [-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence distance negative" 4
> 
> ..., because:
> 
>      gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times
>      FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence distance negative" 4
> 
>      PASS: gcc.dg/vect/vect-119.c (test for excess errors)
>      [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving load of size 2" 1
>      PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum"
> 
> ..., because:
> 
>      gcc.dg/vect/vect-119.c: pattern found 3 times
>      FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving load of size 2" 1
> 
>      PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors)
>      PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test
>      [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect "Reduce using vector shifts"
> 
>      PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors)
>      PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test
>      [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect "Reduce using vector shifts"
> 
> ..., plus the following, in combination with the earlier changes
> disabling patterns:
> 
>      PASS: gcc.dg/vect/vect-reduc-or_1.c (test for excess errors)
>      PASS: gcc.dg/vect/vect-reduc-or_1.c execution test
>      [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_1.c scan-tree-dump vect "Reduce using direct vector reduction"
> 
>      PASS: gcc.dg/vect/vect-reduc-or_2.c (test for excess errors)
>      PASS: gcc.dg/vect/vect-reduc-or_2.c execution test
>      [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_2.c scan-tree-dump vect "Reduce using direct vector reduction"
> 
> Such test cases will need conditionalization on specific configurations.
> I'm fine if we just let those FAIL (for RDNA2+) for the time being; there
> are a good number of similar scanning FAILs pre-existing also for
> non-gfx1100.

Thanks, Thomas.

The patch is now committed.

Andrew
  
Thomas Schwinge Feb. 16, 2024, 9:52 a.m. UTC | #16
Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> I've committed this patch

... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
support builds on top of, and that's what I'm currently working on
getting proper GCC/GCN target (not offloading) results for.

Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
and hopefully representative for other SLP execution test FAILs
(regressions compared to my earlier non-gfx1100 testing).

    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100

The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
suppose will also exhibit the same failure mode, once again?

Compared to '-march=gfx90a', the differences begin in
'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.

Changed like:

    @@ -38,10 +38,10 @@ int main ()
     #pragma GCC novector
       for (i = 1; i < N; i++)
         if (a[i] != i%4 + 1)
    -      abort ();
    +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
     
       if (a[0] != 5)
    -    abort ();
    +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);

..., we see:

    $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
    40 5 != 1
    41 6 != 2
    42 7 != 3
    43 8 != 4
    44 5 != 1
    45 6 != 2
    46 7 != 3
    47 8 != 4

'40..47' are the 'i = 10..11' in 'foo', and the expectation is
'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
scribbled zero values over these (vector lane masking issue, perhaps?),
or some other code generation issue?


Grüße
 Thomas
  
Richard Biener Feb. 16, 2024, 10:17 a.m. UTC | #17
On Fri, 16 Feb 2024, Thomas Schwinge wrote:

> Hi!
> 
> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> > I've committed this patch
> 
> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
> support builds on top of, and that's what I'm currently working on
> getting proper GCC/GCN target (not offloading) results for.
> 
> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
> and hopefully representative for other SLP execution test FAILs
> (regressions compared to my earlier non-gfx1100 testing).
> 
>     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
> 
> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
> suppose will also exhibit the same failure mode, once again?
> 
> Compared to '-march=gfx90a', the differences begin in
> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
> 
> Changed like:
> 
>     @@ -38,10 +38,10 @@ int main ()
>      #pragma GCC novector
>        for (i = 1; i < N; i++)
>          if (a[i] != i%4 + 1)
>     -      abort ();
>     +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>      
>        if (a[0] != 5)
>     -    abort ();
>     +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
> 
> ..., we see:
> 
>     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>     40 5 != 1
>     41 6 != 2
>     42 7 != 3
>     43 8 != 4
>     44 5 != 1
>     45 6 != 2
>     46 7 != 3
>     47 8 != 4
> 
> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
> scribbled zero values over these (vector lane masking issue, perhaps?),
> or some other code generation issue?

So we're indeed BB vectorizing this to

  _54 = MEM <vector(4) int> [(int *)_14];
  vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6, 
7, 8 }, 115);
  MEM <vector(4) int> [(int *)_14] = vect_iftmp.12_56;

I don't understand the assembly very well but it might be that
the mask computation for the .VCOND scribbles the mask used
to constrain operation to 4 lanes?

.L3:
        s_mov_b64       exec, 15
        v_add_co_u32    v4, s[22:23], s32, v3
        v_mov_b32       v5, s33
        v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
        flat_load_dword v7, v[4:5] offset:0
        s_waitcnt       0
        flat_load_dword v0, v[10:11] offset:0
        s_waitcnt       0
        flat_load_dword v6, v[8:9] offset:0
        s_waitcnt       0
        v_cmp_ne_u32    s[18:19], v7, 0
        v_cndmask_b32   v0, v6, v0, s[18:19]
        flat_store_dword        v[4:5], v0 offset:0
        s_add_i32       s12, s12, 1
        s_add_u32       s32, s32, s28
        s_addc_u32      s33, s33, s29
        s_cmp_lg_u32    s12, s13
        s_cbranch_scc1  .L3

Richard.
  
Andrew Stubbs Feb. 16, 2024, 11:22 a.m. UTC | #18
On 16/02/2024 10:17, Richard Biener wrote:
> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
> 
>> Hi!
>>
>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>> I've committed this patch
>>
>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>> support builds on top of, and that's what I'm currently working on
>> getting proper GCC/GCN target (not offloading) results for.
>>
>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>> and hopefully representative for other SLP execution test FAILs
>> (regressions compared to my earlier non-gfx1100 testing).
>>
>>      $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
>>
>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>> suppose will also exhibit the same failure mode, once again?
>>
>> Compared to '-march=gfx90a', the differences begin in
>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>>
>> Changed like:
>>
>>      @@ -38,10 +38,10 @@ int main ()
>>       #pragma GCC novector
>>         for (i = 1; i < N; i++)
>>           if (a[i] != i%4 + 1)
>>      -      abort ();
>>      +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>>       
>>         if (a[0] != 5)
>>      -    abort ();
>>      +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>>
>> ..., we see:
>>
>>      $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>      40 5 != 1
>>      41 6 != 2
>>      42 7 != 3
>>      43 8 != 4
>>      44 5 != 1
>>      45 6 != 2
>>      46 7 != 3
>>      47 8 != 4
>>
>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>> scribbled zero values over these (vector lane masking issue, perhaps?),
>> or some other code generation issue?
> 
> So we're indeed BB vectorizing this to
> 
>    _54 = MEM <vector(4) int> [(int *)_14];
>    vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6,
> 7, 8 }, 115);
>    MEM <vector(4) int> [(int *)_14] = vect_iftmp.12_56;
> 
> I don't understand the assembly very well but it might be that
> the mask computation for the .VCOND scribbles the mask used
> to constrain operation to 4 lanes?
> 
> .L3:
>          s_mov_b64       exec, 15
>          v_add_co_u32    v4, s[22:23], s32, v3
>          v_mov_b32       v5, s33
>          v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
>          flat_load_dword v7, v[4:5] offset:0
>          s_waitcnt       0
>          flat_load_dword v0, v[10:11] offset:0
>          s_waitcnt       0
>          flat_load_dword v6, v[8:9] offset:0
>          s_waitcnt       0
>          v_cmp_ne_u32    s[18:19], v7, 0
>          v_cndmask_b32   v0, v6, v0, s[18:19]
>          flat_store_dword        v[4:5], v0 offset:0
>          s_add_i32       s12, s12, 1
>          s_add_u32       s32, s32, s28
>          s_addc_u32      s33, s33, s29
>          s_cmp_lg_u32    s12, s13
>          s_cbranch_scc1  .L3

This basic block has EXEC set to 15 (4 lanes) throughout. The mask for 
the VCOND a.k.a. v_vndmask_b32 is in s[18:19]. Those things seem OK.

I see the testcase avoids vec_extract V64SI to V4SI for gfx1100, even 
though it would be a no-op conversion, because the general case requires 
a permute instruction and named pattern insns can't have non-constant 
conditions. Is vec_extract allowed to FAIL? That might give a better 
result in this case.

However, I must be doing something different because 
vect/bb-slp-cond-1.c passes for me, on gfx1100.

Andrew
  
Richard Biener Feb. 16, 2024, 12:26 p.m. UTC | #19
On Fri, 16 Feb 2024, Andrew Stubbs wrote:

> On 16/02/2024 10:17, Richard Biener wrote:
> > On Fri, 16 Feb 2024, Thomas Schwinge wrote:
> > 
> >> Hi!
> >>
> >> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >>> I've committed this patch
> >>
> >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
> >> support builds on top of, and that's what I'm currently working on
> >> getting proper GCC/GCN target (not offloading) results for.
> >>
> >> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
> >> and hopefully representative for other SLP execution test FAILs
> >> (regressions compared to my earlier non-gfx1100 testing).
> >>
> >>      $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
> >>      source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> >>      --sysroot=install/amdgcn-amdhsa -ftree-vectorize
> >>      -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
> >>      -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
> >>      build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
> >>      source-gcc/newlib/libc/include
> >>      -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>      -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
> >>      setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
> >>      -fdump-rtl-all-all -save-temps -march=gfx1100
> >>
> >> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
> >> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
> >> suppose will also exhibit the same failure mode, once again?
> >>
> >> Compared to '-march=gfx90a', the differences begin in
> >> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
> >>
> >> Changed like:
> >>
> >>      @@ -38,10 +38,10 @@ int main ()
> >>       #pragma GCC novector
> >>         for (i = 1; i < N; i++)
> >>           if (a[i] != i%4 + 1)
> >>      -      abort ();
> >>      +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
> >>       
> >>         if (a[0] != 5)
> >>      -    abort ();
> >>      +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
> >>
> >> ..., we see:
> >>
> >>      $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>      40 5 != 1
> >>      41 6 != 2
> >>      42 7 != 3
> >>      43 8 != 4
> >>      44 5 != 1
> >>      45 6 != 2
> >>      46 7 != 3
> >>      47 8 != 4
> >>
> >> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
> >> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
> >> scribbled zero values over these (vector lane masking issue, perhaps?),
> >> or some other code generation issue?
> > 
> > So we're indeed BB vectorizing this to
> > 
> >    _54 = MEM <vector(4) int> [(int *)_14];
> >    vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6,
> > 7, 8 }, 115);
> >    MEM <vector(4) int> [(int *)_14] = vect_iftmp.12_56;
> > 
> > I don't understand the assembly very well but it might be that
> > the mask computation for the .VCOND scribbles the mask used
> > to constrain operation to 4 lanes?
> > 
> > .L3:
> >          s_mov_b64       exec, 15
> >          v_add_co_u32    v4, s[22:23], s32, v3
> >          v_mov_b32       v5, s33
> >          v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
> >          flat_load_dword v7, v[4:5] offset:0
> >          s_waitcnt       0
> >          flat_load_dword v0, v[10:11] offset:0
> >          s_waitcnt       0
> >          flat_load_dword v6, v[8:9] offset:0
> >          s_waitcnt       0
> >          v_cmp_ne_u32    s[18:19], v7, 0
> >          v_cndmask_b32   v0, v6, v0, s[18:19]
> >          flat_store_dword        v[4:5], v0 offset:0
> >          s_add_i32       s12, s12, 1
> >          s_add_u32       s32, s32, s28
> >          s_addc_u32      s33, s33, s29
> >          s_cmp_lg_u32    s12, s13
> >          s_cbranch_scc1  .L3
> 
> This basic block has EXEC set to 15 (4 lanes) throughout. The mask for the
> VCOND a.k.a. v_vndmask_b32 is in s[18:19]. Those things seem OK.
> 
> I see the testcase avoids vec_extract V64SI to V4SI for gfx1100, even though
> it would be a no-op conversion, because the general case requires a permute
> instruction and named pattern insns can't have non-constant conditions. Is
> vec_extract allowed to FAIL? That might give a better result in this case.
> 
> However, I must be doing something different because vect/bb-slp-cond-1.c
> passes for me, on gfx1100.

I didn't try to run it - when doing make check-gcc fails to using
gcn-run for test invocation, what's the trick to make it do that?

Richard.
  
Andrew Stubbs Feb. 16, 2024, 12:41 p.m. UTC | #20
On 16/02/2024 12:26, Richard Biener wrote:
> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
> 
>> On 16/02/2024 10:17, Richard Biener wrote:
>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>>>
>>>> Hi!
>>>>
>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>>>> I've committed this patch
>>>>
>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>>>> support builds on top of, and that's what I'm currently working on
>>>> getting proper GCC/GCN target (not offloading) results for.
>>>>
>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>>>> and hopefully representative for other SLP execution test FAILs
>>>> (regressions compared to my earlier non-gfx1100 testing).
>>>>
>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>>>>       source-gcc/newlib/libc/include
>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>>>>
>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>>>> suppose will also exhibit the same failure mode, once again?
>>>>
>>>> Compared to '-march=gfx90a', the differences begin in
>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>>>>
>>>> Changed like:
>>>>
>>>>       @@ -38,10 +38,10 @@ int main ()
>>>>        #pragma GCC novector
>>>>          for (i = 1; i < N; i++)
>>>>            if (a[i] != i%4 + 1)
>>>>       -      abort ();
>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>>>>        
>>>>          if (a[0] != 5)
>>>>       -    abort ();
>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>>>>
>>>> ..., we see:
>>>>
>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>       40 5 != 1
>>>>       41 6 != 2
>>>>       42 7 != 3
>>>>       43 8 != 4
>>>>       44 5 != 1
>>>>       45 6 != 2
>>>>       46 7 != 3
>>>>       47 8 != 4
>>>>
>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
>>>> or some other code generation issue?
>>>
>>> So we're indeed BB vectorizing this to
>>>
>>>     _54 = MEM <vector(4) int> [(int *)_14];
>>>     vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6,
>>> 7, 8 }, 115);
>>>     MEM <vector(4) int> [(int *)_14] = vect_iftmp.12_56;
>>>
>>> I don't understand the assembly very well but it might be that
>>> the mask computation for the .VCOND scribbles the mask used
>>> to constrain operation to 4 lanes?
>>>
>>> .L3:
>>>           s_mov_b64       exec, 15
>>>           v_add_co_u32    v4, s[22:23], s32, v3
>>>           v_mov_b32       v5, s33
>>>           v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
>>>           flat_load_dword v7, v[4:5] offset:0
>>>           s_waitcnt       0
>>>           flat_load_dword v0, v[10:11] offset:0
>>>           s_waitcnt       0
>>>           flat_load_dword v6, v[8:9] offset:0
>>>           s_waitcnt       0
>>>           v_cmp_ne_u32    s[18:19], v7, 0
>>>           v_cndmask_b32   v0, v6, v0, s[18:19]
>>>           flat_store_dword        v[4:5], v0 offset:0
>>>           s_add_i32       s12, s12, 1
>>>           s_add_u32       s32, s32, s28
>>>           s_addc_u32      s33, s33, s29
>>>           s_cmp_lg_u32    s12, s13
>>>           s_cbranch_scc1  .L3
>>
>> This basic block has EXEC set to 15 (4 lanes) throughout. The mask for the
>> VCOND a.k.a. v_vndmask_b32 is in s[18:19]. Those things seem OK.
>>
>> I see the testcase avoids vec_extract V64SI to V4SI for gfx1100, even though
>> it would be a no-op conversion, because the general case requires a permute
>> instruction and named pattern insns can't have non-constant conditions. Is
>> vec_extract allowed to FAIL? That might give a better result in this case.

I found that vec_extract is not allowed to FAIL. I guess the only way to 
allow the no-op conversions is to implement manual fall-back code-gen 
for the broken cases.

>>
>> However, I must be doing something different because vect/bb-slp-cond-1.c
>> passes for me, on gfx1100.
> 
> I didn't try to run it - when doing make check-gcc fails to using
> gcn-run for test invocation, what's the trick to make it do that?

There's a config file for nvptx here: 
https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp

You can probably make the obvious adjustments. I think Thomas has a GCN 
version with a few more features.

I usually use the CodeSourcery magic stack of scripts for testing 
installed toolchains on remote devices, so I'm not too familiar with 
using Dejagnu directly.

Andrew
  
Thomas Schwinge Feb. 16, 2024, 1:53 p.m. UTC | #21
Hi!

On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
> On 16/02/2024 12:26, Richard Biener wrote:
>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
>>> On 16/02/2024 10:17, Richard Biener wrote:
>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>>>>> I've committed this patch
>>>>>
>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>>>>> support builds on top of, and that's what I'm currently working on
>>>>> getting proper GCC/GCN target (not offloading) results for.
>>>>>
>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>>>>> and hopefully representative for other SLP execution test FAILs
>>>>> (regressions compared to my earlier non-gfx1100 testing).
>>>>>
>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>>>>>       source-gcc/newlib/libc/include
>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>>>>>
>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>>>>> suppose will also exhibit the same failure mode, once again?
>>>>>
>>>>> Compared to '-march=gfx90a', the differences begin in
>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>>>>>
>>>>> Changed like:
>>>>>
>>>>>       @@ -38,10 +38,10 @@ int main ()
>>>>>        #pragma GCC novector
>>>>>          for (i = 1; i < N; i++)
>>>>>            if (a[i] != i%4 + 1)
>>>>>       -      abort ();
>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>>>>>        
>>>>>          if (a[0] != 5)
>>>>>       -    abort ();
>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>>>>>
>>>>> ..., we see:
>>>>>
>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>>       40 5 != 1
>>>>>       41 6 != 2
>>>>>       42 7 != 3
>>>>>       43 8 != 4
>>>>>       44 5 != 1
>>>>>       45 6 != 2
>>>>>       46 7 != 3
>>>>>       47 8 != 4
>>>>>
>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
>>>>> or some other code generation issue?

>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
>>> passes for me, on gfx1100.

That's strange.  I've looked at your log file (looks good), and used your
toolchain to compile, and your 'gcn-run' to invoke, and still do get:

    $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
    GCN Kernel Aborted
    Kernel aborted

Andrew, later on, please try what happens when you put an unconditional
'abort' call into a test case?

>> I didn't try to run it - when doing make check-gcc fails to using
>> gcn-run for test invocation

Note, that for such individual test cases, invoking the compiler and then
'gcn-run' manually would seem easiest?

>> what's the trick to make it do that?

I tell you've probably not done much "embedded" or simulator testing of
GCC targets?  ;-P

> There's a config file for nvptx here: 
> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp

Yes, and I have pending some updates to that one, to be finished once
I've generally got my testing set up again, to a sufficient degree...

> You can probably make the obvious adjustments. I think Thomas has a GCN 
> version with a few more features.

Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.

I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
(as Andrew also noted privately) -- likewise, at least in part, for
GCC/nvptx, which is where I copied all that from.  (Will revise later;
not relevant for this discussion, here.)

Similar to what I've recently added to libgomp, there is 'flock'ing here,
so that you may use 'make -j[...] check' for (partial) parallelism, but
still all execution testing runs serialized.  I found this to greatly
help denoise the test results.  (Not ideal, of course, but improving that
is for later, too.)

You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
that doesn't work like that in your case.  (I've no idea what
'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
this, again, greatly helps denoise test results, at least for the one
system I'm currently testing on.

I intend to publish proper documentation of all this, later on -- happy
to answer any questions in the mean time.

If you don't already have a common directory for DejaGnu board files, put
'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
a 'dejagnu.exp' file next to it:

    lappend boards_dir ~/tmp/amdgcn-amdhsa

Prepare:

    $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
    $ export DEJAGNU
    $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
    $ export AMDGCN_AMDHSA_RUN
    $ # If necessary:
    $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
    $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
    $ export LD_LIBRARY_PATH

..., and then run:

    $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'

Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
recently put into a new "Using the GPU as stand-alone system" section
some similar information.  (..., but this should, in my opinion, be on a
different page, as it's explicitly *not* about what we understand as
offloading.)

> I usually use the CodeSourcery magic stack of scripts for testing 
> installed toolchains on remote devices, so I'm not too familiar with 
> using Dejagnu directly.

Tsk...  ;'-|


Grüße
 Thomas
# DejaGnu board file for amdgcn-amdhsa.

set_board_info target_install {amdgcn-amdhsa}

load_generic_config "sim"

if { [info exists env(AMDGCN_AMDHSA_LOCK_FILE)] } then {
    set_board_info sim,lock_file "$env(AMDGCN_AMDHSA_LOCK_FILE)"
} else {
    #TODO What's a good default filename?
    set_board_info sim,lock_file "/tmp/gcn.lock"
}

if { [info exists env(AMDGCN_AMDHSA_RUN)] } then {
    set_board_info sim "$env(AMDGCN_AMDHSA_RUN)"
} else {
    set_board_info sim "gcn-run"
}

# This isn't a simulator, but rather a "launcher".
unset_board_info is_simulator
unset_board_info slow_simulator

process_multilib_options ""

set_board_info gcc,stack_size 8192
set_board_info gcc,no_trampolines 1
set_board_info gcc,no_label_values 1
set_board_info gcc,signal_suppress 1

set_board_info compiler "[find_gcc]"
set_board_info cflags "[newlib_include_flags]"
set_board_info ldflags "[newlib_link_flags]"
set_board_info ldscript ""

#TODO Work around <http://mid.mail-archive.com/B457CE4A2BB446B7930A9BA1E38DBCCC@pleaset> 'ERROR: (DejaGnu) proc "::tcl::tm::UnknownHandler {::tcl::MacOSXPkgUnknown ::tclPkgUnknown} msgcat 1.4" does not exist.'...
# Otherwise, our use of 'clock format' may cause spurious errors such as:
#     ERROR: gcc.c-torture/compile/pr44686.c   -O0 : unknown dg option: ::tcl::tm::UnknownHandler ::tclPkgUnknown msgcat 1.4 for " dg-require-profiling 1 "-fprofile-generate" "
# ..., and all testing thus breaking apart.
set dummy [clock format [clock seconds]]
unset dummy

proc sim__open_lock_file { lock_file } {
    # Try to open the lock file for reading, so that this also works if
    # somebody else created the file.
    if [catch {open $lock_file r} result] {
	verbose -log "Couldn't open '$lock_file' for reading: $result"
	# Try to create the lock file.
	if [catch {open $lock_file a+} result] {
	    verbose -log "Couldn't create '$lock_file': $result"
	    # If this again failed, somebody else created it, concurrently.  If
	    # in the following we're now not able to open it for reading, we've
	    # got a fundamental problem, and let it fail.
	    set result [open $lock_file r]
	}
    }
    return $result
}

# The default 'sim_load' would eventually call into 'sim_spawn', 'sim_wait',
# but it's earlier here to just override the former one, and put safeguards
# into the latter two.

proc sim_spawn { dest cmdline args } {
    perror "TODO 'sim_spawn'"
    verbose -log "TODO 'sim_spawn'"
    return -1
}

proc sim_wait { dest timeout } {
    perror "TODO 'sim_wait'"
    verbose -log "TODO 'sim_wait'"
    return -1
}

proc sim_load { dest prog args } {
    set inpfile ""
    if { [llength $args] > 1 } {
	if { [lindex $args 1] != "" } {
	    set inpfile "[lindex $args 1]"
	}
    }

    # The launcher arguments are the program followed by the program arguments.
    set pargs [lindex $args 0]
    set largs [concat $prog $pargs]
    set args [lreplace $args 0 0 $largs]

    set launcher [board_info $dest sim]

    # To support parallel testing ('make -j[...] check') in light of flaky test
    # results for concurrent GPU usage, we'd like to serialize execution tests.
    set lock_file [board_info $dest sim,lock_file]
    if { $lock_file != "" } {
	set lock_fd [sim__open_lock_file $lock_file]
	set lock_clock_begin [clock seconds]
	exec flock 0 <@ $lock_fd
	set lock_clock_end [clock seconds]
	verbose -log "Got flock('$lock_file') at [clock format $lock_clock_end] after [expr $lock_clock_end - $lock_clock_begin] s" 2
    }

    # Note, not using 'remote_exec $dest' here.
    set result [eval [list remote_exec host $launcher] $args $inpfile]
    #TODO If we ran into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'...
    if { [lindex $result 0] != 0
	 && [string match "*HSA_STATUS_ERROR_OUT_OF_RESOURCES*" [lindex $result 1]] } {
	verbose -log "Trying to recover from 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', and then re-execute."
	#TODO ..., reset the GPU....
	exec sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
	#TODO ..., and try again.
	set result [eval [list remote_exec host $launcher] $args $inpfile]
    }
    # We don't tell 'launcher' execution failure from 'prog' execution failure.
    # Maybe we should, or maybe it doesn't matter.  (When there's an error,
    # there's an error.)

    if { $lock_file != "" } {
	# Unlock (implicit with 'close').
	close $lock_fd
    }

    if { [lindex $result 0] == 0 } {
	return [list "pass" [lindex $result 1]]
    } else {
	return [list "fail" [lindex $result 1]]
    }
}

# <https://inbox.sourceware.org/1392398663.17835.120.camel@ubuntu-sellcey>
proc sim_exec { dest srcfile args } {
    perror "TODO 'sim_exec'"
    verbose -log "TODO 'sim_exec'"
    return -1
}
  
Thomas Schwinge Feb. 19, 2024, 10:38 a.m. UTC | #22
Hi!

On 2024-02-16T14:53:04+0100, I wrote:
> On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
>> On 16/02/2024 12:26, Richard Biener wrote:
>>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
>>>> On 16/02/2024 10:17, Richard Biener wrote:
>>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>>>>>> I've committed this patch
>>>>>>
>>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>>>>>> support builds on top of, and that's what I'm currently working on
>>>>>> getting proper GCC/GCN target (not offloading) results for.
>>>>>>
>>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>>>>>> and hopefully representative for other SLP execution test FAILs
>>>>>> (regressions compared to my earlier non-gfx1100 testing).
>>>>>>
>>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
>>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>>>>>>       source-gcc/newlib/libc/include
>>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
>>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>>>>>>
>>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>>>>>> suppose will also exhibit the same failure mode, once again?
>>>>>>
>>>>>> Compared to '-march=gfx90a', the differences begin in
>>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>>>>>>
>>>>>> Changed like:
>>>>>>
>>>>>>       @@ -38,10 +38,10 @@ int main ()
>>>>>>        #pragma GCC novector
>>>>>>          for (i = 1; i < N; i++)
>>>>>>            if (a[i] != i%4 + 1)
>>>>>>       -      abort ();
>>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>>>>>>        
>>>>>>          if (a[0] != 5)
>>>>>>       -    abort ();
>>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>>>>>>
>>>>>> ..., we see:
>>>>>>
>>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>>>>>       40 5 != 1
>>>>>>       41 6 != 2
>>>>>>       42 7 != 3
>>>>>>       43 8 != 4
>>>>>>       44 5 != 1
>>>>>>       45 6 != 2
>>>>>>       46 7 != 3
>>>>>>       47 8 != 4
>>>>>>
>>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
>>>>>> or some other code generation issue?
>
>>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
>>>> passes for me, on gfx1100.
>
> That's strange.  I've looked at your log file (looks good), and used your
> toolchain to compile, and your 'gcn-run' to invoke, and still do get:
>
>     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
>     GCN Kernel Aborted
>     Kernel aborted
>
> Andrew, later on, please try what happens when you put an unconditional
> 'abort' call into a test case?

Andrew, any luck with that yet?

Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
execution test failure mentioned above (manual compilation and
'gcn-run')?


Grüße
 Thomas


>>> I didn't try to run it - when doing make check-gcc fails to using
>>> gcn-run for test invocation
>
> Note, that for such individual test cases, invoking the compiler and then
> 'gcn-run' manually would seem easiest?
>
>>> what's the trick to make it do that?
>
> I tell you've probably not done much "embedded" or simulator testing of
> GCC targets?  ;-P
>
>> There's a config file for nvptx here: 
>> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
>
> Yes, and I have pending some updates to that one, to be finished once
> I've generally got my testing set up again, to a sufficient degree...
>
>> You can probably make the obvious adjustments. I think Thomas has a GCN 
>> version with a few more features.
>
> Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
>
> I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
> (as Andrew also noted privately) -- likewise, at least in part, for
> GCC/nvptx, which is where I copied all that from.  (Will revise later;
> not relevant for this discussion, here.)
>
> Similar to what I've recently added to libgomp, there is 'flock'ing here,
> so that you may use 'make -j[...] check' for (partial) parallelism, but
> still all execution testing runs serialized.  I found this to greatly
> help denoise the test results.  (Not ideal, of course, but improving that
> is for later, too.)
>
> You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
> that doesn't work like that in your case.  (I've no idea what
> 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
> this, again, greatly helps denoise test results, at least for the one
> system I'm currently testing on.
>
> I intend to publish proper documentation of all this, later on -- happy
> to answer any questions in the mean time.
>
> If you don't already have a common directory for DejaGnu board files, put
> 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
> a 'dejagnu.exp' file next to it:
>
>     lappend boards_dir ~/tmp/amdgcn-amdhsa
>
> Prepare:
>
>     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
>     $ export DEJAGNU
>     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
>     $ export AMDGCN_AMDHSA_RUN
>     $ # If necessary:
>     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
>     $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
>     $ export LD_LIBRARY_PATH
>
> ..., and then run:
>
>     $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
>
> Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
> recently put into a new "Using the GPU as stand-alone system" section
> some similar information.  (..., but this should, in my opinion, be on a
> different page, as it's explicitly *not* about what we understand as
> offloading.)
>
>> I usually use the CodeSourcery magic stack of scripts for testing 
>> installed toolchains on remote devices, so I'm not too familiar with 
>> using Dejagnu directly.
>
> Tsk...  ;'-|
>
>
> Grüße
>  Thomas
  
Richard Biener Feb. 19, 2024, 10:52 a.m. UTC | #23
On Mon, 19 Feb 2024, Thomas Schwinge wrote:

> Hi!
> 
> On 2024-02-16T14:53:04+0100, I wrote:
> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
> >> On 16/02/2024 12:26, Richard Biener wrote:
> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
> >>>> On 16/02/2024 10:17, Richard Biener wrote:
> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >>>>>>> I've committed this patch
> >>>>>>
> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
> >>>>>> support builds on top of, and that's what I'm currently working on
> >>>>>> getting proper GCC/GCN target (not offloading) results for.
> >>>>>>
> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
> >>>>>> and hopefully representative for other SLP execution test FAILs
> >>>>>> (regressions compared to my earlier non-gfx1100 testing).
> >>>>>>
> >>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
> >>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> >>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
> >>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
> >>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
> >>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
> >>>>>>       source-gcc/newlib/libc/include
> >>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
> >>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
> >>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
> >>>>>>
> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
> >>>>>> suppose will also exhibit the same failure mode, once again?
> >>>>>>
> >>>>>> Compared to '-march=gfx90a', the differences begin in
> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
> >>>>>>
> >>>>>> Changed like:
> >>>>>>
> >>>>>>       @@ -38,10 +38,10 @@ int main ()
> >>>>>>        #pragma GCC novector
> >>>>>>          for (i = 1; i < N; i++)
> >>>>>>            if (a[i] != i%4 + 1)
> >>>>>>       -      abort ();
> >>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
> >>>>>>        
> >>>>>>          if (a[0] != 5)
> >>>>>>       -    abort ();
> >>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
> >>>>>>
> >>>>>> ..., we see:
> >>>>>>
> >>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>>>>>       40 5 != 1
> >>>>>>       41 6 != 2
> >>>>>>       42 7 != 3
> >>>>>>       43 8 != 4
> >>>>>>       44 5 != 1
> >>>>>>       45 6 != 2
> >>>>>>       46 7 != 3
> >>>>>>       47 8 != 4
> >>>>>>
> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
> >>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
> >>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
> >>>>>> or some other code generation issue?
> >
> >>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
> >>>> passes for me, on gfx1100.
> >
> > That's strange.  I've looked at your log file (looks good), and used your
> > toolchain to compile, and your 'gcn-run' to invoke, and still do get:
> >
> >     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
> >     GCN Kernel Aborted
> >     Kernel aborted
> >
> > Andrew, later on, please try what happens when you put an unconditional
> > 'abort' call into a test case?
> 
> Andrew, any luck with that yet?
> 
> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
> execution test failure mentioned above (manual compilation and
> 'gcn-run')?

No, when manually compiling/running the testcase it works fine for me.
Didn't yet get to try the .exp files

Richard.

> 
> Gr??e
>  Thomas
> 
> 
> >>> I didn't try to run it - when doing make check-gcc fails to using
> >>> gcn-run for test invocation
> >
> > Note, that for such individual test cases, invoking the compiler and then
> > 'gcn-run' manually would seem easiest?
> >
> >>> what's the trick to make it do that?
> >
> > I tell you've probably not done much "embedded" or simulator testing of
> > GCC targets?  ;-P
> >
> >> There's a config file for nvptx here: 
> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
> >
> > Yes, and I have pending some updates to that one, to be finished once
> > I've generally got my testing set up again, to a sufficient degree...
> >
> >> You can probably make the obvious adjustments. I think Thomas has a GCN 
> >> version with a few more features.
> >
> > Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
> >
> > I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
> > (as Andrew also noted privately) -- likewise, at least in part, for
> > GCC/nvptx, which is where I copied all that from.  (Will revise later;
> > not relevant for this discussion, here.)
> >
> > Similar to what I've recently added to libgomp, there is 'flock'ing here,
> > so that you may use 'make -j[...] check' for (partial) parallelism, but
> > still all execution testing runs serialized.  I found this to greatly
> > help denoise the test results.  (Not ideal, of course, but improving that
> > is for later, too.)
> >
> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
> > that doesn't work like that in your case.  (I've no idea what
> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
> > this, again, greatly helps denoise test results, at least for the one
> > system I'm currently testing on.
> >
> > I intend to publish proper documentation of all this, later on -- happy
> > to answer any questions in the mean time.
> >
> > If you don't already have a common directory for DejaGnu board files, put
> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
> > a 'dejagnu.exp' file next to it:
> >
> >     lappend boards_dir ~/tmp/amdgcn-amdhsa
> >
> > Prepare:
> >
> >     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
> >     $ export DEJAGNU
> >     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
> >     $ export AMDGCN_AMDHSA_RUN
> >     $ # If necessary:
> >     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
> >     $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
> >     $ export LD_LIBRARY_PATH
> >
> > ..., and then run:
> >
> >     $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
> >
> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
> > recently put into a new "Using the GPU as stand-alone system" section
> > some similar information.  (..., but this should, in my opinion, be on a
> > different page, as it's explicitly *not* about what we understand as
> > offloading.)
> >
> >> I usually use the CodeSourcery magic stack of scripts for testing 
> >> installed toolchains on remote devices, so I'm not too familiar with 
> >> using Dejagnu directly.
> >
> > Tsk...  ;'-|
> >
> >
> > Gr??e
> >  Thomas
>
  
Thomas Schwinge Feb. 19, 2024, 4:31 p.m. UTC | #24
Hi!

On 2024-02-19T11:52:55+0100, Richard Biener <rguenther@suse.de> wrote:
> On Mon, 19 Feb 2024, Thomas Schwinge wrote:
>> On 2024-02-16T14:53:04+0100, I wrote:
>> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
>> >> On 16/02/2024 12:26, Richard Biener wrote:
>> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
>> >>>> On 16/02/2024 10:17, Richard Biener wrote:
>> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>> >>>>>>> I've committed this patch
>> >>>>>>
>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>> >>>>>> support builds on top of, and that's what I'm currently working on
>> >>>>>> getting proper GCC/GCN target (not offloading) results for.
>> >>>>>>
>> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>> >>>>>> and hopefully representative for other SLP execution test FAILs
>> >>>>>> (regressions compared to my earlier non-gfx1100 testing).
>> >>>>>>
>> >>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>> >>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>> >>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>> >>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
>> >>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>> >>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>> >>>>>>       source-gcc/newlib/libc/include
>> >>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>> >>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>> >>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
>> >>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>> >>>>>>
>> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>> >>>>>> suppose will also exhibit the same failure mode, once again?
>> >>>>>>
>> >>>>>> Compared to '-march=gfx90a', the differences begin in
>> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>> >>>>>>
>> >>>>>> Changed like:
>> >>>>>>
>> >>>>>>       @@ -38,10 +38,10 @@ int main ()
>> >>>>>>        #pragma GCC novector
>> >>>>>>          for (i = 1; i < N; i++)
>> >>>>>>            if (a[i] != i%4 + 1)
>> >>>>>>       -      abort ();
>> >>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>> >>>>>>        
>> >>>>>>          if (a[0] != 5)
>> >>>>>>       -    abort ();
>> >>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>> >>>>>>
>> >>>>>> ..., we see:
>> >>>>>>
>> >>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>> >>>>>>       40 5 != 1
>> >>>>>>       41 6 != 2
>> >>>>>>       42 7 != 3
>> >>>>>>       43 8 != 4
>> >>>>>>       44 5 != 1
>> >>>>>>       45 6 != 2
>> >>>>>>       46 7 != 3
>> >>>>>>       47 8 != 4
>> >>>>>>
>> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>> >>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>> >>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
>> >>>>>> or some other code generation issue?
>> >
>> >>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
>> >>>> passes for me, on gfx1100.
>> >
>> > That's strange.  I've looked at your log file (looks good), and used your
>> > toolchain to compile, and your 'gcn-run' to invoke, and still do get:
>> >
>> >     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
>> >     GCN Kernel Aborted
>> >     Kernel aborted
>> >
>> > Andrew, later on, please try what happens when you put an unconditional
>> > 'abort' call into a test case?
>> 
>> Andrew, any luck with that yet?
>> 
>> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
>> execution test failure mentioned above (manual compilation and
>> 'gcn-run')?
>
> No, when manually compiling/running the testcase it works fine for me.

I've updated my GCC master branch sources, but it still fails for me:

    $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -march=gfx1100 -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -save-temps
    $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
    GCN Kernel Aborted
    Kernel aborted

Strange.

In 'bb-slp-cond-1.tar.xz' I'm attaching the files I've built.  Could you
please compare those to yours and try 'gcn-run gfx1030/a.out'?


Grüße
 Thomas


> Didn't yet get to try the .exp files
>
> Richard.
>
>> 
>> Gr??e
>>  Thomas
>> 
>> 
>> >>> I didn't try to run it - when doing make check-gcc fails to using
>> >>> gcn-run for test invocation
>> >
>> > Note, that for such individual test cases, invoking the compiler and then
>> > 'gcn-run' manually would seem easiest?
>> >
>> >>> what's the trick to make it do that?
>> >
>> > I tell you've probably not done much "embedded" or simulator testing of
>> > GCC targets?  ;-P
>> >
>> >> There's a config file for nvptx here: 
>> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
>> >
>> > Yes, and I have pending some updates to that one, to be finished once
>> > I've generally got my testing set up again, to a sufficient degree...
>> >
>> >> You can probably make the obvious adjustments. I think Thomas has a GCN 
>> >> version with a few more features.
>> >
>> > Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
>> >
>> > I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
>> > (as Andrew also noted privately) -- likewise, at least in part, for
>> > GCC/nvptx, which is where I copied all that from.  (Will revise later;
>> > not relevant for this discussion, here.)
>> >
>> > Similar to what I've recently added to libgomp, there is 'flock'ing here,
>> > so that you may use 'make -j[...] check' for (partial) parallelism, but
>> > still all execution testing runs serialized.  I found this to greatly
>> > help denoise the test results.  (Not ideal, of course, but improving that
>> > is for later, too.)
>> >
>> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
>> > that doesn't work like that in your case.  (I've no idea what
>> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
>> > this, again, greatly helps denoise test results, at least for the one
>> > system I'm currently testing on.
>> >
>> > I intend to publish proper documentation of all this, later on -- happy
>> > to answer any questions in the mean time.
>> >
>> > If you don't already have a common directory for DejaGnu board files, put
>> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
>> > a 'dejagnu.exp' file next to it:
>> >
>> >     lappend boards_dir ~/tmp/amdgcn-amdhsa
>> >
>> > Prepare:
>> >
>> >     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
>> >     $ export DEJAGNU
>> >     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
>> >     $ export AMDGCN_AMDHSA_RUN
>> >     $ # If necessary:
>> >     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
>> >     $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
>> >     $ export LD_LIBRARY_PATH
>> >
>> > ..., and then run:
>> >
>> >     $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
>> >
>> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
>> > recently put into a new "Using the GPU as stand-alone system" section
>> > some similar information.  (..., but this should, in my opinion, be on a
>> > different page, as it's explicitly *not* about what we understand as
>> > offloading.)
>> >
>> >> I usually use the CodeSourcery magic stack of scripts for testing 
>> >> installed toolchains on remote devices, so I'm not too familiar with 
>> >> using Dejagnu directly.
>> >
>> > Tsk...  ;'-|
>> >
>> >
>> > Gr??e
>> >  Thomas
>> 
>
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
  
Thomas Schwinge Feb. 19, 2024, 4:35 p.m. UTC | #25
Hi!

On 2024-02-19T17:31:20+0100, I wrote:
> On 2024-02-19T11:52:55+0100, Richard Biener <rguenther@suse.de> wrote:
>> On Mon, 19 Feb 2024, Thomas Schwinge wrote:
>>> On 2024-02-16T14:53:04+0100, I wrote:
>>> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
>>> >> On 16/02/2024 12:26, Richard Biener wrote:
>>> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
>>> >>>> On 16/02/2024 10:17, Richard Biener wrote:
>>> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>>> >>>>>>> I've committed this patch
>>> >>>>>>
>>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>>> >>>>>> support builds on top of, and that's what I'm currently working on
>>> >>>>>> getting proper GCC/GCN target (not offloading) results for.
>>> >>>>>>
>>> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>>> >>>>>> and hopefully representative for other SLP execution test FAILs
>>> >>>>>> (regressions compared to my earlier non-gfx1100 testing).
>>> >>>>>>
>>> >>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>>> >>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>>> >>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>>> >>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
>>> >>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>>> >>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>>> >>>>>>       source-gcc/newlib/libc/include
>>> >>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>> >>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>>> >>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
>>> >>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>>> >>>>>>
>>> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>>> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>>> >>>>>> suppose will also exhibit the same failure mode, once again?
>>> >>>>>>
>>> >>>>>> Compared to '-march=gfx90a', the differences begin in
>>> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>>> >>>>>>
>>> >>>>>> Changed like:
>>> >>>>>>
>>> >>>>>>       @@ -38,10 +38,10 @@ int main ()
>>> >>>>>>        #pragma GCC novector
>>> >>>>>>          for (i = 1; i < N; i++)
>>> >>>>>>            if (a[i] != i%4 + 1)
>>> >>>>>>       -      abort ();
>>> >>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>>> >>>>>>        
>>> >>>>>>          if (a[0] != 5)
>>> >>>>>>       -    abort ();
>>> >>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>>> >>>>>>
>>> >>>>>> ..., we see:
>>> >>>>>>
>>> >>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>> >>>>>>       40 5 != 1
>>> >>>>>>       41 6 != 2
>>> >>>>>>       42 7 != 3
>>> >>>>>>       43 8 != 4
>>> >>>>>>       44 5 != 1
>>> >>>>>>       45 6 != 2
>>> >>>>>>       46 7 != 3
>>> >>>>>>       47 8 != 4
>>> >>>>>>
>>> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>>> >>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>>> >>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
>>> >>>>>> or some other code generation issue?
>>> >
>>> >>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
>>> >>>> passes for me, on gfx1100.
>>> >
>>> > That's strange.  I've looked at your log file (looks good), and used your
>>> > toolchain to compile, and your 'gcn-run' to invoke, and still do get:
>>> >
>>> >     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
>>> >     GCN Kernel Aborted
>>> >     Kernel aborted
>>> >
>>> > Andrew, later on, please try what happens when you put an unconditional
>>> > 'abort' call into a test case?
>>> 
>>> Andrew, any luck with that yet?
>>> 
>>> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
>>> execution test failure mentioned above (manual compilation and
>>> 'gcn-run')?
>>
>> No, when manually compiling/running the testcase it works fine for me.
>
> I've updated my GCC master branch sources, but it still fails for me:
>
>     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -march=gfx1100 -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -save-temps
>     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>     GCN Kernel Aborted
>     Kernel aborted
>
> Strange.
>
> In 'bb-slp-cond-1.tar.xz' I'm attaching the files I've built.  Could you
> please compare those to yours and try 'gcn-run gfx1030/a.out'?

Actually: 'gcn-run gfx1030/a.out' a few times -- our dear friend
Nondeterminism seems to be at play here...  :-|


Grüße
 Thomas


>> Didn't yet get to try the .exp files
>>
>> Richard.
>>
>>> 
>>> Gr??e
>>>  Thomas
>>> 
>>> 
>>> >>> I didn't try to run it - when doing make check-gcc fails to using
>>> >>> gcn-run for test invocation
>>> >
>>> > Note, that for such individual test cases, invoking the compiler and then
>>> > 'gcn-run' manually would seem easiest?
>>> >
>>> >>> what's the trick to make it do that?
>>> >
>>> > I tell you've probably not done much "embedded" or simulator testing of
>>> > GCC targets?  ;-P
>>> >
>>> >> There's a config file for nvptx here: 
>>> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
>>> >
>>> > Yes, and I have pending some updates to that one, to be finished once
>>> > I've generally got my testing set up again, to a sufficient degree...
>>> >
>>> >> You can probably make the obvious adjustments. I think Thomas has a GCN 
>>> >> version with a few more features.
>>> >
>>> > Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
>>> >
>>> > I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
>>> > (as Andrew also noted privately) -- likewise, at least in part, for
>>> > GCC/nvptx, which is where I copied all that from.  (Will revise later;
>>> > not relevant for this discussion, here.)
>>> >
>>> > Similar to what I've recently added to libgomp, there is 'flock'ing here,
>>> > so that you may use 'make -j[...] check' for (partial) parallelism, but
>>> > still all execution testing runs serialized.  I found this to greatly
>>> > help denoise the test results.  (Not ideal, of course, but improving that
>>> > is for later, too.)
>>> >
>>> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
>>> > that doesn't work like that in your case.  (I've no idea what
>>> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
>>> > this, again, greatly helps denoise test results, at least for the one
>>> > system I'm currently testing on.
>>> >
>>> > I intend to publish proper documentation of all this, later on -- happy
>>> > to answer any questions in the mean time.
>>> >
>>> > If you don't already have a common directory for DejaGnu board files, put
>>> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
>>> > a 'dejagnu.exp' file next to it:
>>> >
>>> >     lappend boards_dir ~/tmp/amdgcn-amdhsa
>>> >
>>> > Prepare:
>>> >
>>> >     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
>>> >     $ export DEJAGNU
>>> >     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
>>> >     $ export AMDGCN_AMDHSA_RUN
>>> >     $ # If necessary:
>>> >     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
>>> >     $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
>>> >     $ export LD_LIBRARY_PATH
>>> >
>>> > ..., and then run:
>>> >
>>> >     $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
>>> >
>>> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
>>> > recently put into a new "Using the GPU as stand-alone system" section
>>> > some similar information.  (..., but this should, in my opinion, be on a
>>> > different page, as it's explicitly *not* about what we understand as
>>> > offloading.)
>>> >
>>> >> I usually use the CodeSourcery magic stack of scripts for testing 
>>> >> installed toolchains on remote devices, so I'm not too familiar with 
>>> >> using Dejagnu directly.
>>> >
>>> > Tsk...  ;'-|
>>> >
>>> >
>>> > Gr??e
>>> >  Thomas
>>> 
>>
>> -- 
>> Richard Biener <rguenther@suse.de>
>> SUSE Software Solutions Germany GmbH,
>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
  
Richard Biener Feb. 20, 2024, 7:44 a.m. UTC | #26
On Mon, 19 Feb 2024, Thomas Schwinge wrote:

> Hi!
> 
> On 2024-02-19T17:31:20+0100, I wrote:
> > On 2024-02-19T11:52:55+0100, Richard Biener <rguenther@suse.de> wrote:
> >> On Mon, 19 Feb 2024, Thomas Schwinge wrote:
> >>> On 2024-02-16T14:53:04+0100, I wrote:
> >>> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
> >>> >> On 16/02/2024 12:26, Richard Biener wrote:
> >>> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
> >>> >>>> On 16/02/2024 10:17, Richard Biener wrote:
> >>> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
> >>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >>> >>>>>>> I've committed this patch
> >>> >>>>>>
> >>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
> >>> >>>>>> support builds on top of, and that's what I'm currently working on
> >>> >>>>>> getting proper GCC/GCN target (not offloading) results for.
> >>> >>>>>>
> >>> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
> >>> >>>>>> and hopefully representative for other SLP execution test FAILs
> >>> >>>>>> (regressions compared to my earlier non-gfx1100 testing).
> >>> >>>>>>
> >>> >>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
> >>> >>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> >>> >>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
> >>> >>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
> >>> >>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
> >>> >>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
> >>> >>>>>>       source-gcc/newlib/libc/include
> >>> >>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>> >>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
> >>> >>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
> >>> >>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
> >>> >>>>>>
> >>> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
> >>> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
> >>> >>>>>> suppose will also exhibit the same failure mode, once again?
> >>> >>>>>>
> >>> >>>>>> Compared to '-march=gfx90a', the differences begin in
> >>> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
> >>> >>>>>>
> >>> >>>>>> Changed like:
> >>> >>>>>>
> >>> >>>>>>       @@ -38,10 +38,10 @@ int main ()
> >>> >>>>>>        #pragma GCC novector
> >>> >>>>>>          for (i = 1; i < N; i++)
> >>> >>>>>>            if (a[i] != i%4 + 1)
> >>> >>>>>>       -      abort ();
> >>> >>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
> >>> >>>>>>        
> >>> >>>>>>          if (a[0] != 5)
> >>> >>>>>>       -    abort ();
> >>> >>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
> >>> >>>>>>
> >>> >>>>>> ..., we see:
> >>> >>>>>>
> >>> >>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>> >>>>>>       40 5 != 1
> >>> >>>>>>       41 6 != 2
> >>> >>>>>>       42 7 != 3
> >>> >>>>>>       43 8 != 4
> >>> >>>>>>       44 5 != 1
> >>> >>>>>>       45 6 != 2
> >>> >>>>>>       46 7 != 3
> >>> >>>>>>       47 8 != 4
> >>> >>>>>>
> >>> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
> >>> >>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
> >>> >>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
> >>> >>>>>> or some other code generation issue?
> >>> >
> >>> >>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
> >>> >>>> passes for me, on gfx1100.
> >>> >
> >>> > That's strange.  I've looked at your log file (looks good), and used your
> >>> > toolchain to compile, and your 'gcn-run' to invoke, and still do get:
> >>> >
> >>> >     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
> >>> >     GCN Kernel Aborted
> >>> >     Kernel aborted
> >>> >
> >>> > Andrew, later on, please try what happens when you put an unconditional
> >>> > 'abort' call into a test case?
> >>> 
> >>> Andrew, any luck with that yet?
> >>> 
> >>> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
> >>> execution test failure mentioned above (manual compilation and
> >>> 'gcn-run')?
> >>
> >> No, when manually compiling/running the testcase it works fine for me.
> >
> > I've updated my GCC master branch sources, but it still fails for me:
> >
> >     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -march=gfx1100 -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -save-temps
> >     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >     GCN Kernel Aborted
> >     Kernel aborted
> >
> > Strange.
> >
> > In 'bb-slp-cond-1.tar.xz' I'm attaching the files I've built.  Could you
> > please compare those to yours and try 'gcn-run gfx1030/a.out'?
> 
> Actually: 'gcn-run gfx1030/a.out' a few times -- our dear friend
> Nondeterminism seems to be at play here...  :-|

What's your set of compile options?  I don't manage to get close
to your gfx1030 assembly when using your preprocessed source ...

I've tried -march=gfx1030 -O[23] [-fno-vect-cost-model]

Looks like you use -fno-omit-frame-pointer but then I still see
-mine +yours

-       v_readlane_b32  s18, v4, 0
-       v_readlane_b32  s19, v5, 0
-       s_add_u32       s18, s18, s26
-       s_addc_u32      s19, s19, s27
-       v_writelane_b32 v4, s18, 0
-       v_writelane_b32 v5, s19, 0
-       s_mov_b32       s18, s14
-       s_mov_b32       s19, s15
-       s_mov_b32       s22, scc
-       s_add_u32       s18, s18, 4096
-       s_addc_u32      s19, s19, 0
-       s_cmpk_lg_u32   s22, 0
-       v_writelane_b32 v6, s18, 0
-       v_writelane_b32 v7, s19, 0
-       flat_store_dwordx2      v[6:7], v[4:5]
+       v_writelane_b32 v6, s26, 0
+       v_writelane_b32 v7, s27, 0
+       v_add_co_u32    v4, vcc, v6, v4
+       v_add_co_ci_u32 v5, vcc, v7, v5, vcc

and more changes.

Richard.

> 
> Gr??e
>  Thomas
> 
> 
> >> Didn't yet get to try the .exp files
> >>
> >> Richard.
> >>
> >>> 
> >>> Gr??e
> >>>  Thomas
> >>> 
> >>> 
> >>> >>> I didn't try to run it - when doing make check-gcc fails to using
> >>> >>> gcn-run for test invocation
> >>> >
> >>> > Note, that for such individual test cases, invoking the compiler and then
> >>> > 'gcn-run' manually would seem easiest?
> >>> >
> >>> >>> what's the trick to make it do that?
> >>> >
> >>> > I tell you've probably not done much "embedded" or simulator testing of
> >>> > GCC targets?  ;-P
> >>> >
> >>> >> There's a config file for nvptx here: 
> >>> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
> >>> >
> >>> > Yes, and I have pending some updates to that one, to be finished once
> >>> > I've generally got my testing set up again, to a sufficient degree...
> >>> >
> >>> >> You can probably make the obvious adjustments. I think Thomas has a GCN 
> >>> >> version with a few more features.
> >>> >
> >>> > Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
> >>> >
> >>> > I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
> >>> > (as Andrew also noted privately) -- likewise, at least in part, for
> >>> > GCC/nvptx, which is where I copied all that from.  (Will revise later;
> >>> > not relevant for this discussion, here.)
> >>> >
> >>> > Similar to what I've recently added to libgomp, there is 'flock'ing here,
> >>> > so that you may use 'make -j[...] check' for (partial) parallelism, but
> >>> > still all execution testing runs serialized.  I found this to greatly
> >>> > help denoise the test results.  (Not ideal, of course, but improving that
> >>> > is for later, too.)
> >>> >
> >>> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
> >>> > that doesn't work like that in your case.  (I've no idea what
> >>> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
> >>> > this, again, greatly helps denoise test results, at least for the one
> >>> > system I'm currently testing on.
> >>> >
> >>> > I intend to publish proper documentation of all this, later on -- happy
> >>> > to answer any questions in the mean time.
> >>> >
> >>> > If you don't already have a common directory for DejaGnu board files, put
> >>> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
> >>> > a 'dejagnu.exp' file next to it:
> >>> >
> >>> >     lappend boards_dir ~/tmp/amdgcn-amdhsa
> >>> >
> >>> > Prepare:
> >>> >
> >>> >     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
> >>> >     $ export DEJAGNU
> >>> >     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
> >>> >     $ export AMDGCN_AMDHSA_RUN
> >>> >     $ # If necessary:
> >>> >     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
> >>> >     $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
> >>> >     $ export LD_LIBRARY_PATH
> >>> >
> >>> > ..., and then run:
> >>> >
> >>> >     $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
> >>> >
> >>> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
> >>> > recently put into a new "Using the GPU as stand-alone system" section
> >>> > some similar information.  (..., but this should, in my opinion, be on a
> >>> > different page, as it's explicitly *not* about what we understand as
> >>> > offloading.)
> >>> >
> >>> >> I usually use the CodeSourcery magic stack of scripts for testing 
> >>> >> installed toolchains on remote devices, so I'm not too familiar with 
> >>> >> using Dejagnu directly.
> >>> >
> >>> > Tsk...  ;'-|
> >>> >
> >>> >
> >>> > Gr??e
> >>> >  Thomas
> >>> 
> >>
> >> -- 
> >> Richard Biener <rguenther@suse.de>
> >> SUSE Software Solutions Germany GmbH,
> >> Frankenstrasse 146, 90461 Nuernberg, Germany;
> >> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
  
Thomas Schwinge Feb. 20, 2024, 8:46 a.m. UTC | #27
Hi Richard!

On 2024-02-20T08:44:35+0100, Richard Biener <rguenther@suse.de> wrote:
> On Mon, 19 Feb 2024, Thomas Schwinge wrote:
>> On 2024-02-19T17:31:20+0100, I wrote:
>> > On 2024-02-19T11:52:55+0100, Richard Biener <rguenther@suse.de> wrote:
>> >> On Mon, 19 Feb 2024, Thomas Schwinge wrote:
>> >>> On 2024-02-16T14:53:04+0100, I wrote:
>> >>> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
>> >>> >> On 16/02/2024 12:26, Richard Biener wrote:
>> >>> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
>> >>> >>>> On 16/02/2024 10:17, Richard Biener wrote:
>> >>> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>> >>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
>> >>> >>>>>>> I've committed this patch
>> >>> >>>>>>
>> >>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>> >>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
>> >>> >>>>>> support builds on top of, and that's what I'm currently working on
>> >>> >>>>>> getting proper GCC/GCN target (not offloading) results for.
>> >>> >>>>>>
>> >>> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
>> >>> >>>>>> and hopefully representative for other SLP execution test FAILs
>> >>> >>>>>> (regressions compared to my earlier non-gfx1100 testing).
>> >>> >>>>>>
>> >>> >>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>> >>> >>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>> >>> >>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>> >>> >>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
>> >>> >>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>> >>> >>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>> >>> >>>>>>       source-gcc/newlib/libc/include
>> >>> >>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>> >>> >>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>> >>> >>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
>> >>> >>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>> >>> >>>>>>
>> >>> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>> >>> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
>> >>> >>>>>> suppose will also exhibit the same failure mode, once again?
>> >>> >>>>>>
>> >>> >>>>>> Compared to '-march=gfx90a', the differences begin in
>> >>> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>> >>> >>>>>>
>> >>> >>>>>> Changed like:
>> >>> >>>>>>
>> >>> >>>>>>       @@ -38,10 +38,10 @@ int main ()
>> >>> >>>>>>        #pragma GCC novector
>> >>> >>>>>>          for (i = 1; i < N; i++)
>> >>> >>>>>>            if (a[i] != i%4 + 1)
>> >>> >>>>>>       -      abort ();
>> >>> >>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>> >>> >>>>>>        
>> >>> >>>>>>          if (a[0] != 5)
>> >>> >>>>>>       -    abort ();
>> >>> >>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>> >>> >>>>>>
>> >>> >>>>>> ..., we see:
>> >>> >>>>>>
>> >>> >>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>> >>> >>>>>>       40 5 != 1
>> >>> >>>>>>       41 6 != 2
>> >>> >>>>>>       42 7 != 3
>> >>> >>>>>>       43 8 != 4
>> >>> >>>>>>       44 5 != 1
>> >>> >>>>>>       45 6 != 2
>> >>> >>>>>>       46 7 != 3
>> >>> >>>>>>       47 8 != 4
>> >>> >>>>>>
>> >>> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>> >>> >>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>> >>> >>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
>> >>> >>>>>> or some other code generation issue?
>> >>> >
>> >>> >>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
>> >>> >>>> passes for me, on gfx1100.
>> >>> >
>> >>> > That's strange.  I've looked at your log file (looks good), and used your
>> >>> > toolchain to compile, and your 'gcn-run' to invoke, and still do get:
>> >>> >
>> >>> >     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
>> >>> >     GCN Kernel Aborted
>> >>> >     Kernel aborted
>> >>> >
>> >>> > Andrew, later on, please try what happens when you put an unconditional
>> >>> > 'abort' call into a test case?
>> >>> 
>> >>> Andrew, any luck with that yet?
>> >>> 
>> >>> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
>> >>> execution test failure mentioned above (manual compilation and
>> >>> 'gcn-run')?
>> >>
>> >> No, when manually compiling/running the testcase it works fine for me.
>> >
>> > I've updated my GCC master branch sources, but it still fails for me:
>> >
>> >     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -march=gfx1100 -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -save-temps
>> >     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>> >     GCN Kernel Aborted
>> >     Kernel aborted
>> >
>> > Strange.
>> >
>> > In 'bb-slp-cond-1.tar.xz' I'm attaching the files I've built.  Could you
>> > please compare those to yours and try 'gcn-run gfx1030/a.out'?
>> 
>> Actually: 'gcn-run gfx1030/a.out' a few times -- our dear friend
>> Nondeterminism seems to be at play here...  :-|
>
> What's your set of compile options?  I don't manage to get close
> to your gfx1030 assembly when using your preprocessed source ...
>
> I've tried -march=gfx1030 -O[23] [-fno-vect-cost-model]

See the 'xgcc' command line just a few lines above?  ;-)

    -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2

That's what I originally found in 'gcc.log'.


Grüße
 Thomas


> Looks like you use -fno-omit-frame-pointer but then I still see
> -mine +yours
>
> -       v_readlane_b32  s18, v4, 0
> -       v_readlane_b32  s19, v5, 0
> -       s_add_u32       s18, s18, s26
> -       s_addc_u32      s19, s19, s27
> -       v_writelane_b32 v4, s18, 0
> -       v_writelane_b32 v5, s19, 0
> -       s_mov_b32       s18, s14
> -       s_mov_b32       s19, s15
> -       s_mov_b32       s22, scc
> -       s_add_u32       s18, s18, 4096
> -       s_addc_u32      s19, s19, 0
> -       s_cmpk_lg_u32   s22, 0
> -       v_writelane_b32 v6, s18, 0
> -       v_writelane_b32 v7, s19, 0
> -       flat_store_dwordx2      v[6:7], v[4:5]
> +       v_writelane_b32 v6, s26, 0
> +       v_writelane_b32 v7, s27, 0
> +       v_add_co_u32    v4, vcc, v6, v4
> +       v_add_co_ci_u32 v5, vcc, v7, v5, vcc
>
> and more changes.
>
> Richard.
>
>> 
>> Gr??e
>>  Thomas
>> 
>> 
>> >> Didn't yet get to try the .exp files
>> >>
>> >> Richard.
>> >>
>> >>> 
>> >>> Gr??e
>> >>>  Thomas
>> >>> 
>> >>> 
>> >>> >>> I didn't try to run it - when doing make check-gcc fails to using
>> >>> >>> gcn-run for test invocation
>> >>> >
>> >>> > Note, that for such individual test cases, invoking the compiler and then
>> >>> > 'gcn-run' manually would seem easiest?
>> >>> >
>> >>> >>> what's the trick to make it do that?
>> >>> >
>> >>> > I tell you've probably not done much "embedded" or simulator testing of
>> >>> > GCC targets?  ;-P
>> >>> >
>> >>> >> There's a config file for nvptx here: 
>> >>> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
>> >>> >
>> >>> > Yes, and I have pending some updates to that one, to be finished once
>> >>> > I've generally got my testing set up again, to a sufficient degree...
>> >>> >
>> >>> >> You can probably make the obvious adjustments. I think Thomas has a GCN 
>> >>> >> version with a few more features.
>> >>> >
>> >>> > Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
>> >>> >
>> >>> > I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
>> >>> > (as Andrew also noted privately) -- likewise, at least in part, for
>> >>> > GCC/nvptx, which is where I copied all that from.  (Will revise later;
>> >>> > not relevant for this discussion, here.)
>> >>> >
>> >>> > Similar to what I've recently added to libgomp, there is 'flock'ing here,
>> >>> > so that you may use 'make -j[...] check' for (partial) parallelism, but
>> >>> > still all execution testing runs serialized.  I found this to greatly
>> >>> > help denoise the test results.  (Not ideal, of course, but improving that
>> >>> > is for later, too.)
>> >>> >
>> >>> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
>> >>> > that doesn't work like that in your case.  (I've no idea what
>> >>> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
>> >>> > this, again, greatly helps denoise test results, at least for the one
>> >>> > system I'm currently testing on.
>> >>> >
>> >>> > I intend to publish proper documentation of all this, later on -- happy
>> >>> > to answer any questions in the mean time.
>> >>> >
>> >>> > If you don't already have a common directory for DejaGnu board files, put
>> >>> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
>> >>> > a 'dejagnu.exp' file next to it:
>> >>> >
>> >>> >     lappend boards_dir ~/tmp/amdgcn-amdhsa
>> >>> >
>> >>> > Prepare:
>> >>> >
>> >>> >     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
>> >>> >     $ export DEJAGNU
>> >>> >     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
>> >>> >     $ export AMDGCN_AMDHSA_RUN
>> >>> >     $ # If necessary:
>> >>> >     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
>> >>> >     $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
>> >>> >     $ export LD_LIBRARY_PATH
>> >>> >
>> >>> > ..., and then run:
>> >>> >
>> >>> >     $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
>> >>> >
>> >>> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
>> >>> > recently put into a new "Using the GPU as stand-alone system" section
>> >>> > some similar information.  (..., but this should, in my opinion, be on a
>> >>> > different page, as it's explicitly *not* about what we understand as
>> >>> > offloading.)
>> >>> >
>> >>> >> I usually use the CodeSourcery magic stack of scripts for testing 
>> >>> >> installed toolchains on remote devices, so I'm not too familiar with 
>> >>> >> using Dejagnu directly.
>> >>> >
>> >>> > Tsk...  ;'-|
>> >>> >
>> >>> >
>> >>> > Gr??e
>> >>> >  Thomas
>> >>> 
>> >>
>> >> -- 
>> >> Richard Biener <rguenther@suse.de>
>> >> SUSE Software Solutions Germany GmbH,
>> >> Frankenstrasse 146, 90461 Nuernberg, Germany;
>> >> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>> 
>
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
  
Richard Biener Feb. 20, 2024, 9:13 a.m. UTC | #28
On Tue, 20 Feb 2024, Thomas Schwinge wrote:

> Hi Richard!
> 
> On 2024-02-20T08:44:35+0100, Richard Biener <rguenther@suse.de> wrote:
> > On Mon, 19 Feb 2024, Thomas Schwinge wrote:
> >> On 2024-02-19T17:31:20+0100, I wrote:
> >> > On 2024-02-19T11:52:55+0100, Richard Biener <rguenther@suse.de> wrote:
> >> >> On Mon, 19 Feb 2024, Thomas Schwinge wrote:
> >> >>> On 2024-02-16T14:53:04+0100, I wrote:
> >> >>> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <ams@baylibre.com> wrote:
> >> >>> >> On 16/02/2024 12:26, Richard Biener wrote:
> >> >>> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
> >> >>> >>>> On 16/02/2024 10:17, Richard Biener wrote:
> >> >>> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
> >> >>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <ams@codesourcery.com> wrote:
> >> >>> >>>>>>> I've committed this patch
> >> >>> >>>>>>
> >> >>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >> >>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
> >> >>> >>>>>> support builds on top of, and that's what I'm currently working on
> >> >>> >>>>>> getting proper GCC/GCN target (not offloading) results for.
> >> >>> >>>>>>
> >> >>> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
> >> >>> >>>>>> and hopefully representative for other SLP execution test FAILs
> >> >>> >>>>>> (regressions compared to my earlier non-gfx1100 testing).
> >> >>> >>>>>>
> >> >>> >>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
> >> >>> >>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> >> >>> >>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
> >> >>> >>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
> >> >>> >>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
> >> >>> >>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
> >> >>> >>>>>>       source-gcc/newlib/libc/include
> >> >>> >>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >> >>> >>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
> >> >>> >>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
> >> >>> >>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
> >> >>> >>>>>>
> >> >>> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
> >> >>> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
> >> >>> >>>>>> suppose will also exhibit the same failure mode, once again?
> >> >>> >>>>>>
> >> >>> >>>>>> Compared to '-march=gfx90a', the differences begin in
> >> >>> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
> >> >>> >>>>>>
> >> >>> >>>>>> Changed like:
> >> >>> >>>>>>
> >> >>> >>>>>>       @@ -38,10 +38,10 @@ int main ()
> >> >>> >>>>>>        #pragma GCC novector
> >> >>> >>>>>>          for (i = 1; i < N; i++)
> >> >>> >>>>>>            if (a[i] != i%4 + 1)
> >> >>> >>>>>>       -      abort ();
> >> >>> >>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
> >> >>> >>>>>>        
> >> >>> >>>>>>          if (a[0] != 5)
> >> >>> >>>>>>       -    abort ();
> >> >>> >>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
> >> >>> >>>>>>
> >> >>> >>>>>> ..., we see:
> >> >>> >>>>>>
> >> >>> >>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >> >>> >>>>>>       40 5 != 1
> >> >>> >>>>>>       41 6 != 2
> >> >>> >>>>>>       42 7 != 3
> >> >>> >>>>>>       43 8 != 4
> >> >>> >>>>>>       44 5 != 1
> >> >>> >>>>>>       45 6 != 2
> >> >>> >>>>>>       46 7 != 3
> >> >>> >>>>>>       47 8 != 4
> >> >>> >>>>>>
> >> >>> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
> >> >>> >>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
> >> >>> >>>>>> scribbled zero values over these (vector lane masking issue, perhaps?),
> >> >>> >>>>>> or some other code generation issue?
> >> >>> >
> >> >>> >>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
> >> >>> >>>> passes for me, on gfx1100.
> >> >>> >
> >> >>> > That's strange.  I've looked at your log file (looks good), and used your
> >> >>> > toolchain to compile, and your 'gcn-run' to invoke, and still do get:
> >> >>> >
> >> >>> >     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
> >> >>> >     GCN Kernel Aborted
> >> >>> >     Kernel aborted
> >> >>> >
> >> >>> > Andrew, later on, please try what happens when you put an unconditional
> >> >>> > 'abort' call into a test case?
> >> >>> 
> >> >>> Andrew, any luck with that yet?
> >> >>> 
> >> >>> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
> >> >>> execution test failure mentioned above (manual compilation and
> >> >>> 'gcn-run')?
> >> >>
> >> >> No, when manually compiling/running the testcase it works fine for me.
> >> >
> >> > I've updated my GCC master branch sources, but it still fails for me:
> >> >
> >> >     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c --sysroot=install/amdgcn-amdhsa -isystem build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -march=gfx1100 -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -save-temps
> >> >     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >> >     GCN Kernel Aborted
> >> >     Kernel aborted
> >> >
> >> > Strange.
> >> >
> >> > In 'bb-slp-cond-1.tar.xz' I'm attaching the files I've built.  Could you
> >> > please compare those to yours and try 'gcn-run gfx1030/a.out'?
> >> 
> >> Actually: 'gcn-run gfx1030/a.out' a few times -- our dear friend
> >> Nondeterminism seems to be at play here...  :-|
> >
> > What's your set of compile options?  I don't manage to get close
> > to your gfx1030 assembly when using your preprocessed source ...
> >
> > I've tried -march=gfx1030 -O[23] [-fno-vect-cost-model]
> 
> See the 'xgcc' command line just a few lines above?  ;-)
> 
>     -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2
> 
> That's what I originally found in 'gcc.log'.

OK, with that I can reproduce the issue.  -O2 -ftree-vectorize seems
to be enough to trigger it.  It's indeed somewhat random whether it
fails or not ...

Richard.

> 
> Gr??e
>  Thomas
> 
> 
> > Looks like you use -fno-omit-frame-pointer but then I still see
> > -mine +yours
> >
> > -       v_readlane_b32  s18, v4, 0
> > -       v_readlane_b32  s19, v5, 0
> > -       s_add_u32       s18, s18, s26
> > -       s_addc_u32      s19, s19, s27
> > -       v_writelane_b32 v4, s18, 0
> > -       v_writelane_b32 v5, s19, 0
> > -       s_mov_b32       s18, s14
> > -       s_mov_b32       s19, s15
> > -       s_mov_b32       s22, scc
> > -       s_add_u32       s18, s18, 4096
> > -       s_addc_u32      s19, s19, 0
> > -       s_cmpk_lg_u32   s22, 0
> > -       v_writelane_b32 v6, s18, 0
> > -       v_writelane_b32 v7, s19, 0
> > -       flat_store_dwordx2      v[6:7], v[4:5]
> > +       v_writelane_b32 v6, s26, 0
> > +       v_writelane_b32 v7, s27, 0
> > +       v_add_co_u32    v4, vcc, v6, v4
> > +       v_add_co_ci_u32 v5, vcc, v7, v5, vcc
> >
> > and more changes.
> >
> > Richard.
> >
> >> 
> >> Gr??e
> >>  Thomas
> >> 
> >> 
> >> >> Didn't yet get to try the .exp files
> >> >>
> >> >> Richard.
> >> >>
> >> >>> 
> >> >>> Gr??e
> >> >>>  Thomas
> >> >>> 
> >> >>> 
> >> >>> >>> I didn't try to run it - when doing make check-gcc fails to using
> >> >>> >>> gcn-run for test invocation
> >> >>> >
> >> >>> > Note, that for such individual test cases, invoking the compiler and then
> >> >>> > 'gcn-run' manually would seem easiest?
> >> >>> >
> >> >>> >>> what's the trick to make it do that?
> >> >>> >
> >> >>> > I tell you've probably not done much "embedded" or simulator testing of
> >> >>> > GCC targets?  ;-P
> >> >>> >
> >> >>> >> There's a config file for nvptx here: 
> >> >>> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
> >> >>> >
> >> >>> > Yes, and I have pending some updates to that one, to be finished once
> >> >>> > I've generally got my testing set up again, to a sufficient degree...
> >> >>> >
> >> >>> >> You can probably make the obvious adjustments. I think Thomas has a GCN 
> >> >>> >> version with a few more features.
> >> >>> >
> >> >>> > Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
> >> >>> >
> >> >>> > I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
> >> >>> > (as Andrew also noted privately) -- likewise, at least in part, for
> >> >>> > GCC/nvptx, which is where I copied all that from.  (Will revise later;
> >> >>> > not relevant for this discussion, here.)
> >> >>> >
> >> >>> > Similar to what I've recently added to libgomp, there is 'flock'ing here,
> >> >>> > so that you may use 'make -j[...] check' for (partial) parallelism, but
> >> >>> > still all execution testing runs serialized.  I found this to greatly
> >> >>> > help denoise the test results.  (Not ideal, of course, but improving that
> >> >>> > is for later, too.)
> >> >>> >
> >> >>> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
> >> >>> > that doesn't work like that in your case.  (I've no idea what
> >> >>> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
> >> >>> > this, again, greatly helps denoise test results, at least for the one
> >> >>> > system I'm currently testing on.
> >> >>> >
> >> >>> > I intend to publish proper documentation of all this, later on -- happy
> >> >>> > to answer any questions in the mean time.
> >> >>> >
> >> >>> > If you don't already have a common directory for DejaGnu board files, put
> >> >>> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
> >> >>> > a 'dejagnu.exp' file next to it:
> >> >>> >
> >> >>> >     lappend boards_dir ~/tmp/amdgcn-amdhsa
> >> >>> >
> >> >>> > Prepare:
> >> >>> >
> >> >>> >     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
> >> >>> >     $ export DEJAGNU
> >> >>> >     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
> >> >>> >     $ export AMDGCN_AMDHSA_RUN
> >> >>> >     $ # If necessary:
> >> >>> >     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
> >> >>> >     $ LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
> >> >>> >     $ export LD_LIBRARY_PATH
> >> >>> >
> >> >>> > ..., and then run:
> >> >>> >
> >> >>> >     $ make -j8 check-gcc-c RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
> >> >>> >
> >> >>> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
> >> >>> > recently put into a new "Using the GPU as stand-alone system" section
> >> >>> > some similar information.  (..., but this should, in my opinion, be on a
> >> >>> > different page, as it's explicitly *not* about what we understand as
> >> >>> > offloading.)
> >> >>> >
> >> >>> >> I usually use the CodeSourcery magic stack of scripts for testing 
> >> >>> >> installed toolchains on remote devices, so I'm not too familiar with 
> >> >>> >> using Dejagnu directly.
> >> >>> >
> >> >>> > Tsk...  ;'-|
> >> >>> >
> >> >>> >
> >> >>> > Gr??e
> >> >>> >  Thomas
> >> >>> 
> >> >>
> >> >> -- 
> >> >> Richard Biener <rguenther@suse.de>
> >> >> SUSE Software Solutions Germany GmbH,
> >> >> Frankenstrasse 146, 90461 Nuernberg, Germany;
> >> >> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> >> 
> >
> > -- 
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
  

Patch

diff --git a/newlib/libc/machine/amdgcn/exit-value.h b/newlib/libc/machine/amdgcn/exit-value.h
index 7aa2508bb..6b9d2411b 100644
--- a/newlib/libc/machine/amdgcn/exit-value.h
+++ b/newlib/libc/machine/amdgcn/exit-value.h
@@ -32,7 +32,6 @@  exit_with_int (int val)
   *return_value = val;
 
   /* Terminate the current kernel.  */
-  asm ("s_dcache_wb");
   asm ("s_endpgm");
   __builtin_unreachable ();
 }