amdgcn: additional gfx1100 support

Message ID 20240124124304.1780645-1-ams@baylibre.com
State Unresolved
Headers
Series amdgcn: additional gfx1100 support |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Andrew Stubbs Jan. 24, 2024, 12:43 p.m. UTC
  This is enough to get gfx1100 working for most purposes, on top of the
patch that Tobias committed a week or so ago; there are still some test
failures to investigate, and probably some tuning to do.

It might also get gfx1030 working too. @Richi, could you test it,
please?

I can't test the other multilibs right now. @PA, can you test it please?

I can self-approve the patch, but I'll hold off the commit until the
test results come back.

Andrew

gcc/ChangeLog:

	* config/gcn/gcn-opts.h (TARGET_PACKED_WORK_ITEMS): Add TARGET_RDNA3.
	* config/gcn/gcn-valu.md (all_convert): New iterator.
	(<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): New
	define_expand, and rename the old one to ...
	(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): ... this.
	(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): Likewise, to ...
	(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): .. this.
	(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>): New.
	* config/gcn/gcn.cc (gcn_global_address_p): Use "offsetbits" correctly.
	(gcn_hsa_declare_function_name): Update the vgpr counting for gfx1100.
	* config/gcn/gcn.md (<u>mulhisi3): Disable on RDNA3.
	(<u>mulqihi3_scalar): Likewise.

libgcc/ChangeLog:

	* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Handle RDNA3.

libgomp/ChangeLog:

	* config/gcn/time.c (RTC_TICKS): Configure RDNA3.
	(omp_get_wtime): Add RDNA3-compatible variant.
	* plugin/plugin-gcn.c (max_isa_vgprs): Tune for gfx1030 and gfx1100.

Signed-off-by:  Andrew Stubbs <ams@baylibre.com>
---
 gcc/config/gcn/gcn-opts.h         |  2 +-
 gcc/config/gcn/gcn-valu.md        | 41 ++++++++++++++++++++++++++++---
 gcc/config/gcn/gcn.cc             | 31 ++++++++++++++++-------
 gcc/config/gcn/gcn.md             |  4 +--
 libgcc/config/gcn/amdgcn_veclib.h |  2 +-
 libgomp/config/gcn/time.c         | 10 ++++++++
 libgomp/plugin/plugin-gcn.c       |  6 +++--
 7 files changed, 77 insertions(+), 19 deletions(-)
  

Comments

Richard Biener Jan. 26, 2024, 8:56 a.m. UTC | #1
On Wed, 24 Jan 2024, Andrew Stubbs wrote:

> This is enough to get gfx1100 working for most purposes, on top of the
> patch that Tobias committed a week or so ago; there are still some test
> failures to investigate, and probably some tuning to do.
> 
> It might also get gfx1030 working too. @Richi, could you test it,
> please?

I can report partial success here.  I do see quite some FAILs because of

/space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90: 
In function 'accum_._omp_fn.1':^M
/space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90:20:38: 
error: unrecognizable insn:^M
(insn 108 107 109 6 (set (reg:V8SF 849)^M
        (unspec:V8SF [^M
                (reg:V8SF 844 [ vect__43.12_106 ]) repeated x2^M
                (const_int 1 [0x1])^M
            ] UNSPEC_PLUS_DPP_SHR)) 
"/space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90":22:29 
discrim 1 -1^M
     (nil))^M
during RTL pass: vregs^M
/space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90:20:38: 
internal compiler error: in extract_insn, at recog.cc:2812^M

there are also quite a number of execution FAILs like

icv-5.exe: 
/space/rguenther/src/gcc-autopar_devel/libgomp/plugin/plugin-gcn.c:2462: 
isa_matches_agent: Assertion `agent_isa_s' failed.
FAIL: libgomp.c/../libgomp.c-c++-common/icv-5.c execution test

(the assert in question looks bad - yeah, somehow we got past
device initialization - not sure how - but end up here)

Maybe HSA behaves odd here - I didn't constrain the device it should
choose but it works (most of the time).  GCN_DEBUG prints me all the
HSA agents available but I don't see any debug on which agent
is actually initialized during libgomp device init (at least nothing
I can easily match up).  Maybe sth to improve.

I'll followup with a test summary once the (serial) run of libgomp
testing finished.  At least there are quite some number of
actual kernel executions and PASSing testcases.

Richard.

> I can't test the other multilibs right now. @PA, can you test it please?
> 
> I can self-approve the patch, but I'll hold off the commit until the
> test results come back.
> 
> Andrew
> 
> gcc/ChangeLog:
> 
> 	* config/gcn/gcn-opts.h (TARGET_PACKED_WORK_ITEMS): Add TARGET_RDNA3.
> 	* config/gcn/gcn-valu.md (all_convert): New iterator.
> 	(<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): New
> 	define_expand, and rename the old one to ...
> 	(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): ... this.
> 	(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): Likewise, to ...
> 	(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): .. this.
> 	(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>): New.
> 	* config/gcn/gcn.cc (gcn_global_address_p): Use "offsetbits" correctly.
> 	(gcn_hsa_declare_function_name): Update the vgpr counting for gfx1100.
> 	* config/gcn/gcn.md (<u>mulhisi3): Disable on RDNA3.
> 	(<u>mulqihi3_scalar): Likewise.
> 
> libgcc/ChangeLog:
> 
> 	* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Handle RDNA3.
> 
> libgomp/ChangeLog:
> 
> 	* config/gcn/time.c (RTC_TICKS): Configure RDNA3.
> 	(omp_get_wtime): Add RDNA3-compatible variant.
> 	* plugin/plugin-gcn.c (max_isa_vgprs): Tune for gfx1030 and gfx1100.
> 
> Signed-off-by:  Andrew Stubbs <ams@baylibre.com>
> ---
>  gcc/config/gcn/gcn-opts.h         |  2 +-
>  gcc/config/gcn/gcn-valu.md        | 41 ++++++++++++++++++++++++++++---
>  gcc/config/gcn/gcn.cc             | 31 ++++++++++++++++-------
>  gcc/config/gcn/gcn.md             |  4 +--
>  libgcc/config/gcn/amdgcn_veclib.h |  2 +-
>  libgomp/config/gcn/time.c         | 10 ++++++++
>  libgomp/plugin/plugin-gcn.c       |  6 +++--
>  7 files changed, 77 insertions(+), 19 deletions(-)
> 
> diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
> index 79fbda3ab25..6be2c9204fa 100644
> --- a/gcc/config/gcn/gcn-opts.h
> +++ b/gcc/config/gcn/gcn-opts.h
> @@ -62,7 +62,7 @@ extern enum gcn_isa {
>  
>  
>  #define TARGET_M0_LDS_LIMIT (TARGET_GCN3)
> -#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS)
> +#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS || TARGET_RDNA3)
>  
>  #define TARGET_XNACK (flag_xnack != HSACO_ATTR_OFF)
>  
> diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
> index 3d5b6271ee6..cd027f8b369 100644
> --- a/gcc/config/gcn/gcn-valu.md
> +++ b/gcc/config/gcn/gcn-valu.md
> @@ -3555,30 +3555,63 @@
>  ;; }}}
>  ;; {{{ Int/int conversions
>  
> +(define_code_iterator all_convert [truncate zero_extend sign_extend])
>  (define_code_iterator zero_convert [truncate zero_extend])
>  (define_code_attr convop [
>  	(sign_extend "extend")
>  	(zero_extend "zero_extend")
>  	(truncate "trunc")])
>  
> -(define_insn "<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
> +(define_expand "<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
> +  [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
> +        (all_convert:V_INT_1REG
> +	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> +  "")
> +
> +(define_insn "*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>"
>    [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
>          (zero_convert:V_INT_1REG
>  	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> -  ""
> +  "!TARGET_RDNA3"
>    "v_mov_b32_sdwa\t%0, %1 dst_sel:<V_INT_1REG:sdwa> dst_unused:UNUSED_PAD src0_sel:<V_INT_1REG_ALT:sdwa>"
>    [(set_attr "type" "vop_sdwa")
>     (set_attr "length" "8")])
>  
> -(define_insn "extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
> +(define_insn "extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>"
>    [(set (match_operand:V_INT_1REG 0 "register_operand"	    "=v")
>          (sign_extend:V_INT_1REG
>  	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> -  ""
> +  "!TARGET_RDNA3"
>    "v_mov_b32_sdwa\t%0, sext(%1) src0_sel:<V_INT_1REG_ALT:sdwa>"
>    [(set_attr "type" "vop_sdwa")
>     (set_attr "length" "8")])
>  
> +(define_insn "*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>"
> +  [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
> +        (all_convert:V_INT_1REG
> +	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> +  "TARGET_RDNA3"
> +  {
> +    enum {extend, zero_extend, trunc};
> +    rtx shiftwidth = (<V_INT_1REG_ALT:SCALAR_MODE>mode == QImode
> +		      || <V_INT_1REG:SCALAR_MODE>mode == QImode
> +		      ? GEN_INT (24)
> +		      : <V_INT_1REG_ALT:SCALAR_MODE>mode == HImode
> +		        || <V_INT_1REG:SCALAR_MODE>mode == HImode
> +		      ? GEN_INT (16)
> +		      : NULL);
> +    operands[2] = shiftwidth;
> +
> +    if (!shiftwidth)
> +      return "v_mov_b32 %0, %1";
> +    else if (<convop> == extend || <convop> == trunc)
> +      return "v_lshlrev_b32\t%0, %2, %1\;v_ashrrev_i32\t%0, %2, %0";
> +    else
> +      return "v_lshlrev_b32\t%0, %2, %1\;v_lshrrev_b32\t%0, %2, %0";
> +  }
> +  [(set_attr "type" "mult")
> +   (set_attr "length" "8")])
> +
>  ;; GCC can already do these for scalar types, but not for vector types.
>  ;; Unfortunately you can't just do SUBREG on a vector to select the low part,
>  ;; so there must be a few tricks here.
> diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
> index e668ce7c69e..e80de2ce056 100644
> --- a/gcc/config/gcn/gcn.cc
> +++ b/gcc/config/gcn/gcn.cc
> @@ -1597,8 +1597,8 @@ gcn_global_address_p (rtx addr)
>        rtx offset = XEXP (addr, 1);
>        int offsetbits = (TARGET_RDNA2_PLUS ? 11 : 12);
>        bool immediate_p = (CONST_INT_P (offset)
> -			  && INTVAL (offset) >= -(1 << 12)
> -			  && INTVAL (offset) < (1 << 12));
> +			  && INTVAL (offset) >= -(1 << offsetbits)
> +			  && INTVAL (offset) < (1 << offsetbits));
>  
>        if ((gcn_address_register_p (base, DImode, false)
>  	   || gcn_vec_address_register_p (base, DImode, false))
> @@ -6597,8 +6597,10 @@ gcn_hsa_declare_function_name (FILE *file, const char *name,
>      if (df_regs_ever_live_p (FIRST_AVGPR_REG + avgpr))
>        break;
>    avgpr++;
> -  vgpr = (vgpr + 3) & ~3;
> -  avgpr = (avgpr + 3) & ~3;
> +
> +  /* The main function epilogue uses v8, but df doesn't see that.  */
> +  if (vgpr < 9)
> +    vgpr = 9;
>  
>    if (!leaf_function_p ())
>      {
> @@ -6611,9 +6613,18 @@ gcn_hsa_declare_function_name (FILE *file, const char *name,
>  	avgpr = MAX_NORMAL_AVGPR_COUNT;
>      }
>  
> -  /* The gfx90a accum_offset field can't represent 0 registers.  */
> -  if (gcn_arch == PROCESSOR_GFX90a && vgpr < 4)
> -    vgpr = 4;
> +  /* SIMD32 devices count double in wavefront64 mode.  */
> +  if (TARGET_RDNA2_PLUS)
> +    vgpr *= 2;
> +
> +  /* Round up to the allocation block size.  */
> +  int vgpr_block_size = (TARGET_RDNA3 ? 12
> +			 : TARGET_RDNA2_PLUS || TARGET_CDNA2_PLUS ? 8
> +			 : 4);
> +  if (vgpr % vgpr_block_size)
> +    vgpr += vgpr_block_size - (vgpr % vgpr_block_size);
> +  if (avgpr % vgpr_block_size)
> +    avgpr += vgpr_block_size - (avgpr % vgpr_block_size);
>  
>    fputs ("\t.rodata\n"
>  	 "\t.p2align\t6\n"
> @@ -6714,12 +6725,14 @@ gcn_hsa_declare_function_name (FILE *file, const char *name,
>  	   "            .private_segment_fixed_size: 0\n"
>  	   "            .wavefront_size: 64\n"
>  	   "            .sgpr_count: %i\n"
> -	   "            .vgpr_count: %i\n"
> +	   "            .vgpr_count: %i%s\n"
>  	   "            .max_flat_workgroup_size: 1024\n",
>  	   cfun->machine->kernarg_segment_byte_size,
>  	   cfun->machine->kernarg_segment_alignment,
>  	   LDS_SIZE,
> -	   sgpr, next_free_vgpr);
> +	   sgpr, next_free_vgpr,
> +	   (TARGET_RDNA2_PLUS ? " ; wavefrontsize64 counts double on SIMD32"
> +	    : ""));
>    if (gcn_arch == PROCESSOR_GFX90a || gcn_arch == PROCESSOR_GFX908)
>      fprintf (file, "            .agpr_count: %i\n", avgpr);
>    fputs ("        .end_amdgpu_metadata\n", file);
> diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
> index 492b833e255..1f3c692b7a6 100644
> --- a/gcc/config/gcn/gcn.md
> +++ b/gcc/config/gcn/gcn.md
> @@ -1618,7 +1618,7 @@
>  	(mult:SI
>  	  (any_extend:SI (match_operand:HI 1 "register_operand" "%v"))
>  	  (any_extend:SI (match_operand:HI 2 "register_operand" " v"))))]
> -  ""
> +  "!TARGET_RDNA3"
>    "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:WORD_0 src1_sel:WORD_0"
>    [(set_attr "type" "vop_sdwa")
>     (set_attr "length" "8")])
> @@ -1628,7 +1628,7 @@
>  	(mult:HI
>  	  (any_extend:HI (match_operand:QI 1 "register_operand" "%v"))
>  	  (any_extend:HI (match_operand:QI 2 "register_operand" " v"))))]
> -  ""
> +  "!TARGET_RDNA3"
>    "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:BYTE_0 src1_sel:BYTE_0"
>    [(set_attr "type" "vop_sdwa")
>     (set_attr "length" "8")])
> diff --git a/libgcc/config/gcn/amdgcn_veclib.h b/libgcc/config/gcn/amdgcn_veclib.h
> index 821f6386dd6..d268c6cac16 100644
> --- a/libgcc/config/gcn/amdgcn_veclib.h
> +++ b/libgcc/config/gcn/amdgcn_veclib.h
> @@ -230,7 +230,7 @@ do { \
>  
>  #if defined (__GCN3__) || defined (__GCN5__) \
>      || defined (__CDNA1__) || defined (__CDNA2__) \
> -    || defined (__RDNA2__)
> +    || defined (__RDNA2__) || defined (__RDNA3__)
>  #define CDNA3_PLUS 0
>  #else
>  #define CDNA3_PLUS 1
> diff --git a/libgomp/config/gcn/time.c b/libgomp/config/gcn/time.c
> index 30a0d0188e4..efcd04f5f43 100644
> --- a/libgomp/config/gcn/time.c
> +++ b/libgomp/config/gcn/time.c
> @@ -30,15 +30,25 @@
>  /* According to AMD:
>      dGPU RTC is 27MHz
>      AGPU RTC is 100MHz
> +    RDNA3 ISA manual states "typically 100MHz"
>     FIXME: DTRT on an APU.  */
> +#ifdef __RDNA3__
> +#define RTC_TICKS (1.0 / 100000000.0) /* 100MHz */
> +#else
>  #define RTC_TICKS (1.0 / 27000000.0) /* 27MHz */
> +#endif
>  
>  double
>  omp_get_wtime (void)
>  {
>    uint64_t clock;
> +#ifdef __RDNA3__
> +  asm ("s_sendmsg_rtn_b64 %0 0x83 ;Get REALTIME\n\t"
> +       "s_waitcnt 0" : "=r" (clock));
> +#else
>    asm ("s_memrealtime %0\n\t"
>         "s_waitcnt 0" : "=r" (clock));
> +#endif
>    return clock * RTC_TICKS;
>  }
>  
> diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
> index 0339848451e..db28781dedb 100644
> --- a/libgomp/plugin/plugin-gcn.c
> +++ b/libgomp/plugin/plugin-gcn.c
> @@ -1741,11 +1741,13 @@ max_isa_vgprs (int isa)
>      case EF_AMDGPU_MACH_AMDGCN_GFX900:
>      case EF_AMDGPU_MACH_AMDGCN_GFX906:
>      case EF_AMDGPU_MACH_AMDGCN_GFX908:
> -    case EF_AMDGPU_MACH_AMDGCN_GFX1030:
> -    case EF_AMDGPU_MACH_AMDGCN_GFX1100:
>        return 256;
>      case EF_AMDGPU_MACH_AMDGCN_GFX90a:
>        return 512;
> +    case EF_AMDGPU_MACH_AMDGCN_GFX1030:
> +      return 512;  /* 512 SIMD32 = 256 wavefrontsize64.  */
> +    case EF_AMDGPU_MACH_AMDGCN_GFX1100:
> +      return 1536; /* 1536 SIMD32 = 768 wavefrontsize64.  */
>      }
>    GOMP_PLUGIN_fatal ("unhandled ISA in max_isa_vgprs");
>  }
>
  
Richard Biener Jan. 26, 2024, 9:45 a.m. UTC | #2
On Fri, 26 Jan 2024, Richard Biener wrote:

> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
> 
> > This is enough to get gfx1100 working for most purposes, on top of the
> > patch that Tobias committed a week or so ago; there are still some test
> > failures to investigate, and probably some tuning to do.
> > 
> > It might also get gfx1030 working too. @Richi, could you test it,
> > please?
> 
> I can report partial success here.  I do see quite some FAILs because of
> 
> /space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90: 
> In function 'accum_._omp_fn.1':^M
> /space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90:20:38: 
> error: unrecognizable insn:^M
> (insn 108 107 109 6 (set (reg:V8SF 849)^M
>         (unspec:V8SF [^M
>                 (reg:V8SF 844 [ vect__43.12_106 ]) repeated x2^M
>                 (const_int 1 [0x1])^M
>             ] UNSPEC_PLUS_DPP_SHR)) 
> "/space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90":22:29 
> discrim 1 -1^M
>      (nil))^M
> during RTL pass: vregs^M
> /space/rguenther/src/gcc-autopar_devel/libgomp/testsuite/libgomp.fortran/examples-4/declare_target-4.f90:20:38: 
> internal compiler error: in extract_insn, at recog.cc:2812^M
> 
> there are also quite a number of execution FAILs like
> 
> icv-5.exe: 
> /space/rguenther/src/gcc-autopar_devel/libgomp/plugin/plugin-gcn.c:2462: 
> isa_matches_agent: Assertion `agent_isa_s' failed.
> FAIL: libgomp.c/../libgomp.c-c++-common/icv-5.c execution test
> 
> (the assert in question looks bad - yeah, somehow we got past
> device initialization - not sure how - but end up here)
> 
> Maybe HSA behaves odd here - I didn't constrain the device it should
> choose but it works (most of the time).  GCN_DEBUG prints me all the
> HSA agents available but I don't see any debug on which agent
> is actually initialized during libgomp device init (at least nothing
> I can easily match up).  Maybe sth to improve.
> 
> I'll followup with a test summary once the (serial) run of libgomp
> testing finished.  At least there are quite some number of
> actual kernel executions and PASSing testcases.

                === libgomp Summary ===

# of expected passes            29126
# of unexpected failures        697
# of unexpected successes       1
# of expected failures          703
# of unresolved testcases       318
# of unsupported tests          766

full summary attached (compressed).  Even compressed libgomp.log is
too big to send.

Richard.

> 
> Richard.
> 
> > I can't test the other multilibs right now. @PA, can you test it please?
> > 
> > I can self-approve the patch, but I'll hold off the commit until the
> > test results come back.
> > 
> > Andrew
> > 
> > gcc/ChangeLog:
> > 
> > 	* config/gcn/gcn-opts.h (TARGET_PACKED_WORK_ITEMS): Add TARGET_RDNA3.
> > 	* config/gcn/gcn-valu.md (all_convert): New iterator.
> > 	(<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): New
> > 	define_expand, and rename the old one to ...
> > 	(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): ... this.
> > 	(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): Likewise, to ...
> > 	(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): .. this.
> > 	(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>): New.
> > 	* config/gcn/gcn.cc (gcn_global_address_p): Use "offsetbits" correctly.
> > 	(gcn_hsa_declare_function_name): Update the vgpr counting for gfx1100.
> > 	* config/gcn/gcn.md (<u>mulhisi3): Disable on RDNA3.
> > 	(<u>mulqihi3_scalar): Likewise.
> > 
> > libgcc/ChangeLog:
> > 
> > 	* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Handle RDNA3.
> > 
> > libgomp/ChangeLog:
> > 
> > 	* config/gcn/time.c (RTC_TICKS): Configure RDNA3.
> > 	(omp_get_wtime): Add RDNA3-compatible variant.
> > 	* plugin/plugin-gcn.c (max_isa_vgprs): Tune for gfx1030 and gfx1100.
> > 
> > Signed-off-by:  Andrew Stubbs <ams@baylibre.com>
> > ---
> >  gcc/config/gcn/gcn-opts.h         |  2 +-
> >  gcc/config/gcn/gcn-valu.md        | 41 ++++++++++++++++++++++++++++---
> >  gcc/config/gcn/gcn.cc             | 31 ++++++++++++++++-------
> >  gcc/config/gcn/gcn.md             |  4 +--
> >  libgcc/config/gcn/amdgcn_veclib.h |  2 +-
> >  libgomp/config/gcn/time.c         | 10 ++++++++
> >  libgomp/plugin/plugin-gcn.c       |  6 +++--
> >  7 files changed, 77 insertions(+), 19 deletions(-)
> > 
> > diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
> > index 79fbda3ab25..6be2c9204fa 100644
> > --- a/gcc/config/gcn/gcn-opts.h
> > +++ b/gcc/config/gcn/gcn-opts.h
> > @@ -62,7 +62,7 @@ extern enum gcn_isa {
> >  
> >  
> >  #define TARGET_M0_LDS_LIMIT (TARGET_GCN3)
> > -#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS)
> > +#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS || TARGET_RDNA3)
> >  
> >  #define TARGET_XNACK (flag_xnack != HSACO_ATTR_OFF)
> >  
> > diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
> > index 3d5b6271ee6..cd027f8b369 100644
> > --- a/gcc/config/gcn/gcn-valu.md
> > +++ b/gcc/config/gcn/gcn-valu.md
> > @@ -3555,30 +3555,63 @@
> >  ;; }}}
> >  ;; {{{ Int/int conversions
> >  
> > +(define_code_iterator all_convert [truncate zero_extend sign_extend])
> >  (define_code_iterator zero_convert [truncate zero_extend])
> >  (define_code_attr convop [
> >  	(sign_extend "extend")
> >  	(zero_extend "zero_extend")
> >  	(truncate "trunc")])
> >  
> > -(define_insn "<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
> > +(define_expand "<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
> > +  [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
> > +        (all_convert:V_INT_1REG
> > +	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> > +  "")
> > +
> > +(define_insn "*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>"
> >    [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
> >          (zero_convert:V_INT_1REG
> >  	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> > -  ""
> > +  "!TARGET_RDNA3"
> >    "v_mov_b32_sdwa\t%0, %1 dst_sel:<V_INT_1REG:sdwa> dst_unused:UNUSED_PAD src0_sel:<V_INT_1REG_ALT:sdwa>"
> >    [(set_attr "type" "vop_sdwa")
> >     (set_attr "length" "8")])
> >  
> > -(define_insn "extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
> > +(define_insn "extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>"
> >    [(set (match_operand:V_INT_1REG 0 "register_operand"	    "=v")
> >          (sign_extend:V_INT_1REG
> >  	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> > -  ""
> > +  "!TARGET_RDNA3"
> >    "v_mov_b32_sdwa\t%0, sext(%1) src0_sel:<V_INT_1REG_ALT:sdwa>"
> >    [(set_attr "type" "vop_sdwa")
> >     (set_attr "length" "8")])
> >  
> > +(define_insn "*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>"
> > +  [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
> > +        (all_convert:V_INT_1REG
> > +	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
> > +  "TARGET_RDNA3"
> > +  {
> > +    enum {extend, zero_extend, trunc};
> > +    rtx shiftwidth = (<V_INT_1REG_ALT:SCALAR_MODE>mode == QImode
> > +		      || <V_INT_1REG:SCALAR_MODE>mode == QImode
> > +		      ? GEN_INT (24)
> > +		      : <V_INT_1REG_ALT:SCALAR_MODE>mode == HImode
> > +		        || <V_INT_1REG:SCALAR_MODE>mode == HImode
> > +		      ? GEN_INT (16)
> > +		      : NULL);
> > +    operands[2] = shiftwidth;
> > +
> > +    if (!shiftwidth)
> > +      return "v_mov_b32 %0, %1";
> > +    else if (<convop> == extend || <convop> == trunc)
> > +      return "v_lshlrev_b32\t%0, %2, %1\;v_ashrrev_i32\t%0, %2, %0";
> > +    else
> > +      return "v_lshlrev_b32\t%0, %2, %1\;v_lshrrev_b32\t%0, %2, %0";
> > +  }
> > +  [(set_attr "type" "mult")
> > +   (set_attr "length" "8")])
> > +
> >  ;; GCC can already do these for scalar types, but not for vector types.
> >  ;; Unfortunately you can't just do SUBREG on a vector to select the low part,
> >  ;; so there must be a few tricks here.
> > diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
> > index e668ce7c69e..e80de2ce056 100644
> > --- a/gcc/config/gcn/gcn.cc
> > +++ b/gcc/config/gcn/gcn.cc
> > @@ -1597,8 +1597,8 @@ gcn_global_address_p (rtx addr)
> >        rtx offset = XEXP (addr, 1);
> >        int offsetbits = (TARGET_RDNA2_PLUS ? 11 : 12);
> >        bool immediate_p = (CONST_INT_P (offset)
> > -			  && INTVAL (offset) >= -(1 << 12)
> > -			  && INTVAL (offset) < (1 << 12));
> > +			  && INTVAL (offset) >= -(1 << offsetbits)
> > +			  && INTVAL (offset) < (1 << offsetbits));
> >  
> >        if ((gcn_address_register_p (base, DImode, false)
> >  	   || gcn_vec_address_register_p (base, DImode, false))
> > @@ -6597,8 +6597,10 @@ gcn_hsa_declare_function_name (FILE *file, const char *name,
> >      if (df_regs_ever_live_p (FIRST_AVGPR_REG + avgpr))
> >        break;
> >    avgpr++;
> > -  vgpr = (vgpr + 3) & ~3;
> > -  avgpr = (avgpr + 3) & ~3;
> > +
> > +  /* The main function epilogue uses v8, but df doesn't see that.  */
> > +  if (vgpr < 9)
> > +    vgpr = 9;
> >  
> >    if (!leaf_function_p ())
> >      {
> > @@ -6611,9 +6613,18 @@ gcn_hsa_declare_function_name (FILE *file, const char *name,
> >  	avgpr = MAX_NORMAL_AVGPR_COUNT;
> >      }
> >  
> > -  /* The gfx90a accum_offset field can't represent 0 registers.  */
> > -  if (gcn_arch == PROCESSOR_GFX90a && vgpr < 4)
> > -    vgpr = 4;
> > +  /* SIMD32 devices count double in wavefront64 mode.  */
> > +  if (TARGET_RDNA2_PLUS)
> > +    vgpr *= 2;
> > +
> > +  /* Round up to the allocation block size.  */
> > +  int vgpr_block_size = (TARGET_RDNA3 ? 12
> > +			 : TARGET_RDNA2_PLUS || TARGET_CDNA2_PLUS ? 8
> > +			 : 4);
> > +  if (vgpr % vgpr_block_size)
> > +    vgpr += vgpr_block_size - (vgpr % vgpr_block_size);
> > +  if (avgpr % vgpr_block_size)
> > +    avgpr += vgpr_block_size - (avgpr % vgpr_block_size);
> >  
> >    fputs ("\t.rodata\n"
> >  	 "\t.p2align\t6\n"
> > @@ -6714,12 +6725,14 @@ gcn_hsa_declare_function_name (FILE *file, const char *name,
> >  	   "            .private_segment_fixed_size: 0\n"
> >  	   "            .wavefront_size: 64\n"
> >  	   "            .sgpr_count: %i\n"
> > -	   "            .vgpr_count: %i\n"
> > +	   "            .vgpr_count: %i%s\n"
> >  	   "            .max_flat_workgroup_size: 1024\n",
> >  	   cfun->machine->kernarg_segment_byte_size,
> >  	   cfun->machine->kernarg_segment_alignment,
> >  	   LDS_SIZE,
> > -	   sgpr, next_free_vgpr);
> > +	   sgpr, next_free_vgpr,
> > +	   (TARGET_RDNA2_PLUS ? " ; wavefrontsize64 counts double on SIMD32"
> > +	    : ""));
> >    if (gcn_arch == PROCESSOR_GFX90a || gcn_arch == PROCESSOR_GFX908)
> >      fprintf (file, "            .agpr_count: %i\n", avgpr);
> >    fputs ("        .end_amdgpu_metadata\n", file);
> > diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
> > index 492b833e255..1f3c692b7a6 100644
> > --- a/gcc/config/gcn/gcn.md
> > +++ b/gcc/config/gcn/gcn.md
> > @@ -1618,7 +1618,7 @@
> >  	(mult:SI
> >  	  (any_extend:SI (match_operand:HI 1 "register_operand" "%v"))
> >  	  (any_extend:SI (match_operand:HI 2 "register_operand" " v"))))]
> > -  ""
> > +  "!TARGET_RDNA3"
> >    "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:WORD_0 src1_sel:WORD_0"
> >    [(set_attr "type" "vop_sdwa")
> >     (set_attr "length" "8")])
> > @@ -1628,7 +1628,7 @@
> >  	(mult:HI
> >  	  (any_extend:HI (match_operand:QI 1 "register_operand" "%v"))
> >  	  (any_extend:HI (match_operand:QI 2 "register_operand" " v"))))]
> > -  ""
> > +  "!TARGET_RDNA3"
> >    "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:BYTE_0 src1_sel:BYTE_0"
> >    [(set_attr "type" "vop_sdwa")
> >     (set_attr "length" "8")])
> > diff --git a/libgcc/config/gcn/amdgcn_veclib.h b/libgcc/config/gcn/amdgcn_veclib.h
> > index 821f6386dd6..d268c6cac16 100644
> > --- a/libgcc/config/gcn/amdgcn_veclib.h
> > +++ b/libgcc/config/gcn/amdgcn_veclib.h
> > @@ -230,7 +230,7 @@ do { \
> >  
> >  #if defined (__GCN3__) || defined (__GCN5__) \
> >      || defined (__CDNA1__) || defined (__CDNA2__) \
> > -    || defined (__RDNA2__)
> > +    || defined (__RDNA2__) || defined (__RDNA3__)
> >  #define CDNA3_PLUS 0
> >  #else
> >  #define CDNA3_PLUS 1
> > diff --git a/libgomp/config/gcn/time.c b/libgomp/config/gcn/time.c
> > index 30a0d0188e4..efcd04f5f43 100644
> > --- a/libgomp/config/gcn/time.c
> > +++ b/libgomp/config/gcn/time.c
> > @@ -30,15 +30,25 @@
> >  /* According to AMD:
> >      dGPU RTC is 27MHz
> >      AGPU RTC is 100MHz
> > +    RDNA3 ISA manual states "typically 100MHz"
> >     FIXME: DTRT on an APU.  */
> > +#ifdef __RDNA3__
> > +#define RTC_TICKS (1.0 / 100000000.0) /* 100MHz */
> > +#else
> >  #define RTC_TICKS (1.0 / 27000000.0) /* 27MHz */
> > +#endif
> >  
> >  double
> >  omp_get_wtime (void)
> >  {
> >    uint64_t clock;
> > +#ifdef __RDNA3__
> > +  asm ("s_sendmsg_rtn_b64 %0 0x83 ;Get REALTIME\n\t"
> > +       "s_waitcnt 0" : "=r" (clock));
> > +#else
> >    asm ("s_memrealtime %0\n\t"
> >         "s_waitcnt 0" : "=r" (clock));
> > +#endif
> >    return clock * RTC_TICKS;
> >  }
> >  
> > diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
> > index 0339848451e..db28781dedb 100644
> > --- a/libgomp/plugin/plugin-gcn.c
> > +++ b/libgomp/plugin/plugin-gcn.c
> > @@ -1741,11 +1741,13 @@ max_isa_vgprs (int isa)
> >      case EF_AMDGPU_MACH_AMDGCN_GFX900:
> >      case EF_AMDGPU_MACH_AMDGCN_GFX906:
> >      case EF_AMDGPU_MACH_AMDGCN_GFX908:
> > -    case EF_AMDGPU_MACH_AMDGCN_GFX1030:
> > -    case EF_AMDGPU_MACH_AMDGCN_GFX1100:
> >        return 256;
> >      case EF_AMDGPU_MACH_AMDGCN_GFX90a:
> >        return 512;
> > +    case EF_AMDGPU_MACH_AMDGCN_GFX1030:
> > +      return 512;  /* 512 SIMD32 = 256 wavefrontsize64.  */
> > +    case EF_AMDGPU_MACH_AMDGCN_GFX1100:
> > +      return 1536; /* 1536 SIMD32 = 768 wavefrontsize64.  */
> >      }
> >    GOMP_PLUGIN_fatal ("unhandled ISA in max_isa_vgprs");
> >  }
> > 
> 
>
  
Andrew Stubbs Jan. 26, 2024, 10:19 a.m. UTC | #3
On 26/01/2024 09:45, Richard Biener wrote:
> On Fri, 26 Jan 2024, Richard Biener wrote:
> 
>                  === libgomp Summary ===
> 
> # of expected passes            29126
> # of unexpected failures        697
> # of unexpected successes       1
> # of expected failures          703
> # of unresolved testcases       318
> # of unsupported tests          766
> 
> full summary attached (compressed).  Even compressed libgomp.log is
> too big to send.
> 
> Richard.

I think this is good enough to start with. PA reported clean results for 
everything except gfx900 (looks like an unrelated issue).

I'll go ahead and commit the patch.

Hopefully Tobias's patch has already trimmed all the "-g" failures from 
that list.

Andrew
  
Richard Biener Jan. 26, 2024, 10:22 a.m. UTC | #4
On Fri, 26 Jan 2024, Andrew Stubbs wrote:

> On 26/01/2024 09:45, Richard Biener wrote:
> > On Fri, 26 Jan 2024, Richard Biener wrote:
> > 
> >                  === libgomp Summary ===
> > 
> > # of expected passes            29126
> > # of unexpected failures        697
> > # of unexpected successes       1
> > # of expected failures          703
> > # of unresolved testcases       318
> > # of unsupported tests          766
> > 
> > full summary attached (compressed).  Even compressed libgomp.log is
> > too big to send.
> > 
> > Richard.
> 
> I think this is good enough to start with. PA reported clean results for
> everything except gfx900 (looks like an unrelated issue).
> 
> I'll go ahead and commit the patch.
> 
> Hopefully Tobias's patch has already trimmed all the "-g" failures from that
> list.

Should I open a bug for the ICE?  That's responsible for quite a number
of failures as well.

Richard.
  
Andrew Stubbs Jan. 26, 2024, 10:31 a.m. UTC | #5
On 26/01/2024 10:22, Richard Biener wrote:
> On Fri, 26 Jan 2024, Andrew Stubbs wrote:
> 
>> On 26/01/2024 09:45, Richard Biener wrote:
>>> On Fri, 26 Jan 2024, Richard Biener wrote:
>>>
>>>                   === libgomp Summary ===
>>>
>>> # of expected passes            29126
>>> # of unexpected failures        697
>>> # of unexpected successes       1
>>> # of expected failures          703
>>> # of unresolved testcases       318
>>> # of unsupported tests          766
>>>
>>> full summary attached (compressed).  Even compressed libgomp.log is
>>> too big to send.
>>>
>>> Richard.
>>
>> I think this is good enough to start with. PA reported clean results for
>> everything except gfx900 (looks like an unrelated issue).
>>
>> I'll go ahead and commit the patch.
>>
>> Hopefully Tobias's patch has already trimmed all the "-g" failures from that
>> list.
> 
> Should I open a bug for the ICE?  That's responsible for quite a number
> of failures as well.

The broken vector reduction instruction? It's a known issue (RDNA 
doesn't support those instructions anymore, and somehow disabling the 
insn isn't enough to stop them being generated), but it doesn't have a 
tracking number, so why not?

Thanks

Andrew
  
Thomas Schwinge Feb. 1, 2024, 2:41 p.m. UTC | #6
Hi!

On 2024-01-26T10:45:10+0100, Richard Biener <rguenther@suse.de> wrote:
> On Fri, 26 Jan 2024, Richard Biener wrote:
>> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
>> > [...] is enough to get gfx1100 working for most purposes, on top of the
>> > patch that Tobias committed a week or so ago; there are still some test
>> > failures to investigate, and probably some tuning to do.
>> > 
>> > It might also get gfx1030 working too. @Richi, could you test it,
>> > please?
>> 
>> I can report partial success here.  [...]

>> I'll followup with a test summary once the (serial) run of libgomp
>> testing finished.

(Why serial, by the way?)

>> At least there are quite some number of
>> actual kernel executions and PASSing testcases.
>
>                 === libgomp Summary ===
>
> # of expected passes            29126
> # of unexpected failures        697
> # of unexpected successes       1
> # of expected failures          703
> # of unresolved testcases       318
> # of unsupported tests          766
>
> full summary attached (compressed).

Compating your old results ('|     ' prefix in the following) with what I
got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  My GCC sources are
a few weeks old, but have all the recent fix-up commits cherry-picked,
and a work-around applied for:

    /tmp/ccfrKwEK.mkoffload.2.s:29:27: error: value out of range
              .amdhsa_next_free_vgpr        516
                                            ^~~

(..., to be discussed later.)

There are, I think, no compilation FAILs anymore; I'm only commenting on
execution test FAILs.  Not all FAILs appear all the time (so it follows
that I may be missing a few), and 'libgomp.c++/../libgomp.c-c++-common'
generally behaves similar to 'libgomp.c/../libgomp.c-c++-common', so
omitting the former here.

|     FAIL: libgomp.c/../libgomp.c-c++-common/error-1.c output pattern test

Not seeing that FAIL.

I also see 'libgomp.c-c++-common/for-5.c' FAIL.

|     FAIL: libgomp.c/../libgomp.c-c++-common/icv-5.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/icv-6.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/icv-7.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/icv-9.c execution test

I confirm 'libgomp.c-c++-common/icv-5.c', 'libgomp.c-c++-common/icv-9.c'
FAIL, but 'libgomp.c-c++-common/icv-6.c', 'libgomp.c-c++-common/icv-7.c'
PASS.

|     FAIL: libgomp.c/../libgomp.c-c++-common/non-rect-loop-1.c execution test

Not seeing that FAIL.

|     FAIL: libgomp.c/../libgomp.c-c++-common/reduction-6.c execution test

I confirm that FAIL, and also 'libgomp.c-c++-common/reduction-5.c'
occasionally.

|     FAIL: libgomp.c/../libgomp.c-c++-common/requires-unified-addr-1.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/requires-unified-addr-2.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/target-45.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/target-implicit-map-3.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/target-is-accessible-1.c execution test

Not seeing these FAILs.

I also see 'libgomp.c-c++-common/reverse-offload-1.c' FAIL.

|     FAIL: libgomp.c/../libgomp.c-c++-common/task-detach-6.c execution test
|     WARNING: program timed out.
|     FAIL: libgomp.c/../libgomp.c-c++-common/task-in-explicit-1.c execution test

I confirm these FAILs.

|     FAIL: libgomp.c/../libgomp.c-c++-common/teams-2.c execution test

Known FAIL.

|     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-1.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-2.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-3.c execution test
|     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-4.c execution test
|     FAIL: libgomp.c/declare-variant-4-gfx900.c (test for excess errors)
|     FAIL: libgomp.c/declare-variant-4-gfx906.c (test for excess errors)
|     FAIL: libgomp.c/declare-variant-4-gfx908.c (test for excess errors)
|     FAIL: libgomp.c/declare-variant-4-gfx90a.c (test for excess errors)
|     FAIL: libgomp.c/declare-variant-4.c execution test
|     FAIL: libgomp.c/declare-variant-4.c scan-amdgcn-amdhsa-offload-tree-dump optimized "= gfx[^ ]+ \\\\(\\\\);"
|     FAIL: libgomp.c/examples-4/device-2.c execution test
|     WARNING: program timed out.

Not seeing these FAILs.

I also see 'libgomp.c/examples-4/teams-4.c', 'libgomp.c/target-31.c' FAIL.

|     FAIL: libgomp.c/target-teams-1.c execution test

I confirm this FAIL.

|     FAIL: libgomp.fortran/[...] execution test

You had a lot of FAILs there.  I only see the following:

|     FAIL: libgomp.fortran/examples-4/teams-2.f90   -O0  execution test
|     [...]

|     FAIL: libgomp.fortran/examples-4/teams-4.f90   -O0  execution test
|     [...]

|     FAIL: libgomp.fortran/icv-6.f90   -O  execution test

|     FAIL: libgomp.fortran/reverse-offload-1.f90   -O2  execution test
|     FAIL: libgomp.fortran/reverse-offload-1.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
|     FAIL: libgomp.fortran/reverse-offload-1.f90   -O3 -g  (test for excess errors)
|     UNRESOLVED: libgomp.fortran/reverse-offload-1.f90   -O3 -g  compilation failed to produce executable

|     FAIL: libgomp.fortran/reverse-offload-3.f90   -O  execution test
|     FAIL: libgomp.fortran/reverse-offload-4.f90   -O  execution test

|     FAIL: libgomp.fortran/task-detach-6.f90   -O0  execution test
|     [...]

|     FAIL: libgomp.fortran/task-in-explicit-1.f90   -O0  execution test
|     [...]

You had a lot of FAILs for 'libgomp.oacc-c', 'libgomp.oacc-c++',
'libgomp.oacc-fortran'.  For me:

|     FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-10.c -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  execution test

I confirm this FAIL (also 'libgomp.oacc-c++').

|     FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vprop.c -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  (test for excess errors)

Known FAIL (also 'libgomp.oacc-c++').

|     FAIL: libgomp.oacc-fortran/reduction-5.f90 -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O0  execution test

I confirm this FAIL.


So, "not bad!", but also still some work to be done.  :-)


Grüße
 Thomas
  
Richard Biener Feb. 1, 2024, 2:49 p.m. UTC | #7
On Thu, 1 Feb 2024, Thomas Schwinge wrote:

> Hi!
> 
> On 2024-01-26T10:45:10+0100, Richard Biener <rguenther@suse.de> wrote:
> > On Fri, 26 Jan 2024, Richard Biener wrote:
> >> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
> >> > [...] is enough to get gfx1100 working for most purposes, on top of the
> >> > patch that Tobias committed a week or so ago; there are still some test
> >> > failures to investigate, and probably some tuning to do.
> >> > 
> >> > It might also get gfx1030 working too. @Richi, could you test it,
> >> > please?
> >> 
> >> I can report partial success here.  [...]
> 
> >> I'll followup with a test summary once the (serial) run of libgomp
> >> testing finished.
> 
> (Why serial, by the way?)

Just out of caution ... (I'm using the GPU for the desktop at the
same time and dmesg gets spammed with some not-so reassuring
"errors" during the offloading)

> >> At least there are quite some number of
> >> actual kernel executions and PASSing testcases.
> >
> >                 === libgomp Summary ===
> >
> > # of expected passes            29126
> > # of unexpected failures        697
> > # of unexpected successes       1
> > # of expected failures          703
> > # of unresolved testcases       318
> > # of unsupported tests          766
> >
> > full summary attached (compressed).
> 
> Compating your old results ('|     ' prefix in the following) with what I
> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  My GCC sources are
> a few weeks old, but have all the recent fix-up commits cherry-picked,
> and a work-around applied for:
> 
>     /tmp/ccfrKwEK.mkoffload.2.s:29:27: error: value out of range
>               .amdhsa_next_free_vgpr        516
>                                             ^~~
> 
> (..., to be discussed later.)
> 
> There are, I think, no compilation FAILs anymore; I'm only commenting on
> execution test FAILs.  Not all FAILs appear all the time (so it follows
> that I may be missing a few), and 'libgomp.c++/../libgomp.c-c++-common'
> generally behaves similar to 'libgomp.c/../libgomp.c-c++-common', so
> omitting the former here.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/error-1.c output pattern test
> 
> Not seeing that FAIL.
> 
> I also see 'libgomp.c-c++-common/for-5.c' FAIL.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/icv-5.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/icv-6.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/icv-7.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/icv-9.c execution test
> 
> I confirm 'libgomp.c-c++-common/icv-5.c', 'libgomp.c-c++-common/icv-9.c'
> FAIL, but 'libgomp.c-c++-common/icv-6.c', 'libgomp.c-c++-common/icv-7.c'
> PASS.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/non-rect-loop-1.c execution test
> 
> Not seeing that FAIL.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/reduction-6.c execution test
> 
> I confirm that FAIL, and also 'libgomp.c-c++-common/reduction-5.c'
> occasionally.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/requires-unified-addr-1.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/requires-unified-addr-2.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/target-45.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/target-implicit-map-3.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/target-is-accessible-1.c execution test
> 
> Not seeing these FAILs.
> 
> I also see 'libgomp.c-c++-common/reverse-offload-1.c' FAIL.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/task-detach-6.c execution test
> |     WARNING: program timed out.
> |     FAIL: libgomp.c/../libgomp.c-c++-common/task-in-explicit-1.c execution test
> 
> I confirm these FAILs.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/teams-2.c execution test
> 
> Known FAIL.
> 
> |     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-1.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-2.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-3.c execution test
> |     FAIL: libgomp.c/../libgomp.c-c++-common/teams-nteams-icv-4.c execution test
> |     FAIL: libgomp.c/declare-variant-4-gfx900.c (test for excess errors)
> |     FAIL: libgomp.c/declare-variant-4-gfx906.c (test for excess errors)
> |     FAIL: libgomp.c/declare-variant-4-gfx908.c (test for excess errors)
> |     FAIL: libgomp.c/declare-variant-4-gfx90a.c (test for excess errors)
> |     FAIL: libgomp.c/declare-variant-4.c execution test
> |     FAIL: libgomp.c/declare-variant-4.c scan-amdgcn-amdhsa-offload-tree-dump optimized "= gfx[^ ]+ \\\\(\\\\);"
> |     FAIL: libgomp.c/examples-4/device-2.c execution test
> |     WARNING: program timed out.
> 
> Not seeing these FAILs.
> 
> I also see 'libgomp.c/examples-4/teams-4.c', 'libgomp.c/target-31.c' FAIL.
> 
> |     FAIL: libgomp.c/target-teams-1.c execution test
> 
> I confirm this FAIL.
> 
> |     FAIL: libgomp.fortran/[...] execution test
> 
> You had a lot of FAILs there.  I only see the following:
> 
> |     FAIL: libgomp.fortran/examples-4/teams-2.f90   -O0  execution test
> |     [...]
> 
> |     FAIL: libgomp.fortran/examples-4/teams-4.f90   -O0  execution test
> |     [...]
> 
> |     FAIL: libgomp.fortran/icv-6.f90   -O  execution test
> 
> |     FAIL: libgomp.fortran/reverse-offload-1.f90   -O2  execution test
> |     FAIL: libgomp.fortran/reverse-offload-1.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  execution test
> |     FAIL: libgomp.fortran/reverse-offload-1.f90   -O3 -g  (test for excess errors)
> |     UNRESOLVED: libgomp.fortran/reverse-offload-1.f90   -O3 -g  compilation failed to produce executable
> 
> |     FAIL: libgomp.fortran/reverse-offload-3.f90   -O  execution test
> |     FAIL: libgomp.fortran/reverse-offload-4.f90   -O  execution test
> 
> |     FAIL: libgomp.fortran/task-detach-6.f90   -O0  execution test
> |     [...]
> 
> |     FAIL: libgomp.fortran/task-in-explicit-1.f90   -O0  execution test
> |     [...]
> 
> You had a lot of FAILs for 'libgomp.oacc-c', 'libgomp.oacc-c++',
> 'libgomp.oacc-fortran'.  For me:
> 
> |     FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-10.c -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  execution test
> 
> I confirm this FAIL (also 'libgomp.oacc-c++').
> 
> |     FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vprop.c -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O2  (test for excess errors)
> 
> Known FAIL (also 'libgomp.oacc-c++').
> 
> |     FAIL: libgomp.oacc-fortran/reduction-5.f90 -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa  -O0  execution test
> 
> I confirm this FAIL.
> 
> 
> So, "not bad!", but also still some work to be done.  :-)

I'm going to re-do the testing with all the fixes in on Monday and
will report back.

Richard.
  
Thomas Schwinge Feb. 21, 2024, 12:34 p.m. UTC | #8
Hi!

On 2024-02-01T15:49:02+0100, Richard Biener <rguenther@suse.de> wrote:
> On Thu, 1 Feb 2024, Thomas Schwinge wrote:
>> On 2024-01-26T10:45:10+0100, Richard Biener <rguenther@suse.de> wrote:
>> > On Fri, 26 Jan 2024, Richard Biener wrote:
>> >> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
>> >> > [...] is enough to get gfx1100 working for most purposes, on top of the
>> >> > patch that Tobias committed a week or so ago; there are still some test
>> >> > failures to investigate, and probably some tuning to do.
>> >> > 
>> >> > It might also get gfx1030 working too. @Richi, could you test it,
>> >> > please?
>> >> 
>> >> I can report partial success here.  [...]
>> 
>> >> I'll followup with a test summary once the (serial) run of libgomp
>> >> testing finished.
>> 
>> (Why serial, by the way?)
>
> Just out of caution ... (I'm using the GPU for the desktop at the
> same time and dmesg gets spammed with some not-so reassuring
> "errors" during the offloading)

Yeah, indeed 'dmesg' is full of "notes"...

However, note that per my work on <https://gcc.gnu.org/PR66005>
"libgomp make check time is excessive", all execution testing in libgomp
is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  So,
no problem/difference in that regard, to run parallel
'check-target-libgomp'.  (... with the caveat that execution tests for
effective-targets are *not* governed by that, as I've found yesterday.
I have a WIP hack for that, too.)


>> [...] what I
>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  [...]

>> [...] execution test FAILs.  Not all FAILs appear all the time [...]

What disturbs the testing a lot is, that the GPU may get into a bad
state, upon which any use either fails with a
'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
'libhsa-runtime64.so.1'...

I've now tried to debug the latter case (hang).  When the GPU gets into
this bad state (whatever exactly that is),
'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
There it hangs until killed (for example, until DejaGnu's timeout
mechanism kills the process -- just that the next GPU-using execution
test then runs into the same thing again...).

In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
we're able to recover via:

    $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
    0

This is, obviously, a hack, probably needs a serial lock to not disturb
other things, has hard-coded 'dri/0', and as I said in
<https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
"GCN RDNA2+ vs. GCC SLP vectorizer":

| I've no idea what
| 'amdgpu_gpu_recover' would do if the GPU is also used for display.

However, it's very useful in my testing.  :-|

The questions is, how to detect the "hang" state without first running
into a timeout (and disambiguating such a timeout from a user code
timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
initialization, and before the actual GPU kernel launch cancel it with
'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
error message that we can then react on, like for
'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
no-go in libgomp -- instead, use a helper thread to similarly implement a
watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
other purposes.)  Any other clever ideas?  What's a suitable value for
"a few seconds"?


Grüße
 Thomas
  
Richard Biener Feb. 21, 2024, 4:32 p.m. UTC | #9
> Am 21.02.2024 um 13:34 schrieb Thomas Schwinge <tschwinge@baylibre.com>:
> 
> Hi!
> 
>> On 2024-02-01T15:49:02+0100, Richard Biener <rguenther@suse.de> wrote:
>>> On Thu, 1 Feb 2024, Thomas Schwinge wrote:
>>> On 2024-01-26T10:45:10+0100, Richard Biener <rguenther@suse.de> wrote:
>>>> On Fri, 26 Jan 2024, Richard Biener wrote:
>>>>> On Wed, 24 Jan 2024, Andrew Stubbs wrote:
>>>>>> [...] is enough to get gfx1100 working for most purposes, on top of the
>>>>>> patch that Tobias committed a week or so ago; there are still some test
>>>>>> failures to investigate, and probably some tuning to do.
>>>>>> 
>>>>>> It might also get gfx1030 working too. @Richi, could you test it,
>>>>>> please?
>>>>> 
>>>>> I can report partial success here.  [...]
>>> 
>>>>> I'll followup with a test summary once the (serial) run of libgomp
>>>>> testing finished.
>>> 
>>> (Why serial, by the way?)
>> 
>> Just out of caution ... (I'm using the GPU for the desktop at the
>> same time and dmesg gets spammed with some not-so reassuring
>> "errors" during the offloading)
> 
> Yeah, indeed 'dmesg' is full of "notes"...
> 
> However, note that per my work on <https://gcc.gnu.org/PR66005>
> "libgomp make check time is excessive", all execution testing in libgomp
> is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  So,
> no problem/difference in that regard, to run parallel
> 'check-target-libgomp'.  (... with the caveat that execution tests for
> effective-targets are *not* governed by that, as I've found yesterday.
> I have a WIP hack for that, too.)
> 
> 
>>> [...] what I
>>> got with '-march=gfx1100' for AMD Radeon RX 7900 XTX.  [...]
> 
>>> [...] execution test FAILs.  Not all FAILs appear all the time [...]
> 
> What disturbs the testing a lot is, that the GPU may get into a bad
> state, upon which any use either fails with a
> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
> 'libhsa-runtime64.so.1'...
> 
> I've now tried to debug the latter case (hang).  When the GPU gets into
> this bad state (whatever exactly that is),
> 'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
> then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
> vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
> before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
> There it hangs until killed (for example, until DejaGnu's timeout
> mechanism kills the process -- just that the next GPU-using execution
> test then runs into the same thing again...).
> 
> In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
> we're able to recover via:
> 
>    $ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
>    0
> 
> This is, obviously, a hack, probably needs a serial lock to not disturb
> other things, has hard-coded 'dri/0', and as I said in
> <https://inbox.sourceware.org/87plww8qin.fsf@euler.schwinge.ddns.net>
> "GCN RDNA2+ vs. GCC SLP vectorizer":
> 
> | I've no idea what
> | 'amdgpu_gpu_recover' would do if the GPU is also used for display.

It ends up terminating your X session… (there’s some automatic driver recovery that’s also sometimes triggered which sounds like the same thing).  I need to try using the integrated graphics for X11 to see if that avoids the issue.

Guess AMD needs to improve the driver/runtime (or we - it’s open source at least up to the firmware).

Richard 

> However, it's very useful in my testing.  :-|
> 
> The questions is, how to detect the "hang" state without first running
> into a timeout (and disambiguating such a timeout from a user code
> timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
> initialization, and before the actual GPU kernel launch cancel it with
> 'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
> error message that we can then react on, like for
> 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
> no-go in libgomp -- instead, use a helper thread to similarly implement a
> watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
> other purposes.)  Any other clever ideas?  What's a suitable value for
> "a few seconds"?
> 
> 
> Grüße
> Thomas
  

Patch

diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
index 79fbda3ab25..6be2c9204fa 100644
--- a/gcc/config/gcn/gcn-opts.h
+++ b/gcc/config/gcn/gcn-opts.h
@@ -62,7 +62,7 @@  extern enum gcn_isa {
 
 
 #define TARGET_M0_LDS_LIMIT (TARGET_GCN3)
-#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS)
+#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS || TARGET_RDNA3)
 
 #define TARGET_XNACK (flag_xnack != HSACO_ATTR_OFF)
 
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 3d5b6271ee6..cd027f8b369 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -3555,30 +3555,63 @@ 
 ;; }}}
 ;; {{{ Int/int conversions
 
+(define_code_iterator all_convert [truncate zero_extend sign_extend])
 (define_code_iterator zero_convert [truncate zero_extend])
 (define_code_attr convop [
 	(sign_extend "extend")
 	(zero_extend "zero_extend")
 	(truncate "trunc")])
 
-(define_insn "<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
+(define_expand "<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
+  [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
+        (all_convert:V_INT_1REG
+	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
+  "")
+
+(define_insn "*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>"
   [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
         (zero_convert:V_INT_1REG
 	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
-  ""
+  "!TARGET_RDNA3"
   "v_mov_b32_sdwa\t%0, %1 dst_sel:<V_INT_1REG:sdwa> dst_unused:UNUSED_PAD src0_sel:<V_INT_1REG_ALT:sdwa>"
   [(set_attr "type" "vop_sdwa")
    (set_attr "length" "8")])
 
-(define_insn "extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>"
+(define_insn "extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>"
   [(set (match_operand:V_INT_1REG 0 "register_operand"	    "=v")
         (sign_extend:V_INT_1REG
 	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
-  ""
+  "!TARGET_RDNA3"
   "v_mov_b32_sdwa\t%0, sext(%1) src0_sel:<V_INT_1REG_ALT:sdwa>"
   [(set_attr "type" "vop_sdwa")
    (set_attr "length" "8")])
 
+(define_insn "*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>"
+  [(set (match_operand:V_INT_1REG 0 "register_operand"      "=v")
+        (all_convert:V_INT_1REG
+	  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
+  "TARGET_RDNA3"
+  {
+    enum {extend, zero_extend, trunc};
+    rtx shiftwidth = (<V_INT_1REG_ALT:SCALAR_MODE>mode == QImode
+		      || <V_INT_1REG:SCALAR_MODE>mode == QImode
+		      ? GEN_INT (24)
+		      : <V_INT_1REG_ALT:SCALAR_MODE>mode == HImode
+		        || <V_INT_1REG:SCALAR_MODE>mode == HImode
+		      ? GEN_INT (16)
+		      : NULL);
+    operands[2] = shiftwidth;
+
+    if (!shiftwidth)
+      return "v_mov_b32 %0, %1";
+    else if (<convop> == extend || <convop> == trunc)
+      return "v_lshlrev_b32\t%0, %2, %1\;v_ashrrev_i32\t%0, %2, %0";
+    else
+      return "v_lshlrev_b32\t%0, %2, %1\;v_lshrrev_b32\t%0, %2, %0";
+  }
+  [(set_attr "type" "mult")
+   (set_attr "length" "8")])
+
 ;; GCC can already do these for scalar types, but not for vector types.
 ;; Unfortunately you can't just do SUBREG on a vector to select the low part,
 ;; so there must be a few tricks here.
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index e668ce7c69e..e80de2ce056 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -1597,8 +1597,8 @@  gcn_global_address_p (rtx addr)
       rtx offset = XEXP (addr, 1);
       int offsetbits = (TARGET_RDNA2_PLUS ? 11 : 12);
       bool immediate_p = (CONST_INT_P (offset)
-			  && INTVAL (offset) >= -(1 << 12)
-			  && INTVAL (offset) < (1 << 12));
+			  && INTVAL (offset) >= -(1 << offsetbits)
+			  && INTVAL (offset) < (1 << offsetbits));
 
       if ((gcn_address_register_p (base, DImode, false)
 	   || gcn_vec_address_register_p (base, DImode, false))
@@ -6597,8 +6597,10 @@  gcn_hsa_declare_function_name (FILE *file, const char *name,
     if (df_regs_ever_live_p (FIRST_AVGPR_REG + avgpr))
       break;
   avgpr++;
-  vgpr = (vgpr + 3) & ~3;
-  avgpr = (avgpr + 3) & ~3;
+
+  /* The main function epilogue uses v8, but df doesn't see that.  */
+  if (vgpr < 9)
+    vgpr = 9;
 
   if (!leaf_function_p ())
     {
@@ -6611,9 +6613,18 @@  gcn_hsa_declare_function_name (FILE *file, const char *name,
 	avgpr = MAX_NORMAL_AVGPR_COUNT;
     }
 
-  /* The gfx90a accum_offset field can't represent 0 registers.  */
-  if (gcn_arch == PROCESSOR_GFX90a && vgpr < 4)
-    vgpr = 4;
+  /* SIMD32 devices count double in wavefront64 mode.  */
+  if (TARGET_RDNA2_PLUS)
+    vgpr *= 2;
+
+  /* Round up to the allocation block size.  */
+  int vgpr_block_size = (TARGET_RDNA3 ? 12
+			 : TARGET_RDNA2_PLUS || TARGET_CDNA2_PLUS ? 8
+			 : 4);
+  if (vgpr % vgpr_block_size)
+    vgpr += vgpr_block_size - (vgpr % vgpr_block_size);
+  if (avgpr % vgpr_block_size)
+    avgpr += vgpr_block_size - (avgpr % vgpr_block_size);
 
   fputs ("\t.rodata\n"
 	 "\t.p2align\t6\n"
@@ -6714,12 +6725,14 @@  gcn_hsa_declare_function_name (FILE *file, const char *name,
 	   "            .private_segment_fixed_size: 0\n"
 	   "            .wavefront_size: 64\n"
 	   "            .sgpr_count: %i\n"
-	   "            .vgpr_count: %i\n"
+	   "            .vgpr_count: %i%s\n"
 	   "            .max_flat_workgroup_size: 1024\n",
 	   cfun->machine->kernarg_segment_byte_size,
 	   cfun->machine->kernarg_segment_alignment,
 	   LDS_SIZE,
-	   sgpr, next_free_vgpr);
+	   sgpr, next_free_vgpr,
+	   (TARGET_RDNA2_PLUS ? " ; wavefrontsize64 counts double on SIMD32"
+	    : ""));
   if (gcn_arch == PROCESSOR_GFX90a || gcn_arch == PROCESSOR_GFX908)
     fprintf (file, "            .agpr_count: %i\n", avgpr);
   fputs ("        .end_amdgpu_metadata\n", file);
diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 492b833e255..1f3c692b7a6 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -1618,7 +1618,7 @@ 
 	(mult:SI
 	  (any_extend:SI (match_operand:HI 1 "register_operand" "%v"))
 	  (any_extend:SI (match_operand:HI 2 "register_operand" " v"))))]
-  ""
+  "!TARGET_RDNA3"
   "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:WORD_0 src1_sel:WORD_0"
   [(set_attr "type" "vop_sdwa")
    (set_attr "length" "8")])
@@ -1628,7 +1628,7 @@ 
 	(mult:HI
 	  (any_extend:HI (match_operand:QI 1 "register_operand" "%v"))
 	  (any_extend:HI (match_operand:QI 2 "register_operand" " v"))))]
-  ""
+  "!TARGET_RDNA3"
   "v_mul_<iu>32_<iu>24_sdwa\t%0, %<e>1, %<e>2 src0_sel:BYTE_0 src1_sel:BYTE_0"
   [(set_attr "type" "vop_sdwa")
    (set_attr "length" "8")])
diff --git a/libgcc/config/gcn/amdgcn_veclib.h b/libgcc/config/gcn/amdgcn_veclib.h
index 821f6386dd6..d268c6cac16 100644
--- a/libgcc/config/gcn/amdgcn_veclib.h
+++ b/libgcc/config/gcn/amdgcn_veclib.h
@@ -230,7 +230,7 @@  do { \
 
 #if defined (__GCN3__) || defined (__GCN5__) \
     || defined (__CDNA1__) || defined (__CDNA2__) \
-    || defined (__RDNA2__)
+    || defined (__RDNA2__) || defined (__RDNA3__)
 #define CDNA3_PLUS 0
 #else
 #define CDNA3_PLUS 1
diff --git a/libgomp/config/gcn/time.c b/libgomp/config/gcn/time.c
index 30a0d0188e4..efcd04f5f43 100644
--- a/libgomp/config/gcn/time.c
+++ b/libgomp/config/gcn/time.c
@@ -30,15 +30,25 @@ 
 /* According to AMD:
     dGPU RTC is 27MHz
     AGPU RTC is 100MHz
+    RDNA3 ISA manual states "typically 100MHz"
    FIXME: DTRT on an APU.  */
+#ifdef __RDNA3__
+#define RTC_TICKS (1.0 / 100000000.0) /* 100MHz */
+#else
 #define RTC_TICKS (1.0 / 27000000.0) /* 27MHz */
+#endif
 
 double
 omp_get_wtime (void)
 {
   uint64_t clock;
+#ifdef __RDNA3__
+  asm ("s_sendmsg_rtn_b64 %0 0x83 ;Get REALTIME\n\t"
+       "s_waitcnt 0" : "=r" (clock));
+#else
   asm ("s_memrealtime %0\n\t"
        "s_waitcnt 0" : "=r" (clock));
+#endif
   return clock * RTC_TICKS;
 }
 
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 0339848451e..db28781dedb 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1741,11 +1741,13 @@  max_isa_vgprs (int isa)
     case EF_AMDGPU_MACH_AMDGCN_GFX900:
     case EF_AMDGPU_MACH_AMDGCN_GFX906:
     case EF_AMDGPU_MACH_AMDGCN_GFX908:
-    case EF_AMDGPU_MACH_AMDGCN_GFX1030:
-    case EF_AMDGPU_MACH_AMDGCN_GFX1100:
       return 256;
     case EF_AMDGPU_MACH_AMDGCN_GFX90a:
       return 512;
+    case EF_AMDGPU_MACH_AMDGCN_GFX1030:
+      return 512;  /* 512 SIMD32 = 256 wavefrontsize64.  */
+    case EF_AMDGPU_MACH_AMDGCN_GFX1100:
+      return 1536; /* 1536 SIMD32 = 768 wavefrontsize64.  */
     }
   GOMP_PLUGIN_fatal ("unhandled ISA in max_isa_vgprs");
 }