Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.

Message ID ZZiTyrsBFO92FG84@cowardly-lion.the-meissners.org
State Unresolved
Headers
Series Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers. |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Michael Meissner Jan. 5, 2024, 11:42 p.m. UTC
  This patch is a prelimianry patch to add the full 1,024 bit dense math register
(DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the top of the
DMR register.

This patch only adds the new 1,024 bit register support.  It does not add
support for any instructions that need 1,024 bit registers instead of 512 bit
registers.

I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit
registers.  The 'wD' constraint added in previous patches is used for these
registers.  I added support to do load and store of DMRs via the VSX registers,
since there are no load/store dense math instructions.  I added the new keyword
'__dmr' to create 1,024 bit types that can be loaded into DMRs.  At present, I
don't have aliases for __dmr512 and __dmr1024 that we've discussed internally.

The patches have been tested on both little and big endian systems.  Can I check
it into the master branch?

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/mma.md (UNSPEC_DM_INSERT512_UPPER): New unspec.
	(UNSPEC_DM_INSERT512_LOWER): Likewise.
	(UNSPEC_DM_EXTRACT512): Likewise.
	(UNSPEC_DMR_RELOAD_FROM_MEMORY): Likewise.
	(UNSPEC_DMR_RELOAD_TO_MEMORY): Likewise.
	(movtdo): New define_expand and define_insn_and_split to implement 1,024
	bit DMR registers.
	(movtdo_insert512_upper): New insn.
	(movtdo_insert512_lower): Likewise.
	(movtdo_extract512): Likewise.
	(reload_dmr_from_memory): Likewise.
	(reload_dmr_to_memory): Likewise.
	* config/rs6000/rs6000-builtin.cc (rs6000_type_string): Add DMR
	support.
	(rs6000_init_builtins): Add support for __dmr keyword.
	* config/rs6000/rs6000-call.cc (rs6000_return_in_memory): Add support
	for TDOmode.
	(rs6000_function_arg): Likewise.
	* config/rs6000/rs6000-modes.def (TDOmode): New mode.
	* config/rs6000/rs6000.cc (rs6000_hard_regno_nregs_internal): Add
	support for TDOmode.
	(rs6000_hard_regno_mode_ok_uncached): Likewise.
	(rs6000_hard_regno_mode_ok): Likewise.
	(rs6000_modes_tieable_p): Likewise.
	(rs6000_debug_reg_global): Likewise.
	(rs6000_setup_reg_addr_masks): Likewise.
	(rs6000_init_hard_regno_mode_ok): Add support for TDOmode.  Setup reload
	hooks for DMR mode.
	(reg_offset_addressing_ok_p): Add support for TDOmode.
	(rs6000_emit_move): Likewise.
	(rs6000_secondary_reload_simple_move): Likewise.
	(rs6000_secondary_reload_class): Likewise.
	(rs6000_mangle_type): Add mangling for __dmr type.
	(rs6000_dmr_register_move_cost): Add support for TDOmode.
	(rs6000_split_multireg_move): Likewise.
	(rs6000_invalid_conversion): Likewise.
	* config/rs6000/rs6000.h (VECTOR_ALIGNMENT_P): Add TDOmode.
	(enum rs6000_builtin_type_index): Add DMR type nodes.
	(dmr_type_node): Likewise.
	(ptr_dmr_type_node): Likewise.

gcc/testsuite/

	* gcc.target/powerpc/dm-1024bit.c: New test.
---
 gcc/config/rs6000/mma.md                      | 152 ++++++++++++++++++
 gcc/config/rs6000/rs6000-builtin.cc           |  13 ++
 gcc/config/rs6000/rs6000-call.cc              |  13 +-
 gcc/config/rs6000/rs6000-modes.def            |   4 +
 gcc/config/rs6000/rs6000.cc                   | 135 ++++++++++++----
 gcc/config/rs6000/rs6000.h                    |   7 +-
 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c |  63 ++++++++
 7 files changed, 351 insertions(+), 36 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
  

Comments

Michael Meissner Jan. 19, 2024, 6:49 p.m. UTC | #1
Ping

| Date: Fri, 5 Jan 2024 18:42:02 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 6/6] PowerPC: Add support for 1,024 bit DMR registers.
| Message-ID: <ZZiTyrsBFO92FG84@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641966.html
  
Kewen.Lin Feb. 5, 2024, 3:58 a.m. UTC | #2
Hi Mike,

on 2024/1/6 07:42, Michael Meissner wrote:
> This patch is a prelimianry patch to add the full 1,024 bit dense math register> (DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the top of the
> DMR register.
> 
> This patch only adds the new 1,024 bit register support.  It does not add
> support for any instructions that need 1,024 bit registers instead of 512 bit
> registers.
> 
> I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit

typo: 1,204

> registers.  The 'wD' constraint added in previous patches is used for these
> registers.  I added support to do load and store of DMRs via the VSX registers,
> since there are no load/store dense math instructions.  I added the new keyword
> '__dmr' to create 1,024 bit types that can be loaded into DMRs.  At present, I
> don't have aliases for __dmr512 and __dmr1024 that we've discussed internally.
> 
> The patches have been tested on both little and big endian systems.  Can I check
> it into the master branch?
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/mma.md (UNSPEC_DM_INSERT512_UPPER): New unspec.
> 	(UNSPEC_DM_INSERT512_LOWER): Likewise.
> 	(UNSPEC_DM_EXTRACT512): Likewise.
> 	(UNSPEC_DMR_RELOAD_FROM_MEMORY): Likewise.
> 	(UNSPEC_DMR_RELOAD_TO_MEMORY): Likewise.
> 	(movtdo): New define_expand and define_insn_and_split to implement 1,024
> 	bit DMR registers.
> 	(movtdo_insert512_upper): New insn.
> 	(movtdo_insert512_lower): Likewise.
> 	(movtdo_extract512): Likewise.
> 	(reload_dmr_from_memory): Likewise.
> 	(reload_dmr_to_memory): Likewise.
> 	* config/rs6000/rs6000-builtin.cc (rs6000_type_string): Add DMR
> 	support.
> 	(rs6000_init_builtins): Add support for __dmr keyword.
> 	* config/rs6000/rs6000-call.cc (rs6000_return_in_memory): Add support
> 	for TDOmode.
> 	(rs6000_function_arg): Likewise.
> 	* config/rs6000/rs6000-modes.def (TDOmode): New mode.
> 	* config/rs6000/rs6000.cc (rs6000_hard_regno_nregs_internal): Add
> 	support for TDOmode.
> 	(rs6000_hard_regno_mode_ok_uncached): Likewise.
> 	(rs6000_hard_regno_mode_ok): Likewise.
> 	(rs6000_modes_tieable_p): Likewise.
> 	(rs6000_debug_reg_global): Likewise.
> 	(rs6000_setup_reg_addr_masks): Likewise.
> 	(rs6000_init_hard_regno_mode_ok): Add support for TDOmode.  Setup reload
> 	hooks for DMR mode.
> 	(reg_offset_addressing_ok_p): Add support for TDOmode.
> 	(rs6000_emit_move): Likewise.
> 	(rs6000_secondary_reload_simple_move): Likewise.
> 	(rs6000_secondary_reload_class): Likewise.
> 	(rs6000_mangle_type): Add mangling for __dmr type.
> 	(rs6000_dmr_register_move_cost): Add support for TDOmode.
> 	(rs6000_split_multireg_move): Likewise.
> 	(rs6000_invalid_conversion): Likewise.
> 	* config/rs6000/rs6000.h (VECTOR_ALIGNMENT_P): Add TDOmode.
> 	(enum rs6000_builtin_type_index): Add DMR type nodes.
> 	(dmr_type_node): Likewise.
> 	(ptr_dmr_type_node): Likewise.
> 
> gcc/testsuite/
> 
> 	* gcc.target/powerpc/dm-1024bit.c: New test.
> ---
>  gcc/config/rs6000/mma.md                      | 152 ++++++++++++++++++
>  gcc/config/rs6000/rs6000-builtin.cc           |  13 ++
>  gcc/config/rs6000/rs6000-call.cc              |  13 +-
>  gcc/config/rs6000/rs6000-modes.def            |   4 +
>  gcc/config/rs6000/rs6000.cc                   | 135 ++++++++++++----
>  gcc/config/rs6000/rs6000.h                    |   7 +-
>  gcc/testsuite/gcc.target/powerpc/dm-1024bit.c |  63 ++++++++
>  7 files changed, 351 insertions(+), 36 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index f06e6bbb184..37de9030903 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -92,6 +92,11 @@ (define_c_enum "unspec"
>     UNSPEC_MMA_XXMFACC
>     UNSPEC_MMA_XXMTACC
>     UNSPEC_DM_ASSEMBLE_ACC
> +   UNSPEC_DM_INSERT512_UPPER
> +   UNSPEC_DM_INSERT512_LOWER
> +   UNSPEC_DM_EXTRACT512
> +   UNSPEC_DMR_RELOAD_FROM_MEMORY
> +   UNSPEC_DMR_RELOAD_TO_MEMORY
>    ])
>  
>  (define_c_enum "unspecv"
> @@ -879,3 +884,150 @@ (define_insn "mma_<avvi4i4i4>"
>    [(set_attr "type" "mma")
>     (set_attr "prefixed" "yes")
>     (set_attr "isa" "dm,not_dm,not_dm")])
> +
> +
> +;; TDOmode (i.e. __dmr).
> +(define_expand "movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand")
> +	(match_operand:TDO 1 "input_operand"))]
> +  "TARGET_DENSE_MATH"
> +{
> +  rs6000_emit_move (operands[0], operands[1], TDOmode);
> +  DONE;
> +})
> +
> +(define_insn_and_split "*movtdo"
> +  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
> +	(match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
> +  "TARGET_DENSE_MATH
> +   && (gpc_reg_operand (operands[0], TDOmode)
> +       || gpc_reg_operand (operands[1], TDOmode))"
> +  "@
> +   #
> +   #
> +   #
> +   #
> +   dmmr %0,%1
> +   #"
> +  "&& reload_completed
> +   && (!dmr_operand (operands[0], TDOmode) || !dmr_operand (operands[1], TDOmode))"
> +  [(const_int 0)]
> +{
> +  rtx op0 = operands[0];
> +  rtx op1 = operands[1];
> +
> +  if (REG_P (op0) && REG_P (op1))
> +    {
> +      int regno0 = REGNO (op0);
> +      int regno1 = REGNO (op1);
> +
> +      if (DMR_REGNO_P (regno0) && VSX_REGNO_P (regno1))
> +	{
> +	  rtx op1_upper = gen_rtx_REG (XOmode, regno1);
> +	  rtx op1_lower = gen_rtx_REG (XOmode, regno1 + 4);
> +	  emit_insn (gen_movtdo_insert512_upper (op0, op1_upper));
> +	  emit_insn (gen_movtdo_insert512_lower (op0, op0, op1_lower));
> +	  DONE;
> +	}
> +
> +      else if (VSX_REGNO_P (regno0) && DMR_REGNO_P (regno1))
> +	{
> +	  rtx op0_upper = gen_rtx_REG (XOmode, regno0);
> +	  rtx op0_lower = gen_rtx_REG (XOmode, regno0 + 4);
> +	  emit_insn (gen_movtdo_extract512 (op0_upper, op1, const0_rtx));
> +	  emit_insn (gen_movtdo_extract512 (op0_lower, op1, const1_rtx));
> +	  DONE;
> +	}

Add an assertion like gcc_assert (VSX_REGNO_P (regno1) && VSX_REGNO_P (regno2))?

> +    }
> +
> +  rs6000_split_multireg_move (operands[0], operands[1]);
> +  DONE;
> +}
> +  [(set_attr "type" "vecload,vecstore,vecmove,vecmove,vecmove,vecmove")
> +   (set_attr "length" "*,*,32,8,*,8")
> +   (set_attr "max_prefixed_insns" "4,4,*,*,*,*")])
> +
> +;; Move from VSX registers to DMR registers via two insert 512 bit
> +;; instructions.
> +(define_insn "movtdo_insert512_upper"
> +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> +	(unspec:TDO [(match_operand:XO 1 "vsx_register_operand" "wa")]
> +		    UNSPEC_DM_INSERT512_UPPER))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxinstdmr512 %0,%1,%Y1,0"
> +  [(set_attr "type" "mma")])
> +
> +(define_insn "movtdo_insert512_lower"
> +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> +	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "0")
> +		     (match_operand:XO 2 "vsx_register_operand" "wa")]
> +		    UNSPEC_DM_INSERT512_LOWER))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxinstdmr512 %0,%2,%Y2,1"
> +  [(set_attr "type" "mma")])
> +
> +;; Move from DMR registers to VSX registers via two extract 512 bit
> +;; instructions.
> +(define_insn "movtdo_extract512"
> +  [(set (match_operand:XO 0 "vsx_register_operand" "=wa")
> +	(unspec:XO [(match_operand:TDO 1 "dmr_operand" "wD")
> +		    (match_operand 2 "const_0_to_1_operand" "n")]
> +		   UNSPEC_DM_EXTRACT512))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxextfdmr512 %0,%Y0,%1,%2"
> +  [(set_attr "type" "mma")])
> +
> +;; Reload DMR registers from memory
> +(define_insn_and_split "reload_dmr_from_memory"
> +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> +	(unspec:TDO [(match_operand:TDO 1 "memory_operand" "m")]
> +		    UNSPEC_DMR_RELOAD_FROM_MEMORY))
> +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> +  "TARGET_DENSE_MATH"
> +  "#"
> +  "&& reload_completed"
> +  [(const_int 0)]
> +{
> +  rtx dest = operands[0];
> +  rtx src = operands[1];
> +  rtx tmp = operands[2];
> +  rtx mem_upper = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> +  rtx mem_lower = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);

I think the offset should be 64 rather than 32.

> +
> +  emit_move_insn (tmp, mem_upper);
> +  emit_insn (gen_movtdo_insert512_upper (dest, tmp));
> +
> +  emit_move_insn (tmp, mem_lower);
> +  emit_insn (gen_movtdo_insert512_lower (dest, dest, tmp));
> +  DONE;
> +}
> +  [(set_attr "length" "16")
> +   (set_attr "max_prefixed_insns" "2")
> +   (set_attr "type" "vecload")])
> +
> +;; Reload dense math registers to memory
> +(define_insn_and_split "reload_dmr_to_memory"
> +  [(set (match_operand:TDO 0 "memory_operand" "=m")
> +	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "wD")]
> +		    UNSPEC_DMR_RELOAD_TO_MEMORY))
> +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> +  "TARGET_DENSE_MATH"
> +  "#"
> +  "&& reload_completed"
> +  [(const_int 0)]
> +{
> +  rtx dest = operands[0];
> +  rtx src = operands[1];
> +  rtx tmp = operands[2];
> +  rtx mem_upper = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> +  rtx mem_lower = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);

Ditto.

> +
> +  emit_insn (gen_movtdo_extract512 (tmp, src, const0_rtx));
> +  emit_move_insn (mem_upper, tmp);
> +
> +  emit_insn (gen_movtdo_extract512 (tmp, src, const1_rtx));
> +  emit_move_insn (mem_lower, tmp);
> +  DONE;
> +}
> +  [(set_attr "length" "16")
> +   (set_attr "max_prefixed_insns" "2")])
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
> index 6698274031b..54868d2009c 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -495,6 +495,8 @@ const char *rs6000_type_string (tree type_node)
>      return "__vector_pair";
>    else if (type_node == vector_quad_type_node)
>      return "__vector_quad";
> +  else if (type_node == dmr_type_node)
> +    return "__dmr";
>  
>    return "unknown";
>  }
> @@ -781,6 +783,17 @@ rs6000_init_builtins (void)
>    t = build_qualified_type (vector_quad_type_node, TYPE_QUAL_CONST);
>    ptr_vector_quad_type_node = build_pointer_type (t);
>  
> +  dmr_type_node = make_node (OPAQUE_TYPE);
> +  SET_TYPE_MODE (dmr_type_node, TDOmode);
> +  TYPE_SIZE (dmr_type_node) = bitsize_int (GET_MODE_BITSIZE (TDOmode));
> +  TYPE_PRECISION (dmr_type_node) = GET_MODE_BITSIZE (TDOmode);
> +  TYPE_SIZE_UNIT (dmr_type_node) = size_int (GET_MODE_SIZE (TDOmode));
> +  SET_TYPE_ALIGN (dmr_type_node, 512);

why not 1024?

> +  TYPE_USER_ALIGN (dmr_type_node) = 0;
> +  lang_hooks.types.register_builtin_type (dmr_type_node, "__dmr");
> +  t = build_qualified_type (dmr_type_node, TYPE_QUAL_CONST);
> +  ptr_dmr_type_node = build_pointer_type (t);
> +
>    tdecl = add_builtin_type ("__bool char", bool_char_type_node);
>    TYPE_NAME (bool_char_type_node) = tdecl;
>  
> diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
> index 8c590903c86..6e2465204cf 100644
> --- a/gcc/config/rs6000/rs6000-call.cc
> +++ b/gcc/config/rs6000/rs6000-call.cc
> @@ -437,7 +437,8 @@ rs6000_return_in_memory (const_tree type, const_tree fntype ATTRIBUTE_UNUSED)
>    if (cfun
>        && !cfun->machine->mma_return_type_error
>        && TREE_TYPE (cfun->decl) == fntype
> -      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode))
> +      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode
> +	  || TYPE_MODE (type) == TDOmode))

May be just with OPAQUE_MODE_P (TYPE_MODE (type)) for all the cases on type mode.

So far only rs6000 defines OPAQUE_MODE, if we are worried that there are some generic opaque modes
some day, we can probably add one assertion somewhere to guaratee it.  Or add one macro like
OPAQUE_MMA_MODE_P to ensure it only matches {OO,XO,TDO}mode.

>      {
>        /* Record we have now handled function CFUN, so the next time we
>  	 are called, we do not re-report the same error.  */
> @@ -1641,6 +1642,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const function_arg_info &arg)
>        return NULL_RTX;
>      }
>  
> +  if (mode == TDOmode)
> +    {
> +      if (TYPE_CANONICAL (type) != NULL_TREE)
> +	type = TYPE_CANONICAL (type);
> +      error ("invalid use of dense math operand of type %qs as a function "
> +	     "parameter",
> +	     IDENTIFIER_POINTER (DECL_NAME (TYPE_NAME (type))));
> +      return NULL_RTX;
> +    }

Can we merge this hunk into the above hunk for OOmode and XOmode?  Then the code with TYPE_CANONICAL
can be shared and better to maintain.  IMHO, this dense math operand is also MMA operand so the above
error message still works, if it's desired to note this dense math operand then we can use
(mode == TDOmode)? "dense math": "MMA" for the different string part.

> +
>    /* Return a marker to indicate whether CR1 needs to set or clear the
>       bit that V.4 uses to say fp args were passed in registers.
>       Assume that we don't need the marker for software floating point,
> diff --git a/gcc/config/rs6000/rs6000-modes.def b/gcc/config/rs6000/rs6000-modes.def
> index 094b246c834..60ebb363196 100644
> --- a/gcc/config/rs6000/rs6000-modes.def
> +++ b/gcc/config/rs6000/rs6000-modes.def
> @@ -86,3 +86,7 @@ PARTIAL_INT_MODE (TI, 128, PTI);
>  /* Modes used by __vector_pair and __vector_quad.  */
>  OPAQUE_MODE (OO, 32);
>  OPAQUE_MODE (XO, 64);
> +
> +/* Modes used by __dmr.  */

Nit: s/Modes/Mode/

> +OPAQUE_MODE (TDO, 128);
> +

I assumed that "TD" stands for something but I have no idea (at least not obvious to me),
could we also put some comments for it?

> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 59517c8608d..aed4b72c4ea 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1846,7 +1846,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
>       128-bit floating point that can go in vector registers, which has VSX
>       memory addressing.  */
>    if (FP_REGNO_P (regno))
> -    reg_size = (VECTOR_MEM_VSX_P (mode) || VECTOR_ALIGNMENT_P (mode)
> +    reg_size = (VECTOR_MEM_VSX_P (mode)
> +		|| VECTOR_ALIGNMENT_P (mode)
> +		|| mode == TDOmode

Redundant change, since VECTOR_ALIGNMENT_P considers TDOmode as this patch changes.

>  		? UNITS_PER_VSX_WORD
>  		: UNITS_PER_FP_WORD);
>  
> @@ -1880,9 +1882,9 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>    /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
>       by 4.
>  
> -     If dense math is enabled, allow all VSX registers plus the DMR registers.
> -     We need to make sure we don't cross between the boundary of FPRs and
> -     traditional Altiviec registers.  */
> +     If dense math is enabled, allow all VSX registers plus the dense math
> +     registers.  We need to make sure we don't cross between the boundary of
> +     FPRs and traditional Altiviec registers.  */
>    if (mode == XOmode)
>      {
>        if (TARGET_MMA && !TARGET_DENSE_MATH)
> @@ -1904,7 +1906,27 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>  	return 0;
>      }
>  
> -  /* No other types other than XOmode can go in DMRs.  */
> +  /* Dense math register modes need DMR registers or VSX registers divisible by
> +     2.  We need to make sure we don't cross between the boundary of FPRs and
> +     traditional Altiviec registers.  */
> +  if (mode == TDOmode)
> +    {
> +      if (!TARGET_DENSE_MATH)
> +	return 0;
> +
> +      if (DMR_REGNO_P (regno))
> +	return 1;
> +
> +      if (FP_REGNO_P (regno))
> +	return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 7);
> +

Like the comment on XOmode (in one previous patch), this restriction looks too
strict, isn't fine to cross FPR and Altivec registers boundary as the XAp,XBp
are separated in DM 512 insert/extract?

> +      if (ALTIVEC_REGNO_P (regno))
> +	return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 7);
> +
> +      return 0;
> +    }
> +
> +  /* No other types other than XOmode or TDOmode can go in DMRs.  */
>    if (DMR_REGNO_P (regno))
>      return 0;
>  
> @@ -2012,9 +2034,11 @@ rs6000_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
>     GPR registers, and TImode can go in any GPR as well as VSX registers (PR
>     57744).
>  
> -   Similarly, don't allow OOmode (vector pair, restricted to even VSX
> -   registers) or XOmode (vector quad, restricted to FPR registers divisible
> -   by 4) to tie with other modes.
> +   Similarly, don't allow OOmode (vector pair), XOmode (vector quad), or
> +   TDOmode (dmr register) to pair with anything else.  Vector pairs are
> +   restricted to even/odd VSX registers.  Without dense math, vector quads are
> +   limited to FPR registers divisible by 4.  With dense math, vector quads are
> +   limited to even VSX registers or DMR registers.
>  
>     Altivec/VSX vector tests were moved ahead of scalar float mode, so that IEEE
>     128-bit floating point on VSX systems ties with other vectors.  */
> @@ -2023,7 +2047,8 @@ static bool
>  rs6000_modes_tieable_p (machine_mode mode1, machine_mode mode2)
>  {
>    if (mode1 == PTImode || mode1 == OOmode || mode1 == XOmode
> -      || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode)
> +      || mode1 == TDOmode || mode2 == PTImode || mode2 == OOmode
> +      || mode2 == XOmode || mode2 == TDOmode)
>      return mode1 == mode2;
>  
>    if (ALTIVEC_OR_VSX_VECTOR_MODE (mode1))
> @@ -2314,6 +2339,7 @@ rs6000_debug_reg_global (void)
>      V4DFmode,
>      OOmode,
>      XOmode,
> +    TDOmode,
>      CCmode,
>      CCUNSmode,
>      CCEQmode,
> @@ -2679,7 +2705,7 @@ rs6000_setup_reg_addr_masks (void)
>  	  /* Special case DMR registers.  */
>  	  if (rc == RELOAD_REG_DMR)
>  	    {
> -	      if (TARGET_DENSE_MATH && m2 == XOmode)
> +	      if (TARGET_DENSE_MATH && (m2 == XOmode || m2 == TDOmode))
>  		{
>  		  addr_mask = RELOAD_REG_VALID;
>  		  reg_addr[m].addr_mask[rc] = addr_mask;
> @@ -2786,12 +2812,14 @@ rs6000_setup_reg_addr_masks (void)
>  
>  	  /* Vector pairs can do both indexed and offset loads if the
>  	     instructions are enabled, otherwise they can only do offset loads
> -	     since it will be broken into two vector moves.  Vector quads can
> -	     only do offset loads.  If the user restricted generation of either
> -	     of the LXVP or STXVP instructions, do not allow indexed mode so
> -	     that we can split the load/store.  */
> +	     since it will be broken into two vector moves.  If the user
> +	     restricted generation of either of the LXVP or STXVP instructions,
> +	     do not allow indexed mode so that we can split the load/store.
> +
> +	     Vector quads and dense math 1,024 bit registers can only do offset
> +	     loads.  */
>  	  else if ((addr_mask != 0) && TARGET_MMA
> -		   && (m2 == OOmode || m2 == XOmode))
> +		   && (m2 == OOmode || m2 == XOmode || m2 == TDOmode))
>  	    {
>  	      addr_mask |= RELOAD_REG_OFFSET;
>  	      if (rc == RELOAD_REG_FPR || rc == RELOAD_REG_VMX)
> @@ -3021,6 +3049,14 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
>        rs6000_vector_align[XOmode] = 512;
>      }
>  
> +  /* Add support for 1,024 bit DMR registers.  */
> +  if (TARGET_DENSE_MATH)
> +    {
> +      rs6000_vector_unit[TDOmode] = VECTOR_NONE;
> +      rs6000_vector_mem[TDOmode] = VECTOR_VSX;
> +      rs6000_vector_align[TDOmode] = 512;
> +    }
> +
>    /* Register class constraints for the constraints that depend on compile
>       switches. When the VSX code was added, different constraints were added
>       based on the type (DFmode, V2DFmode, V4SFmode).  For the vector types, all
> @@ -3234,6 +3270,12 @@ rs6000_init_hard_regno_mode_ok (bool global_init_p)
>  	}
>      }
>  
> +  if (TARGET_DENSE_MATH)
> +    {
> +      reg_addr[TDOmode].reload_load = CODE_FOR_reload_dmr_from_memory;
> +      reg_addr[TDOmode].reload_store = CODE_FOR_reload_dmr_to_memory;
> +    }
> +
>    /* Precalculate HARD_REGNO_NREGS.  */
>    for (r = 0; HARD_REGISTER_NUM_P (r); ++r)
>      for (m = 0; m < NUM_MACHINE_MODES; ++m)
> @@ -8800,12 +8842,15 @@ reg_offset_addressing_ok_p (machine_mode mode)
>  	return mode_supports_dq_form (mode);
>        break;
>  
> -      /* The vector pair/quad types support offset addressing if the
> -	 underlying vectors support offset addressing.  */
> +      /* The vector pair/quad types and the dense math types support offset

Nit: s/types/type/

> +	 addressing if the underlying vectors support offset addressing.  */
>      case E_OOmode:
>      case E_XOmode:
>        return TARGET_MMA;
>  
> +    case E_TDOmode:
> +      return TARGET_DENSE_MATH;
> +
>      case E_SDmode:
>        /* If we can do direct load/stores of SDmode, restrict it to reg+reg
>  	 addressing for the LFIWZX and STFIWX instructions.  */
> @@ -11354,6 +11399,12 @@ rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
>  	       (mode == OOmode) ? "__vector_pair" : "__vector_quad");
>        break;
>  
> +    case E_TDOmode:
> +      if (CONST_INT_P (operands[1]))
> +	error ("%qs is an opaque type, and you cannot set it to constants",
> +	       "__dmr");
> +      break;
> +
>      case E_SImode:
>      case E_DImode:
>        /* Use default pattern for address of ELF small data */
> @@ -12817,7 +12868,7 @@ rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
>  
>    /* We can transfer between VSX registers and DMR registers without needing
>       extra registers.  */
> -  if (TARGET_DENSE_MATH && mode == XOmode
> +  if (TARGET_DENSE_MATH && (mode == XOmode || mode == TDOmode)
>        && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
>  	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
>      return true;
> @@ -13618,6 +13669,9 @@ rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
>        if (mode == XOmode)
>  	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
>  

Nit: Should update the comments above:

   /* For the vector pair and vector quad modes, prefer their natural register
      (VSX or FPR) rather than GPR registers.  For other integer types, prefer
      the GPR registers.  */

> +      if (mode == TDOmode)
> +	return VSX_REGS;
> +
>        if (GET_MODE_CLASS (mode) == MODE_INT)
>  	return GENERAL_REGS;
>      }
> @@ -13741,8 +13795,9 @@ rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
>    else
>      regno = -1;
>  
> -  /* DMR registers don't have loads or stores.  We have to go through the VSX
> -     registers to load XOmode (vector quad).  */
> +  /* Dense math registers don't have loads or stores.  We have to go through
> +     the VSX registers to load XOmode (vector quad) and TDOmode (dmr 1024
> +     bit).  */
>    if (TARGET_DENSE_MATH && rclass == DM_REGS)
>      return VSX_REGS;
>  
> @@ -20830,6 +20885,8 @@ rs6000_mangle_type (const_tree type)
>      return "u13__vector_pair";
>    if (type == vector_quad_type_node)
>      return "u13__vector_quad";
> +  if (type == dmr_type_node)
> +    return "u5__dmr";
>  
>    /* For all other types, use the default mangling.  */
>    return NULL;
> @@ -22954,6 +23011,10 @@ rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
>        if (mode == XOmode)
>  	return reg_move_base;
>  
> +      /* __dmr (i.e. TDOmode) is transferred in 2 instructions.  */
> +      else if (mode == TDOmode)
> +	return reg_move_base * 2;
> +
>        else
>  	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
>      }
> @@ -27651,9 +27712,10 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>    mode = GET_MODE (dst);
>    nregs = hard_regno_nregs (reg, mode);
>  
> -  /* If we have a vector quad register for MMA, and this is a load or store,
> -     see if we can use vector paired load/stores.  */
> -  if (mode == XOmode && TARGET_MMA
> +  /* If we have a vector quad register for MMA or DMR register for dense math,
> +     and this is a load or store, see if we can use vector paired
> +     load/stores.  */
> +  if ((mode == XOmode || mode == TDOmode) && TARGET_MMA
>        && (MEM_P (dst) || MEM_P (src)))
>      {
>        reg_mode = OOmode;
> @@ -27661,7 +27723,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>      }
>    /* If we have a vector pair/quad mode, split it into two/four separate
>       vectors.  */

Nit: The above comments need to be updated.

> -  else if (mode == OOmode || mode == XOmode)
> +  else if (mode == OOmode || mode == XOmode || mode == TDOmode)
>      reg_mode = V1TImode;
>    else if (FP_REGNO_P (reg))
>      reg_mode = DECIMAL_FLOAT_MODE_P (mode) ? DDmode :
> @@ -27707,13 +27769,13 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>        return;
>      }
>  
> -  /* The __vector_pair and __vector_quad modes are multi-register
> -     modes, so if we have to load or store the registers, we have to be
> -     careful to properly swap them if we're in little endian mode
> -     below.  This means the last register gets the first memory
> -     location.  We also need to be careful of using the right register
> -     numbers if we are splitting XO to OO.  */
> -  if (mode == OOmode || mode == XOmode)
> +  /* The __vector_pair, __vector_quad, and __dmr modes are multi-register
> +     modes, so if we have to load or store the registers, we have to be careful
> +     to properly swap them if we're in little endian mode below.  This means
> +     the last register gets the first memory location.  We also need to be
> +     careful of using the right register numbers if we are splitting XO to
> +     OO.  */
> +  if (mode == OOmode || mode == XOmode || mode == TDOmode)
>      {
>        nregs = hard_regno_nregs (reg, mode);
>        int reg_mode_nregs = hard_regno_nregs (reg, reg_mode);
> @@ -27850,7 +27912,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	 overlap.  */
>        int i;
>        /* XO/OO are opaque so cannot use subregs. */
> -      if (mode == OOmode || mode == XOmode )
> +      if (mode == OOmode || mode == XOmode || mode == TDOmode)
>  	{
>  	  for (i = nregs - 1; i >= 0; i--)
>  	    {
> @@ -28024,7 +28086,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	    continue;
>  
>  	  /* XO/OO are opaque so cannot use subregs. */
> -	  if (mode == OOmode || mode == XOmode )
> +	  if (mode == OOmode || mode == XOmode || mode == TDOmode)
>  	    {
>  	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
>  	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
> @@ -29006,7 +29068,8 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
>  
>    if (frommode != tomode)
>      {
> -      /* Do not allow conversions to/from XOmode and OOmode types.  */
> +      /* Do not allow conversions to/from XOmode, OOmode, and TDOmode
> +	 types.  */
>        if (frommode == XOmode)
>  	return N_("invalid conversion from type %<__vector_quad%>");
>        if (tomode == XOmode)
> @@ -29015,6 +29078,10 @@ rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
>  	return N_("invalid conversion from type %<__vector_pair%>");
>        if (tomode == OOmode)
>  	return N_("invalid conversion to type %<__vector_pair%>");
> +      if (frommode == TDOmode)
> +	return N_("invalid conversion from type %<__dmr%>");
> +      if (tomode == TDOmode)
> +	return N_("invalid conversion to type %<__dmr%>");
>      }
>  
>    /* Conversion allowed.  */
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 22efac4a80c..9711777b5cd 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -1004,7 +1004,8 @@ enum data_align { align_abi, align_opt, align_both };
>  /* Modes that are not vectors, but require vector alignment.  Treat these like
>     vectors in terms of loads and stores.  */
>  #define VECTOR_ALIGNMENT_P(MODE)					\
> -  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode)
> +  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode	\
> +   || (MODE) == TDOmode)
>  
>  #define ALTIVEC_VECTOR_MODE(MODE)					\
>    ((MODE) == V16QImode							\
> @@ -2293,6 +2294,7 @@ enum rs6000_builtin_type_index
>    RS6000_BTI_const_str,		 /* pointer to const char * */
>    RS6000_BTI_vector_pair,	 /* unsigned 256-bit types (vector pair).  */
>    RS6000_BTI_vector_quad,	 /* unsigned 512-bit types (vector quad).  */
> +  RS6000_BTI_dmr,		 /* unsigned 1,024-bit types (dmr).  */
>    RS6000_BTI_const_ptr_void,     /* const pointer to void */
>    RS6000_BTI_ptr_V16QI,
>    RS6000_BTI_ptr_V1TI,
> @@ -2331,6 +2333,7 @@ enum rs6000_builtin_type_index
>    RS6000_BTI_ptr_dfloat128,
>    RS6000_BTI_ptr_vector_pair,
>    RS6000_BTI_ptr_vector_quad,
> +  RS6000_BTI_ptr_dmr,
>    RS6000_BTI_ptr_long_long,
>    RS6000_BTI_ptr_long_long_unsigned,
>    RS6000_BTI_MAX
> @@ -2388,6 +2391,7 @@ enum rs6000_builtin_type_index
>  #define const_str_type_node		 (rs6000_builtin_types[RS6000_BTI_const_str])
>  #define vector_pair_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_pair])
>  #define vector_quad_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_quad])
> +#define dmr_type_node			 (rs6000_builtin_types[RS6000_BTI_dmr])
>  #define pcvoid_type_node		 (rs6000_builtin_types[RS6000_BTI_const_ptr_void])
>  #define ptr_V16QI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V16QI])
>  #define ptr_V1TI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V1TI])
> @@ -2426,6 +2430,7 @@ enum rs6000_builtin_type_index
>  #define ptr_dfloat128_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dfloat128])
>  #define ptr_vector_pair_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_pair])
>  #define ptr_vector_quad_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_quad])
> +#define ptr_dmr_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dmr])
>  #define ptr_long_long_integer_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_long_long])
>  #define ptr_long_long_unsigned_type_node (rs6000_builtin_types[RS6000_BTI_ptr_long_long_unsigned])
>  
> diff --git a/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> new file mode 100644
> index 00000000000..0a9884ddf63
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
> @@ -0,0 +1,63 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_dense_math_ok } */
> +/* { dg-options "-mdejagnu-cpu=future -O2" } */
> +
> +/* Test basic load/store for __dmr type.  */
> +
> +#ifndef CONSTRAINT
> +#if defined(USE_D)
> +#define CONSTRAINT "d"
> +
> +#elif defined(USE_V)
> +#define CONSTRAINT "v"
> +
> +#elif defined(USE_WA)
> +#define CONSTRAINT "wa"
> +
> +#else
> +#define CONSTRAINT "wD"
> +#endif
> +#endif
> +const char constraint[] = CONSTRAINT;
> +
> +void foo_mem_asm (__dmr *p, __dmr *q)
> +{
> +  /* 2 LXVP instructions.  */

Nit: s/2/4/

> +  __dmr vq = *p;
> +
> +  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
> +  __asm__ ("# foo (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
> +  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
> +
> +  /* 2 STXVP instructions.  */

Ditto.

> +  *q = vq;
> +}
> +
> +void foo_mem_asm2 (__dmr *p, __dmr *q)
> +{
> +  /* 2 LXVP instructions.  */

Ditto.

> +  __dmr vq = *p;
> +  __dmr vq2;
> +  __dmr vq3;

Nit: vq3 is useless.

> +
> +  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
> +  __asm__ ("# foo1 (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
> +  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
> +
> +  vq2 = vq;
> +  __asm__ ("# foo2 (wa) %0" : "+wa" (vq2));
> +
> +  /* 2 STXVP instructions.  */

Nit: s/2/4/

> +  *q = vq2;
> +}
> +
> +void foo_mem (__dmr *p, __dmr *q)
> +{
> +  /* 2 LXVP, 2 STXVP instructions, no DMR transfer.  */

Ditto.

> +  *q = *p;
> +}
> +
> +/* { dg-final { scan-assembler-times {\mdmxxextfdmr512\M}  4 } } */
> +/* { dg-final { scan-assembler-times {\mdmxxinstdmr512\M}  4 } } */
> +/* { dg-final { scan-assembler-times {\mlxvp\M}           12 } } */
> +/* { dg-final { scan-assembler-times {\mstxvp\M}          12 } } */


The others look good to me, thanks!

BR,
Kewen
  
Michael Meissner Feb. 8, 2024, 12:35 a.m. UTC | #3
On Mon, Feb 05, 2024 at 11:58:31AM +0800, Kewen.Lin wrote:
> Hi Mike,

I will comment on about 1/2 of the things, and come back with the other
comments.

> on 2024/1/6 07:42, Michael Meissner wrote:
> > This patch is a prelimianry patch to add the full 1,024 bit dense math register> (DMRs) for -mcpu=future.  The MMA 512-bit accumulators map onto the top of the
> > DMR register.
> > 
> > This patch only adds the new 1,024 bit register support.  It does not add
> > support for any instructions that need 1,024 bit registers instead of 512 bit
> > registers.
> > 
> > I used the new mode 'TDOmode' to be the opaque mode used for 1,204 bit
> 
> typo: 1,204

Thanks.

> > +(define_insn_and_split "*movtdo"
> > +  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
> > +	(match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
> > +  "TARGET_DENSE_MATH
> > +   && (gpc_reg_operand (operands[0], TDOmode)
> > +       || gpc_reg_operand (operands[1], TDOmode))"
> > +  "@
> > +   #
> > +   #
> > +   #
> > +   #
> > +   dmmr %0,%1
> > +   #"
> > +  "&& reload_completed
> > +   && (!dmr_operand (operands[0], TDOmode) || !dmr_operand (operands[1], TDOmode))"
> > +  [(const_int 0)]
> > +{
> > +  rtx op0 = operands[0];
> > +  rtx op1 = operands[1];
> > +
> > +  if (REG_P (op0) && REG_P (op1))
> > +    {
> > +      int regno0 = REGNO (op0);
> > +      int regno1 = REGNO (op1);
> > +
> > +      if (DMR_REGNO_P (regno0) && VSX_REGNO_P (regno1))
> > +	{
> > +	  rtx op1_upper = gen_rtx_REG (XOmode, regno1);
> > +	  rtx op1_lower = gen_rtx_REG (XOmode, regno1 + 4);
> > +	  emit_insn (gen_movtdo_insert512_upper (op0, op1_upper));
> > +	  emit_insn (gen_movtdo_insert512_lower (op0, op0, op1_lower));
> > +	  DONE;
> > +	}
> > +
> > +      else if (VSX_REGNO_P (regno0) && DMR_REGNO_P (regno1))
> > +	{
> > +	  rtx op0_upper = gen_rtx_REG (XOmode, regno0);
> > +	  rtx op0_lower = gen_rtx_REG (XOmode, regno0 + 4);
> > +	  emit_insn (gen_movtdo_extract512 (op0_upper, op1, const0_rtx));
> > +	  emit_insn (gen_movtdo_extract512 (op0_lower, op1, const1_rtx));
> > +	  DONE;
> > +	}
> 
> Add an assertion like gcc_assert (VSX_REGNO_P (regno1) && VSX_REGNO_P (regno2))?

Ok.

> > +
> > +;; Reload DMR registers from memory
> > +(define_insn_and_split "reload_dmr_from_memory"
> > +  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
> > +	(unspec:TDO [(match_operand:TDO 1 "memory_operand" "m")]
> > +		    UNSPEC_DMR_RELOAD_FROM_MEMORY))
> > +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> > +  "TARGET_DENSE_MATH"
> > +  "#"
> > +  "&& reload_completed"
> > +  [(const_int 0)]
> > +{
> > +  rtx dest = operands[0];
> > +  rtx src = operands[1];
> > +  rtx tmp = operands[2];
> > +  rtx mem_upper = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> > +  rtx mem_lower = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
> 
> I think the offset should be 64 rather than 32.

Good catch, thanks.

> > +
> > +  emit_move_insn (tmp, mem_upper);
> > +  emit_insn (gen_movtdo_insert512_upper (dest, tmp));
> > +
> > +  emit_move_insn (tmp, mem_lower);
> > +  emit_insn (gen_movtdo_insert512_lower (dest, dest, tmp));
> > +  DONE;
> > +}
> > +  [(set_attr "length" "16")
> > +   (set_attr "max_prefixed_insns" "2")
> > +   (set_attr "type" "vecload")])
> > +
> > +;; Reload dense math registers to memory
> > +(define_insn_and_split "reload_dmr_to_memory"
> > +  [(set (match_operand:TDO 0 "memory_operand" "=m")
> > +	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "wD")]
> > +		    UNSPEC_DMR_RELOAD_TO_MEMORY))
> > +   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
> > +  "TARGET_DENSE_MATH"
> > +  "#"
> > +  "&& reload_completed"
> > +  [(const_int 0)]
> > +{
> > +  rtx dest = operands[0];
> > +  rtx src = operands[1];
> > +  rtx tmp = operands[2];
> > +  rtx mem_upper = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
> > +  rtx mem_lower = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
> 
> Ditto.

Yep.

> > diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
> > index 6698274031b..54868d2009c 100644
> > --- a/gcc/config/rs6000/rs6000-builtin.cc
> > +++ b/gcc/config/rs6000/rs6000-builtin.cc
> > @@ -495,6 +495,8 @@ const char *rs6000_type_string (tree type_node)
> >      return "__vector_pair";
> >    else if (type_node == vector_quad_type_node)
> >      return "__vector_quad";
> > +  else if (type_node == dmr_type_node)
> > +    return "__dmr";
> >  
> >    return "unknown";
> >  }
> > @@ -781,6 +783,17 @@ rs6000_init_builtins (void)
> >    t = build_qualified_type (vector_quad_type_node, TYPE_QUAL_CONST);
> >    ptr_vector_quad_type_node = build_pointer_type (t);
> >  
> > +  dmr_type_node = make_node (OPAQUE_TYPE);
> > +  SET_TYPE_MODE (dmr_type_node, TDOmode);
> > +  TYPE_SIZE (dmr_type_node) = bitsize_int (GET_MODE_BITSIZE (TDOmode));
> > +  TYPE_PRECISION (dmr_type_node) = GET_MODE_BITSIZE (TDOmode);
> > +  TYPE_SIZE_UNIT (dmr_type_node) = size_int (GET_MODE_SIZE (TDOmode));
> > +  SET_TYPE_ALIGN (dmr_type_node, 512);
> 
> why not 1024?

Since we don't have a 1,024 bit load/store and have to use multiple vector pair
or vector load/stores, there is no reason to ask for a 1,024 alignment.  In
addition, I would worry that having a larger alignment might be an issue with
the stack, since I don't believe we have support for aligning the stack to
1,024 bit boundaries.

> > --- a/gcc/config/rs6000/rs6000-call.cc
> > +++ b/gcc/config/rs6000/rs6000-call.cc
> > @@ -437,7 +437,8 @@ rs6000_return_in_memory (const_tree type, const_tree fntype ATTRIBUTE_UNUSED)
> >    if (cfun
> >        && !cfun->machine->mma_return_type_error
> >        && TREE_TYPE (cfun->decl) == fntype
> > -      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode))
> > +      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode
> > +	  || TYPE_MODE (type) == TDOmode))
> 
> May be just with OPAQUE_MODE_P (TYPE_MODE (type)) for all the cases on type mode.

Basically I forgot about using OPAQUE_MODE in this case.  Using OPAQUE_MODE is
better.

> So far only rs6000 defines OPAQUE_MODE, if we are worried that there are some generic opaque modes
> some day, we can probably add one assertion somewhere to guaratee it.  Or add one macro like
> OPAQUE_MMA_MODE_P to ensure it only matches {OO,XO,TDO}mode.
> 
> >      {
> >        /* Record we have now handled function CFUN, so the next time we
> >  	 are called, we do not re-report the same error.  */
> > @@ -1641,6 +1642,16 @@ rs6000_function_arg (cumulative_args_t cum_v, const function_arg_info &arg)
> >        return NULL_RTX;
> >      }
> >  
> > +  if (mode == TDOmode)
> > +    {
> > +      if (TYPE_CANONICAL (type) != NULL_TREE)
> > +	type = TYPE_CANONICAL (type);
> > +      error ("invalid use of dense math operand of type %qs as a function "
> > +	     "parameter",
> > +	     IDENTIFIER_POINTER (DECL_NAME (TYPE_NAME (type))));
> > +      return NULL_RTX;
> > +    }
> 
> Can we merge this hunk into the above hunk for OOmode and XOmode?  Then the code with TYPE_CANONICAL
> can be shared and better to maintain.  IMHO, this dense math operand is also MMA operand so the above
> error message still works, if it's desired to note this dense math operand then we can use
> (mode == TDOmode)? "dense math": "MMA" for the different string part.

I will need to look into this later.

> > +
> >    /* Return a marker to indicate whether CR1 needs to set or clear the
> >       bit that V.4 uses to say fp args were passed in registers.
> >       Assume that we don't need the marker for software floating point,
> > diff --git a/gcc/config/rs6000/rs6000-modes.def b/gcc/config/rs6000/rs6000-modes.def
> > index 094b246c834..60ebb363196 100644
> > --- a/gcc/config/rs6000/rs6000-modes.def
> > +++ b/gcc/config/rs6000/rs6000-modes.def
> > @@ -86,3 +86,7 @@ PARTIAL_INT_MODE (TI, 128, PTI);
> >  /* Modes used by __vector_pair and __vector_quad.  */
> >  OPAQUE_MODE (OO, 32);
> >  OPAQUE_MODE (XO, 64);
> > +
> > +/* Modes used by __dmr.  */
> 
> Nit: s/Modes/Mode/
> 
> > +OPAQUE_MODE (TDO, 128);
> > +
> 
> I assumed that "TD" stands for something but I have no idea (at least not obvious to me),
> could we also put some comments for it?

Basically Segher and I went back and forth on the names.  I would have to dig
into my notes what TDO stands for.

> > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> > index 59517c8608d..aed4b72c4ea 100644
> > --- a/gcc/config/rs6000/rs6000.cc
> > +++ b/gcc/config/rs6000/rs6000.cc
> > @@ -1846,7 +1846,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
> >       128-bit floating point that can go in vector registers, which has VSX
> >       memory addressing.  */
> >    if (FP_REGNO_P (regno))
> > -    reg_size = (VECTOR_MEM_VSX_P (mode) || VECTOR_ALIGNMENT_P (mode)
> > +    reg_size = (VECTOR_MEM_VSX_P (mode)
> > +		|| VECTOR_ALIGNMENT_P (mode)
> > +		|| mode == TDOmode
> 
> Redundant change, since VECTOR_ALIGNMENT_P considers TDOmode as this patch changes.

Ok.

And I'll get back to the rest of the comments shortly.
  

Patch

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index f06e6bbb184..37de9030903 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -92,6 +92,11 @@  (define_c_enum "unspec"
    UNSPEC_MMA_XXMFACC
    UNSPEC_MMA_XXMTACC
    UNSPEC_DM_ASSEMBLE_ACC
+   UNSPEC_DM_INSERT512_UPPER
+   UNSPEC_DM_INSERT512_LOWER
+   UNSPEC_DM_EXTRACT512
+   UNSPEC_DMR_RELOAD_FROM_MEMORY
+   UNSPEC_DMR_RELOAD_TO_MEMORY
   ])
 
 (define_c_enum "unspecv"
@@ -879,3 +884,150 @@  (define_insn "mma_<avvi4i4i4>"
   [(set_attr "type" "mma")
    (set_attr "prefixed" "yes")
    (set_attr "isa" "dm,not_dm,not_dm")])
+
+
+;; TDOmode (i.e. __dmr).
+(define_expand "movtdo"
+  [(set (match_operand:TDO 0 "nonimmediate_operand")
+	(match_operand:TDO 1 "input_operand"))]
+  "TARGET_DENSE_MATH"
+{
+  rs6000_emit_move (operands[0], operands[1], TDOmode);
+  DONE;
+})
+
+(define_insn_and_split "*movtdo"
+  [(set (match_operand:TDO 0 "nonimmediate_operand" "=wa,m,wa,wD,wD,wa")
+	(match_operand:TDO 1 "input_operand" "m,wa,wa,wa,wD,wD"))]
+  "TARGET_DENSE_MATH
+   && (gpc_reg_operand (operands[0], TDOmode)
+       || gpc_reg_operand (operands[1], TDOmode))"
+  "@
+   #
+   #
+   #
+   #
+   dmmr %0,%1
+   #"
+  "&& reload_completed
+   && (!dmr_operand (operands[0], TDOmode) || !dmr_operand (operands[1], TDOmode))"
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  rtx op1 = operands[1];
+
+  if (REG_P (op0) && REG_P (op1))
+    {
+      int regno0 = REGNO (op0);
+      int regno1 = REGNO (op1);
+
+      if (DMR_REGNO_P (regno0) && VSX_REGNO_P (regno1))
+	{
+	  rtx op1_upper = gen_rtx_REG (XOmode, regno1);
+	  rtx op1_lower = gen_rtx_REG (XOmode, regno1 + 4);
+	  emit_insn (gen_movtdo_insert512_upper (op0, op1_upper));
+	  emit_insn (gen_movtdo_insert512_lower (op0, op0, op1_lower));
+	  DONE;
+	}
+
+      else if (VSX_REGNO_P (regno0) && DMR_REGNO_P (regno1))
+	{
+	  rtx op0_upper = gen_rtx_REG (XOmode, regno0);
+	  rtx op0_lower = gen_rtx_REG (XOmode, regno0 + 4);
+	  emit_insn (gen_movtdo_extract512 (op0_upper, op1, const0_rtx));
+	  emit_insn (gen_movtdo_extract512 (op0_lower, op1, const1_rtx));
+	  DONE;
+	}
+    }
+
+  rs6000_split_multireg_move (operands[0], operands[1]);
+  DONE;
+}
+  [(set_attr "type" "vecload,vecstore,vecmove,vecmove,vecmove,vecmove")
+   (set_attr "length" "*,*,32,8,*,8")
+   (set_attr "max_prefixed_insns" "4,4,*,*,*,*")])
+
+;; Move from VSX registers to DMR registers via two insert 512 bit
+;; instructions.
+(define_insn "movtdo_insert512_upper"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:XO 1 "vsx_register_operand" "wa")]
+		    UNSPEC_DM_INSERT512_UPPER))]
+  "TARGET_DENSE_MATH"
+  "dmxxinstdmr512 %0,%1,%Y1,0"
+  [(set_attr "type" "mma")])
+
+(define_insn "movtdo_insert512_lower"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "0")
+		     (match_operand:XO 2 "vsx_register_operand" "wa")]
+		    UNSPEC_DM_INSERT512_LOWER))]
+  "TARGET_DENSE_MATH"
+  "dmxxinstdmr512 %0,%2,%Y2,1"
+  [(set_attr "type" "mma")])
+
+;; Move from DMR registers to VSX registers via two extract 512 bit
+;; instructions.
+(define_insn "movtdo_extract512"
+  [(set (match_operand:XO 0 "vsx_register_operand" "=wa")
+	(unspec:XO [(match_operand:TDO 1 "dmr_operand" "wD")
+		    (match_operand 2 "const_0_to_1_operand" "n")]
+		   UNSPEC_DM_EXTRACT512))]
+  "TARGET_DENSE_MATH"
+  "dmxxextfdmr512 %0,%Y0,%1,%2"
+  [(set_attr "type" "mma")])
+
+;; Reload DMR registers from memory
+(define_insn_and_split "reload_dmr_from_memory"
+  [(set (match_operand:TDO 0 "dmr_operand" "=wD")
+	(unspec:TDO [(match_operand:TDO 1 "memory_operand" "m")]
+		    UNSPEC_DMR_RELOAD_FROM_MEMORY))
+   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
+  "TARGET_DENSE_MATH"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+  rtx tmp = operands[2];
+  rtx mem_upper = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
+  rtx mem_lower = adjust_address (src, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
+
+  emit_move_insn (tmp, mem_upper);
+  emit_insn (gen_movtdo_insert512_upper (dest, tmp));
+
+  emit_move_insn (tmp, mem_lower);
+  emit_insn (gen_movtdo_insert512_lower (dest, dest, tmp));
+  DONE;
+}
+  [(set_attr "length" "16")
+   (set_attr "max_prefixed_insns" "2")
+   (set_attr "type" "vecload")])
+
+;; Reload dense math registers to memory
+(define_insn_and_split "reload_dmr_to_memory"
+  [(set (match_operand:TDO 0 "memory_operand" "=m")
+	(unspec:TDO [(match_operand:TDO 1 "dmr_operand" "wD")]
+		    UNSPEC_DMR_RELOAD_TO_MEMORY))
+   (clobber (match_operand:XO 2 "vsx_register_operand" "=wa"))]
+  "TARGET_DENSE_MATH"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx dest = operands[0];
+  rtx src = operands[1];
+  rtx tmp = operands[2];
+  rtx mem_upper = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 0 : 32);
+  rtx mem_lower = adjust_address (dest, XOmode, BYTES_BIG_ENDIAN ? 32 : 0);
+
+  emit_insn (gen_movtdo_extract512 (tmp, src, const0_rtx));
+  emit_move_insn (mem_upper, tmp);
+
+  emit_insn (gen_movtdo_extract512 (tmp, src, const1_rtx));
+  emit_move_insn (mem_lower, tmp);
+  DONE;
+}
+  [(set_attr "length" "16")
+   (set_attr "max_prefixed_insns" "2")])
diff --git a/gcc/config/rs6000/rs6000-builtin.cc b/gcc/config/rs6000/rs6000-builtin.cc
index 6698274031b..54868d2009c 100644
--- a/gcc/config/rs6000/rs6000-builtin.cc
+++ b/gcc/config/rs6000/rs6000-builtin.cc
@@ -495,6 +495,8 @@  const char *rs6000_type_string (tree type_node)
     return "__vector_pair";
   else if (type_node == vector_quad_type_node)
     return "__vector_quad";
+  else if (type_node == dmr_type_node)
+    return "__dmr";
 
   return "unknown";
 }
@@ -781,6 +783,17 @@  rs6000_init_builtins (void)
   t = build_qualified_type (vector_quad_type_node, TYPE_QUAL_CONST);
   ptr_vector_quad_type_node = build_pointer_type (t);
 
+  dmr_type_node = make_node (OPAQUE_TYPE);
+  SET_TYPE_MODE (dmr_type_node, TDOmode);
+  TYPE_SIZE (dmr_type_node) = bitsize_int (GET_MODE_BITSIZE (TDOmode));
+  TYPE_PRECISION (dmr_type_node) = GET_MODE_BITSIZE (TDOmode);
+  TYPE_SIZE_UNIT (dmr_type_node) = size_int (GET_MODE_SIZE (TDOmode));
+  SET_TYPE_ALIGN (dmr_type_node, 512);
+  TYPE_USER_ALIGN (dmr_type_node) = 0;
+  lang_hooks.types.register_builtin_type (dmr_type_node, "__dmr");
+  t = build_qualified_type (dmr_type_node, TYPE_QUAL_CONST);
+  ptr_dmr_type_node = build_pointer_type (t);
+
   tdecl = add_builtin_type ("__bool char", bool_char_type_node);
   TYPE_NAME (bool_char_type_node) = tdecl;
 
diff --git a/gcc/config/rs6000/rs6000-call.cc b/gcc/config/rs6000/rs6000-call.cc
index 8c590903c86..6e2465204cf 100644
--- a/gcc/config/rs6000/rs6000-call.cc
+++ b/gcc/config/rs6000/rs6000-call.cc
@@ -437,7 +437,8 @@  rs6000_return_in_memory (const_tree type, const_tree fntype ATTRIBUTE_UNUSED)
   if (cfun
       && !cfun->machine->mma_return_type_error
       && TREE_TYPE (cfun->decl) == fntype
-      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode))
+      && (TYPE_MODE (type) == OOmode || TYPE_MODE (type) == XOmode
+	  || TYPE_MODE (type) == TDOmode))
     {
       /* Record we have now handled function CFUN, so the next time we
 	 are called, we do not re-report the same error.  */
@@ -1641,6 +1642,16 @@  rs6000_function_arg (cumulative_args_t cum_v, const function_arg_info &arg)
       return NULL_RTX;
     }
 
+  if (mode == TDOmode)
+    {
+      if (TYPE_CANONICAL (type) != NULL_TREE)
+	type = TYPE_CANONICAL (type);
+      error ("invalid use of dense math operand of type %qs as a function "
+	     "parameter",
+	     IDENTIFIER_POINTER (DECL_NAME (TYPE_NAME (type))));
+      return NULL_RTX;
+    }
+
   /* Return a marker to indicate whether CR1 needs to set or clear the
      bit that V.4 uses to say fp args were passed in registers.
      Assume that we don't need the marker for software floating point,
diff --git a/gcc/config/rs6000/rs6000-modes.def b/gcc/config/rs6000/rs6000-modes.def
index 094b246c834..60ebb363196 100644
--- a/gcc/config/rs6000/rs6000-modes.def
+++ b/gcc/config/rs6000/rs6000-modes.def
@@ -86,3 +86,7 @@  PARTIAL_INT_MODE (TI, 128, PTI);
 /* Modes used by __vector_pair and __vector_quad.  */
 OPAQUE_MODE (OO, 32);
 OPAQUE_MODE (XO, 64);
+
+/* Modes used by __dmr.  */
+OPAQUE_MODE (TDO, 128);
+
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 59517c8608d..aed4b72c4ea 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -1846,7 +1846,9 @@  rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
      128-bit floating point that can go in vector registers, which has VSX
      memory addressing.  */
   if (FP_REGNO_P (regno))
-    reg_size = (VECTOR_MEM_VSX_P (mode) || VECTOR_ALIGNMENT_P (mode)
+    reg_size = (VECTOR_MEM_VSX_P (mode)
+		|| VECTOR_ALIGNMENT_P (mode)
+		|| mode == TDOmode
 		? UNITS_PER_VSX_WORD
 		: UNITS_PER_FP_WORD);
 
@@ -1880,9 +1882,9 @@  rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
   /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
      by 4.
 
-     If dense math is enabled, allow all VSX registers plus the DMR registers.
-     We need to make sure we don't cross between the boundary of FPRs and
-     traditional Altiviec registers.  */
+     If dense math is enabled, allow all VSX registers plus the dense math
+     registers.  We need to make sure we don't cross between the boundary of
+     FPRs and traditional Altiviec registers.  */
   if (mode == XOmode)
     {
       if (TARGET_MMA && !TARGET_DENSE_MATH)
@@ -1904,7 +1906,27 @@  rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
 	return 0;
     }
 
-  /* No other types other than XOmode can go in DMRs.  */
+  /* Dense math register modes need DMR registers or VSX registers divisible by
+     2.  We need to make sure we don't cross between the boundary of FPRs and
+     traditional Altiviec registers.  */
+  if (mode == TDOmode)
+    {
+      if (!TARGET_DENSE_MATH)
+	return 0;
+
+      if (DMR_REGNO_P (regno))
+	return 1;
+
+      if (FP_REGNO_P (regno))
+	return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 7);
+
+      if (ALTIVEC_REGNO_P (regno))
+	return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 7);
+
+      return 0;
+    }
+
+  /* No other types other than XOmode or TDOmode can go in DMRs.  */
   if (DMR_REGNO_P (regno))
     return 0;
 
@@ -2012,9 +2034,11 @@  rs6000_hard_regno_mode_ok (unsigned int regno, machine_mode mode)
    GPR registers, and TImode can go in any GPR as well as VSX registers (PR
    57744).
 
-   Similarly, don't allow OOmode (vector pair, restricted to even VSX
-   registers) or XOmode (vector quad, restricted to FPR registers divisible
-   by 4) to tie with other modes.
+   Similarly, don't allow OOmode (vector pair), XOmode (vector quad), or
+   TDOmode (dmr register) to pair with anything else.  Vector pairs are
+   restricted to even/odd VSX registers.  Without dense math, vector quads are
+   limited to FPR registers divisible by 4.  With dense math, vector quads are
+   limited to even VSX registers or DMR registers.
 
    Altivec/VSX vector tests were moved ahead of scalar float mode, so that IEEE
    128-bit floating point on VSX systems ties with other vectors.  */
@@ -2023,7 +2047,8 @@  static bool
 rs6000_modes_tieable_p (machine_mode mode1, machine_mode mode2)
 {
   if (mode1 == PTImode || mode1 == OOmode || mode1 == XOmode
-      || mode2 == PTImode || mode2 == OOmode || mode2 == XOmode)
+      || mode1 == TDOmode || mode2 == PTImode || mode2 == OOmode
+      || mode2 == XOmode || mode2 == TDOmode)
     return mode1 == mode2;
 
   if (ALTIVEC_OR_VSX_VECTOR_MODE (mode1))
@@ -2314,6 +2339,7 @@  rs6000_debug_reg_global (void)
     V4DFmode,
     OOmode,
     XOmode,
+    TDOmode,
     CCmode,
     CCUNSmode,
     CCEQmode,
@@ -2679,7 +2705,7 @@  rs6000_setup_reg_addr_masks (void)
 	  /* Special case DMR registers.  */
 	  if (rc == RELOAD_REG_DMR)
 	    {
-	      if (TARGET_DENSE_MATH && m2 == XOmode)
+	      if (TARGET_DENSE_MATH && (m2 == XOmode || m2 == TDOmode))
 		{
 		  addr_mask = RELOAD_REG_VALID;
 		  reg_addr[m].addr_mask[rc] = addr_mask;
@@ -2786,12 +2812,14 @@  rs6000_setup_reg_addr_masks (void)
 
 	  /* Vector pairs can do both indexed and offset loads if the
 	     instructions are enabled, otherwise they can only do offset loads
-	     since it will be broken into two vector moves.  Vector quads can
-	     only do offset loads.  If the user restricted generation of either
-	     of the LXVP or STXVP instructions, do not allow indexed mode so
-	     that we can split the load/store.  */
+	     since it will be broken into two vector moves.  If the user
+	     restricted generation of either of the LXVP or STXVP instructions,
+	     do not allow indexed mode so that we can split the load/store.
+
+	     Vector quads and dense math 1,024 bit registers can only do offset
+	     loads.  */
 	  else if ((addr_mask != 0) && TARGET_MMA
-		   && (m2 == OOmode || m2 == XOmode))
+		   && (m2 == OOmode || m2 == XOmode || m2 == TDOmode))
 	    {
 	      addr_mask |= RELOAD_REG_OFFSET;
 	      if (rc == RELOAD_REG_FPR || rc == RELOAD_REG_VMX)
@@ -3021,6 +3049,14 @@  rs6000_init_hard_regno_mode_ok (bool global_init_p)
       rs6000_vector_align[XOmode] = 512;
     }
 
+  /* Add support for 1,024 bit DMR registers.  */
+  if (TARGET_DENSE_MATH)
+    {
+      rs6000_vector_unit[TDOmode] = VECTOR_NONE;
+      rs6000_vector_mem[TDOmode] = VECTOR_VSX;
+      rs6000_vector_align[TDOmode] = 512;
+    }
+
   /* Register class constraints for the constraints that depend on compile
      switches. When the VSX code was added, different constraints were added
      based on the type (DFmode, V2DFmode, V4SFmode).  For the vector types, all
@@ -3234,6 +3270,12 @@  rs6000_init_hard_regno_mode_ok (bool global_init_p)
 	}
     }
 
+  if (TARGET_DENSE_MATH)
+    {
+      reg_addr[TDOmode].reload_load = CODE_FOR_reload_dmr_from_memory;
+      reg_addr[TDOmode].reload_store = CODE_FOR_reload_dmr_to_memory;
+    }
+
   /* Precalculate HARD_REGNO_NREGS.  */
   for (r = 0; HARD_REGISTER_NUM_P (r); ++r)
     for (m = 0; m < NUM_MACHINE_MODES; ++m)
@@ -8800,12 +8842,15 @@  reg_offset_addressing_ok_p (machine_mode mode)
 	return mode_supports_dq_form (mode);
       break;
 
-      /* The vector pair/quad types support offset addressing if the
-	 underlying vectors support offset addressing.  */
+      /* The vector pair/quad types and the dense math types support offset
+	 addressing if the underlying vectors support offset addressing.  */
     case E_OOmode:
     case E_XOmode:
       return TARGET_MMA;
 
+    case E_TDOmode:
+      return TARGET_DENSE_MATH;
+
     case E_SDmode:
       /* If we can do direct load/stores of SDmode, restrict it to reg+reg
 	 addressing for the LFIWZX and STFIWX instructions.  */
@@ -11354,6 +11399,12 @@  rs6000_emit_move (rtx dest, rtx source, machine_mode mode)
 	       (mode == OOmode) ? "__vector_pair" : "__vector_quad");
       break;
 
+    case E_TDOmode:
+      if (CONST_INT_P (operands[1]))
+	error ("%qs is an opaque type, and you cannot set it to constants",
+	       "__dmr");
+      break;
+
     case E_SImode:
     case E_DImode:
       /* Use default pattern for address of ELF small data */
@@ -12817,7 +12868,7 @@  rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
 
   /* We can transfer between VSX registers and DMR registers without needing
      extra registers.  */
-  if (TARGET_DENSE_MATH && mode == XOmode
+  if (TARGET_DENSE_MATH && (mode == XOmode || mode == TDOmode)
       && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
 	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
     return true;
@@ -13618,6 +13669,9 @@  rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
       if (mode == XOmode)
 	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
 
+      if (mode == TDOmode)
+	return VSX_REGS;
+
       if (GET_MODE_CLASS (mode) == MODE_INT)
 	return GENERAL_REGS;
     }
@@ -13741,8 +13795,9 @@  rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
   else
     regno = -1;
 
-  /* DMR registers don't have loads or stores.  We have to go through the VSX
-     registers to load XOmode (vector quad).  */
+  /* Dense math registers don't have loads or stores.  We have to go through
+     the VSX registers to load XOmode (vector quad) and TDOmode (dmr 1024
+     bit).  */
   if (TARGET_DENSE_MATH && rclass == DM_REGS)
     return VSX_REGS;
 
@@ -20830,6 +20885,8 @@  rs6000_mangle_type (const_tree type)
     return "u13__vector_pair";
   if (type == vector_quad_type_node)
     return "u13__vector_quad";
+  if (type == dmr_type_node)
+    return "u5__dmr";
 
   /* For all other types, use the default mangling.  */
   return NULL;
@@ -22954,6 +23011,10 @@  rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
       if (mode == XOmode)
 	return reg_move_base;
 
+      /* __dmr (i.e. TDOmode) is transferred in 2 instructions.  */
+      else if (mode == TDOmode)
+	return reg_move_base * 2;
+
       else
 	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
     }
@@ -27651,9 +27712,10 @@  rs6000_split_multireg_move (rtx dst, rtx src)
   mode = GET_MODE (dst);
   nregs = hard_regno_nregs (reg, mode);
 
-  /* If we have a vector quad register for MMA, and this is a load or store,
-     see if we can use vector paired load/stores.  */
-  if (mode == XOmode && TARGET_MMA
+  /* If we have a vector quad register for MMA or DMR register for dense math,
+     and this is a load or store, see if we can use vector paired
+     load/stores.  */
+  if ((mode == XOmode || mode == TDOmode) && TARGET_MMA
       && (MEM_P (dst) || MEM_P (src)))
     {
       reg_mode = OOmode;
@@ -27661,7 +27723,7 @@  rs6000_split_multireg_move (rtx dst, rtx src)
     }
   /* If we have a vector pair/quad mode, split it into two/four separate
      vectors.  */
-  else if (mode == OOmode || mode == XOmode)
+  else if (mode == OOmode || mode == XOmode || mode == TDOmode)
     reg_mode = V1TImode;
   else if (FP_REGNO_P (reg))
     reg_mode = DECIMAL_FLOAT_MODE_P (mode) ? DDmode :
@@ -27707,13 +27769,13 @@  rs6000_split_multireg_move (rtx dst, rtx src)
       return;
     }
 
-  /* The __vector_pair and __vector_quad modes are multi-register
-     modes, so if we have to load or store the registers, we have to be
-     careful to properly swap them if we're in little endian mode
-     below.  This means the last register gets the first memory
-     location.  We also need to be careful of using the right register
-     numbers if we are splitting XO to OO.  */
-  if (mode == OOmode || mode == XOmode)
+  /* The __vector_pair, __vector_quad, and __dmr modes are multi-register
+     modes, so if we have to load or store the registers, we have to be careful
+     to properly swap them if we're in little endian mode below.  This means
+     the last register gets the first memory location.  We also need to be
+     careful of using the right register numbers if we are splitting XO to
+     OO.  */
+  if (mode == OOmode || mode == XOmode || mode == TDOmode)
     {
       nregs = hard_regno_nregs (reg, mode);
       int reg_mode_nregs = hard_regno_nregs (reg, reg_mode);
@@ -27850,7 +27912,7 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 	 overlap.  */
       int i;
       /* XO/OO are opaque so cannot use subregs. */
-      if (mode == OOmode || mode == XOmode )
+      if (mode == OOmode || mode == XOmode || mode == TDOmode)
 	{
 	  for (i = nregs - 1; i >= 0; i--)
 	    {
@@ -28024,7 +28086,7 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 	    continue;
 
 	  /* XO/OO are opaque so cannot use subregs. */
-	  if (mode == OOmode || mode == XOmode )
+	  if (mode == OOmode || mode == XOmode || mode == TDOmode)
 	    {
 	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
 	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
@@ -29006,7 +29068,8 @@  rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
 
   if (frommode != tomode)
     {
-      /* Do not allow conversions to/from XOmode and OOmode types.  */
+      /* Do not allow conversions to/from XOmode, OOmode, and TDOmode
+	 types.  */
       if (frommode == XOmode)
 	return N_("invalid conversion from type %<__vector_quad%>");
       if (tomode == XOmode)
@@ -29015,6 +29078,10 @@  rs6000_invalid_conversion (const_tree fromtype, const_tree totype)
 	return N_("invalid conversion from type %<__vector_pair%>");
       if (tomode == OOmode)
 	return N_("invalid conversion to type %<__vector_pair%>");
+      if (frommode == TDOmode)
+	return N_("invalid conversion from type %<__dmr%>");
+      if (tomode == TDOmode)
+	return N_("invalid conversion to type %<__dmr%>");
     }
 
   /* Conversion allowed.  */
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 22efac4a80c..9711777b5cd 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -1004,7 +1004,8 @@  enum data_align { align_abi, align_opt, align_both };
 /* Modes that are not vectors, but require vector alignment.  Treat these like
    vectors in terms of loads and stores.  */
 #define VECTOR_ALIGNMENT_P(MODE)					\
-  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode)
+  (FLOAT128_VECTOR_P (MODE) || (MODE) == OOmode || (MODE) == XOmode	\
+   || (MODE) == TDOmode)
 
 #define ALTIVEC_VECTOR_MODE(MODE)					\
   ((MODE) == V16QImode							\
@@ -2293,6 +2294,7 @@  enum rs6000_builtin_type_index
   RS6000_BTI_const_str,		 /* pointer to const char * */
   RS6000_BTI_vector_pair,	 /* unsigned 256-bit types (vector pair).  */
   RS6000_BTI_vector_quad,	 /* unsigned 512-bit types (vector quad).  */
+  RS6000_BTI_dmr,		 /* unsigned 1,024-bit types (dmr).  */
   RS6000_BTI_const_ptr_void,     /* const pointer to void */
   RS6000_BTI_ptr_V16QI,
   RS6000_BTI_ptr_V1TI,
@@ -2331,6 +2333,7 @@  enum rs6000_builtin_type_index
   RS6000_BTI_ptr_dfloat128,
   RS6000_BTI_ptr_vector_pair,
   RS6000_BTI_ptr_vector_quad,
+  RS6000_BTI_ptr_dmr,
   RS6000_BTI_ptr_long_long,
   RS6000_BTI_ptr_long_long_unsigned,
   RS6000_BTI_MAX
@@ -2388,6 +2391,7 @@  enum rs6000_builtin_type_index
 #define const_str_type_node		 (rs6000_builtin_types[RS6000_BTI_const_str])
 #define vector_pair_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_pair])
 #define vector_quad_type_node		 (rs6000_builtin_types[RS6000_BTI_vector_quad])
+#define dmr_type_node			 (rs6000_builtin_types[RS6000_BTI_dmr])
 #define pcvoid_type_node		 (rs6000_builtin_types[RS6000_BTI_const_ptr_void])
 #define ptr_V16QI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V16QI])
 #define ptr_V1TI_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_V1TI])
@@ -2426,6 +2430,7 @@  enum rs6000_builtin_type_index
 #define ptr_dfloat128_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dfloat128])
 #define ptr_vector_pair_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_pair])
 #define ptr_vector_quad_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_vector_quad])
+#define ptr_dmr_type_node		 (rs6000_builtin_types[RS6000_BTI_ptr_dmr])
 #define ptr_long_long_integer_type_node	 (rs6000_builtin_types[RS6000_BTI_ptr_long_long])
 #define ptr_long_long_unsigned_type_node (rs6000_builtin_types[RS6000_BTI_ptr_long_long_unsigned])
 
diff --git a/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
new file mode 100644
index 00000000000..0a9884ddf63
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/dm-1024bit.c
@@ -0,0 +1,63 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_dense_math_ok } */
+/* { dg-options "-mdejagnu-cpu=future -O2" } */
+
+/* Test basic load/store for __dmr type.  */
+
+#ifndef CONSTRAINT
+#if defined(USE_D)
+#define CONSTRAINT "d"
+
+#elif defined(USE_V)
+#define CONSTRAINT "v"
+
+#elif defined(USE_WA)
+#define CONSTRAINT "wa"
+
+#else
+#define CONSTRAINT "wD"
+#endif
+#endif
+const char constraint[] = CONSTRAINT;
+
+void foo_mem_asm (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP instructions.  */
+  __dmr vq = *p;
+
+  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
+  __asm__ ("# foo (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
+  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
+
+  /* 2 STXVP instructions.  */
+  *q = vq;
+}
+
+void foo_mem_asm2 (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP instructions.  */
+  __dmr vq = *p;
+  __dmr vq2;
+  __dmr vq3;
+
+  /* 2 DMXXINSTDMR512 instructions to transfer VSX to DMR.  */
+  __asm__ ("# foo1 (" CONSTRAINT ") %A0" : "+" CONSTRAINT (vq));
+  /* 2 DMXXEXTFDMR512 instructions to transfer DMR to VSX.  */
+
+  vq2 = vq;
+  __asm__ ("# foo2 (wa) %0" : "+wa" (vq2));
+
+  /* 2 STXVP instructions.  */
+  *q = vq2;
+}
+
+void foo_mem (__dmr *p, __dmr *q)
+{
+  /* 2 LXVP, 2 STXVP instructions, no DMR transfer.  */
+  *q = *p;
+}
+
+/* { dg-final { scan-assembler-times {\mdmxxextfdmr512\M}  4 } } */
+/* { dg-final { scan-assembler-times {\mdmxxinstdmr512\M}  4 } } */
+/* { dg-final { scan-assembler-times {\mlxvp\M}           12 } } */
+/* { dg-final { scan-assembler-times {\mstxvp\M}          12 } } */