Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.

Message ID ZZiS7-05Y1n48bjk@cowardly-lion.the-meissners.org
State Unresolved
Headers
Series Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers. |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Michael Meissner Jan. 5, 2024, 11:38 p.m. UTC
  The MMA subsystem added the notion of accumulator registers as an optional
feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
the traditional floating point registers 0..31, but logically the accumulator
registers were separate from the FPR registers.  In ISA 3.1, it was anticipated
that in future systems, the accumulator registers may no overlap with the FPR
registers.  This patch adds the support for dense math registers as separate
registers.

This particular patch does not change the MMA support to use the accumulators
within the dense math registers.  This patch just adds the basic support for
having separate DMRs.  The next patch will switch the MMA support to use the
accumulators if -mcpu=future is used.

For testing purposes, I added an undocumented option '-mdense-math' to enable
or disable the dense math support.

This patch adds a new constraint (wD).  If MMA is selected but dense math is
not selected (i.e. -mcpu=power10), the wD constraint will allow access to
accumulators that overlap with the VSX vector registers 0..31.  If both MMA and
dense math are selected (i.e. -mcpu=future), the wD constraint will only allow
dense math registers.

This patch modifies the existing %A output modifier.  If MMA is selected but
dense math is not selected, then %A output modifier converts the VSX register
number to the accumulator number, by dividing it by 4.  If both MMA and dense
math are selected, then %A will map the separate DMR registers into 0..7.

The intention is that user code using extended asm can be modified to run on
both MMA without dense math and MMA with dense math:

    1)	If possible, don't use extended asm, but instead use the MMA built-in
	functions;

    2)	If you do need to write extended asm, change the d constraints
	targetting accumulators should now use wD;

    3)	Only use the built-in zero, assemble and disassemble functions create
	move data between vector quad types and dense math accumulators.
	I.e. do not use the xxmfacc, xxmtacc, and xxsetaccz directly in the
	extended asm code.  The reason is these instructions assume there is a
	1-to-1 correspondence between 4 adjacent FPR registers and an
	accumulator that overlaps with those instructions.  With accumulators
	now being separate registers, there no longer is a 1-to-1
	correspondence.

It is possible that the mangling for DMRs and the GDB register numbers may
change in the future.

2024-01-05   Michael Meissner  <meissner@linux.ibm.com>

gcc/

	* config/rs6000/constraints.md (wD constraint): New constraint.
	* config/rs6000/mma.md (UNSPEC_DM_ASSEMBLE_ACC): New unspec.
	(movxo): Convert into define_expand.
	(movxo_vsx): Version of movxo where accumulators overlap with VSX vector
	registers 0..31.
	(movxo_dm): Verson of movxo that supports separate dense math
	accumulators.
	(mma_assemble_acc): Add dense math support to define_expand.
	(mma_assemble_acc_vsx): Rename from mma_assemble_acc, and restrict it to
	non dense math systems.
	(mma_assemble_acc_dm): Dense math version of mma_assemble_acc.
	(mma_disassemble_acc): Add dense math support to define_expand.
	(mma_disassemble_acc_vsx): Rename from mma_disassemble_acc, and restrict
	it to non dense math systems.
	(mma_disassemble_acc_dm): Dense math version of mma_disassemble_acc.
	* config/rs6000/predicates.md (dmr_operand): New predicate.
	(accumulator_operand): Likewise.
	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add -mdense-math.
	(POWERPC_MASKS): Likewise.
	* config/rs6000/rs6000.cc (enum rs6000_reg_type): Add DMR_REG_TYPE.
	(enum rs6000_reload_reg_type): Add RELOAD_REG_DMR.
	(LAST_RELOAD_REG_CLASS): Add support for DMR registers and the wD
	constraint.
	(reload_reg_map): Likewise.
	(rs6000_reg_names): Likewise.
	(alt_reg_names): Likewise.
	(rs6000_hard_regno_nregs_internal): Likewise.
	(rs6000_hard_regno_mode_ok_uncached): Likewise.
	(rs6000_debug_reg_global): Likewise.
	(rs6000_setup_reg_addr_masks): Likewise.
	(rs6000_init_hard_regno_mode_ok): Likewise.
	(rs6000_option_override_internal): Add checking for -mdense-math.
	(rs6000_secondary_reload_memory): Add support for DMR registers.
	(rs6000_secondary_reload_simple_move): Likewise.
	(rs6000_preferred_reload_class): Likewise.
	(rs6000_secondary_reload_class): Likewise.
	(print_operand): Make %A handle both FPRs and DMRs.
	(rs6000_dmr_register_move_cost): New helper function.
	(rs6000_register_move_cost): Add support for DMR registers.
	(rs6000_memory_move_cost): Likewise.
	(rs6000_compute_pressure_classes): Likewise.
	(rs6000_debugger_regno): Likewise.
	(rs6000_opt_masks): Add -mdense-math.
	(rs6000_split_multireg_move): Add support for DMRs.
	* config/rs6000/rs6000.h (UNITS_PER_DMR_WORD): New macro.
	(FIRST_PSEUDO_REGISTER): Update for DMRs.
	(FIXED_REGISTERS): Add DMRs.
	(CALL_REALLY_USED_REGISTERS): Likewise.
	(REG_ALLOC_ORDER): Likewise.
	(enum reg_class): Add DM_REGS.
	(REG_CLASS_NAMES): Likewise.
	(REG_CLASS_CONTENTS): Likewise.
	* config/rs6000/rs6000.md (FIRST_DMR_REGNO): New constant.
	(LAST_DMR_REGNO): Likewise.
	(isa attribute): Add 'dm' and 'not_dm' attributes.
	(enabled attribute): Support 'dm' and 'not_dm' attributes.
	* config/rs6000/rs6000.opt (-mdense-math): New switch.
	* doc/md.texi (PowerPC constraints): Document wD constraint.
---
 gcc/config/rs6000/constraints.md  |   3 +
 gcc/config/rs6000/mma.md          | 115 ++++++++++++------
 gcc/config/rs6000/predicates.md   |  32 +++++
 gcc/config/rs6000/rs6000-cpus.def |   2 +
 gcc/config/rs6000/rs6000.cc       | 189 ++++++++++++++++++++++++++----
 gcc/config/rs6000/rs6000.h        |  38 +++++-
 gcc/config/rs6000/rs6000.md       |  12 +-
 gcc/config/rs6000/rs6000.opt      |   4 +
 gcc/doc/md.texi                   |   7 ++
 9 files changed, 343 insertions(+), 59 deletions(-)
  

Comments

Michael Meissner Jan. 19, 2024, 6:46 p.m. UTC | #1
Ping

| Date: Fri, 5 Jan 2024 18:38:23 -0500
| From: Michael Meissner <meissner@linux.ibm.com>
| Subject: Repost [PATCH 3/6] PowerPC: Add support for accumulators in DMR registers.
| Message-ID: <ZZiS7-05Y1n48bjk@cowardly-lion.the-meissners.org>

https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641963.html
  
Kewen.Lin Jan. 25, 2024, 9:28 a.m. UTC | #2
Hi Mike,

on 2024/1/6 07:38, Michael Meissner wrote:
> The MMA subsystem added the notion of accumulator registers as an optional
> feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
> the traditional floating point registers 0..31, but logically the accumulator
> registers were separate from the FPR registers.  In ISA 3.1, it was anticipated

Using VSX register 0..31 rather than traditional floating point registers 0..31
seems more clear, since floating point registers imply 64 bit long registers.

> that in future systems, the accumulator registers may no overlap with the FPR
> registers.  This patch adds the support for dense math registers as separate
> registers.
> 
> This particular patch does not change the MMA support to use the accumulators
> within the dense math registers.  This patch just adds the basic support for
> having separate DMRs.  The next patch will switch the MMA support to use the
> accumulators if -mcpu=future is used.
> 
> For testing purposes, I added an undocumented option '-mdense-math' to enable
> or disable the dense math support.

Can we avoid this and use one macro for it instead?  As you might have noticed
that some previous temporary options like -mpower{8,9}-vector cause ICEs due to
some unexpected combination and we are going to neuter them, so let's try our
best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
TARGET_FUTURE && TARGET_MMA matches all use places? and specifying -mcpu=future
can enable it while -mcpu=power10 can disable it.

> 
> This patch adds a new constraint (wD).  If MMA is selected but dense math is
> not selected (i.e. -mcpu=power10), the wD constraint will allow access to
> accumulators that overlap with the VSX vector registers 0..31.  If both MMA and

Sorry for nitpicking, it's more accurate with "VSX registers 0..31".

> dense math are selected (i.e. -mcpu=future), the wD constraint will only allow
> dense math registers.
> 
> This patch modifies the existing %A output modifier.  If MMA is selected but
> dense math is not selected, then %A output modifier converts the VSX register
> number to the accumulator number, by dividing it by 4.  If both MMA and dense
> math are selected, then %A will map the separate DMR registers into 0..7.
> 
> The intention is that user code using extended asm can be modified to run on
> both MMA without dense math and MMA with dense math:
> 
>     1)	If possible, don't use extended asm, but instead use the MMA built-in
> 	functions;
> 
>     2)	If you do need to write extended asm, change the d constraints
> 	targetting accumulators should now use wD;
> 
>     3)	Only use the built-in zero, assemble and disassemble functions create
> 	move data between vector quad types and dense math accumulators.
> 	I.e. do not use the xxmfacc, xxmtacc, and xxsetaccz directly in the
> 	extended asm code.  The reason is these instructions assume there is a
> 	1-to-1 correspondence between 4 adjacent FPR registers and an
> 	accumulator that overlaps with those instructions.  With accumulators
> 	now being separate registers, there no longer is a 1-to-1
> 	correspondence.
> 
> It is possible that the mangling for DMRs and the GDB register numbers may
> change in the future.
> 
> 2024-01-05   Michael Meissner  <meissner@linux.ibm.com>
> 
> gcc/
> 
> 	* config/rs6000/constraints.md (wD constraint): New constraint.
> 	* config/rs6000/mma.md (UNSPEC_DM_ASSEMBLE_ACC): New unspec.
> 	(movxo): Convert into define_expand.
> 	(movxo_vsx): Version of movxo where accumulators overlap with VSX vector
> 	registers 0..31.
> 	(movxo_dm): Verson of movxo that supports separate dense math
> 	accumulators.
> 	(mma_assemble_acc): Add dense math support to define_expand.
> 	(mma_assemble_acc_vsx): Rename from mma_assemble_acc, and restrict it to
> 	non dense math systems.
> 	(mma_assemble_acc_dm): Dense math version of mma_assemble_acc.
> 	(mma_disassemble_acc): Add dense math support to define_expand.
> 	(mma_disassemble_acc_vsx): Rename from mma_disassemble_acc, and restrict
> 	it to non dense math systems.
> 	(mma_disassemble_acc_dm): Dense math version of mma_disassemble_acc.
> 	* config/rs6000/predicates.md (dmr_operand): New predicate.
> 	(accumulator_operand): Likewise.
> 	* config/rs6000/rs6000-cpus.def (ISA_FUTURE_MASKS): Add -mdense-math.
> 	(POWERPC_MASKS): Likewise.
> 	* config/rs6000/rs6000.cc (enum rs6000_reg_type): Add DMR_REG_TYPE.
> 	(enum rs6000_reload_reg_type): Add RELOAD_REG_DMR.
> 	(LAST_RELOAD_REG_CLASS): Add support for DMR registers and the wD
> 	constraint.
> 	(reload_reg_map): Likewise.
> 	(rs6000_reg_names): Likewise.
> 	(alt_reg_names): Likewise.
> 	(rs6000_hard_regno_nregs_internal): Likewise.
> 	(rs6000_hard_regno_mode_ok_uncached): Likewise.
> 	(rs6000_debug_reg_global): Likewise.
> 	(rs6000_setup_reg_addr_masks): Likewise.
> 	(rs6000_init_hard_regno_mode_ok): Likewise.
> 	(rs6000_option_override_internal): Add checking for -mdense-math.
> 	(rs6000_secondary_reload_memory): Add support for DMR registers.
> 	(rs6000_secondary_reload_simple_move): Likewise.
> 	(rs6000_preferred_reload_class): Likewise.
> 	(rs6000_secondary_reload_class): Likewise.
> 	(print_operand): Make %A handle both FPRs and DMRs.
> 	(rs6000_dmr_register_move_cost): New helper function.
> 	(rs6000_register_move_cost): Add support for DMR registers.
> 	(rs6000_memory_move_cost): Likewise.
> 	(rs6000_compute_pressure_classes): Likewise.
> 	(rs6000_debugger_regno): Likewise.
> 	(rs6000_opt_masks): Add -mdense-math.
> 	(rs6000_split_multireg_move): Add support for DMRs.
> 	* config/rs6000/rs6000.h (UNITS_PER_DMR_WORD): New macro.
> 	(FIRST_PSEUDO_REGISTER): Update for DMRs.
> 	(FIXED_REGISTERS): Add DMRs.
> 	(CALL_REALLY_USED_REGISTERS): Likewise.
> 	(REG_ALLOC_ORDER): Likewise.
> 	(enum reg_class): Add DM_REGS.
> 	(REG_CLASS_NAMES): Likewise.
> 	(REG_CLASS_CONTENTS): Likewise.
> 	* config/rs6000/rs6000.md (FIRST_DMR_REGNO): New constant.
> 	(LAST_DMR_REGNO): Likewise.
> 	(isa attribute): Add 'dm' and 'not_dm' attributes.
> 	(enabled attribute): Support 'dm' and 'not_dm' attributes.
> 	* config/rs6000/rs6000.opt (-mdense-math): New switch.
> 	* doc/md.texi (PowerPC constraints): Document wD constraint.
> ---
>  gcc/config/rs6000/constraints.md  |   3 +
>  gcc/config/rs6000/mma.md          | 115 ++++++++++++------
>  gcc/config/rs6000/predicates.md   |  32 +++++
>  gcc/config/rs6000/rs6000-cpus.def |   2 +
>  gcc/config/rs6000/rs6000.cc       | 189 ++++++++++++++++++++++++++----
>  gcc/config/rs6000/rs6000.h        |  38 +++++-
>  gcc/config/rs6000/rs6000.md       |  12 +-
>  gcc/config/rs6000/rs6000.opt      |   4 +
>  gcc/doc/md.texi                   |   7 ++
>  9 files changed, 343 insertions(+), 59 deletions(-)
> 
> diff --git a/gcc/config/rs6000/constraints.md b/gcc/config/rs6000/constraints.md
> index c99997bf82b..614e431c085 100644
> --- a/gcc/config/rs6000/constraints.md
> +++ b/gcc/config/rs6000/constraints.md
> @@ -107,6 +107,9 @@ (define_constraint "wB"
>         (match_test "TARGET_P8_VECTOR")
>         (match_operand 0 "s5bit_cint_operand")))
>  
> +(define_register_constraint "wD" "rs6000_constraints[RS6000_CONSTRAINT_wD]"
> +  "Accumulator register.")
> +
>  (define_constraint "wE"
>    "@internal Vector constant that can be loaded with the XXSPLTIB instruction."
>    (match_test "xxspltib_constant_nosplit (op, mode)"))
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index 6a7d8a836db..bb898919ab5 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -91,6 +91,7 @@ (define_c_enum "unspec"
>     UNSPEC_MMA_XVI8GER4SPP
>     UNSPEC_MMA_XXMFACC
>     UNSPEC_MMA_XXMTACC
> +   UNSPEC_DM_ASSEMBLE_ACC

The other UNSPEC.*ASSEMBLE like UNSPECV_MMA_ASSEMBLE don't have _ACC suffix,
it's better to keep consistent if this suffix doesn't distinguish something.

>    ])
>  
>  (define_c_enum "unspecv"
> @@ -321,7 +322,9 @@ (define_insn_and_split "*movoo"
>     (set_attr "length" "*,8,*,8,8")
>     (set_attr "isa" "lxvp,*,stxvp,*,*")])
>  
> -;; Vector quad support.  XOmode can only live in FPRs.
> +;; Vector quad support.  Under the original MMA, XOmode can only live in VSX
> +;; vector registers 0..31.  With dense math, XOmode can live in either VSX

Nit: s/vector//

> +;; registers (0..63) or DMR registers.
>  (define_expand "movxo"
>    [(set (match_operand:XO 0 "nonimmediate_operand")
>  	(match_operand:XO 1 "input_operand"))]
> @@ -346,10 +349,10 @@ (define_expand "movxo"
>      gcc_assert (false);
>  })
>  
> -(define_insn_and_split "*movxo"
> +(define_insn_and_split "*movxo_nodm"
>    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
>  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
> -  "TARGET_MMA
> +  "TARGET_MMA && !TARGET_DENSE_MATH
>     && (gpc_reg_operand (operands[0], XOmode)
>         || gpc_reg_operand (operands[1], XOmode))"
>    "@
> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
>     (set_attr "length" "*,*,16")
>     (set_attr "max_prefixed_insns" "2,2,*")])
>  
> +(define_insn_and_split "*movxo_dm"
> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
> +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]

Why not adopt ZwO rather than QwO?

> +  "TARGET_DENSE_MATH
> +   && (gpc_reg_operand (operands[0], XOmode)
> +       || gpc_reg_operand (operands[1], XOmode))"
> +  "@
> +   #
> +   #
> +   #
> +   dmxxinstdmr512 %0,%1,%Y1,0
> +   dmmr %0,%1
> +   dmxxextfdmr512 %0,%Y0,%1,0"
> +  "&& reload_completed
> +   && !dmr_operand (operands[0], XOmode)
> +   && !dmr_operand (operands[1], XOmode)"
> +  [(const_int 0)]
> +{
> +  rs6000_split_multireg_move (operands[0], operands[1]);
> +  DONE;
> +}
> +  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
> +   (set_attr "length" "*,*,16,*,*,*")
> +   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
> +
>  (define_expand "vsx_assemble_pair"
>    [(match_operand:OO 0 "vsx_register_operand")
>     (match_operand:V16QI 1 "mma_assemble_input_operand")
> @@ -433,25 +461,38 @@ (define_insn_and_split "*vsx_disassemble_pair"
>  })
>  
>  (define_expand "mma_assemble_acc"
> -  [(match_operand:XO 0 "fpr_reg_operand")
> +  [(match_operand:XO 0 "register_operand")

Maybe use the newly introduced accumulator_operand?

>     (match_operand:V16QI 1 "mma_assemble_input_operand")
>     (match_operand:V16QI 2 "mma_assemble_input_operand")
>     (match_operand:V16QI 3 "mma_assemble_input_operand")
>     (match_operand:V16QI 4 "mma_assemble_input_operand")]
>    "TARGET_MMA"
>  {
> -  rtx src = gen_rtx_UNSPEC_VOLATILE (XOmode,
> -			    	     gen_rtvec (4, operands[1], operands[2],
> -				       		operands[3], operands[4]),
> -			    	     UNSPECV_MMA_ASSEMBLE);
> -  emit_move_insn (operands[0], src);
> +  rtx op0 = operands[0];
> +  rtx op1 = operands[1];
> +  rtx op2 = operands[2];
> +  rtx op3 = operands[3];
> +  rtx op4 = operands[4];
> +
> +  if (TARGET_DENSE_MATH)
> +    {
> +      rtx vpair1 = gen_reg_rtx (OOmode);
> +      rtx vpair2 = gen_reg_rtx (OOmode);
> +      emit_insn (gen_vsx_assemble_pair (vpair1, op1, op2));
> +      emit_insn (gen_vsx_assemble_pair (vpair2, op3, op4));
> +      emit_insn (gen_mma_assemble_acc_dm (op0, vpair1, vpair2));
> +    }
> +
> +  else
> +    emit_insn (gen_mma_assemble_acc_vsx (op0, op1, op2, op3, op4));
> +
>    DONE;
>  })
>  
>  ;; We cannot update the four output registers atomically, so mark the output
> -;; as an early clobber so we don't accidentally clobber the input operands.  */
> +;; as an early clobber so we don't accidentally clobber the input operands.
>  
> -(define_insn_and_split "*mma_assemble_acc"
> +(define_insn_and_split "mma_assemble_acc_vsx"

Nit: since we use "*_nodm" above, it seems better to name it with
"mma_assemble_acc_nodm" which has the same style?

>    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
>  	(unspec_volatile:XO
>  	  [(match_operand:V16QI 1 "mma_assemble_input_operand" "mwa")
> @@ -459,7 +500,7 @@ (define_insn_and_split "*mma_assemble_acc"
>  	   (match_operand:V16QI 3 "mma_assemble_input_operand" "mwa")
>  	   (match_operand:V16QI 4 "mma_assemble_input_operand" "mwa")]
>  	  UNSPECV_MMA_ASSEMBLE))]
> -  "TARGET_MMA
> +  "TARGET_MMA && !TARGET_DENSE_MATH
>     && fpr_reg_operand (operands[0], XOmode)"
>    "#"
>    "&& reload_completed"
> @@ -473,28 +514,31 @@ (define_insn_and_split "*mma_assemble_acc"
>    DONE;
>  })
>  
> +;; On a system with dense math, we build the accumulators from two vector
> +;; pairs.
> +
> +(define_insn "mma_assemble_acc_dm"
> + [(set (match_operand:XO 0 "dmr_operand" "=wD")
> +       (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
> +		   (match_operand:OO 2 "vsx_register_operand" "wa")]
> +		  UNSPEC_DM_ASSEMBLE_ACC))]
> + "TARGET_MMA && TARGET_DENSE_MATH"

Nit: redundant TARGET_MMA checking.

> + "dmxxinstdmr512 %0,%1,%2,0"
> + [(set_attr "type" "mma")])
> +
>  (define_expand "mma_disassemble_acc"
> -  [(match_operand:V16QI 0 "mma_disassemble_output_operand")
> -   (match_operand:XO 1 "fpr_reg_operand")
> -   (match_operand 2 "const_0_to_3_operand")]
> -  "TARGET_MMA"
> -{
> -  rtx src;
> -  int regoff = INTVAL (operands[2]);
> -  src = gen_rtx_UNSPEC (V16QImode,
> -			gen_rtvec (2, operands[1], GEN_INT (regoff)),
> -			UNSPEC_MMA_EXTRACT);
> -  emit_move_insn (operands[0], src);
> -  DONE;
> -})
> +  [(set (match_operand:V16QI 0 "register_operand")
> +	(unspec:V16QI [(match_operand:XO 1 "register_operand")

s/register_operand/accumulator_operand/?

> +		       (match_operand 2 "const_0_to_3_operand")]
> +		      UNSPEC_MMA_EXTRACT))]
> +  "TARGET_MMA")
>  
> -(define_insn_and_split "*mma_disassemble_acc"
> +(define_insn_and_split "*mma_disassemble_acc_vsx"
>    [(set (match_operand:V16QI 0 "mma_disassemble_output_operand" "=mwa")
> -       (unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> -		      (match_operand 2 "const_0_to_3_operand")]
> +	(unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> +		       (match_operand 2 "const_0_to_3_operand")]
>  		      UNSPEC_MMA_EXTRACT))]
> -  "TARGET_MMA
> -   && fpr_reg_operand (operands[1], XOmode)"
> +  "TARGET_MMA"

Do we still expect to see this pattern if TARGET_DENSE_MATH?
If no, we should guard the condition with !TARGET_DENSE_MATH.

>    "#"
>    "&& reload_completed"
>    [(const_int 0)]
> @@ -506,9 +550,14 @@ (define_insn_and_split "*mma_disassemble_acc"
>    DONE;
>  })
>  
> -;; MMA instructions that do not use their accumulators as an input, still
> -;; must not allow their vector operands to overlap the registers used by
> -;; the accumulator.  We enforce this by marking the output as early clobber.
> +(define_insn "*mma_disassemble_acc_dm"
> +  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
> +	(unspec:V16QI [(match_operand:XO 1 "dmr_operand" "wD")
> +		       (match_operand 2 "const_0_to_3_operand")]
> +		      UNSPEC_MMA_EXTRACT))]
> +  "TARGET_DENSE_MATH"
> +  "dmxxextfdmr256 %0,%1,2"
> +  [(set_attr "type" "mma")])
>  
>  (define_insn "mma_<acc>"
>    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
> index d23ce9a77a3..3040dcd50a3 100644
> --- a/gcc/config/rs6000/predicates.md
> +++ b/gcc/config/rs6000/predicates.md
> @@ -186,6 +186,38 @@ (define_predicate "vlogical_operand"
>    return VLOGICAL_REGNO_P (REGNO (op));
>  })
>  
> +;; Return 1 if op is a DMR register
> +(define_predicate "dmr_operand"
> +  (match_operand 0 "register_operand")
> +{
> +  if (!REG_P (op))
> +    return 0;
> +
> +  if (!HARD_REGISTER_P (op))
> +    return 1;
> +
> +  return DMR_REGNO_P (REGNO (op));
> +})
> +
> +;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
> +;; overlap with the FPRs, while on systems with dense math, the accumulators
> +;; are separate dense math registers and do not overlap with the FPR
> +;; registers..

Nit: an unexpected "."?

> +(define_predicate "accumulator_operand"
> +  (match_operand 0 "register_operand")
> +{

fpr_reg_operand checks for subreg as well, should we check for it here as well?

> +  if (!REG_P (op))
> +    return 0;
> +
> +  if (!HARD_REGISTER_P (op))
> +    return 1;
> +
> +  int r = REGNO (op);
> +  return (TARGET_DENSE_MATH
> +	  ? DMR_REGNO_P (r)
> +	  : FP_REGNO_P (r) && (r & 3) == 0);
> +})
> +
>  ;; Return 1 if op is the carry register.
>  (define_predicate "ca_operand"
>    (match_operand 0 "register_operand")
> diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
> index b6cd6d8cc84..4621b97b522 100644
> --- a/gcc/config/rs6000/rs6000-cpus.def
> +++ b/gcc/config/rs6000/rs6000-cpus.def
> @@ -91,6 +91,7 @@
>  /* Flags for a potential future processor that may or may not be delivered.  */
>  #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
>  				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
> +				 | OPTION_MASK_DENSE_MATH		\
>  				 | OPTION_MASK_FUTURE)
>  
>  /* Flags that need to be turned off if -mno-power9-vector.  */
> @@ -134,6 +135,7 @@
>  				 | OPTION_MASK_DFP			\
>  				 | OPTION_MASK_DIRECT_MOVE		\
>  				 | OPTION_MASK_DLMZB			\
> +				 | OPTION_MASK_DENSE_MATH		\
>  				 | OPTION_MASK_EFFICIENT_UNALIGNED_VSX	\
>  				 | OPTION_MASK_FLOAT128_HW		\
>  				 | OPTION_MASK_FLOAT128_KEYWORD		\
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index bc509399cf6..83e32f7a43a 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -290,7 +290,8 @@ enum rs6000_reg_type {
>    ALTIVEC_REG_TYPE,
>    FPR_REG_TYPE,
>    SPR_REG_TYPE,
> -  CR_REG_TYPE
> +  CR_REG_TYPE,
> +  DMR_REG_TYPE
>  };
>  
>  /* Map register class to register type.  */
> @@ -304,22 +305,23 @@ static enum rs6000_reg_type reg_class_to_reg_type[N_REG_CLASSES];
>  
>  
>  /* Register classes we care about in secondary reload or go if legitimate
> -   address.  We only need to worry about GPR, FPR, and Altivec registers here,
> -   along an ANY field that is the OR of the 3 register classes.  */
> +   address.  We only need to worry about GPR, FPR, Altivec, and DMR registers
> +   here, along an ANY field that is the OR of the 4 register classes.  */
>  
>  enum rs6000_reload_reg_type {
>    RELOAD_REG_GPR,			/* General purpose registers.  */
>    RELOAD_REG_FPR,			/* Traditional floating point regs.  */
>    RELOAD_REG_VMX,			/* Altivec (VMX) registers.  */
> -  RELOAD_REG_ANY,			/* OR of GPR, FPR, Altivec masks.  */
> +  RELOAD_REG_DMR,			/* DMR registers.  */
> +  RELOAD_REG_ANY,			/* OR of GPR/FPR/VMX/DMR masks.  */
>    N_RELOAD_REG
>  };
>  
> -/* For setting up register classes, loop through the 3 register classes mapping
> +/* For setting up register classes, loop through the 4 register classes mapping
>     into real registers, and skip the ANY class, which is just an OR of the
>     bits.  */
>  #define FIRST_RELOAD_REG_CLASS	RELOAD_REG_GPR
> -#define LAST_RELOAD_REG_CLASS	RELOAD_REG_VMX
> +#define LAST_RELOAD_REG_CLASS	RELOAD_REG_DMR
>  
>  /* Map reload register type to a register in the register class.  */
>  struct reload_reg_map_type {
> @@ -331,6 +333,7 @@ static const struct reload_reg_map_type reload_reg_map[N_RELOAD_REG] = {
>    { "Gpr",	FIRST_GPR_REGNO },	/* RELOAD_REG_GPR.  */
>    { "Fpr",	FIRST_FPR_REGNO },	/* RELOAD_REG_FPR.  */
>    { "VMX",	FIRST_ALTIVEC_REGNO },	/* RELOAD_REG_VMX.  */
> +  { "DMR",	FIRST_DMR_REGNO },	/* RELOAD_REG_DMR.  */
>    { "Any",	-1 },			/* RELOAD_REG_ANY.  */
>  };
>  
> @@ -1224,6 +1227,8 @@ char rs6000_reg_names[][8] =
>        "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",
>    /* vrsave vscr sfp */
>        "vrsave", "vscr", "sfp",
> +  /* DMRs */
> +      "0", "1", "2", "3", "4", "5", "6", "7",
>  };
>  
>  #ifdef TARGET_REGNAMES
> @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
>    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
>    /* vrsave vscr sfp */
>    "vrsave", "vscr", "sfp",
> +  /* DMRs */
> +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",

Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
recognize %dm0.

>  };
>  #endif
>  
> @@ -1846,6 +1853,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
>    else if (ALTIVEC_REGNO_P (regno))
>      reg_size = UNITS_PER_ALTIVEC_WORD;
>  
> +  else if (DMR_REGNO_P (regno))
> +    reg_size = UNITS_PER_DMR_WORD;
> +
>    else
>      reg_size = UNITS_PER_WORD;
>  
> @@ -1867,9 +1877,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>    if (mode == OOmode)
>      return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
>  
> -  /* MMA accumulator modes need FPR registers divisible by 4.  */
> +  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
> +     by 4.
> +
> +     If dense math is enabled, allow all VSX registers plus the DMR registers.
> +     We need to make sure we don't cross between the boundary of FPRs and
> +     traditional Altiviec registers.  */
>    if (mode == XOmode)
> -    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
> +    {
> +      if (TARGET_MMA && !TARGET_DENSE_MATH)
> +	return (FP_REGNO_P (regno) && (regno & 3) == 0);
> +
> +      else if (TARGET_DENSE_MATH)
> +	{
> +	  if (DMR_REGNO_P (regno))
> +	    return 1;
> +
> +	  if (FP_REGNO_P (regno))
> +	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
> +
> +	  if (ALTIVEC_REGNO_P (regno))
> +	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
> +	}

I could miss something, I didn't find which section of RFC indicates this
restriction, could you please point out for me?  Thanks!

> +
> +      else
> +	return 0;
> +    }
> +
> +  /* No other types other than XOmode can go in DMRs.  */
> +  if (DMR_REGNO_P (regno))
> +    return 0;
>  
>    /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
>       register combinations, and use PTImode where we need to deal with quad
> @@ -2312,6 +2349,7 @@ rs6000_debug_reg_global (void)
>    rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
>  			  LAST_ALTIVEC_REGNO,
>  			  "vs");
> +  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");

Nit: Like above, use 'dm'.

>    rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
>    rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
>    rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
> @@ -2332,6 +2370,7 @@ rs6000_debug_reg_global (void)
>  	   "wr reg_class = %s\n"
>  	   "wx reg_class = %s\n"
>  	   "wA reg_class = %s\n"
> +	   "wD reg_class = %s\n"
>  	   "\n",
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
> @@ -2339,7 +2378,8 @@ rs6000_debug_reg_global (void)
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
> -	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
> 

snip ...

> +/* Subroutine to determine the move cost of dense math registers.  If we are
> +   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
> +   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
> +   moving to anything else like GPR registers, make the cost very high.  */
> +
> +static int
> +rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
> +{
> +  const int reg_move_base = 2;
> +  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
> +			  & reg_class_contents[VSX_REGS]);
> +
> +  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))

Can we just use reg_classes_intersect_p (rclass, VSX_REGS)?

> +    {
> +      /* __vector_quad (i.e. XOmode) is tranfered in 1 instruction.  */
> +      if (mode == XOmode)
> +	return reg_move_base;
> +
> +      else
> +	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);

I guess this "else" arm is for TDOmode, which belongs to that patch.

> +    }
> +
> +  return 1000 * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
> +}
> +
>  /* A C expression returning the cost of moving data from a register of class
>     CLASS1 to one of CLASS2.  */
>  
> @@ -22843,17 +22969,28 @@ rs6000_register_move_cost (machine_mode mode,
>    if (TARGET_DEBUG_COST)
>      dbg_cost_ctrl++;
>  

snip ...

>  /* Table of additional register names to use in user input.  */
> @@ -2132,6 +2158,8 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
>    {"vs52", 84}, {"vs53", 85}, {"vs54", 86}, {"vs55", 87},	\
>    {"vs56", 88}, {"vs57", 89}, {"vs58", 90}, {"vs59", 91},	\
>    {"vs60", 92}, {"vs61", 93}, {"vs62", 94}, {"vs63", 95},	\
> +  {"dmr0", 111}, {"dmr1", 112}, {"dmr2", 113}, {"dmr3", 114},	\
> +  {"dmr4", 115}, {"dmr5", 116}, {"dmr6", 117}, {"dmr7", 118},	\

Nit: maybe s/dmr/dm/ to align the previous regnames.

>  }
>  
>  /* This is how to output an element of a case-vector that is relative.  */
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index a125fd8fc99..72af3e6ef70 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -51,6 +51,8 @@ (define_constants
>     (VRSAVE_REGNO		108)
>     (VSCR_REGNO			109)
>     (FRAME_POINTER_REGNUM	110)
> +   (FIRST_DMR_REGNO		111)
> +   (LAST_DMR_REGNO		118)
>    ])
>  
>  ;;
> @@ -355,7 +357,7 @@ (define_attr "cpu"
>    (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
>  
>  ;; The ISA we implement.
> -(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp"
> +(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp,dm,not_dm"

Nit: s/not_dm/nodm/ to align with some previous wording.

BR,
Kewen
  
Michael Meissner Feb. 7, 2024, 12:06 a.m. UTC | #3
On Thu, Jan 25, 2024 at 05:28:49PM +0800, Kewen.Lin wrote:
> Hi Mike,
> 
> on 2024/1/6 07:38, Michael Meissner wrote:
> > The MMA subsystem added the notion of accumulator registers as an optional
> > feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
> > the traditional floating point registers 0..31, but logically the accumulator
> > registers were separate from the FPR registers.  In ISA 3.1, it was anticipated
> 
> Using VSX register 0..31 rather than traditional floating point registers 0..31
> seems more clear, since floating point registers imply 64 bit long registers.

Ok.

> > that in future systems, the accumulator registers may no overlap with the FPR
> > registers.  This patch adds the support for dense math registers as separate
> > registers.
> > 
> > This particular patch does not change the MMA support to use the accumulators
> > within the dense math registers.  This patch just adds the basic support for
> > having separate DMRs.  The next patch will switch the MMA support to use the
> > accumulators if -mcpu=future is used.
> > 
> > For testing purposes, I added an undocumented option '-mdense-math' to enable
> > or disable the dense math support.
> 
> Can we avoid this and use one macro for it instead?  As you might have noticed
> that some previous temporary options like -mpower{8,9}-vector cause ICEs due to
> some unexpected combination and we are going to neuter them, so let's try our
> best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
> TARGET_FUTURE && TARGET_MMA matches all use places? and specifying -mcpu=future
> can enable it while -mcpu=power10 can disable it.

That depends on whether there will be other things added in the future power
that are not in the MMA+ instruction set.

But I can switch to defining TARGET_DENSE_MATH to testing TARGET_FUTURE and
TARGET_MMA.  That way if/when a new cpu comes out, we will just have to change
the definition of TARGET_DENSE_MATH and not all of the uses.

I will also add TARGET_MMA_NO_DENSE_MATH to handle the existing MMA code for
assemble and disassemble when we don't have dense math instructions.

> > 
> > This patch adds a new constraint (wD).  If MMA is selected but dense math is
> > not selected (i.e. -mcpu=power10), the wD constraint will allow access to
> > accumulators that overlap with the VSX vector registers 0..31.  If both MMA and
> 
> Sorry for nitpicking, it's more accurate with "VSX registers 0..31".

Ok.

> > diff --git a/gcc/config/rs6000/constraints.md b/gcc/config/rs6000/constraints.md
> > index c99997bf82b..614e431c085 100644
> > --- a/gcc/config/rs6000/constraints.md
> > +++ b/gcc/config/rs6000/constraints.md
> > @@ -107,6 +107,9 @@ (define_constraint "wB"
> >         (match_test "TARGET_P8_VECTOR")
> >         (match_operand 0 "s5bit_cint_operand")))
> >  
> > +(define_register_constraint "wD" "rs6000_constraints[RS6000_CONSTRAINT_wD]"
> > +  "Accumulator register.")
> > +
> >  (define_constraint "wE"
> >    "@internal Vector constant that can be loaded with the XXSPLTIB instruction."
> >    (match_test "xxspltib_constant_nosplit (op, mode)"))
> > diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> > index 6a7d8a836db..bb898919ab5 100644
> > --- a/gcc/config/rs6000/mma.md
> > +++ b/gcc/config/rs6000/mma.md
> > @@ -91,6 +91,7 @@ (define_c_enum "unspec"
> >     UNSPEC_MMA_XVI8GER4SPP
> >     UNSPEC_MMA_XXMFACC
> >     UNSPEC_MMA_XXMTACC
> > +   UNSPEC_DM_ASSEMBLE_ACC
> 
> The other UNSPEC.*ASSEMBLE like UNSPECV_MMA_ASSEMBLE don't have _ACC suffix,
> it's better to keep consistent if this suffix doesn't distinguish something.

Ok.

> >    ])
> >  
> >  (define_c_enum "unspecv"
> > @@ -321,7 +322,9 @@ (define_insn_and_split "*movoo"
> >     (set_attr "length" "*,8,*,8,8")
> >     (set_attr "isa" "lxvp,*,stxvp,*,*")])
> >  
> > -;; Vector quad support.  XOmode can only live in FPRs.
> > +;; Vector quad support.  Under the original MMA, XOmode can only live in VSX
> > +;; vector registers 0..31.  With dense math, XOmode can live in either VSX
> 
> Nit: s/vector//

Ok.

> > +;; registers (0..63) or DMR registers.
> >  (define_expand "movxo"
> >    [(set (match_operand:XO 0 "nonimmediate_operand")
> >  	(match_operand:XO 1 "input_operand"))]
> > @@ -346,10 +349,10 @@ (define_expand "movxo"
> >      gcc_assert (false);
> >  })
> >  
> > -(define_insn_and_split "*movxo"
> > +(define_insn_and_split "*movxo_nodm"
> >    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
> >  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
> > -  "TARGET_MMA
> > +  "TARGET_MMA && !TARGET_DENSE_MATH
> >     && (gpc_reg_operand (operands[0], XOmode)
> >         || gpc_reg_operand (operands[1], XOmode))"
> >    "@
> > @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
> >     (set_attr "length" "*,*,16")
> >     (set_attr "max_prefixed_insns" "2,2,*")])
> >  
> > +(define_insn_and_split "*movxo_dm"
> > +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
> > +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
> 
> Why not adopt ZwO rather than QwO?

You have to split the address into 2 addresses for loading or storing vector
pairs (or 4 addresses for loading or storing vectors).  Z would allow
register+register addresses, and you wouldn't be able to create the second 
address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
d-form addresses.

> 
> > +  "TARGET_DENSE_MATH
> > +   && (gpc_reg_operand (operands[0], XOmode)
> > +       || gpc_reg_operand (operands[1], XOmode))"
> > +  "@
> > +   #
> > +   #
> > +   #
> > +   dmxxinstdmr512 %0,%1,%Y1,0
> > +   dmmr %0,%1
> > +   dmxxextfdmr512 %0,%Y0,%1,0"
> > +  "&& reload_completed
> > +   && !dmr_operand (operands[0], XOmode)
> > +   && !dmr_operand (operands[1], XOmode)"
> > +  [(const_int 0)]
> > +{
> > +  rs6000_split_multireg_move (operands[0], operands[1]);
> > +  DONE;
> > +}
> > +  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
> > +   (set_attr "length" "*,*,16,*,*,*")
> > +   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
> > +
> >  (define_expand "vsx_assemble_pair"
> >    [(match_operand:OO 0 "vsx_register_operand")
> >     (match_operand:V16QI 1 "mma_assemble_input_operand")
> > @@ -433,25 +461,38 @@ (define_insn_and_split "*vsx_disassemble_pair"
> >  })
> >  
> >  (define_expand "mma_assemble_acc"
> > -  [(match_operand:XO 0 "fpr_reg_operand")
> > +  [(match_operand:XO 0 "register_operand")
> 
> Maybe use the newly introduced accumulator_operand?

Ok.

> 
> >     (match_operand:V16QI 1 "mma_assemble_input_operand")
> >     (match_operand:V16QI 2 "mma_assemble_input_operand")
> >     (match_operand:V16QI 3 "mma_assemble_input_operand")
> >     (match_operand:V16QI 4 "mma_assemble_input_operand")]
> >    "TARGET_MMA"
> >  {
> > -  rtx src = gen_rtx_UNSPEC_VOLATILE (XOmode,
> > -			    	     gen_rtvec (4, operands[1], operands[2],
> > -				       		operands[3], operands[4]),
> > -			    	     UNSPECV_MMA_ASSEMBLE);
> > -  emit_move_insn (operands[0], src);
> > +  rtx op0 = operands[0];
> > +  rtx op1 = operands[1];
> > +  rtx op2 = operands[2];
> > +  rtx op3 = operands[3];
> > +  rtx op4 = operands[4];
> > +
> > +  if (TARGET_DENSE_MATH)
> > +    {
> > +      rtx vpair1 = gen_reg_rtx (OOmode);
> > +      rtx vpair2 = gen_reg_rtx (OOmode);
> > +      emit_insn (gen_vsx_assemble_pair (vpair1, op1, op2));
> > +      emit_insn (gen_vsx_assemble_pair (vpair2, op3, op4));
> > +      emit_insn (gen_mma_assemble_acc_dm (op0, vpair1, vpair2));
> > +    }
> > +
> > +  else
> > +    emit_insn (gen_mma_assemble_acc_vsx (op0, op1, op2, op3, op4));
> > +
> >    DONE;
> >  })
> >  
> >  ;; We cannot update the four output registers atomically, so mark the output
> > -;; as an early clobber so we don't accidentally clobber the input operands.  */
> > +;; as an early clobber so we don't accidentally clobber the input operands.
> >  
> > -(define_insn_and_split "*mma_assemble_acc"
> > +(define_insn_and_split "mma_assemble_acc_vsx"
> 
> Nit: since we use "*_nodm" above, it seems better to name it with
> "mma_assemble_acc_nodm" which has the same style?

Ok.

> >    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> >  	(unspec_volatile:XO
> >  	  [(match_operand:V16QI 1 "mma_assemble_input_operand" "mwa")
> > @@ -459,7 +500,7 @@ (define_insn_and_split "*mma_assemble_acc"
> >  	   (match_operand:V16QI 3 "mma_assemble_input_operand" "mwa")
> >  	   (match_operand:V16QI 4 "mma_assemble_input_operand" "mwa")]
> >  	  UNSPECV_MMA_ASSEMBLE))]
> > -  "TARGET_MMA
> > +  "TARGET_MMA && !TARGET_DENSE_MATH
> >     && fpr_reg_operand (operands[0], XOmode)"
> >    "#"
> >    "&& reload_completed"
> > @@ -473,28 +514,31 @@ (define_insn_and_split "*mma_assemble_acc"
> >    DONE;
> >  })
> >  
> > +;; On a system with dense math, we build the accumulators from two vector
> > +;; pairs.
> > +
> > +(define_insn "mma_assemble_acc_dm"
> > + [(set (match_operand:XO 0 "dmr_operand" "=wD")
> > +       (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
> > +		   (match_operand:OO 2 "vsx_register_operand" "wa")]
> > +		  UNSPEC_DM_ASSEMBLE_ACC))]
> > + "TARGET_MMA && TARGET_DENSE_MATH"
> 
> Nit: redundant TARGET_MMA checking.

Ok.

> > + "dmxxinstdmr512 %0,%1,%2,0"
> > + [(set_attr "type" "mma")])
> > +
> >  (define_expand "mma_disassemble_acc"
> > -  [(match_operand:V16QI 0 "mma_disassemble_output_operand")
> > -   (match_operand:XO 1 "fpr_reg_operand")
> > -   (match_operand 2 "const_0_to_3_operand")]
> > -  "TARGET_MMA"
> > -{
> > -  rtx src;
> > -  int regoff = INTVAL (operands[2]);
> > -  src = gen_rtx_UNSPEC (V16QImode,
> > -			gen_rtvec (2, operands[1], GEN_INT (regoff)),
> > -			UNSPEC_MMA_EXTRACT);
> > -  emit_move_insn (operands[0], src);
> > -  DONE;
> > -})
> > +  [(set (match_operand:V16QI 0 "register_operand")
> > +	(unspec:V16QI [(match_operand:XO 1 "register_operand")
> 
> s/register_operand/accumulator_operand/?

Ok

> 
> > +		       (match_operand 2 "const_0_to_3_operand")]
> > +		      UNSPEC_MMA_EXTRACT))]
> > +  "TARGET_MMA")
> >  
> > -(define_insn_and_split "*mma_disassemble_acc"
> > +(define_insn_and_split "*mma_disassemble_acc_vsx"
> >    [(set (match_operand:V16QI 0 "mma_disassemble_output_operand" "=mwa")
> > -       (unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> > -		      (match_operand 2 "const_0_to_3_operand")]
> > +	(unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
> > +		       (match_operand 2 "const_0_to_3_operand")]
> >  		      UNSPEC_MMA_EXTRACT))]
> > -  "TARGET_MMA
> > -   && fpr_reg_operand (operands[1], XOmode)"
> > +  "TARGET_MMA"
> 
> Do we still expect to see this pattern if TARGET_DENSE_MATH?
> If no, we should guard the condition with !TARGET_DENSE_MATH.

Ok.
> 
> >    "#"
> >    "&& reload_completed"
> >    [(const_int 0)]
> > @@ -506,9 +550,14 @@ (define_insn_and_split "*mma_disassemble_acc"
> >    DONE;
> >  })
> >  
> > -;; MMA instructions that do not use their accumulators as an input, still
> > -;; must not allow their vector operands to overlap the registers used by
> > -;; the accumulator.  We enforce this by marking the output as early clobber.
> > +(define_insn "*mma_disassemble_acc_dm"
> > +  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
> > +	(unspec:V16QI [(match_operand:XO 1 "dmr_operand" "wD")
> > +		       (match_operand 2 "const_0_to_3_operand")]
> > +		      UNSPEC_MMA_EXTRACT))]
> > +  "TARGET_DENSE_MATH"
> > +  "dmxxextfdmr256 %0,%1,2"
> > +  [(set_attr "type" "mma")])
> >  
> >  (define_insn "mma_<acc>"
> >    [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> > diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
> > index d23ce9a77a3..3040dcd50a3 100644
> > --- a/gcc/config/rs6000/predicates.md
> > +++ b/gcc/config/rs6000/predicates.md
> > @@ -186,6 +186,38 @@ (define_predicate "vlogical_operand"
> >    return VLOGICAL_REGNO_P (REGNO (op));
> >  })
> >  
> > +;; Return 1 if op is a DMR register
> > +(define_predicate "dmr_operand"
> > +  (match_operand 0 "register_operand")
> > +{
> > +  if (!REG_P (op))
> > +    return 0;
> > +
> > +  if (!HARD_REGISTER_P (op))
> > +    return 1;
> > +
> > +  return DMR_REGNO_P (REGNO (op));
> > +})
> > +
> > +;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
> > +;; overlap with the FPRs, while on systems with dense math, the accumulators
> > +;; are separate dense math registers and do not overlap with the FPR
> > +;; registers..
> 
> Nit: an unexpected "."?
> 
> > +(define_predicate "accumulator_operand"
> > +  (match_operand 0 "register_operand")
> > +{
> 
> fpr_reg_operand checks for subreg as well, should we check for it here as well?
> 
> > +  if (!REG_P (op))
> > +    return 0;
> > +
> > +  if (!HARD_REGISTER_P (op))
> > +    return 1;
> > +
> > +  int r = REGNO (op);
> > +  return (TARGET_DENSE_MATH
> > +	  ? DMR_REGNO_P (r)
> > +	  : FP_REGNO_P (r) && (r & 3) == 0);
> > +})
> > +
> >  ;; Return 1 if op is the carry register.
> >  (define_predicate "ca_operand"
> >    (match_operand 0 "register_operand")
> > diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
> > index b6cd6d8cc84..4621b97b522 100644
> > --- a/gcc/config/rs6000/rs6000-cpus.def
> > +++ b/gcc/config/rs6000/rs6000-cpus.def
> > @@ -91,6 +91,7 @@
> >  /* Flags for a potential future processor that may or may not be delivered.  */
> >  #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
> >  				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
> > +				 | OPTION_MASK_DENSE_MATH		\
> >  				 | OPTION_MASK_FUTURE)
> >  
> >  /* Flags that need to be turned off if -mno-power9-vector.  */
> > @@ -134,6 +135,7 @@
> >  				 | OPTION_MASK_DFP			\
> >  				 | OPTION_MASK_DIRECT_MOVE		\
> >  				 | OPTION_MASK_DLMZB			\
> > +				 | OPTION_MASK_DENSE_MATH		\
> >  				 | OPTION_MASK_EFFICIENT_UNALIGNED_VSX	\
> >  				 | OPTION_MASK_FLOAT128_HW		\
> >  				 | OPTION_MASK_FLOAT128_KEYWORD		\
> > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> > index bc509399cf6..83e32f7a43a 100644
> > --- a/gcc/config/rs6000/rs6000.cc
> > +++ b/gcc/config/rs6000/rs6000.cc
> > @@ -290,7 +290,8 @@ enum rs6000_reg_type {
> >    ALTIVEC_REG_TYPE,
> >    FPR_REG_TYPE,
> >    SPR_REG_TYPE,
> > -  CR_REG_TYPE
> > +  CR_REG_TYPE,
> > +  DMR_REG_TYPE
> >  };
> >  
> >  /* Map register class to register type.  */
> > @@ -304,22 +305,23 @@ static enum rs6000_reg_type reg_class_to_reg_type[N_REG_CLASSES];
> >  
> >  
> >  /* Register classes we care about in secondary reload or go if legitimate
> > -   address.  We only need to worry about GPR, FPR, and Altivec registers here,
> > -   along an ANY field that is the OR of the 3 register classes.  */
> > +   address.  We only need to worry about GPR, FPR, Altivec, and DMR registers
> > +   here, along an ANY field that is the OR of the 4 register classes.  */
> >  
> >  enum rs6000_reload_reg_type {
> >    RELOAD_REG_GPR,			/* General purpose registers.  */
> >    RELOAD_REG_FPR,			/* Traditional floating point regs.  */
> >    RELOAD_REG_VMX,			/* Altivec (VMX) registers.  */
> > -  RELOAD_REG_ANY,			/* OR of GPR, FPR, Altivec masks.  */
> > +  RELOAD_REG_DMR,			/* DMR registers.  */
> > +  RELOAD_REG_ANY,			/* OR of GPR/FPR/VMX/DMR masks.  */
> >    N_RELOAD_REG
> >  };
> >  
> > -/* For setting up register classes, loop through the 3 register classes mapping
> > +/* For setting up register classes, loop through the 4 register classes mapping
> >     into real registers, and skip the ANY class, which is just an OR of the
> >     bits.  */
> >  #define FIRST_RELOAD_REG_CLASS	RELOAD_REG_GPR
> > -#define LAST_RELOAD_REG_CLASS	RELOAD_REG_VMX
> > +#define LAST_RELOAD_REG_CLASS	RELOAD_REG_DMR
> >  
> >  /* Map reload register type to a register in the register class.  */
> >  struct reload_reg_map_type {
> > @@ -331,6 +333,7 @@ static const struct reload_reg_map_type reload_reg_map[N_RELOAD_REG] = {
> >    { "Gpr",	FIRST_GPR_REGNO },	/* RELOAD_REG_GPR.  */
> >    { "Fpr",	FIRST_FPR_REGNO },	/* RELOAD_REG_FPR.  */
> >    { "VMX",	FIRST_ALTIVEC_REGNO },	/* RELOAD_REG_VMX.  */
> > +  { "DMR",	FIRST_DMR_REGNO },	/* RELOAD_REG_DMR.  */
> >    { "Any",	-1 },			/* RELOAD_REG_ANY.  */
> >  };
> >  
> > @@ -1224,6 +1227,8 @@ char rs6000_reg_names[][8] =
> >        "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",
> >    /* vrsave vscr sfp */
> >        "vrsave", "vscr", "sfp",
> > +  /* DMRs */
> > +      "0", "1", "2", "3", "4", "5", "6", "7",
> >  };
> >  
> >  #ifdef TARGET_REGNAMES
> > @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
> >    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
> >    /* vrsave vscr sfp */
> >    "vrsave", "vscr", "sfp",
> > +  /* DMRs */
> > +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
> 
> Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
> recognize %dm0.
> 
> >  };
> >  #endif
> >  
> > @@ -1846,6 +1853,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
> >    else if (ALTIVEC_REGNO_P (regno))
> >      reg_size = UNITS_PER_ALTIVEC_WORD;
> >  
> > +  else if (DMR_REGNO_P (regno))
> > +    reg_size = UNITS_PER_DMR_WORD;
> > +
> >    else
> >      reg_size = UNITS_PER_WORD;
> >  
> > @@ -1867,9 +1877,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
> >    if (mode == OOmode)
> >      return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
> >  
> > -  /* MMA accumulator modes need FPR registers divisible by 4.  */
> > +  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
> > +     by 4.
> > +
> > +     If dense math is enabled, allow all VSX registers plus the DMR registers.
> > +     We need to make sure we don't cross between the boundary of FPRs and
> > +     traditional Altiviec registers.  */
> >    if (mode == XOmode)
> > -    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
> > +    {
> > +      if (TARGET_MMA && !TARGET_DENSE_MATH)
> > +	return (FP_REGNO_P (regno) && (regno & 3) == 0);
> > +
> > +      else if (TARGET_DENSE_MATH)
> > +	{
> > +	  if (DMR_REGNO_P (regno))
> > +	    return 1;
> > +
> > +	  if (FP_REGNO_P (regno))
> > +	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
> > +
> > +	  if (ALTIVEC_REGNO_P (regno))
> > +	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
> > +	}
> 
> I could miss something, I didn't find which section of RFC indicates this
> restriction, could you please point out for me?  Thanks!
> 
> > +
> > +      else
> > +	return 0;
> > +    }
> > +
> > +  /* No other types other than XOmode can go in DMRs.  */
> > +  if (DMR_REGNO_P (regno))
> > +    return 0;
> >  
> >    /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
> >       register combinations, and use PTImode where we need to deal with quad
> > @@ -2312,6 +2349,7 @@ rs6000_debug_reg_global (void)
> >    rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
> >  			  LAST_ALTIVEC_REGNO,
> >  			  "vs");
> > +  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");
> 
> Nit: Like above, use 'dm'.
> 
> >    rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
> >    rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
> >    rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
> > @@ -2332,6 +2370,7 @@ rs6000_debug_reg_global (void)
> >  	   "wr reg_class = %s\n"
> >  	   "wx reg_class = %s\n"
> >  	   "wA reg_class = %s\n"
> > +	   "wD reg_class = %s\n"
> >  	   "\n",
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
> > @@ -2339,7 +2378,8 @@ rs6000_debug_reg_global (void)
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
> >  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
> > -	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
> > +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
> > +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
> > 
> 
> snip ...
> 
> > +/* Subroutine to determine the move cost of dense math registers.  If we are
> > +   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
> > +   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
> > +   moving to anything else like GPR registers, make the cost very high.  */
> > +
> > +static int
> > +rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
> > +{
> > +  const int reg_move_base = 2;
> > +  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
> > +			  & reg_class_contents[VSX_REGS]);
> > +
> > +  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))
> 
> Can we just use reg_classes_intersect_p (rclass, VSX_REGS)?
> 
> > +    {
> > +      /* __vector_quad (i.e. XOmode) is tranfered in 1 instruction.  */
> > +      if (mode == XOmode)
> > +	return reg_move_base;
> > +
> > +      else
> > +	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
> 
> I guess this "else" arm is for TDOmode, which belongs to that patch.
> 
> > +    }
> > +
> > +  return 1000 * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
> > +}
> > +
> >  /* A C expression returning the cost of moving data from a register of class
> >     CLASS1 to one of CLASS2.  */
> >  
> > @@ -22843,17 +22969,28 @@ rs6000_register_move_cost (machine_mode mode,
> >    if (TARGET_DEBUG_COST)
> >      dbg_cost_ctrl++;
> >  
> 
> snip ...
> 
> >  /* Table of additional register names to use in user input.  */
> > @@ -2132,6 +2158,8 @@ extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
> >    {"vs52", 84}, {"vs53", 85}, {"vs54", 86}, {"vs55", 87},	\
> >    {"vs56", 88}, {"vs57", 89}, {"vs58", 90}, {"vs59", 91},	\
> >    {"vs60", 92}, {"vs61", 93}, {"vs62", 94}, {"vs63", 95},	\
> > +  {"dmr0", 111}, {"dmr1", 112}, {"dmr2", 113}, {"dmr3", 114},	\
> > +  {"dmr4", 115}, {"dmr5", 116}, {"dmr6", 117}, {"dmr7", 118},	\
> 
> Nit: maybe s/dmr/dm/ to align the previous regnames.
> 
> >  }
> >  
> >  /* This is how to output an element of a case-vector that is relative.  */
> > diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> > index a125fd8fc99..72af3e6ef70 100644
> > --- a/gcc/config/rs6000/rs6000.md
> > +++ b/gcc/config/rs6000/rs6000.md
> > @@ -51,6 +51,8 @@ (define_constants
> >     (VRSAVE_REGNO		108)
> >     (VSCR_REGNO			109)
> >     (FRAME_POINTER_REGNUM	110)
> > +   (FIRST_DMR_REGNO		111)
> > +   (LAST_DMR_REGNO		118)
> >    ])
> >  
> >  ;;
> > @@ -355,7 +357,7 @@ (define_attr "cpu"
> >    (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
> >  
> >  ;; The ISA we implement.
> > -(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp"
> > +(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp,dm,not_dm"
> 
> Nit: s/not_dm/nodm/ to align with some previous wording.
> 
> BR,
> Kewen
>
  
Kewen.Lin Feb. 7, 2024, 9:38 a.m. UTC | #4
on 2024/2/7 08:06, Michael Meissner wrote:
> On Thu, Jan 25, 2024 at 05:28:49PM +0800, Kewen.Lin wrote:
>> Hi Mike,
>>
>> on 2024/1/6 07:38, Michael Meissner wrote:
>>> The MMA subsystem added the notion of accumulator registers as an optional
>>> feature of ISA 3.1 (power10).  In ISA 3.1, these accumulators overlapped with
>>> the traditional floating point registers 0..31, but logically the accumulator
>>> registers were separate from the FPR registers.  In ISA 3.1, it was anticipated
>>
>> Using VSX register 0..31 rather than traditional floating point registers 0..31
>> seems more clear, since floating point registers imply 64 bit long registers.
> 
> Ok.
> 
>>> that in future systems, the accumulator registers may no overlap with the FPR
>>> registers.  This patch adds the support for dense math registers as separate
>>> registers.
>>>
>>> This particular patch does not change the MMA support to use the accumulators
>>> within the dense math registers.  This patch just adds the basic support for
>>> having separate DMRs.  The next patch will switch the MMA support to use the
>>> accumulators if -mcpu=future is used.
>>>
>>> For testing purposes, I added an undocumented option '-mdense-math' to enable
>>> or disable the dense math support.
>>
>> Can we avoid this and use one macro for it instead?  As you might have noticed
>> that some previous temporary options like -mpower{8,9}-vector cause ICEs due to
>> some unexpected combination and we are going to neuter them, so let's try our
>> best to avoid it if possible.  I guess one macro TARGET_DENSE_MATH defined by
>> TARGET_FUTURE && TARGET_MMA matches all use places? and specifying -mcpu=future
>> can enable it while -mcpu=power10 can disable it.
> 
> That depends on whether there will be other things added in the future power
> that are not in the MMA+ instruction set.
> 
> But I can switch to defining TARGET_DENSE_MATH to testing TARGET_FUTURE and
> TARGET_MMA.  That way if/when a new cpu comes out, we will just have to change
> the definition of TARGET_DENSE_MATH and not all of the uses.

Yes, that's what I expected.  Thanks!

> 
> I will also add TARGET_MMA_NO_DENSE_MATH to handle the existing MMA code for
> assemble and disassemble when we don't have dense math instructions.

Nice, I also found having such macro can help when reviewing one latter patch
so suggested a similar there.

>>> -(define_insn_and_split "*movxo"
>>> +(define_insn_and_split "*movxo_nodm"
>>>    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
>>>  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
>>> -  "TARGET_MMA
>>> +  "TARGET_MMA && !TARGET_DENSE_MATH
>>>     && (gpc_reg_operand (operands[0], XOmode)
>>>         || gpc_reg_operand (operands[1], XOmode))"
>>>    "@
>>> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
>>>     (set_attr "length" "*,*,16")
>>>     (set_attr "max_prefixed_insns" "2,2,*")])
>>>  
>>> +(define_insn_and_split "*movxo_dm"
>>> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
>>> +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
>>
>> Why not adopt ZwO rather than QwO?
> 
> You have to split the address into 2 addresses for loading or storing vector
> pairs (or 4 addresses for loading or storing vectors).  Z would allow
> register+register addresses, and you wouldn't be able to create the second 
> address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
> d-form addresses.

Thanks for clarifying.  But without this patch the define_insn_and_split *movxo
adopts "ZwO", IMHO it would mean the current "*movxo" define_insn_and_split have
been problematic?  I thought adjust_address can ensure the new address would be
still valid after adjusting 128 offset, could you double check?

> 
>>
>>> +  "TARGET_DENSE_MATH
>>> +   && (gpc_reg_operand (operands[0], XOmode)
>>> +       || gpc_reg_operand (operands[1], XOmode))"
>>> +  "@
>>> +   #
>>> +   #
>>> +   #
>>> +   dmxxinstdmr512 %0,%1,%Y1,0
>>> +   dmmr %0,%1
>>> +   dmxxextfdmr512 %0,%Y0,%1,0"
>>> +  "&& reload_completed
>>> +   && !dmr_operand (operands[0], XOmode)
>>> +   && !dmr_operand (operands[1], XOmode)"
>>> +  [(const_int 0)]
>>> +{
>>> +  rs6000_split_multireg_move (operands[0], operands[1]);
>>> +  DONE;
>>> +}
>>> +  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
>>> +   (set_attr "length" "*,*,16,*,*,*")
>>> +   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
>>> +

...

>>> +;; Return 1 if op is a DMR register
>>> +(define_predicate "dmr_operand"
>>> +  (match_operand 0 "register_operand")
>>> +{
>>> +  if (!REG_P (op))
>>> +    return 0;
>>> +
>>> +  if (!HARD_REGISTER_P (op))
>>> +    return 1;
>>> +
>>> +  return DMR_REGNO_P (REGNO (op));
>>> +})
>>> +
>>> +;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
>>> +;; overlap with the FPRs, while on systems with dense math, the accumulators
>>> +;; are separate dense math registers and do not overlap with the FPR
>>> +;; registers..
>>
>> Nit: an unexpected "."?
>>
>>> +(define_predicate "accumulator_operand"
>>> +  (match_operand 0 "register_operand")
>>> +{
>>
>> fpr_reg_operand checks for subreg as well, should we check for it here as well?
>>
>>>  #ifdef TARGET_REGNAMES
>>> @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
>>>    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
>>>    /* vrsave vscr sfp */
>>>    "vrsave", "vscr", "sfp",
>>> +  /* DMRs */
>>> +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
>>
>> Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
>> recognize %dm0.

I guessed some reply was missing on this part (and some latter others)?  Just want
to ensure something wasn't missing and hope this helps.  :)

>>
>>>  };
>>>  #endif
>>>  
>>> @@ -1846,6 +1853,9 @@ rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
>>>    else if (ALTIVEC_REGNO_P (regno))
>>>      reg_size = UNITS_PER_ALTIVEC_WORD;
>>>  
>>> +  else if (DMR_REGNO_P (regno))
>>> +    reg_size = UNITS_PER_DMR_WORD;
>>> +
>>>    else
>>>      reg_size = UNITS_PER_WORD;
>>>  
>>> @@ -1867,9 +1877,36 @@ rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
>>>    if (mode == OOmode)
>>>      return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
>>>  
>>> -  /* MMA accumulator modes need FPR registers divisible by 4.  */
>>> +  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
>>> +     by 4.
>>> +
>>> +     If dense math is enabled, allow all VSX registers plus the DMR registers.
>>> +     We need to make sure we don't cross between the boundary of FPRs and
>>> +     traditional Altiviec registers.  */
>>>    if (mode == XOmode)
>>> -    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
>>> +    {
>>> +      if (TARGET_MMA && !TARGET_DENSE_MATH)
>>> +	return (FP_REGNO_P (regno) && (regno & 3) == 0);
>>> +
>>> +      else if (TARGET_DENSE_MATH)
>>> +	{
>>> +	  if (DMR_REGNO_P (regno))
>>> +	    return 1;
>>> +
>>> +	  if (FP_REGNO_P (regno))
>>> +	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
>>> +
>>> +	  if (ALTIVEC_REGNO_P (regno))
>>> +	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
>>> +	}
>>
>> I could miss something, I didn't find which section of RFC indicates this
>> restriction, could you please point out for me?  Thanks!
>>
>>> +
>>> +      else
>>> +	return 0;
>>> +    }
>>> +
>>> +  /* No other types other than XOmode can go in DMRs.  */
>>> +  if (DMR_REGNO_P (regno))
>>> +    return 0;
>>>  
>>>    /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
>>>       register combinations, and use PTImode where we need to deal with quad
>>> @@ -2312,6 +2349,7 @@ rs6000_debug_reg_global (void)
>>>    rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
>>>  			  LAST_ALTIVEC_REGNO,
>>>  			  "vs");
>>> +  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");
>>
>> Nit: Like above, use 'dm'.
>>
>>>    rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
>>>    rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
>>>    rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
>>> @@ -2332,6 +2370,7 @@ rs6000_debug_reg_global (void)
>>>  	   "wr reg_class = %s\n"
>>>  	   "wx reg_class = %s\n"
>>>  	   "wA reg_class = %s\n"
>>> +	   "wD reg_class = %s\n"
>>>  	   "\n",
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
>>> @@ -2339,7 +2378,8 @@ rs6000_debug_reg_global (void)
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
>>>  	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
>>> -	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
>>> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
>>> +	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
>>>
>>
>> snip ...
>>
>>> +/* Subroutine to determine the move cost of dense math registers.  If we are
>>> +   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
>>> +   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
>>> +   moving to anything else like GPR registers, make the cost very high.  */
>>> +
>>> +static int
>>> +rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
>>> +{
>>> +  const int reg_move_base = 2;
>>> +  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
>>> +			  & reg_class_contents[VSX_REGS]);
>>> +
>>> +  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))
>>
>> Can we just use reg_classes_intersect_p (rclass, VSX_REGS)?
>>


BR,
Kewen
  
Michael Meissner Feb. 8, 2024, 12:26 a.m. UTC | #5
On Wed, Feb 07, 2024 at 05:38:46PM +0800, Kewen.Lin wrote:
> >>> -(define_insn_and_split "*movxo"
> >>> +(define_insn_and_split "*movxo_nodm"
> >>>    [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
> >>>  	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
> >>> -  "TARGET_MMA
> >>> +  "TARGET_MMA && !TARGET_DENSE_MATH
> >>>     && (gpc_reg_operand (operands[0], XOmode)
> >>>         || gpc_reg_operand (operands[1], XOmode))"
> >>>    "@
> >>> @@ -366,6 +369,31 @@ (define_insn_and_split "*movxo"
> >>>     (set_attr "length" "*,*,16")
> >>>     (set_attr "max_prefixed_insns" "2,2,*")])
> >>>  
> >>> +(define_insn_and_split "*movxo_dm"
> >>> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
> >>> +	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
> >>
> >> Why not adopt ZwO rather than QwO?
> > 
> > You have to split the address into 2 addresses for loading or storing vector
> > pairs (or 4 addresses for loading or storing vectors).  Z would allow
> > register+register addresses, and you wouldn't be able to create the second 
> > address by adding 128 to it.  Hence it uses 'Q' for register only and 'wo' for
> > d-form addresses.
> 
> Thanks for clarifying.  But without this patch the define_insn_and_split *movxo
> adopts "ZwO", IMHO it would mean the current "*movxo" define_insn_and_split have
> been problematic?  I thought adjust_address can ensure the new address would be
> still valid after adjusting 128 offset, could you double check?

Well it is more of a theoretical bug.  Using 'Z' is wrong as I said because after
register allocation you can't split an x-form (register+register) address.
Using 'Q' would not allow reg+reg but would allow reg, which can be split
because the 2nd address will be a d-form (reg+offset).

But in practice, it won't be an issue since rs6000_setup_reg_addr_masks won't
allow reg+reg addresses for TDOmode.

> >>>  #ifdef TARGET_REGNAMES
> >>> @@ -1250,6 +1255,8 @@ static const char alt_reg_names[][8] =
> >>>    "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
> >>>    /* vrsave vscr sfp */
> >>>    "vrsave", "vscr", "sfp",
> >>> +  /* DMRs */
> >>> +  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
> >>
> >> Should be without "r" here, as tested gas doesn't recognize %dmr0 but it does
> >> recognize %dm0.
> 
> I guessed some reply was missing on this part (and some latter others)?  Just want
> to ensure something wasn't missing and hope this helps.  :)

I missed this on the first round of comments, but if gas doesn't like %dmr2 we
should use %dm2.

Thanks for catching this.
  

Patch

diff --git a/gcc/config/rs6000/constraints.md b/gcc/config/rs6000/constraints.md
index c99997bf82b..614e431c085 100644
--- a/gcc/config/rs6000/constraints.md
+++ b/gcc/config/rs6000/constraints.md
@@ -107,6 +107,9 @@  (define_constraint "wB"
        (match_test "TARGET_P8_VECTOR")
        (match_operand 0 "s5bit_cint_operand")))
 
+(define_register_constraint "wD" "rs6000_constraints[RS6000_CONSTRAINT_wD]"
+  "Accumulator register.")
+
 (define_constraint "wE"
   "@internal Vector constant that can be loaded with the XXSPLTIB instruction."
   (match_test "xxspltib_constant_nosplit (op, mode)"))
diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 6a7d8a836db..bb898919ab5 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -91,6 +91,7 @@  (define_c_enum "unspec"
    UNSPEC_MMA_XVI8GER4SPP
    UNSPEC_MMA_XXMFACC
    UNSPEC_MMA_XXMTACC
+   UNSPEC_DM_ASSEMBLE_ACC
   ])
 
 (define_c_enum "unspecv"
@@ -321,7 +322,9 @@  (define_insn_and_split "*movoo"
    (set_attr "length" "*,8,*,8,8")
    (set_attr "isa" "lxvp,*,stxvp,*,*")])
 
-;; Vector quad support.  XOmode can only live in FPRs.
+;; Vector quad support.  Under the original MMA, XOmode can only live in VSX
+;; vector registers 0..31.  With dense math, XOmode can live in either VSX
+;; registers (0..63) or DMR registers.
 (define_expand "movxo"
   [(set (match_operand:XO 0 "nonimmediate_operand")
 	(match_operand:XO 1 "input_operand"))]
@@ -346,10 +349,10 @@  (define_expand "movxo"
     gcc_assert (false);
 })
 
-(define_insn_and_split "*movxo"
+(define_insn_and_split "*movxo_nodm"
   [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
 	(match_operand:XO 1 "input_operand" "ZwO,d,d"))]
-  "TARGET_MMA
+  "TARGET_MMA && !TARGET_DENSE_MATH
    && (gpc_reg_operand (operands[0], XOmode)
        || gpc_reg_operand (operands[1], XOmode))"
   "@
@@ -366,6 +369,31 @@  (define_insn_and_split "*movxo"
    (set_attr "length" "*,*,16")
    (set_attr "max_prefixed_insns" "2,2,*")])
 
+(define_insn_and_split "*movxo_dm"
+  [(set (match_operand:XO 0 "nonimmediate_operand" "=wa,QwO,wa,wD,wD,wa")
+	(match_operand:XO 1 "input_operand"        "QwO,wa, wa,wa,wD,wD"))]
+  "TARGET_DENSE_MATH
+   && (gpc_reg_operand (operands[0], XOmode)
+       || gpc_reg_operand (operands[1], XOmode))"
+  "@
+   #
+   #
+   #
+   dmxxinstdmr512 %0,%1,%Y1,0
+   dmmr %0,%1
+   dmxxextfdmr512 %0,%Y0,%1,0"
+  "&& reload_completed
+   && !dmr_operand (operands[0], XOmode)
+   && !dmr_operand (operands[1], XOmode)"
+  [(const_int 0)]
+{
+  rs6000_split_multireg_move (operands[0], operands[1]);
+  DONE;
+}
+  [(set_attr "type" "vecload,vecstore,veclogical,mma,mma,mma")
+   (set_attr "length" "*,*,16,*,*,*")
+   (set_attr "max_prefixed_insns" "2,2,*,*,*,*")])
+
 (define_expand "vsx_assemble_pair"
   [(match_operand:OO 0 "vsx_register_operand")
    (match_operand:V16QI 1 "mma_assemble_input_operand")
@@ -433,25 +461,38 @@  (define_insn_and_split "*vsx_disassemble_pair"
 })
 
 (define_expand "mma_assemble_acc"
-  [(match_operand:XO 0 "fpr_reg_operand")
+  [(match_operand:XO 0 "register_operand")
    (match_operand:V16QI 1 "mma_assemble_input_operand")
    (match_operand:V16QI 2 "mma_assemble_input_operand")
    (match_operand:V16QI 3 "mma_assemble_input_operand")
    (match_operand:V16QI 4 "mma_assemble_input_operand")]
   "TARGET_MMA"
 {
-  rtx src = gen_rtx_UNSPEC_VOLATILE (XOmode,
-			    	     gen_rtvec (4, operands[1], operands[2],
-				       		operands[3], operands[4]),
-			    	     UNSPECV_MMA_ASSEMBLE);
-  emit_move_insn (operands[0], src);
+  rtx op0 = operands[0];
+  rtx op1 = operands[1];
+  rtx op2 = operands[2];
+  rtx op3 = operands[3];
+  rtx op4 = operands[4];
+
+  if (TARGET_DENSE_MATH)
+    {
+      rtx vpair1 = gen_reg_rtx (OOmode);
+      rtx vpair2 = gen_reg_rtx (OOmode);
+      emit_insn (gen_vsx_assemble_pair (vpair1, op1, op2));
+      emit_insn (gen_vsx_assemble_pair (vpair2, op3, op4));
+      emit_insn (gen_mma_assemble_acc_dm (op0, vpair1, vpair2));
+    }
+
+  else
+    emit_insn (gen_mma_assemble_acc_vsx (op0, op1, op2, op3, op4));
+
   DONE;
 })
 
 ;; We cannot update the four output registers atomically, so mark the output
-;; as an early clobber so we don't accidentally clobber the input operands.  */
+;; as an early clobber so we don't accidentally clobber the input operands.
 
-(define_insn_and_split "*mma_assemble_acc"
+(define_insn_and_split "mma_assemble_acc_vsx"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
 	(unspec_volatile:XO
 	  [(match_operand:V16QI 1 "mma_assemble_input_operand" "mwa")
@@ -459,7 +500,7 @@  (define_insn_and_split "*mma_assemble_acc"
 	   (match_operand:V16QI 3 "mma_assemble_input_operand" "mwa")
 	   (match_operand:V16QI 4 "mma_assemble_input_operand" "mwa")]
 	  UNSPECV_MMA_ASSEMBLE))]
-  "TARGET_MMA
+  "TARGET_MMA && !TARGET_DENSE_MATH
    && fpr_reg_operand (operands[0], XOmode)"
   "#"
   "&& reload_completed"
@@ -473,28 +514,31 @@  (define_insn_and_split "*mma_assemble_acc"
   DONE;
 })
 
+;; On a system with dense math, we build the accumulators from two vector
+;; pairs.
+
+(define_insn "mma_assemble_acc_dm"
+ [(set (match_operand:XO 0 "dmr_operand" "=wD")
+       (unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
+		   (match_operand:OO 2 "vsx_register_operand" "wa")]
+		  UNSPEC_DM_ASSEMBLE_ACC))]
+ "TARGET_MMA && TARGET_DENSE_MATH"
+ "dmxxinstdmr512 %0,%1,%2,0"
+ [(set_attr "type" "mma")])
+
 (define_expand "mma_disassemble_acc"
-  [(match_operand:V16QI 0 "mma_disassemble_output_operand")
-   (match_operand:XO 1 "fpr_reg_operand")
-   (match_operand 2 "const_0_to_3_operand")]
-  "TARGET_MMA"
-{
-  rtx src;
-  int regoff = INTVAL (operands[2]);
-  src = gen_rtx_UNSPEC (V16QImode,
-			gen_rtvec (2, operands[1], GEN_INT (regoff)),
-			UNSPEC_MMA_EXTRACT);
-  emit_move_insn (operands[0], src);
-  DONE;
-})
+  [(set (match_operand:V16QI 0 "register_operand")
+	(unspec:V16QI [(match_operand:XO 1 "register_operand")
+		       (match_operand 2 "const_0_to_3_operand")]
+		      UNSPEC_MMA_EXTRACT))]
+  "TARGET_MMA")
 
-(define_insn_and_split "*mma_disassemble_acc"
+(define_insn_and_split "*mma_disassemble_acc_vsx"
   [(set (match_operand:V16QI 0 "mma_disassemble_output_operand" "=mwa")
-       (unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
-		      (match_operand 2 "const_0_to_3_operand")]
+	(unspec:V16QI [(match_operand:XO 1 "fpr_reg_operand" "d")
+		       (match_operand 2 "const_0_to_3_operand")]
 		      UNSPEC_MMA_EXTRACT))]
-  "TARGET_MMA
-   && fpr_reg_operand (operands[1], XOmode)"
+  "TARGET_MMA"
   "#"
   "&& reload_completed"
   [(const_int 0)]
@@ -506,9 +550,14 @@  (define_insn_and_split "*mma_disassemble_acc"
   DONE;
 })
 
-;; MMA instructions that do not use their accumulators as an input, still
-;; must not allow their vector operands to overlap the registers used by
-;; the accumulator.  We enforce this by marking the output as early clobber.
+(define_insn "*mma_disassemble_acc_dm"
+  [(set (match_operand:V16QI 0 "vsx_register_operand" "=wa")
+	(unspec:V16QI [(match_operand:XO 1 "dmr_operand" "wD")
+		       (match_operand 2 "const_0_to_3_operand")]
+		      UNSPEC_MMA_EXTRACT))]
+  "TARGET_DENSE_MATH"
+  "dmxxextfdmr256 %0,%1,2"
+  [(set_attr "type" "mma")])
 
 (define_insn "mma_<acc>"
   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
index d23ce9a77a3..3040dcd50a3 100644
--- a/gcc/config/rs6000/predicates.md
+++ b/gcc/config/rs6000/predicates.md
@@ -186,6 +186,38 @@  (define_predicate "vlogical_operand"
   return VLOGICAL_REGNO_P (REGNO (op));
 })
 
+;; Return 1 if op is a DMR register
+(define_predicate "dmr_operand"
+  (match_operand 0 "register_operand")
+{
+  if (!REG_P (op))
+    return 0;
+
+  if (!HARD_REGISTER_P (op))
+    return 1;
+
+  return DMR_REGNO_P (REGNO (op));
+})
+
+;; Return 1 if op is an accumulator.  On power10 systems, the accumulators
+;; overlap with the FPRs, while on systems with dense math, the accumulators
+;; are separate dense math registers and do not overlap with the FPR
+;; registers..
+(define_predicate "accumulator_operand"
+  (match_operand 0 "register_operand")
+{
+  if (!REG_P (op))
+    return 0;
+
+  if (!HARD_REGISTER_P (op))
+    return 1;
+
+  int r = REGNO (op);
+  return (TARGET_DENSE_MATH
+	  ? DMR_REGNO_P (r)
+	  : FP_REGNO_P (r) && (r & 3) == 0);
+})
+
 ;; Return 1 if op is the carry register.
 (define_predicate "ca_operand"
   (match_operand 0 "register_operand")
diff --git a/gcc/config/rs6000/rs6000-cpus.def b/gcc/config/rs6000/rs6000-cpus.def
index b6cd6d8cc84..4621b97b522 100644
--- a/gcc/config/rs6000/rs6000-cpus.def
+++ b/gcc/config/rs6000/rs6000-cpus.def
@@ -91,6 +91,7 @@ 
 /* Flags for a potential future processor that may or may not be delivered.  */
 #define ISA_FUTURE_MASKS	(ISA_3_1_MASKS_SERVER			\
 				 | OPTION_MASK_BLOCK_OPS_VECTOR_PAIR	\
+				 | OPTION_MASK_DENSE_MATH		\
 				 | OPTION_MASK_FUTURE)
 
 /* Flags that need to be turned off if -mno-power9-vector.  */
@@ -134,6 +135,7 @@ 
 				 | OPTION_MASK_DFP			\
 				 | OPTION_MASK_DIRECT_MOVE		\
 				 | OPTION_MASK_DLMZB			\
+				 | OPTION_MASK_DENSE_MATH		\
 				 | OPTION_MASK_EFFICIENT_UNALIGNED_VSX	\
 				 | OPTION_MASK_FLOAT128_HW		\
 				 | OPTION_MASK_FLOAT128_KEYWORD		\
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index bc509399cf6..83e32f7a43a 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -290,7 +290,8 @@  enum rs6000_reg_type {
   ALTIVEC_REG_TYPE,
   FPR_REG_TYPE,
   SPR_REG_TYPE,
-  CR_REG_TYPE
+  CR_REG_TYPE,
+  DMR_REG_TYPE
 };
 
 /* Map register class to register type.  */
@@ -304,22 +305,23 @@  static enum rs6000_reg_type reg_class_to_reg_type[N_REG_CLASSES];
 
 
 /* Register classes we care about in secondary reload or go if legitimate
-   address.  We only need to worry about GPR, FPR, and Altivec registers here,
-   along an ANY field that is the OR of the 3 register classes.  */
+   address.  We only need to worry about GPR, FPR, Altivec, and DMR registers
+   here, along an ANY field that is the OR of the 4 register classes.  */
 
 enum rs6000_reload_reg_type {
   RELOAD_REG_GPR,			/* General purpose registers.  */
   RELOAD_REG_FPR,			/* Traditional floating point regs.  */
   RELOAD_REG_VMX,			/* Altivec (VMX) registers.  */
-  RELOAD_REG_ANY,			/* OR of GPR, FPR, Altivec masks.  */
+  RELOAD_REG_DMR,			/* DMR registers.  */
+  RELOAD_REG_ANY,			/* OR of GPR/FPR/VMX/DMR masks.  */
   N_RELOAD_REG
 };
 
-/* For setting up register classes, loop through the 3 register classes mapping
+/* For setting up register classes, loop through the 4 register classes mapping
    into real registers, and skip the ANY class, which is just an OR of the
    bits.  */
 #define FIRST_RELOAD_REG_CLASS	RELOAD_REG_GPR
-#define LAST_RELOAD_REG_CLASS	RELOAD_REG_VMX
+#define LAST_RELOAD_REG_CLASS	RELOAD_REG_DMR
 
 /* Map reload register type to a register in the register class.  */
 struct reload_reg_map_type {
@@ -331,6 +333,7 @@  static const struct reload_reg_map_type reload_reg_map[N_RELOAD_REG] = {
   { "Gpr",	FIRST_GPR_REGNO },	/* RELOAD_REG_GPR.  */
   { "Fpr",	FIRST_FPR_REGNO },	/* RELOAD_REG_FPR.  */
   { "VMX",	FIRST_ALTIVEC_REGNO },	/* RELOAD_REG_VMX.  */
+  { "DMR",	FIRST_DMR_REGNO },	/* RELOAD_REG_DMR.  */
   { "Any",	-1 },			/* RELOAD_REG_ANY.  */
 };
 
@@ -1224,6 +1227,8 @@  char rs6000_reg_names[][8] =
       "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",
   /* vrsave vscr sfp */
       "vrsave", "vscr", "sfp",
+  /* DMRs */
+      "0", "1", "2", "3", "4", "5", "6", "7",
 };
 
 #ifdef TARGET_REGNAMES
@@ -1250,6 +1255,8 @@  static const char alt_reg_names[][8] =
   "%cr0",  "%cr1", "%cr2", "%cr3", "%cr4", "%cr5", "%cr6", "%cr7",
   /* vrsave vscr sfp */
   "vrsave", "vscr", "sfp",
+  /* DMRs */
+  "%dmr0", "%dmr1", "%dmr2", "%dmr3", "%dmr4", "%dmr5", "%dmr6", "%dmr7",
 };
 #endif
 
@@ -1846,6 +1853,9 @@  rs6000_hard_regno_nregs_internal (int regno, machine_mode mode)
   else if (ALTIVEC_REGNO_P (regno))
     reg_size = UNITS_PER_ALTIVEC_WORD;
 
+  else if (DMR_REGNO_P (regno))
+    reg_size = UNITS_PER_DMR_WORD;
+
   else
     reg_size = UNITS_PER_WORD;
 
@@ -1867,9 +1877,36 @@  rs6000_hard_regno_mode_ok_uncached (int regno, machine_mode mode)
   if (mode == OOmode)
     return (TARGET_MMA && VSX_REGNO_P (regno) && (regno & 1) == 0);
 
-  /* MMA accumulator modes need FPR registers divisible by 4.  */
+  /* On ISA 3.1 (power10), MMA accumulator modes need FPR registers divisible
+     by 4.
+
+     If dense math is enabled, allow all VSX registers plus the DMR registers.
+     We need to make sure we don't cross between the boundary of FPRs and
+     traditional Altiviec registers.  */
   if (mode == XOmode)
-    return (TARGET_MMA && FP_REGNO_P (regno) && (regno & 3) == 0);
+    {
+      if (TARGET_MMA && !TARGET_DENSE_MATH)
+	return (FP_REGNO_P (regno) && (regno & 3) == 0);
+
+      else if (TARGET_DENSE_MATH)
+	{
+	  if (DMR_REGNO_P (regno))
+	    return 1;
+
+	  if (FP_REGNO_P (regno))
+	    return ((regno & 1) == 0 && regno <= LAST_FPR_REGNO - 3);
+
+	  if (ALTIVEC_REGNO_P (regno))
+	    return ((regno & 1) == 0 && regno <= LAST_ALTIVEC_REGNO - 3);
+	}
+
+      else
+	return 0;
+    }
+
+  /* No other types other than XOmode can go in DMRs.  */
+  if (DMR_REGNO_P (regno))
+    return 0;
 
   /* PTImode can only go in GPRs.  Quad word memory operations require even/odd
      register combinations, and use PTImode where we need to deal with quad
@@ -2312,6 +2349,7 @@  rs6000_debug_reg_global (void)
   rs6000_debug_reg_print (FIRST_ALTIVEC_REGNO,
 			  LAST_ALTIVEC_REGNO,
 			  "vs");
+  rs6000_debug_reg_print (FIRST_DMR_REGNO, LAST_DMR_REGNO, "dmr");
   rs6000_debug_reg_print (LR_REGNO, LR_REGNO, "lr");
   rs6000_debug_reg_print (CTR_REGNO, CTR_REGNO, "ctr");
   rs6000_debug_reg_print (CR0_REGNO, CR7_REGNO, "cr");
@@ -2332,6 +2370,7 @@  rs6000_debug_reg_global (void)
 	   "wr reg_class = %s\n"
 	   "wx reg_class = %s\n"
 	   "wA reg_class = %s\n"
+	   "wD reg_class = %s\n"
 	   "\n",
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_d]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_v]],
@@ -2339,7 +2378,8 @@  rs6000_debug_reg_global (void)
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_we]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wr]],
 	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wx]],
-	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]]);
+	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wA]],
+	   reg_class_names[rs6000_constraints[RS6000_CONSTRAINT_wD]]);
 
   nl = "\n";
   for (m = 0; m < NUM_MACHINE_MODES; ++m)
@@ -2636,6 +2676,21 @@  rs6000_setup_reg_addr_masks (void)
 	  addr_mask = 0;
 	  reg = reload_reg_map[rc].reg;
 
+	  /* Special case DMR registers.  */
+	  if (rc == RELOAD_REG_DMR)
+	    {
+	      if (TARGET_DENSE_MATH && m2 == XOmode)
+		{
+		  addr_mask = RELOAD_REG_VALID;
+		  reg_addr[m].addr_mask[rc] = addr_mask;
+		  any_addr_mask |= addr_mask;
+		}
+	      else
+		reg_addr[m].addr_mask[rc] = 0;
+
+	      continue;
+	    }
+
 	  /* Can mode values go in the GPR/FPR/Altivec registers?  */
 	  if (reg >= 0 && rs6000_hard_regno_mode_ok_p[m][reg])
 	    {
@@ -2790,6 +2845,9 @@  rs6000_init_hard_regno_mode_ok (bool global_init_p)
   for (r = CR1_REGNO; r <= CR7_REGNO; ++r)
     rs6000_regno_regclass[r] = CR_REGS;
 
+  for (r = FIRST_DMR_REGNO; r <= LAST_DMR_REGNO; ++r)
+    rs6000_regno_regclass[r] = DM_REGS;
+
   rs6000_regno_regclass[LR_REGNO] = LINK_REGS;
   rs6000_regno_regclass[CTR_REGNO] = CTR_REGS;
   rs6000_regno_regclass[CA_REGNO] = NO_REGS;
@@ -2814,6 +2872,7 @@  rs6000_init_hard_regno_mode_ok (bool global_init_p)
   reg_class_to_reg_type[(int)LINK_OR_CTR_REGS] = SPR_REG_TYPE;
   reg_class_to_reg_type[(int)CR_REGS] = CR_REG_TYPE;
   reg_class_to_reg_type[(int)CR0_REGS] = CR_REG_TYPE;
+  reg_class_to_reg_type[(int)DM_REGS] = DMR_REG_TYPE;
 
   if (TARGET_VSX)
     {
@@ -3000,6 +3059,13 @@  rs6000_init_hard_regno_mode_ok (bool global_init_p)
   if (TARGET_DIRECT_MOVE_128)
     rs6000_constraints[RS6000_CONSTRAINT_we] = VSX_REGS;
 
+  /* Support for the accumulator registers, either FPR registers (aka original
+     mma) or DMR registers (dense math).  */
+  if (TARGET_DENSE_MATH)
+    rs6000_constraints[RS6000_CONSTRAINT_wD] = DM_REGS;
+  else if (TARGET_MMA)
+    rs6000_constraints[RS6000_CONSTRAINT_wD] = FLOAT_REGS;
+
   /* Set up the reload helper and direct move functions.  */
   if (TARGET_VSX || TARGET_ALTIVEC)
     {
@@ -4496,6 +4562,14 @@  rs6000_option_override_internal (bool global_init_p)
   if (!TARGET_PCREL && TARGET_PCREL_OPT)
     rs6000_isa_flags &= ~OPTION_MASK_PCREL_OPT;
 
+  /* Dense math requires MMA.  */
+  if (TARGET_DENSE_MATH && !TARGET_MMA)
+    {
+      if ((rs6000_isa_flags_explicit & OPTION_MASK_DENSE_MATH) != 0)
+	error ("%qs requires %qs", "-mdense-math", "-mmma");
+      rs6000_isa_flags &= ~OPTION_MASK_DENSE_MATH;
+    }
+
   if (TARGET_DEBUG_REG || TARGET_DEBUG_TARGET)
     rs6000_print_isa_options (stderr, 0, "after subtarget", rs6000_isa_flags);
 
@@ -12408,6 +12482,11 @@  rs6000_secondary_reload_memory (rtx addr,
     addr_mask = (reg_addr[mode].addr_mask[RELOAD_REG_VMX]
 		 & ~RELOAD_REG_AND_M16);
 
+  /* DMR registers use VSX registers, and need to generate some extra
+     instructions.  */
+  else if (rclass == DM_REGS)
+    return 2;
+
   /* If the register allocator hasn't made up its mind yet on the register
      class to use, settle on defaults to use.  */
   else if (rclass == NO_REGS)
@@ -12736,6 +12815,13 @@  rs6000_secondary_reload_simple_move (enum rs6000_reg_type to_type,
 	       || (to_type == SPR_REG_TYPE && from_type == GPR_REG_TYPE)))
     return true;
 
+  /* We can transfer between VSX registers and DMR registers without needing
+     extra registers.  */
+  if (TARGET_DENSE_MATH && mode == XOmode
+      && ((to_type == DMR_REG_TYPE && from_type == VSX_REG_TYPE)
+	  || (to_type == VSX_REG_TYPE && from_type == DMR_REG_TYPE)))
+    return true;
+
   return false;
 }
 
@@ -13430,6 +13516,10 @@  rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
   machine_mode mode = GET_MODE (x);
   bool is_constant = CONSTANT_P (x);
 
+  /* DMR registers can't be loaded or stored.  */
+  if (rclass == DM_REGS)
+    return NO_REGS;
+
   /* If a mode can't go in FPR/ALTIVEC/VSX registers, don't return a preferred
      reload class for it.  */
   if ((rclass == ALTIVEC_REGS || rclass == VSX_REGS)
@@ -13526,7 +13616,7 @@  rs6000_preferred_reload_class (rtx x, enum reg_class rclass)
 	return VSX_REGS;
 
       if (mode == XOmode)
-	return FLOAT_REGS;
+	return TARGET_DENSE_MATH ? VSX_REGS : FLOAT_REGS;
 
       if (GET_MODE_CLASS (mode) == MODE_INT)
 	return GENERAL_REGS;
@@ -13651,6 +13741,11 @@  rs6000_secondary_reload_class (enum reg_class rclass, machine_mode mode,
   else
     regno = -1;
 
+  /* DMR registers don't have loads or stores.  We have to go through the VSX
+     registers to load XOmode (vector quad).  */
+  if (TARGET_DENSE_MATH && rclass == DM_REGS)
+    return VSX_REGS;
+
   /* If we have VSX register moves, prefer moving scalar values between
      Altivec registers and GPR by going via an FPR (and then via memory)
      instead of reloading the secondary memory address for Altivec moves.  */
@@ -14164,8 +14259,14 @@  print_operand (FILE *file, rtx x, int code)
 	 output_operand.  */
 
     case 'A':
-      /* Write the MMA accumulator number associated with VSX register X.  */
-      if (!REG_P (x) || !FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
+      /* Write the MMA accumulator number associated with VSX register X.  On
+	 dense math systems, only allow DMR accumulators, not accumulators
+	 overlapping with the FPR registers.  */
+      if (!REG_P (x))
+	output_operand_lossage ("invalid %%A value");
+      else if (TARGET_DENSE_MATH && DMR_REGNO_P (REGNO (x)))
+	fprintf (file, "%d", REGNO (x) - FIRST_DMR_REGNO);
+      else if (!FP_REGNO_P (REGNO (x)) || (REGNO (x) % 4) != 0)
 	output_operand_lossage ("invalid %%A value");
       else
 	fprintf (file, "%d", (REGNO (x) - FIRST_FPR_REGNO) / 4);
@@ -22830,6 +22931,31 @@  rs6000_debug_address_cost (rtx x, machine_mode mode,
 }
 
 
+/* Subroutine to determine the move cost of dense math registers.  If we are
+   moving to/from VSX_REGISTER registers, the cost is either 1 move (for
+   512-bit accumulators) or 2 moves (for 1,024 dmr registers).  If we are
+   moving to anything else like GPR registers, make the cost very high.  */
+
+static int
+rs6000_dmr_register_move_cost (machine_mode mode, reg_class_t rclass)
+{
+  const int reg_move_base = 2;
+  HARD_REG_SET vsx_set = (reg_class_contents[rclass]
+			  & reg_class_contents[VSX_REGS]);
+
+  if (TARGET_DENSE_MATH && !hard_reg_set_empty_p (vsx_set))
+    {
+      /* __vector_quad (i.e. XOmode) is tranfered in 1 instruction.  */
+      if (mode == XOmode)
+	return reg_move_base;
+
+      else
+	return reg_move_base * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+    }
+
+  return 1000 * 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+}
+
 /* A C expression returning the cost of moving data from a register of class
    CLASS1 to one of CLASS2.  */
 
@@ -22843,17 +22969,28 @@  rs6000_register_move_cost (machine_mode mode,
   if (TARGET_DEBUG_COST)
     dbg_cost_ctrl++;
 
+  HARD_REG_SET to_vsx, from_vsx;
+  to_vsx = reg_class_contents[to] & reg_class_contents[VSX_REGS];
+  from_vsx = reg_class_contents[from] & reg_class_contents[VSX_REGS];
+
+  /* Special case DMR registers, that can only move to/from VSX registers.  */
+  if (from == DM_REGS && to == DM_REGS)
+    ret = 2 * hard_regno_nregs (FIRST_DMR_REGNO, mode);
+
+  else if (from == DM_REGS)
+    ret = rs6000_dmr_register_move_cost (mode, to);
+
+  else if (to == DM_REGS)
+    ret = rs6000_dmr_register_move_cost (mode, from);
+
   /* If we have VSX, we can easily move between FPR or Altivec registers,
      otherwise we can only easily move within classes.
      Do this first so we give best-case answers for union classes
      containing both gprs and vsx regs.  */
-  HARD_REG_SET to_vsx, from_vsx;
-  to_vsx = reg_class_contents[to] & reg_class_contents[VSX_REGS];
-  from_vsx = reg_class_contents[from] & reg_class_contents[VSX_REGS];
-  if (!hard_reg_set_empty_p (to_vsx)
-      && !hard_reg_set_empty_p (from_vsx)
-      && (TARGET_VSX
-	  || hard_reg_set_intersect_p (to_vsx, from_vsx)))
+  else if (!hard_reg_set_empty_p (to_vsx)
+	   && !hard_reg_set_empty_p (from_vsx)
+	   && (TARGET_VSX
+	       || hard_reg_set_intersect_p (to_vsx, from_vsx)))
     {
       int reg = FIRST_FPR_REGNO;
       if (TARGET_VSX
@@ -22948,6 +23085,9 @@  rs6000_memory_move_cost (machine_mode mode, reg_class_t rclass,
     ret = 4 * hard_regno_nregs (32, mode);
   else if (reg_classes_intersect_p (rclass, ALTIVEC_REGS))
     ret = 4 * hard_regno_nregs (FIRST_ALTIVEC_REGNO, mode);
+  else if (reg_classes_intersect_p (rclass, DM_REGS))
+    ret = (rs6000_dmr_register_move_cost (mode, VSX_REGS)
+	   + rs6000_memory_move_cost (mode, VSX_REGS, false));
   else
     ret = 4 + rs6000_register_move_cost (mode, rclass, GENERAL_REGS);
 
@@ -24156,6 +24296,8 @@  rs6000_compute_pressure_classes (enum reg_class *pressure_classes)
       if (TARGET_HARD_FLOAT)
 	pressure_classes[n++] = FLOAT_REGS;
     }
+  if (TARGET_DENSE_MATH)
+    pressure_classes[n++] = DM_REGS;
   pressure_classes[n++] = CR_REGS;
   pressure_classes[n++] = SPECIAL_REGS;
 
@@ -24320,6 +24462,10 @@  rs6000_debugger_regno (unsigned int regno, unsigned int format)
     return 67;
   if (regno == 64)
     return 64;
+  /* XXX: This is a guess.  The GCC register number for FIRST_DMR_REGNO is 111,
+     but the frame pointer regnum uses that.  */
+  if (DMR_REGNO_P (regno))
+    return regno - FIRST_DMR_REGNO + 112;
 
   gcc_unreachable ();
 }
@@ -24531,6 +24677,7 @@  static struct rs6000_opt_mask const rs6000_opt_masks[] =
   { "crypto",			OPTION_MASK_CRYPTO,		false, true  },
   { "direct-move",		OPTION_MASK_DIRECT_MOVE,	false, true  },
   { "dlmzb",			OPTION_MASK_DLMZB,		false, true  },
+  { "dense-math",		OPTION_MASK_DENSE_MATH,		false, true  },
   { "efficient-unaligned-vsx",	OPTION_MASK_EFFICIENT_UNALIGNED_VSX,
 								false, true  },
   { "float128",			OPTION_MASK_FLOAT128_KEYWORD,	false, true  },
@@ -27620,7 +27767,9 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 		      || XINT (src, 1) == UNSPECV_MMA_ASSEMBLE);
 	  gcc_assert (REG_P (dst));
 	  if (GET_MODE (src) == XOmode)
-	    gcc_assert (FP_REGNO_P (REGNO (dst)));
+	    gcc_assert ((TARGET_DENSE_MATH
+			 ? VSX_REGNO_P (REGNO (dst))
+			 : FP_REGNO_P (REGNO (dst))));
 	  if (GET_MODE (src) == OOmode)
 	    gcc_assert (VSX_REGNO_P (REGNO (dst)));
 
diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
index 43209f9a6e7..22efac4a80c 100644
--- a/gcc/config/rs6000/rs6000.h
+++ b/gcc/config/rs6000/rs6000.h
@@ -660,6 +660,7 @@  extern unsigned char rs6000_recip_bits[];
 #define UNITS_PER_FP_WORD 8
 #define UNITS_PER_ALTIVEC_WORD 16
 #define UNITS_PER_VSX_WORD 16
+#define UNITS_PER_DMR_WORD 128
 
 /* Type used for ptrdiff_t, as a string used in a declaration.  */
 #define PTRDIFF_TYPE "int"
@@ -787,7 +788,7 @@  enum data_align { align_abi, align_opt, align_both };
    Another pseudo (not included in DWARF_FRAME_REGISTERS) is soft frame
    pointer, which is eventually eliminated in favor of SP or FP.  */
 
-#define FIRST_PSEUDO_REGISTER 111
+#define FIRST_PSEUDO_REGISTER 119
 
 /* Use standard DWARF numbering for DWARF debugging information.  */
 #define DEBUGGER_REGNO(REGNO) rs6000_debugger_regno ((REGNO), 0)
@@ -824,7 +825,9 @@  enum data_align { align_abi, align_opt, align_both };
    /* cr0..cr7 */				   \
    0, 0, 0, 0, 0, 0, 0, 0,			   \
    /* vrsave vscr sfp */			   \
-   1, 1, 1					   \
+   1, 1, 1,					   \
+   /* DMR registers.  */			   \
+   0, 0, 0, 0, 0, 0, 0, 0			   \
 }
 
 /* Like `CALL_USED_REGISTERS' except this macro doesn't require that
@@ -848,7 +851,9 @@  enum data_align { align_abi, align_opt, align_both };
    /* cr0..cr7 */				   \
    1, 1, 0, 0, 0, 1, 1, 1,			   \
    /* vrsave vscr sfp */			   \
-   0, 0, 0					   \
+   0, 0, 0,					   \
+   /* DMR registers.  */			   \
+   0, 0, 0, 0, 0, 0, 0, 0			   \
 }
 
 #define TOTAL_ALTIVEC_REGS	(LAST_ALTIVEC_REGNO - FIRST_ALTIVEC_REGNO + 1)
@@ -885,6 +890,7 @@  enum data_align { align_abi, align_opt, align_both };
 	v2		(not saved; incoming vector arg reg; return value)
 	v19 - v14	(not saved or used for anything)
 	v31 - v20	(saved; order given to save least number)
+	dmr0 - dmr7	(not saved)
 	vrsave, vscr	(fixed)
 	sfp		(fixed)
 */
@@ -927,6 +933,9 @@  enum data_align { align_abi, align_opt, align_both };
    66,								\
    83, 82, 81, 80, 79, 78,					\
    95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84,		\
+   /* DMR registers.  */					\
+   111, 112, 113, 114, 115, 116, 117, 118,			\
+   /* Vrsave, vscr, sfp.  */					\
    108, 109,							\
    110								\
 }
@@ -953,6 +962,9 @@  enum data_align { align_abi, align_opt, align_both };
 /* True if register is a VSX register.  */
 #define VSX_REGNO_P(N) (FP_REGNO_P (N) || ALTIVEC_REGNO_P (N))
 
+/* True if register is a DMR register.  */
+#define DMR_REGNO_P(N) ((N) >= FIRST_DMR_REGNO && (N) <= LAST_DMR_REGNO)
+
 /* Alternate name for any vector register supporting floating point, no matter
    which instruction set(s) are available.  */
 #define VFLOAT_REGNO_P(N) \
@@ -1088,6 +1100,7 @@  enum reg_class
   FLOAT_REGS,
   ALTIVEC_REGS,
   VSX_REGS,
+  DM_REGS,
   VRSAVE_REGS,
   VSCR_REGS,
   GEN_OR_FLOAT_REGS,
@@ -1117,6 +1130,7 @@  enum reg_class
   "FLOAT_REGS",								\
   "ALTIVEC_REGS",							\
   "VSX_REGS",								\
+  "DM_REGS",								\
   "VRSAVE_REGS",							\
   "VSCR_REGS",								\
   "GEN_OR_FLOAT_REGS",							\
@@ -1151,6 +1165,8 @@  enum reg_class
   { 0x00000000, 0x00000000, 0xffffffff, 0x00000000 },			\
   /* VSX_REGS.  */							\
   { 0x00000000, 0xffffffff, 0xffffffff, 0x00000000 },			\
+  /* DM_REGS.  */							\
+  { 0x00000000, 0x00000000, 0x00000000, 0x007f8000 },			\
   /* VRSAVE_REGS.  */							\
   { 0x00000000, 0x00000000, 0x00000000, 0x00001000 },			\
   /* VSCR_REGS.  */							\
@@ -1178,7 +1194,7 @@  enum reg_class
   /* CA_REGS.  */							\
   { 0x00000000, 0x00000000, 0x00000000, 0x00000004 },			\
   /* ALL_REGS.  */							\
-  { 0xffffffff, 0xffffffff, 0xffffffff, 0x00007fff }			\
+  { 0xffffffff, 0xffffffff, 0xffffffff, 0x007fffff }			\
 }
 
 /* The same information, inverted:
@@ -1202,6 +1218,7 @@  enum r6000_reg_class_enum {
   RS6000_CONSTRAINT_wr,		/* GPR register if 64-bit  */
   RS6000_CONSTRAINT_wx,		/* FPR register for STFIWX */
   RS6000_CONSTRAINT_wA,		/* BASE_REGS if 64-bit.  */
+  RS6000_CONSTRAINT_wD,		/* Accumulator regs if MMA/Dense Math.  */
   RS6000_CONSTRAINT_MAX
 };
 
@@ -2078,7 +2095,16 @@  extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
   &rs6000_reg_names[108][0],	/* vrsave  */				\
   &rs6000_reg_names[109][0],	/* vscr  */				\
 									\
-  &rs6000_reg_names[110][0]	/* sfp  */				\
+  &rs6000_reg_names[110][0],	/* sfp  */				\
+									\
+  &rs6000_reg_names[111][0],	/* dmr0  */				\
+  &rs6000_reg_names[112][0],	/* dmr1  */				\
+  &rs6000_reg_names[113][0],	/* dmr2  */				\
+  &rs6000_reg_names[114][0],	/* dmr3  */				\
+  &rs6000_reg_names[115][0],	/* dmr4  */				\
+  &rs6000_reg_names[116][0],	/* dmr5  */				\
+  &rs6000_reg_names[117][0],	/* dmr6  */				\
+  &rs6000_reg_names[118][0],	/* dmr7  */				\
 }
 
 /* Table of additional register names to use in user input.  */
@@ -2132,6 +2158,8 @@  extern char rs6000_reg_names[][8];	/* register names (0 vs. %r0).  */
   {"vs52", 84}, {"vs53", 85}, {"vs54", 86}, {"vs55", 87},	\
   {"vs56", 88}, {"vs57", 89}, {"vs58", 90}, {"vs59", 91},	\
   {"vs60", 92}, {"vs61", 93}, {"vs62", 94}, {"vs63", 95},	\
+  {"dmr0", 111}, {"dmr1", 112}, {"dmr2", 113}, {"dmr3", 114},	\
+  {"dmr4", 115}, {"dmr5", 116}, {"dmr6", 117}, {"dmr7", 118},	\
 }
 
 /* This is how to output an element of a case-vector that is relative.  */
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index a125fd8fc99..72af3e6ef70 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -51,6 +51,8 @@  (define_constants
    (VRSAVE_REGNO		108)
    (VSCR_REGNO			109)
    (FRAME_POINTER_REGNUM	110)
+   (FIRST_DMR_REGNO		111)
+   (LAST_DMR_REGNO		118)
   ])
 
 ;;
@@ -355,7 +357,7 @@  (define_attr "cpu"
   (const (symbol_ref "(enum attr_cpu) rs6000_tune")))
 
 ;; The ISA we implement.
-(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp"
+(define_attr "isa" "any,p5,p6,p7,p7v,p8v,p9,p9v,p9kf,p9tf,p10,lxvp,stxvp,dm,not_dm"
   (const_string "any"))
 
 ;; Is this alternative enabled for the current CPU/ISA/etc.?
@@ -411,6 +413,14 @@  (define_attr "enabled" ""
      (and (eq_attr "isa" "stxvp")
 	  (match_test "TARGET_STORE_VECTOR_PAIR"))
      (const_int 1)
+
+     (and (eq_attr "isa" "dm")
+	  (match_test "TARGET_DENSE_MATH"))
+     (const_int 1)
+
+     (and (eq_attr "isa" "not_dm")
+	  (match_test "!TARGET_DENSE_MATH"))
+     (const_int 1)
     ] (const_int 0)))
 
 ;; If this instruction is microcoded on the CELL processor
diff --git a/gcc/config/rs6000/rs6000.opt b/gcc/config/rs6000/rs6000.opt
index 775ba830eac..70913d88d39 100644
--- a/gcc/config/rs6000/rs6000.opt
+++ b/gcc/config/rs6000/rs6000.opt
@@ -632,6 +632,10 @@  mfuture
 Target Undocumented Mask(FUTURE) Var(rs6000_isa_flags)
 Generate (do not generate) future instructions.
 
+mdense-math
+Target Undocumented Mask(DENSE_MATH) Var(rs6000_isa_flags)
+Generate (do not generate) dense math instructions.
+
 ; Documented parameters
 
 -param=rs6000-vect-unroll-limit=
diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 47a87d6ceec..2d7674e85b3 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -3440,6 +3440,13 @@  Like @code{d}, if @option{-mpowerpc-gfxopt} is used; otherwise, @code{NO_REGS}.
 @item wA
 Like @code{b}, if @option{-mpowerpc64} is used; otherwise, @code{NO_REGS}.
 
+@item wD
+Accumulator register if @option{-mma} is used; otherwise,
+@code{NO_REGS}.  If @option{-mdense-math} is used, the accumulator
+register will be in the dense match register set.  If
+@option{-mno-dense-math} is used, the accumulator register will
+overlap with the VSX vector registers 0..31.
+
 @item wB
 Signed 5-bit constant integer that can be loaded into an Altivec register.