diff mbox series

[v3] Support APX NDD optimized encoding.

Message ID	20231115025925.2891038-1-lin1.hu@intel.com
State	Unresolved
Headers	Received-SPF: pass (google.com: domain of binutils-bounces+ouuuleilei=gmail.com@sourceware.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D93393858D20 From: "Hu, Lin1" <lin1.hu@intel.com> To: binutils@sourceware.org Cc: JBeulich@suse.com, hongjiu.lu@intel.com Subject: [PATCH][v3] Support APX NDD optimized encoding. Date: Wed, 15 Nov 2023 10:59:25 +0800 Message-Id: <20231115025925.2891038-1-lin1.hu@intel.com> In-Reply-To: <8ed3b7a2-8cba-6428-1c01-5b6c28ca4a89@suse.com> References: <8ed3b7a2-8cba-6428-1c01-5b6c28ca4a89@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: binutils-bounces+ouuuleilei=gmail.com@sourceware.org X-getmail-retrieved-from-mailbox: INBOX
Series	[v3] Support APX NDD optimized encoding. \| [v3] Support APX NDD optimized encoding.

Checks

Context	Check	Description
snail/binutils-gdb-check	warning	Git am fail log

Commit Message

Hu, Lin1 Nov. 15, 2023, 2:59 a.m. UTC

  Hi,

This patch is based on a possible version of what we discussed earlier, which
was developed based on the fact that I had reorder the i386.tbl.

And it only optimize some insns in opcode_space legacy-map0 or legacy-map1.
Like adcx/adox, I don't reorder them as you expectation, so I can use the condition
!t[1].opcode_modifier.evex to exclude them.

The other version in the previous email was made to optimize adcx and adox for
legacy.

BRs,
Lin


This patch aims to optimize:

add %r16, %r15, %r15 -> add %r16, %r15

gas/ChangeLog:

	* config/tc-i386.c (optimize_NDD_to_nonNDD): New function.
	(match_template): If we can optimzie APX NDD insns, so rematch
	template.
	* testsuite/gas/i386/x86-64.exp: Add test.
	* testsuite/gas/i386/x86-64-apx-ndd-optimize.d: New test.
	* testsuite/gas/i386/x86-64-apx-ndd-optimize.s: Ditto.

opcodes/ChangeLog:

	* i386-init.h: Regenerated.
	* i386-mnem.h: Ditto.
	* i386-tbl.h: Ditto.
	* i386-opc.tbl: Add C to some instructions for support
	optimization.
---
 gas/config/tc-i386.c                          |  52 ++++++++
 .../gas/i386/x86-64-apx-ndd-optimize.d        | 126 ++++++++++++++++++
 .../gas/i386/x86-64-apx-ndd-optimize.s        | 119 +++++++++++++++++
 gas/testsuite/gas/i386/x86-64.exp             |   1 +
 4 files changed, 298 insertions(+)
 create mode 100644 gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s

Comments

Jan Beulich Nov. 15, 2023, 9:34 a.m. UTC | #1

On 15.11.2023 03:59, Hu, Lin1 wrote:
> --- a/gas/config/tc-i386.c
> +++ b/gas/config/tc-i386.c
> @@ -7208,6 +7208,43 @@ check_EgprOperands (const insn_template *t)
>    return 0;
>  }
>  
> +/* Optimize APX NDD insns to legacy insns.  */
> +static bool
> +convert_NDD_to_REX2 (const insn_template *t)
> +{
> +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
> +      && t->opcode_space == SPACE_EVEXMAP4
> +      && !i.has_nf
> +      && i.reg_operands >= 2)
> +    {
> +      unsigned int readonly_var = ~0;
> +      unsigned int dest = i.operands - 1;
> +      unsigned int src1 = i.operands - 2;
> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> +
> +      if (i.types[src1].bitfield.class == Reg
> +	  && i.op[src1].regs == i.op[dest].regs)
> +	readonly_var = src2;
> +      /* adcx, adox and imul can't support to swap the source operands.  */
> +      else if (i.types[src2].bitfield.class == Reg
> +	       && i.op[src2].regs == i.op[dest].regs
> +	       && optimize > 1
> +	       && t->opcode_modifier.commutative)

Comment and code still aren't in line: "support to swap the source operands"
really is the D attribute in the opcode table, whereas
t->opcode_modifier.commutative is related to the C attribute (and all three
insns named really are commutative). It looks to me that the code is
correct, so it would then be the comment that may need updating. But it may
also be better to additionally check .d here (making the code robust against
C being added to the truly commutative yet not eligible to be optimized
insns). In which case the comment might say "adcx, adox, and imul, while
commutative, don't support to swap the source operands".

> +	readonly_var = src1;
> +      if (readonly_var != (unsigned int) ~0)
> +	{
> +	  if (readonly_var != src2)
> +	    swap_2_operands (readonly_var, src2);
> +
> +	  --i.operands;
> +	  --i.reg_operands;
> +
> +	  return true;
> +	}
> +    }
> +  return false;
> +}
> +
>  /* Helper function for the progress() macro in match_template().  */
>  static INLINE enum i386_error progress (enum i386_error new,
>  					enum i386_error last,
> @@ -7728,6 +7765,21 @@ match_template (char mnem_suffix)
>  	  i.memshift = memshift;
>  	}
>  
> +      /* If we can optimize a NDD insn to non-NDD insn, like

The terminology here wants to match the function name below, i.e. (as
indicated elsewhere for the name, in reply to your question) "legacy"
instead of "non-NDD" (assuming the function name is changed as well,
in line with that).

> +	 add %r16, %r8, %r8 -> add %r16, %r8,
> +	 add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.  
> +	 Note that the semantics have not been changed.  */
> +      if (optimize
> +	  && !i.no_optimize
> +	  && i.vec_encoding != vex_encoding_evex
> +	  && t + 1 < current_templates->end
> +	  && !t[1].opcode_modifier.evex

This is more fragile than it needs to be; it would imo be better to indeed
go from opcode space of the supposed alternative encoding. Perhaps that's
going to mean checking both.

Jan

Hu, Lin1 Nov. 17, 2023, 7:24 a.m. UTC | #2

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Wednesday, November 15, 2023 5:35 PM
> To: Hu, Lin1 <lin1.hu@intel.com>
> Cc: Lu, Hongjiu <hongjiu.lu@intel.com>; binutils@sourceware.org
> Subject: Re: [PATCH][v3] Support APX NDD optimized encoding.
> 
> On 15.11.2023 03:59, Hu, Lin1 wrote:
> > --- a/gas/config/tc-i386.c
> > +++ b/gas/config/tc-i386.c
> > @@ -7208,6 +7208,43 @@ check_EgprOperands (const insn_template *t)
> >    return 0;
> >  }
> >
> > +/* Optimize APX NDD insns to legacy insns.  */ static bool
> > +convert_NDD_to_REX2 (const insn_template *t) {
> > +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
> > +      && t->opcode_space == SPACE_EVEXMAP4
> > +      && !i.has_nf
> > +      && i.reg_operands >= 2)
> > +    {
> > +      unsigned int readonly_var = ~0;
> > +      unsigned int dest = i.operands - 1;
> > +      unsigned int src1 = i.operands - 2;
> > +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> > +
> > +      if (i.types[src1].bitfield.class == Reg
> > +	  && i.op[src1].regs == i.op[dest].regs)
> > +	readonly_var = src2;
> > +      /* adcx, adox and imul can't support to swap the source operands.  */
> > +      else if (i.types[src2].bitfield.class == Reg
> > +	       && i.op[src2].regs == i.op[dest].regs
> > +	       && optimize > 1
> > +	       && t->opcode_modifier.commutative)
> 
> Comment and code still aren't in line: "support to swap the source
> operands"
> really is the D attribute in the opcode table, whereas
> t->opcode_modifier.commutative is related to the C attribute (and all
> t->three
> insns named really are commutative). It looks to me that the code is correct,
> so it would then be the comment that may need updating. But it may also
> be better to additionally check .d here (making the code robust against C
> being added to the truly commutative yet not eligible to be optimized insns).
> In which case the comment might say "adcx, adox, and imul, while
> commutative, don't support to swap the source operands".
>

I think we don't need to worry about it for now, because we've constrained the function with vexvvvvvvdest, and these instructions must be NDD instructions. And adcx, adox and imul don't have D attribute. If I add check .d here, I will need to exclude them. The code will back, which we had initially hoped to avoid by using C.

>
> > +	readonly_var = src1;
> > +      if (readonly_var != (unsigned int) ~0)
> > +	{
> > +	  if (readonly_var != src2)
> > +	    swap_2_operands (readonly_var, src2);
> > +
> > +	  --i.operands;
> > +	  --i.reg_operands;
> > +
> > +	  return true;
> > +	}
> > +    }
> > +  return false;
> > +}
> > +
> >  /* Helper function for the progress() macro in match_template().  */
> > static INLINE enum i386_error progress (enum i386_error new,
> >  					enum i386_error last,
> > @@ -7728,6 +7765,21 @@ match_template (char mnem_suffix)
> >  	  i.memshift = memshift;
> >  	}
> >
> > +      /* If we can optimize a NDD insn to non-NDD insn, like
> 
> The terminology here wants to match the function name below, i.e. (as
> indicated elsewhere for the name, in reply to your question) "legacy"
> instead of "non-NDD" (assuming the function name is changed as well, in
> line with that).
>

OK.

> 
> > +	 add %r16, %r8, %r8 -> add %r16, %r8,
> > +	 add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
> > +	 Note that the semantics have not been changed.  */
> > +      if (optimize
> > +	  && !i.no_optimize
> > +	  && i.vec_encoding != vex_encoding_evex
> > +	  && t + 1 < current_templates->end
> > +	  && !t[1].opcode_modifier.evex
> 
> This is more fragile than it needs to be; it would imo be better to indeed go
> from opcode space of the supposed alternative encoding. Perhaps that's
> going to mean checking both.
>

Based on our previous discussion, I modified tc-i386.c as follows

+/* Check if the instruction use the REX registers.  */
+static bool
+check_RexOperands (const insn_template *t)
+{
+  for (unsigned int op = 0; op < i.operands; op++)
+    {
+      if (i.types[op].bitfield.class != Reg
+         /* Special case for (%dx) while doing input/output op */
+         || i.input_output_operand)
+       continue;
+
+      if (i.op[op].regs->reg_flags & (RegRex | RegRex64))
+       return true;
+    }
+
+  if ((i.index_reg && (i.index_reg->reg_flags & (RegRex | RegRex64)))
+      || (i.base_reg && (i.base_reg->reg_flags & (RegRex | RegRex64))))
+    return true;
+
+  /* Check pseudo prefix {rex} are valid.  */
+  if (i.rex_encoding)
+    return true;
+  return false;
+}
+
+/* Optimize APX NDD insns to legacy insns.  */
+static unsigned int
+convert_NDD_to_legacy (const insn_template *t)
+{
+  unsigned int readonly_var = ~0;
+
+  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
+      && t->opcode_space == SPACE_EVEXMAP4
+      && !i.has_nf
+      && i.reg_operands >= 2)
+    {
+      unsigned int dest = i.operands - 1;
+      unsigned int src1 = i.operands - 2;
+      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
+
+      if (i.types[src1].bitfield.class == Reg
+         && i.op[src1].regs == i.op[dest].regs)
+       readonly_var = src2;
+      /* adcx, adox, and imul, while commutative, don't support to swap
+        the source operands.  */
+      else if (i.types[src2].bitfield.class == Reg
+              && i.op[src2].regs == i.op[dest].regs
+              && optimize > 1
+              && t->opcode_modifier.commutative)
+       readonly_var = src1;
+    }
+  return readonly_var;
+}
+

@@ -7728,6 +7782,55 @@ match_template (char mnem_suffix)
          i.memshift = memshift;
        }

+      /* If we can optimize a NDD insn to legacy insn, like
+        add %r16, %r8, %r8 -> add %r16, %r8,
+        add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
+        Note that the semantics have not been changed.  */
+      if (optimize
+         && !i.no_optimize
+         && i.vec_encoding != vex_encoding_evex
+         && t + 1 < current_templates->end
+         && !t[1].opcode_modifier.evex
+         && t[1].opcode_space <= SPACE_0F38)
+       {
+         unsigned int readonly_var = convert_NDD_to_legacy (t);
+         size_match = true;
+
+         if (readonly_var != (unsigned int) ~0)
+           {
+             for (j = 0; j < i.operands - 2; j++)
+               {
+                 check_register = j;
+                 if (t->opcode_modifier.d)
+                   check_register ^= 1;
+                 overlap0 = operand_type_and (i.types[check_register],
+                                              t[1].operand_types[check_register]);
+                 if (!operand_type_match (overlap0, i.types[check_register]))
+                   size_match = false;
+               }
+
+             if (size_match
+                 && (t[1].opcode_space <= SPACE_0F
+                     || (!check_EgprOperands (t + 1)	 // These conditions are exclude adcx/adox with inappropriate registers.
+                         && !check_RexOperands (t + 1)
+                         && !i.op[i.operands - 1].regs->reg_type.bitfield.qword)))
+               {
+                 unsigned int src1 = i.operands - 2;
+                 unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
+
+                 if (readonly_var != src2)
+                   swap_2_operands (readonly_var, src2);
+
+                 --i.operands;
+                 --i.reg_operands;
+
+                 specific_error = progress (internal_error);
+                 continue;
+               }
+
+           }
+       }
+
       /* We've found a match; break out of loop.  */
       break;

What's your opinion?

BRs,
Lin

Jan Beulich Nov. 17, 2023, 9:47 a.m. UTC | #3

On 17.11.2023 08:24, Hu, Lin1 wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Wednesday, November 15, 2023 5:35 PM
>>
>> On 15.11.2023 03:59, Hu, Lin1 wrote:
>>> --- a/gas/config/tc-i386.c
>>> +++ b/gas/config/tc-i386.c
>>> @@ -7208,6 +7208,43 @@ check_EgprOperands (const insn_template *t)
>>>    return 0;
>>>  }
>>>
>>> +/* Optimize APX NDD insns to legacy insns.  */ static bool
>>> +convert_NDD_to_REX2 (const insn_template *t) {
>>> +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
>>> +      && t->opcode_space == SPACE_EVEXMAP4
>>> +      && !i.has_nf
>>> +      && i.reg_operands >= 2)
>>> +    {
>>> +      unsigned int readonly_var = ~0;
>>> +      unsigned int dest = i.operands - 1;
>>> +      unsigned int src1 = i.operands - 2;
>>> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
>>> +
>>> +      if (i.types[src1].bitfield.class == Reg
>>> +	  && i.op[src1].regs == i.op[dest].regs)
>>> +	readonly_var = src2;
>>> +      /* adcx, adox and imul can't support to swap the source operands.  */
>>> +      else if (i.types[src2].bitfield.class == Reg
>>> +	       && i.op[src2].regs == i.op[dest].regs
>>> +	       && optimize > 1
>>> +	       && t->opcode_modifier.commutative)
>>
>> Comment and code still aren't in line: "support to swap the source
>> operands"
>> really is the D attribute in the opcode table, whereas
>> t->opcode_modifier.commutative is related to the C attribute (and all
>> t->three
>> insns named really are commutative). It looks to me that the code is correct,
>> so it would then be the comment that may need updating. But it may also
>> be better to additionally check .d here (making the code robust against C
>> being added to the truly commutative yet not eligible to be optimized insns).
>> In which case the comment might say "adcx, adox, and imul, while
>> commutative, don't support to swap the source operands".
>>
> 
> I think we don't need to worry about it for now, because we've constrained the function with vexvvvvvvdest, and these instructions must be NDD instructions. And adcx, adox and imul don't have D attribute.

Right, and I thought to leverage this. IOW ...

> If I add check .d here, I will need to exclude them.

... I don't think I understand this.

> Based on our previous discussion, I modified tc-i386.c as follows
> 
> +/* Check if the instruction use the REX registers.  */
> +static bool
> +check_RexOperands (const insn_template *t)

I don't think I can spot a use of the parameter in the function.

> +{
> +  for (unsigned int op = 0; op < i.operands; op++)
> +    {
> +      if (i.types[op].bitfield.class != Reg
> +         /* Special case for (%dx) while doing input/output op */
> +         || i.input_output_operand)

Once again: Is this needed? Respective insns shouldn't even make it here.
Plus if they did, ...

> +       continue;
> +
> +      if (i.op[op].regs->reg_flags & (RegRex | RegRex64))
> +       return true;

... the loop would continue for (%dx) kind operands anyway.

> +    }
> +
> +  if ((i.index_reg && (i.index_reg->reg_flags & (RegRex | RegRex64)))
> +      || (i.base_reg && (i.base_reg->reg_flags & (RegRex | RegRex64))))
> +    return true;
> +
> +  /* Check pseudo prefix {rex} are valid.  */
> +  if (i.rex_encoding)
> +    return true;
> +  return false;

Just "return i.rex_encoding;"?

> +}
> +
> +/* Optimize APX NDD insns to legacy insns.  */
> +static unsigned int
> +convert_NDD_to_legacy (const insn_template *t)
> +{
> +  unsigned int readonly_var = ~0;

One issue I continue to have is the name of this variable. Good names
help understanding what code is doing. And in 3-operand NDD insns there
are uniformly 2 operands which are only read.

> +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
> +      && t->opcode_space == SPACE_EVEXMAP4
> +      && !i.has_nf
> +      && i.reg_operands >= 2)
> +    {
> +      unsigned int dest = i.operands - 1;
> +      unsigned int src1 = i.operands - 2;
> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> +
> +      if (i.types[src1].bitfield.class == Reg
> +         && i.op[src1].regs == i.op[dest].regs)
> +       readonly_var = src2;
> +      /* adcx, adox, and imul, while commutative, don't support to swap
> +        the source operands.  */
> +      else if (i.types[src2].bitfield.class == Reg
> +              && i.op[src2].regs == i.op[dest].regs
> +              && optimize > 1
> +              && t->opcode_modifier.commutative)
> +       readonly_var = src1;
> +    }
> +  return readonly_var;
> +}

You're no longer converting anything in this function, which - I'm sorry to
say that - once again makes its name unsuitable.

> @@ -7728,6 +7782,55 @@ match_template (char mnem_suffix)
>           i.memshift = memshift;
>         }
> 
> +      /* If we can optimize a NDD insn to legacy insn, like
> +        add %r16, %r8, %r8 -> add %r16, %r8,
> +        add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
> +        Note that the semantics have not been changed.  */
> +      if (optimize
> +         && !i.no_optimize
> +         && i.vec_encoding != vex_encoding_evex
> +         && t + 1 < current_templates->end
> +         && !t[1].opcode_modifier.evex
> +         && t[1].opcode_space <= SPACE_0F38)

In all of these checks what I'm missing is a check that we're actually
dealing with an NDD template.

> +       {
> +         unsigned int readonly_var = convert_NDD_to_legacy (t);
> +         size_match = true;
> +
> +         if (readonly_var != (unsigned int) ~0)
> +           {
> +             for (j = 0; j < i.operands - 2; j++)
> +               {
> +                 check_register = j;
> +                 if (t->opcode_modifier.d)
> +                   check_register ^= 1;
> +                 overlap0 = operand_type_and (i.types[check_register],
> +                                              t[1].operand_types[check_register]);
> +                 if (!operand_type_match (overlap0, i.types[check_register]))
> +                   size_match = false;
> +               }

I'm afraid that without a comment I don't understand what this is about.

> +             if (size_match
> +                 && (t[1].opcode_space <= SPACE_0F
> +                     || (!check_EgprOperands (t + 1)	 // These conditions are exclude adcx/adox with inappropriate registers.
> +                         && !check_RexOperands (t + 1)
> +                         && !i.op[i.operands - 1].regs->reg_type.bitfield.qword)))

Saying "inappropriate" in such a comment doesn't really help, as it's then
still unclear what is "appropriate". But the comment will need re-formatting
anyway.

> +               {
> +                 unsigned int src1 = i.operands - 2;

Looks like this variable is no longer used?

> +                 unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> +
> +                 if (readonly_var != src2)
> +                   swap_2_operands (readonly_var, src2);
> +
> +                 --i.operands;
> +                 --i.reg_operands;
> +
> +                 specific_error = progress (internal_error);
> +                 continue;
> +               }
> +
> +           }
> +       }
> +
>        /* We've found a match; break out of loop.  */
>        break;
> 
> What's your opinion?

I need some further clarification first, as per above. I also don't think I
can properly identify (yet) which parts of the code are solely related to
the ADCX/ADOX special case. The more code that's special for these, the more
I'd be inclined to ask that dealing with them be a separate patch, for us to
judge whether effort and effect are in reasonable balance.

Jan

Hu, Lin1 Nov. 20, 2023, 3:28 a.m. UTC | #4

> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, November 17, 2023 5:48 PM
> To: Hu, Lin1 <lin1.hu@intel.com>
> Cc: Lu, Hongjiu <hongjiu.lu@intel.com>; binutils@sourceware.org
> Subject: Re: [PATCH][v3] Support APX NDD optimized encoding.
> 
> On 17.11.2023 08:24, Hu, Lin1 wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Wednesday, November 15, 2023 5:35 PM
> >>
> >> On 15.11.2023 03:59, Hu, Lin1 wrote:
> >>> --- a/gas/config/tc-i386.c
> >>> +++ b/gas/config/tc-i386.c
> >>> @@ -7208,6 +7208,43 @@ check_EgprOperands (const insn_template *t)
> >>>    return 0;
> >>>  }
> >>>
> >>> +/* Optimize APX NDD insns to legacy insns.  */ static bool
> >>> +convert_NDD_to_REX2 (const insn_template *t) {
> >>> +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
> >>> +      && t->opcode_space == SPACE_EVEXMAP4
> >>> +      && !i.has_nf
> >>> +      && i.reg_operands >= 2)
> >>> +    {
> >>> +      unsigned int readonly_var = ~0;
> >>> +      unsigned int dest = i.operands - 1;
> >>> +      unsigned int src1 = i.operands - 2;
> >>> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> >>> +
> >>> +      if (i.types[src1].bitfield.class == Reg
> >>> +	  && i.op[src1].regs == i.op[dest].regs)
> >>> +	readonly_var = src2;
> >>> +      /* adcx, adox and imul can't support to swap the source operands.
> */
> >>> +      else if (i.types[src2].bitfield.class == Reg
> >>> +	       && i.op[src2].regs == i.op[dest].regs
> >>> +	       && optimize > 1
> >>> +	       && t->opcode_modifier.commutative)
> >>
> >> Comment and code still aren't in line: "support to swap the source
> >> operands"
> >> really is the D attribute in the opcode table, whereas
> >> t->opcode_modifier.commutative is related to the C attribute (and all
> >> t->three
> >> insns named really are commutative). It looks to me that the code is
> >> correct, so it would then be the comment that may need updating. But
> >> it may also be better to additionally check .d here (making the code
> >> robust against C being added to the truly commutative yet not eligible to
> be optimized insns).
> >> In which case the comment might say "adcx, adox, and imul, while
> >> commutative, don't support to swap the source operands".
> >>
> >
> > I think we don't need to worry about it for now, because we've constrained
> the function with vexvvvvvvdest, and these instructions must be NDD
> instructions. And adcx, adox and imul don't have D attribute.
> 
> Right, and I thought to leverage this. IOW ...
> 
> > If I add check .d here, I will need to exclude them.
> 
> ... I don't think I understand this.
>

I mean this place is to check if we can optimize something like "adcx %eax, %ebx, %eax -> adcx %ebx, %eax".  Adcx doesn't support D attribute, if we want to add a constraint t->opcode_modifier.d. The constraints should be && (t->opcode_modifier.d || t->mnem_off == MN_adcx) && t->opcode_modifier.commutative, because, adcx doesn't support D attribute. If we doesn't exclude it, it will be stopped by t->opcode_modifier.d.
 
>
> > Based on our previous discussion, I modified tc-i386.c as follows
> >
> > +/* Check if the instruction use the REX registers.  */ static bool
> > +check_RexOperands (const insn_template *t)
> 
> I don't think I can spot a use of the parameter in the function.
>

I merely mimicked Check_EgprOperands, and didn't pay attention to your comments about it. I will remove the parameter.

> 
> > +{
> > +  for (unsigned int op = 0; op < i.operands; op++)
> > +    {
> > +      if (i.types[op].bitfield.class != Reg
> > +         /* Special case for (%dx) while doing input/output op */
> > +         || i.input_output_operand)
> 
> Once again: Is this needed? Respective insns shouldn't even make it here.
> Plus if they did, ...
>

I modified the constraint be

	If (i.types[op].bitfield.class != Reg)
	  continue;
 
>
> > +       continue;
> > +
> > +      if (i.op[op].regs->reg_flags & (RegRex | RegRex64))
> > +       return true;
> 
> ... the loop would continue for (%dx) kind operands anyway.
>
>
> > +    }
> > +
> > +  if ((i.index_reg && (i.index_reg->reg_flags & (RegRex | RegRex64)))
> > +      || (i.base_reg && (i.base_reg->reg_flags & (RegRex | RegRex64))))
> > +    return true;
> > +
> > +  /* Check pseudo prefix {rex} are valid.  */  if (i.rex_encoding)
> > +    return true;
> > +  return false;
> 
> Just "return i.rex_encoding;"?
>

Indeed, it's more simplified.
 
>
> > +}
> > +
> > +/* Optimize APX NDD insns to legacy insns.  */ static unsigned int
> > +convert_NDD_to_legacy (const insn_template *t) {
> > +  unsigned int readonly_var = ~0;
> 
> One issue I continue to have is the name of this variable. Good names help
> understanding what code is doing. And in 3-operand NDD insns there are
> uniformly 2 operands which are only read.
>

Maybe I should change the variable to be called readonly_reg_pos.

> 
> > +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
> > +      && t->opcode_space == SPACE_EVEXMAP4
> > +      && !i.has_nf
> > +      && i.reg_operands >= 2)
> > +    {
> > +      unsigned int dest = i.operands - 1;
> > +      unsigned int src1 = i.operands - 2;
> > +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> > +
> > +      if (i.types[src1].bitfield.class == Reg
> > +         && i.op[src1].regs == i.op[dest].regs)
> > +       readonly_var = src2;
> > +      /* adcx, adox, and imul, while commutative, don't support to swap
> > +        the source operands.  */
> > +      else if (i.types[src2].bitfield.class == Reg
> > +              && i.op[src2].regs == i.op[dest].regs
> > +              && optimize > 1
> > +              && t->opcode_modifier.commutative)
> > +       readonly_var = src1;
> > +    }
> > +  return readonly_var;
> > +}
> 
> You're no longer converting anything in this function, which - I'm sorry to say
> that - once again makes its name unsuitable.
>

True, but I think the function's name can be changed later, removing some operations on i prevents me from doing unnecessary backtracking.

I think the function maybe can be named as can_convert_NDD_to_legacy. Return 0 means can't, others mean can.

> 
> > @@ -7728,6 +7782,55 @@ match_template (char mnem_suffix)
> >           i.memshift = memshift;
> >         }
> >
> > +      /* If we can optimize a NDD insn to legacy insn, like
> > +        add %r16, %r8, %r8 -> add %r16, %r8,
> > +        add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
> > +        Note that the semantics have not been changed.  */
> > +      if (optimize
> > +         && !i.no_optimize
> > +         && i.vec_encoding != vex_encoding_evex
> > +         && t + 1 < current_templates->end
> > +         && !t[1].opcode_modifier.evex
> > +         && t[1].opcode_space <= SPACE_0F38)
> 
> In all of these checks what I'm missing is a check that we're actually dealing
> with an NDD template.
>

Of course I can add it here, then I'll remove it from convert_NDD_to_legacy.
 
>
> > +       {
> > +         unsigned int readonly_var = convert_NDD_to_legacy (t);
> > +         size_match = true;
> > +
> > +         if (readonly_var != (unsigned int) ~0)
> > +           {
> > +             for (j = 0; j < i.operands - 2; j++)
> > +               {
> > +                 check_register = j;
> > +                 if (t->opcode_modifier.d)
> > +                   check_register ^= 1;
> > +                 overlap0 = operand_type_and (i.types[check_register],
> > +                                              t[1].operand_types[check_register]);
> > +                 if (!operand_type_match (overlap0, i.types[check_register]))
> > +                   size_match = false;
> > +               }
> 
> I'm afraid that without a comment I don't understand what this is about.
> 

I want to make sure that the two neighboring templates have the same input. I tried base_code but it misses some special cases, so I want to start with the first parameter (ATT), like shld Imm8 and shld shiftCount. But some insns with .d. So I need to check if the first operand can match the second type.  Seems like it looks like NDD can simply match that way now.

I'm having some problems with this current version, I've modified it. And delete this part of the code now won't have any effect because the original version of the NDD instruction is sorted by us, and I think that this part will only be used in the case that someone has sorted .tbl incorrectly.

The current version is

+         if (readonly_var != (unsigned int) ~0)
+           {
+             overlap0 = operand_type_and (i.types[0],
+                                          t[1].operand_types[0]);
+             if (t->opcode_modifier.d)
+               overlap1 = operand_type_and (i.types[0],
+                                            t[1].operand_types[1]);
+             if (!operand_type_match (overlap0, i.types[0])
+                 && (!t->opcode_modifier.d
+                     || (t->opcode_modifier.d
+                         && !operand_type_match (overlap1, i.types[0]))))
+               size_match = false;
+

>
> > +             if (size_match
> > +                 && (t[1].opcode_space <= SPACE_0F
> > +                     || (!check_EgprOperands (t + 1)	 // These conditions are
> exclude adcx/adox with inappropriate registers.
> > +                         && !check_RexOperands (t + 1)
> > +                         && !i.op[i.operands -
> > + 1].regs->reg_type.bitfield.qword)))
> 
> Saying "inappropriate" in such a comment doesn't really help, as it's then
> still unclear what is "appropriate". But the comment will need re-formatting
> anyway.
>

I modifed the comment as "Optimizing some non-legacy-map0/1 without REX/REX2 prefix will be valuable."
 
>
> > +               {
> > +                 unsigned int src1 = i.operands - 2;
> 
> Looks like this variable is no longer used?
>

Yes, I removed it.

> 
> > +                 unsigned int src2 = (i.operands > 3) ? i.operands -
> > + 3 : 0;
> > +
> > +                 if (readonly_var != src2)
> > +                   swap_2_operands (readonly_var, src2);
> > +
> > +                 --i.operands;
> > +                 --i.reg_operands;
> > +
> > +                 specific_error = progress (internal_error);
> > +                 continue;
> > +               }
> > +
> > +           }
> > +       }
> > +
> >        /* We've found a match; break out of loop.  */
> >        break;
> >
> > What's your opinion?
> 
> I need some further clarification first, as per above. I also don't think I can
> properly identify (yet) which parts of the code are solely related to the
> ADCX/ADOX special case. The more code that's special for these, the more I'd
> be inclined to ask that dealing with them be a separate patch, for us to
> judge whether effort and effect are in reasonable balance.
> 

Currently only this part of the code 

+                 /* Optimizing some non-legacy-map0/1 without REX/REX2 prefix will be valuable.  */
+                 && (t[1].opcode_space <= SPACE_0F
+                     || (!check_EgprOperands (t + 1)
+                         && !check_RexOperands ()
+                         && !i.op[i.operands - 1].regs->reg_type.bitfield.qword)))

is related to adcx/adox (Include check_RexOperands), all other kinds of constraints are just to make the code more robust. If you still think this part is too complicated, I would consider holding off on optimizing adcx/adox, after all, there aren't many of these types of commands at the moment.

BRs,
Lin

Jan Beulich Nov. 20, 2023, 8:34 a.m. UTC | #5

On 20.11.2023 04:28, Hu, Lin1 wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Friday, November 17, 2023 5:48 PM
>>
>> On 17.11.2023 08:24, Hu, Lin1 wrote:
>>>> -----Original Message-----
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Wednesday, November 15, 2023 5:35 PM
>>>>
>>>> On 15.11.2023 03:59, Hu, Lin1 wrote:
>>>>> --- a/gas/config/tc-i386.c
>>>>> +++ b/gas/config/tc-i386.c
>>>>> @@ -7208,6 +7208,43 @@ check_EgprOperands (const insn_template *t)
>>>>>    return 0;
>>>>>  }
>>>>>
>>>>> +/* Optimize APX NDD insns to legacy insns.  */ static bool
>>>>> +convert_NDD_to_REX2 (const insn_template *t) {
>>>>> +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
>>>>> +      && t->opcode_space == SPACE_EVEXMAP4
>>>>> +      && !i.has_nf
>>>>> +      && i.reg_operands >= 2)
>>>>> +    {
>>>>> +      unsigned int readonly_var = ~0;
>>>>> +      unsigned int dest = i.operands - 1;
>>>>> +      unsigned int src1 = i.operands - 2;
>>>>> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
>>>>> +
>>>>> +      if (i.types[src1].bitfield.class == Reg
>>>>> +	  && i.op[src1].regs == i.op[dest].regs)
>>>>> +	readonly_var = src2;
>>>>> +      /* adcx, adox and imul can't support to swap the source operands.
>> */
>>>>> +      else if (i.types[src2].bitfield.class == Reg
>>>>> +	       && i.op[src2].regs == i.op[dest].regs
>>>>> +	       && optimize > 1
>>>>> +	       && t->opcode_modifier.commutative)
>>>>
>>>> Comment and code still aren't in line: "support to swap the source
>>>> operands"
>>>> really is the D attribute in the opcode table, whereas
>>>> t->opcode_modifier.commutative is related to the C attribute (and all
>>>> t->three
>>>> insns named really are commutative). It looks to me that the code is
>>>> correct, so it would then be the comment that may need updating. But
>>>> it may also be better to additionally check .d here (making the code
>>>> robust against C being added to the truly commutative yet not eligible to
>> be optimized insns).
>>>> In which case the comment might say "adcx, adox, and imul, while
>>>> commutative, don't support to swap the source operands".
>>>>
>>>
>>> I think we don't need to worry about it for now, because we've constrained
>> the function with vexvvvvvvdest, and these instructions must be NDD
>> instructions. And adcx, adox and imul don't have D attribute.
>>
>> Right, and I thought to leverage this. IOW ...
>>
>>> If I add check .d here, I will need to exclude them.
>>
>> ... I don't think I understand this.
>>
> 
> I mean this place is to check if we can optimize something like "adcx %eax, %ebx, %eax -> adcx %ebx, %eax".  Adcx doesn't support D attribute, if we want to add a constraint t->opcode_modifier.d. The constraints should be && (t->opcode_modifier.d || t->mnem_off == MN_adcx) && t->opcode_modifier.commutative, because, adcx doesn't support D attribute. If we doesn't exclude it, it will be stopped by t->opcode_modifier.d.

Just to clarify again: My main issue is with comment and code not being
in sync. Once that's sorted, maybe it'll become clear why you need to
check C rather than D, when at the same time you avoid to add C for
those insns which don't support D anyway (and would hence be
recognizable either way).

>>> +/* Optimize APX NDD insns to legacy insns.  */ static unsigned int
>>> +convert_NDD_to_legacy (const insn_template *t) {
>>> +  unsigned int readonly_var = ~0;
>>
>> One issue I continue to have is the name of this variable. Good names help
>> understanding what code is doing. And in 3-operand NDD insns there are
>> uniformly 2 operands which are only read.
> 
> Maybe I should change the variable to be called readonly_reg_pos.

What would this change? There are, as before, potentially two operands
which are input only (i.e. readonly). What property of the operand is it
that you're after? Answering that will tell what a reasonable name for
the variable is.

>>> +       {
>>> +         unsigned int readonly_var = convert_NDD_to_legacy (t);
>>> +         size_match = true;
>>> +
>>> +         if (readonly_var != (unsigned int) ~0)
>>> +           {
>>> +             for (j = 0; j < i.operands - 2; j++)
>>> +               {
>>> +                 check_register = j;
>>> +                 if (t->opcode_modifier.d)
>>> +                   check_register ^= 1;
>>> +                 overlap0 = operand_type_and (i.types[check_register],
>>> +                                              t[1].operand_types[check_register]);
>>> +                 if (!operand_type_match (overlap0, i.types[check_register]))
>>> +                   size_match = false;
>>> +               }
>>
>> I'm afraid that without a comment I don't understand what this is about.
>>
> 
> I want to make sure that the two neighboring templates have the same input. I tried base_code but it misses some special cases, so I want to start with the first parameter (ATT), like shld Imm8 and shld shiftCount. But some insns with .d. So I need to check if the first operand can match the second type.  Seems like it looks like NDD can simply match that way now.

Okay, something along these lines will want saying in a comment then. That's
not to say though that I'm convinced this is needed. With all these code
fragments and out-of-order patches that were sent I'm really looking forward
to getting a consistent, properly ordered series again, such that I can
actually see how things fit together.

Jan

diff mbox series

Patch

diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
index d98950c7dfd..6d6fc65383e 100644
--- a/gas/config/tc-i386.c
+++ b/gas/config/tc-i386.c
@@ -7208,6 +7208,43 @@  check_EgprOperands (const insn_template *t)
   return 0;
 }
 
+/* Optimize APX NDD insns to legacy insns.  */
+static bool
+convert_NDD_to_REX2 (const insn_template *t)
+{
+  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
+      && t->opcode_space == SPACE_EVEXMAP4
+      && !i.has_nf
+      && i.reg_operands >= 2)
+    {
+      unsigned int readonly_var = ~0;
+      unsigned int dest = i.operands - 1;
+      unsigned int src1 = i.operands - 2;
+      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
+
+      if (i.types[src1].bitfield.class == Reg
+	  && i.op[src1].regs == i.op[dest].regs)
+	readonly_var = src2;
+      /* adcx, adox and imul can't support to swap the source operands.  */
+      else if (i.types[src2].bitfield.class == Reg
+	       && i.op[src2].regs == i.op[dest].regs
+	       && optimize > 1
+	       && t->opcode_modifier.commutative)
+	readonly_var = src1;
+      if (readonly_var != (unsigned int) ~0)
+	{
+	  if (readonly_var != src2)
+	    swap_2_operands (readonly_var, src2);
+
+	  --i.operands;
+	  --i.reg_operands;
+
+	  return true;
+	}
+    }
+  return false;
+}
+
 /* Helper function for the progress() macro in match_template().  */
 static INLINE enum i386_error progress (enum i386_error new,
 					enum i386_error last,
@@ -7728,6 +7765,21 @@  match_template (char mnem_suffix)
 	  i.memshift = memshift;
 	}
 
+      /* If we can optimize a NDD insn to non-NDD insn, like
+	 add %r16, %r8, %r8 -> add %r16, %r8,
+	 add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.  
+	 Note that the semantics have not been changed.  */
+      if (optimize
+	  && !i.no_optimize
+	  && i.vec_encoding != vex_encoding_evex
+	  && t + 1 < current_templates->end
+	  && !t[1].opcode_modifier.evex
+	  && convert_NDD_to_REX2 (t))
+	{
+	  specific_error = progress (internal_error);
+	  continue;
+	}
+
       /* We've found a match; break out of loop.  */
       break;
     }
diff --git a/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d
new file mode 100644
index 00000000000..d13daed38b5
--- /dev/null
+++ b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d
@@ -0,0 +1,126 @@ 
+#as: -Os
+#objdump: -drw
+#name: x86-64 APX NDD optimized encoding
+#source: x86-64-apx-ndd-optimize.s
+
+.*: +file format .*
+
+
+Disassembly of section .text:
+
+0+ <_start>:
+\s*[a-f0-9]+:\s*d5 19 ff c7          	inc    %r31
+\s*[a-f0-9]+:\s*d5 11 fe c7          	inc    %r31b
+\s*[a-f0-9]+:\s*d5 4d 01 f8          	add    %r31,%r8
+\s*[a-f0-9]+:\s*d5 45 00 f8          	add    %r31b,%r8b
+\s*[a-f0-9]+:\s*d5 4d 01 f8          	add    %r31,%r8
+\s*[a-f0-9]+:\s*d5 1d 03 c7          	add    %r31,%r8
+\s*[a-f0-9]+:\s*d5 4d 03 38          	add    \(%r8\),%r31
+\s*[a-f0-9]+:\s*d5 1d 03 07          	add    \(%r31\),%r8
+\s*[a-f0-9]+:\s*49 81 c7 33 44 34 12 	add    \$0x12344433,%r15
+\s*[a-f0-9]+:\s*49 81 c0 11 22 33 f4 	add    \$0xfffffffff4332211,%r8
+\s*[a-f0-9]+:\s*d5 18 ff c9          	dec    %r17
+\s*[a-f0-9]+:\s*d5 10 fe c9          	dec    %r17b
+\s*[a-f0-9]+:\s*d5 18 f7 d1          	not    %r17
+\s*[a-f0-9]+:\s*d5 10 f6 d1          	not    %r17b
+\s*[a-f0-9]+:\s*d5 18 f7 d9          	neg    %r17
+\s*[a-f0-9]+:\s*d5 10 f6 d9          	neg    %r17b
+\s*[a-f0-9]+:\s*d5 1c 29 f9          	sub    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 28 f9          	sub    %r15b,%r17b
+\s*[a-f0-9]+:\s*62 54 84 18 29 38    	sub    %r15,\(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 2b 04 07       	sub    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 ee 34 12 00 00 	sub    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 19 f9          	sbb    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 18 f9          	sbb    %r15b,%r17b
+\s*[a-f0-9]+:\s*62 54 84 18 19 38    	sbb    %r15,\(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 1b 04 07       	sbb    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 de 34 12 00 00 	sbb    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 11 f9          	adc    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 10 f9          	adc    %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 13 38             	adc    \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 13 04 07       	adc    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 d6 34 12 00 00 	adc    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 09 f9          	or     %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 08 f9          	or     %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 0b 38             	or     \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 0b 04 07       	or     \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 ce 34 12 00 00 	or     \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 31 f9          	xor    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 30 f9          	xor    %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 33 38             	xor    \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 33 04 07       	xor    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 f6 34 12 00 00 	xor    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 21 f9          	and    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 20 f9          	and    %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 23 38             	and    \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 23 04 07       	and    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 11 81 e6 34 12 00 00 	and    \$0x1234,%r30d
+\s*[a-f0-9]+:\s*d5 19 d1 cf          	ror    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 cf          	ror    %r31b
+\s*[a-f0-9]+:\s*49 c1 cc 02          	ror    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 cc 02          	ror    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 c7          	rol    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 c7          	rol    %r31b
+\s*[a-f0-9]+:\s*49 c1 c4 02          	rol    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 c4 02          	rol    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 df          	rcr    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 df          	rcr    %r31b
+\s*[a-f0-9]+:\s*49 c1 dc 02          	rcr    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 dc 02          	rcr    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 d7          	rcl    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 d7          	rcl    %r31b
+\s*[a-f0-9]+:\s*49 c1 d4 02          	rcl    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 d4 02          	rcl    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 e7          	shl    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 e7          	shl    %r31b
+\s*[a-f0-9]+:\s*49 c1 e4 02          	shl    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 e4 02          	shl    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 ff          	sar    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 ff          	sar    %r31b
+\s*[a-f0-9]+:\s*49 c1 fc 02          	sar    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 fc 02          	sar    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 e7          	shl    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 e7          	shl    %r31b
+\s*[a-f0-9]+:\s*49 c1 e4 02          	shl    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 e4 02          	shl    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 ef          	shr    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 ef          	shr    %r31b
+\s*[a-f0-9]+:\s*49 c1 ec 02          	shr    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 ec 02          	shr    \$0x2,%r12b
+\s*[a-f0-9]+:\s*62 74 9c 18 24 20 01 	shld   \$0x1,%r12,\(%rax\),%r12
+\s*[a-f0-9]+:\s*4d 0f a4 c4 02       	shld   \$0x2,%r8,%r12
+\s*[a-f0-9]+:\s*62 54 bc 18 24 c4 02 	shld   \$0x2,%r8,%r12,%r8
+\s*[a-f0-9]+:\s*62 74 b4 18 a5 08    	shld   %cl,%r9,\(%rax\),%r9
+\s*[a-f0-9]+:\s*d5 9c a5 e0          	shld   %cl,%r12,%r16
+\s*[a-f0-9]+:\s*62 7c 9c 18 a5 e0    	shld   %cl,%r12,%r16,%r12
+\s*[a-f0-9]+:\s*62 74 9c 18 2c 20 01 	shrd   \$0x1,%r12,\(%rax\),%r12
+\s*[a-f0-9]+:\s*4d 0f ac ec 01       	shrd   \$0x1,%r13,%r12
+\s*[a-f0-9]+:\s*62 54 94 18 2c ec 01 	shrd   \$0x1,%r13,%r12,%r13
+\s*[a-f0-9]+:\s*62 74 b4 18 ad 08    	shrd   %cl,%r9,\(%rax\),%r9
+\s*[a-f0-9]+:\s*d5 9c ad e0          	shrd   %cl,%r12,%r16
+\s*[a-f0-9]+:\s*62 7c 9c 18 ad e0    	shrd   %cl,%r12,%r16,%r12
+\s*[a-f0-9]+:\s*62 54 bd 18 66 c7    	adcx   %r15,%r8,%r8
+\s*[a-f0-9]+:\s*62 14 b9 18 66 04 3f 	adcx   \(%r15,%r31,1\),%r8,%r8
+\s*[a-f0-9]+:\s*62 54 bd 18 66 c8    	adcx   %r8,%r9,%r8
+\s*[a-f0-9]+:\s*62 54 be 18 66 c7    	adox   %r15,%r8,%r8
+\s*[a-f0-9]+:\s*62 14 ba 18 66 04 3f 	adox   \(%r15,%r31,1\),%r8,%r8
+\s*[a-f0-9]+:\s*62 54 be 18 66 c8    	adox   %r8,%r9,%r8
+\s*[a-f0-9]+:\s*67 0f 40 90 90 90 90 90 	cmovo  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 41 90 90 90 90 90 	cmovno -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 42 90 90 90 90 90 	cmovb  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 43 90 90 90 90 90 	cmovae -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 44 90 90 90 90 90 	cmove  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 45 90 90 90 90 90 	cmovne -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 46 90 90 90 90 90 	cmovbe -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 47 90 90 90 90 90 	cmova  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 48 90 90 90 90 90 	cmovs  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 49 90 90 90 90 90 	cmovns -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4a 90 90 90 90 90 	cmovp  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4b 90 90 90 90 90 	cmovnp -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4c 90 90 90 90 90 	cmovl  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4d 90 90 90 90 90 	cmovge -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4e 90 90 90 90 90 	cmovle -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4f 90 90 90 90 90 	cmovg  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f af 90 09 09 09 00 	imul   0x90909\(%eax\),%edx
+\s*[a-f0-9]+:\s*d5 aa af 94 f8 09 09 00 00 	imul   0x909\(%rax,%r31,8\),%rdx
+\s*[a-f0-9]+:\s*48 0f af d0          	imul   %rax,%rdx
diff --git a/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s
new file mode 100644
index 00000000000..80c39059143
--- /dev/null
+++ b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s
@@ -0,0 +1,119 @@ 
+# Check 64bit APX NDD instructions with optimized encoding
+
+	.text
+_start:
+inc    %r31,%r31
+incb   %r31b,%r31b
+add    %r31,%r8,%r8
+addb   %r31b,%r8b,%r8b
+{store} add    %r31,%r8,%r8
+{load}  add    %r31,%r8,%r8
+add    %r31,(%r8),%r31
+add    (%r31),%r8,%r8
+add    $0x12344433,%r15,%r15
+add    $0xfffffffff4332211,%r8,%r8
+dec    %r17,%r17
+decb   %r17b,%r17b
+not    %r17,%r17
+notb   %r17b,%r17b
+neg    %r17,%r17
+negb   %r17b,%r17b
+sub    %r15,%r17,%r17
+subb   %r15b,%r17b,%r17b
+sub    %r15,(%r8),%r15
+sub    (%r15,%rax,1),%r16,%r16
+sub    $0x1234,%r30,%r30
+sbb    %r15,%r17,%r17
+sbbb   %r15b,%r17b,%r17b
+sbb    %r15,(%r8),%r15
+sbb    (%r15,%rax,1),%r16,%r16
+sbb    $0x1234,%r30,%r30
+adc    %r15,%r17,%r17
+adcb   %r15b,%r17b,%r17b
+adc    %r15,(%r8),%r15
+adc    (%r15,%rax,1),%r16,%r16
+adc    $0x1234,%r30,%r30
+or     %r15,%r17,%r17
+orb    %r15b,%r17b,%r17b
+or     %r15,(%r8),%r15
+or     (%r15,%rax,1),%r16,%r16
+or     $0x1234,%r30,%r30
+xor    %r15,%r17,%r17
+xorb   %r15b,%r17b,%r17b
+xor    %r15,(%r8),%r15
+xor    (%r15,%rax,1),%r16,%r16
+xor    $0x1234,%r30,%r30
+and    %r15,%r17,%r17
+andb   %r15b,%r17b,%r17b
+and    %r15,(%r8),%r15
+and    (%r15,%rax,1),%r16,%r16
+and    $0x1234,%r30,%r30
+ror    %r31,%r31
+rorb   %r31b,%r31b
+ror    $0x2,%r12,%r12
+rorb   $0x2,%r12b,%r12b
+rol    %r31,%r31
+rolb   %r31b,%r31b
+rol    $0x2,%r12,%r12
+rolb   $0x2,%r12b,%r12b
+rcr    %r31,%r31
+rcrb   %r31b,%r31b
+rcr    $0x2,%r12,%r12
+rcrb   $0x2,%r12b,%r12b
+rcl    %r31,%r31
+rclb   %r31b,%r31b
+rcl    $0x2,%r12,%r12
+rclb   $0x2,%r12b,%r12b
+sal    %r31,%r31
+salb   %r31b,%r31b
+sal    $0x2,%r12,%r12
+salb   $0x2,%r12b,%r12b
+sar    %r31,%r31
+sarb   %r31b,%r31b
+sar    $0x2,%r12,%r12
+sarb   $0x2,%r12b,%r12b
+shl    %r31,%r31
+shlb   %r31b,%r31b
+shl    $0x2,%r12,%r12
+shlb   $0x2,%r12b,%r12b
+shr    %r31,%r31
+shrb   %r31b,%r31b
+shr    $0x2,%r12,%r12
+shrb   $0x2,%r12b,%r12b
+shld   $0x1,%r12,(%rax),%r12
+shld   $0x2,%r8,%r12,%r12
+shld   $0x2,%r8,%r12,%r8
+shld   %cl,%r9,(%rax),%r9
+shld   %cl,%r12,%r16,%r16
+shld   %cl,%r12,%r16,%r12
+shrd   $0x1,%r12,(%rax),%r12
+shrd   $0x1,%r13,%r12,%r12
+shrd   $0x1,%r13,%r12,%r13
+shrd   %cl,%r9,(%rax),%r9
+shrd   %cl,%r12,%r16,%r16
+shrd   %cl,%r12,%r16,%r12
+adcx   %r15,%r8,%r8
+adcx   (%r15,%r31,1),%r8,%r8
+adcx   %r8,%r9,%r8
+adox   %r15,%r8,%r8
+adox   (%r15,%r31,1),%r8,%r8
+adox   %r8,%r9,%r8
+cmovo  0x90909090(%eax),%edx,%edx
+cmovno 0x90909090(%eax),%edx,%edx
+cmovb  0x90909090(%eax),%edx,%edx
+cmovae 0x90909090(%eax),%edx,%edx
+cmove  0x90909090(%eax),%edx,%edx
+cmovne 0x90909090(%eax),%edx,%edx
+cmovbe 0x90909090(%eax),%edx,%edx
+cmova  0x90909090(%eax),%edx,%edx
+cmovs  0x90909090(%eax),%edx,%edx
+cmovns 0x90909090(%eax),%edx,%edx
+cmovp  0x90909090(%eax),%edx,%edx
+cmovnp 0x90909090(%eax),%edx,%edx
+cmovl  0x90909090(%eax),%edx,%edx
+cmovge 0x90909090(%eax),%edx,%edx
+cmovle 0x90909090(%eax),%edx,%edx
+cmovg  0x90909090(%eax),%edx,%edx
+imul   0x90909(%eax),%edx,%edx
+imul   0x909(%rax,%r31,8),%rdx,%rdx
+imul   %rdx,%rax,%rdx
diff --git a/gas/testsuite/gas/i386/x86-64.exp b/gas/testsuite/gas/i386/x86-64.exp
index 668b366a212..eab99f9e52b 100644
--- a/gas/testsuite/gas/i386/x86-64.exp
+++ b/gas/testsuite/gas/i386/x86-64.exp
@@ -552,6 +552,7 @@  run_dump_test "x86-64-optimize-6"
 run_list_test "x86-64-optimize-7a" "-I${srcdir}/$subdir -march=+noavx -al"
 run_dump_test "x86-64-optimize-7b"
 run_list_test "x86-64-optimize-8" "-I${srcdir}/$subdir -march=+noavx2 -al"
+run_dump_test "x86-64-apx-ndd-optimize"
 run_dump_test "x86-64-align-branch-1a"
 run_dump_test "x86-64-align-branch-1b"
 run_dump_test "x86-64-align-branch-1c"