[7/8] Support APX NDD optimized encoding.

Message ID 20231102112911.2372810-8-lili.cui@intel.com
State Unresolved
Headers
Series Support Intel APX EGPR |

Checks

Context Check Description
snail/binutils-gdb-check warning Git am fail log

Commit Message

Cui, Lili Nov. 2, 2023, 11:29 a.m. UTC
  From: "Hu, Lin1" <lin1.hu@intel.com>

This patch aims to optimize:

add %r16, %r15, %r15 -> add %r16, %r15

gas/ChangeLog:

	* config/tc-i386.c (optimize_NDD_to_nonNDD): New function.
	(match_template): If we can optimzie APX NDD insns, so rematch
	template.
	* testsuite/gas/i386/x86-64.exp: Add test.
	* testsuite/gas/i386/x86-64-apx-ndd-optimize.d: New test.
	* testsuite/gas/i386/x86-64-apx-ndd-optimize.s: Ditto.

opcodes/ChangeLog:

	* i386-init.h: Regenerated.
	* i386-mnem.h: Ditto.
	* i386-tbl.h: Ditto.
	* i386-opc.tbl: Add C to some instructions for support
	optimization.
---
 gas/config/tc-i386.c                          |  46 +++++++
 .../gas/i386/x86-64-apx-ndd-optimize.d        | 124 ++++++++++++++++++
 .../gas/i386/x86-64-apx-ndd-optimize.s        | 117 +++++++++++++++++
 gas/testsuite/gas/i386/x86-64.exp             |   1 +
 opcodes/i386-opc.tbl                          |  22 ++--
 5 files changed, 302 insertions(+), 8 deletions(-)
 create mode 100644 gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d
 create mode 100644 gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s
  

Comments

Jan Beulich Nov. 9, 2023, 10:36 a.m. UTC | #1
On 02.11.2023 12:29, Cui, Lili wrote:
> --- a/gas/config/tc-i386.c
> +++ b/gas/config/tc-i386.c
> @@ -7208,6 +7208,44 @@ check_EgprOperands (const insn_template *t)
>    return 0;
>  }
>  
> +/* Optimize APX NDD insns to non-NDD insns.  */
> +
> +static bool
> +optimize_NDD_to_nonNDD (const insn_template *t)
> +{
> +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
> +      && t->opcode_space == SPACE_EVEXMAP4
> +      && !i.has_nf

As mentioned before, I'd still prefer the optimization to only be
added after NF handling was put in place. But I'm not meaning to make
this a strict requirement: Introducing the has_nf field here (with
it only being read, never [explicitly] written) is certainly okay-ish.
(But see also below.)

Similarly I'm concerned of the ND form of CFCMOVcc, which isn't there
yet in the patches, but which will also need excluding from this
optimization. Obviously this concern then extends to any future ND-
encoded insns, which (likely) won't have legacy-encoded (and hence
REX2-encodable) counterparts. Are there any plans how to deal with
such? (There's a possible approach mentioned further down.)

> +      && i.reg_operands >= 2
> +      && i.types[i.operands - 1].bitfield.class == Reg)

Isn't this implicit from the VexVVVV check further up?

> +    {
> +      unsigned int readonly_var = ~0;
> +      unsigned int dest = i.operands - 1;
> +      unsigned int src1 = (i.operands > 2) ? i.operands - 2 : 0;

Since we already know i.operands >= 2 from the earlier check of
i.reg_operands, can't this simply be

      unsigned int src1 = i.operands - 2;

?

> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> +
> +      if (i.types[src1].bitfield.class == Reg
> +	  && i.op[src1].regs == i.op[dest].regs)
> +	readonly_var = src2;

As can be seen in the testcase, this also results in ADCX/ADOX to be
converted to non-ND EVEX forms, i.e. even when that's not a win at all.
We shouldn't change what the user has written when the encoding doesn't
actually improve. (Or else, but I'd be hesitant to accept that, at the
very least the effect would need pointing out in the description or
even a code comment, so that later on it is possible to figure out
whether that was intentional or an oversight.)

This is where my template ordering remark in reply to patch 5 comes
into play: Whether invoking re-parse is okay would further need to
depend on whether an alternative (earlier) template actually allows
REX2 encoding (same base-opcode could be one of the criteria for how
far to look back through earlier templates; an option might also be to
put the 3-operand templates first, so that looking backwards wouldn't
be necessary in the first place). This would then likely also address
one of the forward looking concerns I've raised above.

> +      /* adcx, adox and imul don't have D bit.  */
> +      else if (i.types[src2].bitfield.class == Reg
> +	       && i.op[src2].regs == i.op[dest].regs
> +	       && t->opcode_modifier.commutative)

There's a disconnect between comment and code here: You don't use the
D attribute, so why is it being mentioned?

> +	readonly_var = src1;
> +      if (readonly_var != (unsigned int) ~0)
> +	{
> +	  --i.operands;
> +	  --i.reg_operands;
> +	  --i.tm.operands;
> +
> +	  if (readonly_var != src2)
> +	    swap_2_operands (readonly_var, src2);

May I suggest that just out of precaution the swapping be done before
operand counts are decremented? In principle swap_2_operands() could
do with having assertions added as to it actually dealing with valid
operands. (You'll note that elsewhere, when we add a new operand, we
increment first and then swap.)

> +	  return 1;
> +	}
> +    }
> +  return 0;
> +}

Nit: A function returning bool would preferably return true/false, not
0/1.

> @@ -7728,6 +7766,14 @@ match_template (char mnem_suffix)
>  	  i.memshift = memshift;
>  	}
>  
> +      /* If we can optimize a NDD insn to non-NDD insn, like
> +	 add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
> +      if (optimize == 1 && optimize_NDD_to_nonNDD (t))

So you do this optimization at -O1, but not at -O2? Imo the "== 1"
simply needs dropping. Furthermore the {nooptimize} and {evex} pseudo
prefixes need respecting. Quite likely respecting {evex} would
eliminate the need for the explicit .has_nf check in the helper
function, as I expect .vec_encoding to be set alongside that bit
anyway. Further quite likely respecting {evex} here will mean that in
patch 3 you need to introduce a new enumerator (e.g.
vex_encoding_egpr, vaguely similar to vex_encoding_evex512), to avoid
setting .vec_encoding to vex_encoding_evex when an eGPR is parsed.

As to optimization level: In build_vex_prefix() we leverage C only
at -O2 or higher (including -Os). We may want to be consistent in this
regard here (i.e. by an extra check in the helper function).

> +	{
> +	  t = current_templates->start - 1;

As per a remark further up, this adjustment could be avoided if the ND
templates came ahead of the legacy ones. They can't be wrongly used in
place of the legacy ones, due to the extra operand they require. Then
a comment here would merely point out this ordering aspect. But of
course care will then need to be taken to not go past i386_optab[]'s
bounds (by having suitably ordered conditionals when looking for
whether there is an alternative template in the first place; again see
the respective remark further up).

> +	  continue;
> +	}

Btw, considering this re-matching, I wonder whether "convert" wouldn't
be better in the function name compared to "optimize".

> --- /dev/null
> +++ b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s
> @@ -0,0 +1,117 @@
> +# Check 64bit APX NDD instructions with optimized encoding
> +
> +	.text
> +_start:
> +inc    %r31,%r31
> +incb   %r31b,%r31b
> +add    %r31,%r8,%r8
> +addb   %r31b,%r8b,%r8b
> +{store} add    %r31,%r8,%r8
> +{load}  add    %r31,%r8,%r8
> +add    %r31,(%r8),%r31
> +add    (%r31),%r8,%r8
> +add    $0x12344433,%r15,%r15
> +add    $0xfffffffff4332211,%r8,%r8
> +dec    %r17,%r17
> +decb   %r17b,%r17b
> +not    %r17,%r17
> +notb   %r17b,%r17b
> +neg    %r17,%r17
> +negb   %r17b,%r17b
> +sub    %r15,%r17,%r17
> +subb   %r15b,%r17b,%r17b
> +sub    %r15,(%r8),%r15
> +sub    (%r15,%rax,1),%r16,%r16
> +sub    $0x1234,%r30,%r30
> +sbb    %r15,%r17,%r17
> +sbbb   %r15b,%r17b,%r17b
> +sbb    %r15,(%r8),%r15
> +sbb    (%r15,%rax,1),%r16,%r16
> +sbb    $0x1234,%r30,%r30
> +adc    %r15,%r17,%r17
> +adcb   %r15b,%r17b,%r17b
> +adc    %r15,(%r8),%r15
> +adc    (%r15,%rax,1),%r16,%r16
> +adc    $0x1234,%r30,%r30
> +or     %r15,%r17,%r17
> +orb    %r15b,%r17b,%r17b
> +or     %r15,(%r8),%r15
> +or     (%r15,%rax,1),%r16,%r16
> +or     $0x1234,%r30,%r30
> +xor    %r15,%r17,%r17
> +xorb   %r15b,%r17b,%r17b
> +xor    %r15,(%r8),%r15
> +xor    (%r15,%rax,1),%r16,%r16
> +xor    $0x1234,%r30,%r30
> +and    %r15,%r17,%r17
> +andb   %r15b,%r17b,%r17b
> +and    %r15,(%r8),%r15
> +and    (%r15,%rax,1),%r16,%r16
> +and    $0x1234,%r30,%r30
> +ror    %r31,%r31
> +rorb   %r31b,%r31b
> +ror    $0x2,%r12,%r12
> +rorb   $0x2,%r12b,%r12b
> +rol    %r31,%r31
> +rolb   %r31b,%r31b
> +rol    $0x2,%r12,%r12
> +rolb   $0x2,%r12b,%r12b
> +rcr    %r31,%r31
> +rcrb   %r31b,%r31b
> +rcr    $0x2,%r12,%r12
> +rcrb   $0x2,%r12b,%r12b
> +rcl    %r31,%r31
> +rclb   %r31b,%r31b
> +rcl    $0x2,%r12,%r12
> +rclb   $0x2,%r12b,%r12b
> +shl    %r31,%r31
> +shlb   %r31b,%r31b
> +shl    $0x2,%r12,%r12
> +shlb   $0x2,%r12b,%r12b
> +sar    %r31,%r31
> +sarb   %r31b,%r31b
> +sar    $0x2,%r12,%r12
> +sarb   $0x2,%r12b,%r12b
> +shl    %r31,%r31
> +shlb   %r31b,%r31b
> +shl    $0x2,%r12,%r12
> +shlb   $0x2,%r12b,%r12b
> +shr    %r31,%r31
> +shrb   %r31b,%r31b
> +shr    $0x2,%r12,%r12
> +shrb   $0x2,%r12b,%r12b
> +shld   $0x1,%r12,(%rax),%r12
> +shld   $0x2,%r8,%r12,%r12
> +shld   %cl,%r9,(%rax),%r9
> +shld   %cl,%r12,%r16,%r16
> +shld   %cl,%r13,(%r19,%rax,4),%r13

What's the difference (in what is being tested) between this and the
first of the %cl tests? Shouldn't one of them rather be of the form
"reg1,reg2,reg1"? And then shouldn't there be a similar test with an
immediate operand? (Same for SHRD then, obviously.)

> --- a/opcodes/i386-opc.tbl
> +++ b/opcodes/i386-opc.tbl
> @@ -145,6 +145,8 @@
>  // The EVEX purpose of StaticRounding appears only together with SAE. Re-use
>  // the bit to mark commutative VEX encodings where swapping the source
>  // operands may allow to switch from 3-byte to 2-byte VEX encoding.
> +// And re-use the bit to mark some NDD insns that swapping the source operands
> +// may allow to switch from 3 operands to 2 operands.
>  #define C StaticRounding

The 3-to-2 conversion isn't what we're primarily after (see comments above).
It's the EVEX->REX2 encoding conversion which we'd like to do.

> @@ -166,6 +168,10 @@
>  
>  ### MARKER ###
>  
> +// Please don't add a NDD insn which may be optimized to a REX2 insn before the
> +// mov. It may result that a good UB checker object the behavior
> +// "template->start - 1" at the end of match_template.
> +
>  // Move instructions.

While I mentioned adding a comment here as a minimal solution, did you try
to think of better approaches, or some enforcement of this restriction
(like gas_assert() before the expression in question)? You could even go
as far as simply not trying the optimization when t == i386_optab, with no
need to have a comment here (the comment would then be next to that
part of the condition, thus right where it's really relevant). Then anyone
misplacing a new template in the opcode table would simply observe that
an expected optimization doesn't happen, and they surely would find the
conditional with its comment.

For all of the changes below (which are a little hard to review in email),
aiui they only add C as needed. I once again would prefer if that attribute
could be added right as the templates are introduced, with the description
stating the intention and that the actual use of the attribute will be
added later (i.e. as expressed earlier already for NF).

Jan
  
Hu, Lin1 Nov. 10, 2023, 5:43 a.m. UTC | #2
> On 02.11.2023 12:29, Cui, Lili wrote:
> > --- a/gas/config/tc-i386.c
> > +++ b/gas/config/tc-i386.c
> > @@ -7208,6 +7208,44 @@ check_EgprOperands (const insn_template *t)
> >    return 0;
> >  }
> >
> > +/* Optimize APX NDD insns to non-NDD insns.  */
> > +
> > +static bool
> > +optimize_NDD_to_nonNDD (const insn_template *t) {
> > +  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
> > +      && t->opcode_space == SPACE_EVEXMAP4
> > +      && !i.has_nf
> 
> As mentioned before, I'd still prefer the optimization to only be added after NF
> handling was put in place. But I'm not meaning to make this a strict requirement:
> Introducing the has_nf field here (with it only being read, never [explicitly]
> written) is certainly okay-ish.
> (But see also below.)
>

Of course we can put the optimization patch after the NF patch. For now, we are just putting it here for review.

> 
> Similarly I'm concerned of the ND form of CFCMOVcc, which isn't there yet in
> the patches, but which will also need excluding from this optimization. Obviously
> this concern then extends to any future ND- encoded insns, which (likely) won't
> have legacy-encoded (and hence
> REX2-encodable) counterparts. Are there any plans how to deal with such?
> (There's a possible approach mentioned further down.)
>

Looking at other current NDD instructions, it should be possible to use evex encoding even if it doesn't have rex2 encoding.
 
>
> > +      && i.reg_operands >= 2
> > +      && i.types[i.operands - 1].bitfield.class == Reg)
> 
> Isn't this implicit from the VexVVVV check further up?
>

Yes.

> 
> > +    {
> > +      unsigned int readonly_var = ~0;
> > +      unsigned int dest = i.operands - 1;
> > +      unsigned int src1 = (i.operands > 2) ? i.operands - 2 : 0;
> 
> Since we already know i.operands >= 2 from the earlier check of i.reg_operands,
> can't this simply be
> 
>       unsigned int src1 = i.operands - 2;
> 
> ?
>

OK.

> 
> > +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> > +
> > +      if (i.types[src1].bitfield.class == Reg
> > +	  && i.op[src1].regs == i.op[dest].regs)
> > +	readonly_var = src2;
> 
> As can be seen in the testcase, this also results in ADCX/ADOX to be converted to
> non-ND EVEX forms, i.e. even when that's not a win at all.
> We shouldn't change what the user has written when the encoding doesn't
> actually improve. (Or else, but I'd be hesitant to accept that, at the very least the
> effect would need pointing out in the description or even a code comment, so
> that later on it is possible to figure out whether that was intentional or an
> oversight.)
> 
> This is where my template ordering remark in reply to patch 5 comes into play:
> Whether invoking re-parse is okay would further need to depend on whether an
> alternative (earlier) template actually allows
> REX2 encoding (same base-opcode could be one of the criteria for how far to
> look back through earlier templates; an option might also be to put the 3-
> operand templates first, so that looking backwards wouldn't be necessary in the
> first place). This would then likely also address one of the forward looking
> concerns I've raised above.
>

Indeed, adcx's legacy insn can't support rex2.

For my problem, I prefer to re-order templates order, because, I hadn't thought of a way to simply move t to the farthest same base_opcode template for the moment. The following is a tentative scenario: the order will be ndd evex - rex2 - evex. And I will need a tmp_variable to avoid the insn doesn't match the rex2, let me backtrack the match's result and the value of i.

> 
> > +      /* adcx, adox and imul don't have D bit.  */
> > +      else if (i.types[src2].bitfield.class == Reg
> > +	       && i.op[src2].regs == i.op[dest].regs
> > +	       && t->opcode_modifier.commutative)
> 
> There's a disconnect between comment and code here: You don't use the D
> attribute, so why is it being mentioned?
>

I forgot to modify it.
 
>
> > +	readonly_var = src1;
> > +      if (readonly_var != (unsigned int) ~0)
> > +	{
> > +	  --i.operands;
> > +	  --i.reg_operands;
> > +	  --i.tm.operands;
> > +
> > +	  if (readonly_var != src2)
> > +	    swap_2_operands (readonly_var, src2);
> 
> May I suggest that just out of precaution the swapping be done before operand
> counts are decremented? In principle swap_2_operands() could do with having
> assertions added as to it actually dealing with valid operands. (You'll note that
> elsewhere, when we add a new operand, we increment first and then swap.)
> 

Indeed, it's safer, I've exchanged the order of execution, do you have any other comments on the assertions (If I understand correctly, there is a desire for some gcc_assert?), for the time being I can guarantee that the two indexes are definitely in range, is there anything else that needs to be judged?

> > @@ -7728,6 +7766,14 @@ match_template (char mnem_suffix)
> >  	  i.memshift = memshift;
> >  	}
> >
> > +      /* If we can optimize a NDD insn to non-NDD insn, like
> > +	 add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
> > +      if (optimize == 1 && optimize_NDD_to_nonNDD (t))
> 
> So you do this optimization at -O1, but not at -O2? Imo the "== 1"
> simply needs dropping. Furthermore the {nooptimize} and {evex} pseudo
> prefixes need respecting. Quite likely respecting {evex} would eliminate the need
> for the explicit .has_nf check in the helper function, as I expect .vec_encoding to
> be set alongside that bit anyway. Further quite likely respecting {evex} here will
> mean that in patch 3 you need to introduce a new enumerator (e.g.
> vex_encoding_egpr, vaguely similar to vex_encoding_evex512), to avoid
> setting .vec_encoding to vex_encoding_evex when an eGPR is parsed.
> 
> As to optimization level: In build_vex_prefix() we leverage C only at -O2 or
> higher (including -Os). We may want to be consistent in this regard here (i.e. by
> an extra check in the helper function).
> 

It's a mistake, I have fixed it. The conditions will be. I will try later, after the NF patch is done, to see if the constraint i.has_nf can be removed or not.

       /* If we can optimize a NDD insn to non-NDD insn, like
         add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
-      if (optimize == 1 && optimize_NDD_to_nonNDD (t))
+      if (!i.no_optimize && i.vec_encoding != vex_encoding_evex
+         && optimize && optimize_NDD_to_nonNDD (t))
        {

>
> > +	{
> > +	  t = current_templates->start - 1;
> 
> As per a remark further up, this adjustment could be avoided if the ND templates
> came ahead of the legacy ones. They can't be wrongly used in place of the
> legacy ones, due to the extra operand they require. Then a comment here would
> merely point out this ordering aspect. But of course care will then need to be
> taken to not go past i386_optab[]'s bounds (by having suitably ordered
> conditionals when looking for whether there is an alternative template in the
> first place; again see the respective remark further up).
>

Yes, if we reorder the template's order, I will remove the line. Only one example of a possible implementation is given here:

        }

+      bool have_converted_NDD_to_nonNDD = false;
+      i386_insn tmp_i;
+
+      if (!i.no_optimize && i.vec_encoding != vex_encoding_evex
+         && optimize && !have_converted_NDD_to_nonNDD
+         && convert_NDD_to_nonNDD (t))
+       {
+         have_converted_NDD_to_nonNDD = true;
+         tmp_i = i;
+       }
+
       /* We've found a match; break out of loop.  */
       break;
     }
@@ -7787,6 +7802,9 @@ match_template (char mnem_suffix)
       return NULL;
     }

+  if (have_converted_to_nonNDD)
+    i = tmp_i;
+
   if (!quiet_warnings)

> 
> > +	  continue;
> > +	}
> 
> Btw, considering this re-matching, I wonder whether "convert" wouldn't be
> better in the function name compared to "optimize".
> 
> > --- /dev/null
> > +++ b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s
> > @@ -0,0 +1,117 @@
> > +# Check 64bit APX NDD instructions with optimized encoding
> > +
> > +	.text
> > +_start:
> > +inc    %r31,%r31
> > +incb   %r31b,%r31b
> > +add    %r31,%r8,%r8
> > +addb   %r31b,%r8b,%r8b
> > +{store} add    %r31,%r8,%r8
> > +{load}  add    %r31,%r8,%r8
> > +add    %r31,(%r8),%r31
> > +add    (%r31),%r8,%r8
> > +add    $0x12344433,%r15,%r15
> > +add    $0xfffffffff4332211,%r8,%r8
> > +dec    %r17,%r17
> > +decb   %r17b,%r17b
> > +not    %r17,%r17
> > +notb   %r17b,%r17b
> > +neg    %r17,%r17
> > +negb   %r17b,%r17b
> > +sub    %r15,%r17,%r17
> > +subb   %r15b,%r17b,%r17b
> > +sub    %r15,(%r8),%r15
> > +sub    (%r15,%rax,1),%r16,%r16
> > +sub    $0x1234,%r30,%r30
> > +sbb    %r15,%r17,%r17
> > +sbbb   %r15b,%r17b,%r17b
> > +sbb    %r15,(%r8),%r15
> > +sbb    (%r15,%rax,1),%r16,%r16
> > +sbb    $0x1234,%r30,%r30
> > +adc    %r15,%r17,%r17
> > +adcb   %r15b,%r17b,%r17b
> > +adc    %r15,(%r8),%r15
> > +adc    (%r15,%rax,1),%r16,%r16
> > +adc    $0x1234,%r30,%r30
> > +or     %r15,%r17,%r17
> > +orb    %r15b,%r17b,%r17b
> > +or     %r15,(%r8),%r15
> > +or     (%r15,%rax,1),%r16,%r16
> > +or     $0x1234,%r30,%r30
> > +xor    %r15,%r17,%r17
> > +xorb   %r15b,%r17b,%r17b
> > +xor    %r15,(%r8),%r15
> > +xor    (%r15,%rax,1),%r16,%r16
> > +xor    $0x1234,%r30,%r30
> > +and    %r15,%r17,%r17
> > +andb   %r15b,%r17b,%r17b
> > +and    %r15,(%r8),%r15
> > +and    (%r15,%rax,1),%r16,%r16
> > +and    $0x1234,%r30,%r30
> > +ror    %r31,%r31
> > +rorb   %r31b,%r31b
> > +ror    $0x2,%r12,%r12
> > +rorb   $0x2,%r12b,%r12b
> > +rol    %r31,%r31
> > +rolb   %r31b,%r31b
> > +rol    $0x2,%r12,%r12
> > +rolb   $0x2,%r12b,%r12b
> > +rcr    %r31,%r31
> > +rcrb   %r31b,%r31b
> > +rcr    $0x2,%r12,%r12
> > +rcrb   $0x2,%r12b,%r12b
> > +rcl    %r31,%r31
> > +rclb   %r31b,%r31b
> > +rcl    $0x2,%r12,%r12
> > +rclb   $0x2,%r12b,%r12b
> > +shl    %r31,%r31
> > +shlb   %r31b,%r31b
> > +shl    $0x2,%r12,%r12
> > +shlb   $0x2,%r12b,%r12b
> > +sar    %r31,%r31
> > +sarb   %r31b,%r31b
> > +sar    $0x2,%r12,%r12
> > +sarb   $0x2,%r12b,%r12b
> > +shl    %r31,%r31
> > +shlb   %r31b,%r31b
> > +shl    $0x2,%r12,%r12
> > +shlb   $0x2,%r12b,%r12b
> > +shr    %r31,%r31
> > +shrb   %r31b,%r31b
> > +shr    $0x2,%r12,%r12
> > +shrb   $0x2,%r12b,%r12b
> > +shld   $0x1,%r12,(%rax),%r12
> > +shld   $0x2,%r8,%r12,%r12
> > +shld   %cl,%r9,(%rax),%r9
> > +shld   %cl,%r12,%r16,%r16
> > +shld   %cl,%r13,(%r19,%rax,4),%r13
> 
> What's the difference (in what is being tested) between this and 
the first of
> the %cl tests? Shouldn't one of them rather be of the form "reg1,reg2,reg1"?
> And then shouldn't there be a similar test with an immediate operand? (Same for
> SHRD then, obviously.)
>

Have modified.

> 
> > --- a/opcodes/i386-opc.tbl
> > +++ b/opcodes/i386-opc.tbl
> > @@ -145,6 +145,8 @@
> >  // The EVEX purpose of StaticRounding appears only together with SAE.
> > Re-use  // the bit to mark commutative VEX encodings where swapping
> > the source  // operands may allow to switch from 3-byte to 2-byte VEX
> encoding.
> > +// And re-use the bit to mark some NDD insns that swapping the source
> > +operands // may allow to switch from 3 operands to 2 operands.
> >  #define C StaticRounding
> 
> The 3-to-2 conversion isn't what we're primarily after (see comments above).
> It's the EVEX->REX2 encoding conversion which we'd like to do.
> 
> > @@ -166,6 +168,10 @@
> >
> >  ### MARKER ###
> >
> > +// Please don't add a NDD insn which may be optimized to a REX2 insn
> > +before the // mov. It may result that a good UB checker object the
> > +behavior // "template->start - 1" at the end of match_template.
> > +
> >  // Move instructions.
> 
> While I mentioned adding a comment here as a minimal solution, did you try to
> think of better approaches, or some enforcement of this restriction (like
> gas_assert() before the expression in question)? You could even go as far as
> simply not trying the optimization when t == i386_optab, with no need to have a
> comment here (the comment would then be next to that part of the condition,
> thus right where it's really relevant). Then anyone misplacing a new template in
> the opcode table would simply observe that an expected optimization doesn't
> happen, and they surely would find the conditional with its comment.
>

If we don't reorder i386.tbl, I will consider to modify t = current_templates->start - 1 to

If (t != current_templates->start)
   t = current_templates->start - 1;
 
>
> For all of the changes below (which are a little hard to review in email), aiui they
> only add C as needed. I once again would prefer if that attribute could be added
> right as the templates are introduced, with the description stating the intention
> and that the actual use of the attribute will be added later (i.e. as expressed
> earlier already for NF).
>

After the changes are finalized, I'll break out this part of the modification that adds the C to lili so she can put it where it belongs.

BRs,
Lin
  
Jan Beulich Nov. 10, 2023, 9:54 a.m. UTC | #3
On 10.11.2023 06:43, Hu, Lin1 wrote:
>> On 02.11.2023 12:29, Cui, Lili wrote:

Btw, your shrinking of reply context above from here is problematic. Someone
reading just this mail can't tell who ...

>> Similarly I'm concerned of the ND form of CFCMOVcc, which isn't there yet in
>> the patches, but which will also need excluding from this optimization. Obviously
>> this concern then extends to any future ND- encoded insns, which (likely) won't
>> have legacy-encoded (and hence
>> REX2-encodable) counterparts. Are there any plans how to deal with such?
>> (There's a possible approach mentioned further down.)

... originally said this.

> Looking at other current NDD instructions, it should be possible to use evex encoding even if it doesn't have rex2 encoding.

Should be possible - yes. But why would you do such a transformation? That's
not an optimization at all, afaict. And we shouldn't alter what the
programmer wrote if the result isn't in at least some respect deemed better
than the original. Considering this, the helper function may want further
naming differently than already suggested, to e.g. convert_NDD_to_REX2().

>>> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
>>> +
>>> +      if (i.types[src1].bitfield.class == Reg
>>> +	  && i.op[src1].regs == i.op[dest].regs)
>>> +	readonly_var = src2;
>>
>> As can be seen in the testcase, this also results in ADCX/ADOX to be converted to
>> non-ND EVEX forms, i.e. even when that's not a win at all.
>> We shouldn't change what the user has written when the encoding doesn't
>> actually improve. (Or else, but I'd be hesitant to accept that, at the very least the
>> effect would need pointing out in the description or even a code comment, so
>> that later on it is possible to figure out whether that was intentional or an
>> oversight.)
>>
>> This is where my template ordering remark in reply to patch 5 comes into play:
>> Whether invoking re-parse is okay would further need to depend on whether an
>> alternative (earlier) template actually allows
>> REX2 encoding (same base-opcode could be one of the criteria for how far to
>> look back through earlier templates; an option might also be to put the 3-
>> operand templates first, so that looking backwards wouldn't be necessary in the
>> first place). This would then likely also address one of the forward looking
>> concerns I've raised above.
>>
> 
> Indeed, adcx's legacy insn can't support rex2.
> 
> For my problem, I prefer to re-order templates order, because, I hadn't thought of a way to simply move t to the farthest same base_opcode template for the moment. The following is a tentative scenario: the order will be ndd evex - rex2 - evex.

Yes, this matches my understanding / expectation.

> And I will need a tmp_variable to avoid the insn doesn't match the rex2, let me backtrack the match's result and the value of i.

This, however, I'm not convinced of. I'd rather see this vaguely in line
with 58bceb182740 ("x86: prefer VEX encodings over EVEX ones when
possible"): Do another full matching round with the removed operand,
arranging for "internal error" to be raised in case that fails. Your
approach would, I think, result in silent bad code generation in case
something went wrong. Thing is - you don't even need to advance (or
backtrack) t in that case

>>> +	readonly_var = src1;
>>> +      if (readonly_var != (unsigned int) ~0)
>>> +	{
>>> +	  --i.operands;
>>> +	  --i.reg_operands;
>>> +	  --i.tm.operands;
>>> +
>>> +	  if (readonly_var != src2)
>>> +	    swap_2_operands (readonly_var, src2);
>>
>> May I suggest that just out of precaution the swapping be done before operand
>> counts are decremented? In principle swap_2_operands() could do with having
>> assertions added as to it actually dealing with valid operands. (You'll note that
>> elsewhere, when we add a new operand, we increment first and then swap.)
>>
> 
> Indeed, it's safer, I've exchanged the order of execution, do you have any other comments on the assertions (If I understand correctly, there is a desire for some gcc_assert?), for the time being I can guarantee that the two indexes are definitely in range, is there anything else that needs to be judged?

That was a remark towards possible future changes, independent of your
work here. I merely want to make sure that possibly introducing such an
assertion wouldn't require code changes elsewhere when that can be
easily avoided right away.

>>> @@ -7728,6 +7766,14 @@ match_template (char mnem_suffix)
>>>  	  i.memshift = memshift;
>>>  	}
>>>
>>> +      /* If we can optimize a NDD insn to non-NDD insn, like
>>> +	 add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
>>> +      if (optimize == 1 && optimize_NDD_to_nonNDD (t))
>>
>> So you do this optimization at -O1, but not at -O2? Imo the "== 1"
>> simply needs dropping. Furthermore the {nooptimize} and {evex} pseudo
>> prefixes need respecting. Quite likely respecting {evex} would eliminate the need
>> for the explicit .has_nf check in the helper function, as I expect .vec_encoding to
>> be set alongside that bit anyway. Further quite likely respecting {evex} here will
>> mean that in patch 3 you need to introduce a new enumerator (e.g.
>> vex_encoding_egpr, vaguely similar to vex_encoding_evex512), to avoid
>> setting .vec_encoding to vex_encoding_evex when an eGPR is parsed.
>>
>> As to optimization level: In build_vex_prefix() we leverage C only at -O2 or
>> higher (including -Os). We may want to be consistent in this regard here (i.e. by
>> an extra check in the helper function).
>>
> 
> It's a mistake, I have fixed it. The conditions will be. I will try later, after the NF patch is done, to see if the constraint i.has_nf can be removed or not.
> 
>        /* If we can optimize a NDD insn to non-NDD insn, like
>          add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
> -      if (optimize == 1 && optimize_NDD_to_nonNDD (t))
> +      if (!i.no_optimize && i.vec_encoding != vex_encoding_evex
> +         && optimize && optimize_NDD_to_nonNDD (t))
>         {

Regardless of what the final expression is going to be, please keep the
check of "optimize" first, such that the common case of optimization
being disabled will be impacted as little as possible.

>>> +	{
>>> +	  t = current_templates->start - 1;
>>
>> As per a remark further up, this adjustment could be avoided if the ND templates
>> came ahead of the legacy ones. They can't be wrongly used in place of the
>> legacy ones, due to the extra operand they require. Then a comment here would
>> merely point out this ordering aspect. But of course care will then need to be
>> taken to not go past i386_optab[]'s bounds (by having suitably ordered
>> conditionals when looking for whether there is an alternative template in the
>> first place; again see the respective remark further up).
>>
> 
> Yes, if we reorder the template's order, I will remove the line. Only one example of a possible implementation is given here:
> 
>         }
> 
> +      bool have_converted_NDD_to_nonNDD = false;
> +      i386_insn tmp_i;
> +
> +      if (!i.no_optimize && i.vec_encoding != vex_encoding_evex
> +         && optimize && !have_converted_NDD_to_nonNDD
> +         && convert_NDD_to_nonNDD (t))
> +       {
> +         have_converted_NDD_to_nonNDD = true;
> +         tmp_i = i;
> +       }
> +
>        /* We've found a match; break out of loop.  */
>        break;
>      }
> @@ -7787,6 +7802,9 @@ match_template (char mnem_suffix)
>        return NULL;
>      }
> 
> +  if (have_converted_to_nonNDD)
> +    i = tmp_i;
> +
>    if (!quiet_warnings)

I have to admit that I don't understand what the goal is of this playing
with i and tmp_i.

>> For all of the changes below (which are a little hard to review in email), aiui they
>> only add C as needed. I once again would prefer if that attribute could be added
>> right as the templates are introduced, with the description stating the intention
>> and that the actual use of the attribute will be added later (i.e. as expressed
>> earlier already for NF).
> 
> After the changes are finalized, I'll break out this part of the modification that adds the C to lili so she can put it where it belongs.

Hmm, that will need doing early, as the NDD patch is hopefully going to
land soon-ish. Same for the template re-ordering (which will need
explaining when pulled ahead, but it will want pulling ahead to reduce
churn).

Jan
  
Hu, Lin1 Nov. 14, 2023, 2:28 a.m. UTC | #4
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, November 10, 2023 5:54 PM
> To: Hu, Lin1 <lin1.hu@intel.com>
> Cc: Lu, Hongjiu <hongjiu.lu@intel.com>; ccoutant@gmail.com;
> binutils@sourceware.org; Cui, Lili <lili.cui@intel.com>
> Subject: Re: [PATCH 7/8] Support APX NDD optimized encoding.
> 
> On 10.11.2023 06:43, Hu, Lin1 wrote:
> >> On 02.11.2023 12:29, Cui, Lili wrote:
> 
> Btw, your shrinking of reply context above from here is problematic. Someone
> reading just this mail can't tell who ...
> 
> >> Similarly I'm concerned of the ND form of CFCMOVcc, which isn't there
> >> yet in the patches, but which will also need excluding from this
> >> optimization. Obviously this concern then extends to any future ND-
> >> encoded insns, which (likely) won't have legacy-encoded (and hence
> >> REX2-encodable) counterparts. Are there any plans how to deal with such?
> >> (There's a possible approach mentioned further down.)
> 
> ... originally said this.
>

Thanks for your advises.

> 
> > Looking at other current NDD instructions, it should be possible to use evex
> encoding even if it doesn't have rex2 encoding.
> 
> Should be possible - yes. But why would you do such a transformation? That's
> not an optimization at all, afaict. And we shouldn't alter what the programmer
> wrote if the result isn't in at least some respect deemed better than the original.
> Considering this, the helper function may want further naming differently than
> already suggested, to e.g. convert_NDD_to_REX2().
>

Indeed.
 
>
> >>> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> >>> +
> >>> +      if (i.types[src1].bitfield.class == Reg
> >>> +	  && i.op[src1].regs == i.op[dest].regs)
> >>> +	readonly_var = src2;
> >>
> >> As can be seen in the testcase, this also results in ADCX/ADOX to be
> >> converted to non-ND EVEX forms, i.e. even when that's not a win at all.
> >> We shouldn't change what the user has written when the encoding
> >> doesn't actually improve. (Or else, but I'd be hesitant to accept
> >> that, at the very least the effect would need pointing out in the
> >> description or even a code comment, so that later on it is possible
> >> to figure out whether that was intentional or an
> >> oversight.)
> >>
> >> This is where my template ordering remark in reply to patch 5 comes into play:
> >> Whether invoking re-parse is okay would further need to depend on
> >> whether an alternative (earlier) template actually allows
> >> REX2 encoding (same base-opcode could be one of the criteria for how
> >> far to look back through earlier templates; an option might also be
> >> to put the 3- operand templates first, so that looking backwards
> >> wouldn't be necessary in the first place). This would then likely
> >> also address one of the forward looking concerns I've raised above.
> >>
> >
> > Indeed, adcx's legacy insn can't support rex2.
> >
> > For my problem, I prefer to re-order templates order, because, I hadn't
> thought of a way to simply move t to the farthest same base_opcode template
> for the moment. The following is a tentative scenario: the order will be ndd evex
> - rex2 - evex.
> 
> Yes, this matches my understanding / expectation.
> 
> > And I will need a tmp_variable to avoid the insn doesn't match the rex2, let me
> backtrack the match's result and the value of i.
> 
> This, however, I'm not convinced of. I'd rather see this vaguely in line with
> 58bceb182740 ("x86: prefer VEX encodings over EVEX ones when
> possible"): Do another full matching round with the removed operand, arranging
> for "internal error" to be raised in case that fails. Your approach would, I think,
> result in silent bad code generation in case something went wrong. Thing is - you
> don't even need to advance (or
> backtrack) t in that case
>

I tried to reorder the templates and modify the code as follows:

@ -7728,6 +7765,40 @@ match_template (char mnem_suffix)
          i.memshift = memshift;
        }

+      /* If we can optimize a NDD insn to non-NDD insn, like
+        add %r16, %r8, %r8 -> add %r16, %r8,
+        add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
+        Note that the semantics have not been changed.  */
+      if (optimize
+         && !i.no_optimize
+         && i.vec_encoding != vex_encoding_evex
+         && t + 1 < current_templates->end
+         && !t[1].opcode_modifier.evex)
+       {
+         unsigned int readonly_var = convert_NDD_to_REX2 (t);
+         if (readonly_var != ~0)
+           {
+             if (!check_EgprOperands (t + 1))
+               {
+                 specific_error = progress (internal_error);
+                 continue;
+               }
+             ++i.operands;
+             ++i.reg_operands;
+             ++i.tm.operands;
+
+             if (readonly_var == 1)
+               swap_2_operands (0, 1);
+           }
+       }

convert_NDD_to_REX2 return readonly_var now. check_EgprOperands aims to exclude some insns like adcx and adox. Because their opcode_space is legacy-map2 can't support rex2.

And I need some modifications in tc-i386.c after reorder i386-opc.tbl.

diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
index 7a86aff1828..d98950c7dfd 100644
--- a/gas/config/tc-i386.c
+++ b/gas/config/tc-i386.c
@@ -14401,7 +14401,9 @@ static bool check_register (const reg_entry *r)

   if (r->reg_flags & RegRex2)
     {
-      if (is_evex_encoding (current_templates->start))
+      if (is_evex_encoding (current_templates->start)
+         && ((current_templates->start + 1 >= current_templates->end)
+             || (is_evex_encoding (current_templates->start + 1))))
        i.vec_encoding = vex_encoding_evex;

       if (!cpu_arch_flags.bitfield.cpuapx_f

What's your opinion?

> 
> >>> @@ -7728,6 +7766,14 @@ match_template (char mnem_suffix)
> >>>  	  i.memshift = memshift;
> >>>  	}
> >>>
> >>> +      /* If we can optimize a NDD insn to non-NDD insn, like
> >>> +	 add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
> >>> +      if (optimize == 1 && optimize_NDD_to_nonNDD (t))
> >>
> >> So you do this optimization at -O1, but not at -O2? Imo the "== 1"
> >> simply needs dropping. Furthermore the {nooptimize} and {evex} pseudo
> >> prefixes need respecting. Quite likely respecting {evex} would
> >> eliminate the need for the explicit .has_nf check in the helper
> >> function, as I expect .vec_encoding to be set alongside that bit
> >> anyway. Further quite likely respecting {evex} here will mean that in patch 3
> you need to introduce a new enumerator (e.g.
> >> vex_encoding_egpr, vaguely similar to vex_encoding_evex512), to avoid
> >> setting .vec_encoding to vex_encoding_evex when an eGPR is parsed.
> >>
> >> As to optimization level: In build_vex_prefix() we leverage C only at
> >> -O2 or higher (including -Os). We may want to be consistent in this
> >> regard here (i.e. by an extra check in the helper function).
> >>
> >
> > It's a mistake, I have fixed it. The conditions will be. I will try later, after the NF
> patch is done, to see if the constraint i.has_nf can be removed or not.
> >
> >        /* If we can optimize a NDD insn to non-NDD insn, like
> >          add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
> > -      if (optimize == 1 && optimize_NDD_to_nonNDD (t))
> > +      if (!i.no_optimize && i.vec_encoding != vex_encoding_evex
> > +         && optimize && optimize_NDD_to_nonNDD (t))
> >         {
> 
> Regardless of what the final expression is going to be, please keep the check of
> "optimize" first, such that the common case of optimization being disabled will
> be impacted as little as possible.
>

Okay.
 
>
> >>> +	{
> >>> +	  t = current_templates->start - 1;
> >>
> >> As per a remark further up, this adjustment could be avoided if the
> >> ND templates came ahead of the legacy ones. They can't be wrongly
> >> used in place of the legacy ones, due to the extra operand they
> >> require. Then a comment here would merely point out this ordering
> >> aspect. But of course care will then need to be taken to not go past
> >> i386_optab[]'s bounds (by having suitably ordered conditionals when
> >> looking for whether there is an alternative template in the first place; again
> see the respective remark further up).
> >>
> >
> > Yes, if we reorder the template's order, I will remove the line. Only one
> example of a possible implementation is given here:
> >
> >         }
> >
> > +      bool have_converted_NDD_to_nonNDD = false;
> > +      i386_insn tmp_i;
> > +
> > +      if (!i.no_optimize && i.vec_encoding != vex_encoding_evex
> > +         && optimize && !have_converted_NDD_to_nonNDD
> > +         && convert_NDD_to_nonNDD (t))
> > +       {
> > +         have_converted_NDD_to_nonNDD = true;
> > +         tmp_i = i;
> > +       }
> > +
> >        /* We've found a match; break out of loop.  */
> >        break;
> >      }
> > @@ -7787,6 +7802,9 @@ match_template (char mnem_suffix)
> >        return NULL;
> >      }
> >
> > +  if (have_converted_to_nonNDD)
> > +    i = tmp_i;
> > +
> >    if (!quiet_warnings)
> 
> I have to admit that I don't understand what the goal is of this playing with i and
> tmp_i.
>

This variable has been removed in the new version, so there's no need to think about it.

> 
> >> For all of the changes below (which are a little hard to review in
> >> email), aiui they only add C as needed. I once again would prefer if
> >> that attribute could be added right as the templates are introduced,
> >> with the description stating the intention and that the actual use of
> >> the attribute will be added later (i.e. as expressed earlier already for NF).
> >
> > After the changes are finalized, I'll break out this part of the modification that
> adds the C to lili so she can put it where it belongs.
> 
> Hmm, that will need doing early, as the NDD patch is hopefully going to land
> soon-ish. Same for the template re-ordering (which will need explaining when
> pulled ahead, but it will want pulling ahead to reduce churn).
> 

OK. These changes, if there are no major problems, I'll give a patch (This patch will be attached to this email for double-check.) to lili, and write in the changelog the reason why we made these changes.

BRs,
Lin
  
Jan Beulich Nov. 14, 2023, 10:50 a.m. UTC | #5
On 14.11.2023 03:28, Hu, Lin1 wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>>
>> On 10.11.2023 06:43, Hu, Lin1 wrote:
>>>> On 02.11.2023 12:29, Cui, Lili wrote:
>>>>> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
>>>>> +
>>>>> +      if (i.types[src1].bitfield.class == Reg
>>>>> +	  && i.op[src1].regs == i.op[dest].regs)
>>>>> +	readonly_var = src2;
>>>>
>>>> As can be seen in the testcase, this also results in ADCX/ADOX to be
>>>> converted to non-ND EVEX forms, i.e. even when that's not a win at all.
>>>> We shouldn't change what the user has written when the encoding
>>>> doesn't actually improve. (Or else, but I'd be hesitant to accept
>>>> that, at the very least the effect would need pointing out in the
>>>> description or even a code comment, so that later on it is possible
>>>> to figure out whether that was intentional or an
>>>> oversight.)
>>>>
>>>> This is where my template ordering remark in reply to patch 5 comes into play:
>>>> Whether invoking re-parse is okay would further need to depend on
>>>> whether an alternative (earlier) template actually allows
>>>> REX2 encoding (same base-opcode could be one of the criteria for how
>>>> far to look back through earlier templates; an option might also be
>>>> to put the 3- operand templates first, so that looking backwards
>>>> wouldn't be necessary in the first place). This would then likely
>>>> also address one of the forward looking concerns I've raised above.
>>>>
>>>
>>> Indeed, adcx's legacy insn can't support rex2.
>>>
>>> For my problem, I prefer to re-order templates order, because, I hadn't
>> thought of a way to simply move t to the farthest same base_opcode template
>> for the moment. The following is a tentative scenario: the order will be ndd evex
>> - rex2 - evex.
>>
>> Yes, this matches my understanding / expectation.
>>
>>> And I will need a tmp_variable to avoid the insn doesn't match the rex2, let me
>> backtrack the match's result and the value of i.
>>
>> This, however, I'm not convinced of. I'd rather see this vaguely in line with
>> 58bceb182740 ("x86: prefer VEX encodings over EVEX ones when
>> possible"): Do another full matching round with the removed operand, arranging
>> for "internal error" to be raised in case that fails. Your approach would, I think,
>> result in silent bad code generation in case something went wrong. Thing is - you
>> don't even need to advance (or
>> backtrack) t in that case
>>
> 
> I tried to reorder the templates and modify the code as follows:
> 
> @ -7728,6 +7765,40 @@ match_template (char mnem_suffix)
>           i.memshift = memshift;
>         }
> 
> +      /* If we can optimize a NDD insn to non-NDD insn, like
> +        add %r16, %r8, %r8 -> add %r16, %r8,
> +        add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
> +        Note that the semantics have not been changed.  */
> +      if (optimize
> +         && !i.no_optimize
> +         && i.vec_encoding != vex_encoding_evex
> +         && t + 1 < current_templates->end
> +         && !t[1].opcode_modifier.evex)
> +       {
> +         unsigned int readonly_var = convert_NDD_to_REX2 (t);
> +         if (readonly_var != ~0)
> +           {
> +             if (!check_EgprOperands (t + 1))
> +               {
> +                 specific_error = progress (internal_error);
> +                 continue;
> +               }
> +             ++i.operands;
> +             ++i.reg_operands;

DYM decrement rather than increment for these? We're trying to go from
3 to 2 operands, after all.

> +             ++i.tm.operands;

Why is this? Aren't we ahead of filling i.tm here?

> +
> +             if (readonly_var == 1)
> +               swap_2_operands (0, 1);
> +           }
> +       }
> 
> convert_NDD_to_REX2 return readonly_var now. check_EgprOperands aims to exclude some insns like adcx and adox. Because their opcode_space is legacy-map2 can't support rex2.

Good. Looking forward to seeing the full change.

> And I need some modifications in tc-i386.c after reorder i386-opc.tbl.
> 
> diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
> index 7a86aff1828..d98950c7dfd 100644
> --- a/gas/config/tc-i386.c
> +++ b/gas/config/tc-i386.c
> @@ -14401,7 +14401,9 @@ static bool check_register (const reg_entry *r)
> 
>    if (r->reg_flags & RegRex2)
>      {
> -      if (is_evex_encoding (current_templates->start))
> +      if (is_evex_encoding (current_templates->start)
> +         && ((current_templates->start + 1 >= current_templates->end)
> +             || (is_evex_encoding (current_templates->start + 1))))
>         i.vec_encoding = vex_encoding_evex;
> 
>        if (!cpu_arch_flags.bitfield.cpuapx_f
> 
> What's your opinion?

See my comments to Lili on already the original code (which you further
modify) here. There cannot be a dependency on current_templates here,
imo. Lili - the fact Lin needs the modification above actually looks to
support my view on this.

Jan
  
Hu, Lin1 Nov. 15, 2023, 2:52 a.m. UTC | #6
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, November 14, 2023 6:51 PM
> To: Hu, Lin1 <lin1.hu@intel.com>; Cui, Lili <lili.cui@intel.com>
> Cc: Lu, Hongjiu <hongjiu.lu@intel.com>; ccoutant@gmail.com;
> binutils@sourceware.org
> Subject: Re: [PATCH 7/8] Support APX NDD optimized encoding.
> 
> On 14.11.2023 03:28, Hu, Lin1 wrote:
> >> -----Original Message-----
> >> From: Jan Beulich <jbeulich@suse.com>
> >>
> >> On 10.11.2023 06:43, Hu, Lin1 wrote:
> >>>> On 02.11.2023 12:29, Cui, Lili wrote:
> >>>>> +      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
> >>>>> +
> >>>>> +      if (i.types[src1].bitfield.class == Reg
> >>>>> +	  && i.op[src1].regs == i.op[dest].regs)
> >>>>> +	readonly_var = src2;
> >>>>
> >>>> As can be seen in the testcase, this also results in ADCX/ADOX to
> >>>> be converted to non-ND EVEX forms, i.e. even when that's not a win at all.
> >>>> We shouldn't change what the user has written when the encoding
> >>>> doesn't actually improve. (Or else, but I'd be hesitant to accept
> >>>> that, at the very least the effect would need pointing out in the
> >>>> description or even a code comment, so that later on it is possible
> >>>> to figure out whether that was intentional or an
> >>>> oversight.)
> >>>>
> >>>> This is where my template ordering remark in reply to patch 5 comes into
> play:
> >>>> Whether invoking re-parse is okay would further need to depend on
> >>>> whether an alternative (earlier) template actually allows
> >>>> REX2 encoding (same base-opcode could be one of the criteria for
> >>>> how far to look back through earlier templates; an option might
> >>>> also be to put the 3- operand templates first, so that looking
> >>>> backwards wouldn't be necessary in the first place). This would
> >>>> then likely also address one of the forward looking concerns I've raised
> above.
> >>>>
> >>>
> >>> Indeed, adcx's legacy insn can't support rex2.
> >>>
> >>> For my problem, I prefer to re-order templates order, because, I
> >>> hadn't
> >> thought of a way to simply move t to the farthest same base_opcode
> >> template for the moment. The following is a tentative scenario: the
> >> order will be ndd evex
> >> - rex2 - evex.
> >>
> >> Yes, this matches my understanding / expectation.
> >>
> >>> And I will need a tmp_variable to avoid the insn doesn't match the
> >>> rex2, let me
> >> backtrack the match's result and the value of i.
> >>
> >> This, however, I'm not convinced of. I'd rather see this vaguely in
> >> line with
> >> 58bceb182740 ("x86: prefer VEX encodings over EVEX ones when
> >> possible"): Do another full matching round with the removed operand,
> >> arranging for "internal error" to be raised in case that fails. Your
> >> approach would, I think, result in silent bad code generation in case
> >> something went wrong. Thing is - you don't even need to advance (or
> >> backtrack) t in that case
> >>
> >
> > I tried to reorder the templates and modify the code as follows:
> >
> > @ -7728,6 +7765,40 @@ match_template (char mnem_suffix)
> >           i.memshift = memshift;
> >         }
> >
> > +      /* If we can optimize a NDD insn to non-NDD insn, like
> > +        add %r16, %r8, %r8 -> add %r16, %r8,
> > +        add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
> > +        Note that the semantics have not been changed.  */
> > +      if (optimize
> > +         && !i.no_optimize
> > +         && i.vec_encoding != vex_encoding_evex
> > +         && t + 1 < current_templates->end
> > +         && !t[1].opcode_modifier.evex)
> > +       {
> > +         unsigned int readonly_var = convert_NDD_to_REX2 (t);
> > +         if (readonly_var != ~0)
> > +           {
> > +             if (!check_EgprOperands (t + 1))
> > +               {
> > +                 specific_error = progress (internal_error);
> > +                 continue;
> > +               }
> > +             ++i.operands;
> > +             ++i.reg_operands;
> 
> DYM decrement rather than increment for these? We're trying to go from
> 3 to 2 operands, after all.
>

Here's a backtrace to considering for possible other opcode_space (0f38,...) instructions that can't accept the r16+ register, but can accept other rex registers or the normal. I decrement i.operands and i.reg_operands in convert_NDD_to_REX2. If the legacy or rex version of the insn can't support rex2 registers, I won't optimize it. So I need to increment these.

> 
> > +             ++i.tm.operands;
> 
> Why is this? Aren't we ahead of filling i.tm here?
>

Indeed. I removed it.

> 
> > +
> > +             if (readonly_var == 1)
> > +               swap_2_operands (0, 1);
> > +           }
> > +       }
> >
> > convert_NDD_to_REX2 return readonly_var now. check_EgprOperands aims
> to exclude some insns like adcx and adox. Because their opcode_space is legacy-
> map2 can't support rex2.
> 
> Good. Looking forward to seeing the full change.
> 

For some insns like adcx and adox, I'd like to add some details. check_EgprOperands only used to exclude some situation that these insns with gpr32 registers. If we think about optimization in terms of encoding length. Is it safe to assume that some insn with prefixes 66, f2, f3 and their opcode_space isn't legacy-map0 or legacy-map1 won't reduce the length of the code even if they are optimized? If yes, I think the code can be simplified like:

       /* If we can optimize a NDD insn to non-NDD insn, like
          add %r16, %r8, %r8 -> add %r16, %r8,
           add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
           Note that the semantics have not been changed.  */
        if (optimize
            && !i.no_optimize
            && i.vec_encoding != vex_encoding_evex
            && t + 1 < current_templates->end
            && !t[1].opcode_modifier.evex
            && convert_NDD_to_REX2 (t))
          {
            specific_error = progress (internal_error);
            continue;
          }  

For those instructions that don't need to be optimized, like adcx and adox we just don't swap the order, so we don't need check_EgprOperands and backtrack, and convert_NDD_to_REX2 has the same return value as before.

PS. So shouldn't the name of the function be convert_NDD_to_legacy.

>
> > And I need some modifications in tc-i386.c after reorder i386-opc.tbl.
> >
> > diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c index
> > 7a86aff1828..d98950c7dfd 100644
> > --- a/gas/config/tc-i386.c
> > +++ b/gas/config/tc-i386.c
> > @@ -14401,7 +14401,9 @@ static bool check_register (const reg_entry
> > *r)
> >
> >    if (r->reg_flags & RegRex2)
> >      {
> > -      if (is_evex_encoding (current_templates->start))
> > +      if (is_evex_encoding (current_templates->start)
> > +         && ((current_templates->start + 1 >= current_templates->end)
> > +             || (is_evex_encoding (current_templates->start + 1))))
> >         i.vec_encoding = vex_encoding_evex;
> >
> >        if (!cpu_arch_flags.bitfield.cpuapx_f
> >
> > What's your opinion?
> 
> See my comments to Lili on already the original code (which you further
> modify) here. There cannot be a dependency on current_templates here, imo.
> Lili - the fact Lin needs the modification above actually looks to support my view
> on this.

This part of the code won't be in the NDD optimize patch but rather should be in the NDD patch. So I'd like to skip that part of the discussion for now until I get a new base branch.

While I still have questions about the previous discussion, I will then send out the current patch.

BRs,
Lin
  
Jan Beulich Nov. 15, 2023, 8:57 a.m. UTC | #7
On 15.11.2023 03:52, Hu, Lin1 wrote:
>> -----Original Message-----
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, November 14, 2023 6:51 PM
>>
>> On 14.11.2023 03:28, Hu, Lin1 wrote:
>>> @ -7728,6 +7765,40 @@ match_template (char mnem_suffix)
>>>           i.memshift = memshift;
>>>         }
>>>
>>> +      /* If we can optimize a NDD insn to non-NDD insn, like
>>> +        add %r16, %r8, %r8 -> add %r16, %r8,
>>> +        add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
>>> +        Note that the semantics have not been changed.  */
>>> +      if (optimize
>>> +         && !i.no_optimize
>>> +         && i.vec_encoding != vex_encoding_evex
>>> +         && t + 1 < current_templates->end
>>> +         && !t[1].opcode_modifier.evex)
>>> +       {
>>> +         unsigned int readonly_var = convert_NDD_to_REX2 (t);
>>> +         if (readonly_var != ~0)
>>> +           {
>>> +             if (!check_EgprOperands (t + 1))
>>> +               {
>>> +                 specific_error = progress (internal_error);
>>> +                 continue;
>>> +               }
>>> +             ++i.operands;
>>> +             ++i.reg_operands;
>>
>> DYM decrement rather than increment for these? We're trying to go from
>> 3 to 2 operands, after all.
>>
> 
> Here's a backtrace to considering for possible other opcode_space (0f38,...) instructions that can't accept the r16+ register, but can accept other rex registers or the normal. I decrement i.operands and i.reg_operands in convert_NDD_to_REX2. If the legacy or rex version of the insn can't support rex2 registers, I won't optimize it. So I need to increment these.

Okay, I need to see the full patch for this. Incrementing to undo earlier
decrementing still looks suspicious to me (for now).

>>> +
>>> +             if (readonly_var == 1)
>>> +               swap_2_operands (0, 1);
>>> +           }
>>> +       }
>>>
>>> convert_NDD_to_REX2 return readonly_var now. check_EgprOperands aims
>> to exclude some insns like adcx and adox. Because their opcode_space is legacy-
>> map2 can't support rex2.
>>
>> Good. Looking forward to seeing the full change.
>>
> 
> For some insns like adcx and adox, I'd like to add some details. check_EgprOperands only used to exclude some situation that these insns with gpr32 registers. If we think about optimization in terms of encoding length. Is it safe to assume that some insn with prefixes 66, f2, f3 and their opcode_space isn't legacy-map0 or legacy-map1 won't reduce the length of the code even if they are optimized?

Well, no, not always. See my other reply regarding 32-bit ADCX/ADOX.

> If yes, I think the code can be simplified like:
> 
>        /* If we can optimize a NDD insn to non-NDD insn, like
>           add %r16, %r8, %r8 -> add %r16, %r8,
>            add  %r8, %r16, %r8 -> add %r16, %r8, then rematch template.
>            Note that the semantics have not been changed.  */
>         if (optimize
>             && !i.no_optimize
>             && i.vec_encoding != vex_encoding_evex
>             && t + 1 < current_templates->end
>             && !t[1].opcode_modifier.evex
>             && convert_NDD_to_REX2 (t))
>           {
>             specific_error = progress (internal_error);
>             continue;
>           }  
> 
> For those instructions that don't need to be optimized, like adcx and adox we just don't swap the order, so we don't need check_EgprOperands and backtrack, and convert_NDD_to_REX2 has the same return value as before.
> 
> PS. So shouldn't the name of the function be convert_NDD_to_legacy.

Perhaps yes, if the result can also be non-REX2 encodings.

Jan
  

Patch

diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
index 7a86aff1828..787108cedc8 100644
--- a/gas/config/tc-i386.c
+++ b/gas/config/tc-i386.c
@@ -7208,6 +7208,44 @@  check_EgprOperands (const insn_template *t)
   return 0;
 }
 
+/* Optimize APX NDD insns to non-NDD insns.  */
+
+static bool
+optimize_NDD_to_nonNDD (const insn_template *t)
+{
+  if (t->opcode_modifier.vexvvvv == VexVVVV_DST
+      && t->opcode_space == SPACE_EVEXMAP4
+      && !i.has_nf
+      && i.reg_operands >= 2
+      && i.types[i.operands - 1].bitfield.class == Reg)
+    {
+      unsigned int readonly_var = ~0;
+      unsigned int dest = i.operands - 1;
+      unsigned int src1 = (i.operands > 2) ? i.operands - 2 : 0;
+      unsigned int src2 = (i.operands > 3) ? i.operands - 3 : 0;
+
+      if (i.types[src1].bitfield.class == Reg
+	  && i.op[src1].regs == i.op[dest].regs)
+	readonly_var = src2;
+      /* adcx, adox and imul don't have D bit.  */
+      else if (i.types[src2].bitfield.class == Reg
+	       && i.op[src2].regs == i.op[dest].regs
+	       && t->opcode_modifier.commutative)
+	readonly_var = src1;
+      if (readonly_var != (unsigned int) ~0)
+	{
+	  --i.operands;
+	  --i.reg_operands;
+	  --i.tm.operands;
+
+	  if (readonly_var != src2)
+	    swap_2_operands (readonly_var, src2);
+	  return 1;
+	}
+    }
+  return 0;
+}
+
 /* Helper function for the progress() macro in match_template().  */
 static INLINE enum i386_error progress (enum i386_error new,
 					enum i386_error last,
@@ -7728,6 +7766,14 @@  match_template (char mnem_suffix)
 	  i.memshift = memshift;
 	}
 
+      /* If we can optimize a NDD insn to non-NDD insn, like
+	 add %r16, %r8, %r8 -> add %r16, %r8, then rematch template.  */
+      if (optimize == 1 && optimize_NDD_to_nonNDD (t))
+	{
+	  t = current_templates->start - 1;
+	  continue;
+	}
+
       /* We've found a match; break out of loop.  */
       break;
     }
diff --git a/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d
new file mode 100644
index 00000000000..f23b2b127b6
--- /dev/null
+++ b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.d
@@ -0,0 +1,124 @@ 
+#as: -O1
+#objdump: -drw
+#name: x86-64 APX NDD optimized encoding
+#source: x86-64-apx-ndd-optimize.s
+
+.*: +file format .*
+
+
+Disassembly of section .text:
+
+0+ <_start>:
+\s*[a-f0-9]+:\s*d5 19 ff c7          	inc    %r31
+\s*[a-f0-9]+:\s*d5 11 fe c7          	inc    %r31b
+\s*[a-f0-9]+:\s*d5 4d 01 f8          	add    %r31,%r8
+\s*[a-f0-9]+:\s*d5 45 00 f8          	add    %r31b,%r8b
+\s*[a-f0-9]+:\s*d5 4d 01 f8          	add    %r31,%r8
+\s*[a-f0-9]+:\s*d5 1d 03 c7          	add    %r31,%r8
+\s*[a-f0-9]+:\s*d5 4d 03 38          	add    \(%r8\),%r31
+\s*[a-f0-9]+:\s*d5 1d 03 07          	add    \(%r31\),%r8
+\s*[a-f0-9]+:\s*49 81 c7 33 44 34 12 	add    \$0x12344433,%r15
+\s*[a-f0-9]+:\s*49 81 c0 11 22 33 f4 	add    \$0xfffffffff4332211,%r8
+\s*[a-f0-9]+:\s*d5 18 ff c9          	dec    %r17
+\s*[a-f0-9]+:\s*d5 10 fe c9          	dec    %r17b
+\s*[a-f0-9]+:\s*d5 18 f7 d1          	not    %r17
+\s*[a-f0-9]+:\s*d5 10 f6 d1          	not    %r17b
+\s*[a-f0-9]+:\s*d5 18 f7 d9          	neg    %r17
+\s*[a-f0-9]+:\s*d5 10 f6 d9          	neg    %r17b
+\s*[a-f0-9]+:\s*d5 1c 29 f9          	sub    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 28 f9          	sub    %r15b,%r17b
+\s*[a-f0-9]+:\s*62 54 84 18 29 38    	sub    %r15,\(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 2b 04 07       	sub    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 ee 34 12 00 00 	sub    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 19 f9          	sbb    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 18 f9          	sbb    %r15b,%r17b
+\s*[a-f0-9]+:\s*62 54 84 18 19 38    	sbb    %r15,\(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 1b 04 07       	sbb    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 de 34 12 00 00 	sbb    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 11 f9          	adc    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 10 f9          	adc    %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 13 38             	adc    \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 13 04 07       	adc    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 d6 34 12 00 00 	adc    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 09 f9          	or     %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 08 f9          	or     %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 0b 38             	or     \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 0b 04 07       	or     \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 ce 34 12 00 00 	or     \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 31 f9          	xor    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 30 f9          	xor    %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 33 38             	xor    \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 33 04 07       	xor    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 19 81 f6 34 12 00 00 	xor    \$0x1234,%r30
+\s*[a-f0-9]+:\s*d5 1c 21 f9          	and    %r15,%r17
+\s*[a-f0-9]+:\s*d5 14 20 f9          	and    %r15b,%r17b
+\s*[a-f0-9]+:\s*4d 23 38             	and    \(%r8\),%r15
+\s*[a-f0-9]+:\s*d5 49 23 04 07       	and    \(%r15,%rax,1\),%r16
+\s*[a-f0-9]+:\s*d5 11 81 e6 34 12 00 00 	and    \$0x1234,%r30d
+\s*[a-f0-9]+:\s*d5 19 d1 cf          	ror    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 cf          	ror    %r31b
+\s*[a-f0-9]+:\s*49 c1 cc 02          	ror    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 cc 02          	ror    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 c7          	rol    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 c7          	rol    %r31b
+\s*[a-f0-9]+:\s*49 c1 c4 02          	rol    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 c4 02          	rol    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 df          	rcr    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 df          	rcr    %r31b
+\s*[a-f0-9]+:\s*49 c1 dc 02          	rcr    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 dc 02          	rcr    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 d7          	rcl    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 d7          	rcl    %r31b
+\s*[a-f0-9]+:\s*49 c1 d4 02          	rcl    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 d4 02          	rcl    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 e7          	shl    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 e7          	shl    %r31b
+\s*[a-f0-9]+:\s*49 c1 e4 02          	shl    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 e4 02          	shl    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 ff          	sar    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 ff          	sar    %r31b
+\s*[a-f0-9]+:\s*49 c1 fc 02          	sar    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 fc 02          	sar    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 e7          	shl    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 e7          	shl    %r31b
+\s*[a-f0-9]+:\s*49 c1 e4 02          	shl    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 e4 02          	shl    \$0x2,%r12b
+\s*[a-f0-9]+:\s*d5 19 d1 ef          	shr    %r31
+\s*[a-f0-9]+:\s*d5 11 d0 ef          	shr    %r31b
+\s*[a-f0-9]+:\s*49 c1 ec 02          	shr    \$0x2,%r12
+\s*[a-f0-9]+:\s*41 c0 ec 02          	shr    \$0x2,%r12b
+\s*[a-f0-9]+:\s*62 74 9c 18 24 20 01 	shld   \$0x1,%r12,\(%rax\),%r12
+\s*[a-f0-9]+:\s*4d 0f a4 c4 02       	shld   \$0x2,%r8,%r12
+\s*[a-f0-9]+:\s*62 74 b4 18 a5 08    	shld   %cl,%r9,\(%rax\),%r9
+\s*[a-f0-9]+:\s*d5 9c a5 e0          	shld   %cl,%r12,%r16
+\s*[a-f0-9]+:\s*62 7c 94 18 a5 2c 83 	shld   %cl,%r13,\(%r19,%rax,4\),%r13
+\s*[a-f0-9]+:\s*62 74 9c 18 2c 20 01 	shrd   \$0x1,%r12,\(%rax\),%r12
+\s*[a-f0-9]+:\s*4d 0f ac ec 01       	shrd   \$0x1,%r13,%r12
+\s*[a-f0-9]+:\s*62 74 b4 18 ad 08    	shrd   %cl,%r9,\(%rax\),%r9
+\s*[a-f0-9]+:\s*d5 9c ad e0          	shrd   %cl,%r12,%r16
+\s*[a-f0-9]+:\s*62 7c 94 18 ad 2c 83 	shrd   %cl,%r13,\(%r19,%rax,4\),%r13
+\s*[a-f0-9]+:\s*66 4d 0f 38 f6 c7    	adcx   %r15,%r8
+\s*[a-f0-9]+:\s*62 14 f9 08 66 04 3f 	adcx   \(%r15,%r31,1\),%r8
+\s*[a-f0-9]+:\s*66 4d 0f 38 f6 c1    	adcx   %r9,%r8
+\s*[a-f0-9]+:\s*f3 4d 0f 38 f6 c7    	adox   %r15,%r8
+\s*[a-f0-9]+:\s*62 14 fa 08 66 04 3f 	adox   \(%r15,%r31,1\),%r8
+\s*[a-f0-9]+:\s*f3 4d 0f 38 f6 c1    	adox   %r9,%r8
+\s*[a-f0-9]+:\s*67 0f 40 90 90 90 90 90 	cmovo  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 41 90 90 90 90 90 	cmovno -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 42 90 90 90 90 90 	cmovb  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 43 90 90 90 90 90 	cmovae -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 44 90 90 90 90 90 	cmove  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 45 90 90 90 90 90 	cmovne -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 46 90 90 90 90 90 	cmovbe -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 47 90 90 90 90 90 	cmova  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 48 90 90 90 90 90 	cmovs  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 49 90 90 90 90 90 	cmovns -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4a 90 90 90 90 90 	cmovp  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4b 90 90 90 90 90 	cmovnp -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4c 90 90 90 90 90 	cmovl  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4d 90 90 90 90 90 	cmovge -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4e 90 90 90 90 90 	cmovle -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f 4f 90 90 90 90 90 	cmovg  -0x6f6f6f70\(%eax\),%edx
+\s*[a-f0-9]+:\s*67 0f af 90 09 09 09 00 	imul   0x90909\(%eax\),%edx
+\s*[a-f0-9]+:\s*d5 aa af 94 f8 09 09 00 00 	imul   0x909\(%rax,%r31,8\),%rdx
+\s*[a-f0-9]+:\s*48 0f af d0          	imul   %rax,%rdx
diff --git a/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s
new file mode 100644
index 00000000000..1b5cc94757d
--- /dev/null
+++ b/gas/testsuite/gas/i386/x86-64-apx-ndd-optimize.s
@@ -0,0 +1,117 @@ 
+# Check 64bit APX NDD instructions with optimized encoding
+
+	.text
+_start:
+inc    %r31,%r31
+incb   %r31b,%r31b
+add    %r31,%r8,%r8
+addb   %r31b,%r8b,%r8b
+{store} add    %r31,%r8,%r8
+{load}  add    %r31,%r8,%r8
+add    %r31,(%r8),%r31
+add    (%r31),%r8,%r8
+add    $0x12344433,%r15,%r15
+add    $0xfffffffff4332211,%r8,%r8
+dec    %r17,%r17
+decb   %r17b,%r17b
+not    %r17,%r17
+notb   %r17b,%r17b
+neg    %r17,%r17
+negb   %r17b,%r17b
+sub    %r15,%r17,%r17
+subb   %r15b,%r17b,%r17b
+sub    %r15,(%r8),%r15
+sub    (%r15,%rax,1),%r16,%r16
+sub    $0x1234,%r30,%r30
+sbb    %r15,%r17,%r17
+sbbb   %r15b,%r17b,%r17b
+sbb    %r15,(%r8),%r15
+sbb    (%r15,%rax,1),%r16,%r16
+sbb    $0x1234,%r30,%r30
+adc    %r15,%r17,%r17
+adcb   %r15b,%r17b,%r17b
+adc    %r15,(%r8),%r15
+adc    (%r15,%rax,1),%r16,%r16
+adc    $0x1234,%r30,%r30
+or     %r15,%r17,%r17
+orb    %r15b,%r17b,%r17b
+or     %r15,(%r8),%r15
+or     (%r15,%rax,1),%r16,%r16
+or     $0x1234,%r30,%r30
+xor    %r15,%r17,%r17
+xorb   %r15b,%r17b,%r17b
+xor    %r15,(%r8),%r15
+xor    (%r15,%rax,1),%r16,%r16
+xor    $0x1234,%r30,%r30
+and    %r15,%r17,%r17
+andb   %r15b,%r17b,%r17b
+and    %r15,(%r8),%r15
+and    (%r15,%rax,1),%r16,%r16
+and    $0x1234,%r30,%r30
+ror    %r31,%r31
+rorb   %r31b,%r31b
+ror    $0x2,%r12,%r12
+rorb   $0x2,%r12b,%r12b
+rol    %r31,%r31
+rolb   %r31b,%r31b
+rol    $0x2,%r12,%r12
+rolb   $0x2,%r12b,%r12b
+rcr    %r31,%r31
+rcrb   %r31b,%r31b
+rcr    $0x2,%r12,%r12
+rcrb   $0x2,%r12b,%r12b
+rcl    %r31,%r31
+rclb   %r31b,%r31b
+rcl    $0x2,%r12,%r12
+rclb   $0x2,%r12b,%r12b
+shl    %r31,%r31
+shlb   %r31b,%r31b
+shl    $0x2,%r12,%r12
+shlb   $0x2,%r12b,%r12b
+sar    %r31,%r31
+sarb   %r31b,%r31b
+sar    $0x2,%r12,%r12
+sarb   $0x2,%r12b,%r12b
+shl    %r31,%r31
+shlb   %r31b,%r31b
+shl    $0x2,%r12,%r12
+shlb   $0x2,%r12b,%r12b
+shr    %r31,%r31
+shrb   %r31b,%r31b
+shr    $0x2,%r12,%r12
+shrb   $0x2,%r12b,%r12b
+shld   $0x1,%r12,(%rax),%r12
+shld   $0x2,%r8,%r12,%r12
+shld   %cl,%r9,(%rax),%r9
+shld   %cl,%r12,%r16,%r16
+shld   %cl,%r13,(%r19,%rax,4),%r13
+shrd   $0x1,%r12,(%rax),%r12
+shrd   $0x1,%r13,%r12,%r12
+shrd   %cl,%r9,(%rax),%r9
+shrd   %cl,%r12,%r16,%r16
+shrd   %cl,%r13,(%r19,%rax,4),%r13
+adcx   %r15,%r8,%r8
+adcx   (%r15,%r31,1),%r8,%r8
+adcx   %r8,%r9,%r8
+adox   %r15,%r8,%r8
+adox   (%r15,%r31,1),%r8,%r8
+adox   %r8,%r9,%r8
+cmovo  0x90909090(%eax),%edx,%edx
+cmovno 0x90909090(%eax),%edx,%edx
+cmovb  0x90909090(%eax),%edx,%edx
+cmovae 0x90909090(%eax),%edx,%edx
+cmove  0x90909090(%eax),%edx,%edx
+cmovne 0x90909090(%eax),%edx,%edx
+cmovbe 0x90909090(%eax),%edx,%edx
+cmova  0x90909090(%eax),%edx,%edx
+cmovs  0x90909090(%eax),%edx,%edx
+cmovns 0x90909090(%eax),%edx,%edx
+cmovp  0x90909090(%eax),%edx,%edx
+cmovnp 0x90909090(%eax),%edx,%edx
+cmovl  0x90909090(%eax),%edx,%edx
+cmovge 0x90909090(%eax),%edx,%edx
+cmovle 0x90909090(%eax),%edx,%edx
+cmovg  0x90909090(%eax),%edx,%edx
+imul   0x90909(%eax),%edx,%edx
+imul   0x909(%rax,%r31,8),%rdx,%rdx
+imul   %rdx,%rax,%rdx
diff --git a/gas/testsuite/gas/i386/x86-64.exp b/gas/testsuite/gas/i386/x86-64.exp
index 668b366a212..eab99f9e52b 100644
--- a/gas/testsuite/gas/i386/x86-64.exp
+++ b/gas/testsuite/gas/i386/x86-64.exp
@@ -552,6 +552,7 @@  run_dump_test "x86-64-optimize-6"
 run_list_test "x86-64-optimize-7a" "-I${srcdir}/$subdir -march=+noavx -al"
 run_dump_test "x86-64-optimize-7b"
 run_list_test "x86-64-optimize-8" "-I${srcdir}/$subdir -march=+noavx2 -al"
+run_dump_test "x86-64-apx-ndd-optimize"
 run_dump_test "x86-64-align-branch-1a"
 run_dump_test "x86-64-align-branch-1b"
 run_dump_test "x86-64-align-branch-1c"
diff --git a/opcodes/i386-opc.tbl b/opcodes/i386-opc.tbl
index 03ebef028f9..5e36f6f67eb 100644
--- a/opcodes/i386-opc.tbl
+++ b/opcodes/i386-opc.tbl
@@ -145,6 +145,8 @@ 
 // The EVEX purpose of StaticRounding appears only together with SAE. Re-use
 // the bit to mark commutative VEX encodings where swapping the source
 // operands may allow to switch from 3-byte to 2-byte VEX encoding.
+// And re-use the bit to mark some NDD insns that swapping the source operands
+// may allow to switch from 3 operands to 2 operands.
 #define C StaticRounding
 
 #define FP 387|287|8087
@@ -166,6 +168,10 @@ 
 
 ### MARKER ###
 
+// Please don't add a NDD insn which may be optimized to a REX2 insn before the
+// mov. It may result that a good UB checker object the behavior
+// "template->start - 1" at the end of match_template.
+
 // Move instructions.
 mov, 0xa0, No64, D|W|CheckOperandSize|No_sSuf|No_qSuf, { Disp16|Disp32|Unspecified|Byte|Word|Dword, Acc|Byte|Word|Dword }
 mov, 0xa0, x64, D|W|CheckOperandSize|No_sSuf, { Disp64|Unspecified|Byte|Word|Dword|Qword, Acc|Byte|Word|Dword|Qword }
@@ -295,7 +301,7 @@  add, 0x0, 0, D|W|CheckOperandSize|Modrm|No_sSuf|HLEPrefixLock, { Reg8|Reg16|Reg3
 add, 0x83/0, 0, Modrm|No_bSuf|No_sSuf|HLEPrefixLock, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex }
 add, 0x4, 0, W|No_sSuf, { Imm8|Imm16|Imm32|Imm32S, Acc|Byte|Word|Dword|Qword }
 add, 0x80/0, 0, W|Modrm|No_sSuf|HLEPrefixLock, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
-add, 0x0, APX_F, D|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
+add, 0x0, APX_F, D|C|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 add, 0x83/0, APX_F, Modrm|CheckOperandSize|No_bSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex, Reg16|Reg32|Reg64 }
 add, 0x80/0, APX_F, W|Modrm|CheckOperandSize|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64}
 
@@ -339,7 +345,7 @@  and, 0x20, 0, D|W|CheckOperandSize|Modrm|No_sSuf|HLEPrefixLock|Optimize, { Reg8|
 and, 0x83/4, 0, Modrm|No_bSuf|No_sSuf|HLEPrefixLock|Optimize, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex }
 and, 0x24, 0, W|No_sSuf|Optimize, { Imm8|Imm16|Imm32|Imm32S, Acc|Byte|Word|Dword|Qword }
 and, 0x80/4, 0, W|Modrm|No_sSuf|HLEPrefixLock|Optimize, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
-and, 0x20, APX_F, D|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
+and, 0x20, APX_F, D|C|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 and, 0x83/4, APX_F, Modrm|CheckOperandSize|No_bSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex, Reg16|Reg32|Reg64 }
 and, 0x80/4, APX_F, W|Modrm|CheckOperandSize|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 
@@ -347,7 +353,7 @@  or, 0x8, 0, D|W|CheckOperandSize|Modrm|No_sSuf|HLEPrefixLock|Optimize, { Reg8|Re
 or, 0x83/1, 0, Modrm|No_bSuf|No_sSuf|HLEPrefixLock, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex }
 or, 0xc, 0, W|No_sSuf, { Imm8|Imm16|Imm32|Imm32S, Acc|Byte|Word|Dword|Qword }
 or, 0x80/1, 0, W|Modrm|No_sSuf|HLEPrefixLock, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
-or, 0x8, APX_F, D|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
+or, 0x8, APX_F, D|C|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 or, 0x83/1, APX_F, Modrm|CheckOperandSize|No_bSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex, Reg16|Reg32|Reg64 }
 or, 0x80/1, APX_F, W|Modrm|CheckOperandSize|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 
@@ -355,7 +361,7 @@  xor, 0x30, 0, D|W|CheckOperandSize|Modrm|No_sSuf|HLEPrefixLock|Optimize, { Reg8|
 xor, 0x83/6, 0, Modrm|No_bSuf|No_sSuf|HLEPrefixLock, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex }
 xor, 0x34, 0, W|No_sSuf, { Imm8|Imm16|Imm32|Imm32S, Acc|Byte|Word|Dword|Qword }
 xor, 0x80/6, 0, W|Modrm|No_sSuf|HLEPrefixLock, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
-xor, 0x30, APX_F, D|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
+xor, 0x30, APX_F, D|C|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF|Optimize, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 xor, 0x83/6, APX_F, Modrm|CheckOperandSize|No_bSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex, Reg16|Reg32|Reg64 }
 xor, 0x80/6, APX_F, W|Modrm|CheckOperandSize|No_sSuf|VexVVVVDest|EVex128|EVexMap4|NF, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 
@@ -369,7 +375,7 @@  adc, 0x80/2, 0, W|Modrm|No_sSuf|HLEPrefixLock, { Imm8|Imm16|Imm32|Imm32S, Reg8|R
 adc, 0x10, APX_F, D|W|CheckOperandSize|Modrm|EVex128|EVexMap4|No_sSuf, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
 adc, 0x83/2, APX_F, Modrm|EVex128|EVexMap4|No_bSuf|No_sSuf, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex }
 adc, 0x80/2, APX_F, W|Modrm|EVex128|EVexMap4|No_sSuf, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
-adc, 0x10, APX_F, D|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
+adc, 0x10, APX_F, D|C|W|CheckOperandSize|Modrm|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg8|Reg16|Reg32|Reg64, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 adc, 0x83/2, APX_F, Modrm|CheckOperandSize|No_bSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex, Reg16|Reg32|Reg64 }
 adc, 0x80/2, APX_F, W|Modrm|CheckOperandSize|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Imm8|Imm16|Imm32|Imm32S, Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex, Reg8|Reg16|Reg32|Reg64 }
 
@@ -412,7 +418,7 @@  cqto, 0x99, x64, Size64|NoSuf, {}
 mul, 0xf6/4, 0, W|Modrm|No_sSuf, { Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
 imul, 0xf6/5, 0, W|Modrm|No_sSuf, { Reg8|Reg16|Reg32|Reg64|Byte|Word|Dword|Qword|Unspecified|BaseIndex }
 imul, 0xfaf, i386, Modrm|CheckOperandSize|No_bSuf|No_sSuf, { Reg16|Reg32|Reg64|Unspecified|Word|Dword|Qword|BaseIndex, Reg16|Reg32|Reg64 }
-imul, 0xaf, APX_F, Modrm|CheckOperandSize|No_bSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg16|Reg32|Reg64|Unspecified|Word|Dword|Qword|BaseIndex, Reg16|Reg32|Reg64, Reg16|Reg32|Reg64 }
+imul, 0xaf, APX_F, C|Modrm|CheckOperandSize|No_bSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg16|Reg32|Reg64|Unspecified|Word|Dword|Qword|BaseIndex, Reg16|Reg32|Reg64, Reg16|Reg32|Reg64 }
 imul, 0x6b, i186, Modrm|CheckOperandSize|No_bSuf|No_sSuf, { Imm8S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex, Reg16|Reg32|Reg64 }
 imul, 0x69, i186, Modrm|CheckOperandSize|No_bSuf|No_sSuf, { Imm16|Imm32|Imm32S, Reg16|Reg32|Reg64|Word|Dword|Qword|Unspecified|BaseIndex, Reg16|Reg32|Reg64 }
 // imul with 2 operands mimics imul with 3 by putting the register in
@@ -2126,10 +2132,10 @@  xstore, 0xfa7c0, PadLock, NoSuf|RepPrefixOk, {}
 // Multy-precision Add Carry, rdseed instructions.
 adcx, 0x660f38f6, ADX, Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
 adcx, 0x6666, ADX|APX_F, Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf|EVex128|EVexMap4, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
-adcx, 0x6666, ADX|APX_F, Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
+adcx, 0x6666, ADX|APX_F, C|Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
 adox, 0xf30f38f6, ADX, Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
 adox, 0xf366, ADX|APX_F, Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf|EVex128|EVexMap4, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64 }
-adox, 0xf366, ADX|APX_F, Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
+adox, 0xf366, ADX|APX_F, C|Modrm|CheckOperandSize|IgnoreSize|No_bSuf|No_wSuf|No_sSuf|VexVVVVDest|EVex128|EVexMap4, { Reg32|Reg64|Unspecified|BaseIndex, Reg32|Reg64, Reg32|Reg64 }
 rdseed, 0xfc7/7, RdSeed, Modrm|NoSuf, { Reg16|Reg32|Reg64 }
 
 // SMAP instructions.