vect: Don't set excess bits in unform masks

Message ID 63e907af-cde8-4f63-bba9-d39fcd5623fb@codesourcery.com
State Unresolved
Headers
Series vect: Don't set excess bits in unform masks |

Checks

Context Check Description
snail/gcc-patch-check warning Git am fail log

Commit Message

Andrew Stubbs Oct. 20, 2023, 3:48 p.m. UTC
  This patch fixes a wrong-code bug on amdgcn in which the excess "ones" 
in the mask enable extra lanes that were supposed to be unused and are 
therefore undefined.

Richi suggested an alternative approach involving narrower types and 
then a zero-extend to the actual mask type.  This solved the problem for 
the specific test case that I had, but I'm not sure if it would work 
with V2 and V4 modes (not that I've observed bad behaviour from them 
anyway, but still).  There were some other caveats involving "two-lane 
division" that I don't fully understand, so I went with the simpler 
implementation.

This patch does have the disadvantage of an additional "and" instruction 
in the non-constant case even for machines that don't need it. I'm not 
sure how to fix that without an additional target hook. (If GCC could 
use the 64-lane vectors more effectively without the assistance of 
artificially reduced sizes then this problem wouldn't exist.)

OK to commit?

Andrew
vect: Don't set excess bits in unform masks

AVX ignores any excess bits in the mask, but AMD GCN magically uses a larger
vector than was intended (the smaller sizes are "fake"), leading to wrong-code.

gcc/ChangeLog:

	* expr.cc (store_constructor): Add "and" operation to uniform mask
	generation.
  

Comments

Richard Biener Oct. 23, 2023, 10:43 a.m. UTC | #1
On Fri, 20 Oct 2023, Andrew Stubbs wrote:

> This patch fixes a wrong-code bug on amdgcn in which the excess "ones" in the
> mask enable extra lanes that were supposed to be unused and are therefore
> undefined.
> 
> Richi suggested an alternative approach involving narrower types and then a
> zero-extend to the actual mask type.  This solved the problem for the specific
> test case that I had, but I'm not sure if it would work with V2 and V4 modes
> (not that I've observed bad behaviour from them anyway, but still).  There
> were some other caveats involving "two-lane division" that I don't fully
> understand, so I went with the simpler implementation.
> 
> This patch does have the disadvantage of an additional "and" instruction in
> the non-constant case even for machines that don't need it. I'm not sure how
> to fix that without an additional target hook. (If GCC could use the 64-lane
> vectors more effectively without the assistance of artificially reduced sizes
> then this problem wouldn't exist.)
> 
> OK to commit?

-           convert_move (target, op0, 0);
+           rtx tmp = gen_reg_rtx (mode);
+           convert_move (tmp, op0, 0);
+
+           if (known_ne (TYPE_VECTOR_SUBPARTS (type),
+                         GET_MODE_PRECISION (mode)))

Usually this would be maybe_ne, but then ...

+             {
+               /* Ensure no excess bits are set.
+                  GCN needs this, AVX does not.  */
+               expand_binop (mode, and_optab, tmp,
+                             GEN_INT ((1 << (TYPE_VECTOR_SUBPARTS (type)
+                                             .to_constant())) - 1),
+                             target, true, OPTAB_DIRECT);

here you have .to_constant ().  I think with having an integer mode
we know subparts is constant so I'd prefer

            auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
            if (maybe_ne (GET_MODE_PRECISION (mode), nunits)
...

+             }
+           else
+             emit_move_insn (target, tmp);

note you need the emit_move_insn also for the expand_binop
path since it's not guaranteed that 'target' is used there.  Thus

  tmp = expand_binop (...)
  if (tmp != target)
    emit_move_insn (...)

Otherwise looks good to me.  The and is needed on x86 for
two and four bit masks, it would be more efficient to use
smaller modes for the sign-extension I guess.

Thanks,
Richard.
  
Andrew Stubbs Nov. 10, 2023, 11:45 a.m. UTC | #2
On 23/10/2023 11:43, Richard Biener wrote:
> On Fri, 20 Oct 2023, Andrew Stubbs wrote:
> 
>> This patch fixes a wrong-code bug on amdgcn in which the excess "ones" in the
>> mask enable extra lanes that were supposed to be unused and are therefore
>> undefined.
>>
>> Richi suggested an alternative approach involving narrower types and then a
>> zero-extend to the actual mask type.  This solved the problem for the specific
>> test case that I had, but I'm not sure if it would work with V2 and V4 modes
>> (not that I've observed bad behaviour from them anyway, but still).  There
>> were some other caveats involving "two-lane division" that I don't fully
>> understand, so I went with the simpler implementation.
>>
>> This patch does have the disadvantage of an additional "and" instruction in
>> the non-constant case even for machines that don't need it. I'm not sure how
>> to fix that without an additional target hook. (If GCC could use the 64-lane
>> vectors more effectively without the assistance of artificially reduced sizes
>> then this problem wouldn't exist.)
>>
>> OK to commit?
> 
> -           convert_move (target, op0, 0);
> +           rtx tmp = gen_reg_rtx (mode);
> +           convert_move (tmp, op0, 0);
> +
> +           if (known_ne (TYPE_VECTOR_SUBPARTS (type),
> +                         GET_MODE_PRECISION (mode)))
> 
> Usually this would be maybe_ne, but then ...
> 
> +             {
> +               /* Ensure no excess bits are set.
> +                  GCN needs this, AVX does not.  */
> +               expand_binop (mode, and_optab, tmp,
> +                             GEN_INT ((1 << (TYPE_VECTOR_SUBPARTS (type)
> +                                             .to_constant())) - 1),
> +                             target, true, OPTAB_DIRECT);
> 
> here you have .to_constant ().  I think with having an integer mode
> we know subparts is constant so I'd prefer
> 
>              auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
>              if (maybe_ne (GET_MODE_PRECISION (mode), nunits)
> ...
> 
> +             }
> +           else
> +             emit_move_insn (target, tmp);
> 
> note you need the emit_move_insn also for the expand_binop
> path since it's not guaranteed that 'target' is used there.  Thus
> 
>    tmp = expand_binop (...)
>    if (tmp != target)
>      emit_move_insn (...)
> 
> Otherwise looks good to me.  The and is needed on x86 for
> two and four bit masks, it would be more efficient to use
> smaller modes for the sign-extension I guess.

I think this patch addresses these issues. I've confirmed it does the 
right thing on amdgcn.

OK?

Andrew
vect: Don't set excess bits in unform masks

AVX ignores any excess bits in the mask (at least for vector sizes >=8), but
AMD GCN magically uses a larger vector than was intended (the smaller sizes are
"fake"), leading to wrong-code.

This patch fixes amdgcn execution failures in gcc.dg/vect/pr81740-1.c,
gfortran.dg/c-interop/contiguous-1.f90,
gfortran.dg/c-interop/ff-descriptor-7.f90, and others.

gcc/ChangeLog:

	* expr.cc (store_constructor): Add "and" operation to uniform mask
	generation.

diff --git a/gcc/expr.cc b/gcc/expr.cc
index ed4dbb13d89..3e2a678710d 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7470,7 +7470,7 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size,
 	    break;
 	  }
 	/* Use sign-extension for uniform boolean vectors with
-	   integer modes.  */
+	   integer modes.  Effectively "vec_duplicate" for bitmasks.  */
 	if (!TREE_SIDE_EFFECTS (exp)
 	    && VECTOR_BOOLEAN_TYPE_P (type)
 	    && SCALAR_INT_MODE_P (mode)
@@ -7479,7 +7479,19 @@ store_constructor (tree exp, rtx target, int cleared, poly_int64 size,
 	  {
 	    rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)),
 				 expand_normal (elt));
-	    convert_move (target, op0, 0);
+	    rtx tmp = gen_reg_rtx (mode);
+	    convert_move (tmp, op0, 0);
+
+	    /* Ensure no excess bits are set.
+	       GCN needs this for nunits < 64.
+	       x86 needs this for nunits < 8.  */
+	    auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+	    if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
+	      tmp = expand_binop (mode, and_optab, tmp,
+				  GEN_INT ((1 << nunits) - 1), target,
+				  true, OPTAB_DIRECT);
+	    if (tmp != target)
+	      emit_move_insn (target, tmp);
 	    break;
 	  }
  
Richard Biener Nov. 10, 2023, 11:53 a.m. UTC | #3
On Fri, 10 Nov 2023, Andrew Stubbs wrote:

> On 23/10/2023 11:43, Richard Biener wrote:
> > On Fri, 20 Oct 2023, Andrew Stubbs wrote:
> > 
> >> This patch fixes a wrong-code bug on amdgcn in which the excess "ones" in
> >> the
> >> mask enable extra lanes that were supposed to be unused and are therefore
> >> undefined.
> >>
> >> Richi suggested an alternative approach involving narrower types and then a
> >> zero-extend to the actual mask type.  This solved the problem for the
> >> specific
> >> test case that I had, but I'm not sure if it would work with V2 and V4
> >> modes
> >> (not that I've observed bad behaviour from them anyway, but still).  There
> >> were some other caveats involving "two-lane division" that I don't fully
> >> understand, so I went with the simpler implementation.
> >>
> >> This patch does have the disadvantage of an additional "and" instruction in
> >> the non-constant case even for machines that don't need it. I'm not sure
> >> how
> >> to fix that without an additional target hook. (If GCC could use the
> >> 64-lane
> >> vectors more effectively without the assistance of artificially reduced
> >> sizes
> >> then this problem wouldn't exist.)
> >>
> >> OK to commit?
> > 
> > -           convert_move (target, op0, 0);
> > +           rtx tmp = gen_reg_rtx (mode);
> > +           convert_move (tmp, op0, 0);
> > +
> > +           if (known_ne (TYPE_VECTOR_SUBPARTS (type),
> > +                         GET_MODE_PRECISION (mode)))
> > 
> > Usually this would be maybe_ne, but then ...
> > 
> > +             {
> > +               /* Ensure no excess bits are set.
> > +                  GCN needs this, AVX does not.  */
> > +               expand_binop (mode, and_optab, tmp,
> > +                             GEN_INT ((1 << (TYPE_VECTOR_SUBPARTS (type)
> > +                                             .to_constant())) - 1),
> > +                             target, true, OPTAB_DIRECT);
> > 
> > here you have .to_constant ().  I think with having an integer mode
> > we know subparts is constant so I'd prefer
> > 
> >              auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> >              if (maybe_ne (GET_MODE_PRECISION (mode), nunits)
> > ...
> > 
> > +             }
> > +           else
> > +             emit_move_insn (target, tmp);
> > 
> > note you need the emit_move_insn also for the expand_binop
> > path since it's not guaranteed that 'target' is used there.  Thus
> > 
> >    tmp = expand_binop (...)
> >    if (tmp != target)
> >      emit_move_insn (...)
> > 
> > Otherwise looks good to me.  The and is needed on x86 for
> > two and four bit masks, it would be more efficient to use
> > smaller modes for the sign-extension I guess.
> 
> I think this patch addresses these issues. I've confirmed it does the right
> thing on amdgcn.
> 
> OK?

OK.

thanks,
Richard.

> Andrew
>
  

Patch

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 4220cbd9f8f..fb4609f616e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7440,7 +7440,7 @@  store_constructor (tree exp, rtx target, int cleared, poly_int64 size,
 	    break;
 	  }
 	/* Use sign-extension for uniform boolean vectors with
-	   integer modes.  */
+	   integer modes.  Effectively "vec_duplicate" for bitmasks.  */
 	if (!TREE_SIDE_EFFECTS (exp)
 	    && VECTOR_BOOLEAN_TYPE_P (type)
 	    && SCALAR_INT_MODE_P (mode)
@@ -7449,7 +7449,21 @@  store_constructor (tree exp, rtx target, int cleared, poly_int64 size,
 	  {
 	    rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)),
 				 expand_normal (elt));
-	    convert_move (target, op0, 0);
+	    rtx tmp = gen_reg_rtx (mode);
+	    convert_move (tmp, op0, 0);
+
+	    if (known_ne (TYPE_VECTOR_SUBPARTS (type),
+			  GET_MODE_PRECISION (mode)))
+	      {
+		/* Ensure no excess bits are set.
+		   GCN needs this, AVX does not.  */
+		expand_binop (mode, and_optab, tmp,
+			      GEN_INT ((1 << (TYPE_VECTOR_SUBPARTS (type)
+					      .to_constant())) - 1),
+			      target, true, OPTAB_DIRECT);
+	      }
+	    else
+	      emit_move_insn (target, tmp);
 	    break;
 	  }