vect: Try to remove single-vector permutes from SLP graph

Message ID mptmtbjcrq8.fsf@arm.com
State New, archived
Headers
Series vect: Try to remove single-vector permutes from SLP graph |

Commit Message

Richard Sandiford Sept. 1, 2022, 10:29 a.m. UTC
  This patch extends the SLP layout optimisation pass so that it
tries to remove layout changes that are brought about by permutes
of existing vectors.  This fixes the bb-slp-pr54400.c regression on
x86_64 and also means that we can remove the permutes in cases like:

typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));

float __attribute__((noipa))
f(v4sf v0, v4sf v1)
{
  return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
}

The new test is a simple adaption of bb-slp-pr54400.c, with the
same style of markup.

Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

Richard

PS. Sorry for missing the regression during testing.  The initial
    version of the new bb-slp-layout* tests had markup that led
    to a lot of FAILs on x86_64.  I fixed the markup to avoid those,
    but didn't notice this extra (unrelated) failure at the end.


gcc/
	* tree-vect-slp.cc (vect_build_slp_tree_2): When building a
	VEC_PERM_EXPR of an existing vector, set the SLP_TREE_LANES
	to the number of vector elements, if that's a known constant.
	(vect_optimize_slp_pass::is_compatible_layout): Remove associated
	comment about zero SLP_TREE_LANES.
	(vect_optimize_slp_pass::start_choosing_layouts): Iterate over
	all partition members when looking for potential layouts.
	Handle existing permutes of fixed-length vectors.

gcc/testsuite/
	* gcc.dg/vect/bb-slp-pr54400.c: Extend to aarch64.
	* gcc.dg/vect/bb-slp-layout-18.c: New test.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c | 15 +++++
 gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c   |  4 +-
 gcc/tree-vect-slp.cc                         | 67 +++++++++++++-------
 3 files changed, 61 insertions(+), 25 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
  

Comments

Richard Biener Sept. 1, 2022, 10:49 a.m. UTC | #1
On Thu, 1 Sep 2022, Richard Sandiford wrote:

> This patch extends the SLP layout optimisation pass so that it
> tries to remove layout changes that are brought about by permutes
> of existing vectors.  This fixes the bb-slp-pr54400.c regression on
> x86_64 and also means that we can remove the permutes in cases like:
> 
> typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
> 
> float __attribute__((noipa))
> f(v4sf v0, v4sf v1)
> {
>   return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
> }
> 
> The new test is a simple adaption of bb-slp-pr54400.c, with the
> same style of markup.
> 
> Tested on aarch64-linux-gnu and x86_64-linux-gnu.  OK to install?

OK.

Thanks,
Richard.

> 
> Richard
> 
> PS. Sorry for missing the regression during testing.  The initial
>     version of the new bb-slp-layout* tests had markup that led
>     to a lot of FAILs on x86_64.  I fixed the markup to avoid those,
>     but didn't notice this extra (unrelated) failure at the end.
> 
> 
> gcc/
> 	* tree-vect-slp.cc (vect_build_slp_tree_2): When building a
> 	VEC_PERM_EXPR of an existing vector, set the SLP_TREE_LANES
> 	to the number of vector elements, if that's a known constant.
> 	(vect_optimize_slp_pass::is_compatible_layout): Remove associated
> 	comment about zero SLP_TREE_LANES.
> 	(vect_optimize_slp_pass::start_choosing_layouts): Iterate over
> 	all partition members when looking for potential layouts.
> 	Handle existing permutes of fixed-length vectors.
> 
> gcc/testsuite/
> 	* gcc.dg/vect/bb-slp-pr54400.c: Extend to aarch64.
> 	* gcc.dg/vect/bb-slp-layout-18.c: New test.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c | 15 +++++
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c   |  4 +-
>  gcc/tree-vect-slp.cc                         | 67 +++++++++++++-------
>  3 files changed, 61 insertions(+), 25 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
> new file mode 100644
> index 00000000000..ff462722507
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_float} */
> +/* { dg-additional-options "-w -Wno-psabi -ffast-math" } */
> +
> +typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
> +
> +float __attribute__((noipa))
> +f(v4sf v0, v4sf v1)
> +{
> +  return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
> +}
> +
> +/* We are lacking an effective target for .REDUC_PLUS support.  */
> +/* { dg-final { scan-tree-dump-times "basic block part vectorized" 1 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
> index 6b427aac774..8aec2092f4d 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
> @@ -39,5 +39,5 @@ main ()
>  }
>  
>  /* We are lacking an effective target for .REDUC_PLUS support.  */
> -/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* } } } */
> -/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 226550635cc..cc04ef350a6 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -1840,6 +1840,10 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>  					     TREE_TYPE (TREE_TYPE (vec))));
>  	  SLP_TREE_VECTYPE (vnode) = TREE_TYPE (vec);
>  	}
> +      auto nunits = TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (vnode));
> +      unsigned HOST_WIDE_INT const_nunits;
> +      if (nunits.is_constant (&const_nunits))
> +	SLP_TREE_LANES (vnode) = const_nunits;
>        SLP_TREE_VEC_DEFS (vnode).safe_push (vec);
>        /* We are always building a permutation node even if it is an identity
>  	 permute to shield the rest of the vectorizer from the odd node
> @@ -4325,8 +4329,6 @@ vect_optimize_slp_pass::is_compatible_layout (slp_tree node,
>    if (layout_i == 0)
>      return true;
>  
> -  /* SLP_TREE_LANES is zero for existing vectors, but those only support
> -     layout 0 anyway.  */
>    if (SLP_TREE_LANES (node) != m_perms[layout_i].length ())
>      return false;
>  
> @@ -4521,38 +4523,57 @@ vect_optimize_slp_pass::start_choosing_layouts ()
>    m_perms.safe_push (vNULL);
>  
>    /* Create layouts from existing permutations.  */
> -  for (unsigned int node_i : m_leafs)
> +  auto_load_permutation_t tmp_perm;
> +  for (unsigned int node_i : m_partitioned_nodes)
>      {
> -      auto &vertex = m_vertices[node_i];
> -      if (vertex.partition < 0)
> -	continue;
> -
>        /* Leafs also double as entries to the reverse graph.  Allow the
>  	 layout of those to be changed.  */
> +      auto &vertex = m_vertices[node_i];
>        auto &partition = m_partitions[vertex.partition];
>        if (!m_slpg->vertices[node_i].succ)
>  	partition.layout = 0;
>  
> -      /* Loads are the only thing generating permutes.  */
> +      /* Loads and VEC_PERM_EXPRs are the only things generating permutes.  */
>        slp_tree node = vertex.node;
> -      if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> -	continue;
> -
> -      /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
> -	 unpermuted, record a layout that reverses this permutation.  */
> -      gcc_assert (partition.layout == 0);
>        stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> -      if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> -	continue;
> -      dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> -      unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> +      slp_tree child;
> +      unsigned HOST_WIDE_INT imin, imax = 0;
>        bool any_permute = false;
> +      tmp_perm.truncate (0);
> +      if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> +	{
> +	  /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
> +	     unpermuted, record a layout that reverses this permutation.  */
> +	  gcc_assert (partition.layout == 0);
> +	  if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> +	    continue;
> +	  dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> +	  imin = DR_GROUP_SIZE (dr_stmt) + 1;
> +	  tmp_perm.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
> +	}
> +      else if (SLP_TREE_CODE (node) == VEC_PERM_EXPR
> +	       && SLP_TREE_CHILDREN (node).length () == 1
> +	       && (child = SLP_TREE_CHILDREN (node)[0])
> +	       && (TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (child))
> +		   .is_constant (&imin)))
> +	{
> +	  /* If the child has the same vector size as this node,
> +	     reversing the permutation can make the permutation a no-op.
> +	     In other cases it can change a true permutation into a
> +	     full-vector extract.  */
> +	  tmp_perm.reserve (SLP_TREE_LANES (node));
> +	  for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> +	    tmp_perm.quick_push (SLP_TREE_LANE_PERMUTATION (node)[j].second);
> +	}
> +      else
> +	continue;
> +
>        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>  	{
> -	  unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> +	  unsigned idx = tmp_perm[j];
>  	  imin = MIN (imin, idx);
>  	  imax = MAX (imax, idx);
> -	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> +	  if (idx - tmp_perm[0] != j)
>  	    any_permute = true;
>  	}
>        /* If the span doesn't match we'd disrupt VF computation, avoid
> @@ -4560,7 +4581,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
>        if (imax - imin + 1 != SLP_TREE_LANES (node))
>  	continue;
>        /* If there's no permute no need to split one out.  In this case
> -	 we can consider turning the load into a permuted load, if that
> +	 we can consider turning a load into a permuted load, if that
>  	 turns out to be cheaper than alternatives.  */
>        if (!any_permute)
>  	{
> @@ -4575,7 +4596,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
>        auto_sbitmap load_index (SLP_TREE_LANES (node));
>        bitmap_clear (load_index);
>        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> -	bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
> +	bitmap_set_bit (load_index, tmp_perm[j] - imin);
>        unsigned j;
>        for (j = 0; j < SLP_TREE_LANES (node); ++j)
>  	if (!bitmap_bit_p (load_index, j))
> @@ -4586,7 +4607,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
>        vec<unsigned> perm = vNULL;
>        perm.safe_grow (SLP_TREE_LANES (node), true);
>        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> -	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> +	perm[j] = tmp_perm[j] - imin;
>  
>        if (int (m_perms.length ()) >= param_vect_max_layout_candidates)
>  	{
>
  

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
new file mode 100644
index 00000000000..ff462722507
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_float} */
+/* { dg-additional-options "-w -Wno-psabi -ffast-math" } */
+
+typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
+
+float __attribute__((noipa))
+f(v4sf v0, v4sf v1)
+{
+  return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
+}
+
+/* We are lacking an effective target for .REDUC_PLUS support.  */
+/* { dg-final { scan-tree-dump-times "basic block part vectorized" 1 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
index 6b427aac774..8aec2092f4d 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
@@ -39,5 +39,5 @@  main ()
 }
 
 /* We are lacking an effective target for .REDUC_PLUS support.  */
-/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* } } } */
-/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* } } } */
+/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 226550635cc..cc04ef350a6 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1840,6 +1840,10 @@  vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 					     TREE_TYPE (TREE_TYPE (vec))));
 	  SLP_TREE_VECTYPE (vnode) = TREE_TYPE (vec);
 	}
+      auto nunits = TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (vnode));
+      unsigned HOST_WIDE_INT const_nunits;
+      if (nunits.is_constant (&const_nunits))
+	SLP_TREE_LANES (vnode) = const_nunits;
       SLP_TREE_VEC_DEFS (vnode).safe_push (vec);
       /* We are always building a permutation node even if it is an identity
 	 permute to shield the rest of the vectorizer from the odd node
@@ -4325,8 +4329,6 @@  vect_optimize_slp_pass::is_compatible_layout (slp_tree node,
   if (layout_i == 0)
     return true;
 
-  /* SLP_TREE_LANES is zero for existing vectors, but those only support
-     layout 0 anyway.  */
   if (SLP_TREE_LANES (node) != m_perms[layout_i].length ())
     return false;
 
@@ -4521,38 +4523,57 @@  vect_optimize_slp_pass::start_choosing_layouts ()
   m_perms.safe_push (vNULL);
 
   /* Create layouts from existing permutations.  */
-  for (unsigned int node_i : m_leafs)
+  auto_load_permutation_t tmp_perm;
+  for (unsigned int node_i : m_partitioned_nodes)
     {
-      auto &vertex = m_vertices[node_i];
-      if (vertex.partition < 0)
-	continue;
-
       /* Leafs also double as entries to the reverse graph.  Allow the
 	 layout of those to be changed.  */
+      auto &vertex = m_vertices[node_i];
       auto &partition = m_partitions[vertex.partition];
       if (!m_slpg->vertices[node_i].succ)
 	partition.layout = 0;
 
-      /* Loads are the only thing generating permutes.  */
+      /* Loads and VEC_PERM_EXPRs are the only things generating permutes.  */
       slp_tree node = vertex.node;
-      if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
-	continue;
-
-      /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
-	 unpermuted, record a layout that reverses this permutation.  */
-      gcc_assert (partition.layout == 0);
       stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
-      if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
-	continue;
-      dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
-      unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
+      slp_tree child;
+      unsigned HOST_WIDE_INT imin, imax = 0;
       bool any_permute = false;
+      tmp_perm.truncate (0);
+      if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
+	{
+	  /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
+	     unpermuted, record a layout that reverses this permutation.  */
+	  gcc_assert (partition.layout == 0);
+	  if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
+	    continue;
+	  dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
+	  imin = DR_GROUP_SIZE (dr_stmt) + 1;
+	  tmp_perm.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
+	}
+      else if (SLP_TREE_CODE (node) == VEC_PERM_EXPR
+	       && SLP_TREE_CHILDREN (node).length () == 1
+	       && (child = SLP_TREE_CHILDREN (node)[0])
+	       && (TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (child))
+		   .is_constant (&imin)))
+	{
+	  /* If the child has the same vector size as this node,
+	     reversing the permutation can make the permutation a no-op.
+	     In other cases it can change a true permutation into a
+	     full-vector extract.  */
+	  tmp_perm.reserve (SLP_TREE_LANES (node));
+	  for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
+	    tmp_perm.quick_push (SLP_TREE_LANE_PERMUTATION (node)[j].second);
+	}
+      else
+	continue;
+
       for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
 	{
-	  unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
+	  unsigned idx = tmp_perm[j];
 	  imin = MIN (imin, idx);
 	  imax = MAX (imax, idx);
-	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
+	  if (idx - tmp_perm[0] != j)
 	    any_permute = true;
 	}
       /* If the span doesn't match we'd disrupt VF computation, avoid
@@ -4560,7 +4581,7 @@  vect_optimize_slp_pass::start_choosing_layouts ()
       if (imax - imin + 1 != SLP_TREE_LANES (node))
 	continue;
       /* If there's no permute no need to split one out.  In this case
-	 we can consider turning the load into a permuted load, if that
+	 we can consider turning a load into a permuted load, if that
 	 turns out to be cheaper than alternatives.  */
       if (!any_permute)
 	{
@@ -4575,7 +4596,7 @@  vect_optimize_slp_pass::start_choosing_layouts ()
       auto_sbitmap load_index (SLP_TREE_LANES (node));
       bitmap_clear (load_index);
       for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
-	bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
+	bitmap_set_bit (load_index, tmp_perm[j] - imin);
       unsigned j;
       for (j = 0; j < SLP_TREE_LANES (node); ++j)
 	if (!bitmap_bit_p (load_index, j))
@@ -4586,7 +4607,7 @@  vect_optimize_slp_pass::start_choosing_layouts ()
       vec<unsigned> perm = vNULL;
       perm.safe_grow (SLP_TREE_LANES (node), true);
       for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
-	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
+	perm[j] = tmp_perm[j] - imin;
 
       if (int (m_perms.length ()) >= param_vect_max_layout_candidates)
 	{