Vect: select small VF for epilog of unrolled loop (PR tree-optimization/110474)

Message ID SJ2PR01MB8635E49C6DC6B89D31D6390FE12FA@SJ2PR01MB8635.prod.exchangelabs.com
State Accepted
Headers
Series Vect: select small VF for epilog of unrolled loop (PR tree-optimization/110474) |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Hao Liu OS July 5, 2023, 8:46 a.m. UTC
  Hi,

If a loop is unrolled during vectorization (i.e. suggested_unroll_factor > 1),
the VFs of both main and epilog loop are enlarged.  The epilog vect loop is
specific for a loop with small iteration counts, so a large VF may hurt
performance.

This patch unscales the main loop VF by suggested_unroll_factor while selecting
the epilog loop VF, so that it will be the same as vectorized loop without
unrolling (i.e. suggested_unroll_factor = 1).

gcc/ChangeLog:

	PR tree-optimization/110474
	* tree-vect-loop.cc (vect_analyze_loop_2): unscale the VF by suggested
	unroll factor while selecting the epilog vect loop VF.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/pr110474.c: New testcase.
---
 gcc/testsuite/gcc.target/aarch64/pr110474.c | 37 +++++++++++++++++++++
 gcc/tree-vect-loop.cc                       | 16 +++++----
 2 files changed, 47 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110474.c
  

Comments

Richard Sandiford July 5, 2023, 7:37 p.m. UTC | #1
Hao Liu OS via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Hi,
>
> If a loop is unrolled during vectorization (i.e. suggested_unroll_factor > 1),
> the VFs of both main and epilog loop are enlarged.  The epilog vect loop is
> specific for a loop with small iteration counts, so a large VF may hurt
> performance.
>
> This patch unscales the main loop VF by suggested_unroll_factor while selecting
> the epilog loop VF, so that it will be the same as vectorized loop without
> unrolling (i.e. suggested_unroll_factor = 1).

I agree that unrolling the main loop shouldn't cause more iterations
to be handled by the scalar code.  It would be nice to support multiple
epilogues, but that's probably a lot of work.

> gcc/ChangeLog:
>
> 	PR tree-optimization/110474
> 	* tree-vect-loop.cc (vect_analyze_loop_2): unscale the VF by suggested
> 	unroll factor while selecting the epilog vect loop VF.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/pr110474.c: New testcase.

OK, thanks.

Richard

> ---
>  gcc/testsuite/gcc.target/aarch64/pr110474.c | 37 +++++++++++++++++++++
>  gcc/tree-vect-loop.cc                       | 16 +++++----
>  2 files changed, 47 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110474.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110474.c b/gcc/testsuite/gcc.target/aarch64/pr110474.c
> new file mode 100644
> index 00000000000..e548416162a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110474.c
> @@ -0,0 +1,37 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mtune=neoverse-n2 -mcpu=neoverse-n1 -fdump-tree-vect-details --param aarch64-vect-unroll-limit=2" } */
> +/* { dg-final { scan-tree-dump "Choosing vector mode V8HI"  "vect" } } */
> +/* { dg-final { scan-tree-dump "Choosing epilogue vector mode V8QI"  "vect" } } */
> +
> +/* Do not increase the the vector factor of the epilog vectorized loop
> +   for a loop with suggested_unroll_factor > 1.
> +
> +   before (suggested_unroll_factor=1):
> +     if N >= 16:
> +         main vect loop
> +     if N >= 8:
> +         epilog vect loop
> +     scalar code
> +
> +   before (suggested_unroll_factor=2):
> +     if N >= 32:
> +         main vect loop
> +     if N >= 16:  // May fail to execute vectorized code (e.g. N is 8)
> +         epilog vect loop
> +     scalar code
> +
> +   after  (suggested_unroll_factor=2):
> +     if N >= 32:
> +         main vect loop
> +     if N >= 8:  // The same VF as suggested_unroll_factor=1
> +         epilog vect loop
> +     scalar code  */
> +
> +int
> +foo (short *A, char *B, int N)
> +{
> +  int sum = 0;
> +  for (int i = 0; i < N; ++i)
> +    sum += A[i] * B[i];
> +  return sum;
> +}
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 3b46c58a8d8..4d9abd035ea 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -3021,12 +3021,16 @@ start_over:
>       to be able to handle fewer than VF scalars, or needs to have a lower VF
>       than the main loop.  */
>    if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> -      && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
> -      && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
> -		   LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
> -    return opt_result::failure_at (vect_location,
> -				   "Vectorization factor too high for"
> -				   " epilogue loop.\n");
> +      && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +    {
> +      poly_uint64 unscaled_vf
> +	= exact_div (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo),
> +		     orig_loop_vinfo->suggested_unroll_factor);
> +      if (maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), unscaled_vf))
> +	return opt_result::failure_at (vect_location,
> +				       "Vectorization factor too high for"
> +				       " epilogue loop.\n");
> +    }
>  
>    /* Decide whether this loop_vinfo should use partial vectors or peeling,
>       assuming that the loop will be used as a main loop.  We will redo
  

Patch

diff --git a/gcc/testsuite/gcc.target/aarch64/pr110474.c b/gcc/testsuite/gcc.target/aarch64/pr110474.c
new file mode 100644
index 00000000000..e548416162a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110474.c
@@ -0,0 +1,37 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -mtune=neoverse-n2 -mcpu=neoverse-n1 -fdump-tree-vect-details --param aarch64-vect-unroll-limit=2" } */
+/* { dg-final { scan-tree-dump "Choosing vector mode V8HI"  "vect" } } */
+/* { dg-final { scan-tree-dump "Choosing epilogue vector mode V8QI"  "vect" } } */
+
+/* Do not increase the the vector factor of the epilog vectorized loop
+   for a loop with suggested_unroll_factor > 1.
+
+   before (suggested_unroll_factor=1):
+     if N >= 16:
+         main vect loop
+     if N >= 8:
+         epilog vect loop
+     scalar code
+
+   before (suggested_unroll_factor=2):
+     if N >= 32:
+         main vect loop
+     if N >= 16:  // May fail to execute vectorized code (e.g. N is 8)
+         epilog vect loop
+     scalar code
+
+   after  (suggested_unroll_factor=2):
+     if N >= 32:
+         main vect loop
+     if N >= 8:  // The same VF as suggested_unroll_factor=1
+         epilog vect loop
+     scalar code  */
+
+int
+foo (short *A, char *B, int N)
+{
+  int sum = 0;
+  for (int i = 0; i < N; ++i)
+    sum += A[i] * B[i];
+  return sum;
+}
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 3b46c58a8d8..4d9abd035ea 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3021,12 +3021,16 @@  start_over:
      to be able to handle fewer than VF scalars, or needs to have a lower VF
      than the main loop.  */
   if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)
-      && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
-      && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-		   LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo)))
-    return opt_result::failure_at (vect_location,
-				   "Vectorization factor too high for"
-				   " epilogue loop.\n");
+      && !LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+    {
+      poly_uint64 unscaled_vf
+	= exact_div (LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo),
+		     orig_loop_vinfo->suggested_unroll_factor);
+      if (maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), unscaled_vf))
+	return opt_result::failure_at (vect_location,
+				       "Vectorization factor too high for"
+				       " epilogue loop.\n");
+    }
 
   /* Decide whether this loop_vinfo should use partial vectors or peeling,
      assuming that the loop will be used as a main loop.  We will redo