[RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization

Message ID 20230728070552.50C1413276@imap2.suse-dmz.suse.de
State Accepted
Headers
Series [RFC] tree-optimization/92335 - Improve sinking heuristics for vectorization |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

Richard Biener July 28, 2023, 7:05 a.m. UTC
  The following delays sinking of loads within the same innermost
loop when it was unconditional before.  That's a not uncommon
issue preventing vectorization when masked loads are not available.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

I have a followup patch improving sinking that without this would
cause more of the problematic sinking - now that we have a second
sink pass after loop opts this looks like a reasonable approach?

OK?

Thanks,
Richard.

	PR tree-optimization/92335
	* tree-ssa-sink.cc (select_best_block): Before loop
	optimizations avoid sinking unconditional loads/stores
	in innermost loops to conditional executed places.

	* gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
	* gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
	expect predictive commoning to happen instead of sinking.
	* gcc.dg/vect/pr65947-3.c: Adjust.
---
 gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c   | 20 ++++++++++++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c |  2 +-
 gcc/testsuite/gcc.dg/vect/pr65947-3.c       |  6 +-----
 gcc/tree-ssa-sink.cc                        | 12 ++++++++++++
 4 files changed, 34 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c
  

Comments

Jeff Law July 31, 2023, 3:33 p.m. UTC | #1
On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> The following delays sinking of loads within the same innermost
> loop when it was unconditional before.  That's a not uncommon
> issue preventing vectorization when masked loads are not available.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> I have a followup patch improving sinking that without this would
> cause more of the problematic sinking - now that we have a second
> sink pass after loop opts this looks like a reasonable approach?
> 
> OK?
> 
> Thanks,
> Richard.
> 
> 	PR tree-optimization/92335
> 	* tree-ssa-sink.cc (select_best_block): Before loop
> 	optimizations avoid sinking unconditional loads/stores
> 	in innermost loops to conditional executed places.
> 
> 	* gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> 	* gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> 	expect predictive commoning to happen instead of sinking.
> 	* gcc.dg/vect/pr65947-3.c: Adjust.
I think it's reasonable -- there's probably going to be cases where it's 
not great, but more often than not I think it's going to be a reasonable 
heuristic.

If there is undesirable fallout, better to find it over the coming 
months than next spring.  So I'd suggest we go forward now to give more 
time to find any pathological cases (if they exist).

Jeff
  
Richard Biener Aug. 2, 2023, 8:46 a.m. UTC | #2
On Mon, 31 Jul 2023, Jeff Law wrote:

> 
> 
> On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > The following delays sinking of loads within the same innermost
> > loop when it was unconditional before.  That's a not uncommon
> > issue preventing vectorization when masked loads are not available.
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > 
> > I have a followup patch improving sinking that without this would
> > cause more of the problematic sinking - now that we have a second
> > sink pass after loop opts this looks like a reasonable approach?
> > 
> > OK?
> > 
> > Thanks,
> > Richard.
> > 
> >  PR tree-optimization/92335
> >  * tree-ssa-sink.cc (select_best_block): Before loop
> >  optimizations avoid sinking unconditional loads/stores
> >  in innermost loops to conditional executed places.
> > 
> >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> >  expect predictive commoning to happen instead of sinking.
> >  * gcc.dg/vect/pr65947-3.c: Adjust.
> I think it's reasonable -- there's probably going to be cases where it's not
> great, but more often than not I think it's going to be a reasonable
> heuristic.
> 
> If there is undesirable fallout, better to find it over the coming months than
> next spring.  So I'd suggest we go forward now to give more time to find any
> pathological cases (if they exist).

Agreed, I've pushed this now.

Richard.
  
Prathamesh Kulkarni Aug. 3, 2023, 11:42 a.m. UTC | #3
On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Mon, 31 Jul 2023, Jeff Law wrote:
>
> >
> >
> > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > The following delays sinking of loads within the same innermost
> > > loop when it was unconditional before.  That's a not uncommon
> > > issue preventing vectorization when masked loads are not available.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > >
> > > I have a followup patch improving sinking that without this would
> > > cause more of the problematic sinking - now that we have a second
> > > sink pass after loop opts this looks like a reasonable approach?
> > >
> > > OK?
> > >
> > > Thanks,
> > > Richard.
> > >
> > >  PR tree-optimization/92335
> > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > >  optimizations avoid sinking unconditional loads/stores
> > >  in innermost loops to conditional executed places.
> > >
> > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > >  expect predictive commoning to happen instead of sinking.
> > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > I think it's reasonable -- there's probably going to be cases where it's not
> > great, but more often than not I think it's going to be a reasonable
> > heuristic.
> >
> > If there is undesirable fallout, better to find it over the coming months than
> > next spring.  So I'd suggest we go forward now to give more time to find any
> > pathological cases (if they exist).
>
> Agreed, I've pushed this now.
Hi Richard,
After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
pr65947-7.c "failed" for aarch64-linux-gnu:
FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
scan-tree-dump-not vect "LOOP VECTORIZED"

/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
! vect_fold_extract_last } } } } */

With your commit, condition_reduction in pr65947-7.c gets vectorized
regardless of vect_fold_extract_last,
which gates the above test (which is an improvement, because the
function didn't get vectorized before the commit).

The attached patch thus removes the gating on vect_fold_extract_last,
and the test passes again.
OK to commit ?

Thanks,
Prathamesh
>
> Richard.
diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-7.c b/gcc/testsuite/gcc.dg/vect/pr65947-7.c
index 16cdcd1c6eb..7dabae81abf 100644
--- a/gcc/testsuite/gcc.dg/vect/pr65947-7.c
+++ b/gcc/testsuite/gcc.dg/vect/pr65947-7.c
@@ -52,5 +52,4 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_fold_extract_last } } } */
-/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { ! vect_fold_extract_last } } } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
  
Richard Biener Aug. 3, 2023, 12:14 p.m. UTC | #4
On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:

> On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Mon, 31 Jul 2023, Jeff Law wrote:
> >
> > >
> > >
> > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > The following delays sinking of loads within the same innermost
> > > > loop when it was unconditional before.  That's a not uncommon
> > > > issue preventing vectorization when masked loads are not available.
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > >
> > > > I have a followup patch improving sinking that without this would
> > > > cause more of the problematic sinking - now that we have a second
> > > > sink pass after loop opts this looks like a reasonable approach?
> > > >
> > > > OK?
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > >  PR tree-optimization/92335
> > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > >  optimizations avoid sinking unconditional loads/stores
> > > >  in innermost loops to conditional executed places.
> > > >
> > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > >  expect predictive commoning to happen instead of sinking.
> > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > I think it's reasonable -- there's probably going to be cases where it's not
> > > great, but more often than not I think it's going to be a reasonable
> > > heuristic.
> > >
> > > If there is undesirable fallout, better to find it over the coming months than
> > > next spring.  So I'd suggest we go forward now to give more time to find any
> > > pathological cases (if they exist).
> >
> > Agreed, I've pushed this now.
> Hi Richard,
> After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> pr65947-7.c "failed" for aarch64-linux-gnu:
> FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
> FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> scan-tree-dump-not vect "LOOP VECTORIZED"
> 
> /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> ! vect_fold_extract_last } } } } */
> 
> With your commit, condition_reduction in pr65947-7.c gets vectorized
> regardless of vect_fold_extract_last,
> which gates the above test (which is an improvement, because the
> function didn't get vectorized before the commit).
> 
> The attached patch thus removes the gating on vect_fold_extract_last,
> and the test passes again.
> OK to commit ?

OK.

Thanks,
Richard.
  
Richard Biener Aug. 3, 2023, 12:16 p.m. UTC | #5
On Thu, 3 Aug 2023, Richard Biener wrote:

> On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> 
> > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > >
> > > >
> > > >
> > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > The following delays sinking of loads within the same innermost
> > > > > loop when it was unconditional before.  That's a not uncommon
> > > > > issue preventing vectorization when masked loads are not available.
> > > > >
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > >
> > > > > I have a followup patch improving sinking that without this would
> > > > > cause more of the problematic sinking - now that we have a second
> > > > > sink pass after loop opts this looks like a reasonable approach?
> > > > >
> > > > > OK?
> > > > >
> > > > > Thanks,
> > > > > Richard.
> > > > >
> > > > >  PR tree-optimization/92335
> > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > >  optimizations avoid sinking unconditional loads/stores
> > > > >  in innermost loops to conditional executed places.
> > > > >
> > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > >  expect predictive commoning to happen instead of sinking.
> > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > I think it's reasonable -- there's probably going to be cases where it's not
> > > > great, but more often than not I think it's going to be a reasonable
> > > > heuristic.
> > > >
> > > > If there is undesirable fallout, better to find it over the coming months than
> > > > next spring.  So I'd suggest we go forward now to give more time to find any
> > > > pathological cases (if they exist).
> > >
> > > Agreed, I've pushed this now.
> > Hi Richard,
> > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > pr65947-7.c "failed" for aarch64-linux-gnu:
> > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
> > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > scan-tree-dump-not vect "LOOP VECTORIZED"
> > 
> > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> > ! vect_fold_extract_last } } } } */
> > 
> > With your commit, condition_reduction in pr65947-7.c gets vectorized
> > regardless of vect_fold_extract_last,
> > which gates the above test (which is an improvement, because the
> > function didn't get vectorized before the commit).
> > 
> > The attached patch thus removes the gating on vect_fold_extract_last,
> > and the test passes again.
> > OK to commit ?
> 
> OK.

Or wait - the loop doesn't vectorize on x86_64, so I guess one
critical target condition is missing.  Can you figure out which?

Thanks,
Richard.
  
Richard Biener Aug. 3, 2023, 12:18 p.m. UTC | #6
On Thu, 3 Aug 2023, Richard Biener wrote:

> On Thu, 3 Aug 2023, Richard Biener wrote:
> 
> > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > 
> > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > >
> > > > >
> > > > >
> > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > The following delays sinking of loads within the same innermost
> > > > > > loop when it was unconditional before.  That's a not uncommon
> > > > > > issue preventing vectorization when masked loads are not available.
> > > > > >
> > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > >
> > > > > > I have a followup patch improving sinking that without this would
> > > > > > cause more of the problematic sinking - now that we have a second
> > > > > > sink pass after loop opts this looks like a reasonable approach?
> > > > > >
> > > > > > OK?
> > > > > >
> > > > > > Thanks,
> > > > > > Richard.
> > > > > >
> > > > > >  PR tree-optimization/92335
> > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > >  in innermost loops to conditional executed places.
> > > > > >
> > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > I think it's reasonable -- there's probably going to be cases where it's not
> > > > > great, but more often than not I think it's going to be a reasonable
> > > > > heuristic.
> > > > >
> > > > > If there is undesirable fallout, better to find it over the coming months than
> > > > > next spring.  So I'd suggest we go forward now to give more time to find any
> > > > > pathological cases (if they exist).
> > > >
> > > > Agreed, I've pushed this now.
> > > Hi Richard,
> > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
> > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > 
> > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> > > ! vect_fold_extract_last } } } } */
> > > 
> > > With your commit, condition_reduction in pr65947-7.c gets vectorized
> > > regardless of vect_fold_extract_last,
> > > which gates the above test (which is an improvement, because the
> > > function didn't get vectorized before the commit).
> > > 
> > > The attached patch thus removes the gating on vect_fold_extract_last,
> > > and the test passes again.
> > > OK to commit ?
> > 
> > OK.
> 
> Or wait - the loop doesn't vectorize on x86_64, so I guess one
> critical target condition is missing.  Can you figure out which?

I see

/space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: 
note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, 
type of def: reduction
/space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: 
note:   vect_is_simple_use: vectype vector(4) int
/space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: 
missed:   multiple types in double reduction or condition reduction or 
fold-left reduction.
/space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1: 
missed:   not vectorized: relevant phi not supported: last_19 = PHI 
<last_8(7), 108(15)>
/space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21: 
missed:  bad operation or unsupported loop bound.

Richard.
  
Prathamesh Kulkarni Aug. 7, 2023, 12:04 a.m. UTC | #7
On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote:
>
> On Thu, 3 Aug 2023, Richard Biener wrote:
>
> > On Thu, 3 Aug 2023, Richard Biener wrote:
> >
> > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > >
> > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > The following delays sinking of loads within the same innermost
> > > > > > > loop when it was unconditional before.  That's a not uncommon
> > > > > > > issue preventing vectorization when masked loads are not available.
> > > > > > >
> > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > >
> > > > > > > I have a followup patch improving sinking that without this would
> > > > > > > cause more of the problematic sinking - now that we have a second
> > > > > > > sink pass after loop opts this looks like a reasonable approach?
> > > > > > >
> > > > > > > OK?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Richard.
> > > > > > >
> > > > > > >  PR tree-optimization/92335
> > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > >  in innermost loops to conditional executed places.
> > > > > > >
> > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > I think it's reasonable -- there's probably going to be cases where it's not
> > > > > > great, but more often than not I think it's going to be a reasonable
> > > > > > heuristic.
> > > > > >
> > > > > > If there is undesirable fallout, better to find it over the coming months than
> > > > > > next spring.  So I'd suggest we go forward now to give more time to find any
> > > > > > pathological cases (if they exist).
> > > > >
> > > > > Agreed, I've pushed this now.
> > > > Hi Richard,
> > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > >
> > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> > > > ! vect_fold_extract_last } } } } */
> > > >
> > > > With your commit, condition_reduction in pr65947-7.c gets vectorized
> > > > regardless of vect_fold_extract_last,
> > > > which gates the above test (which is an improvement, because the
> > > > function didn't get vectorized before the commit).
> > > >
> > > > The attached patch thus removes the gating on vect_fold_extract_last,
> > > > and the test passes again.
> > > > OK to commit ?
> > >
> > > OK.
> >
> > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > critical target condition is missing.  Can you figure out which?
>
> I see
>
> /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>,
> type of def: reduction
> /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> note:   vect_is_simple_use: vectype vector(4) int
> /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> missed:   multiple types in double reduction or condition reduction or
> fold-left reduction.
> /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> missed:   not vectorized: relevant phi not supported: last_19 = PHI
> <last_8(7), 108(15)>
> /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> missed:  bad operation or unsupported loop bound.
Hi Richard,
Looking at the aarch64 vect dump, it seems the loop in
condition_reduction gets vectorized with V4HI mode
while fails for other modes in vectorizable_condition:

  if ((double_reduc || reduction_type != TREE_CODE_REDUCTION)
      && ncopies > 1)
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "multiple types in double reduction or condition "
                         "reduction or fold-left reduction.\n");
      return false;
    }

From the dump:
foo.c:9:21: note:   === vect_analyze_loop_operations ===
foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
<last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int

For V8HI, VF = 8, and vectype_in = vector(4) int.
Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1,
and thus fails:
foo.c:9:21: missed:   multiple types in double reduction or condition
reduction or fold-left reduction.
foo.c:4:1: missed:   not vectorized: relevant phi not supported:
last_19 = PHI <last_8(7), 108(15)>
While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds.

For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
If I "force" the vectorizer to use V4HI mode, we get the following dump:
foo.c:9:21: note:   === vect_analyze_loop_operations ===
foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
<last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
foo.c:9:21: missed:   multiple types in double reduction or condition
reduction or fold-left reduction.

Not sure tho if this is the only reason for the test to fail to
vectorize on the target.
Will investigate in more details next week.

Thanks,
Prathamesh
>
> Richard.
;; Function condition_reduction (condition_reduction, funcdef_no=0, decl_uid=4390, cgraph_uid=1, symbol_order=0)


Analyzing loop at foo.c:9
foo.c:9:21: note:  === analyze_loop_nest ===
foo.c:9:21: note:   === vect_analyze_loop_form ===
foo.c:9:21: note:    === get_loop_niters ===
Analyzing # of iterations of loop 1
  exit condition [42, + , 4294967295] != 0
  bounds on difference of bases: -42 ... -42
  result:
    # of iterations 42, bounded by 42
Creating dr for *_3
analyze_innermost: success.
	base_address: a_12(D)
	offset from base address: 0
	constant offset from base address: 0
	step: 2
	base alignment: 2
	base misalignment: 0
	offset alignment: 128
	step alignment: 2
	base_object: *a_12(D)
	Access function 0: {0B, +, 2}_1
Creating dr for *_6
analyze_innermost: success.
	base_address: b_14(D)
	offset from base address: 0
	constant offset from base address: 0
	step: 4
	base alignment: 4
	base misalignment: 0
	offset alignment: 128
	step alignment: 4
	base_object: *b_14(D)
	Access function 0: {0B, +, 4}_1
foo.c:9:21: note:   === vect_analyze_data_refs ===
foo.c:9:21: note:   got vectype for stmt: aval_13 = *_3;
vector(8) short int
foo.c:9:21: note:   got vectype for stmt: _7 = *_6;
vector(4) int
foo.c:9:21: note:   === vect_analyze_scalar_cycles ===
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   Access function of PHI: last_19
foo.c:9:21: note:   Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   Access function of PHI: {0, +, 1}_1
foo.c:9:21: note:   step: 1,  init: 0
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   Access function of PHI: {43, +, 4294967295}_1
foo.c:9:21: note:   step: 4294967295,  init: 43
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   reduction path: last_8 last_19 
foo.c:9:21: note:   reduction: detected reduction
foo.c:9:21: note:   Detected reduction.
foo.c:9:21: note:   === vect_determine_precisions ===
foo.c:9:21: note:   using boolean precision 32 for _9 = _7 < min_v_15(D);
foo.c:9:21: note:   ivtmp_10 has no range info
foo.c:9:21: note:   i_17 has range [0x1, 0x2b]
foo.c:9:21: note:   can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1;
foo.c:9:21: note:   last_8 has no range info
foo.c:9:21: note:   last_16 has no range info
foo.c:9:21: note:   _7 has no range info
foo.c:9:21: note:   _5 has range [0x0, 0xa8]
foo.c:9:21: note:   can narrow to unsigned:8 without loss of precision: _5 = _1 * 4;
foo.c:9:21: note:   aval_13 has no range info
foo.c:9:21: note:   _2 has range [0x0, 0x54]
foo.c:9:21: note:   can narrow to unsigned:7 without loss of precision: _2 = _1 * 2;
foo.c:9:21: note:   _1 has range [0x0, 0x2a]
foo.c:9:21: note:   === vect_pattern_recog ===
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _2 = _1 * 2;
foo.c:9:21: note:   widen_mult pattern recognized: patt_37 = (long unsigned int) patt_4;
foo.c:9:21: note:   extra pattern stmt: patt_4 = i_21 w* 2;
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _5 = _1 * 4;
foo.c:9:21: note:   widen_mult pattern recognized: patt_39 = (long unsigned int) patt_38;
foo.c:9:21: note:   extra pattern stmt: patt_38 = i_21 w* 4;
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction
foo.c:9:21: note:   === vect_analyze_data_ref_accesses ===
foo.c:9:21: note:   === vect_mark_stmts_to_be_vectorized ===
foo.c:9:21: note:   init: phi relevant? last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   init: phi relevant? i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   init: stmt relevant? _1 = (long unsigned int) i_21;
foo.c:9:21: note:   init: stmt relevant? _2 = _1 * 2;
foo.c:9:21: note:   init: stmt relevant? _3 = a_12(D) + _2;
foo.c:9:21: note:   init: stmt relevant? aval_13 = *_3;
foo.c:9:21: note:   init: stmt relevant? _5 = _1 * 4;
foo.c:9:21: note:   init: stmt relevant? _6 = b_14(D) + _5;
foo.c:9:21: note:   init: stmt relevant? _7 = *_6;
foo.c:9:21: note:   init: stmt relevant? last_16 = (int) aval_13;
foo.c:9:21: note:   init: stmt relevant? _9 = _7 < min_v_15(D);
foo.c:9:21: note:   init: stmt relevant? last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vec_stmt_relevant_p: used out of loop.
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   vec_stmt_relevant_p: stmt live but not relevant.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   init: stmt relevant? i_17 = i_21 + 1;
foo.c:9:21: note:   init: stmt relevant? ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   init: stmt relevant? if (ivtmp_10 != 0)
foo.c:9:21: note:   worklist: examine stmt: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction
foo.c:9:21: note:   reduc-stmt defining reduc-phi in the same nest.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   already marked relevant/live.
foo.c:9:21: note:   vect_is_simple_use: operand 108, type of def: constant
foo.c:9:21: note:   worklist: examine stmt: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand *_3, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand *_6, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _7 = *_6;
foo.c:9:21: note:   vect_is_simple_use: operand min_v_15(D), type of def: external
foo.c:9:21: note:   worklist: examine stmt: _7 = *_6;
foo.c:9:21: note:   === vect_analyze_data_ref_dependences ===
foo.c:9:21: note:   === vect_determine_vectorization_factor ===
foo.c:9:21: note:   ==> examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   get vectype for scalar type:  int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   ==> examining statement: _1 = (long unsigned int) i_21;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _2 = _1 * 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern def stmt: patt_4 = i_21 w* 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_37 = (long unsigned int) patt_4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _3 = a_12(D) + _2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: aval_13 = *_3;
foo.c:9:21: note:   precomputed vectype: vector(8) short int
foo.c:9:21: note:   nunits = 8
foo.c:9:21: note:   ==> examining statement: _5 = _1 * 4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern def stmt: patt_38 = i_21 w* 4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_39 = (long unsigned int) patt_38;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _6 = b_14(D) + _5;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _7 = *_6;
foo.c:9:21: note:   precomputed vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: last_16 = (int) aval_13;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   get vectype for smallest scalar type: short int
foo.c:9:21: note:   nunits vectype: vector(8) short int
foo.c:9:21: note:   nunits = 8
foo.c:9:21: note:   ==> examining statement: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vectype: vector(4) <signed-boolean:32>
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: i_17 = i_21 + 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: if (ivtmp_10 != 0)
foo.c:9:21: note:   skip.
foo.c:9:21: note:   vectorization factor = 8
foo.c:9:21: note:   === vect_compute_single_scalar_iteration_cost ===
*_3 1 times scalar_load costs 1 in prologue
*_6 1 times scalar_load costs 1 in prologue
(int) aval_13 1 times scalar_stmt costs 1 in prologue
_7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue
_9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue
foo.c:9:21: note:   === vect_analyze_slp ===
foo.c:9:21: note:   === vect_make_slp_decision ===
foo.c:9:21: note:  vectorization_factor = 8, niters = 43
foo.c:9:21: note:   === vect_analyze_data_refs_alignment ===
foo.c:9:21: note:   recording new base alignment for a_12(D)
  alignment:    2
  misalignment: 0
  based on:     aval_13 = *_3;
foo.c:9:21: note:   recording new base alignment for b_14(D)
  alignment:    4
  misalignment: 0
  based on:     _7 = *_6;
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_3
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_6
foo.c:9:21: note:   === vect_prune_runtime_alias_test_list ===
foo.c:9:21: note:   === vect_enhance_data_refs_alignment ===
foo.c:9:21: missed:   Unknown misalignment, naturally aligned
foo.c:9:21: missed:   Unknown misalignment, naturally aligned
foo.c:9:21: note:   vect_can_advance_ivs_p:
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   reduc or virtual phi. skip.
foo.c:9:21: note:   Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   vect_model_load_cost: aligned.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 3, outside_cost = 0.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 3, outside_cost = 0.
foo.c:9:21: note:   === vect_dissolve_slp_only_groups ===
foo.c:9:21: note:   === vect_analyze_loop_operations ===
foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: missed:   multiple types in double reduction or condition reduction or fold-left reduction.
foo.c:4:1: missed:   not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: missed:  bad operation or unsupported loop bound.
foo.c:9:21: note:  ***** Analysis  failed with vector mode V8HI
foo.c:9:21: note:  ***** The result for vector mode V16QI would be the same
foo.c:9:21: note:  ***** The result for vector mode V8QI would be the same
foo.c:9:21: note:  ***** Re-trying analysis with vector mode V4HI
foo.c:9:21: note:   === vect_analyze_data_refs ===
foo.c:9:21: note:   got vectype for stmt: aval_13 = *_3;
vector(4) short int
foo.c:9:21: note:   got vectype for stmt: _7 = *_6;
vector(4) int
foo.c:9:21: note:   === vect_analyze_scalar_cycles ===
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   Access function of PHI: last_19
foo.c:9:21: note:   Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   Access function of PHI: {0, +, 1}_1
foo.c:9:21: note:   step: 1,  init: 0
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   Access function of PHI: {43, +, 4294967295}_1
foo.c:9:21: note:   step: 4294967295,  init: 43
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   reduction path: last_8 last_19 
foo.c:9:21: note:   reduction: detected reduction
foo.c:9:21: note:   Detected reduction.
foo.c:9:21: note:   === vect_determine_precisions ===
foo.c:9:21: note:   using boolean precision 32 for _9 = _7 < min_v_15(D);
foo.c:9:21: note:   ivtmp_10 has no range info
foo.c:9:21: note:   i_17 has range [0x1, 0x2b]
foo.c:9:21: note:   can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1;
foo.c:9:21: note:   last_8 has no range info
foo.c:9:21: note:   last_16 has no range info
foo.c:9:21: note:   _7 has no range info
foo.c:9:21: note:   _5 has range [0x0, 0xa8]
foo.c:9:21: note:   can narrow to unsigned:8 without loss of precision: _5 = _1 * 4;
foo.c:9:21: note:   aval_13 has no range info
foo.c:9:21: note:   _2 has range [0x0, 0x54]
foo.c:9:21: note:   can narrow to unsigned:7 without loss of precision: _2 = _1 * 2;
foo.c:9:21: note:   _1 has range [0x0, 0x2a]
foo.c:9:21: note:   === vect_pattern_recog ===
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _2 = _1 * 2;
foo.c:9:21: note:   widen_mult pattern recognized: patt_41 = (long unsigned int) patt_40;
foo.c:9:21: note:   extra pattern stmt: patt_40 = i_21 w* 2;
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _5 = _1 * 4;
foo.c:9:21: note:   widen_mult pattern recognized: patt_43 = (long unsigned int) patt_42;
foo.c:9:21: note:   extra pattern stmt: patt_42 = i_21 w* 4;
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction
foo.c:9:21: note:   === vect_analyze_data_ref_accesses ===
foo.c:9:21: note:   === vect_mark_stmts_to_be_vectorized ===
foo.c:9:21: note:   init: phi relevant? last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   init: phi relevant? i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   init: stmt relevant? _1 = (long unsigned int) i_21;
foo.c:9:21: note:   init: stmt relevant? _2 = _1 * 2;
foo.c:9:21: note:   init: stmt relevant? _3 = a_12(D) + _2;
foo.c:9:21: note:   init: stmt relevant? aval_13 = *_3;
foo.c:9:21: note:   init: stmt relevant? _5 = _1 * 4;
foo.c:9:21: note:   init: stmt relevant? _6 = b_14(D) + _5;
foo.c:9:21: note:   init: stmt relevant? _7 = *_6;
foo.c:9:21: note:   init: stmt relevant? last_16 = (int) aval_13;
foo.c:9:21: note:   init: stmt relevant? _9 = _7 < min_v_15(D);
foo.c:9:21: note:   init: stmt relevant? last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vec_stmt_relevant_p: used out of loop.
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   vec_stmt_relevant_p: stmt live but not relevant.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   init: stmt relevant? i_17 = i_21 + 1;
foo.c:9:21: note:   init: stmt relevant? ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   init: stmt relevant? if (ivtmp_10 != 0)
foo.c:9:21: note:   worklist: examine stmt: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction
foo.c:9:21: note:   reduc-stmt defining reduc-phi in the same nest.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   already marked relevant/live.
foo.c:9:21: note:   vect_is_simple_use: operand 108, type of def: constant
foo.c:9:21: note:   worklist: examine stmt: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand *_3, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand *_6, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _7 = *_6;
foo.c:9:21: note:   vect_is_simple_use: operand min_v_15(D), type of def: external
foo.c:9:21: note:   worklist: examine stmt: _7 = *_6;
foo.c:9:21: note:   === vect_analyze_data_ref_dependences ===
foo.c:9:21: note:   === vect_determine_vectorization_factor ===
foo.c:9:21: note:   ==> examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   get vectype for scalar type:  int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   ==> examining statement: _1 = (long unsigned int) i_21;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _2 = _1 * 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern def stmt: patt_40 = i_21 w* 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_41 = (long unsigned int) patt_40;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _3 = a_12(D) + _2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: aval_13 = *_3;
foo.c:9:21: note:   precomputed vectype: vector(4) short int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: _5 = _1 * 4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern def stmt: patt_42 = i_21 w* 4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_43 = (long unsigned int) patt_42;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _6 = b_14(D) + _5;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _7 = *_6;
foo.c:9:21: note:   precomputed vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: last_16 = (int) aval_13;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   get vectype for smallest scalar type: short int
foo.c:9:21: note:   nunits vectype: vector(4) short int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vectype: vector(4) <signed-boolean:32>
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: i_17 = i_21 + 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: if (ivtmp_10 != 0)
foo.c:9:21: note:   skip.
foo.c:9:21: note:   vectorization factor = 4
foo.c:9:21: note:   === vect_compute_single_scalar_iteration_cost ===
*_3 1 times scalar_load costs 1 in prologue
*_6 1 times scalar_load costs 1 in prologue
(int) aval_13 1 times scalar_stmt costs 1 in prologue
_7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue
_9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue
foo.c:9:21: note:   === vect_analyze_slp ===
foo.c:9:21: note:   === vect_make_slp_decision ===
foo.c:9:21: note:  vectorization_factor = 4, niters = 43
foo.c:9:21: note:   === vect_analyze_data_refs_alignment ===
foo.c:9:21: note:   recording new base alignment for a_12(D)
  alignment:    2
  misalignment: 0
  based on:     aval_13 = *_3;
foo.c:9:21: note:   recording new base alignment for b_14(D)
  alignment:    4
  misalignment: 0
  based on:     _7 = *_6;
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_3
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_6
foo.c:9:21: note:   === vect_prune_runtime_alias_test_list ===
foo.c:9:21: note:   === vect_enhance_data_refs_alignment ===
foo.c:9:21: missed:   Unknown misalignment, naturally aligned
foo.c:9:21: missed:   Unknown misalignment, naturally aligned
foo.c:9:21: note:   vect_can_advance_ivs_p:
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   reduc or virtual phi. skip.
foo.c:9:21: note:   Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   vect_model_load_cost: aligned.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 1, outside_cost = 0.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
foo.c:9:21: note:   === vect_dissolve_slp_only_groups ===
foo.c:9:21: note:   === vect_analyze_loop_operations ===
foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
Estimating # of iterations of loop 1
Analyzing # of iterations of loop 1
  exit condition [42, + , 4294967295] != 0
  bounds on difference of bases: -42 ... -42
  result:
    # of iterations 42, bounded by 42
Analyzing # of iterations of loop 1
  exit condition [42, + , 4294967295] != 0
  bounds on difference of bases: -42 ... -42
  result:
    # of iterations 42, bounded by 42
Statement (exit)if (ivtmp_10 != 0)
 is executed at most 42 (bounded by 42) + 1 times in loop 1.
Induction variable (short int *) a_12(D) + 2 * iteration does not wrap in statement _3 = a_12(D) + _2;
 in loop 1.
Statement _3 = a_12(D) + _2;
 is executed at most 9223372036854775806 (bounded by 9223372036854775806) + 1 times in loop 1.
Induction variable (int *) b_14(D) + 4 * iteration does not wrap in statement _6 = b_14(D) + _5;
 in loop 1.
Statement _6 = b_14(D) + _5;
 is executed at most 4611686018427387902 (bounded by 4611686018427387902) + 1 times in loop 1.
Induction variable (int) 1 + 1 * iteration does not wrap in statement i_17 = i_21 + 1;
 in loop 1.
Statement i_17 = i_21 + 1;
 is executed at most 42 (bounded by 42) + 1 times in loop 1.
vect_model_reduction_cost: inside_cost = 0, prologue_cost = 4, epilogue_cost = 7 .
foo.c:9:21: note:   examining phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   ==> examining statement: _1 = (long unsigned int) i_21;
foo.c:9:21: note:   irrelevant.
foo.c:9:21: note:   ==> examining statement: _2 = _1 * 2;
foo.c:9:21: note:   irrelevant.
foo.c:9:21: note:   ==> examining statement: _3 = a_12(D) + _2;
foo.c:9:21: note:   irrelevant.
foo.c:9:21: note:   ==> examining statement: aval_13 = *_3;
foo.c:9:21: missed:   can't operate on partial vectors because the target doesn't have the appropriate partial vectorization load or store.
foo.c:9:21: note:   Vectorizing an unaligned access.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
foo.c:9:21: note:   ==> examining statement: _5 = _1 * 4;
foo.c:9:21: note:   irrelevant.
foo.c:9:21: note:   ==> examining statement: _6 = b_14(D) + _5;
foo.c:9:21: note:   irrelevant.
foo.c:9:21: note:   ==> examining statement: _7 = *_6;
foo.c:9:21: note:   Vectorizing an unaligned access.
foo.c:9:21: note:   vect_model_load_cost: unaligned supported by hardware.
foo.c:9:21: note:   vect_model_load_cost: inside_cost = 1, prologue_cost = 0 .
foo.c:9:21: note:   ==> examining statement: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand *_3, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) short int
foo.c:9:21: note:    === vectorizable_conversion ===
foo.c:9:21: note:    vect_model_simple_cost: inside_cost = 1, prologue_cost = 0 .
foo.c:9:21: note:   ==> examining statement: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand *_6, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand min_v_15(D), type of def: external
foo.c:9:21: note:   vect_model_simple_cost: inside_cost = 1, prologue_cost = 1 .
foo.c:9:21: note:   ==> examining statement: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) <signed-boolean:32>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_model_simple_cost: inside_cost = 1, prologue_cost = 0 .
foo.c:9:21: note:   ==> examining statement: i_17 = i_21 + 1;
foo.c:9:21: note:   irrelevant.
foo.c:9:21: note:   ==> examining statement: ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   irrelevant.
foo.c:9:21: note:   ==> examining statement: if (ivtmp_10 != 0)
foo.c:9:21: note:   irrelevant.
_9 ? last_16 : last_19 4 times scalar_to_vec costs 4 in prologue
_9 ? last_16 : last_19 2 times vector_stmt costs 2 in epilogue
_9 ? last_16 : last_19 2 times vec_to_scalar costs 4 in epilogue
_9 ? last_16 : last_19 1 times scalar_to_vec costs 1 in epilogue
*_3 1 times unaligned_load (misalign -1) costs 1 in body
*_6 1 times unaligned_load (misalign -1) costs 1 in body
(int) aval_13 1 times vector_stmt costs 1 in body
_7 < min_v_15(D) 1 times scalar_to_vec costs 1 in prologue
_7 < min_v_15(D) 1 times vector_stmt costs 1 in body
_9 ? last_16 : last_19 1 times vector_stmt costs 1 in body
foo.c:9:21: note:  operating on full vectors.
foo.c:9:21: note:  cost model disabled.
foo.c:9:21: note:  epilog loop required
foo.c:9:21: note:  vect_can_advance_ivs_p:
foo.c:9:21: note:  Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:  reduc or virtual phi. skip.
foo.c:9:21: note:  Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:  Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:  ***** Analysis succeeded with vector mode V4HI
foo.c:9:21: note:  ***** Choosing vector mode V4HI
foo.c:9:21: note:  ***** Re-trying epilogue analysis with vector mode V16QI
foo.c:9:21: note:   === vect_analyze_data_refs ===
foo.c:9:21: note:   got vectype for stmt: aval_13 = *_3;
vector(8) short int
foo.c:9:21: note:   got vectype for stmt: _7 = *_6;
vector(4) int
foo.c:9:21: note:   === vect_analyze_scalar_cycles ===
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   Access function of PHI: last_19
foo.c:9:21: note:   Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   Access function of PHI: {0, +, 1}_1
foo.c:9:21: note:   step: 1,  init: 0
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   Access function of PHI: {43, +, 4294967295}_1
foo.c:9:21: note:   step: 4294967295,  init: 43
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   reduction path: last_8 last_19 
foo.c:9:21: note:   reduction: detected reduction
foo.c:9:21: note:   Detected reduction.
foo.c:9:21: note:   === vect_determine_precisions ===
foo.c:9:21: note:   using boolean precision 32 for _9 = _7 < min_v_15(D);
foo.c:9:21: note:   ivtmp_10 has no range info
foo.c:9:21: note:   i_17 has range [0x1, 0x2b]
foo.c:9:21: note:   can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1;
foo.c:9:21: note:   last_8 has no range info
foo.c:9:21: note:   last_16 has no range info
foo.c:9:21: note:   _7 has no range info
foo.c:9:21: note:   _5 has range [0x0, 0xa8]
foo.c:9:21: note:   can narrow to unsigned:8 without loss of precision: _5 = _1 * 4;
foo.c:9:21: note:   aval_13 has no range info
foo.c:9:21: note:   _2 has range [0x0, 0x54]
foo.c:9:21: note:   can narrow to unsigned:7 without loss of precision: _2 = _1 * 2;
foo.c:9:21: note:   _1 has range [0x0, 0x2a]
foo.c:9:21: note:   === vect_pattern_recog ===
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _2 = _1 * 2;
foo.c:9:21: note:   widen_mult pattern recognized: patt_45 = (long unsigned int) patt_44;
foo.c:9:21: note:   extra pattern stmt: patt_44 = i_21 w* 2;
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _5 = _1 * 4;
foo.c:9:21: note:   widen_mult pattern recognized: patt_47 = (long unsigned int) patt_46;
foo.c:9:21: note:   extra pattern stmt: patt_46 = i_21 w* 4;
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction
foo.c:9:21: note:   === vect_analyze_data_ref_accesses ===
foo.c:9:21: note:   === vect_mark_stmts_to_be_vectorized ===
foo.c:9:21: note:   init: phi relevant? last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   init: phi relevant? i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   init: stmt relevant? _1 = (long unsigned int) i_21;
foo.c:9:21: note:   init: stmt relevant? _2 = _1 * 2;
foo.c:9:21: note:   init: stmt relevant? _3 = a_12(D) + _2;
foo.c:9:21: note:   init: stmt relevant? aval_13 = *_3;
foo.c:9:21: note:   init: stmt relevant? _5 = _1 * 4;
foo.c:9:21: note:   init: stmt relevant? _6 = b_14(D) + _5;
foo.c:9:21: note:   init: stmt relevant? _7 = *_6;
foo.c:9:21: note:   init: stmt relevant? last_16 = (int) aval_13;
foo.c:9:21: note:   init: stmt relevant? _9 = _7 < min_v_15(D);
foo.c:9:21: note:   init: stmt relevant? last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vec_stmt_relevant_p: used out of loop.
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   vec_stmt_relevant_p: stmt live but not relevant.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   init: stmt relevant? i_17 = i_21 + 1;
foo.c:9:21: note:   init: stmt relevant? ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   init: stmt relevant? if (ivtmp_10 != 0)
foo.c:9:21: note:   worklist: examine stmt: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction
foo.c:9:21: note:   reduc-stmt defining reduc-phi in the same nest.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   already marked relevant/live.
foo.c:9:21: note:   vect_is_simple_use: operand 108, type of def: constant
foo.c:9:21: note:   worklist: examine stmt: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand *_3, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand *_6, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _7 = *_6;
foo.c:9:21: note:   vect_is_simple_use: operand min_v_15(D), type of def: external
foo.c:9:21: note:   worklist: examine stmt: _7 = *_6;
foo.c:9:21: note:   === vect_analyze_data_ref_dependences ===
foo.c:9:21: note:   === vect_determine_vectorization_factor ===
foo.c:9:21: note:   ==> examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   get vectype for scalar type:  int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   ==> examining statement: _1 = (long unsigned int) i_21;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _2 = _1 * 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern def stmt: patt_44 = i_21 w* 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_45 = (long unsigned int) patt_44;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _3 = a_12(D) + _2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: aval_13 = *_3;
foo.c:9:21: note:   precomputed vectype: vector(8) short int
foo.c:9:21: note:   nunits = 8
foo.c:9:21: note:   ==> examining statement: _5 = _1 * 4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern def stmt: patt_46 = i_21 w* 4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_47 = (long unsigned int) patt_46;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _6 = b_14(D) + _5;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _7 = *_6;
foo.c:9:21: note:   precomputed vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: last_16 = (int) aval_13;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   get vectype for smallest scalar type: short int
foo.c:9:21: note:   nunits vectype: vector(8) short int
foo.c:9:21: note:   nunits = 8
foo.c:9:21: note:   ==> examining statement: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vectype: vector(4) <signed-boolean:32>
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(4) int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: i_17 = i_21 + 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: if (ivtmp_10 != 0)
foo.c:9:21: note:   skip.
foo.c:9:21: note:   vectorization factor = 8
foo.c:9:21: note:   === vect_compute_single_scalar_iteration_cost ===
*_3 1 times scalar_load costs 1 in prologue
*_6 1 times scalar_load costs 1 in prologue
(int) aval_13 1 times scalar_stmt costs 1 in prologue
_7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue
_9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue
foo.c:9:21: note:   === vect_analyze_slp ===
foo.c:9:21: note:   === vect_make_slp_decision ===
foo.c:9:21: note:  vectorization_factor = 8, niters = 43
foo.c:9:21: note:   === vect_analyze_data_refs_alignment ===
foo.c:9:21: note:   recording new base alignment for a_12(D)
  alignment:    2
  misalignment: 0
  based on:     aval_13 = *_3;
foo.c:9:21: note:   recording new base alignment for b_14(D)
  alignment:    4
  misalignment: 0
  based on:     _7 = *_6;
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_3
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_6
foo.c:9:21: note:   === vect_prune_runtime_alias_test_list ===
foo.c:9:21: note:   === vect_dissolve_slp_only_groups ===
foo.c:9:21: note:   === vect_analyze_loop_operations ===
foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: missed:   multiple types in double reduction or condition reduction or fold-left reduction.
foo.c:4:1: missed:   not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: missed:  bad operation or unsupported loop bound.
foo.c:9:21: note:  ***** Analysis  failed with vector mode V16QI
foo.c:9:21: note:  ***** The result for vector mode V8QI would be the same
foo.c:9:21: note:  ***** Re-trying epilogue analysis with vector mode V2SI
foo.c:9:21: note:   === vect_analyze_data_refs ===
foo.c:9:21: note:   got vectype for stmt: aval_13 = *_3;
vector(4) short int
foo.c:9:21: note:   got vectype for stmt: _7 = *_6;
vector(2) int
foo.c:9:21: note:   === vect_analyze_scalar_cycles ===
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   Access function of PHI: last_19
foo.c:9:21: note:   Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   Access function of PHI: {0, +, 1}_1
foo.c:9:21: note:   step: 1,  init: 0
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   Access function of PHI: {43, +, 4294967295}_1
foo.c:9:21: note:   step: 4294967295,  init: 43
foo.c:9:21: note:   Detected induction.
foo.c:9:21: note:   Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   reduction path: last_8 last_19 
foo.c:9:21: note:   reduction: detected reduction
foo.c:9:21: note:   Detected reduction.
foo.c:9:21: note:   === vect_determine_precisions ===
foo.c:9:21: note:   using boolean precision 32 for _9 = _7 < min_v_15(D);
foo.c:9:21: note:   ivtmp_10 has no range info
foo.c:9:21: note:   i_17 has range [0x1, 0x2b]
foo.c:9:21: note:   can narrow to unsigned:6 without loss of precision: i_17 = i_21 + 1;
foo.c:9:21: note:   last_8 has no range info
foo.c:9:21: note:   last_16 has no range info
foo.c:9:21: note:   _7 has no range info
foo.c:9:21: note:   _5 has range [0x0, 0xa8]
foo.c:9:21: note:   can narrow to unsigned:8 without loss of precision: _5 = _1 * 4;
foo.c:9:21: note:   aval_13 has no range info
foo.c:9:21: note:   _2 has range [0x0, 0x54]
foo.c:9:21: note:   can narrow to unsigned:7 without loss of precision: _2 = _1 * 2;
foo.c:9:21: note:   _1 has range [0x0, 0x2a]
foo.c:9:21: note:   === vect_pattern_recog ===
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _2 = _1 * 2;
foo.c:9:21: note:   vect_recog_mult_pattern: detected: _2 = _1 * 2;
foo.c:9:21: note:   mult pattern recognized: patt_48 = _1 << 1;
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand (long unsigned int) i_21, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_recog_widen_mult_pattern: detected: _5 = _1 * 4;
foo.c:9:21: note:   vect_recog_mult_pattern: detected: _5 = _1 * 4;
foo.c:9:21: note:   mult pattern recognized: patt_49 = _1 << 2;
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand i_21 = PHI <i_17(7), 0(15)>, type of def: induction
foo.c:9:21: note:   vect_is_simple_use: operand ivtmp_18 = PHI <ivtmp_10(7), 43(15)>, type of def: induction
foo.c:9:21: note:   === vect_analyze_data_ref_accesses ===
foo.c:9:21: note:   === vect_mark_stmts_to_be_vectorized ===
foo.c:9:21: note:   init: phi relevant? last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   init: phi relevant? i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   init: phi relevant? ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   init: stmt relevant? _1 = (long unsigned int) i_21;
foo.c:9:21: note:   init: stmt relevant? _2 = _1 * 2;
foo.c:9:21: note:   init: stmt relevant? _3 = a_12(D) + _2;
foo.c:9:21: note:   init: stmt relevant? aval_13 = *_3;
foo.c:9:21: note:   init: stmt relevant? _5 = _1 * 4;
foo.c:9:21: note:   init: stmt relevant? _6 = b_14(D) + _5;
foo.c:9:21: note:   init: stmt relevant? _7 = *_6;
foo.c:9:21: note:   init: stmt relevant? last_16 = (int) aval_13;
foo.c:9:21: note:   init: stmt relevant? _9 = _7 < min_v_15(D);
foo.c:9:21: note:   init: stmt relevant? last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vec_stmt_relevant_p: used out of loop.
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   vec_stmt_relevant_p: stmt live but not relevant.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   init: stmt relevant? i_17 = i_21 + 1;
foo.c:9:21: note:   init: stmt relevant? ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   init: stmt relevant? if (ivtmp_10 != 0)
foo.c:9:21: note:   worklist: examine stmt: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   mark relevant 1, live 0: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   worklist: examine stmt: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand _9 ? last_16 : last_19, type of def: reduction
foo.c:9:21: note:   reduc-stmt defining reduc-phi in the same nest.
foo.c:9:21: note:   mark relevant 1, live 1: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   already marked relevant/live.
foo.c:9:21: note:   vect_is_simple_use: operand 108, type of def: constant
foo.c:9:21: note:   worklist: examine stmt: last_16 = (int) aval_13;
foo.c:9:21: note:   vect_is_simple_use: operand *_3, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: aval_13 = *_3;
foo.c:9:21: note:   worklist: examine stmt: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vect_is_simple_use: operand *_6, type of def: internal
foo.c:9:21: note:   mark relevant 1, live 0: _7 = *_6;
foo.c:9:21: note:   vect_is_simple_use: operand min_v_15(D), type of def: external
foo.c:9:21: note:   worklist: examine stmt: _7 = *_6;
foo.c:9:21: note:   === vect_analyze_data_ref_dependences ===
foo.c:9:21: note:   === vect_determine_vectorization_factor ===
foo.c:9:21: note:   ==> examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   get vectype for scalar type:  int
foo.c:9:21: note:   vectype: vector(2) int
foo.c:9:21: note:   nunits = 2
foo.c:9:21: note:   ==> examining phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:   ==> examining phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:   ==> examining statement: _1 = (long unsigned int) i_21;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _2 = _1 * 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_48 = _1 << 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _3 = a_12(D) + _2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: aval_13 = *_3;
foo.c:9:21: note:   precomputed vectype: vector(4) short int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: _5 = _1 * 4;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining pattern statement: patt_49 = _1 << 2;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _6 = b_14(D) + _5;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: _7 = *_6;
foo.c:9:21: note:   precomputed vectype: vector(2) int
foo.c:9:21: note:   nunits = 2
foo.c:9:21: note:   ==> examining statement: last_16 = (int) aval_13;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(2) int
foo.c:9:21: note:   get vectype for smallest scalar type: short int
foo.c:9:21: note:   nunits vectype: vector(4) short int
foo.c:9:21: note:   nunits = 4
foo.c:9:21: note:   ==> examining statement: _9 = _7 < min_v_15(D);
foo.c:9:21: note:   vectype: vector(2) <signed-boolean:32>
foo.c:9:21: note:   nunits = 2
foo.c:9:21: note:   ==> examining statement: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:   get vectype for scalar type: int
foo.c:9:21: note:   vectype: vector(2) int
foo.c:9:21: note:   nunits = 2
foo.c:9:21: note:   ==> examining statement: i_17 = i_21 + 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:   skip.
foo.c:9:21: note:   ==> examining statement: if (ivtmp_10 != 0)
foo.c:9:21: note:   skip.
foo.c:9:21: note:   vectorization factor = 4
foo.c:9:21: note:   === vect_compute_single_scalar_iteration_cost ===
*_3 1 times scalar_load costs 1 in prologue
*_6 1 times scalar_load costs 1 in prologue
(int) aval_13 1 times scalar_stmt costs 1 in prologue
_7 < min_v_15(D) 1 times scalar_stmt costs 1 in prologue
_9 ? last_16 : last_19 1 times scalar_stmt costs 1 in prologue
foo.c:9:21: note:   === vect_analyze_slp ===
foo.c:9:21: note:   === vect_make_slp_decision ===
foo.c:9:21: note:  vectorization_factor = 4, niters = 43
foo.c:9:21: note:   === vect_analyze_data_refs_alignment ===
foo.c:9:21: note:   recording new base alignment for a_12(D)
  alignment:    2
  misalignment: 0
  based on:     aval_13 = *_3;
foo.c:9:21: note:   recording new base alignment for b_14(D)
  alignment:    4
  misalignment: 0
  based on:     _7 = *_6;
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_3
foo.c:9:21: note:   vect_compute_data_ref_alignment:
foo.c:9:21: note:   can't force alignment of ref: *_6
foo.c:9:21: note:   === vect_prune_runtime_alias_test_list ===
foo.c:9:21: note:   === vect_dissolve_slp_only_groups ===
foo.c:9:21: note:   === vect_analyze_loop_operations ===
foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
foo.c:9:21: missed:   multiple types in double reduction or condition reduction or fold-left reduction.
foo.c:4:1: missed:   not vectorized: relevant phi not supported: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: missed:  bad operation or unsupported loop bound.
foo.c:9:21: note:  ***** Analysis  failed with vector mode V2SI
foo.c:9:21: optimized: loop vectorized using 8 byte vectors
foo.c:9:21: note:  === vec_transform_loop ===
split exit edge
split exit edge of scalar loop
Removing basic block 19
;; basic block 19, loop depth 0
;;  pred:       16
;;  succ:      


foo.c:9:21: note:  vect_can_advance_ivs_p:
foo.c:9:21: note:  Analyze phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:  reduc or virtual phi. skip.
foo.c:9:21: note:  Analyze phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:  Analyze phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
foo.c:9:21: note:  vect_update_ivs_after_vectorizer: phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:  reduc or virtual phi. skip.
foo.c:9:21: note:  vect_update_ivs_after_vectorizer: phi: i_21 = PHI <i_17(7), 0(15)>
foo.c:9:21: note:  vect_update_ivs_after_vectorizer: phi: ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
;; Guessed iterations of loop 3 is 42.052870. New upper bound 2.
;; Scaling loop 3 with scale 7.0% (guessed) to reach upper bound 2
foo.c:9:21: note:  ------>vectorizing phi: last_19 = PHI <last_8(7), 108(25)>
foo.c:9:21: note:  transform phi.
foo.c:9:21: note:  ------>vectorizing phi: i_21 = PHI <i_17(7), 0(25)>
foo.c:9:21: note:  ------>vectorizing phi: ivtmp_18 = PHI <ivtmp_10(7), 43(25)>
foo.c:9:21: note:  ------>vectorizing phi: vect_last_19.7_67 = PHI <(7), { 108, 108, 108, 108 }(25)>
foo.c:9:21: note:  ------>vectorizing statement: _1 = (long unsigned int) i_21;
foo.c:9:21: note:  ------>vectorizing statement: patt_40 = i_21 w* 2;
foo.c:9:21: note:  ------>vectorizing statement: patt_41 = (long unsigned int) patt_40;
foo.c:9:21: note:  ------>vectorizing statement: _3 = a_12(D) + _2;
foo.c:9:21: note:  ------>vectorizing statement: aval_13 = *_3;
foo.c:9:21: note:  transform statement.
foo.c:9:21: note:  transform load. ncopies = 1
foo.c:9:21: note:  create vector_type-pointer variable to type: vector(4) short int  vectorizing a pointer ref: *a_12(D)
foo.c:9:21: note:  created a_12(D)
foo.c:9:21: note:  add new stmt: vect_aval_13.10_70 = MEM <vector(4) short int> [(short int *)vectp_a.8_68];
foo.c:9:21: note:  ------>vectorizing statement: patt_42 = i_21 w* 4;
foo.c:9:21: note:  ------>vectorizing statement: patt_43 = (long unsigned int) patt_42;
foo.c:9:21: note:  ------>vectorizing statement: _6 = b_14(D) + _5;
foo.c:9:21: note:  ------>vectorizing statement: _7 = *_6;
foo.c:9:21: note:  transform statement.
foo.c:9:21: note:  transform load. ncopies = 1
foo.c:9:21: note:  create vector_type-pointer variable to type: vector(4) int  vectorizing a pointer ref: *b_14(D)
foo.c:9:21: note:  created b_14(D)
foo.c:9:21: note:  add new stmt: vect__7.13_73 = MEM <vector(4) int> [(int *)vectp_b.11_71];
foo.c:9:21: note:  ------>vectorizing statement: last_16 = (int) aval_13;
foo.c:9:21: note:  transform statement.
foo.c:9:21: note:  vect_is_simple_use: operand *_3, type of def: internal
foo.c:9:21: note:  vect_is_simple_use: vectype vector(4) short int
foo.c:9:21: note:  transform conversion. ncopies = 1.
foo.c:9:21: note:  vect_get_vec_defs_for_operand: aval_13
foo.c:9:21: note:  vect_is_simple_use: operand *_3, type of def: internal
foo.c:9:21: note:    def_stmt =  aval_13 = *_3;
foo.c:9:21: note:  add new stmt: vect_last_16.14_74 = (vector(4) int) vect_aval_13.10_70;
foo.c:9:21: note:  ------>vectorizing statement: _9 = _7 < min_v_15(D);
foo.c:9:21: note:  transform statement.
foo.c:9:21: note:  vect_is_simple_use: operand *_6, type of def: internal
foo.c:9:21: note:  vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:  vect_is_simple_use: operand min_v_15(D), type of def: external
foo.c:9:21: note:  vect_get_vec_defs_for_operand: _7
foo.c:9:21: note:  vect_is_simple_use: operand *_6, type of def: internal
foo.c:9:21: note:    def_stmt =  _7 = *_6;
foo.c:9:21: note:  vect_get_vec_defs_for_operand: min_v_15(D)
foo.c:9:21: note:  vect_is_simple_use: operand min_v_15(D), type of def: external
foo.c:9:21: note:  created new init_stmt: vect_cst__75 = {min_v_15(D), min_v_15(D), min_v_15(D), min_v_15(D)};
foo.c:9:21: note:  add new stmt: mask__9.15_76 = vect__7.13_73 < vect_cst__75;
foo.c:9:21: note:  ------>vectorizing statement: last_8 = _9 ? last_16 : last_19;
foo.c:9:21: note:  transform statement.
foo.c:9:21: note:  vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:  vect_is_simple_use: vectype vector(4) <signed-boolean:32>
foo.c:9:21: note:  vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:  vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:  vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(25)>, type of def: reduction
foo.c:9:21: note:  vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:  vect_get_vec_defs_for_operand: _9
foo.c:9:21: note:  vect_is_simple_use: operand _7 < min_v_15(D), type of def: internal
foo.c:9:21: note:    def_stmt =  _9 = _7 < min_v_15(D);
foo.c:9:21: note:  vect_get_vec_defs_for_operand: last_16
foo.c:9:21: note:  vect_is_simple_use: operand (int) aval_13, type of def: internal
foo.c:9:21: note:    def_stmt =  last_16 = (int) aval_13;
foo.c:9:21: note:  vect_get_vec_defs_for_operand: last_19
foo.c:9:21: note:  vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(25)>, type of def: reduction
foo.c:9:21: note:    def_stmt =  last_19 = PHI <last_8(7), 108(25)>
foo.c:9:21: note:  add new stmt: vect_last_8.16_77 = VEC_COND_EXPR <mask__9.15_76, vect_last_16.14_74, vect_last_19.7_67>;
foo.c:9:21: note:  ------>vectorizing statement: i_17 = i_21 + 1;
foo.c:9:21: note:  ------>vectorizing statement: ivtmp_10 = ivtmp_18 - 1;
foo.c:9:21: note:  ------>vectorizing statement: if (ivtmp_10 != 0)
foo.c:9:21: note:  New loop exit condition: if (ivtmp_91 < 10)
;; Scaling loop 1 with scale 25.0% (adjusted)
;; Guessed iterations of loop 1 is 9.763217. New upper bound 9.
;; Scaling loop 1 with scale 92.9% (guessed) to reach upper bound 9
foo.c:9:21: note:  LOOP VECTORIZED

foo.c:4:1: note: vectorized 1 loops in function.
;; Created LCSSA PHI: _92 = PHI <_81(3)>

Updating SSA:
Registering new PHI nodes in block #3
Updating SSA information for statement _81 = VEC_COND_EXPR <mask__9.15_76, ivtmp_78, _80>;
Registering new PHI nodes in block #7
Registering new PHI nodes in block #20
Updating SSA information for statement _83 = .REDUC_MAX (_81);
Updating SSA information for statement _85 = _81 == _84;
Registering new PHI nodes in block #21

SSA replacement table
N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j

_92 -> { _81 }
Incremental SSA update started at block: 3
Number of blocks in CFG: 26
Number of blocks to update: 3 ( 12%)
Affected blocks: 3 7 20


Processing block 0: BB25
Value numbering stmt = vect_cst__75 = {min_v_15(D), min_v_15(D), min_v_15(D), min_v_15(D)};
Setting value number of vect_cst__75 to vect_cst__75 (changed)
marking outgoing edge 25 -> 3 executable
Making available beyond BB25 vect_cst__75 for value vect_cst__75
Processing block 1: BB3
Cannot trust state of predecessor edge 7 -> 3, marking executable
Value numbering stmt = last_19 = PHI <last_8(7), 108(25)>
Setting value number of last_19 to last_19 (changed)
Making available beyond BB3 last_19 for value last_19
Value numbering stmt = i_21 = PHI <i_17(7), 0(25)>
Setting value number of i_21 to i_21 (changed)
Making available beyond BB3 i_21 for value i_21
Value numbering stmt = ivtmp_18 = PHI <ivtmp_10(7), 43(25)>
Setting value number of ivtmp_18 to ivtmp_18 (changed)
Making available beyond BB3 ivtmp_18 for value ivtmp_18
Value numbering stmt = vect_last_19.7_67 = PHI <vect_last_8.16_77(7), { 108, 108, 108, 108 }(25)>
Setting value number of vect_last_19.7_67 to vect_last_19.7_67 (changed)
Making available beyond BB3 vect_last_19.7_67 for value vect_last_19.7_67
Value numbering stmt = vectp_a.8_68 = PHI <vectp_a.8_69(7), a_12(D)(25)>
Setting value number of vectp_a.8_68 to vectp_a.8_68 (changed)
Making available beyond BB3 vectp_a.8_68 for value vectp_a.8_68
Value numbering stmt = vectp_b.11_71 = PHI <vectp_b.11_72(7), b_14(D)(25)>
Setting value number of vectp_b.11_71 to vectp_b.11_71 (changed)
Making available beyond BB3 vectp_b.11_71 for value vectp_b.11_71
Value numbering stmt = ivtmp_78 = PHI <ivtmp_79(7), { 1, 2, 3, 4 }(25)>
Setting value number of ivtmp_78 to ivtmp_78 (changed)
Making available beyond BB3 ivtmp_78 for value ivtmp_78
Value numbering stmt = _80 = PHI <_81(7), { 0, 0, 0, 0 }(25)>
Setting value number of _80 to _80 (changed)
Making available beyond BB3 _80 for value _80
Value numbering stmt = ivtmp_90 = PHI <ivtmp_91(7), 0(25)>
Setting value number of ivtmp_90 to ivtmp_90 (changed)
Making available beyond BB3 ivtmp_90 for value ivtmp_90
Value numbering stmt = _1 = (long unsigned int) i_21;
Setting value number of _1 to _1 (changed)
Making available beyond BB3 _1 for value _1
Value numbering stmt = _2 = _1 * 2;
Setting value number of _2 to _2 (changed)
Making available beyond BB3 _2 for value _2
Value numbering stmt = _3 = a_12(D) + _2;
Setting value number of _3 to _3 (changed)
Making available beyond BB3 _3 for value _3
Value numbering stmt = vect_aval_13.10_70 = MEM <vector(4) short int> [(short int *)vectp_a.8_68];
Setting value number of vect_aval_13.10_70 to vect_aval_13.10_70 (changed)
Making available beyond BB3 vect_aval_13.10_70 for value vect_aval_13.10_70
Value numbering stmt = aval_13 = *_3;
Setting value number of aval_13 to aval_13 (changed)
Making available beyond BB3 aval_13 for value aval_13
Value numbering stmt = _5 = _1 * 4;
Setting value number of _5 to _5 (changed)
Making available beyond BB3 _5 for value _5
Value numbering stmt = _6 = b_14(D) + _5;
Setting value number of _6 to _6 (changed)
Making available beyond BB3 _6 for value _6
Value numbering stmt = vect__7.13_73 = MEM <vector(4) int> [(int *)vectp_b.11_71];
Setting value number of vect__7.13_73 to vect__7.13_73 (changed)
Making available beyond BB3 vect__7.13_73 for value vect__7.13_73
Value numbering stmt = _7 = *_6;
Setting value number of _7 to _7 (changed)
Making available beyond BB3 _7 for value _7
Value numbering stmt = vect_last_16.14_74 = (vector(4) int) vect_aval_13.10_70;
Setting value number of vect_last_16.14_74 to vect_last_16.14_74 (changed)
Making available beyond BB3 vect_last_16.14_74 for value vect_last_16.14_74
Value numbering stmt = last_16 = (int) aval_13;
Setting value number of last_16 to last_16 (changed)
Making available beyond BB3 last_16 for value last_16
Value numbering stmt = mask__9.15_76 = vect__7.13_73 < vect_cst__75;
Setting value number of mask__9.15_76 to mask__9.15_76 (changed)
Making available beyond BB3 mask__9.15_76 for value mask__9.15_76
Value numbering stmt = _9 = _7 < min_v_15(D);
Setting value number of _9 to _9 (changed)
Making available beyond BB3 _9 for value _9
Value numbering stmt = vect_last_8.16_77 = VEC_COND_EXPR <mask__9.15_76, vect_last_16.14_74, vect_last_19.7_67>;
Setting value number of vect_last_8.16_77 to vect_last_8.16_77 (changed)
Making available beyond BB3 vect_last_8.16_77 for value vect_last_8.16_77
Value numbering stmt = last_8 = _9 ? last_16 : last_19;
Setting value number of last_8 to last_8 (changed)
Making available beyond BB3 last_8 for value last_8
Value numbering stmt = i_17 = i_21 + 1;
Setting value number of i_17 to i_17 (changed)
Making available beyond BB3 i_17 for value i_17
Value numbering stmt = ivtmp_10 = ivtmp_18 - 1;
Setting value number of ivtmp_10 to ivtmp_10 (changed)
Making available beyond BB3 ivtmp_10 for value ivtmp_10
Value numbering stmt = vectp_a.8_69 = vectp_a.8_68 + 8;
Setting value number of vectp_a.8_69 to vectp_a.8_69 (changed)
Making available beyond BB3 vectp_a.8_69 for value vectp_a.8_69
Value numbering stmt = vectp_b.11_72 = vectp_b.11_71 + 16;
Setting value number of vectp_b.11_72 to vectp_b.11_72 (changed)
Making available beyond BB3 vectp_b.11_72 for value vectp_b.11_72
Value numbering stmt = _81 = VEC_COND_EXPR <mask__9.15_76, ivtmp_78, _80>;
Setting value number of _81 to _81 (changed)
Making available beyond BB3 _81 for value _81
Value numbering stmt = ivtmp_79 = ivtmp_78 + { 4, 4, 4, 4 };
Setting value number of ivtmp_79 to ivtmp_79 (changed)
Making available beyond BB3 ivtmp_79 for value ivtmp_79
Value numbering stmt = ivtmp_91 = ivtmp_90 + 1;
Setting value number of ivtmp_91 to ivtmp_91 (changed)
Making available beyond BB3 ivtmp_91 for value ivtmp_91
Value numbering stmt = if (ivtmp_91 < 10)
Recording on edge 3->7 ivtmp_91 lt_expr 10 == true
Recording on edge 3->7 ivtmp_91 ge_expr 10 == false
Recording on edge 3->7 ivtmp_91 ne_expr 10 == true
Recording on edge 3->7 ivtmp_91 le_expr 10 == true
Recording on edge 3->7 ivtmp_91 gt_expr 10 == false
Recording on edge 3->7 ivtmp_91 eq_expr 10 == false
marking outgoing edge 3 -> 7 executable
marking destination block 20 reachable
Processing block 2: BB7
RPO iteration over 3 blocks visited 3 blocks in total discovering 3 executable blocks iterating 1.0 times, a block was visited max. 1 times
RPO tracked 35 values available at 32 locations and 35 lattice elements
Removing basic block 9
;; basic block 9, loop depth 1
;;  pred:       16
;;              13
# last_23 = PHI <108(16), last_34(13)>
# i_24 = PHI <0(16), i_35(13)>
# ivtmp_25 = PHI <43(16), ivtmp_36(13)>
_26 = (long unsigned int) i_24;
_27 = _26 * 2;
_28 = a_12(D) + _27;
aval_29 = *_28;
_30 = _26 * 4;
_31 = b_14(D) + _30;
_32 = *_31;
if (_32 < min_v_15(D))
  goto <bb 11>; [50.00%]
else
  goto <bb 12>; [50.00%]
;;  succ:       11
;;              12


Removing basic block 11
;; basic block 11, loop depth 1
;;  pred:      
last_33 = (int) _29;
;;  succ:       12


Removing basic block 12
;; basic block 12, loop depth 1
;;  pred:      
# last_34 = PHI <>
i_35 = _24 + 1;
ivtmp_36 = _25 - 1;
if (ivtmp_36 != 0)
  goto <bb 13>; [97.68%]
else
  goto <bb 18>; [2.32%]
;;  succ:       13
;;              18


Removing basic block 13
;; basic block 13, loop depth 1
;;  pred:      
;;  succ:      


Removing basic block 16
;; basic block 16, loop depth 0
;;  pred:      
;;  succ:      


Removing basic block 18
;; basic block 18, loop depth 0
;;  pred:      
# last_51 = PHI <>
goto <bb 6>; [100.00%]
;;  succ:       6


Merging blocks 2 and 15
Merging blocks 17 and 6
Merging blocks 2 and 25
fix_loop_structure: fixing up loops for function
fix_loop_structure: removing loop 2
__attribute__((noipa, noinline, noclone, no_icf))
int condition_reduction (short int * a, int min_v, int * b)
{
  int stmp_last_8.17;
  vector(4) int vect_last_8.16;
  vector(4) <signed-boolean:32> mask__9.15;
  vector(4) int vect_last_16.14;
  vector(4) int vect__7.13;
  int * vectp_b.12;
  vector(4) int * vectp_b.11;
  vector(4) short int vect_aval_13.10;
  short int * vectp_a.9;
  vector(4) short int * vectp_a.8;
  vector(4) int vect_last_19.7;
  unsigned int tmp.6;
  int tmp.5;
  int i;
  short int aval;
  int last;
  long unsigned int _1;
  long unsigned int _2;
  short int * _3;
  long unsigned int _5;
  int * _6;
  int _7;
  _Bool _9;
  unsigned int ivtmp_10;
  unsigned int ivtmp_18;
  _Bool _22;
  unsigned int ivtmp_54;
  long unsigned int _55;
  long unsigned int _56;
  short int * _57;
  long unsigned int _59;
  int * _60;
  int _61;
  unsigned int ivtmp_64;
  vector(4) int vect_cst__75;
  vector(4) unsigned int ivtmp_78;
  vector(4) unsigned int ivtmp_79;
  vector(4) unsigned int _80;
  vector(4) unsigned int _81;
  unsigned int _83;
  vector(4) unsigned int _84;
  vector(4) <signed-boolean:32> _85;
  vector(4) int _86;
  vector(4) unsigned int _87;
  unsigned int _88;
  int _89;
  unsigned int ivtmp_90;
  unsigned int ivtmp_91;
  vector(4) unsigned int _92;

  <bb 2> [local count: 24373936]:
  _22 = 1;
  vect_cst__75 = {min_v_15(D), min_v_15(D), min_v_15(D), min_v_15(D)};

  <bb 3> [local count: 243739360]:
  # last_19 = PHI <last_8(7), 108(2)>
  # i_21 = PHI <i_17(7), 0(2)>
  # ivtmp_18 = PHI <ivtmp_10(7), 43(2)>
  # vect_last_19.7_67 = PHI <vect_last_8.16_77(7), { 108, 108, 108, 108 }(2)>
  # vectp_a.8_68 = PHI <vectp_a.8_69(7), a_12(D)(2)>
  # vectp_b.11_71 = PHI <vectp_b.11_72(7), b_14(D)(2)>
  # ivtmp_78 = PHI <ivtmp_79(7), { 1, 2, 3, 4 }(2)>
  # _80 = PHI <_81(7), { 0, 0, 0, 0 }(2)>
  # ivtmp_90 = PHI <ivtmp_91(7), 0(2)>
  _1 = (long unsigned int) i_21;
  _2 = _1 * 2;
  _3 = a_12(D) + _2;
  vect_aval_13.10_70 = MEM <vector(4) short int> [(short int *)vectp_a.8_68];
  aval_13 = *_3;
  _5 = _1 * 4;
  _6 = b_14(D) + _5;
  vect__7.13_73 = MEM <vector(4) int> [(int *)vectp_b.11_71];
  _7 = *_6;
  vect_last_16.14_74 = (vector(4) int) vect_aval_13.10_70;
  last_16 = (int) aval_13;
  mask__9.15_76 = vect__7.13_73 < vect_cst__75;
  _9 = _7 < min_v_15(D);
  vect_last_8.16_77 = VEC_COND_EXPR <mask__9.15_76, vect_last_16.14_74, vect_last_19.7_67>;
  last_8 = _9 ? last_16 : last_19;
  i_17 = i_21 + 1;
  ivtmp_10 = ivtmp_18 - 1;
  vectp_a.8_69 = vectp_a.8_68 + 8;
  vectp_b.11_72 = vectp_b.11_71 + 16;
  _81 = VEC_COND_EXPR <mask__9.15_76, ivtmp_78, _80>;
  ivtmp_79 = ivtmp_78 + { 4, 4, 4, 4 };
  ivtmp_91 = ivtmp_90 + 1;
  if (ivtmp_91 < 10)
    goto <bb 7>; [90.00%]
  else
    goto <bb 20>; [10.00%]

  <bb 7> [local count: 219365424]:
  goto <bb 3>; [100.00%]

  <bb 20> [local count: 24373936]:
  # last_66 = PHI <last_8(3)>
  # vect_last_8.16_82 = PHI <vect_last_8.16_77(3)>
  # _92 = PHI <_81(3)>
  _83 = .REDUC_MAX (_92);
  _84 = {_83, _83, _83, _83};
  _85 = _92 == _84;
  _86 = VEC_COND_EXPR <_85, vect_last_8.16_82, { 0, 0, 0, 0 }>;
  _87 = VIEW_CONVERT_EXPR<vector(4) unsigned int>(_86);
  _88 = .REDUC_MAX (_87);
  _89 = (int) _88;

  <bb 21> [local count: 73121805]:
  # last_52 = PHI <_89(20), last_62(22)>
  # i_53 = PHI <40(20), i_63(22)>
  # ivtmp_54 = PHI <3(20), ivtmp_64(22)>
  _55 = (long unsigned int) i_53;
  _56 = _55 * 2;
  _57 = a_12(D) + _56;
  aval_58 = *_57;
  _59 = _55 * 4;
  _60 = b_14(D) + _59;
  _61 = *_60;
  if (_61 < min_v_15(D))
    goto <bb 24>; [50.00%]
  else
    goto <bb 23>; [50.00%]

  <bb 22> [local count: 48747874]:
  goto <bb 21>; [100.00%]

  <bb 23> [local count: 73121805]:
  # last_62 = PHI <last_52(21), last_65(24)>
  i_63 = i_53 + 1;
  ivtmp_64 = ivtmp_54 - 1;
  if (ivtmp_64 != 0)
    goto <bb 22>; [66.67%]
  else
    goto <bb 17>; [33.33%]

  <bb 24> [local count: 36560903]:
  last_65 = (int) aval_58;
  goto <bb 23>; [100.00%]

  <bb 17> [local count: 24373936]:
  # last_50 = PHI <last_62(23)>
  return last_50;

}
  
Richard Biener Aug. 7, 2023, 7:48 a.m. UTC | #8
On Mon, Aug 7, 2023 at 2:05 AM Prathamesh Kulkarni via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote:
> >
> > On Thu, 3 Aug 2023, Richard Biener wrote:
> >
> > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > >
> > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > > >
> > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > > The following delays sinking of loads within the same innermost
> > > > > > > > loop when it was unconditional before.  That's a not uncommon
> > > > > > > > issue preventing vectorization when masked loads are not available.
> > > > > > > >
> > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > > >
> > > > > > > > I have a followup patch improving sinking that without this would
> > > > > > > > cause more of the problematic sinking - now that we have a second
> > > > > > > > sink pass after loop opts this looks like a reasonable approach?
> > > > > > > >
> > > > > > > > OK?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > >  PR tree-optimization/92335
> > > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > > >  in innermost loops to conditional executed places.
> > > > > > > >
> > > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > > I think it's reasonable -- there's probably going to be cases where it's not
> > > > > > > great, but more often than not I think it's going to be a reasonable
> > > > > > > heuristic.
> > > > > > >
> > > > > > > If there is undesirable fallout, better to find it over the coming months than
> > > > > > > next spring.  So I'd suggest we go forward now to give more time to find any
> > > > > > > pathological cases (if they exist).
> > > > > >
> > > > > > Agreed, I've pushed this now.
> > > > > Hi Richard,
> > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > >
> > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> > > > > ! vect_fold_extract_last } } } } */
> > > > >
> > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized
> > > > > regardless of vect_fold_extract_last,
> > > > > which gates the above test (which is an improvement, because the
> > > > > function didn't get vectorized before the commit).
> > > > >
> > > > > The attached patch thus removes the gating on vect_fold_extract_last,
> > > > > and the test passes again.
> > > > > OK to commit ?
> > > >
> > > > OK.
> > >
> > > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > > critical target condition is missing.  Can you figure out which?
> >
> > I see
> >
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>,
> > type of def: reduction
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > note:   vect_is_simple_use: vectype vector(4) int
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > missed:   multiple types in double reduction or condition reduction or
> > fold-left reduction.
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> > missed:   not vectorized: relevant phi not supported: last_19 = PHI
> > <last_8(7), 108(15)>
> > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > missed:  bad operation or unsupported loop bound.
> Hi Richard,
> Looking at the aarch64 vect dump, it seems the loop in
> condition_reduction gets vectorized with V4HI mode
> while fails for other modes in vectorizable_condition:
>
>   if ((double_reduc || reduction_type != TREE_CODE_REDUCTION)
>       && ncopies > 1)
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "multiple types in double reduction or condition "
>                          "reduction or fold-left reduction.\n");
>       return false;
>     }
>
> From the dump:
> foo.c:9:21: note:   === vect_analyze_loop_operations ===
> foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
>
> For V8HI, VF = 8, and vectype_in = vector(4) int.
> Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1,
> and thus fails:
> foo.c:9:21: missed:   multiple types in double reduction or condition
> reduction or fold-left reduction.
> foo.c:4:1: missed:   not vectorized: relevant phi not supported:
> last_19 = PHI <last_8(7), 108(15)>
> While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds.
>
> For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
> If I "force" the vectorizer to use V4HI mode, we get the following dump:
> foo.c:9:21: note:   === vect_analyze_loop_operations ===
> foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> foo.c:9:21: missed:   multiple types in double reduction or condition
> reduction or fold-left reduction.
>
> Not sure tho if this is the only reason for the test to fail to
> vectorize on the target.
> Will investigate in more details next week.

The odd thing is that you say

  for (int i = 0; i < N; i++)
    {
      aval = a[i];
      if (b[i] < min_v)
        last = aval;
    }

fails to vectorize but

  for (int i = 0; i < N; i++)
    {
      if (b[i] < min_v)
        last = a[i];
    }

succeeds?  The IL difference should be irrelevant for the reduction
vectorization:

  <bb 3> [local count: 1049367889]:
  # last_19 = PHI <last_8(7), 108(15)>
  # i_21 = PHI <i_17(7), 0(15)>
  # ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
  _1 = (long unsigned int) i_21;
  _2 = _1 * 2;
  _3 = a_12(D) + _2;
  aval_13 = *_3;
  _5 = _1 * 4;
  _6 = b_14(D) + _5;
  _7 = *_6;
  last_16 = (int) aval_13;
  _9 = _7 < min_v_15(D);
  last_8 = _9 ? last_16 : last_19;
  i_17 = i_21 + 1;
  ivtmp_10 = ivtmp_18 - 1;
  if (ivtmp_10 != 0)
    goto <bb 7>; [97.68%]

vs

  <bb 3> [local count: 1049367889]:
  # last_19 = PHI <last_9(7), 108(15)>
  # i_21 = PHI <i_17(7), 0(15)>
  # ivtmp_11 = PHI <ivtmp_10(7), 43(15)>
  _1 = (long unsigned int) i_21;
  _2 = _1 * 4;
  _3 = b_13(D) + _2;
  _4 = *_3;
  _5 = _4 < min_v_14(D);
  _6 = _1 * 2;
  _38 = _37 + _6;
  _7 = (short int *) _38;
  _8 = .MASK_LOAD (_7, 16B, _5);
  last_16 = (int) _8;
  last_9 = _5 ? last_16 : last_19;
  i_17 = i_21 + 1;
  ivtmp_10 = ivtmp_11 - 1;
  if (ivtmp_10 != 0)
    goto <bb 7>; [97.68%]

maybe since the "mask" is used twice with the .MASK_LOAD
we are not actually looking at the def (the comparison) and it's
the comparison which would introduce the "multiple types"?

That is, I wonder why not sinking the load, avoiding a conditional
load, makes a difference to vectorizing the condition/extract last reduction.

It doesn't seem to make a difference for x86.  That said, the "fix" is
probably sticking the correct target on the dump-check, it seems
that vect_fold_extract_last is no longer correct here.

Richard.

> Thanks,
> Prathamesh
> >
> > Richard.
  
Prathamesh Kulkarni Aug. 14, 2023, 2:58 p.m. UTC | #9
On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Mon, Aug 7, 2023 at 2:05 AM Prathamesh Kulkarni via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > >
> > > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > > >
> > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > > > >
> > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > > > The following delays sinking of loads within the same innermost
> > > > > > > > > loop when it was unconditional before.  That's a not uncommon
> > > > > > > > > issue preventing vectorization when masked loads are not available.
> > > > > > > > >
> > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > > > >
> > > > > > > > > I have a followup patch improving sinking that without this would
> > > > > > > > > cause more of the problematic sinking - now that we have a second
> > > > > > > > > sink pass after loop opts this looks like a reasonable approach?
> > > > > > > > >
> > > > > > > > > OK?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Richard.
> > > > > > > > >
> > > > > > > > >  PR tree-optimization/92335
> > > > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > > > >  in innermost loops to conditional executed places.
> > > > > > > > >
> > > > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > > > I think it's reasonable -- there's probably going to be cases where it's not
> > > > > > > > great, but more often than not I think it's going to be a reasonable
> > > > > > > > heuristic.
> > > > > > > >
> > > > > > > > If there is undesirable fallout, better to find it over the coming months than
> > > > > > > > next spring.  So I'd suggest we go forward now to give more time to find any
> > > > > > > > pathological cases (if they exist).
> > > > > > >
> > > > > > > Agreed, I've pushed this now.
> > > > > > Hi Richard,
> > > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > > > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > > >
> > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> > > > > > ! vect_fold_extract_last } } } } */
> > > > > >
> > > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized
> > > > > > regardless of vect_fold_extract_last,
> > > > > > which gates the above test (which is an improvement, because the
> > > > > > function didn't get vectorized before the commit).
> > > > > >
> > > > > > The attached patch thus removes the gating on vect_fold_extract_last,
> > > > > > and the test passes again.
> > > > > > OK to commit ?
> > > > >
> > > > > OK.
> > > >
> > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > > > critical target condition is missing.  Can you figure out which?
> > >
> > > I see
> > >
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>,
> > > type of def: reduction
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > note:   vect_is_simple_use: vectype vector(4) int
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > missed:   multiple types in double reduction or condition reduction or
> > > fold-left reduction.
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> > > missed:   not vectorized: relevant phi not supported: last_19 = PHI
> > > <last_8(7), 108(15)>
> > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > missed:  bad operation or unsupported loop bound.
> > Hi Richard,
> > Looking at the aarch64 vect dump, it seems the loop in
> > condition_reduction gets vectorized with V4HI mode
> > while fails for other modes in vectorizable_condition:
> >
> >   if ((double_reduc || reduction_type != TREE_CODE_REDUCTION)
> >       && ncopies > 1)
> >     {
> >       if (dump_enabled_p ())
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >                          "multiple types in double reduction or condition "
> >                          "reduction or fold-left reduction.\n");
> >       return false;
> >     }
> >
> > From the dump:
> > foo.c:9:21: note:   === vect_analyze_loop_operations ===
> > foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > def: internal
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> > foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> > <last_8(7), 108(15)>, type of def: reduction
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> >
> > For V8HI, VF = 8, and vectype_in = vector(4) int.
> > Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1,
> > and thus fails:
> > foo.c:9:21: missed:   multiple types in double reduction or condition
> > reduction or fold-left reduction.
> > foo.c:4:1: missed:   not vectorized: relevant phi not supported:
> > last_19 = PHI <last_8(7), 108(15)>
> > While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds.
> >
> > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
> > If I "force" the vectorizer to use V4HI mode, we get the following dump:
> > foo.c:9:21: note:   === vect_analyze_loop_operations ===
> > foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > def: internal
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> > <last_8(7), 108(15)>, type of def: reduction
> > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > foo.c:9:21: missed:   multiple types in double reduction or condition
> > reduction or fold-left reduction.
> >
> > Not sure tho if this is the only reason for the test to fail to
> > vectorize on the target.
> > Will investigate in more details next week.
>
> The odd thing is that you say
>
>   for (int i = 0; i < N; i++)
>     {
>       aval = a[i];
>       if (b[i] < min_v)
>         last = aval;
>     }
>
> fails to vectorize but
>
>   for (int i = 0; i < N; i++)
>     {
>       if (b[i] < min_v)
>         last = a[i];
>     }
>
> succeeds?  The IL difference should be irrelevant for the reduction
Hi Richard,
Sorry for late response.
No this case containing a conditional load doesn't vectorize on aarch64 either:
foo2.c:9:21: note:  === analyze_loop_nest ===
foo2.c:9:21: note:   === vect_analyze_loop_form ===
foo2.c:9:21: missed:   not vectorized: control flow in loop.
foo2.c:9:21: missed:  bad loop form.

For this test:
for (int i = 0; i < N; i++)
    {
      aval = a[i];
      if (b[i] < min_v)
        last = aval;
    }

IIUC sink pass made the load conditional preventing vectorization
(similar to above),
but your PR92335 fix delays the sinking of load before loop opts, and
thus gets vectorized.
Till vect pass, the dumps are similar for x86 and aarch64.
> vectorization:
>
>   <bb 3> [local count: 1049367889]:
>   # last_19 = PHI <last_8(7), 108(15)>
>   # i_21 = PHI <i_17(7), 0(15)>
>   # ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
>   _1 = (long unsigned int) i_21;
>   _2 = _1 * 2;
>   _3 = a_12(D) + _2;
>   aval_13 = *_3;
>   _5 = _1 * 4;
>   _6 = b_14(D) + _5;
>   _7 = *_6;
>   last_16 = (int) aval_13;
>   _9 = _7 < min_v_15(D);
>   last_8 = _9 ? last_16 : last_19;
>   i_17 = i_21 + 1;
>   ivtmp_10 = ivtmp_18 - 1;
>   if (ivtmp_10 != 0)
>     goto <bb 7>; [97.68%]
>
> vs
>
>   <bb 3> [local count: 1049367889]:
>   # last_19 = PHI <last_9(7), 108(15)>
>   # i_21 = PHI <i_17(7), 0(15)>
>   # ivtmp_11 = PHI <ivtmp_10(7), 43(15)>
>   _1 = (long unsigned int) i_21;
>   _2 = _1 * 4;
>   _3 = b_13(D) + _2;
>   _4 = *_3;
>   _5 = _4 < min_v_14(D);
>   _6 = _1 * 2;
>   _38 = _37 + _6;
>   _7 = (short int *) _38;
>   _8 = .MASK_LOAD (_7, 16B, _5);
>   last_16 = (int) _8;
>   last_9 = _5 ? last_16 : last_19;
>   i_17 = i_21 + 1;
>   ivtmp_10 = ivtmp_11 - 1;
>   if (ivtmp_10 != 0)
>     goto <bb 7>; [97.68%]
>
> maybe since the "mask" is used twice with the .MASK_LOAD
> we are not actually looking at the def (the comparison) and it's
> the comparison which would introduce the "multiple types"?
>
> That is, I wonder why not sinking the load, avoiding a conditional
> load, makes a difference to vectorizing the condition/extract last reduction.
IIUC, the issue is that the vector type used for reduction seems to be
different than
the vector type used for determining VF, and it passes the above check
in vectoriable_reduction,
only if VF matches the length of the vector type used for reduction.

For V4HI mode on aarch64, it sets VF = 4 which matches with the length of vector
type used for reduction (vector (4) int):
foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
def: internal
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
<last_8(7), 108(15)>, type of def: reduction
foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
>
> It doesn't seem to make a difference for x86.  That said, the "fix" is
> probably sticking the correct target on the dump-check, it seems
> that vect_fold_extract_last is no longer correct here.
Um sorry, I did go thru various checks in target-supports.exp, but not
sure which one will be appropriate for this case,
and am stuck here :/ Could you please suggest how to proceed ?

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Richard.
  
Richard Biener Aug. 15, 2023, 7:36 a.m. UTC | #10
On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:

> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Mon, Aug 7, 2023 at 2:05?AM Prathamesh Kulkarni via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Thu, 3 Aug 2023 at 17:48, Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > > >
> > > > > On Thu, 3 Aug 2023, Richard Biener wrote:
> > > > >
> > > > > > On Thu, 3 Aug 2023, Prathamesh Kulkarni wrote:
> > > > > >
> > > > > > > On Wed, 2 Aug 2023 at 14:17, Richard Biener via Gcc-patches
> > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > On Mon, 31 Jul 2023, Jeff Law wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 7/28/23 01:05, Richard Biener via Gcc-patches wrote:
> > > > > > > > > > The following delays sinking of loads within the same innermost
> > > > > > > > > > loop when it was unconditional before.  That's a not uncommon
> > > > > > > > > > issue preventing vectorization when masked loads are not available.
> > > > > > > > > >
> > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > > > > > > > > >
> > > > > > > > > > I have a followup patch improving sinking that without this would
> > > > > > > > > > cause more of the problematic sinking - now that we have a second
> > > > > > > > > > sink pass after loop opts this looks like a reasonable approach?
> > > > > > > > > >
> > > > > > > > > > OK?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Richard.
> > > > > > > > > >
> > > > > > > > > >  PR tree-optimization/92335
> > > > > > > > > >  * tree-ssa-sink.cc (select_best_block): Before loop
> > > > > > > > > >  optimizations avoid sinking unconditional loads/stores
> > > > > > > > > >  in innermost loops to conditional executed places.
> > > > > > > > > >
> > > > > > > > > >  * gcc.dg/tree-ssa/ssa-sink-10.c: Disable vectorizing.
> > > > > > > > > >  * gcc.dg/tree-ssa/predcom-9.c: Clone from ssa-sink-10.c,
> > > > > > > > > >  expect predictive commoning to happen instead of sinking.
> > > > > > > > > >  * gcc.dg/vect/pr65947-3.c: Adjust.
> > > > > > > > > I think it's reasonable -- there's probably going to be cases where it's not
> > > > > > > > > great, but more often than not I think it's going to be a reasonable
> > > > > > > > > heuristic.
> > > > > > > > >
> > > > > > > > > If there is undesirable fallout, better to find it over the coming months than
> > > > > > > > > next spring.  So I'd suggest we go forward now to give more time to find any
> > > > > > > > > pathological cases (if they exist).
> > > > > > > >
> > > > > > > > Agreed, I've pushed this now.
> > > > > > > Hi Richard,
> > > > > > > After this patch (committed in 399c8dd44ff44f4b496223c7cc980651c4d6f6a0),
> > > > > > > pr65947-7.c "failed" for aarch64-linux-gnu:
> > > > > > > FAIL: gcc.dg/vect/pr65947-7.c scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > > > > FAIL: gcc.dg/vect/pr65947-7.c -flto -ffat-lto-objects
> > > > > > > scan-tree-dump-not vect "LOOP VECTORIZED"
> > > > > > >
> > > > > > > /* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target {
> > > > > > > ! vect_fold_extract_last } } } } */
> > > > > > >
> > > > > > > With your commit, condition_reduction in pr65947-7.c gets vectorized
> > > > > > > regardless of vect_fold_extract_last,
> > > > > > > which gates the above test (which is an improvement, because the
> > > > > > > function didn't get vectorized before the commit).
> > > > > > >
> > > > > > > The attached patch thus removes the gating on vect_fold_extract_last,
> > > > > > > and the test passes again.
> > > > > > > OK to commit ?
> > > > > >
> > > > > > OK.
> > > > >
> > > > > Or wait - the loop doesn't vectorize on x86_64, so I guess one
> > > > > critical target condition is missing.  Can you figure out which?
> > > >
> > > > I see
> > > >
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > note:   vect_is_simple_use: operand last_19 = PHI <last_8(7), 108(15)>,
> > > > type of def: reduction
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > note:   vect_is_simple_use: vectype vector(4) int
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > missed:   multiple types in double reduction or condition reduction or
> > > > fold-left reduction.
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:13:1:
> > > > missed:   not vectorized: relevant phi not supported: last_19 = PHI
> > > > <last_8(7), 108(15)>
> > > > /space/rguenther/src/gcc/gcc/testsuite/gcc.dg/vect/pr65947-7.c:18:21:
> > > > missed:  bad operation or unsupported loop bound.
> > > Hi Richard,
> > > Looking at the aarch64 vect dump, it seems the loop in
> > > condition_reduction gets vectorized with V4HI mode
> > > while fails for other modes in vectorizable_condition:
> > >
> > >   if ((double_reduc || reduction_type != TREE_CODE_REDUCTION)
> > >       && ncopies > 1)
> > >     {
> > >       if (dump_enabled_p ())
> > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > >                          "multiple types in double reduction or condition "
> > >                          "reduction or fold-left reduction.\n");
> > >       return false;
> > >     }
> > >
> > > From the dump:
> > > foo.c:9:21: note:   === vect_analyze_loop_operations ===
> > > foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> > > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > > def: internal
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> > > foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> > > <last_8(7), 108(15)>, type of def: reduction
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> > >
> > > For V8HI, VF = 8, and vectype_in = vector(4) int.
> > > Thus ncopies = VF / length(vectype_in) = 2, which is greater than 1,
> > > and thus fails:
> > > foo.c:9:21: missed:   multiple types in double reduction or condition
> > > reduction or fold-left reduction.
> > > foo.c:4:1: missed:   not vectorized: relevant phi not supported:
> > > last_19 = PHI <last_8(7), 108(15)>
> > > While for V4HI, VF = 4 and thus ncopies = 1, so it succeeds.
> > >
> > > For x86_64, it seems the vectorizer doesn't seem to try V4HI mode.
> > > If I "force" the vectorizer to use V4HI mode, we get the following dump:
> > > foo.c:9:21: note:   === vect_analyze_loop_operations ===
> > > foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> > > foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> > > def: internal
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > > foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> > > <last_8(7), 108(15)>, type of def: reduction
> > > foo.c:9:21: note:   vect_is_simple_use: vectype vector(2) int
> > > foo.c:9:21: missed:   multiple types in double reduction or condition
> > > reduction or fold-left reduction.
> > >
> > > Not sure tho if this is the only reason for the test to fail to
> > > vectorize on the target.
> > > Will investigate in more details next week.
> >
> > The odd thing is that you say
> >
> >   for (int i = 0; i < N; i++)
> >     {
> >       aval = a[i];
> >       if (b[i] < min_v)
> >         last = aval;
> >     }
> >
> > fails to vectorize but
> >
> >   for (int i = 0; i < N; i++)
> >     {
> >       if (b[i] < min_v)
> >         last = a[i];
> >     }
> >
> > succeeds?  The IL difference should be irrelevant for the reduction
> Hi Richard,
> Sorry for late response.
> No this case containing a conditional load doesn't vectorize on aarch64 either:
> foo2.c:9:21: note:  === analyze_loop_nest ===
> foo2.c:9:21: note:   === vect_analyze_loop_form ===
> foo2.c:9:21: missed:   not vectorized: control flow in loop.
> foo2.c:9:21: missed:  bad loop form.
> 
> For this test:
> for (int i = 0; i < N; i++)
>     {
>       aval = a[i];
>       if (b[i] < min_v)
>         last = aval;
>     }
> 
> IIUC sink pass made the load conditional preventing vectorization
> (similar to above),
> but your PR92335 fix delays the sinking of load before loop opts, and
> thus gets vectorized.
> Till vect pass, the dumps are similar for x86 and aarch64.
> > vectorization:
> >
> >   <bb 3> [local count: 1049367889]:
> >   # last_19 = PHI <last_8(7), 108(15)>
> >   # i_21 = PHI <i_17(7), 0(15)>
> >   # ivtmp_18 = PHI <ivtmp_10(7), 43(15)>
> >   _1 = (long unsigned int) i_21;
> >   _2 = _1 * 2;
> >   _3 = a_12(D) + _2;
> >   aval_13 = *_3;
> >   _5 = _1 * 4;
> >   _6 = b_14(D) + _5;
> >   _7 = *_6;
> >   last_16 = (int) aval_13;
> >   _9 = _7 < min_v_15(D);
> >   last_8 = _9 ? last_16 : last_19;
> >   i_17 = i_21 + 1;
> >   ivtmp_10 = ivtmp_18 - 1;
> >   if (ivtmp_10 != 0)
> >     goto <bb 7>; [97.68%]
> >
> > vs
> >
> >   <bb 3> [local count: 1049367889]:
> >   # last_19 = PHI <last_9(7), 108(15)>
> >   # i_21 = PHI <i_17(7), 0(15)>
> >   # ivtmp_11 = PHI <ivtmp_10(7), 43(15)>
> >   _1 = (long unsigned int) i_21;
> >   _2 = _1 * 4;
> >   _3 = b_13(D) + _2;
> >   _4 = *_3;
> >   _5 = _4 < min_v_14(D);
> >   _6 = _1 * 2;
> >   _38 = _37 + _6;
> >   _7 = (short int *) _38;
> >   _8 = .MASK_LOAD (_7, 16B, _5);
> >   last_16 = (int) _8;
> >   last_9 = _5 ? last_16 : last_19;
> >   i_17 = i_21 + 1;
> >   ivtmp_10 = ivtmp_11 - 1;
> >   if (ivtmp_10 != 0)
> >     goto <bb 7>; [97.68%]
> >
> > maybe since the "mask" is used twice with the .MASK_LOAD
> > we are not actually looking at the def (the comparison) and it's
> > the comparison which would introduce the "multiple types"?
> >
> > That is, I wonder why not sinking the load, avoiding a conditional
> > load, makes a difference to vectorizing the condition/extract last reduction.
> IIUC, the issue is that the vector type used for reduction seems to be
> different than
> the vector type used for determining VF, and it passes the above check
> in vectoriable_reduction,
> only if VF matches the length of the vector type used for reduction.
> 
> For V4HI mode on aarch64, it sets VF = 4 which matches with the length of vector
> type used for reduction (vector (4) int):
> foo.c:9:21: note:   examining phi: last_19 = PHI <last_8(7), 108(15)>
> foo.c:9:21: note:   vect_is_simple_use: operand (int) aval_13, type of
> def: internal
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int
> foo.c:9:21: note:   vect_is_simple_use: operand last_19 = PHI
> <last_8(7), 108(15)>, type of def: reduction
> foo.c:9:21: note:   vect_is_simple_use: vectype vector(4) int

so it doesn't use a fold_extract_last reduction but a regular condition
reduction?  How does it end up with a V4HI + V4SI combo here?
Ah, so the "key" is that we end up using

  vect_last_16.23_74 = (vector(4) int) vect_aval_13.19_70;
  vect_last_8.25_77 = VEC_COND_EXPR <mask__9.24_76, vect_last_16.23_74, 
vect_last_19.16_67>;

so we can promote V4HI to V4SI via direct conversion instead of
via unpacking.  The x86 backend has no such feat, while it can
do zero_extend via punpcklwd doing a sign_extend requires a
compare, unpacking the result with itself and then doing the
punpcklwd ontop of that.  Of course that's what we're doing
when using vec_unpack_lo/hi_expr already.

> >
> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> > probably sticking the correct target on the dump-check, it seems
> > that vect_fold_extract_last is no longer correct here.
> Um sorry, I did go thru various checks in target-supports.exp, but not
> sure which one will be appropriate for this case,
> and am stuck here :/ Could you please suggest how to proceed ?

Maybe Richard S. knows the magic thing to test, he originally
implemented the direct conversion support.  I suggest to implement
such dg-checks if they are not present (I can't find them),
possibly quite specific to the modes involved (like we have
other checks with _qi_to_hi suffixes, for float modes maybe
just _float).

Richard.
  
Richard Sandiford Aug. 15, 2023, 8:58 a.m. UTC | #11
Richard Biener <rguenther@suse.de> writes:
> On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
>> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
>> > It doesn't seem to make a difference for x86.  That said, the "fix" is
>> > probably sticking the correct target on the dump-check, it seems
>> > that vect_fold_extract_last is no longer correct here.
>> Um sorry, I did go thru various checks in target-supports.exp, but not
>> sure which one will be appropriate for this case,
>> and am stuck here :/ Could you please suggest how to proceed ?
>
> Maybe Richard S. knows the magic thing to test, he originally
> implemented the direct conversion support.  I suggest to implement
> such dg-checks if they are not present (I can't find them),
> possibly quite specific to the modes involved (like we have
> other checks with _qi_to_hi suffixes, for float modes maybe
> just _float).

Yeah, can't remember specific selectors for that feature.  TBH I think
most (all?) of the tests were AArch64-specific.

Thanks,
Richard
  
Prathamesh Kulkarni Aug. 17, 2023, 5:10 p.m. UTC | #12
On Tue, 15 Aug 2023 at 14:28, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <rguenther@suse.de> writes:
> > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
> >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
> >> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> >> > probably sticking the correct target on the dump-check, it seems
> >> > that vect_fold_extract_last is no longer correct here.
> >> Um sorry, I did go thru various checks in target-supports.exp, but not
> >> sure which one will be appropriate for this case,
> >> and am stuck here :/ Could you please suggest how to proceed ?
> >
> > Maybe Richard S. knows the magic thing to test, he originally
> > implemented the direct conversion support.  I suggest to implement
> > such dg-checks if they are not present (I can't find them),
> > possibly quite specific to the modes involved (like we have
> > other checks with _qi_to_hi suffixes, for float modes maybe
> > just _float).
>
> Yeah, can't remember specific selectors for that feature.  TBH I think
> most (all?) of the tests were AArch64-specific.
Hi,
As Richi mentioned above, the test now vectorizes on AArch64 because
it has support for direct conversion
between vectors while x86 doesn't. IIUC this is because
supportable_convert_operation returns true
for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for
doing the conversion ?

In the attached patch, I added a new target check vect_extend which
(currently) returns 1 only for aarch64*-*-*,
which makes the test PASS on both the targets, altho I am not sure if
this is entirely correct.
Does the patch look OK ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-7.c b/gcc/testsuite/gcc.dg/vect/pr65947-7.c
index 16cdcd1c6eb..c8623854af5 100644
--- a/gcc/testsuite/gcc.dg/vect/pr65947-7.c
+++ b/gcc/testsuite/gcc.dg/vect/pr65947-7.c
@@ -52,5 +52,4 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_fold_extract_last } } } */
-/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { ! vect_fold_extract_last } } } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_extend } } } */
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 92b6f69730e..29ef64b84f3 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -7768,6 +7768,16 @@ proc check_effective_target_vect_unpack { } {
 	     || [istarget amdgcn*-*-*] }}]
 }
 
+# Return 1 if the target plus current options supports vector
+# conversion of chars (to shorts) and shorts (to ints), 0 otherwise.
+#
+# This won't change for different subtargets so cache the result.
+
+proc check_effective_target_vect_extend { } {
+    return [check_cached_effective_target_indexed vect_extend {
+      expr { [istarget aarch64*-*-*]}}]
+}
+
 # Return 1 if the target plus current options does not guarantee
 # that its STACK_BOUNDARY is >= the reguired vector alignment.
 #
  
Richard Biener Aug. 18, 2023, 9:29 a.m. UTC | #13
On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote:

> On Tue, 15 Aug 2023 at 14:28, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Biener <rguenther@suse.de> writes:
> > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
> > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
> > >> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> > >> > probably sticking the correct target on the dump-check, it seems
> > >> > that vect_fold_extract_last is no longer correct here.
> > >> Um sorry, I did go thru various checks in target-supports.exp, but not
> > >> sure which one will be appropriate for this case,
> > >> and am stuck here :/ Could you please suggest how to proceed ?
> > >
> > > Maybe Richard S. knows the magic thing to test, he originally
> > > implemented the direct conversion support.  I suggest to implement
> > > such dg-checks if they are not present (I can't find them),
> > > possibly quite specific to the modes involved (like we have
> > > other checks with _qi_to_hi suffixes, for float modes maybe
> > > just _float).
> >
> > Yeah, can't remember specific selectors for that feature.  TBH I think
> > most (all?) of the tests were AArch64-specific.
> Hi,
> As Richi mentioned above, the test now vectorizes on AArch64 because
> it has support for direct conversion
> between vectors while x86 doesn't. IIUC this is because
> supportable_convert_operation returns true
> for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for
> doing the conversion ?
> 
> In the attached patch, I added a new target check vect_extend which
> (currently) returns 1 only for aarch64*-*-*,
> which makes the test PASS on both the targets, altho I am not sure if
> this is entirely correct.
> Does the patch look OK ?

Can you make vect_extend more specific, say vect_extend_hi_si or
what is specifically needed here?  Note I'll have to investigate
why x86 cannot vectorize here since in fact it does have
the extend operation ... it might be also worth splitting the
sign/zero extend case, so - vect_sign_extend_hi_si or
vect_extend_short_int?

> Thanks,
> Prathamesh
> >
> > Thanks,
> > Richard
>
  
Richard Biener Aug. 18, 2023, 11:41 a.m. UTC | #14
On Fri, 18 Aug 2023, Richard Biener wrote:

> On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote:
> 
> > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Richard Biener <rguenther@suse.de> writes:
> > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
> > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
> > > >> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> > > >> > probably sticking the correct target on the dump-check, it seems
> > > >> > that vect_fold_extract_last is no longer correct here.
> > > >> Um sorry, I did go thru various checks in target-supports.exp, but not
> > > >> sure which one will be appropriate for this case,
> > > >> and am stuck here :/ Could you please suggest how to proceed ?
> > > >
> > > > Maybe Richard S. knows the magic thing to test, he originally
> > > > implemented the direct conversion support.  I suggest to implement
> > > > such dg-checks if they are not present (I can't find them),
> > > > possibly quite specific to the modes involved (like we have
> > > > other checks with _qi_to_hi suffixes, for float modes maybe
> > > > just _float).
> > >
> > > Yeah, can't remember specific selectors for that feature.  TBH I think
> > > most (all?) of the tests were AArch64-specific.
> > Hi,
> > As Richi mentioned above, the test now vectorizes on AArch64 because
> > it has support for direct conversion
> > between vectors while x86 doesn't. IIUC this is because
> > supportable_convert_operation returns true
> > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for
> > doing the conversion ?
> > 
> > In the attached patch, I added a new target check vect_extend which
> > (currently) returns 1 only for aarch64*-*-*,
> > which makes the test PASS on both the targets, altho I am not sure if
> > this is entirely correct.
> > Does the patch look OK ?
> 
> Can you make vect_extend more specific, say vect_extend_hi_si or
> what is specifically needed here?  Note I'll have to investigate
> why x86 cannot vectorize here since in fact it does have
> the extend operation ... it might be also worth splitting the
> sign/zero extend case, so - vect_sign_extend_hi_si or
> vect_extend_short_int?

And now having anaylzed _why_ x86 doesn't vectorize it's rather
why we get this vectorized with NEON which is because

static opt_machine_mode
aarch64_vectorize_related_mode (machine_mode vector_mode,
                                scalar_mode element_mode,
                                poly_uint64 nunits)
{
...
  /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors.  */
  if (TARGET_SIMD
      && (vec_flags & VEC_ADVSIMD)
      && known_eq (nunits, 0U)
      && known_eq (GET_MODE_BITSIZE (vector_mode), 64U)
      && maybe_ge (GET_MODE_BITSIZE (element_mode)
                   * GET_MODE_NUNITS (vector_mode), 128U))
    {
      machine_mode res = aarch64_simd_container_mode (element_mode, 128);
      if (VECTOR_MODE_P (res))
        return res;

which makes us get a V4SImode vector for a V4HImode loop vector_mode.

So I think the appropriate effective dejagnu target is
aarch64-*-* (there's none specifically to advsimd, not sure if one
can disable that?)

Richard.

> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Richard
> > 
> 
>
  
Prathamesh Kulkarni Aug. 19, 2023, 3:48 p.m. UTC | #15
On Fri, 18 Aug 2023 at 17:11, Richard Biener <rguenther@suse.de> wrote:
>
> On Fri, 18 Aug 2023, Richard Biener wrote:
>
> > On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote:
> >
> > > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > > >
> > > > Richard Biener <rguenther@suse.de> writes:
> > > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
> > > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > >> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> > > > >> > probably sticking the correct target on the dump-check, it seems
> > > > >> > that vect_fold_extract_last is no longer correct here.
> > > > >> Um sorry, I did go thru various checks in target-supports.exp, but not
> > > > >> sure which one will be appropriate for this case,
> > > > >> and am stuck here :/ Could you please suggest how to proceed ?
> > > > >
> > > > > Maybe Richard S. knows the magic thing to test, he originally
> > > > > implemented the direct conversion support.  I suggest to implement
> > > > > such dg-checks if they are not present (I can't find them),
> > > > > possibly quite specific to the modes involved (like we have
> > > > > other checks with _qi_to_hi suffixes, for float modes maybe
> > > > > just _float).
> > > >
> > > > Yeah, can't remember specific selectors for that feature.  TBH I think
> > > > most (all?) of the tests were AArch64-specific.
> > > Hi,
> > > As Richi mentioned above, the test now vectorizes on AArch64 because
> > > it has support for direct conversion
> > > between vectors while x86 doesn't. IIUC this is because
> > > supportable_convert_operation returns true
> > > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for
> > > doing the conversion ?
> > >
> > > In the attached patch, I added a new target check vect_extend which
> > > (currently) returns 1 only for aarch64*-*-*,
> > > which makes the test PASS on both the targets, altho I am not sure if
> > > this is entirely correct.
> > > Does the patch look OK ?
> >
> > Can you make vect_extend more specific, say vect_extend_hi_si or
> > what is specifically needed here?  Note I'll have to investigate
> > why x86 cannot vectorize here since in fact it does have
> > the extend operation ... it might be also worth splitting the
> > sign/zero extend case, so - vect_sign_extend_hi_si or
> > vect_extend_short_int?
>
> And now having anaylzed _why_ x86 doesn't vectorize it's rather
> why we get this vectorized with NEON which is because
>
> static opt_machine_mode
> aarch64_vectorize_related_mode (machine_mode vector_mode,
>                                 scalar_mode element_mode,
>                                 poly_uint64 nunits)
> {
> ...
>   /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors.  */
>   if (TARGET_SIMD
>       && (vec_flags & VEC_ADVSIMD)
>       && known_eq (nunits, 0U)
>       && known_eq (GET_MODE_BITSIZE (vector_mode), 64U)
>       && maybe_ge (GET_MODE_BITSIZE (element_mode)
>                    * GET_MODE_NUNITS (vector_mode), 128U))
>     {
>       machine_mode res = aarch64_simd_container_mode (element_mode, 128);
>       if (VECTOR_MODE_P (res))
>         return res;
>
> which makes us get a V4SImode vector for a V4HImode loop vector_mode.
Thanks for the explanation!
>
> So I think the appropriate effective dejagnu target is
> aarch64-*-* (there's none specifically to advsimd, not sure if one
> can disable that?)
The attached patch uses aarch64*-*-* target check, and additionally
for SVE (and other targets supporting vect_fold_extract_last) it
checks
if the condition reduction was carried out using FOLD_EXTRACT_LAST.
Does that look OK ?

Thanks,
Prathamesh
>

> Richard.
>
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Richard
> > >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-7.c b/gcc/testsuite/gcc.dg/vect/pr65947-7.c
index 16cdcd1c6eb..58c46df5c54 100644
--- a/gcc/testsuite/gcc.dg/vect/pr65947-7.c
+++ b/gcc/testsuite/gcc.dg/vect/pr65947-7.c
@@ -52,5 +52,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target vect_fold_extract_last } } } */
-/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" { target { ! vect_fold_extract_last } } } } */
+/* { dg-final { scan-tree-dump "optimizing condition reduction with FOLD_EXTRACT_LAST" "vect" { target vect_fold_extract_last } } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target aarch64*-*-* } } } */
  
Richard Biener Aug. 21, 2023, 6:57 a.m. UTC | #16
On Sat, 19 Aug 2023, Prathamesh Kulkarni wrote:

> On Fri, 18 Aug 2023 at 17:11, Richard Biener <rguenther@suse.de> wrote:
> >
> > On Fri, 18 Aug 2023, Richard Biener wrote:
> >
> > > On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote:
> > >
> > > > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford
> > > > <richard.sandiford@arm.com> wrote:
> > > > >
> > > > > Richard Biener <rguenther@suse.de> writes:
> > > > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
> > > > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > >> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> > > > > >> > probably sticking the correct target on the dump-check, it seems
> > > > > >> > that vect_fold_extract_last is no longer correct here.
> > > > > >> Um sorry, I did go thru various checks in target-supports.exp, but not
> > > > > >> sure which one will be appropriate for this case,
> > > > > >> and am stuck here :/ Could you please suggest how to proceed ?
> > > > > >
> > > > > > Maybe Richard S. knows the magic thing to test, he originally
> > > > > > implemented the direct conversion support.  I suggest to implement
> > > > > > such dg-checks if they are not present (I can't find them),
> > > > > > possibly quite specific to the modes involved (like we have
> > > > > > other checks with _qi_to_hi suffixes, for float modes maybe
> > > > > > just _float).
> > > > >
> > > > > Yeah, can't remember specific selectors for that feature.  TBH I think
> > > > > most (all?) of the tests were AArch64-specific.
> > > > Hi,
> > > > As Richi mentioned above, the test now vectorizes on AArch64 because
> > > > it has support for direct conversion
> > > > between vectors while x86 doesn't. IIUC this is because
> > > > supportable_convert_operation returns true
> > > > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for
> > > > doing the conversion ?
> > > >
> > > > In the attached patch, I added a new target check vect_extend which
> > > > (currently) returns 1 only for aarch64*-*-*,
> > > > which makes the test PASS on both the targets, altho I am not sure if
> > > > this is entirely correct.
> > > > Does the patch look OK ?
> > >
> > > Can you make vect_extend more specific, say vect_extend_hi_si or
> > > what is specifically needed here?  Note I'll have to investigate
> > > why x86 cannot vectorize here since in fact it does have
> > > the extend operation ... it might be also worth splitting the
> > > sign/zero extend case, so - vect_sign_extend_hi_si or
> > > vect_extend_short_int?
> >
> > And now having anaylzed _why_ x86 doesn't vectorize it's rather
> > why we get this vectorized with NEON which is because
> >
> > static opt_machine_mode
> > aarch64_vectorize_related_mode (machine_mode vector_mode,
> >                                 scalar_mode element_mode,
> >                                 poly_uint64 nunits)
> > {
> > ...
> >   /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors.  */
> >   if (TARGET_SIMD
> >       && (vec_flags & VEC_ADVSIMD)
> >       && known_eq (nunits, 0U)
> >       && known_eq (GET_MODE_BITSIZE (vector_mode), 64U)
> >       && maybe_ge (GET_MODE_BITSIZE (element_mode)
> >                    * GET_MODE_NUNITS (vector_mode), 128U))
> >     {
> >       machine_mode res = aarch64_simd_container_mode (element_mode, 128);
> >       if (VECTOR_MODE_P (res))
> >         return res;
> >
> > which makes us get a V4SImode vector for a V4HImode loop vector_mode.
> Thanks for the explanation!
> >
> > So I think the appropriate effective dejagnu target is
> > aarch64-*-* (there's none specifically to advsimd, not sure if one
> > can disable that?)
> The attached patch uses aarch64*-*-* target check, and additionally
> for SVE (and other targets supporting vect_fold_extract_last) it
> checks
> if the condition reduction was carried out using FOLD_EXTRACT_LAST.
> Does that look OK ?

Works for me.

Richard.

> Thanks,
> Prathamesh
> >
> 
> > Richard.
> >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Richard
> > > >
> > >
> > >
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
  
Prathamesh Kulkarni Aug. 21, 2023, 11:32 a.m. UTC | #17
On Mon, 21 Aug 2023 at 12:27, Richard Biener <rguenther@suse.de> wrote:
>
> On Sat, 19 Aug 2023, Prathamesh Kulkarni wrote:
>
> > On Fri, 18 Aug 2023 at 17:11, Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Fri, 18 Aug 2023, Richard Biener wrote:
> > >
> > > > On Thu, 17 Aug 2023, Prathamesh Kulkarni wrote:
> > > >
> > > > > On Tue, 15 Aug 2023 at 14:28, Richard Sandiford
> > > > > <richard.sandiford@arm.com> wrote:
> > > > > >
> > > > > > Richard Biener <rguenther@suse.de> writes:
> > > > > > > On Mon, 14 Aug 2023, Prathamesh Kulkarni wrote:
> > > > > > >> On Mon, 7 Aug 2023 at 13:19, Richard Biener <richard.guenther@gmail.com> wrote:
> > > > > > >> > It doesn't seem to make a difference for x86.  That said, the "fix" is
> > > > > > >> > probably sticking the correct target on the dump-check, it seems
> > > > > > >> > that vect_fold_extract_last is no longer correct here.
> > > > > > >> Um sorry, I did go thru various checks in target-supports.exp, but not
> > > > > > >> sure which one will be appropriate for this case,
> > > > > > >> and am stuck here :/ Could you please suggest how to proceed ?
> > > > > > >
> > > > > > > Maybe Richard S. knows the magic thing to test, he originally
> > > > > > > implemented the direct conversion support.  I suggest to implement
> > > > > > > such dg-checks if they are not present (I can't find them),
> > > > > > > possibly quite specific to the modes involved (like we have
> > > > > > > other checks with _qi_to_hi suffixes, for float modes maybe
> > > > > > > just _float).
> > > > > >
> > > > > > Yeah, can't remember specific selectors for that feature.  TBH I think
> > > > > > most (all?) of the tests were AArch64-specific.
> > > > > Hi,
> > > > > As Richi mentioned above, the test now vectorizes on AArch64 because
> > > > > it has support for direct conversion
> > > > > between vectors while x86 doesn't. IIUC this is because
> > > > > supportable_convert_operation returns true
> > > > > for V4HI -> V4SI on Aarch64 since it can use extend_v4hiv4si2 for
> > > > > doing the conversion ?
> > > > >
> > > > > In the attached patch, I added a new target check vect_extend which
> > > > > (currently) returns 1 only for aarch64*-*-*,
> > > > > which makes the test PASS on both the targets, altho I am not sure if
> > > > > this is entirely correct.
> > > > > Does the patch look OK ?
> > > >
> > > > Can you make vect_extend more specific, say vect_extend_hi_si or
> > > > what is specifically needed here?  Note I'll have to investigate
> > > > why x86 cannot vectorize here since in fact it does have
> > > > the extend operation ... it might be also worth splitting the
> > > > sign/zero extend case, so - vect_sign_extend_hi_si or
> > > > vect_extend_short_int?
> > >
> > > And now having anaylzed _why_ x86 doesn't vectorize it's rather
> > > why we get this vectorized with NEON which is because
> > >
> > > static opt_machine_mode
> > > aarch64_vectorize_related_mode (machine_mode vector_mode,
> > >                                 scalar_mode element_mode,
> > >                                 poly_uint64 nunits)
> > > {
> > > ...
> > >   /* Prefer to use 1 128-bit vector instead of 2 64-bit vectors.  */
> > >   if (TARGET_SIMD
> > >       && (vec_flags & VEC_ADVSIMD)
> > >       && known_eq (nunits, 0U)
> > >       && known_eq (GET_MODE_BITSIZE (vector_mode), 64U)
> > >       && maybe_ge (GET_MODE_BITSIZE (element_mode)
> > >                    * GET_MODE_NUNITS (vector_mode), 128U))
> > >     {
> > >       machine_mode res = aarch64_simd_container_mode (element_mode, 128);
> > >       if (VECTOR_MODE_P (res))
> > >         return res;
> > >
> > > which makes us get a V4SImode vector for a V4HImode loop vector_mode.
> > Thanks for the explanation!
> > >
> > > So I think the appropriate effective dejagnu target is
> > > aarch64-*-* (there's none specifically to advsimd, not sure if one
> > > can disable that?)
> > The attached patch uses aarch64*-*-* target check, and additionally
> > for SVE (and other targets supporting vect_fold_extract_last) it
> > checks
> > if the condition reduction was carried out using FOLD_EXTRACT_LAST.
> > Does that look OK ?
>
> Works for me.
Thanks, committed to trunk in dd606dc7c7e49feb7a900902ec6d35b421789173

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> >
> > > Richard.
> > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Thanks,
> > > > > > Richard
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
  

Patch

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c b/gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c
new file mode 100644
index 00000000000..b0fb0e2d4c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/predcom-9.c
@@ -0,0 +1,20 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-sink-details -fdump-tree-pcom-details" } */
+
+int x[1024], y[1024], z[1024], w[1024];
+void foo (void)
+{
+  int i;
+  for (i = 1; i < 1024; ++i)
+    {
+      int a = x[i];
+      int b = y[i];
+      int c = x[i-1];
+      int d = y[i-1];
+      if (w[i])
+	z[i] = (a + b) + (c + d);
+    }
+}
+
+/* { dg-final { scan-tree-dump-not "Sinking # VUSE" "sink1" } } */
+/* { dg-final { scan-tree-dump "Executing predictive commoning without unrolling" "pcom" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c
index 535cb3208f5..a35014be038 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-10.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-sink-details -fno-tree-pre" } */
+/* { dg-options "-O2 -fdump-tree-sink-details -fno-tree-vectorize -fno-tree-pre" } */
 
 int x[1024], y[1024], z[1024], w[1024];
 void foo (void)
diff --git a/gcc/testsuite/gcc.dg/vect/pr65947-3.c b/gcc/testsuite/gcc.dg/vect/pr65947-3.c
index f1bfad65c22..6b4077e1a62 100644
--- a/gcc/testsuite/gcc.dg/vect/pr65947-3.c
+++ b/gcc/testsuite/gcc.dg/vect/pr65947-3.c
@@ -51,10 +51,6 @@  main (void)
   return 0;
 }
 
-/* Since the fix for PR97307 which sinks the load of a[i], preventing
-   if-conversion to happen, targets that cannot do masked loads only
-   vectorize the inline copy.  */
-/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" { target vect_masked_load } } } */
-/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 1 "vect" { target { ! vect_masked_load } } } } */
+/* { dg-final { scan-tree-dump-times "LOOP VECTORIZED" 2 "vect" } } */
 /* { dg-final { scan-tree-dump-times "optimizing condition reduction with FOLD_EXTRACT_LAST" 2 "vect" { target vect_fold_extract_last } } } */
 /* { dg-final { scan-tree-dump-not "condition expression based on integer induction." "vect" } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index cf0a32a954b..dcbe05b3b03 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -220,6 +220,18 @@  select_best_block (basic_block early_bb,
   if (bb_loop_depth (best_bb) < bb_loop_depth (early_bb))
     return best_bb;
 
+  /* Avoid turning an unconditional load/store into a conditional one when we
+     still might want to perform vectorization.  */
+  if (best_bb->loop_father == early_bb->loop_father
+      && loop_outer (best_bb->loop_father)
+      && !best_bb->loop_father->inner
+      && gimple_vuse (stmt)
+      && flag_tree_loop_vectorize
+      && !(cfun->curr_properties & PROP_loop_opts_done)
+      && dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, early_bb)
+      && !dominated_by_p (CDI_DOMINATORS, best_bb->loop_father->latch, best_bb))
+    return early_bb;
+
   /* Get the sinking threshold.  If the statement to be moved has memory
      operands, then increase the threshold by 7% as those are even more
      profitable to avoid, clamping at 100%.  */