[v5,0/19] Support early break/return auto-vectorization

Message ID patch-17494-tamar@arm.com
Headers
Series Support early break/return auto-vectorization |

Message

Tamar Christina June 28, 2023, 1:40 p.m. UTC
  Hi All,

This patch adds initial support for early break vectorization in GCC.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.

Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR.  This is however only checked then the
comparison would produce multiple statements.

Concretely the kind of loops supported are of the forms:

 for (int i = 0; i < N; i++)
 {
   <statements1>
   if (<condition>)
     {
       ...
       <action>;
     }
   <statements2>
 }

where <action> can be:
 - break
 - return
 - goto

Any number of statements can be used before the <action> occurs.

Since this is an initial version for GCC 14 it has the following limitations and
features:

- Only fixed sized iterations and buffers are supported.  That is to say any
  vectors loaded or stored must be to statically allocated arrays with known
  sizes. N must also be known.  This limitation is because our primary target
  for this optimization is SVE.  For VLA SVE we can't easily do cross page
  iteraion checks. The result is likely to also not be beneficial. For that
  reason we punt support for variable buffers till we have First-Faulting
  support in GCC.
- any stores in <statements1> should not be to the same objects as in
  <condition>.  Loads are fine as long as they don't have the possibility to
  alias.  More concretely, we block RAW dependencies when the intermediate value
  can't be separated fromt the store, or the store itself can't be moved.
- The number of loop iterations must be known,  this is just a temporarily
  limitation that I intend to address in GCC 14 itself as follow on patches.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- The early exit must be before the natural loop exit/latch.  The vectorizer is
  designed in way to propage phi-nodes downwards.  As such supporting this
  inverted control flow is hard.
- No support for epilogue vectorization.  The only epilogue supported is the
  scalar final one.  Epilogue vectorization would also not be profitable.
- Early breaks are only supported for inner loop vectorization.

I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break

With the help of IPA and LTO this still gets hit quite often.  During bootstrap
it hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.

This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.

niters analysis and the majority of the vectorizer with hardcoded single_exit
have been updated with the use of a new function vec_loop_iv value which returns
the exit the vectorizer wants to use as the main IV exit.

for niters the this exit is what determines the overall iterations as
that is the O(iters) for the loop.

For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations.  For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.

This new version of the patch does the majority of the work in a new rewritten
loop peeling.  This new function maintains LCSSA all the way through and no
longer requires the touch up functions the vectorized used to incrementally
adjust them later on.  This means that aside from IV updates and guard edge
updates the early exit code is identical to the single exit cases.

When the loop is peeled during the copying I have to go through great lengths to
keep the dominators up to date.  All exits from the first loop are rewired to the
loop header of the second loop.  But this can change the immediate dominator.

The dominators can change again when we wire in the loop guard, as such peeling
now returns a list of dominators that need to be updated if a new guard edge is
added.

For the loop peeling we rewrite the loop form:


                     Header
                      ---
                      |x|
                       2
                       |
                       v
                -------3<------
     early exit |      |      |
                v      v      | latch
                7      4----->6
                |      |
                |      v
                |      8
                |      |
                |      v
                ------>5

into

                     Header
                      ---
                      |x|
                       2
                       |
                       v
                -------3<------
     early exit |      |      |
                v      v      | latch
                7      4----->6
                |      |
                |      v
                |      8
                |      |
                |      v
                |  New Header
                |     ---
                ----->|x|
                       9
                       |
                       v
                ------10<-----
     early exit |      |      |
                v      v      | latch
                14     11---->13
                |      |
                |      v
                |      12
                |      |
                |      v
                ------> 5

That is to say, the first vector loop executes so long as the early exit isn't
needed.  Once the exit is taken, the scalar code will perform at most VF extra
iterations.  The exact number depending on peeling and iteration start and which
exit was taken (natural or early).   For this scalar loop, all early exits are
treated the same.

When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break.  If this is not possible we decline to vectorize.

This means that we check at the start of iterations whether we are going to exit
or not.  During the analyis phase we check whether we are allowed to do this
moving of statements.  Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.

Codegen:

for e.g.

#define N 803
unsigned vect_a[N];
unsigned vect_b[N];

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i] > x)
     break;
   vect_a[i] = x;

 }
 return ret;
}

We generate for Adv. SIMD:

test4:
        adrp    x2, .LC0
        adrp    x3, .LANCHOR0
        dup     v2.4s, w0
        add     x3, x3, :lo12:.LANCHOR0
        movi    v4.4s, 0x4
        add     x4, x3, 3216
        ldr     q1, [x2, #:lo12:.LC0]
        mov     x1, 0
        mov     w2, 0
        .p2align 3,,7
.L3:
        ldr     q0, [x3, x1]
        add     v3.4s, v1.4s, v2.4s
        add     v1.4s, v1.4s, v4.4s
        cmhi    v0.4s, v0.4s, v2.4s
        umaxp   v0.4s, v0.4s, v0.4s
        fmov    x5, d0
        cbnz    x5, .L6
        add     w2, w2, 1
        str     q3, [x1, x4]
        str     q2, [x3, x1]
        add     x1, x1, 16
        cmp     w2, 200
        bne     .L3
        mov     w7, 3
.L2:
        lsl     w2, w2, 2
        add     x5, x3, 3216
        add     w6, w2, w0
        sxtw    x4, w2
        ldr     w1, [x3, x4, lsl 2]
        str     w6, [x5, x4, lsl 2]
        cmp     w0, w1
        bcc     .L4
        add     w1, w2, 1
        str     w0, [x3, x4, lsl 2]
        add     w6, w1, w0
        sxtw    x1, w1
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        add     w4, w2, 2
        str     w0, [x3, x1, lsl 2]
        sxtw    x1, w4
        add     w6, w1, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
        add     w2, w2, 3
        cmp     w7, 3
        beq     .L4
        sxtw    x1, w2
        add     w2, w2, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w2, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
.L4:
        mov     w0, 0
        ret
        .p2align 2,,3
.L6:
        mov     w7, 4
        b       .L2

and for SVE:

test4:
        adrp    x2, .LANCHOR0
        add     x2, x2, :lo12:.LANCHOR0
        add     x5, x2, 3216
        mov     x3, 0
        mov     w1, 0
        cntw    x4
        mov     z1.s, w0
        index   z0.s, #0, #1
        ptrue   p1.b, all
        ptrue   p0.s, all
        .p2align 3,,7
.L3:
        ld1w    z2.s, p1/z, [x2, x3, lsl 2]
        add     z3.s, z0.s, z1.s
        cmplo   p2.s, p0/z, z1.s, z2.s
        b.any   .L2
        st1w    z3.s, p1, [x5, x3, lsl 2]
        add     w1, w1, 1
        st1w    z1.s, p1, [x2, x3, lsl 2]
        add     x3, x3, x4
        incw    z0.s
        cmp     w3, 803
        bls     .L3
.L5:
        mov     w0, 0
        ret
        .p2align 2,,3
.L2:
        cntw    x5
        mul     w1, w1, w5
        cbz     w5, .L5
        sxtw    x1, w1
        sub     w5, w5, #1
        add     x5, x5, x1
        add     x6, x2, 3216
        b       .L6
        .p2align 2,,3
.L14:
        str     w0, [x2, x1, lsl 2]
        cmp     x1, x5
        beq     .L5
        mov     x1, x4
.L6:
        ldr     w3, [x2, x1, lsl 2]
        add     w4, w0, w1
        str     w4, [x6, x1, lsl 2]
        add     x4, x1, 1
        cmp     w0, w3
        bcs     .L14
        mov     w0, 0
        ret

On the workloads this work is based on we see between 2-3x performance uplift
using this patch.

Follow up plan:
 - Boolean vectorization has several shortcomings.  I've filed PR110223 with the
   bigger ones that cause vectorization to fail with this patch.
 - SLP support.  This is planned for GCC 15 as for majority of the cases build
   SLP itself fails.  This means I'll need to spend time in making this more
   robust first.  Additionally it requires:
     * Adding support for vectorizing CFG (gconds)
     * Support for CFG to differ between vector and scalar loops.
   Both of which would be disruptive to the tree and I suspect I'll be handling
   fallouts from this patch for a while.  So I plan to work on the surrounding
   building blocks first for the remainder of the year.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Also ran across various workloads and no issues.

When closer to acceptance I will run on other targets as well and clean up
related testsuite fallouts there.

--- inline copy of patch -- 

--
  

Comments

Tamar Christina June 28, 2023, 1:54 p.m. UTC | #1
Resending attached only due to size limit

> -----Original Message-----
> From: Tamar Christina
> Sent: Wednesday, June 28, 2023 2:42 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <nd@arm.com>; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: [PATCH 3/19]middle-end clean up vect testsuite using pragma
> novector
> 
> Hi All,
> 
> The support for early break vectorization breaks lots of scan vect and slp
> testcases because they assume that loops with abort () in them cannot be
> vectorized.  Additionally it breaks the point of having a scalar loop to check
> the output of the vectorizer if that loop is also vectorized.
> 
> For that reason this adds
> 
> #pragma GCC novector to all tests which have a scalar loop that we would
> have
> vectorized using this patch series.
> 
> FWIW, none of these tests were failing to vectorize or run before the pragma.
> The tests that did point to some issues were copies to the early break test
> suit as well.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/testsuite/ChangeLog:
> 
> 	* g++.dg/vect/pr84556.cc: Add novector pragma.
> 	* g++.dg/vect/simd-1.cc: Add novector pragma.
> 	* g++.dg/vect/simd-2.cc: Add novector pragma.
> 	* g++.dg/vect/simd-3.cc: Add novector pragma.
> 	* g++.dg/vect/simd-4.cc: Add novector pragma.
> 	* g++.dg/vect/simd-5.cc: Add novector pragma.
> 	* g++.dg/vect/simd-6.cc: Add novector pragma.
> 	* g++.dg/vect/simd-7.cc: Add novector pragma.
> 	* g++.dg/vect/simd-8.cc: Add novector pragma.
> 	* g++.dg/vect/simd-9.cc: Add novector pragma.
> 	* g++.dg/vect/simd-clone-6.cc: Add novector pragma.
> 	* gcc.dg/vect/O3-pr70130.c: Add novector pragma.
> 	* gcc.dg/vect/Os-vect-95.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-1.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-16.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-2.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-24.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-25.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-26.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-27.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-28.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-29.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-42.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-cond-1.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-over-widen-1.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-over-widen-2.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-pattern-1.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-pattern-2.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-pow-1.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-pr101615-2.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-pr65935.c: Add novector pragma.
> 	* gcc.dg/vect/bb-slp-subgroups-1.c: Add novector pragma.
> 	* gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/i386/costmodel-vect-68.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-slp-12.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-slp-33.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-slp-34.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-31a.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-31b.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-31c.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-33.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-68a.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-68b.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-68c.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-76a.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-76b.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-76c.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/ppc/costmodel-vect-outer-fir.c: Add
> novector pragma.
> 	* gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Add novector
> pragma.
> 	* gcc.dg/vect/costmodel/x86_64/costmodel-vect-68.c: Add novector
> pragma.
> 	* gcc.dg/vect/fast-math-bb-slp-call-1.c: Add novector pragma.
> 	* gcc.dg/vect/fast-math-bb-slp-call-2.c: Add novector pragma.
> 	* gcc.dg/vect/fast-math-vect-call-1.c: Add novector pragma.
> 	* gcc.dg/vect/fast-math-vect-call-2.c: Add novector pragma.
> 	* gcc.dg/vect/fast-math-vect-complex-3.c: Add novector pragma.
> 	* gcc.dg/vect/if-cvt-stores-vect-ifcvt-18.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-noreassoc-outer-1.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-noreassoc-outer-2.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-noreassoc-outer-3.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-noreassoc-outer-5.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-10.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-10a.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-10b.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-11.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-12.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-15.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-16.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-17.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-18.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-19.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-20.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-21.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-22.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-3.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-4.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-5.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-6-global.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-6.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-7.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-8.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-9.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-9a.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-outer-9b.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-slp-30.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-slp-31.c: Add novector pragma.
> 	* gcc.dg/vect/no-scevccp-vect-iv-2.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-31.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-34.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-36.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-64.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-65.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-66.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-68.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-69.c: Add novector pragma.
> 	* gcc.dg/vect/no-section-anchors-vect-outer-4h.c: Add novector
> pragma.
> 	* gcc.dg/vect/no-trapping-math-2.c: Add novector pragma.
> 	* gcc.dg/vect/no-trapping-math-vect-111.c: Add novector pragma.
> 	* gcc.dg/vect/no-trapping-math-vect-ifcvt-11.c: Add novector
> pragma.
> 	* gcc.dg/vect/no-trapping-math-vect-ifcvt-12.c: Add novector
> pragma.
> 	* gcc.dg/vect/no-trapping-math-vect-ifcvt-13.c: Add novector
> pragma.
> 	* gcc.dg/vect/no-trapping-math-vect-ifcvt-14.c: Add novector
> pragma.
> 	* gcc.dg/vect/no-trapping-math-vect-ifcvt-15.c: Add novector
> pragma.
> 	* gcc.dg/vect/no-tree-dom-vect-bug.c: Add novector pragma.
> 	* gcc.dg/vect/no-tree-pre-slp-29.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-pr29145.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-101.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-102.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-102a.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-37.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-43.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-45.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-49.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-51.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-53.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-57.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-61.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-79.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-depend-1.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-depend-2.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-depend-3.c: Add novector pragma.
> 	* gcc.dg/vect/no-vfa-vect-dv-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr101445.c: Add novector pragma.
> 	* gcc.dg/vect/pr103581.c: Add novector pragma.
> 	* gcc.dg/vect/pr105219.c: Add novector pragma.
> 	* gcc.dg/vect/pr108608.c: Add novector pragma.
> 	* gcc.dg/vect/pr18400.c: Add novector pragma.
> 	* gcc.dg/vect/pr18536.c: Add novector pragma.
> 	* gcc.dg/vect/pr20122.c: Add novector pragma.
> 	* gcc.dg/vect/pr25413.c: Add novector pragma.
> 	* gcc.dg/vect/pr30784.c: Add novector pragma.
> 	* gcc.dg/vect/pr37539.c: Add novector pragma.
> 	* gcc.dg/vect/pr40074.c: Add novector pragma.
> 	* gcc.dg/vect/pr45752.c: Add novector pragma.
> 	* gcc.dg/vect/pr45902.c: Add novector pragma.
> 	* gcc.dg/vect/pr46009.c: Add novector pragma.
> 	* gcc.dg/vect/pr48172.c: Add novector pragma.
> 	* gcc.dg/vect/pr51074.c: Add novector pragma.
> 	* gcc.dg/vect/pr51581-3.c: Add novector pragma.
> 	* gcc.dg/vect/pr51581-4.c: Add novector pragma.
> 	* gcc.dg/vect/pr53185-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr56826.c: Add novector pragma.
> 	* gcc.dg/vect/pr56918.c: Add novector pragma.
> 	* gcc.dg/vect/pr56920.c: Add novector pragma.
> 	* gcc.dg/vect/pr56933.c: Add novector pragma.
> 	* gcc.dg/vect/pr57705.c: Add novector pragma.
> 	* gcc.dg/vect/pr57741-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr57741-3.c: Add novector pragma.
> 	* gcc.dg/vect/pr59591-1.c: Add novector pragma.
> 	* gcc.dg/vect/pr59591-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr59594.c: Add novector pragma.
> 	* gcc.dg/vect/pr59984.c: Add novector pragma.
> 	* gcc.dg/vect/pr60276.c: Add novector pragma.
> 	* gcc.dg/vect/pr61194.c: Add novector pragma.
> 	* gcc.dg/vect/pr61680.c: Add novector pragma.
> 	* gcc.dg/vect/pr62021.c: Add novector pragma.
> 	* gcc.dg/vect/pr63341-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr64252.c: Add novector pragma.
> 	* gcc.dg/vect/pr64404.c: Add novector pragma.
> 	* gcc.dg/vect/pr64421.c: Add novector pragma.
> 	* gcc.dg/vect/pr64493.c: Add novector pragma.
> 	* gcc.dg/vect/pr64495.c: Add novector pragma.
> 	* gcc.dg/vect/pr66251.c: Add novector pragma.
> 	* gcc.dg/vect/pr66253.c: Add novector pragma.
> 	* gcc.dg/vect/pr68502-1.c: Add novector pragma.
> 	* gcc.dg/vect/pr68502-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr69820.c: Add novector pragma.
> 	* gcc.dg/vect/pr70021.c: Add novector pragma.
> 	* gcc.dg/vect/pr70354-1.c: Add novector pragma.
> 	* gcc.dg/vect/pr70354-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr71259.c: Add novector pragma.
> 	* gcc.dg/vect/pr78005.c: Add novector pragma.
> 	* gcc.dg/vect/pr78558.c: Add novector pragma.
> 	* gcc.dg/vect/pr80815-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr80815-3.c: Add novector pragma.
> 	* gcc.dg/vect/pr80928.c: Add novector pragma.
> 	* gcc.dg/vect/pr81410.c: Add novector pragma.
> 	* gcc.dg/vect/pr81633.c: Add novector pragma.
> 	* gcc.dg/vect/pr81740-1.c: Add novector pragma.
> 	* gcc.dg/vect/pr81740-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr85586.c: Add novector pragma.
> 	* gcc.dg/vect/pr87288-1.c: Add novector pragma.
> 	* gcc.dg/vect/pr87288-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr87288-3.c: Add novector pragma.
> 	* gcc.dg/vect/pr88903-1.c: Add novector pragma.
> 	* gcc.dg/vect/pr88903-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr90018.c: Add novector pragma.
> 	* gcc.dg/vect/pr92420.c: Add novector pragma.
> 	* gcc.dg/vect/pr94994.c: Add novector pragma.
> 	* gcc.dg/vect/pr96783-1.c: Add novector pragma.
> 	* gcc.dg/vect/pr96783-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr97081-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr97558-2.c: Add novector pragma.
> 	* gcc.dg/vect/pr97678.c: Add novector pragma.
> 	* gcc.dg/vect/section-anchors-pr27770.c: Add novector pragma.
> 	* gcc.dg/vect/section-anchors-vect-69.c: Add novector pragma.
> 	* gcc.dg/vect/slp-1.c: Add novector pragma.
> 	* gcc.dg/vect/slp-10.c: Add novector pragma.
> 	* gcc.dg/vect/slp-11a.c: Add novector pragma.
> 	* gcc.dg/vect/slp-11b.c: Add novector pragma.
> 	* gcc.dg/vect/slp-11c.c: Add novector pragma.
> 	* gcc.dg/vect/slp-12a.c: Add novector pragma.
> 	* gcc.dg/vect/slp-12b.c: Add novector pragma.
> 	* gcc.dg/vect/slp-12c.c: Add novector pragma.
> 	* gcc.dg/vect/slp-13-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/slp-13.c: Add novector pragma.
> 	* gcc.dg/vect/slp-14.c: Add novector pragma.
> 	* gcc.dg/vect/slp-15.c: Add novector pragma.
> 	* gcc.dg/vect/slp-16.c: Add novector pragma.
> 	* gcc.dg/vect/slp-17.c: Add novector pragma.
> 	* gcc.dg/vect/slp-18.c: Add novector pragma.
> 	* gcc.dg/vect/slp-19a.c: Add novector pragma.
> 	* gcc.dg/vect/slp-19b.c: Add novector pragma.
> 	* gcc.dg/vect/slp-19c.c: Add novector pragma.
> 	* gcc.dg/vect/slp-2.c: Add novector pragma.
> 	* gcc.dg/vect/slp-20.c: Add novector pragma.
> 	* gcc.dg/vect/slp-21.c: Add novector pragma.
> 	* gcc.dg/vect/slp-22.c: Add novector pragma.
> 	* gcc.dg/vect/slp-23.c: Add novector pragma.
> 	* gcc.dg/vect/slp-24-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/slp-24.c: Add novector pragma.
> 	* gcc.dg/vect/slp-25.c: Add novector pragma.
> 	* gcc.dg/vect/slp-26.c: Add novector pragma.
> 	* gcc.dg/vect/slp-28.c: Add novector pragma.
> 	* gcc.dg/vect/slp-3-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/slp-3.c: Add novector pragma.
> 	* gcc.dg/vect/slp-33.c: Add novector pragma.
> 	* gcc.dg/vect/slp-34-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/slp-34.c: Add novector pragma.
> 	* gcc.dg/vect/slp-35.c: Add novector pragma.
> 	* gcc.dg/vect/slp-37.c: Add novector pragma.
> 	* gcc.dg/vect/slp-4-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/slp-4.c: Add novector pragma.
> 	* gcc.dg/vect/slp-41.c: Add novector pragma.
> 	* gcc.dg/vect/slp-43.c: Add novector pragma.
> 	* gcc.dg/vect/slp-45.c: Add novector pragma.
> 	* gcc.dg/vect/slp-46.c: Add novector pragma.
> 	* gcc.dg/vect/slp-47.c: Add novector pragma.
> 	* gcc.dg/vect/slp-48.c: Add novector pragma.
> 	* gcc.dg/vect/slp-49.c: Add novector pragma.
> 	* gcc.dg/vect/slp-5.c: Add novector pragma.
> 	* gcc.dg/vect/slp-6.c: Add novector pragma.
> 	* gcc.dg/vect/slp-7.c: Add novector pragma.
> 	* gcc.dg/vect/slp-8.c: Add novector pragma.
> 	* gcc.dg/vect/slp-9.c: Add novector pragma.
> 	* gcc.dg/vect/slp-cond-1.c: Add novector pragma.
> 	* gcc.dg/vect/slp-cond-2-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/slp-cond-2.c: Add novector pragma.
> 	* gcc.dg/vect/slp-cond-3.c: Add novector pragma.
> 	* gcc.dg/vect/slp-cond-4.c: Add novector pragma.
> 	* gcc.dg/vect/slp-cond-5.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-1.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-10.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-11-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-11.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-12.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-2.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-3.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-4.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-5.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-6.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-7.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-8.c: Add novector pragma.
> 	* gcc.dg/vect/slp-multitypes-9.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-1.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-10.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-11.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-12.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-2.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-3.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-4.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-5.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-6.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-7.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-8.c: Add novector pragma.
> 	* gcc.dg/vect/slp-perm-9.c: Add novector pragma.
> 	* gcc.dg/vect/slp-widen-mult-half.c: Add novector pragma.
> 	* gcc.dg/vect/slp-widen-mult-s16.c: Add novector pragma.
> 	* gcc.dg/vect/slp-widen-mult-u8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-100.c: Add novector pragma.
> 	* gcc.dg/vect/vect-103.c: Add novector pragma.
> 	* gcc.dg/vect/vect-104.c: Add novector pragma.
> 	* gcc.dg/vect/vect-105-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-105.c: Add novector pragma.
> 	* gcc.dg/vect/vect-106.c: Add novector pragma.
> 	* gcc.dg/vect/vect-107.c: Add novector pragma.
> 	* gcc.dg/vect/vect-108.c: Add novector pragma.
> 	* gcc.dg/vect/vect-109.c: Add novector pragma.
> 	* gcc.dg/vect/vect-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-110.c: Add novector pragma.
> 	* gcc.dg/vect/vect-113.c: Add novector pragma.
> 	* gcc.dg/vect/vect-114.c: Add novector pragma.
> 	* gcc.dg/vect/vect-115.c: Add novector pragma.
> 	* gcc.dg/vect/vect-116.c: Add novector pragma.
> 	* gcc.dg/vect/vect-117.c: Add novector pragma.
> 	* gcc.dg/vect/vect-11a.c: Add novector pragma.
> 	* gcc.dg/vect/vect-12.c: Add novector pragma.
> 	* gcc.dg/vect/vect-122.c: Add novector pragma.
> 	* gcc.dg/vect/vect-124.c: Add novector pragma.
> 	* gcc.dg/vect/vect-13.c: Add novector pragma.
> 	* gcc.dg/vect/vect-14.c: Add novector pragma.
> 	* gcc.dg/vect/vect-15-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-15.c: Add novector pragma.
> 	* gcc.dg/vect/vect-17.c: Add novector pragma.
> 	* gcc.dg/vect/vect-18.c: Add novector pragma.
> 	* gcc.dg/vect/vect-19.c: Add novector pragma.
> 	* gcc.dg/vect/vect-2-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-20.c: Add novector pragma.
> 	* gcc.dg/vect/vect-21.c: Add novector pragma.
> 	* gcc.dg/vect/vect-22.c: Add novector pragma.
> 	* gcc.dg/vect/vect-23.c: Add novector pragma.
> 	* gcc.dg/vect/vect-24.c: Add novector pragma.
> 	* gcc.dg/vect/vect-25.c: Add novector pragma.
> 	* gcc.dg/vect/vect-26.c: Add novector pragma.
> 	* gcc.dg/vect/vect-27.c: Add novector pragma.
> 	* gcc.dg/vect/vect-28.c: Add novector pragma.
> 	* gcc.dg/vect/vect-29.c: Add novector pragma.
> 	* gcc.dg/vect/vect-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-30.c: Add novector pragma.
> 	* gcc.dg/vect/vect-31-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-31.c: Add novector pragma.
> 	* gcc.dg/vect/vect-32-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-32.c: Add novector pragma.
> 	* gcc.dg/vect/vect-33-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-33.c: Add novector pragma.
> 	* gcc.dg/vect/vect-34-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-34.c: Add novector pragma.
> 	* gcc.dg/vect/vect-35-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-35.c: Add novector pragma.
> 	* gcc.dg/vect/vect-36-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-36.c: Add novector pragma.
> 	* gcc.dg/vect/vect-38.c: Add novector pragma.
> 	* gcc.dg/vect/vect-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-40.c: Add novector pragma.
> 	* gcc.dg/vect/vect-42.c: Add novector pragma.
> 	* gcc.dg/vect/vect-44.c: Add novector pragma.
> 	* gcc.dg/vect/vect-46.c: Add novector pragma.
> 	* gcc.dg/vect/vect-48.c: Add novector pragma.
> 	* gcc.dg/vect/vect-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-50.c: Add novector pragma.
> 	* gcc.dg/vect/vect-52.c: Add novector pragma.
> 	* gcc.dg/vect/vect-54.c: Add novector pragma.
> 	* gcc.dg/vect/vect-56.c: Add novector pragma.
> 	* gcc.dg/vect/vect-58.c: Add novector pragma.
> 	* gcc.dg/vect/vect-6-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-60.c: Add novector pragma.
> 	* gcc.dg/vect/vect-62.c: Add novector pragma.
> 	* gcc.dg/vect/vect-63.c: Add novector pragma.
> 	* gcc.dg/vect/vect-64.c: Add novector pragma.
> 	* gcc.dg/vect/vect-65.c: Add novector pragma.
> 	* gcc.dg/vect/vect-66.c: Add novector pragma.
> 	* gcc.dg/vect/vect-67.c: Add novector pragma.
> 	* gcc.dg/vect/vect-68.c: Add novector pragma.
> 	* gcc.dg/vect/vect-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-70.c: Add novector pragma.
> 	* gcc.dg/vect/vect-71.c: Add novector pragma.
> 	* gcc.dg/vect/vect-72.c: Add novector pragma.
> 	* gcc.dg/vect/vect-73-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-73.c: Add novector pragma.
> 	* gcc.dg/vect/vect-74-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-74.c: Add novector pragma.
> 	* gcc.dg/vect/vect-75-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-75.c: Add novector pragma.
> 	* gcc.dg/vect/vect-76-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-76.c: Add novector pragma.
> 	* gcc.dg/vect/vect-77-alignchecks.c: Add novector pragma.
> 	* gcc.dg/vect/vect-77-global.c: Add novector pragma.
> 	* gcc.dg/vect/vect-77.c: Add novector pragma.
> 	* gcc.dg/vect/vect-78-alignchecks.c: Add novector pragma.
> 	* gcc.dg/vect/vect-78-global.c: Add novector pragma.
> 	* gcc.dg/vect/vect-78.c: Add novector pragma.
> 	* gcc.dg/vect/vect-8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-80-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-80.c: Add novector pragma.
> 	* gcc.dg/vect/vect-82.c: Add novector pragma.
> 	* gcc.dg/vect/vect-82_64.c: Add novector pragma.
> 	* gcc.dg/vect/vect-83.c: Add novector pragma.
> 	* gcc.dg/vect/vect-83_64.c: Add novector pragma.
> 	* gcc.dg/vect/vect-85-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-85.c: Add novector pragma.
> 	* gcc.dg/vect/vect-86.c: Add novector pragma.
> 	* gcc.dg/vect/vect-87.c: Add novector pragma.
> 	* gcc.dg/vect/vect-88.c: Add novector pragma.
> 	* gcc.dg/vect/vect-89-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-89.c: Add novector pragma.
> 	* gcc.dg/vect/vect-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-92.c: Add novector pragma.
> 	* gcc.dg/vect/vect-93.c: Add novector pragma.
> 	* gcc.dg/vect/vect-95.c: Add novector pragma.
> 	* gcc.dg/vect/vect-96.c: Add novector pragma.
> 	* gcc.dg/vect/vect-97-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-97.c: Add novector pragma.
> 	* gcc.dg/vect/vect-98-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-98.c: Add novector pragma.
> 	* gcc.dg/vect/vect-99.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-10.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-12.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-14.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-15.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-18.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-19.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-20.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-alias-check-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-align-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-align-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-all-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-all.c: Add novector pragma.
> 	* gcc.dg/vect/vect-avg-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-avg-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-avg-15.c: Add novector pragma.
> 	* gcc.dg/vect/vect-avg-16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-avg-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bitfield-write-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bitfield-write-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bitfield-write-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bitfield-write-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bitfield-write-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bool-cmp.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bswap16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bswap32.c: Add novector pragma.
> 	* gcc.dg/vect/vect-bswap64.c: Add novector pragma.
> 	* gcc.dg/vect/vect-complex-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-complex-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-complex-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-10.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-arith-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-arith-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-arith-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-arith-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-arith-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cond-arith-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cselim-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-cselim-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-div-bitmask-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-div-bitmask-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-div-bitmask.h: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-6-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-double-reduc-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-float-extend-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-float-truncate-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-floatint-conversion-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-floatint-conversion-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-fma-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-gather-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-gather-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-17.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-ifcvt-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-intfloat-conversion-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-intfloat-conversion-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-intfloat-conversion-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-intfloat-conversion-4a.c: Add novector pragma.
> 	* gcc.dg/vect/vect-intfloat-conversion-4b.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-10.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-8-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-8a-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-iv-8a.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-slp-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-slp-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-live-slp-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-mask-load-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-mask-loadstore-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-mulhrs-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-mult-const-pattern-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-mult-const-pattern-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-10.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-12.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-13.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-14.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-17.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-multitypes-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-nb-iter-ub-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-nb-iter-ub-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-nb-iter-ub-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-neg-store-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-neg-store-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-nest-cycle-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-nest-cycle-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-nest-cycle-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2a-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2a.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2b.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2c-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2c.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-2d.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-3-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-3a-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-3a.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-3b.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-3c.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-4d-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-4d.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-fir-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-fir-lb-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-fir-lb.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-fir.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-simd-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-simd-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-simd-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-slp-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-outer-slp-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-1-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-13.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-15.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-17.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-18.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-19.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-2-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-20.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-21.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-22.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-3-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-4-big-array.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-over-widen-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-peel-1-src.c: Add novector pragma.
> 	* gcc.dg/vect/vect-peel-2-src.c: Add novector pragma.
> 	* gcc.dg/vect/vect-peel-4-src.c: Add novector pragma.
> 	* gcc.dg/vect/vect-recurr-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-recurr-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-recurr-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-recurr-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-recurr-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-recurr-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-sdiv-pow2-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-sdivmod-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-shift-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-shift-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-shift-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-10.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-12.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-13.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-14.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-15.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-17.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-18.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-19.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-20.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-10.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-11.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-15.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-5.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-6.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-simd-clone-9.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-mult.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-u16-i2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-u16-i4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-u16-mult.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-u32-mult.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-u8-i2-gap.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-u8-i8-gap2-big-array.c: Add novector
> pragma.
> 	* gcc.dg/vect/vect-strided-a-u8-i8-gap2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-a-u8-i8-gap7-big-array.c: Add novector
> pragma.
> 	* gcc.dg/vect/vect-strided-a-u8-i8-gap7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-float.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-mult-char-ls.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-mult.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-same-dr.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-shift-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-store-a-u8-i2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-store-u16-i4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-store-u32-i2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-store.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u16-i2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u16-i3.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u16-i4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u32-i4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u32-i8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u32-mult.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u8-i2-gap.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u8-i2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8-gap2-big-array.c: Add novector
> pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8-gap2.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8-gap4-big-array.c: Add novector
> pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8-gap4-unknown.c: Add novector
> pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8-gap4.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8-gap7-big-array.c: Add novector
> pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8-gap7.c: Add novector pragma.
> 	* gcc.dg/vect/vect-strided-u8-i8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-vfa-01.c: Add novector pragma.
> 	* gcc.dg/vect/vect-vfa-02.c: Add novector pragma.
> 	* gcc.dg/vect/vect-vfa-03.c: Add novector pragma.
> 	* gcc.dg/vect/vect-vfa-04.c: Add novector pragma.
> 	* gcc.dg/vect/vect-vfa-slp.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-1.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-const-s16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-const-u16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-half-u8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-half.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-s16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-s8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-u16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-u8-s16-s32.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-u8-u32.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-mult-u8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-shift-s16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-shift-s8.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-shift-u16.c: Add novector pragma.
> 	* gcc.dg/vect/vect-widen-shift-u8.c: Add novector pragma.
> 	* gcc.dg/vect/wrapv-vect-7.c: Add novector pragma.
  
juzhe.zhong@rivai.ai June 28, 2023, 2:49 p.m. UTC | #2
Hi, Tamar.

This is an amazing auto-vectorization flow.

I am thinking about whether RVV can also get benefits from this optimization.
IMHO, RVV should be also using this flow.

So, to allow RVV  (target uses len as loop_control and mask as flow control),
I am not sure whether we can do this (Feel free to correct me if I am wrong):

+      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+	vect_record_loop_mask (loop_vinfo, masks, ncopies, truth_type, NULL);

Maybe it can be ?

if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) {
  if (mask_loop_p)
     vect_record_loop_mask
   else
     vect_record_loop_len
}


+  tree cond = gimple_assign_lhs (new_stmt);
+  if (masked_loop_p)
+    {
+      tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, truth_type, 0);
+      cond = prepare_vec_mask (loop_vinfo, TREE_TYPE (mask), mask, cond,
+			       &cond_gsi);
+    }
+
+  tree t = fold_build2 (NE_EXPR, boolean_type_node, cond,
+			build_zero_cst (truth_type));

From my understanding, you are using final_mask = loop_mask (WHILE_ULT) && control_mask (comparison).
Then Test final_mask using NE_EXPR. Am I right?

For RVV, I thinking whether we can have a good way to do this testing.
Not sure whether we can have something like LEN_TEST_MASK_NE (loop_len, control_mask...)

I am not saying that we should support "early break" auto-vectorization for RVV (loop_len && control_mask).
I am just write some comments trying to figure out how I can adapt your working for RVV in the future.

Thanks.


juzhe.zhong@rivai.ai
 
From: Li, Pan2
Date: 2023-06-28 22:21
To: juzhe.zhong@rivai.ai
Subject: FW: [PATCH v5 0/19] Support early break/return auto-vectorization
FYI.
 
-----Original Message-----
From: Gcc-patches <gcc-patches-bounces+pan2.li=intel.com@gcc.gnu.org> On Behalf Of Tamar Christina via Gcc-patches
Sent: Wednesday, June 28, 2023 9:41 PM
To: gcc-patches@gcc.gnu.org
Cc: nd@arm.com; rguenther@suse.de; jlaw@ventanamicro.com
Subject: [PATCH v5 0/19] Support early break/return auto-vectorization
 
Hi All,
 
This patch adds initial support for early break vectorization in GCC.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.
 
Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR.  This is however only checked then the
comparison would produce multiple statements.
 
Concretely the kind of loops supported are of the forms:
 
for (int i = 0; i < N; i++)
{
   <statements1>
   if (<condition>)
     {
       ...
       <action>;
     }
   <statements2>
}
 
where <action> can be:
- break
- return
- goto
 
Any number of statements can be used before the <action> occurs.
 
Since this is an initial version for GCC 14 it has the following limitations and
features:
 
- Only fixed sized iterations and buffers are supported.  That is to say any
  vectors loaded or stored must be to statically allocated arrays with known
  sizes. N must also be known.  This limitation is because our primary target
  for this optimization is SVE.  For VLA SVE we can't easily do cross page
  iteraion checks. The result is likely to also not be beneficial. For that
  reason we punt support for variable buffers till we have First-Faulting
  support in GCC.
- any stores in <statements1> should not be to the same objects as in
  <condition>.  Loads are fine as long as they don't have the possibility to
  alias.  More concretely, we block RAW dependencies when the intermediate value
  can't be separated fromt the store, or the store itself can't be moved.
- The number of loop iterations must be known,  this is just a temporarily
  limitation that I intend to address in GCC 14 itself as follow on patches.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- The early exit must be before the natural loop exit/latch.  The vectorizer is
  designed in way to propage phi-nodes downwards.  As such supporting this
  inverted control flow is hard.
- No support for epilogue vectorization.  The only epilogue supported is the
  scalar final one.  Epilogue vectorization would also not be profitable.
- Early breaks are only supported for inner loop vectorization.
 
I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break
 
With the help of IPA and LTO this still gets hit quite often.  During bootstrap
it hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.
 
This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.
 
niters analysis and the majority of the vectorizer with hardcoded single_exit
have been updated with the use of a new function vec_loop_iv value which returns
the exit the vectorizer wants to use as the main IV exit.
 
for niters the this exit is what determines the overall iterations as
that is the O(iters) for the loop.
 
For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations.  For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.
 
This new version of the patch does the majority of the work in a new rewritten
loop peeling.  This new function maintains LCSSA all the way through and no
longer requires the touch up functions the vectorized used to incrementally
adjust them later on.  This means that aside from IV updates and guard edge
updates the early exit code is identical to the single exit cases.
 
When the loop is peeled during the copying I have to go through great lengths to
keep the dominators up to date.  All exits from the first loop are rewired to the
loop header of the second loop.  But this can change the immediate dominator.
 
The dominators can change again when we wire in the loop guard, as such peeling
now returns a list of dominators that need to be updated if a new guard edge is
added.
 
For the loop peeling we rewrite the loop form:
 
 
                     Header
                      ---
                      |x|
                       2
                       |
                       v
                -------3<------
     early exit |      |      |
                v      v      | latch
                7      4----->6
                |      |
                |      v
                |      8
                |      |
                |      v
                ------>5
 
into
 
                     Header
                      ---
                      |x|
                       2
                       |
                       v
                -------3<------
     early exit |      |      |
                v      v      | latch
                7      4----->6
                |      |
                |      v
                |      8
                |      |
                |      v
                |  New Header
                |     ---
                ----->|x|
                       9
                       |
                       v
                ------10<-----
     early exit |      |      |
                v      v      | latch
                14     11---->13
                |      |
                |      v
                |      12
                |      |
                |      v
                ------> 5
 
That is to say, the first vector loop executes so long as the early exit isn't
needed.  Once the exit is taken, the scalar code will perform at most VF extra
iterations.  The exact number depending on peeling and iteration start and which
exit was taken (natural or early).   For this scalar loop, all early exits are
treated the same.
 
When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break.  If this is not possible we decline to vectorize.
 
This means that we check at the start of iterations whether we are going to exit
or not.  During the analyis phase we check whether we are allowed to do this
moving of statements.  Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.
 
Codegen:
 
for e.g.
 
#define N 803
unsigned vect_a[N];
unsigned vect_b[N];
 
unsigned test4(unsigned x)
{
unsigned ret = 0;
for (int i = 0; i < N; i++)
{
   vect_b[i] = x + i;
   if (vect_a[i] > x)
     break;
   vect_a[i] = x;
 
}
return ret;
}
 
We generate for Adv. SIMD:
 
test4:
        adrp    x2, .LC0
        adrp    x3, .LANCHOR0
        dup     v2.4s, w0
        add     x3, x3, :lo12:.LANCHOR0
        movi    v4.4s, 0x4
        add     x4, x3, 3216
        ldr     q1, [x2, #:lo12:.LC0]
        mov     x1, 0
        mov     w2, 0
        .p2align 3,,7
.L3:
        ldr     q0, [x3, x1]
        add     v3.4s, v1.4s, v2.4s
        add     v1.4s, v1.4s, v4.4s
        cmhi    v0.4s, v0.4s, v2.4s
        umaxp   v0.4s, v0.4s, v0.4s
        fmov    x5, d0
        cbnz    x5, .L6
        add     w2, w2, 1
        str     q3, [x1, x4]
        str     q2, [x3, x1]
        add     x1, x1, 16
        cmp     w2, 200
        bne     .L3
        mov     w7, 3
.L2:
        lsl     w2, w2, 2
        add     x5, x3, 3216
        add     w6, w2, w0
        sxtw    x4, w2
        ldr     w1, [x3, x4, lsl 2]
        str     w6, [x5, x4, lsl 2]
        cmp     w0, w1
        bcc     .L4
        add     w1, w2, 1
        str     w0, [x3, x4, lsl 2]
        add     w6, w1, w0
        sxtw    x1, w1
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        add     w4, w2, 2
        str     w0, [x3, x1, lsl 2]
        sxtw    x1, w4
        add     w6, w1, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w6, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
        add     w2, w2, 3
        cmp     w7, 3
        beq     .L4
        sxtw    x1, w2
        add     w2, w2, w0
        ldr     w4, [x3, x1, lsl 2]
        str     w2, [x5, x1, lsl 2]
        cmp     w0, w4
        bcc     .L4
        str     w0, [x3, x1, lsl 2]
.L4:
        mov     w0, 0
        ret
        .p2align 2,,3
.L6:
        mov     w7, 4
        b       .L2
 
and for SVE:
 
test4:
        adrp    x2, .LANCHOR0
        add     x2, x2, :lo12:.LANCHOR0
        add     x5, x2, 3216
        mov     x3, 0
        mov     w1, 0
        cntw    x4
        mov     z1.s, w0
        index   z0.s, #0, #1
        ptrue   p1.b, all
        ptrue   p0.s, all
        .p2align 3,,7
.L3:
        ld1w    z2.s, p1/z, [x2, x3, lsl 2]
        add     z3.s, z0.s, z1.s
        cmplo   p2.s, p0/z, z1.s, z2.s
        b.any   .L2
        st1w    z3.s, p1, [x5, x3, lsl 2]
        add     w1, w1, 1
        st1w    z1.s, p1, [x2, x3, lsl 2]
        add     x3, x3, x4
        incw    z0.s
        cmp     w3, 803
        bls     .L3
.L5:
        mov     w0, 0
        ret
        .p2align 2,,3
.L2:
        cntw    x5
        mul     w1, w1, w5
        cbz     w5, .L5
        sxtw    x1, w1
        sub     w5, w5, #1
        add     x5, x5, x1
        add     x6, x2, 3216
        b       .L6
        .p2align 2,,3
.L14:
        str     w0, [x2, x1, lsl 2]
        cmp     x1, x5
        beq     .L5
        mov     x1, x4
.L6:
        ldr     w3, [x2, x1, lsl 2]
        add     w4, w0, w1
        str     w4, [x6, x1, lsl 2]
        add     x4, x1, 1
        cmp     w0, w3
        bcs     .L14
        mov     w0, 0
        ret
 
On the workloads this work is based on we see between 2-3x performance uplift
using this patch.
 
Follow up plan:
- Boolean vectorization has several shortcomings.  I've filed PR110223 with the
   bigger ones that cause vectorization to fail with this patch.
- SLP support.  This is planned for GCC 15 as for majority of the cases build
   SLP itself fails.  This means I'll need to spend time in making this more
   robust first.  Additionally it requires:
     * Adding support for vectorizing CFG (gconds)
     * Support for CFG to differ between vector and scalar loops.
   Both of which would be disruptive to the tree and I suspect I'll be handling
   fallouts from this patch for a while.  So I plan to work on the surrounding
   building blocks first for the remainder of the year.
 
Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Also ran across various workloads and no issues.
 
When closer to acceptance I will run on other targets as well and clean up
related testsuite fallouts there.
 
--- inline copy of patch -- 
 
--
  
Tamar Christina June 28, 2023, 4 p.m. UTC | #3
Hi Juzhe,

> 
> Hi, Tamar.
> 
> This is an amazing auto-vectorization flow.
> 
> I am thinking about whether RVV can also get benefits from this optimization.
> IMHO, RVV should be also using this flow.
> 
> So, to allow RVV  (target uses len as loop_control and mask as flow control), I
> am not sure whether we can do this (Feel free to correct me if I am wrong):
> 
> +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +	vect_record_loop_mask (loop_vinfo, masks, ncopies, truth_type,
> NULL);
> 
> Maybe it can be ?
> 
> if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) {
>   if (mask_loop_p)
>      vect_record_loop_mask
>    else
>      vect_record_loop_len
> }
> 

Yeah, that should be the only change required,  I started this patch before the loop_len change
made it in and just rebased recently 😊

> 
> +  tree cond = gimple_assign_lhs (new_stmt);
> +  if (masked_loop_p)
> +    {
> +      tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> truth_type, 0);
> +      cond = prepare_vec_mask (loop_vinfo, TREE_TYPE (mask), mask, cond,
> +			       &cond_gsi);
> +    }
> +
> +  tree t = fold_build2 (NE_EXPR, boolean_type_node, cond,
> +			build_zero_cst (truth_type));
> 
> From my understanding, you are using final_mask = loop_mask (WHILE_ULT)
> && control_mask (comparison).
> Then Test final_mask using NE_EXPR. Am I right?

Yeah that's right, It's creating the mask for partial iterations.  The only other constraint is
being able to reduce a boolean mask using inclusive OR,  but that's optional and is only
needed if one side of the comparison produces more than 1 copy (so it's only checked then).

> 
> For RVV, I thinking whether we can have a good way to do this testing.
> Not sure whether we can have something like LEN_TEST_MASK_NE (loop_len,
> control_mask...)
> 

Hmm Is just the vect_record_loop_len change not enough? (I haven't followed the masking
implementation in RVV in detail) but I assume that it's following the general principle than
& an operation with a mask creates a masked operation?

That is to say, I thought LOOP_LEN was only for the loop control? Which doesn't change here.

> I am not saying that we should support "early break" auto-vectorization for
> RVV (loop_len && control_mask).
> I am just write some comments trying to figure out how I can adapt your
> working for RVV in the future.
> 

Yes happy to help, the more uses it gets the more bugs I can fix 😊

Cheers,
Tamar

> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
> 
> From: Li, Pan2
> Date: 2023-06-28 22:21
> To: juzhe.zhong@rivai.ai
> Subject: FW: [PATCH v5 0/19] Support early break/return auto-vectorization
> FYI.
> 
> -----Original Message-----
> From: Gcc-patches <gcc-patches-bounces+pan2.li=intel.com@gcc.gnu.org>
> On Behalf Of Tamar Christina via Gcc-patches
> Sent: Wednesday, June 28, 2023 9:41 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd@arm.com; rguenther@suse.de; jlaw@ventanamicro.com
> Subject: [PATCH v5 0/19] Support early break/return auto-vectorization
> 
> Hi All,
> 
> This patch adds initial support for early break vectorization in GCC.
> The support is added for any target that implements a vector cbranch optab,
> this includes both fully masked and non-masked targets.
> 
> Depending on the operation, the vectorizer may also require support for
> boolean mask reductions using Inclusive OR.  This is however only checked
> then the comparison would produce multiple statements.
> 
> Concretely the kind of loops supported are of the forms:
> 
> for (int i = 0; i < N; i++)
> {
>    <statements1>
>    if (<condition>)
>      {
>        ...
>        <action>;
>      }
>    <statements2>
> }
> 
> where <action> can be:
> - break
> - return
> - goto
> 
> Any number of statements can be used before the <action> occurs.
> 
> Since this is an initial version for GCC 14 it has the following limitations and
> features:
> 
> - Only fixed sized iterations and buffers are supported.  That is to say any
>   vectors loaded or stored must be to statically allocated arrays with known
>   sizes. N must also be known.  This limitation is because our primary target
>   for this optimization is SVE.  For VLA SVE we can't easily do cross page
>   iteraion checks. The result is likely to also not be beneficial. For that
>   reason we punt support for variable buffers till we have First-Faulting
>   support in GCC.
> - any stores in <statements1> should not be to the same objects as in
>   <condition>.  Loads are fine as long as they don't have the possibility to
>   alias.  More concretely, we block RAW dependencies when the intermediate
> value
>   can't be separated fromt the store, or the store itself can't be moved.
> - The number of loop iterations must be known,  this is just a temporarily
>   limitation that I intend to address in GCC 14 itself as follow on patches.
> - Prologue peeling, alignment peelinig and loop versioning are supported.
> - Fully masked loops, unmasked loops and partially masked loops are
> supported
> - Any number of loop early exits are supported.
> - The early exit must be before the natural loop exit/latch.  The vectorizer is
>   designed in way to propage phi-nodes downwards.  As such supporting this
>   inverted control flow is hard.
> - No support for epilogue vectorization.  The only epilogue supported is the
>   scalar final one.  Epilogue vectorization would also not be profitable.
> - Early breaks are only supported for inner loop vectorization.
> 
> I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break
> 
> With the help of IPA and LTO this still gets hit quite often.  During bootstrap it
> hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
> since these are tests for support for early exit vectorization.
> 
> This implementation does not support completely handling the early break
> inside the vector loop itself but instead supports adding checks such that if we
> know that we have to exit in the current iteration then we branch to scalar
> code to actually do the final VF iterations which handles all the code in
> <action>.
> 
> niters analysis and the majority of the vectorizer with hardcoded single_exit
> have been updated with the use of a new function vec_loop_iv value which
> returns the exit the vectorizer wants to use as the main IV exit.
> 
> for niters the this exit is what determines the overall iterations as that is the
> O(iters) for the loop.
> 
> For the scalar loop we know that whatever exit you take you have to perform
> at most VF iterations.  For vector code we only case about the state of fully
> performed iteration and reset the scalar code to the (partially) remaining loop.
> 
> This new version of the patch does the majority of the work in a new rewritten
> loop peeling.  This new function maintains LCSSA all the way through and no
> longer requires the touch up functions the vectorized used to incrementally
> adjust them later on.  This means that aside from IV updates and guard edge
> updates the early exit code is identical to the single exit cases.
> 
> When the loop is peeled during the copying I have to go through great lengths
> to keep the dominators up to date.  All exits from the first loop are rewired to
> the loop header of the second loop.  But this can change the immediate
> dominator.
> 
> The dominators can change again when we wire in the loop guard, as such
> peeling now returns a list of dominators that need to be updated if a new
> guard edge is added.
> 
> For the loop peeling we rewrite the loop form:
> 
> 
>                      Header
>                       ---
>                       |x|
>                        2
>                        |
>                        v
>                 -------3<------
>      early exit |      |      |
>                 v      v      | latch
>                 7      4----->6
>                 |      |
>                 |      v
>                 |      8
>                 |      |
>                 |      v
>                 ------>5
> 
> into
> 
>                      Header
>                       ---
>                       |x|
>                        2
>                        |
>                        v
>                 -------3<------
>      early exit |      |      |
>                 v      v      | latch
>                 7      4----->6
>                 |      |
>                 |      v
>                 |      8
>                 |      |
>                 |      v
>                 |  New Header
>                 |     ---
>                 ----->|x|
>                        9
>                        |
>                        v
>                 ------10<-----
>      early exit |      |      |
>                 v      v      | latch
>                 14     11---->13
>                 |      |
>                 |      v
>                 |      12
>                 |      |
>                 |      v
>                 ------> 5
> 
> That is to say, the first vector loop executes so long as the early exit isn't
> needed.  Once the exit is taken, the scalar code will perform at most VF extra
> iterations.  The exact number depending on peeling and iteration start and
> which
> exit was taken (natural or early).   For this scalar loop, all early exits are
> treated the same.
> 
> When we vectorize we move any statement not related to the early break
> itself and that would be incorrect to execute before the break (i.e. has side
> effects) to after the break.  If this is not possible we decline to vectorize.
> 
> This means that we check at the start of iterations whether we are going to
> exit or not.  During the analyis phase we check whether we are allowed to do
> this moving of statements.  Also note that we only move the scalar
> statements, but only do so after peeling but just before we start transforming
> statements.
> 
> Codegen:
> 
> for e.g.
> 
> #define N 803
> unsigned vect_a[N];
> unsigned vect_b[N];
> 
> unsigned test4(unsigned x)
> {
> unsigned ret = 0;
> for (int i = 0; i < N; i++)
> {
>    vect_b[i] = x + i;
>    if (vect_a[i] > x)
>      break;
>    vect_a[i] = x;
> 
> }
> return ret;
> }
> 
> We generate for Adv. SIMD:
> 
> test4:
>         adrp    x2, .LC0
>         adrp    x3, .LANCHOR0
>         dup     v2.4s, w0
>         add     x3, x3, :lo12:.LANCHOR0
>         movi    v4.4s, 0x4
>         add     x4, x3, 3216
>         ldr     q1, [x2, #:lo12:.LC0]
>         mov     x1, 0
>         mov     w2, 0
>         .p2align 3,,7
> .L3:
>         ldr     q0, [x3, x1]
>         add     v3.4s, v1.4s, v2.4s
>         add     v1.4s, v1.4s, v4.4s
>         cmhi    v0.4s, v0.4s, v2.4s
>         umaxp   v0.4s, v0.4s, v0.4s
>         fmov    x5, d0
>         cbnz    x5, .L6
>         add     w2, w2, 1
>         str     q3, [x1, x4]
>         str     q2, [x3, x1]
>         add     x1, x1, 16
>         cmp     w2, 200
>         bne     .L3
>         mov     w7, 3
> .L2:
>         lsl     w2, w2, 2
>         add     x5, x3, 3216
>         add     w6, w2, w0
>         sxtw    x4, w2
>         ldr     w1, [x3, x4, lsl 2]
>         str     w6, [x5, x4, lsl 2]
>         cmp     w0, w1
>         bcc     .L4
>         add     w1, w2, 1
>         str     w0, [x3, x4, lsl 2]
>         add     w6, w1, w0
>         sxtw    x1, w1
>         ldr     w4, [x3, x1, lsl 2]
>         str     w6, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         add     w4, w2, 2
>         str     w0, [x3, x1, lsl 2]
>         sxtw    x1, w4
>         add     w6, w1, w0
>         ldr     w4, [x3, x1, lsl 2]
>         str     w6, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         str     w0, [x3, x1, lsl 2]
>         add     w2, w2, 3
>         cmp     w7, 3
>         beq     .L4
>         sxtw    x1, w2
>         add     w2, w2, w0
>         ldr     w4, [x3, x1, lsl 2]
>         str     w2, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         str     w0, [x3, x1, lsl 2]
> .L4:
>         mov     w0, 0
>         ret
>         .p2align 2,,3
> .L6:
>         mov     w7, 4
>         b       .L2
> 
> and for SVE:
> 
> test4:
>         adrp    x2, .LANCHOR0
>         add     x2, x2, :lo12:.LANCHOR0
>         add     x5, x2, 3216
>         mov     x3, 0
>         mov     w1, 0
>         cntw    x4
>         mov     z1.s, w0
>         index   z0.s, #0, #1
>         ptrue   p1.b, all
>         ptrue   p0.s, all
>         .p2align 3,,7
> .L3:
>         ld1w    z2.s, p1/z, [x2, x3, lsl 2]
>         add     z3.s, z0.s, z1.s
>         cmplo   p2.s, p0/z, z1.s, z2.s
>         b.any   .L2
>         st1w    z3.s, p1, [x5, x3, lsl 2]
>         add     w1, w1, 1
>         st1w    z1.s, p1, [x2, x3, lsl 2]
>         add     x3, x3, x4
>         incw    z0.s
>         cmp     w3, 803
>         bls     .L3
> .L5:
>         mov     w0, 0
>         ret
>         .p2align 2,,3
> .L2:
>         cntw    x5
>         mul     w1, w1, w5
>         cbz     w5, .L5
>         sxtw    x1, w1
>         sub     w5, w5, #1
>         add     x5, x5, x1
>         add     x6, x2, 3216
>         b       .L6
>         .p2align 2,,3
> .L14:
>         str     w0, [x2, x1, lsl 2]
>         cmp     x1, x5
>         beq     .L5
>         mov     x1, x4
> .L6:
>         ldr     w3, [x2, x1, lsl 2]
>         add     w4, w0, w1
>         str     w4, [x6, x1, lsl 2]
>         add     x4, x1, 1
>         cmp     w0, w3
>         bcs     .L14
>         mov     w0, 0
>         ret
> 
> On the workloads this work is based on we see between 2-3x performance
> uplift using this patch.
> 
> Follow up plan:
> - Boolean vectorization has several shortcomings.  I've filed PR110223 with
> the
>    bigger ones that cause vectorization to fail with this patch.
> - SLP support.  This is planned for GCC 15 as for majority of the cases build
>    SLP itself fails.  This means I'll need to spend time in making this more
>    robust first.  Additionally it requires:
>      * Adding support for vectorizing CFG (gconds)
>      * Support for CFG to differ between vector and scalar loops.
>    Both of which would be disruptive to the tree and I suspect I'll be handling
>    fallouts from this patch for a while.  So I plan to work on the surrounding
>    building blocks first for the remainder of the year.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Also ran across various workloads and no issues.
> 
> When closer to acceptance I will run on other targets as well and clean up
> related testsuite fallouts there.
> 
> --- inline copy of patch --
> 
> --
  
Richard Biener Nov. 6, 2023, 2:25 p.m. UTC | #4
On Mon, 6 Nov 2023, Tamar Christina wrote:

> Hi All,
> 
> This patch adds initial support for early break vectorization in GCC.
> The support is added for any target that implements a vector cbranch optab,
> this includes both fully masked and non-masked targets.
> 
> Depending on the operation, the vectorizer may also require support for boolean
> mask reductions using Inclusive OR.  This is however only checked then the
> comparison would produce multiple statements.
> 
> Note: I am currently struggling to get patch 7 correct in all cases and could use
>       some feedback there.
> 
> Concretely the kind of loops supported are of the forms:
> 
>  for (int i = 0; i < N; i++)
>  {
>    <statements1>
>    if (<condition>)
>      {
>        ...
>        <action>;
>      }
>    <statements2>
>  }
> 
> where <action> can be:
>  - break
>  - return
>  - goto
> 
> Any number of statements can be used before the <action> occurs.
> 
> Since this is an initial version for GCC 14 it has the following limitations and
> features:
> 
> - Only fixed sized iterations and buffers are supported.  That is to say any
>   vectors loaded or stored must be to statically allocated arrays with known
>   sizes. N must also be known.  This limitation is because our primary target
>   for this optimization is SVE.  For VLA SVE we can't easily do cross page
>   iteraion checks. The result is likely to also not be beneficial. For that
>   reason we punt support for variable buffers till we have First-Faulting
>   support in GCC.
> - any stores in <statements1> should not be to the same objects as in
>   <condition>.  Loads are fine as long as they don't have the possibility to
>   alias.  More concretely, we block RAW dependencies when the intermediate value
>   can't be separated fromt the store, or the store itself can't be moved.
> - Prologue peeling, alignment peelinig and loop versioning are supported.
> - Fully masked loops, unmasked loops and partially masked loops are supported
> - Any number of loop early exits are supported.
> - No support for epilogue vectorization.  The only epilogue supported is the
>   scalar final one.  Peeling code supports it but the code motion code cannot
>   find instructions to make the move in the epilog.
> - Early breaks are only supported for inner loop vectorization.
> 
> I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break
> 
> With the help of IPA and LTO this still gets hit quite often.  During bootstrap
> it hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
> since these are tests for support for early exit vectorization.
> 
> This implementation does not support completely handling the early break inside
> the vector loop itself but instead supports adding checks such that if we know
> that we have to exit in the current iteration then we branch to scalar code to
> actually do the final VF iterations which handles all the code in <action>.
> 
> For the scalar loop we know that whatever exit you take you have to perform at
> most VF iterations.  For vector code we only case about the state of fully
> performed iteration and reset the scalar code to the (partially) remaining loop.
> 
> That is to say, the first vector loop executes so long as the early exit isn't
> needed.  Once the exit is taken, the scalar code will perform at most VF extra
> iterations.  The exact number depending on peeling and iteration start and which
> exit was taken (natural or early).   For this scalar loop, all early exits are
> treated the same.
> 
> When we vectorize we move any statement not related to the early break itself
> and that would be incorrect to execute before the break (i.e. has side effects)
> to after the break.  If this is not possible we decline to vectorize.
> 
> This means that we check at the start of iterations whether we are going to exit
> or not.  During the analyis phase we check whether we are allowed to do this
> moving of statements.  Also note that we only move the scalar statements, but
> only do so after peeling but just before we start transforming statements.
> 
> Codegen:
> 
> for e.g.
> 
> #define N 803
> unsigned vect_a[N];
> unsigned vect_b[N];
> 
> unsigned test4(unsigned x)
> {
>  unsigned ret = 0;
>  for (int i = 0; i < N; i++)
>  {
>    vect_b[i] = x + i;
>    if (vect_a[i] > x)
>      break;
>    vect_a[i] = x;
> 
>  }
>  return ret;
> }
> 
> We generate for Adv. SIMD:
> 
> test4:
>         adrp    x2, .LC0
>         adrp    x3, .LANCHOR0
>         dup     v2.4s, w0
>         add     x3, x3, :lo12:.LANCHOR0
>         movi    v4.4s, 0x4
>         add     x4, x3, 3216
>         ldr     q1, [x2, #:lo12:.LC0]
>         mov     x1, 0
>         mov     w2, 0
>         .p2align 3,,7
> .L3:
>         ldr     q0, [x3, x1]
>         add     v3.4s, v1.4s, v2.4s
>         add     v1.4s, v1.4s, v4.4s
>         cmhi    v0.4s, v0.4s, v2.4s
>         umaxp   v0.4s, v0.4s, v0.4s
>         fmov    x5, d0
>         cbnz    x5, .L6
>         add     w2, w2, 1
>         str     q3, [x1, x4]
>         str     q2, [x3, x1]
>         add     x1, x1, 16
>         cmp     w2, 200
>         bne     .L3
>         mov     w7, 3
> .L2:
>         lsl     w2, w2, 2
>         add     x5, x3, 3216
>         add     w6, w2, w0
>         sxtw    x4, w2
>         ldr     w1, [x3, x4, lsl 2]
>         str     w6, [x5, x4, lsl 2]
>         cmp     w0, w1
>         bcc     .L4
>         add     w1, w2, 1
>         str     w0, [x3, x4, lsl 2]
>         add     w6, w1, w0
>         sxtw    x1, w1
>         ldr     w4, [x3, x1, lsl 2]
>         str     w6, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         add     w4, w2, 2
>         str     w0, [x3, x1, lsl 2]
>         sxtw    x1, w4
>         add     w6, w1, w0
>         ldr     w4, [x3, x1, lsl 2]
>         str     w6, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         str     w0, [x3, x1, lsl 2]
>         add     w2, w2, 3
>         cmp     w7, 3
>         beq     .L4
>         sxtw    x1, w2
>         add     w2, w2, w0
>         ldr     w4, [x3, x1, lsl 2]
>         str     w2, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         str     w0, [x3, x1, lsl 2]
> .L4:
>         mov     w0, 0
>         ret
>         .p2align 2,,3
> .L6:
>         mov     w7, 4
>         b       .L2
> 
> and for SVE:
> 
> test4:
>         adrp    x2, .LANCHOR0
>         add     x2, x2, :lo12:.LANCHOR0
>         add     x5, x2, 3216
>         mov     x3, 0
>         mov     w1, 0
>         cntw    x4
>         mov     z1.s, w0
>         index   z0.s, #0, #1
>         ptrue   p1.b, all
>         ptrue   p0.s, all
>         .p2align 3,,7
> .L3:
>         ld1w    z2.s, p1/z, [x2, x3, lsl 2]
>         add     z3.s, z0.s, z1.s
>         cmplo   p2.s, p0/z, z1.s, z2.s
>         b.any   .L2
>         st1w    z3.s, p1, [x5, x3, lsl 2]
>         add     w1, w1, 1
>         st1w    z1.s, p1, [x2, x3, lsl 2]
>         add     x3, x3, x4
>         incw    z0.s
>         cmp     w3, 803
>         bls     .L3
> .L5:
>         mov     w0, 0
>         ret
>         .p2align 2,,3
> .L2:
>         cntw    x5
>         mul     w1, w1, w5
>         cbz     w5, .L5
>         sxtw    x1, w1
>         sub     w5, w5, #1
>         add     x5, x5, x1
>         add     x6, x2, 3216
>         b       .L6
>         .p2align 2,,3
> .L14:
>         str     w0, [x2, x1, lsl 2]
>         cmp     x1, x5
>         beq     .L5
>         mov     x1, x4
> .L6:
>         ldr     w3, [x2, x1, lsl 2]
>         add     w4, w0, w1
>         str     w4, [x6, x1, lsl 2]
>         add     x4, x1, 1
>         cmp     w0, w3
>         bcs     .L14
>         mov     w0, 0
>         ret
> 
> On the workloads this work is based on we see between 2-3x performance uplift
> using this patch.
> 
> Follow up plan:
>  - Boolean vectorization has several shortcomings.  I've filed PR110223 with the
>    bigger ones that cause vectorization to fail with this patch.
>  - SLP support.  This is planned for GCC 15 as for majority of the cases build
>    SLP itself fails.

It would be nice to get at least single-lane SLP support working.  I think
you need to treat the gcond as SLP root stmt and basically do discovery
on the condition as to as if it were a mask generating condition.

Code generation would then simply schedule the gcond root instances
first (that would get you the code motion automagically).

So, add a new slp_instance_kind, for example slp_inst_kind_early_break,
and record the gcond as root stmt.  Possibly "pattern" recognizing

 gcond <_1 != _2>

as

 _mask = _1 != _2;
 gcond <_mask != 0>

makes the SLP discovery less fiddly (but in theory you can of course
handle gconds directly).

Is there any part of the series that can be pushed independelty?  If
so I'll try to look at those parts first.

Thanks,
Richard.
  
Tamar Christina Nov. 6, 2023, 3:17 p.m. UTC | #5
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Monday, November 6, 2023 2:25 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return auto-
> vectorization
> 
> On Mon, 6 Nov 2023, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This patch adds initial support for early break vectorization in GCC.
> > The support is added for any target that implements a vector cbranch
> > optab, this includes both fully masked and non-masked targets.
> >
> > Depending on the operation, the vectorizer may also require support
> > for boolean mask reductions using Inclusive OR.  This is however only
> > checked then the comparison would produce multiple statements.
> >
> > Note: I am currently struggling to get patch 7 correct in all cases and could
> use
> >       some feedback there.
> >
> > Concretely the kind of loops supported are of the forms:
> >
> >  for (int i = 0; i < N; i++)
> >  {
> >    <statements1>
> >    if (<condition>)
> >      {
> >        ...
> >        <action>;
> >      }
> >    <statements2>
> >  }
> >
> > where <action> can be:
> >  - break
> >  - return
> >  - goto
> >
> > Any number of statements can be used before the <action> occurs.
> >
> > Since this is an initial version for GCC 14 it has the following
> > limitations and
> > features:
> >
> > - Only fixed sized iterations and buffers are supported.  That is to say any
> >   vectors loaded or stored must be to statically allocated arrays with known
> >   sizes. N must also be known.  This limitation is because our primary target
> >   for this optimization is SVE.  For VLA SVE we can't easily do cross page
> >   iteraion checks. The result is likely to also not be beneficial. For that
> >   reason we punt support for variable buffers till we have First-Faulting
> >   support in GCC.
> > - any stores in <statements1> should not be to the same objects as in
> >   <condition>.  Loads are fine as long as they don't have the possibility to
> >   alias.  More concretely, we block RAW dependencies when the intermediate
> value
> >   can't be separated fromt the store, or the store itself can't be moved.
> > - Prologue peeling, alignment peelinig and loop versioning are supported.
> > - Fully masked loops, unmasked loops and partially masked loops are
> > supported
> > - Any number of loop early exits are supported.
> > - No support for epilogue vectorization.  The only epilogue supported is the
> >   scalar final one.  Peeling code supports it but the code motion code cannot
> >   find instructions to make the move in the epilog.
> > - Early breaks are only supported for inner loop vectorization.
> >
> > I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break
> >
> > With the help of IPA and LTO this still gets hit quite often.  During
> > bootstrap it hit rather frequently.  Additionally TSVC s332, s481 and
> > s482 all pass now since these are tests for support for early exit
> vectorization.
> >
> > This implementation does not support completely handling the early
> > break inside the vector loop itself but instead supports adding checks
> > such that if we know that we have to exit in the current iteration
> > then we branch to scalar code to actually do the final VF iterations which
> handles all the code in <action>.
> >
> > For the scalar loop we know that whatever exit you take you have to
> > perform at most VF iterations.  For vector code we only case about the
> > state of fully performed iteration and reset the scalar code to the (partially)
> remaining loop.
> >
> > That is to say, the first vector loop executes so long as the early
> > exit isn't needed.  Once the exit is taken, the scalar code will
> > perform at most VF extra iterations.  The exact number depending on peeling
> and iteration start and which
> > exit was taken (natural or early).   For this scalar loop, all early exits are
> > treated the same.
> >
> > When we vectorize we move any statement not related to the early break
> > itself and that would be incorrect to execute before the break (i.e.
> > has side effects) to after the break.  If this is not possible we decline to
> vectorize.
> >
> > This means that we check at the start of iterations whether we are
> > going to exit or not.  During the analyis phase we check whether we
> > are allowed to do this moving of statements.  Also note that we only
> > move the scalar statements, but only do so after peeling but just before we
> start transforming statements.
> >
> > Codegen:
> >
> > for e.g.
> >
> > #define N 803
> > unsigned vect_a[N];
> > unsigned vect_b[N];
> >
> > unsigned test4(unsigned x)
> > {
> >  unsigned ret = 0;
> >  for (int i = 0; i < N; i++)
> >  {
> >    vect_b[i] = x + i;
> >    if (vect_a[i] > x)
> >      break;
> >    vect_a[i] = x;
> >
> >  }
> >  return ret;
> > }
> >
> > We generate for Adv. SIMD:
> >
> > test4:
> >         adrp    x2, .LC0
> >         adrp    x3, .LANCHOR0
> >         dup     v2.4s, w0
> >         add     x3, x3, :lo12:.LANCHOR0
> >         movi    v4.4s, 0x4
> >         add     x4, x3, 3216
> >         ldr     q1, [x2, #:lo12:.LC0]
> >         mov     x1, 0
> >         mov     w2, 0
> >         .p2align 3,,7
> > .L3:
> >         ldr     q0, [x3, x1]
> >         add     v3.4s, v1.4s, v2.4s
> >         add     v1.4s, v1.4s, v4.4s
> >         cmhi    v0.4s, v0.4s, v2.4s
> >         umaxp   v0.4s, v0.4s, v0.4s
> >         fmov    x5, d0
> >         cbnz    x5, .L6
> >         add     w2, w2, 1
> >         str     q3, [x1, x4]
> >         str     q2, [x3, x1]
> >         add     x1, x1, 16
> >         cmp     w2, 200
> >         bne     .L3
> >         mov     w7, 3
> > .L2:
> >         lsl     w2, w2, 2
> >         add     x5, x3, 3216
> >         add     w6, w2, w0
> >         sxtw    x4, w2
> >         ldr     w1, [x3, x4, lsl 2]
> >         str     w6, [x5, x4, lsl 2]
> >         cmp     w0, w1
> >         bcc     .L4
> >         add     w1, w2, 1
> >         str     w0, [x3, x4, lsl 2]
> >         add     w6, w1, w0
> >         sxtw    x1, w1
> >         ldr     w4, [x3, x1, lsl 2]
> >         str     w6, [x5, x1, lsl 2]
> >         cmp     w0, w4
> >         bcc     .L4
> >         add     w4, w2, 2
> >         str     w0, [x3, x1, lsl 2]
> >         sxtw    x1, w4
> >         add     w6, w1, w0
> >         ldr     w4, [x3, x1, lsl 2]
> >         str     w6, [x5, x1, lsl 2]
> >         cmp     w0, w4
> >         bcc     .L4
> >         str     w0, [x3, x1, lsl 2]
> >         add     w2, w2, 3
> >         cmp     w7, 3
> >         beq     .L4
> >         sxtw    x1, w2
> >         add     w2, w2, w0
> >         ldr     w4, [x3, x1, lsl 2]
> >         str     w2, [x5, x1, lsl 2]
> >         cmp     w0, w4
> >         bcc     .L4
> >         str     w0, [x3, x1, lsl 2]
> > .L4:
> >         mov     w0, 0
> >         ret
> >         .p2align 2,,3
> > .L6:
> >         mov     w7, 4
> >         b       .L2
> >
> > and for SVE:
> >
> > test4:
> >         adrp    x2, .LANCHOR0
> >         add     x2, x2, :lo12:.LANCHOR0
> >         add     x5, x2, 3216
> >         mov     x3, 0
> >         mov     w1, 0
> >         cntw    x4
> >         mov     z1.s, w0
> >         index   z0.s, #0, #1
> >         ptrue   p1.b, all
> >         ptrue   p0.s, all
> >         .p2align 3,,7
> > .L3:
> >         ld1w    z2.s, p1/z, [x2, x3, lsl 2]
> >         add     z3.s, z0.s, z1.s
> >         cmplo   p2.s, p0/z, z1.s, z2.s
> >         b.any   .L2
> >         st1w    z3.s, p1, [x5, x3, lsl 2]
> >         add     w1, w1, 1
> >         st1w    z1.s, p1, [x2, x3, lsl 2]
> >         add     x3, x3, x4
> >         incw    z0.s
> >         cmp     w3, 803
> >         bls     .L3
> > .L5:
> >         mov     w0, 0
> >         ret
> >         .p2align 2,,3
> > .L2:
> >         cntw    x5
> >         mul     w1, w1, w5
> >         cbz     w5, .L5
> >         sxtw    x1, w1
> >         sub     w5, w5, #1
> >         add     x5, x5, x1
> >         add     x6, x2, 3216
> >         b       .L6
> >         .p2align 2,,3
> > .L14:
> >         str     w0, [x2, x1, lsl 2]
> >         cmp     x1, x5
> >         beq     .L5
> >         mov     x1, x4
> > .L6:
> >         ldr     w3, [x2, x1, lsl 2]
> >         add     w4, w0, w1
> >         str     w4, [x6, x1, lsl 2]
> >         add     x4, x1, 1
> >         cmp     w0, w3
> >         bcs     .L14
> >         mov     w0, 0
> >         ret
> >
> > On the workloads this work is based on we see between 2-3x performance
> > uplift using this patch.
> >
> > Follow up plan:
> >  - Boolean vectorization has several shortcomings.  I've filed PR110223 with
> the
> >    bigger ones that cause vectorization to fail with this patch.
> >  - SLP support.  This is planned for GCC 15 as for majority of the cases build
> >    SLP itself fails.
> 
> It would be nice to get at least single-lane SLP support working.  I think you
> need to treat the gcond as SLP root stmt and basically do discovery on the
> condition as to as if it were a mask generating condition.

Hmm ok, will give it  a try.

> 
> Code generation would then simply schedule the gcond root instances first
> (that would get you the code motion automagically).

Right, so you're saying treat the gcond's as the seed, and stores as a sink.
And then schedule only the instances without a gcond around such that we
can still vectorize in place to get the branches.  Ok, makes sense.

> 
> So, add a new slp_instance_kind, for example slp_inst_kind_early_break, and
> record the gcond as root stmt.  Possibly "pattern" recognizing
> 
>  gcond <_1 != _2>
> 
> as
> 
>  _mask = _1 != _2;
>  gcond <_mask != 0>
> 
> makes the SLP discovery less fiddly (but in theory you can of course handle
> gconds directly).
> 
> Is there any part of the series that can be pushed independelty?  If so I'll try to
> look at those parts first.
> 

Aside from:

[PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA form for early breaks
[PATCH 7/21]middle-end: update IV update code to support early breaks and arbitrary exits  

The rest lie dormant and don't do anything or disrupt the tree until those two are in.
The rest all just touch up different parts piecewise.

They do rely on the new field introduced in:

[PATCH 3/21]middle-end: Implement code motion and dependency analysis for early breaks

But can split them out.

I'll start respinning no #4 and #7 with your latest changes now.

Thanks,
Tamar

> Thanks,
> Richard.
  
Richard Biener Nov. 7, 2023, 9:42 a.m. UTC | #6
On Mon, 6 Nov 2023, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Monday, November 6, 2023 2:25 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return auto-
> > vectorization
> > 
> > On Mon, 6 Nov 2023, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > This patch adds initial support for early break vectorization in GCC.
> > > The support is added for any target that implements a vector cbranch
> > > optab, this includes both fully masked and non-masked targets.
> > >
> > > Depending on the operation, the vectorizer may also require support
> > > for boolean mask reductions using Inclusive OR.  This is however only
> > > checked then the comparison would produce multiple statements.
> > >
> > > Note: I am currently struggling to get patch 7 correct in all cases and could
> > use
> > >       some feedback there.
> > >
> > > Concretely the kind of loops supported are of the forms:
> > >
> > >  for (int i = 0; i < N; i++)
> > >  {
> > >    <statements1>
> > >    if (<condition>)
> > >      {
> > >        ...
> > >        <action>;
> > >      }
> > >    <statements2>
> > >  }
> > >
> > > where <action> can be:
> > >  - break
> > >  - return
> > >  - goto
> > >
> > > Any number of statements can be used before the <action> occurs.
> > >
> > > Since this is an initial version for GCC 14 it has the following
> > > limitations and
> > > features:
> > >
> > > - Only fixed sized iterations and buffers are supported.  That is to say any
> > >   vectors loaded or stored must be to statically allocated arrays with known
> > >   sizes. N must also be known.  This limitation is because our primary target
> > >   for this optimization is SVE.  For VLA SVE we can't easily do cross page
> > >   iteraion checks. The result is likely to also not be beneficial. For that
> > >   reason we punt support for variable buffers till we have First-Faulting
> > >   support in GCC.

Btw, for this I wonder if you thought about marking memory accesses
required for the early break condition as required to be vector-size
aligned, thus peeling or versioning them for alignment?  That should
ensure they do not fault.

OTOH I somehow remember prologue peeling isn't supported for early
break vectorization?  ..

> > > - any stores in <statements1> should not be to the same objects as in
> > >   <condition>.  Loads are fine as long as they don't have the possibility to
> > >   alias.  More concretely, we block RAW dependencies when the intermediate
> > value
> > >   can't be separated fromt the store, or the store itself can't be moved.
> > > - Prologue peeling, alignment peelinig and loop versioning are supported.

.. but here you say it is.  Not sure if peeling for alignment works for
VLA vectors though.  Just to say x86 doesn't support first-faulting
loads.

> > > - Fully masked loops, unmasked loops and partially masked loops are
> > > supported
> > > - Any number of loop early exits are supported.
> > > - No support for epilogue vectorization.  The only epilogue supported is the
> > >   scalar final one.  Peeling code supports it but the code motion code cannot
> > >   find instructions to make the move in the epilog.
> > > - Early breaks are only supported for inner loop vectorization.
> > >
> > > I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break
> > >
> > > With the help of IPA and LTO this still gets hit quite often.  During
> > > bootstrap it hit rather frequently.  Additionally TSVC s332, s481 and
> > > s482 all pass now since these are tests for support for early exit
> > vectorization.
> > >
> > > This implementation does not support completely handling the early
> > > break inside the vector loop itself but instead supports adding checks
> > > such that if we know that we have to exit in the current iteration
> > > then we branch to scalar code to actually do the final VF iterations which
> > handles all the code in <action>.
> > >
> > > For the scalar loop we know that whatever exit you take you have to
> > > perform at most VF iterations.  For vector code we only case about the
> > > state of fully performed iteration and reset the scalar code to the (partially)
> > remaining loop.
> > >
> > > That is to say, the first vector loop executes so long as the early
> > > exit isn't needed.  Once the exit is taken, the scalar code will
> > > perform at most VF extra iterations.  The exact number depending on peeling
> > and iteration start and which
> > > exit was taken (natural or early).   For this scalar loop, all early exits are
> > > treated the same.
> > >
> > > When we vectorize we move any statement not related to the early break
> > > itself and that would be incorrect to execute before the break (i.e.
> > > has side effects) to after the break.  If this is not possible we decline to
> > vectorize.
> > >
> > > This means that we check at the start of iterations whether we are
> > > going to exit or not.  During the analyis phase we check whether we
> > > are allowed to do this moving of statements.  Also note that we only
> > > move the scalar statements, but only do so after peeling but just before we
> > start transforming statements.
> > >
> > > Codegen:
> > >
> > > for e.g.
> > >
> > > #define N 803
> > > unsigned vect_a[N];
> > > unsigned vect_b[N];
> > >
> > > unsigned test4(unsigned x)
> > > {
> > >  unsigned ret = 0;
> > >  for (int i = 0; i < N; i++)
> > >  {
> > >    vect_b[i] = x + i;
> > >    if (vect_a[i] > x)
> > >      break;
> > >    vect_a[i] = x;
> > >
> > >  }
> > >  return ret;
> > > }
> > >
> > > We generate for Adv. SIMD:
> > >
> > > test4:
> > >         adrp    x2, .LC0
> > >         adrp    x3, .LANCHOR0
> > >         dup     v2.4s, w0
> > >         add     x3, x3, :lo12:.LANCHOR0
> > >         movi    v4.4s, 0x4
> > >         add     x4, x3, 3216
> > >         ldr     q1, [x2, #:lo12:.LC0]
> > >         mov     x1, 0
> > >         mov     w2, 0
> > >         .p2align 3,,7
> > > .L3:
> > >         ldr     q0, [x3, x1]
> > >         add     v3.4s, v1.4s, v2.4s
> > >         add     v1.4s, v1.4s, v4.4s
> > >         cmhi    v0.4s, v0.4s, v2.4s
> > >         umaxp   v0.4s, v0.4s, v0.4s
> > >         fmov    x5, d0
> > >         cbnz    x5, .L6
> > >         add     w2, w2, 1
> > >         str     q3, [x1, x4]
> > >         str     q2, [x3, x1]
> > >         add     x1, x1, 16
> > >         cmp     w2, 200
> > >         bne     .L3
> > >         mov     w7, 3
> > > .L2:
> > >         lsl     w2, w2, 2
> > >         add     x5, x3, 3216
> > >         add     w6, w2, w0
> > >         sxtw    x4, w2
> > >         ldr     w1, [x3, x4, lsl 2]
> > >         str     w6, [x5, x4, lsl 2]
> > >         cmp     w0, w1
> > >         bcc     .L4
> > >         add     w1, w2, 1
> > >         str     w0, [x3, x4, lsl 2]
> > >         add     w6, w1, w0
> > >         sxtw    x1, w1
> > >         ldr     w4, [x3, x1, lsl 2]
> > >         str     w6, [x5, x1, lsl 2]
> > >         cmp     w0, w4
> > >         bcc     .L4
> > >         add     w4, w2, 2
> > >         str     w0, [x3, x1, lsl 2]
> > >         sxtw    x1, w4
> > >         add     w6, w1, w0
> > >         ldr     w4, [x3, x1, lsl 2]
> > >         str     w6, [x5, x1, lsl 2]
> > >         cmp     w0, w4
> > >         bcc     .L4
> > >         str     w0, [x3, x1, lsl 2]
> > >         add     w2, w2, 3
> > >         cmp     w7, 3
> > >         beq     .L4
> > >         sxtw    x1, w2
> > >         add     w2, w2, w0
> > >         ldr     w4, [x3, x1, lsl 2]
> > >         str     w2, [x5, x1, lsl 2]
> > >         cmp     w0, w4
> > >         bcc     .L4
> > >         str     w0, [x3, x1, lsl 2]
> > > .L4:
> > >         mov     w0, 0
> > >         ret
> > >         .p2align 2,,3
> > > .L6:
> > >         mov     w7, 4
> > >         b       .L2
> > >
> > > and for SVE:
> > >
> > > test4:
> > >         adrp    x2, .LANCHOR0
> > >         add     x2, x2, :lo12:.LANCHOR0
> > >         add     x5, x2, 3216
> > >         mov     x3, 0
> > >         mov     w1, 0
> > >         cntw    x4
> > >         mov     z1.s, w0
> > >         index   z0.s, #0, #1
> > >         ptrue   p1.b, all
> > >         ptrue   p0.s, all
> > >         .p2align 3,,7
> > > .L3:
> > >         ld1w    z2.s, p1/z, [x2, x3, lsl 2]
> > >         add     z3.s, z0.s, z1.s
> > >         cmplo   p2.s, p0/z, z1.s, z2.s
> > >         b.any   .L2
> > >         st1w    z3.s, p1, [x5, x3, lsl 2]
> > >         add     w1, w1, 1
> > >         st1w    z1.s, p1, [x2, x3, lsl 2]
> > >         add     x3, x3, x4
> > >         incw    z0.s
> > >         cmp     w3, 803
> > >         bls     .L3
> > > .L5:
> > >         mov     w0, 0
> > >         ret
> > >         .p2align 2,,3
> > > .L2:
> > >         cntw    x5
> > >         mul     w1, w1, w5
> > >         cbz     w5, .L5
> > >         sxtw    x1, w1
> > >         sub     w5, w5, #1
> > >         add     x5, x5, x1
> > >         add     x6, x2, 3216
> > >         b       .L6
> > >         .p2align 2,,3
> > > .L14:
> > >         str     w0, [x2, x1, lsl 2]
> > >         cmp     x1, x5
> > >         beq     .L5
> > >         mov     x1, x4
> > > .L6:
> > >         ldr     w3, [x2, x1, lsl 2]
> > >         add     w4, w0, w1
> > >         str     w4, [x6, x1, lsl 2]
> > >         add     x4, x1, 1
> > >         cmp     w0, w3
> > >         bcs     .L14
> > >         mov     w0, 0
> > >         ret
> > >
> > > On the workloads this work is based on we see between 2-3x performance
> > > uplift using this patch.
> > >
> > > Follow up plan:
> > >  - Boolean vectorization has several shortcomings.  I've filed PR110223 with
> > the
> > >    bigger ones that cause vectorization to fail with this patch.
> > >  - SLP support.  This is planned for GCC 15 as for majority of the cases build
> > >    SLP itself fails.
> > 
> > It would be nice to get at least single-lane SLP support working.  I think you
> > need to treat the gcond as SLP root stmt and basically do discovery on the
> > condition as to as if it were a mask generating condition.
> 
> Hmm ok, will give it  a try.
> 
> > 
> > Code generation would then simply schedule the gcond root instances first
> > (that would get you the code motion automagically).
> 
> Right, so you're saying treat the gcond's as the seed, and stores as a sink.
> And then schedule only the instances without a gcond around such that we
> can still vectorize in place to get the branches.  Ok, makes sense.
> 
> > 
> > So, add a new slp_instance_kind, for example slp_inst_kind_early_break, and
> > record the gcond as root stmt.  Possibly "pattern" recognizing
> > 
> >  gcond <_1 != _2>
> > 
> > as
> > 
> >  _mask = _1 != _2;
> >  gcond <_mask != 0>
> > 
> > makes the SLP discovery less fiddly (but in theory you can of course handle
> > gconds directly).
> > 
> > Is there any part of the series that can be pushed independelty?  If so I'll try to
> > look at those parts first.
> > 
> 
> Aside from:
> 
> [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA form for early breaks
> [PATCH 7/21]middle-end: update IV update code to support early breaks and arbitrary exits  
> 
> The rest lie dormant and don't do anything or disrupt the tree until those two are in.
> The rest all just touch up different parts piecewise.
> 
> They do rely on the new field introduced in:
> 
> [PATCH 3/21]middle-end: Implement code motion and dependency analysis for early breaks
> 
> But can split them out.
> 
> I'll start respinning no #4 and #7 with your latest changes now.

OK, I'll simply go 1-n then.

Richard.

> Thanks,
> Tamar
> 
> > Thanks,
> > Richard.
  
Tamar Christina Nov. 7, 2023, 10:47 a.m. UTC | #7
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, November 7, 2023 9:43 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto-
> vectorization
> 
> On Mon, 6 Nov 2023, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Monday, November 6, 2023 2:25 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return
> > > auto- vectorization
> > >
> > > On Mon, 6 Nov 2023, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This patch adds initial support for early break vectorization in GCC.
> > > > The support is added for any target that implements a vector
> > > > cbranch optab, this includes both fully masked and non-masked targets.
> > > >
> > > > Depending on the operation, the vectorizer may also require
> > > > support for boolean mask reductions using Inclusive OR.  This is
> > > > however only checked then the comparison would produce multiple
> statements.
> > > >
> > > > Note: I am currently struggling to get patch 7 correct in all
> > > > cases and could
> > > use
> > > >       some feedback there.
> > > >
> > > > Concretely the kind of loops supported are of the forms:
> > > >
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    <statements1>
> > > >    if (<condition>)
> > > >      {
> > > >        ...
> > > >        <action>;
> > > >      }
> > > >    <statements2>
> > > >  }
> > > >
> > > > where <action> can be:
> > > >  - break
> > > >  - return
> > > >  - goto
> > > >
> > > > Any number of statements can be used before the <action> occurs.
> > > >
> > > > Since this is an initial version for GCC 14 it has the following
> > > > limitations and
> > > > features:
> > > >
> > > > - Only fixed sized iterations and buffers are supported.  That is to say any
> > > >   vectors loaded or stored must be to statically allocated arrays with
> known
> > > >   sizes. N must also be known.  This limitation is because our primary
> target
> > > >   for this optimization is SVE.  For VLA SVE we can't easily do cross page
> > > >   iteraion checks. The result is likely to also not be beneficial. For that
> > > >   reason we punt support for variable buffers till we have First-Faulting
> > > >   support in GCC.
> 
> Btw, for this I wonder if you thought about marking memory accesses required
> for the early break condition as required to be vector-size aligned, thus peeling
> or versioning them for alignment?  That should ensure they do not fault.
> 
> OTOH I somehow remember prologue peeling isn't supported for early break
> vectorization?  ..
> 
> > > > - any stores in <statements1> should not be to the same objects as in
> > > >   <condition>.  Loads are fine as long as they don't have the possibility to
> > > >   alias.  More concretely, we block RAW dependencies when the
> > > > intermediate
> > > value
> > > >   can't be separated fromt the store, or the store itself can't be moved.
> > > > - Prologue peeling, alignment peelinig and loop versioning are supported.
> 
> .. but here you say it is.  Not sure if peeling for alignment works for VLA vectors
> though.  Just to say x86 doesn't support first-faulting loads.

For VLA we support it through masking.  i.e. if you need to peel N iterations, we
generate a masked copy of the loop vectorized which masks off the first N bits.

This is not typically needed, but we do support it.  But the problem with this
scheme and early break is obviously that the peeled loop needs to be vectorized
so you kinda end up with the same issue again.  So Atm it rejects it for VLA.

Regards,
Tamar

> 
> > > > - Fully masked loops, unmasked loops and partially masked loops
> > > > are supported
> > > > - Any number of loop early exits are supported.
> > > > - No support for epilogue vectorization.  The only epilogue supported is
> the
> > > >   scalar final one.  Peeling code supports it but the code motion code
> cannot
> > > >   find instructions to make the move in the epilog.
> > > > - Early breaks are only supported for inner loop vectorization.
> > > >
> > > > I have pushed a branch to
> > > > refs/users/tnfchris/heads/gcc-14-early-break
> > > >
> > > > With the help of IPA and LTO this still gets hit quite often.
> > > > During bootstrap it hit rather frequently.  Additionally TSVC
> > > > s332, s481 and
> > > > s482 all pass now since these are tests for support for early exit
> > > vectorization.
> > > >
> > > > This implementation does not support completely handling the early
> > > > break inside the vector loop itself but instead supports adding
> > > > checks such that if we know that we have to exit in the current
> > > > iteration then we branch to scalar code to actually do the final
> > > > VF iterations which
> > > handles all the code in <action>.
> > > >
> > > > For the scalar loop we know that whatever exit you take you have
> > > > to perform at most VF iterations.  For vector code we only case
> > > > about the state of fully performed iteration and reset the scalar
> > > > code to the (partially)
> > > remaining loop.
> > > >
> > > > That is to say, the first vector loop executes so long as the
> > > > early exit isn't needed.  Once the exit is taken, the scalar code
> > > > will perform at most VF extra iterations.  The exact number
> > > > depending on peeling
> > > and iteration start and which
> > > > exit was taken (natural or early).   For this scalar loop, all early exits are
> > > > treated the same.
> > > >
> > > > When we vectorize we move any statement not related to the early
> > > > break itself and that would be incorrect to execute before the break (i.e.
> > > > has side effects) to after the break.  If this is not possible we
> > > > decline to
> > > vectorize.
> > > >
> > > > This means that we check at the start of iterations whether we are
> > > > going to exit or not.  During the analyis phase we check whether
> > > > we are allowed to do this moving of statements.  Also note that we
> > > > only move the scalar statements, but only do so after peeling but
> > > > just before we
> > > start transforming statements.
> > > >
> > > > Codegen:
> > > >
> > > > for e.g.
> > > >
> > > > #define N 803
> > > > unsigned vect_a[N];
> > > > unsigned vect_b[N];
> > > >
> > > > unsigned test4(unsigned x)
> > > > {
> > > >  unsigned ret = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    vect_b[i] = x + i;
> > > >    if (vect_a[i] > x)
> > > >      break;
> > > >    vect_a[i] = x;
> > > >
> > > >  }
> > > >  return ret;
> > > > }
> > > >
> > > > We generate for Adv. SIMD:
> > > >
> > > > test4:
> > > >         adrp    x2, .LC0
> > > >         adrp    x3, .LANCHOR0
> > > >         dup     v2.4s, w0
> > > >         add     x3, x3, :lo12:.LANCHOR0
> > > >         movi    v4.4s, 0x4
> > > >         add     x4, x3, 3216
> > > >         ldr     q1, [x2, #:lo12:.LC0]
> > > >         mov     x1, 0
> > > >         mov     w2, 0
> > > >         .p2align 3,,7
> > > > .L3:
> > > >         ldr     q0, [x3, x1]
> > > >         add     v3.4s, v1.4s, v2.4s
> > > >         add     v1.4s, v1.4s, v4.4s
> > > >         cmhi    v0.4s, v0.4s, v2.4s
> > > >         umaxp   v0.4s, v0.4s, v0.4s
> > > >         fmov    x5, d0
> > > >         cbnz    x5, .L6
> > > >         add     w2, w2, 1
> > > >         str     q3, [x1, x4]
> > > >         str     q2, [x3, x1]
> > > >         add     x1, x1, 16
> > > >         cmp     w2, 200
> > > >         bne     .L3
> > > >         mov     w7, 3
> > > > .L2:
> > > >         lsl     w2, w2, 2
> > > >         add     x5, x3, 3216
> > > >         add     w6, w2, w0
> > > >         sxtw    x4, w2
> > > >         ldr     w1, [x3, x4, lsl 2]
> > > >         str     w6, [x5, x4, lsl 2]
> > > >         cmp     w0, w1
> > > >         bcc     .L4
> > > >         add     w1, w2, 1
> > > >         str     w0, [x3, x4, lsl 2]
> > > >         add     w6, w1, w0
> > > >         sxtw    x1, w1
> > > >         ldr     w4, [x3, x1, lsl 2]
> > > >         str     w6, [x5, x1, lsl 2]
> > > >         cmp     w0, w4
> > > >         bcc     .L4
> > > >         add     w4, w2, 2
> > > >         str     w0, [x3, x1, lsl 2]
> > > >         sxtw    x1, w4
> > > >         add     w6, w1, w0
> > > >         ldr     w4, [x3, x1, lsl 2]
> > > >         str     w6, [x5, x1, lsl 2]
> > > >         cmp     w0, w4
> > > >         bcc     .L4
> > > >         str     w0, [x3, x1, lsl 2]
> > > >         add     w2, w2, 3
> > > >         cmp     w7, 3
> > > >         beq     .L4
> > > >         sxtw    x1, w2
> > > >         add     w2, w2, w0
> > > >         ldr     w4, [x3, x1, lsl 2]
> > > >         str     w2, [x5, x1, lsl 2]
> > > >         cmp     w0, w4
> > > >         bcc     .L4
> > > >         str     w0, [x3, x1, lsl 2]
> > > > .L4:
> > > >         mov     w0, 0
> > > >         ret
> > > >         .p2align 2,,3
> > > > .L6:
> > > >         mov     w7, 4
> > > >         b       .L2
> > > >
> > > > and for SVE:
> > > >
> > > > test4:
> > > >         adrp    x2, .LANCHOR0
> > > >         add     x2, x2, :lo12:.LANCHOR0
> > > >         add     x5, x2, 3216
> > > >         mov     x3, 0
> > > >         mov     w1, 0
> > > >         cntw    x4
> > > >         mov     z1.s, w0
> > > >         index   z0.s, #0, #1
> > > >         ptrue   p1.b, all
> > > >         ptrue   p0.s, all
> > > >         .p2align 3,,7
> > > > .L3:
> > > >         ld1w    z2.s, p1/z, [x2, x3, lsl 2]
> > > >         add     z3.s, z0.s, z1.s
> > > >         cmplo   p2.s, p0/z, z1.s, z2.s
> > > >         b.any   .L2
> > > >         st1w    z3.s, p1, [x5, x3, lsl 2]
> > > >         add     w1, w1, 1
> > > >         st1w    z1.s, p1, [x2, x3, lsl 2]
> > > >         add     x3, x3, x4
> > > >         incw    z0.s
> > > >         cmp     w3, 803
> > > >         bls     .L3
> > > > .L5:
> > > >         mov     w0, 0
> > > >         ret
> > > >         .p2align 2,,3
> > > > .L2:
> > > >         cntw    x5
> > > >         mul     w1, w1, w5
> > > >         cbz     w5, .L5
> > > >         sxtw    x1, w1
> > > >         sub     w5, w5, #1
> > > >         add     x5, x5, x1
> > > >         add     x6, x2, 3216
> > > >         b       .L6
> > > >         .p2align 2,,3
> > > > .L14:
> > > >         str     w0, [x2, x1, lsl 2]
> > > >         cmp     x1, x5
> > > >         beq     .L5
> > > >         mov     x1, x4
> > > > .L6:
> > > >         ldr     w3, [x2, x1, lsl 2]
> > > >         add     w4, w0, w1
> > > >         str     w4, [x6, x1, lsl 2]
> > > >         add     x4, x1, 1
> > > >         cmp     w0, w3
> > > >         bcs     .L14
> > > >         mov     w0, 0
> > > >         ret
> > > >
> > > > On the workloads this work is based on we see between 2-3x
> > > > performance uplift using this patch.
> > > >
> > > > Follow up plan:
> > > >  - Boolean vectorization has several shortcomings.  I've filed
> > > > PR110223 with
> > > the
> > > >    bigger ones that cause vectorization to fail with this patch.
> > > >  - SLP support.  This is planned for GCC 15 as for majority of the cases
> build
> > > >    SLP itself fails.
> > >
> > > It would be nice to get at least single-lane SLP support working.  I
> > > think you need to treat the gcond as SLP root stmt and basically do
> > > discovery on the condition as to as if it were a mask generating condition.
> >
> > Hmm ok, will give it  a try.
> >
> > >
> > > Code generation would then simply schedule the gcond root instances
> > > first (that would get you the code motion automagically).
> >
> > Right, so you're saying treat the gcond's as the seed, and stores as a sink.
> > And then schedule only the instances without a gcond around such that
> > we can still vectorize in place to get the branches.  Ok, makes sense.
> >
> > >
> > > So, add a new slp_instance_kind, for example
> > > slp_inst_kind_early_break, and record the gcond as root stmt.
> > > Possibly "pattern" recognizing
> > >
> > >  gcond <_1 != _2>
> > >
> > > as
> > >
> > >  _mask = _1 != _2;
> > >  gcond <_mask != 0>
> > >
> > > makes the SLP discovery less fiddly (but in theory you can of course
> > > handle gconds directly).
> > >
> > > Is there any part of the series that can be pushed independelty?  If
> > > so I'll try to look at those parts first.
> > >
> >
> > Aside from:
> >
> > [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA
> > form for early breaks [PATCH 7/21]middle-end: update IV update code to
> > support early breaks and arbitrary exits
> >
> > The rest lie dormant and don't do anything or disrupt the tree until those
> two are in.
> > The rest all just touch up different parts piecewise.
> >
> > They do rely on the new field introduced in:
> >
> > [PATCH 3/21]middle-end: Implement code motion and dependency analysis
> > for early breaks
> >
> > But can split them out.
> >
> > I'll start respinning no #4 and #7 with your latest changes now.
> 
> OK, I'll simply go 1-n then.
> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > > Thanks,
> > > Richard.
  
Richard Biener Nov. 7, 2023, 1:58 p.m. UTC | #8
On Tue, 7 Nov 2023, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, November 7, 2023 9:43 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto-
> > vectorization
> > 
> > On Mon, 6 Nov 2023, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguenther@suse.de>
> > > > Sent: Monday, November 6, 2023 2:25 PM
> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return
> > > > auto- vectorization
> > > >
> > > > On Mon, 6 Nov 2023, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This patch adds initial support for early break vectorization in GCC.
> > > > > The support is added for any target that implements a vector
> > > > > cbranch optab, this includes both fully masked and non-masked targets.
> > > > >
> > > > > Depending on the operation, the vectorizer may also require
> > > > > support for boolean mask reductions using Inclusive OR.  This is
> > > > > however only checked then the comparison would produce multiple
> > statements.
> > > > >
> > > > > Note: I am currently struggling to get patch 7 correct in all
> > > > > cases and could
> > > > use
> > > > >       some feedback there.
> > > > >
> > > > > Concretely the kind of loops supported are of the forms:
> > > > >
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    <statements1>
> > > > >    if (<condition>)
> > > > >      {
> > > > >        ...
> > > > >        <action>;
> > > > >      }
> > > > >    <statements2>
> > > > >  }
> > > > >
> > > > > where <action> can be:
> > > > >  - break
> > > > >  - return
> > > > >  - goto
> > > > >
> > > > > Any number of statements can be used before the <action> occurs.
> > > > >
> > > > > Since this is an initial version for GCC 14 it has the following
> > > > > limitations and
> > > > > features:
> > > > >
> > > > > - Only fixed sized iterations and buffers are supported.  That is to say any
> > > > >   vectors loaded or stored must be to statically allocated arrays with
> > known
> > > > >   sizes. N must also be known.  This limitation is because our primary
> > target
> > > > >   for this optimization is SVE.  For VLA SVE we can't easily do cross page
> > > > >   iteraion checks. The result is likely to also not be beneficial. For that
> > > > >   reason we punt support for variable buffers till we have First-Faulting
> > > > >   support in GCC.
> > 
> > Btw, for this I wonder if you thought about marking memory accesses required
> > for the early break condition as required to be vector-size aligned, thus peeling
> > or versioning them for alignment?  That should ensure they do not fault.
> > 
> > OTOH I somehow remember prologue peeling isn't supported for early break
> > vectorization?  ..
> > 
> > > > > - any stores in <statements1> should not be to the same objects as in
> > > > >   <condition>.  Loads are fine as long as they don't have the possibility to
> > > > >   alias.  More concretely, we block RAW dependencies when the
> > > > > intermediate
> > > > value
> > > > >   can't be separated fromt the store, or the store itself can't be moved.
> > > > > - Prologue peeling, alignment peelinig and loop versioning are supported.
> > 
> > .. but here you say it is.  Not sure if peeling for alignment works for VLA vectors
> > though.  Just to say x86 doesn't support first-faulting loads.
> 
> For VLA we support it through masking.  i.e. if you need to peel N iterations, we
> generate a masked copy of the loop vectorized which masks off the first N bits.
> 
> This is not typically needed, but we do support it.  But the problem with this
> scheme and early break is obviously that the peeled loop needs to be vectorized
> so you kinda end up with the same issue again.  So Atm it rejects it for VLA.

Hmm, I see.  I thought peeling by masking is an optimization.  Anyhow,
I think it should still work here - since all accesses are aligned and
we know that there's at least one original scalar iteration in the
first masked and the following "unmasked" vector iterations there
should never be faults for any of the aligned accesses.

I think going via alignment is a way easier method to guarantee this
than handwaving about "declared" arrays and niter.  One can try that
in addition of course - it's not always possible to align all
vector loads we are going to speculate (for VLA one could also
find common runtime (mis-)alignment and restrict the vector length based
on that, for RISC-V it seems to be efficient, not sure whether altering
that for SVE is though).

Richard.

> Regards,
> Tamar
> 
> > 
> > > > > - Fully masked loops, unmasked loops and partially masked loops
> > > > > are supported
> > > > > - Any number of loop early exits are supported.
> > > > > - No support for epilogue vectorization.  The only epilogue supported is
> > the
> > > > >   scalar final one.  Peeling code supports it but the code motion code
> > cannot
> > > > >   find instructions to make the move in the epilog.
> > > > > - Early breaks are only supported for inner loop vectorization.
> > > > >
> > > > > I have pushed a branch to
> > > > > refs/users/tnfchris/heads/gcc-14-early-break
> > > > >
> > > > > With the help of IPA and LTO this still gets hit quite often.
> > > > > During bootstrap it hit rather frequently.  Additionally TSVC
> > > > > s332, s481 and
> > > > > s482 all pass now since these are tests for support for early exit
> > > > vectorization.
> > > > >
> > > > > This implementation does not support completely handling the early
> > > > > break inside the vector loop itself but instead supports adding
> > > > > checks such that if we know that we have to exit in the current
> > > > > iteration then we branch to scalar code to actually do the final
> > > > > VF iterations which
> > > > handles all the code in <action>.
> > > > >
> > > > > For the scalar loop we know that whatever exit you take you have
> > > > > to perform at most VF iterations.  For vector code we only case
> > > > > about the state of fully performed iteration and reset the scalar
> > > > > code to the (partially)
> > > > remaining loop.
> > > > >
> > > > > That is to say, the first vector loop executes so long as the
> > > > > early exit isn't needed.  Once the exit is taken, the scalar code
> > > > > will perform at most VF extra iterations.  The exact number
> > > > > depending on peeling
> > > > and iteration start and which
> > > > > exit was taken (natural or early).   For this scalar loop, all early exits are
> > > > > treated the same.
> > > > >
> > > > > When we vectorize we move any statement not related to the early
> > > > > break itself and that would be incorrect to execute before the break (i.e.
> > > > > has side effects) to after the break.  If this is not possible we
> > > > > decline to
> > > > vectorize.
> > > > >
> > > > > This means that we check at the start of iterations whether we are
> > > > > going to exit or not.  During the analyis phase we check whether
> > > > > we are allowed to do this moving of statements.  Also note that we
> > > > > only move the scalar statements, but only do so after peeling but
> > > > > just before we
> > > > start transforming statements.
> > > > >
> > > > > Codegen:
> > > > >
> > > > > for e.g.
> > > > >
> > > > > #define N 803
> > > > > unsigned vect_a[N];
> > > > > unsigned vect_b[N];
> > > > >
> > > > > unsigned test4(unsigned x)
> > > > > {
> > > > >  unsigned ret = 0;
> > > > >  for (int i = 0; i < N; i++)
> > > > >  {
> > > > >    vect_b[i] = x + i;
> > > > >    if (vect_a[i] > x)
> > > > >      break;
> > > > >    vect_a[i] = x;
> > > > >
> > > > >  }
> > > > >  return ret;
> > > > > }
> > > > >
> > > > > We generate for Adv. SIMD:
> > > > >
> > > > > test4:
> > > > >         adrp    x2, .LC0
> > > > >         adrp    x3, .LANCHOR0
> > > > >         dup     v2.4s, w0
> > > > >         add     x3, x3, :lo12:.LANCHOR0
> > > > >         movi    v4.4s, 0x4
> > > > >         add     x4, x3, 3216
> > > > >         ldr     q1, [x2, #:lo12:.LC0]
> > > > >         mov     x1, 0
> > > > >         mov     w2, 0
> > > > >         .p2align 3,,7
> > > > > .L3:
> > > > >         ldr     q0, [x3, x1]
> > > > >         add     v3.4s, v1.4s, v2.4s
> > > > >         add     v1.4s, v1.4s, v4.4s
> > > > >         cmhi    v0.4s, v0.4s, v2.4s
> > > > >         umaxp   v0.4s, v0.4s, v0.4s
> > > > >         fmov    x5, d0
> > > > >         cbnz    x5, .L6
> > > > >         add     w2, w2, 1
> > > > >         str     q3, [x1, x4]
> > > > >         str     q2, [x3, x1]
> > > > >         add     x1, x1, 16
> > > > >         cmp     w2, 200
> > > > >         bne     .L3
> > > > >         mov     w7, 3
> > > > > .L2:
> > > > >         lsl     w2, w2, 2
> > > > >         add     x5, x3, 3216
> > > > >         add     w6, w2, w0
> > > > >         sxtw    x4, w2
> > > > >         ldr     w1, [x3, x4, lsl 2]
> > > > >         str     w6, [x5, x4, lsl 2]
> > > > >         cmp     w0, w1
> > > > >         bcc     .L4
> > > > >         add     w1, w2, 1
> > > > >         str     w0, [x3, x4, lsl 2]
> > > > >         add     w6, w1, w0
> > > > >         sxtw    x1, w1
> > > > >         ldr     w4, [x3, x1, lsl 2]
> > > > >         str     w6, [x5, x1, lsl 2]
> > > > >         cmp     w0, w4
> > > > >         bcc     .L4
> > > > >         add     w4, w2, 2
> > > > >         str     w0, [x3, x1, lsl 2]
> > > > >         sxtw    x1, w4
> > > > >         add     w6, w1, w0
> > > > >         ldr     w4, [x3, x1, lsl 2]
> > > > >         str     w6, [x5, x1, lsl 2]
> > > > >         cmp     w0, w4
> > > > >         bcc     .L4
> > > > >         str     w0, [x3, x1, lsl 2]
> > > > >         add     w2, w2, 3
> > > > >         cmp     w7, 3
> > > > >         beq     .L4
> > > > >         sxtw    x1, w2
> > > > >         add     w2, w2, w0
> > > > >         ldr     w4, [x3, x1, lsl 2]
> > > > >         str     w2, [x5, x1, lsl 2]
> > > > >         cmp     w0, w4
> > > > >         bcc     .L4
> > > > >         str     w0, [x3, x1, lsl 2]
> > > > > .L4:
> > > > >         mov     w0, 0
> > > > >         ret
> > > > >         .p2align 2,,3
> > > > > .L6:
> > > > >         mov     w7, 4
> > > > >         b       .L2
> > > > >
> > > > > and for SVE:
> > > > >
> > > > > test4:
> > > > >         adrp    x2, .LANCHOR0
> > > > >         add     x2, x2, :lo12:.LANCHOR0
> > > > >         add     x5, x2, 3216
> > > > >         mov     x3, 0
> > > > >         mov     w1, 0
> > > > >         cntw    x4
> > > > >         mov     z1.s, w0
> > > > >         index   z0.s, #0, #1
> > > > >         ptrue   p1.b, all
> > > > >         ptrue   p0.s, all
> > > > >         .p2align 3,,7
> > > > > .L3:
> > > > >         ld1w    z2.s, p1/z, [x2, x3, lsl 2]
> > > > >         add     z3.s, z0.s, z1.s
> > > > >         cmplo   p2.s, p0/z, z1.s, z2.s
> > > > >         b.any   .L2
> > > > >         st1w    z3.s, p1, [x5, x3, lsl 2]
> > > > >         add     w1, w1, 1
> > > > >         st1w    z1.s, p1, [x2, x3, lsl 2]
> > > > >         add     x3, x3, x4
> > > > >         incw    z0.s
> > > > >         cmp     w3, 803
> > > > >         bls     .L3
> > > > > .L5:
> > > > >         mov     w0, 0
> > > > >         ret
> > > > >         .p2align 2,,3
> > > > > .L2:
> > > > >         cntw    x5
> > > > >         mul     w1, w1, w5
> > > > >         cbz     w5, .L5
> > > > >         sxtw    x1, w1
> > > > >         sub     w5, w5, #1
> > > > >         add     x5, x5, x1
> > > > >         add     x6, x2, 3216
> > > > >         b       .L6
> > > > >         .p2align 2,,3
> > > > > .L14:
> > > > >         str     w0, [x2, x1, lsl 2]
> > > > >         cmp     x1, x5
> > > > >         beq     .L5
> > > > >         mov     x1, x4
> > > > > .L6:
> > > > >         ldr     w3, [x2, x1, lsl 2]
> > > > >         add     w4, w0, w1
> > > > >         str     w4, [x6, x1, lsl 2]
> > > > >         add     x4, x1, 1
> > > > >         cmp     w0, w3
> > > > >         bcs     .L14
> > > > >         mov     w0, 0
> > > > >         ret
> > > > >
> > > > > On the workloads this work is based on we see between 2-3x
> > > > > performance uplift using this patch.
> > > > >
> > > > > Follow up plan:
> > > > >  - Boolean vectorization has several shortcomings.  I've filed
> > > > > PR110223 with
> > > > the
> > > > >    bigger ones that cause vectorization to fail with this patch.
> > > > >  - SLP support.  This is planned for GCC 15 as for majority of the cases
> > build
> > > > >    SLP itself fails.
> > > >
> > > > It would be nice to get at least single-lane SLP support working.  I
> > > > think you need to treat the gcond as SLP root stmt and basically do
> > > > discovery on the condition as to as if it were a mask generating condition.
> > >
> > > Hmm ok, will give it  a try.
> > >
> > > >
> > > > Code generation would then simply schedule the gcond root instances
> > > > first (that would get you the code motion automagically).
> > >
> > > Right, so you're saying treat the gcond's as the seed, and stores as a sink.
> > > And then schedule only the instances without a gcond around such that
> > > we can still vectorize in place to get the branches.  Ok, makes sense.
> > >
> > > >
> > > > So, add a new slp_instance_kind, for example
> > > > slp_inst_kind_early_break, and record the gcond as root stmt.
> > > > Possibly "pattern" recognizing
> > > >
> > > >  gcond <_1 != _2>
> > > >
> > > > as
> > > >
> > > >  _mask = _1 != _2;
> > > >  gcond <_mask != 0>
> > > >
> > > > makes the SLP discovery less fiddly (but in theory you can of course
> > > > handle gconds directly).
> > > >
> > > > Is there any part of the series that can be pushed independelty?  If
> > > > so I'll try to look at those parts first.
> > > >
> > >
> > > Aside from:
> > >
> > > [PATCH 4/21]middle-end: update loop peeling code to maintain LCSSA
> > > form for early breaks [PATCH 7/21]middle-end: update IV update code to
> > > support early breaks and arbitrary exits
> > >
> > > The rest lie dormant and don't do anything or disrupt the tree until those
> > two are in.
> > > The rest all just touch up different parts piecewise.
> > >
> > > They do rely on the new field introduced in:
> > >
> > > [PATCH 3/21]middle-end: Implement code motion and dependency analysis
> > > for early breaks
> > >
> > > But can split them out.
> > >
> > > I'll start respinning no #4 and #7 with your latest changes now.
> > 
> > OK, I'll simply go 1-n then.
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > > Thanks,
> > > > Richard.
>
  
Richard Sandiford Nov. 27, 2023, 6:30 p.m. UTC | #9
Catching up on backlog, so this might already be resolved, but:

Richard Biener <rguenther@suse.de> writes:
> On Tue, 7 Nov 2023, Tamar Christina wrote:
>
>> > -----Original Message-----
>> > From: Richard Biener <rguenther@suse.de>
>> > Sent: Tuesday, November 7, 2023 9:43 AM
>> > To: Tamar Christina <Tamar.Christina@arm.com>
>> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
>> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto-
>> > vectorization
>> > 
>> > On Mon, 6 Nov 2023, Tamar Christina wrote:
>> > 
>> > > > -----Original Message-----
>> > > > From: Richard Biener <rguenther@suse.de>
>> > > > Sent: Monday, November 6, 2023 2:25 PM
>> > > > To: Tamar Christina <Tamar.Christina@arm.com>
>> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
>> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return
>> > > > auto- vectorization
>> > > >
>> > > > On Mon, 6 Nov 2023, Tamar Christina wrote:
>> > > >
>> > > > > Hi All,
>> > > > >
>> > > > > This patch adds initial support for early break vectorization in GCC.
>> > > > > The support is added for any target that implements a vector
>> > > > > cbranch optab, this includes both fully masked and non-masked targets.
>> > > > >
>> > > > > Depending on the operation, the vectorizer may also require
>> > > > > support for boolean mask reductions using Inclusive OR.  This is
>> > > > > however only checked then the comparison would produce multiple
>> > statements.
>> > > > >
>> > > > > Note: I am currently struggling to get patch 7 correct in all
>> > > > > cases and could
>> > > > use
>> > > > >       some feedback there.
>> > > > >
>> > > > > Concretely the kind of loops supported are of the forms:
>> > > > >
>> > > > >  for (int i = 0; i < N; i++)
>> > > > >  {
>> > > > >    <statements1>
>> > > > >    if (<condition>)
>> > > > >      {
>> > > > >        ...
>> > > > >        <action>;
>> > > > >      }
>> > > > >    <statements2>
>> > > > >  }
>> > > > >
>> > > > > where <action> can be:
>> > > > >  - break
>> > > > >  - return
>> > > > >  - goto
>> > > > >
>> > > > > Any number of statements can be used before the <action> occurs.
>> > > > >
>> > > > > Since this is an initial version for GCC 14 it has the following
>> > > > > limitations and
>> > > > > features:
>> > > > >
>> > > > > - Only fixed sized iterations and buffers are supported.  That is to say any
>> > > > >   vectors loaded or stored must be to statically allocated arrays with
>> > known
>> > > > >   sizes. N must also be known.  This limitation is because our primary
>> > target
>> > > > >   for this optimization is SVE.  For VLA SVE we can't easily do cross page
>> > > > >   iteraion checks. The result is likely to also not be beneficial. For that
>> > > > >   reason we punt support for variable buffers till we have First-Faulting
>> > > > >   support in GCC.
>> > 
>> > Btw, for this I wonder if you thought about marking memory accesses required
>> > for the early break condition as required to be vector-size aligned, thus peeling
>> > or versioning them for alignment?  That should ensure they do not fault.
>> > 
>> > OTOH I somehow remember prologue peeling isn't supported for early break
>> > vectorization?  ..
>> > 
>> > > > > - any stores in <statements1> should not be to the same objects as in
>> > > > >   <condition>.  Loads are fine as long as they don't have the possibility to
>> > > > >   alias.  More concretely, we block RAW dependencies when the
>> > > > > intermediate
>> > > > value
>> > > > >   can't be separated fromt the store, or the store itself can't be moved.
>> > > > > - Prologue peeling, alignment peelinig and loop versioning are supported.
>> > 
>> > .. but here you say it is.  Not sure if peeling for alignment works for VLA vectors
>> > though.  Just to say x86 doesn't support first-faulting loads.
>> 
>> For VLA we support it through masking.  i.e. if you need to peel N iterations, we
>> generate a masked copy of the loop vectorized which masks off the first N bits.
>> 
>> This is not typically needed, but we do support it.  But the problem with this
>> scheme and early break is obviously that the peeled loop needs to be vectorized
>> so you kinda end up with the same issue again.  So Atm it rejects it for VLA.
>
> Hmm, I see.  I thought peeling by masking is an optimization.

Yeah, it's an opt-in optimisation.  No current Arm cores opt in though.

> Anyhow, I think it should still work here - since all accesses are aligned
> and we know that there's at least one original scalar iteration in the
> first masked and the following "unmasked" vector iterations there
> should never be faults for any of the aligned accesses.

Peeling via masking works by using the main loop for the "peeled"
iteration (so it's a bit of a misnomer).  The vector pointers start
out lower than the original scalar pointers, with some leading
inactive elements.

The awkwardness would be in skipping those leading inactive elements
in the epilogue, if an early break occurs in the first vector iteration.
Definitely doable, but I imagine not trivial.

> I think going via alignment is a way easier method to guarantee this
> than handwaving about "declared" arrays and niter.  One can try that
> in addition of course - it's not always possible to align all
> vector loads we are going to speculate (for VLA one could also
> find common runtime (mis-)alignment and restrict the vector length based
> on that, for RISC-V it seems to be efficient, not sure whether altering
> that for SVE is though).

I think both techniques (alignment and reasoning about accessibility)
are useful.  And they each help with different cases.  Like you say,
if there are two vector loads that need to be aligned, we'd need to
version for alignment on fixed-length architectures, with a scalar
fallback when the alignment requirement isn't met.  In contrast,
static reasoning about accessibility allows the vector loop to be
used for all relative misalignments.

So I think the aim should be to support both techniques.  But IMO it's
reasonable to start with either one.  It sounds from Tamar's results
like starting with static reasoning does fire quite often, and it
should have less runtime overhead than the alignment approach.

Plus, when the loop operates on chars, it's hard to predict whether
peeling for alignment pays for itself, or whether the scalar prologue
will end up handling the majority of cases.  If we have the option
of not peeling for alignment, then it's probably worth taking it
for chars.

Capping the VL at runtime is possible on SVE.  It's on the backlog
for handling runtime aliases, where we can vectorise with a lower VF
rather than falling back to scalar code.  But first-faulting loads
are likely to be better than halving or quartering the VL at runtime,
so I don't think capping the VL would be the right SVE technique for
early exits.

Thanks,
Richard
  
Richard Biener Nov. 28, 2023, 8:11 a.m. UTC | #10
On Mon, 27 Nov 2023, Richard Sandiford wrote:

> Catching up on backlog, so this might already be resolved, but:
> 
> Richard Biener <rguenther@suse.de> writes:
> > On Tue, 7 Nov 2023, Tamar Christina wrote:
> >
> >> > -----Original Message-----
> >> > From: Richard Biener <rguenther@suse.de>
> >> > Sent: Tuesday, November 7, 2023 9:43 AM
> >> > To: Tamar Christina <Tamar.Christina@arm.com>
> >> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> >> > Subject: RE: [PATCH v6 0/21]middle-end: Support early break/return auto-
> >> > vectorization
> >> > 
> >> > On Mon, 6 Nov 2023, Tamar Christina wrote:
> >> > 
> >> > > > -----Original Message-----
> >> > > > From: Richard Biener <rguenther@suse.de>
> >> > > > Sent: Monday, November 6, 2023 2:25 PM
> >> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> >> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> >> > > > Subject: Re: [PATCH v6 0/21]middle-end: Support early break/return
> >> > > > auto- vectorization
> >> > > >
> >> > > > On Mon, 6 Nov 2023, Tamar Christina wrote:
> >> > > >
> >> > > > > Hi All,
> >> > > > >
> >> > > > > This patch adds initial support for early break vectorization in GCC.
> >> > > > > The support is added for any target that implements a vector
> >> > > > > cbranch optab, this includes both fully masked and non-masked targets.
> >> > > > >
> >> > > > > Depending on the operation, the vectorizer may also require
> >> > > > > support for boolean mask reductions using Inclusive OR.  This is
> >> > > > > however only checked then the comparison would produce multiple
> >> > statements.
> >> > > > >
> >> > > > > Note: I am currently struggling to get patch 7 correct in all
> >> > > > > cases and could
> >> > > > use
> >> > > > >       some feedback there.
> >> > > > >
> >> > > > > Concretely the kind of loops supported are of the forms:
> >> > > > >
> >> > > > >  for (int i = 0; i < N; i++)
> >> > > > >  {
> >> > > > >    <statements1>
> >> > > > >    if (<condition>)
> >> > > > >      {
> >> > > > >        ...
> >> > > > >        <action>;
> >> > > > >      }
> >> > > > >    <statements2>
> >> > > > >  }
> >> > > > >
> >> > > > > where <action> can be:
> >> > > > >  - break
> >> > > > >  - return
> >> > > > >  - goto
> >> > > > >
> >> > > > > Any number of statements can be used before the <action> occurs.
> >> > > > >
> >> > > > > Since this is an initial version for GCC 14 it has the following
> >> > > > > limitations and
> >> > > > > features:
> >> > > > >
> >> > > > > - Only fixed sized iterations and buffers are supported.  That is to say any
> >> > > > >   vectors loaded or stored must be to statically allocated arrays with
> >> > known
> >> > > > >   sizes. N must also be known.  This limitation is because our primary
> >> > target
> >> > > > >   for this optimization is SVE.  For VLA SVE we can't easily do cross page
> >> > > > >   iteraion checks. The result is likely to also not be beneficial. For that
> >> > > > >   reason we punt support for variable buffers till we have First-Faulting
> >> > > > >   support in GCC.
> >> > 
> >> > Btw, for this I wonder if you thought about marking memory accesses required
> >> > for the early break condition as required to be vector-size aligned, thus peeling
> >> > or versioning them for alignment?  That should ensure they do not fault.
> >> > 
> >> > OTOH I somehow remember prologue peeling isn't supported for early break
> >> > vectorization?  ..
> >> > 
> >> > > > > - any stores in <statements1> should not be to the same objects as in
> >> > > > >   <condition>.  Loads are fine as long as they don't have the possibility to
> >> > > > >   alias.  More concretely, we block RAW dependencies when the
> >> > > > > intermediate
> >> > > > value
> >> > > > >   can't be separated fromt the store, or the store itself can't be moved.
> >> > > > > - Prologue peeling, alignment peelinig and loop versioning are supported.
> >> > 
> >> > .. but here you say it is.  Not sure if peeling for alignment works for VLA vectors
> >> > though.  Just to say x86 doesn't support first-faulting loads.
> >> 
> >> For VLA we support it through masking.  i.e. if you need to peel N iterations, we
> >> generate a masked copy of the loop vectorized which masks off the first N bits.
> >> 
> >> This is not typically needed, but we do support it.  But the problem with this
> >> scheme and early break is obviously that the peeled loop needs to be vectorized
> >> so you kinda end up with the same issue again.  So Atm it rejects it for VLA.
> >
> > Hmm, I see.  I thought peeling by masking is an optimization.
> 
> Yeah, it's an opt-in optimisation.  No current Arm cores opt in though.
> 
> > Anyhow, I think it should still work here - since all accesses are aligned
> > and we know that there's at least one original scalar iteration in the
> > first masked and the following "unmasked" vector iterations there
> > should never be faults for any of the aligned accesses.
> 
> Peeling via masking works by using the main loop for the "peeled"
> iteration (so it's a bit of a misnomer).  The vector pointers start
> out lower than the original scalar pointers, with some leading
> inactive elements.
> 
> The awkwardness would be in skipping those leading inactive elements
> in the epilogue, if an early break occurs in the first vector iteration.
> Definitely doable, but I imagine not trivial.
> 
> > I think going via alignment is a way easier method to guarantee this
> > than handwaving about "declared" arrays and niter.  One can try that
> > in addition of course - it's not always possible to align all
> > vector loads we are going to speculate (for VLA one could also
> > find common runtime (mis-)alignment and restrict the vector length based
> > on that, for RISC-V it seems to be efficient, not sure whether altering
> > that for SVE is though).
> 
> I think both techniques (alignment and reasoning about accessibility)
> are useful.  And they each help with different cases.  Like you say,
> if there are two vector loads that need to be aligned, we'd need to
> version for alignment on fixed-length architectures, with a scalar
> fallback when the alignment requirement isn't met.  In contrast,
> static reasoning about accessibility allows the vector loop to be
> used for all relative misalignments.
> 
> So I think the aim should be to support both techniques.  But IMO it's
> reasonable to start with either one.  It sounds from Tamar's results
> like starting with static reasoning does fire quite often, and it
> should have less runtime overhead than the alignment approach.

Fair enough, we need to fix the correctness issues then though
(as said, correctness is way easier to assert for alignment).

> Plus, when the loop operates on chars, it's hard to predict whether
> peeling for alignment pays for itself, or whether the scalar prologue
> will end up handling the majority of cases.  If we have the option
> of not peeling for alignment, then it's probably worth taking it
> for chars.

That's true.

> Capping the VL at runtime is possible on SVE.  It's on the backlog
> for handling runtime aliases, where we can vectorise with a lower VF
> rather than falling back to scalar code.  But first-faulting loads
> are likely to be better than halving or quartering the VL at runtime,
> so I don't think capping the VL would be the right SVE technique for
> early exits.

For targets with no first-faulting loads we only have alignment as
additional possibility then.  I can look at this for next stage1.

Richard.

> Thanks,
> Richard
  
Richard Biener Dec. 8, 2023, 10:28 a.m. UTC | #11
On Fri, 8 Dec 2023, Tamar Christina wrote:

> > --param vect-partial-vector-usage=2 would, no?
> > 
> I.. didn't even know it went to 2!
> 
> > > In principal I suppose I could mask the individual stmts, that should handle the
> > future case when
> > > This is relaxed to supposed non-fix length buffers?
> > 
> > Well, it looks wrong - either put in an assert that we start with a
> > single stmt or assert !masked_loop_p instead?  Better ICE than
> > generate wrong code.
> > 
> > That said, I think you need to apply the masking on the original
> > stmts[], before reducing them, no?
> 
> Yeah, I've done so now.  For simplicity I've just kept the final masking always as well
> and just leave it up to the optimizers to drop it when it's superfluous.
> 
> Simple testcase:
> 
> #ifndef N
> #define N 837
> #endif
> float vect_a[N];
> unsigned vect_b[N];
> 
> unsigned test4(double x)
> {
>  unsigned ret = 0;
>  for (int i = 0; i < N; i++)
>  {
>    if (vect_a[i] > x)
>      break;
>    vect_a[i] = x;
> 
>  }
>  return ret;
> }
> 
> Looks good now. After this one there's only one patch left, the dependency analysis.
> I'm almost done with the cleanup/respin, but want to take the weekend to double check and will post it first thing Monday morning.
> 
> Did you want to see the testsuite changes as well again? I've basically just added the right dg-requires-effective and add-options etc.

Yes please.

> Thanks for all the reviews!
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
>
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* tree-vect-patterns.cc (vect_init_pattern_stmt): Support gconds.
> 	(check_bool_pattern, adjust_bool_pattern, adjust_bool_stmts,
> 	vect_recog_bool_pattern, sort_after_uid): Support gconds type analysis.
> 	* tree-vect-stmts.cc (vectorizable_comparison_1): Support stmts without
> 	lhs.
> 	(vectorizable_early_exit): New.
> 	(vect_analyze_stmt, vect_transform_stmt): Use it.
> 	(vect_is_simple_use, vect_get_vector_types_for_stmt): Support gcond.
> 
> 
> --- inline copy of patch ---
> 
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index 7debe7f0731673cd1bf25cd39d55e23990a73d0e..8865cde9f3481a474d31848ae12523576d29744d 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -132,6 +132,7 @@ vect_init_pattern_stmt (vec_info *vinfo, gimple *pattern_stmt,
>    if (!STMT_VINFO_VECTYPE (pattern_stmt_info))
>      {
>        gcc_assert (!vectype
> +		  || is_a <gcond *> (pattern_stmt)
>  		  || (VECTOR_BOOLEAN_TYPE_P (vectype)
>  		      == vect_use_mask_type_p (orig_stmt_info)));
>        STMT_VINFO_VECTYPE (pattern_stmt_info) = vectype;
> @@ -5210,19 +5211,28 @@ vect_recog_mixed_size_cond_pattern (vec_info *vinfo,
>     true if bool VAR can and should be optimized that way.  Assume it shouldn't
>     in case it's a result of a comparison which can be directly vectorized into
>     a vector comparison.  Fills in STMTS with all stmts visited during the
> -   walk.  */
> +   walk.  if ANALYZE_ONLY then only analyze the booleans but do not perform any
> +   codegen associated with the boolean condition.  */
>  
>  static bool
> -check_bool_pattern (tree var, vec_info *vinfo, hash_set<gimple *> &stmts)
> +check_bool_pattern (tree var, vec_info *vinfo, hash_set<gimple *> &stmts,
> +		    bool analyze_only)
>  {
>    tree rhs1;
>    enum tree_code rhs_code;
> +  gassign *def_stmt = NULL;
>  
>    stmt_vec_info def_stmt_info = vect_get_internal_def (vinfo, var);
> -  if (!def_stmt_info)
> +  if (!def_stmt_info && !analyze_only)
>      return false;
> +  else if (!def_stmt_info)
> +    /* If we're a only analyzing we won't be codegen-ing the statements and are
> +       only after if the types match.  In that case we can accept loop invariant
> +       values.  */
> +    def_stmt = dyn_cast <gassign *> (SSA_NAME_DEF_STMT (var));
> +  else
> +    def_stmt = dyn_cast <gassign *> (def_stmt_info->stmt);
>  

Hmm, but we're visiting them then?  I wonder how you get along
without doing adjustmens on the uses if you consider

    _1 = a < b;
    _2 = c != d;
    _3 = _1 | _2;
    if (_3 != 0)
      exit loop;

thus a combined condition like

    if (a < b || c != d)

that we if-converted.  We need to recognize that _1, _2 and _3 have
mask uses and thus possibly adjust them.

What bad happens if you drop 'analyze_only'?  We're not really
rewriting anything there.

> -  gassign *def_stmt = dyn_cast <gassign *> (def_stmt_info->stmt);
>    if (!def_stmt)
>      return false;
>  
> @@ -5234,27 +5244,28 @@ check_bool_pattern (tree var, vec_info *vinfo, hash_set<gimple *> &stmts)
>    switch (rhs_code)
>      {
>      case SSA_NAME:
> -      if (! check_bool_pattern (rhs1, vinfo, stmts))
> +      if (! check_bool_pattern (rhs1, vinfo, stmts, analyze_only))
>  	return false;
>        break;
>  
>      CASE_CONVERT:
>        if (!VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (rhs1)))
>  	return false;
> -      if (! check_bool_pattern (rhs1, vinfo, stmts))
> +      if (! check_bool_pattern (rhs1, vinfo, stmts, analyze_only))
>  	return false;
>        break;
>  
>      case BIT_NOT_EXPR:
> -      if (! check_bool_pattern (rhs1, vinfo, stmts))
> +      if (! check_bool_pattern (rhs1, vinfo, stmts, analyze_only))
>  	return false;
>        break;
>  
>      case BIT_AND_EXPR:
>      case BIT_IOR_EXPR:
>      case BIT_XOR_EXPR:
> -      if (! check_bool_pattern (rhs1, vinfo, stmts)
> -	  || ! check_bool_pattern (gimple_assign_rhs2 (def_stmt), vinfo, stmts))
> +      if (! check_bool_pattern (rhs1, vinfo, stmts, analyze_only)
> +	  || ! check_bool_pattern (gimple_assign_rhs2 (def_stmt), vinfo, stmts,
> +				   analyze_only))
>  	return false;
>        break;
>  
> @@ -5275,6 +5286,7 @@ check_bool_pattern (tree var, vec_info *vinfo, hash_set<gimple *> &stmts)
>  	  tree mask_type = get_mask_type_for_scalar_type (vinfo,
>  							  TREE_TYPE (rhs1));
>  	  if (mask_type
> +	      && !analyze_only
>  	      && expand_vec_cmp_expr_p (comp_vectype, mask_type, rhs_code))
>  	    return false;
>  
> @@ -5289,7 +5301,8 @@ check_bool_pattern (tree var, vec_info *vinfo, hash_set<gimple *> &stmts)
>  	    }
>  	  else
>  	    vecitype = comp_vectype;
> -	  if (! expand_vec_cond_expr_p (vecitype, comp_vectype, rhs_code))
> +	  if (!analyze_only
> +	      && !expand_vec_cond_expr_p (vecitype, comp_vectype, rhs_code))
>  	    return false;
>  	}
>        else
> @@ -5324,11 +5337,13 @@ adjust_bool_pattern_cast (vec_info *vinfo,
>     VAR is an SSA_NAME that should be transformed from bool to a wider integer
>     type, OUT_TYPE is the desired final integer type of the whole pattern.
>     STMT_INFO is the info of the pattern root and is where pattern stmts should
> -   be associated with.  DEFS is a map of pattern defs.  */
> +   be associated with.  DEFS is a map of pattern defs.  If TYPE_ONLY then don't
> +   create new pattern statements and instead only fill LAST_STMT and DEFS.  */
>  
>  static void
>  adjust_bool_pattern (vec_info *vinfo, tree var, tree out_type,
> -		     stmt_vec_info stmt_info, hash_map <tree, tree> &defs)
> +		     stmt_vec_info stmt_info, hash_map <tree, tree> &defs,
> +		     gimple *&last_stmt, bool type_only)
>  {
>    gimple *stmt = SSA_NAME_DEF_STMT (var);
>    enum tree_code rhs_code, def_rhs_code;
> @@ -5492,28 +5507,38 @@ adjust_bool_pattern (vec_info *vinfo, tree var, tree out_type,
>      }
>  
>    gimple_set_location (pattern_stmt, loc);
> -  append_pattern_def_seq (vinfo, stmt_info, pattern_stmt,
> -			  get_vectype_for_scalar_type (vinfo, itype));
> +  if (!type_only)
> +    append_pattern_def_seq (vinfo, stmt_info, pattern_stmt,
> +			    get_vectype_for_scalar_type (vinfo, itype));
> +  last_stmt = pattern_stmt;
>    defs.put (var, gimple_assign_lhs (pattern_stmt));
>  }
>  
> -/* Comparison function to qsort a vector of gimple stmts after UID.  */
> +/* Comparison function to qsort a vector of gimple stmts after BB and UID.
> +   the def of one statement can be in an earlier block than the use, so if
> +   the BB are different, first compare by BB.  */
>  
>  static int
>  sort_after_uid (const void *p1, const void *p2)
>  {
>    const gimple *stmt1 = *(const gimple * const *)p1;
>    const gimple *stmt2 = *(const gimple * const *)p2;
> +  if (gimple_bb (stmt1)->index != gimple_bb (stmt2)->index)
> +    return gimple_bb (stmt1)->index - gimple_bb (stmt2)->index;
> +

is this because you eventually get out-of-loop stmts (without UID)?

>    return gimple_uid (stmt1) - gimple_uid (stmt2);
>  }
>  
>  /* Create pattern stmts for all stmts participating in the bool pattern
>     specified by BOOL_STMT_SET and its root STMT_INFO with the desired type
> -   OUT_TYPE.  Return the def of the pattern root.  */
> +   OUT_TYPE.  Return the def of the pattern root.  If TYPE_ONLY the new
> +   statements are not emitted as pattern statements and the tree returned is
> +   only useful for type queries.  */
>  
>  static tree
>  adjust_bool_stmts (vec_info *vinfo, hash_set <gimple *> &bool_stmt_set,
> -		   tree out_type, stmt_vec_info stmt_info)
> +		   tree out_type, stmt_vec_info stmt_info,
> +		   bool type_only = false)
>  {
>    /* Gather original stmts in the bool pattern in their order of appearance
>       in the IL.  */
> @@ -5523,16 +5548,16 @@ adjust_bool_stmts (vec_info *vinfo, hash_set <gimple *> &bool_stmt_set,
>      bool_stmts.quick_push (*i);
>    bool_stmts.qsort (sort_after_uid);
>  
> +  gimple *last_stmt = NULL;
> +
>    /* Now process them in that order, producing pattern stmts.  */
>    hash_map <tree, tree> defs;
> -  for (unsigned i = 0; i < bool_stmts.length (); ++i)
> -    adjust_bool_pattern (vinfo, gimple_assign_lhs (bool_stmts[i]),
> -			 out_type, stmt_info, defs);
> +  for (auto bool_stmt : bool_stmts)
> +    adjust_bool_pattern (vinfo, gimple_assign_lhs (bool_stmt),
> +			 out_type, stmt_info, defs, last_stmt, type_only);
>  
>    /* Pop the last pattern seq stmt and install it as pattern root for STMT.  */
> -  gimple *pattern_stmt
> -    = gimple_seq_last_stmt (STMT_VINFO_PATTERN_DEF_SEQ (stmt_info));
> -  return gimple_assign_lhs (pattern_stmt);
> +  return gimple_assign_lhs (last_stmt);
>  }
>  
>  /* Return the proper type for converting bool VAR into
> @@ -5608,13 +5633,27 @@ vect_recog_bool_pattern (vec_info *vinfo,
>    enum tree_code rhs_code;
>    tree var, lhs, rhs, vectype;
>    gimple *pattern_stmt;
> -
> -  if (!is_gimple_assign (last_stmt))
> +  gcond* cond = NULL;
> +  if (!is_gimple_assign (last_stmt)
> +      && !(cond = dyn_cast <gcond *> (last_stmt)))
>      return NULL;
>  
> -  var = gimple_assign_rhs1 (last_stmt);
> -  lhs = gimple_assign_lhs (last_stmt);
> -  rhs_code = gimple_assign_rhs_code (last_stmt);
> +  loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
> +  if (is_gimple_assign (last_stmt))
> +    {
> +      var = gimple_assign_rhs1 (last_stmt);
> +      lhs = gimple_assign_lhs (last_stmt);
> +      rhs_code = gimple_assign_rhs_code (last_stmt);
> +    }
> +  else if (loop_vinfo && LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    {
> +      /* If not multiple exits, and loop vectorization don't bother analyzing
> +	 the gcond as we don't support SLP today.  */
> +      lhs = var = gimple_cond_lhs (last_stmt);
> +      rhs_code = gimple_cond_code (last_stmt);
> +    }
> +  else
> +    return NULL;
>  
>    if (rhs_code == VIEW_CONVERT_EXPR)
>      var = TREE_OPERAND (var, 0);
> @@ -5632,7 +5671,7 @@ vect_recog_bool_pattern (vec_info *vinfo,
>  	return NULL;
>        vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (lhs));
>  
> -      if (check_bool_pattern (var, vinfo, bool_stmts))
> +      if (check_bool_pattern (var, vinfo, bool_stmts, false))
>  	{
>  	  rhs = adjust_bool_stmts (vinfo, bool_stmts,
>  				   TREE_TYPE (lhs), stmt_vinfo);
> @@ -5680,7 +5719,7 @@ vect_recog_bool_pattern (vec_info *vinfo,
>  
>        return pattern_stmt;
>      }
> -  else if (rhs_code == COND_EXPR
> +  else if ((rhs_code == COND_EXPR || cond)
>  	   && TREE_CODE (var) == SSA_NAME)
>      {
>        vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (lhs));
> @@ -5700,18 +5739,31 @@ vect_recog_bool_pattern (vec_info *vinfo,
>        if (get_vectype_for_scalar_type (vinfo, type) == NULL_TREE)
>  	return NULL;
>  
> -      if (check_bool_pattern (var, vinfo, bool_stmts))
> -	var = adjust_bool_stmts (vinfo, bool_stmts, type, stmt_vinfo);
> +      if (check_bool_pattern (var, vinfo, bool_stmts, cond))
> +	var = adjust_bool_stmts (vinfo, bool_stmts, type, stmt_vinfo, cond);
>        else if (integer_type_for_mask (var, vinfo))
>  	return NULL;
>  
> -      lhs = vect_recog_temp_ssa_var (TREE_TYPE (lhs), NULL);
> -      pattern_stmt 
> -	= gimple_build_assign (lhs, COND_EXPR,
> -			       build2 (NE_EXPR, boolean_type_node,
> -				       var, build_int_cst (TREE_TYPE (var), 0)),
> -			       gimple_assign_rhs2 (last_stmt),
> -			       gimple_assign_rhs3 (last_stmt));
> +      if (!cond)
> +	{
> +	  lhs = vect_recog_temp_ssa_var (TREE_TYPE (lhs), NULL);
> +	  pattern_stmt
> +	    = gimple_build_assign (lhs, COND_EXPR,
> +				   build2 (NE_EXPR, boolean_type_node, var,
> +					   build_int_cst (TREE_TYPE (var), 0)),
> +				   gimple_assign_rhs2 (last_stmt),
> +				   gimple_assign_rhs3 (last_stmt));
> +	}
> +      else
> +	{
> +	  pattern_stmt
> +	    = gimple_build_cond (gimple_cond_code (cond),
> +				 gimple_cond_lhs (cond), gimple_cond_rhs (cond),
> +				 gimple_cond_true_label (cond),
> +				 gimple_cond_false_label (cond));
> +	  vectype = get_vectype_for_scalar_type (vinfo, TREE_TYPE (var));
> +	  vectype = truth_type_for (vectype);
> +	}
>        *type_out = vectype;
>        vect_pattern_detected ("vect_recog_bool_pattern", last_stmt);
>  

So this is also quite odd.  You're hooking into COND_EXPR handling
but only look at the LHS of the GIMPLE_COND compare.

Please refactor the changes to separate the GIMPLE_COND path
completely.

Is there test coverage for such "complex" condition?  I think
you'll need adjustments to vect_recog_mask_conversion_pattern
as well similar as to how COND_EXPR is handled there.

> @@ -5725,7 +5777,7 @@ vect_recog_bool_pattern (vec_info *vinfo,
>        if (!vectype || !VECTOR_MODE_P (TYPE_MODE (vectype)))
>  	return NULL;
>  
> -      if (check_bool_pattern (var, vinfo, bool_stmts))
> +      if (check_bool_pattern (var, vinfo, bool_stmts, false))
>  	rhs = adjust_bool_stmts (vinfo, bool_stmts,
>  				 TREE_TYPE (vectype), stmt_vinfo);
>        else
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 582c5e678fad802d6e76300fe3c939b9f2978f17..e9116d184149826ba436b0f562721c140d586c94 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -12489,7 +12489,7 @@ vectorizable_comparison_1 (vec_info *vinfo, tree vectype,
>    vec<tree> vec_oprnds0 = vNULL;
>    vec<tree> vec_oprnds1 = vNULL;
>    tree mask_type;
> -  tree mask;
> +  tree mask = NULL_TREE;
>  
>    if (!STMT_VINFO_RELEVANT_P (stmt_info) && !bb_vinfo)
>      return false;
> @@ -12629,8 +12629,9 @@ vectorizable_comparison_1 (vec_info *vinfo, tree vectype,
>    /* Transform.  */
>  
>    /* Handle def.  */
> -  lhs = gimple_assign_lhs (STMT_VINFO_STMT (stmt_info));
> -  mask = vect_create_destination_var (lhs, mask_type);
> +  lhs = gimple_get_lhs (STMT_VINFO_STMT (stmt_info));
> +  if (lhs)
> +    mask = vect_create_destination_var (lhs, mask_type);
>  
>    vect_get_vec_defs (vinfo, stmt_info, slp_node, ncopies,
>  		     rhs1, &vec_oprnds0, vectype,
> @@ -12644,7 +12645,10 @@ vectorizable_comparison_1 (vec_info *vinfo, tree vectype,
>        gimple *new_stmt;
>        vec_rhs2 = vec_oprnds1[i];
>  
> -      new_temp = make_ssa_name (mask);
> +      if (lhs)
> +	new_temp = make_ssa_name (mask);
> +      else
> +	new_temp = make_temp_ssa_name (mask_type, NULL, "cmp");
>        if (bitop1 == NOP_EXPR)
>  	{
>  	  new_stmt = gimple_build_assign (new_temp, code,
> @@ -12723,6 +12727,184 @@ vectorizable_comparison (vec_info *vinfo,
>    return true;
>  }
>  
> +/* Check to see if the current early break given in STMT_INFO is valid for
> +   vectorization.  */
> +
> +static bool
> +vectorizable_early_exit (vec_info *vinfo, stmt_vec_info stmt_info,
> +			 gimple_stmt_iterator *gsi, gimple **vec_stmt,
> +			 slp_tree slp_node, stmt_vector_for_cost *cost_vec)
> +{
> +  loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
> +  if (!loop_vinfo
> +      || !is_a <gcond *> (STMT_VINFO_STMT (stmt_info)))
> +    return false;
> +
> +  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_condition_def)
> +    return false;
> +
> +  if (!STMT_VINFO_RELEVANT_P (stmt_info))
> +    return false;
> +
> +  auto code = gimple_cond_code (STMT_VINFO_STMT (stmt_info));
> +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  gcc_assert (vectype);
> +
> +  tree vectype_op0 = NULL_TREE;
> +  slp_tree slp_op0;
> +  tree op0;
> +  enum vect_def_type dt0;
> +  if (!vect_is_simple_use (vinfo, stmt_info, slp_node, 0, &op0, &slp_op0, &dt0,
> +			   &vectype_op0))
> +    {
> +      if (dump_enabled_p ())
> +	  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			   "use not simple.\n");
> +	return false;
> +    }

I think you rely on patterns transforming this into canonical form
mask != 0, so I suggest to check this here.

> +  machine_mode mode = TYPE_MODE (vectype);
> +  int ncopies;
> +
> +  if (slp_node)
> +    ncopies = 1;
> +  else
> +    ncopies = vect_get_num_copies (loop_vinfo, vectype);
> +
> +  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +  bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
> +
> +  /* Analyze only.  */
> +  if (!vec_stmt)
> +    {
> +      if (direct_optab_handler (cbranch_optab, mode) == CODE_FOR_nothing)
> +	{
> +	  if (dump_enabled_p ())
> +	      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			       "can't vectorize early exit because the "
> +			       "target doesn't support flag setting vector "
> +			       "comparisons.\n");
> +	  return false;
> +	}
> +
> +      if (ncopies > 1

Also required for vec_num > 1 with SLP
(SLP_TREE_NUMBER_OF_VEC_STMTS)

> +	  && direct_optab_handler (ior_optab, mode) == CODE_FOR_nothing)
> +	{
> +	  if (dump_enabled_p ())
> +	      dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			       "can't vectorize early exit because the "
> +			       "target does not support boolean vector OR for "
> +			       "type %T.\n", vectype);
> +	  return false;
> +	}
> +
> +      if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi,
> +				      vec_stmt, slp_node, cost_vec))
> +	return false;
> +
> +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +	{
> +	  if (direct_internal_fn_supported_p (IFN_VCOND_MASK_LEN, vectype,
> +					      OPTIMIZE_FOR_SPEED))
> +	    return false;
> +	  else
> +	    vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype, NULL);
> +	}
> +
> +
> +      return true;
> +    }
> +
> +  /* Tranform.  */
> +
> +  tree new_temp = NULL_TREE;
> +  gimple *new_stmt = NULL;
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location, "transform early-exit.\n");
> +
> +  if (!vectorizable_comparison_1 (vinfo, vectype, stmt_info, code, gsi,
> +				  vec_stmt, slp_node, cost_vec))
> +    gcc_unreachable ();
> +
> +  gimple *stmt = STMT_VINFO_STMT (stmt_info);
> +  basic_block cond_bb = gimple_bb (stmt);
> +  gimple_stmt_iterator  cond_gsi = gsi_last_bb (cond_bb);
> +
> +  auto_vec<tree> stmts;
> +
> +  tree mask = NULL_TREE;
> +  if (masked_loop_p)
> +    mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, vectype, 0);
> +
> +  if (slp_node)
> +    stmts.safe_splice (SLP_TREE_VEC_DEFS (slp_node));
> +  else
> +    {
> +      auto vec_stmts = STMT_VINFO_VEC_STMTS (stmt_info);
> +      stmts.reserve_exact (vec_stmts.length ());
> +      for (auto stmt : vec_stmts)
> +	stmts.quick_push (gimple_assign_lhs (stmt));
> +    }
> +
> +  /* Determine if we need to reduce the final value.  */
> +  if (stmts.length () > 1)
> +    {
> +      /* We build the reductions in a way to maintain as much parallelism as
> +	 possible.  */
> +      auto_vec<tree> workset (stmts.length ());
> +
> +      /* Mask the statements as we queue them up.  */
> +      if (masked_loop_p)
> +	for (auto stmt : stmts)
> +	  workset.quick_push (prepare_vec_mask (loop_vinfo, TREE_TYPE (mask),
> +						mask, stmt, &cond_gsi));

I think this still uses the wrong mask, you need to use

  vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, vectype, <cnt>)

replacing <cnt> with the vector def index to mask I think.  For this
reason keeping the "final" mask below is also wrong.

Or am I missing something?

> +      else
> +	workset.splice (stmts);
> +
> +      while (workset.length () > 1)
> +	{
> +	  new_temp = make_temp_ssa_name (vectype, NULL, "vexit_reduc");
> +	  tree arg0 = workset.pop ();
> +	  tree arg1 = workset.pop ();
> +	  new_stmt = gimple_build_assign (new_temp, BIT_IOR_EXPR, arg0, arg1);
> +	  vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> +				       &cond_gsi);
> +	  workset.quick_insert (0, new_temp);
> +	}
> +    }
> +  else
> +    new_temp = stmts[0];
> +
> +  gcc_assert (new_temp);
> +
> +  tree cond = new_temp;
> +  /* If we have multiple statements after reduction we should check all the
> +     lanes and treat it as a full vector.  */
> +  if (masked_loop_p)
> +    cond = prepare_vec_mask (loop_vinfo, TREE_TYPE (mask), mask, cond,
> +			     &cond_gsi);

so just do this in the else path above

Otherwise looks OK.

Richard.

> +  /* Now build the new conditional.  Pattern gimple_conds get dropped during
> +     codegen so we must replace the original insn.  */
> +  stmt = STMT_VINFO_STMT (vect_orig_stmt (stmt_info));
> +  gcond *cond_stmt = as_a <gcond *>(stmt);
> +  gimple_cond_set_condition (cond_stmt, NE_EXPR, cond,
> +			     build_zero_cst (vectype));
> +  update_stmt (stmt);
> +
> +  if (slp_node)
> +    SLP_TREE_VEC_DEFS (slp_node).truncate (0);
> +   else
> +    STMT_VINFO_VEC_STMTS (stmt_info).truncate (0);
> +
> +
> +  if (!slp_node)
> +    *vec_stmt = stmt;
> +
> +  return true;
> +}
> +
>  /* If SLP_NODE is nonnull, return true if vectorizable_live_operation
>     can handle all live statements in the node.  Otherwise return true
>     if STMT_INFO is not live or if vectorizable_live_operation can handle it.
> @@ -12949,7 +13131,9 @@ vect_analyze_stmt (vec_info *vinfo,
>  	  || vectorizable_lc_phi (as_a <loop_vec_info> (vinfo),
>  				  stmt_info, NULL, node)
>  	  || vectorizable_recurr (as_a <loop_vec_info> (vinfo),
> -				   stmt_info, NULL, node, cost_vec));
> +				   stmt_info, NULL, node, cost_vec)
> +	  || vectorizable_early_exit (vinfo, stmt_info, NULL, NULL, node,
> +				      cost_vec));
>    else
>      {
>        if (bb_vinfo)
> @@ -12972,7 +13156,10 @@ vect_analyze_stmt (vec_info *vinfo,
>  					 NULL, NULL, node, cost_vec)
>  	      || vectorizable_comparison (vinfo, stmt_info, NULL, NULL, node,
>  					  cost_vec)
> -	      || vectorizable_phi (vinfo, stmt_info, NULL, node, cost_vec));
> +	      || vectorizable_phi (vinfo, stmt_info, NULL, node, cost_vec)
> +	      || vectorizable_early_exit (vinfo, stmt_info, NULL, NULL, node,
> +					  cost_vec));
> +
>      }
>  
>    if (node)
> @@ -13131,6 +13318,12 @@ vect_transform_stmt (vec_info *vinfo,
>        gcc_assert (done);
>        break;
>  
> +    case loop_exit_ctrl_vec_info_type:
> +      done = vectorizable_early_exit (vinfo, stmt_info, gsi, &vec_stmt,
> +				      slp_node, NULL);
> +      gcc_assert (done);
> +      break;
> +
>      default:
>        if (!STMT_VINFO_LIVE_P (stmt_info))
>  	{
> @@ -14321,10 +14514,19 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>      }
>    else
>      {
> +      gcond *cond = NULL;
>        if (data_reference *dr = STMT_VINFO_DATA_REF (stmt_info))
>  	scalar_type = TREE_TYPE (DR_REF (dr));
>        else if (gimple_call_internal_p (stmt, IFN_MASK_STORE))
>  	scalar_type = TREE_TYPE (gimple_call_arg (stmt, 3));
> +      else if ((cond = dyn_cast <gcond *> (stmt)))
> +	{
> +	  /* We can't convert the scalar type to boolean yet, since booleans have a
> +	     single bit precision and we need the vector boolean to be a
> +	     representation of the integer mask.  So set the correct integer type and
> +	     convert to boolean vector once we have a vectype.  */
> +	  scalar_type = TREE_TYPE (gimple_cond_lhs (cond));
> +	}
>        else
>  	scalar_type = TREE_TYPE (gimple_get_lhs (stmt));
>  
> @@ -14339,12 +14541,18 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>  			     "get vectype for scalar type: %T\n", scalar_type);
>  	}
>        vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size);
> +
>        if (!vectype)
>  	return opt_result::failure_at (stmt,
>  				       "not vectorized:"
>  				       " unsupported data-type %T\n",
>  				       scalar_type);
>  
> +      /* If we were a gcond, convert the resulting type to a vector boolean type now
> +	 that we have the correct integer mask type.  */
> +      if (cond)
> +	vectype = truth_type_for (vectype);
> +
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_NOTE, vect_location, "vectype: %T\n", vectype);
>      }
>
  
Richard Biener Dec. 12, 2023, 11:30 a.m. UTC | #12
On Tue, 12 Dec 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Mon, 11 Dec 2023, Tamar Christina wrote:
> >> @@ -5553,6 +5554,83 @@ integer_type_for_mask (tree var, vec_info *vinfo)
> >>    return build_nonstandard_integer_type (def_stmt_info->mask_precision, 1);
> >>  }
> >>  
> >> +/* Function vect_recog_gcond_pattern
> >> +
> >> +   Try to find pattern like following:
> >> +
> >> +     if (a op b)
> >> +
> >> +   where operator 'op' is not != and convert it to an adjusted boolean pattern
> >> +
> >> +     mask = a op b
> >> +     if (mask != 0)
> >> +
> >> +   and set the mask type on MASK.
> >> +
> >> +   Input:
> >> +
> >> +   * STMT_VINFO: The stmt at the end from which the pattern
> >> +		 search begins, i.e. cast of a bool to
> >> +		 an integer type.
> >> +
> >> +   Output:
> >> +
> >> +   * TYPE_OUT: The type of the output of this pattern.
> >> +
> >> +   * Return value: A new stmt that will be used to replace the pattern.  */
> >> +
> >> +static gimple *
> >> +vect_recog_gcond_pattern (vec_info *vinfo,
> >> +			 stmt_vec_info stmt_vinfo, tree *type_out)
> >> +{
> >> +  gimple *last_stmt = STMT_VINFO_STMT (stmt_vinfo);
> >> +  gcond* cond = NULL;
> >> +  if (!(cond = dyn_cast <gcond *> (last_stmt)))
> >> +    return NULL;
> >> +
> >> +  auto lhs = gimple_cond_lhs (cond);
> >> +  auto rhs = gimple_cond_rhs (cond);
> >> +  auto code = gimple_cond_code (cond);
> >> +
> >> +  tree scalar_type = TREE_TYPE (lhs);
> >> +  if (VECTOR_TYPE_P (scalar_type))
> >> +    return NULL;
> >> +
> >> +  if (code == NE_EXPR && zerop (rhs))
> >
> > I think you need && VECT_SCALAR_BOOLEAN_TYPE_P (scalar_type) here,
> > an integer != 0 would not be an appropriate mask.  I guess two
> > relevant testcases would have an early exit like
> >
> >    if (here[i] != 0)
> >      break;
> >
> > once with a 'bool here[]' and once with a 'int here[]'.
> >
> >> +    return NULL;
> >> +
> >> +  tree vecitype = get_vectype_for_scalar_type (vinfo, scalar_type);
> >> +  if (vecitype == NULL_TREE)
> >> +    return NULL;
> >> +
> >> +  /* Build a scalar type for the boolean result that when vectorized matches the
> >> +     vector type of the result in size and number of elements.  */
> >> +  unsigned prec
> >> +    = vector_element_size (tree_to_poly_uint64 (TYPE_SIZE (vecitype)),
> >> +			   TYPE_VECTOR_SUBPARTS (vecitype));
> >> +
> >> +  scalar_type
> >> +    = build_nonstandard_integer_type (prec, TYPE_UNSIGNED (scalar_type));
> >> +
> >> +  vecitype = get_vectype_for_scalar_type (vinfo, scalar_type);
> >> +  if (vecitype == NULL_TREE)
> >> +    return NULL;
> >> +
> >> +  tree vectype = truth_type_for (vecitype);
> >
> > That looks awfully complicated.  I guess one complication is that
> > we compute mask_precision & friends before this pattern gets
> > recognized.  See vect_determine_mask_precision and its handling
> > of tcc_comparison, see also integer_type_for_mask.  For comparisons
> > properly handled during pattern recog the vector type is determined
> > in vect_get_vector_types_for_stmt via
> >
> >   else if (vect_use_mask_type_p (stmt_info))
> >     {
> >       unsigned int precision = stmt_info->mask_precision;
> >       scalar_type = build_nonstandard_integer_type (precision, 1);
> >       vectype = get_mask_type_for_scalar_type (vinfo, scalar_type, 
> > group_size);
> >       if (!vectype)
> >         return opt_result::failure_at (stmt, "not vectorized: unsupported"
> >                                        " data-type %T\n", scalar_type);
> >
> > Richard, do you have any advice here?  I suppose vect_determine_precisions
> > needs to handle the gcond case with bool != 0 somehow and for the
> > extra mask producer we add here we have to emulate what it would have 
> > done, right?
> 
> How about handling gconds directly in vect_determine_mask_precision?
> In a sense it's not needed, since gconds are always roots, and so we
> could calculate their precision on the fly instead.  But handling it in
> vect_determine_mask_precision feels like it should reduce the number
> of special cases.

Yeah, that sounds worth trying.

Richard.