[12/19] middle-end: implement loop peeling and IV updates for early break.

Message ID ZJw510c3/NpFrqbh@arm.com
State Not Applicable
Headers
Series Support early break/return auto-vectorization |

Checks

Context Check Description
snail/gcc-patch-check fail Git am fail log

Commit Message

Tamar Christina June 28, 2023, 1:47 p.m. UTC
  Hi All,

This patch updates the peeling code to maintain LCSSA during peeling.
The rewrite also naturally takes into account multiple exits and so it didn't
make sense to split them off.

For the purposes of peeling the only change for multiple exits is that the
secondary exits are all wired to the start of the new loop preheader when doing
epilogue peeling.

When doing prologue peeling the CFG is kept in tact.

For both epilogue and prologue peeling we wire through between the two loops any
PHI nodes that escape the first loop into the second loop if flow_loops is
specified.  The reason for this conditionality is because
slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
  - prologue peeling
  - epilogue peeling
  - loop distribution

for the last case the loops should remain independent, and so not be connected.
Because of this propagation of only used phi nodes get_current_def can be used
to easily find the previous definitions.  However live statements that are
not used inside the loop itself are not propagated (since if unused, the moment
we add the guard in between the two loops the value across the bypass edge can
be wrong if the loop has been peeled.)

This is dealt with easily enough in find_guard_arg.

For multiple exits, while we are in LCSSA form, and have a correct DOM tree, the
moment we add the guard block we will change the dominators again.  To deal with
this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the blocks to
update without having to recompute the list of blocks to update again.

When multiple exits and doing epilogue peeling we will also temporarily have an
incorrect VUSES chain for the secondary exits as it anticipates the final result
after the VDEFs have been moved.  This will thus be corrected once the code
motion is applied.

Lastly by doing things this way we can remove the helper functions that
previously did lock step iterations to update things as it went along.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* tree-loop-distribution.cc (copy_loop_before): Pass flow_loops = false.
	* tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
	* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add additional
	assert.
	(vect_set_loop_condition_normal): Skip modifying loop IV for multiple
	exits.
	(slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit peeling.
	(slpeel_can_duplicate_loop_p): Likewise.
	(vect_update_ivs_after_vectorizer): Don't enter this...
	(vect_update_ivs_after_early_break): ...but instead enter here.
	(find_guard_arg): Update for new peeling code.
	(slpeel_update_phi_nodes_for_loops): Remove.
	(slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0 checks.
	(slpeel_update_phi_nodes_for_lcssa): Remove.
	(vect_do_peeling): Fix VF for multiple exits and force epilogue.
	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
	non_break_control_flow and early_breaks.
	(vect_need_peeling_or_partial_vectors_p): Force partial vector if
	multiple exits and VLA.
	(vect_analyze_loop_form): Support inner loop multiple exits.
	(vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
	(vect_create_epilog_for_reduction):  Update live phi nodes.
	(vectorizable_live_operation): Ignore live operations in vector loop
	when multiple exits.
	(vect_transform_loop): Force unrolling for VF loops and multiple exits.
	* tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
	(vect_mark_stmts_to_be_vectorized): Check for non-exit control flow and
	analyze gcond params.
	(vect_analyze_stmt): Support gcond.
	* tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
	in RPO pass.
	* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW): New.
	(loop_vec_info_for_loop): Change to const and static.
	(is_loop_header_bb_p): Drop assert.
	(slpeel_can_duplicate_loop_p): Update prototype.
	(class loop): Add early_breaks and non_break_control_flow.

--- inline copy of patch -- 
diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40d833a968314a4442b9e 100644




--
diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
index 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40d833a968314a4442b9e 100644
--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool redirect_lc_phi_defs)
   edge preheader = loop_preheader_edge (loop);
 
   initialize_original_copy_tables ();
-  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
+  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader, false);
   gcc_assert (res != NULL);
 
   /* When a not last partition is supposed to keep the LC PHIs computed
diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
index 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba377300430c04b7aeefe6c 100644
--- a/gcc/tree-ssa-loop-niter.cc
+++ b/gcc/tree-ssa-loop-niter.cc
@@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop, basic_block *body, const_edge exit)
   gimple_stmt_iterator bsi;
   unsigned i;
 
-  if (exit != single_exit (loop))
+  /* We need to check for alternative exits since exit can be NULL.  */
+  auto exits = get_loop_exit_edges (loop);
+  if (exits.length () != 1)
+    return false;
+
+  if (exit != exits[0])
     return false;
 
   for (i = 0; i < loop->num_nodes; i++)
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a927cb150edd7c2aa9999 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def)
 {
   tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
 
+  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
+	      || orig_def != new_def);
+
   SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
 
   if (MAY_HAVE_DEBUG_BIND_STMTS)
@@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal (loop_vec_info loop_vinfo,
   gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
 
   /* Record the number of latch iterations.  */
-  if (limit == niters)
+  if (limit == niters
+      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     /* Case A: the loop iterates NITERS times.  Subtract one to get the
        latch count.  */
     loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
@@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal (loop_vec_info loop_vinfo,
     loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
 				       limit, step);
 
-  if (final_iv)
+  /* For multiple exits we've already maintained LCSSA form and handled
+     the scalar iteration update in the code that deals with the merge
+     block and its updated guard.  I could move that code here instead
+     of in vect_update_ivs_after_early_break but I have to still deal
+     with the updates to the counter `i`.  So for now I'll keep them
+     together.  */
+  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       gassign *assign;
       edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
@@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
    on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
    non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
    basic blocks from SCALAR_LOOP instead of LOOP, but to either the
-   entry or exit of LOOP.  */
+   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to SCALAR_LOOP as a
+   continuation.  This is correct for cases where one loop continues from the
+   other like in the vectorizer, but not true for uses in e.g. loop distribution
+   where the loop is duplicated and then modified.
+
+   If UPDATED_DOMS is not NULL it is update with the list of basic blocks whoms
+   dominators were updated during the peeling.  */
 
 class loop *
 slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
-					class loop *scalar_loop, edge e)
+					class loop *scalar_loop, edge e,
+					bool flow_loops,
+					vec<basic_block> *updated_doms)
 {
   class loop *new_loop;
   basic_block *new_bbs, *bbs, *pbbs;
@@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
   for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
     rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
 
+  /* Rename the exit uses.  */
+  for (edge exit : get_loop_exit_edges (new_loop))
+    for (auto gsi = gsi_start_phis (exit->dest);
+	 !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
+	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
+	if (MAY_HAVE_DEBUG_BIND_STMTS)
+	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
+      }
+
+  /* This condition happens when the loop has been versioned. e.g. due to ifcvt
+     versioning the loop.  */
   if (scalar_loop != loop)
     {
       /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
@@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
 						EDGE_SUCC (loop->latch, 0));
     }
 
+  vec<edge> alt_exits = loop->vec_loop_alt_exits;
+  bool multiple_exits_p = !alt_exits.is_empty ();
+  auto_vec<basic_block> doms;
+  class loop *update_loop = NULL;
+
   if (at_exit) /* Add the loop copy at exit.  */
     {
-      if (scalar_loop != loop)
+      if (scalar_loop != loop && new_exit->dest != exit_dest)
 	{
-	  gphi_iterator gsi;
 	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
+	  flush_pending_stmts (new_exit);
+	}
 
-	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
-	       gsi_next (&gsi))
+      auto loop_exits = get_loop_exit_edges (loop);
+      for (edge exit : loop_exits)
+	redirect_edge_and_branch (exit, new_preheader);
+
+
+      /* Copy the current loop LC PHI nodes between the original loop exit
+	 block and the new loop header.  This allows us to later split the
+	 preheader block and still find the right LC nodes.  */
+      edge latch_new = single_succ_edge (new_preheader);
+      edge latch_old = loop_latch_edge (loop);
+      hash_set <tree> lcssa_vars;
+      for (auto gsi_from = gsi_start_phis (latch_old->dest),
+	   gsi_to = gsi_start_phis (latch_new->dest);
+	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
+	   gsi_next (&gsi_from), gsi_next (&gsi_to))
+	{
+	  gimple *from_phi = gsi_stmt (gsi_from);
+	  gimple *to_phi = gsi_stmt (gsi_to);
+	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
+	  /* In all cases, even in early break situations we're only
+	     interested in the number of fully executed loop iters.  As such
+	     we discard any partially done iteration.  So we simply propagate
+	     the phi nodes from the latch to the merge block.  */
+	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
+	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
+
+	  lcssa_vars.add (new_arg);
+
+	  /* Main loop exit should use the final iter value.  */
+	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv, UNKNOWN_LOCATION);
+
+	  /* All other exits use the previous iters.  */
+	  for (edge e : alt_exits)
+	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
+			 UNKNOWN_LOCATION);
+
+	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
+	}
+
+      /* Copy over any live SSA vars that may not have been materialized in the
+	 loops themselves but would be in the exit block.  However when the live
+	 value is not used inside the loop then we don't need to do this,  if we do
+	 then when we split the guard block the branch edge can end up containing the
+	 wrong reference,  particularly if it shares an edge with something that has
+	 bypassed the loop.  This is not something peeling can check so we need to
+	 anticipate the usage of the live variable here.  */
+      auto exit_map = redirect_edge_var_map_vector (exit);
+      if (exit_map)
+        for (auto vm : exit_map)
+	{
+	  if (lcssa_vars.contains (vm.def)
+	      || TREE_CODE (vm.def) != SSA_NAME)
+	    continue;
+
+	  imm_use_iterator imm_iter;
+	  use_operand_p use_p;
+	  bool use_in_loop = false;
+
+	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
 	    {
-	      gphi *phi = gsi.phi ();
-	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
-	      location_t orig_locus
-		= gimple_phi_arg_location_from_edge (phi, e);
+	      basic_block bb = gimple_bb (USE_STMT (use_p));
+	      if (flow_bb_inside_loop_p (loop, bb)
+		  && !gimple_vuse (USE_STMT (use_p)))
+		{
+		  use_in_loop = true;
+		  break;
+		}
+	    }
 
-	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
+	  if (!use_in_loop)
+	    {
+	       /* Do a final check to see if it's perhaps defined in the loop.  This
+		  mirrors the relevancy analysis's used_outside_scope.  */
+	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
+	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+		continue;
 	    }
+
+	  tree new_res = copy_ssa_name (vm.result);
+	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
+	  for (edge exit : loop_exits)
+	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);
 	}
-      redirect_edge_and_branch_force (e, new_preheader);
-      flush_pending_stmts (e);
+
       set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
-      if (was_imm_dom || duplicate_outer_loop)
+
+      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
 	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit->src);
 
       /* And remove the non-necessary forwarder again.  Keep the other
@@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
       delete_basic_block (preheader);
       set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
 			       loop_preheader_edge (scalar_loop)->src);
+
+      /* Finally after wiring the new epilogue we need to update its main exit
+	 to the original function exit we recorded.  Other exits are already
+	 correct.  */
+      if (multiple_exits_p)
+	{
+	  for (edge e : get_loop_exit_edges (loop))
+	    doms.safe_push (e->dest);
+	  update_loop = new_loop;
+	  doms.safe_push (exit_dest);
+
+	  /* Likely a fall-through edge, so update if needed.  */
+	  if (single_succ_p (exit_dest))
+	    doms.safe_push (single_succ (exit_dest));
+	}
     }
   else /* Add the copy at entry.  */
     {
+      /* Copy the current loop LC PHI nodes between the original loop exit
+	 block and the new loop header.  This allows us to later split the
+	 preheader block and still find the right LC nodes.  */
+      edge old_latch_loop = loop_latch_edge (loop);
+      edge old_latch_init = loop_preheader_edge (loop);
+      edge new_latch_loop = loop_latch_edge (new_loop);
+      edge new_latch_init = loop_preheader_edge (new_loop);
+      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
+	   gsi_to = gsi_start_phis (old_latch_loop->dest);
+	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
+	   gsi_next (&gsi_from), gsi_next (&gsi_to))
+	{
+	  gimple *from_phi = gsi_stmt (gsi_from);
+	  gimple *to_phi = gsi_stmt (gsi_to);
+	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, new_latch_loop);
+	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
+	}
+
       if (scalar_loop != loop)
 	{
 	  /* Remove the non-necessary forwarder of scalar_loop again.  */
@@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
       delete_basic_block (new_preheader);
       set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
 			       loop_preheader_edge (new_loop)->src);
+
+      if (multiple_exits_p)
+	update_loop = loop;
     }
 
-  if (scalar_loop != loop)
+  if (multiple_exits_p)
     {
-      /* Update new_loop->header PHIs, so that on the preheader
-	 edge they are the ones from loop rather than scalar_loop.  */
-      gphi_iterator gsi_orig, gsi_new;
-      edge orig_e = loop_preheader_edge (loop);
-      edge new_e = loop_preheader_edge (new_loop);
-
-      for (gsi_orig = gsi_start_phis (loop->header),
-	   gsi_new = gsi_start_phis (new_loop->header);
-	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
-	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
+      for (edge e : get_loop_exit_edges (update_loop))
 	{
-	  gphi *orig_phi = gsi_orig.phi ();
-	  gphi *new_phi = gsi_new.phi ();
-	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
-	  location_t orig_locus
-	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
-
-	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
+	  edge ex;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
+	    {
+	      /* Find the first non-fallthrough block as fall-throughs can't
+		 dominate other blocks.  */
+	      while ((ex->flags & EDGE_FALLTHRU)
+		     && single_succ_p (ex->dest))
+		{
+		  doms.safe_push (ex->dest);
+		  ex = single_succ_edge (ex->dest);
+		}
+	      doms.safe_push (ex->dest);
+	    }
+	  doms.safe_push (e->dest);
 	}
-    }
 
+      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
+      if (updated_doms)
+	updated_doms->safe_splice (doms);
+    }
   free (new_bbs);
   free (bbs);
 
@@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const loop_vec_info loop_vinfo, const_edge e)
   gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
   unsigned int num_bb = loop->inner? 5 : 2;
 
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
+
   /* All loops have an outer scope; the only case loop->outer is NULL is for
      the function itself.  */
   if (!loop_outer (loop)
@@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   basic_block update_bb = update_e->dest;
 
+  /* For early exits we'll update the IVs in
+     vect_update_ivs_after_early_break.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    return;
+
   basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
 
   /* Make sure there exists a single-predecessor exit bb:  */
@@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
       /* Fix phi expressions in the successor bb.  */
       adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
     }
+  return;
+}
+
+/*   Function vect_update_ivs_after_early_break.
+
+     "Advance" the induction variables of LOOP to the value they should take
+     after the execution of LOOP.  This is currently necessary because the
+     vectorizer does not handle induction variables that are used after the
+     loop.  Such a situation occurs when the last iterations of LOOP are
+     peeled, because of the early exit.  With an early exit we always peel the
+     loop.
+
+     Input:
+     - LOOP_VINFO - a loop info structure for the loop that is going to be
+		    vectorized. The last few iterations of LOOP were peeled.
+     - LOOP - a loop that is going to be vectorized. The last few iterations
+	      of LOOP were peeled.
+     - VF - The loop vectorization factor.
+     - NITERS_ORIG - the number of iterations that LOOP executes (before it is
+		     vectorized). i.e, the number of times the ivs should be
+		     bumped.
+     - NITERS_VECTOR - The number of iterations that the vector LOOP executes.
+     - UPDATE_E - a successor edge of LOOP->exit that is on the (only) path
+		  coming out from LOOP on which there are uses of the LOOP ivs
+		  (this is the path from LOOP->exit to epilog_loop->preheader).
+
+		  The new definitions of the ivs are placed in LOOP->exit.
+		  The phi args associated with the edge UPDATE_E in the bb
+		  UPDATE_E->dest are updated accordingly.
+
+     Output:
+       - If available, the LCSSA phi node for the loop IV temp.
+
+     Assumption 1: Like the rest of the vectorizer, this function assumes
+     a single loop exit that has a single predecessor.
+
+     Assumption 2: The phi nodes in the LOOP header and in update_bb are
+     organized in the same order.
+
+     Assumption 3: The access function of the ivs is simple enough (see
+     vect_can_advance_ivs_p).  This assumption will be relaxed in the future.
+
+     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a path
+     coming out of LOOP on which the ivs of LOOP are used (this is the path
+     that leads to the epilog loop; other paths skip the epilog loop).  This
+     path starts with the edge UPDATE_E, and its destination (denoted update_bb)
+     needs to have its phis updated.
+ */
+
+static tree
+vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class loop * epilog,
+				   poly_int64 vf, tree niters_orig,
+				   tree niters_vector, edge update_e)
+{
+  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    return NULL;
+
+  gphi_iterator gsi, gsi1;
+  tree ni_name, ivtmp = NULL;
+  basic_block update_bb = update_e->dest;
+  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
+  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
+  basic_block exit_bb = loop_iv->dest;
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
+
+  gcc_assert (cond);
+
+  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis (update_bb);
+       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
+       gsi_next (&gsi), gsi_next (&gsi1))
+    {
+      tree init_expr, final_expr, step_expr;
+      tree type;
+      tree var, ni, off;
+      gimple_stmt_iterator last_gsi;
+
+      gphi *phi = gsi1.phi ();
+      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (epilog));
+      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
+      if (!phi1)
+	continue;
+      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_update_ivs_after_early_break: phi: %G",
+			 (gimple *)phi);
+
+      /* Skip reduction and virtual phis.  */
+      if (!iv_phi_p (phi_info))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "reduc or virtual phi. skip.\n");
+	  continue;
+	}
+
+      /* For multiple exits where we handle early exits we need to carry on
+	 with the previous IV as loop iteration was not done because we exited
+	 early.  As such just grab the original IV.  */
+      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge (loop));
+      if (gimple_cond_lhs (cond) != phi_ssa
+	  && gimple_cond_rhs (cond) != phi_ssa)
+	{
+	  type = TREE_TYPE (gimple_phi_result (phi));
+	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
+	  step_expr = unshare_expr (step_expr);
+
+	  /* We previously generated the new merged phi in the same BB as the
+	     guard.  So use that to perform the scaling on rather than the
+	     normal loop phi which don't take the early breaks into account.  */
+	  final_expr = gimple_phi_result (phi1);
+	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_preheader_edge (loop));
+
+	  tree stype = TREE_TYPE (step_expr);
+	  /* For early break the final loop IV is:
+	     init + (final - init) * vf which takes into account peeling
+	     values and non-single steps.  */
+	  off = fold_build2 (MINUS_EXPR, stype,
+			     fold_convert (stype, final_expr),
+			     fold_convert (stype, init_expr));
+	  /* Now adjust for VF to get the final iteration value.  */
+	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
+
+	  /* Adjust the value with the offset.  */
+	  if (POINTER_TYPE_P (type))
+	    ni = fold_build_pointer_plus (init_expr, off);
+	  else
+	    ni = fold_convert (type,
+			       fold_build2 (PLUS_EXPR, stype,
+					    fold_convert (stype, init_expr),
+					    off));
+	  var = create_tmp_var (type, "tmp");
+
+	  last_gsi = gsi_last_bb (exit_bb);
+	  gimple_seq new_stmts = NULL;
+	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
+	  /* Exit_bb shouldn't be empty.  */
+	  if (!gsi_end_p (last_gsi))
+	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
+	  else
+	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
+
+	  /* Fix phi expressions in the successor bb.  */
+	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
+	}
+      else
+	{
+	  type = TREE_TYPE (gimple_phi_result (phi));
+	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
+	  step_expr = unshare_expr (step_expr);
+
+	  /* We previously generated the new merged phi in the same BB as the
+	     guard.  So use that to perform the scaling on rather than the
+	     normal loop phi which don't take the early breaks into account.  */
+	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge (loop));
+	  tree stype = TREE_TYPE (step_expr);
+
+	  if (vf.is_constant ())
+	    {
+	      ni = fold_build2 (MULT_EXPR, stype,
+				fold_convert (stype,
+					      niters_vector),
+				build_int_cst (stype, vf));
+
+	      ni = fold_build2 (MINUS_EXPR, stype,
+				fold_convert (stype,
+					      niters_orig),
+				fold_convert (stype, ni));
+	    }
+	  else
+	    /* If the loop's VF isn't constant then the loop must have been
+	       masked, so at the end of the loop we know we have finished
+	       the entire loop and found nothing.  */
+	    ni = build_zero_cst (stype);
+
+	  ni = fold_convert (type, ni);
+	  /* We don't support variable n in this version yet.  */
+	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
+
+	  var = create_tmp_var (type, "tmp");
+
+	  last_gsi = gsi_last_bb (exit_bb);
+	  gimple_seq new_stmts = NULL;
+	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
+	  /* Exit_bb shouldn't be empty.  */
+	  if (!gsi_end_p (last_gsi))
+	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
+	  else
+	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
+
+	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
+
+	  for (edge exit : alt_exits)
+	    adjust_phi_and_debug_stmts (phi1, exit,
+					build_int_cst (TREE_TYPE (step_expr),
+						       vf));
+	  ivtmp = gimple_phi_result (phi1);
+	}
+    }
+
+  return ivtmp;
 }
 
 /* Return a gimple value containing the misalignment (measured in vector
@@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf (loop_vec_info loop_vinfo,
 
 /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
    this function searches for the corresponding lcssa phi node in exit
-   bb of LOOP.  If it is found, return the phi result; otherwise return
-   NULL.  */
+   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
+   return the phi result; otherwise return NULL.  */
 
 static tree
 find_guard_arg (class loop *loop, class loop *epilog ATTRIBUTE_UNUSED,
-		gphi *lcssa_phi)
+		gphi *lcssa_phi, int lcssa_edge = 0)
 {
   gphi_iterator gsi;
   edge e = loop->vec_loop_iv;
 
-  gcc_assert (single_pred_p (e->dest));
   for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
     {
       gphi *phi = gsi.phi ();
-      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
-			   PHI_ARG_DEF (lcssa_phi, 0), 0))
-	return PHI_RESULT (phi);
-    }
-  return NULL_TREE;
-}
-
-/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates FIRST/SECOND
-   from SECOND/FIRST and puts it at the original loop's preheader/exit
-   edge, the two loops are arranged as below:
-
-       preheader_a:
-     first_loop:
-       header_a:
-	 i_1 = PHI<i_0, i_2>;
-	 ...
-	 i_2 = i_1 + 1;
-	 if (cond_a)
-	   goto latch_a;
-	 else
-	   goto between_bb;
-       latch_a:
-	 goto header_a;
-
-       between_bb:
-	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
-
-     second_loop:
-       header_b:
-	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
-				 or with i_2 if no LCSSA phi is created
-				 under condition of CREATE_LCSSA_FOR_IV_PHIS.
-	 ...
-	 i_4 = i_3 + 1;
-	 if (cond_b)
-	   goto latch_b;
-	 else
-	   goto exit_bb;
-       latch_b:
-	 goto header_b;
-
-       exit_bb:
-
-   This function creates loop closed SSA for the first loop; update the
-   second loop's PHI nodes by replacing argument on incoming edge with the
-   result of newly created lcssa PHI nodes.  IF CREATE_LCSSA_FOR_IV_PHIS
-   is false, Loop closed ssa phis will only be created for non-iv phis for
-   the first loop.
-
-   This function assumes exit bb of the first loop is preheader bb of the
-   second loop, i.e, between_bb in the example code.  With PHIs updated,
-   the second loop will execute rest iterations of the first.  */
-
-static void
-slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
-				   class loop *first, class loop *second,
-				   bool create_lcssa_for_iv_phis)
-{
-  gphi_iterator gsi_update, gsi_orig;
-  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
-
-  edge first_latch_e = EDGE_SUCC (first->latch, 0);
-  edge second_preheader_e = loop_preheader_edge (second);
-  basic_block between_bb = single_exit (first)->dest;
-
-  gcc_assert (between_bb == second_preheader_e->src);
-  gcc_assert (single_pred_p (between_bb) && single_succ_p (between_bb));
-  /* Either the first loop or the second is the loop to be vectorized.  */
-  gcc_assert (loop == first || loop == second);
-
-  for (gsi_orig = gsi_start_phis (first->header),
-       gsi_update = gsi_start_phis (second->header);
-       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
-       gsi_next (&gsi_orig), gsi_next (&gsi_update))
-    {
-      gphi *orig_phi = gsi_orig.phi ();
-      gphi *update_phi = gsi_update.phi ();
-
-      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
-      /* Generate lcssa PHI node for the first loop.  */
-      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
-      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
-      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
+      /* Nested loops with multiple exits can have different no# phi node
+	 arguments between the main loop and epilog as epilog falls to the
+	 second loop.  */
+      if (gimple_phi_num_args (phi) > e->dest_idx)
 	{
-	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
-	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
-	  add_phi_arg (lcssa_phi, arg, single_exit (first), UNKNOWN_LOCATION);
-	  arg = new_res;
-	}
-
-      /* Update PHI node in the second loop by replacing arg on the loop's
-	 incoming edge.  */
-      adjust_phi_and_debug_stmts (update_phi, second_preheader_e, arg);
-    }
-
-  /* For epilogue peeling we have to make sure to copy all LC PHIs
-     for correct vectorization of live stmts.  */
-  if (loop == first)
-    {
-      basic_block orig_exit = single_exit (second)->dest;
-      for (gsi_orig = gsi_start_phis (orig_exit);
-	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
-	{
-	  gphi *orig_phi = gsi_orig.phi ();
-	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
-	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p  (orig_arg))
-	    continue;
-
-	  /* Already created in the above loop.   */
-	  if (find_guard_arg (first, second, orig_phi))
+	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
+	  if (TREE_CODE (var) != SSA_NAME)
 	    continue;
 
-	  tree new_res = copy_ssa_name (orig_arg);
-	  gphi *lcphi = create_phi_node (new_res, between_bb);
-	  add_phi_arg (lcphi, orig_arg, single_exit (first), UNKNOWN_LOCATION);
+	  if (operand_equal_p (get_current_def (var),
+			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
+	    return PHI_RESULT (phi);
 	}
     }
+  return NULL_TREE;
 }
 
 /* Function slpeel_add_loop_guard adds guard skipping from the beginning
@@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
   gcc_assert (single_succ_p (merge_bb));
   edge e = single_succ_edge (merge_bb);
   basic_block exit_bb = e->dest;
-  gcc_assert (single_pred_p (exit_bb));
-  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
 
   for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
     {
       gphi *update_phi = gsi.phi ();
-      tree old_arg = PHI_ARG_DEF (update_phi, 0);
+      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
 
       tree merge_arg = NULL_TREE;
 
@@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
       if (!merge_arg)
 	merge_arg = old_arg;
 
-      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
+      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e->dest_idx);
       /* If the var is live after loop but not a reduction, we simply
 	 use the old arg.  */
       if (!guard_arg)
@@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
     }
 }
 
-/* EPILOG loop is duplicated from the original loop for vectorizing,
-   the arg of its loop closed ssa PHI needs to be updated.  */
-
-static void
-slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
-{
-  gphi_iterator gsi;
-  basic_block exit_bb = single_exit (epilog)->dest;
-
-  gcc_assert (single_pred_p (exit_bb));
-  edge e = EDGE_PRED (exit_bb, 0);
-  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
-}
-
 /* EPILOGUE_VINFO is an epilogue loop that we now know would need to
    iterate exactly CONST_NITERS times.  Make a final decision about
    whether the epilogue loop should be used, returning true if so.  */
@@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     bound_epilog += vf - 1;
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
     bound_epilog += 1;
+  /* For early breaks the scalar loop needs to execute at most VF times
+     to find the element that caused the break.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    {
+      bound_epilog = vf;
+      /* Force a scalar epilogue as we can't vectorize the index finding.  */
+      vect_epilogues = false;
+    }
   bool epilog_peeling = maybe_ne (bound_epilog, 0U);
   poly_uint64 bound_scalar = bound_epilog;
 
@@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 				  bound_prolog + bound_epilog)
 		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
 			 || vect_epilogues));
+
+  /* We only support early break vectorization on known bounds at this time.
+     This means that if the vector loop can't be entered then we won't generate
+     it at all.  So for now force skip_vector off because the additional control
+     flow messes with the BB exits and we've already analyzed them.  */
+ skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
+
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
   bool skip_epilog = (prolog_peeling < 0
 		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      || !vf.is_constant ());
-  /* PEELING_FOR_GAPS is special because epilog loop must be executed.  */
-  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+  /* PEELING_FOR_GAPS and peeling for early breaks are special because epilog
+     loop must be executed.  */
+  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     skip_epilog = false;
-
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
   auto_vec<profile_count> original_counts;
   basic_block *original_bbs = NULL;
@@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
-      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
-
+      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
       /* Peel prolog and put it on preheader edge of loop.  */
-      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e,
+						       true);
       gcc_assert (prolog);
       prolog->force_vectorize = false;
-      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
+
       first_loop = prolog;
       reset_original_copy_tables ();
 
@@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	 as the transformations mentioned above make less or no sense when not
 	 vectorizing.  */
       epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
-      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
+      auto_vec<basic_block> doms;
+      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e, true,
+						       &doms);
       gcc_assert (epilog);
 
       epilog->force_vectorize = false;
-      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
 
       /* Scalar version loop may be preferred.  In this case, add guard
 	 and skip to epilog.  Note this only happens when the number of
@@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
 					update_e);
 
+      /* For early breaks we must create a guard to check how many iterations
+	 of the scalar loop are yet to be performed.  */
+      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	{
+	  tree ivtmp =
+	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
+					       *niters_vector, update_e);
+
+	  gcc_assert (ivtmp);
+	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
+					 fold_convert (TREE_TYPE (niters),
+						       ivtmp),
+					 build_zero_cst (TREE_TYPE (niters)));
+	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
+
+	  /* If we had a fallthrough edge, the guard will the threaded through
+	     and so we may need to find the actual final edge.  */
+	  edge final_edge = epilog->vec_loop_iv;
+	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
+	     between the guard and the exit edge.  It only adds new nodes and
+	     doesn't update existing one in the current scheme.  */
+	  basic_block guard_to = split_edge (final_edge);
+	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond, guard_to,
+						guard_bb, prob_epilog.invert (),
+						irred_flag);
+	  doms.safe_push (guard_bb);
+
+	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
+
+	  /* We must update all the edges from the new guard_bb.  */
+	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
+					      final_edge);
+
+	  /* If the loop was versioned we'll have an intermediate BB between
+	     the guard and the exit.  This intermediate block is required
+	     because in the current scheme of things the guard block phi
+	     updating can only maintain LCSSA by creating new blocks.  In this
+	     case we just need to update the uses in this block as well.  */
+	  if (loop != scalar_loop)
+	    {
+	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
+		   !gsi_end_p (gsi); gsi_next (&gsi))
+		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), guard_e));
+	    }
+
+	  flush_pending_stmts (guard_e);
+	}
+
       if (skip_epilog)
 	{
 	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
@@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	    }
 	  scale_loop_profile (epilog, prob_epilog, 0);
 	}
-      else
-	slpeel_update_phi_nodes_for_lcssa (epilog);
 
       unsigned HOST_WIDE_INT bound;
       if (bound_scalar.is_constant (&bound))
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49f21421958e7ee25eada 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     partial_load_store_bias (0),
     peeling_for_gaps (false),
     peeling_for_niter (false),
+    early_breaks (false),
+    non_break_control_flow (false),
     no_data_dependencies (false),
     has_mask_store (false),
     scalar_loop_scaling (profile_probability::uninitialized ()),
@@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p (loop_vec_info loop_vinfo)
     th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
 					  (loop_vinfo));
 
+  /* When we have multiple exits and VF is unknown, we must require partial
+     vectors because the loop bounds is not a minimum but a maximum.  That is to
+     say we cannot unpredicate the main loop unless we peel or use partial
+     vectors in the epilogue.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
+    return true;
+
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
     {
@@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
   loop_vinfo->scalar_costs->finish_cost (nullptr);
 }
 
-
 /* Function vect_analyze_loop_form.
 
    Verify that certain CFG restrictions hold, including:
    - the loop has a pre-header
-   - the loop has a single entry and exit
+   - the loop has a single entry
+   - nested loops can have only a single exit.
    - the loop exit condition is simple enough
    - the number of iterations can be analyzed, i.e, a countable loop.  The
      niter could be analyzed under some assumptions.  */
@@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
                            |
                         (exit-bb)  */
 
-      if (loop->num_nodes != 2)
-	return opt_result::failure_at (vect_location,
-				       "not vectorized:"
-				       " control flow in loop.\n");
-
       if (empty_block_p (loop->header))
 	return opt_result::failure_at (vect_location,
 				       "not vectorized: empty loop.\n");
@@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
         dump_printf_loc (MSG_NOTE, vect_location,
 			 "Considering outer-loop vectorization.\n");
       info->inner_loop_cond = inner.loop_cond;
+
+      if (!single_exit (loop))
+	return opt_result::failure_at (vect_location,
+				       "not vectorized: multiple exits.\n");
+
     }
 
-  if (!single_exit (loop))
-    return opt_result::failure_at (vect_location,
-				   "not vectorized: multiple exits.\n");
   if (EDGE_COUNT (loop->header->preds) != 2)
     return opt_result::failure_at (vect_location,
 				   "not vectorized:"
@@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
 				   "not vectorized: latch block not empty.\n");
 
   /* Make sure the exit is not abnormal.  */
-  edge e = single_exit (loop);
-  if (e->flags & EDGE_ABNORMAL)
-    return opt_result::failure_at (vect_location,
-				   "not vectorized:"
-				   " abnormal loop exit edge.\n");
+  auto_vec<edge> exits = get_loop_exit_edges (loop);
+  edge nexit = loop->vec_loop_iv;
+  for (edge e : exits)
+    {
+      if (e->flags & EDGE_ABNORMAL)
+	return opt_result::failure_at (vect_location,
+				       "not vectorized:"
+				       " abnormal loop exit edge.\n");
+      /* Early break BB must be after the main exit BB.  In theory we should
+	 be able to vectorize the inverse order, but the current flow in the
+	 the vectorizer always assumes you update successor PHI nodes, not
+	 preds.  */
+      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e->src))
+	return opt_result::failure_at (vect_location,
+				       "not vectorized:"
+				       " abnormal loop exit edge order.\n");
+    }
+
+  /* We currently only support early exit loops with known bounds.   */
+  if (exits.length () > 1)
+    {
+      class tree_niter_desc niter;
+      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter, NULL)
+	  || chrec_contains_undetermined (niter.niter)
+	  || !evolution_function_is_constant_p (niter.niter))
+	return opt_result::failure_at (vect_location,
+				       "not vectorized:"
+				       " early breaks only supported on loops"
+				       " with known iteration bounds.\n");
+    }
 
   info->conds
     = vect_get_loop_niters (loop, &info->assumptions,
@@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop, vec_info_shared *shared,
   LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info->alt_loop_conds);
   LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
 
+  /* Check to see if we're vectorizing multiple exits.  */
+  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
+
   if (info->inner_loop_cond)
     {
       stmt_vec_info inner_loop_cond_info
@@ -3070,7 +3106,8 @@ start_over:
 
   /* If an epilogue loop is required make sure we can create one.  */
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
+      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location, "epilog loop required\n");
@@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
   basic_block exit_bb;
   tree scalar_dest;
   tree scalar_type;
-  gimple *new_phi = NULL, *phi;
+  gimple *new_phi = NULL, *phi = NULL;
   gimple_stmt_iterator exit_gsi;
   tree new_temp = NULL_TREE, new_name, new_scalar_dest;
   gimple *epilog_stmt = NULL;
@@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
 	  new_def = gimple_convert (&stmts, vectype, new_def);
 	  reduc_inputs.quick_push (new_def);
 	}
+
+	/* Update the other exits.  */
+	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	  {
+	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
+	    gphi_iterator gsi, gsi1;
+	    for (edge exit : alt_exits)
+	      {
+		/* Find the phi node to propaget into the exit block for each
+		   exit edge.  */
+		for (gsi = gsi_start_phis (exit_bb),
+		     gsi1 = gsi_start_phis (exit->src);
+		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
+		     gsi_next (&gsi), gsi_next (&gsi1))
+		  {
+		    /* There really should be a function to just get the number
+		       of phis inside a bb.  */
+		    if (phi && phi == gsi.phi ())
+		      {
+			gphi *phi1 = gsi1.phi ();
+			SET_PHI_ARG_DEF (phi, exit->dest_idx,
+					 PHI_RESULT (phi1));
+			break;
+		      }
+		  }
+	      }
+	  }
       gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
     }
 
@@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info *vinfo,
 	   new_tree = lane_extract <vec_lhs', ...>;
 	   lhs' = new_tree;  */
 
+      /* When vectorizing an early break, any live statement that is used
+	 outside of the loop are dead.  The loop will never get to them.
+	 We could change the liveness value during analysis instead but since
+	 the below code is invalid anyway just ignore it during codegen.  */
+      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	return true;
+
       class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
       basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
       gcc_assert (single_pred_p (exit_bb));
@@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
      versioning.   */
   edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
-  if (! single_pred_p (e->dest))
+  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       split_loop_exit_edge (e, true);
       if (dump_enabled_p ())
@@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
     {
       e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
-      if (! single_pred_p (e->dest))
+      if (e && ! single_pred_p (e->dest))
 	{
 	  split_loop_exit_edge (e, true);
 	  if (dump_enabled_p ())
@@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   /* Loops vectorized with a variable factor won't benefit from
      unrolling/peeling.  */
-  if (!vf.is_constant ())
+  if (!vf.is_constant ()
+      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       loop->unroll = 1;
       if (dump_enabled_p ())
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82afe78ee2a7c627be45b 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
   *live_p = false;
 
   /* cond stmt other than loop exit cond.  */
-  if (is_ctrl_stmt (stmt_info->stmt)
-      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
-    *relevant = vect_used_in_scope;
+  if (is_ctrl_stmt (stmt_info->stmt))
+    {
+      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge, but
+	 it looks like loop_manip doesn't do that..  So we have to do it
+	 the hard way.  */
+      basic_block bb = gimple_bb (stmt_info->stmt);
+      bool exit_bb = false, early_exit = false;
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+        if (!flow_bb_inside_loop_p (loop, e->dest))
+	  {
+	    exit_bb = true;
+	    early_exit = loop->vec_loop_iv->src != bb;
+	    break;
+	  }
+
+      /* We should have processed any exit edge, so an edge not an early
+	 break must be a loop IV edge.  We need to distinguish between the
+	 two as we don't want to generate code for the main loop IV.  */
+      if (exit_bb)
+	{
+	  if (early_exit)
+	    *relevant = vect_used_in_scope;
+	}
+      else if (bb->loop_father == loop)
+	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
+    }
 
   /* changing memory.  */
   if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
@@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
 	*relevant = vect_used_in_scope;
       }
 
+  auto_vec<edge> exits = get_loop_exit_edges (loop);
+  auto_bitmap exit_bbs;
+  for (edge exit : exits)
+    bitmap_set_bit (exit_bbs, exit->dest->index);
+
   /* uses outside the loop.  */
   FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter, SSA_OP_DEF)
     {
@@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
 	      /* We expect all such uses to be in the loop exit phis
 		 (because of loop closed form)   */
 	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
-	      gcc_assert (bb == single_exit (loop)->dest);
+	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
 
               *live_p = true;
 	    }
@@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info loop_vinfo, bool *fatal)
 	}
     }
 
+  /* Ideally this should be in vect_analyze_loop_form but we haven't seen all
+     the conds yet at that point and there's no quick way to retrieve them.  */
+  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
+    return opt_result::failure_at (vect_location,
+				   "not vectorized:"
+				   " unsupported control flow in loop.\n");
+
   /* 2. Process_worklist */
   while (worklist.length () > 0)
     {
@@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info loop_vinfo, bool *fatal)
 			return res;
 		    }
                  }
+	    }
+	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
+	    {
+	      enum tree_code rhs_code = gimple_cond_code (cond);
+	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
+	      opt_result res
+		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
+			       loop_vinfo, relevant, &worklist, false);
+	      if (!res)
+		return res;
+	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
+				loop_vinfo, relevant, &worklist, false);
+	      if (!res)
+		return res;
             }
 	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
 	    {
@@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
 			     node_instance, cost_vec);
       if (!res)
 	return res;
-   }
+    }
+
+  if (is_ctrl_stmt (stmt_info->stmt))
+    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
 
   switch (STMT_VINFO_DEF_TYPE (stmt_info))
     {
       case vect_internal_def:
+      case vect_early_exit_def:
         break;
 
       case vect_reduction_def:
@@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
     {
       gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
       gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
+		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
 		  || (call && gimple_call_lhs (call) == NULL_TREE));
       *need_to_vectorize = true;
     }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b340baff61d18da8e242dd 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -63,6 +63,7 @@ enum vect_def_type {
   vect_internal_def,
   vect_induction_def,
   vect_reduction_def,
+  vect_early_exit_def,
   vect_double_reduction_def,
   vect_nested_cycle,
   vect_first_order_recurrence,
@@ -876,6 +877,13 @@ public:
      we need to peel off iterations at the end to form an epilogue loop.  */
   bool peeling_for_niter;
 
+  /* When the loop has early breaks that we can vectorize we need to peel
+     the loop for the break finding loop.  */
+  bool early_breaks;
+
+  /* When the loop has a non-early break control flow inside.  */
+  bool non_break_control_flow;
+
   /* List of loop additional IV conditionals found in the loop.  */
   auto_vec<gcond *> conds;
 
@@ -985,9 +993,11 @@ public:
 #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
 #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
 #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
+#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
 #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)->early_break_conflict
 #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
 #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
+#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)->non_break_control_flow
 #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
 #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
 #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
@@ -1038,8 +1048,8 @@ public:
    stack.  */
 typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
 
-inline loop_vec_info
-loop_vec_info_for_loop (class loop *loop)
+static inline loop_vec_info
+loop_vec_info_for_loop (const class loop *loop)
 {
   return (loop_vec_info) loop->aux;
 }
@@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
 {
   if (bb == (bb->loop_father)->header)
     return true;
-  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
+
   return false;
 }
 
@@ -2176,9 +2186,10 @@ class auto_purge_vect_location
    in tree-vect-loop-manip.cc.  */
 extern void vect_set_loop_condition (class loop *, loop_vec_info,
 				     tree, tree, tree, bool);
-extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
+extern bool slpeel_can_duplicate_loop_p (const loop_vec_info, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
-						     class loop *, edge);
+						    class loop *, edge, bool,
+						    vec<basic_block> * = NULL);
 class loop *vect_loop_versioning (loop_vec_info, gimple *);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
 				    tree *, tree *, tree *, int, bool, bool,
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c67f29f5dc9a8d72235f4 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
 	 predicates that need to be shared for optimal predicate usage.
 	 However reassoc will re-order them and prevent CSE from working
 	 as it should.  CSE only the loop body, not the entry.  */
-      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
+      auto_vec<edge> exits = get_loop_exit_edges (loop);
+      for (edge exit : exits)
+	bitmap_set_bit (exit_bbs, exit->dest->index);
 
       edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
       do_rpo_vn (fun, entry, exit_bbs);
  

Comments

Richard Biener July 13, 2023, 5:31 p.m. UTC | #1
On Wed, 28 Jun 2023, Tamar Christina wrote:

> Hi All,
> 
> This patch updates the peeling code to maintain LCSSA during peeling.
> The rewrite also naturally takes into account multiple exits and so it didn't
> make sense to split them off.
> 
> For the purposes of peeling the only change for multiple exits is that the
> secondary exits are all wired to the start of the new loop preheader when doing
> epilogue peeling.
> 
> When doing prologue peeling the CFG is kept in tact.
> 
> For both epilogue and prologue peeling we wire through between the two loops any
> PHI nodes that escape the first loop into the second loop if flow_loops is
> specified.  The reason for this conditionality is because
> slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
>   - prologue peeling
>   - epilogue peeling
>   - loop distribution
> 
> for the last case the loops should remain independent, and so not be connected.
> Because of this propagation of only used phi nodes get_current_def can be used
> to easily find the previous definitions.  However live statements that are
> not used inside the loop itself are not propagated (since if unused, the moment
> we add the guard in between the two loops the value across the bypass edge can
> be wrong if the loop has been peeled.)
> 
> This is dealt with easily enough in find_guard_arg.
> 
> For multiple exits, while we are in LCSSA form, and have a correct DOM tree, the
> moment we add the guard block we will change the dominators again.  To deal with
> this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the blocks to
> update without having to recompute the list of blocks to update again.
> 
> When multiple exits and doing epilogue peeling we will also temporarily have an
> incorrect VUSES chain for the secondary exits as it anticipates the final result
> after the VDEFs have been moved.  This will thus be corrected once the code
> motion is applied.
> 
> Lastly by doing things this way we can remove the helper functions that
> previously did lock step iterations to update things as it went along.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

Not sure if I get through all of this in one go - so be prepared that
the rest of the review follows another day.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* tree-loop-distribution.cc (copy_loop_before): Pass flow_loops = false.
> 	* tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
> 	* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add additional
> 	assert.
> 	(vect_set_loop_condition_normal): Skip modifying loop IV for multiple
> 	exits.
> 	(slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit peeling.
> 	(slpeel_can_duplicate_loop_p): Likewise.
> 	(vect_update_ivs_after_vectorizer): Don't enter this...
> 	(vect_update_ivs_after_early_break): ...but instead enter here.
> 	(find_guard_arg): Update for new peeling code.
> 	(slpeel_update_phi_nodes_for_loops): Remove.
> 	(slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0 checks.
> 	(slpeel_update_phi_nodes_for_lcssa): Remove.
> 	(vect_do_peeling): Fix VF for multiple exits and force epilogue.
> 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> 	non_break_control_flow and early_breaks.
> 	(vect_need_peeling_or_partial_vectors_p): Force partial vector if
> 	multiple exits and VLA.
> 	(vect_analyze_loop_form): Support inner loop multiple exits.
> 	(vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
> 	(vect_create_epilog_for_reduction):  Update live phi nodes.
> 	(vectorizable_live_operation): Ignore live operations in vector loop
> 	when multiple exits.
> 	(vect_transform_loop): Force unrolling for VF loops and multiple exits.
> 	* tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
> 	(vect_mark_stmts_to_be_vectorized): Check for non-exit control flow and
> 	analyze gcond params.
> 	(vect_analyze_stmt): Support gcond.
> 	* tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
> 	in RPO pass.
> 	* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
> 	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW): New.
> 	(loop_vec_info_for_loop): Change to const and static.
> 	(is_loop_header_bb_p): Drop assert.
> 	(slpeel_can_duplicate_loop_p): Update prototype.
> 	(class loop): Add early_breaks and non_break_control_flow.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> index 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40d833a968314a4442b9e 100644
> --- a/gcc/tree-loop-distribution.cc
> +++ b/gcc/tree-loop-distribution.cc
> @@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool redirect_lc_phi_defs)
>    edge preheader = loop_preheader_edge (loop);
>  
>    initialize_original_copy_tables ();
> -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader, false);
>    gcc_assert (res != NULL);
>  
>    /* When a not last partition is supposed to keep the LC PHIs computed
> diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> index 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba377300430c04b7aeefe6c 100644
> --- a/gcc/tree-ssa-loop-niter.cc
> +++ b/gcc/tree-ssa-loop-niter.cc
> @@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop, basic_block *body, const_edge exit)
>    gimple_stmt_iterator bsi;
>    unsigned i;
>  
> -  if (exit != single_exit (loop))
> +  /* We need to check for alternative exits since exit can be NULL.  */

You mean we pass in exit == NULL in some cases?  I'm not sure what
the desired behavior in that case is - can you point out the
callers you are fixing here?

I think we should add gcc_assert (exit != nullptr)

>    for (i = 0; i < loop->num_nodes; i++)
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a927cb150edd7c2aa9999 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def)
>  {
>    tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
>  
> +  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
> +	      || orig_def != new_def);
> +
>    SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
>  
>    if (MAY_HAVE_DEBUG_BIND_STMTS)
> @@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal (loop_vec_info loop_vinfo,
>    gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
>  
>    /* Record the number of latch iterations.  */
> -  if (limit == niters)
> +  if (limit == niters
> +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      /* Case A: the loop iterates NITERS times.  Subtract one to get the
>         latch count.  */
>      loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
> @@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal (loop_vec_info loop_vinfo,
>      loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
>  				       limit, step);
>  
> -  if (final_iv)
> +  /* For multiple exits we've already maintained LCSSA form and handled
> +     the scalar iteration update in the code that deals with the merge
> +     block and its updated guard.  I could move that code here instead
> +     of in vect_update_ivs_after_early_break but I have to still deal
> +     with the updates to the counter `i`.  So for now I'll keep them
> +     together.  */
> +  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      {
>        gassign *assign;
>        edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
> @@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
>     on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
>     non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
>     basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> -   entry or exit of LOOP.  */
> +   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to SCALAR_LOOP as a
> +   continuation.  This is correct for cases where one loop continues from the
> +   other like in the vectorizer, but not true for uses in e.g. loop distribution
> +   where the loop is duplicated and then modified.
> +

but for loop distribution the flow also continues?  I'm not sure what you
are refering to here.  Do you by chance have a branch with the patches
installed?

> +   If UPDATED_DOMS is not NULL it is update with the list of basic blocks whoms
> +   dominators were updated during the peeling.  */
>  
>  class loop *
>  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
> -					class loop *scalar_loop, edge e)
> +					class loop *scalar_loop, edge e,
> +					bool flow_loops,
> +					vec<basic_block> *updated_doms)
>  {
>    class loop *new_loop;
>    basic_block *new_bbs, *bbs, *pbbs;
> @@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
>    for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
>      rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
>  
> +  /* Rename the exit uses.  */
> +  for (edge exit : get_loop_exit_edges (new_loop))
> +    for (auto gsi = gsi_start_phis (exit->dest);
> +	 !gsi_end_p (gsi); gsi_next (&gsi))
> +      {
> +	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
> +	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
> +	if (MAY_HAVE_DEBUG_BIND_STMTS)
> +	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
> +      }
> +
> +  /* This condition happens when the loop has been versioned. e.g. due to ifcvt
> +     versioning the loop.  */
>    if (scalar_loop != loop)
>      {
>        /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
> @@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
>  						EDGE_SUCC (loop->latch, 0));
>      }
>  
> +  vec<edge> alt_exits = loop->vec_loop_alt_exits;

So 'e' is not one of alt_exits, right?  I wonder if we can simply
compute the vector from all exits of 'loop' and removing 'e'?

> +  bool multiple_exits_p = !alt_exits.is_empty ();
> +  auto_vec<basic_block> doms;
> +  class loop *update_loop = NULL;
> +
>    if (at_exit) /* Add the loop copy at exit.  */
>      {
> -      if (scalar_loop != loop)
> +      if (scalar_loop != loop && new_exit->dest != exit_dest)
>  	{
> -	  gphi_iterator gsi;
>  	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> +	  flush_pending_stmts (new_exit);
> +	}
>  
> -	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> -	       gsi_next (&gsi))
> +      auto loop_exits = get_loop_exit_edges (loop);
> +      for (edge exit : loop_exits)
> +	redirect_edge_and_branch (exit, new_preheader);
> +
> +

one line vertical space too much

> +      /* Copy the current loop LC PHI nodes between the original loop exit
> +	 block and the new loop header.  This allows us to later split the
> +	 preheader block and still find the right LC nodes.  */
> +      edge latch_new = single_succ_edge (new_preheader);
> +      edge latch_old = loop_latch_edge (loop);
> +      hash_set <tree> lcssa_vars;
> +      for (auto gsi_from = gsi_start_phis (latch_old->dest),

so that's loop->header (and makes it more clear which PHI nodes you are 
looking at)

> +	   gsi_to = gsi_start_phis (latch_new->dest);

likewise new_loop->header

> +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> +	{
> +	  gimple *from_phi = gsi_stmt (gsi_from);
> +	  gimple *to_phi = gsi_stmt (gsi_to);
> +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
> +	  /* In all cases, even in early break situations we're only
> +	     interested in the number of fully executed loop iters.  As such
> +	     we discard any partially done iteration.  So we simply propagate
> +	     the phi nodes from the latch to the merge block.  */
> +	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> +
> +	  lcssa_vars.add (new_arg);
> +
> +	  /* Main loop exit should use the final iter value.  */
> +	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv, UNKNOWN_LOCATION);

above you are creating the PHI node at e->dest but here add the PHI arg to
loop->vec_loop_iv - that's 'e' here, no?  Consistency makes it easier
to follow.  I _think_ this code doesn't need to know about the "special"
edge.

> +
> +	  /* All other exits use the previous iters.  */
> +	  for (edge e : alt_exits)
> +	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
> +			 UNKNOWN_LOCATION);
> +
> +	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
> +	}
> +
> +      /* Copy over any live SSA vars that may not have been materialized in the
> +	 loops themselves but would be in the exit block.  However when the live
> +	 value is not used inside the loop then we don't need to do this,  if we do
> +	 then when we split the guard block the branch edge can end up containing the
> +	 wrong reference,  particularly if it shares an edge with something that has
> +	 bypassed the loop.  This is not something peeling can check so we need to
> +	 anticipate the usage of the live variable here.  */
> +      auto exit_map = redirect_edge_var_map_vector (exit);

Hmm, did I use that in my attemt to refactor things? ...

> +      if (exit_map)
> +        for (auto vm : exit_map)
> +	{
> +	  if (lcssa_vars.contains (vm.def)
> +	      || TREE_CODE (vm.def) != SSA_NAME)

the latter check is cheaper so it should come first

> +	    continue;
> +
> +	  imm_use_iterator imm_iter;
> +	  use_operand_p use_p;
> +	  bool use_in_loop = false;
> +
> +	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
>  	    {
> -	      gphi *phi = gsi.phi ();
> -	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> -	      location_t orig_locus
> -		= gimple_phi_arg_location_from_edge (phi, e);
> +	      basic_block bb = gimple_bb (USE_STMT (use_p));
> +	      if (flow_bb_inside_loop_p (loop, bb)
> +		  && !gimple_vuse (USE_STMT (use_p)))
> +		{
> +		  use_in_loop = true;
> +		  break;
> +		}
> +	    }
>  
> -	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> +	  if (!use_in_loop)
> +	    {
> +	       /* Do a final check to see if it's perhaps defined in the loop.  This
> +		  mirrors the relevancy analysis's used_outside_scope.  */
> +	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
> +	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> +		continue;
>  	    }
> +
> +	  tree new_res = copy_ssa_name (vm.result);
> +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> +	  for (edge exit : loop_exits)
> +	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);

not sure what you are doing above - I guess I have to play with it
in a debug session.

>  	}
> -      redirect_edge_and_branch_force (e, new_preheader);
> -      flush_pending_stmts (e);
> +
>        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
> -      if (was_imm_dom || duplicate_outer_loop)
> +
> +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
>  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit->src);
>  
>        /* And remove the non-necessary forwarder again.  Keep the other
> @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
>        delete_basic_block (preheader);
>        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
>  			       loop_preheader_edge (scalar_loop)->src);
> +
> +      /* Finally after wiring the new epilogue we need to update its main exit
> +	 to the original function exit we recorded.  Other exits are already
> +	 correct.  */
> +      if (multiple_exits_p)
> +	{
> +	  for (edge e : get_loop_exit_edges (loop))
> +	    doms.safe_push (e->dest);
> +	  update_loop = new_loop;
> +	  doms.safe_push (exit_dest);
> +
> +	  /* Likely a fall-through edge, so update if needed.  */
> +	  if (single_succ_p (exit_dest))
> +	    doms.safe_push (single_succ (exit_dest));
> +	}
>      }
>    else /* Add the copy at entry.  */
>      {
> +      /* Copy the current loop LC PHI nodes between the original loop exit
> +	 block and the new loop header.  This allows us to later split the
> +	 preheader block and still find the right LC nodes.  */
> +      edge old_latch_loop = loop_latch_edge (loop);
> +      edge old_latch_init = loop_preheader_edge (loop);
> +      edge new_latch_loop = loop_latch_edge (new_loop);
> +      edge new_latch_init = loop_preheader_edge (new_loop);
> +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),

see above

> +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> +	{
> +	  gimple *from_phi = gsi_stmt (gsi_from);
> +	  gimple *to_phi = gsi_stmt (gsi_to);
> +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, new_latch_loop);
> +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> +	}
> +
>        if (scalar_loop != loop)
>  	{
>  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
>        delete_basic_block (new_preheader);
>        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
>  			       loop_preheader_edge (new_loop)->src);
> +
> +      if (multiple_exits_p)
> +	update_loop = loop;
>      }
>  
> -  if (scalar_loop != loop)
> +  if (multiple_exits_p)
>      {
> -      /* Update new_loop->header PHIs, so that on the preheader
> -	 edge they are the ones from loop rather than scalar_loop.  */
> -      gphi_iterator gsi_orig, gsi_new;
> -      edge orig_e = loop_preheader_edge (loop);
> -      edge new_e = loop_preheader_edge (new_loop);
> -
> -      for (gsi_orig = gsi_start_phis (loop->header),
> -	   gsi_new = gsi_start_phis (new_loop->header);
> -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> +      for (edge e : get_loop_exit_edges (update_loop))
>  	{
> -	  gphi *orig_phi = gsi_orig.phi ();
> -	  gphi *new_phi = gsi_new.phi ();
> -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> -	  location_t orig_locus
> -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> -
> -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> +	  edge ex;
> +	  edge_iterator ei;
> +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> +	    {
> +	      /* Find the first non-fallthrough block as fall-throughs can't
> +		 dominate other blocks.  */
> +	      while ((ex->flags & EDGE_FALLTHRU)

I don't think EDGE_FALLTHRU is set correctly, what's wrong with
just using single_succ_p here?  A fallthru edge src dominates the
fallthru edge dest, so the sentence above doesn't make sense.

> +		     && single_succ_p (ex->dest))
> +		{
> +		  doms.safe_push (ex->dest);
> +		  ex = single_succ_edge (ex->dest);
> +		}
> +	      doms.safe_push (ex->dest);
> +	    }
> +	  doms.safe_push (e->dest);
>  	}
> -    }
>  
> +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> +      if (updated_doms)
> +	updated_doms->safe_splice (doms);
> +    }
>    free (new_bbs);
>    free (bbs);
>  
> @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const loop_vec_info loop_vinfo, const_edge e)
>    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
>    unsigned int num_bb = loop->inner? 5 : 2;
>  
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> +

I think checking the number of BBs is odd, I don't remember anything
in slpeel is specifically tied to that?  I think we can simply drop
this or do you remember anything that would depend on ->num_nodes
being only exactly 5 or 2?

>    /* All loops have an outer scope; the only case loop->outer is NULL is for
>       the function itself.  */
>    if (!loop_outer (loop)
> @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
>    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>    basic_block update_bb = update_e->dest;
>  
> +  /* For early exits we'll update the IVs in
> +     vect_update_ivs_after_early_break.  */
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    return;
> +
>    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
>  
>    /* Make sure there exists a single-predecessor exit bb:  */
> @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
>        /* Fix phi expressions in the successor bb.  */
>        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
>      }
> +  return;

we don't usually place a return at the end of void functions

> +}
> +
> +/*   Function vect_update_ivs_after_early_break.
> +
> +     "Advance" the induction variables of LOOP to the value they should take
> +     after the execution of LOOP.  This is currently necessary because the
> +     vectorizer does not handle induction variables that are used after the
> +     loop.  Such a situation occurs when the last iterations of LOOP are
> +     peeled, because of the early exit.  With an early exit we always peel the
> +     loop.
> +
> +     Input:
> +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> +		    vectorized. The last few iterations of LOOP were peeled.
> +     - LOOP - a loop that is going to be vectorized. The last few iterations
> +	      of LOOP were peeled.
> +     - VF - The loop vectorization factor.
> +     - NITERS_ORIG - the number of iterations that LOOP executes (before it is
> +		     vectorized). i.e, the number of times the ivs should be
> +		     bumped.
> +     - NITERS_VECTOR - The number of iterations that the vector LOOP executes.
> +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only) path
> +		  coming out from LOOP on which there are uses of the LOOP ivs
> +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> +
> +		  The new definitions of the ivs are placed in LOOP->exit.
> +		  The phi args associated with the edge UPDATE_E in the bb
> +		  UPDATE_E->dest are updated accordingly.
> +
> +     Output:
> +       - If available, the LCSSA phi node for the loop IV temp.
> +
> +     Assumption 1: Like the rest of the vectorizer, this function assumes
> +     a single loop exit that has a single predecessor.
> +
> +     Assumption 2: The phi nodes in the LOOP header and in update_bb are
> +     organized in the same order.
> +
> +     Assumption 3: The access function of the ivs is simple enough (see
> +     vect_can_advance_ivs_p).  This assumption will be relaxed in the future.
> +
> +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a path
> +     coming out of LOOP on which the ivs of LOOP are used (this is the path
> +     that leads to the epilog loop; other paths skip the epilog loop).  This
> +     path starts with the edge UPDATE_E, and its destination (denoted update_bb)
> +     needs to have its phis updated.
> + */
> +
> +static tree
> +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class loop * epilog,
> +				   poly_int64 vf, tree niters_orig,
> +				   tree niters_vector, edge update_e)
> +{
> +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    return NULL;
> +
> +  gphi_iterator gsi, gsi1;
> +  tree ni_name, ivtmp = NULL;
> +  basic_block update_bb = update_e->dest;
> +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> +  basic_block exit_bb = loop_iv->dest;
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> +
> +  gcc_assert (cond);
> +
> +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis (update_bb);
> +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> +       gsi_next (&gsi), gsi_next (&gsi1))
> +    {
> +      tree init_expr, final_expr, step_expr;
> +      tree type;
> +      tree var, ni, off;
> +      gimple_stmt_iterator last_gsi;
> +
> +      gphi *phi = gsi1.phi ();
> +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (epilog));

I'm confused about the setup.  update_bb looks like the block with the
loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
loop_preheader_edge (epilog) come into play here?  That would feed into
epilog->header PHIs?!

It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
better way?  Is update_bb really epilog->header?!

We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
which "works" even for totally unrelated edges.

> +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> +      if (!phi1)

shouldn't that be an assert?

> +	continue;
> +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "vect_update_ivs_after_early_break: phi: %G",
> +			 (gimple *)phi);
> +
> +      /* Skip reduction and virtual phis.  */
> +      if (!iv_phi_p (phi_info))
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "reduc or virtual phi. skip.\n");
> +	  continue;
> +	}
> +
> +      /* For multiple exits where we handle early exits we need to carry on
> +	 with the previous IV as loop iteration was not done because we exited
> +	 early.  As such just grab the original IV.  */
> +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge (loop));

but this should be taken care of by LC SSA?

OK, have to continue tomorrow from here.

Richard.

> +      if (gimple_cond_lhs (cond) != phi_ssa
> +	  && gimple_cond_rhs (cond) != phi_ssa)
> +	{
> +	  type = TREE_TYPE (gimple_phi_result (phi));
> +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> +	  step_expr = unshare_expr (step_expr);
> +
> +	  /* We previously generated the new merged phi in the same BB as the
> +	     guard.  So use that to perform the scaling on rather than the
> +	     normal loop phi which don't take the early breaks into account.  */
> +	  final_expr = gimple_phi_result (phi1);
> +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_preheader_edge (loop));
> +
> +	  tree stype = TREE_TYPE (step_expr);
> +	  /* For early break the final loop IV is:
> +	     init + (final - init) * vf which takes into account peeling
> +	     values and non-single steps.  */
> +	  off = fold_build2 (MINUS_EXPR, stype,
> +			     fold_convert (stype, final_expr),
> +			     fold_convert (stype, init_expr));
> +	  /* Now adjust for VF to get the final iteration value.  */
> +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> +
> +	  /* Adjust the value with the offset.  */
> +	  if (POINTER_TYPE_P (type))
> +	    ni = fold_build_pointer_plus (init_expr, off);
> +	  else
> +	    ni = fold_convert (type,
> +			       fold_build2 (PLUS_EXPR, stype,
> +					    fold_convert (stype, init_expr),
> +					    off));
> +	  var = create_tmp_var (type, "tmp");
> +
> +	  last_gsi = gsi_last_bb (exit_bb);
> +	  gimple_seq new_stmts = NULL;
> +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> +	  /* Exit_bb shouldn't be empty.  */
> +	  if (!gsi_end_p (last_gsi))
> +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> +	  else
> +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> +
> +	  /* Fix phi expressions in the successor bb.  */
> +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> +	}
> +      else
> +	{
> +	  type = TREE_TYPE (gimple_phi_result (phi));
> +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> +	  step_expr = unshare_expr (step_expr);
> +
> +	  /* We previously generated the new merged phi in the same BB as the
> +	     guard.  So use that to perform the scaling on rather than the
> +	     normal loop phi which don't take the early breaks into account.  */
> +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge (loop));
> +	  tree stype = TREE_TYPE (step_expr);
> +
> +	  if (vf.is_constant ())
> +	    {
> +	      ni = fold_build2 (MULT_EXPR, stype,
> +				fold_convert (stype,
> +					      niters_vector),
> +				build_int_cst (stype, vf));
> +
> +	      ni = fold_build2 (MINUS_EXPR, stype,
> +				fold_convert (stype,
> +					      niters_orig),
> +				fold_convert (stype, ni));
> +	    }
> +	  else
> +	    /* If the loop's VF isn't constant then the loop must have been
> +	       masked, so at the end of the loop we know we have finished
> +	       the entire loop and found nothing.  */
> +	    ni = build_zero_cst (stype);
> +
> +	  ni = fold_convert (type, ni);
> +	  /* We don't support variable n in this version yet.  */
> +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> +
> +	  var = create_tmp_var (type, "tmp");
> +
> +	  last_gsi = gsi_last_bb (exit_bb);
> +	  gimple_seq new_stmts = NULL;
> +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> +	  /* Exit_bb shouldn't be empty.  */
> +	  if (!gsi_end_p (last_gsi))
> +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> +	  else
> +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> +
> +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> +
> +	  for (edge exit : alt_exits)
> +	    adjust_phi_and_debug_stmts (phi1, exit,
> +					build_int_cst (TREE_TYPE (step_expr),
> +						       vf));
> +	  ivtmp = gimple_phi_result (phi1);
> +	}
> +    }
> +
> +  return ivtmp;
>  }
>  
>  /* Return a gimple value containing the misalignment (measured in vector
> @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf (loop_vec_info loop_vinfo,
>  
>  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
>     this function searches for the corresponding lcssa phi node in exit
> -   bb of LOOP.  If it is found, return the phi result; otherwise return
> -   NULL.  */
> +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> +   return the phi result; otherwise return NULL.  */
>  
>  static tree
>  find_guard_arg (class loop *loop, class loop *epilog ATTRIBUTE_UNUSED,
> -		gphi *lcssa_phi)
> +		gphi *lcssa_phi, int lcssa_edge = 0)
>  {
>    gphi_iterator gsi;
>    edge e = loop->vec_loop_iv;
>  
> -  gcc_assert (single_pred_p (e->dest));
>    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
>      {
>        gphi *phi = gsi.phi ();
> -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> -	return PHI_RESULT (phi);
> -    }
> -  return NULL_TREE;
> -}
> -
> -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates FIRST/SECOND
> -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> -   edge, the two loops are arranged as below:
> -
> -       preheader_a:
> -     first_loop:
> -       header_a:
> -	 i_1 = PHI<i_0, i_2>;
> -	 ...
> -	 i_2 = i_1 + 1;
> -	 if (cond_a)
> -	   goto latch_a;
> -	 else
> -	   goto between_bb;
> -       latch_a:
> -	 goto header_a;
> -
> -       between_bb:
> -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> -
> -     second_loop:
> -       header_b:
> -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> -				 or with i_2 if no LCSSA phi is created
> -				 under condition of CREATE_LCSSA_FOR_IV_PHIS.
> -	 ...
> -	 i_4 = i_3 + 1;
> -	 if (cond_b)
> -	   goto latch_b;
> -	 else
> -	   goto exit_bb;
> -       latch_b:
> -	 goto header_b;
> -
> -       exit_bb:
> -
> -   This function creates loop closed SSA for the first loop; update the
> -   second loop's PHI nodes by replacing argument on incoming edge with the
> -   result of newly created lcssa PHI nodes.  IF CREATE_LCSSA_FOR_IV_PHIS
> -   is false, Loop closed ssa phis will only be created for non-iv phis for
> -   the first loop.
> -
> -   This function assumes exit bb of the first loop is preheader bb of the
> -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> -   the second loop will execute rest iterations of the first.  */
> -
> -static void
> -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> -				   class loop *first, class loop *second,
> -				   bool create_lcssa_for_iv_phis)
> -{
> -  gphi_iterator gsi_update, gsi_orig;
> -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> -
> -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> -  edge second_preheader_e = loop_preheader_edge (second);
> -  basic_block between_bb = single_exit (first)->dest;
> -
> -  gcc_assert (between_bb == second_preheader_e->src);
> -  gcc_assert (single_pred_p (between_bb) && single_succ_p (between_bb));
> -  /* Either the first loop or the second is the loop to be vectorized.  */
> -  gcc_assert (loop == first || loop == second);
> -
> -  for (gsi_orig = gsi_start_phis (first->header),
> -       gsi_update = gsi_start_phis (second->header);
> -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> -    {
> -      gphi *orig_phi = gsi_orig.phi ();
> -      gphi *update_phi = gsi_update.phi ();
> -
> -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> -      /* Generate lcssa PHI node for the first loop.  */
> -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> +      /* Nested loops with multiple exits can have different no# phi node
> +	 arguments between the main loop and epilog as epilog falls to the
> +	 second loop.  */
> +      if (gimple_phi_num_args (phi) > e->dest_idx)
>  	{
> -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> -	  add_phi_arg (lcssa_phi, arg, single_exit (first), UNKNOWN_LOCATION);
> -	  arg = new_res;
> -	}
> -
> -      /* Update PHI node in the second loop by replacing arg on the loop's
> -	 incoming edge.  */
> -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e, arg);
> -    }
> -
> -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> -     for correct vectorization of live stmts.  */
> -  if (loop == first)
> -    {
> -      basic_block orig_exit = single_exit (second)->dest;
> -      for (gsi_orig = gsi_start_phis (orig_exit);
> -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> -	{
> -	  gphi *orig_phi = gsi_orig.phi ();
> -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p  (orig_arg))
> -	    continue;
> -
> -	  /* Already created in the above loop.   */
> -	  if (find_guard_arg (first, second, orig_phi))
> +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> +	  if (TREE_CODE (var) != SSA_NAME)
>  	    continue;
>  
> -	  tree new_res = copy_ssa_name (orig_arg);
> -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> -	  add_phi_arg (lcphi, orig_arg, single_exit (first), UNKNOWN_LOCATION);
> +	  if (operand_equal_p (get_current_def (var),
> +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> +	    return PHI_RESULT (phi);
>  	}
>      }
> +  return NULL_TREE;
>  }
>  
>  /* Function slpeel_add_loop_guard adds guard skipping from the beginning
> @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
>    gcc_assert (single_succ_p (merge_bb));
>    edge e = single_succ_edge (merge_bb);
>    basic_block exit_bb = e->dest;
> -  gcc_assert (single_pred_p (exit_bb));
> -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
>  
>    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
>      {
>        gphi *update_phi = gsi.phi ();
> -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
>  
>        tree merge_arg = NULL_TREE;
>  
> @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
>        if (!merge_arg)
>  	merge_arg = old_arg;
>  
> -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e->dest_idx);
>        /* If the var is live after loop but not a reduction, we simply
>  	 use the old arg.  */
>        if (!guard_arg)
> @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
>      }
>  }
>  
> -/* EPILOG loop is duplicated from the original loop for vectorizing,
> -   the arg of its loop closed ssa PHI needs to be updated.  */
> -
> -static void
> -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> -{
> -  gphi_iterator gsi;
> -  basic_block exit_bb = single_exit (epilog)->dest;
> -
> -  gcc_assert (single_pred_p (exit_bb));
> -  edge e = EDGE_PRED (exit_bb, 0);
> -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> -}
> -
>  /* EPILOGUE_VINFO is an epilogue loop that we now know would need to
>     iterate exactly CONST_NITERS times.  Make a final decision about
>     whether the epilogue loop should be used, returning true if so.  */
> @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>      bound_epilog += vf - 1;
>    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
>      bound_epilog += 1;
> +  /* For early breaks the scalar loop needs to execute at most VF times
> +     to find the element that caused the break.  */
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +    {
> +      bound_epilog = vf;
> +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> +      vect_epilogues = false;
> +    }
>    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
>    poly_uint64 bound_scalar = bound_epilog;
>  
> @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>  				  bound_prolog + bound_epilog)
>  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
>  			 || vect_epilogues));
> +
> +  /* We only support early break vectorization on known bounds at this time.
> +     This means that if the vector loop can't be entered then we won't generate
> +     it at all.  So for now force skip_vector off because the additional control
> +     flow messes with the BB exits and we've already analyzed them.  */
> + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
> +
>    /* Epilog loop must be executed if the number of iterations for epilog
>       loop is known at compile time, otherwise we need to add a check at
>       the end of vector loop and skip to the end of epilog loop.  */
>    bool skip_epilog = (prolog_peeling < 0
>  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>  		      || !vf.is_constant ());
> -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.  */
> -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> +  /* PEELING_FOR_GAPS and peeling for early breaks are special because epilog
> +     loop must be executed.  */
> +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      skip_epilog = false;
> -
>    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
>    auto_vec<profile_count> original_counts;
>    basic_block *original_bbs = NULL;
> @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>    if (prolog_peeling)
>      {
>        e = loop_preheader_edge (loop);
> -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> -
> +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
>        /* Peel prolog and put it on preheader edge of loop.  */
> -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
> +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e,
> +						       true);
>        gcc_assert (prolog);
>        prolog->force_vectorize = false;
> -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> +
>        first_loop = prolog;
>        reset_original_copy_tables ();
>  
> @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>  	 as the transformations mentioned above make less or no sense when not
>  	 vectorizing.  */
>        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> +      auto_vec<basic_block> doms;
> +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e, true,
> +						       &doms);
>        gcc_assert (epilog);
>  
>        epilog->force_vectorize = false;
> -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
>  
>        /* Scalar version loop may be preferred.  In this case, add guard
>  	 and skip to epilog.  Note this only happens when the number of
> @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
>  					update_e);
>  
> +      /* For early breaks we must create a guard to check how many iterations
> +	 of the scalar loop are yet to be performed.  */
> +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +	{
> +	  tree ivtmp =
> +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> +					       *niters_vector, update_e);
> +
> +	  gcc_assert (ivtmp);
> +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> +					 fold_convert (TREE_TYPE (niters),
> +						       ivtmp),
> +					 build_zero_cst (TREE_TYPE (niters)));
> +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> +
> +	  /* If we had a fallthrough edge, the guard will the threaded through
> +	     and so we may need to find the actual final edge.  */
> +	  edge final_edge = epilog->vec_loop_iv;
> +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> +	     between the guard and the exit edge.  It only adds new nodes and
> +	     doesn't update existing one in the current scheme.  */
> +	  basic_block guard_to = split_edge (final_edge);
> +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond, guard_to,
> +						guard_bb, prob_epilog.invert (),
> +						irred_flag);
> +	  doms.safe_push (guard_bb);
> +
> +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> +
> +	  /* We must update all the edges from the new guard_bb.  */
> +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> +					      final_edge);
> +
> +	  /* If the loop was versioned we'll have an intermediate BB between
> +	     the guard and the exit.  This intermediate block is required
> +	     because in the current scheme of things the guard block phi
> +	     updating can only maintain LCSSA by creating new blocks.  In this
> +	     case we just need to update the uses in this block as well.  */
> +	  if (loop != scalar_loop)
> +	    {
> +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> +		   !gsi_end_p (gsi); gsi_next (&gsi))
> +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), guard_e));
> +	    }
> +
> +	  flush_pending_stmts (guard_e);
> +	}
> +
>        if (skip_epilog)
>  	{
>  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>  	    }
>  	  scale_loop_profile (epilog, prob_epilog, 0);
>  	}
> -      else
> -	slpeel_update_phi_nodes_for_lcssa (epilog);
>  
>        unsigned HOST_WIDE_INT bound;
>        if (bound_scalar.is_constant (&bound))
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49f21421958e7ee25eada 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>      partial_load_store_bias (0),
>      peeling_for_gaps (false),
>      peeling_for_niter (false),
> +    early_breaks (false),
> +    non_break_control_flow (false),
>      no_data_dependencies (false),
>      has_mask_store (false),
>      scalar_loop_scaling (profile_probability::uninitialized ()),
> @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p (loop_vec_info loop_vinfo)
>      th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
>  					  (loop_vinfo));
>  
> +  /* When we have multiple exits and VF is unknown, we must require partial
> +     vectors because the loop bounds is not a minimum but a maximum.  That is to
> +     say we cannot unpredicate the main loop unless we peel or use partial
> +     vectors in the epilogue.  */
> +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> +    return true;
> +
>    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
>        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
>      {
> @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
>    loop_vinfo->scalar_costs->finish_cost (nullptr);
>  }
>  
> -
>  /* Function vect_analyze_loop_form.
>  
>     Verify that certain CFG restrictions hold, including:
>     - the loop has a pre-header
> -   - the loop has a single entry and exit
> +   - the loop has a single entry
> +   - nested loops can have only a single exit.
>     - the loop exit condition is simple enough
>     - the number of iterations can be analyzed, i.e, a countable loop.  The
>       niter could be analyzed under some assumptions.  */
> @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
>                             |
>                          (exit-bb)  */
>  
> -      if (loop->num_nodes != 2)
> -	return opt_result::failure_at (vect_location,
> -				       "not vectorized:"
> -				       " control flow in loop.\n");
> -
>        if (empty_block_p (loop->header))
>  	return opt_result::failure_at (vect_location,
>  				       "not vectorized: empty loop.\n");
> @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
>          dump_printf_loc (MSG_NOTE, vect_location,
>  			 "Considering outer-loop vectorization.\n");
>        info->inner_loop_cond = inner.loop_cond;
> +
> +      if (!single_exit (loop))
> +	return opt_result::failure_at (vect_location,
> +				       "not vectorized: multiple exits.\n");
> +
>      }
>  
> -  if (!single_exit (loop))
> -    return opt_result::failure_at (vect_location,
> -				   "not vectorized: multiple exits.\n");
>    if (EDGE_COUNT (loop->header->preds) != 2)
>      return opt_result::failure_at (vect_location,
>  				   "not vectorized:"
> @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
>  				   "not vectorized: latch block not empty.\n");
>  
>    /* Make sure the exit is not abnormal.  */
> -  edge e = single_exit (loop);
> -  if (e->flags & EDGE_ABNORMAL)
> -    return opt_result::failure_at (vect_location,
> -				   "not vectorized:"
> -				   " abnormal loop exit edge.\n");
> +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> +  edge nexit = loop->vec_loop_iv;
> +  for (edge e : exits)
> +    {
> +      if (e->flags & EDGE_ABNORMAL)
> +	return opt_result::failure_at (vect_location,
> +				       "not vectorized:"
> +				       " abnormal loop exit edge.\n");
> +      /* Early break BB must be after the main exit BB.  In theory we should
> +	 be able to vectorize the inverse order, but the current flow in the
> +	 the vectorizer always assumes you update successor PHI nodes, not
> +	 preds.  */
> +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e->src))
> +	return opt_result::failure_at (vect_location,
> +				       "not vectorized:"
> +				       " abnormal loop exit edge order.\n");
> +    }
> +
> +  /* We currently only support early exit loops with known bounds.   */
> +  if (exits.length () > 1)
> +    {
> +      class tree_niter_desc niter;
> +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter, NULL)
> +	  || chrec_contains_undetermined (niter.niter)
> +	  || !evolution_function_is_constant_p (niter.niter))
> +	return opt_result::failure_at (vect_location,
> +				       "not vectorized:"
> +				       " early breaks only supported on loops"
> +				       " with known iteration bounds.\n");
> +    }
>  
>    info->conds
>      = vect_get_loop_niters (loop, &info->assumptions,
> @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop, vec_info_shared *shared,
>    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info->alt_loop_conds);
>    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
>  
> +  /* Check to see if we're vectorizing multiple exits.  */
> +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> +
>    if (info->inner_loop_cond)
>      {
>        stmt_vec_info inner_loop_cond_info
> @@ -3070,7 +3106,8 @@ start_over:
>  
>    /* If an epilogue loop is required make sure we can create one.  */
>    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      {
>        if (dump_enabled_p ())
>          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop required\n");
> @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
>    basic_block exit_bb;
>    tree scalar_dest;
>    tree scalar_type;
> -  gimple *new_phi = NULL, *phi;
> +  gimple *new_phi = NULL, *phi = NULL;
>    gimple_stmt_iterator exit_gsi;
>    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
>    gimple *epilog_stmt = NULL;
> @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
>  	  new_def = gimple_convert (&stmts, vectype, new_def);
>  	  reduc_inputs.quick_push (new_def);
>  	}
> +
> +	/* Update the other exits.  */
> +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +	  {
> +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> +	    gphi_iterator gsi, gsi1;
> +	    for (edge exit : alt_exits)
> +	      {
> +		/* Find the phi node to propaget into the exit block for each
> +		   exit edge.  */
> +		for (gsi = gsi_start_phis (exit_bb),
> +		     gsi1 = gsi_start_phis (exit->src);
> +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> +		     gsi_next (&gsi), gsi_next (&gsi1))
> +		  {
> +		    /* There really should be a function to just get the number
> +		       of phis inside a bb.  */
> +		    if (phi && phi == gsi.phi ())
> +		      {
> +			gphi *phi1 = gsi1.phi ();
> +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> +					 PHI_RESULT (phi1));
> +			break;
> +		      }
> +		  }
> +	      }
> +	  }
>        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
>      }
>  
> @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info *vinfo,
>  	   new_tree = lane_extract <vec_lhs', ...>;
>  	   lhs' = new_tree;  */
>  
> +      /* When vectorizing an early break, any live statement that is used
> +	 outside of the loop are dead.  The loop will never get to them.
> +	 We could change the liveness value during analysis instead but since
> +	 the below code is invalid anyway just ignore it during codegen.  */
> +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> +	return true;
> +
>        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
>        gcc_assert (single_pred_p (exit_bb));
> @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>    /* Make sure there exists a single-predecessor exit bb.  Do this before 
>       versioning.   */
>    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> -  if (! single_pred_p (e->dest))
> +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      {
>        split_loop_exit_edge (e, true);
>        if (dump_enabled_p ())
> @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
>      {
>        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> -      if (! single_pred_p (e->dest))
> +      if (e && ! single_pred_p (e->dest))
>  	{
>  	  split_loop_exit_edge (e, true);
>  	  if (dump_enabled_p ())
> @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>  
>    /* Loops vectorized with a variable factor won't benefit from
>       unrolling/peeling.  */
> -  if (!vf.is_constant ())
> +  if (!vf.is_constant ()
> +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
>      {
>        loop->unroll = 1;
>        if (dump_enabled_p ())
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82afe78ee2a7c627be45b 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
>    *live_p = false;
>  
>    /* cond stmt other than loop exit cond.  */
> -  if (is_ctrl_stmt (stmt_info->stmt)
> -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> -    *relevant = vect_used_in_scope;
> +  if (is_ctrl_stmt (stmt_info->stmt))
> +    {
> +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge, but
> +	 it looks like loop_manip doesn't do that..  So we have to do it
> +	 the hard way.  */
> +      basic_block bb = gimple_bb (stmt_info->stmt);
> +      bool exit_bb = false, early_exit = false;
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +        if (!flow_bb_inside_loop_p (loop, e->dest))
> +	  {
> +	    exit_bb = true;
> +	    early_exit = loop->vec_loop_iv->src != bb;
> +	    break;
> +	  }
> +
> +      /* We should have processed any exit edge, so an edge not an early
> +	 break must be a loop IV edge.  We need to distinguish between the
> +	 two as we don't want to generate code for the main loop IV.  */
> +      if (exit_bb)
> +	{
> +	  if (early_exit)
> +	    *relevant = vect_used_in_scope;
> +	}
> +      else if (bb->loop_father == loop)
> +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> +    }
>  
>    /* changing memory.  */
>    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
>  	*relevant = vect_used_in_scope;
>        }
>  
> +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> +  auto_bitmap exit_bbs;
> +  for (edge exit : exits)
> +    bitmap_set_bit (exit_bbs, exit->dest->index);
> +
>    /* uses outside the loop.  */
>    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter, SSA_OP_DEF)
>      {
> @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
>  	      /* We expect all such uses to be in the loop exit phis
>  		 (because of loop closed form)   */
>  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> -	      gcc_assert (bb == single_exit (loop)->dest);
> +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
>  
>                *live_p = true;
>  	    }
> @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info loop_vinfo, bool *fatal)
>  	}
>      }
>  
> +  /* Ideally this should be in vect_analyze_loop_form but we haven't seen all
> +     the conds yet at that point and there's no quick way to retrieve them.  */
> +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> +    return opt_result::failure_at (vect_location,
> +				   "not vectorized:"
> +				   " unsupported control flow in loop.\n");
> +
>    /* 2. Process_worklist */
>    while (worklist.length () > 0)
>      {
> @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized (loop_vec_info loop_vinfo, bool *fatal)
>  			return res;
>  		    }
>                   }
> +	    }
> +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> +	    {
> +	      enum tree_code rhs_code = gimple_cond_code (cond);
> +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> +	      opt_result res
> +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> +			       loop_vinfo, relevant, &worklist, false);
> +	      if (!res)
> +		return res;
> +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> +				loop_vinfo, relevant, &worklist, false);
> +	      if (!res)
> +		return res;
>              }
>  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
>  	    {
> @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
>  			     node_instance, cost_vec);
>        if (!res)
>  	return res;
> -   }
> +    }
> +
> +  if (is_ctrl_stmt (stmt_info->stmt))
> +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
>  
>    switch (STMT_VINFO_DEF_TYPE (stmt_info))
>      {
>        case vect_internal_def:
> +      case vect_early_exit_def:
>          break;
>  
>        case vect_reduction_def:
> @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
>      {
>        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
>        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
>  		  || (call && gimple_call_lhs (call) == NULL_TREE));
>        *need_to_vectorize = true;
>      }
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b340baff61d18da8e242dd 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -63,6 +63,7 @@ enum vect_def_type {
>    vect_internal_def,
>    vect_induction_def,
>    vect_reduction_def,
> +  vect_early_exit_def,
>    vect_double_reduction_def,
>    vect_nested_cycle,
>    vect_first_order_recurrence,
> @@ -876,6 +877,13 @@ public:
>       we need to peel off iterations at the end to form an epilogue loop.  */
>    bool peeling_for_niter;
>  
> +  /* When the loop has early breaks that we can vectorize we need to peel
> +     the loop for the break finding loop.  */
> +  bool early_breaks;
> +
> +  /* When the loop has a non-early break control flow inside.  */
> +  bool non_break_control_flow;
> +
>    /* List of loop additional IV conditionals found in the loop.  */
>    auto_vec<gcond *> conds;
>  
> @@ -985,9 +993,11 @@ public:
>  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
>  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
>  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
>  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)->early_break_conflict
>  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
>  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)->non_break_control_flow
>  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
>  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
>  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
> @@ -1038,8 +1048,8 @@ public:
>     stack.  */
>  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
>  
> -inline loop_vec_info
> -loop_vec_info_for_loop (class loop *loop)
> +static inline loop_vec_info
> +loop_vec_info_for_loop (const class loop *loop)
>  {
>    return (loop_vec_info) loop->aux;
>  }
> @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
>  {
>    if (bb == (bb->loop_father)->header)
>      return true;
> -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> +
>    return false;
>  }
>  
> @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
>     in tree-vect-loop-manip.cc.  */
>  extern void vect_set_loop_condition (class loop *, loop_vec_info,
>  				     tree, tree, tree, bool);
> -extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
> +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info, const_edge);
>  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> -						     class loop *, edge);
> +						    class loop *, edge, bool,
> +						    vec<basic_block> * = NULL);
>  class loop *vect_loop_versioning (loop_vec_info, gimple *);
>  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
>  				    tree *, tree *, tree *, int, bool, bool,
> diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> index a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c67f29f5dc9a8d72235f4 100644
> --- a/gcc/tree-vectorizer.cc
> +++ b/gcc/tree-vectorizer.cc
> @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
>  	 predicates that need to be shared for optimal predicate usage.
>  	 However reassoc will re-order them and prevent CSE from working
>  	 as it should.  CSE only the loop body, not the entry.  */
> -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> +      for (edge exit : exits)
> +	bitmap_set_bit (exit_bbs, exit->dest->index);
>  
>        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
>        do_rpo_vn (fun, entry, exit_bbs);
> 
> 
> 
> 
>
  
Tamar Christina July 13, 2023, 7:05 p.m. UTC | #2
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Thursday, July 13, 2023 6:31 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com
> Subject: Re: [PATCH 12/19]middle-end: implement loop peeling and IV
> updates for early break.
> 
> On Wed, 28 Jun 2023, Tamar Christina wrote:
> 
> > Hi All,
> >
> > This patch updates the peeling code to maintain LCSSA during peeling.
> > The rewrite also naturally takes into account multiple exits and so it didn't
> > make sense to split them off.
> >
> > For the purposes of peeling the only change for multiple exits is that the
> > secondary exits are all wired to the start of the new loop preheader when
> doing
> > epilogue peeling.
> >
> > When doing prologue peeling the CFG is kept in tact.
> >
> > For both epilogue and prologue peeling we wire through between the two
> loops any
> > PHI nodes that escape the first loop into the second loop if flow_loops is
> > specified.  The reason for this conditionality is because
> > slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
> >   - prologue peeling
> >   - epilogue peeling
> >   - loop distribution
> >
> > for the last case the loops should remain independent, and so not be
> connected.
> > Because of this propagation of only used phi nodes get_current_def can be
> used
> > to easily find the previous definitions.  However live statements that are
> > not used inside the loop itself are not propagated (since if unused, the
> moment
> > we add the guard in between the two loops the value across the bypass edge
> can
> > be wrong if the loop has been peeled.)
> >
> > This is dealt with easily enough in find_guard_arg.
> >
> > For multiple exits, while we are in LCSSA form, and have a correct DOM tree,
> the
> > moment we add the guard block we will change the dominators again.  To
> deal with
> > this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the blocks
> to
> > update without having to recompute the list of blocks to update again.
> >
> > When multiple exits and doing epilogue peeling we will also temporarily have
> an
> > incorrect VUSES chain for the secondary exits as it anticipates the final result
> > after the VDEFs have been moved.  This will thus be corrected once the code
> > motion is applied.
> >
> > Lastly by doing things this way we can remove the helper functions that
> > previously did lock step iterations to update things as it went along.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> 
> Not sure if I get through all of this in one go - so be prepared that
> the rest of the review follows another day.

No worries, I appreciate the reviews!
Just giving some quick replies for when you continue.

> 
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* tree-loop-distribution.cc (copy_loop_before): Pass flow_loops =
> false.
> > 	* tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
> > 	* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add
> additional
> > 	assert.
> > 	(vect_set_loop_condition_normal): Skip modifying loop IV for multiple
> > 	exits.
> > 	(slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit
> peeling.
> > 	(slpeel_can_duplicate_loop_p): Likewise.
> > 	(vect_update_ivs_after_vectorizer): Don't enter this...
> > 	(vect_update_ivs_after_early_break): ...but instead enter here.
> > 	(find_guard_arg): Update for new peeling code.
> > 	(slpeel_update_phi_nodes_for_loops): Remove.
> > 	(slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0
> checks.
> > 	(slpeel_update_phi_nodes_for_lcssa): Remove.
> > 	(vect_do_peeling): Fix VF for multiple exits and force epilogue.
> > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > 	non_break_control_flow and early_breaks.
> > 	(vect_need_peeling_or_partial_vectors_p): Force partial vector if
> > 	multiple exits and VLA.
> > 	(vect_analyze_loop_form): Support inner loop multiple exits.
> > 	(vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
> > 	(vect_create_epilog_for_reduction):  Update live phi nodes.
> > 	(vectorizable_live_operation): Ignore live operations in vector loop
> > 	when multiple exits.
> > 	(vect_transform_loop): Force unrolling for VF loops and multiple exits.
> > 	* tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
> > 	(vect_mark_stmts_to_be_vectorized): Check for non-exit control flow
> and
> > 	analyze gcond params.
> > 	(vect_analyze_stmt): Support gcond.
> > 	* tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
> > 	in RPO pass.
> > 	* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
> > 	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW):
> New.
> > 	(loop_vec_info_for_loop): Change to const and static.
> > 	(is_loop_header_bb_p): Drop assert.
> > 	(slpeel_can_duplicate_loop_p): Update prototype.
> > 	(class loop): Add early_breaks and non_break_control_flow.
> >
> > --- inline copy of patch --
> > diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> > index
> 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40
> d833a968314a4442b9e 100644
> > --- a/gcc/tree-loop-distribution.cc
> > +++ b/gcc/tree-loop-distribution.cc
> > @@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool
> redirect_lc_phi_defs)
> >    edge preheader = loop_preheader_edge (loop);
> >
> >    initialize_original_copy_tables ();
> > -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> > +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader,
> false);
> >    gcc_assert (res != NULL);
> >
> >    /* When a not last partition is supposed to keep the LC PHIs computed
> > diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> > index
> 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba
> 377300430c04b7aeefe6c 100644
> > --- a/gcc/tree-ssa-loop-niter.cc
> > +++ b/gcc/tree-ssa-loop-niter.cc
> > @@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop,
> basic_block *body, const_edge exit)
> >    gimple_stmt_iterator bsi;
> >    unsigned i;
> >
> > -  if (exit != single_exit (loop))
> > +  /* We need to check for alternative exits since exit can be NULL.  */
> 
> You mean we pass in exit == NULL in some cases?  I'm not sure what
> the desired behavior in that case is - can you point out the
> callers you are fixing here?
> 
> I think we should add gcc_assert (exit != nullptr)
> 
> >    for (i = 0; i < loop->num_nodes; i++)
> > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > index
> 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a92
> 7cb150edd7c2aa9999 100644
> > --- a/gcc/tree-vect-loop-manip.cc
> > +++ b/gcc/tree-vect-loop-manip.cc
> > @@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple *update_phi,
> edge e, tree new_def)
> >  {
> >    tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
> >
> > +  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
> > +	      || orig_def != new_def);
> > +
> >    SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
> >
> >    if (MAY_HAVE_DEBUG_BIND_STMTS)
> > @@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal (loop_vec_info
> loop_vinfo,
> >    gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> >
> >    /* Record the number of latch iterations.  */
> > -  if (limit == niters)
> > +  if (limit == niters
> > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> >      /* Case A: the loop iterates NITERS times.  Subtract one to get the
> >         latch count.  */
> >      loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
> > @@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal
> (loop_vec_info loop_vinfo,
> >      loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
> >  				       limit, step);
> >
> > -  if (final_iv)
> > +  /* For multiple exits we've already maintained LCSSA form and handled
> > +     the scalar iteration update in the code that deals with the merge
> > +     block and its updated guard.  I could move that code here instead
> > +     of in vect_update_ivs_after_early_break but I have to still deal
> > +     with the updates to the counter `i`.  So for now I'll keep them
> > +     together.  */
> > +  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> >      {
> >        gassign *assign;
> >        edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > @@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
> >     on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
> >     non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
> >     basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> > -   entry or exit of LOOP.  */
> > +   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to
> SCALAR_LOOP as a
> > +   continuation.  This is correct for cases where one loop continues from the
> > +   other like in the vectorizer, but not true for uses in e.g. loop distribution
> > +   where the loop is duplicated and then modified.
> > +
> 
> but for loop distribution the flow also continues?  I'm not sure what you
> are refering to here.  Do you by chance have a branch with the patches
> installed?

Yup, they're at refs/users/tnfchris/heads/gcc-14-early-break in the repo.

> 
> > +   If UPDATED_DOMS is not NULL it is update with the list of basic blocks
> whoms
> > +   dominators were updated during the peeling.  */
> >
> >  class loop *
> >  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
> > -					class loop *scalar_loop, edge e)
> > +					class loop *scalar_loop, edge e,
> > +					bool flow_loops,
> > +					vec<basic_block> *updated_doms)
> >  {
> >    class loop *new_loop;
> >    basic_block *new_bbs, *bbs, *pbbs;
> > @@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class
> loop *loop,
> >    for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
> >      rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
> >
> > +  /* Rename the exit uses.  */
> > +  for (edge exit : get_loop_exit_edges (new_loop))
> > +    for (auto gsi = gsi_start_phis (exit->dest);
> > +	 !gsi_end_p (gsi); gsi_next (&gsi))
> > +      {
> > +	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
> > +	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
> > +	if (MAY_HAVE_DEBUG_BIND_STMTS)
> > +	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
> > +      }
> > +
> > +  /* This condition happens when the loop has been versioned. e.g. due to
> ifcvt
> > +     versioning the loop.  */
> >    if (scalar_loop != loop)
> >      {
> >        /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
> > @@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class loop *loop,
> >  						EDGE_SUCC (loop->latch, 0));
> >      }
> >
> > +  vec<edge> alt_exits = loop->vec_loop_alt_exits;
> 
> So 'e' is not one of alt_exits, right?  I wonder if we can simply
> compute the vector from all exits of 'loop' and removing 'e'?
> 
> > +  bool multiple_exits_p = !alt_exits.is_empty ();
> > +  auto_vec<basic_block> doms;
> > +  class loop *update_loop = NULL;
> > +
> >    if (at_exit) /* Add the loop copy at exit.  */
> >      {
> > -      if (scalar_loop != loop)
> > +      if (scalar_loop != loop && new_exit->dest != exit_dest)
> >  	{
> > -	  gphi_iterator gsi;
> >  	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> > +	  flush_pending_stmts (new_exit);
> > +	}
> >
> > -	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> > -	       gsi_next (&gsi))
> > +      auto loop_exits = get_loop_exit_edges (loop);
> > +      for (edge exit : loop_exits)
> > +	redirect_edge_and_branch (exit, new_preheader);
> > +
> > +
> 
> one line vertical space too much
> 
> > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > +	 block and the new loop header.  This allows us to later split the
> > +	 preheader block and still find the right LC nodes.  */
> > +      edge latch_new = single_succ_edge (new_preheader);
> > +      edge latch_old = loop_latch_edge (loop);
> > +      hash_set <tree> lcssa_vars;
> > +      for (auto gsi_from = gsi_start_phis (latch_old->dest),
> 
> so that's loop->header (and makes it more clear which PHI nodes you are
> looking at)
> 
> > +	   gsi_to = gsi_start_phis (latch_new->dest);
> 
> likewise new_loop->header
> 
> > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > +	{
> > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
> > +	  /* In all cases, even in early break situations we're only
> > +	     interested in the number of fully executed loop iters.  As such
> > +	     we discard any partially done iteration.  So we simply propagate
> > +	     the phi nodes from the latch to the merge block.  */
> > +	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > +
> > +	  lcssa_vars.add (new_arg);
> > +
> > +	  /* Main loop exit should use the final iter value.  */
> > +	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv,
> UNKNOWN_LOCATION);
> 
> above you are creating the PHI node at e->dest but here add the PHI arg to
> loop->vec_loop_iv - that's 'e' here, no?  Consistency makes it easier
> to follow.  I _think_ this code doesn't need to know about the "special"
> edge.
> 
> > +
> > +	  /* All other exits use the previous iters.  */
> > +	  for (edge e : alt_exits)
> > +	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
> > +			 UNKNOWN_LOCATION);
> > +
> > +	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
> > +	}
> > +
> > +      /* Copy over any live SSA vars that may not have been materialized in
> the
> > +	 loops themselves but would be in the exit block.  However when the
> live
> > +	 value is not used inside the loop then we don't need to do this,  if we
> do
> > +	 then when we split the guard block the branch edge can end up
> containing the
> > +	 wrong reference,  particularly if it shares an edge with something that
> has
> > +	 bypassed the loop.  This is not something peeling can check so we
> need to
> > +	 anticipate the usage of the live variable here.  */
> > +      auto exit_map = redirect_edge_var_map_vector (exit);
> 
> Hmm, did I use that in my attemt to refactor things? ...

Indeed, I didn't always use it, but found it was the best way to deal with the
variables being live in various BB after the loop.

> 
> > +      if (exit_map)
> > +        for (auto vm : exit_map)
> > +	{
> > +	  if (lcssa_vars.contains (vm.def)
> > +	      || TREE_CODE (vm.def) != SSA_NAME)
> 
> the latter check is cheaper so it should come first
> 
> > +	    continue;
> > +
> > +	  imm_use_iterator imm_iter;
> > +	  use_operand_p use_p;
> > +	  bool use_in_loop = false;
> > +
> > +	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
> >  	    {
> > -	      gphi *phi = gsi.phi ();
> > -	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> > -	      location_t orig_locus
> > -		= gimple_phi_arg_location_from_edge (phi, e);
> > +	      basic_block bb = gimple_bb (USE_STMT (use_p));
> > +	      if (flow_bb_inside_loop_p (loop, bb)
> > +		  && !gimple_vuse (USE_STMT (use_p)))
> > +		{
> > +		  use_in_loop = true;
> > +		  break;
> > +		}
> > +	    }
> >
> > -	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> > +	  if (!use_in_loop)
> > +	    {
> > +	       /* Do a final check to see if it's perhaps defined in the loop.  This
> > +		  mirrors the relevancy analysis's used_outside_scope.  */
> > +	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
> > +	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> > +		continue;
> >  	    }
> > +
> > +	  tree new_res = copy_ssa_name (vm.result);
> > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > +	  for (edge exit : loop_exits)
> > +	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);
> 
> not sure what you are doing above - I guess I have to play with it
> in a debug session.

Yeah if you comment it out one of the testcases should fail.

> 
> >  	}
> > -      redirect_edge_and_branch_force (e, new_preheader);
> > -      flush_pending_stmts (e);
> > +
> >        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
> > -      if (was_imm_dom || duplicate_outer_loop)
> > +
> > +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
> >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit-
> >src);
> >
> >        /* And remove the non-necessary forwarder again.  Keep the other
> > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class
> loop *loop,
> >        delete_basic_block (preheader);
> >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> >  			       loop_preheader_edge (scalar_loop)->src);
> > +
> > +      /* Finally after wiring the new epilogue we need to update its main exit
> > +	 to the original function exit we recorded.  Other exits are already
> > +	 correct.  */
> > +      if (multiple_exits_p)
> > +	{
> > +	  for (edge e : get_loop_exit_edges (loop))
> > +	    doms.safe_push (e->dest);
> > +	  update_loop = new_loop;
> > +	  doms.safe_push (exit_dest);
> > +
> > +	  /* Likely a fall-through edge, so update if needed.  */
> > +	  if (single_succ_p (exit_dest))
> > +	    doms.safe_push (single_succ (exit_dest));
> > +	}
> >      }
> >    else /* Add the copy at entry.  */
> >      {
> > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > +	 block and the new loop header.  This allows us to later split the
> > +	 preheader block and still find the right LC nodes.  */
> > +      edge old_latch_loop = loop_latch_edge (loop);
> > +      edge old_latch_init = loop_preheader_edge (loop);
> > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> 
> see above
> 
> > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > +	{
> > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> new_latch_loop);
> > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> > +	}
> > +
> >        if (scalar_loop != loop)
> >  	{
> >  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> > @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class
> loop *loop,
> >        delete_basic_block (new_preheader);
> >        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
> >  			       loop_preheader_edge (new_loop)->src);
> > +
> > +      if (multiple_exits_p)
> > +	update_loop = loop;
> >      }
> >
> > -  if (scalar_loop != loop)
> > +  if (multiple_exits_p)
> >      {
> > -      /* Update new_loop->header PHIs, so that on the preheader
> > -	 edge they are the ones from loop rather than scalar_loop.  */
> > -      gphi_iterator gsi_orig, gsi_new;
> > -      edge orig_e = loop_preheader_edge (loop);
> > -      edge new_e = loop_preheader_edge (new_loop);
> > -
> > -      for (gsi_orig = gsi_start_phis (loop->header),
> > -	   gsi_new = gsi_start_phis (new_loop->header);
> > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > +      for (edge e : get_loop_exit_edges (update_loop))
> >  	{
> > -	  gphi *orig_phi = gsi_orig.phi ();
> > -	  gphi *new_phi = gsi_new.phi ();
> > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > -	  location_t orig_locus
> > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > -
> > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > +	  edge ex;
> > +	  edge_iterator ei;
> > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > +	    {
> > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > +		 dominate other blocks.  */
> > +	      while ((ex->flags & EDGE_FALLTHRU)
> 
> I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> just using single_succ_p here?  A fallthru edge src dominates the
> fallthru edge dest, so the sentence above doesn't make sense.

I wanted to say, that the immediate dominator of a block is never
an fall through block.  At least from what I understood from how
the dominators are calculated in the code, though may have missed
something.

> 
> > +		     && single_succ_p (ex->dest))
> > +		{
> > +		  doms.safe_push (ex->dest);
> > +		  ex = single_succ_edge (ex->dest);
> > +		}
> > +	      doms.safe_push (ex->dest);
> > +	    }
> > +	  doms.safe_push (e->dest);
> >  	}
> > -    }
> >
> > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > +      if (updated_doms)
> > +	updated_doms->safe_splice (doms);
> > +    }
> >    free (new_bbs);
> >    free (bbs);
> >
> > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> loop_vec_info loop_vinfo, const_edge e)
> >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> >    unsigned int num_bb = loop->inner? 5 : 2;
> >
> > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > +
> 
> I think checking the number of BBs is odd, I don't remember anything
> in slpeel is specifically tied to that?  I think we can simply drop
> this or do you remember anything that would depend on ->num_nodes
> being only exactly 5 or 2?

Never actually seemed to require it, but they're used as some check to
see if there are unexpected control flow in the loop.

i.e. this would say no if you have an if statement in the loop that wasn't
converted.  The other part of this and the accompanying explanation is in
vect_analyze_loop_form.  In the patch series I had to remove the hard
num_nodes == 2 check from there because number of nodes restricted
things too much.  If you have an empty fall through block, which seems to
happen often between the main exit and the latch block then we'd not
vectorize.

So instead I now rejects loops after analyzing the gcond.  So think this check
can go/needs to be different.

> 
> >    /* All loops have an outer scope; the only case loop->outer is NULL is for
> >       the function itself.  */
> >    if (!loop_outer (loop)
> > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> (loop_vec_info loop_vinfo,
> >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> >    basic_block update_bb = update_e->dest;
> >
> > +  /* For early exits we'll update the IVs in
> > +     vect_update_ivs_after_early_break.  */
> > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +    return;
> > +
> >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> >
> >    /* Make sure there exists a single-predecessor exit bb:  */
> > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> (loop_vec_info loop_vinfo,
> >        /* Fix phi expressions in the successor bb.  */
> >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> >      }
> > +  return;
> 
> we don't usually place a return at the end of void functions
> 
> > +}
> > +
> > +/*   Function vect_update_ivs_after_early_break.
> > +
> > +     "Advance" the induction variables of LOOP to the value they should take
> > +     after the execution of LOOP.  This is currently necessary because the
> > +     vectorizer does not handle induction variables that are used after the
> > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > +     peeled, because of the early exit.  With an early exit we always peel the
> > +     loop.
> > +
> > +     Input:
> > +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> > +		    vectorized. The last few iterations of LOOP were peeled.
> > +     - LOOP - a loop that is going to be vectorized. The last few iterations
> > +	      of LOOP were peeled.
> > +     - VF - The loop vectorization factor.
> > +     - NITERS_ORIG - the number of iterations that LOOP executes (before it is
> > +		     vectorized). i.e, the number of times the ivs should be
> > +		     bumped.
> > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> executes.
> > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only) path
> > +		  coming out from LOOP on which there are uses of the LOOP
> ivs
> > +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> > +
> > +		  The new definitions of the ivs are placed in LOOP->exit.
> > +		  The phi args associated with the edge UPDATE_E in the bb
> > +		  UPDATE_E->dest are updated accordingly.
> > +
> > +     Output:
> > +       - If available, the LCSSA phi node for the loop IV temp.
> > +
> > +     Assumption 1: Like the rest of the vectorizer, this function assumes
> > +     a single loop exit that has a single predecessor.
> > +
> > +     Assumption 2: The phi nodes in the LOOP header and in update_bb are
> > +     organized in the same order.
> > +
> > +     Assumption 3: The access function of the ivs is simple enough (see
> > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the future.
> > +
> > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a path
> > +     coming out of LOOP on which the ivs of LOOP are used (this is the path
> > +     that leads to the epilog loop; other paths skip the epilog loop).  This
> > +     path starts with the edge UPDATE_E, and its destination (denoted
> update_bb)
> > +     needs to have its phis updated.
> > + */
> > +
> > +static tree
> > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class loop *
> epilog,
> > +				   poly_int64 vf, tree niters_orig,
> > +				   tree niters_vector, edge update_e)
> > +{
> > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +    return NULL;
> > +
> > +  gphi_iterator gsi, gsi1;
> > +  tree ni_name, ivtmp = NULL;
> > +  basic_block update_bb = update_e->dest;
> > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > +  basic_block exit_bb = loop_iv->dest;
> > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > +
> > +  gcc_assert (cond);
> > +
> > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis (update_bb);
> > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > +       gsi_next (&gsi), gsi_next (&gsi1))
> > +    {
> > +      tree init_expr, final_expr, step_expr;
> > +      tree type;
> > +      tree var, ni, off;
> > +      gimple_stmt_iterator last_gsi;
> > +
> > +      gphi *phi = gsi1.phi ();
> > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge
> (epilog));
> 
> I'm confused about the setup.  update_bb looks like the block with the
> loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> loop_preheader_edge (epilog) come into play here?  That would feed into
> epilog->header PHIs?!

We can't query the type of the phis in the block with the LC PHI nodes, so the
Typical pattern seems to be that we iterate over a block that's part of the loop
and that would have the PHIs in the same order, just so we can get to the
stmt_vec_info.

> 
> It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> better way?  Is update_bb really epilog->header?!
> 
> We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> which "works" even for totally unrelated edges.
> 
> > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> > +      if (!phi1)
> 
> shouldn't that be an assert?
> 
> > +	continue;
> > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > +      if (dump_enabled_p ())
> > +	dump_printf_loc (MSG_NOTE, vect_location,
> > +			 "vect_update_ivs_after_early_break: phi: %G",
> > +			 (gimple *)phi);
> > +
> > +      /* Skip reduction and virtual phis.  */
> > +      if (!iv_phi_p (phi_info))
> > +	{
> > +	  if (dump_enabled_p ())
> > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > +			     "reduc or virtual phi. skip.\n");
> > +	  continue;
> > +	}
> > +
> > +      /* For multiple exits where we handle early exits we need to carry on
> > +	 with the previous IV as loop iteration was not done because we exited
> > +	 early.  As such just grab the original IV.  */
> > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> (loop));
> 
> but this should be taken care of by LC SSA?

It is, the comment is probably missing details, this part just scales the counter
from VF to scalar counts.  It's just a reminder that this scaling is done differently
from normal single exit vectorization.

> 
> OK, have to continue tomorrow from here.

Cheers, Thank you!

Tamar

> 
> Richard.
> 
> > +      if (gimple_cond_lhs (cond) != phi_ssa
> > +	  && gimple_cond_rhs (cond) != phi_ssa)
> > +	{
> > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > +	  step_expr = unshare_expr (step_expr);
> > +
> > +	  /* We previously generated the new merged phi in the same BB as
> the
> > +	     guard.  So use that to perform the scaling on rather than the
> > +	     normal loop phi which don't take the early breaks into account.  */
> > +	  final_expr = gimple_phi_result (phi1);
> > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> loop_preheader_edge (loop));
> > +
> > +	  tree stype = TREE_TYPE (step_expr);
> > +	  /* For early break the final loop IV is:
> > +	     init + (final - init) * vf which takes into account peeling
> > +	     values and non-single steps.  */
> > +	  off = fold_build2 (MINUS_EXPR, stype,
> > +			     fold_convert (stype, final_expr),
> > +			     fold_convert (stype, init_expr));
> > +	  /* Now adjust for VF to get the final iteration value.  */
> > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> > +
> > +	  /* Adjust the value with the offset.  */
> > +	  if (POINTER_TYPE_P (type))
> > +	    ni = fold_build_pointer_plus (init_expr, off);
> > +	  else
> > +	    ni = fold_convert (type,
> > +			       fold_build2 (PLUS_EXPR, stype,
> > +					    fold_convert (stype, init_expr),
> > +					    off));
> > +	  var = create_tmp_var (type, "tmp");
> > +
> > +	  last_gsi = gsi_last_bb (exit_bb);
> > +	  gimple_seq new_stmts = NULL;
> > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > +	  /* Exit_bb shouldn't be empty.  */
> > +	  if (!gsi_end_p (last_gsi))
> > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > +	  else
> > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > +
> > +	  /* Fix phi expressions in the successor bb.  */
> > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > +	}
> > +      else
> > +	{
> > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > +	  step_expr = unshare_expr (step_expr);
> > +
> > +	  /* We previously generated the new merged phi in the same BB as
> the
> > +	     guard.  So use that to perform the scaling on rather than the
> > +	     normal loop phi which don't take the early breaks into account.  */
> > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge
> (loop));
> > +	  tree stype = TREE_TYPE (step_expr);
> > +
> > +	  if (vf.is_constant ())
> > +	    {
> > +	      ni = fold_build2 (MULT_EXPR, stype,
> > +				fold_convert (stype,
> > +					      niters_vector),
> > +				build_int_cst (stype, vf));
> > +
> > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > +				fold_convert (stype,
> > +					      niters_orig),
> > +				fold_convert (stype, ni));
> > +	    }
> > +	  else
> > +	    /* If the loop's VF isn't constant then the loop must have been
> > +	       masked, so at the end of the loop we know we have finished
> > +	       the entire loop and found nothing.  */
> > +	    ni = build_zero_cst (stype);
> > +
> > +	  ni = fold_convert (type, ni);
> > +	  /* We don't support variable n in this version yet.  */
> > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > +
> > +	  var = create_tmp_var (type, "tmp");
> > +
> > +	  last_gsi = gsi_last_bb (exit_bb);
> > +	  gimple_seq new_stmts = NULL;
> > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > +	  /* Exit_bb shouldn't be empty.  */
> > +	  if (!gsi_end_p (last_gsi))
> > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > +	  else
> > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > +
> > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > +
> > +	  for (edge exit : alt_exits)
> > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > +					build_int_cst (TREE_TYPE (step_expr),
> > +						       vf));
> > +	  ivtmp = gimple_phi_result (phi1);
> > +	}
> > +    }
> > +
> > +  return ivtmp;
> >  }
> >
> >  /* Return a gimple value containing the misalignment (measured in vector
> > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> (loop_vec_info loop_vinfo,
> >
> >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> >     this function searches for the corresponding lcssa phi node in exit
> > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > -   NULL.  */
> > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> > +   return the phi result; otherwise return NULL.  */
> >
> >  static tree
> >  find_guard_arg (class loop *loop, class loop *epilog ATTRIBUTE_UNUSED,
> > -		gphi *lcssa_phi)
> > +		gphi *lcssa_phi, int lcssa_edge = 0)
> >  {
> >    gphi_iterator gsi;
> >    edge e = loop->vec_loop_iv;
> >
> > -  gcc_assert (single_pred_p (e->dest));
> >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> >      {
> >        gphi *phi = gsi.phi ();
> > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > -	return PHI_RESULT (phi);
> > -    }
> > -  return NULL_TREE;
> > -}
> > -
> > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> FIRST/SECOND
> > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > -   edge, the two loops are arranged as below:
> > -
> > -       preheader_a:
> > -     first_loop:
> > -       header_a:
> > -	 i_1 = PHI<i_0, i_2>;
> > -	 ...
> > -	 i_2 = i_1 + 1;
> > -	 if (cond_a)
> > -	   goto latch_a;
> > -	 else
> > -	   goto between_bb;
> > -       latch_a:
> > -	 goto header_a;
> > -
> > -       between_bb:
> > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > -
> > -     second_loop:
> > -       header_b:
> > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > -				 or with i_2 if no LCSSA phi is created
> > -				 under condition of
> CREATE_LCSSA_FOR_IV_PHIS.
> > -	 ...
> > -	 i_4 = i_3 + 1;
> > -	 if (cond_b)
> > -	   goto latch_b;
> > -	 else
> > -	   goto exit_bb;
> > -       latch_b:
> > -	 goto header_b;
> > -
> > -       exit_bb:
> > -
> > -   This function creates loop closed SSA for the first loop; update the
> > -   second loop's PHI nodes by replacing argument on incoming edge with the
> > -   result of newly created lcssa PHI nodes.  IF CREATE_LCSSA_FOR_IV_PHIS
> > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > -   the first loop.
> > -
> > -   This function assumes exit bb of the first loop is preheader bb of the
> > -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> > -   the second loop will execute rest iterations of the first.  */
> > -
> > -static void
> > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > -				   class loop *first, class loop *second,
> > -				   bool create_lcssa_for_iv_phis)
> > -{
> > -  gphi_iterator gsi_update, gsi_orig;
> > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > -
> > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > -  edge second_preheader_e = loop_preheader_edge (second);
> > -  basic_block between_bb = single_exit (first)->dest;
> > -
> > -  gcc_assert (between_bb == second_preheader_e->src);
> > -  gcc_assert (single_pred_p (between_bb) && single_succ_p (between_bb));
> > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > -  gcc_assert (loop == first || loop == second);
> > -
> > -  for (gsi_orig = gsi_start_phis (first->header),
> > -       gsi_update = gsi_start_phis (second->header);
> > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > -    {
> > -      gphi *orig_phi = gsi_orig.phi ();
> > -      gphi *update_phi = gsi_update.phi ();
> > -
> > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > -      /* Generate lcssa PHI node for the first loop.  */
> > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > +      /* Nested loops with multiple exits can have different no# phi node
> > +	 arguments between the main loop and epilog as epilog falls to the
> > +	 second loop.  */
> > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> >  	{
> > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> UNKNOWN_LOCATION);
> > -	  arg = new_res;
> > -	}
> > -
> > -      /* Update PHI node in the second loop by replacing arg on the loop's
> > -	 incoming edge.  */
> > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e, arg);
> > -    }
> > -
> > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > -     for correct vectorization of live stmts.  */
> > -  if (loop == first)
> > -    {
> > -      basic_block orig_exit = single_exit (second)->dest;
> > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > -	{
> > -	  gphi *orig_phi = gsi_orig.phi ();
> > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> (orig_arg))
> > -	    continue;
> > -
> > -	  /* Already created in the above loop.   */
> > -	  if (find_guard_arg (first, second, orig_phi))
> > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > +	  if (TREE_CODE (var) != SSA_NAME)
> >  	    continue;
> >
> > -	  tree new_res = copy_ssa_name (orig_arg);
> > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> UNKNOWN_LOCATION);
> > +	  if (operand_equal_p (get_current_def (var),
> > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > +	    return PHI_RESULT (phi);
> >  	}
> >      }
> > +  return NULL_TREE;
> >  }
> >
> >  /* Function slpeel_add_loop_guard adds guard skipping from the beginning
> > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2 (class
> loop *loop, class loop *epilog,
> >    gcc_assert (single_succ_p (merge_bb));
> >    edge e = single_succ_edge (merge_bb);
> >    basic_block exit_bb = e->dest;
> > -  gcc_assert (single_pred_p (exit_bb));
> > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> >
> >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> >      {
> >        gphi *update_phi = gsi.phi ();
> > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> >
> >        tree merge_arg = NULL_TREE;
> >
> > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class loop
> *loop, class loop *epilog,
> >        if (!merge_arg)
> >  	merge_arg = old_arg;
> >
> > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e->dest_idx);
> >        /* If the var is live after loop but not a reduction, we simply
> >  	 use the old arg.  */
> >        if (!guard_arg)
> > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class
> loop *loop, class loop *epilog,
> >      }
> >  }
> >
> > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > -
> > -static void
> > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > -{
> > -  gphi_iterator gsi;
> > -  basic_block exit_bb = single_exit (epilog)->dest;
> > -
> > -  gcc_assert (single_pred_p (exit_bb));
> > -  edge e = EDGE_PRED (exit_bb, 0);
> > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > -}
> > -
> >  /* EPILOGUE_VINFO is an epilogue loop that we now know would need to
> >     iterate exactly CONST_NITERS times.  Make a final decision about
> >     whether the epilogue loop should be used, returning true if so.  */
> > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> niters, tree nitersm1,
> >      bound_epilog += vf - 1;
> >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> >      bound_epilog += 1;
> > +  /* For early breaks the scalar loop needs to execute at most VF times
> > +     to find the element that caused the break.  */
> > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +    {
> > +      bound_epilog = vf;
> > +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> > +      vect_epilogues = false;
> > +    }
> >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> >    poly_uint64 bound_scalar = bound_epilog;
> >
> > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree niters, tree nitersm1,
> >  				  bound_prolog + bound_epilog)
> >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> >  			 || vect_epilogues));
> > +
> > +  /* We only support early break vectorization on known bounds at this
> time.
> > +     This means that if the vector loop can't be entered then we won't
> generate
> > +     it at all.  So for now force skip_vector off because the additional control
> > +     flow messes with the BB exits and we've already analyzed them.  */
> > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
> > +
> >    /* Epilog loop must be executed if the number of iterations for epilog
> >       loop is known at compile time, otherwise we need to add a check at
> >       the end of vector loop and skip to the end of epilog loop.  */
> >    bool skip_epilog = (prolog_peeling < 0
> >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >  		      || !vf.is_constant ());
> > -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.  */
> > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > +  /* PEELING_FOR_GAPS and peeling for early breaks are special because
> epilog
> > +     loop must be executed.  */
> > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> >      skip_epilog = false;
> > -
> >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> >    auto_vec<profile_count> original_counts;
> >    basic_block *original_bbs = NULL;
> > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree niters, tree nitersm1,
> >    if (prolog_peeling)
> >      {
> >        e = loop_preheader_edge (loop);
> > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > -
> > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
> >        /* Peel prolog and put it on preheader edge of loop.  */
> > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
> > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e,
> > +						       true);
> >        gcc_assert (prolog);
> >        prolog->force_vectorize = false;
> > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> > +
> >        first_loop = prolog;
> >        reset_original_copy_tables ();
> >
> > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree niters, tree nitersm1,
> >  	 as the transformations mentioned above make less or no sense when
> not
> >  	 vectorizing.  */
> >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > +      auto_vec<basic_block> doms;
> > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e, true,
> > +						       &doms);
> >        gcc_assert (epilog);
> >
> >        epilog->force_vectorize = false;
> > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
> >
> >        /* Scalar version loop may be preferred.  In this case, add guard
> >  	 and skip to epilog.  Note this only happens when the number of
> > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> niters, tree nitersm1,
> >        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> >  					update_e);
> >
> > +      /* For early breaks we must create a guard to check how many iterations
> > +	 of the scalar loop are yet to be performed.  */
> > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +	{
> > +	  tree ivtmp =
> > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> > +					       *niters_vector, update_e);
> > +
> > +	  gcc_assert (ivtmp);
> > +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > +					 fold_convert (TREE_TYPE (niters),
> > +						       ivtmp),
> > +					 build_zero_cst (TREE_TYPE (niters)));
> > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > +
> > +	  /* If we had a fallthrough edge, the guard will the threaded through
> > +	     and so we may need to find the actual final edge.  */
> > +	  edge final_edge = epilog->vec_loop_iv;
> > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> > +	     between the guard and the exit edge.  It only adds new nodes and
> > +	     doesn't update existing one in the current scheme.  */
> > +	  basic_block guard_to = split_edge (final_edge);
> > +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
> guard_to,
> > +						guard_bb, prob_epilog.invert
> (),
> > +						irred_flag);
> > +	  doms.safe_push (guard_bb);
> > +
> > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > +
> > +	  /* We must update all the edges from the new guard_bb.  */
> > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > +					      final_edge);
> > +
> > +	  /* If the loop was versioned we'll have an intermediate BB between
> > +	     the guard and the exit.  This intermediate block is required
> > +	     because in the current scheme of things the guard block phi
> > +	     updating can only maintain LCSSA by creating new blocks.  In this
> > +	     case we just need to update the uses in this block as well.  */
> > +	  if (loop != scalar_loop)
> > +	    {
> > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (),
> guard_e));
> > +	    }
> > +
> > +	  flush_pending_stmts (guard_e);
> > +	}
> > +
> >        if (skip_epilog)
> >  	{
> >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> niters, tree nitersm1,
> >  	    }
> >  	  scale_loop_profile (epilog, prob_epilog, 0);
> >  	}
> > -      else
> > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> >
> >        unsigned HOST_WIDE_INT bound;
> >        if (bound_scalar.is_constant (&bound))
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index
> b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> f21421958e7ee25eada 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> *loop_in, vec_info_shared *shared)
> >      partial_load_store_bias (0),
> >      peeling_for_gaps (false),
> >      peeling_for_niter (false),
> > +    early_breaks (false),
> > +    non_break_control_flow (false),
> >      no_data_dependencies (false),
> >      has_mask_store (false),
> >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> (loop_vec_info loop_vinfo)
> >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> (LOOP_VINFO_ORIG_LOOP_INFO
> >  					  (loop_vinfo));
> >
> > +  /* When we have multiple exits and VF is unknown, we must require
> partial
> > +     vectors because the loop bounds is not a minimum but a maximum.
> That is to
> > +     say we cannot unpredicate the main loop unless we peel or use partial
> > +     vectors in the epilogue.  */
> > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > +    return true;
> > +
> >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> >      {
> > @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost
> (loop_vec_info loop_vinfo)
> >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> >  }
> >
> > -
> >  /* Function vect_analyze_loop_form.
> >
> >     Verify that certain CFG restrictions hold, including:
> >     - the loop has a pre-header
> > -   - the loop has a single entry and exit
> > +   - the loop has a single entry
> > +   - nested loops can have only a single exit.
> >     - the loop exit condition is simple enough
> >     - the number of iterations can be analyzed, i.e, a countable loop.  The
> >       niter could be analyzed under some assumptions.  */
> > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop,
> vect_loop_form_info *info)
> >                             |
> >                          (exit-bb)  */
> >
> > -      if (loop->num_nodes != 2)
> > -	return opt_result::failure_at (vect_location,
> > -				       "not vectorized:"
> > -				       " control flow in loop.\n");
> > -
> >        if (empty_block_p (loop->header))
> >  	return opt_result::failure_at (vect_location,
> >  				       "not vectorized: empty loop.\n");
> > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop,
> vect_loop_form_info *info)
> >          dump_printf_loc (MSG_NOTE, vect_location,
> >  			 "Considering outer-loop vectorization.\n");
> >        info->inner_loop_cond = inner.loop_cond;
> > +
> > +      if (!single_exit (loop))
> > +	return opt_result::failure_at (vect_location,
> > +				       "not vectorized: multiple exits.\n");
> > +
> >      }
> >
> > -  if (!single_exit (loop))
> > -    return opt_result::failure_at (vect_location,
> > -				   "not vectorized: multiple exits.\n");
> >    if (EDGE_COUNT (loop->header->preds) != 2)
> >      return opt_result::failure_at (vect_location,
> >  				   "not vectorized:"
> > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop,
> vect_loop_form_info *info)
> >  				   "not vectorized: latch block not empty.\n");
> >
> >    /* Make sure the exit is not abnormal.  */
> > -  edge e = single_exit (loop);
> > -  if (e->flags & EDGE_ABNORMAL)
> > -    return opt_result::failure_at (vect_location,
> > -				   "not vectorized:"
> > -				   " abnormal loop exit edge.\n");
> > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > +  edge nexit = loop->vec_loop_iv;
> > +  for (edge e : exits)
> > +    {
> > +      if (e->flags & EDGE_ABNORMAL)
> > +	return opt_result::failure_at (vect_location,
> > +				       "not vectorized:"
> > +				       " abnormal loop exit edge.\n");
> > +      /* Early break BB must be after the main exit BB.  In theory we should
> > +	 be able to vectorize the inverse order, but the current flow in the
> > +	 the vectorizer always assumes you update successor PHI nodes, not
> > +	 preds.  */
> > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e-
> >src))
> > +	return opt_result::failure_at (vect_location,
> > +				       "not vectorized:"
> > +				       " abnormal loop exit edge order.\n");
> > +    }
> > +
> > +  /* We currently only support early exit loops with known bounds.   */
> > +  if (exits.length () > 1)
> > +    {
> > +      class tree_niter_desc niter;
> > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter, NULL)
> > +	  || chrec_contains_undetermined (niter.niter)
> > +	  || !evolution_function_is_constant_p (niter.niter))
> > +	return opt_result::failure_at (vect_location,
> > +				       "not vectorized:"
> > +				       " early breaks only supported on loops"
> > +				       " with known iteration bounds.\n");
> > +    }
> >
> >    info->conds
> >      = vect_get_loop_niters (loop, &info->assumptions,
> > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> vec_info_shared *shared,
> >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> >alt_loop_conds);
> >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> >
> > +  /* Check to see if we're vectorizing multiple exits.  */
> > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > +
> >    if (info->inner_loop_cond)
> >      {
> >        stmt_vec_info inner_loop_cond_info
> > @@ -3070,7 +3106,8 @@ start_over:
> >
> >    /* If an epilogue loop is required make sure we can create one.  */
> >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> >      {
> >        if (dump_enabled_p ())
> >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop required\n");
> > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction (loop_vec_info
> loop_vinfo,
> >    basic_block exit_bb;
> >    tree scalar_dest;
> >    tree scalar_type;
> > -  gimple *new_phi = NULL, *phi;
> > +  gimple *new_phi = NULL, *phi = NULL;
> >    gimple_stmt_iterator exit_gsi;
> >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> >    gimple *epilog_stmt = NULL;
> > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> (loop_vec_info loop_vinfo,
> >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> >  	  reduc_inputs.quick_push (new_def);
> >  	}
> > +
> > +	/* Update the other exits.  */
> > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +	  {
> > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > +	    gphi_iterator gsi, gsi1;
> > +	    for (edge exit : alt_exits)
> > +	      {
> > +		/* Find the phi node to propaget into the exit block for each
> > +		   exit edge.  */
> > +		for (gsi = gsi_start_phis (exit_bb),
> > +		     gsi1 = gsi_start_phis (exit->src);
> > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > +		  {
> > +		    /* There really should be a function to just get the number
> > +		       of phis inside a bb.  */
> > +		    if (phi && phi == gsi.phi ())
> > +		      {
> > +			gphi *phi1 = gsi1.phi ();
> > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > +					 PHI_RESULT (phi1));
> > +			break;
> > +		      }
> > +		  }
> > +	      }
> > +	  }
> >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> >      }
> >
> > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info *vinfo,
> >  	   new_tree = lane_extract <vec_lhs', ...>;
> >  	   lhs' = new_tree;  */
> >
> > +      /* When vectorizing an early break, any live statement that is used
> > +	 outside of the loop are dead.  The loop will never get to them.
> > +	 We could change the liveness value during analysis instead but since
> > +	 the below code is invalid anyway just ignore it during codegen.  */
> > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > +	return true;
> > +
> >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> >        gcc_assert (single_pred_p (exit_bb));
> > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> loop_vinfo, gimple *loop_vectorized_call)
> >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> >       versioning.   */
> >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > -  if (! single_pred_p (e->dest))
> > +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS
> (loop_vinfo))
> >      {
> >        split_loop_exit_edge (e, true);
> >        if (dump_enabled_p ())
> > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> loop_vinfo, gimple *loop_vectorized_call)
> >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> >      {
> >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > -      if (! single_pred_p (e->dest))
> > +      if (e && ! single_pred_p (e->dest))
> >  	{
> >  	  split_loop_exit_edge (e, true);
> >  	  if (dump_enabled_p ())
> > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> loop_vinfo, gimple *loop_vectorized_call)
> >
> >    /* Loops vectorized with a variable factor won't benefit from
> >       unrolling/peeling.  */
> > -  if (!vf.is_constant ())
> > +  if (!vf.is_constant ()
> > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> >      {
> >        loop->unroll = 1;
> >        if (dump_enabled_p ())
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index
> 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> e78ee2a7c627be45b 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> loop_vec_info loop_vinfo,
> >    *live_p = false;
> >
> >    /* cond stmt other than loop exit cond.  */
> > -  if (is_ctrl_stmt (stmt_info->stmt)
> > -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> > -    *relevant = vect_used_in_scope;
> > +  if (is_ctrl_stmt (stmt_info->stmt))
> > +    {
> > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge, but
> > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > +	 the hard way.  */
> > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > +      bool exit_bb = false, early_exit = false;
> > +      edge_iterator ei;
> > +      edge e;
> > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > +	  {
> > +	    exit_bb = true;
> > +	    early_exit = loop->vec_loop_iv->src != bb;
> > +	    break;
> > +	  }
> > +
> > +      /* We should have processed any exit edge, so an edge not an early
> > +	 break must be a loop IV edge.  We need to distinguish between the
> > +	 two as we don't want to generate code for the main loop IV.  */
> > +      if (exit_bb)
> > +	{
> > +	  if (early_exit)
> > +	    *relevant = vect_used_in_scope;
> > +	}
> > +      else if (bb->loop_father == loop)
> > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> > +    }
> >
> >    /* changing memory.  */
> >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> loop_vec_info loop_vinfo,
> >  	*relevant = vect_used_in_scope;
> >        }
> >
> > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > +  auto_bitmap exit_bbs;
> > +  for (edge exit : exits)
> > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > +
> >    /* uses outside the loop.  */
> >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> SSA_OP_DEF)
> >      {
> > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> loop_vec_info loop_vinfo,
> >  	      /* We expect all such uses to be in the loop exit phis
> >  		 (because of loop closed form)   */
> >  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> > -	      gcc_assert (bb == single_exit (loop)->dest);
> > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> >
> >                *live_p = true;
> >  	    }
> > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> (loop_vec_info loop_vinfo, bool *fatal)
> >  	}
> >      }
> >
> > +  /* Ideally this should be in vect_analyze_loop_form but we haven't seen all
> > +     the conds yet at that point and there's no quick way to retrieve them.  */
> > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > +    return opt_result::failure_at (vect_location,
> > +				   "not vectorized:"
> > +				   " unsupported control flow in loop.\n");
> > +
> >    /* 2. Process_worklist */
> >    while (worklist.length () > 0)
> >      {
> > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> (loop_vec_info loop_vinfo, bool *fatal)
> >  			return res;
> >  		    }
> >                   }
> > +	    }
> > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> > +	    {
> > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> > +	      opt_result res
> > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > +			       loop_vinfo, relevant, &worklist, false);
> > +	      if (!res)
> > +		return res;
> > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > +				loop_vinfo, relevant, &worklist, false);
> > +	      if (!res)
> > +		return res;
> >              }
> >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> >  	    {
> > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> >  			     node_instance, cost_vec);
> >        if (!res)
> >  	return res;
> > -   }
> > +    }
> > +
> > +  if (is_ctrl_stmt (stmt_info->stmt))
> > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> >
> >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> >      {
> >        case vect_internal_def:
> > +      case vect_early_exit_def:
> >          break;
> >
> >        case vect_reduction_def:
> > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> >      {
> >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> >        *need_to_vectorize = true;
> >      }
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index
> ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> 40baff61d18da8e242dd 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -63,6 +63,7 @@ enum vect_def_type {
> >    vect_internal_def,
> >    vect_induction_def,
> >    vect_reduction_def,
> > +  vect_early_exit_def,
> >    vect_double_reduction_def,
> >    vect_nested_cycle,
> >    vect_first_order_recurrence,
> > @@ -876,6 +877,13 @@ public:
> >       we need to peel off iterations at the end to form an epilogue loop.  */
> >    bool peeling_for_niter;
> >
> > +  /* When the loop has early breaks that we can vectorize we need to peel
> > +     the loop for the break finding loop.  */
> > +  bool early_breaks;
> > +
> > +  /* When the loop has a non-early break control flow inside.  */
> > +  bool non_break_control_flow;
> > +
> >    /* List of loop additional IV conditionals found in the loop.  */
> >    auto_vec<gcond *> conds;
> >
> > @@ -985,9 +993,11 @@ public:
> >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
> >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
> >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> >early_break_conflict
> >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
> >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> >non_break_control_flow
> >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> >no_data_dependencies
> > @@ -1038,8 +1048,8 @@ public:
> >     stack.  */
> >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> >
> > -inline loop_vec_info
> > -loop_vec_info_for_loop (class loop *loop)
> > +static inline loop_vec_info
> > +loop_vec_info_for_loop (const class loop *loop)
> >  {
> >    return (loop_vec_info) loop->aux;
> >  }
> > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> >  {
> >    if (bb == (bb->loop_father)->header)
> >      return true;
> > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > +
> >    return false;
> >  }
> >
> > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> >     in tree-vect-loop-manip.cc.  */
> >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> >  				     tree, tree, tree, bool);
> > -extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
> > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> const_edge);
> >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > -						     class loop *, edge);
> > +						    class loop *, edge, bool,
> > +						    vec<basic_block> * = NULL);
> >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> >  				    tree *, tree *, tree *, int, bool, bool,
> > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > index
> a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> 67f29f5dc9a8d72235f4 100644
> > --- a/gcc/tree-vectorizer.cc
> > +++ b/gcc/tree-vectorizer.cc
> > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> >  	 predicates that need to be shared for optimal predicate usage.
> >  	 However reassoc will re-order them and prevent CSE from working
> >  	 as it should.  CSE only the loop body, not the entry.  */
> > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> > +      for (edge exit : exits)
> > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> >
> >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> >        do_rpo_vn (fun, entry, exit_bbs);
> >
> >
> >
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> Moerman;
> HRB 36809 (AG Nuernberg)
  
Richard Biener July 14, 2023, 1:34 p.m. UTC | #3
On Thu, 13 Jul 2023, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Thursday, July 13, 2023 6:31 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com
> > Subject: Re: [PATCH 12/19]middle-end: implement loop peeling and IV
> > updates for early break.
> > 
> > On Wed, 28 Jun 2023, Tamar Christina wrote:
> > 
> > > Hi All,
> > >
> > > This patch updates the peeling code to maintain LCSSA during peeling.
> > > The rewrite also naturally takes into account multiple exits and so it didn't
> > > make sense to split them off.
> > >
> > > For the purposes of peeling the only change for multiple exits is that the
> > > secondary exits are all wired to the start of the new loop preheader when
> > doing
> > > epilogue peeling.
> > >
> > > When doing prologue peeling the CFG is kept in tact.
> > >
> > > For both epilogue and prologue peeling we wire through between the two
> > loops any
> > > PHI nodes that escape the first loop into the second loop if flow_loops is
> > > specified.  The reason for this conditionality is because
> > > slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
> > >   - prologue peeling
> > >   - epilogue peeling
> > >   - loop distribution
> > >
> > > for the last case the loops should remain independent, and so not be
> > connected.
> > > Because of this propagation of only used phi nodes get_current_def can be
> > used
> > > to easily find the previous definitions.  However live statements that are
> > > not used inside the loop itself are not propagated (since if unused, the
> > moment
> > > we add the guard in between the two loops the value across the bypass edge
> > can
> > > be wrong if the loop has been peeled.)
> > >
> > > This is dealt with easily enough in find_guard_arg.
> > >
> > > For multiple exits, while we are in LCSSA form, and have a correct DOM tree,
> > the
> > > moment we add the guard block we will change the dominators again.  To
> > deal with
> > > this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the blocks
> > to
> > > update without having to recompute the list of blocks to update again.
> > >
> > > When multiple exits and doing epilogue peeling we will also temporarily have
> > an
> > > incorrect VUSES chain for the secondary exits as it anticipates the final result
> > > after the VDEFs have been moved.  This will thus be corrected once the code
> > > motion is applied.
> > >
> > > Lastly by doing things this way we can remove the helper functions that
> > > previously did lock step iterations to update things as it went along.
> > >
> > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > >
> > > Ok for master?
> > 
> > Not sure if I get through all of this in one go - so be prepared that
> > the rest of the review follows another day.
> 
> No worries, I appreciate the reviews!
> Just giving some quick replies for when you continue.

Continueing.

> > 
> > > Thanks,
> > > Tamar
> > >
> > > gcc/ChangeLog:
> > >
> > > 	* tree-loop-distribution.cc (copy_loop_before): Pass flow_loops =
> > false.
> > > 	* tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
> > > 	* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add
> > additional
> > > 	assert.
> > > 	(vect_set_loop_condition_normal): Skip modifying loop IV for multiple
> > > 	exits.
> > > 	(slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit
> > peeling.
> > > 	(slpeel_can_duplicate_loop_p): Likewise.
> > > 	(vect_update_ivs_after_vectorizer): Don't enter this...
> > > 	(vect_update_ivs_after_early_break): ...but instead enter here.
> > > 	(find_guard_arg): Update for new peeling code.
> > > 	(slpeel_update_phi_nodes_for_loops): Remove.
> > > 	(slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0
> > checks.
> > > 	(slpeel_update_phi_nodes_for_lcssa): Remove.
> > > 	(vect_do_peeling): Fix VF for multiple exits and force epilogue.
> > > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > > 	non_break_control_flow and early_breaks.
> > > 	(vect_need_peeling_or_partial_vectors_p): Force partial vector if
> > > 	multiple exits and VLA.
> > > 	(vect_analyze_loop_form): Support inner loop multiple exits.
> > > 	(vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
> > > 	(vect_create_epilog_for_reduction):  Update live phi nodes.
> > > 	(vectorizable_live_operation): Ignore live operations in vector loop
> > > 	when multiple exits.
> > > 	(vect_transform_loop): Force unrolling for VF loops and multiple exits.
> > > 	* tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
> > > 	(vect_mark_stmts_to_be_vectorized): Check for non-exit control flow
> > and
> > > 	analyze gcond params.
> > > 	(vect_analyze_stmt): Support gcond.
> > > 	* tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
> > > 	in RPO pass.
> > > 	* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
> > > 	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW):
> > New.
> > > 	(loop_vec_info_for_loop): Change to const and static.
> > > 	(is_loop_header_bb_p): Drop assert.
> > > 	(slpeel_can_duplicate_loop_p): Update prototype.
> > > 	(class loop): Add early_breaks and non_break_control_flow.
> > >
> > > --- inline copy of patch --
> > > diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> > > index
> > 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40
> > d833a968314a4442b9e 100644
> > > --- a/gcc/tree-loop-distribution.cc
> > > +++ b/gcc/tree-loop-distribution.cc
> > > @@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool
> > redirect_lc_phi_defs)
> > >    edge preheader = loop_preheader_edge (loop);
> > >
> > >    initialize_original_copy_tables ();
> > > -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> > > +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader,
> > false);
> > >    gcc_assert (res != NULL);
> > >
> > >    /* When a not last partition is supposed to keep the LC PHIs computed
> > > diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> > > index
> > 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba
> > 377300430c04b7aeefe6c 100644
> > > --- a/gcc/tree-ssa-loop-niter.cc
> > > +++ b/gcc/tree-ssa-loop-niter.cc
> > > @@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop,
> > basic_block *body, const_edge exit)
> > >    gimple_stmt_iterator bsi;
> > >    unsigned i;
> > >
> > > -  if (exit != single_exit (loop))
> > > +  /* We need to check for alternative exits since exit can be NULL.  */
> > 
> > You mean we pass in exit == NULL in some cases?  I'm not sure what
> > the desired behavior in that case is - can you point out the
> > callers you are fixing here?
> > 
> > I think we should add gcc_assert (exit != nullptr)
> > 
> > >    for (i = 0; i < loop->num_nodes; i++)
> > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > index
> > 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a92
> > 7cb150edd7c2aa9999 100644
> > > --- a/gcc/tree-vect-loop-manip.cc
> > > +++ b/gcc/tree-vect-loop-manip.cc
> > > @@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple *update_phi,
> > edge e, tree new_def)
> > >  {
> > >    tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
> > >
> > > +  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
> > > +	      || orig_def != new_def);
> > > +
> > >    SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
> > >
> > >    if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > @@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal (loop_vec_info
> > loop_vinfo,
> > >    gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > >
> > >    /* Record the number of latch iterations.  */
> > > -  if (limit == niters)
> > > +  if (limit == niters
> > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > >      /* Case A: the loop iterates NITERS times.  Subtract one to get the
> > >         latch count.  */
> > >      loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
> > > @@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal
> > (loop_vec_info loop_vinfo,
> > >      loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
> > >  				       limit, step);
> > >
> > > -  if (final_iv)
> > > +  /* For multiple exits we've already maintained LCSSA form and handled
> > > +     the scalar iteration update in the code that deals with the merge
> > > +     block and its updated guard.  I could move that code here instead
> > > +     of in vect_update_ivs_after_early_break but I have to still deal
> > > +     with the updates to the counter `i`.  So for now I'll keep them
> > > +     together.  */
> > > +  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > >      {
> > >        gassign *assign;
> > >        edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > @@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
> > >     on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
> > >     non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
> > >     basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> > > -   entry or exit of LOOP.  */
> > > +   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to
> > SCALAR_LOOP as a
> > > +   continuation.  This is correct for cases where one loop continues from the
> > > +   other like in the vectorizer, but not true for uses in e.g. loop distribution
> > > +   where the loop is duplicated and then modified.
> > > +
> > 
> > but for loop distribution the flow also continues?  I'm not sure what you
> > are refering to here.  Do you by chance have a branch with the patches
> > installed?
> 
> Yup, they're at refs/users/tnfchris/heads/gcc-14-early-break in the repo.
> 
> > 
> > > +   If UPDATED_DOMS is not NULL it is update with the list of basic blocks
> > whoms
> > > +   dominators were updated during the peeling.  */
> > >
> > >  class loop *
> > >  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
> > > -					class loop *scalar_loop, edge e)
> > > +					class loop *scalar_loop, edge e,
> > > +					bool flow_loops,
> > > +					vec<basic_block> *updated_doms)
> > >  {
> > >    class loop *new_loop;
> > >    basic_block *new_bbs, *bbs, *pbbs;
> > > @@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class
> > loop *loop,
> > >    for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
> > >      rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
> > >
> > > +  /* Rename the exit uses.  */
> > > +  for (edge exit : get_loop_exit_edges (new_loop))
> > > +    for (auto gsi = gsi_start_phis (exit->dest);
> > > +	 !gsi_end_p (gsi); gsi_next (&gsi))
> > > +      {
> > > +	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
> > > +	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
> > > +	if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > +	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
> > > +      }
> > > +
> > > +  /* This condition happens when the loop has been versioned. e.g. due to
> > ifcvt
> > > +     versioning the loop.  */
> > >    if (scalar_loop != loop)
> > >      {
> > >        /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
> > > @@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > (class loop *loop,
> > >  						EDGE_SUCC (loop->latch, 0));
> > >      }
> > >
> > > +  vec<edge> alt_exits = loop->vec_loop_alt_exits;
> > 
> > So 'e' is not one of alt_exits, right?  I wonder if we can simply
> > compute the vector from all exits of 'loop' and removing 'e'?
> > 
> > > +  bool multiple_exits_p = !alt_exits.is_empty ();
> > > +  auto_vec<basic_block> doms;
> > > +  class loop *update_loop = NULL;
> > > +
> > >    if (at_exit) /* Add the loop copy at exit.  */
> > >      {
> > > -      if (scalar_loop != loop)
> > > +      if (scalar_loop != loop && new_exit->dest != exit_dest)
> > >  	{
> > > -	  gphi_iterator gsi;
> > >  	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> > > +	  flush_pending_stmts (new_exit);
> > > +	}
> > >
> > > -	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> > > -	       gsi_next (&gsi))
> > > +      auto loop_exits = get_loop_exit_edges (loop);
> > > +      for (edge exit : loop_exits)
> > > +	redirect_edge_and_branch (exit, new_preheader);
> > > +
> > > +
> > 
> > one line vertical space too much
> > 
> > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > +	 block and the new loop header.  This allows us to later split the
> > > +	 preheader block and still find the right LC nodes.  */
> > > +      edge latch_new = single_succ_edge (new_preheader);
> > > +      edge latch_old = loop_latch_edge (loop);
> > > +      hash_set <tree> lcssa_vars;
> > > +      for (auto gsi_from = gsi_start_phis (latch_old->dest),
> > 
> > so that's loop->header (and makes it more clear which PHI nodes you are
> > looking at)


So I'm now in a debug session - I think that conceptually it would
make more sense to create the LC PHI nodes that are present at the
old exit destination in the new preheader _before_ you redirect them
above and then flush_pending_stmts after redirecting, that should deal
with the copying.

Now, your copying actually iterates over all PHIs in the loop _header_,
so it doesn't actually copy LC PHI nodes but possibly creates additional
ones.  The intent does seem to do this since you want a different value
on those edges for all but the main loop exit.  But then the 
overall comments should better reflect that and maybe you should
do what I suggested anyway and have this loop alter only the alternate
exit LC PHIs?

If you don't flush_pending_stmts on an edge after redirecting you
should call redirect_edge_var_map_clear (edge), otherwise the stale
info might break things later.

> > > +	   gsi_to = gsi_start_phis (latch_new->dest);
> > 
> > likewise new_loop->header
> > 
> > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > +	{
> > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
> > > +	  /* In all cases, even in early break situations we're only
> > > +	     interested in the number of fully executed loop iters.  As such
> > > +	     we discard any partially done iteration.  So we simply propagate
> > > +	     the phi nodes from the latch to the merge block.  */
> > > +	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > +
> > > +	  lcssa_vars.add (new_arg);
> > > +
> > > +	  /* Main loop exit should use the final iter value.  */
> > > +	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv,
> > UNKNOWN_LOCATION);
> > 
> > above you are creating the PHI node at e->dest but here add the PHI arg to
> > loop->vec_loop_iv - that's 'e' here, no?  Consistency makes it easier
> > to follow.  I _think_ this code doesn't need to know about the "special"
> > edge.
> > 
> > > +
> > > +	  /* All other exits use the previous iters.  */
> > > +	  for (edge e : alt_exits)
> > > +	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
> > > +			 UNKNOWN_LOCATION);
> > > +
> > > +	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
> > > +	}
> > > +
> > > +      /* Copy over any live SSA vars that may not have been materialized in
> > the
> > > +	 loops themselves but would be in the exit block.  However when the
> > live
> > > +	 value is not used inside the loop then we don't need to do this,  if we
> > do
> > > +	 then when we split the guard block the branch edge can end up
> > containing the
> > > +	 wrong reference,  particularly if it shares an edge with something that
> > has
> > > +	 bypassed the loop.  This is not something peeling can check so we
> > need to
> > > +	 anticipate the usage of the live variable here.  */
> > > +      auto exit_map = redirect_edge_var_map_vector (exit);
> > 
> > Hmm, did I use that in my attemt to refactor things? ...
> 
> Indeed, I didn't always use it, but found it was the best way to deal with the
> variables being live in various BB after the loop.

As said this whole piece of code is possibly more complicated than 
necessary.  First copying/creating the PHI nodes that are present
at the exit (the old LC PHI nodes), then redirecting edges and flushing
stmts should deal with half of this.

> > 
> > > +      if (exit_map)
> > > +        for (auto vm : exit_map)
> > > +	{
> > > +	  if (lcssa_vars.contains (vm.def)
> > > +	      || TREE_CODE (vm.def) != SSA_NAME)
> > 
> > the latter check is cheaper so it should come first
> > 
> > > +	    continue;
> > > +
> > > +	  imm_use_iterator imm_iter;
> > > +	  use_operand_p use_p;
> > > +	  bool use_in_loop = false;
> > > +
> > > +	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
> > >  	    {
> > > -	      gphi *phi = gsi.phi ();
> > > -	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> > > -	      location_t orig_locus
> > > -		= gimple_phi_arg_location_from_edge (phi, e);
> > > +	      basic_block bb = gimple_bb (USE_STMT (use_p));
> > > +	      if (flow_bb_inside_loop_p (loop, bb)
> > > +		  && !gimple_vuse (USE_STMT (use_p)))

what's this gimple_vuse check?  I see now for vect-early-break_17.c this
code triggers and ignores

  vect_b[i_18] = _2;

> > > +		{
> > > +		  use_in_loop = true;
> > > +		  break;
> > > +		}
> > > +	    }
> > >
> > > -	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> > > +	  if (!use_in_loop)
> > > +	    {
> > > +	       /* Do a final check to see if it's perhaps defined in the loop.  This
> > > +		  mirrors the relevancy analysis's used_outside_scope.  */
> > > +	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
> > > +	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> > > +		continue;
> > >  	    }

since the def was on a LC PHI the def should always be defined inside the 
loop.

> > > +
> > > +	  tree new_res = copy_ssa_name (vm.result);
> > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > +	  for (edge exit : loop_exits)
> > > +	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);
> > 
> > not sure what you are doing above - I guess I have to play with it
> > in a debug session.
> 
> Yeah if you comment it out one of the testcases should fail.

using new_preheader instead of e->dest would make things clearer.

You are now adding the same arg to every exit (you've just queried the
main exit redirect_edge_var_map_vector).

OK, so I think I understand what you're doing.  If I understand
correctly we know that when we exit the main loop via one of the
early exits we are definitely going to enter the epilog but when
we take the main exit we might not.

Looking at the CFG we create currently this isn't reflected and
this complicates this PHI node updating.  What I'd try to do
is leave redirecting the alternate exits until after
slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
means leaving it almost unchanged besides the LC SSA maintaining
changes.  After that for the multi-exit case split the
epilog preheader edge and redirect all the alternate exits to the
new preheader.  So the CFG becomes

                 <original loop>
                /      |
               /    <main exit w/ original LC PHI>
              /      if (epilog)
   alt exits /        /  \
            /        /    loop around
            |       /    
           preheader with "header" PHIs
              |
          <epilog>

note you need the header PHIs also on the main exit path but you
only need the loop end PHIs there.

It seems so that at least currently the order of things makes
them more complicated than necessary.

> > 
> > >  	}
> > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > -      flush_pending_stmts (e);
> > > +
> > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
> > > -      if (was_imm_dom || duplicate_outer_loop)
> > > +
> > > +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
> > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit-
> > >src);
> > >
> > >        /* And remove the non-necessary forwarder again.  Keep the other
> > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class
> > loop *loop,
> > >        delete_basic_block (preheader);
> > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> > >  			       loop_preheader_edge (scalar_loop)->src);
> > > +
> > > +      /* Finally after wiring the new epilogue we need to update its main exit
> > > +	 to the original function exit we recorded.  Other exits are already
> > > +	 correct.  */
> > > +      if (multiple_exits_p)
> > > +	{
> > > +	  for (edge e : get_loop_exit_edges (loop))
> > > +	    doms.safe_push (e->dest);
> > > +	  update_loop = new_loop;
> > > +	  doms.safe_push (exit_dest);
> > > +
> > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > +	  if (single_succ_p (exit_dest))
> > > +	    doms.safe_push (single_succ (exit_dest));
> > > +	}
> > >      }
> > >    else /* Add the copy at entry.  */
> > >      {
> > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > +	 block and the new loop header.  This allows us to later split the
> > > +	 preheader block and still find the right LC nodes.  */
> > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > 
> > see above
> > 
> > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > +	{
> > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > new_latch_loop);
> > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> > > +	}
> > > +
> > >        if (scalar_loop != loop)
> > >  	{
> > >  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> > > @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class
> > loop *loop,
> > >        delete_basic_block (new_preheader);
> > >        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
> > >  			       loop_preheader_edge (new_loop)->src);
> > > +
> > > +      if (multiple_exits_p)
> > > +	update_loop = loop;
> > >      }
> > >
> > > -  if (scalar_loop != loop)
> > > +  if (multiple_exits_p)
> > >      {
> > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > -      gphi_iterator gsi_orig, gsi_new;
> > > -      edge orig_e = loop_preheader_edge (loop);
> > > -      edge new_e = loop_preheader_edge (new_loop);
> > > -
> > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > +      for (edge e : get_loop_exit_edges (update_loop))
> > >  	{
> > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > -	  gphi *new_phi = gsi_new.phi ();
> > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > -	  location_t orig_locus
> > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > -
> > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > +	  edge ex;
> > > +	  edge_iterator ei;
> > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > +	    {
> > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > +		 dominate other blocks.  */
> > > +	      while ((ex->flags & EDGE_FALLTHRU)

For the prologue peeling any early exit we take would skip all other
loops so we can simply leave them and their LC PHI nodes in place.
We need extra PHIs only on the path to the main vector loop.  I
think the comment isn't accurately reflecting what we do.  In
fact we do not add any LC PHI nodes here but simply adjust the
main loop header PHI arguments?

> > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > just using single_succ_p here?  A fallthru edge src dominates the
> > fallthru edge dest, so the sentence above doesn't make sense.
> 
> I wanted to say, that the immediate dominator of a block is never
> an fall through block.  At least from what I understood from how
> the dominators are calculated in the code, though may have missed
> something.

 BB1
  |
 BB2
  |
 BB3

here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.

> > 
> > > +		     && single_succ_p (ex->dest))
> > > +		{
> > > +		  doms.safe_push (ex->dest);
> > > +		  ex = single_succ_edge (ex->dest);
> > > +		}
> > > +	      doms.safe_push (ex->dest);
> > > +	    }
> > > +	  doms.safe_push (e->dest);
> > >  	}
> > > -    }
> > >
> > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > +      if (updated_doms)
> > > +	updated_doms->safe_splice (doms);
> > > +    }
> > >    free (new_bbs);
> > >    free (bbs);
> > >
> > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > loop_vec_info loop_vinfo, const_edge e)
> > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > >    unsigned int num_bb = loop->inner? 5 : 2;
> > >
> > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > +
> > 
> > I think checking the number of BBs is odd, I don't remember anything
> > in slpeel is specifically tied to that?  I think we can simply drop
> > this or do you remember anything that would depend on ->num_nodes
> > being only exactly 5 or 2?
> 
> Never actually seemed to require it, but they're used as some check to
> see if there are unexpected control flow in the loop.
> 
> i.e. this would say no if you have an if statement in the loop that wasn't
> converted.  The other part of this and the accompanying explanation is in
> vect_analyze_loop_form.  In the patch series I had to remove the hard
> num_nodes == 2 check from there because number of nodes restricted
> things too much.  If you have an empty fall through block, which seems to
> happen often between the main exit and the latch block then we'd not
> vectorize.
> 
> So instead I now rejects loops after analyzing the gcond.  So think this check
> can go/needs to be different.

Lets remove it from this function then.

> > 
> > >    /* All loops have an outer scope; the only case loop->outer is NULL is for
> > >       the function itself.  */
> > >    if (!loop_outer (loop)
> > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > (loop_vec_info loop_vinfo,
> > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > >    basic_block update_bb = update_e->dest;
> > >
> > > +  /* For early exits we'll update the IVs in
> > > +     vect_update_ivs_after_early_break.  */
> > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > +    return;
> > > +
> > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > >
> > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > (loop_vec_info loop_vinfo,
> > >        /* Fix phi expressions in the successor bb.  */
> > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > >      }
> > > +  return;
> > 
> > we don't usually place a return at the end of void functions
> > 
> > > +}
> > > +
> > > +/*   Function vect_update_ivs_after_early_break.
> > > +
> > > +     "Advance" the induction variables of LOOP to the value they should take
> > > +     after the execution of LOOP.  This is currently necessary because the
> > > +     vectorizer does not handle induction variables that are used after the
> > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > +     peeled, because of the early exit.  With an early exit we always peel the
> > > +     loop.
> > > +
> > > +     Input:
> > > +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> > > +		    vectorized. The last few iterations of LOOP were peeled.
> > > +     - LOOP - a loop that is going to be vectorized. The last few iterations
> > > +	      of LOOP were peeled.
> > > +     - VF - The loop vectorization factor.
> > > +     - NITERS_ORIG - the number of iterations that LOOP executes (before it is
> > > +		     vectorized). i.e, the number of times the ivs should be
> > > +		     bumped.
> > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > executes.
> > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only) path
> > > +		  coming out from LOOP on which there are uses of the LOOP
> > ivs
> > > +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> > > +
> > > +		  The new definitions of the ivs are placed in LOOP->exit.
> > > +		  The phi args associated with the edge UPDATE_E in the bb
> > > +		  UPDATE_E->dest are updated accordingly.
> > > +
> > > +     Output:
> > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > +
> > > +     Assumption 1: Like the rest of the vectorizer, this function assumes
> > > +     a single loop exit that has a single predecessor.
> > > +
> > > +     Assumption 2: The phi nodes in the LOOP header and in update_bb are
> > > +     organized in the same order.
> > > +
> > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the future.
> > > +
> > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a path
> > > +     coming out of LOOP on which the ivs of LOOP are used (this is the path
> > > +     that leads to the epilog loop; other paths skip the epilog loop).  This
> > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > update_bb)
> > > +     needs to have its phis updated.
> > > + */
> > > +
> > > +static tree
> > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class loop *
> > epilog,
> > > +				   poly_int64 vf, tree niters_orig,
> > > +				   tree niters_vector, edge update_e)
> > > +{
> > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > +    return NULL;
> > > +
> > > +  gphi_iterator gsi, gsi1;
> > > +  tree ni_name, ivtmp = NULL;
> > > +  basic_block update_bb = update_e->dest;
> > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > +  basic_block exit_bb = loop_iv->dest;
> > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > +
> > > +  gcc_assert (cond);
> > > +
> > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis (update_bb);
> > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > +    {
> > > +      tree init_expr, final_expr, step_expr;
> > > +      tree type;
> > > +      tree var, ni, off;
> > > +      gimple_stmt_iterator last_gsi;
> > > +
> > > +      gphi *phi = gsi1.phi ();
> > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge
> > (epilog));
> > 
> > I'm confused about the setup.  update_bb looks like the block with the
> > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > loop_preheader_edge (epilog) come into play here?  That would feed into
> > epilog->header PHIs?!
> 
> We can't query the type of the phis in the block with the LC PHI nodes, so the
> Typical pattern seems to be that we iterate over a block that's part of the loop
> and that would have the PHIs in the same order, just so we can get to the
> stmt_vec_info.
> 
> > 
> > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > better way?  Is update_bb really epilog->header?!
> > 
> > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > which "works" even for totally unrelated edges.
> > 
> > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> > > +      if (!phi1)
> > 
> > shouldn't that be an assert?
> > 
> > > +	continue;
> > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > +      if (dump_enabled_p ())
> > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > +			 "vect_update_ivs_after_early_break: phi: %G",
> > > +			 (gimple *)phi);
> > > +
> > > +      /* Skip reduction and virtual phis.  */
> > > +      if (!iv_phi_p (phi_info))
> > > +	{
> > > +	  if (dump_enabled_p ())
> > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > +			     "reduc or virtual phi. skip.\n");
> > > +	  continue;
> > > +	}
> > > +
> > > +      /* For multiple exits where we handle early exits we need to carry on
> > > +	 with the previous IV as loop iteration was not done because we exited
> > > +	 early.  As such just grab the original IV.  */
> > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > (loop));
> > 
> > but this should be taken care of by LC SSA?
> 
> It is, the comment is probably missing details, this part just scales the counter
> from VF to scalar counts.  It's just a reminder that this scaling is done differently
> from normal single exit vectorization.
>
> > 
> > OK, have to continue tomorrow from here.
> 
> Cheers, Thank you!
> 
> Tamar
> 
> > 
> > Richard.
> > 
> > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > +	  && gimple_cond_rhs (cond) != phi_ssa)

so this is a way to avoid touching the main IV?  Looks a bit fragile to 
me.  Hmm, we're iterating over the main loop header PHIs here?
Can't you check, say, the relevancy of the PHI node instead?  Though
it might also be used as induction.  Can't it be used as alternate
exit like

  for (i)
   {
     if (i & bit)
       break;
   }

and would we need to adjust 'i' then?

> > > +	{
> > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > +	  step_expr = unshare_expr (step_expr);
> > > +
> > > +	  /* We previously generated the new merged phi in the same BB as
> > the
> > > +	     guard.  So use that to perform the scaling on rather than the
> > > +	     normal loop phi which don't take the early breaks into account.  */
> > > +	  final_expr = gimple_phi_result (phi1);
> > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > loop_preheader_edge (loop));
> > > +
> > > +	  tree stype = TREE_TYPE (step_expr);
> > > +	  /* For early break the final loop IV is:
> > > +	     init + (final - init) * vf which takes into account peeling
> > > +	     values and non-single steps.  */
> > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > +			     fold_convert (stype, final_expr),
> > > +			     fold_convert (stype, init_expr));
> > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> > > +
> > > +	  /* Adjust the value with the offset.  */
> > > +	  if (POINTER_TYPE_P (type))
> > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > +	  else
> > > +	    ni = fold_convert (type,
> > > +			       fold_build2 (PLUS_EXPR, stype,
> > > +					    fold_convert (stype, init_expr),
> > > +					    off));
> > > +	  var = create_tmp_var (type, "tmp");

so how does the non-early break code deal with updating inductions?
And how do you avoid altering this when we flow in from the normal
exit?  That is, you are updating the value in the epilog loop
header but don't you need to instead update the value only on
the alternate exit edges from the main loop (and keep the not
updated value on the main exit edge)?

> > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > +	  gimple_seq new_stmts = NULL;
> > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > +	  /* Exit_bb shouldn't be empty.  */
> > > +	  if (!gsi_end_p (last_gsi))
> > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > +	  else
> > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > +
> > > +	  /* Fix phi expressions in the successor bb.  */
> > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > +	}
> > > +      else
> > > +	{
> > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > +	  step_expr = unshare_expr (step_expr);
> > > +
> > > +	  /* We previously generated the new merged phi in the same BB as
> > the
> > > +	     guard.  So use that to perform the scaling on rather than the
> > > +	     normal loop phi which don't take the early breaks into account.  */
> > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge
> > (loop));
> > > +	  tree stype = TREE_TYPE (step_expr);
> > > +
> > > +	  if (vf.is_constant ())
> > > +	    {
> > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > +				fold_convert (stype,
> > > +					      niters_vector),
> > > +				build_int_cst (stype, vf));
> > > +
> > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > +				fold_convert (stype,
> > > +					      niters_orig),
> > > +				fold_convert (stype, ni));
> > > +	    }
> > > +	  else
> > > +	    /* If the loop's VF isn't constant then the loop must have been
> > > +	       masked, so at the end of the loop we know we have finished
> > > +	       the entire loop and found nothing.  */
> > > +	    ni = build_zero_cst (stype);
> > > +
> > > +	  ni = fold_convert (type, ni);
> > > +	  /* We don't support variable n in this version yet.  */
> > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > +
> > > +	  var = create_tmp_var (type, "tmp");
> > > +
> > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > +	  gimple_seq new_stmts = NULL;
> > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > +	  /* Exit_bb shouldn't be empty.  */
> > > +	  if (!gsi_end_p (last_gsi))
> > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > +	  else
> > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > +
> > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > +
> > > +	  for (edge exit : alt_exits)
> > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > +					build_int_cst (TREE_TYPE (step_expr),
> > > +						       vf));
> > > +	  ivtmp = gimple_phi_result (phi1);
> > > +	}
> > > +    }
> > > +
> > > +  return ivtmp;
> > >  }
> > >
> > >  /* Return a gimple value containing the misalignment (measured in vector
> > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > (loop_vec_info loop_vinfo,
> > >
> > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > >     this function searches for the corresponding lcssa phi node in exit
> > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > -   NULL.  */
> > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> > > +   return the phi result; otherwise return NULL.  */
> > >
> > >  static tree
> > >  find_guard_arg (class loop *loop, class loop *epilog ATTRIBUTE_UNUSED,
> > > -		gphi *lcssa_phi)
> > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > >  {
> > >    gphi_iterator gsi;
> > >    edge e = loop->vec_loop_iv;
> > >
> > > -  gcc_assert (single_pred_p (e->dest));
> > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > >      {
> > >        gphi *phi = gsi.phi ();
> > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > -	return PHI_RESULT (phi);
> > > -    }
> > > -  return NULL_TREE;
> > > -}
> > > -
> > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > FIRST/SECOND
> > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > -   edge, the two loops are arranged as below:
> > > -
> > > -       preheader_a:
> > > -     first_loop:
> > > -       header_a:
> > > -	 i_1 = PHI<i_0, i_2>;
> > > -	 ...
> > > -	 i_2 = i_1 + 1;
> > > -	 if (cond_a)
> > > -	   goto latch_a;
> > > -	 else
> > > -	   goto between_bb;
> > > -       latch_a:
> > > -	 goto header_a;
> > > -
> > > -       between_bb:
> > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > -
> > > -     second_loop:
> > > -       header_b:
> > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > -				 or with i_2 if no LCSSA phi is created
> > > -				 under condition of
> > CREATE_LCSSA_FOR_IV_PHIS.
> > > -	 ...
> > > -	 i_4 = i_3 + 1;
> > > -	 if (cond_b)
> > > -	   goto latch_b;
> > > -	 else
> > > -	   goto exit_bb;
> > > -       latch_b:
> > > -	 goto header_b;
> > > -
> > > -       exit_bb:
> > > -
> > > -   This function creates loop closed SSA for the first loop; update the
> > > -   second loop's PHI nodes by replacing argument on incoming edge with the
> > > -   result of newly created lcssa PHI nodes.  IF CREATE_LCSSA_FOR_IV_PHIS
> > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > -   the first loop.
> > > -
> > > -   This function assumes exit bb of the first loop is preheader bb of the
> > > -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> > > -   the second loop will execute rest iterations of the first.  */
> > > -
> > > -static void
> > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > -				   class loop *first, class loop *second,
> > > -				   bool create_lcssa_for_iv_phis)
> > > -{
> > > -  gphi_iterator gsi_update, gsi_orig;
> > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > -
> > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > -  basic_block between_bb = single_exit (first)->dest;
> > > -
> > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p (between_bb));
> > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > -  gcc_assert (loop == first || loop == second);
> > > -
> > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > -       gsi_update = gsi_start_phis (second->header);
> > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > -    {
> > > -      gphi *orig_phi = gsi_orig.phi ();
> > > -      gphi *update_phi = gsi_update.phi ();
> > > -
> > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > -      /* Generate lcssa PHI node for the first loop.  */
> > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > +      /* Nested loops with multiple exits can have different no# phi node
> > > +	 arguments between the main loop and epilog as epilog falls to the
> > > +	 second loop.  */
> > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > >  	{
> > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > UNKNOWN_LOCATION);
> > > -	  arg = new_res;
> > > -	}
> > > -
> > > -      /* Update PHI node in the second loop by replacing arg on the loop's
> > > -	 incoming edge.  */
> > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e, arg);
> > > -    }
> > > -
> > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > -     for correct vectorization of live stmts.  */
> > > -  if (loop == first)
> > > -    {
> > > -      basic_block orig_exit = single_exit (second)->dest;
> > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > -	{
> > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > (orig_arg))
> > > -	    continue;
> > > -
> > > -	  /* Already created in the above loop.   */
> > > -	  if (find_guard_arg (first, second, orig_phi))
> > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > +	  if (TREE_CODE (var) != SSA_NAME)
> > >  	    continue;
> > >
> > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > UNKNOWN_LOCATION);
> > > +	  if (operand_equal_p (get_current_def (var),
> > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > +	    return PHI_RESULT (phi);
> > >  	}
> > >      }
> > > +  return NULL_TREE;
> > >  }
> > >
> > >  /* Function slpeel_add_loop_guard adds guard skipping from the beginning
> > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2 (class
> > loop *loop, class loop *epilog,
> > >    gcc_assert (single_succ_p (merge_bb));
> > >    edge e = single_succ_edge (merge_bb);
> > >    basic_block exit_bb = e->dest;
> > > -  gcc_assert (single_pred_p (exit_bb));
> > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > >
> > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > >      {
> > >        gphi *update_phi = gsi.phi ();
> > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > >
> > >        tree merge_arg = NULL_TREE;
> > >
> > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class loop
> > *loop, class loop *epilog,
> > >        if (!merge_arg)
> > >  	merge_arg = old_arg;
> > >
> > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e->dest_idx);
> > >        /* If the var is live after loop but not a reduction, we simply
> > >  	 use the old arg.  */
> > >        if (!guard_arg)
> > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class
> > loop *loop, class loop *epilog,
> > >      }
> > >  }
> > >
> > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > -
> > > -static void
> > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > -{
> > > -  gphi_iterator gsi;
> > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > -
> > > -  gcc_assert (single_pred_p (exit_bb));
> > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > -}
> > > -

I wonder if we can still split these changes out to before early break 
vect?

> > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would need to
> > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > >     whether the epilogue loop should be used, returning true if so.  */
> > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> > niters, tree nitersm1,
> > >      bound_epilog += vf - 1;
> > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > >      bound_epilog += 1;
> > > +  /* For early breaks the scalar loop needs to execute at most VF times
> > > +     to find the element that caused the break.  */
> > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > +    {
> > > +      bound_epilog = vf;
> > > +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> > > +      vect_epilogues = false;
> > > +    }
> > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > >    poly_uint64 bound_scalar = bound_epilog;
> > >
> > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree niters, tree nitersm1,
> > >  				  bound_prolog + bound_epilog)
> > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > >  			 || vect_epilogues));
> > > +
> > > +  /* We only support early break vectorization on known bounds at this
> > time.
> > > +     This means that if the vector loop can't be entered then we won't
> > generate
> > > +     it at all.  So for now force skip_vector off because the additional control
> > > +     flow messes with the BB exits and we've already analyzed them.  */
> > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
> > > +

I think it should be as "easy" as entering the epilog via the block taking
the regular exit?

> > >    /* Epilog loop must be executed if the number of iterations for epilog
> > >       loop is known at compile time, otherwise we need to add a check at
> > >       the end of vector loop and skip to the end of epilog loop.  */
> > >    bool skip_epilog = (prolog_peeling < 0
> > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >  		      || !vf.is_constant ());
> > > -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.  */
> > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special because
> > epilog
> > > +     loop must be executed.  */
> > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > >      skip_epilog = false;
> > > -
> > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > >    auto_vec<profile_count> original_counts;
> > >    basic_block *original_bbs = NULL;
> > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree niters, tree nitersm1,
> > >    if (prolog_peeling)
> > >      {
> > >        e = loop_preheader_edge (loop);
> > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > -
> > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
> > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
> > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e,
> > > +						       true);
> > >        gcc_assert (prolog);
> > >        prolog->force_vectorize = false;
> > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> > > +
> > >        first_loop = prolog;
> > >        reset_original_copy_tables ();
> > >
> > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree niters, tree nitersm1,
> > >  	 as the transformations mentioned above make less or no sense when
> > not
> > >  	 vectorizing.  */
> > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > +      auto_vec<basic_block> doms;
> > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e, true,
> > > +						       &doms);
> > >        gcc_assert (epilog);
> > >
> > >        epilog->force_vectorize = false;
> > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
> > >
> > >        /* Scalar version loop may be preferred.  In this case, add guard
> > >  	 and skip to epilog.  Note this only happens when the number of
> > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> > niters, tree nitersm1,
> > >        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> > >  					update_e);
> > >
> > > +      /* For early breaks we must create a guard to check how many iterations
> > > +	 of the scalar loop are yet to be performed.  */

We have this check anyway, no?  In fact don't we know that we always enter
the epilog (see above)?

> > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > +	{
> > > +	  tree ivtmp =
> > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> > > +					       *niters_vector, update_e);
> > > +
> > > +	  gcc_assert (ivtmp);
> > > +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > +					 fold_convert (TREE_TYPE (niters),
> > > +						       ivtmp),
> > > +					 build_zero_cst (TREE_TYPE (niters)));
> > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > +
> > > +	  /* If we had a fallthrough edge, the guard will the threaded through
> > > +	     and so we may need to find the actual final edge.  */
> > > +	  edge final_edge = epilog->vec_loop_iv;
> > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> > > +	     between the guard and the exit edge.  It only adds new nodes and
> > > +	     doesn't update existing one in the current scheme.  */
> > > +	  basic_block guard_to = split_edge (final_edge);
> > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
> > guard_to,
> > > +						guard_bb, prob_epilog.invert
> > (),
> > > +						irred_flag);
> > > +	  doms.safe_push (guard_bb);
> > > +
> > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > +
> > > +	  /* We must update all the edges from the new guard_bb.  */
> > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > +					      final_edge);
> > > +
> > > +	  /* If the loop was versioned we'll have an intermediate BB between
> > > +	     the guard and the exit.  This intermediate block is required
> > > +	     because in the current scheme of things the guard block phi
> > > +	     updating can only maintain LCSSA by creating new blocks.  In this
> > > +	     case we just need to update the uses in this block as well.  */
> > > +	  if (loop != scalar_loop)
> > > +	    {
> > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (),
> > guard_e));
> > > +	    }
> > > +
> > > +	  flush_pending_stmts (guard_e);
> > > +	}
> > > +
> > >        if (skip_epilog)
> > >  	{
> > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree
> > niters, tree nitersm1,
> > >  	    }
> > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > >  	}
> > > -      else
> > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > >
> > >        unsigned HOST_WIDE_INT bound;
> > >        if (bound_scalar.is_constant (&bound))
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index
> > b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > f21421958e7ee25eada 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > *loop_in, vec_info_shared *shared)
> > >      partial_load_store_bias (0),
> > >      peeling_for_gaps (false),
> > >      peeling_for_niter (false),
> > > +    early_breaks (false),
> > > +    non_break_control_flow (false),
> > >      no_data_dependencies (false),
> > >      has_mask_store (false),
> > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > (loop_vec_info loop_vinfo)
> > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > (LOOP_VINFO_ORIG_LOOP_INFO
> > >  					  (loop_vinfo));
> > >
> > > +  /* When we have multiple exits and VF is unknown, we must require
> > partial
> > > +     vectors because the loop bounds is not a minimum but a maximum.
> > That is to
> > > +     say we cannot unpredicate the main loop unless we peel or use partial
> > > +     vectors in the epilogue.  */
> > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > +    return true;
> > > +
> > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > >      {
> > > @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost
> > (loop_vec_info loop_vinfo)
> > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > >  }
> > >
> > > -
> > >  /* Function vect_analyze_loop_form.
> > >
> > >     Verify that certain CFG restrictions hold, including:
> > >     - the loop has a pre-header
> > > -   - the loop has a single entry and exit
> > > +   - the loop has a single entry
> > > +   - nested loops can have only a single exit.
> > >     - the loop exit condition is simple enough
> > >     - the number of iterations can be analyzed, i.e, a countable loop.  The
> > >       niter could be analyzed under some assumptions.  */
> > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop,
> > vect_loop_form_info *info)
> > >                             |
> > >                          (exit-bb)  */
> > >
> > > -      if (loop->num_nodes != 2)
> > > -	return opt_result::failure_at (vect_location,
> > > -				       "not vectorized:"
> > > -				       " control flow in loop.\n");
> > > -
> > >        if (empty_block_p (loop->header))
> > >  	return opt_result::failure_at (vect_location,
> > >  				       "not vectorized: empty loop.\n");
> > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop,
> > vect_loop_form_info *info)
> > >          dump_printf_loc (MSG_NOTE, vect_location,
> > >  			 "Considering outer-loop vectorization.\n");
> > >        info->inner_loop_cond = inner.loop_cond;
> > > +
> > > +      if (!single_exit (loop))
> > > +	return opt_result::failure_at (vect_location,
> > > +				       "not vectorized: multiple exits.\n");
> > > +
> > >      }
> > >
> > > -  if (!single_exit (loop))
> > > -    return opt_result::failure_at (vect_location,
> > > -				   "not vectorized: multiple exits.\n");
> > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > >      return opt_result::failure_at (vect_location,
> > >  				   "not vectorized:"
> > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop,
> > vect_loop_form_info *info)
> > >  				   "not vectorized: latch block not empty.\n");
> > >
> > >    /* Make sure the exit is not abnormal.  */
> > > -  edge e = single_exit (loop);
> > > -  if (e->flags & EDGE_ABNORMAL)
> > > -    return opt_result::failure_at (vect_location,
> > > -				   "not vectorized:"
> > > -				   " abnormal loop exit edge.\n");
> > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > +  edge nexit = loop->vec_loop_iv;
> > > +  for (edge e : exits)
> > > +    {
> > > +      if (e->flags & EDGE_ABNORMAL)
> > > +	return opt_result::failure_at (vect_location,
> > > +				       "not vectorized:"
> > > +				       " abnormal loop exit edge.\n");
> > > +      /* Early break BB must be after the main exit BB.  In theory we should
> > > +	 be able to vectorize the inverse order, but the current flow in the
> > > +	 the vectorizer always assumes you update successor PHI nodes, not
> > > +	 preds.  */
> > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e-
> > >src))
> > > +	return opt_result::failure_at (vect_location,
> > > +				       "not vectorized:"
> > > +				       " abnormal loop exit edge order.\n");

"unsupported loop exit order", but I don't understand the comment.

> > > +    }
> > > +
> > > +  /* We currently only support early exit loops with known bounds.   */

Btw, why's that?  Is that because we don't support the loop-around edge?
IMHO this is the most serious limitation (and as said above it should be
trivial to fix).

> > > +  if (exits.length () > 1)
> > > +    {
> > > +      class tree_niter_desc niter;
> > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter, NULL)
> > > +	  || chrec_contains_undetermined (niter.niter)
> > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > +	return opt_result::failure_at (vect_location,
> > > +				       "not vectorized:"
> > > +				       " early breaks only supported on loops"
> > > +				       " with known iteration bounds.\n");
> > > +    }
> > >
> > >    info->conds
> > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > vec_info_shared *shared,
> > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > >alt_loop_conds);
> > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > >
> > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > +
> > >    if (info->inner_loop_cond)
> > >      {
> > >        stmt_vec_info inner_loop_cond_info
> > > @@ -3070,7 +3106,8 @@ start_over:
> > >
> > >    /* If an epilogue loop is required make sure we can create one.  */
> > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > >      {
> > >        if (dump_enabled_p ())
> > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop required\n");
> > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction (loop_vec_info
> > loop_vinfo,
> > >    basic_block exit_bb;
> > >    tree scalar_dest;
> > >    tree scalar_type;
> > > -  gimple *new_phi = NULL, *phi;
> > > +  gimple *new_phi = NULL, *phi = NULL;
> > >    gimple_stmt_iterator exit_gsi;
> > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > >    gimple *epilog_stmt = NULL;
> > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > (loop_vec_info loop_vinfo,
> > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > >  	  reduc_inputs.quick_push (new_def);
> > >  	}
> > > +
> > > +	/* Update the other exits.  */
> > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > +	  {
> > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > +	    gphi_iterator gsi, gsi1;
> > > +	    for (edge exit : alt_exits)
> > > +	      {
> > > +		/* Find the phi node to propaget into the exit block for each
> > > +		   exit edge.  */
> > > +		for (gsi = gsi_start_phis (exit_bb),
> > > +		     gsi1 = gsi_start_phis (exit->src);

exit->src == loop->header, right?  I think this won't work for multiple
alternate exits.  It's probably easier to do this where we create the
LC PHI node for the reduction result?

> > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > +		  {
> > > +		    /* There really should be a function to just get the number
> > > +		       of phis inside a bb.  */
> > > +		    if (phi && phi == gsi.phi ())
> > > +		      {
> > > +			gphi *phi1 = gsi1.phi ();
> > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > +					 PHI_RESULT (phi1));

I think we know the header PHI of a reduction perfectly well, there 
shouldn't be the need to "search" for it.

> > > +			break;
> > > +		      }
> > > +		  }
> > > +	      }
> > > +	  }
> > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > >      }
> > >
> > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info *vinfo,
> > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > >  	   lhs' = new_tree;  */
> > >
> > > +      /* When vectorizing an early break, any live statement that is used
> > > +	 outside of the loop are dead.  The loop will never get to them.
> > > +	 We could change the liveness value during analysis instead but since
> > > +	 the below code is invalid anyway just ignore it during codegen.  */
> > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > +	return true;

But what about the value that's live across the main exit when the 
epilogue is not entered?

> > > +
> > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > >        gcc_assert (single_pred_p (exit_bb));
> > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > loop_vinfo, gimple *loop_vectorized_call)
> > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > >       versioning.   */
> > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > -  if (! single_pred_p (e->dest))
> > > +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS
> > (loop_vinfo))

e can be NULL here?  I think we should reject such loops earlier.

> > >      {
> > >        split_loop_exit_edge (e, true);
> > >        if (dump_enabled_p ())
> > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > loop_vinfo, gimple *loop_vectorized_call)
> > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > >      {
> > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > -      if (! single_pred_p (e->dest))
> > > +      if (e && ! single_pred_p (e->dest))
> > >  	{
> > >  	  split_loop_exit_edge (e, true);
> > >  	  if (dump_enabled_p ())
> > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > loop_vinfo, gimple *loop_vectorized_call)
> > >
> > >    /* Loops vectorized with a variable factor won't benefit from
> > >       unrolling/peeling.  */

update the comment?  Why would we unroll a VLA loop with early breaks?
Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?

> > > -  if (!vf.is_constant ())
> > > +  if (!vf.is_constant ()
> > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > >      {
> > >        loop->unroll = 1;
> > >        if (dump_enabled_p ())
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index
> > 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > e78ee2a7c627be45b 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > loop_vec_info loop_vinfo,
> > >    *live_p = false;
> > >
> > >    /* cond stmt other than loop exit cond.  */
> > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> > > -    *relevant = vect_used_in_scope;

how was that ever hit before?  For outer loop processing with outer loop
vectorization?

> > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > +    {
> > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge, but
> > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > +	 the hard way.  */
> > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > +      bool exit_bb = false, early_exit = false;
> > > +      edge_iterator ei;
> > > +      edge e;
> > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > +	  {
> > > +	    exit_bb = true;
> > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > +	    break;
> > > +	  }
> > > +
> > > +      /* We should have processed any exit edge, so an edge not an early
> > > +	 break must be a loop IV edge.  We need to distinguish between the
> > > +	 two as we don't want to generate code for the main loop IV.  */
> > > +      if (exit_bb)
> > > +	{
> > > +	  if (early_exit)
> > > +	    *relevant = vect_used_in_scope;
> > > +	}

I wonder why you can't simply do

         if (is_ctrl_stmt (stmt_info->stmt)
             && stmt_info->stmt != LOOP_VINFO_COND (loop_info))

?

> > > +      else if (bb->loop_father == loop)
> > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;

so for control flow not exiting the loop you can check
loop_exits_from_bb_p ().

> > > +    }
> > >
> > >    /* changing memory.  */
> > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > loop_vec_info loop_vinfo,
> > >  	*relevant = vect_used_in_scope;
> > >        }
> > >
> > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > +  auto_bitmap exit_bbs;
> > > +  for (edge exit : exits)
> > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > +
> > >    /* uses outside the loop.  */
> > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > SSA_OP_DEF)
> > >      {
> > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > loop_vec_info loop_vinfo,
> > >  	      /* We expect all such uses to be in the loop exit phis
> > >  		 (because of loop closed form)   */
> > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));

That now becomes quite expensive checking already covered by the LC SSA 
verifier so I suggest to simply drop this assert instead.

> > >                *live_p = true;
> > >  	    }
> > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > (loop_vec_info loop_vinfo, bool *fatal)
> > >  	}
> > >      }
> > >
> > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't seen all
> > > +     the conds yet at that point and there's no quick way to retrieve them.  */
> > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > +    return opt_result::failure_at (vect_location,
> > > +				   "not vectorized:"
> > > +				   " unsupported control flow in loop.\n");

so we didn't do this before?  But see above where I wondered.  So when 
does this hit with early exits and why can't we check for this in
vect_verify_loop_form?

> > > +
> > >    /* 2. Process_worklist */
> > >    while (worklist.length () > 0)
> > >      {
> > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > (loop_vec_info loop_vinfo, bool *fatal)
> > >  			return res;
> > >  		    }
> > >                   }
> > > +	    }
> > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> > > +	    {
> > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> > > +	      opt_result res
> > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > +			       loop_vinfo, relevant, &worklist, false);
> > > +	      if (!res)
> > > +		return res;
> > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > +				loop_vinfo, relevant, &worklist, false);
> > > +	      if (!res)
> > > +		return res;
> > >              }
> > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > >  	    {
> > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > >  			     node_instance, cost_vec);
> > >        if (!res)
> > >  	return res;
> > > -   }
> > > +    }
> > > +
> > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > >
> > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > >      {
> > >        case vect_internal_def:
> > > +      case vect_early_exit_def:
> > >          break;
> > >
> > >        case vect_reduction_def:
> > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > >      {
> > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > >        *need_to_vectorize = true;
> > >      }
> > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > index
> > ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > 40baff61d18da8e242dd 100644
> > > --- a/gcc/tree-vectorizer.h
> > > +++ b/gcc/tree-vectorizer.h
> > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > >    vect_internal_def,
> > >    vect_induction_def,
> > >    vect_reduction_def,
> > > +  vect_early_exit_def,

can you avoid putting this inbetween reduction and double reduction 
please?  Just put it before vect_unknown_def_type.  In fact the COND
isn't a def ... maybe we should have pattern recogized

 if (a < b) exit;

as

 cond = a < b;
 if (cond != 0) exit;

so the part that we need to vectorize is more clear.

> > >    vect_double_reduction_def,
> > >    vect_nested_cycle,
> > >    vect_first_order_recurrence,
> > > @@ -876,6 +877,13 @@ public:
> > >       we need to peel off iterations at the end to form an epilogue loop.  */
> > >    bool peeling_for_niter;
> > >
> > > +  /* When the loop has early breaks that we can vectorize we need to peel
> > > +     the loop for the break finding loop.  */
> > > +  bool early_breaks;
> > > +
> > > +  /* When the loop has a non-early break control flow inside.  */
> > > +  bool non_break_control_flow;
> > > +
> > >    /* List of loop additional IV conditionals found in the loop.  */
> > >    auto_vec<gcond *> conds;
> > >
> > > @@ -985,9 +993,11 @@ public:
> > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
> > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
> > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > >early_break_conflict
> > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
> > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > >non_break_control_flow
> > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > >no_data_dependencies
> > > @@ -1038,8 +1048,8 @@ public:
> > >     stack.  */
> > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > >
> > > -inline loop_vec_info
> > > -loop_vec_info_for_loop (class loop *loop)
> > > +static inline loop_vec_info
> > > +loop_vec_info_for_loop (const class loop *loop)
> > >  {
> > >    return (loop_vec_info) loop->aux;
> > >  }
> > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > >  {
> > >    if (bb == (bb->loop_father)->header)
> > >      return true;
> > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > +
> > >    return false;
> > >  }
> > >
> > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > >     in tree-vect-loop-manip.cc.  */
> > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > >  				     tree, tree, tree, bool);
> > > -extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
> > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > const_edge);
> > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > -						     class loop *, edge);
> > > +						    class loop *, edge, bool,
> > > +						    vec<basic_block> * = NULL);
> > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > >  				    tree *, tree *, tree *, int, bool, bool,
> > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > index
> > a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > 67f29f5dc9a8d72235f4 100644
> > > --- a/gcc/tree-vectorizer.cc
> > > +++ b/gcc/tree-vectorizer.cc
> > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > >  	 predicates that need to be shared for optimal predicate usage.
> > >  	 However reassoc will re-order them and prevent CSE from working
> > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);

seeing this more and more I think we want a simple way to iterate over
all exits without copying to a vector when we have them recorded.  My
C++ fu is too limited to support

  for (auto exit : recorded_exits (loop))
    ...

(maybe that's enough for somebody to jump onto this ;))

Don't treat all review comments as change orders, but it should be clear
the code isn't 100% obvious.  Maybe the patch can be simplified by
splitting out the LC SSA cleanup parts.

Thanks,
Richard.

> > > +      for (edge exit : exits)
> > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > >
> > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > >        do_rpo_vn (fun, entry, exit_bbs);
> > >
> > >
> > >
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > Nuernberg,
> > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > Moerman;
> > HRB 36809 (AG Nuernberg)
>
  
Tamar Christina July 17, 2023, 10:56 a.m. UTC | #4
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Friday, July 14, 2023 2:35 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com
> Subject: RE: [PATCH 12/19]middle-end: implement loop peeling and IV
> updates for early break.
> 
> On Thu, 13 Jul 2023, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Thursday, July 13, 2023 6:31 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> jlaw@ventanamicro.com
> > > Subject: Re: [PATCH 12/19]middle-end: implement loop peeling and IV
> > > updates for early break.
> > >
> > > On Wed, 28 Jun 2023, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This patch updates the peeling code to maintain LCSSA during peeling.
> > > > The rewrite also naturally takes into account multiple exits and so it didn't
> > > > make sense to split them off.
> > > >
> > > > For the purposes of peeling the only change for multiple exits is that the
> > > > secondary exits are all wired to the start of the new loop preheader when
> > > doing
> > > > epilogue peeling.
> > > >
> > > > When doing prologue peeling the CFG is kept in tact.
> > > >
> > > > For both epilogue and prologue peeling we wire through between the
> two
> > > loops any
> > > > PHI nodes that escape the first loop into the second loop if flow_loops is
> > > > specified.  The reason for this conditionality is because
> > > > slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
> > > >   - prologue peeling
> > > >   - epilogue peeling
> > > >   - loop distribution
> > > >
> > > > for the last case the loops should remain independent, and so not be
> > > connected.
> > > > Because of this propagation of only used phi nodes get_current_def can
> be
> > > used
> > > > to easily find the previous definitions.  However live statements that are
> > > > not used inside the loop itself are not propagated (since if unused, the
> > > moment
> > > > we add the guard in between the two loops the value across the bypass
> edge
> > > can
> > > > be wrong if the loop has been peeled.)
> > > >
> > > > This is dealt with easily enough in find_guard_arg.
> > > >
> > > > For multiple exits, while we are in LCSSA form, and have a correct DOM
> tree,
> > > the
> > > > moment we add the guard block we will change the dominators again.  To
> > > deal with
> > > > this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the
> blocks
> > > to
> > > > update without having to recompute the list of blocks to update again.
> > > >
> > > > When multiple exits and doing epilogue peeling we will also temporarily
> have
> > > an
> > > > incorrect VUSES chain for the secondary exits as it anticipates the final
> result
> > > > after the VDEFs have been moved.  This will thus be corrected once the
> code
> > > > motion is applied.
> > > >
> > > > Lastly by doing things this way we can remove the helper functions that
> > > > previously did lock step iterations to update things as it went along.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > >
> > > Not sure if I get through all of this in one go - so be prepared that
> > > the rest of the review follows another day.
> >
> > No worries, I appreciate the reviews!
> > Just giving some quick replies for when you continue.
> 
> Continueing.
> 
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	* tree-loop-distribution.cc (copy_loop_before): Pass flow_loops =
> > > false.
> > > > 	* tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
> > > > 	* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add
> > > additional
> > > > 	assert.
> > > > 	(vect_set_loop_condition_normal): Skip modifying loop IV for multiple
> > > > 	exits.
> > > > 	(slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit
> > > peeling.
> > > > 	(slpeel_can_duplicate_loop_p): Likewise.
> > > > 	(vect_update_ivs_after_vectorizer): Don't enter this...
> > > > 	(vect_update_ivs_after_early_break): ...but instead enter here.
> > > > 	(find_guard_arg): Update for new peeling code.
> > > > 	(slpeel_update_phi_nodes_for_loops): Remove.
> > > > 	(slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0
> > > checks.
> > > > 	(slpeel_update_phi_nodes_for_lcssa): Remove.
> > > > 	(vect_do_peeling): Fix VF for multiple exits and force epilogue.
> > > > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > > > 	non_break_control_flow and early_breaks.
> > > > 	(vect_need_peeling_or_partial_vectors_p): Force partial vector if
> > > > 	multiple exits and VLA.
> > > > 	(vect_analyze_loop_form): Support inner loop multiple exits.
> > > > 	(vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
> > > > 	(vect_create_epilog_for_reduction):  Update live phi nodes.
> > > > 	(vectorizable_live_operation): Ignore live operations in vector loop
> > > > 	when multiple exits.
> > > > 	(vect_transform_loop): Force unrolling for VF loops and multiple exits.
> > > > 	* tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
> > > > 	(vect_mark_stmts_to_be_vectorized): Check for non-exit control flow
> > > and
> > > > 	analyze gcond params.
> > > > 	(vect_analyze_stmt): Support gcond.
> > > > 	* tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
> > > > 	in RPO pass.
> > > > 	* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
> > > > 	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW):
> > > New.
> > > > 	(loop_vec_info_for_loop): Change to const and static.
> > > > 	(is_loop_header_bb_p): Drop assert.
> > > > 	(slpeel_can_duplicate_loop_p): Update prototype.
> > > > 	(class loop): Add early_breaks and non_break_control_flow.
> > > >
> > > > --- inline copy of patch --
> > > > diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> > > > index
> > >
> 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40
> > > d833a968314a4442b9e 100644
> > > > --- a/gcc/tree-loop-distribution.cc
> > > > +++ b/gcc/tree-loop-distribution.cc
> > > > @@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool
> > > redirect_lc_phi_defs)
> > > >    edge preheader = loop_preheader_edge (loop);
> > > >
> > > >    initialize_original_copy_tables ();
> > > > -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> > > > +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader,
> > > false);
> > > >    gcc_assert (res != NULL);
> > > >
> > > >    /* When a not last partition is supposed to keep the LC PHIs computed
> > > > diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> > > > index
> > >
> 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba
> > > 377300430c04b7aeefe6c 100644
> > > > --- a/gcc/tree-ssa-loop-niter.cc
> > > > +++ b/gcc/tree-ssa-loop-niter.cc
> > > > @@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop,
> > > basic_block *body, const_edge exit)
> > > >    gimple_stmt_iterator bsi;
> > > >    unsigned i;
> > > >
> > > > -  if (exit != single_exit (loop))
> > > > +  /* We need to check for alternative exits since exit can be NULL.  */
> > >
> > > You mean we pass in exit == NULL in some cases?  I'm not sure what
> > > the desired behavior in that case is - can you point out the
> > > callers you are fixing here?
> > >
> > > I think we should add gcc_assert (exit != nullptr)
> > >
> > > >    for (i = 0; i < loop->num_nodes; i++)
> > > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > > index
> > >
> 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a92
> > > 7cb150edd7c2aa9999 100644
> > > > --- a/gcc/tree-vect-loop-manip.cc
> > > > +++ b/gcc/tree-vect-loop-manip.cc
> > > > @@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple
> *update_phi,
> > > edge e, tree new_def)
> > > >  {
> > > >    tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
> > > >
> > > > +  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
> > > > +	      || orig_def != new_def);
> > > > +
> > > >    SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
> > > >
> > > >    if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > > @@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal
> (loop_vec_info
> > > loop_vinfo,
> > > >    gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > > >
> > > >    /* Record the number of latch iterations.  */
> > > > -  if (limit == niters)
> > > > +  if (limit == niters
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      /* Case A: the loop iterates NITERS times.  Subtract one to get the
> > > >         latch count.  */
> > > >      loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
> > > > @@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal
> > > (loop_vec_info loop_vinfo,
> > > >      loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
> > > >  				       limit, step);
> > > >
> > > > -  if (final_iv)
> > > > +  /* For multiple exits we've already maintained LCSSA form and handled
> > > > +     the scalar iteration update in the code that deals with the merge
> > > > +     block and its updated guard.  I could move that code here instead
> > > > +     of in vect_update_ivs_after_early_break but I have to still deal
> > > > +     with the updates to the counter `i`.  So for now I'll keep them
> > > > +     together.  */
> > > > +  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        gassign *assign;
> > > >        edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > @@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
> > > >     on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
> > > >     non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy
> the
> > > >     basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> > > > -   entry or exit of LOOP.  */
> > > > +   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to
> > > SCALAR_LOOP as a
> > > > +   continuation.  This is correct for cases where one loop continues from
> the
> > > > +   other like in the vectorizer, but not true for uses in e.g. loop
> distribution
> > > > +   where the loop is duplicated and then modified.
> > > > +
> > >
> > > but for loop distribution the flow also continues?  I'm not sure what you
> > > are refering to here.  Do you by chance have a branch with the patches
> > > installed?
> >
> > Yup, they're at refs/users/tnfchris/heads/gcc-14-early-break in the repo.
> >
> > >
> > > > +   If UPDATED_DOMS is not NULL it is update with the list of basic blocks
> > > whoms
> > > > +   dominators were updated during the peeling.  */
> > > >
> > > >  class loop *
> > > >  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
> > > > -					class loop *scalar_loop, edge e)
> > > > +					class loop *scalar_loop, edge e,
> > > > +					bool flow_loops,
> > > > +					vec<basic_block> *updated_doms)
> > > >  {
> > > >    class loop *new_loop;
> > > >    basic_block *new_bbs, *bbs, *pbbs;
> > > > @@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >    for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
> > > >      rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
> > > >
> > > > +  /* Rename the exit uses.  */
> > > > +  for (edge exit : get_loop_exit_edges (new_loop))
> > > > +    for (auto gsi = gsi_start_phis (exit->dest);
> > > > +	 !gsi_end_p (gsi); gsi_next (&gsi))
> > > > +      {
> > > > +	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
> > > > +	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
> > > > +	if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > > +	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
> > > > +      }
> > > > +
> > > > +  /* This condition happens when the loop has been versioned. e.g. due
> to
> > > ifcvt
> > > > +     versioning the loop.  */
> > > >    if (scalar_loop != loop)
> > > >      {
> > > >        /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs
> from
> > > > @@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > > (class loop *loop,
> > > >  						EDGE_SUCC (loop->latch, 0));
> > > >      }
> > > >
> > > > +  vec<edge> alt_exits = loop->vec_loop_alt_exits;
> > >
> > > So 'e' is not one of alt_exits, right?  I wonder if we can simply
> > > compute the vector from all exits of 'loop' and removing 'e'?
> > >
> > > > +  bool multiple_exits_p = !alt_exits.is_empty ();
> > > > +  auto_vec<basic_block> doms;
> > > > +  class loop *update_loop = NULL;
> > > > +
> > > >    if (at_exit) /* Add the loop copy at exit.  */
> > > >      {
> > > > -      if (scalar_loop != loop)
> > > > +      if (scalar_loop != loop && new_exit->dest != exit_dest)
> > > >  	{
> > > > -	  gphi_iterator gsi;
> > > >  	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> > > > +	  flush_pending_stmts (new_exit);
> > > > +	}
> > > >
> > > > -	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> > > > -	       gsi_next (&gsi))
> > > > +      auto loop_exits = get_loop_exit_edges (loop);
> > > > +      for (edge exit : loop_exits)
> > > > +	redirect_edge_and_branch (exit, new_preheader);
> > > > +
> > > > +
> > >
> > > one line vertical space too much
> > >
> > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > +	 block and the new loop header.  This allows us to later split the
> > > > +	 preheader block and still find the right LC nodes.  */
> > > > +      edge latch_new = single_succ_edge (new_preheader);
> > > > +      edge latch_old = loop_latch_edge (loop);
> > > > +      hash_set <tree> lcssa_vars;
> > > > +      for (auto gsi_from = gsi_start_phis (latch_old->dest),
> > >
> > > so that's loop->header (and makes it more clear which PHI nodes you are
> > > looking at)
> 
> 
> So I'm now in a debug session - I think that conceptually it would
> make more sense to create the LC PHI nodes that are present at the
> old exit destination in the new preheader _before_ you redirect them
> above and then flush_pending_stmts after redirecting, that should deal
> with the copying.
> 

This was the first thing I tried, however as soon as you redirect one edge
you destroy all other phi nodes on the original block.

As in if I have 3 phi nodes, I need to move them all at the same time, which
brings the next problem in that I can't add more entries to a phi than it has
incoming edges.  That is I can't just make the final phi nodes on the destination
without having an edge for it.

And to make an edge for it I need to have a condition to attach to the edge.
To work around it I tried maintaining a cache of the nodes I need to make on
the new destination and after redirecting just create them,  but that has me
looping over the same PHIs multiple times.

Any suggestions?

> Now, your copying actually iterates over all PHIs in the loop _header_,
> so it doesn't actually copy LC PHI nodes but possibly creates additional
> ones.  The intent does seem to do this since you want a different value
> on those edges for all but the main loop exit.  But then the
> overall comments should better reflect that and maybe you should
> do what I suggested anyway and have this loop alter only the alternate
> exit LC PHIs?

It does create the LC PHI nodes for all exits, it's just that for alternate exits
all the nodes are the same since we only care about the value for full
iterations. 

I'm not sure what you're suggesting with alter only the alternative exits.
I need to still create the one for the main loop and seems easier to do them
In one loop rather than two?

Doing this here allows the removal of all the code later on that the vectorizer
uses to try to find the main exit's PHIs.

> 
> If you don't flush_pending_stmts on an edge after redirecting you
> should call redirect_edge_var_map_clear (edge), otherwise the stale
> info might break things later.
> 
> > > > +	   gsi_to = gsi_start_phis (latch_new->dest);
> > >
> > > likewise new_loop->header
> > >
> > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > +	{
> > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
> > > > +	  /* In all cases, even in early break situations we're only
> > > > +	     interested in the number of fully executed loop iters.  As such
> > > > +	     we discard any partially done iteration.  So we simply propagate
> > > > +	     the phi nodes from the latch to the merge block.  */
> > > > +	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> > > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > > +
> > > > +	  lcssa_vars.add (new_arg);
> > > > +
> > > > +	  /* Main loop exit should use the final iter value.  */
> > > > +	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv,
> > > UNKNOWN_LOCATION);
> > >
> > > above you are creating the PHI node at e->dest but here add the PHI arg to
> > > loop->vec_loop_iv - that's 'e' here, no?  Consistency makes it easier
> > > to follow.  I _think_ this code doesn't need to know about the "special"
> > > edge.
> > >
> > > > +
> > > > +	  /* All other exits use the previous iters.  */
> > > > +	  for (edge e : alt_exits)
> > > > +	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
> > > > +			 UNKNOWN_LOCATION);
> > > > +
> > > > +	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
> > > > +	}
> > > > +
> > > > +      /* Copy over any live SSA vars that may not have been materialized in
> > > the
> > > > +	 loops themselves but would be in the exit block.  However when the
> > > live
> > > > +	 value is not used inside the loop then we don't need to do this,  if we
> > > do
> > > > +	 then when we split the guard block the branch edge can end up
> > > containing the
> > > > +	 wrong reference,  particularly if it shares an edge with something that
> > > has
> > > > +	 bypassed the loop.  This is not something peeling can check so we
> > > need to
> > > > +	 anticipate the usage of the live variable here.  */
> > > > +      auto exit_map = redirect_edge_var_map_vector (exit);
> > >
> > > Hmm, did I use that in my attemt to refactor things? ...
> >
> > Indeed, I didn't always use it, but found it was the best way to deal with the
> > variables being live in various BB after the loop.
> 
> As said this whole piece of code is possibly more complicated than
> necessary.  First copying/creating the PHI nodes that are present
> at the exit (the old LC PHI nodes), then redirecting edges and flushing
> stmts should deal with half of this.
>
> > >
> > > > +      if (exit_map)
> > > > +        for (auto vm : exit_map)
> > > > +	{
> > > > +	  if (lcssa_vars.contains (vm.def)
> > > > +	      || TREE_CODE (vm.def) != SSA_NAME)
> > >
> > > the latter check is cheaper so it should come first
> > >
> > > > +	    continue;
> > > > +
> > > > +	  imm_use_iterator imm_iter;
> > > > +	  use_operand_p use_p;
> > > > +	  bool use_in_loop = false;
> > > > +
> > > > +	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
> > > >  	    {
> > > > -	      gphi *phi = gsi.phi ();
> > > > -	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> > > > -	      location_t orig_locus
> > > > -		= gimple_phi_arg_location_from_edge (phi, e);
> > > > +	      basic_block bb = gimple_bb (USE_STMT (use_p));
> > > > +	      if (flow_bb_inside_loop_p (loop, bb)
> > > > +		  && !gimple_vuse (USE_STMT (use_p)))
> 
> what's this gimple_vuse check?  I see now for vect-early-break_17.c this
> code triggers and ignores
> 
>   vect_b[i_18] = _2;
> 
> > > > +		{
> > > > +		  use_in_loop = true;
> > > > +		  break;
> > > > +		}
> > > > +	    }
> > > >
> > > > -	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> > > > +	  if (!use_in_loop)
> > > > +	    {
> > > > +	       /* Do a final check to see if it's perhaps defined in the loop.  This
> > > > +		  mirrors the relevancy analysis's used_outside_scope.  */
> > > > +	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
> > > > +	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> > > > +		continue;
> > > >  	    }
> 
> since the def was on a LC PHI the def should always be defined inside the
> loop.
> 
> > > > +
> > > > +	  tree new_res = copy_ssa_name (vm.result);
> > > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > > +	  for (edge exit : loop_exits)
> > > > +	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);
> > >
> > > not sure what you are doing above - I guess I have to play with it
> > > in a debug session.
> >
> > Yeah if you comment it out one of the testcases should fail.
> 
> using new_preheader instead of e->dest would make things clearer.
> 
> You are now adding the same arg to every exit (you've just queried the
> main exit redirect_edge_var_map_vector).
> 
> OK, so I think I understand what you're doing.  If I understand
> correctly we know that when we exit the main loop via one of the
> early exits we are definitely going to enter the epilog but when
> we take the main exit we might not.

Correct

> 
> Looking at the CFG we create currently this isn't reflected and
> this complicates this PHI node updating.  What I'd try to do
> is leave redirecting the alternate exits until after
> slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
> means leaving it almost unchanged besides the LC SSA maintaining
> changes.  After that for the multi-exit case split the
> epilog preheader edge and redirect all the alternate exits to the
> new preheader.  So the CFG becomes
> 
>                  <original loop>
>                 /      |
>                /    <main exit w/ original LC PHI>
>               /      if (epilog)
>    alt exits /        /  \
>             /        /    loop around
>             |       /
>            preheader with "header" PHIs
>               |
>           <epilog>
> 
> note you need the header PHIs also on the main exit path but you
> only need the loop end PHIs there.

Ah, hadn't considered this one.  I'll give it a try.

> 
> It seems so that at least currently the order of things makes
> them more complicated than necessary.

Possibly yeah, there's a lot of work to maintain the dominators
and phi nodes because the exits are rewritten early.

I'll give this a go!

> 
> > >
> > > >  	}
> > > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > > -      flush_pending_stmts (e);
> > > > +
> > > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e-
> >src);
> > > > -      if (was_imm_dom || duplicate_outer_loop)
> > > > +
> > > > +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
> > > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit-
> > > >src);
> > > >
> > > >        /* And remove the non-necessary forwarder again.  Keep the other
> > > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >        delete_basic_block (preheader);
> > > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> > > >  			       loop_preheader_edge (scalar_loop)->src);
> > > > +
> > > > +      /* Finally after wiring the new epilogue we need to update its main
> exit
> > > > +	 to the original function exit we recorded.  Other exits are already
> > > > +	 correct.  */
> > > > +      if (multiple_exits_p)
> > > > +	{
> > > > +	  for (edge e : get_loop_exit_edges (loop))
> > > > +	    doms.safe_push (e->dest);
> > > > +	  update_loop = new_loop;
> > > > +	  doms.safe_push (exit_dest);
> > > > +
> > > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > > +	  if (single_succ_p (exit_dest))
> > > > +	    doms.safe_push (single_succ (exit_dest));
> > > > +	}
> > > >      }
> > > >    else /* Add the copy at entry.  */
> > > >      {
> > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > +	 block and the new loop header.  This allows us to later split the
> > > > +	 preheader block and still find the right LC nodes.  */
> > > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > >
> > > see above
> > >
> > > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > +	{
> > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > > new_latch_loop);
> > > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> > > > +	}
> > > > +
> > > >        if (scalar_loop != loop)
> > > >  	{
> > > >  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> > > > @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >        delete_basic_block (new_preheader);
> > > >        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
> > > >  			       loop_preheader_edge (new_loop)->src);
> > > > +
> > > > +      if (multiple_exits_p)
> > > > +	update_loop = loop;
> > > >      }
> > > >
> > > > -  if (scalar_loop != loop)
> > > > +  if (multiple_exits_p)
> > > >      {
> > > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > > -      gphi_iterator gsi_orig, gsi_new;
> > > > -      edge orig_e = loop_preheader_edge (loop);
> > > > -      edge new_e = loop_preheader_edge (new_loop);
> > > > -
> > > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > > +      for (edge e : get_loop_exit_edges (update_loop))
> > > >  	{
> > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > -	  gphi *new_phi = gsi_new.phi ();
> > > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > > -	  location_t orig_locus
> > > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > > -
> > > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > > +	  edge ex;
> > > > +	  edge_iterator ei;
> > > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > > +	    {
> > > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > > +		 dominate other blocks.  */
> > > > +	      while ((ex->flags & EDGE_FALLTHRU)
> 
> For the prologue peeling any early exit we take would skip all other
> loops so we can simply leave them and their LC PHI nodes in place.
> We need extra PHIs only on the path to the main vector loop.  I
> think the comment isn't accurately reflecting what we do.  In
> fact we do not add any LC PHI nodes here but simply adjust the
> main loop header PHI arguments?

Yeah we don't create any new nodes after peeling in this version.

> 
> > > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > > just using single_succ_p here?  A fallthru edge src dominates the
> > > fallthru edge dest, so the sentence above doesn't make sense.
> >
> > I wanted to say, that the immediate dominator of a block is never
> > an fall through block.  At least from what I understood from how
> > the dominators are calculated in the code, though may have missed
> > something.
> 
>  BB1
>   |
>  BB2
>   |
>  BB3
> 
> here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.
> 
> > >
> > > > +		     && single_succ_p (ex->dest))
> > > > +		{
> > > > +		  doms.safe_push (ex->dest);
> > > > +		  ex = single_succ_edge (ex->dest);
> > > > +		}
> > > > +	      doms.safe_push (ex->dest);
> > > > +	    }
> > > > +	  doms.safe_push (e->dest);
> > > >  	}
> > > > -    }
> > > >
> > > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > +      if (updated_doms)
> > > > +	updated_doms->safe_splice (doms);
> > > > +    }
> > > >    free (new_bbs);
> > > >    free (bbs);
> > > >
> > > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > > loop_vec_info loop_vinfo, const_edge e)
> > > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > > >    unsigned int num_bb = loop->inner? 5 : 2;
> > > >
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > > +
> > >
> > > I think checking the number of BBs is odd, I don't remember anything
> > > in slpeel is specifically tied to that?  I think we can simply drop
> > > this or do you remember anything that would depend on ->num_nodes
> > > being only exactly 5 or 2?
> >
> > Never actually seemed to require it, but they're used as some check to
> > see if there are unexpected control flow in the loop.
> >
> > i.e. this would say no if you have an if statement in the loop that wasn't
> > converted.  The other part of this and the accompanying explanation is in
> > vect_analyze_loop_form.  In the patch series I had to remove the hard
> > num_nodes == 2 check from there because number of nodes restricted
> > things too much.  If you have an empty fall through block, which seems to
> > happen often between the main exit and the latch block then we'd not
> > vectorize.
> >
> > So instead I now rejects loops after analyzing the gcond.  So think this check
> > can go/needs to be different.
> 
> Lets remove it from this function then.

Ok, I can remove it from the outerloop vect then too, since it's mostly dead code.
Will do so and reg-test that as well.

> 
> > >
> > > >    /* All loops have an outer scope; the only case loop->outer is NULL is for
> > > >       the function itself.  */
> > > >    if (!loop_outer (loop)
> > > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > > (loop_vec_info loop_vinfo,
> > > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > >    basic_block update_bb = update_e->dest;
> > > >
> > > > +  /* For early exits we'll update the IVs in
> > > > +     vect_update_ivs_after_early_break.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    return;
> > > > +
> > > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > >
> > > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > > (loop_vec_info loop_vinfo,
> > > >        /* Fix phi expressions in the successor bb.  */
> > > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > > >      }
> > > > +  return;
> > >
> > > we don't usually place a return at the end of void functions
> > >
> > > > +}
> > > > +
> > > > +/*   Function vect_update_ivs_after_early_break.
> > > > +
> > > > +     "Advance" the induction variables of LOOP to the value they should
> take
> > > > +     after the execution of LOOP.  This is currently necessary because the
> > > > +     vectorizer does not handle induction variables that are used after the
> > > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > > +     peeled, because of the early exit.  With an early exit we always peel
> the
> > > > +     loop.
> > > > +
> > > > +     Input:
> > > > +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> > > > +		    vectorized. The last few iterations of LOOP were peeled.
> > > > +     - LOOP - a loop that is going to be vectorized. The last few iterations
> > > > +	      of LOOP were peeled.
> > > > +     - VF - The loop vectorization factor.
> > > > +     - NITERS_ORIG - the number of iterations that LOOP executes (before
> it is
> > > > +		     vectorized). i.e, the number of times the ivs should be
> > > > +		     bumped.
> > > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > > executes.
> > > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only)
> path
> > > > +		  coming out from LOOP on which there are uses of the LOOP
> > > ivs
> > > > +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> > > > +
> > > > +		  The new definitions of the ivs are placed in LOOP->exit.
> > > > +		  The phi args associated with the edge UPDATE_E in the bb
> > > > +		  UPDATE_E->dest are updated accordingly.
> > > > +
> > > > +     Output:
> > > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > > +
> > > > +     Assumption 1: Like the rest of the vectorizer, this function assumes
> > > > +     a single loop exit that has a single predecessor.
> > > > +
> > > > +     Assumption 2: The phi nodes in the LOOP header and in update_bb
> are
> > > > +     organized in the same order.
> > > > +
> > > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the
> future.
> > > > +
> > > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a
> path
> > > > +     coming out of LOOP on which the ivs of LOOP are used (this is the
> path
> > > > +     that leads to the epilog loop; other paths skip the epilog loop).  This
> > > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > > update_bb)
> > > > +     needs to have its phis updated.
> > > > + */
> > > > +
> > > > +static tree
> > > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class
> loop *
> > > epilog,
> > > > +				   poly_int64 vf, tree niters_orig,
> > > > +				   tree niters_vector, edge update_e)
> > > > +{
> > > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    return NULL;
> > > > +
> > > > +  gphi_iterator gsi, gsi1;
> > > > +  tree ni_name, ivtmp = NULL;
> > > > +  basic_block update_bb = update_e->dest;
> > > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > +  basic_block exit_bb = loop_iv->dest;
> > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > > +
> > > > +  gcc_assert (cond);
> > > > +
> > > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis
> (update_bb);
> > > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > > +    {
> > > > +      tree init_expr, final_expr, step_expr;
> > > > +      tree type;
> > > > +      tree var, ni, off;
> > > > +      gimple_stmt_iterator last_gsi;
> > > > +
> > > > +      gphi *phi = gsi1.phi ();
> > > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi,
> loop_preheader_edge
> > > (epilog));
> > >
> > > I'm confused about the setup.  update_bb looks like the block with the
> > > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > > loop_preheader_edge (epilog) come into play here?  That would feed into
> > > epilog->header PHIs?!
> >
> > We can't query the type of the phis in the block with the LC PHI nodes, so the
> > Typical pattern seems to be that we iterate over a block that's part of the
> loop
> > and that would have the PHIs in the same order, just so we can get to the
> > stmt_vec_info.
> >
> > >
> > > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > > better way?  Is update_bb really epilog->header?!
> > >
> > > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > > which "works" even for totally unrelated edges.
> > >
> > > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> > > > +      if (!phi1)
> > >
> > > shouldn't that be an assert?
> > >
> > > > +	continue;
> > > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > > +      if (dump_enabled_p ())
> > > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > > +			 "vect_update_ivs_after_early_break: phi: %G",
> > > > +			 (gimple *)phi);
> > > > +
> > > > +      /* Skip reduction and virtual phis.  */
> > > > +      if (!iv_phi_p (phi_info))
> > > > +	{
> > > > +	  if (dump_enabled_p ())
> > > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > > +			     "reduc or virtual phi. skip.\n");
> > > > +	  continue;
> > > > +	}
> > > > +
> > > > +      /* For multiple exits where we handle early exits we need to carry on
> > > > +	 with the previous IV as loop iteration was not done because we exited
> > > > +	 early.  As such just grab the original IV.  */
> > > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > > (loop));
> > >
> > > but this should be taken care of by LC SSA?
> >
> > It is, the comment is probably missing details, this part just scales the
> counter
> > from VF to scalar counts.  It's just a reminder that this scaling is done
> differently
> > from normal single exit vectorization.
> >
> > >
> > > OK, have to continue tomorrow from here.
> >
> > Cheers, Thank you!
> >
> > Tamar
> >
> > >
> > > Richard.
> > >
> > > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > > +	  && gimple_cond_rhs (cond) != phi_ssa)
> 
> so this is a way to avoid touching the main IV?  Looks a bit fragile to
> me.  Hmm, we're iterating over the main loop header PHIs here?
> Can't you check, say, the relevancy of the PHI node instead?  Though
> it might also be used as induction.  Can't it be used as alternate
> exit like
> 
>   for (i)
>    {
>      if (i & bit)
>        break;
>    }
> 
> and would we need to adjust 'i' then?

Hmm you're right, I could reject them based on the definition BB and the
location the main IV is, that's probably safest.

> 
> > > > +	{
> > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > +	  step_expr = unshare_expr (step_expr);
> > > > +
> > > > +	  /* We previously generated the new merged phi in the same BB as
> > > the
> > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > +	  final_expr = gimple_phi_result (phi1);
> > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > > loop_preheader_edge (loop));
> > > > +
> > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > +	  /* For early break the final loop IV is:
> > > > +	     init + (final - init) * vf which takes into account peeling
> > > > +	     values and non-single steps.  */
> > > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > > +			     fold_convert (stype, final_expr),
> > > > +			     fold_convert (stype, init_expr));
> > > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> > > > +
> > > > +	  /* Adjust the value with the offset.  */
> > > > +	  if (POINTER_TYPE_P (type))
> > > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > > +	  else
> > > > +	    ni = fold_convert (type,
> > > > +			       fold_build2 (PLUS_EXPR, stype,
> > > > +					    fold_convert (stype, init_expr),
> > > > +					    off));
> > > > +	  var = create_tmp_var (type, "tmp");
> 
> so how does the non-early break code deal with updating inductions?

non-early break code does this essentially the same, see vect_update_ivs_after_vectorizer
which was modified to skip early exits.  The major difference is that on a non-early
exit the value can just be adjusted linearly: i_scalar = i_vect * VF + init and peeling is taken
into account by vect_peel_nonlinear_iv_init. 

I wanted to keep them separately as there's a significant enough difference in the
calculation of the loop bodies themselves that having one function became unwieldy.

That and we don't support all of the induction steps a non-early exit supports so I inlined
a simpler calculation.

Now there's a big difference between the normal loop vectorization and early break.
During the main loop vectorization the ivtmp is adjusted for peeling.  That is the amount is
already adjusted in the phi node itself.

For early break this isn't done because the bounds are fixed, i.e. for every exit we can at
most do VF iterations.

> And how do you avoid altering this when we flow in from the normal
> exit?  That is, you are updating the value in the epilog loop
> header but don't you need to instead update the value only on
> the alternate exit edges from the main loop (and keep the not
> updated value on the main exit edge)?

Because the ivtmps has not been adjusted for peeling, we adjust for it here.  This allows us
to have the same update required for all exits, so I don't have to differentiate between the
two here because I didn't do so during creation of the PHI node.

> 
> > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > +	  gimple_seq new_stmts = NULL;
> > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > +	  if (!gsi_end_p (last_gsi))
> > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +	  else
> > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +
> > > > +	  /* Fix phi expressions in the successor bb.  */
> > > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > > +	}
> > > > +      else
> > > > +	{
> > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > +	  step_expr = unshare_expr (step_expr);
> > > > +
> > > > +	  /* We previously generated the new merged phi in the same BB as
> > > the
> > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge
> > > (loop));
> > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > +
> > > > +	  if (vf.is_constant ())
> > > > +	    {
> > > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > > +				fold_convert (stype,
> > > > +					      niters_vector),
> > > > +				build_int_cst (stype, vf));
> > > > +
> > > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > > +				fold_convert (stype,
> > > > +					      niters_orig),
> > > > +				fold_convert (stype, ni));
> > > > +	    }
> > > > +	  else
> > > > +	    /* If the loop's VF isn't constant then the loop must have been
> > > > +	       masked, so at the end of the loop we know we have finished
> > > > +	       the entire loop and found nothing.  */
> > > > +	    ni = build_zero_cst (stype);
> > > > +
> > > > +	  ni = fold_convert (type, ni);
> > > > +	  /* We don't support variable n in this version yet.  */
> > > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > > +
> > > > +	  var = create_tmp_var (type, "tmp");
> > > > +
> > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > +	  gimple_seq new_stmts = NULL;
> > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > +	  if (!gsi_end_p (last_gsi))
> > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +	  else
> > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +
> > > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > > +
> > > > +	  for (edge exit : alt_exits)
> > > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > > +					build_int_cst (TREE_TYPE (step_expr),
> > > > +						       vf));
> > > > +	  ivtmp = gimple_phi_result (phi1);
> > > > +	}
> > > > +    }
> > > > +
> > > > +  return ivtmp;
> > > >  }
> > > >
> > > >  /* Return a gimple value containing the misalignment (measured in
> vector
> > > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > > (loop_vec_info loop_vinfo,
> > > >
> > > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > > >     this function searches for the corresponding lcssa phi node in exit
> > > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > > -   NULL.  */
> > > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> > > > +   return the phi result; otherwise return NULL.  */
> > > >
> > > >  static tree
> > > >  find_guard_arg (class loop *loop, class loop *epilog
> ATTRIBUTE_UNUSED,
> > > > -		gphi *lcssa_phi)
> > > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > > >  {
> > > >    gphi_iterator gsi;
> > > >    edge e = loop->vec_loop_iv;
> > > >
> > > > -  gcc_assert (single_pred_p (e->dest));
> > > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > > >      {
> > > >        gphi *phi = gsi.phi ();
> > > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > > -	return PHI_RESULT (phi);
> > > > -    }
> > > > -  return NULL_TREE;
> > > > -}
> > > > -
> > > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > > FIRST/SECOND
> > > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > > -   edge, the two loops are arranged as below:
> > > > -
> > > > -       preheader_a:
> > > > -     first_loop:
> > > > -       header_a:
> > > > -	 i_1 = PHI<i_0, i_2>;
> > > > -	 ...
> > > > -	 i_2 = i_1 + 1;
> > > > -	 if (cond_a)
> > > > -	   goto latch_a;
> > > > -	 else
> > > > -	   goto between_bb;
> > > > -       latch_a:
> > > > -	 goto header_a;
> > > > -
> > > > -       between_bb:
> > > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > > -
> > > > -     second_loop:
> > > > -       header_b:
> > > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > > -				 or with i_2 if no LCSSA phi is created
> > > > -				 under condition of
> > > CREATE_LCSSA_FOR_IV_PHIS.
> > > > -	 ...
> > > > -	 i_4 = i_3 + 1;
> > > > -	 if (cond_b)
> > > > -	   goto latch_b;
> > > > -	 else
> > > > -	   goto exit_bb;
> > > > -       latch_b:
> > > > -	 goto header_b;
> > > > -
> > > > -       exit_bb:
> > > > -
> > > > -   This function creates loop closed SSA for the first loop; update the
> > > > -   second loop's PHI nodes by replacing argument on incoming edge with
> the
> > > > -   result of newly created lcssa PHI nodes.  IF
> CREATE_LCSSA_FOR_IV_PHIS
> > > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > > -   the first loop.
> > > > -
> > > > -   This function assumes exit bb of the first loop is preheader bb of the
> > > > -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> > > > -   the second loop will execute rest iterations of the first.  */
> > > > -
> > > > -static void
> > > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > > -				   class loop *first, class loop *second,
> > > > -				   bool create_lcssa_for_iv_phis)
> > > > -{
> > > > -  gphi_iterator gsi_update, gsi_orig;
> > > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > -
> > > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > > -  basic_block between_bb = single_exit (first)->dest;
> > > > -
> > > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p
> (between_bb));
> > > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > > -  gcc_assert (loop == first || loop == second);
> > > > -
> > > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > > -       gsi_update = gsi_start_phis (second->header);
> > > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > > -    {
> > > > -      gphi *orig_phi = gsi_orig.phi ();
> > > > -      gphi *update_phi = gsi_update.phi ();
> > > > -
> > > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > > -      /* Generate lcssa PHI node for the first loop.  */
> > > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> > > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > > +      /* Nested loops with multiple exits can have different no# phi node
> > > > +	 arguments between the main loop and epilog as epilog falls to the
> > > > +	 second loop.  */
> > > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > > >  	{
> > > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > > UNKNOWN_LOCATION);
> > > > -	  arg = new_res;
> > > > -	}
> > > > -
> > > > -      /* Update PHI node in the second loop by replacing arg on the loop's
> > > > -	 incoming edge.  */
> > > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e,
> arg);
> > > > -    }
> > > > -
> > > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > > -     for correct vectorization of live stmts.  */
> > > > -  if (loop == first)
> > > > -    {
> > > > -      basic_block orig_exit = single_exit (second)->dest;
> > > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > > -	{
> > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > > (orig_arg))
> > > > -	    continue;
> > > > -
> > > > -	  /* Already created in the above loop.   */
> > > > -	  if (find_guard_arg (first, second, orig_phi))
> > > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > > +	  if (TREE_CODE (var) != SSA_NAME)
> > > >  	    continue;
> > > >
> > > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > > UNKNOWN_LOCATION);
> > > > +	  if (operand_equal_p (get_current_def (var),
> > > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > > +	    return PHI_RESULT (phi);
> > > >  	}
> > > >      }
> > > > +  return NULL_TREE;
> > > >  }
> > > >
> > > >  /* Function slpeel_add_loop_guard adds guard skipping from the
> beginning
> > > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2
> (class
> > > loop *loop, class loop *epilog,
> > > >    gcc_assert (single_succ_p (merge_bb));
> > > >    edge e = single_succ_edge (merge_bb);
> > > >    basic_block exit_bb = e->dest;
> > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > > >
> > > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > >      {
> > > >        gphi *update_phi = gsi.phi ();
> > > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > > >
> > > >        tree merge_arg = NULL_TREE;
> > > >
> > > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class
> loop
> > > *loop, class loop *epilog,
> > > >        if (!merge_arg)
> > > >  	merge_arg = old_arg;
> > > >
> > > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e-
> >dest_idx);
> > > >        /* If the var is live after loop but not a reduction, we simply
> > > >  	 use the old arg.  */
> > > >        if (!guard_arg)
> > > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class
> > > loop *loop, class loop *epilog,
> > > >      }
> > > >  }
> > > >
> > > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > > -
> > > > -static void
> > > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > > -{
> > > > -  gphi_iterator gsi;
> > > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > > -
> > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > > -}
> > > > -
> 
> I wonder if we can still split these changes out to before early break
> vect?

Maybe, I hadn't done so before because I was redirecting the edges during peeling.
If we no longer do that that may be easier.  Let me try to.

> 
> > > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would need
> to
> > > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > > >     whether the epilogue loop should be used, returning true if so.  */
> > > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >      bound_epilog += vf - 1;
> > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > >      bound_epilog += 1;
> > > > +  /* For early breaks the scalar loop needs to execute at most VF times
> > > > +     to find the element that caused the break.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    {
> > > > +      bound_epilog = vf;
> > > > +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> > > > +      vect_epilogues = false;
> > > > +    }
> > > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > > >    poly_uint64 bound_scalar = bound_epilog;
> > > >
> > > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >  				  bound_prolog + bound_epilog)
> > > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > >  			 || vect_epilogues));
> > > > +
> > > > +  /* We only support early break vectorization on known bounds at this
> > > time.
> > > > +     This means that if the vector loop can't be entered then we won't
> > > generate
> > > > +     it at all.  So for now force skip_vector off because the additional
> control
> > > > +     flow messes with the BB exits and we've already analyzed them.  */
> > > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS
> (loop_vinfo);
> > > > +
> 
> I think it should be as "easy" as entering the epilog via the block taking
> the regular exit?
> 
> > > >    /* Epilog loop must be executed if the number of iterations for epilog
> > > >       loop is known at compile time, otherwise we need to add a check at
> > > >       the end of vector loop and skip to the end of epilog loop.  */
> > > >    bool skip_epilog = (prolog_peeling < 0
> > > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >  		      || !vf.is_constant ());
> > > > -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.
> */
> > > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special because
> > > epilog
> > > > +     loop must be executed.  */
> > > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      skip_epilog = false;
> > > > -
> > > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > > >    auto_vec<profile_count> original_counts;
> > > >    basic_block *original_bbs = NULL;
> > > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >    if (prolog_peeling)
> > > >      {
> > > >        e = loop_preheader_edge (loop);
> > > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > > -
> > > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
> > > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> e);
> > > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> e,
> > > > +						       true);
> > > >        gcc_assert (prolog);
> > > >        prolog->force_vectorize = false;
> > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> > > > +
> > > >        first_loop = prolog;
> > > >        reset_original_copy_tables ();
> > > >
> > > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >  	 as the transformations mentioned above make less or no sense when
> > > not
> > > >  	 vectorizing.  */
> > > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > > +      auto_vec<basic_block> doms;
> > > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e,
> true,
> > > > +						       &doms);
> > > >        gcc_assert (epilog);
> > > >
> > > >        epilog->force_vectorize = false;
> > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
> > > >
> > > >        /* Scalar version loop may be preferred.  In this case, add guard
> > > >  	 and skip to epilog.  Note this only happens when the number of
> > > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> > > >  					update_e);
> > > >
> > > > +      /* For early breaks we must create a guard to check how many
> iterations
> > > > +	 of the scalar loop are yet to be performed.  */
> 
> We have this check anyway, no?  In fact don't we know that we always enter
> the epilog (see above)?

Not always, masked loops for instance never enter the epilogue if the main loop
finished completely since there is no "remainder".

> 
> > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	{
> > > > +	  tree ivtmp =
> > > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> > > > +					       *niters_vector, update_e);
> > > > +
> > > > +	  gcc_assert (ivtmp);
> > > > +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > +					 fold_convert (TREE_TYPE (niters),
> > > > +						       ivtmp),
> > > > +					 build_zero_cst (TREE_TYPE (niters)));
> > > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > +
> > > > +	  /* If we had a fallthrough edge, the guard will the threaded through
> > > > +	     and so we may need to find the actual final edge.  */
> > > > +	  edge final_edge = epilog->vec_loop_iv;
> > > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> > > > +	     between the guard and the exit edge.  It only adds new nodes and
> > > > +	     doesn't update existing one in the current scheme.  */
> > > > +	  basic_block guard_to = split_edge (final_edge);
> > > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
> > > guard_to,
> > > > +						guard_bb, prob_epilog.invert
> > > (),
> > > > +						irred_flag);
> > > > +	  doms.safe_push (guard_bb);
> > > > +
> > > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > +
> > > > +	  /* We must update all the edges from the new guard_bb.  */
> > > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > > +					      final_edge);
> > > > +
> > > > +	  /* If the loop was versioned we'll have an intermediate BB between
> > > > +	     the guard and the exit.  This intermediate block is required
> > > > +	     because in the current scheme of things the guard block phi
> > > > +	     updating can only maintain LCSSA by creating new blocks.  In this
> > > > +	     case we just need to update the uses in this block as well.  */
> > > > +	  if (loop != scalar_loop)
> > > > +	    {
> > > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (),
> > > guard_e));
> > > > +	    }
> > > > +
> > > > +	  flush_pending_stmts (guard_e);
> > > > +	}
> > > > +
> > > >        if (skip_epilog)
> > > >  	{
> > > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >  	    }
> > > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > > >  	}
> > > > -      else
> > > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > > >
> > > >        unsigned HOST_WIDE_INT bound;
> > > >        if (bound_scalar.is_constant (&bound))
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index
> > >
> b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > > f21421958e7ee25eada 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > > *loop_in, vec_info_shared *shared)
> > > >      partial_load_store_bias (0),
> > > >      peeling_for_gaps (false),
> > > >      peeling_for_niter (false),
> > > > +    early_breaks (false),
> > > > +    non_break_control_flow (false),
> > > >      no_data_dependencies (false),
> > > >      has_mask_store (false),
> > > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > > (loop_vec_info loop_vinfo)
> > > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > > (LOOP_VINFO_ORIG_LOOP_INFO
> > > >  					  (loop_vinfo));
> > > >
> > > > +  /* When we have multiple exits and VF is unknown, we must require
> > > partial
> > > > +     vectors because the loop bounds is not a minimum but a maximum.
> > > That is to
> > > > +     say we cannot unpredicate the main loop unless we peel or use partial
> > > > +     vectors in the epilogue.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > > +    return true;
> > > > +
> > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > >      {
> > > > @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost
> > > (loop_vec_info loop_vinfo)
> > > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > > >  }
> > > >
> > > > -
> > > >  /* Function vect_analyze_loop_form.
> > > >
> > > >     Verify that certain CFG restrictions hold, including:
> > > >     - the loop has a pre-header
> > > > -   - the loop has a single entry and exit
> > > > +   - the loop has a single entry
> > > > +   - nested loops can have only a single exit.
> > > >     - the loop exit condition is simple enough
> > > >     - the number of iterations can be analyzed, i.e, a countable loop.  The
> > > >       niter could be analyzed under some assumptions.  */
> > > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >                             |
> > > >                          (exit-bb)  */
> > > >
> > > > -      if (loop->num_nodes != 2)
> > > > -	return opt_result::failure_at (vect_location,
> > > > -				       "not vectorized:"
> > > > -				       " control flow in loop.\n");
> > > > -
> > > >        if (empty_block_p (loop->header))
> > > >  	return opt_result::failure_at (vect_location,
> > > >  				       "not vectorized: empty loop.\n");
> > > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >          dump_printf_loc (MSG_NOTE, vect_location,
> > > >  			 "Considering outer-loop vectorization.\n");
> > > >        info->inner_loop_cond = inner.loop_cond;
> > > > +
> > > > +      if (!single_exit (loop))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized: multiple exits.\n");
> > > > +
> > > >      }
> > > >
> > > > -  if (!single_exit (loop))
> > > > -    return opt_result::failure_at (vect_location,
> > > > -				   "not vectorized: multiple exits.\n");
> > > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > > >      return opt_result::failure_at (vect_location,
> > > >  				   "not vectorized:"
> > > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >  				   "not vectorized: latch block not empty.\n");
> > > >
> > > >    /* Make sure the exit is not abnormal.  */
> > > > -  edge e = single_exit (loop);
> > > > -  if (e->flags & EDGE_ABNORMAL)
> > > > -    return opt_result::failure_at (vect_location,
> > > > -				   "not vectorized:"
> > > > -				   " abnormal loop exit edge.\n");
> > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > +  edge nexit = loop->vec_loop_iv;
> > > > +  for (edge e : exits)
> > > > +    {
> > > > +      if (e->flags & EDGE_ABNORMAL)
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " abnormal loop exit edge.\n");
> > > > +      /* Early break BB must be after the main exit BB.  In theory we should
> > > > +	 be able to vectorize the inverse order, but the current flow in the
> > > > +	 the vectorizer always assumes you update successor PHI nodes, not
> > > > +	 preds.  */
> > > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e-
> > > >src))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " abnormal loop exit edge order.\n");
> 
> "unsupported loop exit order", but I don't understand the comment.
> 

One failure I found during bootstrap is that there was a CFG where the BBs were all
reversed.  Should have it in the testsuite, I'll dig it out and come back to you.

> > > > +    }
> > > > +
> > > > +  /* We currently only support early exit loops with known bounds.   */
> 
> Btw, why's that?  Is that because we don't support the loop-around edge?
> IMHO this is the most serious limitation (and as said above it should be
> trivial to fix).

Nah, it's just time 😊 I wanted to start getting feedback before relaxing it.
My patch 0/19 has an implementation plan for the remaining work.

I plan to relax this in this release, most likely in this series itself.

> 
> > > > +  if (exits.length () > 1)
> > > > +    {
> > > > +      class tree_niter_desc niter;
> > > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter,
> NULL)
> > > > +	  || chrec_contains_undetermined (niter.niter)
> > > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " early breaks only supported on loops"
> > > > +				       " with known iteration bounds.\n");
> > > > +    }
> > > >
> > > >    info->conds
> > > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > > vec_info_shared *shared,
> > > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > > >alt_loop_conds);
> > > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > > >
> > > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > > +
> > > >    if (info->inner_loop_cond)
> > > >      {
> > > >        stmt_vec_info inner_loop_cond_info
> > > > @@ -3070,7 +3106,8 @@ start_over:
> > > >
> > > >    /* If an epilogue loop is required make sure we can create one.  */
> > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        if (dump_enabled_p ())
> > > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop
> required\n");
> > > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction
> (loop_vec_info
> > > loop_vinfo,
> > > >    basic_block exit_bb;
> > > >    tree scalar_dest;
> > > >    tree scalar_type;
> > > > -  gimple *new_phi = NULL, *phi;
> > > > +  gimple *new_phi = NULL, *phi = NULL;
> > > >    gimple_stmt_iterator exit_gsi;
> > > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > > >    gimple *epilog_stmt = NULL;
> > > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > > (loop_vec_info loop_vinfo,
> > > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > > >  	  reduc_inputs.quick_push (new_def);
> > > >  	}
> > > > +
> > > > +	/* Update the other exits.  */
> > > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	  {
> > > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > +	    gphi_iterator gsi, gsi1;
> > > > +	    for (edge exit : alt_exits)
> > > > +	      {
> > > > +		/* Find the phi node to propaget into the exit block for each
> > > > +		   exit edge.  */
> > > > +		for (gsi = gsi_start_phis (exit_bb),
> > > > +		     gsi1 = gsi_start_phis (exit->src);
> 
> exit->src == loop->header, right?  I think this won't work for multiple
> alternate exits.  It's probably easier to do this where we create the
> LC PHI node for the reduction result?

No exit->src == definition block of the gcond. 

> 
> > > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > > +		  {
> > > > +		    /* There really should be a function to just get the number
> > > > +		       of phis inside a bb.  */
> > > > +		    if (phi && phi == gsi.phi ())
> > > > +		      {
> > > > +			gphi *phi1 = gsi1.phi ();
> > > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > > +					 PHI_RESULT (phi1));
> 
> I think we know the header PHI of a reduction perfectly well, there
> shouldn't be the need to "search" for it.
> 
> > > > +			break;
> > > > +		      }
> > > > +		  }
> > > > +	      }
> > > > +	  }
> > > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > > >      }
> > > >
> > > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info
> *vinfo,
> > > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > > >  	   lhs' = new_tree;  */
> > > >
> > > > +      /* When vectorizing an early break, any live statement that is used
> > > > +	 outside of the loop are dead.  The loop will never get to them.
> > > > +	 We could change the liveness value during analysis instead but since
> > > > +	 the below code is invalid anyway just ignore it during codegen.  */
> > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	return true;
> 
> But what about the value that's live across the main exit when the
> epilogue is not entered?

My understanding is that vectorizable_live_operation only vectorizes statements within
a loop that can be live outside, In this case e.g. statements inside the body of the IF.

What you're describing above is done by other vectorizable_reduction etc.

That said, I can make this much safer by just restricting it to statements inside the same BB
as an alt exit BB.

> 
> > > > +
> > > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > >        gcc_assert (single_pred_p (exit_bb));
> > > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > > >       versioning.   */
> > > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > -  if (! single_pred_p (e->dest))
> > > > +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS
> > > (loop_vinfo))
> 
> e can be NULL here?  I think we should reject such loops earlier.
> 

Ah no, that's left-over from when this used single_exit.  It should be removed
In this patch.  Had missed it, sorry.

> > > >      {
> > > >        split_loop_exit_edge (e, true);
> > > >        if (dump_enabled_p ())
> > > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > > >      {
> > > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > > -      if (! single_pred_p (e->dest))
> > > > +      if (e && ! single_pred_p (e->dest))
> > > >  	{
> > > >  	  split_loop_exit_edge (e, true);
> > > >  	  if (dump_enabled_p ())
> > > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >
> > > >    /* Loops vectorized with a variable factor won't benefit from
> > > >       unrolling/peeling.  */
> 
> update the comment?  Why would we unroll a VLA loop with early breaks?
> Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?
> 

Ah indeed, should be ||.

> > > > -  if (!vf.is_constant ())
> > > > +  if (!vf.is_constant ()
> > > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        loop->unroll = 1;
> > > >        if (dump_enabled_p ())
> > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > index
> > >
> 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > > e78ee2a7c627be45b 100644
> > > > --- a/gcc/tree-vect-stmts.cc
> > > > +++ b/gcc/tree-vect-stmts.cc
> > > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info
> stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >    *live_p = false;
> > > >
> > > >    /* cond stmt other than loop exit cond.  */
> > > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > > -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> > > > -    *relevant = vect_used_in_scope;
> 
> how was that ever hit before?  For outer loop processing with outer loop
> vectorization?
>

I believe so, because the outer-loop would see the exit cond of the inner loop as well.

> > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > +    {
> > > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge,
> but
> > > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > > +	 the hard way.  */
> > > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > > +      bool exit_bb = false, early_exit = false;
> > > > +      edge_iterator ei;
> > > > +      edge e;
> > > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > > +	  {
> > > > +	    exit_bb = true;
> > > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > > +	    break;
> > > > +	  }
> > > > +
> > > > +      /* We should have processed any exit edge, so an edge not an early
> > > > +	 break must be a loop IV edge.  We need to distinguish between the
> > > > +	 two as we don't want to generate code for the main loop IV.  */
> > > > +      if (exit_bb)
> > > > +	{
> > > > +	  if (early_exit)
> > > > +	    *relevant = vect_used_in_scope;
> > > > +	}
> 
> I wonder why you can't simply do
> 
>          if (is_ctrl_stmt (stmt_info->stmt)
>              && stmt_info->stmt != LOOP_VINFO_COND (loop_info))
> 
> ?
> 
> > > > +      else if (bb->loop_father == loop)
> > > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> 
> so for control flow not exiting the loop you can check
> loop_exits_from_bb_p ().
> 
> > > > +    }
> > > >
> > > >    /* changing memory.  */
> > > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info
> stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >  	*relevant = vect_used_in_scope;
> > > >        }
> > > >
> > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > +  auto_bitmap exit_bbs;
> > > > +  for (edge exit : exits)
> > > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > +
> > > >    /* uses outside the loop.  */
> > > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > > SSA_OP_DEF)
> > > >      {
> > > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >  	      /* We expect all such uses to be in the loop exit phis
> > > >  		 (because of loop closed form)   */
> > > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> > > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> 
> That now becomes quite expensive checking already covered by the LC SSA
> verifier so I suggest to simply drop this assert instead.
> 
> > > >                *live_p = true;
> > > >  	    }
> > > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > > (loop_vec_info loop_vinfo, bool *fatal)
> > > >  	}
> > > >      }
> > > >
> > > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't
> seen all
> > > > +     the conds yet at that point and there's no quick way to retrieve them.
> */
> > > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > > +    return opt_result::failure_at (vect_location,
> > > > +				   "not vectorized:"
> > > > +				   " unsupported control flow in loop.\n");
> 
> so we didn't do this before?  But see above where I wondered.  So when
> does this hit with early exits and why can't we check for this in
> vect_verify_loop_form?
> 

We did, but it was done in vect_analyze_loop_form but it was done purely based
on number of BB in the loop.  This required loops to be highly normalized which
isn't the case with multiple exits. That is I've seen various loops with different
numbers of random empty fall through BB in the body or after the main exit before
the latch.

We can do it in vect_analyze_loop_form but that requires us to walk all the
statements in all the basic blocks, because the loops track exit edges and a general
control flow edge is not easy to find as far as I know.  I added it at this point
because by here because by this point we would have walked all the statements.

> > > > +
> > > >    /* 2. Process_worklist */
> > > >    while (worklist.length () > 0)
> > > >      {
> > > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > > (loop_vec_info loop_vinfo, bool *fatal)
> > > >  			return res;
> > > >  		    }
> > > >                   }
> > > > +	    }
> > > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> > > > +	    {
> > > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> > > > +	      opt_result res
> > > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > > +			       loop_vinfo, relevant, &worklist, false);
> > > > +	      if (!res)
> > > > +		return res;
> > > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > > +				loop_vinfo, relevant, &worklist, false);
> > > > +	      if (!res)
> > > > +		return res;
> > > >              }
> > > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > > >  	    {
> > > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > > >  			     node_instance, cost_vec);
> > > >        if (!res)
> > > >  	return res;
> > > > -   }
> > > > +    }
> > > > +
> > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > > >
> > > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > > >      {
> > > >        case vect_internal_def:
> > > > +      case vect_early_exit_def:
> > > >          break;
> > > >
> > > >        case vect_reduction_def:
> > > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > > >      {
> > > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > > >        *need_to_vectorize = true;
> > > >      }
> > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > > index
> > >
> ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > > 40baff61d18da8e242dd 100644
> > > > --- a/gcc/tree-vectorizer.h
> > > > +++ b/gcc/tree-vectorizer.h
> > > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > > >    vect_internal_def,
> > > >    vect_induction_def,
> > > >    vect_reduction_def,
> > > > +  vect_early_exit_def,
> 
> can you avoid putting this inbetween reduction and double reduction
> please?  Just put it before vect_unknown_def_type.  In fact the COND
> isn't a def ... maybe we should have pattern recogized
> 
>  if (a < b) exit;
> 
> as
> 
>  cond = a < b;
>  if (cond != 0) exit;
> 
> so the part that we need to vectorize is more clear.

Hmm fair enough, I still find it useful to be able to distinguish
between this and general control flow though.  In fact depending
on when we finish reviewing/upstreaming this It should be easy to
support general control flow in such loops.

> 
> > > >    vect_double_reduction_def,
> > > >    vect_nested_cycle,
> > > >    vect_first_order_recurrence,
> > > > @@ -876,6 +877,13 @@ public:
> > > >       we need to peel off iterations at the end to form an epilogue loop.  */
> > > >    bool peeling_for_niter;
> > > >
> > > > +  /* When the loop has early breaks that we can vectorize we need to
> peel
> > > > +     the loop for the break finding loop.  */
> > > > +  bool early_breaks;
> > > > +
> > > > +  /* When the loop has a non-early break control flow inside.  */
> > > > +  bool non_break_control_flow;
> > > > +
> > > >    /* List of loop additional IV conditionals found in the loop.  */
> > > >    auto_vec<gcond *> conds;
> > > >
> > > > @@ -985,9 +993,11 @@ public:
> > > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
> > > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
> > > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> > > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > > >early_break_conflict
> > > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)-
> >early_break_dest_bb
> > > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> > > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > > >non_break_control_flow
> > > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > > >no_data_dependencies
> > > > @@ -1038,8 +1048,8 @@ public:
> > > >     stack.  */
> > > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > > >
> > > > -inline loop_vec_info
> > > > -loop_vec_info_for_loop (class loop *loop)
> > > > +static inline loop_vec_info
> > > > +loop_vec_info_for_loop (const class loop *loop)
> > > >  {
> > > >    return (loop_vec_info) loop->aux;
> > > >  }
> > > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > > >  {
> > > >    if (bb == (bb->loop_father)->header)
> > > >      return true;
> > > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > > +
> > > >    return false;
> > > >  }
> > > >
> > > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > > >     in tree-vect-loop-manip.cc.  */
> > > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > > >  				     tree, tree, tree, bool);
> > > > -extern bool slpeel_can_duplicate_loop_p (const class loop *,
> const_edge);
> > > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > > const_edge);
> > > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > > -						     class loop *, edge);
> > > > +						    class loop *, edge, bool,
> > > > +						    vec<basic_block> * = NULL);
> > > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > > >  				    tree *, tree *, tree *, int, bool, bool,
> > > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > > index
> > >
> a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > > 67f29f5dc9a8d72235f4 100644
> > > > --- a/gcc/tree-vectorizer.cc
> > > > +++ b/gcc/tree-vectorizer.cc
> > > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > > >  	 predicates that need to be shared for optimal predicate usage.
> > > >  	 However reassoc will re-order them and prevent CSE from working
> > > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> 
> seeing this more and more I think we want a simple way to iterate over
> all exits without copying to a vector when we have them recorded.  My
> C++ fu is too limited to support
> 
>   for (auto exit : recorded_exits (loop))
>     ...
> 
> (maybe that's enough for somebody to jump onto this ;))
> 
> Don't treat all review comments as change orders, but it should be clear
> the code isn't 100% obvious.  Maybe the patch can be simplified by
> splitting out the LC SSA cleanup parts.

Will give it a try,

Thanks!
Tamar

> 
> Thanks,
> Richard.
> 
> > > > +      for (edge exit : exits)
> > > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > > >
> > > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > > >        do_rpo_vn (fun, entry, exit_bbs);
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > Nuernberg,
> > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > > Moerman;
> > > HRB 36809 (AG Nuernberg)
> >
  
Richard Biener July 17, 2023, 12:48 p.m. UTC | #5
On Mon, 17 Jul 2023, Tamar Christina wrote:

> 
> 
> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Friday, July 14, 2023 2:35 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com
> > Subject: RE: [PATCH 12/19]middle-end: implement loop peeling and IV
> > updates for early break.
> > 
> > On Thu, 13 Jul 2023, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguenther@suse.de>
> > > > Sent: Thursday, July 13, 2023 6:31 PM
> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> > jlaw@ventanamicro.com
> > > > Subject: Re: [PATCH 12/19]middle-end: implement loop peeling and IV
> > > > updates for early break.
> > > >
> > > > On Wed, 28 Jun 2023, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This patch updates the peeling code to maintain LCSSA during peeling.
> > > > > The rewrite also naturally takes into account multiple exits and so it didn't
> > > > > make sense to split them off.
> > > > >
> > > > > For the purposes of peeling the only change for multiple exits is that the
> > > > > secondary exits are all wired to the start of the new loop preheader when
> > > > doing
> > > > > epilogue peeling.
> > > > >
> > > > > When doing prologue peeling the CFG is kept in tact.
> > > > >
> > > > > For both epilogue and prologue peeling we wire through between the
> > two
> > > > loops any
> > > > > PHI nodes that escape the first loop into the second loop if flow_loops is
> > > > > specified.  The reason for this conditionality is because
> > > > > slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
> > > > >   - prologue peeling
> > > > >   - epilogue peeling
> > > > >   - loop distribution
> > > > >
> > > > > for the last case the loops should remain independent, and so not be
> > > > connected.
> > > > > Because of this propagation of only used phi nodes get_current_def can
> > be
> > > > used
> > > > > to easily find the previous definitions.  However live statements that are
> > > > > not used inside the loop itself are not propagated (since if unused, the
> > > > moment
> > > > > we add the guard in between the two loops the value across the bypass
> > edge
> > > > can
> > > > > be wrong if the loop has been peeled.)
> > > > >
> > > > > This is dealt with easily enough in find_guard_arg.
> > > > >
> > > > > For multiple exits, while we are in LCSSA form, and have a correct DOM
> > tree,
> > > > the
> > > > > moment we add the guard block we will change the dominators again.  To
> > > > deal with
> > > > > this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the
> > blocks
> > > > to
> > > > > update without having to recompute the list of blocks to update again.
> > > > >
> > > > > When multiple exits and doing epilogue peeling we will also temporarily
> > have
> > > > an
> > > > > incorrect VUSES chain for the secondary exits as it anticipates the final
> > result
> > > > > after the VDEFs have been moved.  This will thus be corrected once the
> > code
> > > > > motion is applied.
> > > > >
> > > > > Lastly by doing things this way we can remove the helper functions that
> > > > > previously did lock step iterations to update things as it went along.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > > >
> > > > > Ok for master?
> > > >
> > > > Not sure if I get through all of this in one go - so be prepared that
> > > > the rest of the review follows another day.
> > >
> > > No worries, I appreciate the reviews!
> > > Just giving some quick replies for when you continue.
> > 
> > Continueing.
> > 
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > 	* tree-loop-distribution.cc (copy_loop_before): Pass flow_loops =
> > > > false.
> > > > > 	* tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
> > > > > 	* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add
> > > > additional
> > > > > 	assert.
> > > > > 	(vect_set_loop_condition_normal): Skip modifying loop IV for multiple
> > > > > 	exits.
> > > > > 	(slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit
> > > > peeling.
> > > > > 	(slpeel_can_duplicate_loop_p): Likewise.
> > > > > 	(vect_update_ivs_after_vectorizer): Don't enter this...
> > > > > 	(vect_update_ivs_after_early_break): ...but instead enter here.
> > > > > 	(find_guard_arg): Update for new peeling code.
> > > > > 	(slpeel_update_phi_nodes_for_loops): Remove.
> > > > > 	(slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0
> > > > checks.
> > > > > 	(slpeel_update_phi_nodes_for_lcssa): Remove.
> > > > > 	(vect_do_peeling): Fix VF for multiple exits and force epilogue.
> > > > > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > > > > 	non_break_control_flow and early_breaks.
> > > > > 	(vect_need_peeling_or_partial_vectors_p): Force partial vector if
> > > > > 	multiple exits and VLA.
> > > > > 	(vect_analyze_loop_form): Support inner loop multiple exits.
> > > > > 	(vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
> > > > > 	(vect_create_epilog_for_reduction):  Update live phi nodes.
> > > > > 	(vectorizable_live_operation): Ignore live operations in vector loop
> > > > > 	when multiple exits.
> > > > > 	(vect_transform_loop): Force unrolling for VF loops and multiple exits.
> > > > > 	* tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
> > > > > 	(vect_mark_stmts_to_be_vectorized): Check for non-exit control flow
> > > > and
> > > > > 	analyze gcond params.
> > > > > 	(vect_analyze_stmt): Support gcond.
> > > > > 	* tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
> > > > > 	in RPO pass.
> > > > > 	* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
> > > > > 	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW):
> > > > New.
> > > > > 	(loop_vec_info_for_loop): Change to const and static.
> > > > > 	(is_loop_header_bb_p): Drop assert.
> > > > > 	(slpeel_can_duplicate_loop_p): Update prototype.
> > > > > 	(class loop): Add early_breaks and non_break_control_flow.
> > > > >
> > > > > --- inline copy of patch --
> > > > > diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> > > > > index
> > > >
> > 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40
> > > > d833a968314a4442b9e 100644
> > > > > --- a/gcc/tree-loop-distribution.cc
> > > > > +++ b/gcc/tree-loop-distribution.cc
> > > > > @@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool
> > > > redirect_lc_phi_defs)
> > > > >    edge preheader = loop_preheader_edge (loop);
> > > > >
> > > > >    initialize_original_copy_tables ();
> > > > > -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> > > > > +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader,
> > > > false);
> > > > >    gcc_assert (res != NULL);
> > > > >
> > > > >    /* When a not last partition is supposed to keep the LC PHIs computed
> > > > > diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> > > > > index
> > > >
> > 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba
> > > > 377300430c04b7aeefe6c 100644
> > > > > --- a/gcc/tree-ssa-loop-niter.cc
> > > > > +++ b/gcc/tree-ssa-loop-niter.cc
> > > > > @@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop,
> > > > basic_block *body, const_edge exit)
> > > > >    gimple_stmt_iterator bsi;
> > > > >    unsigned i;
> > > > >
> > > > > -  if (exit != single_exit (loop))
> > > > > +  /* We need to check for alternative exits since exit can be NULL.  */
> > > >
> > > > You mean we pass in exit == NULL in some cases?  I'm not sure what
> > > > the desired behavior in that case is - can you point out the
> > > > callers you are fixing here?
> > > >
> > > > I think we should add gcc_assert (exit != nullptr)
> > > >
> > > > >    for (i = 0; i < loop->num_nodes; i++)
> > > > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > > > index
> > > >
> > 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a92
> > > > 7cb150edd7c2aa9999 100644
> > > > > --- a/gcc/tree-vect-loop-manip.cc
> > > > > +++ b/gcc/tree-vect-loop-manip.cc
> > > > > @@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple
> > *update_phi,
> > > > edge e, tree new_def)
> > > > >  {
> > > > >    tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
> > > > >
> > > > > +  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
> > > > > +	      || orig_def != new_def);
> > > > > +
> > > > >    SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
> > > > >
> > > > >    if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > > > @@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal
> > (loop_vec_info
> > > > loop_vinfo,
> > > > >    gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > > > >
> > > > >    /* Record the number of latch iterations.  */
> > > > > -  if (limit == niters)
> > > > > +  if (limit == niters
> > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      /* Case A: the loop iterates NITERS times.  Subtract one to get the
> > > > >         latch count.  */
> > > > >      loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
> > > > > @@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal
> > > > (loop_vec_info loop_vinfo,
> > > > >      loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
> > > > >  				       limit, step);
> > > > >
> > > > > -  if (final_iv)
> > > > > +  /* For multiple exits we've already maintained LCSSA form and handled
> > > > > +     the scalar iteration update in the code that deals with the merge
> > > > > +     block and its updated guard.  I could move that code here instead
> > > > > +     of in vect_update_ivs_after_early_break but I have to still deal
> > > > > +     with the updates to the counter `i`.  So for now I'll keep them
> > > > > +     together.  */
> > > > > +  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      {
> > > > >        gassign *assign;
> > > > >        edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > @@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
> > > > >     on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
> > > > >     non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy
> > the
> > > > >     basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> > > > > -   entry or exit of LOOP.  */
> > > > > +   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to
> > > > SCALAR_LOOP as a
> > > > > +   continuation.  This is correct for cases where one loop continues from
> > the
> > > > > +   other like in the vectorizer, but not true for uses in e.g. loop
> > distribution
> > > > > +   where the loop is duplicated and then modified.
> > > > > +
> > > >
> > > > but for loop distribution the flow also continues?  I'm not sure what you
> > > > are refering to here.  Do you by chance have a branch with the patches
> > > > installed?
> > >
> > > Yup, they're at refs/users/tnfchris/heads/gcc-14-early-break in the repo.
> > >
> > > >
> > > > > +   If UPDATED_DOMS is not NULL it is update with the list of basic blocks
> > > > whoms
> > > > > +   dominators were updated during the peeling.  */
> > > > >
> > > > >  class loop *
> > > > >  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
> > > > > -					class loop *scalar_loop, edge e)
> > > > > +					class loop *scalar_loop, edge e,
> > > > > +					bool flow_loops,
> > > > > +					vec<basic_block> *updated_doms)
> > > > >  {
> > > > >    class loop *new_loop;
> > > > >    basic_block *new_bbs, *bbs, *pbbs;
> > > > > @@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > (class
> > > > loop *loop,
> > > > >    for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
> > > > >      rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
> > > > >
> > > > > +  /* Rename the exit uses.  */
> > > > > +  for (edge exit : get_loop_exit_edges (new_loop))
> > > > > +    for (auto gsi = gsi_start_phis (exit->dest);
> > > > > +	 !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > +      {
> > > > > +	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
> > > > > +	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
> > > > > +	if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > > > +	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
> > > > > +      }
> > > > > +
> > > > > +  /* This condition happens when the loop has been versioned. e.g. due
> > to
> > > > ifcvt
> > > > > +     versioning the loop.  */
> > > > >    if (scalar_loop != loop)
> > > > >      {
> > > > >        /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs
> > from
> > > > > @@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > > > (class loop *loop,
> > > > >  						EDGE_SUCC (loop->latch, 0));
> > > > >      }
> > > > >
> > > > > +  vec<edge> alt_exits = loop->vec_loop_alt_exits;
> > > >
> > > > So 'e' is not one of alt_exits, right?  I wonder if we can simply
> > > > compute the vector from all exits of 'loop' and removing 'e'?
> > > >
> > > > > +  bool multiple_exits_p = !alt_exits.is_empty ();
> > > > > +  auto_vec<basic_block> doms;
> > > > > +  class loop *update_loop = NULL;
> > > > > +
> > > > >    if (at_exit) /* Add the loop copy at exit.  */
> > > > >      {
> > > > > -      if (scalar_loop != loop)
> > > > > +      if (scalar_loop != loop && new_exit->dest != exit_dest)
> > > > >  	{
> > > > > -	  gphi_iterator gsi;
> > > > >  	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> > > > > +	  flush_pending_stmts (new_exit);
> > > > > +	}
> > > > >
> > > > > -	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> > > > > -	       gsi_next (&gsi))
> > > > > +      auto loop_exits = get_loop_exit_edges (loop);
> > > > > +      for (edge exit : loop_exits)
> > > > > +	redirect_edge_and_branch (exit, new_preheader);
> > > > > +
> > > > > +
> > > >
> > > > one line vertical space too much
> > > >
> > > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > > +	 block and the new loop header.  This allows us to later split the
> > > > > +	 preheader block and still find the right LC nodes.  */
> > > > > +      edge latch_new = single_succ_edge (new_preheader);
> > > > > +      edge latch_old = loop_latch_edge (loop);
> > > > > +      hash_set <tree> lcssa_vars;
> > > > > +      for (auto gsi_from = gsi_start_phis (latch_old->dest),
> > > >
> > > > so that's loop->header (and makes it more clear which PHI nodes you are
> > > > looking at)
> > 
> > 
> > So I'm now in a debug session - I think that conceptually it would
> > make more sense to create the LC PHI nodes that are present at the
> > old exit destination in the new preheader _before_ you redirect them
> > above and then flush_pending_stmts after redirecting, that should deal
> > with the copying.
> > 
> 
> This was the first thing I tried, however as soon as you redirect one edge
> you destroy all other phi nodes on the original block.
> 
> As in if I have 3 phi nodes, I need to move them all at the same time, which
> brings the next problem in that I can't add more entries to a phi than it has
> incoming edges.  That is I can't just make the final phi nodes on the destination
> without having an edge for it.

Hmm?  It would go like

 1. create all PHI nodes (without any arguments) at the destination
 2. redirect edge + flush stmts (will copy all PHI args)

> And to make an edge for it I need to have a condition to attach to the edge.
> To work around it I tried maintaining a cache of the nodes I need to make on
> the new destination and after redirecting just create them,  but that has me
> looping over the same PHIs multiple times.
> 
> Any suggestions?

Not sure what you mean with "a condition to attach to the edge"?

> > Now, your copying actually iterates over all PHIs in the loop _header_,
> > so it doesn't actually copy LC PHI nodes but possibly creates additional
> > ones.  The intent does seem to do this since you want a different value
> > on those edges for all but the main loop exit.  But then the
> > overall comments should better reflect that and maybe you should
> > do what I suggested anyway and have this loop alter only the alternate
> > exit LC PHIs?
> 
> It does create the LC PHI nodes for all exits, it's just that for alternate exits
> all the nodes are the same since we only care about the value for full
> iterations. 
> 
> I'm not sure what you're suggesting with alter only the alternative exits.
> I need to still create the one for the main loop and seems easier to do them
> In one loop rather than two?
> 
> Doing this here allows the removal of all the code later on that the vectorizer
> uses to try to find the main exit's PHIs.

I think you have two paths (unless you always peel at least a single 
iteration - it seems you probably do that).  On the path from main
to epilog you don't need the existing LC PHI values - those are for
the loop _exit_ but you are continuing the loop so you need values
for the entry of the epilog header PHIs.

There's still somehow the distinction betwee taking the "normal"
exit and the alternate exits - at least your code suggests so.
In case we would have the case the epilog loop is skipped for
the normal exit we'd need the original LC PHI nodes there
(merging with the epilog exit LC PHI nodes).  So for the
"normal" exit we'd need both set of PHIs.  Thus I assume you
made your life simpler by always peeling one iteration so
even the "normal" exit only needs the PHIs for the epilog header?

But then I wonder why you look at the original values of the exit
LC PHI node at all ...

So what is it?  A comment at the very spot in the code would be
helpful, documenting the constraints we are dealing with.

> > 
> > If you don't flush_pending_stmts on an edge after redirecting you
> > should call redirect_edge_var_map_clear (edge), otherwise the stale
> > info might break things later.
> > 
> > > > > +	   gsi_to = gsi_start_phis (latch_new->dest);
> > > >
> > > > likewise new_loop->header
> > > >
> > > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > > +	{
> > > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
> > > > > +	  /* In all cases, even in early break situations we're only
> > > > > +	     interested in the number of fully executed loop iters.  As such
> > > > > +	     we discard any partially done iteration.  So we simply propagate
> > > > > +	     the phi nodes from the latch to the merge block.  */
> > > > > +	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> > > > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > > > +
> > > > > +	  lcssa_vars.add (new_arg);
> > > > > +
> > > > > +	  /* Main loop exit should use the final iter value.  */
> > > > > +	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv,
> > > > UNKNOWN_LOCATION);
> > > >
> > > > above you are creating the PHI node at e->dest but here add the PHI arg to
> > > > loop->vec_loop_iv - that's 'e' here, no?  Consistency makes it easier
> > > > to follow.  I _think_ this code doesn't need to know about the "special"
> > > > edge.
> > > >
> > > > > +
> > > > > +	  /* All other exits use the previous iters.  */
> > > > > +	  for (edge e : alt_exits)
> > > > > +	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
> > > > > +			 UNKNOWN_LOCATION);
> > > > > +
> > > > > +	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
> > > > > +	}
> > > > > +
> > > > > +      /* Copy over any live SSA vars that may not have been materialized in
> > > > the
> > > > > +	 loops themselves but would be in the exit block.  However when the
> > > > live
> > > > > +	 value is not used inside the loop then we don't need to do this,  if we
> > > > do
> > > > > +	 then when we split the guard block the branch edge can end up
> > > > containing the
> > > > > +	 wrong reference,  particularly if it shares an edge with something that
> > > > has
> > > > > +	 bypassed the loop.  This is not something peeling can check so we
> > > > need to
> > > > > +	 anticipate the usage of the live variable here.  */
> > > > > +      auto exit_map = redirect_edge_var_map_vector (exit);
> > > >
> > > > Hmm, did I use that in my attemt to refactor things? ...
> > >
> > > Indeed, I didn't always use it, but found it was the best way to deal with the
> > > variables being live in various BB after the loop.
> > 
> > As said this whole piece of code is possibly more complicated than
> > necessary.  First copying/creating the PHI nodes that are present
> > at the exit (the old LC PHI nodes), then redirecting edges and flushing
> > stmts should deal with half of this.
> >
> > > >
> > > > > +      if (exit_map)
> > > > > +        for (auto vm : exit_map)
> > > > > +	{
> > > > > +	  if (lcssa_vars.contains (vm.def)
> > > > > +	      || TREE_CODE (vm.def) != SSA_NAME)
> > > >
> > > > the latter check is cheaper so it should come first
> > > >
> > > > > +	    continue;
> > > > > +
> > > > > +	  imm_use_iterator imm_iter;
> > > > > +	  use_operand_p use_p;
> > > > > +	  bool use_in_loop = false;
> > > > > +
> > > > > +	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
> > > > >  	    {
> > > > > -	      gphi *phi = gsi.phi ();
> > > > > -	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> > > > > -	      location_t orig_locus
> > > > > -		= gimple_phi_arg_location_from_edge (phi, e);
> > > > > +	      basic_block bb = gimple_bb (USE_STMT (use_p));
> > > > > +	      if (flow_bb_inside_loop_p (loop, bb)
> > > > > +		  && !gimple_vuse (USE_STMT (use_p)))
> > 
> > what's this gimple_vuse check?  I see now for vect-early-break_17.c this
> > code triggers and ignores
> > 
> >   vect_b[i_18] = _2;
> > 
> > > > > +		{
> > > > > +		  use_in_loop = true;
> > > > > +		  break;
> > > > > +		}
> > > > > +	    }
> > > > >
> > > > > -	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> > > > > +	  if (!use_in_loop)
> > > > > +	    {
> > > > > +	       /* Do a final check to see if it's perhaps defined in the loop.  This
> > > > > +		  mirrors the relevancy analysis's used_outside_scope.  */
> > > > > +	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
> > > > > +	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> > > > > +		continue;
> > > > >  	    }
> > 
> > since the def was on a LC PHI the def should always be defined inside the
> > loop.
> > 
> > > > > +
> > > > > +	  tree new_res = copy_ssa_name (vm.result);
> > > > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > > > +	  for (edge exit : loop_exits)
> > > > > +	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);
> > > >
> > > > not sure what you are doing above - I guess I have to play with it
> > > > in a debug session.
> > >
> > > Yeah if you comment it out one of the testcases should fail.
> > 
> > using new_preheader instead of e->dest would make things clearer.
> > 
> > You are now adding the same arg to every exit (you've just queried the
> > main exit redirect_edge_var_map_vector).
> > 
> > OK, so I think I understand what you're doing.  If I understand
> > correctly we know that when we exit the main loop via one of the
> > early exits we are definitely going to enter the epilog but when
> > we take the main exit we might not.
> 
> Correct

Hmm, so you _didn't_ make your life easier.

> > 
> > Looking at the CFG we create currently this isn't reflected and
> > this complicates this PHI node updating.  What I'd try to do
> > is leave redirecting the alternate exits until after
> > slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
> > means leaving it almost unchanged besides the LC SSA maintaining
> > changes.  After that for the multi-exit case split the
> > epilog preheader edge and redirect all the alternate exits to the
> > new preheader.  So the CFG becomes
> > 
> >                  <original loop>
> >                 /      |
> >                /    <main exit w/ original LC PHI>
> >               /      if (epilog)
> >    alt exits /        /  \
> >             /        /    loop around
> >             |       /
> >            preheader with "header" PHIs
> >               |
> >           <epilog>
> > 
> > note you need the header PHIs also on the main exit path but you
> > only need the loop end PHIs there.
> 
> Ah, hadn't considered this one.  I'll give it a try.
> 
> > 
> > It seems so that at least currently the order of things makes
> > them more complicated than necessary.
> 
> Possibly yeah, there's a lot of work to maintain the dominators
> and phi nodes because the exits are rewritten early.
> 
> I'll give this a go!

Note the above CFG will likely confuse the hell out of the
reduction code (or at least you need to fixup your fixup).

> > 
> > > >
> > > > >  	}
> > > > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > > > -      flush_pending_stmts (e);
> > > > > +
> > > > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e-
> > >src);
> > > > > -      if (was_imm_dom || duplicate_outer_loop)
> > > > > +
> > > > > +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
> > > > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit-
> > > > >src);
> > > > >
> > > > >        /* And remove the non-necessary forwarder again.  Keep the other
> > > > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > (class
> > > > loop *loop,
> > > > >        delete_basic_block (preheader);
> > > > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> > > > >  			       loop_preheader_edge (scalar_loop)->src);
> > > > > +
> > > > > +      /* Finally after wiring the new epilogue we need to update its main
> > exit
> > > > > +	 to the original function exit we recorded.  Other exits are already
> > > > > +	 correct.  */
> > > > > +      if (multiple_exits_p)
> > > > > +	{
> > > > > +	  for (edge e : get_loop_exit_edges (loop))
> > > > > +	    doms.safe_push (e->dest);
> > > > > +	  update_loop = new_loop;
> > > > > +	  doms.safe_push (exit_dest);
> > > > > +
> > > > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > > > +	  if (single_succ_p (exit_dest))
> > > > > +	    doms.safe_push (single_succ (exit_dest));
> > > > > +	}
> > > > >      }
> > > > >    else /* Add the copy at entry.  */
> > > > >      {
> > > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > > +	 block and the new loop header.  This allows us to later split the
> > > > > +	 preheader block and still find the right LC nodes.  */
> > > > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > > >
> > > > see above
> > > >
> > > > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > > +	{
> > > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > > > new_latch_loop);
> > > > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> > > > > +	}
> > > > > +
> > > > >        if (scalar_loop != loop)
> > > > >  	{
> > > > >  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> > > > > @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > (class
> > > > loop *loop,
> > > > >        delete_basic_block (new_preheader);
> > > > >        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
> > > > >  			       loop_preheader_edge (new_loop)->src);
> > > > > +
> > > > > +      if (multiple_exits_p)
> > > > > +	update_loop = loop;
> > > > >      }
> > > > >
> > > > > -  if (scalar_loop != loop)
> > > > > +  if (multiple_exits_p)
> > > > >      {
> > > > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > > > -      gphi_iterator gsi_orig, gsi_new;
> > > > > -      edge orig_e = loop_preheader_edge (loop);
> > > > > -      edge new_e = loop_preheader_edge (new_loop);
> > > > > -
> > > > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > > > +      for (edge e : get_loop_exit_edges (update_loop))
> > > > >  	{
> > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > -	  gphi *new_phi = gsi_new.phi ();
> > > > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > > > -	  location_t orig_locus
> > > > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > > > -
> > > > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > > > +	  edge ex;
> > > > > +	  edge_iterator ei;
> > > > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > > > +	    {
> > > > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > > > +		 dominate other blocks.  */
> > > > > +	      while ((ex->flags & EDGE_FALLTHRU)
> > 
> > For the prologue peeling any early exit we take would skip all other
> > loops so we can simply leave them and their LC PHI nodes in place.
> > We need extra PHIs only on the path to the main vector loop.  I
> > think the comment isn't accurately reflecting what we do.  In
> > fact we do not add any LC PHI nodes here but simply adjust the
> > main loop header PHI arguments?
> 
> Yeah we don't create any new nodes after peeling in this version.
> 
> > 
> > > > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > > > just using single_succ_p here?  A fallthru edge src dominates the
> > > > fallthru edge dest, so the sentence above doesn't make sense.
> > >
> > > I wanted to say, that the immediate dominator of a block is never
> > > an fall through block.  At least from what I understood from how
> > > the dominators are calculated in the code, though may have missed
> > > something.
> > 
> >  BB1
> >   |
> >  BB2
> >   |
> >  BB3
> > 
> > here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.
> > 
> > > >
> > > > > +		     && single_succ_p (ex->dest))
> > > > > +		{
> > > > > +		  doms.safe_push (ex->dest);
> > > > > +		  ex = single_succ_edge (ex->dest);
> > > > > +		}
> > > > > +	      doms.safe_push (ex->dest);
> > > > > +	    }
> > > > > +	  doms.safe_push (e->dest);
> > > > >  	}
> > > > > -    }
> > > > >
> > > > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > +      if (updated_doms)
> > > > > +	updated_doms->safe_splice (doms);
> > > > > +    }
> > > > >    free (new_bbs);
> > > > >    free (bbs);
> > > > >
> > > > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > > > loop_vec_info loop_vinfo, const_edge e)
> > > > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > > > >    unsigned int num_bb = loop->inner? 5 : 2;
> > > > >
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > > > +
> > > >
> > > > I think checking the number of BBs is odd, I don't remember anything
> > > > in slpeel is specifically tied to that?  I think we can simply drop
> > > > this or do you remember anything that would depend on ->num_nodes
> > > > being only exactly 5 or 2?
> > >
> > > Never actually seemed to require it, but they're used as some check to
> > > see if there are unexpected control flow in the loop.
> > >
> > > i.e. this would say no if you have an if statement in the loop that wasn't
> > > converted.  The other part of this and the accompanying explanation is in
> > > vect_analyze_loop_form.  In the patch series I had to remove the hard
> > > num_nodes == 2 check from there because number of nodes restricted
> > > things too much.  If you have an empty fall through block, which seems to
> > > happen often between the main exit and the latch block then we'd not
> > > vectorize.
> > >
> > > So instead I now rejects loops after analyzing the gcond.  So think this check
> > > can go/needs to be different.
> > 
> > Lets remove it from this function then.
> 
> Ok, I can remove it from the outerloop vect then too, since it's mostly dead code.
> Will do so and reg-test that as well.
> 
> > 
> > > >
> > > > >    /* All loops have an outer scope; the only case loop->outer is NULL is for
> > > > >       the function itself.  */
> > > > >    if (!loop_outer (loop)
> > > > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > > > (loop_vec_info loop_vinfo,
> > > > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > >    basic_block update_bb = update_e->dest;
> > > > >
> > > > > +  /* For early exits we'll update the IVs in
> > > > > +     vect_update_ivs_after_early_break.  */
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    return;
> > > > > +
> > > > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > >
> > > > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > > > (loop_vec_info loop_vinfo,
> > > > >        /* Fix phi expressions in the successor bb.  */
> > > > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > > > >      }
> > > > > +  return;
> > > >
> > > > we don't usually place a return at the end of void functions
> > > >
> > > > > +}
> > > > > +
> > > > > +/*   Function vect_update_ivs_after_early_break.
> > > > > +
> > > > > +     "Advance" the induction variables of LOOP to the value they should
> > take
> > > > > +     after the execution of LOOP.  This is currently necessary because the
> > > > > +     vectorizer does not handle induction variables that are used after the
> > > > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > > > +     peeled, because of the early exit.  With an early exit we always peel
> > the
> > > > > +     loop.
> > > > > +
> > > > > +     Input:
> > > > > +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> > > > > +		    vectorized. The last few iterations of LOOP were peeled.
> > > > > +     - LOOP - a loop that is going to be vectorized. The last few iterations
> > > > > +	      of LOOP were peeled.
> > > > > +     - VF - The loop vectorization factor.
> > > > > +     - NITERS_ORIG - the number of iterations that LOOP executes (before
> > it is
> > > > > +		     vectorized). i.e, the number of times the ivs should be
> > > > > +		     bumped.
> > > > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > > > executes.
> > > > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only)
> > path
> > > > > +		  coming out from LOOP on which there are uses of the LOOP
> > > > ivs
> > > > > +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> > > > > +
> > > > > +		  The new definitions of the ivs are placed in LOOP->exit.
> > > > > +		  The phi args associated with the edge UPDATE_E in the bb
> > > > > +		  UPDATE_E->dest are updated accordingly.
> > > > > +
> > > > > +     Output:
> > > > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > > > +
> > > > > +     Assumption 1: Like the rest of the vectorizer, this function assumes
> > > > > +     a single loop exit that has a single predecessor.
> > > > > +
> > > > > +     Assumption 2: The phi nodes in the LOOP header and in update_bb
> > are
> > > > > +     organized in the same order.
> > > > > +
> > > > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the
> > future.
> > > > > +
> > > > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a
> > path
> > > > > +     coming out of LOOP on which the ivs of LOOP are used (this is the
> > path
> > > > > +     that leads to the epilog loop; other paths skip the epilog loop).  This
> > > > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > > > update_bb)
> > > > > +     needs to have its phis updated.
> > > > > + */
> > > > > +
> > > > > +static tree
> > > > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class
> > loop *
> > > > epilog,
> > > > > +				   poly_int64 vf, tree niters_orig,
> > > > > +				   tree niters_vector, edge update_e)
> > > > > +{
> > > > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    return NULL;
> > > > > +
> > > > > +  gphi_iterator gsi, gsi1;
> > > > > +  tree ni_name, ivtmp = NULL;
> > > > > +  basic_block update_bb = update_e->dest;
> > > > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > +  basic_block exit_bb = loop_iv->dest;
> > > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > > > +
> > > > > +  gcc_assert (cond);
> > > > > +
> > > > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis
> > (update_bb);
> > > > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > > > +    {
> > > > > +      tree init_expr, final_expr, step_expr;
> > > > > +      tree type;
> > > > > +      tree var, ni, off;
> > > > > +      gimple_stmt_iterator last_gsi;
> > > > > +
> > > > > +      gphi *phi = gsi1.phi ();
> > > > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi,
> > loop_preheader_edge
> > > > (epilog));
> > > >
> > > > I'm confused about the setup.  update_bb looks like the block with the
> > > > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > > > loop_preheader_edge (epilog) come into play here?  That would feed into
> > > > epilog->header PHIs?!
> > >
> > > We can't query the type of the phis in the block with the LC PHI nodes, so the
> > > Typical pattern seems to be that we iterate over a block that's part of the
> > loop
> > > and that would have the PHIs in the same order, just so we can get to the
> > > stmt_vec_info.
> > >
> > > >
> > > > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > > > better way?  Is update_bb really epilog->header?!
> > > >
> > > > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > > > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > > > which "works" even for totally unrelated edges.
> > > >
> > > > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> > > > > +      if (!phi1)
> > > >
> > > > shouldn't that be an assert?
> > > >
> > > > > +	continue;
> > > > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > > > +      if (dump_enabled_p ())
> > > > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > > > +			 "vect_update_ivs_after_early_break: phi: %G",
> > > > > +			 (gimple *)phi);
> > > > > +
> > > > > +      /* Skip reduction and virtual phis.  */
> > > > > +      if (!iv_phi_p (phi_info))
> > > > > +	{
> > > > > +	  if (dump_enabled_p ())
> > > > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > > > +			     "reduc or virtual phi. skip.\n");
> > > > > +	  continue;
> > > > > +	}
> > > > > +
> > > > > +      /* For multiple exits where we handle early exits we need to carry on
> > > > > +	 with the previous IV as loop iteration was not done because we exited
> > > > > +	 early.  As such just grab the original IV.  */
> > > > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > > > (loop));
> > > >
> > > > but this should be taken care of by LC SSA?
> > >
> > > It is, the comment is probably missing details, this part just scales the
> > counter
> > > from VF to scalar counts.  It's just a reminder that this scaling is done
> > differently
> > > from normal single exit vectorization.
> > >
> > > >
> > > > OK, have to continue tomorrow from here.
> > >
> > > Cheers, Thank you!
> > >
> > > Tamar
> > >
> > > >
> > > > Richard.
> > > >
> > > > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > > > +	  && gimple_cond_rhs (cond) != phi_ssa)
> > 
> > so this is a way to avoid touching the main IV?  Looks a bit fragile to
> > me.  Hmm, we're iterating over the main loop header PHIs here?
> > Can't you check, say, the relevancy of the PHI node instead?  Though
> > it might also be used as induction.  Can't it be used as alternate
> > exit like
> > 
> >   for (i)
> >    {
> >      if (i & bit)
> >        break;
> >    }
> > 
> > and would we need to adjust 'i' then?
> 
> Hmm you're right, I could reject them based on the definition BB and the
> location the main IV is, that's probably safest.
> 
> > 
> > > > > +	{
> > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > +
> > > > > +	  /* We previously generated the new merged phi in the same BB as
> > > > the
> > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > > +	  final_expr = gimple_phi_result (phi1);
> > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > > > loop_preheader_edge (loop));
> > > > > +
> > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > +	  /* For early break the final loop IV is:
> > > > > +	     init + (final - init) * vf which takes into account peeling
> > > > > +	     values and non-single steps.  */
> > > > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > > > +			     fold_convert (stype, final_expr),
> > > > > +			     fold_convert (stype, init_expr));
> > > > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> > > > > +
> > > > > +	  /* Adjust the value with the offset.  */
> > > > > +	  if (POINTER_TYPE_P (type))
> > > > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > > > +	  else
> > > > > +	    ni = fold_convert (type,
> > > > > +			       fold_build2 (PLUS_EXPR, stype,
> > > > > +					    fold_convert (stype, init_expr),
> > > > > +					    off));
> > > > > +	  var = create_tmp_var (type, "tmp");
> > 
> > so how does the non-early break code deal with updating inductions?
> 
> non-early break code does this essentially the same, see vect_update_ivs_after_vectorizer
> which was modified to skip early exits.  The major difference is that on a non-early
> exit the value can just be adjusted linearly: i_scalar = i_vect * VF + init and peeling is taken
> into account by vect_peel_nonlinear_iv_init. 
> 
> I wanted to keep them separately as there's a significant enough difference in the
> calculation of the loop bodies themselves that having one function became unwieldy.
> 
> That and we don't support all of the induction steps a non-early exit supports so I inlined
> a simpler calculation.
> 
> Now there's a big difference between the normal loop vectorization and early break.
> During the main loop vectorization the ivtmp is adjusted for peeling.  That is the amount is
> already adjusted in the phi node itself.
> 
> For early break this isn't done because the bounds are fixed, i.e. for every exit we can at
> most do VF iterations.
>
> > And how do you avoid altering this when we flow in from the normal
> > exit?  That is, you are updating the value in the epilog loop
> > header but don't you need to instead update the value only on
> > the alternate exit edges from the main loop (and keep the not
> > updated value on the main exit edge)?
> 
> Because the ivtmps has not been adjusted for peeling, we adjust for it here.  This allows us
> to have the same update required for all exits, so I don't have to differentiate between the
> two here because I didn't do so during creation of the PHI node.

Can we do this in all cases then?  I really don't like to have two
schemes with differences spread over two places ... if the current
scheme doesn't work for early exit the early exit scheme should
do for zero early exits?  What's the downside of using that 
unconditionally?

> > 
> > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > +	  gimple_seq new_stmts = NULL;
> > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +	  else
> > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +
> > > > > +	  /* Fix phi expressions in the successor bb.  */
> > > > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > > > +	}
> > > > > +      else
> > > > > +	{
> > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > +
> > > > > +	  /* We previously generated the new merged phi in the same BB as
> > > > the
> > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge
> > > > (loop));
> > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > +
> > > > > +	  if (vf.is_constant ())
> > > > > +	    {
> > > > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > > > +				fold_convert (stype,
> > > > > +					      niters_vector),
> > > > > +				build_int_cst (stype, vf));
> > > > > +
> > > > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > > > +				fold_convert (stype,
> > > > > +					      niters_orig),
> > > > > +				fold_convert (stype, ni));
> > > > > +	    }
> > > > > +	  else
> > > > > +	    /* If the loop's VF isn't constant then the loop must have been
> > > > > +	       masked, so at the end of the loop we know we have finished
> > > > > +	       the entire loop and found nothing.  */
> > > > > +	    ni = build_zero_cst (stype);
> > > > > +
> > > > > +	  ni = fold_convert (type, ni);
> > > > > +	  /* We don't support variable n in this version yet.  */
> > > > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > > > +
> > > > > +	  var = create_tmp_var (type, "tmp");
> > > > > +
> > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > +	  gimple_seq new_stmts = NULL;
> > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +	  else
> > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +
> > > > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > > > +
> > > > > +	  for (edge exit : alt_exits)
> > > > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > > > +					build_int_cst (TREE_TYPE (step_expr),
> > > > > +						       vf));
> > > > > +	  ivtmp = gimple_phi_result (phi1);
> > > > > +	}
> > > > > +    }
> > > > > +
> > > > > +  return ivtmp;
> > > > >  }
> > > > >
> > > > >  /* Return a gimple value containing the misalignment (measured in
> > vector
> > > > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > > > (loop_vec_info loop_vinfo,
> > > > >
> > > > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > > > >     this function searches for the corresponding lcssa phi node in exit
> > > > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > > > -   NULL.  */
> > > > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> > > > > +   return the phi result; otherwise return NULL.  */
> > > > >
> > > > >  static tree
> > > > >  find_guard_arg (class loop *loop, class loop *epilog
> > ATTRIBUTE_UNUSED,
> > > > > -		gphi *lcssa_phi)
> > > > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > > > >  {
> > > > >    gphi_iterator gsi;
> > > > >    edge e = loop->vec_loop_iv;
> > > > >
> > > > > -  gcc_assert (single_pred_p (e->dest));
> > > > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > >      {
> > > > >        gphi *phi = gsi.phi ();
> > > > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > > > -	return PHI_RESULT (phi);
> > > > > -    }
> > > > > -  return NULL_TREE;
> > > > > -}
> > > > > -
> > > > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > > > FIRST/SECOND
> > > > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > > > -   edge, the two loops are arranged as below:
> > > > > -
> > > > > -       preheader_a:
> > > > > -     first_loop:
> > > > > -       header_a:
> > > > > -	 i_1 = PHI<i_0, i_2>;
> > > > > -	 ...
> > > > > -	 i_2 = i_1 + 1;
> > > > > -	 if (cond_a)
> > > > > -	   goto latch_a;
> > > > > -	 else
> > > > > -	   goto between_bb;
> > > > > -       latch_a:
> > > > > -	 goto header_a;
> > > > > -
> > > > > -       between_bb:
> > > > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > > > -
> > > > > -     second_loop:
> > > > > -       header_b:
> > > > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > > > -				 or with i_2 if no LCSSA phi is created
> > > > > -				 under condition of
> > > > CREATE_LCSSA_FOR_IV_PHIS.
> > > > > -	 ...
> > > > > -	 i_4 = i_3 + 1;
> > > > > -	 if (cond_b)
> > > > > -	   goto latch_b;
> > > > > -	 else
> > > > > -	   goto exit_bb;
> > > > > -       latch_b:
> > > > > -	 goto header_b;
> > > > > -
> > > > > -       exit_bb:
> > > > > -
> > > > > -   This function creates loop closed SSA for the first loop; update the
> > > > > -   second loop's PHI nodes by replacing argument on incoming edge with
> > the
> > > > > -   result of newly created lcssa PHI nodes.  IF
> > CREATE_LCSSA_FOR_IV_PHIS
> > > > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > > > -   the first loop.
> > > > > -
> > > > > -   This function assumes exit bb of the first loop is preheader bb of the
> > > > > -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> > > > > -   the second loop will execute rest iterations of the first.  */
> > > > > -
> > > > > -static void
> > > > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > > > -				   class loop *first, class loop *second,
> > > > > -				   bool create_lcssa_for_iv_phis)
> > > > > -{
> > > > > -  gphi_iterator gsi_update, gsi_orig;
> > > > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > -
> > > > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > > > -  basic_block between_bb = single_exit (first)->dest;
> > > > > -
> > > > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p
> > (between_bb));
> > > > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > > > -  gcc_assert (loop == first || loop == second);
> > > > > -
> > > > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > > > -       gsi_update = gsi_start_phis (second->header);
> > > > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > > > -    {
> > > > > -      gphi *orig_phi = gsi_orig.phi ();
> > > > > -      gphi *update_phi = gsi_update.phi ();
> > > > > -
> > > > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > > > -      /* Generate lcssa PHI node for the first loop.  */
> > > > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> > > > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > > > +      /* Nested loops with multiple exits can have different no# phi node
> > > > > +	 arguments between the main loop and epilog as epilog falls to the
> > > > > +	 second loop.  */
> > > > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > > > >  	{
> > > > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > > > UNKNOWN_LOCATION);
> > > > > -	  arg = new_res;
> > > > > -	}
> > > > > -
> > > > > -      /* Update PHI node in the second loop by replacing arg on the loop's
> > > > > -	 incoming edge.  */
> > > > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e,
> > arg);
> > > > > -    }
> > > > > -
> > > > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > > > -     for correct vectorization of live stmts.  */
> > > > > -  if (loop == first)
> > > > > -    {
> > > > > -      basic_block orig_exit = single_exit (second)->dest;
> > > > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > > > -	{
> > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > > > (orig_arg))
> > > > > -	    continue;
> > > > > -
> > > > > -	  /* Already created in the above loop.   */
> > > > > -	  if (find_guard_arg (first, second, orig_phi))
> > > > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > > > +	  if (TREE_CODE (var) != SSA_NAME)
> > > > >  	    continue;
> > > > >
> > > > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > > > UNKNOWN_LOCATION);
> > > > > +	  if (operand_equal_p (get_current_def (var),
> > > > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > > > +	    return PHI_RESULT (phi);
> > > > >  	}
> > > > >      }
> > > > > +  return NULL_TREE;
> > > > >  }
> > > > >
> > > > >  /* Function slpeel_add_loop_guard adds guard skipping from the
> > beginning
> > > > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2
> > (class
> > > > loop *loop, class loop *epilog,
> > > > >    gcc_assert (single_succ_p (merge_bb));
> > > > >    edge e = single_succ_edge (merge_bb);
> > > > >    basic_block exit_bb = e->dest;
> > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > > > >
> > > > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > >      {
> > > > >        gphi *update_phi = gsi.phi ();
> > > > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > > > >
> > > > >        tree merge_arg = NULL_TREE;
> > > > >
> > > > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class
> > loop
> > > > *loop, class loop *epilog,
> > > > >        if (!merge_arg)
> > > > >  	merge_arg = old_arg;
> > > > >
> > > > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e-
> > >dest_idx);
> > > > >        /* If the var is live after loop but not a reduction, we simply
> > > > >  	 use the old arg.  */
> > > > >        if (!guard_arg)
> > > > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class
> > > > loop *loop, class loop *epilog,
> > > > >      }
> > > > >  }
> > > > >
> > > > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > > > -
> > > > > -static void
> > > > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > > > -{
> > > > > -  gphi_iterator gsi;
> > > > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > > > -
> > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > > > -}
> > > > > -
> > 
> > I wonder if we can still split these changes out to before early break
> > vect?
> 
> Maybe, I hadn't done so before because I was redirecting the edges during peeling.
> If we no longer do that that may be easier.  Let me try to.
>
> > 
> > > > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would need
> > to
> > > > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > > > >     whether the epilogue loop should be used, returning true if so.  */
> > > > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree
> > > > niters, tree nitersm1,
> > > > >      bound_epilog += vf - 1;
> > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > >      bound_epilog += 1;
> > > > > +  /* For early breaks the scalar loop needs to execute at most VF times
> > > > > +     to find the element that caused the break.  */
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    {
> > > > > +      bound_epilog = vf;
> > > > > +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> > > > > +      vect_epilogues = false;
> > > > > +    }
> > > > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > > > >    poly_uint64 bound_scalar = bound_epilog;
> > > > >
> > > > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree niters, tree nitersm1,
> > > > >  				  bound_prolog + bound_epilog)
> > > > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > > >  			 || vect_epilogues));
> > > > > +
> > > > > +  /* We only support early break vectorization on known bounds at this
> > > > time.
> > > > > +     This means that if the vector loop can't be entered then we won't
> > > > generate
> > > > > +     it at all.  So for now force skip_vector off because the additional
> > control
> > > > > +     flow messes with the BB exits and we've already analyzed them.  */
> > > > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS
> > (loop_vinfo);
> > > > > +
> > 
> > I think it should be as "easy" as entering the epilog via the block taking
> > the regular exit?
> > 
> > > > >    /* Epilog loop must be executed if the number of iterations for epilog
> > > > >       loop is known at compile time, otherwise we need to add a check at
> > > > >       the end of vector loop and skip to the end of epilog loop.  */
> > > > >    bool skip_epilog = (prolog_peeling < 0
> > > > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > >  		      || !vf.is_constant ());
> > > > > -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.
> > */
> > > > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special because
> > > > epilog
> > > > > +     loop must be executed.  */
> > > > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      skip_epilog = false;
> > > > > -
> > > > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > > > >    auto_vec<profile_count> original_counts;
> > > > >    basic_block *original_bbs = NULL;
> > > > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree niters, tree nitersm1,
> > > > >    if (prolog_peeling)
> > > > >      {
> > > > >        e = loop_preheader_edge (loop);
> > > > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > > > -
> > > > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
> > > > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> > e);
> > > > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> > e,
> > > > > +						       true);
> > > > >        gcc_assert (prolog);
> > > > >        prolog->force_vectorize = false;
> > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> > > > > +
> > > > >        first_loop = prolog;
> > > > >        reset_original_copy_tables ();
> > > > >
> > > > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree niters, tree nitersm1,
> > > > >  	 as the transformations mentioned above make less or no sense when
> > > > not
> > > > >  	 vectorizing.  */
> > > > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > > > +      auto_vec<basic_block> doms;
> > > > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e,
> > true,
> > > > > +						       &doms);
> > > > >        gcc_assert (epilog);
> > > > >
> > > > >        epilog->force_vectorize = false;
> > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
> > > > >
> > > > >        /* Scalar version loop may be preferred.  In this case, add guard
> > > > >  	 and skip to epilog.  Note this only happens when the number of
> > > > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree
> > > > niters, tree nitersm1,
> > > > >        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> > > > >  					update_e);
> > > > >
> > > > > +      /* For early breaks we must create a guard to check how many
> > iterations
> > > > > +	 of the scalar loop are yet to be performed.  */
> > 
> > We have this check anyway, no?  In fact don't we know that we always enter
> > the epilog (see above)?
> 
> Not always, masked loops for instance never enter the epilogue if the main loop
> finished completely since there is no "remainder".
> 
> > 
> > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +	{
> > > > > +	  tree ivtmp =
> > > > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> > > > > +					       *niters_vector, update_e);
> > > > > +
> > > > > +	  gcc_assert (ivtmp);
> > > > > +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > > +					 fold_convert (TREE_TYPE (niters),
> > > > > +						       ivtmp),
> > > > > +					 build_zero_cst (TREE_TYPE (niters)));
> > > > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > > +
> > > > > +	  /* If we had a fallthrough edge, the guard will the threaded through
> > > > > +	     and so we may need to find the actual final edge.  */
> > > > > +	  edge final_edge = epilog->vec_loop_iv;
> > > > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> > > > > +	     between the guard and the exit edge.  It only adds new nodes and
> > > > > +	     doesn't update existing one in the current scheme.  */
> > > > > +	  basic_block guard_to = split_edge (final_edge);
> > > > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
> > > > guard_to,
> > > > > +						guard_bb, prob_epilog.invert
> > > > (),
> > > > > +						irred_flag);
> > > > > +	  doms.safe_push (guard_bb);
> > > > > +
> > > > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > +
> > > > > +	  /* We must update all the edges from the new guard_bb.  */
> > > > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > > > +					      final_edge);
> > > > > +
> > > > > +	  /* If the loop was versioned we'll have an intermediate BB between
> > > > > +	     the guard and the exit.  This intermediate block is required
> > > > > +	     because in the current scheme of things the guard block phi
> > > > > +	     updating can only maintain LCSSA by creating new blocks.  In this
> > > > > +	     case we just need to update the uses in this block as well.  */
> > > > > +	  if (loop != scalar_loop)
> > > > > +	    {
> > > > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (),
> > > > guard_e));
> > > > > +	    }
> > > > > +
> > > > > +	  flush_pending_stmts (guard_e);
> > > > > +	}
> > > > > +
> > > > >        if (skip_epilog)
> > > > >  	{
> > > > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree
> > > > niters, tree nitersm1,
> > > > >  	    }
> > > > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > > > >  	}
> > > > > -      else
> > > > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > > > >
> > > > >        unsigned HOST_WIDE_INT bound;
> > > > >        if (bound_scalar.is_constant (&bound))
> > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > > index
> > > >
> > b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > > > f21421958e7ee25eada 100644
> > > > > --- a/gcc/tree-vect-loop.cc
> > > > > +++ b/gcc/tree-vect-loop.cc
> > > > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > > > *loop_in, vec_info_shared *shared)
> > > > >      partial_load_store_bias (0),
> > > > >      peeling_for_gaps (false),
> > > > >      peeling_for_niter (false),
> > > > > +    early_breaks (false),
> > > > > +    non_break_control_flow (false),
> > > > >      no_data_dependencies (false),
> > > > >      has_mask_store (false),
> > > > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > > > (loop_vec_info loop_vinfo)
> > > > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > > > (LOOP_VINFO_ORIG_LOOP_INFO
> > > > >  					  (loop_vinfo));
> > > > >
> > > > > +  /* When we have multiple exits and VF is unknown, we must require
> > > > partial
> > > > > +     vectors because the loop bounds is not a minimum but a maximum.
> > > > That is to
> > > > > +     say we cannot unpredicate the main loop unless we peel or use partial
> > > > > +     vectors in the epilogue.  */
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > > > +    return true;
> > > > > +
> > > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > > >      {
> > > > > @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost
> > > > (loop_vec_info loop_vinfo)
> > > > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > > > >  }
> > > > >
> > > > > -
> > > > >  /* Function vect_analyze_loop_form.
> > > > >
> > > > >     Verify that certain CFG restrictions hold, including:
> > > > >     - the loop has a pre-header
> > > > > -   - the loop has a single entry and exit
> > > > > +   - the loop has a single entry
> > > > > +   - nested loops can have only a single exit.
> > > > >     - the loop exit condition is simple enough
> > > > >     - the number of iterations can be analyzed, i.e, a countable loop.  The
> > > > >       niter could be analyzed under some assumptions.  */
> > > > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop,
> > > > vect_loop_form_info *info)
> > > > >                             |
> > > > >                          (exit-bb)  */
> > > > >
> > > > > -      if (loop->num_nodes != 2)
> > > > > -	return opt_result::failure_at (vect_location,
> > > > > -				       "not vectorized:"
> > > > > -				       " control flow in loop.\n");
> > > > > -
> > > > >        if (empty_block_p (loop->header))
> > > > >  	return opt_result::failure_at (vect_location,
> > > > >  				       "not vectorized: empty loop.\n");
> > > > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop,
> > > > vect_loop_form_info *info)
> > > > >          dump_printf_loc (MSG_NOTE, vect_location,
> > > > >  			 "Considering outer-loop vectorization.\n");
> > > > >        info->inner_loop_cond = inner.loop_cond;
> > > > > +
> > > > > +      if (!single_exit (loop))
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized: multiple exits.\n");
> > > > > +
> > > > >      }
> > > > >
> > > > > -  if (!single_exit (loop))
> > > > > -    return opt_result::failure_at (vect_location,
> > > > > -				   "not vectorized: multiple exits.\n");
> > > > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > > > >      return opt_result::failure_at (vect_location,
> > > > >  				   "not vectorized:"
> > > > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop,
> > > > vect_loop_form_info *info)
> > > > >  				   "not vectorized: latch block not empty.\n");
> > > > >
> > > > >    /* Make sure the exit is not abnormal.  */
> > > > > -  edge e = single_exit (loop);
> > > > > -  if (e->flags & EDGE_ABNORMAL)
> > > > > -    return opt_result::failure_at (vect_location,
> > > > > -				   "not vectorized:"
> > > > > -				   " abnormal loop exit edge.\n");
> > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > +  edge nexit = loop->vec_loop_iv;
> > > > > +  for (edge e : exits)
> > > > > +    {
> > > > > +      if (e->flags & EDGE_ABNORMAL)
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized:"
> > > > > +				       " abnormal loop exit edge.\n");
> > > > > +      /* Early break BB must be after the main exit BB.  In theory we should
> > > > > +	 be able to vectorize the inverse order, but the current flow in the
> > > > > +	 the vectorizer always assumes you update successor PHI nodes, not
> > > > > +	 preds.  */
> > > > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e-
> > > > >src))
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized:"
> > > > > +				       " abnormal loop exit edge order.\n");
> > 
> > "unsupported loop exit order", but I don't understand the comment.
> > 
> 
> One failure I found during bootstrap is that there was a CFG where the BBs were all
> reversed.  Should have it in the testsuite, I'll dig it out and come back to you.
> 
> > > > > +    }
> > > > > +
> > > > > +  /* We currently only support early exit loops with known bounds.   */
> > 
> > Btw, why's that?  Is that because we don't support the loop-around edge?
> > IMHO this is the most serious limitation (and as said above it should be
> > trivial to fix).
> 
> Nah, it's just time ? I wanted to start getting feedback before relaxing it.
> My patch 0/19 has an implementation plan for the remaining work.
> 
> I plan to relax this in this release, most likely in this series itself.
> 
> > 
> > > > > +  if (exits.length () > 1)
> > > > > +    {
> > > > > +      class tree_niter_desc niter;
> > > > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter,
> > NULL)
> > > > > +	  || chrec_contains_undetermined (niter.niter)
> > > > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized:"
> > > > > +				       " early breaks only supported on loops"
> > > > > +				       " with known iteration bounds.\n");
> > > > > +    }
> > > > >
> > > > >    info->conds
> > > > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > > > vec_info_shared *shared,
> > > > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > > > >alt_loop_conds);
> > > > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > > > >
> > > > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > > > +
> > > > >    if (info->inner_loop_cond)
> > > > >      {
> > > > >        stmt_vec_info inner_loop_cond_info
> > > > > @@ -3070,7 +3106,8 @@ start_over:
> > > > >
> > > > >    /* If an epilogue loop is required make sure we can create one.  */
> > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      {
> > > > >        if (dump_enabled_p ())
> > > > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop
> > required\n");
> > > > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction
> > (loop_vec_info
> > > > loop_vinfo,
> > > > >    basic_block exit_bb;
> > > > >    tree scalar_dest;
> > > > >    tree scalar_type;
> > > > > -  gimple *new_phi = NULL, *phi;
> > > > > +  gimple *new_phi = NULL, *phi = NULL;
> > > > >    gimple_stmt_iterator exit_gsi;
> > > > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > > > >    gimple *epilog_stmt = NULL;
> > > > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > > > (loop_vec_info loop_vinfo,
> > > > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > > > >  	  reduc_inputs.quick_push (new_def);
> > > > >  	}
> > > > > +
> > > > > +	/* Update the other exits.  */
> > > > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +	  {
> > > > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > +	    gphi_iterator gsi, gsi1;
> > > > > +	    for (edge exit : alt_exits)
> > > > > +	      {
> > > > > +		/* Find the phi node to propaget into the exit block for each
> > > > > +		   exit edge.  */
> > > > > +		for (gsi = gsi_start_phis (exit_bb),
> > > > > +		     gsi1 = gsi_start_phis (exit->src);
> > 
> > exit->src == loop->header, right?  I think this won't work for multiple
> > alternate exits.  It's probably easier to do this where we create the
> > LC PHI node for the reduction result?
> 
> No exit->src == definition block of the gcond. 
>
> > 
> > > > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > > > +		  {
> > > > > +		    /* There really should be a function to just get the number
> > > > > +		       of phis inside a bb.  */
> > > > > +		    if (phi && phi == gsi.phi ())
> > > > > +		      {
> > > > > +			gphi *phi1 = gsi1.phi ();
> > > > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > > > +					 PHI_RESULT (phi1));

But the definition block of the gcond doesn't have a PHI?!

for (;;)
{
  if (alt-exit1) <- exit->src is the loop header in this case
    ..
  if (alt-exit2) <- exit->src doesn't have any PHIs
    ..
  if (IV)
    ..
}

so I think the above "works" for alt-exit1 but not alt-exit2?  Should't
you look at loop->header for phi1?

> > 
> > I think we know the header PHI of a reduction perfectly well, there
> > shouldn't be the need to "search" for it.
> > 
> > > > > +			break;
> > > > > +		      }
> > > > > +		  }
> > > > > +	      }
> > > > > +	  }
> > > > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > > > >      }
> > > > >
> > > > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info
> > *vinfo,
> > > > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > > > >  	   lhs' = new_tree;  */
> > > > >
> > > > > +      /* When vectorizing an early break, any live statement that is used
> > > > > +	 outside of the loop are dead.  The loop will never get to them.
> > > > > +	 We could change the liveness value during analysis instead but since
> > > > > +	 the below code is invalid anyway just ignore it during codegen.  */
> > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +	return true;
> > 
> > But what about the value that's live across the main exit when the
> > epilogue is not entered?
> 
> My understanding is that vectorizable_live_operation only vectorizes statements within
> a loop that can be live outside, In this case e.g. statements inside the body of the IF.
> 
> What you're describing above is done by other vectorizable_reduction etc.
> 
> That said, I can make this much safer by just restricting it to statements inside the same BB
> as an alt exit BB.

So is this a restriction you pose on LOOP_VINFO_EARLY_BREAKS during
early analysis?  Because as you return true above you claim you can
vectorize all live operations just fine.  A live operation is for example

 for (i)
  {
    tem = a[i];
    b[i] = tem;
  }
.. use tem here ..

this isn't a reduction or induction.  You don't need 'tem' on the
path to the epilog loop (if you enter it), since the epilog will
compute the correct 'tem', but on the path through the regular
exit when that doesn't enter the epilog because no iterations are
left you _do_ need the value of 'tem' and thus have to code generate
it.

> > 
> > > > > +
> > > > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > >        gcc_assert (single_pred_p (exit_bb));
> > > > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > > > >       versioning.   */
> > > > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > -  if (! single_pred_p (e->dest))
> > > > > +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS
> > > > (loop_vinfo))
> > 
> > e can be NULL here?  I think we should reject such loops earlier.
> > 
> 
> Ah no, that's left-over from when this used single_exit.  It should be removed
> In this patch.  Had missed it, sorry.
> 
> > > > >      {
> > > > >        split_loop_exit_edge (e, true);
> > > > >        if (dump_enabled_p ())
> > > > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > > > >      {
> > > > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > > > -      if (! single_pred_p (e->dest))
> > > > > +      if (e && ! single_pred_p (e->dest))
> > > > >  	{
> > > > >  	  split_loop_exit_edge (e, true);
> > > > >  	  if (dump_enabled_p ())
> > > > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > >
> > > > >    /* Loops vectorized with a variable factor won't benefit from
> > > > >       unrolling/peeling.  */
> > 
> > update the comment?  Why would we unroll a VLA loop with early breaks?
> > Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?
> > 
> 
> Ah indeed, should be ||.
> 
> > > > > -  if (!vf.is_constant ())
> > > > > +  if (!vf.is_constant ()
> > > > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      {
> > > > >        loop->unroll = 1;
> > > > >        if (dump_enabled_p ())
> > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > index
> > > >
> > 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > > > e78ee2a7c627be45b 100644
> > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info
> > stmt_info,
> > > > loop_vec_info loop_vinfo,
> > > > >    *live_p = false;
> > > > >
> > > > >    /* cond stmt other than loop exit cond.  */
> > > > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > > > -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> > > > > -    *relevant = vect_used_in_scope;
> > 
> > how was that ever hit before?  For outer loop processing with outer loop
> > vectorization?
> >
> 
> I believe so, because the outer-loop would see the exit cond of the inner loop as well.
> 
> > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > +    {
> > > > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge,
> > but
> > > > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > > > +	 the hard way.  */
> > > > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > > > +      bool exit_bb = false, early_exit = false;
> > > > > +      edge_iterator ei;
> > > > > +      edge e;
> > > > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > > > +	  {
> > > > > +	    exit_bb = true;
> > > > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > > > +	    break;
> > > > > +	  }
> > > > > +
> > > > > +      /* We should have processed any exit edge, so an edge not an early
> > > > > +	 break must be a loop IV edge.  We need to distinguish between the
> > > > > +	 two as we don't want to generate code for the main loop IV.  */
> > > > > +      if (exit_bb)
> > > > > +	{
> > > > > +	  if (early_exit)
> > > > > +	    *relevant = vect_used_in_scope;
> > > > > +	}
> > 
> > I wonder why you can't simply do
> > 
> >          if (is_ctrl_stmt (stmt_info->stmt)
> >              && stmt_info->stmt != LOOP_VINFO_COND (loop_info))
> > 
> > ?
> > 
> > > > > +      else if (bb->loop_father == loop)
> > > > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> > 
> > so for control flow not exiting the loop you can check
> > loop_exits_from_bb_p ().
> > 
> > > > > +    }
> > > > >
> > > > >    /* changing memory.  */
> > > > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info
> > stmt_info,
> > > > loop_vec_info loop_vinfo,
> > > > >  	*relevant = vect_used_in_scope;
> > > > >        }
> > > > >
> > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > +  auto_bitmap exit_bbs;
> > > > > +  for (edge exit : exits)
> > > > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > > +
> > > > >    /* uses outside the loop.  */
> > > > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > > > SSA_OP_DEF)
> > > > >      {
> > > > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > > > loop_vec_info loop_vinfo,
> > > > >  	      /* We expect all such uses to be in the loop exit phis
> > > > >  		 (because of loop closed form)   */
> > > > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> > > > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> > 
> > That now becomes quite expensive checking already covered by the LC SSA
> > verifier so I suggest to simply drop this assert instead.
> > 
> > > > >                *live_p = true;
> > > > >  	    }
> > > > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > >  	}
> > > > >      }
> > > > >
> > > > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't
> > seen all
> > > > > +     the conds yet at that point and there's no quick way to retrieve them.
> > */
> > > > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > > > +    return opt_result::failure_at (vect_location,
> > > > > +				   "not vectorized:"
> > > > > +				   " unsupported control flow in loop.\n");
> > 
> > so we didn't do this before?  But see above where I wondered.  So when
> > does this hit with early exits and why can't we check for this in
> > vect_verify_loop_form?
> > 
> 
> We did, but it was done in vect_analyze_loop_form but it was done purely based
> on number of BB in the loop.  This required loops to be highly normalized which
> isn't the case with multiple exits. That is I've seen various loops with different
> numbers of random empty fall through BB in the body or after the main exit before
> the latch.
> 
> We can do it in vect_analyze_loop_form but that requires us to walk all the
> statements in all the basic blocks, because the loops track exit edges and a general
> control flow edge is not easy to find as far as I know.  I added it at this point
> because by here because by this point we would have walked all the statements.

You only need to look at *gsi_last_bb (bb) for each block, only the last
stmt can be a control stmt.  I think rejecting this earlier is better.

> > > > > +
> > > > >    /* 2. Process_worklist */
> > > > >    while (worklist.length () > 0)
> > > > >      {
> > > > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > >  			return res;
> > > > >  		    }
> > > > >                   }
> > > > > +	    }
> > > > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> > > > > +	    {
> > > > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> > > > > +	      opt_result res
> > > > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > > > +			       loop_vinfo, relevant, &worklist, false);
> > > > > +	      if (!res)
> > > > > +		return res;
> > > > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > > > +				loop_vinfo, relevant, &worklist, false);
> > > > > +	      if (!res)
> > > > > +		return res;
> > > > >              }
> > > > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > > > >  	    {
> > > > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > >  			     node_instance, cost_vec);
> > > > >        if (!res)
> > > > >  	return res;
> > > > > -   }
> > > > > +    }
> > > > > +
> > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > > > >
> > > > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > > > >      {
> > > > >        case vect_internal_def:
> > > > > +      case vect_early_exit_def:
> > > > >          break;
> > > > >
> > > > >        case vect_reduction_def:
> > > > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > >      {
> > > > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > > > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > > > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > > > >        *need_to_vectorize = true;
> > > > >      }
> > > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > > > index
> > > >
> > ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > > > 40baff61d18da8e242dd 100644
> > > > > --- a/gcc/tree-vectorizer.h
> > > > > +++ b/gcc/tree-vectorizer.h
> > > > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > > > >    vect_internal_def,
> > > > >    vect_induction_def,
> > > > >    vect_reduction_def,
> > > > > +  vect_early_exit_def,
> > 
> > can you avoid putting this inbetween reduction and double reduction
> > please?  Just put it before vect_unknown_def_type.  In fact the COND
> > isn't a def ... maybe we should have pattern recogized
> > 
> >  if (a < b) exit;
> > 
> > as
> > 
> >  cond = a < b;
> >  if (cond != 0) exit;
> > 
> > so the part that we need to vectorize is more clear.
> 
> Hmm fair enough, I still find it useful to be able to distinguish
> between this and general control flow though.  In fact depending
> on when we finish reviewing/upstreaming this It should be easy to
> support general control flow in such loops.
> 
> > 
> > > > >    vect_double_reduction_def,
> > > > >    vect_nested_cycle,
> > > > >    vect_first_order_recurrence,
> > > > > @@ -876,6 +877,13 @@ public:
> > > > >       we need to peel off iterations at the end to form an epilogue loop.  */
> > > > >    bool peeling_for_niter;
> > > > >
> > > > > +  /* When the loop has early breaks that we can vectorize we need to
> > peel
> > > > > +     the loop for the break finding loop.  */
> > > > > +  bool early_breaks;
> > > > > +
> > > > > +  /* When the loop has a non-early break control flow inside.  */
> > > > > +  bool non_break_control_flow;
> > > > > +
> > > > >    /* List of loop additional IV conditionals found in the loop.  */
> > > > >    auto_vec<gcond *> conds;
> > > > >
> > > > > @@ -985,9 +993,11 @@ public:
> > > > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
> > > > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
> > > > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> > > > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > > > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > > > >early_break_conflict
> > > > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)-
> > >early_break_dest_bb
> > > > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> > > > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > > > >non_break_control_flow
> > > > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > > > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > > > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > > > >no_data_dependencies
> > > > > @@ -1038,8 +1048,8 @@ public:
> > > > >     stack.  */
> > > > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > > > >
> > > > > -inline loop_vec_info
> > > > > -loop_vec_info_for_loop (class loop *loop)
> > > > > +static inline loop_vec_info
> > > > > +loop_vec_info_for_loop (const class loop *loop)
> > > > >  {
> > > > >    return (loop_vec_info) loop->aux;
> > > > >  }
> > > > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > > > >  {
> > > > >    if (bb == (bb->loop_father)->header)
> > > > >      return true;
> > > > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > > > +
> > > > >    return false;
> > > > >  }
> > > > >
> > > > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > > > >     in tree-vect-loop-manip.cc.  */
> > > > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > > > >  				     tree, tree, tree, bool);
> > > > > -extern bool slpeel_can_duplicate_loop_p (const class loop *,
> > const_edge);
> > > > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > > > const_edge);
> > > > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > > > -						     class loop *, edge);
> > > > > +						    class loop *, edge, bool,
> > > > > +						    vec<basic_block> * = NULL);
> > > > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > > > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > > > >  				    tree *, tree *, tree *, int, bool, bool,
> > > > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > > > index
> > > >
> > a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > > > 67f29f5dc9a8d72235f4 100644
> > > > > --- a/gcc/tree-vectorizer.cc
> > > > > +++ b/gcc/tree-vectorizer.cc
> > > > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > > > >  	 predicates that need to be shared for optimal predicate usage.
> > > > >  	 However reassoc will re-order them and prevent CSE from working
> > > > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> > 
> > seeing this more and more I think we want a simple way to iterate over
> > all exits without copying to a vector when we have them recorded.  My
> > C++ fu is too limited to support
> > 
> >   for (auto exit : recorded_exits (loop))
> >     ...
> > 
> > (maybe that's enough for somebody to jump onto this ;))
> > 
> > Don't treat all review comments as change orders, but it should be clear
> > the code isn't 100% obvious.  Maybe the patch can be simplified by
> > splitting out the LC SSA cleanup parts.
> 
> Will give it a try,
> 
> Thanks!
> Tamar
> 
> > 
> > Thanks,
> > Richard.
> > 
> > > > > +      for (edge exit : exits)
> > > > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > >
> > > > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > > > >        do_rpo_vn (fun, entry, exit_bbs);
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguenther@suse.de>
> > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > > Nuernberg,
> > > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > > > Moerman;
> > > > HRB 36809 (AG Nuernberg)
> > >
>
  
Tamar Christina Aug. 18, 2023, 11:35 a.m. UTC | #6
> > Yeah if you comment it out one of the testcases should fail.
> 
> using new_preheader instead of e->dest would make things clearer.
> 
> You are now adding the same arg to every exit (you've just queried the
> main exit redirect_edge_var_map_vector).
> 
> OK, so I think I understand what you're doing.  If I understand
> correctly we know that when we exit the main loop via one of the
> early exits we are definitely going to enter the epilog but when
> we take the main exit we might not.
> 

Correct.. but..

> Looking at the CFG we create currently this isn't reflected and
> this complicates this PHI node updating.  What I'd try to do
> is leave redirecting the alternate exits until after

It is, in the case of the alternate exits this is reflected in copying
the same values, as they are the values of the number of completed 
iterations since the scalar code restarts the last iteration.

So all the PHI nodes of the alternate exits are correct.  The vector
Iteration doesn't handle the partial iteration.

> slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
> means leaving it almost unchanged besides the LC SSA maintaining
> changes.  After that for the multi-exit case split the
> epilog preheader edge and redirect all the alternate exits to the
> new preheader.  So the CFG becomes
> 
>                  <original loop>
>                 /      |
>                /    <main exit w/ original LC PHI>
>               /      if (epilog)
>    alt exits /        /  \
>             /        /    loop around
>             |       /
>            preheader with "header" PHIs
>               |
>           <epilog>
> 
> note you need the header PHIs also on the main exit path but you
> only need the loop end PHIs there.
> 
> It seems so that at least currently the order of things makes
> them more complicated than necessary.

I've been trying to, but this representation seems a lot harder to work with,
In particular at the moment once we exit slpeel_tree_duplicate_loop_to_edge_cfg
the loop structure is exactly the same as one expects from any normal epilog vectorization.

But this new representation requires me to place the guard much earlier than the epilogue
preheader,  yet I still have to adjust the PHI nodes in the preheader.  So it seems that this split
is there to only indicate that we always enter the epilog when taking an early exit.

Today this is reflected in the values of the PHI nodes rather than structurally.  Once we place
The guard we update the nodes and the alternate exits get their value for ivtmp updated to VF.

This representation also forces me to do the redirection in every call site of
slpeel_tree_duplicate_loop_to_edge_cfg making the code more complicated in all use sites.

But I think this doesn't address the main reason why the slpeel_tree_duplicate_loop_to_edge_cfg
code has a large block of code to deal with PHI node updates.

The reason as you mentioned somewhere else is that after we redirect the edges I have to reconstruct
the phi nodes.  For most it's straight forwards, but for live values or vuse chains it requires extra code.

You're right in that before we redirect the edges they are all correct in the exit block, you mentioned that
the API for the edge redirection is supposed to copy the values over if I create the phi nodes before hand.

However this doesn't seem to work:

     for (auto gsi_from = gsi_start_phis (scalar_exit->dest);
	   !gsi_end_p (gsi_from); gsi_next (&gsi_from))
	{
	  gimple *from_phi = gsi_stmt (gsi_from);
	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
	  create_phi_node (new_res, new_preheader);
	}

      for (edge exit : loop_exits)
	redirect_edge_and_branch (exit, new_preheader);

Still leaves them empty.  Grepping around most code seems to pair redirect_edge_and_branch with
copy_phi_arg_into_existing_phi.  The problem is that in all these cases after redirecting an edge they
call copy_phi_arg_into_existing_phi from a predecessor edge to fill in the phi nodes.

This is because as I redirect_edge_and_branch destroys the phi node entries and copy_phi_arg_into_existing_phi
simply just reads the gimple_phi_arg_def which would be NULL.

You could point it to the src block of the exit, in which case it copies the wrong values in for the vuses.  At the end
of vectorization the cfgcleanup code does the same thing to maintain LCSSA if you haven't.  This code always goes
wrong for multiple exits because of the problem described above.  There's no node for it to copy the right value
from.

As an alternate approach I can split the exit edges, copy the phi nodes into the split and after that redirect them.
This however creates the awkwardness of having the exit edges no longer connect to the preheader.

All of this then begs the question if this is all easier than the current approach which is just to read the edge var
map to figure out the nodes that were removed during the redirect.

Maybe I'm still misunderstanding the API, but reading the sources of the functions, they all copy values from *existing*
phi nodes.  And any existing phi node after the redirect are not correct.

gimple_redirect_edge_and_branch has a chunk that indicates it should have updated the PHI nodes before calling
ssa_redirect_edge to remove the old ones, but there's no code there. It's all empty.

Most of the other refactorings/changes were easy enough to do, but this one I seem to be a struggling with.

Thanks,
Tamar
> 
> > >
> > > >  	}
> > > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > > -      flush_pending_stmts (e);
> > > > +
> > > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e-
> >src);
> > > > -      if (was_imm_dom || duplicate_outer_loop)
> > > > +
> > > > +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
> > > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit-
> > > >src);
> > > >
> > > >        /* And remove the non-necessary forwarder again.  Keep the other
> > > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >        delete_basic_block (preheader);
> > > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> > > >  			       loop_preheader_edge (scalar_loop)->src);
> > > > +
> > > > +      /* Finally after wiring the new epilogue we need to update its main
> exit
> > > > +	 to the original function exit we recorded.  Other exits are already
> > > > +	 correct.  */
> > > > +      if (multiple_exits_p)
> > > > +	{
> > > > +	  for (edge e : get_loop_exit_edges (loop))
> > > > +	    doms.safe_push (e->dest);
> > > > +	  update_loop = new_loop;
> > > > +	  doms.safe_push (exit_dest);
> > > > +
> > > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > > +	  if (single_succ_p (exit_dest))
> > > > +	    doms.safe_push (single_succ (exit_dest));
> > > > +	}
> > > >      }
> > > >    else /* Add the copy at entry.  */
> > > >      {
> > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > +	 block and the new loop header.  This allows us to later split the
> > > > +	 preheader block and still find the right LC nodes.  */
> > > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > >
> > > see above
> > >
> > > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > +	{
> > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > > new_latch_loop);
> > > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> > > > +	}
> > > > +
> > > >        if (scalar_loop != loop)
> > > >  	{
> > > >  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> > > > @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >        delete_basic_block (new_preheader);
> > > >        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
> > > >  			       loop_preheader_edge (new_loop)->src);
> > > > +
> > > > +      if (multiple_exits_p)
> > > > +	update_loop = loop;
> > > >      }
> > > >
> > > > -  if (scalar_loop != loop)
> > > > +  if (multiple_exits_p)
> > > >      {
> > > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > > -      gphi_iterator gsi_orig, gsi_new;
> > > > -      edge orig_e = loop_preheader_edge (loop);
> > > > -      edge new_e = loop_preheader_edge (new_loop);
> > > > -
> > > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > > +      for (edge e : get_loop_exit_edges (update_loop))
> > > >  	{
> > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > -	  gphi *new_phi = gsi_new.phi ();
> > > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > > -	  location_t orig_locus
> > > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > > -
> > > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > > +	  edge ex;
> > > > +	  edge_iterator ei;
> > > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > > +	    {
> > > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > > +		 dominate other blocks.  */
> > > > +	      while ((ex->flags & EDGE_FALLTHRU)
> 
> For the prologue peeling any early exit we take would skip all other
> loops so we can simply leave them and their LC PHI nodes in place.
> We need extra PHIs only on the path to the main vector loop.  I
> think the comment isn't accurately reflecting what we do.  In
> fact we do not add any LC PHI nodes here but simply adjust the
> main loop header PHI arguments?
> 
> > > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > > just using single_succ_p here?  A fallthru edge src dominates the
> > > fallthru edge dest, so the sentence above doesn't make sense.
> >
> > I wanted to say, that the immediate dominator of a block is never
> > an fall through block.  At least from what I understood from how
> > the dominators are calculated in the code, though may have missed
> > something.
> 
>  BB1
>   |
>  BB2
>   |
>  BB3
> 
> here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.
> 
> > >
> > > > +		     && single_succ_p (ex->dest))
> > > > +		{
> > > > +		  doms.safe_push (ex->dest);
> > > > +		  ex = single_succ_edge (ex->dest);
> > > > +		}
> > > > +	      doms.safe_push (ex->dest);
> > > > +	    }
> > > > +	  doms.safe_push (e->dest);
> > > >  	}
> > > > -    }
> > > >
> > > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > +      if (updated_doms)
> > > > +	updated_doms->safe_splice (doms);
> > > > +    }
> > > >    free (new_bbs);
> > > >    free (bbs);
> > > >
> > > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > > loop_vec_info loop_vinfo, const_edge e)
> > > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > > >    unsigned int num_bb = loop->inner? 5 : 2;
> > > >
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > > +
> > >
> > > I think checking the number of BBs is odd, I don't remember anything
> > > in slpeel is specifically tied to that?  I think we can simply drop
> > > this or do you remember anything that would depend on ->num_nodes
> > > being only exactly 5 or 2?
> >
> > Never actually seemed to require it, but they're used as some check to
> > see if there are unexpected control flow in the loop.
> >
> > i.e. this would say no if you have an if statement in the loop that wasn't
> > converted.  The other part of this and the accompanying explanation is in
> > vect_analyze_loop_form.  In the patch series I had to remove the hard
> > num_nodes == 2 check from there because number of nodes restricted
> > things too much.  If you have an empty fall through block, which seems to
> > happen often between the main exit and the latch block then we'd not
> > vectorize.
> >
> > So instead I now rejects loops after analyzing the gcond.  So think this check
> > can go/needs to be different.
> 
> Lets remove it from this function then.
> 
> > >
> > > >    /* All loops have an outer scope; the only case loop->outer is NULL is for
> > > >       the function itself.  */
> > > >    if (!loop_outer (loop)
> > > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > > (loop_vec_info loop_vinfo,
> > > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > >    basic_block update_bb = update_e->dest;
> > > >
> > > > +  /* For early exits we'll update the IVs in
> > > > +     vect_update_ivs_after_early_break.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    return;
> > > > +
> > > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > >
> > > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > > (loop_vec_info loop_vinfo,
> > > >        /* Fix phi expressions in the successor bb.  */
> > > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > > >      }
> > > > +  return;
> > >
> > > we don't usually place a return at the end of void functions
> > >
> > > > +}
> > > > +
> > > > +/*   Function vect_update_ivs_after_early_break.
> > > > +
> > > > +     "Advance" the induction variables of LOOP to the value they should
> take
> > > > +     after the execution of LOOP.  This is currently necessary because the
> > > > +     vectorizer does not handle induction variables that are used after the
> > > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > > +     peeled, because of the early exit.  With an early exit we always peel
> the
> > > > +     loop.
> > > > +
> > > > +     Input:
> > > > +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> > > > +		    vectorized. The last few iterations of LOOP were peeled.
> > > > +     - LOOP - a loop that is going to be vectorized. The last few iterations
> > > > +	      of LOOP were peeled.
> > > > +     - VF - The loop vectorization factor.
> > > > +     - NITERS_ORIG - the number of iterations that LOOP executes (before
> it is
> > > > +		     vectorized). i.e, the number of times the ivs should be
> > > > +		     bumped.
> > > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > > executes.
> > > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only)
> path
> > > > +		  coming out from LOOP on which there are uses of the LOOP
> > > ivs
> > > > +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> > > > +
> > > > +		  The new definitions of the ivs are placed in LOOP->exit.
> > > > +		  The phi args associated with the edge UPDATE_E in the bb
> > > > +		  UPDATE_E->dest are updated accordingly.
> > > > +
> > > > +     Output:
> > > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > > +
> > > > +     Assumption 1: Like the rest of the vectorizer, this function assumes
> > > > +     a single loop exit that has a single predecessor.
> > > > +
> > > > +     Assumption 2: The phi nodes in the LOOP header and in update_bb
> are
> > > > +     organized in the same order.
> > > > +
> > > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the
> future.
> > > > +
> > > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a
> path
> > > > +     coming out of LOOP on which the ivs of LOOP are used (this is the
> path
> > > > +     that leads to the epilog loop; other paths skip the epilog loop).  This
> > > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > > update_bb)
> > > > +     needs to have its phis updated.
> > > > + */
> > > > +
> > > > +static tree
> > > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class
> loop *
> > > epilog,
> > > > +				   poly_int64 vf, tree niters_orig,
> > > > +				   tree niters_vector, edge update_e)
> > > > +{
> > > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    return NULL;
> > > > +
> > > > +  gphi_iterator gsi, gsi1;
> > > > +  tree ni_name, ivtmp = NULL;
> > > > +  basic_block update_bb = update_e->dest;
> > > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > +  basic_block exit_bb = loop_iv->dest;
> > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > > +
> > > > +  gcc_assert (cond);
> > > > +
> > > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis
> (update_bb);
> > > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > > +    {
> > > > +      tree init_expr, final_expr, step_expr;
> > > > +      tree type;
> > > > +      tree var, ni, off;
> > > > +      gimple_stmt_iterator last_gsi;
> > > > +
> > > > +      gphi *phi = gsi1.phi ();
> > > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi,
> loop_preheader_edge
> > > (epilog));
> > >
> > > I'm confused about the setup.  update_bb looks like the block with the
> > > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > > loop_preheader_edge (epilog) come into play here?  That would feed into
> > > epilog->header PHIs?!
> >
> > We can't query the type of the phis in the block with the LC PHI nodes, so the
> > Typical pattern seems to be that we iterate over a block that's part of the
> loop
> > and that would have the PHIs in the same order, just so we can get to the
> > stmt_vec_info.
> >
> > >
> > > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > > better way?  Is update_bb really epilog->header?!
> > >
> > > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > > which "works" even for totally unrelated edges.
> > >
> > > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> > > > +      if (!phi1)
> > >
> > > shouldn't that be an assert?
> > >
> > > > +	continue;
> > > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > > +      if (dump_enabled_p ())
> > > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > > +			 "vect_update_ivs_after_early_break: phi: %G",
> > > > +			 (gimple *)phi);
> > > > +
> > > > +      /* Skip reduction and virtual phis.  */
> > > > +      if (!iv_phi_p (phi_info))
> > > > +	{
> > > > +	  if (dump_enabled_p ())
> > > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > > +			     "reduc or virtual phi. skip.\n");
> > > > +	  continue;
> > > > +	}
> > > > +
> > > > +      /* For multiple exits where we handle early exits we need to carry on
> > > > +	 with the previous IV as loop iteration was not done because we exited
> > > > +	 early.  As such just grab the original IV.  */
> > > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > > (loop));
> > >
> > > but this should be taken care of by LC SSA?
> >
> > It is, the comment is probably missing details, this part just scales the
> counter
> > from VF to scalar counts.  It's just a reminder that this scaling is done
> differently
> > from normal single exit vectorization.
> >
> > >
> > > OK, have to continue tomorrow from here.
> >
> > Cheers, Thank you!
> >
> > Tamar
> >
> > >
> > > Richard.
> > >
> > > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > > +	  && gimple_cond_rhs (cond) != phi_ssa)
> 
> so this is a way to avoid touching the main IV?  Looks a bit fragile to
> me.  Hmm, we're iterating over the main loop header PHIs here?
> Can't you check, say, the relevancy of the PHI node instead?  Though
> it might also be used as induction.  Can't it be used as alternate
> exit like
> 
>   for (i)
>    {
>      if (i & bit)
>        break;
>    }
> 
> and would we need to adjust 'i' then?
> 
> > > > +	{
> > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > +	  step_expr = unshare_expr (step_expr);
> > > > +
> > > > +	  /* We previously generated the new merged phi in the same BB as
> > > the
> > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > +	  final_expr = gimple_phi_result (phi1);
> > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > > loop_preheader_edge (loop));
> > > > +
> > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > +	  /* For early break the final loop IV is:
> > > > +	     init + (final - init) * vf which takes into account peeling
> > > > +	     values and non-single steps.  */
> > > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > > +			     fold_convert (stype, final_expr),
> > > > +			     fold_convert (stype, init_expr));
> > > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> > > > +
> > > > +	  /* Adjust the value with the offset.  */
> > > > +	  if (POINTER_TYPE_P (type))
> > > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > > +	  else
> > > > +	    ni = fold_convert (type,
> > > > +			       fold_build2 (PLUS_EXPR, stype,
> > > > +					    fold_convert (stype, init_expr),
> > > > +					    off));
> > > > +	  var = create_tmp_var (type, "tmp");
> 
> so how does the non-early break code deal with updating inductions?
> And how do you avoid altering this when we flow in from the normal
> exit?  That is, you are updating the value in the epilog loop
> header but don't you need to instead update the value only on
> the alternate exit edges from the main loop (and keep the not
> updated value on the main exit edge)?
> 
> > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > +	  gimple_seq new_stmts = NULL;
> > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > +	  if (!gsi_end_p (last_gsi))
> > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +	  else
> > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +
> > > > +	  /* Fix phi expressions in the successor bb.  */
> > > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > > +	}
> > > > +      else
> > > > +	{
> > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > +	  step_expr = unshare_expr (step_expr);
> > > > +
> > > > +	  /* We previously generated the new merged phi in the same BB as
> > > the
> > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge
> > > (loop));
> > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > +
> > > > +	  if (vf.is_constant ())
> > > > +	    {
> > > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > > +				fold_convert (stype,
> > > > +					      niters_vector),
> > > > +				build_int_cst (stype, vf));
> > > > +
> > > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > > +				fold_convert (stype,
> > > > +					      niters_orig),
> > > > +				fold_convert (stype, ni));
> > > > +	    }
> > > > +	  else
> > > > +	    /* If the loop's VF isn't constant then the loop must have been
> > > > +	       masked, so at the end of the loop we know we have finished
> > > > +	       the entire loop and found nothing.  */
> > > > +	    ni = build_zero_cst (stype);
> > > > +
> > > > +	  ni = fold_convert (type, ni);
> > > > +	  /* We don't support variable n in this version yet.  */
> > > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > > +
> > > > +	  var = create_tmp_var (type, "tmp");
> > > > +
> > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > +	  gimple_seq new_stmts = NULL;
> > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > +	  if (!gsi_end_p (last_gsi))
> > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +	  else
> > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +
> > > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > > +
> > > > +	  for (edge exit : alt_exits)
> > > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > > +					build_int_cst (TREE_TYPE (step_expr),
> > > > +						       vf));
> > > > +	  ivtmp = gimple_phi_result (phi1);
> > > > +	}
> > > > +    }
> > > > +
> > > > +  return ivtmp;
> > > >  }
> > > >
> > > >  /* Return a gimple value containing the misalignment (measured in
> vector
> > > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > > (loop_vec_info loop_vinfo,
> > > >
> > > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > > >     this function searches for the corresponding lcssa phi node in exit
> > > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > > -   NULL.  */
> > > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> > > > +   return the phi result; otherwise return NULL.  */
> > > >
> > > >  static tree
> > > >  find_guard_arg (class loop *loop, class loop *epilog
> ATTRIBUTE_UNUSED,
> > > > -		gphi *lcssa_phi)
> > > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > > >  {
> > > >    gphi_iterator gsi;
> > > >    edge e = loop->vec_loop_iv;
> > > >
> > > > -  gcc_assert (single_pred_p (e->dest));
> > > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > > >      {
> > > >        gphi *phi = gsi.phi ();
> > > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > > -	return PHI_RESULT (phi);
> > > > -    }
> > > > -  return NULL_TREE;
> > > > -}
> > > > -
> > > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > > FIRST/SECOND
> > > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > > -   edge, the two loops are arranged as below:
> > > > -
> > > > -       preheader_a:
> > > > -     first_loop:
> > > > -       header_a:
> > > > -	 i_1 = PHI<i_0, i_2>;
> > > > -	 ...
> > > > -	 i_2 = i_1 + 1;
> > > > -	 if (cond_a)
> > > > -	   goto latch_a;
> > > > -	 else
> > > > -	   goto between_bb;
> > > > -       latch_a:
> > > > -	 goto header_a;
> > > > -
> > > > -       between_bb:
> > > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > > -
> > > > -     second_loop:
> > > > -       header_b:
> > > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > > -				 or with i_2 if no LCSSA phi is created
> > > > -				 under condition of
> > > CREATE_LCSSA_FOR_IV_PHIS.
> > > > -	 ...
> > > > -	 i_4 = i_3 + 1;
> > > > -	 if (cond_b)
> > > > -	   goto latch_b;
> > > > -	 else
> > > > -	   goto exit_bb;
> > > > -       latch_b:
> > > > -	 goto header_b;
> > > > -
> > > > -       exit_bb:
> > > > -
> > > > -   This function creates loop closed SSA for the first loop; update the
> > > > -   second loop's PHI nodes by replacing argument on incoming edge with
> the
> > > > -   result of newly created lcssa PHI nodes.  IF
> CREATE_LCSSA_FOR_IV_PHIS
> > > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > > -   the first loop.
> > > > -
> > > > -   This function assumes exit bb of the first loop is preheader bb of the
> > > > -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> > > > -   the second loop will execute rest iterations of the first.  */
> > > > -
> > > > -static void
> > > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > > -				   class loop *first, class loop *second,
> > > > -				   bool create_lcssa_for_iv_phis)
> > > > -{
> > > > -  gphi_iterator gsi_update, gsi_orig;
> > > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > -
> > > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > > -  basic_block between_bb = single_exit (first)->dest;
> > > > -
> > > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p
> (between_bb));
> > > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > > -  gcc_assert (loop == first || loop == second);
> > > > -
> > > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > > -       gsi_update = gsi_start_phis (second->header);
> > > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > > -    {
> > > > -      gphi *orig_phi = gsi_orig.phi ();
> > > > -      gphi *update_phi = gsi_update.phi ();
> > > > -
> > > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > > -      /* Generate lcssa PHI node for the first loop.  */
> > > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> > > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > > +      /* Nested loops with multiple exits can have different no# phi node
> > > > +	 arguments between the main loop and epilog as epilog falls to the
> > > > +	 second loop.  */
> > > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > > >  	{
> > > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > > UNKNOWN_LOCATION);
> > > > -	  arg = new_res;
> > > > -	}
> > > > -
> > > > -      /* Update PHI node in the second loop by replacing arg on the loop's
> > > > -	 incoming edge.  */
> > > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e,
> arg);
> > > > -    }
> > > > -
> > > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > > -     for correct vectorization of live stmts.  */
> > > > -  if (loop == first)
> > > > -    {
> > > > -      basic_block orig_exit = single_exit (second)->dest;
> > > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > > -	{
> > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > > (orig_arg))
> > > > -	    continue;
> > > > -
> > > > -	  /* Already created in the above loop.   */
> > > > -	  if (find_guard_arg (first, second, orig_phi))
> > > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > > +	  if (TREE_CODE (var) != SSA_NAME)
> > > >  	    continue;
> > > >
> > > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > > UNKNOWN_LOCATION);
> > > > +	  if (operand_equal_p (get_current_def (var),
> > > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > > +	    return PHI_RESULT (phi);
> > > >  	}
> > > >      }
> > > > +  return NULL_TREE;
> > > >  }
> > > >
> > > >  /* Function slpeel_add_loop_guard adds guard skipping from the
> beginning
> > > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2
> (class
> > > loop *loop, class loop *epilog,
> > > >    gcc_assert (single_succ_p (merge_bb));
> > > >    edge e = single_succ_edge (merge_bb);
> > > >    basic_block exit_bb = e->dest;
> > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > > >
> > > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > >      {
> > > >        gphi *update_phi = gsi.phi ();
> > > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > > >
> > > >        tree merge_arg = NULL_TREE;
> > > >
> > > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class
> loop
> > > *loop, class loop *epilog,
> > > >        if (!merge_arg)
> > > >  	merge_arg = old_arg;
> > > >
> > > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e-
> >dest_idx);
> > > >        /* If the var is live after loop but not a reduction, we simply
> > > >  	 use the old arg.  */
> > > >        if (!guard_arg)
> > > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class
> > > loop *loop, class loop *epilog,
> > > >      }
> > > >  }
> > > >
> > > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > > -
> > > > -static void
> > > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > > -{
> > > > -  gphi_iterator gsi;
> > > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > > -
> > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > > -}
> > > > -
> 
> I wonder if we can still split these changes out to before early break
> vect?
> 
> > > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would need
> to
> > > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > > >     whether the epilogue loop should be used, returning true if so.  */
> > > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >      bound_epilog += vf - 1;
> > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > >      bound_epilog += 1;
> > > > +  /* For early breaks the scalar loop needs to execute at most VF times
> > > > +     to find the element that caused the break.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    {
> > > > +      bound_epilog = vf;
> > > > +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> > > > +      vect_epilogues = false;
> > > > +    }
> > > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > > >    poly_uint64 bound_scalar = bound_epilog;
> > > >
> > > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >  				  bound_prolog + bound_epilog)
> > > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > >  			 || vect_epilogues));
> > > > +
> > > > +  /* We only support early break vectorization on known bounds at this
> > > time.
> > > > +     This means that if the vector loop can't be entered then we won't
> > > generate
> > > > +     it at all.  So for now force skip_vector off because the additional
> control
> > > > +     flow messes with the BB exits and we've already analyzed them.  */
> > > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS
> (loop_vinfo);
> > > > +
> 
> I think it should be as "easy" as entering the epilog via the block taking
> the regular exit?
> 
> > > >    /* Epilog loop must be executed if the number of iterations for epilog
> > > >       loop is known at compile time, otherwise we need to add a check at
> > > >       the end of vector loop and skip to the end of epilog loop.  */
> > > >    bool skip_epilog = (prolog_peeling < 0
> > > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >  		      || !vf.is_constant ());
> > > > -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.
> */
> > > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special because
> > > epilog
> > > > +     loop must be executed.  */
> > > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      skip_epilog = false;
> > > > -
> > > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > > >    auto_vec<profile_count> original_counts;
> > > >    basic_block *original_bbs = NULL;
> > > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >    if (prolog_peeling)
> > > >      {
> > > >        e = loop_preheader_edge (loop);
> > > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > > -
> > > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
> > > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> e);
> > > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> e,
> > > > +						       true);
> > > >        gcc_assert (prolog);
> > > >        prolog->force_vectorize = false;
> > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> > > > +
> > > >        first_loop = prolog;
> > > >        reset_original_copy_tables ();
> > > >
> > > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >  	 as the transformations mentioned above make less or no sense when
> > > not
> > > >  	 vectorizing.  */
> > > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > > +      auto_vec<basic_block> doms;
> > > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e,
> true,
> > > > +						       &doms);
> > > >        gcc_assert (epilog);
> > > >
> > > >        epilog->force_vectorize = false;
> > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
> > > >
> > > >        /* Scalar version loop may be preferred.  In this case, add guard
> > > >  	 and skip to epilog.  Note this only happens when the number of
> > > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> > > >  					update_e);
> > > >
> > > > +      /* For early breaks we must create a guard to check how many
> iterations
> > > > +	 of the scalar loop are yet to be performed.  */
> 
> We have this check anyway, no?  In fact don't we know that we always enter
> the epilog (see above)?
> 
> > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	{
> > > > +	  tree ivtmp =
> > > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> > > > +					       *niters_vector, update_e);
> > > > +
> > > > +	  gcc_assert (ivtmp);
> > > > +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > +					 fold_convert (TREE_TYPE (niters),
> > > > +						       ivtmp),
> > > > +					 build_zero_cst (TREE_TYPE (niters)));
> > > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > +
> > > > +	  /* If we had a fallthrough edge, the guard will the threaded through
> > > > +	     and so we may need to find the actual final edge.  */
> > > > +	  edge final_edge = epilog->vec_loop_iv;
> > > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> > > > +	     between the guard and the exit edge.  It only adds new nodes and
> > > > +	     doesn't update existing one in the current scheme.  */
> > > > +	  basic_block guard_to = split_edge (final_edge);
> > > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
> > > guard_to,
> > > > +						guard_bb, prob_epilog.invert
> > > (),
> > > > +						irred_flag);
> > > > +	  doms.safe_push (guard_bb);
> > > > +
> > > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > +
> > > > +	  /* We must update all the edges from the new guard_bb.  */
> > > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > > +					      final_edge);
> > > > +
> > > > +	  /* If the loop was versioned we'll have an intermediate BB between
> > > > +	     the guard and the exit.  This intermediate block is required
> > > > +	     because in the current scheme of things the guard block phi
> > > > +	     updating can only maintain LCSSA by creating new blocks.  In this
> > > > +	     case we just need to update the uses in this block as well.  */
> > > > +	  if (loop != scalar_loop)
> > > > +	    {
> > > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (),
> > > guard_e));
> > > > +	    }
> > > > +
> > > > +	  flush_pending_stmts (guard_e);
> > > > +	}
> > > > +
> > > >        if (skip_epilog)
> > > >  	{
> > > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >  	    }
> > > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > > >  	}
> > > > -      else
> > > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > > >
> > > >        unsigned HOST_WIDE_INT bound;
> > > >        if (bound_scalar.is_constant (&bound))
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index
> > >
> b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > > f21421958e7ee25eada 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > > *loop_in, vec_info_shared *shared)
> > > >      partial_load_store_bias (0),
> > > >      peeling_for_gaps (false),
> > > >      peeling_for_niter (false),
> > > > +    early_breaks (false),
> > > > +    non_break_control_flow (false),
> > > >      no_data_dependencies (false),
> > > >      has_mask_store (false),
> > > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > > (loop_vec_info loop_vinfo)
> > > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > > (LOOP_VINFO_ORIG_LOOP_INFO
> > > >  					  (loop_vinfo));
> > > >
> > > > +  /* When we have multiple exits and VF is unknown, we must require
> > > partial
> > > > +     vectors because the loop bounds is not a minimum but a maximum.
> > > That is to
> > > > +     say we cannot unpredicate the main loop unless we peel or use partial
> > > > +     vectors in the epilogue.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > > +    return true;
> > > > +
> > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > >      {
> > > > @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost
> > > (loop_vec_info loop_vinfo)
> > > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > > >  }
> > > >
> > > > -
> > > >  /* Function vect_analyze_loop_form.
> > > >
> > > >     Verify that certain CFG restrictions hold, including:
> > > >     - the loop has a pre-header
> > > > -   - the loop has a single entry and exit
> > > > +   - the loop has a single entry
> > > > +   - nested loops can have only a single exit.
> > > >     - the loop exit condition is simple enough
> > > >     - the number of iterations can be analyzed, i.e, a countable loop.  The
> > > >       niter could be analyzed under some assumptions.  */
> > > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >                             |
> > > >                          (exit-bb)  */
> > > >
> > > > -      if (loop->num_nodes != 2)
> > > > -	return opt_result::failure_at (vect_location,
> > > > -				       "not vectorized:"
> > > > -				       " control flow in loop.\n");
> > > > -
> > > >        if (empty_block_p (loop->header))
> > > >  	return opt_result::failure_at (vect_location,
> > > >  				       "not vectorized: empty loop.\n");
> > > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >          dump_printf_loc (MSG_NOTE, vect_location,
> > > >  			 "Considering outer-loop vectorization.\n");
> > > >        info->inner_loop_cond = inner.loop_cond;
> > > > +
> > > > +      if (!single_exit (loop))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized: multiple exits.\n");
> > > > +
> > > >      }
> > > >
> > > > -  if (!single_exit (loop))
> > > > -    return opt_result::failure_at (vect_location,
> > > > -				   "not vectorized: multiple exits.\n");
> > > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > > >      return opt_result::failure_at (vect_location,
> > > >  				   "not vectorized:"
> > > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >  				   "not vectorized: latch block not empty.\n");
> > > >
> > > >    /* Make sure the exit is not abnormal.  */
> > > > -  edge e = single_exit (loop);
> > > > -  if (e->flags & EDGE_ABNORMAL)
> > > > -    return opt_result::failure_at (vect_location,
> > > > -				   "not vectorized:"
> > > > -				   " abnormal loop exit edge.\n");
> > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > +  edge nexit = loop->vec_loop_iv;
> > > > +  for (edge e : exits)
> > > > +    {
> > > > +      if (e->flags & EDGE_ABNORMAL)
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " abnormal loop exit edge.\n");
> > > > +      /* Early break BB must be after the main exit BB.  In theory we should
> > > > +	 be able to vectorize the inverse order, but the current flow in the
> > > > +	 the vectorizer always assumes you update successor PHI nodes, not
> > > > +	 preds.  */
> > > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e-
> > > >src))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " abnormal loop exit edge order.\n");
> 
> "unsupported loop exit order", but I don't understand the comment.
> 
> > > > +    }
> > > > +
> > > > +  /* We currently only support early exit loops with known bounds.   */
> 
> Btw, why's that?  Is that because we don't support the loop-around edge?
> IMHO this is the most serious limitation (and as said above it should be
> trivial to fix).
> 
> > > > +  if (exits.length () > 1)
> > > > +    {
> > > > +      class tree_niter_desc niter;
> > > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter,
> NULL)
> > > > +	  || chrec_contains_undetermined (niter.niter)
> > > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " early breaks only supported on loops"
> > > > +				       " with known iteration bounds.\n");
> > > > +    }
> > > >
> > > >    info->conds
> > > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > > vec_info_shared *shared,
> > > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > > >alt_loop_conds);
> > > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > > >
> > > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > > +
> > > >    if (info->inner_loop_cond)
> > > >      {
> > > >        stmt_vec_info inner_loop_cond_info
> > > > @@ -3070,7 +3106,8 @@ start_over:
> > > >
> > > >    /* If an epilogue loop is required make sure we can create one.  */
> > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        if (dump_enabled_p ())
> > > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop
> required\n");
> > > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction
> (loop_vec_info
> > > loop_vinfo,
> > > >    basic_block exit_bb;
> > > >    tree scalar_dest;
> > > >    tree scalar_type;
> > > > -  gimple *new_phi = NULL, *phi;
> > > > +  gimple *new_phi = NULL, *phi = NULL;
> > > >    gimple_stmt_iterator exit_gsi;
> > > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > > >    gimple *epilog_stmt = NULL;
> > > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > > (loop_vec_info loop_vinfo,
> > > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > > >  	  reduc_inputs.quick_push (new_def);
> > > >  	}
> > > > +
> > > > +	/* Update the other exits.  */
> > > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	  {
> > > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > +	    gphi_iterator gsi, gsi1;
> > > > +	    for (edge exit : alt_exits)
> > > > +	      {
> > > > +		/* Find the phi node to propaget into the exit block for each
> > > > +		   exit edge.  */
> > > > +		for (gsi = gsi_start_phis (exit_bb),
> > > > +		     gsi1 = gsi_start_phis (exit->src);
> 
> exit->src == loop->header, right?  I think this won't work for multiple
> alternate exits.  It's probably easier to do this where we create the
> LC PHI node for the reduction result?
> 
> > > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > > +		  {
> > > > +		    /* There really should be a function to just get the number
> > > > +		       of phis inside a bb.  */
> > > > +		    if (phi && phi == gsi.phi ())
> > > > +		      {
> > > > +			gphi *phi1 = gsi1.phi ();
> > > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > > +					 PHI_RESULT (phi1));
> 
> I think we know the header PHI of a reduction perfectly well, there
> shouldn't be the need to "search" for it.
> 
> > > > +			break;
> > > > +		      }
> > > > +		  }
> > > > +	      }
> > > > +	  }
> > > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > > >      }
> > > >
> > > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info
> *vinfo,
> > > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > > >  	   lhs' = new_tree;  */
> > > >
> > > > +      /* When vectorizing an early break, any live statement that is used
> > > > +	 outside of the loop are dead.  The loop will never get to them.
> > > > +	 We could change the liveness value during analysis instead but since
> > > > +	 the below code is invalid anyway just ignore it during codegen.  */
> > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	return true;
> 
> But what about the value that's live across the main exit when the
> epilogue is not entered?
> 
> > > > +
> > > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > >        gcc_assert (single_pred_p (exit_bb));
> > > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > > >       versioning.   */
> > > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > -  if (! single_pred_p (e->dest))
> > > > +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS
> > > (loop_vinfo))
> 
> e can be NULL here?  I think we should reject such loops earlier.
> 
> > > >      {
> > > >        split_loop_exit_edge (e, true);
> > > >        if (dump_enabled_p ())
> > > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > > >      {
> > > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > > -      if (! single_pred_p (e->dest))
> > > > +      if (e && ! single_pred_p (e->dest))
> > > >  	{
> > > >  	  split_loop_exit_edge (e, true);
> > > >  	  if (dump_enabled_p ())
> > > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >
> > > >    /* Loops vectorized with a variable factor won't benefit from
> > > >       unrolling/peeling.  */
> 
> update the comment?  Why would we unroll a VLA loop with early breaks?
> Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?
> 
> > > > -  if (!vf.is_constant ())
> > > > +  if (!vf.is_constant ()
> > > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        loop->unroll = 1;
> > > >        if (dump_enabled_p ())
> > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > index
> > >
> 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > > e78ee2a7c627be45b 100644
> > > > --- a/gcc/tree-vect-stmts.cc
> > > > +++ b/gcc/tree-vect-stmts.cc
> > > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info
> stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >    *live_p = false;
> > > >
> > > >    /* cond stmt other than loop exit cond.  */
> > > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > > -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> > > > -    *relevant = vect_used_in_scope;
> 
> how was that ever hit before?  For outer loop processing with outer loop
> vectorization?
> 
> > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > +    {
> > > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge,
> but
> > > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > > +	 the hard way.  */
> > > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > > +      bool exit_bb = false, early_exit = false;
> > > > +      edge_iterator ei;
> > > > +      edge e;
> > > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > > +	  {
> > > > +	    exit_bb = true;
> > > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > > +	    break;
> > > > +	  }
> > > > +
> > > > +      /* We should have processed any exit edge, so an edge not an early
> > > > +	 break must be a loop IV edge.  We need to distinguish between the
> > > > +	 two as we don't want to generate code for the main loop IV.  */
> > > > +      if (exit_bb)
> > > > +	{
> > > > +	  if (early_exit)
> > > > +	    *relevant = vect_used_in_scope;
> > > > +	}
> 
> I wonder why you can't simply do
> 
>          if (is_ctrl_stmt (stmt_info->stmt)
>              && stmt_info->stmt != LOOP_VINFO_COND (loop_info))
> 
> ?
> 
> > > > +      else if (bb->loop_father == loop)
> > > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> 
> so for control flow not exiting the loop you can check
> loop_exits_from_bb_p ().
> 
> > > > +    }
> > > >
> > > >    /* changing memory.  */
> > > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info
> stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >  	*relevant = vect_used_in_scope;
> > > >        }
> > > >
> > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > +  auto_bitmap exit_bbs;
> > > > +  for (edge exit : exits)
> > > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > +
> > > >    /* uses outside the loop.  */
> > > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > > SSA_OP_DEF)
> > > >      {
> > > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >  	      /* We expect all such uses to be in the loop exit phis
> > > >  		 (because of loop closed form)   */
> > > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> > > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> 
> That now becomes quite expensive checking already covered by the LC SSA
> verifier so I suggest to simply drop this assert instead.
> 
> > > >                *live_p = true;
> > > >  	    }
> > > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > > (loop_vec_info loop_vinfo, bool *fatal)
> > > >  	}
> > > >      }
> > > >
> > > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't
> seen all
> > > > +     the conds yet at that point and there's no quick way to retrieve them.
> */
> > > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > > +    return opt_result::failure_at (vect_location,
> > > > +				   "not vectorized:"
> > > > +				   " unsupported control flow in loop.\n");
> 
> so we didn't do this before?  But see above where I wondered.  So when
> does this hit with early exits and why can't we check for this in
> vect_verify_loop_form?
> 
> > > > +
> > > >    /* 2. Process_worklist */
> > > >    while (worklist.length () > 0)
> > > >      {
> > > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > > (loop_vec_info loop_vinfo, bool *fatal)
> > > >  			return res;
> > > >  		    }
> > > >                   }
> > > > +	    }
> > > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> > > > +	    {
> > > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> > > > +	      opt_result res
> > > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > > +			       loop_vinfo, relevant, &worklist, false);
> > > > +	      if (!res)
> > > > +		return res;
> > > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > > +				loop_vinfo, relevant, &worklist, false);
> > > > +	      if (!res)
> > > > +		return res;
> > > >              }
> > > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > > >  	    {
> > > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > > >  			     node_instance, cost_vec);
> > > >        if (!res)
> > > >  	return res;
> > > > -   }
> > > > +    }
> > > > +
> > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > > >
> > > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > > >      {
> > > >        case vect_internal_def:
> > > > +      case vect_early_exit_def:
> > > >          break;
> > > >
> > > >        case vect_reduction_def:
> > > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > > >      {
> > > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > > >        *need_to_vectorize = true;
> > > >      }
> > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > > index
> > >
> ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > > 40baff61d18da8e242dd 100644
> > > > --- a/gcc/tree-vectorizer.h
> > > > +++ b/gcc/tree-vectorizer.h
> > > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > > >    vect_internal_def,
> > > >    vect_induction_def,
> > > >    vect_reduction_def,
> > > > +  vect_early_exit_def,
> 
> can you avoid putting this inbetween reduction and double reduction
> please?  Just put it before vect_unknown_def_type.  In fact the COND
> isn't a def ... maybe we should have pattern recogized
> 
>  if (a < b) exit;
> 
> as
> 
>  cond = a < b;
>  if (cond != 0) exit;
> 
> so the part that we need to vectorize is more clear.
> 
> > > >    vect_double_reduction_def,
> > > >    vect_nested_cycle,
> > > >    vect_first_order_recurrence,
> > > > @@ -876,6 +877,13 @@ public:
> > > >       we need to peel off iterations at the end to form an epilogue loop.  */
> > > >    bool peeling_for_niter;
> > > >
> > > > +  /* When the loop has early breaks that we can vectorize we need to
> peel
> > > > +     the loop for the break finding loop.  */
> > > > +  bool early_breaks;
> > > > +
> > > > +  /* When the loop has a non-early break control flow inside.  */
> > > > +  bool non_break_control_flow;
> > > > +
> > > >    /* List of loop additional IV conditionals found in the loop.  */
> > > >    auto_vec<gcond *> conds;
> > > >
> > > > @@ -985,9 +993,11 @@ public:
> > > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
> > > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
> > > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> > > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > > >early_break_conflict
> > > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)-
> >early_break_dest_bb
> > > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> > > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > > >non_break_control_flow
> > > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > > >no_data_dependencies
> > > > @@ -1038,8 +1048,8 @@ public:
> > > >     stack.  */
> > > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > > >
> > > > -inline loop_vec_info
> > > > -loop_vec_info_for_loop (class loop *loop)
> > > > +static inline loop_vec_info
> > > > +loop_vec_info_for_loop (const class loop *loop)
> > > >  {
> > > >    return (loop_vec_info) loop->aux;
> > > >  }
> > > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > > >  {
> > > >    if (bb == (bb->loop_father)->header)
> > > >      return true;
> > > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > > +
> > > >    return false;
> > > >  }
> > > >
> > > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > > >     in tree-vect-loop-manip.cc.  */
> > > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > > >  				     tree, tree, tree, bool);
> > > > -extern bool slpeel_can_duplicate_loop_p (const class loop *,
> const_edge);
> > > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > > const_edge);
> > > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > > -						     class loop *, edge);
> > > > +						    class loop *, edge, bool,
> > > > +						    vec<basic_block> * = NULL);
> > > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > > >  				    tree *, tree *, tree *, int, bool, bool,
> > > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > > index
> > >
> a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > > 67f29f5dc9a8d72235f4 100644
> > > > --- a/gcc/tree-vectorizer.cc
> > > > +++ b/gcc/tree-vectorizer.cc
> > > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > > >  	 predicates that need to be shared for optimal predicate usage.
> > > >  	 However reassoc will re-order them and prevent CSE from working
> > > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> 
> seeing this more and more I think we want a simple way to iterate over
> all exits without copying to a vector when we have them recorded.  My
> C++ fu is too limited to support
> 
>   for (auto exit : recorded_exits (loop))
>     ...
> 
> (maybe that's enough for somebody to jump onto this ;))
> 
> Don't treat all review comments as change orders, but it should be clear
> the code isn't 100% obvious.  Maybe the patch can be simplified by
> splitting out the LC SSA cleanup parts.
> 
> Thanks,
> Richard.
> 
> > > > +      for (edge exit : exits)
> > > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > > >
> > > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > > >        do_rpo_vn (fun, entry, exit_bbs);
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > Nuernberg,
> > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > > Moerman;
> > > HRB 36809 (AG Nuernberg)
> >
  
Richard Biener Aug. 18, 2023, 12:53 p.m. UTC | #7
On Fri, 18 Aug 2023, Tamar Christina wrote:

> > > Yeah if you comment it out one of the testcases should fail.
> > 
> > using new_preheader instead of e->dest would make things clearer.
> > 
> > You are now adding the same arg to every exit (you've just queried the
> > main exit redirect_edge_var_map_vector).
> > 
> > OK, so I think I understand what you're doing.  If I understand
> > correctly we know that when we exit the main loop via one of the
> > early exits we are definitely going to enter the epilog but when
> > we take the main exit we might not.
> > 
> 
> Correct.. but..
> 
> > Looking at the CFG we create currently this isn't reflected and
> > this complicates this PHI node updating.  What I'd try to do
> > is leave redirecting the alternate exits until after
> 
> It is, in the case of the alternate exits this is reflected in copying
> the same values, as they are the values of the number of completed 
> iterations since the scalar code restarts the last iteration.
> 
> So all the PHI nodes of the alternate exits are correct.  The vector
> Iteration doesn't handle the partial iteration.
> 
> > slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
> > means leaving it almost unchanged besides the LC SSA maintaining
> > changes.  After that for the multi-exit case split the
> > epilog preheader edge and redirect all the alternate exits to the
> > new preheader.  So the CFG becomes
> > 
> >                  <original loop>
> >                 /      |
> >                /    <main exit w/ original LC PHI>
> >               /      if (epilog)
> >    alt exits /        /  \
> >             /        /    loop around
> >             |       /
> >            preheader with "header" PHIs
> >               |
> >           <epilog>
> > 
> > note you need the header PHIs also on the main exit path but you
> > only need the loop end PHIs there.
> > 
> > It seems so that at least currently the order of things makes
> > them more complicated than necessary.
> 
> I've been trying to, but this representation seems a lot harder to work with,
> In particular at the moment once we exit slpeel_tree_duplicate_loop_to_edge_cfg
> the loop structure is exactly the same as one expects from any normal epilog vectorization.
> 
> But this new representation requires me to place the guard much earlier than the epilogue
> preheader,  yet I still have to adjust the PHI nodes in the preheader.  So it seems that this split
> is there to only indicate that we always enter the epilog when taking an early exit.
> 
> Today this is reflected in the values of the PHI nodes rather than structurally.  Once we place
> The guard we update the nodes and the alternate exits get their value for ivtmp updated to VF.
> 
> This representation also forces me to do the redirection in every call site of
> slpeel_tree_duplicate_loop_to_edge_cfg making the code more complicated in all use sites.
> 
> But I think this doesn't address the main reason why the slpeel_tree_duplicate_loop_to_edge_cfg
> code has a large block of code to deal with PHI node updates.
> 
> The reason as you mentioned somewhere else is that after we redirect the edges I have to reconstruct
> the phi nodes.  For most it's straight forwards, but for live values or vuse chains it requires extra code.
> 
> You're right in that before we redirect the edges they are all correct in the exit block, you mentioned that
> the API for the edge redirection is supposed to copy the values over if I create the phi nodes before hand.
> 
> However this doesn't seem to work:
> 
>      for (auto gsi_from = gsi_start_phis (scalar_exit->dest);
> 	   !gsi_end_p (gsi_from); gsi_next (&gsi_from))
> 	{
> 	  gimple *from_phi = gsi_stmt (gsi_from);
> 	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> 	  create_phi_node (new_res, new_preheader);
> 	}
> 
>       for (edge exit : loop_exits)
> 	redirect_edge_and_branch (exit, new_preheader);
> 
> Still leaves them empty.  Grepping around most code seems to pair redirect_edge_and_branch with
> copy_phi_arg_into_existing_phi.  The problem is that in all these cases after redirecting an edge they
> call copy_phi_arg_into_existing_phi from a predecessor edge to fill in the phi nodes.

You need to call flush_pending_stmts on each edge you redirect.
copy_phi_arg_into_existing_phi isn't suitable for edge redirecting.

> This is because as I redirect_edge_and_branch destroys the phi node entries and copy_phi_arg_into_existing_phi
> simply just reads the gimple_phi_arg_def which would be NULL.
> 
> You could point it to the src block of the exit, in which case it copies the wrong values in for the vuses.  At the end
> of vectorization the cfgcleanup code does the same thing to maintain LCSSA if you haven't.  This code always goes
> wrong for multiple exits because of the problem described above.  There's no node for it to copy the right value
> from.
> 
> As an alternate approach I can split the exit edges, copy the phi nodes into the split and after that redirect them.
> This however creates the awkwardness of having the exit edges no longer connect to the preheader.
> 
> All of this then begs the question if this is all easier than the current approach which is just to read the edge var
> map to figure out the nodes that were removed during the redirect.

But the edge map is supposed to be applied via flush_pending_stmts,
specifically it relies on PHI nodes having a 1:1 correspondence between
old and new destination and thus is really designed for the case
you copy the destination and redirect an edge to the copy.

That is, the main issue I have with the CFG manipulation is that it
isn't broken down to simple operations that in themselves leave
everything correct.  I think it should be possible to do this
and as 2nd step only do the special massaging for the early exit
LC PHIs that feed into the epilogue loop.

As you say the code is quite complicated even without early break
vectorization which is why I originally suggested to try "fixing" it
as prerequesite.  It does have the same fundamental issue when feeding
the epilogue - the "missing" LC PHIs, the difference is only that
without early break vectorization we take the exit values while
for early break vectorization we take the latch values from the
previous iteration(?)

> Maybe I'm still misunderstanding the API, but reading the sources of the functions, they all copy values from *existing*
> phi nodes.  And any existing phi node after the redirect are not correct.
> 
> gimple_redirect_edge_and_branch has a chunk that indicates it should have updated the PHI nodes before calling
> ssa_redirect_edge to remove the old ones, but there's no code there. It's all empty.
> 
> Most of the other refactorings/changes were easy enough to do, but this one I seem to be a struggling with.

I see.  If you are tired of trying feel free to send an updated series
with the other changes, if it looks awkward but correct we can see
someone doing the cleanup afterwards.

Richard.

> Thanks,
> Tamar
> > 
> > > >
> > > > >  	}
> > > > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > > > -      flush_pending_stmts (e);
> > > > > +
> > > > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e-
> > >src);
> > > > > -      if (was_imm_dom || duplicate_outer_loop)
> > > > > +
> > > > > +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
> > > > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit-
> > > > >src);
> > > > >
> > > > >        /* And remove the non-necessary forwarder again.  Keep the other
> > > > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > (class
> > > > loop *loop,
> > > > >        delete_basic_block (preheader);
> > > > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> > > > >  			       loop_preheader_edge (scalar_loop)->src);
> > > > > +
> > > > > +      /* Finally after wiring the new epilogue we need to update its main
> > exit
> > > > > +	 to the original function exit we recorded.  Other exits are already
> > > > > +	 correct.  */
> > > > > +      if (multiple_exits_p)
> > > > > +	{
> > > > > +	  for (edge e : get_loop_exit_edges (loop))
> > > > > +	    doms.safe_push (e->dest);
> > > > > +	  update_loop = new_loop;
> > > > > +	  doms.safe_push (exit_dest);
> > > > > +
> > > > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > > > +	  if (single_succ_p (exit_dest))
> > > > > +	    doms.safe_push (single_succ (exit_dest));
> > > > > +	}
> > > > >      }
> > > > >    else /* Add the copy at entry.  */
> > > > >      {
> > > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > > +	 block and the new loop header.  This allows us to later split the
> > > > > +	 preheader block and still find the right LC nodes.  */
> > > > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > > >
> > > > see above
> > > >
> > > > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > > +	{
> > > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > > > new_latch_loop);
> > > > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> > > > > +	}
> > > > > +
> > > > >        if (scalar_loop != loop)
> > > > >  	{
> > > > >  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> > > > > @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > (class
> > > > loop *loop,
> > > > >        delete_basic_block (new_preheader);
> > > > >        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
> > > > >  			       loop_preheader_edge (new_loop)->src);
> > > > > +
> > > > > +      if (multiple_exits_p)
> > > > > +	update_loop = loop;
> > > > >      }
> > > > >
> > > > > -  if (scalar_loop != loop)
> > > > > +  if (multiple_exits_p)
> > > > >      {
> > > > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > > > -      gphi_iterator gsi_orig, gsi_new;
> > > > > -      edge orig_e = loop_preheader_edge (loop);
> > > > > -      edge new_e = loop_preheader_edge (new_loop);
> > > > > -
> > > > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > > > +      for (edge e : get_loop_exit_edges (update_loop))
> > > > >  	{
> > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > -	  gphi *new_phi = gsi_new.phi ();
> > > > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > > > -	  location_t orig_locus
> > > > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > > > -
> > > > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > > > +	  edge ex;
> > > > > +	  edge_iterator ei;
> > > > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > > > +	    {
> > > > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > > > +		 dominate other blocks.  */
> > > > > +	      while ((ex->flags & EDGE_FALLTHRU)
> > 
> > For the prologue peeling any early exit we take would skip all other
> > loops so we can simply leave them and their LC PHI nodes in place.
> > We need extra PHIs only on the path to the main vector loop.  I
> > think the comment isn't accurately reflecting what we do.  In
> > fact we do not add any LC PHI nodes here but simply adjust the
> > main loop header PHI arguments?
> > 
> > > > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > > > just using single_succ_p here?  A fallthru edge src dominates the
> > > > fallthru edge dest, so the sentence above doesn't make sense.
> > >
> > > I wanted to say, that the immediate dominator of a block is never
> > > an fall through block.  At least from what I understood from how
> > > the dominators are calculated in the code, though may have missed
> > > something.
> > 
> >  BB1
> >   |
> >  BB2
> >   |
> >  BB3
> > 
> > here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.
> > 
> > > >
> > > > > +		     && single_succ_p (ex->dest))
> > > > > +		{
> > > > > +		  doms.safe_push (ex->dest);
> > > > > +		  ex = single_succ_edge (ex->dest);
> > > > > +		}
> > > > > +	      doms.safe_push (ex->dest);
> > > > > +	    }
> > > > > +	  doms.safe_push (e->dest);
> > > > >  	}
> > > > > -    }
> > > > >
> > > > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > +      if (updated_doms)
> > > > > +	updated_doms->safe_splice (doms);
> > > > > +    }
> > > > >    free (new_bbs);
> > > > >    free (bbs);
> > > > >
> > > > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > > > loop_vec_info loop_vinfo, const_edge e)
> > > > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > > > >    unsigned int num_bb = loop->inner? 5 : 2;
> > > > >
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > > > +
> > > >
> > > > I think checking the number of BBs is odd, I don't remember anything
> > > > in slpeel is specifically tied to that?  I think we can simply drop
> > > > this or do you remember anything that would depend on ->num_nodes
> > > > being only exactly 5 or 2?
> > >
> > > Never actually seemed to require it, but they're used as some check to
> > > see if there are unexpected control flow in the loop.
> > >
> > > i.e. this would say no if you have an if statement in the loop that wasn't
> > > converted.  The other part of this and the accompanying explanation is in
> > > vect_analyze_loop_form.  In the patch series I had to remove the hard
> > > num_nodes == 2 check from there because number of nodes restricted
> > > things too much.  If you have an empty fall through block, which seems to
> > > happen often between the main exit and the latch block then we'd not
> > > vectorize.
> > >
> > > So instead I now rejects loops after analyzing the gcond.  So think this check
> > > can go/needs to be different.
> > 
> > Lets remove it from this function then.
> > 
> > > >
> > > > >    /* All loops have an outer scope; the only case loop->outer is NULL is for
> > > > >       the function itself.  */
> > > > >    if (!loop_outer (loop)
> > > > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > > > (loop_vec_info loop_vinfo,
> > > > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > >    basic_block update_bb = update_e->dest;
> > > > >
> > > > > +  /* For early exits we'll update the IVs in
> > > > > +     vect_update_ivs_after_early_break.  */
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    return;
> > > > > +
> > > > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > >
> > > > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > > > (loop_vec_info loop_vinfo,
> > > > >        /* Fix phi expressions in the successor bb.  */
> > > > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > > > >      }
> > > > > +  return;
> > > >
> > > > we don't usually place a return at the end of void functions
> > > >
> > > > > +}
> > > > > +
> > > > > +/*   Function vect_update_ivs_after_early_break.
> > > > > +
> > > > > +     "Advance" the induction variables of LOOP to the value they should
> > take
> > > > > +     after the execution of LOOP.  This is currently necessary because the
> > > > > +     vectorizer does not handle induction variables that are used after the
> > > > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > > > +     peeled, because of the early exit.  With an early exit we always peel
> > the
> > > > > +     loop.
> > > > > +
> > > > > +     Input:
> > > > > +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> > > > > +		    vectorized. The last few iterations of LOOP were peeled.
> > > > > +     - LOOP - a loop that is going to be vectorized. The last few iterations
> > > > > +	      of LOOP were peeled.
> > > > > +     - VF - The loop vectorization factor.
> > > > > +     - NITERS_ORIG - the number of iterations that LOOP executes (before
> > it is
> > > > > +		     vectorized). i.e, the number of times the ivs should be
> > > > > +		     bumped.
> > > > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > > > executes.
> > > > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only)
> > path
> > > > > +		  coming out from LOOP on which there are uses of the LOOP
> > > > ivs
> > > > > +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> > > > > +
> > > > > +		  The new definitions of the ivs are placed in LOOP->exit.
> > > > > +		  The phi args associated with the edge UPDATE_E in the bb
> > > > > +		  UPDATE_E->dest are updated accordingly.
> > > > > +
> > > > > +     Output:
> > > > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > > > +
> > > > > +     Assumption 1: Like the rest of the vectorizer, this function assumes
> > > > > +     a single loop exit that has a single predecessor.
> > > > > +
> > > > > +     Assumption 2: The phi nodes in the LOOP header and in update_bb
> > are
> > > > > +     organized in the same order.
> > > > > +
> > > > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the
> > future.
> > > > > +
> > > > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a
> > path
> > > > > +     coming out of LOOP on which the ivs of LOOP are used (this is the
> > path
> > > > > +     that leads to the epilog loop; other paths skip the epilog loop).  This
> > > > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > > > update_bb)
> > > > > +     needs to have its phis updated.
> > > > > + */
> > > > > +
> > > > > +static tree
> > > > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class
> > loop *
> > > > epilog,
> > > > > +				   poly_int64 vf, tree niters_orig,
> > > > > +				   tree niters_vector, edge update_e)
> > > > > +{
> > > > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    return NULL;
> > > > > +
> > > > > +  gphi_iterator gsi, gsi1;
> > > > > +  tree ni_name, ivtmp = NULL;
> > > > > +  basic_block update_bb = update_e->dest;
> > > > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > +  basic_block exit_bb = loop_iv->dest;
> > > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > > > +
> > > > > +  gcc_assert (cond);
> > > > > +
> > > > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis
> > (update_bb);
> > > > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > > > +    {
> > > > > +      tree init_expr, final_expr, step_expr;
> > > > > +      tree type;
> > > > > +      tree var, ni, off;
> > > > > +      gimple_stmt_iterator last_gsi;
> > > > > +
> > > > > +      gphi *phi = gsi1.phi ();
> > > > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi,
> > loop_preheader_edge
> > > > (epilog));
> > > >
> > > > I'm confused about the setup.  update_bb looks like the block with the
> > > > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > > > loop_preheader_edge (epilog) come into play here?  That would feed into
> > > > epilog->header PHIs?!
> > >
> > > We can't query the type of the phis in the block with the LC PHI nodes, so the
> > > Typical pattern seems to be that we iterate over a block that's part of the
> > loop
> > > and that would have the PHIs in the same order, just so we can get to the
> > > stmt_vec_info.
> > >
> > > >
> > > > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > > > better way?  Is update_bb really epilog->header?!
> > > >
> > > > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > > > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > > > which "works" even for totally unrelated edges.
> > > >
> > > > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> > > > > +      if (!phi1)
> > > >
> > > > shouldn't that be an assert?
> > > >
> > > > > +	continue;
> > > > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > > > +      if (dump_enabled_p ())
> > > > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > > > +			 "vect_update_ivs_after_early_break: phi: %G",
> > > > > +			 (gimple *)phi);
> > > > > +
> > > > > +      /* Skip reduction and virtual phis.  */
> > > > > +      if (!iv_phi_p (phi_info))
> > > > > +	{
> > > > > +	  if (dump_enabled_p ())
> > > > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > > > +			     "reduc or virtual phi. skip.\n");
> > > > > +	  continue;
> > > > > +	}
> > > > > +
> > > > > +      /* For multiple exits where we handle early exits we need to carry on
> > > > > +	 with the previous IV as loop iteration was not done because we exited
> > > > > +	 early.  As such just grab the original IV.  */
> > > > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > > > (loop));
> > > >
> > > > but this should be taken care of by LC SSA?
> > >
> > > It is, the comment is probably missing details, this part just scales the
> > counter
> > > from VF to scalar counts.  It's just a reminder that this scaling is done
> > differently
> > > from normal single exit vectorization.
> > >
> > > >
> > > > OK, have to continue tomorrow from here.
> > >
> > > Cheers, Thank you!
> > >
> > > Tamar
> > >
> > > >
> > > > Richard.
> > > >
> > > > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > > > +	  && gimple_cond_rhs (cond) != phi_ssa)
> > 
> > so this is a way to avoid touching the main IV?  Looks a bit fragile to
> > me.  Hmm, we're iterating over the main loop header PHIs here?
> > Can't you check, say, the relevancy of the PHI node instead?  Though
> > it might also be used as induction.  Can't it be used as alternate
> > exit like
> > 
> >   for (i)
> >    {
> >      if (i & bit)
> >        break;
> >    }
> > 
> > and would we need to adjust 'i' then?
> > 
> > > > > +	{
> > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > +
> > > > > +	  /* We previously generated the new merged phi in the same BB as
> > > > the
> > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > > +	  final_expr = gimple_phi_result (phi1);
> > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > > > loop_preheader_edge (loop));
> > > > > +
> > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > +	  /* For early break the final loop IV is:
> > > > > +	     init + (final - init) * vf which takes into account peeling
> > > > > +	     values and non-single steps.  */
> > > > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > > > +			     fold_convert (stype, final_expr),
> > > > > +			     fold_convert (stype, init_expr));
> > > > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> > > > > +
> > > > > +	  /* Adjust the value with the offset.  */
> > > > > +	  if (POINTER_TYPE_P (type))
> > > > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > > > +	  else
> > > > > +	    ni = fold_convert (type,
> > > > > +			       fold_build2 (PLUS_EXPR, stype,
> > > > > +					    fold_convert (stype, init_expr),
> > > > > +					    off));
> > > > > +	  var = create_tmp_var (type, "tmp");
> > 
> > so how does the non-early break code deal with updating inductions?
> > And how do you avoid altering this when we flow in from the normal
> > exit?  That is, you are updating the value in the epilog loop
> > header but don't you need to instead update the value only on
> > the alternate exit edges from the main loop (and keep the not
> > updated value on the main exit edge)?
> > 
> > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > +	  gimple_seq new_stmts = NULL;
> > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +	  else
> > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +
> > > > > +	  /* Fix phi expressions in the successor bb.  */
> > > > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > > > +	}
> > > > > +      else
> > > > > +	{
> > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > +
> > > > > +	  /* We previously generated the new merged phi in the same BB as
> > > > the
> > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge
> > > > (loop));
> > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > +
> > > > > +	  if (vf.is_constant ())
> > > > > +	    {
> > > > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > > > +				fold_convert (stype,
> > > > > +					      niters_vector),
> > > > > +				build_int_cst (stype, vf));
> > > > > +
> > > > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > > > +				fold_convert (stype,
> > > > > +					      niters_orig),
> > > > > +				fold_convert (stype, ni));
> > > > > +	    }
> > > > > +	  else
> > > > > +	    /* If the loop's VF isn't constant then the loop must have been
> > > > > +	       masked, so at the end of the loop we know we have finished
> > > > > +	       the entire loop and found nothing.  */
> > > > > +	    ni = build_zero_cst (stype);
> > > > > +
> > > > > +	  ni = fold_convert (type, ni);
> > > > > +	  /* We don't support variable n in this version yet.  */
> > > > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > > > +
> > > > > +	  var = create_tmp_var (type, "tmp");
> > > > > +
> > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > +	  gimple_seq new_stmts = NULL;
> > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +	  else
> > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > > +
> > > > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > > > +
> > > > > +	  for (edge exit : alt_exits)
> > > > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > > > +					build_int_cst (TREE_TYPE (step_expr),
> > > > > +						       vf));
> > > > > +	  ivtmp = gimple_phi_result (phi1);
> > > > > +	}
> > > > > +    }
> > > > > +
> > > > > +  return ivtmp;
> > > > >  }
> > > > >
> > > > >  /* Return a gimple value containing the misalignment (measured in
> > vector
> > > > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > > > (loop_vec_info loop_vinfo,
> > > > >
> > > > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > > > >     this function searches for the corresponding lcssa phi node in exit
> > > > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > > > -   NULL.  */
> > > > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> > > > > +   return the phi result; otherwise return NULL.  */
> > > > >
> > > > >  static tree
> > > > >  find_guard_arg (class loop *loop, class loop *epilog
> > ATTRIBUTE_UNUSED,
> > > > > -		gphi *lcssa_phi)
> > > > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > > > >  {
> > > > >    gphi_iterator gsi;
> > > > >    edge e = loop->vec_loop_iv;
> > > > >
> > > > > -  gcc_assert (single_pred_p (e->dest));
> > > > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > >      {
> > > > >        gphi *phi = gsi.phi ();
> > > > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > > > -	return PHI_RESULT (phi);
> > > > > -    }
> > > > > -  return NULL_TREE;
> > > > > -}
> > > > > -
> > > > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > > > FIRST/SECOND
> > > > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > > > -   edge, the two loops are arranged as below:
> > > > > -
> > > > > -       preheader_a:
> > > > > -     first_loop:
> > > > > -       header_a:
> > > > > -	 i_1 = PHI<i_0, i_2>;
> > > > > -	 ...
> > > > > -	 i_2 = i_1 + 1;
> > > > > -	 if (cond_a)
> > > > > -	   goto latch_a;
> > > > > -	 else
> > > > > -	   goto between_bb;
> > > > > -       latch_a:
> > > > > -	 goto header_a;
> > > > > -
> > > > > -       between_bb:
> > > > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > > > -
> > > > > -     second_loop:
> > > > > -       header_b:
> > > > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > > > -				 or with i_2 if no LCSSA phi is created
> > > > > -				 under condition of
> > > > CREATE_LCSSA_FOR_IV_PHIS.
> > > > > -	 ...
> > > > > -	 i_4 = i_3 + 1;
> > > > > -	 if (cond_b)
> > > > > -	   goto latch_b;
> > > > > -	 else
> > > > > -	   goto exit_bb;
> > > > > -       latch_b:
> > > > > -	 goto header_b;
> > > > > -
> > > > > -       exit_bb:
> > > > > -
> > > > > -   This function creates loop closed SSA for the first loop; update the
> > > > > -   second loop's PHI nodes by replacing argument on incoming edge with
> > the
> > > > > -   result of newly created lcssa PHI nodes.  IF
> > CREATE_LCSSA_FOR_IV_PHIS
> > > > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > > > -   the first loop.
> > > > > -
> > > > > -   This function assumes exit bb of the first loop is preheader bb of the
> > > > > -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> > > > > -   the second loop will execute rest iterations of the first.  */
> > > > > -
> > > > > -static void
> > > > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > > > -				   class loop *first, class loop *second,
> > > > > -				   bool create_lcssa_for_iv_phis)
> > > > > -{
> > > > > -  gphi_iterator gsi_update, gsi_orig;
> > > > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > -
> > > > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > > > -  basic_block between_bb = single_exit (first)->dest;
> > > > > -
> > > > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p
> > (between_bb));
> > > > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > > > -  gcc_assert (loop == first || loop == second);
> > > > > -
> > > > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > > > -       gsi_update = gsi_start_phis (second->header);
> > > > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > > > -    {
> > > > > -      gphi *orig_phi = gsi_orig.phi ();
> > > > > -      gphi *update_phi = gsi_update.phi ();
> > > > > -
> > > > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > > > -      /* Generate lcssa PHI node for the first loop.  */
> > > > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> > > > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > > > +      /* Nested loops with multiple exits can have different no# phi node
> > > > > +	 arguments between the main loop and epilog as epilog falls to the
> > > > > +	 second loop.  */
> > > > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > > > >  	{
> > > > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > > > UNKNOWN_LOCATION);
> > > > > -	  arg = new_res;
> > > > > -	}
> > > > > -
> > > > > -      /* Update PHI node in the second loop by replacing arg on the loop's
> > > > > -	 incoming edge.  */
> > > > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e,
> > arg);
> > > > > -    }
> > > > > -
> > > > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > > > -     for correct vectorization of live stmts.  */
> > > > > -  if (loop == first)
> > > > > -    {
> > > > > -      basic_block orig_exit = single_exit (second)->dest;
> > > > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > > > -	{
> > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > > > (orig_arg))
> > > > > -	    continue;
> > > > > -
> > > > > -	  /* Already created in the above loop.   */
> > > > > -	  if (find_guard_arg (first, second, orig_phi))
> > > > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > > > +	  if (TREE_CODE (var) != SSA_NAME)
> > > > >  	    continue;
> > > > >
> > > > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > > > UNKNOWN_LOCATION);
> > > > > +	  if (operand_equal_p (get_current_def (var),
> > > > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > > > +	    return PHI_RESULT (phi);
> > > > >  	}
> > > > >      }
> > > > > +  return NULL_TREE;
> > > > >  }
> > > > >
> > > > >  /* Function slpeel_add_loop_guard adds guard skipping from the
> > beginning
> > > > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2
> > (class
> > > > loop *loop, class loop *epilog,
> > > > >    gcc_assert (single_succ_p (merge_bb));
> > > > >    edge e = single_succ_edge (merge_bb);
> > > > >    basic_block exit_bb = e->dest;
> > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > > > >
> > > > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > >      {
> > > > >        gphi *update_phi = gsi.phi ();
> > > > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > > > >
> > > > >        tree merge_arg = NULL_TREE;
> > > > >
> > > > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class
> > loop
> > > > *loop, class loop *epilog,
> > > > >        if (!merge_arg)
> > > > >  	merge_arg = old_arg;
> > > > >
> > > > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e-
> > >dest_idx);
> > > > >        /* If the var is live after loop but not a reduction, we simply
> > > > >  	 use the old arg.  */
> > > > >        if (!guard_arg)
> > > > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class
> > > > loop *loop, class loop *epilog,
> > > > >      }
> > > > >  }
> > > > >
> > > > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > > > -
> > > > > -static void
> > > > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > > > -{
> > > > > -  gphi_iterator gsi;
> > > > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > > > -
> > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > > > -}
> > > > > -
> > 
> > I wonder if we can still split these changes out to before early break
> > vect?
> > 
> > > > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would need
> > to
> > > > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > > > >     whether the epilogue loop should be used, returning true if so.  */
> > > > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree
> > > > niters, tree nitersm1,
> > > > >      bound_epilog += vf - 1;
> > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > >      bound_epilog += 1;
> > > > > +  /* For early breaks the scalar loop needs to execute at most VF times
> > > > > +     to find the element that caused the break.  */
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +    {
> > > > > +      bound_epilog = vf;
> > > > > +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> > > > > +      vect_epilogues = false;
> > > > > +    }
> > > > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > > > >    poly_uint64 bound_scalar = bound_epilog;
> > > > >
> > > > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree niters, tree nitersm1,
> > > > >  				  bound_prolog + bound_epilog)
> > > > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > > >  			 || vect_epilogues));
> > > > > +
> > > > > +  /* We only support early break vectorization on known bounds at this
> > > > time.
> > > > > +     This means that if the vector loop can't be entered then we won't
> > > > generate
> > > > > +     it at all.  So for now force skip_vector off because the additional
> > control
> > > > > +     flow messes with the BB exits and we've already analyzed them.  */
> > > > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS
> > (loop_vinfo);
> > > > > +
> > 
> > I think it should be as "easy" as entering the epilog via the block taking
> > the regular exit?
> > 
> > > > >    /* Epilog loop must be executed if the number of iterations for epilog
> > > > >       loop is known at compile time, otherwise we need to add a check at
> > > > >       the end of vector loop and skip to the end of epilog loop.  */
> > > > >    bool skip_epilog = (prolog_peeling < 0
> > > > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > >  		      || !vf.is_constant ());
> > > > > -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.
> > */
> > > > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special because
> > > > epilog
> > > > > +     loop must be executed.  */
> > > > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      skip_epilog = false;
> > > > > -
> > > > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > > > >    auto_vec<profile_count> original_counts;
> > > > >    basic_block *original_bbs = NULL;
> > > > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree niters, tree nitersm1,
> > > > >    if (prolog_peeling)
> > > > >      {
> > > > >        e = loop_preheader_edge (loop);
> > > > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > > > -
> > > > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
> > > > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> > e);
> > > > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> > e,
> > > > > +						       true);
> > > > >        gcc_assert (prolog);
> > > > >        prolog->force_vectorize = false;
> > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> > > > > +
> > > > >        first_loop = prolog;
> > > > >        reset_original_copy_tables ();
> > > > >
> > > > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree niters, tree nitersm1,
> > > > >  	 as the transformations mentioned above make less or no sense when
> > > > not
> > > > >  	 vectorizing.  */
> > > > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > > > +      auto_vec<basic_block> doms;
> > > > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e,
> > true,
> > > > > +						       &doms);
> > > > >        gcc_assert (epilog);
> > > > >
> > > > >        epilog->force_vectorize = false;
> > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
> > > > >
> > > > >        /* Scalar version loop may be preferred.  In this case, add guard
> > > > >  	 and skip to epilog.  Note this only happens when the number of
> > > > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree
> > > > niters, tree nitersm1,
> > > > >        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> > > > >  					update_e);
> > > > >
> > > > > +      /* For early breaks we must create a guard to check how many
> > iterations
> > > > > +	 of the scalar loop are yet to be performed.  */
> > 
> > We have this check anyway, no?  In fact don't we know that we always enter
> > the epilog (see above)?
> > 
> > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +	{
> > > > > +	  tree ivtmp =
> > > > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> > > > > +					       *niters_vector, update_e);
> > > > > +
> > > > > +	  gcc_assert (ivtmp);
> > > > > +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > > +					 fold_convert (TREE_TYPE (niters),
> > > > > +						       ivtmp),
> > > > > +					 build_zero_cst (TREE_TYPE (niters)));
> > > > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > > +
> > > > > +	  /* If we had a fallthrough edge, the guard will the threaded through
> > > > > +	     and so we may need to find the actual final edge.  */
> > > > > +	  edge final_edge = epilog->vec_loop_iv;
> > > > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> > > > > +	     between the guard and the exit edge.  It only adds new nodes and
> > > > > +	     doesn't update existing one in the current scheme.  */
> > > > > +	  basic_block guard_to = split_edge (final_edge);
> > > > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
> > > > guard_to,
> > > > > +						guard_bb, prob_epilog.invert
> > > > (),
> > > > > +						irred_flag);
> > > > > +	  doms.safe_push (guard_bb);
> > > > > +
> > > > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > +
> > > > > +	  /* We must update all the edges from the new guard_bb.  */
> > > > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > > > +					      final_edge);
> > > > > +
> > > > > +	  /* If the loop was versioned we'll have an intermediate BB between
> > > > > +	     the guard and the exit.  This intermediate block is required
> > > > > +	     because in the current scheme of things the guard block phi
> > > > > +	     updating can only maintain LCSSA by creating new blocks.  In this
> > > > > +	     case we just need to update the uses in this block as well.  */
> > > > > +	  if (loop != scalar_loop)
> > > > > +	    {
> > > > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (),
> > > > guard_e));
> > > > > +	    }
> > > > > +
> > > > > +	  flush_pending_stmts (guard_e);
> > > > > +	}
> > > > > +
> > > > >        if (skip_epilog)
> > > > >  	{
> > > > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> > tree
> > > > niters, tree nitersm1,
> > > > >  	    }
> > > > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > > > >  	}
> > > > > -      else
> > > > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > > > >
> > > > >        unsigned HOST_WIDE_INT bound;
> > > > >        if (bound_scalar.is_constant (&bound))
> > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > > index
> > > >
> > b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > > > f21421958e7ee25eada 100644
> > > > > --- a/gcc/tree-vect-loop.cc
> > > > > +++ b/gcc/tree-vect-loop.cc
> > > > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > > > *loop_in, vec_info_shared *shared)
> > > > >      partial_load_store_bias (0),
> > > > >      peeling_for_gaps (false),
> > > > >      peeling_for_niter (false),
> > > > > +    early_breaks (false),
> > > > > +    non_break_control_flow (false),
> > > > >      no_data_dependencies (false),
> > > > >      has_mask_store (false),
> > > > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > > > (loop_vec_info loop_vinfo)
> > > > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > > > (LOOP_VINFO_ORIG_LOOP_INFO
> > > > >  					  (loop_vinfo));
> > > > >
> > > > > +  /* When we have multiple exits and VF is unknown, we must require
> > > > partial
> > > > > +     vectors because the loop bounds is not a minimum but a maximum.
> > > > That is to
> > > > > +     say we cannot unpredicate the main loop unless we peel or use partial
> > > > > +     vectors in the epilogue.  */
> > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > > > +    return true;
> > > > > +
> > > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > > >      {
> > > > > @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost
> > > > (loop_vec_info loop_vinfo)
> > > > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > > > >  }
> > > > >
> > > > > -
> > > > >  /* Function vect_analyze_loop_form.
> > > > >
> > > > >     Verify that certain CFG restrictions hold, including:
> > > > >     - the loop has a pre-header
> > > > > -   - the loop has a single entry and exit
> > > > > +   - the loop has a single entry
> > > > > +   - nested loops can have only a single exit.
> > > > >     - the loop exit condition is simple enough
> > > > >     - the number of iterations can be analyzed, i.e, a countable loop.  The
> > > > >       niter could be analyzed under some assumptions.  */
> > > > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop,
> > > > vect_loop_form_info *info)
> > > > >                             |
> > > > >                          (exit-bb)  */
> > > > >
> > > > > -      if (loop->num_nodes != 2)
> > > > > -	return opt_result::failure_at (vect_location,
> > > > > -				       "not vectorized:"
> > > > > -				       " control flow in loop.\n");
> > > > > -
> > > > >        if (empty_block_p (loop->header))
> > > > >  	return opt_result::failure_at (vect_location,
> > > > >  				       "not vectorized: empty loop.\n");
> > > > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop,
> > > > vect_loop_form_info *info)
> > > > >          dump_printf_loc (MSG_NOTE, vect_location,
> > > > >  			 "Considering outer-loop vectorization.\n");
> > > > >        info->inner_loop_cond = inner.loop_cond;
> > > > > +
> > > > > +      if (!single_exit (loop))
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized: multiple exits.\n");
> > > > > +
> > > > >      }
> > > > >
> > > > > -  if (!single_exit (loop))
> > > > > -    return opt_result::failure_at (vect_location,
> > > > > -				   "not vectorized: multiple exits.\n");
> > > > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > > > >      return opt_result::failure_at (vect_location,
> > > > >  				   "not vectorized:"
> > > > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop,
> > > > vect_loop_form_info *info)
> > > > >  				   "not vectorized: latch block not empty.\n");
> > > > >
> > > > >    /* Make sure the exit is not abnormal.  */
> > > > > -  edge e = single_exit (loop);
> > > > > -  if (e->flags & EDGE_ABNORMAL)
> > > > > -    return opt_result::failure_at (vect_location,
> > > > > -				   "not vectorized:"
> > > > > -				   " abnormal loop exit edge.\n");
> > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > +  edge nexit = loop->vec_loop_iv;
> > > > > +  for (edge e : exits)
> > > > > +    {
> > > > > +      if (e->flags & EDGE_ABNORMAL)
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized:"
> > > > > +				       " abnormal loop exit edge.\n");
> > > > > +      /* Early break BB must be after the main exit BB.  In theory we should
> > > > > +	 be able to vectorize the inverse order, but the current flow in the
> > > > > +	 the vectorizer always assumes you update successor PHI nodes, not
> > > > > +	 preds.  */
> > > > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e-
> > > > >src))
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized:"
> > > > > +				       " abnormal loop exit edge order.\n");
> > 
> > "unsupported loop exit order", but I don't understand the comment.
> > 
> > > > > +    }
> > > > > +
> > > > > +  /* We currently only support early exit loops with known bounds.   */
> > 
> > Btw, why's that?  Is that because we don't support the loop-around edge?
> > IMHO this is the most serious limitation (and as said above it should be
> > trivial to fix).
> > 
> > > > > +  if (exits.length () > 1)
> > > > > +    {
> > > > > +      class tree_niter_desc niter;
> > > > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter,
> > NULL)
> > > > > +	  || chrec_contains_undetermined (niter.niter)
> > > > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > > > +	return opt_result::failure_at (vect_location,
> > > > > +				       "not vectorized:"
> > > > > +				       " early breaks only supported on loops"
> > > > > +				       " with known iteration bounds.\n");
> > > > > +    }
> > > > >
> > > > >    info->conds
> > > > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > > > vec_info_shared *shared,
> > > > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > > > >alt_loop_conds);
> > > > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > > > >
> > > > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > > > +
> > > > >    if (info->inner_loop_cond)
> > > > >      {
> > > > >        stmt_vec_info inner_loop_cond_info
> > > > > @@ -3070,7 +3106,8 @@ start_over:
> > > > >
> > > > >    /* If an epilogue loop is required make sure we can create one.  */
> > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      {
> > > > >        if (dump_enabled_p ())
> > > > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop
> > required\n");
> > > > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction
> > (loop_vec_info
> > > > loop_vinfo,
> > > > >    basic_block exit_bb;
> > > > >    tree scalar_dest;
> > > > >    tree scalar_type;
> > > > > -  gimple *new_phi = NULL, *phi;
> > > > > +  gimple *new_phi = NULL, *phi = NULL;
> > > > >    gimple_stmt_iterator exit_gsi;
> > > > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > > > >    gimple *epilog_stmt = NULL;
> > > > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > > > (loop_vec_info loop_vinfo,
> > > > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > > > >  	  reduc_inputs.quick_push (new_def);
> > > > >  	}
> > > > > +
> > > > > +	/* Update the other exits.  */
> > > > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +	  {
> > > > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > +	    gphi_iterator gsi, gsi1;
> > > > > +	    for (edge exit : alt_exits)
> > > > > +	      {
> > > > > +		/* Find the phi node to propaget into the exit block for each
> > > > > +		   exit edge.  */
> > > > > +		for (gsi = gsi_start_phis (exit_bb),
> > > > > +		     gsi1 = gsi_start_phis (exit->src);
> > 
> > exit->src == loop->header, right?  I think this won't work for multiple
> > alternate exits.  It's probably easier to do this where we create the
> > LC PHI node for the reduction result?
> > 
> > > > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > > > +		  {
> > > > > +		    /* There really should be a function to just get the number
> > > > > +		       of phis inside a bb.  */
> > > > > +		    if (phi && phi == gsi.phi ())
> > > > > +		      {
> > > > > +			gphi *phi1 = gsi1.phi ();
> > > > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > > > +					 PHI_RESULT (phi1));
> > 
> > I think we know the header PHI of a reduction perfectly well, there
> > shouldn't be the need to "search" for it.
> > 
> > > > > +			break;
> > > > > +		      }
> > > > > +		  }
> > > > > +	      }
> > > > > +	  }
> > > > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > > > >      }
> > > > >
> > > > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info
> > *vinfo,
> > > > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > > > >  	   lhs' = new_tree;  */
> > > > >
> > > > > +      /* When vectorizing an early break, any live statement that is used
> > > > > +	 outside of the loop are dead.  The loop will never get to them.
> > > > > +	 We could change the liveness value during analysis instead but since
> > > > > +	 the below code is invalid anyway just ignore it during codegen.  */
> > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > +	return true;
> > 
> > But what about the value that's live across the main exit when the
> > epilogue is not entered?
> > 
> > > > > +
> > > > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > >        gcc_assert (single_pred_p (exit_bb));
> > > > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > > > >       versioning.   */
> > > > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > -  if (! single_pred_p (e->dest))
> > > > > +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS
> > > > (loop_vinfo))
> > 
> > e can be NULL here?  I think we should reject such loops earlier.
> > 
> > > > >      {
> > > > >        split_loop_exit_edge (e, true);
> > > > >        if (dump_enabled_p ())
> > > > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > > > >      {
> > > > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > > > -      if (! single_pred_p (e->dest))
> > > > > +      if (e && ! single_pred_p (e->dest))
> > > > >  	{
> > > > >  	  split_loop_exit_edge (e, true);
> > > > >  	  if (dump_enabled_p ())
> > > > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > >
> > > > >    /* Loops vectorized with a variable factor won't benefit from
> > > > >       unrolling/peeling.  */
> > 
> > update the comment?  Why would we unroll a VLA loop with early breaks?
> > Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?
> > 
> > > > > -  if (!vf.is_constant ())
> > > > > +  if (!vf.is_constant ()
> > > > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > >      {
> > > > >        loop->unroll = 1;
> > > > >        if (dump_enabled_p ())
> > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > index
> > > >
> > 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > > > e78ee2a7c627be45b 100644
> > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info
> > stmt_info,
> > > > loop_vec_info loop_vinfo,
> > > > >    *live_p = false;
> > > > >
> > > > >    /* cond stmt other than loop exit cond.  */
> > > > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > > > -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> > > > > -    *relevant = vect_used_in_scope;
> > 
> > how was that ever hit before?  For outer loop processing with outer loop
> > vectorization?
> > 
> > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > +    {
> > > > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge,
> > but
> > > > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > > > +	 the hard way.  */
> > > > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > > > +      bool exit_bb = false, early_exit = false;
> > > > > +      edge_iterator ei;
> > > > > +      edge e;
> > > > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > > > +	  {
> > > > > +	    exit_bb = true;
> > > > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > > > +	    break;
> > > > > +	  }
> > > > > +
> > > > > +      /* We should have processed any exit edge, so an edge not an early
> > > > > +	 break must be a loop IV edge.  We need to distinguish between the
> > > > > +	 two as we don't want to generate code for the main loop IV.  */
> > > > > +      if (exit_bb)
> > > > > +	{
> > > > > +	  if (early_exit)
> > > > > +	    *relevant = vect_used_in_scope;
> > > > > +	}
> > 
> > I wonder why you can't simply do
> > 
> >          if (is_ctrl_stmt (stmt_info->stmt)
> >              && stmt_info->stmt != LOOP_VINFO_COND (loop_info))
> > 
> > ?
> > 
> > > > > +      else if (bb->loop_father == loop)
> > > > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> > 
> > so for control flow not exiting the loop you can check
> > loop_exits_from_bb_p ().
> > 
> > > > > +    }
> > > > >
> > > > >    /* changing memory.  */
> > > > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info
> > stmt_info,
> > > > loop_vec_info loop_vinfo,
> > > > >  	*relevant = vect_used_in_scope;
> > > > >        }
> > > > >
> > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > +  auto_bitmap exit_bbs;
> > > > > +  for (edge exit : exits)
> > > > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > > +
> > > > >    /* uses outside the loop.  */
> > > > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > > > SSA_OP_DEF)
> > > > >      {
> > > > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > > > loop_vec_info loop_vinfo,
> > > > >  	      /* We expect all such uses to be in the loop exit phis
> > > > >  		 (because of loop closed form)   */
> > > > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> > > > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> > 
> > That now becomes quite expensive checking already covered by the LC SSA
> > verifier so I suggest to simply drop this assert instead.
> > 
> > > > >                *live_p = true;
> > > > >  	    }
> > > > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > >  	}
> > > > >      }
> > > > >
> > > > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't
> > seen all
> > > > > +     the conds yet at that point and there's no quick way to retrieve them.
> > */
> > > > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > > > +    return opt_result::failure_at (vect_location,
> > > > > +				   "not vectorized:"
> > > > > +				   " unsupported control flow in loop.\n");
> > 
> > so we didn't do this before?  But see above where I wondered.  So when
> > does this hit with early exits and why can't we check for this in
> > vect_verify_loop_form?
> > 
> > > > > +
> > > > >    /* 2. Process_worklist */
> > > > >    while (worklist.length () > 0)
> > > > >      {
> > > > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > >  			return res;
> > > > >  		    }
> > > > >                   }
> > > > > +	    }
> > > > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> > > > > +	    {
> > > > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> > > > > +	      opt_result res
> > > > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > > > +			       loop_vinfo, relevant, &worklist, false);
> > > > > +	      if (!res)
> > > > > +		return res;
> > > > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > > > +				loop_vinfo, relevant, &worklist, false);
> > > > > +	      if (!res)
> > > > > +		return res;
> > > > >              }
> > > > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > > > >  	    {
> > > > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > >  			     node_instance, cost_vec);
> > > > >        if (!res)
> > > > >  	return res;
> > > > > -   }
> > > > > +    }
> > > > > +
> > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > > > >
> > > > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > > > >      {
> > > > >        case vect_internal_def:
> > > > > +      case vect_early_exit_def:
> > > > >          break;
> > > > >
> > > > >        case vect_reduction_def:
> > > > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > >      {
> > > > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > > > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > > > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > > > >        *need_to_vectorize = true;
> > > > >      }
> > > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > > > index
> > > >
> > ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > > > 40baff61d18da8e242dd 100644
> > > > > --- a/gcc/tree-vectorizer.h
> > > > > +++ b/gcc/tree-vectorizer.h
> > > > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > > > >    vect_internal_def,
> > > > >    vect_induction_def,
> > > > >    vect_reduction_def,
> > > > > +  vect_early_exit_def,
> > 
> > can you avoid putting this inbetween reduction and double reduction
> > please?  Just put it before vect_unknown_def_type.  In fact the COND
> > isn't a def ... maybe we should have pattern recogized
> > 
> >  if (a < b) exit;
> > 
> > as
> > 
> >  cond = a < b;
> >  if (cond != 0) exit;
> > 
> > so the part that we need to vectorize is more clear.
> > 
> > > > >    vect_double_reduction_def,
> > > > >    vect_nested_cycle,
> > > > >    vect_first_order_recurrence,
> > > > > @@ -876,6 +877,13 @@ public:
> > > > >       we need to peel off iterations at the end to form an epilogue loop.  */
> > > > >    bool peeling_for_niter;
> > > > >
> > > > > +  /* When the loop has early breaks that we can vectorize we need to
> > peel
> > > > > +     the loop for the break finding loop.  */
> > > > > +  bool early_breaks;
> > > > > +
> > > > > +  /* When the loop has a non-early break control flow inside.  */
> > > > > +  bool non_break_control_flow;
> > > > > +
> > > > >    /* List of loop additional IV conditionals found in the loop.  */
> > > > >    auto_vec<gcond *> conds;
> > > > >
> > > > > @@ -985,9 +993,11 @@ public:
> > > > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
> > > > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
> > > > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> > > > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > > > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > > > >early_break_conflict
> > > > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)-
> > >early_break_dest_bb
> > > > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> > > > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > > > >non_break_control_flow
> > > > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > > > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > > > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > > > >no_data_dependencies
> > > > > @@ -1038,8 +1048,8 @@ public:
> > > > >     stack.  */
> > > > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > > > >
> > > > > -inline loop_vec_info
> > > > > -loop_vec_info_for_loop (class loop *loop)
> > > > > +static inline loop_vec_info
> > > > > +loop_vec_info_for_loop (const class loop *loop)
> > > > >  {
> > > > >    return (loop_vec_info) loop->aux;
> > > > >  }
> > > > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > > > >  {
> > > > >    if (bb == (bb->loop_father)->header)
> > > > >      return true;
> > > > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > > > +
> > > > >    return false;
> > > > >  }
> > > > >
> > > > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > > > >     in tree-vect-loop-manip.cc.  */
> > > > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > > > >  				     tree, tree, tree, bool);
> > > > > -extern bool slpeel_can_duplicate_loop_p (const class loop *,
> > const_edge);
> > > > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > > > const_edge);
> > > > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > > > -						     class loop *, edge);
> > > > > +						    class loop *, edge, bool,
> > > > > +						    vec<basic_block> * = NULL);
> > > > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > > > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > > > >  				    tree *, tree *, tree *, int, bool, bool,
> > > > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > > > index
> > > >
> > a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > > > 67f29f5dc9a8d72235f4 100644
> > > > > --- a/gcc/tree-vectorizer.cc
> > > > > +++ b/gcc/tree-vectorizer.cc
> > > > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > > > >  	 predicates that need to be shared for optimal predicate usage.
> > > > >  	 However reassoc will re-order them and prevent CSE from working
> > > > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> > 
> > seeing this more and more I think we want a simple way to iterate over
> > all exits without copying to a vector when we have them recorded.  My
> > C++ fu is too limited to support
> > 
> >   for (auto exit : recorded_exits (loop))
> >     ...
> > 
> > (maybe that's enough for somebody to jump onto this ;))
> > 
> > Don't treat all review comments as change orders, but it should be clear
> > the code isn't 100% obvious.  Maybe the patch can be simplified by
> > splitting out the LC SSA cleanup parts.
> > 
> > Thanks,
> > Richard.
> > 
> > > > > +      for (edge exit : exits)
> > > > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > >
> > > > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > > > >        do_rpo_vn (fun, entry, exit_bbs);
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguenther@suse.de>
> > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > > Nuernberg,
> > > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > > > Moerman;
> > > > HRB 36809 (AG Nuernberg)
> > >
>
  
Tamar Christina Aug. 18, 2023, 1:12 p.m. UTC | #8
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Friday, August 18, 2023 2:53 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com
> Subject: RE: [PATCH 12/19]middle-end: implement loop peeling and IV
> updates for early break.
> 
> On Fri, 18 Aug 2023, Tamar Christina wrote:
> 
> > > > Yeah if you comment it out one of the testcases should fail.
> > >
> > > using new_preheader instead of e->dest would make things clearer.
> > >
> > > You are now adding the same arg to every exit (you've just queried the
> > > main exit redirect_edge_var_map_vector).
> > >
> > > OK, so I think I understand what you're doing.  If I understand
> > > correctly we know that when we exit the main loop via one of the
> > > early exits we are definitely going to enter the epilog but when
> > > we take the main exit we might not.
> > >
> >
> > Correct.. but..
> >
> > > Looking at the CFG we create currently this isn't reflected and
> > > this complicates this PHI node updating.  What I'd try to do
> > > is leave redirecting the alternate exits until after
> >
> > It is, in the case of the alternate exits this is reflected in copying
> > the same values, as they are the values of the number of completed
> > iterations since the scalar code restarts the last iteration.
> >
> > So all the PHI nodes of the alternate exits are correct.  The vector
> > Iteration doesn't handle the partial iteration.
> >
> > > slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
> > > means leaving it almost unchanged besides the LC SSA maintaining
> > > changes.  After that for the multi-exit case split the
> > > epilog preheader edge and redirect all the alternate exits to the
> > > new preheader.  So the CFG becomes
> > >
> > >                  <original loop>
> > >                 /      |
> > >                /    <main exit w/ original LC PHI>
> > >               /      if (epilog)
> > >    alt exits /        /  \
> > >             /        /    loop around
> > >             |       /
> > >            preheader with "header" PHIs
> > >               |
> > >           <epilog>
> > >
> > > note you need the header PHIs also on the main exit path but you
> > > only need the loop end PHIs there.
> > >
> > > It seems so that at least currently the order of things makes
> > > them more complicated than necessary.
> >
> > I've been trying to, but this representation seems a lot harder to work with,
> > In particular at the moment once we exit
> slpeel_tree_duplicate_loop_to_edge_cfg
> > the loop structure is exactly the same as one expects from any normal epilog
> vectorization.
> >
> > But this new representation requires me to place the guard much earlier than
> the epilogue
> > preheader,  yet I still have to adjust the PHI nodes in the preheader.  So it
> seems that this split
> > is there to only indicate that we always enter the epilog when taking an early
> exit.
> >
> > Today this is reflected in the values of the PHI nodes rather than structurally.
> Once we place
> > The guard we update the nodes and the alternate exits get their value for
> ivtmp updated to VF.
> >
> > This representation also forces me to do the redirection in every call site of
> > slpeel_tree_duplicate_loop_to_edge_cfg making the code more complicated
> in all use sites.
> >
> > But I think this doesn't address the main reason why the
> slpeel_tree_duplicate_loop_to_edge_cfg
> > code has a large block of code to deal with PHI node updates.
> >
> > The reason as you mentioned somewhere else is that after we redirect the
> edges I have to reconstruct
> > the phi nodes.  For most it's straight forwards, but for live values or vuse
> chains it requires extra code.
> >
> > You're right in that before we redirect the edges they are all correct in the exit
> block, you mentioned that
> > the API for the edge redirection is supposed to copy the values over if I
> create the phi nodes before hand.
> >
> > However this doesn't seem to work:
> >
> >      for (auto gsi_from = gsi_start_phis (scalar_exit->dest);
> > 	   !gsi_end_p (gsi_from); gsi_next (&gsi_from))
> > 	{
> > 	  gimple *from_phi = gsi_stmt (gsi_from);
> > 	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> > 	  create_phi_node (new_res, new_preheader);
> > 	}
> >
> >       for (edge exit : loop_exits)
> > 	redirect_edge_and_branch (exit, new_preheader);
> >
> > Still leaves them empty.  Grepping around most code seems to pair
> redirect_edge_and_branch with
> > copy_phi_arg_into_existing_phi.  The problem is that in all these cases after
> redirecting an edge they
> > call copy_phi_arg_into_existing_phi from a predecessor edge to fill in the phi
> nodes.
> 
> You need to call flush_pending_stmts on each edge you redirect.
> copy_phi_arg_into_existing_phi isn't suitable for edge redirecting.

Oh. I'll give that a try, that would make sense.. I didn't flush it in the current approach
because I needed the map, but since I want to get rid of the map, this makes sense.

> 
> > This is because as I redirect_edge_and_branch destroys the phi node entries
> and copy_phi_arg_into_existing_phi
> > simply just reads the gimple_phi_arg_def which would be NULL.
> >
> > You could point it to the src block of the exit, in which case it copies the
> wrong values in for the vuses.  At the end
> > of vectorization the cfgcleanup code does the same thing to maintain LCSSA
> if you haven't.  This code always goes
> > wrong for multiple exits because of the problem described above.  There's no
> node for it to copy the right value
> > from.
> >
> > As an alternate approach I can split the exit edges, copy the phi nodes into
> the split and after that redirect them.
> > This however creates the awkwardness of having the exit edges no longer
> connect to the preheader.
> >
> > All of this then begs the question if this is all easier than the current approach
> which is just to read the edge var
> > map to figure out the nodes that were removed during the redirect.
> 
> But the edge map is supposed to be applied via flush_pending_stmts,
> specifically it relies on PHI nodes having a 1:1 correspondence between
> old and new destination and thus is really designed for the case
> you copy the destination and redirect an edge to the copy.

Ah you were referring to flush_pending_stmts,  right that would make more sense.
Will give it  a go, thanks!

> 
> That is, the main issue I have with the CFG manipulation is that it
> isn't broken down to simple operations that in themselves leave
> everything correct.  I think it should be possible to do this
> and as 2nd step only do the special massaging for the early exit
> LC PHIs that feed into the epilogue loop.
> 

I see, So I think, If I understand what you're saying correctly, is that
you would like instead of having 1 big function, to have little smaller
helper functions that each on their own do something small but
correct and when called in sequence perform the complicated work?

i.e. you still have one top level function, and at the end of that call
everything is in order? 

> As you say the code is quite complicated even without early break
> vectorization which is why I originally suggested to try "fixing" it
> as prerequesite.  It does have the same fundamental issue when feeding
> the epilogue - the "missing" LC PHIs, the difference is only that
> without early break vectorization we take the exit values while
> for early break vectorization we take the latch values from the
> previous iteration(?)

Ok, If I understood the above correctly with how you wanted the sequence
split out, then would this work for you:

I pull out from the patch series:

1. the single_exit removal and using of our own IV in the vectorizer.
2. The refactoring of the current peeling code, without new functionality but
     splitting it up into the logical little helper functions.

And get those committed separately and then rebase on the early break stuff
which will add handling the additional case for multiple exits in the peeling step?

I already did 1, I didn't do 2 because the function was reworked in one go to both
clean it up and support multiple exits.  But would make sense to pull them out.

Is it ok to also pull out the vectorable_comparison refactoring as well? I haven't
committed it yet because without the early_break support it doesn't look useful
but rebases are tricky when it changes.

> 
> > Maybe I'm still misunderstanding the API, but reading the sources of the
> functions, they all copy values from *existing*
> > phi nodes.  And any existing phi node after the redirect are not correct.
> >
> > gimple_redirect_edge_and_branch has a chunk that indicates it should have
> updated the PHI nodes before calling
> > ssa_redirect_edge to remove the old ones, but there's no code there. It's all
> empty.
> >
> > Most of the other refactorings/changes were easy enough to do, but this
> one I seem to be a struggling with.
> 
> I see.  If you are tired of trying feel free to send an updated series
> with the other changes, if it looks awkward but correct we can see
> someone doing the cleanup afterwards.
> 

Thanks, I'm not tired yet 😊, just confused, and I think it's clearer now.  If you agree
with the above I can pull them out of the series.

Cheers,
Tamar

> Richard.
> 
> > Thanks,
> > Tamar
> > >
> > > > >
> > > > > >  	}
> > > > > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > > > > -      flush_pending_stmts (e);
> > > > > > +
> > > > > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader,
> e-
> > > >src);
> > > > > > -      if (was_imm_dom || duplicate_outer_loop)
> > > > > > +
> > > > > > +      if ((was_imm_dom || duplicate_outer_loop) &&
> !multiple_exits_p)
> > > > > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest,
> new_exit-
> > > > > >src);
> > > > > >
> > > > > >        /* And remove the non-necessary forwarder again.  Keep the
> other
> > > > > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > > (class
> > > > > loop *loop,
> > > > > >        delete_basic_block (preheader);
> > > > > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop-
> >header,
> > > > > >  			       loop_preheader_edge (scalar_loop)->src);
> > > > > > +
> > > > > > +      /* Finally after wiring the new epilogue we need to update its
> main
> > > exit
> > > > > > +	 to the original function exit we recorded.  Other exits are
> already
> > > > > > +	 correct.  */
> > > > > > +      if (multiple_exits_p)
> > > > > > +	{
> > > > > > +	  for (edge e : get_loop_exit_edges (loop))
> > > > > > +	    doms.safe_push (e->dest);
> > > > > > +	  update_loop = new_loop;
> > > > > > +	  doms.safe_push (exit_dest);
> > > > > > +
> > > > > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > > > > +	  if (single_succ_p (exit_dest))
> > > > > > +	    doms.safe_push (single_succ (exit_dest));
> > > > > > +	}
> > > > > >      }
> > > > > >    else /* Add the copy at entry.  */
> > > > > >      {
> > > > > > +      /* Copy the current loop LC PHI nodes between the original loop
> exit
> > > > > > +	 block and the new loop header.  This allows us to later split
> the
> > > > > > +	 preheader block and still find the right LC nodes.  */
> > > > > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > > > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > > > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > > > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > > > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > > > >
> > > > > see above
> > > > >
> > > > > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p
> (gsi_to);
> > > > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > > > +	{
> > > > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > > > > new_latch_loop);
> > > > > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init,
> new_arg);
> > > > > > +	}
> > > > > > +
> > > > > >        if (scalar_loop != loop)
> > > > > >  	{
> > > > > >  	  /* Remove the non-necessary forwarder of scalar_loop again.
> */
> > > > > > @@ -1677,31 +1819,36 @@
> slpeel_tree_duplicate_loop_to_edge_cfg
> > > (class
> > > > > loop *loop,
> > > > > >        delete_basic_block (new_preheader);
> > > > > >        set_immediate_dominator (CDI_DOMINATORS, new_loop-
> >header,
> > > > > >  			       loop_preheader_edge (new_loop)->src);
> > > > > > +
> > > > > > +      if (multiple_exits_p)
> > > > > > +	update_loop = loop;
> > > > > >      }
> > > > > >
> > > > > > -  if (scalar_loop != loop)
> > > > > > +  if (multiple_exits_p)
> > > > > >      {
> > > > > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > > > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > > > > -      gphi_iterator gsi_orig, gsi_new;
> > > > > > -      edge orig_e = loop_preheader_edge (loop);
> > > > > > -      edge new_e = loop_preheader_edge (new_loop);
> > > > > > -
> > > > > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > > > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > > > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > > > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > > > > +      for (edge e : get_loop_exit_edges (update_loop))
> > > > > >  	{
> > > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > > -	  gphi *new_phi = gsi_new.phi ();
> > > > > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > > > > -	  location_t orig_locus
> > > > > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > > > > -
> > > > > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > > > > +	  edge ex;
> > > > > > +	  edge_iterator ei;
> > > > > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > > > > +	    {
> > > > > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > > > > +		 dominate other blocks.  */
> > > > > > +	      while ((ex->flags & EDGE_FALLTHRU)
> > >
> > > For the prologue peeling any early exit we take would skip all other
> > > loops so we can simply leave them and their LC PHI nodes in place.
> > > We need extra PHIs only on the path to the main vector loop.  I
> > > think the comment isn't accurately reflecting what we do.  In
> > > fact we do not add any LC PHI nodes here but simply adjust the
> > > main loop header PHI arguments?
> > >
> > > > > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > > > > just using single_succ_p here?  A fallthru edge src dominates the
> > > > > fallthru edge dest, so the sentence above doesn't make sense.
> > > >
> > > > I wanted to say, that the immediate dominator of a block is never
> > > > an fall through block.  At least from what I understood from how
> > > > the dominators are calculated in the code, though may have missed
> > > > something.
> > >
> > >  BB1
> > >   |
> > >  BB2
> > >   |
> > >  BB3
> > >
> > > here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.
> > >
> > > > >
> > > > > > +		     && single_succ_p (ex->dest))
> > > > > > +		{
> > > > > > +		  doms.safe_push (ex->dest);
> > > > > > +		  ex = single_succ_edge (ex->dest);
> > > > > > +		}
> > > > > > +	      doms.safe_push (ex->dest);
> > > > > > +	    }
> > > > > > +	  doms.safe_push (e->dest);
> > > > > >  	}
> > > > > > -    }
> > > > > >
> > > > > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > > +      if (updated_doms)
> > > > > > +	updated_doms->safe_splice (doms);
> > > > > > +    }
> > > > > >    free (new_bbs);
> > > > > >    free (bbs);
> > > > > >
> > > > > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > > > > loop_vec_info loop_vinfo, const_edge e)
> > > > > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > > > > >    unsigned int num_bb = loop->inner? 5 : 2;
> > > > > >
> > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > > > > +
> > > > >
> > > > > I think checking the number of BBs is odd, I don't remember anything
> > > > > in slpeel is specifically tied to that?  I think we can simply drop
> > > > > this or do you remember anything that would depend on ->num_nodes
> > > > > being only exactly 5 or 2?
> > > >
> > > > Never actually seemed to require it, but they're used as some check to
> > > > see if there are unexpected control flow in the loop.
> > > >
> > > > i.e. this would say no if you have an if statement in the loop that wasn't
> > > > converted.  The other part of this and the accompanying explanation is in
> > > > vect_analyze_loop_form.  In the patch series I had to remove the hard
> > > > num_nodes == 2 check from there because number of nodes restricted
> > > > things too much.  If you have an empty fall through block, which seems
> to
> > > > happen often between the main exit and the latch block then we'd not
> > > > vectorize.
> > > >
> > > > So instead I now rejects loops after analyzing the gcond.  So think this
> check
> > > > can go/needs to be different.
> > >
> > > Lets remove it from this function then.
> > >
> > > > >
> > > > > >    /* All loops have an outer scope; the only case loop->outer is NULL is
> for
> > > > > >       the function itself.  */
> > > > > >    if (!loop_outer (loop)
> > > > > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > > > > (loop_vec_info loop_vinfo,
> > > > > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > >    basic_block update_bb = update_e->dest;
> > > > > >
> > > > > > +  /* For early exits we'll update the IVs in
> > > > > > +     vect_update_ivs_after_early_break.  */
> > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > +    return;
> > > > > > +
> > > > > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > > >
> > > > > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > > > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > > > > (loop_vec_info loop_vinfo,
> > > > > >        /* Fix phi expressions in the successor bb.  */
> > > > > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > > > > >      }
> > > > > > +  return;
> > > > >
> > > > > we don't usually place a return at the end of void functions
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > +/*   Function vect_update_ivs_after_early_break.
> > > > > > +
> > > > > > +     "Advance" the induction variables of LOOP to the value they
> should
> > > take
> > > > > > +     after the execution of LOOP.  This is currently necessary because
> the
> > > > > > +     vectorizer does not handle induction variables that are used after
> the
> > > > > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > > > > +     peeled, because of the early exit.  With an early exit we always peel
> > > the
> > > > > > +     loop.
> > > > > > +
> > > > > > +     Input:
> > > > > > +     - LOOP_VINFO - a loop info structure for the loop that is going to
> be
> > > > > > +		    vectorized. The last few iterations of LOOP were
> peeled.
> > > > > > +     - LOOP - a loop that is going to be vectorized. The last few
> iterations
> > > > > > +	      of LOOP were peeled.
> > > > > > +     - VF - The loop vectorization factor.
> > > > > > +     - NITERS_ORIG - the number of iterations that LOOP executes
> (before
> > > it is
> > > > > > +		     vectorized). i.e, the number of times the ivs should
> be
> > > > > > +		     bumped.
> > > > > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > > > > executes.
> > > > > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only)
> > > path
> > > > > > +		  coming out from LOOP on which there are uses of
> the LOOP
> > > > > ivs
> > > > > > +		  (this is the path from LOOP->exit to epilog_loop-
> >preheader).
> > > > > > +
> > > > > > +		  The new definitions of the ivs are placed in LOOP-
> >exit.
> > > > > > +		  The phi args associated with the edge UPDATE_E in
> the bb
> > > > > > +		  UPDATE_E->dest are updated accordingly.
> > > > > > +
> > > > > > +     Output:
> > > > > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > > > > +
> > > > > > +     Assumption 1: Like the rest of the vectorizer, this function
> assumes
> > > > > > +     a single loop exit that has a single predecessor.
> > > > > > +
> > > > > > +     Assumption 2: The phi nodes in the LOOP header and in
> update_bb
> > > are
> > > > > > +     organized in the same order.
> > > > > > +
> > > > > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > > > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the
> > > future.
> > > > > > +
> > > > > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on
> a
> > > path
> > > > > > +     coming out of LOOP on which the ivs of LOOP are used (this is the
> > > path
> > > > > > +     that leads to the epilog loop; other paths skip the epilog loop).
> This
> > > > > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > > > > update_bb)
> > > > > > +     needs to have its phis updated.
> > > > > > + */
> > > > > > +
> > > > > > +static tree
> > > > > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class
> > > loop *
> > > > > epilog,
> > > > > > +				   poly_int64 vf, tree niters_orig,
> > > > > > +				   tree niters_vector, edge update_e)
> > > > > > +{
> > > > > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > +    return NULL;
> > > > > > +
> > > > > > +  gphi_iterator gsi, gsi1;
> > > > > > +  tree ni_name, ivtmp = NULL;
> > > > > > +  basic_block update_bb = update_e->dest;
> > > > > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > > +  basic_block exit_bb = loop_iv->dest;
> > > > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > > > > +
> > > > > > +  gcc_assert (cond);
> > > > > > +
> > > > > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis
> > > (update_bb);
> > > > > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > > > > +    {
> > > > > > +      tree init_expr, final_expr, step_expr;
> > > > > > +      tree type;
> > > > > > +      tree var, ni, off;
> > > > > > +      gimple_stmt_iterator last_gsi;
> > > > > > +
> > > > > > +      gphi *phi = gsi1.phi ();
> > > > > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi,
> > > loop_preheader_edge
> > > > > (epilog));
> > > > >
> > > > > I'm confused about the setup.  update_bb looks like the block with the
> > > > > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > > > > loop_preheader_edge (epilog) come into play here?  That would feed
> into
> > > > > epilog->header PHIs?!
> > > >
> > > > We can't query the type of the phis in the block with the LC PHI nodes, so
> the
> > > > Typical pattern seems to be that we iterate over a block that's part of the
> > > loop
> > > > and that would have the PHIs in the same order, just so we can get to the
> > > > stmt_vec_info.
> > > >
> > > > >
> > > > > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > > > > better way?  Is update_bb really epilog->header?!
> > > > >
> > > > > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > > > > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > > > > which "works" even for totally unrelated edges.
> > > > >
> > > > > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT
> (phi_ssa));
> > > > > > +      if (!phi1)
> > > > >
> > > > > shouldn't that be an assert?
> > > > >
> > > > > > +	continue;
> > > > > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > > > > +      if (dump_enabled_p ())
> > > > > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > > > > +			 "vect_update_ivs_after_early_break: phi:
> %G",
> > > > > > +			 (gimple *)phi);
> > > > > > +
> > > > > > +      /* Skip reduction and virtual phis.  */
> > > > > > +      if (!iv_phi_p (phi_info))
> > > > > > +	{
> > > > > > +	  if (dump_enabled_p ())
> > > > > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > > > > +			     "reduc or virtual phi. skip.\n");
> > > > > > +	  continue;
> > > > > > +	}
> > > > > > +
> > > > > > +      /* For multiple exits where we handle early exits we need to carry
> on
> > > > > > +	 with the previous IV as loop iteration was not done because
> we exited
> > > > > > +	 early.  As such just grab the original IV.  */
> > > > > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > > > > (loop));
> > > > >
> > > > > but this should be taken care of by LC SSA?
> > > >
> > > > It is, the comment is probably missing details, this part just scales the
> > > counter
> > > > from VF to scalar counts.  It's just a reminder that this scaling is done
> > > differently
> > > > from normal single exit vectorization.
> > > >
> > > > >
> > > > > OK, have to continue tomorrow from here.
> > > >
> > > > Cheers, Thank you!
> > > >
> > > > Tamar
> > > >
> > > > >
> > > > > Richard.
> > > > >
> > > > > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > > > > +	  && gimple_cond_rhs (cond) != phi_ssa)
> > >
> > > so this is a way to avoid touching the main IV?  Looks a bit fragile to
> > > me.  Hmm, we're iterating over the main loop header PHIs here?
> > > Can't you check, say, the relevancy of the PHI node instead?  Though
> > > it might also be used as induction.  Can't it be used as alternate
> > > exit like
> > >
> > >   for (i)
> > >    {
> > >      if (i & bit)
> > >        break;
> > >    }
> > >
> > > and would we need to adjust 'i' then?
> > >
> > > > > > +	{
> > > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART
> (phi_info);
> > > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > > +
> > > > > > +	  /* We previously generated the new merged phi in the same
> BB as
> > > > > the
> > > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > > +	     normal loop phi which don't take the early breaks into
> account.  */
> > > > > > +	  final_expr = gimple_phi_result (phi1);
> > > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > > > > loop_preheader_edge (loop));
> > > > > > +
> > > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > > +	  /* For early break the final loop IV is:
> > > > > > +	     init + (final - init) * vf which takes into account peeling
> > > > > > +	     values and non-single steps.  */
> > > > > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > > > > +			     fold_convert (stype, final_expr),
> > > > > > +			     fold_convert (stype, init_expr));
> > > > > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > > > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst
> (stype, vf));
> > > > > > +
> > > > > > +	  /* Adjust the value with the offset.  */
> > > > > > +	  if (POINTER_TYPE_P (type))
> > > > > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > > > > +	  else
> > > > > > +	    ni = fold_convert (type,
> > > > > > +			       fold_build2 (PLUS_EXPR, stype,
> > > > > > +					    fold_convert (stype,
> init_expr),
> > > > > > +					    off));
> > > > > > +	  var = create_tmp_var (type, "tmp");
> > >
> > > so how does the non-early break code deal with updating inductions?
> > > And how do you avoid altering this when we flow in from the normal
> > > exit?  That is, you are updating the value in the epilog loop
> > > header but don't you need to instead update the value only on
> > > the alternate exit edges from the main loop (and keep the not
> > > updated value on the main exit edge)?
> > >
> > > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > > +	  gimple_seq new_stmts = NULL;
> > > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false,
> var);
> > > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts,
> GSI_SAME_STMT);
> > > > > > +	  else
> > > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts,
> GSI_SAME_STMT);
> > > > > > +
> > > > > > +	  /* Fix phi expressions in the successor bb.  */
> > > > > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > > > > +	}
> > > > > > +      else
> > > > > > +	{
> > > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART
> (phi_info);
> > > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > > +
> > > > > > +	  /* We previously generated the new merged phi in the same
> BB as
> > > > > the
> > > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > > +	     normal loop phi which don't take the early breaks into
> account.  */
> > > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1,
> loop_preheader_edge
> > > > > (loop));
> > > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > > +
> > > > > > +	  if (vf.is_constant ())
> > > > > > +	    {
> > > > > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > > > > +				fold_convert (stype,
> > > > > > +					      niters_vector),
> > > > > > +				build_int_cst (stype, vf));
> > > > > > +
> > > > > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > > > > +				fold_convert (stype,
> > > > > > +					      niters_orig),
> > > > > > +				fold_convert (stype, ni));
> > > > > > +	    }
> > > > > > +	  else
> > > > > > +	    /* If the loop's VF isn't constant then the loop must have
> been
> > > > > > +	       masked, so at the end of the loop we know we have
> finished
> > > > > > +	       the entire loop and found nothing.  */
> > > > > > +	    ni = build_zero_cst (stype);
> > > > > > +
> > > > > > +	  ni = fold_convert (type, ni);
> > > > > > +	  /* We don't support variable n in this version yet.  */
> > > > > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > > > > +
> > > > > > +	  var = create_tmp_var (type, "tmp");
> > > > > > +
> > > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > > +	  gimple_seq new_stmts = NULL;
> > > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false,
> var);
> > > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts,
> GSI_SAME_STMT);
> > > > > > +	  else
> > > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts,
> GSI_SAME_STMT);
> > > > > > +
> > > > > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > > > > +
> > > > > > +	  for (edge exit : alt_exits)
> > > > > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > > > > +					build_int_cst (TREE_TYPE
> (step_expr),
> > > > > > +						       vf));
> > > > > > +	  ivtmp = gimple_phi_result (phi1);
> > > > > > +	}
> > > > > > +    }
> > > > > > +
> > > > > > +  return ivtmp;
> > > > > >  }
> > > > > >
> > > > > >  /* Return a gimple value containing the misalignment (measured in
> > > vector
> > > > > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > > > > (loop_vec_info loop_vinfo,
> > > > > >
> > > > > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > > > > >     this function searches for the corresponding lcssa phi node in exit
> > > > > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > > > > -   NULL.  */
> > > > > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is
> found,
> > > > > > +   return the phi result; otherwise return NULL.  */
> > > > > >
> > > > > >  static tree
> > > > > >  find_guard_arg (class loop *loop, class loop *epilog
> > > ATTRIBUTE_UNUSED,
> > > > > > -		gphi *lcssa_phi)
> > > > > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > > > > >  {
> > > > > >    gphi_iterator gsi;
> > > > > >    edge e = loop->vec_loop_iv;
> > > > > >
> > > > > > -  gcc_assert (single_pred_p (e->dest));
> > > > > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > >      {
> > > > > >        gphi *phi = gsi.phi ();
> > > > > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > > > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > > > > -	return PHI_RESULT (phi);
> > > > > > -    }
> > > > > > -  return NULL_TREE;
> > > > > > -}
> > > > > > -
> > > > > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > > > > FIRST/SECOND
> > > > > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > > > > -   edge, the two loops are arranged as below:
> > > > > > -
> > > > > > -       preheader_a:
> > > > > > -     first_loop:
> > > > > > -       header_a:
> > > > > > -	 i_1 = PHI<i_0, i_2>;
> > > > > > -	 ...
> > > > > > -	 i_2 = i_1 + 1;
> > > > > > -	 if (cond_a)
> > > > > > -	   goto latch_a;
> > > > > > -	 else
> > > > > > -	   goto between_bb;
> > > > > > -       latch_a:
> > > > > > -	 goto header_a;
> > > > > > -
> > > > > > -       between_bb:
> > > > > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > > > > -
> > > > > > -     second_loop:
> > > > > > -       header_b:
> > > > > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > > > > -				 or with i_2 if no LCSSA phi is created
> > > > > > -				 under condition of
> > > > > CREATE_LCSSA_FOR_IV_PHIS.
> > > > > > -	 ...
> > > > > > -	 i_4 = i_3 + 1;
> > > > > > -	 if (cond_b)
> > > > > > -	   goto latch_b;
> > > > > > -	 else
> > > > > > -	   goto exit_bb;
> > > > > > -       latch_b:
> > > > > > -	 goto header_b;
> > > > > > -
> > > > > > -       exit_bb:
> > > > > > -
> > > > > > -   This function creates loop closed SSA for the first loop; update the
> > > > > > -   second loop's PHI nodes by replacing argument on incoming edge
> with
> > > the
> > > > > > -   result of newly created lcssa PHI nodes.  IF
> > > CREATE_LCSSA_FOR_IV_PHIS
> > > > > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > > > > -   the first loop.
> > > > > > -
> > > > > > -   This function assumes exit bb of the first loop is preheader bb of
> the
> > > > > > -   second loop, i.e, between_bb in the example code.  With PHIs
> updated,
> > > > > > -   the second loop will execute rest iterations of the first.  */
> > > > > > -
> > > > > > -static void
> > > > > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > > > > -				   class loop *first, class loop *second,
> > > > > > -				   bool create_lcssa_for_iv_phis)
> > > > > > -{
> > > > > > -  gphi_iterator gsi_update, gsi_orig;
> > > > > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > > -
> > > > > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > > > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > > > > -  basic_block between_bb = single_exit (first)->dest;
> > > > > > -
> > > > > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > > > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p
> > > (between_bb));
> > > > > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > > > > -  gcc_assert (loop == first || loop == second);
> > > > > > -
> > > > > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > > > > -       gsi_update = gsi_start_phis (second->header);
> > > > > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > > > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > > > > -    {
> > > > > > -      gphi *orig_phi = gsi_orig.phi ();
> > > > > > -      gphi *update_phi = gsi_update.phi ();
> > > > > > -
> > > > > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > > > > -      /* Generate lcssa PHI node for the first loop.  */
> > > > > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > > > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt
> (vect_phi);
> > > > > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > > > > +      /* Nested loops with multiple exits can have different no# phi
> node
> > > > > > +	 arguments between the main loop and epilog as epilog falls to
> the
> > > > > > +	 second loop.  */
> > > > > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > > > > >  	{
> > > > > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > > > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > > > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > > > > UNKNOWN_LOCATION);
> > > > > > -	  arg = new_res;
> > > > > > -	}
> > > > > > -
> > > > > > -      /* Update PHI node in the second loop by replacing arg on the
> loop's
> > > > > > -	 incoming edge.  */
> > > > > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e,
> > > arg);
> > > > > > -    }
> > > > > > -
> > > > > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > > > > -     for correct vectorization of live stmts.  */
> > > > > > -  if (loop == first)
> > > > > > -    {
> > > > > > -      basic_block orig_exit = single_exit (second)->dest;
> > > > > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > > > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > > > > -	{
> > > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > > > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > > > > (orig_arg))
> > > > > > -	    continue;
> > > > > > -
> > > > > > -	  /* Already created in the above loop.   */
> > > > > > -	  if (find_guard_arg (first, second, orig_phi))
> > > > > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > > > > +	  if (TREE_CODE (var) != SSA_NAME)
> > > > > >  	    continue;
> > > > > >
> > > > > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > > > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > > > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > > > > UNKNOWN_LOCATION);
> > > > > > +	  if (operand_equal_p (get_current_def (var),
> > > > > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > > > > +	    return PHI_RESULT (phi);
> > > > > >  	}
> > > > > >      }
> > > > > > +  return NULL_TREE;
> > > > > >  }
> > > > > >
> > > > > >  /* Function slpeel_add_loop_guard adds guard skipping from the
> > > beginning
> > > > > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2
> > > (class
> > > > > loop *loop, class loop *epilog,
> > > > > >    gcc_assert (single_succ_p (merge_bb));
> > > > > >    edge e = single_succ_edge (merge_bb);
> > > > > >    basic_block exit_bb = e->dest;
> > > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > > > > >
> > > > > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > >      {
> > > > > >        gphi *update_phi = gsi.phi ();
> > > > > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > > > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > > > > >
> > > > > >        tree merge_arg = NULL_TREE;
> > > > > >
> > > > > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2
> (class
> > > loop
> > > > > *loop, class loop *epilog,
> > > > > >        if (!merge_arg)
> > > > > >  	merge_arg = old_arg;
> > > > > >
> > > > > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > > > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e-
> > > >dest_idx);
> > > > > >        /* If the var is live after loop but not a reduction, we simply
> > > > > >  	 use the old arg.  */
> > > > > >        if (!guard_arg)
> > > > > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2
> (class
> > > > > loop *loop, class loop *epilog,
> > > > > >      }
> > > > > >  }
> > > > > >
> > > > > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > > > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > > > > -
> > > > > > -static void
> > > > > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > > > > -{
> > > > > > -  gphi_iterator gsi;
> > > > > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > > > > -
> > > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > > > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > > > > -}
> > > > > > -
> > >
> > > I wonder if we can still split these changes out to before early break
> > > vect?
> > >
> > > > > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would
> need
> > > to
> > > > > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > > > > >     whether the epilogue loop should be used, returning true if so.  */
> > > > > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree
> > > > > niters, tree nitersm1,
> > > > > >      bound_epilog += vf - 1;
> > > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > > >      bound_epilog += 1;
> > > > > > +  /* For early breaks the scalar loop needs to execute at most VF
> times
> > > > > > +     to find the element that caused the break.  */
> > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > +    {
> > > > > > +      bound_epilog = vf;
> > > > > > +      /* Force a scalar epilogue as we can't vectorize the index finding.
> */
> > > > > > +      vect_epilogues = false;
> > > > > > +    }
> > > > > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > > > > >    poly_uint64 bound_scalar = bound_epilog;
> > > > > >
> > > > > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info
> > > loop_vinfo,
> > > > > tree niters, tree nitersm1,
> > > > > >  				  bound_prolog + bound_epilog)
> > > > > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > > > >  			 || vect_epilogues));
> > > > > > +
> > > > > > +  /* We only support early break vectorization on known bounds at
> this
> > > > > time.
> > > > > > +     This means that if the vector loop can't be entered then we won't
> > > > > generate
> > > > > > +     it at all.  So for now force skip_vector off because the additional
> > > control
> > > > > > +     flow messes with the BB exits and we've already analyzed them.
> */
> > > > > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS
> > > (loop_vinfo);
> > > > > > +
> > >
> > > I think it should be as "easy" as entering the epilog via the block taking
> > > the regular exit?
> > >
> > > > > >    /* Epilog loop must be executed if the number of iterations for epilog
> > > > > >       loop is known at compile time, otherwise we need to add a check
> at
> > > > > >       the end of vector loop and skip to the end of epilog loop.  */
> > > > > >    bool skip_epilog = (prolog_peeling < 0
> > > > > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > > >  		      || !vf.is_constant ());
> > > > > > -  /* PEELING_FOR_GAPS is special because epilog loop must be
> executed.
> > > */
> > > > > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special
> because
> > > > > epilog
> > > > > > +     loop must be executed.  */
> > > > > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > >      skip_epilog = false;
> > > > > > -
> > > > > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > > > > >    auto_vec<profile_count> original_counts;
> > > > > >    basic_block *original_bbs = NULL;
> > > > > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info
> > > loop_vinfo,
> > > > > tree niters, tree nitersm1,
> > > > > >    if (prolog_peeling)
> > > > > >      {
> > > > > >        e = loop_preheader_edge (loop);
> > > > > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > > > > -
> > > > > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo,
> e));
> > > > > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > > > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop,
> scalar_loop,
> > > e);
> > > > > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop,
> scalar_loop,
> > > e,
> > > > > > +						       true);
> > > > > >        gcc_assert (prolog);
> > > > > >        prolog->force_vectorize = false;
> > > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop,
> true);
> > > > > > +
> > > > > >        first_loop = prolog;
> > > > > >        reset_original_copy_tables ();
> > > > > >
> > > > > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info
> > > loop_vinfo,
> > > > > tree niters, tree nitersm1,
> > > > > >  	 as the transformations mentioned above make less or no
> sense when
> > > > > not
> > > > > >  	 vectorizing.  */
> > > > > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > > > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > > > > +      auto_vec<basic_block> doms;
> > > > > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e,
> > > true,
> > > > > > +						       &doms);
> > > > > >        gcc_assert (epilog);
> > > > > >
> > > > > >        epilog->force_vectorize = false;
> > > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog,
> false);
> > > > > >
> > > > > >        /* Scalar version loop may be preferred.  In this case, add guard
> > > > > >  	 and skip to epilog.  Note this only happens when the number
> of
> > > > > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree
> > > > > niters, tree nitersm1,
> > > > > >        vect_update_ivs_after_vectorizer (loop_vinfo,
> niters_vector_mult_vf,
> > > > > >  					update_e);
> > > > > >
> > > > > > +      /* For early breaks we must create a guard to check how many
> > > iterations
> > > > > > +	 of the scalar loop are yet to be performed.  */
> > >
> > > We have this check anyway, no?  In fact don't we know that we always
> enter
> > > the epilog (see above)?
> > >
> > > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > +	{
> > > > > > +	  tree ivtmp =
> > > > > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf,
> niters,
> > > > > > +					       *niters_vector, update_e);
> > > > > > +
> > > > > > +	  gcc_assert (ivtmp);
> > > > > > +	  tree guard_cond = fold_build2 (EQ_EXPR,
> boolean_type_node,
> > > > > > +					 fold_convert (TREE_TYPE
> (niters),
> > > > > > +						       ivtmp),
> > > > > > +					 build_zero_cst (TREE_TYPE
> (niters)));
> > > > > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)-
> >dest;
> > > > > > +
> > > > > > +	  /* If we had a fallthrough edge, the guard will the threaded
> through
> > > > > > +	     and so we may need to find the actual final edge.  */
> > > > > > +	  edge final_edge = epilog->vec_loop_iv;
> > > > > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty
> block in
> > > > > > +	     between the guard and the exit edge.  It only adds new
> nodes and
> > > > > > +	     doesn't update existing one in the current scheme.  */
> > > > > > +	  basic_block guard_to = split_edge (final_edge);
> > > > > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb,
> guard_cond,
> > > > > guard_to,
> > > > > > +						guard_bb,
> prob_epilog.invert
> > > > > (),
> > > > > > +						irred_flag);
> > > > > > +	  doms.safe_push (guard_bb);
> > > > > > +
> > > > > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > > +
> > > > > > +	  /* We must update all the edges from the new guard_bb.  */
> > > > > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > > > > +					      final_edge);
> > > > > > +
> > > > > > +	  /* If the loop was versioned we'll have an intermediate BB
> between
> > > > > > +	     the guard and the exit.  This intermediate block is required
> > > > > > +	     because in the current scheme of things the guard block phi
> > > > > > +	     updating can only maintain LCSSA by creating new blocks.
> In this
> > > > > > +	     case we just need to update the uses in this block as well.
> */
> > > > > > +	  if (loop != scalar_loop)
> > > > > > +	    {
> > > > > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > > > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE
> (gsi.phi (),
> > > > > guard_e));
> > > > > > +	    }
> > > > > > +
> > > > > > +	  flush_pending_stmts (guard_e);
> > > > > > +	}
> > > > > > +
> > > > > >        if (skip_epilog)
> > > > > >  	{
> > > > > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree
> > > > > niters, tree nitersm1,
> > > > > >  	    }
> > > > > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > > > > >  	}
> > > > > > -      else
> > > > > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > > > > >
> > > > > >        unsigned HOST_WIDE_INT bound;
> > > > > >        if (bound_scalar.is_constant (&bound))
> > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > > > index
> > > > >
> > >
> b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > > > > f21421958e7ee25eada 100644
> > > > > > --- a/gcc/tree-vect-loop.cc
> > > > > > +++ b/gcc/tree-vect-loop.cc
> > > > > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > > > > *loop_in, vec_info_shared *shared)
> > > > > >      partial_load_store_bias (0),
> > > > > >      peeling_for_gaps (false),
> > > > > >      peeling_for_niter (false),
> > > > > > +    early_breaks (false),
> > > > > > +    non_break_control_flow (false),
> > > > > >      no_data_dependencies (false),
> > > > > >      has_mask_store (false),
> > > > > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > > > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > > > > (loop_vec_info loop_vinfo)
> > > > > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > > > > (LOOP_VINFO_ORIG_LOOP_INFO
> > > > > >  					  (loop_vinfo));
> > > > > >
> > > > > > +  /* When we have multiple exits and VF is unknown, we must
> require
> > > > > partial
> > > > > > +     vectors because the loop bounds is not a minimum but a
> maximum.
> > > > > That is to
> > > > > > +     say we cannot unpredicate the main loop unless we peel or use
> partial
> > > > > > +     vectors in the epilogue.  */
> > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > > > > +    return true;
> > > > > > +
> > > > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > > > >      {
> > > > > > @@ -1652,12 +1662,12 @@
> vect_compute_single_scalar_iteration_cost
> > > > > (loop_vec_info loop_vinfo)
> > > > > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > > > > >  }
> > > > > >
> > > > > > -
> > > > > >  /* Function vect_analyze_loop_form.
> > > > > >
> > > > > >     Verify that certain CFG restrictions hold, including:
> > > > > >     - the loop has a pre-header
> > > > > > -   - the loop has a single entry and exit
> > > > > > +   - the loop has a single entry
> > > > > > +   - nested loops can have only a single exit.
> > > > > >     - the loop exit condition is simple enough
> > > > > >     - the number of iterations can be analyzed, i.e, a countable loop.
> The
> > > > > >       niter could be analyzed under some assumptions.  */
> > > > > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop
> *loop,
> > > > > vect_loop_form_info *info)
> > > > > >                             |
> > > > > >                          (exit-bb)  */
> > > > > >
> > > > > > -      if (loop->num_nodes != 2)
> > > > > > -	return opt_result::failure_at (vect_location,
> > > > > > -				       "not vectorized:"
> > > > > > -				       " control flow in loop.\n");
> > > > > > -
> > > > > >        if (empty_block_p (loop->header))
> > > > > >  	return opt_result::failure_at (vect_location,
> > > > > >  				       "not vectorized: empty loop.\n");
> > > > > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop
> *loop,
> > > > > vect_loop_form_info *info)
> > > > > >          dump_printf_loc (MSG_NOTE, vect_location,
> > > > > >  			 "Considering outer-loop vectorization.\n");
> > > > > >        info->inner_loop_cond = inner.loop_cond;
> > > > > > +
> > > > > > +      if (!single_exit (loop))
> > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > +				       "not vectorized: multiple
> exits.\n");
> > > > > > +
> > > > > >      }
> > > > > >
> > > > > > -  if (!single_exit (loop))
> > > > > > -    return opt_result::failure_at (vect_location,
> > > > > > -				   "not vectorized: multiple exits.\n");
> > > > > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > > > > >      return opt_result::failure_at (vect_location,
> > > > > >  				   "not vectorized:"
> > > > > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop
> *loop,
> > > > > vect_loop_form_info *info)
> > > > > >  				   "not vectorized: latch block not
> empty.\n");
> > > > > >
> > > > > >    /* Make sure the exit is not abnormal.  */
> > > > > > -  edge e = single_exit (loop);
> > > > > > -  if (e->flags & EDGE_ABNORMAL)
> > > > > > -    return opt_result::failure_at (vect_location,
> > > > > > -				   "not vectorized:"
> > > > > > -				   " abnormal loop exit edge.\n");
> > > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > > +  edge nexit = loop->vec_loop_iv;
> > > > > > +  for (edge e : exits)
> > > > > > +    {
> > > > > > +      if (e->flags & EDGE_ABNORMAL)
> > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > +				       "not vectorized:"
> > > > > > +				       " abnormal loop exit edge.\n");
> > > > > > +      /* Early break BB must be after the main exit BB.  In theory we
> should
> > > > > > +	 be able to vectorize the inverse order, but the current flow in
> the
> > > > > > +	 the vectorizer always assumes you update successor PHI
> nodes, not
> > > > > > +	 preds.  */
> > > > > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src,
> e-
> > > > > >src))
> > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > +				       "not vectorized:"
> > > > > > +				       " abnormal loop exit edge
> order.\n");
> > >
> > > "unsupported loop exit order", but I don't understand the comment.
> > >
> > > > > > +    }
> > > > > > +
> > > > > > +  /* We currently only support early exit loops with known bounds.
> */
> > >
> > > Btw, why's that?  Is that because we don't support the loop-around edge?
> > > IMHO this is the most serious limitation (and as said above it should be
> > > trivial to fix).
> > >
> > > > > > +  if (exits.length () > 1)
> > > > > > +    {
> > > > > > +      class tree_niter_desc niter;
> > > > > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter,
> > > NULL)
> > > > > > +	  || chrec_contains_undetermined (niter.niter)
> > > > > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > +				       "not vectorized:"
> > > > > > +				       " early breaks only supported on
> loops"
> > > > > > +				       " with known iteration
> bounds.\n");
> > > > > > +    }
> > > > > >
> > > > > >    info->conds
> > > > > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > > > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > > > > vec_info_shared *shared,
> > > > > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > > > > >alt_loop_conds);
> > > > > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > > > > >
> > > > > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > > > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > > > > +
> > > > > >    if (info->inner_loop_cond)
> > > > > >      {
> > > > > >        stmt_vec_info inner_loop_cond_info
> > > > > > @@ -3070,7 +3106,8 @@ start_over:
> > > > > >
> > > > > >    /* If an epilogue loop is required make sure we can create one.  */
> > > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > > > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > >      {
> > > > > >        if (dump_enabled_p ())
> > > > > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop
> > > required\n");
> > > > > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction
> > > (loop_vec_info
> > > > > loop_vinfo,
> > > > > >    basic_block exit_bb;
> > > > > >    tree scalar_dest;
> > > > > >    tree scalar_type;
> > > > > > -  gimple *new_phi = NULL, *phi;
> > > > > > +  gimple *new_phi = NULL, *phi = NULL;
> > > > > >    gimple_stmt_iterator exit_gsi;
> > > > > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > > > > >    gimple *epilog_stmt = NULL;
> > > > > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > > > > (loop_vec_info loop_vinfo,
> > > > > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > > > > >  	  reduc_inputs.quick_push (new_def);
> > > > > >  	}
> > > > > > +
> > > > > > +	/* Update the other exits.  */
> > > > > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > +	  {
> > > > > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > > +	    gphi_iterator gsi, gsi1;
> > > > > > +	    for (edge exit : alt_exits)
> > > > > > +	      {
> > > > > > +		/* Find the phi node to propaget into the exit block for
> each
> > > > > > +		   exit edge.  */
> > > > > > +		for (gsi = gsi_start_phis (exit_bb),
> > > > > > +		     gsi1 = gsi_start_phis (exit->src);
> > >
> > > exit->src == loop->header, right?  I think this won't work for multiple
> > > alternate exits.  It's probably easier to do this where we create the
> > > LC PHI node for the reduction result?
> > >
> > > > > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > > > > +		  {
> > > > > > +		    /* There really should be a function to just get the
> number
> > > > > > +		       of phis inside a bb.  */
> > > > > > +		    if (phi && phi == gsi.phi ())
> > > > > > +		      {
> > > > > > +			gphi *phi1 = gsi1.phi ();
> > > > > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > > > > +					 PHI_RESULT (phi1));
> > >
> > > I think we know the header PHI of a reduction perfectly well, there
> > > shouldn't be the need to "search" for it.
> > >
> > > > > > +			break;
> > > > > > +		      }
> > > > > > +		  }
> > > > > > +	      }
> > > > > > +	  }
> > > > > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > > > > >      }
> > > > > >
> > > > > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info
> > > *vinfo,
> > > > > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > > > > >  	   lhs' = new_tree;  */
> > > > > >
> > > > > > +      /* When vectorizing an early break, any live statement that is
> used
> > > > > > +	 outside of the loop are dead.  The loop will never get to them.
> > > > > > +	 We could change the liveness value during analysis instead
> but since
> > > > > > +	 the below code is invalid anyway just ignore it during
> codegen.  */
> > > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > +	return true;
> > >
> > > But what about the value that's live across the main exit when the
> > > epilogue is not entered?
> > >
> > > > > > +
> > > > > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > > >        gcc_assert (single_pred_p (exit_bb));
> > > > > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > > > > >       versioning.   */
> > > > > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > > -  if (! single_pred_p (e->dest))
> > > > > > +  if (e && ! single_pred_p (e->dest) &&
> !LOOP_VINFO_EARLY_BREAKS
> > > > > (loop_vinfo))
> > >
> > > e can be NULL here?  I think we should reject such loops earlier.
> > >
> > > > > >      {
> > > > > >        split_loop_exit_edge (e, true);
> > > > > >        if (dump_enabled_p ())
> > > > > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > > > > >      {
> > > > > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > > > > -      if (! single_pred_p (e->dest))
> > > > > > +      if (e && ! single_pred_p (e->dest))
> > > > > >  	{
> > > > > >  	  split_loop_exit_edge (e, true);
> > > > > >  	  if (dump_enabled_p ())
> > > > > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > > >
> > > > > >    /* Loops vectorized with a variable factor won't benefit from
> > > > > >       unrolling/peeling.  */
> > >
> > > update the comment?  Why would we unroll a VLA loop with early breaks?
> > > Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?
> > >
> > > > > > -  if (!vf.is_constant ())
> > > > > > +  if (!vf.is_constant ()
> > > > > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > >      {
> > > > > >        loop->unroll = 1;
> > > > > >        if (dump_enabled_p ())
> > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > > index
> > > > >
> > >
> 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > > > > e78ee2a7c627be45b 100644
> > > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info
> > > stmt_info,
> > > > > loop_vec_info loop_vinfo,
> > > > > >    *live_p = false;
> > > > > >
> > > > > >    /* cond stmt other than loop exit cond.  */
> > > > > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > > > > -      && STMT_VINFO_TYPE (stmt_info) !=
> loop_exit_ctrl_vec_info_type)
> > > > > > -    *relevant = vect_used_in_scope;
> > >
> > > how was that ever hit before?  For outer loop processing with outer loop
> > > vectorization?
> > >
> > > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > > +    {
> > > > > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge,
> > > but
> > > > > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > > > > +	 the hard way.  */
> > > > > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > > > > +      bool exit_bb = false, early_exit = false;
> > > > > > +      edge_iterator ei;
> > > > > > +      edge e;
> > > > > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > > > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > > > > +	  {
> > > > > > +	    exit_bb = true;
> > > > > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > > > > +	    break;
> > > > > > +	  }
> > > > > > +
> > > > > > +      /* We should have processed any exit edge, so an edge not an
> early
> > > > > > +	 break must be a loop IV edge.  We need to distinguish
> between the
> > > > > > +	 two as we don't want to generate code for the main loop IV.
> */
> > > > > > +      if (exit_bb)
> > > > > > +	{
> > > > > > +	  if (early_exit)
> > > > > > +	    *relevant = vect_used_in_scope;
> > > > > > +	}
> > >
> > > I wonder why you can't simply do
> > >
> > >          if (is_ctrl_stmt (stmt_info->stmt)
> > >              && stmt_info->stmt != LOOP_VINFO_COND (loop_info))
> > >
> > > ?
> > >
> > > > > > +      else if (bb->loop_father == loop)
> > > > > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> > >
> > > so for control flow not exiting the loop you can check
> > > loop_exits_from_bb_p ().
> > >
> > > > > > +    }
> > > > > >
> > > > > >    /* changing memory.  */
> > > > > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > > > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info
> > > stmt_info,
> > > > > loop_vec_info loop_vinfo,
> > > > > >  	*relevant = vect_used_in_scope;
> > > > > >        }
> > > > > >
> > > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > > +  auto_bitmap exit_bbs;
> > > > > > +  for (edge exit : exits)
> > > > > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > > > +
> > > > > >    /* uses outside the loop.  */
> > > > > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > > > > SSA_OP_DEF)
> > > > > >      {
> > > > > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info
> stmt_info,
> > > > > loop_vec_info loop_vinfo,
> > > > > >  	      /* We expect all such uses to be in the loop exit phis
> > > > > >  		 (because of loop closed form)   */
> > > > > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) ==
> GIMPLE_PHI);
> > > > > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > > > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> > >
> > > That now becomes quite expensive checking already covered by the LC SSA
> > > verifier so I suggest to simply drop this assert instead.
> > >
> > > > > >                *live_p = true;
> > > > > >  	    }
> > > > > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > > >  	}
> > > > > >      }
> > > > > >
> > > > > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't
> > > seen all
> > > > > > +     the conds yet at that point and there's no quick way to retrieve
> them.
> > > */
> > > > > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > > > > +    return opt_result::failure_at (vect_location,
> > > > > > +				   "not vectorized:"
> > > > > > +				   " unsupported control flow in
> loop.\n");
> > >
> > > so we didn't do this before?  But see above where I wondered.  So when
> > > does this hit with early exits and why can't we check for this in
> > > vect_verify_loop_form?
> > >
> > > > > > +
> > > > > >    /* 2. Process_worklist */
> > > > > >    while (worklist.length () > 0)
> > > > > >      {
> > > > > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > > >  			return res;
> > > > > >  		    }
> > > > > >                   }
> > > > > > +	    }
> > > > > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo-
> >stmt))
> > > > > > +	    {
> > > > > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > > > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) ==
> tcc_comparison);
> > > > > > +	      opt_result res
> > > > > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > > > > +			       loop_vinfo, relevant, &worklist, false);
> > > > > > +	      if (!res)
> > > > > > +		return res;
> > > > > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > > > > +				loop_vinfo, relevant, &worklist, false);
> > > > > > +	      if (!res)
> > > > > > +		return res;
> > > > > >              }
> > > > > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > > > > >  	    {
> > > > > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > > >  			     node_instance, cost_vec);
> > > > > >        if (!res)
> > > > > >  	return res;
> > > > > > -   }
> > > > > > +    }
> > > > > > +
> > > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > > > > >
> > > > > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > > > > >      {
> > > > > >        case vect_internal_def:
> > > > > > +      case vect_early_exit_def:
> > > > > >          break;
> > > > > >
> > > > > >        case vect_reduction_def:
> > > > > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > > >      {
> > > > > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > > > > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > > > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > > > > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > > > > >        *need_to_vectorize = true;
> > > > > >      }
> > > > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > > > > index
> > > > >
> > >
> ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > > > > 40baff61d18da8e242dd 100644
> > > > > > --- a/gcc/tree-vectorizer.h
> > > > > > +++ b/gcc/tree-vectorizer.h
> > > > > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > > > > >    vect_internal_def,
> > > > > >    vect_induction_def,
> > > > > >    vect_reduction_def,
> > > > > > +  vect_early_exit_def,
> > >
> > > can you avoid putting this inbetween reduction and double reduction
> > > please?  Just put it before vect_unknown_def_type.  In fact the COND
> > > isn't a def ... maybe we should have pattern recogized
> > >
> > >  if (a < b) exit;
> > >
> > > as
> > >
> > >  cond = a < b;
> > >  if (cond != 0) exit;
> > >
> > > so the part that we need to vectorize is more clear.
> > >
> > > > > >    vect_double_reduction_def,
> > > > > >    vect_nested_cycle,
> > > > > >    vect_first_order_recurrence,
> > > > > > @@ -876,6 +877,13 @@ public:
> > > > > >       we need to peel off iterations at the end to form an epilogue loop.
> */
> > > > > >    bool peeling_for_niter;
> > > > > >
> > > > > > +  /* When the loop has early breaks that we can vectorize we need to
> > > peel
> > > > > > +     the loop for the break finding loop.  */
> > > > > > +  bool early_breaks;
> > > > > > +
> > > > > > +  /* When the loop has a non-early break control flow inside.  */
> > > > > > +  bool non_break_control_flow;
> > > > > > +
> > > > > >    /* List of loop additional IV conditionals found in the loop.  */
> > > > > >    auto_vec<gcond *> conds;
> > > > > >
> > > > > > @@ -985,9 +993,11 @@ public:
> > > > > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)-
> >reduction_chains
> > > > > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)-
> >peeling_for_gaps
> > > > > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)-
> >peeling_for_niter
> > > > > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > > > > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > > > > >early_break_conflict
> > > > > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)-
> > > >early_break_dest_bb
> > > > > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)-
> >early_break_vuses
> > > > > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > > > > >non_break_control_flow
> > > > > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > > > > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > > > > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > > > > >no_data_dependencies
> > > > > > @@ -1038,8 +1048,8 @@ public:
> > > > > >     stack.  */
> > > > > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > > > > >
> > > > > > -inline loop_vec_info
> > > > > > -loop_vec_info_for_loop (class loop *loop)
> > > > > > +static inline loop_vec_info
> > > > > > +loop_vec_info_for_loop (const class loop *loop)
> > > > > >  {
> > > > > >    return (loop_vec_info) loop->aux;
> > > > > >  }
> > > > > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > > > > >  {
> > > > > >    if (bb == (bb->loop_father)->header)
> > > > > >      return true;
> > > > > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > > > > +
> > > > > >    return false;
> > > > > >  }
> > > > > >
> > > > > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > > > > >     in tree-vect-loop-manip.cc.  */
> > > > > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > > > > >  				     tree, tree, tree, bool);
> > > > > > -extern bool slpeel_can_duplicate_loop_p (const class loop *,
> > > const_edge);
> > > > > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > > > > const_edge);
> > > > > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > > > > -						     class loop *, edge);
> > > > > > +						    class loop *, edge,
> bool,
> > > > > > +						    vec<basic_block> *
> = NULL);
> > > > > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > > > > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > > > > >  				    tree *, tree *, tree *, int, bool, bool,
> > > > > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > > > > index
> > > > >
> > >
> a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > > > > 67f29f5dc9a8d72235f4 100644
> > > > > > --- a/gcc/tree-vectorizer.cc
> > > > > > +++ b/gcc/tree-vectorizer.cc
> > > > > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > > > > >  	 predicates that need to be shared for optimal predicate usage.
> > > > > >  	 However reassoc will re-order them and prevent CSE from
> working
> > > > > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > > > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > > > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> > >
> > > seeing this more and more I think we want a simple way to iterate over
> > > all exits without copying to a vector when we have them recorded.  My
> > > C++ fu is too limited to support
> > >
> > >   for (auto exit : recorded_exits (loop))
> > >     ...
> > >
> > > (maybe that's enough for somebody to jump onto this ;))
> > >
> > > Don't treat all review comments as change orders, but it should be clear
> > > the code isn't 100% obvious.  Maybe the patch can be simplified by
> > > splitting out the LC SSA cleanup parts.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > > > +      for (edge exit : exits)
> > > > > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > > >
> > > > > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > > > > >        do_rpo_vn (fun, entry, exit_bbs);
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Richard Biener <rguenther@suse.de>
> > > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > > > Nuernberg,
> > > > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > > > > Moerman;
> > > > > HRB 36809 (AG Nuernberg)
> > > >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
  
Richard Biener Aug. 18, 2023, 1:15 p.m. UTC | #9
On Fri, 18 Aug 2023, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Friday, August 18, 2023 2:53 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com
> > Subject: RE: [PATCH 12/19]middle-end: implement loop peeling and IV
> > updates for early break.
> > 
> > On Fri, 18 Aug 2023, Tamar Christina wrote:
> > 
> > > > > Yeah if you comment it out one of the testcases should fail.
> > > >
> > > > using new_preheader instead of e->dest would make things clearer.
> > > >
> > > > You are now adding the same arg to every exit (you've just queried the
> > > > main exit redirect_edge_var_map_vector).
> > > >
> > > > OK, so I think I understand what you're doing.  If I understand
> > > > correctly we know that when we exit the main loop via one of the
> > > > early exits we are definitely going to enter the epilog but when
> > > > we take the main exit we might not.
> > > >
> > >
> > > Correct.. but..
> > >
> > > > Looking at the CFG we create currently this isn't reflected and
> > > > this complicates this PHI node updating.  What I'd try to do
> > > > is leave redirecting the alternate exits until after
> > >
> > > It is, in the case of the alternate exits this is reflected in copying
> > > the same values, as they are the values of the number of completed
> > > iterations since the scalar code restarts the last iteration.
> > >
> > > So all the PHI nodes of the alternate exits are correct.  The vector
> > > Iteration doesn't handle the partial iteration.
> > >
> > > > slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
> > > > means leaving it almost unchanged besides the LC SSA maintaining
> > > > changes.  After that for the multi-exit case split the
> > > > epilog preheader edge and redirect all the alternate exits to the
> > > > new preheader.  So the CFG becomes
> > > >
> > > >                  <original loop>
> > > >                 /      |
> > > >                /    <main exit w/ original LC PHI>
> > > >               /      if (epilog)
> > > >    alt exits /        /  \
> > > >             /        /    loop around
> > > >             |       /
> > > >            preheader with "header" PHIs
> > > >               |
> > > >           <epilog>
> > > >
> > > > note you need the header PHIs also on the main exit path but you
> > > > only need the loop end PHIs there.
> > > >
> > > > It seems so that at least currently the order of things makes
> > > > them more complicated than necessary.
> > >
> > > I've been trying to, but this representation seems a lot harder to work with,
> > > In particular at the moment once we exit
> > slpeel_tree_duplicate_loop_to_edge_cfg
> > > the loop structure is exactly the same as one expects from any normal epilog
> > vectorization.
> > >
> > > But this new representation requires me to place the guard much earlier than
> > the epilogue
> > > preheader,  yet I still have to adjust the PHI nodes in the preheader.  So it
> > seems that this split
> > > is there to only indicate that we always enter the epilog when taking an early
> > exit.
> > >
> > > Today this is reflected in the values of the PHI nodes rather than structurally.
> > Once we place
> > > The guard we update the nodes and the alternate exits get their value for
> > ivtmp updated to VF.
> > >
> > > This representation also forces me to do the redirection in every call site of
> > > slpeel_tree_duplicate_loop_to_edge_cfg making the code more complicated
> > in all use sites.
> > >
> > > But I think this doesn't address the main reason why the
> > slpeel_tree_duplicate_loop_to_edge_cfg
> > > code has a large block of code to deal with PHI node updates.
> > >
> > > The reason as you mentioned somewhere else is that after we redirect the
> > edges I have to reconstruct
> > > the phi nodes.  For most it's straight forwards, but for live values or vuse
> > chains it requires extra code.
> > >
> > > You're right in that before we redirect the edges they are all correct in the exit
> > block, you mentioned that
> > > the API for the edge redirection is supposed to copy the values over if I
> > create the phi nodes before hand.
> > >
> > > However this doesn't seem to work:
> > >
> > >      for (auto gsi_from = gsi_start_phis (scalar_exit->dest);
> > > 	   !gsi_end_p (gsi_from); gsi_next (&gsi_from))
> > > 	{
> > > 	  gimple *from_phi = gsi_stmt (gsi_from);
> > > 	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> > > 	  create_phi_node (new_res, new_preheader);
> > > 	}
> > >
> > >       for (edge exit : loop_exits)
> > > 	redirect_edge_and_branch (exit, new_preheader);
> > >
> > > Still leaves them empty.  Grepping around most code seems to pair
> > redirect_edge_and_branch with
> > > copy_phi_arg_into_existing_phi.  The problem is that in all these cases after
> > redirecting an edge they
> > > call copy_phi_arg_into_existing_phi from a predecessor edge to fill in the phi
> > nodes.
> > 
> > You need to call flush_pending_stmts on each edge you redirect.
> > copy_phi_arg_into_existing_phi isn't suitable for edge redirecting.
> 
> Oh. I'll give that a try, that would make sense.. I didn't flush it in the current approach
> because I needed the map, but since I want to get rid of the map, this makes sense.
> 
> > 
> > > This is because as I redirect_edge_and_branch destroys the phi node entries
> > and copy_phi_arg_into_existing_phi
> > > simply just reads the gimple_phi_arg_def which would be NULL.
> > >
> > > You could point it to the src block of the exit, in which case it copies the
> > wrong values in for the vuses.  At the end
> > > of vectorization the cfgcleanup code does the same thing to maintain LCSSA
> > if you haven't.  This code always goes
> > > wrong for multiple exits because of the problem described above.  There's no
> > node for it to copy the right value
> > > from.
> > >
> > > As an alternate approach I can split the exit edges, copy the phi nodes into
> > the split and after that redirect them.
> > > This however creates the awkwardness of having the exit edges no longer
> > connect to the preheader.
> > >
> > > All of this then begs the question if this is all easier than the current approach
> > which is just to read the edge var
> > > map to figure out the nodes that were removed during the redirect.
> > 
> > But the edge map is supposed to be applied via flush_pending_stmts,
> > specifically it relies on PHI nodes having a 1:1 correspondence between
> > old and new destination and thus is really designed for the case
> > you copy the destination and redirect an edge to the copy.
> 
> Ah you were referring to flush_pending_stmts,  right that would make more sense.
> Will give it  a go, thanks!
> 
> > 
> > That is, the main issue I have with the CFG manipulation is that it
> > isn't broken down to simple operations that in themselves leave
> > everything correct.  I think it should be possible to do this
> > and as 2nd step only do the special massaging for the early exit
> > LC PHIs that feed into the epilogue loop.
> > 
> 
> I see, So I think, If I understand what you're saying correctly, is that
> you would like instead of having 1 big function, to have little smaller
> helper functions that each on their own do something small but
> correct and when called in sequence perform the complicated work?
> 
> i.e. you still have one top level function, and at the end of that call
> everything is in order? 

Well, ideally we'd be able to use existing APIs to lay out the CFG
and only need to do vectorizer transform specific things for the
"missing" PHIs.

> > As you say the code is quite complicated even without early break
> > vectorization which is why I originally suggested to try "fixing" it
> > as prerequesite.  It does have the same fundamental issue when feeding
> > the epilogue - the "missing" LC PHIs, the difference is only that
> > without early break vectorization we take the exit values while
> > for early break vectorization we take the latch values from the
> > previous iteration(?)
> 
> Ok, If I understood the above correctly with how you wanted the sequence
> split out, then would this work for you:
> 
> I pull out from the patch series:
> 
> 1. the single_exit removal and using of our own IV in the vectorizer.
> 2. The refactoring of the current peeling code, without new functionality but
>      splitting it up into the logical little helper functions.
> 
> And get those committed separately and then rebase on the early break stuff
> which will add handling the additional case for multiple exits in the peeling step?

Yeah, that would be nice.

> I already did 1, I didn't do 2 because the function was reworked in one go to both
> clean it up and support multiple exits.  But would make sense to pull them out.
> 
> Is it ok to also pull out the vectorable_comparison refactoring as well? I haven't
> committed it yet because without the early_break support it doesn't look useful
> but rebases are tricky when it changes.

I don't remember exactly what that was, but I think yes, that's fine.

> > 
> > > Maybe I'm still misunderstanding the API, but reading the sources of the
> > functions, they all copy values from *existing*
> > > phi nodes.  And any existing phi node after the redirect are not correct.
> > >
> > > gimple_redirect_edge_and_branch has a chunk that indicates it should have
> > updated the PHI nodes before calling
> > > ssa_redirect_edge to remove the old ones, but there's no code there. It's all
> > empty.
> > >
> > > Most of the other refactorings/changes were easy enough to do, but this
> > one I seem to be a struggling with.
> > 
> > I see.  If you are tired of trying feel free to send an updated series
> > with the other changes, if it looks awkward but correct we can see
> > someone doing the cleanup afterwards.
> > 
> 
> Thanks, I'm not tired yet ?, just confused, and I think it's clearer now.  If you agree
> with the above I can pull them out of the series.
> 
> Cheers,
> Tamar
> 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > > >
> > > > > >
> > > > > > >  	}
> > > > > > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > > > > > -      flush_pending_stmts (e);
> > > > > > > +
> > > > > > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader,
> > e-
> > > > >src);
> > > > > > > -      if (was_imm_dom || duplicate_outer_loop)
> > > > > > > +
> > > > > > > +      if ((was_imm_dom || duplicate_outer_loop) &&
> > !multiple_exits_p)
> > > > > > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest,
> > new_exit-
> > > > > > >src);
> > > > > > >
> > > > > > >        /* And remove the non-necessary forwarder again.  Keep the
> > other
> > > > > > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > > > (class
> > > > > > loop *loop,
> > > > > > >        delete_basic_block (preheader);
> > > > > > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop-
> > >header,
> > > > > > >  			       loop_preheader_edge (scalar_loop)->src);
> > > > > > > +
> > > > > > > +      /* Finally after wiring the new epilogue we need to update its
> > main
> > > > exit
> > > > > > > +	 to the original function exit we recorded.  Other exits are
> > already
> > > > > > > +	 correct.  */
> > > > > > > +      if (multiple_exits_p)
> > > > > > > +	{
> > > > > > > +	  for (edge e : get_loop_exit_edges (loop))
> > > > > > > +	    doms.safe_push (e->dest);
> > > > > > > +	  update_loop = new_loop;
> > > > > > > +	  doms.safe_push (exit_dest);
> > > > > > > +
> > > > > > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > > > > > +	  if (single_succ_p (exit_dest))
> > > > > > > +	    doms.safe_push (single_succ (exit_dest));
> > > > > > > +	}
> > > > > > >      }
> > > > > > >    else /* Add the copy at entry.  */
> > > > > > >      {
> > > > > > > +      /* Copy the current loop LC PHI nodes between the original loop
> > exit
> > > > > > > +	 block and the new loop header.  This allows us to later split
> > the
> > > > > > > +	 preheader block and still find the right LC nodes.  */
> > > > > > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > > > > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > > > > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > > > > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > > > > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > > > > >
> > > > > > see above
> > > > > >
> > > > > > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > > > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p
> > (gsi_to);
> > > > > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > > > > +	{
> > > > > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > > > > > new_latch_loop);
> > > > > > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init,
> > new_arg);
> > > > > > > +	}
> > > > > > > +
> > > > > > >        if (scalar_loop != loop)
> > > > > > >  	{
> > > > > > >  	  /* Remove the non-necessary forwarder of scalar_loop again.
> > */
> > > > > > > @@ -1677,31 +1819,36 @@
> > slpeel_tree_duplicate_loop_to_edge_cfg
> > > > (class
> > > > > > loop *loop,
> > > > > > >        delete_basic_block (new_preheader);
> > > > > > >        set_immediate_dominator (CDI_DOMINATORS, new_loop-
> > >header,
> > > > > > >  			       loop_preheader_edge (new_loop)->src);
> > > > > > > +
> > > > > > > +      if (multiple_exits_p)
> > > > > > > +	update_loop = loop;
> > > > > > >      }
> > > > > > >
> > > > > > > -  if (scalar_loop != loop)
> > > > > > > +  if (multiple_exits_p)
> > > > > > >      {
> > > > > > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > > > > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > > > > > -      gphi_iterator gsi_orig, gsi_new;
> > > > > > > -      edge orig_e = loop_preheader_edge (loop);
> > > > > > > -      edge new_e = loop_preheader_edge (new_loop);
> > > > > > > -
> > > > > > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > > > > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > > > > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > > > > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > > > > > +      for (edge e : get_loop_exit_edges (update_loop))
> > > > > > >  	{
> > > > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > > > -	  gphi *new_phi = gsi_new.phi ();
> > > > > > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > > > > > -	  location_t orig_locus
> > > > > > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > > > > > -
> > > > > > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > > > > > +	  edge ex;
> > > > > > > +	  edge_iterator ei;
> > > > > > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > > > > > +	    {
> > > > > > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > > > > > +		 dominate other blocks.  */
> > > > > > > +	      while ((ex->flags & EDGE_FALLTHRU)
> > > >
> > > > For the prologue peeling any early exit we take would skip all other
> > > > loops so we can simply leave them and their LC PHI nodes in place.
> > > > We need extra PHIs only on the path to the main vector loop.  I
> > > > think the comment isn't accurately reflecting what we do.  In
> > > > fact we do not add any LC PHI nodes here but simply adjust the
> > > > main loop header PHI arguments?
> > > >
> > > > > > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > > > > > just using single_succ_p here?  A fallthru edge src dominates the
> > > > > > fallthru edge dest, so the sentence above doesn't make sense.
> > > > >
> > > > > I wanted to say, that the immediate dominator of a block is never
> > > > > an fall through block.  At least from what I understood from how
> > > > > the dominators are calculated in the code, though may have missed
> > > > > something.
> > > >
> > > >  BB1
> > > >   |
> > > >  BB2
> > > >   |
> > > >  BB3
> > > >
> > > > here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.
> > > >
> > > > > >
> > > > > > > +		     && single_succ_p (ex->dest))
> > > > > > > +		{
> > > > > > > +		  doms.safe_push (ex->dest);
> > > > > > > +		  ex = single_succ_edge (ex->dest);
> > > > > > > +		}
> > > > > > > +	      doms.safe_push (ex->dest);
> > > > > > > +	    }
> > > > > > > +	  doms.safe_push (e->dest);
> > > > > > >  	}
> > > > > > > -    }
> > > > > > >
> > > > > > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > > > +      if (updated_doms)
> > > > > > > +	updated_doms->safe_splice (doms);
> > > > > > > +    }
> > > > > > >    free (new_bbs);
> > > > > > >    free (bbs);
> > > > > > >
> > > > > > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > > > > > loop_vec_info loop_vinfo, const_edge e)
> > > > > > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > > > > > >    unsigned int num_bb = loop->inner? 5 : 2;
> > > > > > >
> > > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > > > > > +
> > > > > >
> > > > > > I think checking the number of BBs is odd, I don't remember anything
> > > > > > in slpeel is specifically tied to that?  I think we can simply drop
> > > > > > this or do you remember anything that would depend on ->num_nodes
> > > > > > being only exactly 5 or 2?
> > > > >
> > > > > Never actually seemed to require it, but they're used as some check to
> > > > > see if there are unexpected control flow in the loop.
> > > > >
> > > > > i.e. this would say no if you have an if statement in the loop that wasn't
> > > > > converted.  The other part of this and the accompanying explanation is in
> > > > > vect_analyze_loop_form.  In the patch series I had to remove the hard
> > > > > num_nodes == 2 check from there because number of nodes restricted
> > > > > things too much.  If you have an empty fall through block, which seems
> > to
> > > > > happen often between the main exit and the latch block then we'd not
> > > > > vectorize.
> > > > >
> > > > > So instead I now rejects loops after analyzing the gcond.  So think this
> > check
> > > > > can go/needs to be different.
> > > >
> > > > Lets remove it from this function then.
> > > >
> > > > > >
> > > > > > >    /* All loops have an outer scope; the only case loop->outer is NULL is
> > for
> > > > > > >       the function itself.  */
> > > > > > >    if (!loop_outer (loop)
> > > > > > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > > > > > (loop_vec_info loop_vinfo,
> > > > > > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > > >    basic_block update_bb = update_e->dest;
> > > > > > >
> > > > > > > +  /* For early exits we'll update the IVs in
> > > > > > > +     vect_update_ivs_after_early_break.  */
> > > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > > +    return;
> > > > > > > +
> > > > > > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > > > >
> > > > > > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > > > > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > > > > > (loop_vec_info loop_vinfo,
> > > > > > >        /* Fix phi expressions in the successor bb.  */
> > > > > > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > > > > > >      }
> > > > > > > +  return;
> > > > > >
> > > > > > we don't usually place a return at the end of void functions
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*   Function vect_update_ivs_after_early_break.
> > > > > > > +
> > > > > > > +     "Advance" the induction variables of LOOP to the value they
> > should
> > > > take
> > > > > > > +     after the execution of LOOP.  This is currently necessary because
> > the
> > > > > > > +     vectorizer does not handle induction variables that are used after
> > the
> > > > > > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > > > > > +     peeled, because of the early exit.  With an early exit we always peel
> > > > the
> > > > > > > +     loop.
> > > > > > > +
> > > > > > > +     Input:
> > > > > > > +     - LOOP_VINFO - a loop info structure for the loop that is going to
> > be
> > > > > > > +		    vectorized. The last few iterations of LOOP were
> > peeled.
> > > > > > > +     - LOOP - a loop that is going to be vectorized. The last few
> > iterations
> > > > > > > +	      of LOOP were peeled.
> > > > > > > +     - VF - The loop vectorization factor.
> > > > > > > +     - NITERS_ORIG - the number of iterations that LOOP executes
> > (before
> > > > it is
> > > > > > > +		     vectorized). i.e, the number of times the ivs should
> > be
> > > > > > > +		     bumped.
> > > > > > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > > > > > executes.
> > > > > > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only)
> > > > path
> > > > > > > +		  coming out from LOOP on which there are uses of
> > the LOOP
> > > > > > ivs
> > > > > > > +		  (this is the path from LOOP->exit to epilog_loop-
> > >preheader).
> > > > > > > +
> > > > > > > +		  The new definitions of the ivs are placed in LOOP-
> > >exit.
> > > > > > > +		  The phi args associated with the edge UPDATE_E in
> > the bb
> > > > > > > +		  UPDATE_E->dest are updated accordingly.
> > > > > > > +
> > > > > > > +     Output:
> > > > > > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > > > > > +
> > > > > > > +     Assumption 1: Like the rest of the vectorizer, this function
> > assumes
> > > > > > > +     a single loop exit that has a single predecessor.
> > > > > > > +
> > > > > > > +     Assumption 2: The phi nodes in the LOOP header and in
> > update_bb
> > > > are
> > > > > > > +     organized in the same order.
> > > > > > > +
> > > > > > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > > > > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the
> > > > future.
> > > > > > > +
> > > > > > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on
> > a
> > > > path
> > > > > > > +     coming out of LOOP on which the ivs of LOOP are used (this is the
> > > > path
> > > > > > > +     that leads to the epilog loop; other paths skip the epilog loop).
> > This
> > > > > > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > > > > > update_bb)
> > > > > > > +     needs to have its phis updated.
> > > > > > > + */
> > > > > > > +
> > > > > > > +static tree
> > > > > > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class
> > > > loop *
> > > > > > epilog,
> > > > > > > +				   poly_int64 vf, tree niters_orig,
> > > > > > > +				   tree niters_vector, edge update_e)
> > > > > > > +{
> > > > > > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > > +    return NULL;
> > > > > > > +
> > > > > > > +  gphi_iterator gsi, gsi1;
> > > > > > > +  tree ni_name, ivtmp = NULL;
> > > > > > > +  basic_block update_bb = update_e->dest;
> > > > > > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > > > +  basic_block exit_bb = loop_iv->dest;
> > > > > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > > > > > +
> > > > > > > +  gcc_assert (cond);
> > > > > > > +
> > > > > > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis
> > > > (update_bb);
> > > > > > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > > > > > +    {
> > > > > > > +      tree init_expr, final_expr, step_expr;
> > > > > > > +      tree type;
> > > > > > > +      tree var, ni, off;
> > > > > > > +      gimple_stmt_iterator last_gsi;
> > > > > > > +
> > > > > > > +      gphi *phi = gsi1.phi ();
> > > > > > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi,
> > > > loop_preheader_edge
> > > > > > (epilog));
> > > > > >
> > > > > > I'm confused about the setup.  update_bb looks like the block with the
> > > > > > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > > > > > loop_preheader_edge (epilog) come into play here?  That would feed
> > into
> > > > > > epilog->header PHIs?!
> > > > >
> > > > > We can't query the type of the phis in the block with the LC PHI nodes, so
> > the
> > > > > Typical pattern seems to be that we iterate over a block that's part of the
> > > > loop
> > > > > and that would have the PHIs in the same order, just so we can get to the
> > > > > stmt_vec_info.
> > > > >
> > > > > >
> > > > > > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > > > > > better way?  Is update_bb really epilog->header?!
> > > > > >
> > > > > > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > > > > > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > > > > > which "works" even for totally unrelated edges.
> > > > > >
> > > > > > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT
> > (phi_ssa));
> > > > > > > +      if (!phi1)
> > > > > >
> > > > > > shouldn't that be an assert?
> > > > > >
> > > > > > > +	continue;
> > > > > > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > > > > > +      if (dump_enabled_p ())
> > > > > > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > > > > > +			 "vect_update_ivs_after_early_break: phi:
> > %G",
> > > > > > > +			 (gimple *)phi);
> > > > > > > +
> > > > > > > +      /* Skip reduction and virtual phis.  */
> > > > > > > +      if (!iv_phi_p (phi_info))
> > > > > > > +	{
> > > > > > > +	  if (dump_enabled_p ())
> > > > > > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > > > > > +			     "reduc or virtual phi. skip.\n");
> > > > > > > +	  continue;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +      /* For multiple exits where we handle early exits we need to carry
> > on
> > > > > > > +	 with the previous IV as loop iteration was not done because
> > we exited
> > > > > > > +	 early.  As such just grab the original IV.  */
> > > > > > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > > > > > (loop));
> > > > > >
> > > > > > but this should be taken care of by LC SSA?
> > > > >
> > > > > It is, the comment is probably missing details, this part just scales the
> > > > counter
> > > > > from VF to scalar counts.  It's just a reminder that this scaling is done
> > > > differently
> > > > > from normal single exit vectorization.
> > > > >
> > > > > >
> > > > > > OK, have to continue tomorrow from here.
> > > > >
> > > > > Cheers, Thank you!
> > > > >
> > > > > Tamar
> > > > >
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > > > > > +	  && gimple_cond_rhs (cond) != phi_ssa)
> > > >
> > > > so this is a way to avoid touching the main IV?  Looks a bit fragile to
> > > > me.  Hmm, we're iterating over the main loop header PHIs here?
> > > > Can't you check, say, the relevancy of the PHI node instead?  Though
> > > > it might also be used as induction.  Can't it be used as alternate
> > > > exit like
> > > >
> > > >   for (i)
> > > >    {
> > > >      if (i & bit)
> > > >        break;
> > > >    }
> > > >
> > > > and would we need to adjust 'i' then?
> > > >
> > > > > > > +	{
> > > > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART
> > (phi_info);
> > > > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > > > +
> > > > > > > +	  /* We previously generated the new merged phi in the same
> > BB as
> > > > > > the
> > > > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > > > +	     normal loop phi which don't take the early breaks into
> > account.  */
> > > > > > > +	  final_expr = gimple_phi_result (phi1);
> > > > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > > > > > loop_preheader_edge (loop));
> > > > > > > +
> > > > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > > > +	  /* For early break the final loop IV is:
> > > > > > > +	     init + (final - init) * vf which takes into account peeling
> > > > > > > +	     values and non-single steps.  */
> > > > > > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > > > > > +			     fold_convert (stype, final_expr),
> > > > > > > +			     fold_convert (stype, init_expr));
> > > > > > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > > > > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst
> > (stype, vf));
> > > > > > > +
> > > > > > > +	  /* Adjust the value with the offset.  */
> > > > > > > +	  if (POINTER_TYPE_P (type))
> > > > > > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > > > > > +	  else
> > > > > > > +	    ni = fold_convert (type,
> > > > > > > +			       fold_build2 (PLUS_EXPR, stype,
> > > > > > > +					    fold_convert (stype,
> > init_expr),
> > > > > > > +					    off));
> > > > > > > +	  var = create_tmp_var (type, "tmp");
> > > >
> > > > so how does the non-early break code deal with updating inductions?
> > > > And how do you avoid altering this when we flow in from the normal
> > > > exit?  That is, you are updating the value in the epilog loop
> > > > header but don't you need to instead update the value only on
> > > > the alternate exit edges from the main loop (and keep the not
> > > > updated value on the main exit edge)?
> > > >
> > > > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > > > +	  gimple_seq new_stmts = NULL;
> > > > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false,
> > var);
> > > > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts,
> > GSI_SAME_STMT);
> > > > > > > +	  else
> > > > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts,
> > GSI_SAME_STMT);
> > > > > > > +
> > > > > > > +	  /* Fix phi expressions in the successor bb.  */
> > > > > > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > > > > > +	}
> > > > > > > +      else
> > > > > > > +	{
> > > > > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART
> > (phi_info);
> > > > > > > +	  step_expr = unshare_expr (step_expr);
> > > > > > > +
> > > > > > > +	  /* We previously generated the new merged phi in the same
> > BB as
> > > > > > the
> > > > > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > > > > +	     normal loop phi which don't take the early breaks into
> > account.  */
> > > > > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1,
> > loop_preheader_edge
> > > > > > (loop));
> > > > > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > > > > +
> > > > > > > +	  if (vf.is_constant ())
> > > > > > > +	    {
> > > > > > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > > > > > +				fold_convert (stype,
> > > > > > > +					      niters_vector),
> > > > > > > +				build_int_cst (stype, vf));
> > > > > > > +
> > > > > > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > > > > > +				fold_convert (stype,
> > > > > > > +					      niters_orig),
> > > > > > > +				fold_convert (stype, ni));
> > > > > > > +	    }
> > > > > > > +	  else
> > > > > > > +	    /* If the loop's VF isn't constant then the loop must have
> > been
> > > > > > > +	       masked, so at the end of the loop we know we have
> > finished
> > > > > > > +	       the entire loop and found nothing.  */
> > > > > > > +	    ni = build_zero_cst (stype);
> > > > > > > +
> > > > > > > +	  ni = fold_convert (type, ni);
> > > > > > > +	  /* We don't support variable n in this version yet.  */
> > > > > > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > > > > > +
> > > > > > > +	  var = create_tmp_var (type, "tmp");
> > > > > > > +
> > > > > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > > > > +	  gimple_seq new_stmts = NULL;
> > > > > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false,
> > var);
> > > > > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > > > > +	  if (!gsi_end_p (last_gsi))
> > > > > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts,
> > GSI_SAME_STMT);
> > > > > > > +	  else
> > > > > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts,
> > GSI_SAME_STMT);
> > > > > > > +
> > > > > > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > > > > > +
> > > > > > > +	  for (edge exit : alt_exits)
> > > > > > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > > > > > +					build_int_cst (TREE_TYPE
> > (step_expr),
> > > > > > > +						       vf));
> > > > > > > +	  ivtmp = gimple_phi_result (phi1);
> > > > > > > +	}
> > > > > > > +    }
> > > > > > > +
> > > > > > > +  return ivtmp;
> > > > > > >  }
> > > > > > >
> > > > > > >  /* Return a gimple value containing the misalignment (measured in
> > > > vector
> > > > > > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > > > > > (loop_vec_info loop_vinfo,
> > > > > > >
> > > > > > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > > > > > >     this function searches for the corresponding lcssa phi node in exit
> > > > > > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > > > > > -   NULL.  */
> > > > > > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is
> > found,
> > > > > > > +   return the phi result; otherwise return NULL.  */
> > > > > > >
> > > > > > >  static tree
> > > > > > >  find_guard_arg (class loop *loop, class loop *epilog
> > > > ATTRIBUTE_UNUSED,
> > > > > > > -		gphi *lcssa_phi)
> > > > > > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > > > > > >  {
> > > > > > >    gphi_iterator gsi;
> > > > > > >    edge e = loop->vec_loop_iv;
> > > > > > >
> > > > > > > -  gcc_assert (single_pred_p (e->dest));
> > > > > > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > > >      {
> > > > > > >        gphi *phi = gsi.phi ();
> > > > > > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > > > > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > > > > > -	return PHI_RESULT (phi);
> > > > > > > -    }
> > > > > > > -  return NULL_TREE;
> > > > > > > -}
> > > > > > > -
> > > > > > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > > > > > FIRST/SECOND
> > > > > > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > > > > > -   edge, the two loops are arranged as below:
> > > > > > > -
> > > > > > > -       preheader_a:
> > > > > > > -     first_loop:
> > > > > > > -       header_a:
> > > > > > > -	 i_1 = PHI<i_0, i_2>;
> > > > > > > -	 ...
> > > > > > > -	 i_2 = i_1 + 1;
> > > > > > > -	 if (cond_a)
> > > > > > > -	   goto latch_a;
> > > > > > > -	 else
> > > > > > > -	   goto between_bb;
> > > > > > > -       latch_a:
> > > > > > > -	 goto header_a;
> > > > > > > -
> > > > > > > -       between_bb:
> > > > > > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > > > > > -
> > > > > > > -     second_loop:
> > > > > > > -       header_b:
> > > > > > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > > > > > -				 or with i_2 if no LCSSA phi is created
> > > > > > > -				 under condition of
> > > > > > CREATE_LCSSA_FOR_IV_PHIS.
> > > > > > > -	 ...
> > > > > > > -	 i_4 = i_3 + 1;
> > > > > > > -	 if (cond_b)
> > > > > > > -	   goto latch_b;
> > > > > > > -	 else
> > > > > > > -	   goto exit_bb;
> > > > > > > -       latch_b:
> > > > > > > -	 goto header_b;
> > > > > > > -
> > > > > > > -       exit_bb:
> > > > > > > -
> > > > > > > -   This function creates loop closed SSA for the first loop; update the
> > > > > > > -   second loop's PHI nodes by replacing argument on incoming edge
> > with
> > > > the
> > > > > > > -   result of newly created lcssa PHI nodes.  IF
> > > > CREATE_LCSSA_FOR_IV_PHIS
> > > > > > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > > > > > -   the first loop.
> > > > > > > -
> > > > > > > -   This function assumes exit bb of the first loop is preheader bb of
> > the
> > > > > > > -   second loop, i.e, between_bb in the example code.  With PHIs
> > updated,
> > > > > > > -   the second loop will execute rest iterations of the first.  */
> > > > > > > -
> > > > > > > -static void
> > > > > > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > > > > > -				   class loop *first, class loop *second,
> > > > > > > -				   bool create_lcssa_for_iv_phis)
> > > > > > > -{
> > > > > > > -  gphi_iterator gsi_update, gsi_orig;
> > > > > > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > > > -
> > > > > > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > > > > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > > > > > -  basic_block between_bb = single_exit (first)->dest;
> > > > > > > -
> > > > > > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > > > > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p
> > > > (between_bb));
> > > > > > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > > > > > -  gcc_assert (loop == first || loop == second);
> > > > > > > -
> > > > > > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > > > > > -       gsi_update = gsi_start_phis (second->header);
> > > > > > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > > > > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > > > > > -    {
> > > > > > > -      gphi *orig_phi = gsi_orig.phi ();
> > > > > > > -      gphi *update_phi = gsi_update.phi ();
> > > > > > > -
> > > > > > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > > > > > -      /* Generate lcssa PHI node for the first loop.  */
> > > > > > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > > > > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt
> > (vect_phi);
> > > > > > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > > > > > +      /* Nested loops with multiple exits can have different no# phi
> > node
> > > > > > > +	 arguments between the main loop and epilog as epilog falls to
> > the
> > > > > > > +	 second loop.  */
> > > > > > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > > > > > >  	{
> > > > > > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > > > > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > > > > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > > > > > UNKNOWN_LOCATION);
> > > > > > > -	  arg = new_res;
> > > > > > > -	}
> > > > > > > -
> > > > > > > -      /* Update PHI node in the second loop by replacing arg on the
> > loop's
> > > > > > > -	 incoming edge.  */
> > > > > > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e,
> > > > arg);
> > > > > > > -    }
> > > > > > > -
> > > > > > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > > > > > -     for correct vectorization of live stmts.  */
> > > > > > > -  if (loop == first)
> > > > > > > -    {
> > > > > > > -      basic_block orig_exit = single_exit (second)->dest;
> > > > > > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > > > > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > > > > > -	{
> > > > > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > > > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > > > > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > > > > > (orig_arg))
> > > > > > > -	    continue;
> > > > > > > -
> > > > > > > -	  /* Already created in the above loop.   */
> > > > > > > -	  if (find_guard_arg (first, second, orig_phi))
> > > > > > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > > > > > +	  if (TREE_CODE (var) != SSA_NAME)
> > > > > > >  	    continue;
> > > > > > >
> > > > > > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > > > > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > > > > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > > > > > UNKNOWN_LOCATION);
> > > > > > > +	  if (operand_equal_p (get_current_def (var),
> > > > > > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > > > > > +	    return PHI_RESULT (phi);
> > > > > > >  	}
> > > > > > >      }
> > > > > > > +  return NULL_TREE;
> > > > > > >  }
> > > > > > >
> > > > > > >  /* Function slpeel_add_loop_guard adds guard skipping from the
> > > > beginning
> > > > > > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2
> > > > (class
> > > > > > loop *loop, class loop *epilog,
> > > > > > >    gcc_assert (single_succ_p (merge_bb));
> > > > > > >    edge e = single_succ_edge (merge_bb);
> > > > > > >    basic_block exit_bb = e->dest;
> > > > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > > > > > >
> > > > > > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > > >      {
> > > > > > >        gphi *update_phi = gsi.phi ();
> > > > > > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > > > > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > > > > > >
> > > > > > >        tree merge_arg = NULL_TREE;
> > > > > > >
> > > > > > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2
> > (class
> > > > loop
> > > > > > *loop, class loop *epilog,
> > > > > > >        if (!merge_arg)
> > > > > > >  	merge_arg = old_arg;
> > > > > > >
> > > > > > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > > > > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e-
> > > > >dest_idx);
> > > > > > >        /* If the var is live after loop but not a reduction, we simply
> > > > > > >  	 use the old arg.  */
> > > > > > >        if (!guard_arg)
> > > > > > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2
> > (class
> > > > > > loop *loop, class loop *epilog,
> > > > > > >      }
> > > > > > >  }
> > > > > > >
> > > > > > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > > > > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > > > > > -
> > > > > > > -static void
> > > > > > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > > > > > -{
> > > > > > > -  gphi_iterator gsi;
> > > > > > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > > > > > -
> > > > > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > > > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > > > > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > > > > > -}
> > > > > > > -
> > > >
> > > > I wonder if we can still split these changes out to before early break
> > > > vect?
> > > >
> > > > > > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would
> > need
> > > > to
> > > > > > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > > > > > >     whether the epilogue loop should be used, returning true if so.  */
> > > > > > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree
> > > > > > niters, tree nitersm1,
> > > > > > >      bound_epilog += vf - 1;
> > > > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > > > >      bound_epilog += 1;
> > > > > > > +  /* For early breaks the scalar loop needs to execute at most VF
> > times
> > > > > > > +     to find the element that caused the break.  */
> > > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > > +    {
> > > > > > > +      bound_epilog = vf;
> > > > > > > +      /* Force a scalar epilogue as we can't vectorize the index finding.
> > */
> > > > > > > +      vect_epilogues = false;
> > > > > > > +    }
> > > > > > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > > > > > >    poly_uint64 bound_scalar = bound_epilog;
> > > > > > >
> > > > > > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info
> > > > loop_vinfo,
> > > > > > tree niters, tree nitersm1,
> > > > > > >  				  bound_prolog + bound_epilog)
> > > > > > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > > > > >  			 || vect_epilogues));
> > > > > > > +
> > > > > > > +  /* We only support early break vectorization on known bounds at
> > this
> > > > > > time.
> > > > > > > +     This means that if the vector loop can't be entered then we won't
> > > > > > generate
> > > > > > > +     it at all.  So for now force skip_vector off because the additional
> > > > control
> > > > > > > +     flow messes with the BB exits and we've already analyzed them.
> > */
> > > > > > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS
> > > > (loop_vinfo);
> > > > > > > +
> > > >
> > > > I think it should be as "easy" as entering the epilog via the block taking
> > > > the regular exit?
> > > >
> > > > > > >    /* Epilog loop must be executed if the number of iterations for epilog
> > > > > > >       loop is known at compile time, otherwise we need to add a check
> > at
> > > > > > >       the end of vector loop and skip to the end of epilog loop.  */
> > > > > > >    bool skip_epilog = (prolog_peeling < 0
> > > > > > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > > > >  		      || !vf.is_constant ());
> > > > > > > -  /* PEELING_FOR_GAPS is special because epilog loop must be
> > executed.
> > > > */
> > > > > > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > > > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special
> > because
> > > > > > epilog
> > > > > > > +     loop must be executed.  */
> > > > > > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > >      skip_epilog = false;
> > > > > > > -
> > > > > > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > > > > > >    auto_vec<profile_count> original_counts;
> > > > > > >    basic_block *original_bbs = NULL;
> > > > > > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info
> > > > loop_vinfo,
> > > > > > tree niters, tree nitersm1,
> > > > > > >    if (prolog_peeling)
> > > > > > >      {
> > > > > > >        e = loop_preheader_edge (loop);
> > > > > > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > > > > > -
> > > > > > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo,
> > e));
> > > > > > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > > > > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop,
> > scalar_loop,
> > > > e);
> > > > > > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop,
> > scalar_loop,
> > > > e,
> > > > > > > +						       true);
> > > > > > >        gcc_assert (prolog);
> > > > > > >        prolog->force_vectorize = false;
> > > > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop,
> > true);
> > > > > > > +
> > > > > > >        first_loop = prolog;
> > > > > > >        reset_original_copy_tables ();
> > > > > > >
> > > > > > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info
> > > > loop_vinfo,
> > > > > > tree niters, tree nitersm1,
> > > > > > >  	 as the transformations mentioned above make less or no
> > sense when
> > > > > > not
> > > > > > >  	 vectorizing.  */
> > > > > > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > > > > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > > > > > +      auto_vec<basic_block> doms;
> > > > > > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e,
> > > > true,
> > > > > > > +						       &doms);
> > > > > > >        gcc_assert (epilog);
> > > > > > >
> > > > > > >        epilog->force_vectorize = false;
> > > > > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog,
> > false);
> > > > > > >
> > > > > > >        /* Scalar version loop may be preferred.  In this case, add guard
> > > > > > >  	 and skip to epilog.  Note this only happens when the number
> > of
> > > > > > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree
> > > > > > niters, tree nitersm1,
> > > > > > >        vect_update_ivs_after_vectorizer (loop_vinfo,
> > niters_vector_mult_vf,
> > > > > > >  					update_e);
> > > > > > >
> > > > > > > +      /* For early breaks we must create a guard to check how many
> > > > iterations
> > > > > > > +	 of the scalar loop are yet to be performed.  */
> > > >
> > > > We have this check anyway, no?  In fact don't we know that we always
> > enter
> > > > the epilog (see above)?
> > > >
> > > > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > > +	{
> > > > > > > +	  tree ivtmp =
> > > > > > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf,
> > niters,
> > > > > > > +					       *niters_vector, update_e);
> > > > > > > +
> > > > > > > +	  gcc_assert (ivtmp);
> > > > > > > +	  tree guard_cond = fold_build2 (EQ_EXPR,
> > boolean_type_node,
> > > > > > > +					 fold_convert (TREE_TYPE
> > (niters),
> > > > > > > +						       ivtmp),
> > > > > > > +					 build_zero_cst (TREE_TYPE
> > (niters)));
> > > > > > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)-
> > >dest;
> > > > > > > +
> > > > > > > +	  /* If we had a fallthrough edge, the guard will the threaded
> > through
> > > > > > > +	     and so we may need to find the actual final edge.  */
> > > > > > > +	  edge final_edge = epilog->vec_loop_iv;
> > > > > > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty
> > block in
> > > > > > > +	     between the guard and the exit edge.  It only adds new
> > nodes and
> > > > > > > +	     doesn't update existing one in the current scheme.  */
> > > > > > > +	  basic_block guard_to = split_edge (final_edge);
> > > > > > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb,
> > guard_cond,
> > > > > > guard_to,
> > > > > > > +						guard_bb,
> > prob_epilog.invert
> > > > > > (),
> > > > > > > +						irred_flag);
> > > > > > > +	  doms.safe_push (guard_bb);
> > > > > > > +
> > > > > > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > > > > +
> > > > > > > +	  /* We must update all the edges from the new guard_bb.  */
> > > > > > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > > > > > +					      final_edge);
> > > > > > > +
> > > > > > > +	  /* If the loop was versioned we'll have an intermediate BB
> > between
> > > > > > > +	     the guard and the exit.  This intermediate block is required
> > > > > > > +	     because in the current scheme of things the guard block phi
> > > > > > > +	     updating can only maintain LCSSA by creating new blocks.
> > In this
> > > > > > > +	     case we just need to update the uses in this block as well.
> > */
> > > > > > > +	  if (loop != scalar_loop)
> > > > > > > +	    {
> > > > > > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > > > > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > > > > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE
> > (gsi.phi (),
> > > > > > guard_e));
> > > > > > > +	    }
> > > > > > > +
> > > > > > > +	  flush_pending_stmts (guard_e);
> > > > > > > +	}
> > > > > > > +
> > > > > > >        if (skip_epilog)
> > > > > > >  	{
> > > > > > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > > > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info
> > loop_vinfo,
> > > > tree
> > > > > > niters, tree nitersm1,
> > > > > > >  	    }
> > > > > > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > > > > > >  	}
> > > > > > > -      else
> > > > > > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > > > > > >
> > > > > > >        unsigned HOST_WIDE_INT bound;
> > > > > > >        if (bound_scalar.is_constant (&bound))
> > > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > > > > index
> > > > > >
> > > >
> > b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > > > > > f21421958e7ee25eada 100644
> > > > > > > --- a/gcc/tree-vect-loop.cc
> > > > > > > +++ b/gcc/tree-vect-loop.cc
> > > > > > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > > > > > *loop_in, vec_info_shared *shared)
> > > > > > >      partial_load_store_bias (0),
> > > > > > >      peeling_for_gaps (false),
> > > > > > >      peeling_for_niter (false),
> > > > > > > +    early_breaks (false),
> > > > > > > +    non_break_control_flow (false),
> > > > > > >      no_data_dependencies (false),
> > > > > > >      has_mask_store (false),
> > > > > > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > > > > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > > > > > (loop_vec_info loop_vinfo)
> > > > > > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > > > > > (LOOP_VINFO_ORIG_LOOP_INFO
> > > > > > >  					  (loop_vinfo));
> > > > > > >
> > > > > > > +  /* When we have multiple exits and VF is unknown, we must
> > require
> > > > > > partial
> > > > > > > +     vectors because the loop bounds is not a minimum but a
> > maximum.
> > > > > > That is to
> > > > > > > +     say we cannot unpredicate the main loop unless we peel or use
> > partial
> > > > > > > +     vectors in the epilogue.  */
> > > > > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > > > > > +    return true;
> > > > > > > +
> > > > > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > > > > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > > > > >      {
> > > > > > > @@ -1652,12 +1662,12 @@
> > vect_compute_single_scalar_iteration_cost
> > > > > > (loop_vec_info loop_vinfo)
> > > > > > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > > > > > >  }
> > > > > > >
> > > > > > > -
> > > > > > >  /* Function vect_analyze_loop_form.
> > > > > > >
> > > > > > >     Verify that certain CFG restrictions hold, including:
> > > > > > >     - the loop has a pre-header
> > > > > > > -   - the loop has a single entry and exit
> > > > > > > +   - the loop has a single entry
> > > > > > > +   - nested loops can have only a single exit.
> > > > > > >     - the loop exit condition is simple enough
> > > > > > >     - the number of iterations can be analyzed, i.e, a countable loop.
> > The
> > > > > > >       niter could be analyzed under some assumptions.  */
> > > > > > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop
> > *loop,
> > > > > > vect_loop_form_info *info)
> > > > > > >                             |
> > > > > > >                          (exit-bb)  */
> > > > > > >
> > > > > > > -      if (loop->num_nodes != 2)
> > > > > > > -	return opt_result::failure_at (vect_location,
> > > > > > > -				       "not vectorized:"
> > > > > > > -				       " control flow in loop.\n");
> > > > > > > -
> > > > > > >        if (empty_block_p (loop->header))
> > > > > > >  	return opt_result::failure_at (vect_location,
> > > > > > >  				       "not vectorized: empty loop.\n");
> > > > > > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop
> > *loop,
> > > > > > vect_loop_form_info *info)
> > > > > > >          dump_printf_loc (MSG_NOTE, vect_location,
> > > > > > >  			 "Considering outer-loop vectorization.\n");
> > > > > > >        info->inner_loop_cond = inner.loop_cond;
> > > > > > > +
> > > > > > > +      if (!single_exit (loop))
> > > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > > +				       "not vectorized: multiple
> > exits.\n");
> > > > > > > +
> > > > > > >      }
> > > > > > >
> > > > > > > -  if (!single_exit (loop))
> > > > > > > -    return opt_result::failure_at (vect_location,
> > > > > > > -				   "not vectorized: multiple exits.\n");
> > > > > > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > > > > > >      return opt_result::failure_at (vect_location,
> > > > > > >  				   "not vectorized:"
> > > > > > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop
> > *loop,
> > > > > > vect_loop_form_info *info)
> > > > > > >  				   "not vectorized: latch block not
> > empty.\n");
> > > > > > >
> > > > > > >    /* Make sure the exit is not abnormal.  */
> > > > > > > -  edge e = single_exit (loop);
> > > > > > > -  if (e->flags & EDGE_ABNORMAL)
> > > > > > > -    return opt_result::failure_at (vect_location,
> > > > > > > -				   "not vectorized:"
> > > > > > > -				   " abnormal loop exit edge.\n");
> > > > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > > > +  edge nexit = loop->vec_loop_iv;
> > > > > > > +  for (edge e : exits)
> > > > > > > +    {
> > > > > > > +      if (e->flags & EDGE_ABNORMAL)
> > > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > > +				       "not vectorized:"
> > > > > > > +				       " abnormal loop exit edge.\n");
> > > > > > > +      /* Early break BB must be after the main exit BB.  In theory we
> > should
> > > > > > > +	 be able to vectorize the inverse order, but the current flow in
> > the
> > > > > > > +	 the vectorizer always assumes you update successor PHI
> > nodes, not
> > > > > > > +	 preds.  */
> > > > > > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src,
> > e-
> > > > > > >src))
> > > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > > +				       "not vectorized:"
> > > > > > > +				       " abnormal loop exit edge
> > order.\n");
> > > >
> > > > "unsupported loop exit order", but I don't understand the comment.
> > > >
> > > > > > > +    }
> > > > > > > +
> > > > > > > +  /* We currently only support early exit loops with known bounds.
> > */
> > > >
> > > > Btw, why's that?  Is that because we don't support the loop-around edge?
> > > > IMHO this is the most serious limitation (and as said above it should be
> > > > trivial to fix).
> > > >
> > > > > > > +  if (exits.length () > 1)
> > > > > > > +    {
> > > > > > > +      class tree_niter_desc niter;
> > > > > > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter,
> > > > NULL)
> > > > > > > +	  || chrec_contains_undetermined (niter.niter)
> > > > > > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > > > > > +	return opt_result::failure_at (vect_location,
> > > > > > > +				       "not vectorized:"
> > > > > > > +				       " early breaks only supported on
> > loops"
> > > > > > > +				       " with known iteration
> > bounds.\n");
> > > > > > > +    }
> > > > > > >
> > > > > > >    info->conds
> > > > > > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > > > > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > > > > > vec_info_shared *shared,
> > > > > > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > > > > > >alt_loop_conds);
> > > > > > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > > > > > >
> > > > > > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > > > > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > > > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > > > > > +
> > > > > > >    if (info->inner_loop_cond)
> > > > > > >      {
> > > > > > >        stmt_vec_info inner_loop_cond_info
> > > > > > > @@ -3070,7 +3106,8 @@ start_over:
> > > > > > >
> > > > > > >    /* If an epilogue loop is required make sure we can create one.  */
> > > > > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > > > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > > > > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > >      {
> > > > > > >        if (dump_enabled_p ())
> > > > > > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop
> > > > required\n");
> > > > > > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction
> > > > (loop_vec_info
> > > > > > loop_vinfo,
> > > > > > >    basic_block exit_bb;
> > > > > > >    tree scalar_dest;
> > > > > > >    tree scalar_type;
> > > > > > > -  gimple *new_phi = NULL, *phi;
> > > > > > > +  gimple *new_phi = NULL, *phi = NULL;
> > > > > > >    gimple_stmt_iterator exit_gsi;
> > > > > > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > > > > > >    gimple *epilog_stmt = NULL;
> > > > > > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > > > > > (loop_vec_info loop_vinfo,
> > > > > > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > > > > > >  	  reduc_inputs.quick_push (new_def);
> > > > > > >  	}
> > > > > > > +
> > > > > > > +	/* Update the other exits.  */
> > > > > > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > > +	  {
> > > > > > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > > > > +	    gphi_iterator gsi, gsi1;
> > > > > > > +	    for (edge exit : alt_exits)
> > > > > > > +	      {
> > > > > > > +		/* Find the phi node to propaget into the exit block for
> > each
> > > > > > > +		   exit edge.  */
> > > > > > > +		for (gsi = gsi_start_phis (exit_bb),
> > > > > > > +		     gsi1 = gsi_start_phis (exit->src);
> > > >
> > > > exit->src == loop->header, right?  I think this won't work for multiple
> > > > alternate exits.  It's probably easier to do this where we create the
> > > > LC PHI node for the reduction result?
> > > >
> > > > > > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > > > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > > > > > +		  {
> > > > > > > +		    /* There really should be a function to just get the
> > number
> > > > > > > +		       of phis inside a bb.  */
> > > > > > > +		    if (phi && phi == gsi.phi ())
> > > > > > > +		      {
> > > > > > > +			gphi *phi1 = gsi1.phi ();
> > > > > > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > > > > > +					 PHI_RESULT (phi1));
> > > >
> > > > I think we know the header PHI of a reduction perfectly well, there
> > > > shouldn't be the need to "search" for it.
> > > >
> > > > > > > +			break;
> > > > > > > +		      }
> > > > > > > +		  }
> > > > > > > +	      }
> > > > > > > +	  }
> > > > > > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > > > > > >      }
> > > > > > >
> > > > > > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info
> > > > *vinfo,
> > > > > > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > > > > > >  	   lhs' = new_tree;  */
> > > > > > >
> > > > > > > +      /* When vectorizing an early break, any live statement that is
> > used
> > > > > > > +	 outside of the loop are dead.  The loop will never get to them.
> > > > > > > +	 We could change the liveness value during analysis instead
> > but since
> > > > > > > +	 the below code is invalid anyway just ignore it during
> > codegen.  */
> > > > > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > > +	return true;
> > > >
> > > > But what about the value that's live across the main exit when the
> > > > epilogue is not entered?
> > > >
> > > > > > > +
> > > > > > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > > > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > > > >        gcc_assert (single_pred_p (exit_bb));
> > > > > > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > > > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > > > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > > > > > >       versioning.   */
> > > > > > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > > > > -  if (! single_pred_p (e->dest))
> > > > > > > +  if (e && ! single_pred_p (e->dest) &&
> > !LOOP_VINFO_EARLY_BREAKS
> > > > > > (loop_vinfo))
> > > >
> > > > e can be NULL here?  I think we should reject such loops earlier.
> > > >
> > > > > > >      {
> > > > > > >        split_loop_exit_edge (e, true);
> > > > > > >        if (dump_enabled_p ())
> > > > > > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > > > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > > > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > > > > > >      {
> > > > > > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > > > > > -      if (! single_pred_p (e->dest))
> > > > > > > +      if (e && ! single_pred_p (e->dest))
> > > > > > >  	{
> > > > > > >  	  split_loop_exit_edge (e, true);
> > > > > > >  	  if (dump_enabled_p ())
> > > > > > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > > > > > loop_vinfo, gimple *loop_vectorized_call)
> > > > > > >
> > > > > > >    /* Loops vectorized with a variable factor won't benefit from
> > > > > > >       unrolling/peeling.  */
> > > >
> > > > update the comment?  Why would we unroll a VLA loop with early breaks?
> > > > Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?
> > > >
> > > > > > > -  if (!vf.is_constant ())
> > > > > > > +  if (!vf.is_constant ()
> > > > > > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > > > >      {
> > > > > > >        loop->unroll = 1;
> > > > > > >        if (dump_enabled_p ())
> > > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > > > index
> > > > > >
> > > >
> > 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > > > > > e78ee2a7c627be45b 100644
> > > > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info
> > > > stmt_info,
> > > > > > loop_vec_info loop_vinfo,
> > > > > > >    *live_p = false;
> > > > > > >
> > > > > > >    /* cond stmt other than loop exit cond.  */
> > > > > > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > > > > > -      && STMT_VINFO_TYPE (stmt_info) !=
> > loop_exit_ctrl_vec_info_type)
> > > > > > > -    *relevant = vect_used_in_scope;
> > > >
> > > > how was that ever hit before?  For outer loop processing with outer loop
> > > > vectorization?
> > > >
> > > > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > > > +    {
> > > > > > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge,
> > > > but
> > > > > > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > > > > > +	 the hard way.  */
> > > > > > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > > > > > +      bool exit_bb = false, early_exit = false;
> > > > > > > +      edge_iterator ei;
> > > > > > > +      edge e;
> > > > > > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > > > > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > > > > > +	  {
> > > > > > > +	    exit_bb = true;
> > > > > > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > > > > > +	    break;
> > > > > > > +	  }
> > > > > > > +
> > > > > > > +      /* We should have processed any exit edge, so an edge not an
> > early
> > > > > > > +	 break must be a loop IV edge.  We need to distinguish
> > between the
> > > > > > > +	 two as we don't want to generate code for the main loop IV.
> > */
> > > > > > > +      if (exit_bb)
> > > > > > > +	{
> > > > > > > +	  if (early_exit)
> > > > > > > +	    *relevant = vect_used_in_scope;
> > > > > > > +	}
> > > >
> > > > I wonder why you can't simply do
> > > >
> > > >          if (is_ctrl_stmt (stmt_info->stmt)
> > > >              && stmt_info->stmt != LOOP_VINFO_COND (loop_info))
> > > >
> > > > ?
> > > >
> > > > > > > +      else if (bb->loop_father == loop)
> > > > > > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> > > >
> > > > so for control flow not exiting the loop you can check
> > > > loop_exits_from_bb_p ().
> > > >
> > > > > > > +    }
> > > > > > >
> > > > > > >    /* changing memory.  */
> > > > > > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > > > > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info
> > > > stmt_info,
> > > > > > loop_vec_info loop_vinfo,
> > > > > > >  	*relevant = vect_used_in_scope;
> > > > > > >        }
> > > > > > >
> > > > > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > > > > +  auto_bitmap exit_bbs;
> > > > > > > +  for (edge exit : exits)
> > > > > > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > > > > +
> > > > > > >    /* uses outside the loop.  */
> > > > > > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > > > > > SSA_OP_DEF)
> > > > > > >      {
> > > > > > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info
> > stmt_info,
> > > > > > loop_vec_info loop_vinfo,
> > > > > > >  	      /* We expect all such uses to be in the loop exit phis
> > > > > > >  		 (because of loop closed form)   */
> > > > > > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) ==
> > GIMPLE_PHI);
> > > > > > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > > > > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> > > >
> > > > That now becomes quite expensive checking already covered by the LC SSA
> > > > verifier so I suggest to simply drop this assert instead.
> > > >
> > > > > > >                *live_p = true;
> > > > > > >  	    }
> > > > > > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > > > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > > > >  	}
> > > > > > >      }
> > > > > > >
> > > > > > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't
> > > > seen all
> > > > > > > +     the conds yet at that point and there's no quick way to retrieve
> > them.
> > > > */
> > > > > > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > > > > > +    return opt_result::failure_at (vect_location,
> > > > > > > +				   "not vectorized:"
> > > > > > > +				   " unsupported control flow in
> > loop.\n");
> > > >
> > > > so we didn't do this before?  But see above where I wondered.  So when
> > > > does this hit with early exits and why can't we check for this in
> > > > vect_verify_loop_form?
> > > >
> > > > > > > +
> > > > > > >    /* 2. Process_worklist */
> > > > > > >    while (worklist.length () > 0)
> > > > > > >      {
> > > > > > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > > > > > (loop_vec_info loop_vinfo, bool *fatal)
> > > > > > >  			return res;
> > > > > > >  		    }
> > > > > > >                   }
> > > > > > > +	    }
> > > > > > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo-
> > >stmt))
> > > > > > > +	    {
> > > > > > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > > > > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) ==
> > tcc_comparison);
> > > > > > > +	      opt_result res
> > > > > > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > > > > > +			       loop_vinfo, relevant, &worklist, false);
> > > > > > > +	      if (!res)
> > > > > > > +		return res;
> > > > > > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > > > > > +				loop_vinfo, relevant, &worklist, false);
> > > > > > > +	      if (!res)
> > > > > > > +		return res;
> > > > > > >              }
> > > > > > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > > > > > >  	    {
> > > > > > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > > > >  			     node_instance, cost_vec);
> > > > > > >        if (!res)
> > > > > > >  	return res;
> > > > > > > -   }
> > > > > > > +    }
> > > > > > > +
> > > > > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > > > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > > > > > >
> > > > > > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > > > > > >      {
> > > > > > >        case vect_internal_def:
> > > > > > > +      case vect_early_exit_def:
> > > > > > >          break;
> > > > > > >
> > > > > > >        case vect_reduction_def:
> > > > > > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > > > > > >      {
> > > > > > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > > > > > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > > > > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > > > > > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > > > > > >        *need_to_vectorize = true;
> > > > > > >      }
> > > > > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > > > > > index
> > > > > >
> > > >
> > ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > > > > > 40baff61d18da8e242dd 100644
> > > > > > > --- a/gcc/tree-vectorizer.h
> > > > > > > +++ b/gcc/tree-vectorizer.h
> > > > > > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > > > > > >    vect_internal_def,
> > > > > > >    vect_induction_def,
> > > > > > >    vect_reduction_def,
> > > > > > > +  vect_early_exit_def,
> > > >
> > > > can you avoid putting this inbetween reduction and double reduction
> > > > please?  Just put it before vect_unknown_def_type.  In fact the COND
> > > > isn't a def ... maybe we should have pattern recogized
> > > >
> > > >  if (a < b) exit;
> > > >
> > > > as
> > > >
> > > >  cond = a < b;
> > > >  if (cond != 0) exit;
> > > >
> > > > so the part that we need to vectorize is more clear.
> > > >
> > > > > > >    vect_double_reduction_def,
> > > > > > >    vect_nested_cycle,
> > > > > > >    vect_first_order_recurrence,
> > > > > > > @@ -876,6 +877,13 @@ public:
> > > > > > >       we need to peel off iterations at the end to form an epilogue loop.
> > */
> > > > > > >    bool peeling_for_niter;
> > > > > > >
> > > > > > > +  /* When the loop has early breaks that we can vectorize we need to
> > > > peel
> > > > > > > +     the loop for the break finding loop.  */
> > > > > > > +  bool early_breaks;
> > > > > > > +
> > > > > > > +  /* When the loop has a non-early break control flow inside.  */
> > > > > > > +  bool non_break_control_flow;
> > > > > > > +
> > > > > > >    /* List of loop additional IV conditionals found in the loop.  */
> > > > > > >    auto_vec<gcond *> conds;
> > > > > > >
> > > > > > > @@ -985,9 +993,11 @@ public:
> > > > > > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)-
> > >reduction_chains
> > > > > > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)-
> > >peeling_for_gaps
> > > > > > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)-
> > >peeling_for_niter
> > > > > > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > > > > > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > > > > > >early_break_conflict
> > > > > > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)-
> > > > >early_break_dest_bb
> > > > > > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)-
> > >early_break_vuses
> > > > > > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > > > > > >non_break_control_flow
> > > > > > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > > > > > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > > > > > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > > > > > >no_data_dependencies
> > > > > > > @@ -1038,8 +1048,8 @@ public:
> > > > > > >     stack.  */
> > > > > > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > > > > > >
> > > > > > > -inline loop_vec_info
> > > > > > > -loop_vec_info_for_loop (class loop *loop)
> > > > > > > +static inline loop_vec_info
> > > > > > > +loop_vec_info_for_loop (const class loop *loop)
> > > > > > >  {
> > > > > > >    return (loop_vec_info) loop->aux;
> > > > > > >  }
> > > > > > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > > > > > >  {
> > > > > > >    if (bb == (bb->loop_father)->header)
> > > > > > >      return true;
> > > > > > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > > > > > +
> > > > > > >    return false;
> > > > > > >  }
> > > > > > >
> > > > > > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > > > > > >     in tree-vect-loop-manip.cc.  */
> > > > > > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > > > > > >  				     tree, tree, tree, bool);
> > > > > > > -extern bool slpeel_can_duplicate_loop_p (const class loop *,
> > > > const_edge);
> > > > > > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > > > > > const_edge);
> > > > > > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > > > > > -						     class loop *, edge);
> > > > > > > +						    class loop *, edge,
> > bool,
> > > > > > > +						    vec<basic_block> *
> > = NULL);
> > > > > > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > > > > > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > > > > > >  				    tree *, tree *, tree *, int, bool, bool,
> > > > > > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > > > > > index
> > > > > >
> > > >
> > a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > > > > > 67f29f5dc9a8d72235f4 100644
> > > > > > > --- a/gcc/tree-vectorizer.cc
> > > > > > > +++ b/gcc/tree-vectorizer.cc
> > > > > > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > > > > > >  	 predicates that need to be shared for optimal predicate usage.
> > > > > > >  	 However reassoc will re-order them and prevent CSE from
> > working
> > > > > > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > > > > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > > > > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > >
> > > > seeing this more and more I think we want a simple way to iterate over
> > > > all exits without copying to a vector when we have them recorded.  My
> > > > C++ fu is too limited to support
> > > >
> > > >   for (auto exit : recorded_exits (loop))
> > > >     ...
> > > >
> > > > (maybe that's enough for somebody to jump onto this ;))
> > > >
> > > > Don't treat all review comments as change orders, but it should be clear
> > > > the code isn't 100% obvious.  Maybe the patch can be simplified by
> > > > splitting out the LC SSA cleanup parts.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > > > +      for (edge exit : exits)
> > > > > > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > > > >
> > > > > > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > > > > > >        do_rpo_vn (fun, entry, exit_bbs);
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Richard Biener <rguenther@suse.de>
> > > > > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > > > > Nuernberg,
> > > > > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > > > > > Moerman;
> > > > > > HRB 36809 (AG Nuernberg)
> > > > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
>
  
Tamar Christina Oct. 23, 2023, 8:21 p.m. UTC | #10
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Friday, July 14, 2023 2:35 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; jlaw@ventanamicro.com
> Subject: RE: [PATCH 12/19]middle-end: implement loop peeling and IV
> updates for early break.
> 
> On Thu, 13 Jul 2023, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Thursday, July 13, 2023 6:31 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>;
> jlaw@ventanamicro.com
> > > Subject: Re: [PATCH 12/19]middle-end: implement loop peeling and IV
> > > updates for early break.
> > >
> > > On Wed, 28 Jun 2023, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This patch updates the peeling code to maintain LCSSA during peeling.
> > > > The rewrite also naturally takes into account multiple exits and so it didn't
> > > > make sense to split them off.
> > > >
> > > > For the purposes of peeling the only change for multiple exits is that the
> > > > secondary exits are all wired to the start of the new loop preheader when
> > > doing
> > > > epilogue peeling.
> > > >
> > > > When doing prologue peeling the CFG is kept in tact.
> > > >
> > > > For both epilogue and prologue peeling we wire through between the
> two
> > > loops any
> > > > PHI nodes that escape the first loop into the second loop if flow_loops is
> > > > specified.  The reason for this conditionality is because
> > > > slpeel_tree_duplicate_loop_to_edge_cfg is used in the compiler in 3 ways:
> > > >   - prologue peeling
> > > >   - epilogue peeling
> > > >   - loop distribution
> > > >
> > > > for the last case the loops should remain independent, and so not be
> > > connected.
> > > > Because of this propagation of only used phi nodes get_current_def can
> be
> > > used
> > > > to easily find the previous definitions.  However live statements that are
> > > > not used inside the loop itself are not propagated (since if unused, the
> > > moment
> > > > we add the guard in between the two loops the value across the bypass
> edge
> > > can
> > > > be wrong if the loop has been peeled.)
> > > >
> > > > This is dealt with easily enough in find_guard_arg.
> > > >
> > > > For multiple exits, while we are in LCSSA form, and have a correct DOM
> tree,
> > > the
> > > > moment we add the guard block we will change the dominators again.  To
> > > deal with
> > > > this slpeel_tree_duplicate_loop_to_edge_cfg can optionally return the
> blocks
> > > to
> > > > update without having to recompute the list of blocks to update again.
> > > >
> > > > When multiple exits and doing epilogue peeling we will also temporarily
> have
> > > an
> > > > incorrect VUSES chain for the secondary exits as it anticipates the final
> result
> > > > after the VDEFs have been moved.  This will thus be corrected once the
> code
> > > > motion is applied.
> > > >
> > > > Lastly by doing things this way we can remove the helper functions that
> > > > previously did lock step iterations to update things as it went along.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > > >
> > > > Ok for master?
> > >
> > > Not sure if I get through all of this in one go - so be prepared that
> > > the rest of the review follows another day.
> >
> > No worries, I appreciate the reviews!
> > Just giving some quick replies for when you continue.
> 
> Continueing.
> 
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	* tree-loop-distribution.cc (copy_loop_before): Pass flow_loops =
> > > false.
> > > > 	* tree-ssa-loop-niter.cc (loop_only_exit_p):  Fix bug when exit==null.
> > > > 	* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add
> > > additional
> > > > 	assert.
> > > > 	(vect_set_loop_condition_normal): Skip modifying loop IV for multiple
> > > > 	exits.
> > > > 	(slpeel_tree_duplicate_loop_to_edge_cfg): Support multiple exit
> > > peeling.
> > > > 	(slpeel_can_duplicate_loop_p): Likewise.
> > > > 	(vect_update_ivs_after_vectorizer): Don't enter this...
> > > > 	(vect_update_ivs_after_early_break): ...but instead enter here.
> > > > 	(find_guard_arg): Update for new peeling code.
> > > > 	(slpeel_update_phi_nodes_for_loops): Remove.
> > > > 	(slpeel_update_phi_nodes_for_guard2): Remove hardcoded edge 0
> > > checks.
> > > > 	(slpeel_update_phi_nodes_for_lcssa): Remove.
> > > > 	(vect_do_peeling): Fix VF for multiple exits and force epilogue.
> > > > 	* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
> > > > 	non_break_control_flow and early_breaks.
> > > > 	(vect_need_peeling_or_partial_vectors_p): Force partial vector if
> > > > 	multiple exits and VLA.
> > > > 	(vect_analyze_loop_form): Support inner loop multiple exits.
> > > > 	(vect_create_loop_vinfo): Set LOOP_VINFO_EARLY_BREAKS.
> > > > 	(vect_create_epilog_for_reduction):  Update live phi nodes.
> > > > 	(vectorizable_live_operation): Ignore live operations in vector loop
> > > > 	when multiple exits.
> > > > 	(vect_transform_loop): Force unrolling for VF loops and multiple exits.
> > > > 	* tree-vect-stmts.cc (vect_stmt_relevant_p): Analyze ctrl statements.
> > > > 	(vect_mark_stmts_to_be_vectorized): Check for non-exit control flow
> > > and
> > > > 	analyze gcond params.
> > > > 	(vect_analyze_stmt): Support gcond.
> > > > 	* tree-vectorizer.cc (pass_vectorize::execute): Support multiple exits
> > > > 	in RPO pass.
> > > > 	* tree-vectorizer.h (enum vect_def_type): Add vect_early_exit_def.
> > > > 	(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_GENERAL_CTR_FLOW):
> > > New.
> > > > 	(loop_vec_info_for_loop): Change to const and static.
> > > > 	(is_loop_header_bb_p): Drop assert.
> > > > 	(slpeel_can_duplicate_loop_p): Update prototype.
> > > > 	(class loop): Add early_breaks and non_break_control_flow.
> > > >
> > > > --- inline copy of patch --
> > > > diff --git a/gcc/tree-loop-distribution.cc b/gcc/tree-loop-distribution.cc
> > > > index
> > >
> 97879498db46dd3c34181ae9aa6e5476004dd5b5..d790ce5fffab3aa3dfc40
> > > d833a968314a4442b9e 100644
> > > > --- a/gcc/tree-loop-distribution.cc
> > > > +++ b/gcc/tree-loop-distribution.cc
> > > > @@ -948,7 +948,7 @@ copy_loop_before (class loop *loop, bool
> > > redirect_lc_phi_defs)
> > > >    edge preheader = loop_preheader_edge (loop);
> > > >
> > > >    initialize_original_copy_tables ();
> > > > -  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
> > > > +  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader,
> > > false);
> > > >    gcc_assert (res != NULL);
> > > >
> > > >    /* When a not last partition is supposed to keep the LC PHIs computed
> > > > diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> > > > index
> > >
> 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba
> > > 377300430c04b7aeefe6c 100644
> > > > --- a/gcc/tree-ssa-loop-niter.cc
> > > > +++ b/gcc/tree-ssa-loop-niter.cc
> > > > @@ -3072,7 +3072,12 @@ loop_only_exit_p (const class loop *loop,
> > > basic_block *body, const_edge exit)
> > > >    gimple_stmt_iterator bsi;
> > > >    unsigned i;
> > > >
> > > > -  if (exit != single_exit (loop))
> > > > +  /* We need to check for alternative exits since exit can be NULL.  */
> > >
> > > You mean we pass in exit == NULL in some cases?  I'm not sure what
> > > the desired behavior in that case is - can you point out the
> > > callers you are fixing here?
> > >
> > > I think we should add gcc_assert (exit != nullptr)
> > >
> > > >    for (i = 0; i < loop->num_nodes; i++)
> > > > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> > > > index
> > >
> 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a92
> > > 7cb150edd7c2aa9999 100644
> > > > --- a/gcc/tree-vect-loop-manip.cc
> > > > +++ b/gcc/tree-vect-loop-manip.cc
> > > > @@ -252,6 +252,9 @@ adjust_phi_and_debug_stmts (gimple
> *update_phi,
> > > edge e, tree new_def)
> > > >  {
> > > >    tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
> > > >
> > > > +  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
> > > > +	      || orig_def != new_def);
> > > > +
> > > >    SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
> > > >
> > > >    if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > > @@ -1292,7 +1295,8 @@ vect_set_loop_condition_normal
> (loop_vec_info
> > > loop_vinfo,
> > > >    gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
> > > >
> > > >    /* Record the number of latch iterations.  */
> > > > -  if (limit == niters)
> > > > +  if (limit == niters
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      /* Case A: the loop iterates NITERS times.  Subtract one to get the
> > > >         latch count.  */
> > > >      loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
> > > > @@ -1303,7 +1307,13 @@ vect_set_loop_condition_normal
> > > (loop_vec_info loop_vinfo,
> > > >      loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
> > > >  				       limit, step);
> > > >
> > > > -  if (final_iv)
> > > > +  /* For multiple exits we've already maintained LCSSA form and handled
> > > > +     the scalar iteration update in the code that deals with the merge
> > > > +     block and its updated guard.  I could move that code here instead
> > > > +     of in vect_update_ivs_after_early_break but I have to still deal
> > > > +     with the updates to the counter `i`.  So for now I'll keep them
> > > > +     together.  */
> > > > +  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        gassign *assign;
> > > >        edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > @@ -1509,11 +1519,19 @@ vec_init_exit_info (class loop *loop)
> > > >     on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
> > > >     non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy
> the
> > > >     basic blocks from SCALAR_LOOP instead of LOOP, but to either the
> > > > -   entry or exit of LOOP.  */
> > > > +   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to
> > > SCALAR_LOOP as a
> > > > +   continuation.  This is correct for cases where one loop continues from
> the
> > > > +   other like in the vectorizer, but not true for uses in e.g. loop
> distribution
> > > > +   where the loop is duplicated and then modified.
> > > > +
> > >
> > > but for loop distribution the flow also continues?  I'm not sure what you
> > > are refering to here.  Do you by chance have a branch with the patches
> > > installed?
> >
> > Yup, they're at refs/users/tnfchris/heads/gcc-14-early-break in the repo.
> >
> > >
> > > > +   If UPDATED_DOMS is not NULL it is update with the list of basic blocks
> > > whoms
> > > > +   dominators were updated during the peeling.  */
> > > >
> > > >  class loop *
> > > >  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
> > > > -					class loop *scalar_loop, edge e)
> > > > +					class loop *scalar_loop, edge e,
> > > > +					bool flow_loops,
> > > > +					vec<basic_block> *updated_doms)
> > > >  {
> > > >    class loop *new_loop;
> > > >    basic_block *new_bbs, *bbs, *pbbs;
> > > > @@ -1602,6 +1620,19 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >    for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
> > > >      rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
> > > >
> > > > +  /* Rename the exit uses.  */
> > > > +  for (edge exit : get_loop_exit_edges (new_loop))
> > > > +    for (auto gsi = gsi_start_phis (exit->dest);
> > > > +	 !gsi_end_p (gsi); gsi_next (&gsi))
> > > > +      {
> > > > +	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
> > > > +	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
> > > > +	if (MAY_HAVE_DEBUG_BIND_STMTS)
> > > > +	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
> > > > +      }
> > > > +
> > > > +  /* This condition happens when the loop has been versioned. e.g. due
> to
> > > ifcvt
> > > > +     versioning the loop.  */
> > > >    if (scalar_loop != loop)
> > > >      {
> > > >        /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs
> from
> > > > @@ -1616,28 +1647,106 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> > > (class loop *loop,
> > > >  						EDGE_SUCC (loop->latch, 0));
> > > >      }
> > > >
> > > > +  vec<edge> alt_exits = loop->vec_loop_alt_exits;
> > >
> > > So 'e' is not one of alt_exits, right?  I wonder if we can simply
> > > compute the vector from all exits of 'loop' and removing 'e'?
> > >
> > > > +  bool multiple_exits_p = !alt_exits.is_empty ();
> > > > +  auto_vec<basic_block> doms;
> > > > +  class loop *update_loop = NULL;
> > > > +
> > > >    if (at_exit) /* Add the loop copy at exit.  */
> > > >      {
> > > > -      if (scalar_loop != loop)
> > > > +      if (scalar_loop != loop && new_exit->dest != exit_dest)
> > > >  	{
> > > > -	  gphi_iterator gsi;
> > > >  	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
> > > > +	  flush_pending_stmts (new_exit);
> > > > +	}
> > > >
> > > > -	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
> > > > -	       gsi_next (&gsi))
> > > > +      auto loop_exits = get_loop_exit_edges (loop);
> > > > +      for (edge exit : loop_exits)
> > > > +	redirect_edge_and_branch (exit, new_preheader);
> > > > +
> > > > +
> > >
> > > one line vertical space too much
> > >
> > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > +	 block and the new loop header.  This allows us to later split the
> > > > +	 preheader block and still find the right LC nodes.  */
> > > > +      edge latch_new = single_succ_edge (new_preheader);
> > > > +      edge latch_old = loop_latch_edge (loop);
> > > > +      hash_set <tree> lcssa_vars;
> > > > +      for (auto gsi_from = gsi_start_phis (latch_old->dest),
> > >
> > > so that's loop->header (and makes it more clear which PHI nodes you are
> > > looking at)
> 
> 
> So I'm now in a debug session - I think that conceptually it would
> make more sense to create the LC PHI nodes that are present at the
> old exit destination in the new preheader _before_ you redirect them
> above and then flush_pending_stmts after redirecting, that should deal
> with the copying.
> 
> Now, your copying actually iterates over all PHIs in the loop _header_,
> so it doesn't actually copy LC PHI nodes but possibly creates additional
> ones.  The intent does seem to do this since you want a different value
> on those edges for all but the main loop exit.  But then the
> overall comments should better reflect that and maybe you should
> do what I suggested anyway and have this loop alter only the alternate
> exit LC PHIs?
> 
> If you don't flush_pending_stmts on an edge after redirecting you
> should call redirect_edge_var_map_clear (edge), otherwise the stale
> info might break things later.
> 
> > > > +	   gsi_to = gsi_start_phis (latch_new->dest);
> > >
> > > likewise new_loop->header
> > >
> > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > +	{
> > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
> > > > +	  /* In all cases, even in early break situations we're only
> > > > +	     interested in the number of fully executed loop iters.  As such
> > > > +	     we discard any partially done iteration.  So we simply propagate
> > > > +	     the phi nodes from the latch to the merge block.  */
> > > > +	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
> > > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > > +
> > > > +	  lcssa_vars.add (new_arg);
> > > > +
> > > > +	  /* Main loop exit should use the final iter value.  */
> > > > +	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv,
> > > UNKNOWN_LOCATION);
> > >
> > > above you are creating the PHI node at e->dest but here add the PHI arg to
> > > loop->vec_loop_iv - that's 'e' here, no?  Consistency makes it easier
> > > to follow.  I _think_ this code doesn't need to know about the "special"
> > > edge.
> > >
> > > > +
> > > > +	  /* All other exits use the previous iters.  */
> > > > +	  for (edge e : alt_exits)
> > > > +	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
> > > > +			 UNKNOWN_LOCATION);
> > > > +
> > > > +	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
> > > > +	}
> > > > +
> > > > +      /* Copy over any live SSA vars that may not have been materialized in
> > > the
> > > > +	 loops themselves but would be in the exit block.  However when the
> > > live
> > > > +	 value is not used inside the loop then we don't need to do this,  if we
> > > do
> > > > +	 then when we split the guard block the branch edge can end up
> > > containing the
> > > > +	 wrong reference,  particularly if it shares an edge with something that
> > > has
> > > > +	 bypassed the loop.  This is not something peeling can check so we
> > > need to
> > > > +	 anticipate the usage of the live variable here.  */
> > > > +      auto exit_map = redirect_edge_var_map_vector (exit);
> > >
> > > Hmm, did I use that in my attemt to refactor things? ...
> >
> > Indeed, I didn't always use it, but found it was the best way to deal with the
> > variables being live in various BB after the loop.
> 
> As said this whole piece of code is possibly more complicated than
> necessary.  First copying/creating the PHI nodes that are present
> at the exit (the old LC PHI nodes), then redirecting edges and flushing
> stmts should deal with half of this.
> 
> > >
> > > > +      if (exit_map)
> > > > +        for (auto vm : exit_map)
> > > > +	{
> > > > +	  if (lcssa_vars.contains (vm.def)
> > > > +	      || TREE_CODE (vm.def) != SSA_NAME)
> > >
> > > the latter check is cheaper so it should come first
> > >
> > > > +	    continue;
> > > > +
> > > > +	  imm_use_iterator imm_iter;
> > > > +	  use_operand_p use_p;
> > > > +	  bool use_in_loop = false;
> > > > +
> > > > +	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
> > > >  	    {
> > > > -	      gphi *phi = gsi.phi ();
> > > > -	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
> > > > -	      location_t orig_locus
> > > > -		= gimple_phi_arg_location_from_edge (phi, e);
> > > > +	      basic_block bb = gimple_bb (USE_STMT (use_p));
> > > > +	      if (flow_bb_inside_loop_p (loop, bb)
> > > > +		  && !gimple_vuse (USE_STMT (use_p)))
> 
> what's this gimple_vuse check?  I see now for vect-early-break_17.c this
> code triggers and ignores
> 
>   vect_b[i_18] = _2;
> 
> > > > +		{
> > > > +		  use_in_loop = true;
> > > > +		  break;
> > > > +		}
> > > > +	    }
> > > >
> > > > -	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
> > > > +	  if (!use_in_loop)
> > > > +	    {
> > > > +	       /* Do a final check to see if it's perhaps defined in the loop.  This
> > > > +		  mirrors the relevancy analysis's used_outside_scope.  */
> > > > +	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
> > > > +	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
> > > > +		continue;
> > > >  	    }
> 
> since the def was on a LC PHI the def should always be defined inside the
> loop.
> 
> > > > +
> > > > +	  tree new_res = copy_ssa_name (vm.result);
> > > > +	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
> > > > +	  for (edge exit : loop_exits)
> > > > +	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);
> > >
> > > not sure what you are doing above - I guess I have to play with it
> > > in a debug session.
> >
> > Yeah if you comment it out one of the testcases should fail.
> 
> using new_preheader instead of e->dest would make things clearer.
> 
> You are now adding the same arg to every exit (you've just queried the
> main exit redirect_edge_var_map_vector).
> 
> OK, so I think I understand what you're doing.  If I understand
> correctly we know that when we exit the main loop via one of the
> early exits we are definitely going to enter the epilog but when
> we take the main exit we might not.
> 
> Looking at the CFG we create currently this isn't reflected and
> this complicates this PHI node updating.  What I'd try to do
> is leave redirecting the alternate exits until after
> slpeel_tree_duplicate_loop_to_edge_cfg finished which probably
> means leaving it almost unchanged besides the LC SSA maintaining
> changes.  After that for the multi-exit case split the
> epilog preheader edge and redirect all the alternate exits to the
> new preheader.  So the CFG becomes
> 
>                  <original loop>
>                 /      |
>                /    <main exit w/ original LC PHI>
>               /      if (epilog)
>    alt exits /        /  \
>             /        /    loop around
>             |       /
>            preheader with "header" PHIs
>               |
>           <epilog>
> 
> note you need the header PHIs also on the main exit path but you
> only need the loop end PHIs there.
> 
> It seems so that at least currently the order of things makes
> them more complicated than necessary.
> 

Hi,

I'm re-spinning this particular change and would like some clarification.

I assume the reason you prefer flow is because the updates on the "normal" pre-header remains
the same as today. And the alternate exits are easy since the iteration count is just VF.

I think that should be easy enough to do and I guess this makes it easier since I can re-use
vect_update_ivs_after_vectorizer as is. Or at the very least needs only a small update.

What I require some advice on is how to handle the alt_exits in order to create `preheader with "header" PHIs`.

With the last refactoring we've started using redirect_edge_and_branch to redirect the exits. But for this to work
all the exits need to have the PHI nodes in the same order.  In the above scheme the alt exits need to.

However every exit can have varying amount of PHI nodes due to live values.

Consider:

#define N 1024
unsigned vect_a[N];
unsigned vect_b[N];
 
unsigned test4(unsigned x, unsigned y)
{
 unsigned ret = 0;
 unsigned sum = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i] > x)
     return vect_a[i];
  
  vect_b[i] += x + i;
  if (vect_a[i] > y)
      return sum;
  sum += vect_a[i];
  vect_a[i] = x;
 }
 return ret;
}

The CFG for this looks like https://tinyurl.com/2p9p2he8

The first exit contains: _10, .MEM
The second exit contains: .MEM, sum
The main exit contains: .MEM

This goes wrong when using redirect_edge_and_branch because alt_exit1 and alt_exit2 have the nodes in different orders.

This patch deals with it by using the rename map to recreate the PHI nodes but I'm not sure how to get this to work with the
redirect_edge_and_branch approach.

Cheers,
Tamar
> > >
> > > >  	}
> > > > -      redirect_edge_and_branch_force (e, new_preheader);
> > > > -      flush_pending_stmts (e);
> > > > +
> > > >        set_immediate_dominator (CDI_DOMINATORS, new_preheader, e-
> >src);
> > > > -      if (was_imm_dom || duplicate_outer_loop)
> > > > +
> > > > +      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
> > > >  	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit-
> > > >src);
> > > >
> > > >        /* And remove the non-necessary forwarder again.  Keep the other
> > > > @@ -1647,9 +1756,42 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >        delete_basic_block (preheader);
> > > >        set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
> > > >  			       loop_preheader_edge (scalar_loop)->src);
> > > > +
> > > > +      /* Finally after wiring the new epilogue we need to update its main
> exit
> > > > +	 to the original function exit we recorded.  Other exits are already
> > > > +	 correct.  */
> > > > +      if (multiple_exits_p)
> > > > +	{
> > > > +	  for (edge e : get_loop_exit_edges (loop))
> > > > +	    doms.safe_push (e->dest);
> > > > +	  update_loop = new_loop;
> > > > +	  doms.safe_push (exit_dest);
> > > > +
> > > > +	  /* Likely a fall-through edge, so update if needed.  */
> > > > +	  if (single_succ_p (exit_dest))
> > > > +	    doms.safe_push (single_succ (exit_dest));
> > > > +	}
> > > >      }
> > > >    else /* Add the copy at entry.  */
> > > >      {
> > > > +      /* Copy the current loop LC PHI nodes between the original loop exit
> > > > +	 block and the new loop header.  This allows us to later split the
> > > > +	 preheader block and still find the right LC nodes.  */
> > > > +      edge old_latch_loop = loop_latch_edge (loop);
> > > > +      edge old_latch_init = loop_preheader_edge (loop);
> > > > +      edge new_latch_loop = loop_latch_edge (new_loop);
> > > > +      edge new_latch_init = loop_preheader_edge (new_loop);
> > > > +      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
> > >
> > > see above
> > >
> > > > +	   gsi_to = gsi_start_phis (old_latch_loop->dest);
> > > > +	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
> > > > +	   gsi_next (&gsi_from), gsi_next (&gsi_to))
> > > > +	{
> > > > +	  gimple *from_phi = gsi_stmt (gsi_from);
> > > > +	  gimple *to_phi = gsi_stmt (gsi_to);
> > > > +	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi,
> > > new_latch_loop);
> > > > +	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
> > > > +	}
> > > > +
> > > >        if (scalar_loop != loop)
> > > >  	{
> > > >  	  /* Remove the non-necessary forwarder of scalar_loop again.  */
> > > > @@ -1677,31 +1819,36 @@ slpeel_tree_duplicate_loop_to_edge_cfg
> (class
> > > loop *loop,
> > > >        delete_basic_block (new_preheader);
> > > >        set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
> > > >  			       loop_preheader_edge (new_loop)->src);
> > > > +
> > > > +      if (multiple_exits_p)
> > > > +	update_loop = loop;
> > > >      }
> > > >
> > > > -  if (scalar_loop != loop)
> > > > +  if (multiple_exits_p)
> > > >      {
> > > > -      /* Update new_loop->header PHIs, so that on the preheader
> > > > -	 edge they are the ones from loop rather than scalar_loop.  */
> > > > -      gphi_iterator gsi_orig, gsi_new;
> > > > -      edge orig_e = loop_preheader_edge (loop);
> > > > -      edge new_e = loop_preheader_edge (new_loop);
> > > > -
> > > > -      for (gsi_orig = gsi_start_phis (loop->header),
> > > > -	   gsi_new = gsi_start_phis (new_loop->header);
> > > > -	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
> > > > -	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
> > > > +      for (edge e : get_loop_exit_edges (update_loop))
> > > >  	{
> > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > -	  gphi *new_phi = gsi_new.phi ();
> > > > -	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
> > > > -	  location_t orig_locus
> > > > -	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
> > > > -
> > > > -	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
> > > > +	  edge ex;
> > > > +	  edge_iterator ei;
> > > > +	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
> > > > +	    {
> > > > +	      /* Find the first non-fallthrough block as fall-throughs can't
> > > > +		 dominate other blocks.  */
> > > > +	      while ((ex->flags & EDGE_FALLTHRU)
> 
> For the prologue peeling any early exit we take would skip all other
> loops so we can simply leave them and their LC PHI nodes in place.
> We need extra PHIs only on the path to the main vector loop.  I
> think the comment isn't accurately reflecting what we do.  In
> fact we do not add any LC PHI nodes here but simply adjust the
> main loop header PHI arguments?
> 
> > > I don't think EDGE_FALLTHRU is set correctly, what's wrong with
> > > just using single_succ_p here?  A fallthru edge src dominates the
> > > fallthru edge dest, so the sentence above doesn't make sense.
> >
> > I wanted to say, that the immediate dominator of a block is never
> > an fall through block.  At least from what I understood from how
> > the dominators are calculated in the code, though may have missed
> > something.
> 
>  BB1
>   |
>  BB2
>   |
>  BB3
> 
> here the immediate dominator of BB3 is BB2 and that of BB2 is BB1.
> 
> > >
> > > > +		     && single_succ_p (ex->dest))
> > > > +		{
> > > > +		  doms.safe_push (ex->dest);
> > > > +		  ex = single_succ_edge (ex->dest);
> > > > +		}
> > > > +	      doms.safe_push (ex->dest);
> > > > +	    }
> > > > +	  doms.safe_push (e->dest);
> > > >  	}
> > > > -    }
> > > >
> > > > +      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > +      if (updated_doms)
> > > > +	updated_doms->safe_splice (doms);
> > > > +    }
> > > >    free (new_bbs);
> > > >    free (bbs);
> > > >
> > > > @@ -1777,6 +1924,9 @@ slpeel_can_duplicate_loop_p (const
> > > loop_vec_info loop_vinfo, const_edge e)
> > > >    gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
> > > >    unsigned int num_bb = loop->inner? 5 : 2;
> > > >
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
> > > > +
> > >
> > > I think checking the number of BBs is odd, I don't remember anything
> > > in slpeel is specifically tied to that?  I think we can simply drop
> > > this or do you remember anything that would depend on ->num_nodes
> > > being only exactly 5 or 2?
> >
> > Never actually seemed to require it, but they're used as some check to
> > see if there are unexpected control flow in the loop.
> >
> > i.e. this would say no if you have an if statement in the loop that wasn't
> > converted.  The other part of this and the accompanying explanation is in
> > vect_analyze_loop_form.  In the patch series I had to remove the hard
> > num_nodes == 2 check from there because number of nodes restricted
> > things too much.  If you have an empty fall through block, which seems to
> > happen often between the main exit and the latch block then we'd not
> > vectorize.
> >
> > So instead I now rejects loops after analyzing the gcond.  So think this check
> > can go/needs to be different.
> 
> Lets remove it from this function then.
> 
> > >
> > > >    /* All loops have an outer scope; the only case loop->outer is NULL is for
> > > >       the function itself.  */
> > > >    if (!loop_outer (loop)
> > > > @@ -2044,6 +2194,11 @@ vect_update_ivs_after_vectorizer
> > > (loop_vec_info loop_vinfo,
> > > >    class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > >    basic_block update_bb = update_e->dest;
> > > >
> > > > +  /* For early exits we'll update the IVs in
> > > > +     vect_update_ivs_after_early_break.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    return;
> > > > +
> > > >    basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > >
> > > >    /* Make sure there exists a single-predecessor exit bb:  */
> > > > @@ -2131,6 +2286,208 @@ vect_update_ivs_after_vectorizer
> > > (loop_vec_info loop_vinfo,
> > > >        /* Fix phi expressions in the successor bb.  */
> > > >        adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
> > > >      }
> > > > +  return;
> > >
> > > we don't usually place a return at the end of void functions
> > >
> > > > +}
> > > > +
> > > > +/*   Function vect_update_ivs_after_early_break.
> > > > +
> > > > +     "Advance" the induction variables of LOOP to the value they should
> take
> > > > +     after the execution of LOOP.  This is currently necessary because the
> > > > +     vectorizer does not handle induction variables that are used after the
> > > > +     loop.  Such a situation occurs when the last iterations of LOOP are
> > > > +     peeled, because of the early exit.  With an early exit we always peel
> the
> > > > +     loop.
> > > > +
> > > > +     Input:
> > > > +     - LOOP_VINFO - a loop info structure for the loop that is going to be
> > > > +		    vectorized. The last few iterations of LOOP were peeled.
> > > > +     - LOOP - a loop that is going to be vectorized. The last few iterations
> > > > +	      of LOOP were peeled.
> > > > +     - VF - The loop vectorization factor.
> > > > +     - NITERS_ORIG - the number of iterations that LOOP executes (before
> it is
> > > > +		     vectorized). i.e, the number of times the ivs should be
> > > > +		     bumped.
> > > > +     - NITERS_VECTOR - The number of iterations that the vector LOOP
> > > executes.
> > > > +     - UPDATE_E - a successor edge of LOOP->exit that is on the (only)
> path
> > > > +		  coming out from LOOP on which there are uses of the LOOP
> > > ivs
> > > > +		  (this is the path from LOOP->exit to epilog_loop->preheader).
> > > > +
> > > > +		  The new definitions of the ivs are placed in LOOP->exit.
> > > > +		  The phi args associated with the edge UPDATE_E in the bb
> > > > +		  UPDATE_E->dest are updated accordingly.
> > > > +
> > > > +     Output:
> > > > +       - If available, the LCSSA phi node for the loop IV temp.
> > > > +
> > > > +     Assumption 1: Like the rest of the vectorizer, this function assumes
> > > > +     a single loop exit that has a single predecessor.
> > > > +
> > > > +     Assumption 2: The phi nodes in the LOOP header and in update_bb
> are
> > > > +     organized in the same order.
> > > > +
> > > > +     Assumption 3: The access function of the ivs is simple enough (see
> > > > +     vect_can_advance_ivs_p).  This assumption will be relaxed in the
> future.
> > > > +
> > > > +     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a
> path
> > > > +     coming out of LOOP on which the ivs of LOOP are used (this is the
> path
> > > > +     that leads to the epilog loop; other paths skip the epilog loop).  This
> > > > +     path starts with the edge UPDATE_E, and its destination (denoted
> > > update_bb)
> > > > +     needs to have its phis updated.
> > > > + */
> > > > +
> > > > +static tree
> > > > +vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class
> loop *
> > > epilog,
> > > > +				   poly_int64 vf, tree niters_orig,
> > > > +				   tree niters_vector, edge update_e)
> > > > +{
> > > > +  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    return NULL;
> > > > +
> > > > +  gphi_iterator gsi, gsi1;
> > > > +  tree ni_name, ivtmp = NULL;
> > > > +  basic_block update_bb = update_e->dest;
> > > > +  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > +  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > +  basic_block exit_bb = loop_iv->dest;
> > > > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > +  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
> > > > +
> > > > +  gcc_assert (cond);
> > > > +
> > > > +  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis
> (update_bb);
> > > > +       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > +       gsi_next (&gsi), gsi_next (&gsi1))
> > > > +    {
> > > > +      tree init_expr, final_expr, step_expr;
> > > > +      tree type;
> > > > +      tree var, ni, off;
> > > > +      gimple_stmt_iterator last_gsi;
> > > > +
> > > > +      gphi *phi = gsi1.phi ();
> > > > +      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi,
> loop_preheader_edge
> > > (epilog));
> > >
> > > I'm confused about the setup.  update_bb looks like the block with the
> > > loop-closed PHI nodes of 'loop' and the exit (update_e)?  How does
> > > loop_preheader_edge (epilog) come into play here?  That would feed into
> > > epilog->header PHIs?!
> >
> > We can't query the type of the phis in the block with the LC PHI nodes, so the
> > Typical pattern seems to be that we iterate over a block that's part of the
> loop
> > and that would have the PHIs in the same order, just so we can get to the
> > stmt_vec_info.
> >
> > >
> > > It would be nice to name 'gsi[1]', 'update_e' and 'update_bb' in a
> > > better way?  Is update_bb really epilog->header?!
> > >
> > > We're missing checking in PHI_ARG_DEF_FROM_EDGE, namely that
> > > E->dest == gimple_bb (PHI) - we're just using E->dest_idx there
> > > which "works" even for totally unrelated edges.
> > >
> > > > +      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
> > > > +      if (!phi1)
> > >
> > > shouldn't that be an assert?
> > >
> > > > +	continue;
> > > > +      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
> > > > +      if (dump_enabled_p ())
> > > > +	dump_printf_loc (MSG_NOTE, vect_location,
> > > > +			 "vect_update_ivs_after_early_break: phi: %G",
> > > > +			 (gimple *)phi);
> > > > +
> > > > +      /* Skip reduction and virtual phis.  */
> > > > +      if (!iv_phi_p (phi_info))
> > > > +	{
> > > > +	  if (dump_enabled_p ())
> > > > +	    dump_printf_loc (MSG_NOTE, vect_location,
> > > > +			     "reduc or virtual phi. skip.\n");
> > > > +	  continue;
> > > > +	}
> > > > +
> > > > +      /* For multiple exits where we handle early exits we need to carry on
> > > > +	 with the previous IV as loop iteration was not done because we exited
> > > > +	 early.  As such just grab the original IV.  */
> > > > +      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge
> > > (loop));
> > >
> > > but this should be taken care of by LC SSA?
> >
> > It is, the comment is probably missing details, this part just scales the
> counter
> > from VF to scalar counts.  It's just a reminder that this scaling is done
> differently
> > from normal single exit vectorization.
> >
> > >
> > > OK, have to continue tomorrow from here.
> >
> > Cheers, Thank you!
> >
> > Tamar
> >
> > >
> > > Richard.
> > >
> > > > +      if (gimple_cond_lhs (cond) != phi_ssa
> > > > +	  && gimple_cond_rhs (cond) != phi_ssa)
> 
> so this is a way to avoid touching the main IV?  Looks a bit fragile to
> me.  Hmm, we're iterating over the main loop header PHIs here?
> Can't you check, say, the relevancy of the PHI node instead?  Though
> it might also be used as induction.  Can't it be used as alternate
> exit like
> 
>   for (i)
>    {
>      if (i & bit)
>        break;
>    }
> 
> and would we need to adjust 'i' then?
> 
> > > > +	{
> > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > +	  step_expr = unshare_expr (step_expr);
> > > > +
> > > > +	  /* We previously generated the new merged phi in the same BB as
> > > the
> > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > +	  final_expr = gimple_phi_result (phi1);
> > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (),
> > > loop_preheader_edge (loop));
> > > > +
> > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > +	  /* For early break the final loop IV is:
> > > > +	     init + (final - init) * vf which takes into account peeling
> > > > +	     values and non-single steps.  */
> > > > +	  off = fold_build2 (MINUS_EXPR, stype,
> > > > +			     fold_convert (stype, final_expr),
> > > > +			     fold_convert (stype, init_expr));
> > > > +	  /* Now adjust for VF to get the final iteration value.  */
> > > > +	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
> > > > +
> > > > +	  /* Adjust the value with the offset.  */
> > > > +	  if (POINTER_TYPE_P (type))
> > > > +	    ni = fold_build_pointer_plus (init_expr, off);
> > > > +	  else
> > > > +	    ni = fold_convert (type,
> > > > +			       fold_build2 (PLUS_EXPR, stype,
> > > > +					    fold_convert (stype, init_expr),
> > > > +					    off));
> > > > +	  var = create_tmp_var (type, "tmp");
> 
> so how does the non-early break code deal with updating inductions?
> And how do you avoid altering this when we flow in from the normal
> exit?  That is, you are updating the value in the epilog loop
> header but don't you need to instead update the value only on
> the alternate exit edges from the main loop (and keep the not
> updated value on the main exit edge)?
> 
> > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > +	  gimple_seq new_stmts = NULL;
> > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > +	  if (!gsi_end_p (last_gsi))
> > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +	  else
> > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +
> > > > +	  /* Fix phi expressions in the successor bb.  */
> > > > +	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
> > > > +	}
> > > > +      else
> > > > +	{
> > > > +	  type = TREE_TYPE (gimple_phi_result (phi));
> > > > +	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
> > > > +	  step_expr = unshare_expr (step_expr);
> > > > +
> > > > +	  /* We previously generated the new merged phi in the same BB as
> > > the
> > > > +	     guard.  So use that to perform the scaling on rather than the
> > > > +	     normal loop phi which don't take the early breaks into account.  */
> > > > +	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge
> > > (loop));
> > > > +	  tree stype = TREE_TYPE (step_expr);
> > > > +
> > > > +	  if (vf.is_constant ())
> > > > +	    {
> > > > +	      ni = fold_build2 (MULT_EXPR, stype,
> > > > +				fold_convert (stype,
> > > > +					      niters_vector),
> > > > +				build_int_cst (stype, vf));
> > > > +
> > > > +	      ni = fold_build2 (MINUS_EXPR, stype,
> > > > +				fold_convert (stype,
> > > > +					      niters_orig),
> > > > +				fold_convert (stype, ni));
> > > > +	    }
> > > > +	  else
> > > > +	    /* If the loop's VF isn't constant then the loop must have been
> > > > +	       masked, so at the end of the loop we know we have finished
> > > > +	       the entire loop and found nothing.  */
> > > > +	    ni = build_zero_cst (stype);
> > > > +
> > > > +	  ni = fold_convert (type, ni);
> > > > +	  /* We don't support variable n in this version yet.  */
> > > > +	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
> > > > +
> > > > +	  var = create_tmp_var (type, "tmp");
> > > > +
> > > > +	  last_gsi = gsi_last_bb (exit_bb);
> > > > +	  gimple_seq new_stmts = NULL;
> > > > +	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
> > > > +	  /* Exit_bb shouldn't be empty.  */
> > > > +	  if (!gsi_end_p (last_gsi))
> > > > +	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +	  else
> > > > +	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
> > > > +
> > > > +	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
> > > > +
> > > > +	  for (edge exit : alt_exits)
> > > > +	    adjust_phi_and_debug_stmts (phi1, exit,
> > > > +					build_int_cst (TREE_TYPE (step_expr),
> > > > +						       vf));
> > > > +	  ivtmp = gimple_phi_result (phi1);
> > > > +	}
> > > > +    }
> > > > +
> > > > +  return ivtmp;
> > > >  }
> > > >
> > > >  /* Return a gimple value containing the misalignment (measured in
> vector
> > > > @@ -2632,137 +2989,34 @@ vect_gen_vector_loop_niters_mult_vf
> > > (loop_vec_info loop_vinfo,
> > > >
> > > >  /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
> > > >     this function searches for the corresponding lcssa phi node in exit
> > > > -   bb of LOOP.  If it is found, return the phi result; otherwise return
> > > > -   NULL.  */
> > > > +   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
> > > > +   return the phi result; otherwise return NULL.  */
> > > >
> > > >  static tree
> > > >  find_guard_arg (class loop *loop, class loop *epilog
> ATTRIBUTE_UNUSED,
> > > > -		gphi *lcssa_phi)
> > > > +		gphi *lcssa_phi, int lcssa_edge = 0)
> > > >  {
> > > >    gphi_iterator gsi;
> > > >    edge e = loop->vec_loop_iv;
> > > >
> > > > -  gcc_assert (single_pred_p (e->dest));
> > > >    for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
> > > >      {
> > > >        gphi *phi = gsi.phi ();
> > > > -      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
> > > > -			   PHI_ARG_DEF (lcssa_phi, 0), 0))
> > > > -	return PHI_RESULT (phi);
> > > > -    }
> > > > -  return NULL_TREE;
> > > > -}
> > > > -
> > > > -/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates
> > > FIRST/SECOND
> > > > -   from SECOND/FIRST and puts it at the original loop's preheader/exit
> > > > -   edge, the two loops are arranged as below:
> > > > -
> > > > -       preheader_a:
> > > > -     first_loop:
> > > > -       header_a:
> > > > -	 i_1 = PHI<i_0, i_2>;
> > > > -	 ...
> > > > -	 i_2 = i_1 + 1;
> > > > -	 if (cond_a)
> > > > -	   goto latch_a;
> > > > -	 else
> > > > -	   goto between_bb;
> > > > -       latch_a:
> > > > -	 goto header_a;
> > > > -
> > > > -       between_bb:
> > > > -	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
> > > > -
> > > > -     second_loop:
> > > > -       header_b:
> > > > -	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
> > > > -				 or with i_2 if no LCSSA phi is created
> > > > -				 under condition of
> > > CREATE_LCSSA_FOR_IV_PHIS.
> > > > -	 ...
> > > > -	 i_4 = i_3 + 1;
> > > > -	 if (cond_b)
> > > > -	   goto latch_b;
> > > > -	 else
> > > > -	   goto exit_bb;
> > > > -       latch_b:
> > > > -	 goto header_b;
> > > > -
> > > > -       exit_bb:
> > > > -
> > > > -   This function creates loop closed SSA for the first loop; update the
> > > > -   second loop's PHI nodes by replacing argument on incoming edge with
> the
> > > > -   result of newly created lcssa PHI nodes.  IF
> CREATE_LCSSA_FOR_IV_PHIS
> > > > -   is false, Loop closed ssa phis will only be created for non-iv phis for
> > > > -   the first loop.
> > > > -
> > > > -   This function assumes exit bb of the first loop is preheader bb of the
> > > > -   second loop, i.e, between_bb in the example code.  With PHIs updated,
> > > > -   the second loop will execute rest iterations of the first.  */
> > > > -
> > > > -static void
> > > > -slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
> > > > -				   class loop *first, class loop *second,
> > > > -				   bool create_lcssa_for_iv_phis)
> > > > -{
> > > > -  gphi_iterator gsi_update, gsi_orig;
> > > > -  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > > -
> > > > -  edge first_latch_e = EDGE_SUCC (first->latch, 0);
> > > > -  edge second_preheader_e = loop_preheader_edge (second);
> > > > -  basic_block between_bb = single_exit (first)->dest;
> > > > -
> > > > -  gcc_assert (between_bb == second_preheader_e->src);
> > > > -  gcc_assert (single_pred_p (between_bb) && single_succ_p
> (between_bb));
> > > > -  /* Either the first loop or the second is the loop to be vectorized.  */
> > > > -  gcc_assert (loop == first || loop == second);
> > > > -
> > > > -  for (gsi_orig = gsi_start_phis (first->header),
> > > > -       gsi_update = gsi_start_phis (second->header);
> > > > -       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
> > > > -       gsi_next (&gsi_orig), gsi_next (&gsi_update))
> > > > -    {
> > > > -      gphi *orig_phi = gsi_orig.phi ();
> > > > -      gphi *update_phi = gsi_update.phi ();
> > > > -
> > > > -      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
> > > > -      /* Generate lcssa PHI node for the first loop.  */
> > > > -      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
> > > > -      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
> > > > -      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
> > > > +      /* Nested loops with multiple exits can have different no# phi node
> > > > +	 arguments between the main loop and epilog as epilog falls to the
> > > > +	 second loop.  */
> > > > +      if (gimple_phi_num_args (phi) > e->dest_idx)
> > > >  	{
> > > > -	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
> > > > -	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
> > > > -	  add_phi_arg (lcssa_phi, arg, single_exit (first),
> > > UNKNOWN_LOCATION);
> > > > -	  arg = new_res;
> > > > -	}
> > > > -
> > > > -      /* Update PHI node in the second loop by replacing arg on the loop's
> > > > -	 incoming edge.  */
> > > > -      adjust_phi_and_debug_stmts (update_phi, second_preheader_e,
> arg);
> > > > -    }
> > > > -
> > > > -  /* For epilogue peeling we have to make sure to copy all LC PHIs
> > > > -     for correct vectorization of live stmts.  */
> > > > -  if (loop == first)
> > > > -    {
> > > > -      basic_block orig_exit = single_exit (second)->dest;
> > > > -      for (gsi_orig = gsi_start_phis (orig_exit);
> > > > -	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
> > > > -	{
> > > > -	  gphi *orig_phi = gsi_orig.phi ();
> > > > -	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
> > > > -	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p
> > > (orig_arg))
> > > > -	    continue;
> > > > -
> > > > -	  /* Already created in the above loop.   */
> > > > -	  if (find_guard_arg (first, second, orig_phi))
> > > > +	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
> > > > +	  if (TREE_CODE (var) != SSA_NAME)
> > > >  	    continue;
> > > >
> > > > -	  tree new_res = copy_ssa_name (orig_arg);
> > > > -	  gphi *lcphi = create_phi_node (new_res, between_bb);
> > > > -	  add_phi_arg (lcphi, orig_arg, single_exit (first),
> > > UNKNOWN_LOCATION);
> > > > +	  if (operand_equal_p (get_current_def (var),
> > > > +			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
> > > > +	    return PHI_RESULT (phi);
> > > >  	}
> > > >      }
> > > > +  return NULL_TREE;
> > > >  }
> > > >
> > > >  /* Function slpeel_add_loop_guard adds guard skipping from the
> beginning
> > > > @@ -2910,13 +3164,11 @@ slpeel_update_phi_nodes_for_guard2
> (class
> > > loop *loop, class loop *epilog,
> > > >    gcc_assert (single_succ_p (merge_bb));
> > > >    edge e = single_succ_edge (merge_bb);
> > > >    basic_block exit_bb = e->dest;
> > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > -  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
> > > >
> > > >    for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > >      {
> > > >        gphi *update_phi = gsi.phi ();
> > > > -      tree old_arg = PHI_ARG_DEF (update_phi, 0);
> > > > +      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
> > > >
> > > >        tree merge_arg = NULL_TREE;
> > > >
> > > > @@ -2928,7 +3180,7 @@ slpeel_update_phi_nodes_for_guard2 (class
> loop
> > > *loop, class loop *epilog,
> > > >        if (!merge_arg)
> > > >  	merge_arg = old_arg;
> > > >
> > > > -      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
> > > > +      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e-
> >dest_idx);
> > > >        /* If the var is live after loop but not a reduction, we simply
> > > >  	 use the old arg.  */
> > > >        if (!guard_arg)
> > > > @@ -2948,21 +3200,6 @@ slpeel_update_phi_nodes_for_guard2 (class
> > > loop *loop, class loop *epilog,
> > > >      }
> > > >  }
> > > >
> > > > -/* EPILOG loop is duplicated from the original loop for vectorizing,
> > > > -   the arg of its loop closed ssa PHI needs to be updated.  */
> > > > -
> > > > -static void
> > > > -slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
> > > > -{
> > > > -  gphi_iterator gsi;
> > > > -  basic_block exit_bb = single_exit (epilog)->dest;
> > > > -
> > > > -  gcc_assert (single_pred_p (exit_bb));
> > > > -  edge e = EDGE_PRED (exit_bb, 0);
> > > > -  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
> > > > -    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
> > > > -}
> > > > -
> 
> I wonder if we can still split these changes out to before early break
> vect?
> 
> > > >  /* EPILOGUE_VINFO is an epilogue loop that we now know would need
> to
> > > >     iterate exactly CONST_NITERS times.  Make a final decision about
> > > >     whether the epilogue loop should be used, returning true if so.  */
> > > > @@ -3138,6 +3375,14 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >      bound_epilog += vf - 1;
> > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > >      bound_epilog += 1;
> > > > +  /* For early breaks the scalar loop needs to execute at most VF times
> > > > +     to find the element that caused the break.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +    {
> > > > +      bound_epilog = vf;
> > > > +      /* Force a scalar epilogue as we can't vectorize the index finding.  */
> > > > +      vect_epilogues = false;
> > > > +    }
> > > >    bool epilog_peeling = maybe_ne (bound_epilog, 0U);
> > > >    poly_uint64 bound_scalar = bound_epilog;
> > > >
> > > > @@ -3297,16 +3542,24 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >  				  bound_prolog + bound_epilog)
> > > >  		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
> > > >  			 || vect_epilogues));
> > > > +
> > > > +  /* We only support early break vectorization on known bounds at this
> > > time.
> > > > +     This means that if the vector loop can't be entered then we won't
> > > generate
> > > > +     it at all.  So for now force skip_vector off because the additional
> control
> > > > +     flow messes with the BB exits and we've already analyzed them.  */
> > > > + skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS
> (loop_vinfo);
> > > > +
> 
> I think it should be as "easy" as entering the epilog via the block taking
> the regular exit?
> 
> > > >    /* Epilog loop must be executed if the number of iterations for epilog
> > > >       loop is known at compile time, otherwise we need to add a check at
> > > >       the end of vector loop and skip to the end of epilog loop.  */
> > > >    bool skip_epilog = (prolog_peeling < 0
> > > >  		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >  		      || !vf.is_constant ());
> > > > -  /* PEELING_FOR_GAPS is special because epilog loop must be executed.
> */
> > > > -  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > > +  /* PEELING_FOR_GAPS and peeling for early breaks are special because
> > > epilog
> > > > +     loop must be executed.  */
> > > > +  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      skip_epilog = false;
> > > > -
> > > >    class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
> > > >    auto_vec<profile_count> original_counts;
> > > >    basic_block *original_bbs = NULL;
> > > > @@ -3344,13 +3597,13 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >    if (prolog_peeling)
> > > >      {
> > > >        e = loop_preheader_edge (loop);
> > > > -      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
> > > > -
> > > > +      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
> > > >        /* Peel prolog and put it on preheader edge of loop.  */
> > > > -      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> e);
> > > > +      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop,
> e,
> > > > +						       true);
> > > >        gcc_assert (prolog);
> > > >        prolog->force_vectorize = false;
> > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
> > > > +
> > > >        first_loop = prolog;
> > > >        reset_original_copy_tables ();
> > > >
> > > > @@ -3420,11 +3673,12 @@ vect_do_peeling (loop_vec_info
> loop_vinfo,
> > > tree niters, tree nitersm1,
> > > >  	 as the transformations mentioned above make less or no sense when
> > > not
> > > >  	 vectorizing.  */
> > > >        epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
> > > > -      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
> > > > +      auto_vec<basic_block> doms;
> > > > +      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e,
> true,
> > > > +						       &doms);
> > > >        gcc_assert (epilog);
> > > >
> > > >        epilog->force_vectorize = false;
> > > > -      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
> > > >
> > > >        /* Scalar version loop may be preferred.  In this case, add guard
> > > >  	 and skip to epilog.  Note this only happens when the number of
> > > > @@ -3496,6 +3750,54 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >        vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
> > > >  					update_e);
> > > >
> > > > +      /* For early breaks we must create a guard to check how many
> iterations
> > > > +	 of the scalar loop are yet to be performed.  */
> 
> We have this check anyway, no?  In fact don't we know that we always enter
> the epilog (see above)?
> 
> > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	{
> > > > +	  tree ivtmp =
> > > > +	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
> > > > +					       *niters_vector, update_e);
> > > > +
> > > > +	  gcc_assert (ivtmp);
> > > > +	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > +					 fold_convert (TREE_TYPE (niters),
> > > > +						       ivtmp),
> > > > +					 build_zero_cst (TREE_TYPE (niters)));
> > > > +	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > > +
> > > > +	  /* If we had a fallthrough edge, the guard will the threaded through
> > > > +	     and so we may need to find the actual final edge.  */
> > > > +	  edge final_edge = epilog->vec_loop_iv;
> > > > +	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
> > > > +	     between the guard and the exit edge.  It only adds new nodes and
> > > > +	     doesn't update existing one in the current scheme.  */
> > > > +	  basic_block guard_to = split_edge (final_edge);
> > > > +	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond,
> > > guard_to,
> > > > +						guard_bb, prob_epilog.invert
> > > (),
> > > > +						irred_flag);
> > > > +	  doms.safe_push (guard_bb);
> > > > +
> > > > +	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
> > > > +
> > > > +	  /* We must update all the edges from the new guard_bb.  */
> > > > +	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
> > > > +					      final_edge);
> > > > +
> > > > +	  /* If the loop was versioned we'll have an intermediate BB between
> > > > +	     the guard and the exit.  This intermediate block is required
> > > > +	     because in the current scheme of things the guard block phi
> > > > +	     updating can only maintain LCSSA by creating new blocks.  In this
> > > > +	     case we just need to update the uses in this block as well.  */
> > > > +	  if (loop != scalar_loop)
> > > > +	    {
> > > > +	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
> > > > +		   !gsi_end_p (gsi); gsi_next (&gsi))
> > > > +		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (),
> > > guard_e));
> > > > +	    }
> > > > +
> > > > +	  flush_pending_stmts (guard_e);
> > > > +	}
> > > > +
> > > >        if (skip_epilog)
> > > >  	{
> > > >  	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
> > > > @@ -3520,8 +3822,6 @@ vect_do_peeling (loop_vec_info loop_vinfo,
> tree
> > > niters, tree nitersm1,
> > > >  	    }
> > > >  	  scale_loop_profile (epilog, prob_epilog, 0);
> > > >  	}
> > > > -      else
> > > > -	slpeel_update_phi_nodes_for_lcssa (epilog);
> > > >
> > > >        unsigned HOST_WIDE_INT bound;
> > > >        if (bound_scalar.is_constant (&bound))
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index
> > >
> b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49
> > > f21421958e7ee25eada 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -1007,6 +1007,8 @@ _loop_vec_info::_loop_vec_info (class loop
> > > *loop_in, vec_info_shared *shared)
> > > >      partial_load_store_bias (0),
> > > >      peeling_for_gaps (false),
> > > >      peeling_for_niter (false),
> > > > +    early_breaks (false),
> > > > +    non_break_control_flow (false),
> > > >      no_data_dependencies (false),
> > > >      has_mask_store (false),
> > > >      scalar_loop_scaling (profile_probability::uninitialized ()),
> > > > @@ -1199,6 +1201,14 @@ vect_need_peeling_or_partial_vectors_p
> > > (loop_vec_info loop_vinfo)
> > > >      th = LOOP_VINFO_COST_MODEL_THRESHOLD
> > > (LOOP_VINFO_ORIG_LOOP_INFO
> > > >  					  (loop_vinfo));
> > > >
> > > > +  /* When we have multiple exits and VF is unknown, we must require
> > > partial
> > > > +     vectors because the loop bounds is not a minimum but a maximum.
> > > That is to
> > > > +     say we cannot unpredicate the main loop unless we peel or use partial
> > > > +     vectors in the epilogue.  */
> > > > +  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > +      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
> > > > +    return true;
> > > > +
> > > >    if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > >        && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > >      {
> > > > @@ -1652,12 +1662,12 @@ vect_compute_single_scalar_iteration_cost
> > > (loop_vec_info loop_vinfo)
> > > >    loop_vinfo->scalar_costs->finish_cost (nullptr);
> > > >  }
> > > >
> > > > -
> > > >  /* Function vect_analyze_loop_form.
> > > >
> > > >     Verify that certain CFG restrictions hold, including:
> > > >     - the loop has a pre-header
> > > > -   - the loop has a single entry and exit
> > > > +   - the loop has a single entry
> > > > +   - nested loops can have only a single exit.
> > > >     - the loop exit condition is simple enough
> > > >     - the number of iterations can be analyzed, i.e, a countable loop.  The
> > > >       niter could be analyzed under some assumptions.  */
> > > > @@ -1693,11 +1703,6 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >                             |
> > > >                          (exit-bb)  */
> > > >
> > > > -      if (loop->num_nodes != 2)
> > > > -	return opt_result::failure_at (vect_location,
> > > > -				       "not vectorized:"
> > > > -				       " control flow in loop.\n");
> > > > -
> > > >        if (empty_block_p (loop->header))
> > > >  	return opt_result::failure_at (vect_location,
> > > >  				       "not vectorized: empty loop.\n");
> > > > @@ -1768,11 +1773,13 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >          dump_printf_loc (MSG_NOTE, vect_location,
> > > >  			 "Considering outer-loop vectorization.\n");
> > > >        info->inner_loop_cond = inner.loop_cond;
> > > > +
> > > > +      if (!single_exit (loop))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized: multiple exits.\n");
> > > > +
> > > >      }
> > > >
> > > > -  if (!single_exit (loop))
> > > > -    return opt_result::failure_at (vect_location,
> > > > -				   "not vectorized: multiple exits.\n");
> > > >    if (EDGE_COUNT (loop->header->preds) != 2)
> > > >      return opt_result::failure_at (vect_location,
> > > >  				   "not vectorized:"
> > > > @@ -1788,11 +1795,36 @@ vect_analyze_loop_form (class loop *loop,
> > > vect_loop_form_info *info)
> > > >  				   "not vectorized: latch block not empty.\n");
> > > >
> > > >    /* Make sure the exit is not abnormal.  */
> > > > -  edge e = single_exit (loop);
> > > > -  if (e->flags & EDGE_ABNORMAL)
> > > > -    return opt_result::failure_at (vect_location,
> > > > -				   "not vectorized:"
> > > > -				   " abnormal loop exit edge.\n");
> > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > +  edge nexit = loop->vec_loop_iv;
> > > > +  for (edge e : exits)
> > > > +    {
> > > > +      if (e->flags & EDGE_ABNORMAL)
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " abnormal loop exit edge.\n");
> > > > +      /* Early break BB must be after the main exit BB.  In theory we should
> > > > +	 be able to vectorize the inverse order, but the current flow in the
> > > > +	 the vectorizer always assumes you update successor PHI nodes, not
> > > > +	 preds.  */
> > > > +      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e-
> > > >src))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " abnormal loop exit edge order.\n");
> 
> "unsupported loop exit order", but I don't understand the comment.
> 
> > > > +    }
> > > > +
> > > > +  /* We currently only support early exit loops with known bounds.   */
> 
> Btw, why's that?  Is that because we don't support the loop-around edge?
> IMHO this is the most serious limitation (and as said above it should be
> trivial to fix).
> 
> > > > +  if (exits.length () > 1)
> > > > +    {
> > > > +      class tree_niter_desc niter;
> > > > +      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter,
> NULL)
> > > > +	  || chrec_contains_undetermined (niter.niter)
> > > > +	  || !evolution_function_is_constant_p (niter.niter))
> > > > +	return opt_result::failure_at (vect_location,
> > > > +				       "not vectorized:"
> > > > +				       " early breaks only supported on loops"
> > > > +				       " with known iteration bounds.\n");
> > > > +    }
> > > >
> > > >    info->conds
> > > >      = vect_get_loop_niters (loop, &info->assumptions,
> > > > @@ -1866,6 +1898,10 @@ vect_create_loop_vinfo (class loop *loop,
> > > vec_info_shared *shared,
> > > >    LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info-
> > > >alt_loop_conds);
> > > >    LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
> > > >
> > > > +  /* Check to see if we're vectorizing multiple exits.  */
> > > > +  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> > > > +    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
> > > > +
> > > >    if (info->inner_loop_cond)
> > > >      {
> > > >        stmt_vec_info inner_loop_cond_info
> > > > @@ -3070,7 +3106,8 @@ start_over:
> > > >
> > > >    /* If an epilogue loop is required make sure we can create one.  */
> > > >    if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > -      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
> > > > +      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
> > > > +      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        if (dump_enabled_p ())
> > > >          dump_printf_loc (MSG_NOTE, vect_location, "epilog loop
> required\n");
> > > > @@ -5797,7 +5834,7 @@ vect_create_epilog_for_reduction
> (loop_vec_info
> > > loop_vinfo,
> > > >    basic_block exit_bb;
> > > >    tree scalar_dest;
> > > >    tree scalar_type;
> > > > -  gimple *new_phi = NULL, *phi;
> > > > +  gimple *new_phi = NULL, *phi = NULL;
> > > >    gimple_stmt_iterator exit_gsi;
> > > >    tree new_temp = NULL_TREE, new_name, new_scalar_dest;
> > > >    gimple *epilog_stmt = NULL;
> > > > @@ -6039,6 +6076,33 @@ vect_create_epilog_for_reduction
> > > (loop_vec_info loop_vinfo,
> > > >  	  new_def = gimple_convert (&stmts, vectype, new_def);
> > > >  	  reduc_inputs.quick_push (new_def);
> > > >  	}
> > > > +
> > > > +	/* Update the other exits.  */
> > > > +	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	  {
> > > > +	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
> > > > +	    gphi_iterator gsi, gsi1;
> > > > +	    for (edge exit : alt_exits)
> > > > +	      {
> > > > +		/* Find the phi node to propaget into the exit block for each
> > > > +		   exit edge.  */
> > > > +		for (gsi = gsi_start_phis (exit_bb),
> > > > +		     gsi1 = gsi_start_phis (exit->src);
> 
> exit->src == loop->header, right?  I think this won't work for multiple
> alternate exits.  It's probably easier to do this where we create the
> LC PHI node for the reduction result?
> 
> > > > +		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
> > > > +		     gsi_next (&gsi), gsi_next (&gsi1))
> > > > +		  {
> > > > +		    /* There really should be a function to just get the number
> > > > +		       of phis inside a bb.  */
> > > > +		    if (phi && phi == gsi.phi ())
> > > > +		      {
> > > > +			gphi *phi1 = gsi1.phi ();
> > > > +			SET_PHI_ARG_DEF (phi, exit->dest_idx,
> > > > +					 PHI_RESULT (phi1));
> 
> I think we know the header PHI of a reduction perfectly well, there
> shouldn't be the need to "search" for it.
> 
> > > > +			break;
> > > > +		      }
> > > > +		  }
> > > > +	      }
> > > > +	  }
> > > >        gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
> > > >      }
> > > >
> > > > @@ -10355,6 +10419,13 @@ vectorizable_live_operation (vec_info
> *vinfo,
> > > >  	   new_tree = lane_extract <vec_lhs', ...>;
> > > >  	   lhs' = new_tree;  */
> > > >
> > > > +      /* When vectorizing an early break, any live statement that is used
> > > > +	 outside of the loop are dead.  The loop will never get to them.
> > > > +	 We could change the liveness value during analysis instead but since
> > > > +	 the below code is invalid anyway just ignore it during codegen.  */
> > > > +      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > > +	return true;
> 
> But what about the value that's live across the main exit when the
> epilogue is not entered?
> 
> > > > +
> > > >        class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > > >        basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
> > > >        gcc_assert (single_pred_p (exit_bb));
> > > > @@ -11277,7 +11348,7 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >    /* Make sure there exists a single-predecessor exit bb.  Do this before
> > > >       versioning.   */
> > > >    edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
> > > > -  if (! single_pred_p (e->dest))
> > > > +  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS
> > > (loop_vinfo))
> 
> e can be NULL here?  I think we should reject such loops earlier.
> 
> > > >      {
> > > >        split_loop_exit_edge (e, true);
> > > >        if (dump_enabled_p ())
> > > > @@ -11303,7 +11374,7 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >    if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
> > > >      {
> > > >        e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
> > > > -      if (! single_pred_p (e->dest))
> > > > +      if (e && ! single_pred_p (e->dest))
> > > >  	{
> > > >  	  split_loop_exit_edge (e, true);
> > > >  	  if (dump_enabled_p ())
> > > > @@ -11641,7 +11712,8 @@ vect_transform_loop (loop_vec_info
> > > loop_vinfo, gimple *loop_vectorized_call)
> > > >
> > > >    /* Loops vectorized with a variable factor won't benefit from
> > > >       unrolling/peeling.  */
> 
> update the comment?  Why would we unroll a VLA loop with early breaks?
> Or did you mean to use || LOOP_VINFO_EARLY_BREAKS (loop_vinfo)?
> 
> > > > -  if (!vf.is_constant ())
> > > > +  if (!vf.is_constant ()
> > > > +      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
> > > >      {
> > > >        loop->unroll = 1;
> > > >        if (dump_enabled_p ())
> > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > index
> > >
> 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82af
> > > e78ee2a7c627be45b 100644
> > > > --- a/gcc/tree-vect-stmts.cc
> > > > +++ b/gcc/tree-vect-stmts.cc
> > > > @@ -344,9 +344,34 @@ vect_stmt_relevant_p (stmt_vec_info
> stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >    *live_p = false;
> > > >
> > > >    /* cond stmt other than loop exit cond.  */
> > > > -  if (is_ctrl_stmt (stmt_info->stmt)
> > > > -      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
> > > > -    *relevant = vect_used_in_scope;
> 
> how was that ever hit before?  For outer loop processing with outer loop
> vectorization?
> 
> > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > +    {
> > > > +      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge,
> but
> > > > +	 it looks like loop_manip doesn't do that..  So we have to do it
> > > > +	 the hard way.  */
> > > > +      basic_block bb = gimple_bb (stmt_info->stmt);
> > > > +      bool exit_bb = false, early_exit = false;
> > > > +      edge_iterator ei;
> > > > +      edge e;
> > > > +      FOR_EACH_EDGE (e, ei, bb->succs)
> > > > +        if (!flow_bb_inside_loop_p (loop, e->dest))
> > > > +	  {
> > > > +	    exit_bb = true;
> > > > +	    early_exit = loop->vec_loop_iv->src != bb;
> > > > +	    break;
> > > > +	  }
> > > > +
> > > > +      /* We should have processed any exit edge, so an edge not an early
> > > > +	 break must be a loop IV edge.  We need to distinguish between the
> > > > +	 two as we don't want to generate code for the main loop IV.  */
> > > > +      if (exit_bb)
> > > > +	{
> > > > +	  if (early_exit)
> > > > +	    *relevant = vect_used_in_scope;
> > > > +	}
> 
> I wonder why you can't simply do
> 
>          if (is_ctrl_stmt (stmt_info->stmt)
>              && stmt_info->stmt != LOOP_VINFO_COND (loop_info))
> 
> ?
> 
> > > > +      else if (bb->loop_father == loop)
> > > > +	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
> 
> so for control flow not exiting the loop you can check
> loop_exits_from_bb_p ().
> 
> > > > +    }
> > > >
> > > >    /* changing memory.  */
> > > >    if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
> > > > @@ -359,6 +384,11 @@ vect_stmt_relevant_p (stmt_vec_info
> stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >  	*relevant = vect_used_in_scope;
> > > >        }
> > > >
> > > > +  auto_vec<edge> exits = get_loop_exit_edges (loop);
> > > > +  auto_bitmap exit_bbs;
> > > > +  for (edge exit : exits)
> > > > +    bitmap_set_bit (exit_bbs, exit->dest->index);
> > > > +
> > > >    /* uses outside the loop.  */
> > > >    FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter,
> > > SSA_OP_DEF)
> > > >      {
> > > > @@ -377,7 +407,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info,
> > > loop_vec_info loop_vinfo,
> > > >  	      /* We expect all such uses to be in the loop exit phis
> > > >  		 (because of loop closed form)   */
> > > >  	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
> > > > -	      gcc_assert (bb == single_exit (loop)->dest);
> > > > +	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
> 
> That now becomes quite expensive checking already covered by the LC SSA
> verifier so I suggest to simply drop this assert instead.
> 
> > > >                *live_p = true;
> > > >  	    }
> > > > @@ -683,6 +713,13 @@ vect_mark_stmts_to_be_vectorized
> > > (loop_vec_info loop_vinfo, bool *fatal)
> > > >  	}
> > > >      }
> > > >
> > > > +  /* Ideally this should be in vect_analyze_loop_form but we haven't
> seen all
> > > > +     the conds yet at that point and there's no quick way to retrieve them.
> */
> > > > +  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
> > > > +    return opt_result::failure_at (vect_location,
> > > > +				   "not vectorized:"
> > > > +				   " unsupported control flow in loop.\n");
> 
> so we didn't do this before?  But see above where I wondered.  So when
> does this hit with early exits and why can't we check for this in
> vect_verify_loop_form?
> 
> > > > +
> > > >    /* 2. Process_worklist */
> > > >    while (worklist.length () > 0)
> > > >      {
> > > > @@ -778,6 +815,20 @@ vect_mark_stmts_to_be_vectorized
> > > (loop_vec_info loop_vinfo, bool *fatal)
> > > >  			return res;
> > > >  		    }
> > > >                   }
> > > > +	    }
> > > > +	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
> > > > +	    {
> > > > +	      enum tree_code rhs_code = gimple_cond_code (cond);
> > > > +	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
> > > > +	      opt_result res
> > > > +		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
> > > > +			       loop_vinfo, relevant, &worklist, false);
> > > > +	      if (!res)
> > > > +		return res;
> > > > +	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
> > > > +				loop_vinfo, relevant, &worklist, false);
> > > > +	      if (!res)
> > > > +		return res;
> > > >              }
> > > >  	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
> > > >  	    {
> > > > @@ -11919,11 +11970,15 @@ vect_analyze_stmt (vec_info *vinfo,
> > > >  			     node_instance, cost_vec);
> > > >        if (!res)
> > > >  	return res;
> > > > -   }
> > > > +    }
> > > > +
> > > > +  if (is_ctrl_stmt (stmt_info->stmt))
> > > > +    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
> > > >
> > > >    switch (STMT_VINFO_DEF_TYPE (stmt_info))
> > > >      {
> > > >        case vect_internal_def:
> > > > +      case vect_early_exit_def:
> > > >          break;
> > > >
> > > >        case vect_reduction_def:
> > > > @@ -11956,6 +12011,7 @@ vect_analyze_stmt (vec_info *vinfo,
> > > >      {
> > > >        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
> > > >        gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
> > > > +		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
> > > >  		  || (call && gimple_call_lhs (call) == NULL_TREE));
> > > >        *need_to_vectorize = true;
> > > >      }
> > > > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > > > index
> > >
> ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b3
> > > 40baff61d18da8e242dd 100644
> > > > --- a/gcc/tree-vectorizer.h
> > > > +++ b/gcc/tree-vectorizer.h
> > > > @@ -63,6 +63,7 @@ enum vect_def_type {
> > > >    vect_internal_def,
> > > >    vect_induction_def,
> > > >    vect_reduction_def,
> > > > +  vect_early_exit_def,
> 
> can you avoid putting this inbetween reduction and double reduction
> please?  Just put it before vect_unknown_def_type.  In fact the COND
> isn't a def ... maybe we should have pattern recogized
> 
>  if (a < b) exit;
> 
> as
> 
>  cond = a < b;
>  if (cond != 0) exit;
> 
> so the part that we need to vectorize is more clear.
> 
> > > >    vect_double_reduction_def,
> > > >    vect_nested_cycle,
> > > >    vect_first_order_recurrence,
> > > > @@ -876,6 +877,13 @@ public:
> > > >       we need to peel off iterations at the end to form an epilogue loop.  */
> > > >    bool peeling_for_niter;
> > > >
> > > > +  /* When the loop has early breaks that we can vectorize we need to
> peel
> > > > +     the loop for the break finding loop.  */
> > > > +  bool early_breaks;
> > > > +
> > > > +  /* When the loop has a non-early break control flow inside.  */
> > > > +  bool non_break_control_flow;
> > > > +
> > > >    /* List of loop additional IV conditionals found in the loop.  */
> > > >    auto_vec<gcond *> conds;
> > > >
> > > > @@ -985,9 +993,11 @@ public:
> > > >  #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
> > > >  #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
> > > >  #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
> > > > +#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
> > > >  #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)-
> > > >early_break_conflict
> > > >  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)-
> >early_break_dest_bb
> > > >  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> > > > +#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)-
> > > >non_break_control_flow
> > > >  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
> > > >  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
> > > >  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)-
> > > >no_data_dependencies
> > > > @@ -1038,8 +1048,8 @@ public:
> > > >     stack.  */
> > > >  typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
> > > >
> > > > -inline loop_vec_info
> > > > -loop_vec_info_for_loop (class loop *loop)
> > > > +static inline loop_vec_info
> > > > +loop_vec_info_for_loop (const class loop *loop)
> > > >  {
> > > >    return (loop_vec_info) loop->aux;
> > > >  }
> > > > @@ -1789,7 +1799,7 @@ is_loop_header_bb_p (basic_block bb)
> > > >  {
> > > >    if (bb == (bb->loop_father)->header)
> > > >      return true;
> > > > -  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
> > > > +
> > > >    return false;
> > > >  }
> > > >
> > > > @@ -2176,9 +2186,10 @@ class auto_purge_vect_location
> > > >     in tree-vect-loop-manip.cc.  */
> > > >  extern void vect_set_loop_condition (class loop *, loop_vec_info,
> > > >  				     tree, tree, tree, bool);
> > > > -extern bool slpeel_can_duplicate_loop_p (const class loop *,
> const_edge);
> > > > +extern bool slpeel_can_duplicate_loop_p (const loop_vec_info,
> > > const_edge);
> > > >  class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
> > > > -						     class loop *, edge);
> > > > +						    class loop *, edge, bool,
> > > > +						    vec<basic_block> * = NULL);
> > > >  class loop *vect_loop_versioning (loop_vec_info, gimple *);
> > > >  extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
> > > >  				    tree *, tree *, tree *, int, bool, bool,
> > > > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> > > > index
> > >
> a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c
> > > 67f29f5dc9a8d72235f4 100644
> > > > --- a/gcc/tree-vectorizer.cc
> > > > +++ b/gcc/tree-vectorizer.cc
> > > > @@ -1379,7 +1379,9 @@ pass_vectorize::execute (function *fun)
> > > >  	 predicates that need to be shared for optimal predicate usage.
> > > >  	 However reassoc will re-order them and prevent CSE from working
> > > >  	 as it should.  CSE only the loop body, not the entry.  */
> > > > -      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> > > > +      auto_vec<edge> exits = get_loop_exit_edges (loop);
> 
> seeing this more and more I think we want a simple way to iterate over
> all exits without copying to a vector when we have them recorded.  My
> C++ fu is too limited to support
> 
>   for (auto exit : recorded_exits (loop))
>     ...
> 
> (maybe that's enough for somebody to jump onto this ;))
> 
> Don't treat all review comments as change orders, but it should be clear
> the code isn't 100% obvious.  Maybe the patch can be simplified by
> splitting out the LC SSA cleanup parts.
> 
> Thanks,
> Richard.
> 
> > > > +      for (edge exit : exits)
> > > > +	bitmap_set_bit (exit_bbs, exit->dest->index);
> > > >
> > > >        edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
> > > >        do_rpo_vn (fun, entry, exit_bbs);
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461
> > > Nuernberg,
> > > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien
> > > Moerman;
> > > HRB 36809 (AG Nuernberg)
> >
  

Patch

--- a/gcc/tree-loop-distribution.cc
+++ b/gcc/tree-loop-distribution.cc
@@ -948,7 +948,7 @@  copy_loop_before (class loop *loop, bool redirect_lc_phi_defs)
   edge preheader = loop_preheader_edge (loop);
 
   initialize_original_copy_tables ();
-  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader);
+  res = slpeel_tree_duplicate_loop_to_edge_cfg (loop, NULL, preheader, false);
   gcc_assert (res != NULL);
 
   /* When a not last partition is supposed to keep the LC PHIs computed
diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
index 5d398b67e68c7076760854119590f18b19c622b6..79686f6c4945b7139ba377300430c04b7aeefe6c 100644
--- a/gcc/tree-ssa-loop-niter.cc
+++ b/gcc/tree-ssa-loop-niter.cc
@@ -3072,7 +3072,12 @@  loop_only_exit_p (const class loop *loop, basic_block *body, const_edge exit)
   gimple_stmt_iterator bsi;
   unsigned i;
 
-  if (exit != single_exit (loop))
+  /* We need to check for alternative exits since exit can be NULL.  */
+  auto exits = get_loop_exit_edges (loop);
+  if (exits.length () != 1)
+    return false;
+
+  if (exit != exits[0])
     return false;
 
   for (i = 0; i < loop->num_nodes; i++)
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 6b93fb3f9af8f2bbdf5dec28f0009177aa5171ab..550d7f40002cf0b58f8a927cb150edd7c2aa9999 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -252,6 +252,9 @@  adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def)
 {
   tree orig_def = PHI_ARG_DEF_FROM_EDGE (update_phi, e);
 
+  gcc_assert (TREE_CODE (orig_def) != SSA_NAME
+	      || orig_def != new_def);
+
   SET_PHI_ARG_DEF (update_phi, e->dest_idx, new_def);
 
   if (MAY_HAVE_DEBUG_BIND_STMTS)
@@ -1292,7 +1295,8 @@  vect_set_loop_condition_normal (loop_vec_info loop_vinfo,
   gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT);
 
   /* Record the number of latch iterations.  */
-  if (limit == niters)
+  if (limit == niters
+      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     /* Case A: the loop iterates NITERS times.  Subtract one to get the
        latch count.  */
     loop->nb_iterations = fold_build2 (MINUS_EXPR, niters_type, niters,
@@ -1303,7 +1307,13 @@  vect_set_loop_condition_normal (loop_vec_info loop_vinfo,
     loop->nb_iterations = fold_build2 (TRUNC_DIV_EXPR, niters_type,
 				       limit, step);
 
-  if (final_iv)
+  /* For multiple exits we've already maintained LCSSA form and handled
+     the scalar iteration update in the code that deals with the merge
+     block and its updated guard.  I could move that code here instead
+     of in vect_update_ivs_after_early_break but I have to still deal
+     with the updates to the counter `i`.  So for now I'll keep them
+     together.  */
+  if (final_iv && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       gassign *assign;
       edge exit = LOOP_VINFO_IV_EXIT (loop_vinfo);
@@ -1509,11 +1519,19 @@  vec_init_exit_info (class loop *loop)
    on E which is either the entry or exit of LOOP.  If SCALAR_LOOP is
    non-NULL, assume LOOP and SCALAR_LOOP are equivalent and copy the
    basic blocks from SCALAR_LOOP instead of LOOP, but to either the
-   entry or exit of LOOP.  */
+   entry or exit of LOOP.  If FLOW_LOOPS then connect LOOP to SCALAR_LOOP as a
+   continuation.  This is correct for cases where one loop continues from the
+   other like in the vectorizer, but not true for uses in e.g. loop distribution
+   where the loop is duplicated and then modified.
+
+   If UPDATED_DOMS is not NULL it is update with the list of basic blocks whoms
+   dominators were updated during the peeling.  */
 
 class loop *
 slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
-					class loop *scalar_loop, edge e)
+					class loop *scalar_loop, edge e,
+					bool flow_loops,
+					vec<basic_block> *updated_doms)
 {
   class loop *new_loop;
   basic_block *new_bbs, *bbs, *pbbs;
@@ -1602,6 +1620,19 @@  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
   for (unsigned i = (at_exit ? 0 : 1); i < scalar_loop->num_nodes + 1; i++)
     rename_variables_in_bb (new_bbs[i], duplicate_outer_loop);
 
+  /* Rename the exit uses.  */
+  for (edge exit : get_loop_exit_edges (new_loop))
+    for (auto gsi = gsi_start_phis (exit->dest);
+	 !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	tree orig_def = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), exit);
+	rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), exit));
+	if (MAY_HAVE_DEBUG_BIND_STMTS)
+	  adjust_debug_stmts (orig_def, PHI_RESULT (gsi.phi ()), exit->dest);
+      }
+
+  /* This condition happens when the loop has been versioned. e.g. due to ifcvt
+     versioning the loop.  */
   if (scalar_loop != loop)
     {
       /* If we copied from SCALAR_LOOP rather than LOOP, SSA_NAMEs from
@@ -1616,28 +1647,106 @@  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
 						EDGE_SUCC (loop->latch, 0));
     }
 
+  vec<edge> alt_exits = loop->vec_loop_alt_exits;
+  bool multiple_exits_p = !alt_exits.is_empty ();
+  auto_vec<basic_block> doms;
+  class loop *update_loop = NULL;
+
   if (at_exit) /* Add the loop copy at exit.  */
     {
-      if (scalar_loop != loop)
+      if (scalar_loop != loop && new_exit->dest != exit_dest)
 	{
-	  gphi_iterator gsi;
 	  new_exit = redirect_edge_and_branch (new_exit, exit_dest);
+	  flush_pending_stmts (new_exit);
+	}
 
-	  for (gsi = gsi_start_phis (exit_dest); !gsi_end_p (gsi);
-	       gsi_next (&gsi))
+      auto loop_exits = get_loop_exit_edges (loop);
+      for (edge exit : loop_exits)
+	redirect_edge_and_branch (exit, new_preheader);
+
+
+      /* Copy the current loop LC PHI nodes between the original loop exit
+	 block and the new loop header.  This allows us to later split the
+	 preheader block and still find the right LC nodes.  */
+      edge latch_new = single_succ_edge (new_preheader);
+      edge latch_old = loop_latch_edge (loop);
+      hash_set <tree> lcssa_vars;
+      for (auto gsi_from = gsi_start_phis (latch_old->dest),
+	   gsi_to = gsi_start_phis (latch_new->dest);
+	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
+	   gsi_next (&gsi_from), gsi_next (&gsi_to))
+	{
+	  gimple *from_phi = gsi_stmt (gsi_from);
+	  gimple *to_phi = gsi_stmt (gsi_to);
+	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, latch_old);
+	  /* In all cases, even in early break situations we're only
+	     interested in the number of fully executed loop iters.  As such
+	     we discard any partially done iteration.  So we simply propagate
+	     the phi nodes from the latch to the merge block.  */
+	  tree new_res = copy_ssa_name (gimple_phi_result (from_phi));
+	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
+
+	  lcssa_vars.add (new_arg);
+
+	  /* Main loop exit should use the final iter value.  */
+	  add_phi_arg (lcssa_phi, new_arg, loop->vec_loop_iv, UNKNOWN_LOCATION);
+
+	  /* All other exits use the previous iters.  */
+	  for (edge e : alt_exits)
+	    add_phi_arg (lcssa_phi, gimple_phi_result (from_phi), e,
+			 UNKNOWN_LOCATION);
+
+	  adjust_phi_and_debug_stmts (to_phi, latch_new, new_res);
+	}
+
+      /* Copy over any live SSA vars that may not have been materialized in the
+	 loops themselves but would be in the exit block.  However when the live
+	 value is not used inside the loop then we don't need to do this,  if we do
+	 then when we split the guard block the branch edge can end up containing the
+	 wrong reference,  particularly if it shares an edge with something that has
+	 bypassed the loop.  This is not something peeling can check so we need to
+	 anticipate the usage of the live variable here.  */
+      auto exit_map = redirect_edge_var_map_vector (exit);
+      if (exit_map)
+        for (auto vm : exit_map)
+	{
+	  if (lcssa_vars.contains (vm.def)
+	      || TREE_CODE (vm.def) != SSA_NAME)
+	    continue;
+
+	  imm_use_iterator imm_iter;
+	  use_operand_p use_p;
+	  bool use_in_loop = false;
+
+	  FOR_EACH_IMM_USE_FAST (use_p, imm_iter, vm.def)
 	    {
-	      gphi *phi = gsi.phi ();
-	      tree orig_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
-	      location_t orig_locus
-		= gimple_phi_arg_location_from_edge (phi, e);
+	      basic_block bb = gimple_bb (USE_STMT (use_p));
+	      if (flow_bb_inside_loop_p (loop, bb)
+		  && !gimple_vuse (USE_STMT (use_p)))
+		{
+		  use_in_loop = true;
+		  break;
+		}
+	    }
 
-	      add_phi_arg (phi, orig_arg, new_exit, orig_locus);
+	  if (!use_in_loop)
+	    {
+	       /* Do a final check to see if it's perhaps defined in the loop.  This
+		  mirrors the relevancy analysis's used_outside_scope.  */
+	      gimple *stmt = SSA_NAME_DEF_STMT (vm.def);
+	      if (!stmt || !flow_bb_inside_loop_p (loop, gimple_bb (stmt)))
+		continue;
 	    }
+
+	  tree new_res = copy_ssa_name (vm.result);
+	  gphi *lcssa_phi = create_phi_node (new_res, e->dest);
+	  for (edge exit : loop_exits)
+	     add_phi_arg (lcssa_phi, vm.def, exit, vm.locus);
 	}
-      redirect_edge_and_branch_force (e, new_preheader);
-      flush_pending_stmts (e);
+
       set_immediate_dominator (CDI_DOMINATORS, new_preheader, e->src);
-      if (was_imm_dom || duplicate_outer_loop)
+
+      if ((was_imm_dom || duplicate_outer_loop) && !multiple_exits_p)
 	set_immediate_dominator (CDI_DOMINATORS, exit_dest, new_exit->src);
 
       /* And remove the non-necessary forwarder again.  Keep the other
@@ -1647,9 +1756,42 @@  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
       delete_basic_block (preheader);
       set_immediate_dominator (CDI_DOMINATORS, scalar_loop->header,
 			       loop_preheader_edge (scalar_loop)->src);
+
+      /* Finally after wiring the new epilogue we need to update its main exit
+	 to the original function exit we recorded.  Other exits are already
+	 correct.  */
+      if (multiple_exits_p)
+	{
+	  for (edge e : get_loop_exit_edges (loop))
+	    doms.safe_push (e->dest);
+	  update_loop = new_loop;
+	  doms.safe_push (exit_dest);
+
+	  /* Likely a fall-through edge, so update if needed.  */
+	  if (single_succ_p (exit_dest))
+	    doms.safe_push (single_succ (exit_dest));
+	}
     }
   else /* Add the copy at entry.  */
     {
+      /* Copy the current loop LC PHI nodes between the original loop exit
+	 block and the new loop header.  This allows us to later split the
+	 preheader block and still find the right LC nodes.  */
+      edge old_latch_loop = loop_latch_edge (loop);
+      edge old_latch_init = loop_preheader_edge (loop);
+      edge new_latch_loop = loop_latch_edge (new_loop);
+      edge new_latch_init = loop_preheader_edge (new_loop);
+      for (auto gsi_from = gsi_start_phis (new_latch_init->dest),
+	   gsi_to = gsi_start_phis (old_latch_loop->dest);
+	   flow_loops && !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
+	   gsi_next (&gsi_from), gsi_next (&gsi_to))
+	{
+	  gimple *from_phi = gsi_stmt (gsi_from);
+	  gimple *to_phi = gsi_stmt (gsi_to);
+	  tree new_arg = PHI_ARG_DEF_FROM_EDGE (from_phi, new_latch_loop);
+	  adjust_phi_and_debug_stmts (to_phi, old_latch_init, new_arg);
+	}
+
       if (scalar_loop != loop)
 	{
 	  /* Remove the non-necessary forwarder of scalar_loop again.  */
@@ -1677,31 +1819,36 @@  slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop,
       delete_basic_block (new_preheader);
       set_immediate_dominator (CDI_DOMINATORS, new_loop->header,
 			       loop_preheader_edge (new_loop)->src);
+
+      if (multiple_exits_p)
+	update_loop = loop;
     }
 
-  if (scalar_loop != loop)
+  if (multiple_exits_p)
     {
-      /* Update new_loop->header PHIs, so that on the preheader
-	 edge they are the ones from loop rather than scalar_loop.  */
-      gphi_iterator gsi_orig, gsi_new;
-      edge orig_e = loop_preheader_edge (loop);
-      edge new_e = loop_preheader_edge (new_loop);
-
-      for (gsi_orig = gsi_start_phis (loop->header),
-	   gsi_new = gsi_start_phis (new_loop->header);
-	   !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_new);
-	   gsi_next (&gsi_orig), gsi_next (&gsi_new))
+      for (edge e : get_loop_exit_edges (update_loop))
 	{
-	  gphi *orig_phi = gsi_orig.phi ();
-	  gphi *new_phi = gsi_new.phi ();
-	  tree orig_arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, orig_e);
-	  location_t orig_locus
-	    = gimple_phi_arg_location_from_edge (orig_phi, orig_e);
-
-	  add_phi_arg (new_phi, orig_arg, new_e, orig_locus);
+	  edge ex;
+	  edge_iterator ei;
+	  FOR_EACH_EDGE (ex, ei, e->dest->succs)
+	    {
+	      /* Find the first non-fallthrough block as fall-throughs can't
+		 dominate other blocks.  */
+	      while ((ex->flags & EDGE_FALLTHRU)
+		     && single_succ_p (ex->dest))
+		{
+		  doms.safe_push (ex->dest);
+		  ex = single_succ_edge (ex->dest);
+		}
+	      doms.safe_push (ex->dest);
+	    }
+	  doms.safe_push (e->dest);
 	}
-    }
 
+      iterate_fix_dominators (CDI_DOMINATORS, doms, false);
+      if (updated_doms)
+	updated_doms->safe_splice (doms);
+    }
   free (new_bbs);
   free (bbs);
 
@@ -1777,6 +1924,9 @@  slpeel_can_duplicate_loop_p (const loop_vec_info loop_vinfo, const_edge e)
   gimple_stmt_iterator loop_exit_gsi = gsi_last_bb (exit_e->src);
   unsigned int num_bb = loop->inner? 5 : 2;
 
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    num_bb += LOOP_VINFO_ALT_EXITS (loop_vinfo).length ();
+
   /* All loops have an outer scope; the only case loop->outer is NULL is for
      the function itself.  */
   if (!loop_outer (loop)
@@ -2044,6 +2194,11 @@  vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   basic_block update_bb = update_e->dest;
 
+  /* For early exits we'll update the IVs in
+     vect_update_ivs_after_early_break.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    return;
+
   basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
 
   /* Make sure there exists a single-predecessor exit bb:  */
@@ -2131,6 +2286,208 @@  vect_update_ivs_after_vectorizer (loop_vec_info loop_vinfo,
       /* Fix phi expressions in the successor bb.  */
       adjust_phi_and_debug_stmts (phi1, update_e, ni_name);
     }
+  return;
+}
+
+/*   Function vect_update_ivs_after_early_break.
+
+     "Advance" the induction variables of LOOP to the value they should take
+     after the execution of LOOP.  This is currently necessary because the
+     vectorizer does not handle induction variables that are used after the
+     loop.  Such a situation occurs when the last iterations of LOOP are
+     peeled, because of the early exit.  With an early exit we always peel the
+     loop.
+
+     Input:
+     - LOOP_VINFO - a loop info structure for the loop that is going to be
+		    vectorized. The last few iterations of LOOP were peeled.
+     - LOOP - a loop that is going to be vectorized. The last few iterations
+	      of LOOP were peeled.
+     - VF - The loop vectorization factor.
+     - NITERS_ORIG - the number of iterations that LOOP executes (before it is
+		     vectorized). i.e, the number of times the ivs should be
+		     bumped.
+     - NITERS_VECTOR - The number of iterations that the vector LOOP executes.
+     - UPDATE_E - a successor edge of LOOP->exit that is on the (only) path
+		  coming out from LOOP on which there are uses of the LOOP ivs
+		  (this is the path from LOOP->exit to epilog_loop->preheader).
+
+		  The new definitions of the ivs are placed in LOOP->exit.
+		  The phi args associated with the edge UPDATE_E in the bb
+		  UPDATE_E->dest are updated accordingly.
+
+     Output:
+       - If available, the LCSSA phi node for the loop IV temp.
+
+     Assumption 1: Like the rest of the vectorizer, this function assumes
+     a single loop exit that has a single predecessor.
+
+     Assumption 2: The phi nodes in the LOOP header and in update_bb are
+     organized in the same order.
+
+     Assumption 3: The access function of the ivs is simple enough (see
+     vect_can_advance_ivs_p).  This assumption will be relaxed in the future.
+
+     Assumption 4: Exactly one of the successors of LOOP exit-bb is on a path
+     coming out of LOOP on which the ivs of LOOP are used (this is the path
+     that leads to the epilog loop; other paths skip the epilog loop).  This
+     path starts with the edge UPDATE_E, and its destination (denoted update_bb)
+     needs to have its phis updated.
+ */
+
+static tree
+vect_update_ivs_after_early_break (loop_vec_info loop_vinfo, class loop * epilog,
+				   poly_int64 vf, tree niters_orig,
+				   tree niters_vector, edge update_e)
+{
+  if (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    return NULL;
+
+  gphi_iterator gsi, gsi1;
+  tree ni_name, ivtmp = NULL;
+  basic_block update_bb = update_e->dest;
+  vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
+  edge loop_iv = LOOP_VINFO_IV_EXIT (loop_vinfo);
+  basic_block exit_bb = loop_iv->dest;
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  gcond *cond = LOOP_VINFO_LOOP_IV_COND (loop_vinfo);
+
+  gcc_assert (cond);
+
+  for (gsi = gsi_start_phis (loop->header), gsi1 = gsi_start_phis (update_bb);
+       !gsi_end_p (gsi) && !gsi_end_p (gsi1);
+       gsi_next (&gsi), gsi_next (&gsi1))
+    {
+      tree init_expr, final_expr, step_expr;
+      tree type;
+      tree var, ni, off;
+      gimple_stmt_iterator last_gsi;
+
+      gphi *phi = gsi1.phi ();
+      tree phi_ssa = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (epilog));
+      gphi *phi1 = dyn_cast <gphi *> (SSA_NAME_DEF_STMT (phi_ssa));
+      if (!phi1)
+	continue;
+      stmt_vec_info phi_info = loop_vinfo->lookup_stmt (gsi.phi ());
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "vect_update_ivs_after_early_break: phi: %G",
+			 (gimple *)phi);
+
+      /* Skip reduction and virtual phis.  */
+      if (!iv_phi_p (phi_info))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "reduc or virtual phi. skip.\n");
+	  continue;
+	}
+
+      /* For multiple exits where we handle early exits we need to carry on
+	 with the previous IV as loop iteration was not done because we exited
+	 early.  As such just grab the original IV.  */
+      phi_ssa = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_latch_edge (loop));
+      if (gimple_cond_lhs (cond) != phi_ssa
+	  && gimple_cond_rhs (cond) != phi_ssa)
+	{
+	  type = TREE_TYPE (gimple_phi_result (phi));
+	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
+	  step_expr = unshare_expr (step_expr);
+
+	  /* We previously generated the new merged phi in the same BB as the
+	     guard.  So use that to perform the scaling on rather than the
+	     normal loop phi which don't take the early breaks into account.  */
+	  final_expr = gimple_phi_result (phi1);
+	  init_expr = PHI_ARG_DEF_FROM_EDGE (gsi.phi (), loop_preheader_edge (loop));
+
+	  tree stype = TREE_TYPE (step_expr);
+	  /* For early break the final loop IV is:
+	     init + (final - init) * vf which takes into account peeling
+	     values and non-single steps.  */
+	  off = fold_build2 (MINUS_EXPR, stype,
+			     fold_convert (stype, final_expr),
+			     fold_convert (stype, init_expr));
+	  /* Now adjust for VF to get the final iteration value.  */
+	  off = fold_build2 (MULT_EXPR, stype, off, build_int_cst (stype, vf));
+
+	  /* Adjust the value with the offset.  */
+	  if (POINTER_TYPE_P (type))
+	    ni = fold_build_pointer_plus (init_expr, off);
+	  else
+	    ni = fold_convert (type,
+			       fold_build2 (PLUS_EXPR, stype,
+					    fold_convert (stype, init_expr),
+					    off));
+	  var = create_tmp_var (type, "tmp");
+
+	  last_gsi = gsi_last_bb (exit_bb);
+	  gimple_seq new_stmts = NULL;
+	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
+	  /* Exit_bb shouldn't be empty.  */
+	  if (!gsi_end_p (last_gsi))
+	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
+	  else
+	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
+
+	  /* Fix phi expressions in the successor bb.  */
+	  adjust_phi_and_debug_stmts (phi, update_e, ni_name);
+	}
+      else
+	{
+	  type = TREE_TYPE (gimple_phi_result (phi));
+	  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (phi_info);
+	  step_expr = unshare_expr (step_expr);
+
+	  /* We previously generated the new merged phi in the same BB as the
+	     guard.  So use that to perform the scaling on rather than the
+	     normal loop phi which don't take the early breaks into account.  */
+	  init_expr = PHI_ARG_DEF_FROM_EDGE (phi1, loop_preheader_edge (loop));
+	  tree stype = TREE_TYPE (step_expr);
+
+	  if (vf.is_constant ())
+	    {
+	      ni = fold_build2 (MULT_EXPR, stype,
+				fold_convert (stype,
+					      niters_vector),
+				build_int_cst (stype, vf));
+
+	      ni = fold_build2 (MINUS_EXPR, stype,
+				fold_convert (stype,
+					      niters_orig),
+				fold_convert (stype, ni));
+	    }
+	  else
+	    /* If the loop's VF isn't constant then the loop must have been
+	       masked, so at the end of the loop we know we have finished
+	       the entire loop and found nothing.  */
+	    ni = build_zero_cst (stype);
+
+	  ni = fold_convert (type, ni);
+	  /* We don't support variable n in this version yet.  */
+	  gcc_assert (TREE_CODE (ni) == INTEGER_CST);
+
+	  var = create_tmp_var (type, "tmp");
+
+	  last_gsi = gsi_last_bb (exit_bb);
+	  gimple_seq new_stmts = NULL;
+	  ni_name = force_gimple_operand (ni, &new_stmts, false, var);
+	  /* Exit_bb shouldn't be empty.  */
+	  if (!gsi_end_p (last_gsi))
+	    gsi_insert_seq_after (&last_gsi, new_stmts, GSI_SAME_STMT);
+	  else
+	    gsi_insert_seq_before (&last_gsi, new_stmts, GSI_SAME_STMT);
+
+	  adjust_phi_and_debug_stmts (phi1, loop_iv, ni_name);
+
+	  for (edge exit : alt_exits)
+	    adjust_phi_and_debug_stmts (phi1, exit,
+					build_int_cst (TREE_TYPE (step_expr),
+						       vf));
+	  ivtmp = gimple_phi_result (phi1);
+	}
+    }
+
+  return ivtmp;
 }
 
 /* Return a gimple value containing the misalignment (measured in vector
@@ -2632,137 +2989,34 @@  vect_gen_vector_loop_niters_mult_vf (loop_vec_info loop_vinfo,
 
 /* LCSSA_PHI is a lcssa phi of EPILOG loop which is copied from LOOP,
    this function searches for the corresponding lcssa phi node in exit
-   bb of LOOP.  If it is found, return the phi result; otherwise return
-   NULL.  */
+   bb of LOOP following the LCSSA_EDGE to the exit node.  If it is found,
+   return the phi result; otherwise return NULL.  */
 
 static tree
 find_guard_arg (class loop *loop, class loop *epilog ATTRIBUTE_UNUSED,
-		gphi *lcssa_phi)
+		gphi *lcssa_phi, int lcssa_edge = 0)
 {
   gphi_iterator gsi;
   edge e = loop->vec_loop_iv;
 
-  gcc_assert (single_pred_p (e->dest));
   for (gsi = gsi_start_phis (e->dest); !gsi_end_p (gsi); gsi_next (&gsi))
     {
       gphi *phi = gsi.phi ();
-      if (operand_equal_p (PHI_ARG_DEF (phi, 0),
-			   PHI_ARG_DEF (lcssa_phi, 0), 0))
-	return PHI_RESULT (phi);
-    }
-  return NULL_TREE;
-}
-
-/* Function slpeel_tree_duplicate_loop_to_edge_cfg duplciates FIRST/SECOND
-   from SECOND/FIRST and puts it at the original loop's preheader/exit
-   edge, the two loops are arranged as below:
-
-       preheader_a:
-     first_loop:
-       header_a:
-	 i_1 = PHI<i_0, i_2>;
-	 ...
-	 i_2 = i_1 + 1;
-	 if (cond_a)
-	   goto latch_a;
-	 else
-	   goto between_bb;
-       latch_a:
-	 goto header_a;
-
-       between_bb:
-	 ;; i_x = PHI<i_2>;   ;; LCSSA phi node to be created for FIRST,
-
-     second_loop:
-       header_b:
-	 i_3 = PHI<i_0, i_4>; ;; Use of i_0 to be replaced with i_x,
-				 or with i_2 if no LCSSA phi is created
-				 under condition of CREATE_LCSSA_FOR_IV_PHIS.
-	 ...
-	 i_4 = i_3 + 1;
-	 if (cond_b)
-	   goto latch_b;
-	 else
-	   goto exit_bb;
-       latch_b:
-	 goto header_b;
-
-       exit_bb:
-
-   This function creates loop closed SSA for the first loop; update the
-   second loop's PHI nodes by replacing argument on incoming edge with the
-   result of newly created lcssa PHI nodes.  IF CREATE_LCSSA_FOR_IV_PHIS
-   is false, Loop closed ssa phis will only be created for non-iv phis for
-   the first loop.
-
-   This function assumes exit bb of the first loop is preheader bb of the
-   second loop, i.e, between_bb in the example code.  With PHIs updated,
-   the second loop will execute rest iterations of the first.  */
-
-static void
-slpeel_update_phi_nodes_for_loops (loop_vec_info loop_vinfo,
-				   class loop *first, class loop *second,
-				   bool create_lcssa_for_iv_phis)
-{
-  gphi_iterator gsi_update, gsi_orig;
-  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
-
-  edge first_latch_e = EDGE_SUCC (first->latch, 0);
-  edge second_preheader_e = loop_preheader_edge (second);
-  basic_block between_bb = single_exit (first)->dest;
-
-  gcc_assert (between_bb == second_preheader_e->src);
-  gcc_assert (single_pred_p (between_bb) && single_succ_p (between_bb));
-  /* Either the first loop or the second is the loop to be vectorized.  */
-  gcc_assert (loop == first || loop == second);
-
-  for (gsi_orig = gsi_start_phis (first->header),
-       gsi_update = gsi_start_phis (second->header);
-       !gsi_end_p (gsi_orig) && !gsi_end_p (gsi_update);
-       gsi_next (&gsi_orig), gsi_next (&gsi_update))
-    {
-      gphi *orig_phi = gsi_orig.phi ();
-      gphi *update_phi = gsi_update.phi ();
-
-      tree arg = PHI_ARG_DEF_FROM_EDGE (orig_phi, first_latch_e);
-      /* Generate lcssa PHI node for the first loop.  */
-      gphi *vect_phi = (loop == first) ? orig_phi : update_phi;
-      stmt_vec_info vect_phi_info = loop_vinfo->lookup_stmt (vect_phi);
-      if (create_lcssa_for_iv_phis || !iv_phi_p (vect_phi_info))
+      /* Nested loops with multiple exits can have different no# phi node
+	 arguments between the main loop and epilog as epilog falls to the
+	 second loop.  */
+      if (gimple_phi_num_args (phi) > e->dest_idx)
 	{
-	  tree new_res = copy_ssa_name (PHI_RESULT (orig_phi));
-	  gphi *lcssa_phi = create_phi_node (new_res, between_bb);
-	  add_phi_arg (lcssa_phi, arg, single_exit (first), UNKNOWN_LOCATION);
-	  arg = new_res;
-	}
-
-      /* Update PHI node in the second loop by replacing arg on the loop's
-	 incoming edge.  */
-      adjust_phi_and_debug_stmts (update_phi, second_preheader_e, arg);
-    }
-
-  /* For epilogue peeling we have to make sure to copy all LC PHIs
-     for correct vectorization of live stmts.  */
-  if (loop == first)
-    {
-      basic_block orig_exit = single_exit (second)->dest;
-      for (gsi_orig = gsi_start_phis (orig_exit);
-	   !gsi_end_p (gsi_orig); gsi_next (&gsi_orig))
-	{
-	  gphi *orig_phi = gsi_orig.phi ();
-	  tree orig_arg = PHI_ARG_DEF (orig_phi, 0);
-	  if (TREE_CODE (orig_arg) != SSA_NAME || virtual_operand_p  (orig_arg))
-	    continue;
-
-	  /* Already created in the above loop.   */
-	  if (find_guard_arg (first, second, orig_phi))
+	  tree var = PHI_ARG_DEF (phi, e->dest_idx);
+	  if (TREE_CODE (var) != SSA_NAME)
 	    continue;
 
-	  tree new_res = copy_ssa_name (orig_arg);
-	  gphi *lcphi = create_phi_node (new_res, between_bb);
-	  add_phi_arg (lcphi, orig_arg, single_exit (first), UNKNOWN_LOCATION);
+	  if (operand_equal_p (get_current_def (var),
+			       PHI_ARG_DEF (lcssa_phi, lcssa_edge), 0))
+	    return PHI_RESULT (phi);
 	}
     }
+  return NULL_TREE;
 }
 
 /* Function slpeel_add_loop_guard adds guard skipping from the beginning
@@ -2910,13 +3164,11 @@  slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
   gcc_assert (single_succ_p (merge_bb));
   edge e = single_succ_edge (merge_bb);
   basic_block exit_bb = e->dest;
-  gcc_assert (single_pred_p (exit_bb));
-  gcc_assert (single_pred (exit_bb) == single_exit (epilog)->dest);
 
   for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
     {
       gphi *update_phi = gsi.phi ();
-      tree old_arg = PHI_ARG_DEF (update_phi, 0);
+      tree old_arg = PHI_ARG_DEF (update_phi, e->dest_idx);
 
       tree merge_arg = NULL_TREE;
 
@@ -2928,7 +3180,7 @@  slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
       if (!merge_arg)
 	merge_arg = old_arg;
 
-      tree guard_arg = find_guard_arg (loop, epilog, update_phi);
+      tree guard_arg = find_guard_arg (loop, epilog, update_phi, e->dest_idx);
       /* If the var is live after loop but not a reduction, we simply
 	 use the old arg.  */
       if (!guard_arg)
@@ -2948,21 +3200,6 @@  slpeel_update_phi_nodes_for_guard2 (class loop *loop, class loop *epilog,
     }
 }
 
-/* EPILOG loop is duplicated from the original loop for vectorizing,
-   the arg of its loop closed ssa PHI needs to be updated.  */
-
-static void
-slpeel_update_phi_nodes_for_lcssa (class loop *epilog)
-{
-  gphi_iterator gsi;
-  basic_block exit_bb = single_exit (epilog)->dest;
-
-  gcc_assert (single_pred_p (exit_bb));
-  edge e = EDGE_PRED (exit_bb, 0);
-  for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), e));
-}
-
 /* EPILOGUE_VINFO is an epilogue loop that we now know would need to
    iterate exactly CONST_NITERS times.  Make a final decision about
    whether the epilogue loop should be used, returning true if so.  */
@@ -3138,6 +3375,14 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
     bound_epilog += vf - 1;
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
     bound_epilog += 1;
+  /* For early breaks the scalar loop needs to execute at most VF times
+     to find the element that caused the break.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    {
+      bound_epilog = vf;
+      /* Force a scalar epilogue as we can't vectorize the index finding.  */
+      vect_epilogues = false;
+    }
   bool epilog_peeling = maybe_ne (bound_epilog, 0U);
   poly_uint64 bound_scalar = bound_epilog;
 
@@ -3297,16 +3542,24 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 				  bound_prolog + bound_epilog)
 		      : (!LOOP_REQUIRES_VERSIONING (loop_vinfo)
 			 || vect_epilogues));
+
+  /* We only support early break vectorization on known bounds at this time.
+     This means that if the vector loop can't be entered then we won't generate
+     it at all.  So for now force skip_vector off because the additional control
+     flow messes with the BB exits and we've already analyzed them.  */
+ skip_vector = skip_vector && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo);
+
   /* Epilog loop must be executed if the number of iterations for epilog
      loop is known at compile time, otherwise we need to add a check at
      the end of vector loop and skip to the end of epilog loop.  */
   bool skip_epilog = (prolog_peeling < 0
 		      || !LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
 		      || !vf.is_constant ());
-  /* PEELING_FOR_GAPS is special because epilog loop must be executed.  */
-  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
+  /* PEELING_FOR_GAPS and peeling for early breaks are special because epilog
+     loop must be executed.  */
+  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     skip_epilog = false;
-
   class loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo);
   auto_vec<profile_count> original_counts;
   basic_block *original_bbs = NULL;
@@ -3344,13 +3597,13 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
   if (prolog_peeling)
     {
       e = loop_preheader_edge (loop);
-      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop, e));
-
+      gcc_checking_assert (slpeel_can_duplicate_loop_p (loop_vinfo, e));
       /* Peel prolog and put it on preheader edge of loop.  */
-      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e);
+      prolog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, scalar_loop, e,
+						       true);
       gcc_assert (prolog);
       prolog->force_vectorize = false;
-      slpeel_update_phi_nodes_for_loops (loop_vinfo, prolog, loop, true);
+
       first_loop = prolog;
       reset_original_copy_tables ();
 
@@ -3420,11 +3673,12 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	 as the transformations mentioned above make less or no sense when not
 	 vectorizing.  */
       epilog = vect_epilogues ? get_loop_copy (loop) : scalar_loop;
-      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e);
+      auto_vec<basic_block> doms;
+      epilog = slpeel_tree_duplicate_loop_to_edge_cfg (loop, epilog, e, true,
+						       &doms);
       gcc_assert (epilog);
 
       epilog->force_vectorize = false;
-      slpeel_update_phi_nodes_for_loops (loop_vinfo, loop, epilog, false);
 
       /* Scalar version loop may be preferred.  In this case, add guard
 	 and skip to epilog.  Note this only happens when the number of
@@ -3496,6 +3750,54 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       vect_update_ivs_after_vectorizer (loop_vinfo, niters_vector_mult_vf,
 					update_e);
 
+      /* For early breaks we must create a guard to check how many iterations
+	 of the scalar loop are yet to be performed.  */
+      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	{
+	  tree ivtmp =
+	    vect_update_ivs_after_early_break (loop_vinfo, epilog, vf, niters,
+					       *niters_vector, update_e);
+
+	  gcc_assert (ivtmp);
+	  tree guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
+					 fold_convert (TREE_TYPE (niters),
+						       ivtmp),
+					 build_zero_cst (TREE_TYPE (niters)));
+	  basic_block guard_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
+
+	  /* If we had a fallthrough edge, the guard will the threaded through
+	     and so we may need to find the actual final edge.  */
+	  edge final_edge = epilog->vec_loop_iv;
+	  /* slpeel_update_phi_nodes_for_guard2 expects an empty block in
+	     between the guard and the exit edge.  It only adds new nodes and
+	     doesn't update existing one in the current scheme.  */
+	  basic_block guard_to = split_edge (final_edge);
+	  edge guard_e = slpeel_add_loop_guard (guard_bb, guard_cond, guard_to,
+						guard_bb, prob_epilog.invert (),
+						irred_flag);
+	  doms.safe_push (guard_bb);
+
+	  iterate_fix_dominators (CDI_DOMINATORS, doms, false);
+
+	  /* We must update all the edges from the new guard_bb.  */
+	  slpeel_update_phi_nodes_for_guard2 (loop, epilog, guard_e,
+					      final_edge);
+
+	  /* If the loop was versioned we'll have an intermediate BB between
+	     the guard and the exit.  This intermediate block is required
+	     because in the current scheme of things the guard block phi
+	     updating can only maintain LCSSA by creating new blocks.  In this
+	     case we just need to update the uses in this block as well.  */
+	  if (loop != scalar_loop)
+	    {
+	      for (gphi_iterator gsi = gsi_start_phis (guard_to);
+		   !gsi_end_p (gsi); gsi_next (&gsi))
+		rename_use_op (PHI_ARG_DEF_PTR_FROM_EDGE (gsi.phi (), guard_e));
+	    }
+
+	  flush_pending_stmts (guard_e);
+	}
+
       if (skip_epilog)
 	{
 	  guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
@@ -3520,8 +3822,6 @@  vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	    }
 	  scale_loop_profile (epilog, prob_epilog, 0);
 	}
-      else
-	slpeel_update_phi_nodes_for_lcssa (epilog);
 
       unsigned HOST_WIDE_INT bound;
       if (bound_scalar.is_constant (&bound))
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index b4a98de80aa39057fc9b17977dd0e347b4f0fb5d..ab9a2048186f461f5ec49f21421958e7ee25eada 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -1007,6 +1007,8 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     partial_load_store_bias (0),
     peeling_for_gaps (false),
     peeling_for_niter (false),
+    early_breaks (false),
+    non_break_control_flow (false),
     no_data_dependencies (false),
     has_mask_store (false),
     scalar_loop_scaling (profile_probability::uninitialized ()),
@@ -1199,6 +1201,14 @@  vect_need_peeling_or_partial_vectors_p (loop_vec_info loop_vinfo)
     th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO
 					  (loop_vinfo));
 
+  /* When we have multiple exits and VF is unknown, we must require partial
+     vectors because the loop bounds is not a minimum but a maximum.  That is to
+     say we cannot unpredicate the main loop unless we peel or use partial
+     vectors in the epilogue.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+      && !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())
+    return true;
+
   if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
       && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
     {
@@ -1652,12 +1662,12 @@  vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
   loop_vinfo->scalar_costs->finish_cost (nullptr);
 }
 
-
 /* Function vect_analyze_loop_form.
 
    Verify that certain CFG restrictions hold, including:
    - the loop has a pre-header
-   - the loop has a single entry and exit
+   - the loop has a single entry
+   - nested loops can have only a single exit.
    - the loop exit condition is simple enough
    - the number of iterations can be analyzed, i.e, a countable loop.  The
      niter could be analyzed under some assumptions.  */
@@ -1693,11 +1703,6 @@  vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
                            |
                         (exit-bb)  */
 
-      if (loop->num_nodes != 2)
-	return opt_result::failure_at (vect_location,
-				       "not vectorized:"
-				       " control flow in loop.\n");
-
       if (empty_block_p (loop->header))
 	return opt_result::failure_at (vect_location,
 				       "not vectorized: empty loop.\n");
@@ -1768,11 +1773,13 @@  vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
         dump_printf_loc (MSG_NOTE, vect_location,
 			 "Considering outer-loop vectorization.\n");
       info->inner_loop_cond = inner.loop_cond;
+
+      if (!single_exit (loop))
+	return opt_result::failure_at (vect_location,
+				       "not vectorized: multiple exits.\n");
+
     }
 
-  if (!single_exit (loop))
-    return opt_result::failure_at (vect_location,
-				   "not vectorized: multiple exits.\n");
   if (EDGE_COUNT (loop->header->preds) != 2)
     return opt_result::failure_at (vect_location,
 				   "not vectorized:"
@@ -1788,11 +1795,36 @@  vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
 				   "not vectorized: latch block not empty.\n");
 
   /* Make sure the exit is not abnormal.  */
-  edge e = single_exit (loop);
-  if (e->flags & EDGE_ABNORMAL)
-    return opt_result::failure_at (vect_location,
-				   "not vectorized:"
-				   " abnormal loop exit edge.\n");
+  auto_vec<edge> exits = get_loop_exit_edges (loop);
+  edge nexit = loop->vec_loop_iv;
+  for (edge e : exits)
+    {
+      if (e->flags & EDGE_ABNORMAL)
+	return opt_result::failure_at (vect_location,
+				       "not vectorized:"
+				       " abnormal loop exit edge.\n");
+      /* Early break BB must be after the main exit BB.  In theory we should
+	 be able to vectorize the inverse order, but the current flow in the
+	 the vectorizer always assumes you update successor PHI nodes, not
+	 preds.  */
+      if (e != nexit && !dominated_by_p (CDI_DOMINATORS, nexit->src, e->src))
+	return opt_result::failure_at (vect_location,
+				       "not vectorized:"
+				       " abnormal loop exit edge order.\n");
+    }
+
+  /* We currently only support early exit loops with known bounds.   */
+  if (exits.length () > 1)
+    {
+      class tree_niter_desc niter;
+      if (!number_of_iterations_exit_assumptions (loop, nexit, &niter, NULL)
+	  || chrec_contains_undetermined (niter.niter)
+	  || !evolution_function_is_constant_p (niter.niter))
+	return opt_result::failure_at (vect_location,
+				       "not vectorized:"
+				       " early breaks only supported on loops"
+				       " with known iteration bounds.\n");
+    }
 
   info->conds
     = vect_get_loop_niters (loop, &info->assumptions,
@@ -1866,6 +1898,10 @@  vect_create_loop_vinfo (class loop *loop, vec_info_shared *shared,
   LOOP_VINFO_LOOP_CONDS (loop_vinfo).safe_splice (info->alt_loop_conds);
   LOOP_VINFO_LOOP_IV_COND (loop_vinfo) = info->loop_cond;
 
+  /* Check to see if we're vectorizing multiple exits.  */
+  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
+
   if (info->inner_loop_cond)
     {
       stmt_vec_info inner_loop_cond_info
@@ -3070,7 +3106,8 @@  start_over:
 
   /* If an epilogue loop is required make sure we can create one.  */
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
+      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location, "epilog loop required\n");
@@ -5797,7 +5834,7 @@  vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
   basic_block exit_bb;
   tree scalar_dest;
   tree scalar_type;
-  gimple *new_phi = NULL, *phi;
+  gimple *new_phi = NULL, *phi = NULL;
   gimple_stmt_iterator exit_gsi;
   tree new_temp = NULL_TREE, new_name, new_scalar_dest;
   gimple *epilog_stmt = NULL;
@@ -6039,6 +6076,33 @@  vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
 	  new_def = gimple_convert (&stmts, vectype, new_def);
 	  reduc_inputs.quick_push (new_def);
 	}
+
+	/* Update the other exits.  */
+	if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	  {
+	    vec<edge> alt_exits = LOOP_VINFO_ALT_EXITS (loop_vinfo);
+	    gphi_iterator gsi, gsi1;
+	    for (edge exit : alt_exits)
+	      {
+		/* Find the phi node to propaget into the exit block for each
+		   exit edge.  */
+		for (gsi = gsi_start_phis (exit_bb),
+		     gsi1 = gsi_start_phis (exit->src);
+		     !gsi_end_p (gsi) && !gsi_end_p (gsi1);
+		     gsi_next (&gsi), gsi_next (&gsi1))
+		  {
+		    /* There really should be a function to just get the number
+		       of phis inside a bb.  */
+		    if (phi && phi == gsi.phi ())
+		      {
+			gphi *phi1 = gsi1.phi ();
+			SET_PHI_ARG_DEF (phi, exit->dest_idx,
+					 PHI_RESULT (phi1));
+			break;
+		      }
+		  }
+	      }
+	  }
       gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
     }
 
@@ -10355,6 +10419,13 @@  vectorizable_live_operation (vec_info *vinfo,
 	   new_tree = lane_extract <vec_lhs', ...>;
 	   lhs' = new_tree;  */
 
+      /* When vectorizing an early break, any live statement that is used
+	 outside of the loop are dead.  The loop will never get to them.
+	 We could change the liveness value during analysis instead but since
+	 the below code is invalid anyway just ignore it during codegen.  */
+      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	return true;
+
       class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
       basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
       gcc_assert (single_pred_p (exit_bb));
@@ -11277,7 +11348,7 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
      versioning.   */
   edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
-  if (! single_pred_p (e->dest))
+  if (e && ! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       split_loop_exit_edge (e, true);
       if (dump_enabled_p ())
@@ -11303,7 +11374,7 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   if (LOOP_VINFO_SCALAR_LOOP (loop_vinfo))
     {
       e = single_exit (LOOP_VINFO_SCALAR_LOOP (loop_vinfo));
-      if (! single_pred_p (e->dest))
+      if (e && ! single_pred_p (e->dest))
 	{
 	  split_loop_exit_edge (e, true);
 	  if (dump_enabled_p ())
@@ -11641,7 +11712,8 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 
   /* Loops vectorized with a variable factor won't benefit from
      unrolling/peeling.  */
-  if (!vf.is_constant ())
+  if (!vf.is_constant ()
+      && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       loop->unroll = 1;
       if (dump_enabled_p ())
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 87c4353fa5180fcb7f60b192897456cf24f3fdbe..03524e8500ee06df42f82afe78ee2a7c627be45b 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -344,9 +344,34 @@  vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
   *live_p = false;
 
   /* cond stmt other than loop exit cond.  */
-  if (is_ctrl_stmt (stmt_info->stmt)
-      && STMT_VINFO_TYPE (stmt_info) != loop_exit_ctrl_vec_info_type)
-    *relevant = vect_used_in_scope;
+  if (is_ctrl_stmt (stmt_info->stmt))
+    {
+      /* Ideally EDGE_LOOP_EXIT would have been set on the exit edge, but
+	 it looks like loop_manip doesn't do that..  So we have to do it
+	 the hard way.  */
+      basic_block bb = gimple_bb (stmt_info->stmt);
+      bool exit_bb = false, early_exit = false;
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+        if (!flow_bb_inside_loop_p (loop, e->dest))
+	  {
+	    exit_bb = true;
+	    early_exit = loop->vec_loop_iv->src != bb;
+	    break;
+	  }
+
+      /* We should have processed any exit edge, so an edge not an early
+	 break must be a loop IV edge.  We need to distinguish between the
+	 two as we don't want to generate code for the main loop IV.  */
+      if (exit_bb)
+	{
+	  if (early_exit)
+	    *relevant = vect_used_in_scope;
+	}
+      else if (bb->loop_father == loop)
+	LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo) = true;
+    }
 
   /* changing memory.  */
   if (gimple_code (stmt_info->stmt) != GIMPLE_PHI)
@@ -359,6 +384,11 @@  vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
 	*relevant = vect_used_in_scope;
       }
 
+  auto_vec<edge> exits = get_loop_exit_edges (loop);
+  auto_bitmap exit_bbs;
+  for (edge exit : exits)
+    bitmap_set_bit (exit_bbs, exit->dest->index);
+
   /* uses outside the loop.  */
   FOR_EACH_PHI_OR_STMT_DEF (def_p, stmt_info->stmt, op_iter, SSA_OP_DEF)
     {
@@ -377,7 +407,7 @@  vect_stmt_relevant_p (stmt_vec_info stmt_info, loop_vec_info loop_vinfo,
 	      /* We expect all such uses to be in the loop exit phis
 		 (because of loop closed form)   */
 	      gcc_assert (gimple_code (USE_STMT (use_p)) == GIMPLE_PHI);
-	      gcc_assert (bb == single_exit (loop)->dest);
+	      gcc_assert (bitmap_bit_p (exit_bbs, bb->index));
 
               *live_p = true;
 	    }
@@ -683,6 +713,13 @@  vect_mark_stmts_to_be_vectorized (loop_vec_info loop_vinfo, bool *fatal)
 	}
     }
 
+  /* Ideally this should be in vect_analyze_loop_form but we haven't seen all
+     the conds yet at that point and there's no quick way to retrieve them.  */
+  if (LOOP_VINFO_GENERAL_CTR_FLOW (loop_vinfo))
+    return opt_result::failure_at (vect_location,
+				   "not vectorized:"
+				   " unsupported control flow in loop.\n");
+
   /* 2. Process_worklist */
   while (worklist.length () > 0)
     {
@@ -778,6 +815,20 @@  vect_mark_stmts_to_be_vectorized (loop_vec_info loop_vinfo, bool *fatal)
 			return res;
 		    }
                  }
+	    }
+	  else if (gcond *cond = dyn_cast <gcond *> (stmt_vinfo->stmt))
+	    {
+	      enum tree_code rhs_code = gimple_cond_code (cond);
+	      gcc_assert (TREE_CODE_CLASS (rhs_code) == tcc_comparison);
+	      opt_result res
+		= process_use (stmt_vinfo, gimple_cond_lhs (cond),
+			       loop_vinfo, relevant, &worklist, false);
+	      if (!res)
+		return res;
+	      res = process_use (stmt_vinfo, gimple_cond_rhs (cond),
+				loop_vinfo, relevant, &worklist, false);
+	      if (!res)
+		return res;
             }
 	  else if (gcall *call = dyn_cast <gcall *> (stmt_vinfo->stmt))
 	    {
@@ -11919,11 +11970,15 @@  vect_analyze_stmt (vec_info *vinfo,
 			     node_instance, cost_vec);
       if (!res)
 	return res;
-   }
+    }
+
+  if (is_ctrl_stmt (stmt_info->stmt))
+    STMT_VINFO_DEF_TYPE (stmt_info) = vect_early_exit_def;
 
   switch (STMT_VINFO_DEF_TYPE (stmt_info))
     {
       case vect_internal_def:
+      case vect_early_exit_def:
         break;
 
       case vect_reduction_def:
@@ -11956,6 +12011,7 @@  vect_analyze_stmt (vec_info *vinfo,
     {
       gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
       gcc_assert (STMT_VINFO_VECTYPE (stmt_info)
+		  || gimple_code (stmt_info->stmt) == GIMPLE_COND
 		  || (call && gimple_call_lhs (call) == NULL_TREE));
       *need_to_vectorize = true;
     }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index ec65b65b5910e9cbad0a8c7e83c950b6168b98bf..24a0567a2f23f1b3d8b340baff61d18da8e242dd 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -63,6 +63,7 @@  enum vect_def_type {
   vect_internal_def,
   vect_induction_def,
   vect_reduction_def,
+  vect_early_exit_def,
   vect_double_reduction_def,
   vect_nested_cycle,
   vect_first_order_recurrence,
@@ -876,6 +877,13 @@  public:
      we need to peel off iterations at the end to form an epilogue loop.  */
   bool peeling_for_niter;
 
+  /* When the loop has early breaks that we can vectorize we need to peel
+     the loop for the break finding loop.  */
+  bool early_breaks;
+
+  /* When the loop has a non-early break control flow inside.  */
+  bool non_break_control_flow;
+
   /* List of loop additional IV conditionals found in the loop.  */
   auto_vec<gcond *> conds;
 
@@ -985,9 +993,11 @@  public:
 #define LOOP_VINFO_REDUCTION_CHAINS(L)     (L)->reduction_chains
 #define LOOP_VINFO_PEELING_FOR_GAPS(L)     (L)->peeling_for_gaps
 #define LOOP_VINFO_PEELING_FOR_NITER(L)    (L)->peeling_for_niter
+#define LOOP_VINFO_EARLY_BREAKS(L)         (L)->early_breaks
 #define LOOP_VINFO_EARLY_BRK_CONFLICT_STMTS(L) (L)->early_break_conflict
 #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
 #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
+#define LOOP_VINFO_GENERAL_CTR_FLOW(L)     (L)->non_break_control_flow
 #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
 #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
 #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
@@ -1038,8 +1048,8 @@  public:
    stack.  */
 typedef opt_pointer_wrapper <loop_vec_info> opt_loop_vec_info;
 
-inline loop_vec_info
-loop_vec_info_for_loop (class loop *loop)
+static inline loop_vec_info
+loop_vec_info_for_loop (const class loop *loop)
 {
   return (loop_vec_info) loop->aux;
 }
@@ -1789,7 +1799,7 @@  is_loop_header_bb_p (basic_block bb)
 {
   if (bb == (bb->loop_father)->header)
     return true;
-  gcc_checking_assert (EDGE_COUNT (bb->preds) == 1);
+
   return false;
 }
 
@@ -2176,9 +2186,10 @@  class auto_purge_vect_location
    in tree-vect-loop-manip.cc.  */
 extern void vect_set_loop_condition (class loop *, loop_vec_info,
 				     tree, tree, tree, bool);
-extern bool slpeel_can_duplicate_loop_p (const class loop *, const_edge);
+extern bool slpeel_can_duplicate_loop_p (const loop_vec_info, const_edge);
 class loop *slpeel_tree_duplicate_loop_to_edge_cfg (class loop *,
-						     class loop *, edge);
+						    class loop *, edge, bool,
+						    vec<basic_block> * = NULL);
 class loop *vect_loop_versioning (loop_vec_info, gimple *);
 extern class loop *vect_do_peeling (loop_vec_info, tree, tree,
 				    tree *, tree *, tree *, int, bool, bool,
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index a048e9d89178a37455bd7b83ab0f2a238a4ce69e..0dc5479dc92058b6c70c67f29f5dc9a8d72235f4 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -1379,7 +1379,9 @@  pass_vectorize::execute (function *fun)
 	 predicates that need to be shared for optimal predicate usage.
 	 However reassoc will re-order them and prevent CSE from working
 	 as it should.  CSE only the loop body, not the entry.  */
-      bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
+      auto_vec<edge> exits = get_loop_exit_edges (loop);
+      for (edge exit : exits)
+	bitmap_set_bit (exit_bbs, exit->dest->index);
 
       edge entry = EDGE_PRED (loop_preheader_edge (loop)->src, 0);
       do_rpo_vn (fun, entry, exit_bbs);