[Unfinished] Add first-order recurrence autovectorization

Message ID 20220929110723.277330-1-juzhe.zhong@rivai.ai
State Accepted, archived
Headers
Series [Unfinished] Add first-order recurrence autovectorization |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

juzhe.zhong@rivai.ai Sept. 29, 2022, 11:07 a.m. UTC
  From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

gcc/ChangeLog:

        * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New function.
        (vect_analyze_scalar_cycles_1): Classify first-order recurrence phi.
        (vect_analyze_loop_operations): Add first-order recurrence autovectorization support.
        (vectorizable_dep_phi): New function.
        (vect_use_first_order_phi_result_p): New function.
        (vect_transform_loop): Add first-order recurrence autovectorization support.
        * tree-vect-stmts.cc (vect_transform_stmt): Ditto.
        (vect_is_simple_use): Ditto.
        * tree-vectorizer.h (enum vect_def_type): New enum.
        (enum stmt_vec_info_type): Ditto.
        (vectorizable_dep_phi): New function.

Hi, since Richard said I can post unfinished for help, I post it.
This patch is for fix issue:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409.
LLVM can vectorize this case using first-order recurrence loop-vectorizer.
This patch is inspired by first-order recurrence autovectorization support in LLVM:
https://reviews.llvm.org/D16197
There is a link that I can show you several cases that GCC fails vectorization
because no support of firs-order recurrence vectorization: https://godbolt.org/z/nzf1Wrd6T

Let's consider a simple case that I simplify:
void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restrict__ c, int n)
{
  int32_t t = *c;
  for (int i = 0; i < n; ++i)
    {
      b[i] = a[i] - t;
      t = a[i];
    }
}

Applying this patch, my downstream RVV GCC can vectorize with -fdump-tree-vect:

note: vect_is_simple_use: operand t_21 = PHI <_4(6), t_12(5)>, type of def: first order recurrence

However, it ICE in "dce6" when removing PHI node "t_21 = PHI <_4(6), t_12(5)>":
0x143c174 crash_signal
        ../../../riscv-gcc/gcc/toplev.cc:322
0x170d4fd delink_imm_use
        ../../../riscv-gcc/gcc/ssa-iterators.h:257
I was stuck by this issue. Besides, this patch has more 2 more things to do that I didn't implement:

1. insert VEC_PERM before the vector subtraction statement (Because I was stuck, I didn't continue
   implementing this patch and miss this.)
2. Support this vectorization in SLP autovectorizaiton.

To understand this patch, 2 functions are important:

1. vect_phi_first_order_recurrence_p, this function is used to forbid the cases that can not be vectorized
   by this vectorizer. The constraints there are strictly the same as LLVM.
2. vectorizable_dep_phi, the implementation of first-order recurrence vectorizer.

I hope someone can help me fix && finish && test && refine this patch.
Thanks.

---
 gcc/tree-vect-loop.cc  | 239 ++++++++++++++++++++++++++++++++++++++++-
 gcc/tree-vect-stmts.cc |  12 ++-
 gcc/tree-vectorizer.h  |   4 +
 3 files changed, 252 insertions(+), 3 deletions(-)
  

Comments

Richard Biener Sept. 29, 2022, 2:20 p.m. UTC | #1
On Thu, Sep 29, 2022 at 1:07 PM <juzhe.zhong@rivai.ai> wrote:
>
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> gcc/ChangeLog:
>
>         * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New function.
>         (vect_analyze_scalar_cycles_1): Classify first-order recurrence phi.
>         (vect_analyze_loop_operations): Add first-order recurrence autovectorization support.
>         (vectorizable_dep_phi): New function.
>         (vect_use_first_order_phi_result_p): New function.
>         (vect_transform_loop): Add first-order recurrence autovectorization support.
>         * tree-vect-stmts.cc (vect_transform_stmt): Ditto.
>         (vect_is_simple_use): Ditto.
>         * tree-vectorizer.h (enum vect_def_type): New enum.
>         (enum stmt_vec_info_type): Ditto.
>         (vectorizable_dep_phi): New function.
>
> Hi, since Richard said I can post unfinished for help, I post it.
> This patch is for fix issue:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409.
> LLVM can vectorize this case using first-order recurrence loop-vectorizer.
> This patch is inspired by first-order recurrence autovectorization support in LLVM:
> https://reviews.llvm.org/D16197
> There is a link that I can show you several cases that GCC fails vectorization
> because no support of firs-order recurrence vectorization: https://godbolt.org/z/nzf1Wrd6T
>
> Let's consider a simple case that I simplify:
> void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restrict__ c, int n)
> {
>   int32_t t = *c;
>   for (int i = 0; i < n; ++i)
>     {
>       b[i] = a[i] - t;
>       t = a[i];
>     }
> }
>
> Applying this patch, my downstream RVV GCC can vectorize with -fdump-tree-vect:
>
> note: vect_is_simple_use: operand t_21 = PHI <_4(6), t_12(5)>, type of def: first order recurrence
>
> However, it ICE in "dce6" when removing PHI node "t_21 = PHI <_4(6), t_12(5)>":
> 0x143c174 crash_signal
>         ../../../riscv-gcc/gcc/toplev.cc:322
> 0x170d4fd delink_imm_use
>         ../../../riscv-gcc/gcc/ssa-iterators.h:257
> I was stuck by this issue. Besides, this patch has more 2 more things to do that I didn't implement:

The issue is you're using

      gimple *first_use = first_imm_use_stmt (&imm, phi_result);

but that's an internal function, you need to use

gimple *first_use;
use_operand_p use_p;
single_imm_use (phi_result, &use_p, &first_use);

Otherwise you are corrupting the immediate use list.  When that's
fixed it seems to work (in the partial way it's implemented).

> 1. insert VEC_PERM before the vector subtraction statement (Because I was stuck, I didn't continue
>    implementing this patch and miss this.)
> 2. Support this vectorization in SLP autovectorizaiton.
>
> To understand this patch, 2 functions are important:
>
> 1. vect_phi_first_order_recurrence_p, this function is used to forbid the cases that can not be vectorized
>    by this vectorizer. The constraints there are strictly the same as LLVM.
> 2. vectorizable_dep_phi, the implementation of first-order recurrence vectorizer.
>
> I hope someone can help me fix && finish && test && refine this patch.
>
> Thanks.
>
> ---
>  gcc/tree-vect-loop.cc  | 239 ++++++++++++++++++++++++++++++++++++++++-
>  gcc/tree-vect-stmts.cc |  12 ++-
>  gcc/tree-vectorizer.h  |   4 +
>  3 files changed, 252 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 2536cc3cf49..adb48356c23 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -529,6 +529,57 @@ vect_inner_phi_in_double_reduction_p (loop_vec_info loop_vinfo, gphi *phi)
>    return false;
>  }
>
> +/* Returns true if Phi is a first-order recurrence. A first-order
> +   recurrence is a non-reduction recurrence relation in which the value of
> +   the recurrence in the current loop iteration equals a value defined in
> +   the previous iteration.  */
> +
> +static bool
> +vect_phi_first_order_recurrence_p (loop_vec_info loop_vinfo, class loop *loop,
> +                                  gphi *phi)
> +{
> +  /* Ensure the phi node is in the loop header and has two incoming values.  */
> +  if (gimple_bb (phi) != loop->header || gimple_phi_num_args (phi) != 2)
> +    return false;
> +
> +  /* Ensure the loop has a preheader and a single latch block. The loop
> +     vectorizer will need the latch to set up the next iteration of the loop. */
> +  edge preheader = loop_preheader_edge (loop);
> +  edge latch = loop_latch_edge (loop);
> +  if (!preheader || !latch)
> +    return false;
> +
> +  /* Ensure the phi node's incoming blocks are the loop preheader and latch.  */
> +  if (!PHI_ARG_DEF_FROM_EDGE (phi, preheader)
> +      || !PHI_ARG_DEF_FROM_EDGE (phi, latch))
> +    return false;
> +
> +  /* Get the previous value. The previous value comes from the latch edge while
> +     the initial value comes form the preheader edge.  */
> +  gimple *previous = SSA_NAME_DEF_STMT (PHI_ARG_DEF_FROM_EDGE (phi, latch));
> +  if (!previous)
> +    return false;
> +
> +  /* Ensure every use_stmt of the phi node is dominated by the previous value.
> +     The dominance requirement ensures the loop vectorizer will not need to
> +     vectorize the initial value prior to the first iteration of the loop.  */
> +  gimple *use_stmt;
> +  imm_use_iterator imm_iter;
> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, gimple_phi_result (phi))
> +    if (use_stmt)
> +      if (!dominated_by_p (CDI_DOMINATORS, gimple_bb (use_stmt),
> +                          gimple_bb (previous)))
> +       return false;
> +
> +  /* First-order recurrence autovectorization needs shuffle vector.  */
> +  tree scalar_type = TREE_TYPE (PHI_RESULT (phi));
> +  tree vectype = get_vectype_for_scalar_type (loop_vinfo, scalar_type);
> +  if (!can_vec_perm_var_p (TYPE_MODE (vectype)))
> +    return false;
> +
> +  return true;
> +}
> +
>  /* Function vect_analyze_scalar_cycles_1.
>
>     Examine the cross iteration def-use cycles of scalar variables
> @@ -666,6 +717,8 @@ vect_analyze_scalar_cycles_1 (loop_vec_info loop_vinfo, class loop *loop,
>                  }
>              }
>          }
> +      else if (vect_phi_first_order_recurrence_p (loop_vinfo, loop, phi))
> +       STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_first_order_recurrence;
>        else
>          if (dump_enabled_p ())
>            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -1808,9 +1861,13 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo)
>
>            gcc_assert (stmt_info);
>
> +          if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> +           ok = vectorizable_dep_phi (loop_vinfo, stmt_info, NULL, NULL);
> +
>            if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope
>                 || STMT_VINFO_LIVE_P (stmt_info))
> -              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def)
> +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def
> +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
>             /* A scalar-dependence cycle that we don't support.  */
>             return opt_result::failure_at (phi,
>                                            "not vectorized:"
> @@ -8267,6 +8324,120 @@ vectorizable_phi (vec_info *,
>    return true;
>  }
>
> +/* Vectorizes Dep PHIs.  */
> +
> +bool
> +vectorizable_dep_phi (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +                     gimple **vec_stmt, slp_tree slp_node)
> +{
> +  if (!loop_vinfo || !is_a<gphi *> (stmt_info->stmt))
> +    return false;
> +
> +  gphi *phi = as_a<gphi *> (stmt_info->stmt);
> +
> +  /* So far we only support first-order recurrence auto-vectorization.  */
> +  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
> +    return false;
> +
> +  if (!vec_stmt) /* transformation not required.  */
> +    {
> +      /* So far we don't support first-order recurrence for SLP
> +       * auto-vectorization.   */
> +      if (slp_node)
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "do not support first-order recurrence"
> +                            "autovectorization for SLP node\n");
> +         return false;
> +       }
> +      STMT_VINFO_TYPE (stmt_info) = dep_phi_info_type;
> +      return true;
> +    }
> +
> +  /* This is the second phase of vectorizing first-order rececurrences. An
> +  overview of the transformation is described below. Suppose we have the
> +  following loop.
> +
> +    int32_t t = 0;
> +    for (int i = 0; i < n; ++i)
> +      {
> +       b[i] = a[i] - t;
> +       t = a[i];
> +      }
> +
> +  There is a first-order recurrence on "a". For this loop, the shorthand
> +  scalar IR looks like:
> +
> +    scalar.preheader:
> +      init = a[-1]
> +      br loop.body
> +
> +    scalar.body:
> +      i = PHI <0(scalar.preheader), i+1(scalar.body)>
> +      _2 = PHI <(init(scalar.preheader), <_1(scalar.body)>
> +      _1 = a[i]
> +      b[i] = _1 - _2
> +      br cond, scalar.body, ...
> +
> +  In this example, _2 is a recurrence because it's value depends on the
> +  previous iteration. In the first phase of vectorization, we created a
> +  temporary value for _2. We now complete the vectorization and produce the
> +  shorthand vector IR shown below (VF = 4).
> +
> +    vector.preheader:
> +      vect_init = vect_cst(..., ..., ..., a[-1])
> +      br vector.body
> +
> +    vector.body
> +      i = PHI <0(vector.preheader), i+4(vector.body)>
> +      vect_1 = PHI <vect_init(vector.preheader), v2(vector.body)>
> +      vect_2 = a[i, i+1, i+2, i+3];
> +      vect_3 = vector(vect_1(3), vect_2(0, 1, 2))
> +      b[i, i+1, i+2, i+3] = vect_2 - vect_3
> +      br cond, vector.body, middle.block
> +
> +    middle.block:
> +      x = vect_2(3)
> +      br scalar.preheader
> +
> +    scalar.ph:
> +      s_init = PHI <x(middle.block), a[-1], otherwise>
> +      br scalar.body
> +
> +  After execution completes the vector loop, we extract the next value of
> +  the recurrence (x) to use as the initial value in the scalar loop.  */
> +
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  tree scalar_dest = gimple_phi_result (stmt_info->stmt);
> +  basic_block bb = gimple_bb (stmt_info->stmt);
> +  tree vec_dest = vect_create_destination_var (scalar_dest, vectype);
> +  auto_vec<tree> vec_oprnds;
> +  tree preheader = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
> +  tree latch = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
> +  gimple_seq stmts = NULL;
> +
> +  tree vec_init = build_vector_from_val (vectype, preheader);
> +  vec_init = vect_init_vector (loop_vinfo, stmt_info,
> +                              vec_init, vectype, NULL);
> +  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node,
> +                    !slp_node ? vect_get_num_copies (loop_vinfo, vectype) : 1,
> +                    gimple_phi_arg_def (stmt_info->stmt, 0), &vec_oprnds);
> +  /* Create the vectorized first-order PHI node.  */
> +  gphi *new_phi = create_phi_node (vec_dest, bb);
> +  add_phi_arg (new_phi, vec_oprnds[0], loop_latch_edge (loop),
> +              UNKNOWN_LOCATION);
> +  add_phi_arg (new_phi, vec_init, loop_preheader_edge (loop), UNKNOWN_LOCATION);
> +  if (slp_node)
> +    SLP_TREE_VEC_STMTS (slp_node).quick_push (new_phi);
> +  else
> +    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_phi);
> +  if (!slp_node)
> +    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
> +  return true;
> +}
> +
>  /* Return true if VECTYPE represents a vector that requires lowering
>     by the vector lowering pass.  */
>
> @@ -10476,6 +10647,26 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
>    epilogue_vinfo->shared->save_datarefs ();
>  }
>
> +/* Return true if the stmt is the first statement that uses the result
> +   of a first-order recurrence phi node.  */
> +
> +static gphi *
> +vect_use_first_order_phi_result_p (auto_vec<gphi *> &first_order_phi_vec,
> +                                  gimple *stmt)
> +{
> +  for (unsigned int i = 0; i < first_order_phi_vec.length (); ++i)
> +    {
> +      imm_use_iterator imm;
> +      gphi *phi = first_order_phi_vec[i];
> +      tree phi_result = PHI_RESULT (phi);
> +      gimple *first_use = first_imm_use_stmt (&imm, phi_result);
> +      if (first_use && first_use == stmt)
> +       return phi;
> +    }
> +
> +  return NULL;
> +}
> +
>  /* Function vect_transform_loop.
>
>     The analysis phase has determined that the loop is vectorizable.
> @@ -10624,6 +10815,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>        basic_block bb = bbs[i];
>        stmt_vec_info stmt_info;
>
> +      /* It's used to save the first-order recurrence phi node.
> +         Consider the following case:
> +         # t_19 = PHI <_4(6), 0(5)>
> +         ...
> +         _4 = *_3;
> +         ...
> +         _6 = _4 - t_19;
> +
> +         The vectorization sequence should be:
> +         1. Vectorize _4 = *_3;
> +         2. Vectorize # t_19 = PHI <_4(6), 0(5)>
> +         3, Vectorize _6 = _4 - t_19;
> +
> +         Flow:
> +         1. So we save the first-order recurrence PHI in first_order_phi_vec
> +           and skip vectorizing it tentatively when iterating phi statement
> +           (save t_19 = PHI <_4(6), 0(5)>).
> +         2. Vectorize all statements when iterating all statements
> +            until we reach the statement who is using the result of
> +            first-order recurrence PHI (vectorize _4 = *_3;).
> +         3. Stop iterating and return to vectorize the PHI saved
> +           in first_order_phi_vec (# t_19 = PHI <_4(6), 0(5)>).
> +         4. Conintue iterating and vectorize the residual statements.  */
> +      auto_vec<gphi *> first_order_phi_vec;
> +
>        for (gphi_iterator si = gsi_start_phis (bb); !gsi_end_p (si);
>            gsi_next (&si))
>         {
> @@ -10677,9 +10893,16 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>                || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
>                || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def
>                || STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle
> -              || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def)
> +              || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def
> +              || STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
>               && ! PURE_SLP_STMT (stmt_info))
>             maybe_set_vectorized_backedge_value (loop_vinfo, stmt_info);
> +
> +         if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> +           {
> +             first_order_phi_vec.safe_push (phi);
> +             continue;
> +           }
>         }
>
>        for (gimple_stmt_iterator si = gsi_start_bb (bb);
> @@ -10698,6 +10921,18 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>               /* Ignore vector stmts created in the outer loop.  */
>               stmt_info = loop_vinfo->lookup_stmt (stmt);
>
> +             gphi *first_order_phi
> +               = vect_use_first_order_phi_result_p (first_order_phi_vec, stmt);
> +             if (first_order_phi)
> +               {
> +                 stmt_vec_info first_order_stmt_info
> +                   = loop_vinfo->lookup_stmt (first_order_phi);
> +                 gimple_stmt_iterator first_order_si
> +                   = gsi_for_stmt (first_order_phi);
> +                 vect_transform_loop_stmt (loop_vinfo, first_order_stmt_info,
> +                                           &first_order_si, NULL);
> +               }
> +
>               /* vector stmts created in the outer-loop during vectorization of
>                  stmts in an inner-loop may not have a stmt_info, and do not
>                  need to be vectorized.  */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c8d1efc45e5..a8c0a8a4151 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11404,6 +11404,12 @@ vect_transform_stmt (vec_info *vinfo,
>        gcc_assert (done);
>        break;
>
> +    case dep_phi_info_type:
> +      done = vectorizable_dep_phi (as_a <loop_vec_info> (vinfo),
> +                                 stmt_info, &vec_stmt, slp_node);
> +      gcc_assert (done);
> +      break;
> +
>      case phi_info_type:
>        done = vectorizable_phi (vinfo, stmt_info, &vec_stmt, slp_node, NULL);
>        gcc_assert (done);
> @@ -11804,6 +11810,9 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
>         case vect_nested_cycle:
>           dump_printf (MSG_NOTE, "nested cycle\n");
>           break;
> +       case vect_first_order_recurrence:
> +         dump_printf (MSG_NOTE, "first order recurrence\n");
> +         break;
>         case vect_unknown_def_type:
>           dump_printf (MSG_NOTE, "unknown\n");
>           break;
> @@ -11852,7 +11861,8 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
>        || *dt == vect_induction_def
>        || *dt == vect_reduction_def
>        || *dt == vect_double_reduction_def
> -      || *dt == vect_nested_cycle)
> +      || *dt == vect_nested_cycle
> +                       || *dt == vect_first_order_recurrence)
>      {
>        *vectype = STMT_VINFO_VECTYPE (def_stmt_info);
>        gcc_assert (*vectype != NULL_TREE);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 4870c754499..2c8e7660b4e 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -65,6 +65,7 @@ enum vect_def_type {
>    vect_reduction_def,
>    vect_double_reduction_def,
>    vect_nested_cycle,
> +  vect_first_order_recurrence,
>    vect_unknown_def_type
>  };
>
> @@ -1027,6 +1028,7 @@ enum stmt_vec_info_type {
>    cycle_phi_info_type,
>    lc_phi_info_type,
>    phi_info_type,
> +  dep_phi_info_type,
>    loop_exit_ctrl_vec_info_type
>  };
>
> @@ -2331,6 +2333,8 @@ extern bool vectorizable_lc_phi (loop_vec_info, stmt_vec_info,
>                                  gimple **, slp_tree);
>  extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree,
>                               stmt_vector_for_cost *);
> +extern bool vectorizable_dep_phi (loop_vec_info, stmt_vec_info,
> +                                gimple **, slp_tree);
>  extern bool vect_emulated_vector_p (tree);
>  extern bool vect_can_vectorize_without_simd_p (tree_code);
>  extern bool vect_can_vectorize_without_simd_p (code_helper);
> --
> 2.36.1
>
  
Richard Sandiford Sept. 29, 2022, 4:53 p.m. UTC | #2
Thanks for posting the patch.

juzhe.zhong@rivai.ai writes:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> gcc/ChangeLog:
>
>         * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New function.
>         (vect_analyze_scalar_cycles_1): Classify first-order recurrence phi.
>         (vect_analyze_loop_operations): Add first-order recurrence autovectorization support.
>         (vectorizable_dep_phi): New function.
>         (vect_use_first_order_phi_result_p): New function.
>         (vect_transform_loop): Add first-order recurrence autovectorization support.
>         * tree-vect-stmts.cc (vect_transform_stmt): Ditto.
>         (vect_is_simple_use): Ditto.
>         * tree-vectorizer.h (enum vect_def_type): New enum.
>         (enum stmt_vec_info_type): Ditto.
>         (vectorizable_dep_phi): New function.
>
> Hi, since Richard said I can post unfinished for help, I post it.
> This patch is for fix issue:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409.
> LLVM can vectorize this case using first-order recurrence loop-vectorizer.
> This patch is inspired by first-order recurrence autovectorization support in LLVM:
> https://reviews.llvm.org/D16197
> There is a link that I can show you several cases that GCC fails vectorization
> because no support of firs-order recurrence vectorization: https://godbolt.org/z/nzf1Wrd6T
>
> Let's consider a simple case that I simplify:
> void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restrict__ c, int n)
> {
>   int32_t t = *c;
>   for (int i = 0; i < n; ++i)
>     {
>       b[i] = a[i] - t;
>       t = a[i];
>     }
> }

One thing that I wondered about the LLVM implementation is:
does reusing the loaded value really pay for itself?  E.g. for
the un-predictive-commoned version:

void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restr\
ict__ c, int n)
{
  b[0] = a[0] - *c;
  for (int i = 1; i < n; ++i)
    b[i] = a[i] - a[i - 1];
}

GCC generates:

L4:
        ldr     q0, [x6, x2]
        ldr     q1, [x0, x2]
        sub     v0.4s, v0.4s, v1.4s
        str     q0, [x5, x2]
        add     x2, x2, 16
        cmp     x2, x4
        bne     .L4

whereas LLVM (with -fno-unroll-loops) generates:

.LBB0_4:                                // %vector.body
        mov     v1.16b, v0.16b
        subs    x15, x15, #4
        ldr     q0, [x13], #16
        ext     v1.16b, v1.16b, v0.16b, #12
        sub     v1.4s, v0.4s, v1.4s
        str     q1, [x14], #16
        b.ne    .LBB0_4

Introducing the loop-carried dependency (via the ext) limits the
throughput of the loop to the latency of a permutation.

But I guess which approach is better depends on the amount of work
that is repeated by GCC's approach.  For a single load it's probably
better to repeat the work, but for something more complicated the
general recurrence approach probably wins out.

So perhaps we should first handle the general case (as for your patch)
and then, as a potential later follow-on patch, optimise the cases where
the loop-carried dependency is harmful?

This is all hand-wavy speculation, in case it wasn't obvious :-)

Thanks,
Richard

> Applying this patch, my downstream RVV GCC can vectorize with -fdump-tree-vect:
>
> note: vect_is_simple_use: operand t_21 = PHI <_4(6), t_12(5)>, type of def: first order recurrence
>
> However, it ICE in "dce6" when removing PHI node "t_21 = PHI <_4(6), t_12(5)>":
> 0x143c174 crash_signal
>         ../../../riscv-gcc/gcc/toplev.cc:322
> 0x170d4fd delink_imm_use
>         ../../../riscv-gcc/gcc/ssa-iterators.h:257
> I was stuck by this issue. Besides, this patch has more 2 more things to do that I didn't implement:
>
> 1. insert VEC_PERM before the vector subtraction statement (Because I was stuck, I didn't continue
>    implementing this patch and miss this.)
> 2. Support this vectorization in SLP autovectorizaiton.
>
> To understand this patch, 2 functions are important:
>
> 1. vect_phi_first_order_recurrence_p, this function is used to forbid the cases that can not be vectorized
>    by this vectorizer. The constraints there are strictly the same as LLVM.
> 2. vectorizable_dep_phi, the implementation of first-order recurrence vectorizer.
>
> I hope someone can help me fix && finish && test && refine this patch.
> Thanks.
>
> ---
>  gcc/tree-vect-loop.cc  | 239 ++++++++++++++++++++++++++++++++++++++++-
>  gcc/tree-vect-stmts.cc |  12 ++-
>  gcc/tree-vectorizer.h  |   4 +
>  3 files changed, 252 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 2536cc3cf49..adb48356c23 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -529,6 +529,57 @@ vect_inner_phi_in_double_reduction_p (loop_vec_info loop_vinfo, gphi *phi)
>    return false;
>  }
>  
> +/* Returns true if Phi is a first-order recurrence. A first-order
> +   recurrence is a non-reduction recurrence relation in which the value of
> +   the recurrence in the current loop iteration equals a value defined in
> +   the previous iteration.  */
> +
> +static bool
> +vect_phi_first_order_recurrence_p (loop_vec_info loop_vinfo, class loop *loop,
> +				   gphi *phi)
> +{
> +  /* Ensure the phi node is in the loop header and has two incoming values.  */
> +  if (gimple_bb (phi) != loop->header || gimple_phi_num_args (phi) != 2)
> +    return false;
> +
> +  /* Ensure the loop has a preheader and a single latch block. The loop
> +     vectorizer will need the latch to set up the next iteration of the loop. */
> +  edge preheader = loop_preheader_edge (loop);
> +  edge latch = loop_latch_edge (loop);
> +  if (!preheader || !latch)
> +    return false;
> +
> +  /* Ensure the phi node's incoming blocks are the loop preheader and latch.  */
> +  if (!PHI_ARG_DEF_FROM_EDGE (phi, preheader)
> +      || !PHI_ARG_DEF_FROM_EDGE (phi, latch))
> +    return false;
> +
> +  /* Get the previous value. The previous value comes from the latch edge while
> +     the initial value comes form the preheader edge.  */
> +  gimple *previous = SSA_NAME_DEF_STMT (PHI_ARG_DEF_FROM_EDGE (phi, latch));
> +  if (!previous)
> +    return false;
> +
> +  /* Ensure every use_stmt of the phi node is dominated by the previous value.
> +     The dominance requirement ensures the loop vectorizer will not need to
> +     vectorize the initial value prior to the first iteration of the loop.  */
> +  gimple *use_stmt;
> +  imm_use_iterator imm_iter;
> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, gimple_phi_result (phi))
> +    if (use_stmt)
> +      if (!dominated_by_p (CDI_DOMINATORS, gimple_bb (use_stmt),
> +			   gimple_bb (previous)))
> +	return false;
> +
> +  /* First-order recurrence autovectorization needs shuffle vector.  */
> +  tree scalar_type = TREE_TYPE (PHI_RESULT (phi));
> +  tree vectype = get_vectype_for_scalar_type (loop_vinfo, scalar_type);
> +  if (!can_vec_perm_var_p (TYPE_MODE (vectype)))
> +    return false;
> +
> +  return true;
> +}
> +
>  /* Function vect_analyze_scalar_cycles_1.
>  
>     Examine the cross iteration def-use cycles of scalar variables
> @@ -666,6 +717,8 @@ vect_analyze_scalar_cycles_1 (loop_vec_info loop_vinfo, class loop *loop,
>                  }
>              }
>          }
> +      else if (vect_phi_first_order_recurrence_p (loop_vinfo, loop, phi))
> +	STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_first_order_recurrence;
>        else
>          if (dump_enabled_p ())
>            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -1808,9 +1861,13 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo)
>  
>            gcc_assert (stmt_info);
>  
> +          if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> +	    ok = vectorizable_dep_phi (loop_vinfo, stmt_info, NULL, NULL);
> +
>            if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope
>                 || STMT_VINFO_LIVE_P (stmt_info))
> -              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def)
> +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def
> +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
>  	    /* A scalar-dependence cycle that we don't support.  */
>  	    return opt_result::failure_at (phi,
>  					   "not vectorized:"
> @@ -8267,6 +8324,120 @@ vectorizable_phi (vec_info *,
>    return true;
>  }
>  
> +/* Vectorizes Dep PHIs.  */
> +
> +bool
> +vectorizable_dep_phi (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +		      gimple **vec_stmt, slp_tree slp_node)
> +{
> +  if (!loop_vinfo || !is_a<gphi *> (stmt_info->stmt))
> +    return false;
> +
> +  gphi *phi = as_a<gphi *> (stmt_info->stmt);
> +
> +  /* So far we only support first-order recurrence auto-vectorization.  */
> +  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
> +    return false;
> +
> +  if (!vec_stmt) /* transformation not required.  */
> +    {
> +      /* So far we don't support first-order recurrence for SLP
> +       * auto-vectorization.   */
> +      if (slp_node)
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			     "do not support first-order recurrence"
> +			     "autovectorization for SLP node\n");
> +	  return false;
> +	}
> +      STMT_VINFO_TYPE (stmt_info) = dep_phi_info_type;
> +      return true;
> +    }
> +
> +  /* This is the second phase of vectorizing first-order rececurrences. An
> +  overview of the transformation is described below. Suppose we have the
> +  following loop.
> +
> +    int32_t t = 0;
> +    for (int i = 0; i < n; ++i)
> +      {
> +	b[i] = a[i] - t;
> +	t = a[i];
> +      }
> +
> +  There is a first-order recurrence on "a". For this loop, the shorthand
> +  scalar IR looks like:
> +
> +    scalar.preheader:
> +      init = a[-1]
> +      br loop.body
> +
> +    scalar.body:
> +      i = PHI <0(scalar.preheader), i+1(scalar.body)>
> +      _2 = PHI <(init(scalar.preheader), <_1(scalar.body)>
> +      _1 = a[i]
> +      b[i] = _1 - _2
> +      br cond, scalar.body, ...
> +
> +  In this example, _2 is a recurrence because it's value depends on the
> +  previous iteration. In the first phase of vectorization, we created a
> +  temporary value for _2. We now complete the vectorization and produce the
> +  shorthand vector IR shown below (VF = 4).
> +
> +    vector.preheader:
> +      vect_init = vect_cst(..., ..., ..., a[-1])
> +      br vector.body
> +
> +    vector.body
> +      i = PHI <0(vector.preheader), i+4(vector.body)>
> +      vect_1 = PHI <vect_init(vector.preheader), v2(vector.body)>
> +      vect_2 = a[i, i+1, i+2, i+3];
> +      vect_3 = vector(vect_1(3), vect_2(0, 1, 2))
> +      b[i, i+1, i+2, i+3] = vect_2 - vect_3
> +      br cond, vector.body, middle.block
> +
> +    middle.block:
> +      x = vect_2(3)
> +      br scalar.preheader
> +
> +    scalar.ph:
> +      s_init = PHI <x(middle.block), a[-1], otherwise>
> +      br scalar.body
> +
> +  After execution completes the vector loop, we extract the next value of
> +  the recurrence (x) to use as the initial value in the scalar loop.  */
> +
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  tree scalar_dest = gimple_phi_result (stmt_info->stmt);
> +  basic_block bb = gimple_bb (stmt_info->stmt);
> +  tree vec_dest = vect_create_destination_var (scalar_dest, vectype);
> +  auto_vec<tree> vec_oprnds;
> +  tree preheader = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
> +  tree latch = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
> +  gimple_seq stmts = NULL;
> +
> +  tree vec_init = build_vector_from_val (vectype, preheader);
> +  vec_init = vect_init_vector (loop_vinfo, stmt_info,
> +			       vec_init, vectype, NULL);
> +  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node,
> +		     !slp_node ? vect_get_num_copies (loop_vinfo, vectype) : 1,
> +		     gimple_phi_arg_def (stmt_info->stmt, 0), &vec_oprnds);
> +  /* Create the vectorized first-order PHI node.  */
> +  gphi *new_phi = create_phi_node (vec_dest, bb);
> +  add_phi_arg (new_phi, vec_oprnds[0], loop_latch_edge (loop),
> +	       UNKNOWN_LOCATION);
> +  add_phi_arg (new_phi, vec_init, loop_preheader_edge (loop), UNKNOWN_LOCATION);
> +  if (slp_node)
> +    SLP_TREE_VEC_STMTS (slp_node).quick_push (new_phi);
> +  else
> +    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_phi);
> +  if (!slp_node)
> +    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
> +  return true;
> +}
> +
>  /* Return true if VECTYPE represents a vector that requires lowering
>     by the vector lowering pass.  */
>  
> @@ -10476,6 +10647,26 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
>    epilogue_vinfo->shared->save_datarefs ();
>  }
>  
> +/* Return true if the stmt is the first statement that uses the result
> +   of a first-order recurrence phi node.  */
> +
> +static gphi *
> +vect_use_first_order_phi_result_p (auto_vec<gphi *> &first_order_phi_vec,
> +				   gimple *stmt)
> +{
> +  for (unsigned int i = 0; i < first_order_phi_vec.length (); ++i)
> +    {
> +      imm_use_iterator imm;
> +      gphi *phi = first_order_phi_vec[i];
> +      tree phi_result = PHI_RESULT (phi);
> +      gimple *first_use = first_imm_use_stmt (&imm, phi_result);
> +      if (first_use && first_use == stmt)
> +	return phi;
> +    }
> +
> +  return NULL;
> +}
> +
>  /* Function vect_transform_loop.
>  
>     The analysis phase has determined that the loop is vectorizable.
> @@ -10624,6 +10815,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>        basic_block bb = bbs[i];
>        stmt_vec_info stmt_info;
>  
> +      /* It's used to save the first-order recurrence phi node.
> +         Consider the following case:
> +         # t_19 = PHI <_4(6), 0(5)>
> +         ...
> +         _4 = *_3;
> +         ...
> +         _6 = _4 - t_19;
> +         
> +         The vectorization sequence should be:
> +         1. Vectorize _4 = *_3;
> +         2. Vectorize # t_19 = PHI <_4(6), 0(5)>
> +         3, Vectorize _6 = _4 - t_19;
> +         
> +         Flow:
> +         1. So we save the first-order recurrence PHI in first_order_phi_vec
> +	    and skip vectorizing it tentatively when iterating phi statement 
> +	    (save t_19 = PHI <_4(6), 0(5)>).
> +         2. Vectorize all statements when iterating all statements
> +            until we reach the statement who is using the result of
> +            first-order recurrence PHI (vectorize _4 = *_3;).
> +         3. Stop iterating and return to vectorize the PHI saved
> +	    in first_order_phi_vec (# t_19 = PHI <_4(6), 0(5)>).
> +         4. Conintue iterating and vectorize the residual statements.  */
> +      auto_vec<gphi *> first_order_phi_vec;
> +
>        for (gphi_iterator si = gsi_start_phis (bb); !gsi_end_p (si);
>  	   gsi_next (&si))
>  	{
> @@ -10677,9 +10893,16 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>  	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
>  	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def
>  	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle
> -	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def)
> +	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def
> +	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
>  	      && ! PURE_SLP_STMT (stmt_info))
>  	    maybe_set_vectorized_backedge_value (loop_vinfo, stmt_info);
> +
> +	  if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> +	    {
> +	      first_order_phi_vec.safe_push (phi);
> +	      continue;
> +	    }
>  	}
>  
>        for (gimple_stmt_iterator si = gsi_start_bb (bb);
> @@ -10698,6 +10921,18 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>  	      /* Ignore vector stmts created in the outer loop.  */
>  	      stmt_info = loop_vinfo->lookup_stmt (stmt);
>  
> +	      gphi *first_order_phi
> +		= vect_use_first_order_phi_result_p (first_order_phi_vec, stmt);
> +	      if (first_order_phi)
> +		{
> +		  stmt_vec_info first_order_stmt_info
> +		    = loop_vinfo->lookup_stmt (first_order_phi);
> +		  gimple_stmt_iterator first_order_si
> +		    = gsi_for_stmt (first_order_phi);
> +		  vect_transform_loop_stmt (loop_vinfo, first_order_stmt_info,
> +					    &first_order_si, NULL);
> +		}
> +
>  	      /* vector stmts created in the outer-loop during vectorization of
>  		 stmts in an inner-loop may not have a stmt_info, and do not
>  		 need to be vectorized.  */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c8d1efc45e5..a8c0a8a4151 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11404,6 +11404,12 @@ vect_transform_stmt (vec_info *vinfo,
>        gcc_assert (done);
>        break;
>  
> +    case dep_phi_info_type:
> +      done = vectorizable_dep_phi (as_a <loop_vec_info> (vinfo),
> +				  stmt_info, &vec_stmt, slp_node);
> +      gcc_assert (done);
> +      break;
> +
>      case phi_info_type:
>        done = vectorizable_phi (vinfo, stmt_info, &vec_stmt, slp_node, NULL);
>        gcc_assert (done);
> @@ -11804,6 +11810,9 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
>  	case vect_nested_cycle:
>  	  dump_printf (MSG_NOTE, "nested cycle\n");
>  	  break;
> +	case vect_first_order_recurrence:
> +	  dump_printf (MSG_NOTE, "first order recurrence\n");
> +	  break;
>  	case vect_unknown_def_type:
>  	  dump_printf (MSG_NOTE, "unknown\n");
>  	  break;
> @@ -11852,7 +11861,8 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
>        || *dt == vect_induction_def
>        || *dt == vect_reduction_def
>        || *dt == vect_double_reduction_def
> -      || *dt == vect_nested_cycle)
> +      || *dt == vect_nested_cycle
> +			|| *dt == vect_first_order_recurrence)
>      {
>        *vectype = STMT_VINFO_VECTYPE (def_stmt_info);
>        gcc_assert (*vectype != NULL_TREE);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 4870c754499..2c8e7660b4e 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -65,6 +65,7 @@ enum vect_def_type {
>    vect_reduction_def,
>    vect_double_reduction_def,
>    vect_nested_cycle,
> +  vect_first_order_recurrence,
>    vect_unknown_def_type
>  };
>  
> @@ -1027,6 +1028,7 @@ enum stmt_vec_info_type {
>    cycle_phi_info_type,
>    lc_phi_info_type,
>    phi_info_type,
> +  dep_phi_info_type,
>    loop_exit_ctrl_vec_info_type
>  };
>  
> @@ -2331,6 +2333,8 @@ extern bool vectorizable_lc_phi (loop_vec_info, stmt_vec_info,
>  				 gimple **, slp_tree);
>  extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree,
>  			      stmt_vector_for_cost *);
> +extern bool vectorizable_dep_phi (loop_vec_info, stmt_vec_info,
> +				 gimple **, slp_tree);
>  extern bool vect_emulated_vector_p (tree);
>  extern bool vect_can_vectorize_without_simd_p (tree_code);
>  extern bool vect_can_vectorize_without_simd_p (code_helper);
  
juzhe.zhong@rivai.ai Sept. 29, 2022, 9:39 p.m. UTC | #3
Yeah, frankly, I already noticed this situation.
If we can manually rewrite some codes, GCC can solve data dependency in scalar passes 
by introducing repeating statement (It will remove PHI nodes) before loop vectorizer.
Which approach is winner, GCC or LLVM ? This is not point that I care about.
My goal is to fix cases that GCC failed to vectorize and make GCC loop vectorizer more powerful and can vectorize more cases.
Besides, In many situations user doesn't want to rewrite the codes and also we can't leave data dependency to scalar pass to handle it.

The same example I presented you, users could write codes in different styles will get different vectorization codegen (after applying my patch).
However, LLVM can not achieve that, no matter how you write the codes they always uses general first-order recurrence loop vectorizer. 
And I think this is the advantage GCC overcome LLVM after my patch is finished and merge into GCC upstream.
Which approach is better? Leave it to user choose it.

If you watched my presentation in GNU cauldron 2022. I have showed the comparison between RVV LLVM and RVV GCC.
After compiling and testing many benchmarks, I noticed LLVM can always vectorize more cases than GCC.
However, in case of cases that both GCC and LLVM can vectorize, some cases GCC wins, some cases GCC and LLVM are the same or LLVM wins,
but overal GCC can win more in most of cases.
I have analyzed most of them, because GCC is missing some general loop vectorizer that is what I want to do (translating LLVM loop vectorizer into GCC).

So, let's me first finish this patch and test it in the downstream RVV GCC. I can only test it in my downstream RVV GCC.
Because the RISC-V backend in upstream GCC is far from ready to support autovectorization even though my about 10 pathes of RVV support are merged into GCC upstream.
Then I post the finished version of this loop vectorizer to you, can you help me test it in ARM platform ? Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2022-09-30 00:53
To: juzhe.zhong
CC: gcc-patches
Subject: Re: [Unfinished PATCH] Add first-order recurrence autovectorization
Thanks for posting the patch.
 
juzhe.zhong@rivai.ai writes:
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
>
> gcc/ChangeLog:
>
>         * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New function.
>         (vect_analyze_scalar_cycles_1): Classify first-order recurrence phi.
>         (vect_analyze_loop_operations): Add first-order recurrence autovectorization support.
>         (vectorizable_dep_phi): New function.
>         (vect_use_first_order_phi_result_p): New function.
>         (vect_transform_loop): Add first-order recurrence autovectorization support.
>         * tree-vect-stmts.cc (vect_transform_stmt): Ditto.
>         (vect_is_simple_use): Ditto.
>         * tree-vectorizer.h (enum vect_def_type): New enum.
>         (enum stmt_vec_info_type): Ditto.
>         (vectorizable_dep_phi): New function.
>
> Hi, since Richard said I can post unfinished for help, I post it.
> This patch is for fix issue:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409.
> LLVM can vectorize this case using first-order recurrence loop-vectorizer.
> This patch is inspired by first-order recurrence autovectorization support in LLVM:
> https://reviews.llvm.org/D16197
> There is a link that I can show you several cases that GCC fails vectorization
> because no support of firs-order recurrence vectorization: https://godbolt.org/z/nzf1Wrd6T
>
> Let's consider a simple case that I simplify:
> void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restrict__ c, int n)
> {
>   int32_t t = *c;
>   for (int i = 0; i < n; ++i)
>     {
>       b[i] = a[i] - t;
>       t = a[i];
>     }
> }
 
One thing that I wondered about the LLVM implementation is:
does reusing the loaded value really pay for itself?  E.g. for
the un-predictive-commoned version:
 
void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restr\
ict__ c, int n)
{
  b[0] = a[0] - *c;
  for (int i = 1; i < n; ++i)
    b[i] = a[i] - a[i - 1];
}
 
GCC generates:
 
L4:
        ldr     q0, [x6, x2]
        ldr     q1, [x0, x2]
        sub     v0.4s, v0.4s, v1.4s
        str     q0, [x5, x2]
        add     x2, x2, 16
        cmp     x2, x4
        bne     .L4
 
whereas LLVM (with -fno-unroll-loops) generates:
 
.LBB0_4:                                // %vector.body
        mov     v1.16b, v0.16b
        subs    x15, x15, #4
        ldr     q0, [x13], #16
        ext     v1.16b, v1.16b, v0.16b, #12
        sub     v1.4s, v0.4s, v1.4s
        str     q1, [x14], #16
        b.ne    .LBB0_4
 
Introducing the loop-carried dependency (via the ext) limits the
throughput of the loop to the latency of a permutation.
 
But I guess which approach is better depends on the amount of work
that is repeated by GCC's approach.  For a single load it's probably
better to repeat the work, but for something more complicated the
general recurrence approach probably wins out.
 
So perhaps we should first handle the general case (as for your patch)
and then, as a potential later follow-on patch, optimise the cases where
the loop-carried dependency is harmful?
 
This is all hand-wavy speculation, in case it wasn't obvious :-)
 
Thanks,
Richard
 
> Applying this patch, my downstream RVV GCC can vectorize with -fdump-tree-vect:
>
> note: vect_is_simple_use: operand t_21 = PHI <_4(6), t_12(5)>, type of def: first order recurrence
>
> However, it ICE in "dce6" when removing PHI node "t_21 = PHI <_4(6), t_12(5)>":
> 0x143c174 crash_signal
>         ../../../riscv-gcc/gcc/toplev.cc:322
> 0x170d4fd delink_imm_use
>         ../../../riscv-gcc/gcc/ssa-iterators.h:257
> I was stuck by this issue. Besides, this patch has more 2 more things to do that I didn't implement:
>
> 1. insert VEC_PERM before the vector subtraction statement (Because I was stuck, I didn't continue
>    implementing this patch and miss this.)
> 2. Support this vectorization in SLP autovectorizaiton.
>
> To understand this patch, 2 functions are important:
>
> 1. vect_phi_first_order_recurrence_p, this function is used to forbid the cases that can not be vectorized
>    by this vectorizer. The constraints there are strictly the same as LLVM.
> 2. vectorizable_dep_phi, the implementation of first-order recurrence vectorizer.
>
> I hope someone can help me fix && finish && test && refine this patch.
> Thanks.
>
> ---
>  gcc/tree-vect-loop.cc  | 239 ++++++++++++++++++++++++++++++++++++++++-
>  gcc/tree-vect-stmts.cc |  12 ++-
>  gcc/tree-vectorizer.h  |   4 +
>  3 files changed, 252 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 2536cc3cf49..adb48356c23 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -529,6 +529,57 @@ vect_inner_phi_in_double_reduction_p (loop_vec_info loop_vinfo, gphi *phi)
>    return false;
>  }
>  
> +/* Returns true if Phi is a first-order recurrence. A first-order
> +   recurrence is a non-reduction recurrence relation in which the value of
> +   the recurrence in the current loop iteration equals a value defined in
> +   the previous iteration.  */
> +
> +static bool
> +vect_phi_first_order_recurrence_p (loop_vec_info loop_vinfo, class loop *loop,
> +    gphi *phi)
> +{
> +  /* Ensure the phi node is in the loop header and has two incoming values.  */
> +  if (gimple_bb (phi) != loop->header || gimple_phi_num_args (phi) != 2)
> +    return false;
> +
> +  /* Ensure the loop has a preheader and a single latch block. The loop
> +     vectorizer will need the latch to set up the next iteration of the loop. */
> +  edge preheader = loop_preheader_edge (loop);
> +  edge latch = loop_latch_edge (loop);
> +  if (!preheader || !latch)
> +    return false;
> +
> +  /* Ensure the phi node's incoming blocks are the loop preheader and latch.  */
> +  if (!PHI_ARG_DEF_FROM_EDGE (phi, preheader)
> +      || !PHI_ARG_DEF_FROM_EDGE (phi, latch))
> +    return false;
> +
> +  /* Get the previous value. The previous value comes from the latch edge while
> +     the initial value comes form the preheader edge.  */
> +  gimple *previous = SSA_NAME_DEF_STMT (PHI_ARG_DEF_FROM_EDGE (phi, latch));
> +  if (!previous)
> +    return false;
> +
> +  /* Ensure every use_stmt of the phi node is dominated by the previous value.
> +     The dominance requirement ensures the loop vectorizer will not need to
> +     vectorize the initial value prior to the first iteration of the loop.  */
> +  gimple *use_stmt;
> +  imm_use_iterator imm_iter;
> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, gimple_phi_result (phi))
> +    if (use_stmt)
> +      if (!dominated_by_p (CDI_DOMINATORS, gimple_bb (use_stmt),
> +    gimple_bb (previous)))
> + return false;
> +
> +  /* First-order recurrence autovectorization needs shuffle vector.  */
> +  tree scalar_type = TREE_TYPE (PHI_RESULT (phi));
> +  tree vectype = get_vectype_for_scalar_type (loop_vinfo, scalar_type);
> +  if (!can_vec_perm_var_p (TYPE_MODE (vectype)))
> +    return false;
> +
> +  return true;
> +}
> +
>  /* Function vect_analyze_scalar_cycles_1.
>  
>     Examine the cross iteration def-use cycles of scalar variables
> @@ -666,6 +717,8 @@ vect_analyze_scalar_cycles_1 (loop_vec_info loop_vinfo, class loop *loop,
>                  }
>              }
>          }
> +      else if (vect_phi_first_order_recurrence_p (loop_vinfo, loop, phi))
> + STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_first_order_recurrence;
>        else
>          if (dump_enabled_p ())
>            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -1808,9 +1861,13 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo)
>  
>            gcc_assert (stmt_info);
>  
> +          if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> +     ok = vectorizable_dep_phi (loop_vinfo, stmt_info, NULL, NULL);
> +
>            if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope
>                 || STMT_VINFO_LIVE_P (stmt_info))
> -              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def)
> +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def
> +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
>      /* A scalar-dependence cycle that we don't support.  */
>      return opt_result::failure_at (phi,
>     "not vectorized:"
> @@ -8267,6 +8324,120 @@ vectorizable_phi (vec_info *,
>    return true;
>  }
>  
> +/* Vectorizes Dep PHIs.  */
> +
> +bool
> +vectorizable_dep_phi (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> +       gimple **vec_stmt, slp_tree slp_node)
> +{
> +  if (!loop_vinfo || !is_a<gphi *> (stmt_info->stmt))
> +    return false;
> +
> +  gphi *phi = as_a<gphi *> (stmt_info->stmt);
> +
> +  /* So far we only support first-order recurrence auto-vectorization.  */
> +  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
> +    return false;
> +
> +  if (!vec_stmt) /* transformation not required.  */
> +    {
> +      /* So far we don't support first-order recurrence for SLP
> +       * auto-vectorization.   */
> +      if (slp_node)
> + {
> +   if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +      "do not support first-order recurrence"
> +      "autovectorization for SLP node\n");
> +   return false;
> + }
> +      STMT_VINFO_TYPE (stmt_info) = dep_phi_info_type;
> +      return true;
> +    }
> +
> +  /* This is the second phase of vectorizing first-order rececurrences. An
> +  overview of the transformation is described below. Suppose we have the
> +  following loop.
> +
> +    int32_t t = 0;
> +    for (int i = 0; i < n; ++i)
> +      {
> + b[i] = a[i] - t;
> + t = a[i];
> +      }
> +
> +  There is a first-order recurrence on "a". For this loop, the shorthand
> +  scalar IR looks like:
> +
> +    scalar.preheader:
> +      init = a[-1]
> +      br loop.body
> +
> +    scalar.body:
> +      i = PHI <0(scalar.preheader), i+1(scalar.body)>
> +      _2 = PHI <(init(scalar.preheader), <_1(scalar.body)>
> +      _1 = a[i]
> +      b[i] = _1 - _2
> +      br cond, scalar.body, ...
> +
> +  In this example, _2 is a recurrence because it's value depends on the
> +  previous iteration. In the first phase of vectorization, we created a
> +  temporary value for _2. We now complete the vectorization and produce the
> +  shorthand vector IR shown below (VF = 4).
> +
> +    vector.preheader:
> +      vect_init = vect_cst(..., ..., ..., a[-1])
> +      br vector.body
> +
> +    vector.body
> +      i = PHI <0(vector.preheader), i+4(vector.body)>
> +      vect_1 = PHI <vect_init(vector.preheader), v2(vector.body)>
> +      vect_2 = a[i, i+1, i+2, i+3];
> +      vect_3 = vector(vect_1(3), vect_2(0, 1, 2))
> +      b[i, i+1, i+2, i+3] = vect_2 - vect_3
> +      br cond, vector.body, middle.block
> +
> +    middle.block:
> +      x = vect_2(3)
> +      br scalar.preheader
> +
> +    scalar.ph:
> +      s_init = PHI <x(middle.block), a[-1], otherwise>
> +      br scalar.body
> +
> +  After execution completes the vector loop, we extract the next value of
> +  the recurrence (x) to use as the initial value in the scalar loop.  */
> +
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  tree scalar_dest = gimple_phi_result (stmt_info->stmt);
> +  basic_block bb = gimple_bb (stmt_info->stmt);
> +  tree vec_dest = vect_create_destination_var (scalar_dest, vectype);
> +  auto_vec<tree> vec_oprnds;
> +  tree preheader = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
> +  tree latch = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
> +  gimple_seq stmts = NULL;
> +
> +  tree vec_init = build_vector_from_val (vectype, preheader);
> +  vec_init = vect_init_vector (loop_vinfo, stmt_info,
> +        vec_init, vectype, NULL);
> +  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node,
> +      !slp_node ? vect_get_num_copies (loop_vinfo, vectype) : 1,
> +      gimple_phi_arg_def (stmt_info->stmt, 0), &vec_oprnds);
> +  /* Create the vectorized first-order PHI node.  */
> +  gphi *new_phi = create_phi_node (vec_dest, bb);
> +  add_phi_arg (new_phi, vec_oprnds[0], loop_latch_edge (loop),
> +        UNKNOWN_LOCATION);
> +  add_phi_arg (new_phi, vec_init, loop_preheader_edge (loop), UNKNOWN_LOCATION);
> +  if (slp_node)
> +    SLP_TREE_VEC_STMTS (slp_node).quick_push (new_phi);
> +  else
> +    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_phi);
> +  if (!slp_node)
> +    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
> +  return true;
> +}
> +
>  /* Return true if VECTYPE represents a vector that requires lowering
>     by the vector lowering pass.  */
>  
> @@ -10476,6 +10647,26 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
>    epilogue_vinfo->shared->save_datarefs ();
>  }
>  
> +/* Return true if the stmt is the first statement that uses the result
> +   of a first-order recurrence phi node.  */
> +
> +static gphi *
> +vect_use_first_order_phi_result_p (auto_vec<gphi *> &first_order_phi_vec,
> +    gimple *stmt)
> +{
> +  for (unsigned int i = 0; i < first_order_phi_vec.length (); ++i)
> +    {
> +      imm_use_iterator imm;
> +      gphi *phi = first_order_phi_vec[i];
> +      tree phi_result = PHI_RESULT (phi);
> +      gimple *first_use = first_imm_use_stmt (&imm, phi_result);
> +      if (first_use && first_use == stmt)
> + return phi;
> +    }
> +
> +  return NULL;
> +}
> +
>  /* Function vect_transform_loop.
>  
>     The analysis phase has determined that the loop is vectorizable.
> @@ -10624,6 +10815,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>        basic_block bb = bbs[i];
>        stmt_vec_info stmt_info;
>  
> +      /* It's used to save the first-order recurrence phi node.
> +         Consider the following case:
> +         # t_19 = PHI <_4(6), 0(5)>
> +         ...
> +         _4 = *_3;
> +         ...
> +         _6 = _4 - t_19;
> +         
> +         The vectorization sequence should be:
> +         1. Vectorize _4 = *_3;
> +         2. Vectorize # t_19 = PHI <_4(6), 0(5)>
> +         3, Vectorize _6 = _4 - t_19;
> +         
> +         Flow:
> +         1. So we save the first-order recurrence PHI in first_order_phi_vec
> +     and skip vectorizing it tentatively when iterating phi statement 
> +     (save t_19 = PHI <_4(6), 0(5)>).
> +         2. Vectorize all statements when iterating all statements
> +            until we reach the statement who is using the result of
> +            first-order recurrence PHI (vectorize _4 = *_3;).
> +         3. Stop iterating and return to vectorize the PHI saved
> +     in first_order_phi_vec (# t_19 = PHI <_4(6), 0(5)>).
> +         4. Conintue iterating and vectorize the residual statements.  */
> +      auto_vec<gphi *> first_order_phi_vec;
> +
>        for (gphi_iterator si = gsi_start_phis (bb); !gsi_end_p (si);
>     gsi_next (&si))
>  {
> @@ -10677,9 +10893,16 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>         || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
>         || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def
>         || STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle
> -        || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def)
> +        || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def
> +        || STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
>        && ! PURE_SLP_STMT (stmt_info))
>      maybe_set_vectorized_backedge_value (loop_vinfo, stmt_info);
> +
> +   if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> +     {
> +       first_order_phi_vec.safe_push (phi);
> +       continue;
> +     }
>  }
>  
>        for (gimple_stmt_iterator si = gsi_start_bb (bb);
> @@ -10698,6 +10921,18 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
>        /* Ignore vector stmts created in the outer loop.  */
>        stmt_info = loop_vinfo->lookup_stmt (stmt);
>  
> +       gphi *first_order_phi
> + = vect_use_first_order_phi_result_p (first_order_phi_vec, stmt);
> +       if (first_order_phi)
> + {
> +   stmt_vec_info first_order_stmt_info
> +     = loop_vinfo->lookup_stmt (first_order_phi);
> +   gimple_stmt_iterator first_order_si
> +     = gsi_for_stmt (first_order_phi);
> +   vect_transform_loop_stmt (loop_vinfo, first_order_stmt_info,
> +     &first_order_si, NULL);
> + }
> +
>        /* vector stmts created in the outer-loop during vectorization of
>  stmts in an inner-loop may not have a stmt_info, and do not
>  need to be vectorized.  */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index c8d1efc45e5..a8c0a8a4151 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11404,6 +11404,12 @@ vect_transform_stmt (vec_info *vinfo,
>        gcc_assert (done);
>        break;
>  
> +    case dep_phi_info_type:
> +      done = vectorizable_dep_phi (as_a <loop_vec_info> (vinfo),
> +   stmt_info, &vec_stmt, slp_node);
> +      gcc_assert (done);
> +      break;
> +
>      case phi_info_type:
>        done = vectorizable_phi (vinfo, stmt_info, &vec_stmt, slp_node, NULL);
>        gcc_assert (done);
> @@ -11804,6 +11810,9 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
>  case vect_nested_cycle:
>    dump_printf (MSG_NOTE, "nested cycle\n");
>    break;
> + case vect_first_order_recurrence:
> +   dump_printf (MSG_NOTE, "first order recurrence\n");
> +   break;
>  case vect_unknown_def_type:
>    dump_printf (MSG_NOTE, "unknown\n");
>    break;
> @@ -11852,7 +11861,8 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
>        || *dt == vect_induction_def
>        || *dt == vect_reduction_def
>        || *dt == vect_double_reduction_def
> -      || *dt == vect_nested_cycle)
> +      || *dt == vect_nested_cycle
> + || *dt == vect_first_order_recurrence)
>      {
>        *vectype = STMT_VINFO_VECTYPE (def_stmt_info);
>        gcc_assert (*vectype != NULL_TREE);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 4870c754499..2c8e7660b4e 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -65,6 +65,7 @@ enum vect_def_type {
>    vect_reduction_def,
>    vect_double_reduction_def,
>    vect_nested_cycle,
> +  vect_first_order_recurrence,
>    vect_unknown_def_type
>  };
>  
> @@ -1027,6 +1028,7 @@ enum stmt_vec_info_type {
>    cycle_phi_info_type,
>    lc_phi_info_type,
>    phi_info_type,
> +  dep_phi_info_type,
>    loop_exit_ctrl_vec_info_type
>  };
>  
> @@ -2331,6 +2333,8 @@ extern bool vectorizable_lc_phi (loop_vec_info, stmt_vec_info,
>  gimple **, slp_tree);
>  extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree,
>        stmt_vector_for_cost *);
> +extern bool vectorizable_dep_phi (loop_vec_info, stmt_vec_info,
> + gimple **, slp_tree);
>  extern bool vect_emulated_vector_p (tree);
>  extern bool vect_can_vectorize_without_simd_p (tree_code);
>  extern bool vect_can_vectorize_without_simd_p (code_helper);
  
Richard Biener Sept. 30, 2022, 6:08 a.m. UTC | #4
On Thu, Sep 29, 2022 at 6:54 PM Richard Sandiford via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Thanks for posting the patch.
>
> juzhe.zhong@rivai.ai writes:
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> >
> > gcc/ChangeLog:
> >
> >         * tree-vect-loop.cc (vect_phi_first_order_recurrence_p): New function.
> >         (vect_analyze_scalar_cycles_1): Classify first-order recurrence phi.
> >         (vect_analyze_loop_operations): Add first-order recurrence autovectorization support.
> >         (vectorizable_dep_phi): New function.
> >         (vect_use_first_order_phi_result_p): New function.
> >         (vect_transform_loop): Add first-order recurrence autovectorization support.
> >         * tree-vect-stmts.cc (vect_transform_stmt): Ditto.
> >         (vect_is_simple_use): Ditto.
> >         * tree-vectorizer.h (enum vect_def_type): New enum.
> >         (enum stmt_vec_info_type): Ditto.
> >         (vectorizable_dep_phi): New function.
> >
> > Hi, since Richard said I can post unfinished for help, I post it.
> > This patch is for fix issue:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99409.
> > LLVM can vectorize this case using first-order recurrence loop-vectorizer.
> > This patch is inspired by first-order recurrence autovectorization support in LLVM:
> > https://reviews.llvm.org/D16197
> > There is a link that I can show you several cases that GCC fails vectorization
> > because no support of firs-order recurrence vectorization: https://godbolt.org/z/nzf1Wrd6T
> >
> > Let's consider a simple case that I simplify:
> > void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restrict__ c, int n)
> > {
> >   int32_t t = *c;
> >   for (int i = 0; i < n; ++i)
> >     {
> >       b[i] = a[i] - t;
> >       t = a[i];
> >     }
> > }
>
> One thing that I wondered about the LLVM implementation is:
> does reusing the loaded value really pay for itself?  E.g. for
> the un-predictive-commoned version:
>
> void foo (int32_t * __restrict__ a, int32_t * __restrict__ b, int32_t * __restr\
> ict__ c, int n)
> {
>   b[0] = a[0] - *c;
>   for (int i = 1; i < n; ++i)
>     b[i] = a[i] - a[i - 1];
> }
>
> GCC generates:
>
> L4:
>         ldr     q0, [x6, x2]
>         ldr     q1, [x0, x2]
>         sub     v0.4s, v0.4s, v1.4s
>         str     q0, [x5, x2]
>         add     x2, x2, 16
>         cmp     x2, x4
>         bne     .L4
>
> whereas LLVM (with -fno-unroll-loops) generates:
>
> .LBB0_4:                                // %vector.body
>         mov     v1.16b, v0.16b
>         subs    x15, x15, #4
>         ldr     q0, [x13], #16
>         ext     v1.16b, v1.16b, v0.16b, #12
>         sub     v1.4s, v0.4s, v1.4s
>         str     q1, [x14], #16
>         b.ne    .LBB0_4
>
> Introducing the loop-carried dependency (via the ext) limits the
> throughput of the loop to the latency of a permutation.
>
> But I guess which approach is better depends on the amount of work
> that is repeated by GCC's approach.  For a single load it's probably
> better to repeat the work, but for something more complicated the
> general recurrence approach probably wins out.
>
> So perhaps we should first handle the general case (as for your patch)
> and then, as a potential later follow-on patch, optimise the cases where
> the loop-carried dependency is harmful?

Note we for example avoid introducing loop-carried dependences in PRE
when vectorizing, but we currently cannot do anything about this when it
already exists.  That's what the patch fixes and I think that's good.  Another
strathegy also with larger order recurrences is to privatize them to dynamically
allocated storage since in the general case it's hard to "undo" predcom, esp.
when the loop entry values got optimized to constants (though that
mostly happens
for "testcases").

Richard.

> This is all hand-wavy speculation, in case it wasn't obvious :-)
>
> Thanks,
> Richard
>
> > Applying this patch, my downstream RVV GCC can vectorize with -fdump-tree-vect:
> >
> > note: vect_is_simple_use: operand t_21 = PHI <_4(6), t_12(5)>, type of def: first order recurrence
> >
> > However, it ICE in "dce6" when removing PHI node "t_21 = PHI <_4(6), t_12(5)>":
> > 0x143c174 crash_signal
> >         ../../../riscv-gcc/gcc/toplev.cc:322
> > 0x170d4fd delink_imm_use
> >         ../../../riscv-gcc/gcc/ssa-iterators.h:257
> > I was stuck by this issue. Besides, this patch has more 2 more things to do that I didn't implement:
> >
> > 1. insert VEC_PERM before the vector subtraction statement (Because I was stuck, I didn't continue
> >    implementing this patch and miss this.)
> > 2. Support this vectorization in SLP autovectorizaiton.
> >
> > To understand this patch, 2 functions are important:
> >
> > 1. vect_phi_first_order_recurrence_p, this function is used to forbid the cases that can not be vectorized
> >    by this vectorizer. The constraints there are strictly the same as LLVM.
> > 2. vectorizable_dep_phi, the implementation of first-order recurrence vectorizer.
> >
> > I hope someone can help me fix && finish && test && refine this patch.
> > Thanks.
> >
> > ---
> >  gcc/tree-vect-loop.cc  | 239 ++++++++++++++++++++++++++++++++++++++++-
> >  gcc/tree-vect-stmts.cc |  12 ++-
> >  gcc/tree-vectorizer.h  |   4 +
> >  3 files changed, 252 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 2536cc3cf49..adb48356c23 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -529,6 +529,57 @@ vect_inner_phi_in_double_reduction_p (loop_vec_info loop_vinfo, gphi *phi)
> >    return false;
> >  }
> >
> > +/* Returns true if Phi is a first-order recurrence. A first-order
> > +   recurrence is a non-reduction recurrence relation in which the value of
> > +   the recurrence in the current loop iteration equals a value defined in
> > +   the previous iteration.  */
> > +
> > +static bool
> > +vect_phi_first_order_recurrence_p (loop_vec_info loop_vinfo, class loop *loop,
> > +                                gphi *phi)
> > +{
> > +  /* Ensure the phi node is in the loop header and has two incoming values.  */
> > +  if (gimple_bb (phi) != loop->header || gimple_phi_num_args (phi) != 2)
> > +    return false;
> > +
> > +  /* Ensure the loop has a preheader and a single latch block. The loop
> > +     vectorizer will need the latch to set up the next iteration of the loop. */
> > +  edge preheader = loop_preheader_edge (loop);
> > +  edge latch = loop_latch_edge (loop);
> > +  if (!preheader || !latch)
> > +    return false;
> > +
> > +  /* Ensure the phi node's incoming blocks are the loop preheader and latch.  */
> > +  if (!PHI_ARG_DEF_FROM_EDGE (phi, preheader)
> > +      || !PHI_ARG_DEF_FROM_EDGE (phi, latch))
> > +    return false;
> > +
> > +  /* Get the previous value. The previous value comes from the latch edge while
> > +     the initial value comes form the preheader edge.  */
> > +  gimple *previous = SSA_NAME_DEF_STMT (PHI_ARG_DEF_FROM_EDGE (phi, latch));
> > +  if (!previous)
> > +    return false;
> > +
> > +  /* Ensure every use_stmt of the phi node is dominated by the previous value.
> > +     The dominance requirement ensures the loop vectorizer will not need to
> > +     vectorize the initial value prior to the first iteration of the loop.  */
> > +  gimple *use_stmt;
> > +  imm_use_iterator imm_iter;
> > +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, gimple_phi_result (phi))
> > +    if (use_stmt)
> > +      if (!dominated_by_p (CDI_DOMINATORS, gimple_bb (use_stmt),
> > +                        gimple_bb (previous)))
> > +     return false;
> > +
> > +  /* First-order recurrence autovectorization needs shuffle vector.  */
> > +  tree scalar_type = TREE_TYPE (PHI_RESULT (phi));
> > +  tree vectype = get_vectype_for_scalar_type (loop_vinfo, scalar_type);
> > +  if (!can_vec_perm_var_p (TYPE_MODE (vectype)))
> > +    return false;
> > +
> > +  return true;
> > +}
> > +
> >  /* Function vect_analyze_scalar_cycles_1.
> >
> >     Examine the cross iteration def-use cycles of scalar variables
> > @@ -666,6 +717,8 @@ vect_analyze_scalar_cycles_1 (loop_vec_info loop_vinfo, class loop *loop,
> >                  }
> >              }
> >          }
> > +      else if (vect_phi_first_order_recurrence_p (loop_vinfo, loop, phi))
> > +     STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_first_order_recurrence;
> >        else
> >          if (dump_enabled_p ())
> >            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > @@ -1808,9 +1861,13 @@ vect_analyze_loop_operations (loop_vec_info loop_vinfo)
> >
> >            gcc_assert (stmt_info);
> >
> > +          if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> > +         ok = vectorizable_dep_phi (loop_vinfo, stmt_info, NULL, NULL);
> > +
> >            if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope
> >                 || STMT_VINFO_LIVE_P (stmt_info))
> > -              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def)
> > +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def
> > +              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
> >           /* A scalar-dependence cycle that we don't support.  */
> >           return opt_result::failure_at (phi,
> >                                          "not vectorized:"
> > @@ -8267,6 +8324,120 @@ vectorizable_phi (vec_info *,
> >    return true;
> >  }
> >
> > +/* Vectorizes Dep PHIs.  */
> > +
> > +bool
> > +vectorizable_dep_phi (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
> > +                   gimple **vec_stmt, slp_tree slp_node)
> > +{
> > +  if (!loop_vinfo || !is_a<gphi *> (stmt_info->stmt))
> > +    return false;
> > +
> > +  gphi *phi = as_a<gphi *> (stmt_info->stmt);
> > +
> > +  /* So far we only support first-order recurrence auto-vectorization.  */
> > +  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
> > +    return false;
> > +
> > +  if (!vec_stmt) /* transformation not required.  */
> > +    {
> > +      /* So far we don't support first-order recurrence for SLP
> > +       * auto-vectorization.   */
> > +      if (slp_node)
> > +     {
> > +       if (dump_enabled_p ())
> > +         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > +                          "do not support first-order recurrence"
> > +                          "autovectorization for SLP node\n");
> > +       return false;
> > +     }
> > +      STMT_VINFO_TYPE (stmt_info) = dep_phi_info_type;
> > +      return true;
> > +    }
> > +
> > +  /* This is the second phase of vectorizing first-order rececurrences. An
> > +  overview of the transformation is described below. Suppose we have the
> > +  following loop.
> > +
> > +    int32_t t = 0;
> > +    for (int i = 0; i < n; ++i)
> > +      {
> > +     b[i] = a[i] - t;
> > +     t = a[i];
> > +      }
> > +
> > +  There is a first-order recurrence on "a". For this loop, the shorthand
> > +  scalar IR looks like:
> > +
> > +    scalar.preheader:
> > +      init = a[-1]
> > +      br loop.body
> > +
> > +    scalar.body:
> > +      i = PHI <0(scalar.preheader), i+1(scalar.body)>
> > +      _2 = PHI <(init(scalar.preheader), <_1(scalar.body)>
> > +      _1 = a[i]
> > +      b[i] = _1 - _2
> > +      br cond, scalar.body, ...
> > +
> > +  In this example, _2 is a recurrence because it's value depends on the
> > +  previous iteration. In the first phase of vectorization, we created a
> > +  temporary value for _2. We now complete the vectorization and produce the
> > +  shorthand vector IR shown below (VF = 4).
> > +
> > +    vector.preheader:
> > +      vect_init = vect_cst(..., ..., ..., a[-1])
> > +      br vector.body
> > +
> > +    vector.body
> > +      i = PHI <0(vector.preheader), i+4(vector.body)>
> > +      vect_1 = PHI <vect_init(vector.preheader), v2(vector.body)>
> > +      vect_2 = a[i, i+1, i+2, i+3];
> > +      vect_3 = vector(vect_1(3), vect_2(0, 1, 2))
> > +      b[i, i+1, i+2, i+3] = vect_2 - vect_3
> > +      br cond, vector.body, middle.block
> > +
> > +    middle.block:
> > +      x = vect_2(3)
> > +      br scalar.preheader
> > +
> > +    scalar.ph:
> > +      s_init = PHI <x(middle.block), a[-1], otherwise>
> > +      br scalar.body
> > +
> > +  After execution completes the vector loop, we extract the next value of
> > +  the recurrence (x) to use as the initial value in the scalar loop.  */
> > +
> > +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> > +  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> > +  tree scalar_dest = gimple_phi_result (stmt_info->stmt);
> > +  basic_block bb = gimple_bb (stmt_info->stmt);
> > +  tree vec_dest = vect_create_destination_var (scalar_dest, vectype);
> > +  auto_vec<tree> vec_oprnds;
> > +  tree preheader = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
> > +  tree latch = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
> > +  gimple_seq stmts = NULL;
> > +
> > +  tree vec_init = build_vector_from_val (vectype, preheader);
> > +  vec_init = vect_init_vector (loop_vinfo, stmt_info,
> > +                            vec_init, vectype, NULL);
> > +  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node,
> > +                  !slp_node ? vect_get_num_copies (loop_vinfo, vectype) : 1,
> > +                  gimple_phi_arg_def (stmt_info->stmt, 0), &vec_oprnds);
> > +  /* Create the vectorized first-order PHI node.  */
> > +  gphi *new_phi = create_phi_node (vec_dest, bb);
> > +  add_phi_arg (new_phi, vec_oprnds[0], loop_latch_edge (loop),
> > +            UNKNOWN_LOCATION);
> > +  add_phi_arg (new_phi, vec_init, loop_preheader_edge (loop), UNKNOWN_LOCATION);
> > +  if (slp_node)
> > +    SLP_TREE_VEC_STMTS (slp_node).quick_push (new_phi);
> > +  else
> > +    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_phi);
> > +  if (!slp_node)
> > +    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
> > +  return true;
> > +}
> > +
> >  /* Return true if VECTYPE represents a vector that requires lowering
> >     by the vector lowering pass.  */
> >
> > @@ -10476,6 +10647,26 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
> >    epilogue_vinfo->shared->save_datarefs ();
> >  }
> >
> > +/* Return true if the stmt is the first statement that uses the result
> > +   of a first-order recurrence phi node.  */
> > +
> > +static gphi *
> > +vect_use_first_order_phi_result_p (auto_vec<gphi *> &first_order_phi_vec,
> > +                                gimple *stmt)
> > +{
> > +  for (unsigned int i = 0; i < first_order_phi_vec.length (); ++i)
> > +    {
> > +      imm_use_iterator imm;
> > +      gphi *phi = first_order_phi_vec[i];
> > +      tree phi_result = PHI_RESULT (phi);
> > +      gimple *first_use = first_imm_use_stmt (&imm, phi_result);
> > +      if (first_use && first_use == stmt)
> > +     return phi;
> > +    }
> > +
> > +  return NULL;
> > +}
> > +
> >  /* Function vect_transform_loop.
> >
> >     The analysis phase has determined that the loop is vectorizable.
> > @@ -10624,6 +10815,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
> >        basic_block bb = bbs[i];
> >        stmt_vec_info stmt_info;
> >
> > +      /* It's used to save the first-order recurrence phi node.
> > +         Consider the following case:
> > +         # t_19 = PHI <_4(6), 0(5)>
> > +         ...
> > +         _4 = *_3;
> > +         ...
> > +         _6 = _4 - t_19;
> > +
> > +         The vectorization sequence should be:
> > +         1. Vectorize _4 = *_3;
> > +         2. Vectorize # t_19 = PHI <_4(6), 0(5)>
> > +         3, Vectorize _6 = _4 - t_19;
> > +
> > +         Flow:
> > +         1. So we save the first-order recurrence PHI in first_order_phi_vec
> > +         and skip vectorizing it tentatively when iterating phi statement
> > +         (save t_19 = PHI <_4(6), 0(5)>).
> > +         2. Vectorize all statements when iterating all statements
> > +            until we reach the statement who is using the result of
> > +            first-order recurrence PHI (vectorize _4 = *_3;).
> > +         3. Stop iterating and return to vectorize the PHI saved
> > +         in first_order_phi_vec (# t_19 = PHI <_4(6), 0(5)>).
> > +         4. Conintue iterating and vectorize the residual statements.  */
> > +      auto_vec<gphi *> first_order_phi_vec;
> > +
> >        for (gphi_iterator si = gsi_start_phis (bb); !gsi_end_p (si);
> >          gsi_next (&si))
> >       {
> > @@ -10677,9 +10893,16 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
> >              || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
> >              || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def
> >              || STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle
> > -            || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def)
> > +            || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def
> > +            || STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> >             && ! PURE_SLP_STMT (stmt_info))
> >           maybe_set_vectorized_backedge_value (loop_vinfo, stmt_info);
> > +
> > +       if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
> > +         {
> > +           first_order_phi_vec.safe_push (phi);
> > +           continue;
> > +         }
> >       }
> >
> >        for (gimple_stmt_iterator si = gsi_start_bb (bb);
> > @@ -10698,6 +10921,18 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
> >             /* Ignore vector stmts created in the outer loop.  */
> >             stmt_info = loop_vinfo->lookup_stmt (stmt);
> >
> > +           gphi *first_order_phi
> > +             = vect_use_first_order_phi_result_p (first_order_phi_vec, stmt);
> > +           if (first_order_phi)
> > +             {
> > +               stmt_vec_info first_order_stmt_info
> > +                 = loop_vinfo->lookup_stmt (first_order_phi);
> > +               gimple_stmt_iterator first_order_si
> > +                 = gsi_for_stmt (first_order_phi);
> > +               vect_transform_loop_stmt (loop_vinfo, first_order_stmt_info,
> > +                                         &first_order_si, NULL);
> > +             }
> > +
> >             /* vector stmts created in the outer-loop during vectorization of
> >                stmts in an inner-loop may not have a stmt_info, and do not
> >                need to be vectorized.  */
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index c8d1efc45e5..a8c0a8a4151 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -11404,6 +11404,12 @@ vect_transform_stmt (vec_info *vinfo,
> >        gcc_assert (done);
> >        break;
> >
> > +    case dep_phi_info_type:
> > +      done = vectorizable_dep_phi (as_a <loop_vec_info> (vinfo),
> > +                               stmt_info, &vec_stmt, slp_node);
> > +      gcc_assert (done);
> > +      break;
> > +
> >      case phi_info_type:
> >        done = vectorizable_phi (vinfo, stmt_info, &vec_stmt, slp_node, NULL);
> >        gcc_assert (done);
> > @@ -11804,6 +11810,9 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
> >       case vect_nested_cycle:
> >         dump_printf (MSG_NOTE, "nested cycle\n");
> >         break;
> > +     case vect_first_order_recurrence:
> > +       dump_printf (MSG_NOTE, "first order recurrence\n");
> > +       break;
> >       case vect_unknown_def_type:
> >         dump_printf (MSG_NOTE, "unknown\n");
> >         break;
> > @@ -11852,7 +11861,8 @@ vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
> >        || *dt == vect_induction_def
> >        || *dt == vect_reduction_def
> >        || *dt == vect_double_reduction_def
> > -      || *dt == vect_nested_cycle)
> > +      || *dt == vect_nested_cycle
> > +                     || *dt == vect_first_order_recurrence)
> >      {
> >        *vectype = STMT_VINFO_VECTYPE (def_stmt_info);
> >        gcc_assert (*vectype != NULL_TREE);
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index 4870c754499..2c8e7660b4e 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -65,6 +65,7 @@ enum vect_def_type {
> >    vect_reduction_def,
> >    vect_double_reduction_def,
> >    vect_nested_cycle,
> > +  vect_first_order_recurrence,
> >    vect_unknown_def_type
> >  };
> >
> > @@ -1027,6 +1028,7 @@ enum stmt_vec_info_type {
> >    cycle_phi_info_type,
> >    lc_phi_info_type,
> >    phi_info_type,
> > +  dep_phi_info_type,
> >    loop_exit_ctrl_vec_info_type
> >  };
> >
> > @@ -2331,6 +2333,8 @@ extern bool vectorizable_lc_phi (loop_vec_info, stmt_vec_info,
> >                                gimple **, slp_tree);
> >  extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree,
> >                             stmt_vector_for_cost *);
> > +extern bool vectorizable_dep_phi (loop_vec_info, stmt_vec_info,
> > +                              gimple **, slp_tree);
> >  extern bool vect_emulated_vector_p (tree);
> >  extern bool vect_can_vectorize_without_simd_p (tree_code);
> >  extern bool vect_can_vectorize_without_simd_p (code_helper);
  

Patch

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 2536cc3cf49..adb48356c23 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -529,6 +529,57 @@  vect_inner_phi_in_double_reduction_p (loop_vec_info loop_vinfo, gphi *phi)
   return false;
 }
 
+/* Returns true if Phi is a first-order recurrence. A first-order
+   recurrence is a non-reduction recurrence relation in which the value of
+   the recurrence in the current loop iteration equals a value defined in
+   the previous iteration.  */
+
+static bool
+vect_phi_first_order_recurrence_p (loop_vec_info loop_vinfo, class loop *loop,
+				   gphi *phi)
+{
+  /* Ensure the phi node is in the loop header and has two incoming values.  */
+  if (gimple_bb (phi) != loop->header || gimple_phi_num_args (phi) != 2)
+    return false;
+
+  /* Ensure the loop has a preheader and a single latch block. The loop
+     vectorizer will need the latch to set up the next iteration of the loop. */
+  edge preheader = loop_preheader_edge (loop);
+  edge latch = loop_latch_edge (loop);
+  if (!preheader || !latch)
+    return false;
+
+  /* Ensure the phi node's incoming blocks are the loop preheader and latch.  */
+  if (!PHI_ARG_DEF_FROM_EDGE (phi, preheader)
+      || !PHI_ARG_DEF_FROM_EDGE (phi, latch))
+    return false;
+
+  /* Get the previous value. The previous value comes from the latch edge while
+     the initial value comes form the preheader edge.  */
+  gimple *previous = SSA_NAME_DEF_STMT (PHI_ARG_DEF_FROM_EDGE (phi, latch));
+  if (!previous)
+    return false;
+
+  /* Ensure every use_stmt of the phi node is dominated by the previous value.
+     The dominance requirement ensures the loop vectorizer will not need to
+     vectorize the initial value prior to the first iteration of the loop.  */
+  gimple *use_stmt;
+  imm_use_iterator imm_iter;
+  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, gimple_phi_result (phi))
+    if (use_stmt)
+      if (!dominated_by_p (CDI_DOMINATORS, gimple_bb (use_stmt),
+			   gimple_bb (previous)))
+	return false;
+
+  /* First-order recurrence autovectorization needs shuffle vector.  */
+  tree scalar_type = TREE_TYPE (PHI_RESULT (phi));
+  tree vectype = get_vectype_for_scalar_type (loop_vinfo, scalar_type);
+  if (!can_vec_perm_var_p (TYPE_MODE (vectype)))
+    return false;
+
+  return true;
+}
+
 /* Function vect_analyze_scalar_cycles_1.
 
    Examine the cross iteration def-use cycles of scalar variables
@@ -666,6 +717,8 @@  vect_analyze_scalar_cycles_1 (loop_vec_info loop_vinfo, class loop *loop,
                 }
             }
         }
+      else if (vect_phi_first_order_recurrence_p (loop_vinfo, loop, phi))
+	STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_first_order_recurrence;
       else
         if (dump_enabled_p ())
           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -1808,9 +1861,13 @@  vect_analyze_loop_operations (loop_vec_info loop_vinfo)
 
           gcc_assert (stmt_info);
 
+          if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
+	    ok = vectorizable_dep_phi (loop_vinfo, stmt_info, NULL, NULL);
+
           if ((STMT_VINFO_RELEVANT (stmt_info) == vect_used_in_scope
                || STMT_VINFO_LIVE_P (stmt_info))
-              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def)
+              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_induction_def
+              && STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
 	    /* A scalar-dependence cycle that we don't support.  */
 	    return opt_result::failure_at (phi,
 					   "not vectorized:"
@@ -8267,6 +8324,120 @@  vectorizable_phi (vec_info *,
   return true;
 }
 
+/* Vectorizes Dep PHIs.  */
+
+bool
+vectorizable_dep_phi (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
+		      gimple **vec_stmt, slp_tree slp_node)
+{
+  if (!loop_vinfo || !is_a<gphi *> (stmt_info->stmt))
+    return false;
+
+  gphi *phi = as_a<gphi *> (stmt_info->stmt);
+
+  /* So far we only support first-order recurrence auto-vectorization.  */
+  if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_first_order_recurrence)
+    return false;
+
+  if (!vec_stmt) /* transformation not required.  */
+    {
+      /* So far we don't support first-order recurrence for SLP
+       * auto-vectorization.   */
+      if (slp_node)
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			     "do not support first-order recurrence"
+			     "autovectorization for SLP node\n");
+	  return false;
+	}
+      STMT_VINFO_TYPE (stmt_info) = dep_phi_info_type;
+      return true;
+    }
+
+  /* This is the second phase of vectorizing first-order rececurrences. An
+  overview of the transformation is described below. Suppose we have the
+  following loop.
+
+    int32_t t = 0;
+    for (int i = 0; i < n; ++i)
+      {
+	b[i] = a[i] - t;
+	t = a[i];
+      }
+
+  There is a first-order recurrence on "a". For this loop, the shorthand
+  scalar IR looks like:
+
+    scalar.preheader:
+      init = a[-1]
+      br loop.body
+
+    scalar.body:
+      i = PHI <0(scalar.preheader), i+1(scalar.body)>
+      _2 = PHI <(init(scalar.preheader), <_1(scalar.body)>
+      _1 = a[i]
+      b[i] = _1 - _2
+      br cond, scalar.body, ...
+
+  In this example, _2 is a recurrence because it's value depends on the
+  previous iteration. In the first phase of vectorization, we created a
+  temporary value for _2. We now complete the vectorization and produce the
+  shorthand vector IR shown below (VF = 4).
+
+    vector.preheader:
+      vect_init = vect_cst(..., ..., ..., a[-1])
+      br vector.body
+
+    vector.body
+      i = PHI <0(vector.preheader), i+4(vector.body)>
+      vect_1 = PHI <vect_init(vector.preheader), v2(vector.body)>
+      vect_2 = a[i, i+1, i+2, i+3];
+      vect_3 = vector(vect_1(3), vect_2(0, 1, 2))
+      b[i, i+1, i+2, i+3] = vect_2 - vect_3
+      br cond, vector.body, middle.block
+
+    middle.block:
+      x = vect_2(3)
+      br scalar.preheader
+
+    scalar.ph:
+      s_init = PHI <x(middle.block), a[-1], otherwise>
+      br scalar.body
+
+  After execution completes the vector loop, we extract the next value of
+  the recurrence (x) to use as the initial value in the scalar loop.  */
+
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+  tree scalar_dest = gimple_phi_result (stmt_info->stmt);
+  basic_block bb = gimple_bb (stmt_info->stmt);
+  tree vec_dest = vect_create_destination_var (scalar_dest, vectype);
+  auto_vec<tree> vec_oprnds;
+  tree preheader = PHI_ARG_DEF_FROM_EDGE (phi, loop_preheader_edge (loop));
+  tree latch = PHI_ARG_DEF_FROM_EDGE (phi, loop_latch_edge (loop));
+  gimple_seq stmts = NULL;
+
+  tree vec_init = build_vector_from_val (vectype, preheader);
+  vec_init = vect_init_vector (loop_vinfo, stmt_info,
+			       vec_init, vectype, NULL);
+  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node,
+		     !slp_node ? vect_get_num_copies (loop_vinfo, vectype) : 1,
+		     gimple_phi_arg_def (stmt_info->stmt, 0), &vec_oprnds);
+  /* Create the vectorized first-order PHI node.  */
+  gphi *new_phi = create_phi_node (vec_dest, bb);
+  add_phi_arg (new_phi, vec_oprnds[0], loop_latch_edge (loop),
+	       UNKNOWN_LOCATION);
+  add_phi_arg (new_phi, vec_init, loop_preheader_edge (loop), UNKNOWN_LOCATION);
+  if (slp_node)
+    SLP_TREE_VEC_STMTS (slp_node).quick_push (new_phi);
+  else
+    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_phi);
+  if (!slp_node)
+    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
+  return true;
+}
+
 /* Return true if VECTYPE represents a vector that requires lowering
    by the vector lowering pass.  */
 
@@ -10476,6 +10647,26 @@  update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
   epilogue_vinfo->shared->save_datarefs ();
 }
 
+/* Return true if the stmt is the first statement that uses the result
+   of a first-order recurrence phi node.  */
+
+static gphi *
+vect_use_first_order_phi_result_p (auto_vec<gphi *> &first_order_phi_vec,
+				   gimple *stmt)
+{
+  for (unsigned int i = 0; i < first_order_phi_vec.length (); ++i)
+    {
+      imm_use_iterator imm;
+      gphi *phi = first_order_phi_vec[i];
+      tree phi_result = PHI_RESULT (phi);
+      gimple *first_use = first_imm_use_stmt (&imm, phi_result);
+      if (first_use && first_use == stmt)
+	return phi;
+    }
+
+  return NULL;
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -10624,6 +10815,31 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
       basic_block bb = bbs[i];
       stmt_vec_info stmt_info;
 
+      /* It's used to save the first-order recurrence phi node.
+         Consider the following case:
+         # t_19 = PHI <_4(6), 0(5)>
+         ...
+         _4 = *_3;
+         ...
+         _6 = _4 - t_19;
+         
+         The vectorization sequence should be:
+         1. Vectorize _4 = *_3;
+         2. Vectorize # t_19 = PHI <_4(6), 0(5)>
+         3, Vectorize _6 = _4 - t_19;
+         
+         Flow:
+         1. So we save the first-order recurrence PHI in first_order_phi_vec
+	    and skip vectorizing it tentatively when iterating phi statement 
+	    (save t_19 = PHI <_4(6), 0(5)>).
+         2. Vectorize all statements when iterating all statements
+            until we reach the statement who is using the result of
+            first-order recurrence PHI (vectorize _4 = *_3;).
+         3. Stop iterating and return to vectorize the PHI saved
+	    in first_order_phi_vec (# t_19 = PHI <_4(6), 0(5)>).
+         4. Conintue iterating and vectorize the residual statements.  */
+      auto_vec<gphi *> first_order_phi_vec;
+
       for (gphi_iterator si = gsi_start_phis (bb); !gsi_end_p (si);
 	   gsi_next (&si))
 	{
@@ -10677,9 +10893,16 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
 	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def
 	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle
-	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def)
+	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_internal_def
+	       || STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
 	      && ! PURE_SLP_STMT (stmt_info))
 	    maybe_set_vectorized_backedge_value (loop_vinfo, stmt_info);
+
+	  if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_first_order_recurrence)
+	    {
+	      first_order_phi_vec.safe_push (phi);
+	      continue;
+	    }
 	}
 
       for (gimple_stmt_iterator si = gsi_start_bb (bb);
@@ -10698,6 +10921,18 @@  vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
 	      /* Ignore vector stmts created in the outer loop.  */
 	      stmt_info = loop_vinfo->lookup_stmt (stmt);
 
+	      gphi *first_order_phi
+		= vect_use_first_order_phi_result_p (first_order_phi_vec, stmt);
+	      if (first_order_phi)
+		{
+		  stmt_vec_info first_order_stmt_info
+		    = loop_vinfo->lookup_stmt (first_order_phi);
+		  gimple_stmt_iterator first_order_si
+		    = gsi_for_stmt (first_order_phi);
+		  vect_transform_loop_stmt (loop_vinfo, first_order_stmt_info,
+					    &first_order_si, NULL);
+		}
+
 	      /* vector stmts created in the outer-loop during vectorization of
 		 stmts in an inner-loop may not have a stmt_info, and do not
 		 need to be vectorized.  */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c8d1efc45e5..a8c0a8a4151 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -11404,6 +11404,12 @@  vect_transform_stmt (vec_info *vinfo,
       gcc_assert (done);
       break;
 
+    case dep_phi_info_type:
+      done = vectorizable_dep_phi (as_a <loop_vec_info> (vinfo),
+				  stmt_info, &vec_stmt, slp_node);
+      gcc_assert (done);
+      break;
+
     case phi_info_type:
       done = vectorizable_phi (vinfo, stmt_info, &vec_stmt, slp_node, NULL);
       gcc_assert (done);
@@ -11804,6 +11810,9 @@  vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
 	case vect_nested_cycle:
 	  dump_printf (MSG_NOTE, "nested cycle\n");
 	  break;
+	case vect_first_order_recurrence:
+	  dump_printf (MSG_NOTE, "first order recurrence\n");
+	  break;
 	case vect_unknown_def_type:
 	  dump_printf (MSG_NOTE, "unknown\n");
 	  break;
@@ -11852,7 +11861,8 @@  vect_is_simple_use (tree operand, vec_info *vinfo, enum vect_def_type *dt,
       || *dt == vect_induction_def
       || *dt == vect_reduction_def
       || *dt == vect_double_reduction_def
-      || *dt == vect_nested_cycle)
+      || *dt == vect_nested_cycle
+			|| *dt == vect_first_order_recurrence)
     {
       *vectype = STMT_VINFO_VECTYPE (def_stmt_info);
       gcc_assert (*vectype != NULL_TREE);
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 4870c754499..2c8e7660b4e 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -65,6 +65,7 @@  enum vect_def_type {
   vect_reduction_def,
   vect_double_reduction_def,
   vect_nested_cycle,
+  vect_first_order_recurrence,
   vect_unknown_def_type
 };
 
@@ -1027,6 +1028,7 @@  enum stmt_vec_info_type {
   cycle_phi_info_type,
   lc_phi_info_type,
   phi_info_type,
+  dep_phi_info_type,
   loop_exit_ctrl_vec_info_type
 };
 
@@ -2331,6 +2333,8 @@  extern bool vectorizable_lc_phi (loop_vec_info, stmt_vec_info,
 				 gimple **, slp_tree);
 extern bool vectorizable_phi (vec_info *, stmt_vec_info, gimple **, slp_tree,
 			      stmt_vector_for_cost *);
+extern bool vectorizable_dep_phi (loop_vec_info, stmt_vec_info,
+				 gimple **, slp_tree);
 extern bool vect_emulated_vector_p (tree);
 extern bool vect_can_vectorize_without_simd_p (tree_code);
 extern bool vect_can_vectorize_without_simd_p (code_helper);