[V4] VECT: Support LEN_MASK_{LOAD,STORE} ifn && optabs

Message ID 20230615131435.10323-1-juzhe.zhong@rivai.ai
State Accepted
Headers
Series [V4] VECT: Support LEN_MASK_{LOAD,STORE} ifn && optabs |

Checks

Context Check Description
snail/gcc-patch-check success Github commit url

Commit Message

juzhe.zhong@rivai.ai June 15, 2023, 1:14 p.m. UTC
  From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

This patch bootstrap pass on X86, ok for trunk ?

Accoding to comments from Richi, split the first patch to add ifn && optabs
of LEN_MASK_{LOAD,STORE} only, we don't apply them into vectorizer in this
patch. And also add BIAS argument for possible s390's future use.

The description of the patterns in doc are coming Robin.

After this patch is approved, will send the second patch to apply len_mask_*
patterns into vectorizer.

Target like ARM SVE in GCC has an elegant way to handle both loop control
and flow control simultaneously:

loop_control_mask = WHILE_ULT
flow_control_mask = comparison
control_mask = loop_control_mask & flow_control_mask;
MASK_LOAD (control_mask)
MASK_STORE (control_mask)

However, targets like RVV (RISC-V Vector) can not use this approach in
auto-vectorization since RVV use length in loop control.

This patch adds LEN_MASK_ LOAD/STORE to support flow control for targets
like RISC-V that uses length in loop control.
Normalize load/store into LEN_MASK_ LOAD/STORE as long as either length
or mask is valid. Length is the outcome of SELECT_VL or MIN_EXPR.
Mask is the outcome of comparison.

LEN_MASK_ LOAD/STORE format is defined as follows:
1). LEN_MASK_LOAD (ptr, align, length, mask).
2). LEN_MASK_STORE (ptr, align, length, mask, vec).

Consider these 4 following cases:

VLA: Variable-length auto-vectorization
VLS: Specific-length auto-vectorization

Case 1 (VLS): -mrvv-vector-bits=128   IR (Does not use LEN_MASK_*):
Code:					v1 = MEM (...)
  for (int i = 0; i < 4; i++)           v2 = MEM (...)
    a[i] = b[i] + c[i];                 v3 = v1 + v2 
                                        MEM[...] = v3

Case 2 (VLS): -mrvv-vector-bits=128   IR (LEN_MASK_* with length = VF, mask = comparison):
Code:                                   mask = comparison
  for (int i = 0; i < 4; i++)           v1 = LEN_MASK_LOAD (length = VF, mask)
    if (cond[i])                        v2 = LEN_MASK_LOAD (length = VF, mask) 
      a[i] = b[i] + c[i];               v3 = v1 + v2
                                        LEN_MASK_STORE (length = VF, mask, v3)
           
Case 3 (VLA):
Code:                                   loop_len = SELECT_VL or MIN
  for (int i = 0; i < n; i++)           v1 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
      a[i] = b[i] + c[i];               v2 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
                                        v3 = v1 + v2                            
                                        LEN_MASK_STORE (length = loop_len, mask = {-1,-1,...}, v3)

Case 4 (VLA):
Code:                                   loop_len = SELECT_VL or MIN
  for (int i = 0; i < n; i++)           mask = comparison
      if (cond[i])                      v1 = LEN_MASK_LOAD (length = loop_len, mask)
      a[i] = b[i] + c[i];               v2 = LEN_MASK_LOAD (length = loop_len, mask)
                                        v3 = v1 + v2                            
                                        LEN_MASK_STORE (length = loop_len, mask, v3)

Co-authored-by: Robin Dapp <rdapp.gcc@gmail.com>

gcc/ChangeLog:

        * doc/md.texi: Add len_mask{load,store}.
        * genopinit.cc (main): Ditto.
        (CMP_NAME): Ditto.
        * internal-fn.cc (len_maskload_direct): Ditto.
        (len_maskstore_direct): Ditto.
        (expand_call_mem_ref): Ditto.
        (expand_partial_load_optab_fn): Ditto.
        (expand_len_maskload_optab_fn): Ditto.
        (expand_partial_store_optab_fn): Ditto.
        (expand_len_maskstore_optab_fn): Ditto.
        (direct_len_maskload_optab_supported_p): Ditto.
        (direct_len_maskstore_optab_supported_p): Ditto.
        * internal-fn.def (LEN_MASK_LOAD): Ditto.
        (LEN_MASK_STORE): Ditto.
        * optabs.def (OPTAB_CD): Ditto.

---
 gcc/doc/md.texi     | 46 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/genopinit.cc    |  6 ++++--
 gcc/internal-fn.cc  | 43 ++++++++++++++++++++++++++++++++++++++----
 gcc/internal-fn.def |  4 ++++
 gcc/optabs.def      |  2 ++
 5 files changed, 95 insertions(+), 6 deletions(-)
  

Comments

Richard Biener June 16, 2023, 9:04 a.m. UTC | #1
On Thu, 15 Jun 2023, juzhe.zhong@rivai.ai wrote:

> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> This patch bootstrap pass on X86, ok for trunk ?

OK with me, please give Richard S. a chance to comment before pushing.

Thanks,
Richard.

> Accoding to comments from Richi, split the first patch to add ifn && optabs
> of LEN_MASK_{LOAD,STORE} only, we don't apply them into vectorizer in this
> patch. And also add BIAS argument for possible s390's future use.
> 
> The description of the patterns in doc are coming Robin.
> 
> After this patch is approved, will send the second patch to apply len_mask_*
> patterns into vectorizer.
> 
> Target like ARM SVE in GCC has an elegant way to handle both loop control
> and flow control simultaneously:
> 
> loop_control_mask = WHILE_ULT
> flow_control_mask = comparison
> control_mask = loop_control_mask & flow_control_mask;
> MASK_LOAD (control_mask)
> MASK_STORE (control_mask)
> 
> However, targets like RVV (RISC-V Vector) can not use this approach in
> auto-vectorization since RVV use length in loop control.
> 
> This patch adds LEN_MASK_ LOAD/STORE to support flow control for targets
> like RISC-V that uses length in loop control.
> Normalize load/store into LEN_MASK_ LOAD/STORE as long as either length
> or mask is valid. Length is the outcome of SELECT_VL or MIN_EXPR.
> Mask is the outcome of comparison.
> 
> LEN_MASK_ LOAD/STORE format is defined as follows:
> 1). LEN_MASK_LOAD (ptr, align, length, mask).
> 2). LEN_MASK_STORE (ptr, align, length, mask, vec).
> 
> Consider these 4 following cases:
> 
> VLA: Variable-length auto-vectorization
> VLS: Specific-length auto-vectorization
> 
> Case 1 (VLS): -mrvv-vector-bits=128   IR (Does not use LEN_MASK_*):
> Code:					v1 = MEM (...)
>   for (int i = 0; i < 4; i++)           v2 = MEM (...)
>     a[i] = b[i] + c[i];                 v3 = v1 + v2 
>                                         MEM[...] = v3
> 
> Case 2 (VLS): -mrvv-vector-bits=128   IR (LEN_MASK_* with length = VF, mask = comparison):
> Code:                                   mask = comparison
>   for (int i = 0; i < 4; i++)           v1 = LEN_MASK_LOAD (length = VF, mask)
>     if (cond[i])                        v2 = LEN_MASK_LOAD (length = VF, mask) 
>       a[i] = b[i] + c[i];               v3 = v1 + v2
>                                         LEN_MASK_STORE (length = VF, mask, v3)
>            
> Case 3 (VLA):
> Code:                                   loop_len = SELECT_VL or MIN
>   for (int i = 0; i < n; i++)           v1 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
>       a[i] = b[i] + c[i];               v2 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
>                                         v3 = v1 + v2                            
>                                         LEN_MASK_STORE (length = loop_len, mask = {-1,-1,...}, v3)
> 
> Case 4 (VLA):
> Code:                                   loop_len = SELECT_VL or MIN
>   for (int i = 0; i < n; i++)           mask = comparison
>       if (cond[i])                      v1 = LEN_MASK_LOAD (length = loop_len, mask)
>       a[i] = b[i] + c[i];               v2 = LEN_MASK_LOAD (length = loop_len, mask)
>                                         v3 = v1 + v2                            
>                                         LEN_MASK_STORE (length = loop_len, mask, v3)
> 
> Co-authored-by: Robin Dapp <rdapp.gcc@gmail.com>
> 
> gcc/ChangeLog:
> 
>         * doc/md.texi: Add len_mask{load,store}.
>         * genopinit.cc (main): Ditto.
>         (CMP_NAME): Ditto.
>         * internal-fn.cc (len_maskload_direct): Ditto.
>         (len_maskstore_direct): Ditto.
>         (expand_call_mem_ref): Ditto.
>         (expand_partial_load_optab_fn): Ditto.
>         (expand_len_maskload_optab_fn): Ditto.
>         (expand_partial_store_optab_fn): Ditto.
>         (expand_len_maskstore_optab_fn): Ditto.
>         (direct_len_maskload_optab_supported_p): Ditto.
>         (direct_len_maskstore_optab_supported_p): Ditto.
>         * internal-fn.def (LEN_MASK_LOAD): Ditto.
>         (LEN_MASK_STORE): Ditto.
>         * optabs.def (OPTAB_CD): Ditto.
> 
> ---
>  gcc/doc/md.texi     | 46 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/genopinit.cc    |  6 ++++--
>  gcc/internal-fn.cc  | 43 ++++++++++++++++++++++++++++++++++++++----
>  gcc/internal-fn.def |  4 ++++
>  gcc/optabs.def      |  2 ++
>  5 files changed, 95 insertions(+), 6 deletions(-)
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index a43fd65a2b2..af23ec938d6 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5136,6 +5136,52 @@ of @code{QI} elements.
>  
>  This pattern is not allowed to @code{FAIL}.
>  
> +@cindex @code{len_maskload@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskload@var{m}@var{n}}
> +Perform a masked load (operand 2 - operand 4) elements from vector memory
> +operand 1 into vector register operand 0, setting the other elements of
> +operand 0 to undefined values.  This is a combination of len_load and maskload. 
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2
> +has whichever integer mode the target prefers.  A secondary mask is specified in
> +operand 3 which must be of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
> +
> +Operand 2 can be a variable or a constant amount.  Operand 4 specifies a
> +constant bias: it is either a constant 0 or a constant -1.  The predicate on
> +operand 4 must only accept the bias values that the target actually supports.
> +GCC handles a bias of 0 more efficiently than a bias of -1.
> +
> +If (operand 2 - operand 4) exceeds the number of elements in mode
> +@var{m}, the behavior is undefined.
> +
> +If the target prefers the length to be measured in bytes
> +rather than elements, it should only implement this pattern for vectors
> +of @code{QI} elements.
> +
> +This pattern is not allowed to @code{FAIL}.
> +
> +@cindex @code{len_maskstore@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskstore@var{m}@var{n}}
> +Perform a masked store (operand 2 - operand 4) vector elements from vector register
> +operand 1 into memory operand 0, leaving the other elements of operand 0 unchanged.
> +This is a combination of len_store and maskstore.
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2 has whichever
> +integer mode the target prefers.  A secondary mask is specified in operand 3 which must be
> +of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
> +
> +Operand 2 can be a variable or a constant amount.  Operand 3 specifies a
> +constant bias: it is either a constant 0 or a constant -1.  The predicate on
> +operand 4 must only accept the bias values that the target actually supports.
> +GCC handles a bias of 0 more efficiently than a bias of -1.
> +
> +If (operand 2 - operand 4) exceeds the number of elements in mode
> +@var{m}, the behavior is undefined.
> +
> +If the target prefers the length to be measured in bytes
> +rather than elements, it should only implement this pattern for vectors
> +of @code{QI} elements.
> +
> +This pattern is not allowed to @code{FAIL}.
> +
>  @cindex @code{vec_perm@var{m}} instruction pattern
>  @item @samp{vec_perm@var{m}}
>  Output a (variable) vector permutation.  Operand 0 is the destination
> diff --git a/gcc/genopinit.cc b/gcc/genopinit.cc
> index 0c1b6859ca0..6bd8858a1d9 100644
> --- a/gcc/genopinit.cc
> +++ b/gcc/genopinit.cc
> @@ -376,7 +376,8 @@ main (int argc, const char **argv)
>  
>    fprintf (s_file,
>  	   "/* Returns TRUE if the target supports any of the partial vector\n"
> -	   "   optabs: while_ult_optab, len_load_optab or len_store_optab,\n"
> +	   "   optabs: while_ult_optab, len_load_optab, len_store_optab,\n"
> +	   "   len_maskload_optab or len_maskstore_optab,\n"
>  	   "   for any mode.  */\n"
>  	   "bool\npartial_vectors_supported_p (void)\n{\n");
>    bool any_match = false;
> @@ -386,7 +387,8 @@ main (int argc, const char **argv)
>      {
>  #define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N)))
>        if (CMP_NAME("while_ult") || CMP_NAME ("len_load")
> -	  || CMP_NAME ("len_store"))
> +	  || CMP_NAME ("len_store")|| CMP_NAME ("len_maskload")
> +	  || CMP_NAME ("len_maskstore"))
>  	{
>  	  if (first)
>  	    fprintf (s_file, " HAVE_%s", p->name);
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 208bdf497eb..c911ae790cb 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -165,6 +165,7 @@ init_internal_fns ()
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
>  #define len_load_direct { -1, -1, false }
> +#define len_maskload_direct { -1, 3, false }
>  #define mask_store_direct { 3, 2, false }
>  #define store_lanes_direct { 0, 0, false }
>  #define mask_store_lanes_direct { 0, 0, false }
> @@ -172,6 +173,7 @@ init_internal_fns ()
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
>  #define len_store_direct { 3, 3, false }
> +#define len_maskstore_direct { 4, 3, false }
>  #define vec_set_direct { 3, 3, false }
>  #define unary_direct { 0, 0, true }
>  #define unary_convert_direct { -1, 0, true }
> @@ -2873,12 +2875,13 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
>    return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
>  }
>  
> -/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
> +/* Expand MASK_LOAD{,_LANES}, LEN_MASK_LOAD or LEN_LOAD call STMT using optab
> + * OPTAB.  */
>  
>  static void
>  expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  {
> -  class expand_operand ops[4];
> +  class expand_operand ops[5];
>    tree type, lhs, rhs, maskt, biast;
>    rtx mem, target, mask, bias;
>    insn_code icode;
> @@ -2913,6 +2916,20 @@ expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>        create_input_operand (&ops[3], bias, QImode);
>        expand_insn (icode, 4, ops);
>      }
> +  else if (optab == len_maskload_optab)
> +    {
> +      create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
> +				   TYPE_UNSIGNED (TREE_TYPE (maskt)));
> +      maskt = gimple_call_arg (stmt, 3);
> +      mask = expand_normal (maskt);
> +      create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt)));
> +      icode = convert_optab_handler (optab, TYPE_MODE (type),
> +				     TYPE_MODE (TREE_TYPE (maskt)));
> +      biast = gimple_call_arg (stmt, 4);
> +      bias = expand_normal (biast);
> +      create_input_operand (&ops[4], bias, QImode);
> +      expand_insn (icode, 5, ops);
> +    }
>    else
>      {
>        create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> @@ -2926,13 +2943,15 @@ expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  #define expand_mask_load_optab_fn expand_partial_load_optab_fn
>  #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
>  #define expand_len_load_optab_fn expand_partial_load_optab_fn
> +#define expand_len_maskload_optab_fn expand_partial_load_optab_fn
>  
> -/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
> +/* Expand MASK_STORE{,_LANES}, LEN_MASK_STORE or LEN_STORE call STMT using optab
> + * OPTAB.  */
>  
>  static void
>  expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  {
> -  class expand_operand ops[4];
> +  class expand_operand ops[5];
>    tree type, lhs, rhs, maskt, biast;
>    rtx mem, reg, mask, bias;
>    insn_code icode;
> @@ -2965,6 +2984,19 @@ expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>        create_input_operand (&ops[3], bias, QImode);
>        expand_insn (icode, 4, ops);
>      }
> +  else if (optab == len_maskstore_optab)
> +    {
> +      create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
> +				   TYPE_UNSIGNED (TREE_TYPE (maskt)));
> +      maskt = gimple_call_arg (stmt, 3);
> +      mask = expand_normal (maskt);
> +      create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt)));
> +      biast = gimple_call_arg (stmt, 4);
> +      bias = expand_normal (biast);
> +      create_input_operand (&ops[4], bias, QImode);
> +      icode = convert_optab_handler (optab, TYPE_MODE (type), GET_MODE (mask));
> +      expand_insn (icode, 5, ops);
> +    }
>    else
>      {
>        create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> @@ -2975,6 +3007,7 @@ expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  #define expand_mask_store_optab_fn expand_partial_store_optab_fn
>  #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
>  #define expand_len_store_optab_fn expand_partial_store_optab_fn
> +#define expand_len_maskstore_optab_fn expand_partial_store_optab_fn
>  
>  /* Expand VCOND, VCONDU and VCONDEQ optab internal functions.
>     The expansion of STMT happens based on OPTAB table associated.  */
> @@ -3928,6 +3961,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_gather_load_optab_supported_p convert_optab_supported_p
>  #define direct_len_load_optab_supported_p direct_optab_supported_p
> +#define direct_len_maskload_optab_supported_p convert_optab_supported_p
>  #define direct_mask_store_optab_supported_p convert_optab_supported_p
>  #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
> @@ -3935,6 +3969,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
>  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
>  #define direct_len_store_optab_supported_p direct_optab_supported_p
> +#define direct_len_maskstore_optab_supported_p convert_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
>  #define direct_fold_extract_optab_supported_p direct_optab_supported_p
>  #define direct_fold_left_optab_supported_p direct_optab_supported_p
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 9da5f31636e..bc947c0fde7 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -50,12 +50,14 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_load_lanes: currently just vec_mask_load_lanes
>     - gather_load: used for {mask_,}gather_load
>     - len_load: currently just len_load
> +   - len_maskload: currently just len_maskload
>  
>     - mask_store: currently just maskstore
>     - store_lanes: currently just vec_store_lanes
>     - mask_store_lanes: currently just vec_mask_store_lanes
>     - scatter_store: used for {mask_,}scatter_store
>     - len_store: currently just len_store
> +   - len_maskstore: currently just len_maskstore
>  
>     - unary: a normal unary optab, such as vec_reverse_<mode>
>     - binary: a normal binary optab, such as vec_interleave_lo_<mode>
> @@ -157,6 +159,7 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>  		       mask_gather_load, gather_load)
>  
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> +DEF_INTERNAL_OPTAB_FN (LEN_MASK_LOAD, ECF_PURE, len_maskload, len_maskload)
>  
>  DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> @@ -175,6 +178,7 @@ DEF_INTERNAL_OPTAB_FN (VCOND_MASK, 0, vcond_mask, vec_cond_mask)
>  DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
> +DEF_INTERNAL_OPTAB_FN (LEN_MASK_STORE, 0, len_maskstore, len_maskstore)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
>  DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, binary)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 22b31be0f72..9533eb11565 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -91,6 +91,8 @@ OPTAB_CD(vec_cmpu_optab, "vec_cmpu$a$b")
>  OPTAB_CD(vec_cmpeq_optab, "vec_cmpeq$a$b")
>  OPTAB_CD(maskload_optab, "maskload$a$b")
>  OPTAB_CD(maskstore_optab, "maskstore$a$b")
> +OPTAB_CD(len_maskload_optab, "len_maskload$a$b")
> +OPTAB_CD(len_maskstore_optab, "len_maskstore$a$b")
>  OPTAB_CD(gather_load_optab, "gather_load$a$b")
>  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
>  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
>
  
juzhe.zhong@rivai.ai June 16, 2023, 9:06 a.m. UTC | #2
Thanks a lot! I will wait for Richard final approve.



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-06-16 17:04
To: Ju-Zhe Zhong
CC: gcc-patches; richard.sandiford; rdapp.gcc
Subject: Re: [PATCH V4] VECT: Support LEN_MASK_{LOAD,STORE} ifn && optabs
On Thu, 15 Jun 2023, juzhe.zhong@rivai.ai wrote:
 
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> This patch bootstrap pass on X86, ok for trunk ?
 
OK with me, please give Richard S. a chance to comment before pushing.
 
Thanks,
Richard.
 
> Accoding to comments from Richi, split the first patch to add ifn && optabs
> of LEN_MASK_{LOAD,STORE} only, we don't apply them into vectorizer in this
> patch. And also add BIAS argument for possible s390's future use.
> 
> The description of the patterns in doc are coming Robin.
> 
> After this patch is approved, will send the second patch to apply len_mask_*
> patterns into vectorizer.
> 
> Target like ARM SVE in GCC has an elegant way to handle both loop control
> and flow control simultaneously:
> 
> loop_control_mask = WHILE_ULT
> flow_control_mask = comparison
> control_mask = loop_control_mask & flow_control_mask;
> MASK_LOAD (control_mask)
> MASK_STORE (control_mask)
> 
> However, targets like RVV (RISC-V Vector) can not use this approach in
> auto-vectorization since RVV use length in loop control.
> 
> This patch adds LEN_MASK_ LOAD/STORE to support flow control for targets
> like RISC-V that uses length in loop control.
> Normalize load/store into LEN_MASK_ LOAD/STORE as long as either length
> or mask is valid. Length is the outcome of SELECT_VL or MIN_EXPR.
> Mask is the outcome of comparison.
> 
> LEN_MASK_ LOAD/STORE format is defined as follows:
> 1). LEN_MASK_LOAD (ptr, align, length, mask).
> 2). LEN_MASK_STORE (ptr, align, length, mask, vec).
> 
> Consider these 4 following cases:
> 
> VLA: Variable-length auto-vectorization
> VLS: Specific-length auto-vectorization
> 
> Case 1 (VLS): -mrvv-vector-bits=128   IR (Does not use LEN_MASK_*):
> Code: v1 = MEM (...)
>   for (int i = 0; i < 4; i++)           v2 = MEM (...)
>     a[i] = b[i] + c[i];                 v3 = v1 + v2 
>                                         MEM[...] = v3
> 
> Case 2 (VLS): -mrvv-vector-bits=128   IR (LEN_MASK_* with length = VF, mask = comparison):
> Code:                                   mask = comparison
>   for (int i = 0; i < 4; i++)           v1 = LEN_MASK_LOAD (length = VF, mask)
>     if (cond[i])                        v2 = LEN_MASK_LOAD (length = VF, mask) 
>       a[i] = b[i] + c[i];               v3 = v1 + v2
>                                         LEN_MASK_STORE (length = VF, mask, v3)
>            
> Case 3 (VLA):
> Code:                                   loop_len = SELECT_VL or MIN
>   for (int i = 0; i < n; i++)           v1 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
>       a[i] = b[i] + c[i];               v2 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
>                                         v3 = v1 + v2                            
>                                         LEN_MASK_STORE (length = loop_len, mask = {-1,-1,...}, v3)
> 
> Case 4 (VLA):
> Code:                                   loop_len = SELECT_VL or MIN
>   for (int i = 0; i < n; i++)           mask = comparison
>       if (cond[i])                      v1 = LEN_MASK_LOAD (length = loop_len, mask)
>       a[i] = b[i] + c[i];               v2 = LEN_MASK_LOAD (length = loop_len, mask)
>                                         v3 = v1 + v2                            
>                                         LEN_MASK_STORE (length = loop_len, mask, v3)
> 
> Co-authored-by: Robin Dapp <rdapp.gcc@gmail.com>
> 
> gcc/ChangeLog:
> 
>         * doc/md.texi: Add len_mask{load,store}.
>         * genopinit.cc (main): Ditto.
>         (CMP_NAME): Ditto.
>         * internal-fn.cc (len_maskload_direct): Ditto.
>         (len_maskstore_direct): Ditto.
>         (expand_call_mem_ref): Ditto.
>         (expand_partial_load_optab_fn): Ditto.
>         (expand_len_maskload_optab_fn): Ditto.
>         (expand_partial_store_optab_fn): Ditto.
>         (expand_len_maskstore_optab_fn): Ditto.
>         (direct_len_maskload_optab_supported_p): Ditto.
>         (direct_len_maskstore_optab_supported_p): Ditto.
>         * internal-fn.def (LEN_MASK_LOAD): Ditto.
>         (LEN_MASK_STORE): Ditto.
>         * optabs.def (OPTAB_CD): Ditto.
> 
> ---
>  gcc/doc/md.texi     | 46 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/genopinit.cc    |  6 ++++--
>  gcc/internal-fn.cc  | 43 ++++++++++++++++++++++++++++++++++++++----
>  gcc/internal-fn.def |  4 ++++
>  gcc/optabs.def      |  2 ++
>  5 files changed, 95 insertions(+), 6 deletions(-)
> 
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index a43fd65a2b2..af23ec938d6 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5136,6 +5136,52 @@ of @code{QI} elements.
>  
>  This pattern is not allowed to @code{FAIL}.
>  
> +@cindex @code{len_maskload@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskload@var{m}@var{n}}
> +Perform a masked load (operand 2 - operand 4) elements from vector memory
> +operand 1 into vector register operand 0, setting the other elements of
> +operand 0 to undefined values.  This is a combination of len_load and maskload. 
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2
> +has whichever integer mode the target prefers.  A secondary mask is specified in
> +operand 3 which must be of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
> +
> +Operand 2 can be a variable or a constant amount.  Operand 4 specifies a
> +constant bias: it is either a constant 0 or a constant -1.  The predicate on
> +operand 4 must only accept the bias values that the target actually supports.
> +GCC handles a bias of 0 more efficiently than a bias of -1.
> +
> +If (operand 2 - operand 4) exceeds the number of elements in mode
> +@var{m}, the behavior is undefined.
> +
> +If the target prefers the length to be measured in bytes
> +rather than elements, it should only implement this pattern for vectors
> +of @code{QI} elements.
> +
> +This pattern is not allowed to @code{FAIL}.
> +
> +@cindex @code{len_maskstore@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskstore@var{m}@var{n}}
> +Perform a masked store (operand 2 - operand 4) vector elements from vector register
> +operand 1 into memory operand 0, leaving the other elements of operand 0 unchanged.
> +This is a combination of len_store and maskstore.
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2 has whichever
> +integer mode the target prefers.  A secondary mask is specified in operand 3 which must be
> +of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
> +
> +Operand 2 can be a variable or a constant amount.  Operand 3 specifies a
> +constant bias: it is either a constant 0 or a constant -1.  The predicate on
> +operand 4 must only accept the bias values that the target actually supports.
> +GCC handles a bias of 0 more efficiently than a bias of -1.
> +
> +If (operand 2 - operand 4) exceeds the number of elements in mode
> +@var{m}, the behavior is undefined.
> +
> +If the target prefers the length to be measured in bytes
> +rather than elements, it should only implement this pattern for vectors
> +of @code{QI} elements.
> +
> +This pattern is not allowed to @code{FAIL}.
> +
>  @cindex @code{vec_perm@var{m}} instruction pattern
>  @item @samp{vec_perm@var{m}}
>  Output a (variable) vector permutation.  Operand 0 is the destination
> diff --git a/gcc/genopinit.cc b/gcc/genopinit.cc
> index 0c1b6859ca0..6bd8858a1d9 100644
> --- a/gcc/genopinit.cc
> +++ b/gcc/genopinit.cc
> @@ -376,7 +376,8 @@ main (int argc, const char **argv)
>  
>    fprintf (s_file,
>     "/* Returns TRUE if the target supports any of the partial vector\n"
> -    "   optabs: while_ult_optab, len_load_optab or len_store_optab,\n"
> +    "   optabs: while_ult_optab, len_load_optab, len_store_optab,\n"
> +    "   len_maskload_optab or len_maskstore_optab,\n"
>     "   for any mode.  */\n"
>     "bool\npartial_vectors_supported_p (void)\n{\n");
>    bool any_match = false;
> @@ -386,7 +387,8 @@ main (int argc, const char **argv)
>      {
>  #define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N)))
>        if (CMP_NAME("while_ult") || CMP_NAME ("len_load")
> -   || CMP_NAME ("len_store"))
> +   || CMP_NAME ("len_store")|| CMP_NAME ("len_maskload")
> +   || CMP_NAME ("len_maskstore"))
>  {
>    if (first)
>      fprintf (s_file, " HAVE_%s", p->name);
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 208bdf497eb..c911ae790cb 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -165,6 +165,7 @@ init_internal_fns ()
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
>  #define len_load_direct { -1, -1, false }
> +#define len_maskload_direct { -1, 3, false }
>  #define mask_store_direct { 3, 2, false }
>  #define store_lanes_direct { 0, 0, false }
>  #define mask_store_lanes_direct { 0, 0, false }
> @@ -172,6 +173,7 @@ init_internal_fns ()
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
>  #define len_store_direct { 3, 3, false }
> +#define len_maskstore_direct { 4, 3, false }
>  #define vec_set_direct { 3, 3, false }
>  #define unary_direct { 0, 0, true }
>  #define unary_convert_direct { -1, 0, true }
> @@ -2873,12 +2875,13 @@ expand_call_mem_ref (tree type, gcall *stmt, int index)
>    return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
>  }
>  
> -/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
> +/* Expand MASK_LOAD{,_LANES}, LEN_MASK_LOAD or LEN_LOAD call STMT using optab
> + * OPTAB.  */
>  
>  static void
>  expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  {
> -  class expand_operand ops[4];
> +  class expand_operand ops[5];
>    tree type, lhs, rhs, maskt, biast;
>    rtx mem, target, mask, bias;
>    insn_code icode;
> @@ -2913,6 +2916,20 @@ expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>        create_input_operand (&ops[3], bias, QImode);
>        expand_insn (icode, 4, ops);
>      }
> +  else if (optab == len_maskload_optab)
> +    {
> +      create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
> +    TYPE_UNSIGNED (TREE_TYPE (maskt)));
> +      maskt = gimple_call_arg (stmt, 3);
> +      mask = expand_normal (maskt);
> +      create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt)));
> +      icode = convert_optab_handler (optab, TYPE_MODE (type),
> +      TYPE_MODE (TREE_TYPE (maskt)));
> +      biast = gimple_call_arg (stmt, 4);
> +      bias = expand_normal (biast);
> +      create_input_operand (&ops[4], bias, QImode);
> +      expand_insn (icode, 5, ops);
> +    }
>    else
>      {
>        create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> @@ -2926,13 +2943,15 @@ expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  #define expand_mask_load_optab_fn expand_partial_load_optab_fn
>  #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
>  #define expand_len_load_optab_fn expand_partial_load_optab_fn
> +#define expand_len_maskload_optab_fn expand_partial_load_optab_fn
>  
> -/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
> +/* Expand MASK_STORE{,_LANES}, LEN_MASK_STORE or LEN_STORE call STMT using optab
> + * OPTAB.  */
>  
>  static void
>  expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  {
> -  class expand_operand ops[4];
> +  class expand_operand ops[5];
>    tree type, lhs, rhs, maskt, biast;
>    rtx mem, reg, mask, bias;
>    insn_code icode;
> @@ -2965,6 +2984,19 @@ expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>        create_input_operand (&ops[3], bias, QImode);
>        expand_insn (icode, 4, ops);
>      }
> +  else if (optab == len_maskstore_optab)
> +    {
> +      create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
> +    TYPE_UNSIGNED (TREE_TYPE (maskt)));
> +      maskt = gimple_call_arg (stmt, 3);
> +      mask = expand_normal (maskt);
> +      create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt)));
> +      biast = gimple_call_arg (stmt, 4);
> +      bias = expand_normal (biast);
> +      create_input_operand (&ops[4], bias, QImode);
> +      icode = convert_optab_handler (optab, TYPE_MODE (type), GET_MODE (mask));
> +      expand_insn (icode, 5, ops);
> +    }
>    else
>      {
>        create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
> @@ -2975,6 +3007,7 @@ expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
>  #define expand_mask_store_optab_fn expand_partial_store_optab_fn
>  #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
>  #define expand_len_store_optab_fn expand_partial_store_optab_fn
> +#define expand_len_maskstore_optab_fn expand_partial_store_optab_fn
>  
>  /* Expand VCOND, VCONDU and VCONDEQ optab internal functions.
>     The expansion of STMT happens based on OPTAB table associated.  */
> @@ -3928,6 +3961,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_gather_load_optab_supported_p convert_optab_supported_p
>  #define direct_len_load_optab_supported_p direct_optab_supported_p
> +#define direct_len_maskload_optab_supported_p convert_optab_supported_p
>  #define direct_mask_store_optab_supported_p convert_optab_supported_p
>  #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
> @@ -3935,6 +3969,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
>  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
>  #define direct_len_store_optab_supported_p direct_optab_supported_p
> +#define direct_len_maskstore_optab_supported_p convert_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
>  #define direct_fold_extract_optab_supported_p direct_optab_supported_p
>  #define direct_fold_left_optab_supported_p direct_optab_supported_p
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 9da5f31636e..bc947c0fde7 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -50,12 +50,14 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_load_lanes: currently just vec_mask_load_lanes
>     - gather_load: used for {mask_,}gather_load
>     - len_load: currently just len_load
> +   - len_maskload: currently just len_maskload
>  
>     - mask_store: currently just maskstore
>     - store_lanes: currently just vec_store_lanes
>     - mask_store_lanes: currently just vec_mask_store_lanes
>     - scatter_store: used for {mask_,}scatter_store
>     - len_store: currently just len_store
> +   - len_maskstore: currently just len_maskstore
>  
>     - unary: a normal unary optab, such as vec_reverse_<mode>
>     - binary: a normal binary optab, such as vec_interleave_lo_<mode>
> @@ -157,6 +159,7 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>         mask_gather_load, gather_load)
>  
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> +DEF_INTERNAL_OPTAB_FN (LEN_MASK_LOAD, ECF_PURE, len_maskload, len_maskload)
>  
>  DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> @@ -175,6 +178,7 @@ DEF_INTERNAL_OPTAB_FN (VCOND_MASK, 0, vcond_mask, vec_cond_mask)
>  DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
>  
>  DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
> +DEF_INTERNAL_OPTAB_FN (LEN_MASK_STORE, 0, len_maskstore, len_maskstore)
>  
>  DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
>  DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, binary)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 22b31be0f72..9533eb11565 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -91,6 +91,8 @@ OPTAB_CD(vec_cmpu_optab, "vec_cmpu$a$b")
>  OPTAB_CD(vec_cmpeq_optab, "vec_cmpeq$a$b")
>  OPTAB_CD(maskload_optab, "maskload$a$b")
>  OPTAB_CD(maskstore_optab, "maskstore$a$b")
> +OPTAB_CD(len_maskload_optab, "len_maskload$a$b")
> +OPTAB_CD(len_maskstore_optab, "len_maskstore$a$b")
>  OPTAB_CD(gather_load_optab, "gather_load$a$b")
>  OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
>  OPTAB_CD(scatter_store_optab, "scatter_store$a$b")
> 
 
-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
  
Robin Dapp June 16, 2023, 9:21 a.m. UTC | #3
Hi Juzhe,

> +@cindex @code{len_maskload@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskload@var{m}@var{n}}
> +Perform a masked load (operand 2 - operand 4) elements from vector memory
> +operand 1 into vector register operand 0, setting the other elements of
> +operand 0 to undefined values.  This is a combination of len_load and maskload. 
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2
> +has whichever integer mode the target prefers.  A secondary mask is specified in
> +operand 3 which must be of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
> +
> +Operand 2 can be a variable or a constant amount.  Operand 4 specifies a
> +constant bias: it is either a constant 0 or a constant -1.  The predicate on
> +operand 4 must only accept the bias values that the target actually supports.
> +GCC handles a bias of 0 more efficiently than a bias of -1.
> +
> +If (operand 2 - operand 4) exceeds the number of elements in mode
> +@var{m}, the behavior is undefined.
> +
> +If the target prefers the length to be measured in bytes
> +rather than elements, it should only implement this pattern for vectors
> +of @code{QI} elements.
> +
> +This pattern is not allowed to @code{FAIL}.

Please still change
"Perform a masked load (operand 2 - operand 4) elements"
to
"Perform a masked load of (operand 2 + operand 4) elements".

"vector memory operand" -> "memory operand"

As Richi has mentioned we are adding the negative bias not subtracting a positive
one.  You could also change the len_load and len_store comments while at it so
as to not introduce more confusion.

The "secondary" can also be omitted now because we don't have a primary mask
somewhere.  Maybe, for clarification, even if it's implicit:
"A mask is specified in operand 3 which must... The mask has lower precedence
than the length and is itself subject to length masking, i.e. only mask indices
<= (operand 2 + operand 4) are used."

> +
> +@cindex @code{len_maskstore@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskstore@var{m}@var{n}}
> +Perform a masked store (operand 2 - operand 4) vector elements from vector register
> +operand 1 into memory operand 0, leaving the other elements of operand 0 unchanged.
> +This is a combination of len_store and maskstore.
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2 has whichever
> +integer mode the target prefers.  A secondary mask is specified in operand 3 which must be
> +of type @var{n}.  Operand 4 conceptually has mode @code{QI}.

Same thing applies here "store of (operand 2 + operand 4) vector elements as
well as the secondary.

Thanks.  No V5 necessary IMHO for those but let's see what Richard says.

Regards
 Robin
  
juzhe.zhong@rivai.ai June 16, 2023, 9:33 a.m. UTC | #4
Thanks Robin. 

I have sent V5 for future merge convenience.
I didn't change len_load/len_store description since I think it should be another separate patch.
This patch is adding len_maskload/len_maskstore.

I will wait for Richard S the final comments.


juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2023-06-16 17:21
To: juzhe.zhong; gcc-patches
CC: rdapp.gcc; rguenther; richard.sandiford
Subject: Re: [PATCH V4] VECT: Support LEN_MASK_{LOAD,STORE} ifn && optabs
Hi Juzhe,
 
> +@cindex @code{len_maskload@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskload@var{m}@var{n}}
> +Perform a masked load (operand 2 - operand 4) elements from vector memory
> +operand 1 into vector register operand 0, setting the other elements of
> +operand 0 to undefined values.  This is a combination of len_load and maskload. 
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2
> +has whichever integer mode the target prefers.  A secondary mask is specified in
> +operand 3 which must be of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
> +
> +Operand 2 can be a variable or a constant amount.  Operand 4 specifies a
> +constant bias: it is either a constant 0 or a constant -1.  The predicate on
> +operand 4 must only accept the bias values that the target actually supports.
> +GCC handles a bias of 0 more efficiently than a bias of -1.
> +
> +If (operand 2 - operand 4) exceeds the number of elements in mode
> +@var{m}, the behavior is undefined.
> +
> +If the target prefers the length to be measured in bytes
> +rather than elements, it should only implement this pattern for vectors
> +of @code{QI} elements.
> +
> +This pattern is not allowed to @code{FAIL}.
 
Please still change
"Perform a masked load (operand 2 - operand 4) elements"
to
"Perform a masked load of (operand 2 + operand 4) elements".
 
"vector memory operand" -> "memory operand"
 
As Richi has mentioned we are adding the negative bias not subtracting a positive
one.  You could also change the len_load and len_store comments while at it so
as to not introduce more confusion.
 
The "secondary" can also be omitted now because we don't have a primary mask
somewhere.  Maybe, for clarification, even if it's implicit:
"A mask is specified in operand 3 which must... The mask has lower precedence
than the length and is itself subject to length masking, i.e. only mask indices
<= (operand 2 + operand 4) are used."
 
> +
> +@cindex @code{len_maskstore@var{m}@var{n}} instruction pattern
> +@item @samp{len_maskstore@var{m}@var{n}}
> +Perform a masked store (operand 2 - operand 4) vector elements from vector register
> +operand 1 into memory operand 0, leaving the other elements of operand 0 unchanged.
> +This is a combination of len_store and maskstore.
> +Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2 has whichever
> +integer mode the target prefers.  A secondary mask is specified in operand 3 which must be
> +of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
 
Same thing applies here "store of (operand 2 + operand 4) vector elements as
well as the secondary.
 
Thanks.  No V5 necessary IMHO for those but let's see what Richard says.
 
Regards
Robin
  
Robin Dapp June 16, 2023, 10:10 a.m. UTC | #5
>     <= (operand 2 + operand 4) are used."

Sorry it's really minor (and my mistake) but it should be < and
not <=, right?  Mask index 0 is inactive when the length is 0.

> +Perform a masked store (operand 2 + operand 4)

Even more minor but as mentioned the "of" is still missing ;)
Same with the <=.

Regards
 Robin
  
juzhe.zhong@rivai.ai June 16, 2023, 10:29 a.m. UTC | #6
Address comments and send V6.



juzhe.zhong@rivai.ai
 
From: Robin Dapp
Date: 2023-06-16 18:10
To: juzhe.zhong@rivai.ai; gcc-patches
CC: rdapp.gcc; rguenther; richard.sandiford
Subject: Re: [PATCH V4] VECT: Support LEN_MASK_{LOAD,STORE} ifn && optabs
>     <= (operand 2 + operand 4) are used."
 
Sorry it's really minor (and my mistake) but it should be < and
not <=, right?  Mask index 0 is inactive when the length is 0.
 
> +Perform a masked store (operand 2 + operand 4)
 
Even more minor but as mentioned the "of" is still missing ;)
Same with the <=.
 
Regards
Robin
  

Patch

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index a43fd65a2b2..af23ec938d6 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5136,6 +5136,52 @@  of @code{QI} elements.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{len_maskload@var{m}@var{n}} instruction pattern
+@item @samp{len_maskload@var{m}@var{n}}
+Perform a masked load (operand 2 - operand 4) elements from vector memory
+operand 1 into vector register operand 0, setting the other elements of
+operand 0 to undefined values.  This is a combination of len_load and maskload. 
+Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2
+has whichever integer mode the target prefers.  A secondary mask is specified in
+operand 3 which must be of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
+
+Operand 2 can be a variable or a constant amount.  Operand 4 specifies a
+constant bias: it is either a constant 0 or a constant -1.  The predicate on
+operand 4 must only accept the bias values that the target actually supports.
+GCC handles a bias of 0 more efficiently than a bias of -1.
+
+If (operand 2 - operand 4) exceeds the number of elements in mode
+@var{m}, the behavior is undefined.
+
+If the target prefers the length to be measured in bytes
+rather than elements, it should only implement this pattern for vectors
+of @code{QI} elements.
+
+This pattern is not allowed to @code{FAIL}.
+
+@cindex @code{len_maskstore@var{m}@var{n}} instruction pattern
+@item @samp{len_maskstore@var{m}@var{n}}
+Perform a masked store (operand 2 - operand 4) vector elements from vector register
+operand 1 into memory operand 0, leaving the other elements of operand 0 unchanged.
+This is a combination of len_store and maskstore.
+Operands 0 and 1 have mode @var{m}, which must be a vector mode.  Operand 2 has whichever
+integer mode the target prefers.  A secondary mask is specified in operand 3 which must be
+of type @var{n}.  Operand 4 conceptually has mode @code{QI}.
+
+Operand 2 can be a variable or a constant amount.  Operand 3 specifies a
+constant bias: it is either a constant 0 or a constant -1.  The predicate on
+operand 4 must only accept the bias values that the target actually supports.
+GCC handles a bias of 0 more efficiently than a bias of -1.
+
+If (operand 2 - operand 4) exceeds the number of elements in mode
+@var{m}, the behavior is undefined.
+
+If the target prefers the length to be measured in bytes
+rather than elements, it should only implement this pattern for vectors
+of @code{QI} elements.
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_perm@var{m}} instruction pattern
 @item @samp{vec_perm@var{m}}
 Output a (variable) vector permutation.  Operand 0 is the destination
diff --git a/gcc/genopinit.cc b/gcc/genopinit.cc
index 0c1b6859ca0..6bd8858a1d9 100644
--- a/gcc/genopinit.cc
+++ b/gcc/genopinit.cc
@@ -376,7 +376,8 @@  main (int argc, const char **argv)
 
   fprintf (s_file,
 	   "/* Returns TRUE if the target supports any of the partial vector\n"
-	   "   optabs: while_ult_optab, len_load_optab or len_store_optab,\n"
+	   "   optabs: while_ult_optab, len_load_optab, len_store_optab,\n"
+	   "   len_maskload_optab or len_maskstore_optab,\n"
 	   "   for any mode.  */\n"
 	   "bool\npartial_vectors_supported_p (void)\n{\n");
   bool any_match = false;
@@ -386,7 +387,8 @@  main (int argc, const char **argv)
     {
 #define CMP_NAME(N) !strncmp (p->name, (N), strlen ((N)))
       if (CMP_NAME("while_ult") || CMP_NAME ("len_load")
-	  || CMP_NAME ("len_store"))
+	  || CMP_NAME ("len_store")|| CMP_NAME ("len_maskload")
+	  || CMP_NAME ("len_maskstore"))
 	{
 	  if (first)
 	    fprintf (s_file, " HAVE_%s", p->name);
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 208bdf497eb..c911ae790cb 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -165,6 +165,7 @@  init_internal_fns ()
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
 #define len_load_direct { -1, -1, false }
+#define len_maskload_direct { -1, 3, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
 #define mask_store_lanes_direct { 0, 0, false }
@@ -172,6 +173,7 @@  init_internal_fns ()
 #define vec_cond_direct { 2, 0, false }
 #define scatter_store_direct { 3, 1, false }
 #define len_store_direct { 3, 3, false }
+#define len_maskstore_direct { 4, 3, false }
 #define vec_set_direct { 3, 3, false }
 #define unary_direct { 0, 0, true }
 #define unary_convert_direct { -1, 0, true }
@@ -2873,12 +2875,13 @@  expand_call_mem_ref (tree type, gcall *stmt, int index)
   return fold_build2 (MEM_REF, type, addr, build_int_cst (alias_ptr_type, 0));
 }
 
-/* Expand MASK_LOAD{,_LANES} or LEN_LOAD call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES}, LEN_MASK_LOAD or LEN_LOAD call STMT using optab
+ * OPTAB.  */
 
 static void
 expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
-  class expand_operand ops[4];
+  class expand_operand ops[5];
   tree type, lhs, rhs, maskt, biast;
   rtx mem, target, mask, bias;
   insn_code icode;
@@ -2913,6 +2916,20 @@  expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
       create_input_operand (&ops[3], bias, QImode);
       expand_insn (icode, 4, ops);
     }
+  else if (optab == len_maskload_optab)
+    {
+      create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				   TYPE_UNSIGNED (TREE_TYPE (maskt)));
+      maskt = gimple_call_arg (stmt, 3);
+      mask = expand_normal (maskt);
+      create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt)));
+      icode = convert_optab_handler (optab, TYPE_MODE (type),
+				     TYPE_MODE (TREE_TYPE (maskt)));
+      biast = gimple_call_arg (stmt, 4);
+      bias = expand_normal (biast);
+      create_input_operand (&ops[4], bias, QImode);
+      expand_insn (icode, 5, ops);
+    }
   else
     {
       create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
@@ -2926,13 +2943,15 @@  expand_partial_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 #define expand_mask_load_optab_fn expand_partial_load_optab_fn
 #define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
 #define expand_len_load_optab_fn expand_partial_load_optab_fn
+#define expand_len_maskload_optab_fn expand_partial_load_optab_fn
 
-/* Expand MASK_STORE{,_LANES} or LEN_STORE call STMT using optab OPTAB.  */
+/* Expand MASK_STORE{,_LANES}, LEN_MASK_STORE or LEN_STORE call STMT using optab
+ * OPTAB.  */
 
 static void
 expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 {
-  class expand_operand ops[4];
+  class expand_operand ops[5];
   tree type, lhs, rhs, maskt, biast;
   rtx mem, reg, mask, bias;
   insn_code icode;
@@ -2965,6 +2984,19 @@  expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
       create_input_operand (&ops[3], bias, QImode);
       expand_insn (icode, 4, ops);
     }
+  else if (optab == len_maskstore_optab)
+    {
+      create_convert_operand_from (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)),
+				   TYPE_UNSIGNED (TREE_TYPE (maskt)));
+      maskt = gimple_call_arg (stmt, 3);
+      mask = expand_normal (maskt);
+      create_input_operand (&ops[3], mask, TYPE_MODE (TREE_TYPE (maskt)));
+      biast = gimple_call_arg (stmt, 4);
+      bias = expand_normal (biast);
+      create_input_operand (&ops[4], bias, QImode);
+      icode = convert_optab_handler (optab, TYPE_MODE (type), GET_MODE (mask));
+      expand_insn (icode, 5, ops);
+    }
   else
     {
       create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
@@ -2975,6 +3007,7 @@  expand_partial_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
 #define expand_mask_store_optab_fn expand_partial_store_optab_fn
 #define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
 #define expand_len_store_optab_fn expand_partial_store_optab_fn
+#define expand_len_maskstore_optab_fn expand_partial_store_optab_fn
 
 /* Expand VCOND, VCONDU and VCONDEQ optab internal functions.
    The expansion of STMT happens based on OPTAB table associated.  */
@@ -3928,6 +3961,7 @@  multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
 #define direct_len_load_optab_supported_p direct_optab_supported_p
+#define direct_len_maskload_optab_supported_p convert_optab_supported_p
 #define direct_mask_store_optab_supported_p convert_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
@@ -3935,6 +3969,7 @@  multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_vec_cond_optab_supported_p convert_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
 #define direct_len_store_optab_supported_p direct_optab_supported_p
+#define direct_len_maskstore_optab_supported_p convert_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
 #define direct_fold_extract_optab_supported_p direct_optab_supported_p
 #define direct_fold_left_optab_supported_p direct_optab_supported_p
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 9da5f31636e..bc947c0fde7 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -50,12 +50,14 @@  along with GCC; see the file COPYING3.  If not see
    - mask_load_lanes: currently just vec_mask_load_lanes
    - gather_load: used for {mask_,}gather_load
    - len_load: currently just len_load
+   - len_maskload: currently just len_maskload
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
    - mask_store_lanes: currently just vec_mask_store_lanes
    - scatter_store: used for {mask_,}scatter_store
    - len_store: currently just len_store
+   - len_maskstore: currently just len_maskstore
 
    - unary: a normal unary optab, such as vec_reverse_<mode>
    - binary: a normal binary optab, such as vec_interleave_lo_<mode>
@@ -157,6 +159,7 @@  DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 
 DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
+DEF_INTERNAL_OPTAB_FN (LEN_MASK_LOAD, ECF_PURE, len_maskload, len_maskload)
 
 DEF_INTERNAL_OPTAB_FN (SCATTER_STORE, 0, scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
@@ -175,6 +178,7 @@  DEF_INTERNAL_OPTAB_FN (VCOND_MASK, 0, vcond_mask, vec_cond_mask)
 DEF_INTERNAL_OPTAB_FN (VEC_SET, 0, vec_set, vec_set)
 
 DEF_INTERNAL_OPTAB_FN (LEN_STORE, 0, len_store, len_store)
+DEF_INTERNAL_OPTAB_FN (LEN_MASK_STORE, 0, len_maskstore, len_maskstore)
 
 DEF_INTERNAL_OPTAB_FN (WHILE_ULT, ECF_CONST | ECF_NOTHROW, while_ult, while)
 DEF_INTERNAL_OPTAB_FN (SELECT_VL, ECF_CONST | ECF_NOTHROW, select_vl, binary)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 22b31be0f72..9533eb11565 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -91,6 +91,8 @@  OPTAB_CD(vec_cmpu_optab, "vec_cmpu$a$b")
 OPTAB_CD(vec_cmpeq_optab, "vec_cmpeq$a$b")
 OPTAB_CD(maskload_optab, "maskload$a$b")
 OPTAB_CD(maskstore_optab, "maskstore$a$b")
+OPTAB_CD(len_maskload_optab, "len_maskload$a$b")
+OPTAB_CD(len_maskstore_optab, "len_maskstore$a$b")
 OPTAB_CD(gather_load_optab, "gather_load$a$b")
 OPTAB_CD(mask_gather_load_optab, "mask_gather_load$a$b")
 OPTAB_CD(scatter_store_optab, "scatter_store$a$b")