diff mbox series

Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner

Message ID	CAAgBjM=CUm=fOVWCDu4YVydKQzRxkmw9fa_2hiZuX5pYungj6Q@mail.gmail.com
State	New, archived
Headers	Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 078593858436 MIME-Version: 1.0 Date: Wed, 17 Aug 2022 18:09:01 +0530 Message-ID: <CAAgBjM=CUm=fOVWCDu4YVydKQzRxkmw9fa_2hiZuX5pYungj6Q@mail.gmail.com> Subject: Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner To: gcc Patches <gcc-patches@gcc.gnu.org>, Richard Biener <richard.guenther@gmail.com>, Richard Sandiford <richard.sandiford@arm.com> Content-Type: multipart/mixed; boundary="0000000000005ff77105e66f2846" Precedence: list From: Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?=
Series	Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner \| Extend fold_vec_perm to fold VEC_PERM_EXPR in VLA manner

Commit Message

Prathamesh Kulkarni Aug. 17, 2022, 12:39 p.m. UTC

  Hi,
The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
in VLA manner, and currently handles the following cases:
(a) fixed len arg0, arg1 and fixed len sel.
(b) fixed len arg0, arg1 and vla sel
(c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.

It seems to work for the VLA tests written in
test_vec_perm_vla_folding (), and am working thru the fallout observed in
regression testing.

Does the approach taken in the patch look in the right direction ?
I am not sure if I have got the conversion from "sel_index"
to index of either arg0, or arg1 entirely correct.
I would be grateful for suggestions on the patch.

Thanks,
Prathamesh

Comments

Prathamesh Kulkarni Aug. 29, 2022, 6:08 a.m. UTC | #1

On Wed, 17 Aug 2022 at 18:09, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> Hi,
> The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
> in VLA manner, and currently handles the following cases:
> (a) fixed len arg0, arg1 and fixed len sel.
> (b) fixed len arg0, arg1 and vla sel
> (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
>
> It seems to work for the VLA tests written in
> test_vec_perm_vla_folding (), and am working thru the fallout observed in
> regression testing.
>
> Does the approach taken in the patch look in the right direction ?
> I am not sure if I have got the conversion from "sel_index"
> to index of either arg0, or arg1 entirely correct.
> I would be grateful for suggestions on the patch.
ping https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599888.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh

Prathamesh Kulkarni Sept. 5, 2022, 8:53 a.m. UTC | #2

On Mon, 29 Aug 2022 at 11:38, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 17 Aug 2022 at 18:09, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > Hi,
> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
> > in VLA manner, and currently handles the following cases:
> > (a) fixed len arg0, arg1 and fixed len sel.
> > (b) fixed len arg0, arg1 and vla sel
> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
> >
> > It seems to work for the VLA tests written in
> > test_vec_perm_vla_folding (), and am working thru the fallout observed in
> > regression testing.
> >
> > Does the approach taken in the patch look in the right direction ?
> > I am not sure if I have got the conversion from "sel_index"
> > to index of either arg0, or arg1 entirely correct.
> > I would be grateful for suggestions on the patch.
> ping https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599888.html
ping * 2 https://gcc.gnu.org/pipermail/gcc-patches/2022-August/599888.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh

Richard Sandiford Sept. 5, 2022, 10:21 a.m. UTC | #3

Sorry for the slow reply.  I wrote a response a couple of weeks ago
but I think it get lost in a machine outage.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi,
> The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
> in VLA manner, and currently handles the following cases:
> (a) fixed len arg0, arg1 and fixed len sel.
> (b) fixed len arg0, arg1 and vla sel
> (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
>
> It seems to work for the VLA tests written in
> test_vec_perm_vla_folding (), and am working thru the fallout observed in
> regression testing.
>
> Does the approach taken in the patch look in the right direction ?
> I am not sure if I have got the conversion from "sel_index"
> to index of either arg0, or arg1 entirely correct.
> I would be grateful for suggestions on the patch.
>
> Thanks,
> Prathamesh
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 4f4ec81c8d4..5e12260211e 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
>  #include "vec-perm-indices.h"
>  #include "asan.h"
>  #include "gimple-range.h"
> +#include "tree-pretty-print.h"
> +#include "gimple-pretty-print.h"
> +#include "print-tree.h"
>  
>  /* Nonzero if we are folding constants inside an initializer or a C++
>     manifestly-constant-evaluated context; zero otherwise.
> @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>  			  build_zero_cst (itype));
>  }
>  
> -
> -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> -   true if successful.  */
> -
> -static bool
> -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> -{
> -  unsigned HOST_WIDE_INT i, nunits;
> -
> -  if (TREE_CODE (arg) == VECTOR_CST
> -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> -    {
> -      for (i = 0; i < nunits; ++i)
> -	elts[i] = VECTOR_CST_ELT (arg, i);
> -    }
> -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> -    {
> -      constructor_elt *elt;
> -
> -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> -	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> -	  return false;
> -	else
> -	  elts[i] = elt->value;
> -    }
> -  else
> -    return false;
> -  for (; i < nelts; i++)
> -    elts[i]
> -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> -  return true;
> -}
> -
>  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
>     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
>     NULL_TREE otherwise.  */
> @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>  tree
>  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
>  {
> -  unsigned int i;
> -  unsigned HOST_WIDE_INT nelts;
> -  bool need_ctor = false;
> +  poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +  poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
> +
> +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
> +			sel.length ()));
> +  gcc_assert (known_eq (arg0_len, arg1_len));
>  
> -  if (!sel.length ().is_constant (&nelts))
> -    return NULL_TREE;
> -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
>    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
>        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
>      return NULL_TREE;
>  
> -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> +  unsigned input_npatterns = 0;
> +  unsigned out_npatterns = sel.encoding ().npatterns ();
> +  unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +
> +  /* FIXME: How to reshape fixed length vector_cst, so that
> +     npatterns == vector.length () and nelts_per_pattern == 1 ?
> +     It seems the vector is canonicalized to minimize npatterns.  */
> +
> +  if (arg0_len.is_constant ())
> +    {
> +      /* If arg0, arg1 are fixed width vectors, and sel is VLA,
> +         ensure that it is a dup sequence and has same period
> +	 as input vector.  */
> +
> +      if (!sel.length ().is_constant ()
> +	  && (sel.encoding ().nelts_per_pattern () > 2
> +	      || !known_eq (arg0_len, sel.encoding ().npatterns ())))
> +	return NULL_TREE;
> +
> +      input_npatterns = arg0_len.to_constant ();
> +
> +      if (sel.length ().is_constant ())
> +	{
> +	  out_npatterns = sel.length ().to_constant ();
> +	  out_nelts_per_pattern = 1;
> +	}
> +    }
> +  else if (TREE_CODE (arg0) == VECTOR_CST
> +	   && TREE_CODE (arg1) == VECTOR_CST)
> +    {
> +      unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
> +      unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
> +
> +      /* If arg0, arg1 are VLA, then ensure that,
> +	 (a) sel also has same length as input vectors.
> +	 (b) arg0 and arg1 have same encoding.
> +	 (c) sel has same number of patterns as input vectors.
> +	 (d) if sel is a stepped sequence, then it has same
> +	     encoding as input vectors.  */
> +
> +      if (!known_eq (arg0_len, sel.length ())
> +	  || npatterns != VECTOR_CST_NPATTERNS (arg1)
> +	  || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
> +	  || npatterns != sel.encoding ().npatterns ()
> +	  || (sel.encoding ().nelts_per_pattern () > 2
> +	      && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
> +	return NULL_TREE;

This seems too restrictive.  More below.

> +
> +      input_npatterns = npatterns;
> +    }
> +  else
>      return NULL_TREE;
>  
> -  tree_vector_builder out_elts (type, nelts, 1);
> -  for (i = 0; i < nelts; i++)
> +  tree_vector_builder out_elts_builder (type, out_npatterns,
> +					out_nelts_per_pattern);
> +  bool need_ctor = false;
> +  unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
> +
> +  for (unsigned i = 0; i < out_encoded_nelts; i++)
>      {
> -      HOST_WIDE_INT index;
> -      if (!sel[i].is_constant (&index))
> +      HOST_WIDE_INT sel_index;
> +      if (!sel[i].is_constant (&sel_index))
>  	return NULL_TREE;
> -      if (!CONSTANT_CLASS_P (in_elts[index]))
> -	need_ctor = true;
> -      out_elts.quick_push (unshare_expr (in_elts[index]));
> +
> +      /* Convert sel_index to index of either arg0 or arg1.
> +	 For eg:
> +	 arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
> +	 arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
> +	 Both have npatterns == 2, nelts_per_pattern == 3.
> +	 Then the combined vector would be:
> +	 {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
> +	 This combined vector will have,
> +	 npatterns = 2 * input_npatterns == 4.
> +	 sel_index is used to index this above combined vector.

There's no interleaving of the arguments though.  The selector selects from:

{a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...}

The VLA encoding encodes the first N patterns explicitly.  The
npatterns/nelts_per_pattern values then describe how to extend that
initial sequence to an arbitrary number of elements.  So when performing
an operation on (potentially) variable-length vectors, the questions is:

* Can we work out an initial sequence and npatterns/nelts_per_pattern
  pair that will be correct for all elements of the result?

This depends on the operation that we're performing.  E.g. it's
different for unary operations (vector_builder::new_unary_operation)
and binary operations (vector_builder::new_binary_operations).  It also
varies between unary operations and between binary operations, hence
the allow_stepped_p parameters.

For VEC_PERM_EXPR, I think the key requirement is that:

(R) Each individual selector pattern must always select from the same vector.

Whether this condition is met depends both on the pattern itself and on
the number of patterns that it's combined with.

E.g. suppose we had the selector pattern:

  { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0

If the arguments and selector are n elements then this pattern on its
own would select from more than one argument if 3(n-1) - 2 >= n.
This is clearly true for large enough n.  So if n is variable then
we cannot represent this.

If the pattern above is one of two patterns, so interleaved as:

     { 0, _, 1, _, 4, _, ... }  o=0
  or { _, 0, _, 1, _, 4, ... }  o=1

then the pattern would select from more than one argument if
3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.

But if the pattern above is one of four patterns then it selects
from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
true for any valid n or o, so the pattern is OK.

So let's define some ad hoc terminology:

* Px is the number of patterns in x
* Ex is the number of elements per pattern in x

where x can be:

* 1: first argument
* 2: second argument
* s: selector
* r: result

Then:

(1) The number of elements encoded explicitly for x is Ex*Px

(2) The explicit encoding can be used to produce a sequence of N*Ex*Px
    elements for any integer N.  This extended sequence can be reencoded
    as having N*Px patterns, with Ex staying the same.

(3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
    of the explicit encoding.

So let's assume (optimistically) that we can produce the result
by calculating the first Pr*Er elements and using the Pr,Er encoding
to imply the rest.  Then:

* (2) means that, when combining multiple input operands with potentially
  different encodings, we can set the number of patterns in the result
  to the least common multiple of the number of patterns in the inputs.
  In this case:

  Pr = least_common_multiple(P1, P2, Ps)

  is a valid number of patterns.

* (3) means that the number of elements per pattern of the result can
  be the maximum of the number of elements per pattern in the inputs.
  (Alternatively, we could always use 3.)  In this case:

  Er = max(E1, E2, Es)

  is a valid number of elements per pattern.

So if (R) holds we can compute the result -- for both VLA and VLS -- by
calculating the first Pr*Er elements of the result and using the
encoding to derive the rest.  If (R) doesn't hold then we need the
selector to be constant-length.  We should then fill in the result
based on:

- Pr == number of elements in the result
- Er == 1

But this should be the fallback option, even for VLS.

As far as the arguments go: we should reject CONSTRUCTORs for
variable-length types.  After doing that, we can treat a CONSTRUCTOR
for an N-element vector type by setting the number of patterns to N
and the number of elements per pattern to 1.

Thanks,
Richard

> +	 Since we don't explicitly build the combined vector, we convert
> +	 sel_index to corresponding index for either arg0 or arg1.
> +	 For eg, if sel_index == 7,
> +	 pattern = 7 % 4 == 3.
> +	 Since pattern > input_npatterns, the elem will come from:
> +	 pattern = 3 - input_npatterns ie, pattern 1 from arg1.
> +	 elem_index_in_pattern = 7 / 4 == 1.
> +	 So the actual index of the element in arg1 would be: 1 + (1 * 2) == 3.
> +	 So, sel_index == 7 corresponds to arg1[3], ie, d1.  */
> +
> +      unsigned pattern = sel_index % (2 * input_npatterns);
> +      unsigned elem_index_in_pattern = sel_index / (2 * input_npatterns);
> +      tree arg;
> +      if (pattern < input_npatterns)
> +	arg = arg0;
> +      else
> +	{
> +	  arg = arg1;
> +	  pattern -= input_npatterns;
> +	}
> +
> +      unsigned elem_index = (elem_index_in_pattern * input_npatterns) + pattern;
> +      tree elem;
> +      if (TREE_CODE (arg) == VECTOR_CST)
> +	{
> +	  /* If arg is fixed width vector, and elem_index goes out of range,
> +	     then return NULL_TREE.  */
> +	  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant ()
> +	      && elem_index > vector_cst_encoded_nelts (arg))
> +	    return NULL_TREE;
> +	  elem = vector_cst_elt (arg, elem_index);
> +	}
> +      else
> +	{
> +	  gcc_assert (TREE_CODE (arg) == CONSTRUCTOR);
> +	  if (elem_index >= CONSTRUCTOR_NELTS (arg))
> +	    return NULL_TREE;
> +	  elem = CONSTRUCTOR_ELT (arg, elem_index)->value;
> +	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> +	    return NULL_TREE;
> +	  need_ctor = true;
> +	}
> +
> +      out_elts_builder.quick_push (unshare_expr (elem));
>      }
>  
>    if (need_ctor)
>      {
>        vec<constructor_elt, va_gc> *v;
> -      vec_alloc (v, nelts);
> -      for (i = 0; i < nelts; i++)
> -	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> +      vec_alloc (v, out_encoded_nelts);
> +
> +      for (unsigned i = 0; i < out_encoded_nelts; i++)
> +	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts_builder[i]);
>        return build_constructor (type, v);
>      }
> -  else
> -    return out_elts.build ();
> +
> +  return out_elts_builder.build ();
>  }
>  
>  /* Try to fold a pointer difference of type TYPE two address expressions of
> @@ -16912,6 +16985,91 @@ test_vec_duplicate_folding ()
>    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
>  }
>  
> +static tree
> +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> +		   int *encoded_elems)
> +{
> +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> +  tree vectype = build_vector_type (integer_type_node, nunits);
> +
> +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> +  return builder.build ();
> +}
> +
> +static void
> +vpe_verify_res (tree res, unsigned npatterns, unsigned nelts_per_pattern,
> +		int *encoded_elems)
> +{
> +  ASSERT_TRUE (res != NULL_TREE);
> +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
> +
> +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> +    ASSERT_TRUE (wi::to_wide (VECTOR_CST_ELT (res, i))
> +			      == encoded_elems[i]);
> +}
> +
> +static void
> +test_vec_perm_vla_folding ()
> +{
> +  /* For all cases
> +     arg0: {1, 11, 21, 31, 2, 12, 22, 32, 3, 13, 23, 33, ...}, npatterns == 4, nelts_per_pattern == 3.
> +     arg1: {41, 51, 61, 71, 42, 52, 62, 72, 43, 53, 63, 73 ...}, npatterns == 4, nelts_per_pattern == 3.  */
> +
> +  int arg0_elems[] = { 1, 11, 21, 31, 2, 12, 22, 32, 3, 13, 23, 33 };
> +  tree arg0 = build_vec_int_cst (4, 3, arg0_elems);
> +
> +  int arg1_elems[] = { 41, 51, 61, 71, 42, 52, 62, 72, 43, 53, 63, 73 };
> +  tree arg1 = build_vec_int_cst (4, 3, arg1_elems);
> +
> +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> +    return;
> +
> +  /* Case 1: Dup mask sequence.
> +     mask = {0, 9, 3, 11, ...}, npatterns == 4, nelts_per_pattern == 1.
> +     expected result: {1, 21, 31, 32, ...}, npatterns == 4, nelts_per_pattern == 1.  */
> +  {
> +    int mask_elems[] = {0, 9, 3, 12};
> +    tree mask = build_vec_int_cst (4, 1, mask_elems);
> +    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
> +      return;
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    int res_encoded_elems[] = {1, 12, 31, 42};
> +    vpe_verify_res (res, 4, 1, res_encoded_elems);
> +  }
> +
> +  /* Case 2:
> +     mask = {0, 4, 1, 5, 8, 12, 9, 13 ...}, npatterns == 4, nelts_per_pattern == 2.
> +     expected result: {1, 41, 11, 51, 2, 12, 42, 52, ...}, npatterns == 4, nelts_per_pattern == 2.  */
> +  {
> +    int mask_elems[] = {0, 4, 1, 5, 8, 12, 9, 13};
> +    tree mask = build_vec_int_cst (4, 2, mask_elems);
> +    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
> +      return;
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    int res_encoded_elems[] = {1, 41, 11, 51, 2, 42, 12, 52};
> +    vpe_verify_res (res, 4, 2, res_encoded_elems);
> +  }
> +
> +  /* Case 3: Stepped mask sequence.
> +     mask = {0, 4, 1, 5, 8, 12, 9, 13, 16, 20, 17, 21}, npatterns == 4, nelts_per_pattern == 3.
> +     expected result = {1, 41, 11, 51, 2, 42, 12, 52, 3, 43, 13, 53 ...}, npatterns == 4, nelts_per_pattern == 3.  */
> +  {
> +    int mask_elems[] = {0, 4, 1, 5, 8, 12, 9, 13, 16, 20, 17, 21};
> +    tree mask = build_vec_int_cst (4, 3, mask_elems);
> +    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
> +      return;
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    int res_encoded_elems[] = {1, 41, 11, 51, 2, 42, 12, 52, 3, 43, 13, 53};
> +    vpe_verify_res (res, 4, 3, res_encoded_elems);
> +  }
> +}
> +
>  /* Run all of the selftests within this file.  */
>  
>  void
> @@ -16920,6 +17078,7 @@ fold_const_cc_tests ()
>    test_arithmetic_folding ();
>    test_vector_folding ();
>    test_vec_duplicate_folding ();
> +  test_vec_perm_vla_folding ();
>  }
>  
>  } // namespace selftest

Prathamesh Kulkarni Sept. 9, 2022, 1:59 p.m. UTC | #4

On Mon, 5 Sept 2022 at 15:51, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Sorry for the slow reply.  I wrote a response a couple of weeks ago
> but I think it get lost in a machine outage.
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi,
> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
> > in VLA manner, and currently handles the following cases:
> > (a) fixed len arg0, arg1 and fixed len sel.
> > (b) fixed len arg0, arg1 and vla sel
> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
> >
> > It seems to work for the VLA tests written in
> > test_vec_perm_vla_folding (), and am working thru the fallout observed in
> > regression testing.
> >
> > Does the approach taken in the patch look in the right direction ?
> > I am not sure if I have got the conversion from "sel_index"
> > to index of either arg0, or arg1 entirely correct.
> > I would be grateful for suggestions on the patch.
> >
> > Thanks,
> > Prathamesh
> >
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index 4f4ec81c8d4..5e12260211e 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "vec-perm-indices.h"
> >  #include "asan.h"
> >  #include "gimple-range.h"
> > +#include "tree-pretty-print.h"
> > +#include "gimple-pretty-print.h"
> > +#include "print-tree.h"
> >
> >  /* Nonzero if we are folding constants inside an initializer or a C++
> >     manifestly-constant-evaluated context; zero otherwise.
> > @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >                         build_zero_cst (itype));
> >  }
> >
> > -
> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > -   true if successful.  */
> > -
> > -static bool
> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > -{
> > -  unsigned HOST_WIDE_INT i, nunits;
> > -
> > -  if (TREE_CODE (arg) == VECTOR_CST
> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > -    {
> > -      for (i = 0; i < nunits; ++i)
> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > -    }
> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > -    {
> > -      constructor_elt *elt;
> > -
> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > -       return false;
> > -     else
> > -       elts[i] = elt->value;
> > -    }
> > -  else
> > -    return false;
> > -  for (; i < nelts; i++)
> > -    elts[i]
> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > -  return true;
> > -}
> > -
> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> >     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
> >     NULL_TREE otherwise.  */
> > @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >  tree
> >  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >  {
> > -  unsigned int i;
> > -  unsigned HOST_WIDE_INT nelts;
> > -  bool need_ctor = false;
> > +  poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +  poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
> > +
> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
> > +                     sel.length ()));
> > +  gcc_assert (known_eq (arg0_len, arg1_len));
> >
> > -  if (!sel.length ().is_constant (&nelts))
> > -    return NULL_TREE;
> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >      return NULL_TREE;
> >
> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > +  unsigned input_npatterns = 0;
> > +  unsigned out_npatterns = sel.encoding ().npatterns ();
> > +  unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > +
> > +  /* FIXME: How to reshape fixed length vector_cst, so that
> > +     npatterns == vector.length () and nelts_per_pattern == 1 ?
> > +     It seems the vector is canonicalized to minimize npatterns.  */
> > +
> > +  if (arg0_len.is_constant ())
> > +    {
> > +      /* If arg0, arg1 are fixed width vectors, and sel is VLA,
> > +         ensure that it is a dup sequence and has same period
> > +      as input vector.  */
> > +
> > +      if (!sel.length ().is_constant ()
> > +       && (sel.encoding ().nelts_per_pattern () > 2
> > +           || !known_eq (arg0_len, sel.encoding ().npatterns ())))
> > +     return NULL_TREE;
> > +
> > +      input_npatterns = arg0_len.to_constant ();
> > +
> > +      if (sel.length ().is_constant ())
> > +     {
> > +       out_npatterns = sel.length ().to_constant ();
> > +       out_nelts_per_pattern = 1;
> > +     }
> > +    }
> > +  else if (TREE_CODE (arg0) == VECTOR_CST
> > +        && TREE_CODE (arg1) == VECTOR_CST)
> > +    {
> > +      unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
> > +      unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
> > +
> > +      /* If arg0, arg1 are VLA, then ensure that,
> > +      (a) sel also has same length as input vectors.
> > +      (b) arg0 and arg1 have same encoding.
> > +      (c) sel has same number of patterns as input vectors.
> > +      (d) if sel is a stepped sequence, then it has same
> > +          encoding as input vectors.  */
> > +
> > +      if (!known_eq (arg0_len, sel.length ())
> > +       || npatterns != VECTOR_CST_NPATTERNS (arg1)
> > +       || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
> > +       || npatterns != sel.encoding ().npatterns ()
> > +       || (sel.encoding ().nelts_per_pattern () > 2
> > +           && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
> > +     return NULL_TREE;
>
> This seems too restrictive.  More below.
>
> > +
> > +      input_npatterns = npatterns;
> > +    }
> > +  else
> >      return NULL_TREE;
> >
> > -  tree_vector_builder out_elts (type, nelts, 1);
> > -  for (i = 0; i < nelts; i++)
> > +  tree_vector_builder out_elts_builder (type, out_npatterns,
> > +                                     out_nelts_per_pattern);
> > +  bool need_ctor = false;
> > +  unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
> > +
> > +  for (unsigned i = 0; i < out_encoded_nelts; i++)
> >      {
> > -      HOST_WIDE_INT index;
> > -      if (!sel[i].is_constant (&index))
> > +      HOST_WIDE_INT sel_index;
> > +      if (!sel[i].is_constant (&sel_index))
> >       return NULL_TREE;
> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > -     need_ctor = true;
> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > +
> > +      /* Convert sel_index to index of either arg0 or arg1.
> > +      For eg:
> > +      arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
> > +      arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
> > +      Both have npatterns == 2, nelts_per_pattern == 3.
> > +      Then the combined vector would be:
> > +      {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
> > +      This combined vector will have,
> > +      npatterns = 2 * input_npatterns == 4.
> > +      sel_index is used to index this above combined vector.
>
> There's no interleaving of the arguments though.  The selector selects from:
>
> {a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...}
>
> The VLA encoding encodes the first N patterns explicitly.  The
> npatterns/nelts_per_pattern values then describe how to extend that
> initial sequence to an arbitrary number of elements.  So when performing
> an operation on (potentially) variable-length vectors, the questions is:
>
> * Can we work out an initial sequence and npatterns/nelts_per_pattern
>   pair that will be correct for all elements of the result?
>
> This depends on the operation that we're performing.  E.g. it's
> different for unary operations (vector_builder::new_unary_operation)
> and binary operations (vector_builder::new_binary_operations).  It also
> varies between unary operations and between binary operations, hence
> the allow_stepped_p parameters.
>
> For VEC_PERM_EXPR, I think the key requirement is that:
>
> (R) Each individual selector pattern must always select from the same vector.
>
> Whether this condition is met depends both on the pattern itself and on
> the number of patterns that it's combined with.
>
> E.g. suppose we had the selector pattern:
>
>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>
> If the arguments and selector are n elements then this pattern on its
> own would select from more than one argument if 3(n-1) - 2 >= n.
> This is clearly true for large enough n.  So if n is variable then
> we cannot represent this.
>
> If the pattern above is one of two patterns, so interleaved as:
>
>      { 0, _, 1, _, 4, _, ... }  o=0
>   or { _, 0, _, 1, _, 4, ... }  o=1
>
> then the pattern would select from more than one argument if
> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
>
> But if the pattern above is one of four patterns then it selects
> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
> true for any valid n or o, so the pattern is OK.
>
> So let's define some ad hoc terminology:
>
> * Px is the number of patterns in x
> * Ex is the number of elements per pattern in x
>
> where x can be:
>
> * 1: first argument
> * 2: second argument
> * s: selector
> * r: result
>
> Then:
>
> (1) The number of elements encoded explicitly for x is Ex*Px
>
> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>     elements for any integer N.  This extended sequence can be reencoded
>     as having N*Px patterns, with Ex staying the same.
>
> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
>     of the explicit encoding.
>
> So let's assume (optimistically) that we can produce the result
> by calculating the first Pr*Er elements and using the Pr,Er encoding
> to imply the rest.  Then:
>
> * (2) means that, when combining multiple input operands with potentially
>   different encodings, we can set the number of patterns in the result
>   to the least common multiple of the number of patterns in the inputs.
>   In this case:
>
>   Pr = least_common_multiple(P1, P2, Ps)
>
>   is a valid number of patterns.
>
> * (3) means that the number of elements per pattern of the result can
>   be the maximum of the number of elements per pattern in the inputs.
>   (Alternatively, we could always use 3.)  In this case:
>
>   Er = max(E1, E2, Es)
>
>   is a valid number of elements per pattern.
>
> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> calculating the first Pr*Er elements of the result and using the
> encoding to derive the rest.  If (R) doesn't hold then we need the
> selector to be constant-length.  We should then fill in the result
> based on:
>
> - Pr == number of elements in the result
> - Er == 1
>
> But this should be the fallback option, even for VLS.
>
> As far as the arguments go: we should reject CONSTRUCTORs for
> variable-length types.  After doing that, we can treat a CONSTRUCTOR
> for an N-element vector type by setting the number of patterns to N
> and the number of elements per pattern to 1.
Hi Richard,
Thanks for the suggestions, and sorry for late response.
I have a couple of very elementary questions:

1: Consider following inputs to VEC_PERM_EXPR:
op1: P_op1 == 4, E_op1 == 1
{1, 2, 3, 4, ...}

op2: P_op2 == 2, E_op2 == 2
{11, 21, 12, 22, ...}

sel: P_sel == 3, E_sel == 1
{0, 4, 5, ...}

What shall be the result in this case ?
P_res = lcm(4, 2, 3) == 12
E_res = max(1, 2, 1) == 2.

2. How should we specify index of element in sel when it is not
explicitly encoded in the operand ?
For eg:
op1: npatterns == 2, nelts_per_pattern == 3
{ 1, 0, 2, 0, 3, 0, ... }
op2: npatterns == 6, nelts_per_pattern == 1
{ 11, 12, 13, 14, 15, 16, ...}

In sel, how do we refer to element with value 4, that would be 4th element
of first pattern in op1, but not explicitly encoded ?
In op1, 4 will come at index == 6.
However in sel, index 6 would refer to 11, ie op2[0] ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> > +      Since we don't explicitly build the combined vector, we convert
> > +      sel_index to corresponding index for either arg0 or arg1.
> > +      For eg, if sel_index == 7,
> > +      pattern = 7 % 4 == 3.
> > +      Since pattern > input_npatterns, the elem will come from:
> > +      pattern = 3 - input_npatterns ie, pattern 1 from arg1.
> > +      elem_index_in_pattern = 7 / 4 == 1.
> > +      So the actual index of the element in arg1 would be: 1 + (1 * 2) == 3.
> > +      So, sel_index == 7 corresponds to arg1[3], ie, d1.  */
> > +
> > +      unsigned pattern = sel_index % (2 * input_npatterns);
> > +      unsigned elem_index_in_pattern = sel_index / (2 * input_npatterns);
> > +      tree arg;
> > +      if (pattern < input_npatterns)
> > +     arg = arg0;
> > +      else
> > +     {
> > +       arg = arg1;
> > +       pattern -= input_npatterns;
> > +     }
> > +
> > +      unsigned elem_index = (elem_index_in_pattern * input_npatterns) + pattern;
> > +      tree elem;
> > +      if (TREE_CODE (arg) == VECTOR_CST)
> > +     {
> > +       /* If arg is fixed width vector, and elem_index goes out of range,
> > +          then return NULL_TREE.  */
> > +       if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant ()
> > +           && elem_index > vector_cst_encoded_nelts (arg))
> > +         return NULL_TREE;
> > +       elem = vector_cst_elt (arg, elem_index);
> > +     }
> > +      else
> > +     {
> > +       gcc_assert (TREE_CODE (arg) == CONSTRUCTOR);
> > +       if (elem_index >= CONSTRUCTOR_NELTS (arg))
> > +         return NULL_TREE;
> > +       elem = CONSTRUCTOR_ELT (arg, elem_index)->value;
> > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > +         return NULL_TREE;
> > +       need_ctor = true;
> > +     }
> > +
> > +      out_elts_builder.quick_push (unshare_expr (elem));
> >      }
> >
> >    if (need_ctor)
> >      {
> >        vec<constructor_elt, va_gc> *v;
> > -      vec_alloc (v, nelts);
> > -      for (i = 0; i < nelts; i++)
> > -     CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > +      vec_alloc (v, out_encoded_nelts);
> > +
> > +      for (unsigned i = 0; i < out_encoded_nelts; i++)
> > +     CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts_builder[i]);
> >        return build_constructor (type, v);
> >      }
> > -  else
> > -    return out_elts.build ();
> > +
> > +  return out_elts_builder.build ();
> >  }
> >
> >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > @@ -16912,6 +16985,91 @@ test_vec_duplicate_folding ()
> >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> >  }
> >
> > +static tree
> > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > +                int *encoded_elems)
> > +{
> > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > +
> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > +  return builder.build ();
> > +}
> > +
> > +static void
> > +vpe_verify_res (tree res, unsigned npatterns, unsigned nelts_per_pattern,
> > +             int *encoded_elems)
> > +{
> > +  ASSERT_TRUE (res != NULL_TREE);
> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
> > +
> > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > +    ASSERT_TRUE (wi::to_wide (VECTOR_CST_ELT (res, i))
> > +                           == encoded_elems[i]);
> > +}
> > +
> > +static void
> > +test_vec_perm_vla_folding ()
> > +{
> > +  /* For all cases
> > +     arg0: {1, 11, 21, 31, 2, 12, 22, 32, 3, 13, 23, 33, ...}, npatterns == 4, nelts_per_pattern == 3.
> > +     arg1: {41, 51, 61, 71, 42, 52, 62, 72, 43, 53, 63, 73 ...}, npatterns == 4, nelts_per_pattern == 3.  */
> > +
> > +  int arg0_elems[] = { 1, 11, 21, 31, 2, 12, 22, 32, 3, 13, 23, 33 };
> > +  tree arg0 = build_vec_int_cst (4, 3, arg0_elems);
> > +
> > +  int arg1_elems[] = { 41, 51, 61, 71, 42, 52, 62, 72, 43, 53, 63, 73 };
> > +  tree arg1 = build_vec_int_cst (4, 3, arg1_elems);
> > +
> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > +    return;
> > +
> > +  /* Case 1: Dup mask sequence.
> > +     mask = {0, 9, 3, 11, ...}, npatterns == 4, nelts_per_pattern == 1.
> > +     expected result: {1, 21, 31, 32, ...}, npatterns == 4, nelts_per_pattern == 1.  */
> > +  {
> > +    int mask_elems[] = {0, 9, 3, 12};
> > +    tree mask = build_vec_int_cst (4, 1, mask_elems);
> > +    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
> > +      return;
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    int res_encoded_elems[] = {1, 12, 31, 42};
> > +    vpe_verify_res (res, 4, 1, res_encoded_elems);
> > +  }
> > +
> > +  /* Case 2:
> > +     mask = {0, 4, 1, 5, 8, 12, 9, 13 ...}, npatterns == 4, nelts_per_pattern == 2.
> > +     expected result: {1, 41, 11, 51, 2, 12, 42, 52, ...}, npatterns == 4, nelts_per_pattern == 2.  */
> > +  {
> > +    int mask_elems[] = {0, 4, 1, 5, 8, 12, 9, 13};
> > +    tree mask = build_vec_int_cst (4, 2, mask_elems);
> > +    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
> > +      return;
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    int res_encoded_elems[] = {1, 41, 11, 51, 2, 42, 12, 52};
> > +    vpe_verify_res (res, 4, 2, res_encoded_elems);
> > +  }
> > +
> > +  /* Case 3: Stepped mask sequence.
> > +     mask = {0, 4, 1, 5, 8, 12, 9, 13, 16, 20, 17, 21}, npatterns == 4, nelts_per_pattern == 3.
> > +     expected result = {1, 41, 11, 51, 2, 42, 12, 52, 3, 43, 13, 53 ...}, npatterns == 4, nelts_per_pattern == 3.  */
> > +  {
> > +    int mask_elems[] = {0, 4, 1, 5, 8, 12, 9, 13, 16, 20, 17, 21};
> > +    tree mask = build_vec_int_cst (4, 3, mask_elems);
> > +    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
> > +      return;
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    int res_encoded_elems[] = {1, 41, 11, 51, 2, 42, 12, 52, 3, 43, 13, 53};
> > +    vpe_verify_res (res, 4, 3, res_encoded_elems);
> > +  }
> > +}
> > +
> >  /* Run all of the selftests within this file.  */
> >
> >  void
> > @@ -16920,6 +17078,7 @@ fold_const_cc_tests ()
> >    test_arithmetic_folding ();
> >    test_vector_folding ();
> >    test_vec_duplicate_folding ();
> > +  test_vec_perm_vla_folding ();
> >  }
> >
> >  } // namespace selftest

Richard Sandiford Sept. 12, 2022, 2:27 p.m. UTC | #5

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Mon, 5 Sept 2022 at 15:51, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Sorry for the slow reply.  I wrote a response a couple of weeks ago
>> but I think it get lost in a machine outage.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > Hi,
>> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
>> > in VLA manner, and currently handles the following cases:
>> > (a) fixed len arg0, arg1 and fixed len sel.
>> > (b) fixed len arg0, arg1 and vla sel
>> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
>> >
>> > It seems to work for the VLA tests written in
>> > test_vec_perm_vla_folding (), and am working thru the fallout observed in
>> > regression testing.
>> >
>> > Does the approach taken in the patch look in the right direction ?
>> > I am not sure if I have got the conversion from "sel_index"
>> > to index of either arg0, or arg1 entirely correct.
>> > I would be grateful for suggestions on the patch.
>> >
>> > Thanks,
>> > Prathamesh
>> >
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index 4f4ec81c8d4..5e12260211e 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "vec-perm-indices.h"
>> >  #include "asan.h"
>> >  #include "gimple-range.h"
>> > +#include "tree-pretty-print.h"
>> > +#include "gimple-pretty-print.h"
>> > +#include "print-tree.h"
>> >
>> >  /* Nonzero if we are folding constants inside an initializer or a C++
>> >     manifestly-constant-evaluated context; zero otherwise.
>> > @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>> >                         build_zero_cst (itype));
>> >  }
>> >
>> > -
>> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
>> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
>> > -   true if successful.  */
>> > -
>> > -static bool
>> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>> > -{
>> > -  unsigned HOST_WIDE_INT i, nunits;
>> > -
>> > -  if (TREE_CODE (arg) == VECTOR_CST
>> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
>> > -    {
>> > -      for (i = 0; i < nunits; ++i)
>> > -     elts[i] = VECTOR_CST_ELT (arg, i);
>> > -    }
>> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
>> > -    {
>> > -      constructor_elt *elt;
>> > -
>> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
>> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
>> > -       return false;
>> > -     else
>> > -       elts[i] = elt->value;
>> > -    }
>> > -  else
>> > -    return false;
>> > -  for (; i < nelts; i++)
>> > -    elts[i]
>> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
>> > -  return true;
>> > -}
>> > -
>> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
>> >     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
>> >     NULL_TREE otherwise.  */
>> > @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>> >  tree
>> >  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
>> >  {
>> > -  unsigned int i;
>> > -  unsigned HOST_WIDE_INT nelts;
>> > -  bool need_ctor = false;
>> > +  poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +  poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
>> > +
>> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
>> > +                     sel.length ()));
>> > +  gcc_assert (known_eq (arg0_len, arg1_len));
>> >
>> > -  if (!sel.length ().is_constant (&nelts))
>> > -    return NULL_TREE;
>> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
>> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
>> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
>> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
>> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
>> >      return NULL_TREE;
>> >
>> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
>> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
>> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
>> > +  unsigned input_npatterns = 0;
>> > +  unsigned out_npatterns = sel.encoding ().npatterns ();
>> > +  unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>> > +
>> > +  /* FIXME: How to reshape fixed length vector_cst, so that
>> > +     npatterns == vector.length () and nelts_per_pattern == 1 ?
>> > +     It seems the vector is canonicalized to minimize npatterns.  */
>> > +
>> > +  if (arg0_len.is_constant ())
>> > +    {
>> > +      /* If arg0, arg1 are fixed width vectors, and sel is VLA,
>> > +         ensure that it is a dup sequence and has same period
>> > +      as input vector.  */
>> > +
>> > +      if (!sel.length ().is_constant ()
>> > +       && (sel.encoding ().nelts_per_pattern () > 2
>> > +           || !known_eq (arg0_len, sel.encoding ().npatterns ())))
>> > +     return NULL_TREE;
>> > +
>> > +      input_npatterns = arg0_len.to_constant ();
>> > +
>> > +      if (sel.length ().is_constant ())
>> > +     {
>> > +       out_npatterns = sel.length ().to_constant ();
>> > +       out_nelts_per_pattern = 1;
>> > +     }
>> > +    }
>> > +  else if (TREE_CODE (arg0) == VECTOR_CST
>> > +        && TREE_CODE (arg1) == VECTOR_CST)
>> > +    {
>> > +      unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
>> > +      unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
>> > +
>> > +      /* If arg0, arg1 are VLA, then ensure that,
>> > +      (a) sel also has same length as input vectors.
>> > +      (b) arg0 and arg1 have same encoding.
>> > +      (c) sel has same number of patterns as input vectors.
>> > +      (d) if sel is a stepped sequence, then it has same
>> > +          encoding as input vectors.  */
>> > +
>> > +      if (!known_eq (arg0_len, sel.length ())
>> > +       || npatterns != VECTOR_CST_NPATTERNS (arg1)
>> > +       || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
>> > +       || npatterns != sel.encoding ().npatterns ()
>> > +       || (sel.encoding ().nelts_per_pattern () > 2
>> > +           && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
>> > +     return NULL_TREE;
>>
>> This seems too restrictive.  More below.
>>
>> > +
>> > +      input_npatterns = npatterns;
>> > +    }
>> > +  else
>> >      return NULL_TREE;
>> >
>> > -  tree_vector_builder out_elts (type, nelts, 1);
>> > -  for (i = 0; i < nelts; i++)
>> > +  tree_vector_builder out_elts_builder (type, out_npatterns,
>> > +                                     out_nelts_per_pattern);
>> > +  bool need_ctor = false;
>> > +  unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
>> > +
>> > +  for (unsigned i = 0; i < out_encoded_nelts; i++)
>> >      {
>> > -      HOST_WIDE_INT index;
>> > -      if (!sel[i].is_constant (&index))
>> > +      HOST_WIDE_INT sel_index;
>> > +      if (!sel[i].is_constant (&sel_index))
>> >       return NULL_TREE;
>> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
>> > -     need_ctor = true;
>> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
>> > +
>> > +      /* Convert sel_index to index of either arg0 or arg1.
>> > +      For eg:
>> > +      arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
>> > +      arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
>> > +      Both have npatterns == 2, nelts_per_pattern == 3.
>> > +      Then the combined vector would be:
>> > +      {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
>> > +      This combined vector will have,
>> > +      npatterns = 2 * input_npatterns == 4.
>> > +      sel_index is used to index this above combined vector.
>>
>> There's no interleaving of the arguments though.  The selector selects from:
>>
>> {a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...}
>>
>> The VLA encoding encodes the first N patterns explicitly.  The
>> npatterns/nelts_per_pattern values then describe how to extend that
>> initial sequence to an arbitrary number of elements.  So when performing
>> an operation on (potentially) variable-length vectors, the questions is:
>>
>> * Can we work out an initial sequence and npatterns/nelts_per_pattern
>>   pair that will be correct for all elements of the result?
>>
>> This depends on the operation that we're performing.  E.g. it's
>> different for unary operations (vector_builder::new_unary_operation)
>> and binary operations (vector_builder::new_binary_operations).  It also
>> varies between unary operations and between binary operations, hence
>> the allow_stepped_p parameters.
>>
>> For VEC_PERM_EXPR, I think the key requirement is that:
>>
>> (R) Each individual selector pattern must always select from the same vector.
>>
>> Whether this condition is met depends both on the pattern itself and on
>> the number of patterns that it's combined with.
>>
>> E.g. suppose we had the selector pattern:
>>
>>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>>
>> If the arguments and selector are n elements then this pattern on its
>> own would select from more than one argument if 3(n-1) - 2 >= n.
>> This is clearly true for large enough n.  So if n is variable then
>> we cannot represent this.
>>
>> If the pattern above is one of two patterns, so interleaved as:
>>
>>      { 0, _, 1, _, 4, _, ... }  o=0
>>   or { _, 0, _, 1, _, 4, ... }  o=1
>>
>> then the pattern would select from more than one argument if
>> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
>>
>> But if the pattern above is one of four patterns then it selects
>> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
>> true for any valid n or o, so the pattern is OK.
>>
>> So let's define some ad hoc terminology:
>>
>> * Px is the number of patterns in x
>> * Ex is the number of elements per pattern in x
>>
>> where x can be:
>>
>> * 1: first argument
>> * 2: second argument
>> * s: selector
>> * r: result
>>
>> Then:
>>
>> (1) The number of elements encoded explicitly for x is Ex*Px
>>
>> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>>     elements for any integer N.  This extended sequence can be reencoded
>>     as having N*Px patterns, with Ex staying the same.
>>
>> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
>>     of the explicit encoding.
>>
>> So let's assume (optimistically) that we can produce the result
>> by calculating the first Pr*Er elements and using the Pr,Er encoding
>> to imply the rest.  Then:
>>
>> * (2) means that, when combining multiple input operands with potentially
>>   different encodings, we can set the number of patterns in the result
>>   to the least common multiple of the number of patterns in the inputs.
>>   In this case:
>>
>>   Pr = least_common_multiple(P1, P2, Ps)
>>
>>   is a valid number of patterns.
>>
>> * (3) means that the number of elements per pattern of the result can
>>   be the maximum of the number of elements per pattern in the inputs.
>>   (Alternatively, we could always use 3.)  In this case:
>>
>>   Er = max(E1, E2, Es)
>>
>>   is a valid number of elements per pattern.
>>
>> So if (R) holds we can compute the result -- for both VLA and VLS -- by
>> calculating the first Pr*Er elements of the result and using the
>> encoding to derive the rest.  If (R) doesn't hold then we need the
>> selector to be constant-length.  We should then fill in the result
>> based on:
>>
>> - Pr == number of elements in the result
>> - Er == 1
>>
>> But this should be the fallback option, even for VLS.
>>
>> As far as the arguments go: we should reject CONSTRUCTORs for
>> variable-length types.  After doing that, we can treat a CONSTRUCTOR
>> for an N-element vector type by setting the number of patterns to N
>> and the number of elements per pattern to 1.
> Hi Richard,
> Thanks for the suggestions, and sorry for late response.
> I have a couple of very elementary questions:
>
> 1: Consider following inputs to VEC_PERM_EXPR:
> op1: P_op1 == 4, E_op1 == 1
> {1, 2, 3, 4, ...}
>
> op2: P_op2 == 2, E_op2 == 2
> {11, 21, 12, 22, ...}
>
> sel: P_sel == 3, E_sel == 1
> {0, 4, 5, ...}
>
> What shall be the result in this case ?
> P_res = lcm(4, 2, 3) == 12
> E_res = max(1, 2, 1) == 2.

Yeah, that looks right.  Of course, since sel is just repeating
every three elements, it could just be P_res==3, E_sel==1,
but the vector_builder would do that optimisation for us.

(I'm not sure whether we'd see a P==3 encoding in practice,
but perhaps it's possible.)

If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
repeating every three elements) then:

P_res = lcm(4, 2) == 4
E_res = max(1, 2, 3) == 3

which also looks like it would give the right encoding.

> 2. How should we specify index of element in sel when it is not
> explicitly encoded in the operand ?
> For eg:
> op1: npatterns == 2, nelts_per_pattern == 3
> { 1, 0, 2, 0, 3, 0, ... }
> op2: npatterns == 6, nelts_per_pattern == 1
> { 11, 12, 13, 14, 15, 16, ...}
>
> In sel, how do we refer to element with value 4, that would be 4th element
> of first pattern in op1, but not explicitly encoded ?
> In op1, 4 will come at index == 6.
> However in sel, index 6 would refer to 11, ie op2[0] ?

What index 6 refers to depends on the length of op1.
If the length of op1 is 4 at runtime the index 6 refers to op2[2].
If the length of op1 is 6 then index 6 refers to op2[0].
If the length of op1 is 8 then index 6 refers to op1[6], etc.

This comes back to (R) above.  We need to be able to prove at compile
time that each pattern selects from the same input vectors (for all
elements, not just the encoded elements).  If we can't prove that
then we can't fold for variable-length vectors.

Thanks,
Richard

Prathamesh Kulkarni Sept. 15, 2022, 12:26 p.m. UTC | #6

On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 5 Sept 2022 at 15:51, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Sorry for the slow reply.  I wrote a response a couple of weeks ago
> >> but I think it get lost in a machine outage.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi,
> >> > The attached prototype patch extends fold_vec_perm to fold VEC_PERM_EXPR
> >> > in VLA manner, and currently handles the following cases:
> >> > (a) fixed len arg0, arg1 and fixed len sel.
> >> > (b) fixed len arg0, arg1 and vla sel
> >> > (c) vla arg0, arg1 and vla sel with arg0, arg1 being VECTOR_CST.
> >> >
> >> > It seems to work for the VLA tests written in
> >> > test_vec_perm_vla_folding (), and am working thru the fallout observed in
> >> > regression testing.
> >> >
> >> > Does the approach taken in the patch look in the right direction ?
> >> > I am not sure if I have got the conversion from "sel_index"
> >> > to index of either arg0, or arg1 entirely correct.
> >> > I would be grateful for suggestions on the patch.
> >> >
> >> > Thanks,
> >> > Prathamesh
> >> >
> >> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> > index 4f4ec81c8d4..5e12260211e 100644
> >> > --- a/gcc/fold-const.cc
> >> > +++ b/gcc/fold-const.cc
> >> > @@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
> >> >  #include "vec-perm-indices.h"
> >> >  #include "asan.h"
> >> >  #include "gimple-range.h"
> >> > +#include "tree-pretty-print.h"
> >> > +#include "gimple-pretty-print.h"
> >> > +#include "print-tree.h"
> >> >
> >> >  /* Nonzero if we are folding constants inside an initializer or a C++
> >> >     manifestly-constant-evaluated context; zero otherwise.
> >> > @@ -10496,40 +10499,6 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >> >                         build_zero_cst (itype));
> >> >  }
> >> >
> >> > -
> >> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> >> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> >> > -   true if successful.  */
> >> > -
> >> > -static bool
> >> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >> > -{
> >> > -  unsigned HOST_WIDE_INT i, nunits;
> >> > -
> >> > -  if (TREE_CODE (arg) == VECTOR_CST
> >> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> >> > -    {
> >> > -      for (i = 0; i < nunits; ++i)
> >> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> >> > -    }
> >> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> >> > -    {
> >> > -      constructor_elt *elt;
> >> > -
> >> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> >> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> >> > -       return false;
> >> > -     else
> >> > -       elts[i] = elt->value;
> >> > -    }
> >> > -  else
> >> > -    return false;
> >> > -  for (; i < nelts; i++)
> >> > -    elts[i]
> >> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> >> > -  return true;
> >> > -}
> >> > -
> >> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> >> >     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
> >> >     NULL_TREE otherwise.  */
> >> > @@ -10537,45 +10506,149 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >> >  tree
> >> >  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >> >  {
> >> > -  unsigned int i;
> >> > -  unsigned HOST_WIDE_INT nelts;
> >> > -  bool need_ctor = false;
> >> > +  poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > +  poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
> >> > +
> >> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
> >> > +                     sel.length ()));
> >> > +  gcc_assert (known_eq (arg0_len, arg1_len));
> >> >
> >> > -  if (!sel.length ().is_constant (&nelts))
> >> > -    return NULL_TREE;
> >> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> >> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> >> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> >> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >> >      return NULL_TREE;
> >> >
> >> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> >> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> >> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> >> > +  unsigned input_npatterns = 0;
> >> > +  unsigned out_npatterns = sel.encoding ().npatterns ();
> >> > +  unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >> > +
> >> > +  /* FIXME: How to reshape fixed length vector_cst, so that
> >> > +     npatterns == vector.length () and nelts_per_pattern == 1 ?
> >> > +     It seems the vector is canonicalized to minimize npatterns.  */
> >> > +
> >> > +  if (arg0_len.is_constant ())
> >> > +    {
> >> > +      /* If arg0, arg1 are fixed width vectors, and sel is VLA,
> >> > +         ensure that it is a dup sequence and has same period
> >> > +      as input vector.  */
> >> > +
> >> > +      if (!sel.length ().is_constant ()
> >> > +       && (sel.encoding ().nelts_per_pattern () > 2
> >> > +           || !known_eq (arg0_len, sel.encoding ().npatterns ())))
> >> > +     return NULL_TREE;
> >> > +
> >> > +      input_npatterns = arg0_len.to_constant ();
> >> > +
> >> > +      if (sel.length ().is_constant ())
> >> > +     {
> >> > +       out_npatterns = sel.length ().to_constant ();
> >> > +       out_nelts_per_pattern = 1;
> >> > +     }
> >> > +    }
> >> > +  else if (TREE_CODE (arg0) == VECTOR_CST
> >> > +        && TREE_CODE (arg1) == VECTOR_CST)
> >> > +    {
> >> > +      unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
> >> > +      unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
> >> > +
> >> > +      /* If arg0, arg1 are VLA, then ensure that,
> >> > +      (a) sel also has same length as input vectors.
> >> > +      (b) arg0 and arg1 have same encoding.
> >> > +      (c) sel has same number of patterns as input vectors.
> >> > +      (d) if sel is a stepped sequence, then it has same
> >> > +          encoding as input vectors.  */
> >> > +
> >> > +      if (!known_eq (arg0_len, sel.length ())
> >> > +       || npatterns != VECTOR_CST_NPATTERNS (arg1)
> >> > +       || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
> >> > +       || npatterns != sel.encoding ().npatterns ()
> >> > +       || (sel.encoding ().nelts_per_pattern () > 2
> >> > +           && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
> >> > +     return NULL_TREE;
> >>
> >> This seems too restrictive.  More below.
> >>
> >> > +
> >> > +      input_npatterns = npatterns;
> >> > +    }
> >> > +  else
> >> >      return NULL_TREE;
> >> >
> >> > -  tree_vector_builder out_elts (type, nelts, 1);
> >> > -  for (i = 0; i < nelts; i++)
> >> > +  tree_vector_builder out_elts_builder (type, out_npatterns,
> >> > +                                     out_nelts_per_pattern);
> >> > +  bool need_ctor = false;
> >> > +  unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
> >> > +
> >> > +  for (unsigned i = 0; i < out_encoded_nelts; i++)
> >> >      {
> >> > -      HOST_WIDE_INT index;
> >> > -      if (!sel[i].is_constant (&index))
> >> > +      HOST_WIDE_INT sel_index;
> >> > +      if (!sel[i].is_constant (&sel_index))
> >> >       return NULL_TREE;
> >> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> >> > -     need_ctor = true;
> >> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> >> > +
> >> > +      /* Convert sel_index to index of either arg0 or arg1.
> >> > +      For eg:
> >> > +      arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
> >> > +      arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
> >> > +      Both have npatterns == 2, nelts_per_pattern == 3.
> >> > +      Then the combined vector would be:
> >> > +      {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
> >> > +      This combined vector will have,
> >> > +      npatterns = 2 * input_npatterns == 4.
> >> > +      sel_index is used to index this above combined vector.
> >>
> >> There's no interleaving of the arguments though.  The selector selects from:
> >>
> >> {a0, b0, a1, b1, a1 + S, b1 + S, ..., c0, d0, c1, d1, c1 + S, d1 + S, ...}
> >>
> >> The VLA encoding encodes the first N patterns explicitly.  The
> >> npatterns/nelts_per_pattern values then describe how to extend that
> >> initial sequence to an arbitrary number of elements.  So when performing
> >> an operation on (potentially) variable-length vectors, the questions is:
> >>
> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
> >>   pair that will be correct for all elements of the result?
> >>
> >> This depends on the operation that we're performing.  E.g. it's
> >> different for unary operations (vector_builder::new_unary_operation)
> >> and binary operations (vector_builder::new_binary_operations).  It also
> >> varies between unary operations and between binary operations, hence
> >> the allow_stepped_p parameters.
> >>
> >> For VEC_PERM_EXPR, I think the key requirement is that:
> >>
> >> (R) Each individual selector pattern must always select from the same vector.
> >>
> >> Whether this condition is met depends both on the pattern itself and on
> >> the number of patterns that it's combined with.
> >>
> >> E.g. suppose we had the selector pattern:
> >>
> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
> >>
> >> If the arguments and selector are n elements then this pattern on its
> >> own would select from more than one argument if 3(n-1) - 2 >= n.
> >> This is clearly true for large enough n.  So if n is variable then
> >> we cannot represent this.
> >>
> >> If the pattern above is one of two patterns, so interleaved as:
> >>
> >>      { 0, _, 1, _, 4, _, ... }  o=0
> >>   or { _, 0, _, 1, _, 4, ... }  o=1
> >>
> >> then the pattern would select from more than one argument if
> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
> >>
> >> But if the pattern above is one of four patterns then it selects
> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
> >> true for any valid n or o, so the pattern is OK.
> >>
> >> So let's define some ad hoc terminology:
> >>
> >> * Px is the number of patterns in x
> >> * Ex is the number of elements per pattern in x
> >>
> >> where x can be:
> >>
> >> * 1: first argument
> >> * 2: second argument
> >> * s: selector
> >> * r: result
> >>
> >> Then:
> >>
> >> (1) The number of elements encoded explicitly for x is Ex*Px
> >>
> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >>     elements for any integer N.  This extended sequence can be reencoded
> >>     as having N*Px patterns, with Ex staying the same.
> >>
> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
> >>     of the explicit encoding.
> >>
> >> So let's assume (optimistically) that we can produce the result
> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
> >> to imply the rest.  Then:
> >>
> >> * (2) means that, when combining multiple input operands with potentially
> >>   different encodings, we can set the number of patterns in the result
> >>   to the least common multiple of the number of patterns in the inputs.
> >>   In this case:
> >>
> >>   Pr = least_common_multiple(P1, P2, Ps)
> >>
> >>   is a valid number of patterns.
> >>
> >> * (3) means that the number of elements per pattern of the result can
> >>   be the maximum of the number of elements per pattern in the inputs.
> >>   (Alternatively, we could always use 3.)  In this case:
> >>
> >>   Er = max(E1, E2, Es)
> >>
> >>   is a valid number of elements per pattern.
> >>
> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> >> calculating the first Pr*Er elements of the result and using the
> >> encoding to derive the rest.  If (R) doesn't hold then we need the
> >> selector to be constant-length.  We should then fill in the result
> >> based on:
> >>
> >> - Pr == number of elements in the result
> >> - Er == 1
> >>
> >> But this should be the fallback option, even for VLS.
> >>
> >> As far as the arguments go: we should reject CONSTRUCTORs for
> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
> >> for an N-element vector type by setting the number of patterns to N
> >> and the number of elements per pattern to 1.
> > Hi Richard,
> > Thanks for the suggestions, and sorry for late response.
> > I have a couple of very elementary questions:
> >
> > 1: Consider following inputs to VEC_PERM_EXPR:
> > op1: P_op1 == 4, E_op1 == 1
> > {1, 2, 3, 4, ...}
> >
> > op2: P_op2 == 2, E_op2 == 2
> > {11, 21, 12, 22, ...}
> >
> > sel: P_sel == 3, E_sel == 1
> > {0, 4, 5, ...}
> >
> > What shall be the result in this case ?
> > P_res = lcm(4, 2, 3) == 12
> > E_res = max(1, 2, 1) == 2.
>
> Yeah, that looks right.  Of course, since sel is just repeating
> every three elements, it could just be P_res==3, E_sel==1,
> but the vector_builder would do that optimisation for us.
>
> (I'm not sure whether we'd see a P==3 encoding in practice,
> but perhaps it's possible.)
>
> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
> repeating every three elements) then:
>
> P_res = lcm(4, 2) == 4
> E_res = max(1, 2, 3) == 3
>
> which also looks like it would give the right encoding.
>
> > 2. How should we specify index of element in sel when it is not
> > explicitly encoded in the operand ?
> > For eg:
> > op1: npatterns == 2, nelts_per_pattern == 3
> > { 1, 0, 2, 0, 3, 0, ... }
> > op2: npatterns == 6, nelts_per_pattern == 1
> > { 11, 12, 13, 14, 15, 16, ...}
> >
> > In sel, how do we refer to element with value 4, that would be 4th element
> > of first pattern in op1, but not explicitly encoded ?
> > In op1, 4 will come at index == 6.
> > However in sel, index 6 would refer to 11, ie op2[0] ?
>
> What index 6 refers to depends on the length of op1.
> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
> If the length of op1 is 6 then index 6 refers to op2[0].
> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>
> This comes back to (R) above.  We need to be able to prove at compile
> time that each pattern selects from the same input vectors (for all
> elements, not just the encoded elements).  If we can't prove that
> then we can't fold for variable-length vectors.
Hi Richard,
Thanks for the clarification!
I have come up with an approach to verify R:

Consider following pattern:
a0, a1, a1 + S, ...,
nelts_per_pattern would be n / Psel, where n == actual length of the vector.
And last element of pattern will be given by:
a1 + (n/Psel - 2) * S

Rearranging the above term, we can think of pattern
as a line with following equation:
y = (S/Psel) * n + (a1 - 2S)
where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.

At,
n = 2*Psel, y = a1
n = 3*Psel, y = a1 + S,
n = 4*Psel, y = a1 + 2S ...

To compare with n, we compare the following lines:
y1 = (S/Psel) * n + (a1 - 2S)
y2 = n

So to check if elements always come from first vector,
we want to check y1 < y2 for n > 0.
Likewise, if elements always come from second vector,
we want to check if y1 >= y2, for n > 0.

If both lines are parallel, ie S/PSel == 1,
then we choose first or second vector depending on the y-intercept a1 - 2S.
If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.

For eg, if we have following pattern:
{0, 1, 3, ...}
where a1 = 1, S = 2, and consider PSel = 2.

y1 = n - 3
y2 = n

In this case, y1 < y2 for all values of n,  so we select first vector.

Since y2 = n, passes thru origin with slope = 1,
a line can intersect it either in 1st or 3rd quadrant.
Calculate point of intersection:
n_int = Psel * (a1 - 2S) / (Psel - S);

(a) n_int > 0
n_int > 0 => intersecting in 1st quadrant.
In this case there will be a cross-over at n_int.

For eg, consider pattern { 0, 1, 4, ...}
a1 = 1, S = 3, and let's take PSel = 2

y1 = (3/2)n - 5
y2 = n

Both intersect at (10, 10).
So for n < 10, y1 < y2
and for n > 10, y1 > y2.
so in this case we can't fold since we will select elements from both vectors.

(b) n_int <= 0
In this case, the lines will intersect in 3rd quadrant,
so depending upon the slope we can choose either vector.
If (S/Psel) < 1, ie y1 has a gentler slope than y2,
then y1 < y2 for n > 0
If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
then y1 > y2 for n > 0.

For eg, in the above pattern {0, 1, 4, ...}
a1 = 1, S = 3, and let's take PSel = 4

y1 = (3/4)n - 5
y2 = n
Both intersect at (-20, -20).
y1's slope = (S/Psel) = (3/4) < 1
So y1 < y2 for n > 0.
Graph: https://www.desmos.com/calculator/ct7edqbr9d
So we pick first vector.

The following pseudo code attempts to capture this:

tree select_vector_for_pattern (op1, op2, a1, S, Psel)
{
  if (S == Psel)
    {
      /* If y1 intercept >= 0, then y1 >= y2
          for all values of n.  */
      if (a1 - 2*S >= 0)
        return op2;
      return op1;
    }

   n_int = Psel * (a1 - 2*S) / (Psel - S)
   /* If intersecting in 1st quadrant, there will be cross over,
       bail out.  */
   if (n_int > 0)
     return NULL_TREE;
   /* If S/Psel < 1, ie y1 has gentler slope than y2,
      then y1 < y2 for n > 0.  */
   if (S < Psel)
     return op1;
   /* If S/Psel > 1, ie y1 has steeper slope than y2,
      then y1 > y2 for n > 0.  */
   return op2;
}

Does this look reasonable ?

Thanks,
Prathamesh
>
> Thanks,
> Richard

Richard Sandiford Sept. 20, 2022, 12:39 p.m. UTC | #7

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> The VLA encoding encodes the first N patterns explicitly.  The
>> >> npatterns/nelts_per_pattern values then describe how to extend that
>> >> initial sequence to an arbitrary number of elements.  So when performing
>> >> an operation on (potentially) variable-length vectors, the questions is:
>> >>
>> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
>> >>   pair that will be correct for all elements of the result?
>> >>
>> >> This depends on the operation that we're performing.  E.g. it's
>> >> different for unary operations (vector_builder::new_unary_operation)
>> >> and binary operations (vector_builder::new_binary_operations).  It also
>> >> varies between unary operations and between binary operations, hence
>> >> the allow_stepped_p parameters.
>> >>
>> >> For VEC_PERM_EXPR, I think the key requirement is that:
>> >>
>> >> (R) Each individual selector pattern must always select from the same vector.
>> >>
>> >> Whether this condition is met depends both on the pattern itself and on
>> >> the number of patterns that it's combined with.
>> >>
>> >> E.g. suppose we had the selector pattern:
>> >>
>> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>> >>
>> >> If the arguments and selector are n elements then this pattern on its
>> >> own would select from more than one argument if 3(n-1) - 2 >= n.
>> >> This is clearly true for large enough n.  So if n is variable then
>> >> we cannot represent this.
>> >>
>> >> If the pattern above is one of two patterns, so interleaved as:
>> >>
>> >>      { 0, _, 1, _, 4, _, ... }  o=0
>> >>   or { _, 0, _, 1, _, 4, ... }  o=1
>> >>
>> >> then the pattern would select from more than one argument if
>> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
>> >>
>> >> But if the pattern above is one of four patterns then it selects
>> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
>> >> true for any valid n or o, so the pattern is OK.
>> >>
>> >> So let's define some ad hoc terminology:
>> >>
>> >> * Px is the number of patterns in x
>> >> * Ex is the number of elements per pattern in x
>> >>
>> >> where x can be:
>> >>
>> >> * 1: first argument
>> >> * 2: second argument
>> >> * s: selector
>> >> * r: result
>> >>
>> >> Then:
>> >>
>> >> (1) The number of elements encoded explicitly for x is Ex*Px
>> >>
>> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>> >>     elements for any integer N.  This extended sequence can be reencoded
>> >>     as having N*Px patterns, with Ex staying the same.
>> >>
>> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
>> >>     of the explicit encoding.
>> >>
>> >> So let's assume (optimistically) that we can produce the result
>> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
>> >> to imply the rest.  Then:
>> >>
>> >> * (2) means that, when combining multiple input operands with potentially
>> >>   different encodings, we can set the number of patterns in the result
>> >>   to the least common multiple of the number of patterns in the inputs.
>> >>   In this case:
>> >>
>> >>   Pr = least_common_multiple(P1, P2, Ps)
>> >>
>> >>   is a valid number of patterns.
>> >>
>> >> * (3) means that the number of elements per pattern of the result can
>> >>   be the maximum of the number of elements per pattern in the inputs.
>> >>   (Alternatively, we could always use 3.)  In this case:
>> >>
>> >>   Er = max(E1, E2, Es)
>> >>
>> >>   is a valid number of elements per pattern.
>> >>
>> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
>> >> calculating the first Pr*Er elements of the result and using the
>> >> encoding to derive the rest.  If (R) doesn't hold then we need the
>> >> selector to be constant-length.  We should then fill in the result
>> >> based on:
>> >>
>> >> - Pr == number of elements in the result
>> >> - Er == 1
>> >>
>> >> But this should be the fallback option, even for VLS.
>> >>
>> >> As far as the arguments go: we should reject CONSTRUCTORs for
>> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
>> >> for an N-element vector type by setting the number of patterns to N
>> >> and the number of elements per pattern to 1.
>> > Hi Richard,
>> > Thanks for the suggestions, and sorry for late response.
>> > I have a couple of very elementary questions:
>> >
>> > 1: Consider following inputs to VEC_PERM_EXPR:
>> > op1: P_op1 == 4, E_op1 == 1
>> > {1, 2, 3, 4, ...}
>> >
>> > op2: P_op2 == 2, E_op2 == 2
>> > {11, 21, 12, 22, ...}
>> >
>> > sel: P_sel == 3, E_sel == 1
>> > {0, 4, 5, ...}
>> >
>> > What shall be the result in this case ?
>> > P_res = lcm(4, 2, 3) == 12
>> > E_res = max(1, 2, 1) == 2.
>>
>> Yeah, that looks right.  Of course, since sel is just repeating
>> every three elements, it could just be P_res==3, E_sel==1,
>> but the vector_builder would do that optimisation for us.
>>
>> (I'm not sure whether we'd see a P==3 encoding in practice,
>> but perhaps it's possible.)
>>
>> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
>> repeating every three elements) then:
>>
>> P_res = lcm(4, 2) == 4
>> E_res = max(1, 2, 3) == 3
>>
>> which also looks like it would give the right encoding.
>>
>> > 2. How should we specify index of element in sel when it is not
>> > explicitly encoded in the operand ?
>> > For eg:
>> > op1: npatterns == 2, nelts_per_pattern == 3
>> > { 1, 0, 2, 0, 3, 0, ... }
>> > op2: npatterns == 6, nelts_per_pattern == 1
>> > { 11, 12, 13, 14, 15, 16, ...}
>> >
>> > In sel, how do we refer to element with value 4, that would be 4th element
>> > of first pattern in op1, but not explicitly encoded ?
>> > In op1, 4 will come at index == 6.
>> > However in sel, index 6 would refer to 11, ie op2[0] ?
>>
>> What index 6 refers to depends on the length of op1.
>> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
>> If the length of op1 is 6 then index 6 refers to op2[0].
>> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>>
>> This comes back to (R) above.  We need to be able to prove at compile
>> time that each pattern selects from the same input vectors (for all
>> elements, not just the encoded elements).  If we can't prove that
>> then we can't fold for variable-length vectors.
> Hi Richard,
> Thanks for the clarification!
> I have come up with an approach to verify R:
>
> Consider following pattern:
> a0, a1, a1 + S, ...,
> nelts_per_pattern would be n / Psel, where n == actual length of the vector.
> And last element of pattern will be given by:
> a1 + (n/Psel - 2) * S

(I think this is just a terminology thing, but in the source,
nelts_per_pattern is a compile-time constant that describes the
encoding.  It always has the value 1, 2 or 3, regardless of the
runtime length.)

> Rearranging the above term, we can think of pattern
> as a line with following equation:
> y = (S/Psel) * n + (a1 - 2S)
> where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
>
> At,
> n = 2*Psel, y = a1
> n = 3*Psel, y = a1 + S,
> n = 4*Psel, y = a1 + 2S ...
>
> To compare with n, we compare the following lines:
> y1 = (S/Psel) * n + (a1 - 2S)
> y2 = n
>
> So to check if elements always come from first vector,
> we want to check y1 < y2 for n > 0.
> Likewise, if elements always come from second vector,
> we want to check if y1 >= y2, for n > 0.

One difficulty here is that the indices wrap around, so an index value of
2n selects from the first vector rather than the second.  (This is pretty
awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...

> If both lines are parallel, ie S/PSel == 1,
> then we choose first or second vector depending on the y-intercept a1 - 2S.
> If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
> If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
>
> For eg, if we have following pattern:
> {0, 1, 3, ...}
> where a1 = 1, S = 2, and consider PSel = 2.
>
> y1 = n - 3
> y2 = n
>
> In this case, y1 < y2 for all values of n,  so we select first vector.
>
> Since y2 = n, passes thru origin with slope = 1,
> a line can intersect it either in 1st or 3rd quadrant.
> Calculate point of intersection:
> n_int = Psel * (a1 - 2S) / (Psel - S);
>
> (a) n_int > 0
> n_int > 0 => intersecting in 1st quadrant.
> In this case there will be a cross-over at n_int.
>
> For eg, consider pattern { 0, 1, 4, ...}
> a1 = 1, S = 3, and let's take PSel = 2
>
> y1 = (3/2)n - 5
> y2 = n
>
> Both intersect at (10, 10).
> So for n < 10, y1 < y2
> and for n > 10, y1 > y2.
> so in this case we can't fold since we will select elements from both vectors.
>
> (b) n_int <= 0
> In this case, the lines will intersect in 3rd quadrant,
> so depending upon the slope we can choose either vector.
> If (S/Psel) < 1, ie y1 has a gentler slope than y2,
> then y1 < y2 for n > 0
> If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
> then y1 > y2 for n > 0.
>
> For eg, in the above pattern {0, 1, 4, ...}
> a1 = 1, S = 3, and let's take PSel = 4
>
> y1 = (3/4)n - 5
> y2 = n
> Both intersect at (-20, -20).
> y1's slope = (S/Psel) = (3/4) < 1
> So y1 < y2 for n > 0.
> Graph: https://www.desmos.com/calculator/ct7edqbr9d
> So we pick first vector.
>
> The following pseudo code attempts to capture this:
>
> tree select_vector_for_pattern (op1, op2, a1, S, Psel)
> {
>   if (S == Psel)
>     {
>       /* If y1 intercept >= 0, then y1 >= y2
>           for all values of n.  */
>       if (a1 - 2*S >= 0)
>         return op2;
>       return op1;
>     }
>
>    n_int = Psel * (a1 - 2*S) / (Psel - S)
>    /* If intersecting in 1st quadrant, there will be cross over,
>        bail out.  */
>    if (n_int > 0)
>      return NULL_TREE;
>    /* If S/Psel < 1, ie y1 has gentler slope than y2,
>       then y1 < y2 for n > 0.  */
>    if (S < Psel)
>      return op1;
>    /* If S/Psel > 1, ie y1 has steeper slope than y2,
>       then y1 > y2 for n > 0.  */
>    return op2;
> }
>
> Does this look reasonable ?

...I think we need to be more conservative.  I think we also need to
distinguish n1 (the number of elements in the input vectors) and
nsel (the number of elements in the selector).

If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
there will be (nsel /exact Psel) - 1 index elements from the stepped
encoding and the final index value will be:

  ae = a1 + (nsel /exact Psel - 2) * S

Because of wrap-around, we need to ensure that that doesn't run
into an adjoining vector.  I think the easiest way of doing that
is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
and check that the quotients are equal.

However, I now realise that there's a wrinkle.  If S < 0 then we
also need to check that either:

(a) the chosen input vector (given by the quotient above) has either:

    (i) nelts_per_pattern == 1
    (ii) nelts_per_pattern == 3 and the difference between the
         first and second elements in each pattern is the same as
         the difference between the second and third elements
         (i.e. every pattern is a natural stepped one).

(b) ae % n1 >= the number of patterns in the input vector.
    (ae % n1 is calculated as a side-effect of can_div_trunc_p).

Otherwise the index vector has the effect of moving the "foreground"
from the front of the input vector to the end of the result vector.

If nsel == Psel then the stepped part of the sequence doesn't matter.
Thus, the same condition works whenever nsel is a multiple of Psel.

If nsel is not a multiple of Psel then I think we should punt for now.
There are some cases that we could handle when n1 == nsel, but "nsel
is a multiple of Psel" will be the normal case.

Thanks,
Richard

Prathamesh Kulkarni Sept. 23, 2022, 11:59 a.m. UTC | #8

On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> The VLA encoding encodes the first N patterns explicitly.  The
> >> >> npatterns/nelts_per_pattern values then describe how to extend that
> >> >> initial sequence to an arbitrary number of elements.  So when performing
> >> >> an operation on (potentially) variable-length vectors, the questions is:
> >> >>
> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
> >> >>   pair that will be correct for all elements of the result?
> >> >>
> >> >> This depends on the operation that we're performing.  E.g. it's
> >> >> different for unary operations (vector_builder::new_unary_operation)
> >> >> and binary operations (vector_builder::new_binary_operations).  It also
> >> >> varies between unary operations and between binary operations, hence
> >> >> the allow_stepped_p parameters.
> >> >>
> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
> >> >>
> >> >> (R) Each individual selector pattern must always select from the same vector.
> >> >>
> >> >> Whether this condition is met depends both on the pattern itself and on
> >> >> the number of patterns that it's combined with.
> >> >>
> >> >> E.g. suppose we had the selector pattern:
> >> >>
> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
> >> >>
> >> >> If the arguments and selector are n elements then this pattern on its
> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
> >> >> This is clearly true for large enough n.  So if n is variable then
> >> >> we cannot represent this.
> >> >>
> >> >> If the pattern above is one of two patterns, so interleaved as:
> >> >>
> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
> >> >>
> >> >> then the pattern would select from more than one argument if
> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
> >> >>
> >> >> But if the pattern above is one of four patterns then it selects
> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
> >> >> true for any valid n or o, so the pattern is OK.
> >> >>
> >> >> So let's define some ad hoc terminology:
> >> >>
> >> >> * Px is the number of patterns in x
> >> >> * Ex is the number of elements per pattern in x
> >> >>
> >> >> where x can be:
> >> >>
> >> >> * 1: first argument
> >> >> * 2: second argument
> >> >> * s: selector
> >> >> * r: result
> >> >>
> >> >> Then:
> >> >>
> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
> >> >>
> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> >>     elements for any integer N.  This extended sequence can be reencoded
> >> >>     as having N*Px patterns, with Ex staying the same.
> >> >>
> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
> >> >>     of the explicit encoding.
> >> >>
> >> >> So let's assume (optimistically) that we can produce the result
> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
> >> >> to imply the rest.  Then:
> >> >>
> >> >> * (2) means that, when combining multiple input operands with potentially
> >> >>   different encodings, we can set the number of patterns in the result
> >> >>   to the least common multiple of the number of patterns in the inputs.
> >> >>   In this case:
> >> >>
> >> >>   Pr = least_common_multiple(P1, P2, Ps)
> >> >>
> >> >>   is a valid number of patterns.
> >> >>
> >> >> * (3) means that the number of elements per pattern of the result can
> >> >>   be the maximum of the number of elements per pattern in the inputs.
> >> >>   (Alternatively, we could always use 3.)  In this case:
> >> >>
> >> >>   Er = max(E1, E2, Es)
> >> >>
> >> >>   is a valid number of elements per pattern.
> >> >>
> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> >> >> calculating the first Pr*Er elements of the result and using the
> >> >> encoding to derive the rest.  If (R) doesn't hold then we need the
> >> >> selector to be constant-length.  We should then fill in the result
> >> >> based on:
> >> >>
> >> >> - Pr == number of elements in the result
> >> >> - Er == 1
> >> >>
> >> >> But this should be the fallback option, even for VLS.
> >> >>
> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
> >> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
> >> >> for an N-element vector type by setting the number of patterns to N
> >> >> and the number of elements per pattern to 1.
> >> > Hi Richard,
> >> > Thanks for the suggestions, and sorry for late response.
> >> > I have a couple of very elementary questions:
> >> >
> >> > 1: Consider following inputs to VEC_PERM_EXPR:
> >> > op1: P_op1 == 4, E_op1 == 1
> >> > {1, 2, 3, 4, ...}
> >> >
> >> > op2: P_op2 == 2, E_op2 == 2
> >> > {11, 21, 12, 22, ...}
> >> >
> >> > sel: P_sel == 3, E_sel == 1
> >> > {0, 4, 5, ...}
> >> >
> >> > What shall be the result in this case ?
> >> > P_res = lcm(4, 2, 3) == 12
> >> > E_res = max(1, 2, 1) == 2.
> >>
> >> Yeah, that looks right.  Of course, since sel is just repeating
> >> every three elements, it could just be P_res==3, E_sel==1,
> >> but the vector_builder would do that optimisation for us.
> >>
> >> (I'm not sure whether we'd see a P==3 encoding in practice,
> >> but perhaps it's possible.)
> >>
> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
> >> repeating every three elements) then:
> >>
> >> P_res = lcm(4, 2) == 4
> >> E_res = max(1, 2, 3) == 3
> >>
> >> which also looks like it would give the right encoding.
> >>
> >> > 2. How should we specify index of element in sel when it is not
> >> > explicitly encoded in the operand ?
> >> > For eg:
> >> > op1: npatterns == 2, nelts_per_pattern == 3
> >> > { 1, 0, 2, 0, 3, 0, ... }
> >> > op2: npatterns == 6, nelts_per_pattern == 1
> >> > { 11, 12, 13, 14, 15, 16, ...}
> >> >
> >> > In sel, how do we refer to element with value 4, that would be 4th element
> >> > of first pattern in op1, but not explicitly encoded ?
> >> > In op1, 4 will come at index == 6.
> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
> >>
> >> What index 6 refers to depends on the length of op1.
> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
> >> If the length of op1 is 6 then index 6 refers to op2[0].
> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
> >>
> >> This comes back to (R) above.  We need to be able to prove at compile
> >> time that each pattern selects from the same input vectors (for all
> >> elements, not just the encoded elements).  If we can't prove that
> >> then we can't fold for variable-length vectors.
> > Hi Richard,
> > Thanks for the clarification!
> > I have come up with an approach to verify R:
> >
> > Consider following pattern:
> > a0, a1, a1 + S, ...,
> > nelts_per_pattern would be n / Psel, where n == actual length of the vector.
> > And last element of pattern will be given by:
> > a1 + (n/Psel - 2) * S
>
> (I think this is just a terminology thing, but in the source,
> nelts_per_pattern is a compile-time constant that describes the
> encoding.  It always has the value 1, 2 or 3, regardless of the
> runtime length.)
>
> > Rearranging the above term, we can think of pattern
> > as a line with following equation:
> > y = (S/Psel) * n + (a1 - 2S)
> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
> >
> > At,
> > n = 2*Psel, y = a1
> > n = 3*Psel, y = a1 + S,
> > n = 4*Psel, y = a1 + 2S ...
> >
> > To compare with n, we compare the following lines:
> > y1 = (S/Psel) * n + (a1 - 2S)
> > y2 = n
> >
> > So to check if elements always come from first vector,
> > we want to check y1 < y2 for n > 0.
> > Likewise, if elements always come from second vector,
> > we want to check if y1 >= y2, for n > 0.
>
> One difficulty here is that the indices wrap around, so an index value of
> 2n selects from the first vector rather than the second.  (This is pretty
> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
>
> > If both lines are parallel, ie S/PSel == 1,
> > then we choose first or second vector depending on the y-intercept a1 - 2S.
> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
> >
> > For eg, if we have following pattern:
> > {0, 1, 3, ...}
> > where a1 = 1, S = 2, and consider PSel = 2.
> >
> > y1 = n - 3
> > y2 = n
> >
> > In this case, y1 < y2 for all values of n,  so we select first vector.
> >
> > Since y2 = n, passes thru origin with slope = 1,
> > a line can intersect it either in 1st or 3rd quadrant.
> > Calculate point of intersection:
> > n_int = Psel * (a1 - 2S) / (Psel - S);
> >
> > (a) n_int > 0
> > n_int > 0 => intersecting in 1st quadrant.
> > In this case there will be a cross-over at n_int.
> >
> > For eg, consider pattern { 0, 1, 4, ...}
> > a1 = 1, S = 3, and let's take PSel = 2
> >
> > y1 = (3/2)n - 5
> > y2 = n
> >
> > Both intersect at (10, 10).
> > So for n < 10, y1 < y2
> > and for n > 10, y1 > y2.
> > so in this case we can't fold since we will select elements from both vectors.
> >
> > (b) n_int <= 0
> > In this case, the lines will intersect in 3rd quadrant,
> > so depending upon the slope we can choose either vector.
> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
> > then y1 < y2 for n > 0
> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
> > then y1 > y2 for n > 0.
> >
> > For eg, in the above pattern {0, 1, 4, ...}
> > a1 = 1, S = 3, and let's take PSel = 4
> >
> > y1 = (3/4)n - 5
> > y2 = n
> > Both intersect at (-20, -20).
> > y1's slope = (S/Psel) = (3/4) < 1
> > So y1 < y2 for n > 0.
> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
> > So we pick first vector.
> >
> > The following pseudo code attempts to capture this:
> >
> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
> > {
> >   if (S == Psel)
> >     {
> >       /* If y1 intercept >= 0, then y1 >= y2
> >           for all values of n.  */
> >       if (a1 - 2*S >= 0)
> >         return op2;
> >       return op1;
> >     }
> >
> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
> >    /* If intersecting in 1st quadrant, there will be cross over,
> >        bail out.  */
> >    if (n_int > 0)
> >      return NULL_TREE;
> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
> >       then y1 < y2 for n > 0.  */
> >    if (S < Psel)
> >      return op1;
> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
> >       then y1 > y2 for n > 0.  */
> >    return op2;
> > }
> >
> > Does this look reasonable ?
>
> ...I think we need to be more conservative.  I think we also need to
> distinguish n1 (the number of elements in the input vectors) and
> nsel (the number of elements in the selector).
>
> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
> there will be (nsel /exact Psel) - 1 index elements from the stepped
> encoding and the final index value will be:
>
>   ae = a1 + (nsel /exact Psel - 2) * S
>
> Because of wrap-around, we need to ensure that that doesn't run
> into an adjoining vector.  I think the easiest way of doing that
> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
> and check that the quotients are equal.
IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
vector since ae is last elem,
and the quotient can choose the vector because it will be either 0 or
1 (since indices wrap around after 2n).
Um, could you please elaborate a bit on how will can_div_trunc_p
calculate quotients, when n1 and nsel are unknown
at compile time ?

To calculate the quotients for a hard coded pattern,
with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
I tried the following:

  poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
  poly_uint64 nsel = n1;
  poly_uint64 a1 = 1
  poly_uint64 Esel = exact_div (nsel / Psel);
  poly_uint64 ae = a1 + (Esel - 2) * S;

  int q1, qe;
  poly_uint64 r1, re;

  bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
  bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);

Which gave strange values for qe and 0 for q1, with first call succeeding,
and second call returning false.
Am I calling it incorrectly ?

Thanks,
Prathamesh

>
> However, I now realise that there's a wrinkle.  If S < 0 then we
> also need to check that either:
>
> (a) the chosen input vector (given by the quotient above) has either:
>
>     (i) nelts_per_pattern == 1
>     (ii) nelts_per_pattern == 3 and the difference between the
>          first and second elements in each pattern is the same as
>          the difference between the second and third elements
>          (i.e. every pattern is a natural stepped one).
>
> (b) ae % n1 >= the number of patterns in the input vector.
>     (ae % n1 is calculated as a side-effect of can_div_trunc_p).
>
> Otherwise the index vector has the effect of moving the "foreground"
> from the front of the input vector to the end of the result vector.
>
> If nsel == Psel then the stepped part of the sequence doesn't matter.
> Thus, the same condition works whenever nsel is a multiple of Psel.
>
> If nsel is not a multiple of Psel then I think we should punt for now.
> There are some cases that we could handle when n1 == nsel, but "nsel
> is a multiple of Psel" will be the normal case.
>
> Thanks,
> Richard

Richard Sandiford Sept. 23, 2022, 4:03 p.m. UTC | #9

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> >> The VLA encoding encodes the first N patterns explicitly.  The
>> >> >> npatterns/nelts_per_pattern values then describe how to extend that
>> >> >> initial sequence to an arbitrary number of elements.  So when performing
>> >> >> an operation on (potentially) variable-length vectors, the questions is:
>> >> >>
>> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
>> >> >>   pair that will be correct for all elements of the result?
>> >> >>
>> >> >> This depends on the operation that we're performing.  E.g. it's
>> >> >> different for unary operations (vector_builder::new_unary_operation)
>> >> >> and binary operations (vector_builder::new_binary_operations).  It also
>> >> >> varies between unary operations and between binary operations, hence
>> >> >> the allow_stepped_p parameters.
>> >> >>
>> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
>> >> >>
>> >> >> (R) Each individual selector pattern must always select from the same vector.
>> >> >>
>> >> >> Whether this condition is met depends both on the pattern itself and on
>> >> >> the number of patterns that it's combined with.
>> >> >>
>> >> >> E.g. suppose we had the selector pattern:
>> >> >>
>> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>> >> >>
>> >> >> If the arguments and selector are n elements then this pattern on its
>> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
>> >> >> This is clearly true for large enough n.  So if n is variable then
>> >> >> we cannot represent this.
>> >> >>
>> >> >> If the pattern above is one of two patterns, so interleaved as:
>> >> >>
>> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
>> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
>> >> >>
>> >> >> then the pattern would select from more than one argument if
>> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
>> >> >>
>> >> >> But if the pattern above is one of four patterns then it selects
>> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
>> >> >> true for any valid n or o, so the pattern is OK.
>> >> >>
>> >> >> So let's define some ad hoc terminology:
>> >> >>
>> >> >> * Px is the number of patterns in x
>> >> >> * Ex is the number of elements per pattern in x
>> >> >>
>> >> >> where x can be:
>> >> >>
>> >> >> * 1: first argument
>> >> >> * 2: second argument
>> >> >> * s: selector
>> >> >> * r: result
>> >> >>
>> >> >> Then:
>> >> >>
>> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
>> >> >>
>> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>> >> >>     elements for any integer N.  This extended sequence can be reencoded
>> >> >>     as having N*Px patterns, with Ex staying the same.
>> >> >>
>> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
>> >> >>     of the explicit encoding.
>> >> >>
>> >> >> So let's assume (optimistically) that we can produce the result
>> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
>> >> >> to imply the rest.  Then:
>> >> >>
>> >> >> * (2) means that, when combining multiple input operands with potentially
>> >> >>   different encodings, we can set the number of patterns in the result
>> >> >>   to the least common multiple of the number of patterns in the inputs.
>> >> >>   In this case:
>> >> >>
>> >> >>   Pr = least_common_multiple(P1, P2, Ps)
>> >> >>
>> >> >>   is a valid number of patterns.
>> >> >>
>> >> >> * (3) means that the number of elements per pattern of the result can
>> >> >>   be the maximum of the number of elements per pattern in the inputs.
>> >> >>   (Alternatively, we could always use 3.)  In this case:
>> >> >>
>> >> >>   Er = max(E1, E2, Es)
>> >> >>
>> >> >>   is a valid number of elements per pattern.
>> >> >>
>> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
>> >> >> calculating the first Pr*Er elements of the result and using the
>> >> >> encoding to derive the rest.  If (R) doesn't hold then we need the
>> >> >> selector to be constant-length.  We should then fill in the result
>> >> >> based on:
>> >> >>
>> >> >> - Pr == number of elements in the result
>> >> >> - Er == 1
>> >> >>
>> >> >> But this should be the fallback option, even for VLS.
>> >> >>
>> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
>> >> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
>> >> >> for an N-element vector type by setting the number of patterns to N
>> >> >> and the number of elements per pattern to 1.
>> >> > Hi Richard,
>> >> > Thanks for the suggestions, and sorry for late response.
>> >> > I have a couple of very elementary questions:
>> >> >
>> >> > 1: Consider following inputs to VEC_PERM_EXPR:
>> >> > op1: P_op1 == 4, E_op1 == 1
>> >> > {1, 2, 3, 4, ...}
>> >> >
>> >> > op2: P_op2 == 2, E_op2 == 2
>> >> > {11, 21, 12, 22, ...}
>> >> >
>> >> > sel: P_sel == 3, E_sel == 1
>> >> > {0, 4, 5, ...}
>> >> >
>> >> > What shall be the result in this case ?
>> >> > P_res = lcm(4, 2, 3) == 12
>> >> > E_res = max(1, 2, 1) == 2.
>> >>
>> >> Yeah, that looks right.  Of course, since sel is just repeating
>> >> every three elements, it could just be P_res==3, E_sel==1,
>> >> but the vector_builder would do that optimisation for us.
>> >>
>> >> (I'm not sure whether we'd see a P==3 encoding in practice,
>> >> but perhaps it's possible.)
>> >>
>> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
>> >> repeating every three elements) then:
>> >>
>> >> P_res = lcm(4, 2) == 4
>> >> E_res = max(1, 2, 3) == 3
>> >>
>> >> which also looks like it would give the right encoding.
>> >>
>> >> > 2. How should we specify index of element in sel when it is not
>> >> > explicitly encoded in the operand ?
>> >> > For eg:
>> >> > op1: npatterns == 2, nelts_per_pattern == 3
>> >> > { 1, 0, 2, 0, 3, 0, ... }
>> >> > op2: npatterns == 6, nelts_per_pattern == 1
>> >> > { 11, 12, 13, 14, 15, 16, ...}
>> >> >
>> >> > In sel, how do we refer to element with value 4, that would be 4th element
>> >> > of first pattern in op1, but not explicitly encoded ?
>> >> > In op1, 4 will come at index == 6.
>> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
>> >>
>> >> What index 6 refers to depends on the length of op1.
>> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
>> >> If the length of op1 is 6 then index 6 refers to op2[0].
>> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>> >>
>> >> This comes back to (R) above.  We need to be able to prove at compile
>> >> time that each pattern selects from the same input vectors (for all
>> >> elements, not just the encoded elements).  If we can't prove that
>> >> then we can't fold for variable-length vectors.
>> > Hi Richard,
>> > Thanks for the clarification!
>> > I have come up with an approach to verify R:
>> >
>> > Consider following pattern:
>> > a0, a1, a1 + S, ...,
>> > nelts_per_pattern would be n / Psel, where n == actual length of the vector.
>> > And last element of pattern will be given by:
>> > a1 + (n/Psel - 2) * S
>>
>> (I think this is just a terminology thing, but in the source,
>> nelts_per_pattern is a compile-time constant that describes the
>> encoding.  It always has the value 1, 2 or 3, regardless of the
>> runtime length.)
>>
>> > Rearranging the above term, we can think of pattern
>> > as a line with following equation:
>> > y = (S/Psel) * n + (a1 - 2S)
>> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
>> >
>> > At,
>> > n = 2*Psel, y = a1
>> > n = 3*Psel, y = a1 + S,
>> > n = 4*Psel, y = a1 + 2S ...
>> >
>> > To compare with n, we compare the following lines:
>> > y1 = (S/Psel) * n + (a1 - 2S)
>> > y2 = n
>> >
>> > So to check if elements always come from first vector,
>> > we want to check y1 < y2 for n > 0.
>> > Likewise, if elements always come from second vector,
>> > we want to check if y1 >= y2, for n > 0.
>>
>> One difficulty here is that the indices wrap around, so an index value of
>> 2n selects from the first vector rather than the second.  (This is pretty
>> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
>>
>> > If both lines are parallel, ie S/PSel == 1,
>> > then we choose first or second vector depending on the y-intercept a1 - 2S.
>> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
>> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
>> >
>> > For eg, if we have following pattern:
>> > {0, 1, 3, ...}
>> > where a1 = 1, S = 2, and consider PSel = 2.
>> >
>> > y1 = n - 3
>> > y2 = n
>> >
>> > In this case, y1 < y2 for all values of n,  so we select first vector.
>> >
>> > Since y2 = n, passes thru origin with slope = 1,
>> > a line can intersect it either in 1st or 3rd quadrant.
>> > Calculate point of intersection:
>> > n_int = Psel * (a1 - 2S) / (Psel - S);
>> >
>> > (a) n_int > 0
>> > n_int > 0 => intersecting in 1st quadrant.
>> > In this case there will be a cross-over at n_int.
>> >
>> > For eg, consider pattern { 0, 1, 4, ...}
>> > a1 = 1, S = 3, and let's take PSel = 2
>> >
>> > y1 = (3/2)n - 5
>> > y2 = n
>> >
>> > Both intersect at (10, 10).
>> > So for n < 10, y1 < y2
>> > and for n > 10, y1 > y2.
>> > so in this case we can't fold since we will select elements from both vectors.
>> >
>> > (b) n_int <= 0
>> > In this case, the lines will intersect in 3rd quadrant,
>> > so depending upon the slope we can choose either vector.
>> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
>> > then y1 < y2 for n > 0
>> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
>> > then y1 > y2 for n > 0.
>> >
>> > For eg, in the above pattern {0, 1, 4, ...}
>> > a1 = 1, S = 3, and let's take PSel = 4
>> >
>> > y1 = (3/4)n - 5
>> > y2 = n
>> > Both intersect at (-20, -20).
>> > y1's slope = (S/Psel) = (3/4) < 1
>> > So y1 < y2 for n > 0.
>> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
>> > So we pick first vector.
>> >
>> > The following pseudo code attempts to capture this:
>> >
>> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
>> > {
>> >   if (S == Psel)
>> >     {
>> >       /* If y1 intercept >= 0, then y1 >= y2
>> >           for all values of n.  */
>> >       if (a1 - 2*S >= 0)
>> >         return op2;
>> >       return op1;
>> >     }
>> >
>> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
>> >    /* If intersecting in 1st quadrant, there will be cross over,
>> >        bail out.  */
>> >    if (n_int > 0)
>> >      return NULL_TREE;
>> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
>> >       then y1 < y2 for n > 0.  */
>> >    if (S < Psel)
>> >      return op1;
>> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
>> >       then y1 > y2 for n > 0.  */
>> >    return op2;
>> > }
>> >
>> > Does this look reasonable ?
>>
>> ...I think we need to be more conservative.  I think we also need to
>> distinguish n1 (the number of elements in the input vectors) and
>> nsel (the number of elements in the selector).
>>
>> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
>> there will be (nsel /exact Psel) - 1 index elements from the stepped
>> encoding and the final index value will be:
>>
>>   ae = a1 + (nsel /exact Psel - 2) * S
>>
>> Because of wrap-around, we need to ensure that that doesn't run
>> into an adjoining vector.  I think the easiest way of doing that
>> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
>> and check that the quotients are equal.
> IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
> vector since ae is last elem,
> and the quotient can choose the vector because it will be either 0 or
> 1 (since indices wrap around after 2n).

Right.

> Um, could you please elaborate a bit on how will can_div_trunc_p
> calculate quotients, when n1 and nsel are unknown
> at compile time ?
>
> To calculate the quotients for a hard coded pattern,
> with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
> I tried the following:
>
>   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
>   poly_uint64 nsel = n1;
>   poly_uint64 a1 = 1
>   poly_uint64 Esel = exact_div (nsel / Psel);

We can't use exact_div here.  We need to test that nsel is a multiple
of Psel (e.g. using multiple_p).

>   poly_uint64 ae = a1 + (Esel - 2) * S;
>
>   int q1, qe;
>   poly_uint64 r1, re;
>
>   bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
>   bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);
>
> Which gave strange values for qe and 0 for q1, with first call succeeding,
> and second call returning false.
> Am I calling it incorrectly ?

No, that looks right.  I guess the issue is that ae < a1 for Psel == nsel
and so for min(nsel) the index wraps around to the other input vector.
IMO it's OK to punt on that.  We don't have interfaces for applying
ranges to the indeterminates in a poly_int.

I guess if we want to be fancy, we could look for a1 = a0 + S and,
if true, do the calculation based on a0 rather than a1.  The combination
of Psel >= min(nsel) and a0 != a1 + S, although valid in theory, should
be a very niche case.

But it might be better to stick to the a1-based case first and
get that working.  We can then see whether it's worth extending.

Thanks,
Richard

>
> Thanks,
> Prathamesh
>
>>
>> However, I now realise that there's a wrinkle.  If S < 0 then we
>> also need to check that either:
>>
>> (a) the chosen input vector (given by the quotient above) has either:
>>
>>     (i) nelts_per_pattern == 1
>>     (ii) nelts_per_pattern == 3 and the difference between the
>>          first and second elements in each pattern is the same as
>>          the difference between the second and third elements
>>          (i.e. every pattern is a natural stepped one).
>>
>> (b) ae % n1 >= the number of patterns in the input vector.
>>     (ae % n1 is calculated as a side-effect of can_div_trunc_p).
>>
>> Otherwise the index vector has the effect of moving the "foreground"
>> from the front of the input vector to the end of the result vector.
>>
>> If nsel == Psel then the stepped part of the sequence doesn't matter.
>> Thus, the same condition works whenever nsel is a multiple of Psel.
>>
>> If nsel is not a multiple of Psel then I think we should punt for now.
>> There are some cases that we could handle when n1 == nsel, but "nsel
>> is a multiple of Psel" will be the normal case.
>>
>> Thanks,
>> Richard

Prathamesh Kulkarni Sept. 26, 2022, 7:33 p.m. UTC | #10

On Fri, 23 Sept 2022 at 21:33, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> >> The VLA encoding encodes the first N patterns explicitly.  The
> >> >> >> npatterns/nelts_per_pattern values then describe how to extend that
> >> >> >> initial sequence to an arbitrary number of elements.  So when performing
> >> >> >> an operation on (potentially) variable-length vectors, the questions is:
> >> >> >>
> >> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
> >> >> >>   pair that will be correct for all elements of the result?
> >> >> >>
> >> >> >> This depends on the operation that we're performing.  E.g. it's
> >> >> >> different for unary operations (vector_builder::new_unary_operation)
> >> >> >> and binary operations (vector_builder::new_binary_operations).  It also
> >> >> >> varies between unary operations and between binary operations, hence
> >> >> >> the allow_stepped_p parameters.
> >> >> >>
> >> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
> >> >> >>
> >> >> >> (R) Each individual selector pattern must always select from the same vector.
> >> >> >>
> >> >> >> Whether this condition is met depends both on the pattern itself and on
> >> >> >> the number of patterns that it's combined with.
> >> >> >>
> >> >> >> E.g. suppose we had the selector pattern:
> >> >> >>
> >> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
> >> >> >>
> >> >> >> If the arguments and selector are n elements then this pattern on its
> >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
> >> >> >> This is clearly true for large enough n.  So if n is variable then
> >> >> >> we cannot represent this.
> >> >> >>
> >> >> >> If the pattern above is one of two patterns, so interleaved as:
> >> >> >>
> >> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
> >> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
> >> >> >>
> >> >> >> then the pattern would select from more than one argument if
> >> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
> >> >> >>
> >> >> >> But if the pattern above is one of four patterns then it selects
> >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
> >> >> >> true for any valid n or o, so the pattern is OK.
> >> >> >>
> >> >> >> So let's define some ad hoc terminology:
> >> >> >>
> >> >> >> * Px is the number of patterns in x
> >> >> >> * Ex is the number of elements per pattern in x
> >> >> >>
> >> >> >> where x can be:
> >> >> >>
> >> >> >> * 1: first argument
> >> >> >> * 2: second argument
> >> >> >> * s: selector
> >> >> >> * r: result
> >> >> >>
> >> >> >> Then:
> >> >> >>
> >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
> >> >> >>
> >> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> >> >>     elements for any integer N.  This extended sequence can be reencoded
> >> >> >>     as having N*Px patterns, with Ex staying the same.
> >> >> >>
> >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
> >> >> >>     of the explicit encoding.
> >> >> >>
> >> >> >> So let's assume (optimistically) that we can produce the result
> >> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
> >> >> >> to imply the rest.  Then:
> >> >> >>
> >> >> >> * (2) means that, when combining multiple input operands with potentially
> >> >> >>   different encodings, we can set the number of patterns in the result
> >> >> >>   to the least common multiple of the number of patterns in the inputs.
> >> >> >>   In this case:
> >> >> >>
> >> >> >>   Pr = least_common_multiple(P1, P2, Ps)
> >> >> >>
> >> >> >>   is a valid number of patterns.
> >> >> >>
> >> >> >> * (3) means that the number of elements per pattern of the result can
> >> >> >>   be the maximum of the number of elements per pattern in the inputs.
> >> >> >>   (Alternatively, we could always use 3.)  In this case:
> >> >> >>
> >> >> >>   Er = max(E1, E2, Es)
> >> >> >>
> >> >> >>   is a valid number of elements per pattern.
> >> >> >>
> >> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> >> >> >> calculating the first Pr*Er elements of the result and using the
> >> >> >> encoding to derive the rest.  If (R) doesn't hold then we need the
> >> >> >> selector to be constant-length.  We should then fill in the result
> >> >> >> based on:
> >> >> >>
> >> >> >> - Pr == number of elements in the result
> >> >> >> - Er == 1
> >> >> >>
> >> >> >> But this should be the fallback option, even for VLS.
> >> >> >>
> >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
> >> >> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
> >> >> >> for an N-element vector type by setting the number of patterns to N
> >> >> >> and the number of elements per pattern to 1.
> >> >> > Hi Richard,
> >> >> > Thanks for the suggestions, and sorry for late response.
> >> >> > I have a couple of very elementary questions:
> >> >> >
> >> >> > 1: Consider following inputs to VEC_PERM_EXPR:
> >> >> > op1: P_op1 == 4, E_op1 == 1
> >> >> > {1, 2, 3, 4, ...}
> >> >> >
> >> >> > op2: P_op2 == 2, E_op2 == 2
> >> >> > {11, 21, 12, 22, ...}
> >> >> >
> >> >> > sel: P_sel == 3, E_sel == 1
> >> >> > {0, 4, 5, ...}
> >> >> >
> >> >> > What shall be the result in this case ?
> >> >> > P_res = lcm(4, 2, 3) == 12
> >> >> > E_res = max(1, 2, 1) == 2.
> >> >>
> >> >> Yeah, that looks right.  Of course, since sel is just repeating
> >> >> every three elements, it could just be P_res==3, E_sel==1,
> >> >> but the vector_builder would do that optimisation for us.
> >> >>
> >> >> (I'm not sure whether we'd see a P==3 encoding in practice,
> >> >> but perhaps it's possible.)
> >> >>
> >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
> >> >> repeating every three elements) then:
> >> >>
> >> >> P_res = lcm(4, 2) == 4
> >> >> E_res = max(1, 2, 3) == 3
> >> >>
> >> >> which also looks like it would give the right encoding.
> >> >>
> >> >> > 2. How should we specify index of element in sel when it is not
> >> >> > explicitly encoded in the operand ?
> >> >> > For eg:
> >> >> > op1: npatterns == 2, nelts_per_pattern == 3
> >> >> > { 1, 0, 2, 0, 3, 0, ... }
> >> >> > op2: npatterns == 6, nelts_per_pattern == 1
> >> >> > { 11, 12, 13, 14, 15, 16, ...}
> >> >> >
> >> >> > In sel, how do we refer to element with value 4, that would be 4th element
> >> >> > of first pattern in op1, but not explicitly encoded ?
> >> >> > In op1, 4 will come at index == 6.
> >> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
> >> >>
> >> >> What index 6 refers to depends on the length of op1.
> >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
> >> >> If the length of op1 is 6 then index 6 refers to op2[0].
> >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
> >> >>
> >> >> This comes back to (R) above.  We need to be able to prove at compile
> >> >> time that each pattern selects from the same input vectors (for all
> >> >> elements, not just the encoded elements).  If we can't prove that
> >> >> then we can't fold for variable-length vectors.
> >> > Hi Richard,
> >> > Thanks for the clarification!
> >> > I have come up with an approach to verify R:
> >> >
> >> > Consider following pattern:
> >> > a0, a1, a1 + S, ...,
> >> > nelts_per_pattern would be n / Psel, where n == actual length of the vector.
> >> > And last element of pattern will be given by:
> >> > a1 + (n/Psel - 2) * S
> >>
> >> (I think this is just a terminology thing, but in the source,
> >> nelts_per_pattern is a compile-time constant that describes the
> >> encoding.  It always has the value 1, 2 or 3, regardless of the
> >> runtime length.)
> >>
> >> > Rearranging the above term, we can think of pattern
> >> > as a line with following equation:
> >> > y = (S/Psel) * n + (a1 - 2S)
> >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
> >> >
> >> > At,
> >> > n = 2*Psel, y = a1
> >> > n = 3*Psel, y = a1 + S,
> >> > n = 4*Psel, y = a1 + 2S ...
> >> >
> >> > To compare with n, we compare the following lines:
> >> > y1 = (S/Psel) * n + (a1 - 2S)
> >> > y2 = n
> >> >
> >> > So to check if elements always come from first vector,
> >> > we want to check y1 < y2 for n > 0.
> >> > Likewise, if elements always come from second vector,
> >> > we want to check if y1 >= y2, for n > 0.
> >>
> >> One difficulty here is that the indices wrap around, so an index value of
> >> 2n selects from the first vector rather than the second.  (This is pretty
> >> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
> >>
> >> > If both lines are parallel, ie S/PSel == 1,
> >> > then we choose first or second vector depending on the y-intercept a1 - 2S.
> >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
> >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
> >> >
> >> > For eg, if we have following pattern:
> >> > {0, 1, 3, ...}
> >> > where a1 = 1, S = 2, and consider PSel = 2.
> >> >
> >> > y1 = n - 3
> >> > y2 = n
> >> >
> >> > In this case, y1 < y2 for all values of n,  so we select first vector.
> >> >
> >> > Since y2 = n, passes thru origin with slope = 1,
> >> > a line can intersect it either in 1st or 3rd quadrant.
> >> > Calculate point of intersection:
> >> > n_int = Psel * (a1 - 2S) / (Psel - S);
> >> >
> >> > (a) n_int > 0
> >> > n_int > 0 => intersecting in 1st quadrant.
> >> > In this case there will be a cross-over at n_int.
> >> >
> >> > For eg, consider pattern { 0, 1, 4, ...}
> >> > a1 = 1, S = 3, and let's take PSel = 2
> >> >
> >> > y1 = (3/2)n - 5
> >> > y2 = n
> >> >
> >> > Both intersect at (10, 10).
> >> > So for n < 10, y1 < y2
> >> > and for n > 10, y1 > y2.
> >> > so in this case we can't fold since we will select elements from both vectors.
> >> >
> >> > (b) n_int <= 0
> >> > In this case, the lines will intersect in 3rd quadrant,
> >> > so depending upon the slope we can choose either vector.
> >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
> >> > then y1 < y2 for n > 0
> >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
> >> > then y1 > y2 for n > 0.
> >> >
> >> > For eg, in the above pattern {0, 1, 4, ...}
> >> > a1 = 1, S = 3, and let's take PSel = 4
> >> >
> >> > y1 = (3/4)n - 5
> >> > y2 = n
> >> > Both intersect at (-20, -20).
> >> > y1's slope = (S/Psel) = (3/4) < 1
> >> > So y1 < y2 for n > 0.
> >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
> >> > So we pick first vector.
> >> >
> >> > The following pseudo code attempts to capture this:
> >> >
> >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
> >> > {
> >> >   if (S == Psel)
> >> >     {
> >> >       /* If y1 intercept >= 0, then y1 >= y2
> >> >           for all values of n.  */
> >> >       if (a1 - 2*S >= 0)
> >> >         return op2;
> >> >       return op1;
> >> >     }
> >> >
> >> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
> >> >    /* If intersecting in 1st quadrant, there will be cross over,
> >> >        bail out.  */
> >> >    if (n_int > 0)
> >> >      return NULL_TREE;
> >> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
> >> >       then y1 < y2 for n > 0.  */
> >> >    if (S < Psel)
> >> >      return op1;
> >> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
> >> >       then y1 > y2 for n > 0.  */
> >> >    return op2;
> >> > }
> >> >
> >> > Does this look reasonable ?
> >>
> >> ...I think we need to be more conservative.  I think we also need to
> >> distinguish n1 (the number of elements in the input vectors) and
> >> nsel (the number of elements in the selector).
> >>
> >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
> >> there will be (nsel /exact Psel) - 1 index elements from the stepped
> >> encoding and the final index value will be:
> >>
> >>   ae = a1 + (nsel /exact Psel - 2) * S
> >>
> >> Because of wrap-around, we need to ensure that that doesn't run
> >> into an adjoining vector.  I think the easiest way of doing that
> >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
> >> and check that the quotients are equal.
> > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
> > vector since ae is last elem,
> > and the quotient can choose the vector because it will be either 0 or
> > 1 (since indices wrap around after 2n).
>
> Right.
>
> > Um, could you please elaborate a bit on how will can_div_trunc_p
> > calculate quotients, when n1 and nsel are unknown
> > at compile time ?
> >
> > To calculate the quotients for a hard coded pattern,
> > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
> > I tried the following:
> >
> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
> >   poly_uint64 nsel = n1;
> >   poly_uint64 a1 = 1
> >   poly_uint64 Esel = exact_div (nsel / Psel);
>
> We can't use exact_div here.  We need to test that nsel is a multiple
> of Psel (e.g. using multiple_p).
>
> >   poly_uint64 ae = a1 + (Esel - 2) * S;
> >
> >   int q1, qe;
> >   poly_uint64 r1, re;
> >
> >   bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
> >   bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);
> >
> > Which gave strange values for qe and 0 for q1, with first call succeeding,
> > and second call returning false.
> > Am I calling it incorrectly ?
>
> No, that looks right.  I guess the issue is that ae < a1 for Psel == nsel
> and so for min(nsel) the index wraps around to the other input vector.
> IMO it's OK to punt on that.  We don't have interfaces for applying
> ranges to the indeterminates in a poly_int.
Hi Richard,
Thanks for the suggestions.
I tried the following to test for wrap around:

  poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
  poly_uint64 nsel = n1;
  poly_uint64 a1 = 1;
  unsigned Psel = 4;
  int S = 3;

  if (multiple_p (nsel, Psel))
    {
      poly_uint64 nelems = exact_div (nsel, Psel);
      poly_uint64 ae = a1 + (nelems - 2) * S;

      if (known_gt (ae, a1))
        {
          int q1, qe;
          poly_uint64 r1, re;

          bool ok1 = can_div_trunc_p (a1, n1, &q1, &r1);
          bool oke = can_div_trunc_p (ae, n1, &qe, &re);
        }
    }

However, the second call to can_div_trunc_p still returns false, with
strange values for qe.

Thanks,
Prathamesh

>
> I guess if we want to be fancy, we could look for a1 = a0 + S and,
> if true, do the calculation based on a0 rather than a1.  The combination
> of Psel >= min(nsel) and a0 != a1 + S, although valid in theory, should
> be a very niche case.
>
> But it might be better to stick to the a1-based case first and
> get that working.  We can then see whether it's worth extending.
>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Prathamesh
> >
> >>
> >> However, I now realise that there's a wrinkle.  If S < 0 then we
> >> also need to check that either:
> >>
> >> (a) the chosen input vector (given by the quotient above) has either:
> >>
> >>     (i) nelts_per_pattern == 1
> >>     (ii) nelts_per_pattern == 3 and the difference between the
> >>          first and second elements in each pattern is the same as
> >>          the difference between the second and third elements
> >>          (i.e. every pattern is a natural stepped one).
> >>
> >> (b) ae % n1 >= the number of patterns in the input vector.
> >>     (ae % n1 is calculated as a side-effect of can_div_trunc_p).
> >>
> >> Otherwise the index vector has the effect of moving the "foreground"
> >> from the front of the input vector to the end of the result vector.
> >>
> >> If nsel == Psel then the stepped part of the sequence doesn't matter.
> >> Thus, the same condition works whenever nsel is a multiple of Psel.
> >>
> >> If nsel is not a multiple of Psel then I think we should punt for now.
> >> There are some cases that we could handle when n1 == nsel, but "nsel
> >> is a multiple of Psel" will be the normal case.
> >>
> >> Thanks,
> >> Richard

Richard Sandiford Sept. 26, 2022, 8:29 p.m. UTC | #11

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 23 Sept 2022 at 21:33, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >>
>> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> >> >> The VLA encoding encodes the first N patterns explicitly.  The
>> >> >> >> npatterns/nelts_per_pattern values then describe how to extend that
>> >> >> >> initial sequence to an arbitrary number of elements.  So when performing
>> >> >> >> an operation on (potentially) variable-length vectors, the questions is:
>> >> >> >>
>> >> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
>> >> >> >>   pair that will be correct for all elements of the result?
>> >> >> >>
>> >> >> >> This depends on the operation that we're performing.  E.g. it's
>> >> >> >> different for unary operations (vector_builder::new_unary_operation)
>> >> >> >> and binary operations (vector_builder::new_binary_operations).  It also
>> >> >> >> varies between unary operations and between binary operations, hence
>> >> >> >> the allow_stepped_p parameters.
>> >> >> >>
>> >> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
>> >> >> >>
>> >> >> >> (R) Each individual selector pattern must always select from the same vector.
>> >> >> >>
>> >> >> >> Whether this condition is met depends both on the pattern itself and on
>> >> >> >> the number of patterns that it's combined with.
>> >> >> >>
>> >> >> >> E.g. suppose we had the selector pattern:
>> >> >> >>
>> >> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>> >> >> >>
>> >> >> >> If the arguments and selector are n elements then this pattern on its
>> >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
>> >> >> >> This is clearly true for large enough n.  So if n is variable then
>> >> >> >> we cannot represent this.
>> >> >> >>
>> >> >> >> If the pattern above is one of two patterns, so interleaved as:
>> >> >> >>
>> >> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
>> >> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
>> >> >> >>
>> >> >> >> then the pattern would select from more than one argument if
>> >> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
>> >> >> >>
>> >> >> >> But if the pattern above is one of four patterns then it selects
>> >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
>> >> >> >> true for any valid n or o, so the pattern is OK.
>> >> >> >>
>> >> >> >> So let's define some ad hoc terminology:
>> >> >> >>
>> >> >> >> * Px is the number of patterns in x
>> >> >> >> * Ex is the number of elements per pattern in x
>> >> >> >>
>> >> >> >> where x can be:
>> >> >> >>
>> >> >> >> * 1: first argument
>> >> >> >> * 2: second argument
>> >> >> >> * s: selector
>> >> >> >> * r: result
>> >> >> >>
>> >> >> >> Then:
>> >> >> >>
>> >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
>> >> >> >>
>> >> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>> >> >> >>     elements for any integer N.  This extended sequence can be reencoded
>> >> >> >>     as having N*Px patterns, with Ex staying the same.
>> >> >> >>
>> >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
>> >> >> >>     of the explicit encoding.
>> >> >> >>
>> >> >> >> So let's assume (optimistically) that we can produce the result
>> >> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
>> >> >> >> to imply the rest.  Then:
>> >> >> >>
>> >> >> >> * (2) means that, when combining multiple input operands with potentially
>> >> >> >>   different encodings, we can set the number of patterns in the result
>> >> >> >>   to the least common multiple of the number of patterns in the inputs.
>> >> >> >>   In this case:
>> >> >> >>
>> >> >> >>   Pr = least_common_multiple(P1, P2, Ps)
>> >> >> >>
>> >> >> >>   is a valid number of patterns.
>> >> >> >>
>> >> >> >> * (3) means that the number of elements per pattern of the result can
>> >> >> >>   be the maximum of the number of elements per pattern in the inputs.
>> >> >> >>   (Alternatively, we could always use 3.)  In this case:
>> >> >> >>
>> >> >> >>   Er = max(E1, E2, Es)
>> >> >> >>
>> >> >> >>   is a valid number of elements per pattern.
>> >> >> >>
>> >> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
>> >> >> >> calculating the first Pr*Er elements of the result and using the
>> >> >> >> encoding to derive the rest.  If (R) doesn't hold then we need the
>> >> >> >> selector to be constant-length.  We should then fill in the result
>> >> >> >> based on:
>> >> >> >>
>> >> >> >> - Pr == number of elements in the result
>> >> >> >> - Er == 1
>> >> >> >>
>> >> >> >> But this should be the fallback option, even for VLS.
>> >> >> >>
>> >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
>> >> >> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
>> >> >> >> for an N-element vector type by setting the number of patterns to N
>> >> >> >> and the number of elements per pattern to 1.
>> >> >> > Hi Richard,
>> >> >> > Thanks for the suggestions, and sorry for late response.
>> >> >> > I have a couple of very elementary questions:
>> >> >> >
>> >> >> > 1: Consider following inputs to VEC_PERM_EXPR:
>> >> >> > op1: P_op1 == 4, E_op1 == 1
>> >> >> > {1, 2, 3, 4, ...}
>> >> >> >
>> >> >> > op2: P_op2 == 2, E_op2 == 2
>> >> >> > {11, 21, 12, 22, ...}
>> >> >> >
>> >> >> > sel: P_sel == 3, E_sel == 1
>> >> >> > {0, 4, 5, ...}
>> >> >> >
>> >> >> > What shall be the result in this case ?
>> >> >> > P_res = lcm(4, 2, 3) == 12
>> >> >> > E_res = max(1, 2, 1) == 2.
>> >> >>
>> >> >> Yeah, that looks right.  Of course, since sel is just repeating
>> >> >> every three elements, it could just be P_res==3, E_sel==1,
>> >> >> but the vector_builder would do that optimisation for us.
>> >> >>
>> >> >> (I'm not sure whether we'd see a P==3 encoding in practice,
>> >> >> but perhaps it's possible.)
>> >> >>
>> >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
>> >> >> repeating every three elements) then:
>> >> >>
>> >> >> P_res = lcm(4, 2) == 4
>> >> >> E_res = max(1, 2, 3) == 3
>> >> >>
>> >> >> which also looks like it would give the right encoding.
>> >> >>
>> >> >> > 2. How should we specify index of element in sel when it is not
>> >> >> > explicitly encoded in the operand ?
>> >> >> > For eg:
>> >> >> > op1: npatterns == 2, nelts_per_pattern == 3
>> >> >> > { 1, 0, 2, 0, 3, 0, ... }
>> >> >> > op2: npatterns == 6, nelts_per_pattern == 1
>> >> >> > { 11, 12, 13, 14, 15, 16, ...}
>> >> >> >
>> >> >> > In sel, how do we refer to element with value 4, that would be 4th element
>> >> >> > of first pattern in op1, but not explicitly encoded ?
>> >> >> > In op1, 4 will come at index == 6.
>> >> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
>> >> >>
>> >> >> What index 6 refers to depends on the length of op1.
>> >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
>> >> >> If the length of op1 is 6 then index 6 refers to op2[0].
>> >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>> >> >>
>> >> >> This comes back to (R) above.  We need to be able to prove at compile
>> >> >> time that each pattern selects from the same input vectors (for all
>> >> >> elements, not just the encoded elements).  If we can't prove that
>> >> >> then we can't fold for variable-length vectors.
>> >> > Hi Richard,
>> >> > Thanks for the clarification!
>> >> > I have come up with an approach to verify R:
>> >> >
>> >> > Consider following pattern:
>> >> > a0, a1, a1 + S, ...,
>> >> > nelts_per_pattern would be n / Psel, where n == actual length of the vector.
>> >> > And last element of pattern will be given by:
>> >> > a1 + (n/Psel - 2) * S
>> >>
>> >> (I think this is just a terminology thing, but in the source,
>> >> nelts_per_pattern is a compile-time constant that describes the
>> >> encoding.  It always has the value 1, 2 or 3, regardless of the
>> >> runtime length.)
>> >>
>> >> > Rearranging the above term, we can think of pattern
>> >> > as a line with following equation:
>> >> > y = (S/Psel) * n + (a1 - 2S)
>> >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
>> >> >
>> >> > At,
>> >> > n = 2*Psel, y = a1
>> >> > n = 3*Psel, y = a1 + S,
>> >> > n = 4*Psel, y = a1 + 2S ...
>> >> >
>> >> > To compare with n, we compare the following lines:
>> >> > y1 = (S/Psel) * n + (a1 - 2S)
>> >> > y2 = n
>> >> >
>> >> > So to check if elements always come from first vector,
>> >> > we want to check y1 < y2 for n > 0.
>> >> > Likewise, if elements always come from second vector,
>> >> > we want to check if y1 >= y2, for n > 0.
>> >>
>> >> One difficulty here is that the indices wrap around, so an index value of
>> >> 2n selects from the first vector rather than the second.  (This is pretty
>> >> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
>> >>
>> >> > If both lines are parallel, ie S/PSel == 1,
>> >> > then we choose first or second vector depending on the y-intercept a1 - 2S.
>> >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
>> >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
>> >> >
>> >> > For eg, if we have following pattern:
>> >> > {0, 1, 3, ...}
>> >> > where a1 = 1, S = 2, and consider PSel = 2.
>> >> >
>> >> > y1 = n - 3
>> >> > y2 = n
>> >> >
>> >> > In this case, y1 < y2 for all values of n,  so we select first vector.
>> >> >
>> >> > Since y2 = n, passes thru origin with slope = 1,
>> >> > a line can intersect it either in 1st or 3rd quadrant.
>> >> > Calculate point of intersection:
>> >> > n_int = Psel * (a1 - 2S) / (Psel - S);
>> >> >
>> >> > (a) n_int > 0
>> >> > n_int > 0 => intersecting in 1st quadrant.
>> >> > In this case there will be a cross-over at n_int.
>> >> >
>> >> > For eg, consider pattern { 0, 1, 4, ...}
>> >> > a1 = 1, S = 3, and let's take PSel = 2
>> >> >
>> >> > y1 = (3/2)n - 5
>> >> > y2 = n
>> >> >
>> >> > Both intersect at (10, 10).
>> >> > So for n < 10, y1 < y2
>> >> > and for n > 10, y1 > y2.
>> >> > so in this case we can't fold since we will select elements from both vectors.
>> >> >
>> >> > (b) n_int <= 0
>> >> > In this case, the lines will intersect in 3rd quadrant,
>> >> > so depending upon the slope we can choose either vector.
>> >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
>> >> > then y1 < y2 for n > 0
>> >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
>> >> > then y1 > y2 for n > 0.
>> >> >
>> >> > For eg, in the above pattern {0, 1, 4, ...}
>> >> > a1 = 1, S = 3, and let's take PSel = 4
>> >> >
>> >> > y1 = (3/4)n - 5
>> >> > y2 = n
>> >> > Both intersect at (-20, -20).
>> >> > y1's slope = (S/Psel) = (3/4) < 1
>> >> > So y1 < y2 for n > 0.
>> >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
>> >> > So we pick first vector.
>> >> >
>> >> > The following pseudo code attempts to capture this:
>> >> >
>> >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
>> >> > {
>> >> >   if (S == Psel)
>> >> >     {
>> >> >       /* If y1 intercept >= 0, then y1 >= y2
>> >> >           for all values of n.  */
>> >> >       if (a1 - 2*S >= 0)
>> >> >         return op2;
>> >> >       return op1;
>> >> >     }
>> >> >
>> >> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
>> >> >    /* If intersecting in 1st quadrant, there will be cross over,
>> >> >        bail out.  */
>> >> >    if (n_int > 0)
>> >> >      return NULL_TREE;
>> >> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
>> >> >       then y1 < y2 for n > 0.  */
>> >> >    if (S < Psel)
>> >> >      return op1;
>> >> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
>> >> >       then y1 > y2 for n > 0.  */
>> >> >    return op2;
>> >> > }
>> >> >
>> >> > Does this look reasonable ?
>> >>
>> >> ...I think we need to be more conservative.  I think we also need to
>> >> distinguish n1 (the number of elements in the input vectors) and
>> >> nsel (the number of elements in the selector).
>> >>
>> >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
>> >> there will be (nsel /exact Psel) - 1 index elements from the stepped
>> >> encoding and the final index value will be:
>> >>
>> >>   ae = a1 + (nsel /exact Psel - 2) * S
>> >>
>> >> Because of wrap-around, we need to ensure that that doesn't run
>> >> into an adjoining vector.  I think the easiest way of doing that
>> >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
>> >> and check that the quotients are equal.
>> > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
>> > vector since ae is last elem,
>> > and the quotient can choose the vector because it will be either 0 or
>> > 1 (since indices wrap around after 2n).
>>
>> Right.
>>
>> > Um, could you please elaborate a bit on how will can_div_trunc_p
>> > calculate quotients, when n1 and nsel are unknown
>> > at compile time ?
>> >
>> > To calculate the quotients for a hard coded pattern,
>> > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
>> > I tried the following:
>> >
>> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
>> >   poly_uint64 nsel = n1;
>> >   poly_uint64 a1 = 1
>> >   poly_uint64 Esel = exact_div (nsel / Psel);
>>
>> We can't use exact_div here.  We need to test that nsel is a multiple
>> of Psel (e.g. using multiple_p).
>>
>> >   poly_uint64 ae = a1 + (Esel - 2) * S;
>> >
>> >   int q1, qe;
>> >   poly_uint64 r1, re;
>> >
>> >   bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
>> >   bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);
>> >
>> > Which gave strange values for qe and 0 for q1, with first call succeeding,
>> > and second call returning false.
>> > Am I calling it incorrectly ?
>>
>> No, that looks right.  I guess the issue is that ae < a1 for Psel == nsel
>> and so for min(nsel) the index wraps around to the other input vector.
>> IMO it's OK to punt on that.  We don't have interfaces for applying
>> ranges to the indeterminates in a poly_int.
> Hi Richard,
> Thanks for the suggestions.
> I tried the following to test for wrap around:
>
>   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
>   poly_uint64 nsel = n1;
>   poly_uint64 a1 = 1;
>   unsigned Psel = 4;
>   int S = 3;
>
>   if (multiple_p (nsel, Psel))
>     {
>       poly_uint64 nelems = exact_div (nsel, Psel);
>       poly_uint64 ae = a1 + (nelems - 2) * S;
>
>       if (known_gt (ae, a1))
>         {
>           int q1, qe;
>           poly_uint64 r1, re;
>
>           bool ok1 = can_div_trunc_p (a1, n1, &q1, &r1);
>           bool oke = can_div_trunc_p (ae, n1, &qe, &re);
>         }
>     }
>
> However, the second call to can_div_trunc_p still returns false, with
> strange values for qe.

The known_gt (ae, a1) part isn't required.  It's OK for S to be negative.

What I meant above is that Psel is a valid value of nsel.  When nsel
*does* equal Psel, ae will select from a different input vector from a1.
(Although I said ae < a1, that isn't the important bit, sorry.
The important bit is that a1 > 0 and ae < 0.)

That's why not having conditions that apply ranges to the indeterminates
is a problem.  I guess what we want to ask is "does ae select from the
same input as a1 for all nsel >= Psel*2?".  But we don't have a way of
asking that.  We can only ask for all nsel, and the answer to "does ae
select from the same input as a1 for all nsel?" is "no".

Perhaps another way of saying this is: if the selector was a natural
stepped vector, the first value in the pattern (a0) would be -2.  The
selector { -2, 1, 4, ... } *would* cross input vectors.

This means that a selector with the above values of a1 and S could only
be valid if a0 is in the range [0, nsel).  It's valid for a0 to be in
that range, because it can be arbitrarily different from a1 and above,
but it means that the test for a1 and above is no longer a simple linear
one, of the type we're trying to use here.

So what I meant was, we should accept that the fold will be rejected
for this case.  I don't think that matters in practice.

On the last bit, the values returned by reference from can_div_trunc_p
are only meaningful when the function returns true.  The variables
aren't updated otherwise (which is by design).

Thanks,
Richard

>
> Thanks,
> Prathamesh
>
>>
>> I guess if we want to be fancy, we could look for a1 = a0 + S and,
>> if true, do the calculation based on a0 rather than a1.  The combination
>> of Psel >= min(nsel) and a0 != a1 + S, although valid in theory, should
>> be a very niche case.
>>
>> But it might be better to stick to the a1-based case first and
>> get that working.  We can then see whether it's worth extending.
>>
>> Thanks,
>> Richard
>>
>> >
>> > Thanks,
>> > Prathamesh
>> >
>> >>
>> >> However, I now realise that there's a wrinkle.  If S < 0 then we
>> >> also need to check that either:
>> >>
>> >> (a) the chosen input vector (given by the quotient above) has either:
>> >>
>> >>     (i) nelts_per_pattern == 1
>> >>     (ii) nelts_per_pattern == 3 and the difference between the
>> >>          first and second elements in each pattern is the same as
>> >>          the difference between the second and third elements
>> >>          (i.e. every pattern is a natural stepped one).
>> >>
>> >> (b) ae % n1 >= the number of patterns in the input vector.
>> >>     (ae % n1 is calculated as a side-effect of can_div_trunc_p).
>> >>
>> >> Otherwise the index vector has the effect of moving the "foreground"
>> >> from the front of the input vector to the end of the result vector.
>> >>
>> >> If nsel == Psel then the stepped part of the sequence doesn't matter.
>> >> Thus, the same condition works whenever nsel is a multiple of Psel.
>> >>
>> >> If nsel is not a multiple of Psel then I think we should punt for now.
>> >> There are some cases that we could handle when n1 == nsel, but "nsel
>> >> is a multiple of Psel" will be the normal case.
>> >>
>> >> Thanks,
>> >> Richard

Prathamesh Kulkarni Sept. 30, 2022, 2:41 p.m. UTC | #12

On Tue, 27 Sept 2022 at 01:59, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Fri, 23 Sept 2022 at 21:33, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
> >> >> > <richard.sandiford@arm.com> wrote:
> >> >> >>
> >> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> >> >> The VLA encoding encodes the first N patterns explicitly.  The
> >> >> >> >> npatterns/nelts_per_pattern values then describe how to extend that
> >> >> >> >> initial sequence to an arbitrary number of elements.  So when performing
> >> >> >> >> an operation on (potentially) variable-length vectors, the questions is:
> >> >> >> >>
> >> >> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
> >> >> >> >>   pair that will be correct for all elements of the result?
> >> >> >> >>
> >> >> >> >> This depends on the operation that we're performing.  E.g. it's
> >> >> >> >> different for unary operations (vector_builder::new_unary_operation)
> >> >> >> >> and binary operations (vector_builder::new_binary_operations).  It also
> >> >> >> >> varies between unary operations and between binary operations, hence
> >> >> >> >> the allow_stepped_p parameters.
> >> >> >> >>
> >> >> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
> >> >> >> >>
> >> >> >> >> (R) Each individual selector pattern must always select from the same vector.
> >> >> >> >>
> >> >> >> >> Whether this condition is met depends both on the pattern itself and on
> >> >> >> >> the number of patterns that it's combined with.
> >> >> >> >>
> >> >> >> >> E.g. suppose we had the selector pattern:
> >> >> >> >>
> >> >> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
> >> >> >> >>
> >> >> >> >> If the arguments and selector are n elements then this pattern on its
> >> >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
> >> >> >> >> This is clearly true for large enough n.  So if n is variable then
> >> >> >> >> we cannot represent this.
> >> >> >> >>
> >> >> >> >> If the pattern above is one of two patterns, so interleaved as:
> >> >> >> >>
> >> >> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
> >> >> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
> >> >> >> >>
> >> >> >> >> then the pattern would select from more than one argument if
> >> >> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
> >> >> >> >>
> >> >> >> >> But if the pattern above is one of four patterns then it selects
> >> >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
> >> >> >> >> true for any valid n or o, so the pattern is OK.
> >> >> >> >>
> >> >> >> >> So let's define some ad hoc terminology:
> >> >> >> >>
> >> >> >> >> * Px is the number of patterns in x
> >> >> >> >> * Ex is the number of elements per pattern in x
> >> >> >> >>
> >> >> >> >> where x can be:
> >> >> >> >>
> >> >> >> >> * 1: first argument
> >> >> >> >> * 2: second argument
> >> >> >> >> * s: selector
> >> >> >> >> * r: result
> >> >> >> >>
> >> >> >> >> Then:
> >> >> >> >>
> >> >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
> >> >> >> >>
> >> >> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> >> >> >>     elements for any integer N.  This extended sequence can be reencoded
> >> >> >> >>     as having N*Px patterns, with Ex staying the same.
> >> >> >> >>
> >> >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
> >> >> >> >>     of the explicit encoding.
> >> >> >> >>
> >> >> >> >> So let's assume (optimistically) that we can produce the result
> >> >> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
> >> >> >> >> to imply the rest.  Then:
> >> >> >> >>
> >> >> >> >> * (2) means that, when combining multiple input operands with potentially
> >> >> >> >>   different encodings, we can set the number of patterns in the result
> >> >> >> >>   to the least common multiple of the number of patterns in the inputs.
> >> >> >> >>   In this case:
> >> >> >> >>
> >> >> >> >>   Pr = least_common_multiple(P1, P2, Ps)
> >> >> >> >>
> >> >> >> >>   is a valid number of patterns.
> >> >> >> >>
> >> >> >> >> * (3) means that the number of elements per pattern of the result can
> >> >> >> >>   be the maximum of the number of elements per pattern in the inputs.
> >> >> >> >>   (Alternatively, we could always use 3.)  In this case:
> >> >> >> >>
> >> >> >> >>   Er = max(E1, E2, Es)
> >> >> >> >>
> >> >> >> >>   is a valid number of elements per pattern.
> >> >> >> >>
> >> >> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
> >> >> >> >> calculating the first Pr*Er elements of the result and using the
> >> >> >> >> encoding to derive the rest.  If (R) doesn't hold then we need the
> >> >> >> >> selector to be constant-length.  We should then fill in the result
> >> >> >> >> based on:
> >> >> >> >>
> >> >> >> >> - Pr == number of elements in the result
> >> >> >> >> - Er == 1
> >> >> >> >>
> >> >> >> >> But this should be the fallback option, even for VLS.
> >> >> >> >>
> >> >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
> >> >> >> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
> >> >> >> >> for an N-element vector type by setting the number of patterns to N
> >> >> >> >> and the number of elements per pattern to 1.
> >> >> >> > Hi Richard,
> >> >> >> > Thanks for the suggestions, and sorry for late response.
> >> >> >> > I have a couple of very elementary questions:
> >> >> >> >
> >> >> >> > 1: Consider following inputs to VEC_PERM_EXPR:
> >> >> >> > op1: P_op1 == 4, E_op1 == 1
> >> >> >> > {1, 2, 3, 4, ...}
> >> >> >> >
> >> >> >> > op2: P_op2 == 2, E_op2 == 2
> >> >> >> > {11, 21, 12, 22, ...}
> >> >> >> >
> >> >> >> > sel: P_sel == 3, E_sel == 1
> >> >> >> > {0, 4, 5, ...}
> >> >> >> >
> >> >> >> > What shall be the result in this case ?
> >> >> >> > P_res = lcm(4, 2, 3) == 12
> >> >> >> > E_res = max(1, 2, 1) == 2.
> >> >> >>
> >> >> >> Yeah, that looks right.  Of course, since sel is just repeating
> >> >> >> every three elements, it could just be P_res==3, E_sel==1,
> >> >> >> but the vector_builder would do that optimisation for us.
> >> >> >>
> >> >> >> (I'm not sure whether we'd see a P==3 encoding in practice,
> >> >> >> but perhaps it's possible.)
> >> >> >>
> >> >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
> >> >> >> repeating every three elements) then:
> >> >> >>
> >> >> >> P_res = lcm(4, 2) == 4
> >> >> >> E_res = max(1, 2, 3) == 3
> >> >> >>
> >> >> >> which also looks like it would give the right encoding.
> >> >> >>
> >> >> >> > 2. How should we specify index of element in sel when it is not
> >> >> >> > explicitly encoded in the operand ?
> >> >> >> > For eg:
> >> >> >> > op1: npatterns == 2, nelts_per_pattern == 3
> >> >> >> > { 1, 0, 2, 0, 3, 0, ... }
> >> >> >> > op2: npatterns == 6, nelts_per_pattern == 1
> >> >> >> > { 11, 12, 13, 14, 15, 16, ...}
> >> >> >> >
> >> >> >> > In sel, how do we refer to element with value 4, that would be 4th element
> >> >> >> > of first pattern in op1, but not explicitly encoded ?
> >> >> >> > In op1, 4 will come at index == 6.
> >> >> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
> >> >> >>
> >> >> >> What index 6 refers to depends on the length of op1.
> >> >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
> >> >> >> If the length of op1 is 6 then index 6 refers to op2[0].
> >> >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
> >> >> >>
> >> >> >> This comes back to (R) above.  We need to be able to prove at compile
> >> >> >> time that each pattern selects from the same input vectors (for all
> >> >> >> elements, not just the encoded elements).  If we can't prove that
> >> >> >> then we can't fold for variable-length vectors.
> >> >> > Hi Richard,
> >> >> > Thanks for the clarification!
> >> >> > I have come up with an approach to verify R:
> >> >> >
> >> >> > Consider following pattern:
> >> >> > a0, a1, a1 + S, ...,
> >> >> > nelts_per_pattern would be n / Psel, where n == actual length of the vector.
> >> >> > And last element of pattern will be given by:
> >> >> > a1 + (n/Psel - 2) * S
> >> >>
> >> >> (I think this is just a terminology thing, but in the source,
> >> >> nelts_per_pattern is a compile-time constant that describes the
> >> >> encoding.  It always has the value 1, 2 or 3, regardless of the
> >> >> runtime length.)
> >> >>
> >> >> > Rearranging the above term, we can think of pattern
> >> >> > as a line with following equation:
> >> >> > y = (S/Psel) * n + (a1 - 2S)
> >> >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
> >> >> >
> >> >> > At,
> >> >> > n = 2*Psel, y = a1
> >> >> > n = 3*Psel, y = a1 + S,
> >> >> > n = 4*Psel, y = a1 + 2S ...
> >> >> >
> >> >> > To compare with n, we compare the following lines:
> >> >> > y1 = (S/Psel) * n + (a1 - 2S)
> >> >> > y2 = n
> >> >> >
> >> >> > So to check if elements always come from first vector,
> >> >> > we want to check y1 < y2 for n > 0.
> >> >> > Likewise, if elements always come from second vector,
> >> >> > we want to check if y1 >= y2, for n > 0.
> >> >>
> >> >> One difficulty here is that the indices wrap around, so an index value of
> >> >> 2n selects from the first vector rather than the second.  (This is pretty
> >> >> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
> >> >>
> >> >> > If both lines are parallel, ie S/PSel == 1,
> >> >> > then we choose first or second vector depending on the y-intercept a1 - 2S.
> >> >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
> >> >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
> >> >> >
> >> >> > For eg, if we have following pattern:
> >> >> > {0, 1, 3, ...}
> >> >> > where a1 = 1, S = 2, and consider PSel = 2.
> >> >> >
> >> >> > y1 = n - 3
> >> >> > y2 = n
> >> >> >
> >> >> > In this case, y1 < y2 for all values of n,  so we select first vector.
> >> >> >
> >> >> > Since y2 = n, passes thru origin with slope = 1,
> >> >> > a line can intersect it either in 1st or 3rd quadrant.
> >> >> > Calculate point of intersection:
> >> >> > n_int = Psel * (a1 - 2S) / (Psel - S);
> >> >> >
> >> >> > (a) n_int > 0
> >> >> > n_int > 0 => intersecting in 1st quadrant.
> >> >> > In this case there will be a cross-over at n_int.
> >> >> >
> >> >> > For eg, consider pattern { 0, 1, 4, ...}
> >> >> > a1 = 1, S = 3, and let's take PSel = 2
> >> >> >
> >> >> > y1 = (3/2)n - 5
> >> >> > y2 = n
> >> >> >
> >> >> > Both intersect at (10, 10).
> >> >> > So for n < 10, y1 < y2
> >> >> > and for n > 10, y1 > y2.
> >> >> > so in this case we can't fold since we will select elements from both vectors.
> >> >> >
> >> >> > (b) n_int <= 0
> >> >> > In this case, the lines will intersect in 3rd quadrant,
> >> >> > so depending upon the slope we can choose either vector.
> >> >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
> >> >> > then y1 < y2 for n > 0
> >> >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
> >> >> > then y1 > y2 for n > 0.
> >> >> >
> >> >> > For eg, in the above pattern {0, 1, 4, ...}
> >> >> > a1 = 1, S = 3, and let's take PSel = 4
> >> >> >
> >> >> > y1 = (3/4)n - 5
> >> >> > y2 = n
> >> >> > Both intersect at (-20, -20).
> >> >> > y1's slope = (S/Psel) = (3/4) < 1
> >> >> > So y1 < y2 for n > 0.
> >> >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
> >> >> > So we pick first vector.
> >> >> >
> >> >> > The following pseudo code attempts to capture this:
> >> >> >
> >> >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
> >> >> > {
> >> >> >   if (S == Psel)
> >> >> >     {
> >> >> >       /* If y1 intercept >= 0, then y1 >= y2
> >> >> >           for all values of n.  */
> >> >> >       if (a1 - 2*S >= 0)
> >> >> >         return op2;
> >> >> >       return op1;
> >> >> >     }
> >> >> >
> >> >> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
> >> >> >    /* If intersecting in 1st quadrant, there will be cross over,
> >> >> >        bail out.  */
> >> >> >    if (n_int > 0)
> >> >> >      return NULL_TREE;
> >> >> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
> >> >> >       then y1 < y2 for n > 0.  */
> >> >> >    if (S < Psel)
> >> >> >      return op1;
> >> >> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
> >> >> >       then y1 > y2 for n > 0.  */
> >> >> >    return op2;
> >> >> > }
> >> >> >
> >> >> > Does this look reasonable ?
> >> >>
> >> >> ...I think we need to be more conservative.  I think we also need to
> >> >> distinguish n1 (the number of elements in the input vectors) and
> >> >> nsel (the number of elements in the selector).
> >> >>
> >> >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
> >> >> there will be (nsel /exact Psel) - 1 index elements from the stepped
> >> >> encoding and the final index value will be:
> >> >>
> >> >>   ae = a1 + (nsel /exact Psel - 2) * S
> >> >>
> >> >> Because of wrap-around, we need to ensure that that doesn't run
> >> >> into an adjoining vector.  I think the easiest way of doing that
> >> >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
> >> >> and check that the quotients are equal.
> >> > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
> >> > vector since ae is last elem,
> >> > and the quotient can choose the vector because it will be either 0 or
> >> > 1 (since indices wrap around after 2n).
> >>
> >> Right.
> >>
> >> > Um, could you please elaborate a bit on how will can_div_trunc_p
> >> > calculate quotients, when n1 and nsel are unknown
> >> > at compile time ?
> >> >
> >> > To calculate the quotients for a hard coded pattern,
> >> > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
> >> > I tried the following:
> >> >
> >> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
> >> >   poly_uint64 nsel = n1;
> >> >   poly_uint64 a1 = 1
> >> >   poly_uint64 Esel = exact_div (nsel / Psel);
> >>
> >> We can't use exact_div here.  We need to test that nsel is a multiple
> >> of Psel (e.g. using multiple_p).
> >>
> >> >   poly_uint64 ae = a1 + (Esel - 2) * S;
> >> >
> >> >   int q1, qe;
> >> >   poly_uint64 r1, re;
> >> >
> >> >   bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
> >> >   bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);
> >> >
> >> > Which gave strange values for qe and 0 for q1, with first call succeeding,
> >> > and second call returning false.
> >> > Am I calling it incorrectly ?
> >>
> >> No, that looks right.  I guess the issue is that ae < a1 for Psel == nsel
> >> and so for min(nsel) the index wraps around to the other input vector.
> >> IMO it's OK to punt on that.  We don't have interfaces for applying
> >> ranges to the indeterminates in a poly_int.
> > Hi Richard,
> > Thanks for the suggestions.
> > I tried the following to test for wrap around:
> >
> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
> >   poly_uint64 nsel = n1;
> >   poly_uint64 a1 = 1;
> >   unsigned Psel = 4;
> >   int S = 3;
> >
> >   if (multiple_p (nsel, Psel))
> >     {
> >       poly_uint64 nelems = exact_div (nsel, Psel);
> >       poly_uint64 ae = a1 + (nelems - 2) * S;
> >
> >       if (known_gt (ae, a1))
> >         {
> >           int q1, qe;
> >           poly_uint64 r1, re;
> >
> >           bool ok1 = can_div_trunc_p (a1, n1, &q1, &r1);
> >           bool oke = can_div_trunc_p (ae, n1, &qe, &re);
> >         }
> >     }
> >
> > However, the second call to can_div_trunc_p still returns false, with
> > strange values for qe.
>
> The known_gt (ae, a1) part isn't required.  It's OK for S to be negative.
>
> What I meant above is that Psel is a valid value of nsel.  When nsel
> *does* equal Psel, ae will select from a different input vector from a1.
> (Although I said ae < a1, that isn't the important bit, sorry.
> The important bit is that a1 > 0 and ae < 0.)
>
> That's why not having conditions that apply ranges to the indeterminates
> is a problem.  I guess what we want to ask is "does ae select from the
> same input as a1 for all nsel >= Psel*2?".  But we don't have a way of
> asking that.  We can only ask for all nsel, and the answer to "does ae
> select from the same input as a1 for all nsel?" is "no".
>
> Perhaps another way of saying this is: if the selector was a natural
> stepped vector, the first value in the pattern (a0) would be -2.  The
> selector { -2, 1, 4, ... } *would* cross input vectors.
>
> This means that a selector with the above values of a1 and S could only
> be valid if a0 is in the range [0, nsel).  It's valid for a0 to be in
> that range, because it can be arbitrarily different from a1 and above,
> but it means that the test for a1 and above is no longer a simple linear
> one, of the type we're trying to use here.
>
> So what I meant was, we should accept that the fold will be rejected
> for this case.  I don't think that matters in practice.
>
> On the last bit, the values returned by reference from can_div_trunc_p
> are only meaningful when the function returns true.  The variables
> aren't updated otherwise (which is by design).
Hi Richard,
Thanks for the explanation!
IIUC, for can_div_trunc_p (a, b, &q, &r);
to return true, quotients obtained by division of respective
coeffs should be equal to q0, if q0 = a.coeffs[0] / b.coeffs[0] != 0.

  C q = NCa (a.coeffs[0]) / NCb (b.coeffs[0]);
   /* Otherwise just check for the case in which ai / bi == Q.  */
   if (NCa (a.coeffs[i]) / NCb (b.coeffs[i]) != q)
     return false;

Just to iterate, for above case,
n1 = len(Vnx4SI) = 4 + 4x
nsel = n1 = 4 + 4x
S = 3
Psel = 4
a1 = 1
ae = 3x - 2

a1 /trunc n1 == (1 + 0x) / (4 + 4x)
Since 1/4 == 0/4, can_div_trunc_p returns true, and sets q1 to 0.

ae /trunc n1 == (-2 + 3x) / (4 + 4x)
Since the coeff type is unsigned long,
-2/4 != 3/4, so it returns false.
and we reject to fold for this case.
Is this correct ?

Sorry to ask a silly question but in which case shall we select 2nd vector ?
For num_poly_int_coeffs == 2,
a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
If a1/trunc n1 succeeds,
0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
So, a1 has to be < n1.coeffs[0] ?

For eg, if n1 = 4 + 4x, and a1 = 5,
a1 /trunc n1 will return false since 5/4 != 0 /4.
At runtime,
if x == 0, we will select op1[2];
for x > 0, we will select op0[5];
But we cannot determine this at compile time ?

I have tried to come up with following pseudo code to verify R:

tree get_vector_for_pattern(tree op0, tree op1, vec_perm_indices sel,
int pattern)
{
  if (!multiple_p (nsel, sel_npatterns))
    return NULL_TREE;

  poly_uint64 nsel = sel.length ();
  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
  poly_uint64 Esel = exact_div (nsel, Psel);
  poly_uint64 a1 = sel[pattern + sel_npatterns];
  poly_uint64 ae = a1 + (Esel - 2) * S;

  if (known_lt (ae, 0))
    return NULL_TREE;

  uint64_t q1, qe;
  poly_uint64 r1, re;
  if (!can_div_trunc_p (a1, n1, &q1, &r1)
      || !can_div_trunc_p (ae, n1, &qe, &re))
    return NULL_TREE;

  if (q1 != qe)
    return NULL_TREE;
  tree op_vec = (q1 == 0) ? op0 : op1;

 sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
 if (sel_nelts_per_pattern == 3)
   {
     a2 = sel[pattern + 2 * sel_npatterns];
     S = a2 - a1;
     if (S < 0)
       {
         /* Check for natural stepped pattern.  */
         if ((a1 - sel[pattern]) != S)
           return NULL_TREE;
         if (!known_ge (re, VECTOR_CST_NPATTERNS (op_vec)))
           return NULL_TREE;
       }
   }
 return op_vec;
}

Does this look in right direction ?

Um, could you please give an example when nsel is *not* a multiple
of Psel ?
I had (incorrectly) assumed, nsel = Psel * Esel,
so nsel would always be a multiple of Psel.

Thanks,
Prathamesh

>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Prathamesh
> >
> >>
> >> I guess if we want to be fancy, we could look for a1 = a0 + S and,
> >> if true, do the calculation based on a0 rather than a1.  The combination
> >> of Psel >= min(nsel) and a0 != a1 + S, although valid in theory, should
> >> be a very niche case.
> >>
> >> But it might be better to stick to the a1-based case first and
> >> get that working.  We can then see whether it's worth extending.
> >>
> >> Thanks,
> >> Richard
> >>
> >> >
> >> > Thanks,
> >> > Prathamesh
> >> >
> >> >>
> >> >> However, I now realise that there's a wrinkle.  If S < 0 then we
> >> >> also need to check that either:
> >> >>
> >> >> (a) the chosen input vector (given by the quotient above) has either:
> >> >>
> >> >>     (i) nelts_per_pattern == 1
> >> >>     (ii) nelts_per_pattern == 3 and the difference between the
> >> >>          first and second elements in each pattern is the same as
> >> >>          the difference between the second and third elements
> >> >>          (i.e. every pattern is a natural stepped one).
> >> >>
> >> >> (b) ae % n1 >= the number of patterns in the input vector.
> >> >>     (ae % n1 is calculated as a side-effect of can_div_trunc_p).
> >> >>
> >> >> Otherwise the index vector has the effect of moving the "foreground"
> >> >> from the front of the input vector to the end of the result vector.
> >> >>
> >> >> If nsel == Psel then the stepped part of the sequence doesn't matter.
> >> >> Thus, the same condition works whenever nsel is a multiple of Psel.
> >> >>
> >> >> If nsel is not a multiple of Psel then I think we should punt for now.
> >> >> There are some cases that we could handle when n1 == nsel, but "nsel
> >> >> is a multiple of Psel" will be the normal case.
> >> >>
> >> >> Thanks,
> >> >> Richard

Richard Sandiford Sept. 30, 2022, 4 p.m. UTC | #13

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 27 Sept 2022 at 01:59, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Fri, 23 Sept 2022 at 21:33, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> > On Tue, 20 Sept 2022 at 18:09, Richard Sandiford
>> >> > <richard.sandiford@arm.com> wrote:
>> >> >>
>> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> >> > On Mon, 12 Sept 2022 at 19:57, Richard Sandiford
>> >> >> > <richard.sandiford@arm.com> wrote:
>> >> >> >>
>> >> >> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> >> >> >> The VLA encoding encodes the first N patterns explicitly.  The
>> >> >> >> >> npatterns/nelts_per_pattern values then describe how to extend that
>> >> >> >> >> initial sequence to an arbitrary number of elements.  So when performing
>> >> >> >> >> an operation on (potentially) variable-length vectors, the questions is:
>> >> >> >> >>
>> >> >> >> >> * Can we work out an initial sequence and npatterns/nelts_per_pattern
>> >> >> >> >>   pair that will be correct for all elements of the result?
>> >> >> >> >>
>> >> >> >> >> This depends on the operation that we're performing.  E.g. it's
>> >> >> >> >> different for unary operations (vector_builder::new_unary_operation)
>> >> >> >> >> and binary operations (vector_builder::new_binary_operations).  It also
>> >> >> >> >> varies between unary operations and between binary operations, hence
>> >> >> >> >> the allow_stepped_p parameters.
>> >> >> >> >>
>> >> >> >> >> For VEC_PERM_EXPR, I think the key requirement is that:
>> >> >> >> >>
>> >> >> >> >> (R) Each individual selector pattern must always select from the same vector.
>> >> >> >> >>
>> >> >> >> >> Whether this condition is met depends both on the pattern itself and on
>> >> >> >> >> the number of patterns that it's combined with.
>> >> >> >> >>
>> >> >> >> >> E.g. suppose we had the selector pattern:
>> >> >> >> >>
>> >> >> >> >>   { 0, 1, 4, ... }   i.e. 3x - 2 for x > 0
>> >> >> >> >>
>> >> >> >> >> If the arguments and selector are n elements then this pattern on its
>> >> >> >> >> own would select from more than one argument if 3(n-1) - 2 >= n.
>> >> >> >> >> This is clearly true for large enough n.  So if n is variable then
>> >> >> >> >> we cannot represent this.
>> >> >> >> >>
>> >> >> >> >> If the pattern above is one of two patterns, so interleaved as:
>> >> >> >> >>
>> >> >> >> >>      { 0, _, 1, _, 4, _, ... }  o=0
>> >> >> >> >>   or { _, 0, _, 1, _, 4, ... }  o=1
>> >> >> >> >>
>> >> >> >> >> then the pattern would select from more than one argument if
>> >> >> >> >> 3(n/2-1) - 2 + o >= n.  This too would be a problem for variable n.
>> >> >> >> >>
>> >> >> >> >> But if the pattern above is one of four patterns then it selects
>> >> >> >> >> from more than one argument if 3(n/4-1) - 2 + o >= n.  This is not
>> >> >> >> >> true for any valid n or o, so the pattern is OK.
>> >> >> >> >>
>> >> >> >> >> So let's define some ad hoc terminology:
>> >> >> >> >>
>> >> >> >> >> * Px is the number of patterns in x
>> >> >> >> >> * Ex is the number of elements per pattern in x
>> >> >> >> >>
>> >> >> >> >> where x can be:
>> >> >> >> >>
>> >> >> >> >> * 1: first argument
>> >> >> >> >> * 2: second argument
>> >> >> >> >> * s: selector
>> >> >> >> >> * r: result
>> >> >> >> >>
>> >> >> >> >> Then:
>> >> >> >> >>
>> >> >> >> >> (1) The number of elements encoded explicitly for x is Ex*Px
>> >> >> >> >>
>> >> >> >> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>> >> >> >> >>     elements for any integer N.  This extended sequence can be reencoded
>> >> >> >> >>     as having N*Px patterns, with Ex staying the same.
>> >> >> >> >>
>> >> >> >> >> (3) If Ex < 3, Ex can be increased by 1 by repeating the final Px elements
>> >> >> >> >>     of the explicit encoding.
>> >> >> >> >>
>> >> >> >> >> So let's assume (optimistically) that we can produce the result
>> >> >> >> >> by calculating the first Pr*Er elements and using the Pr,Er encoding
>> >> >> >> >> to imply the rest.  Then:
>> >> >> >> >>
>> >> >> >> >> * (2) means that, when combining multiple input operands with potentially
>> >> >> >> >>   different encodings, we can set the number of patterns in the result
>> >> >> >> >>   to the least common multiple of the number of patterns in the inputs.
>> >> >> >> >>   In this case:
>> >> >> >> >>
>> >> >> >> >>   Pr = least_common_multiple(P1, P2, Ps)
>> >> >> >> >>
>> >> >> >> >>   is a valid number of patterns.
>> >> >> >> >>
>> >> >> >> >> * (3) means that the number of elements per pattern of the result can
>> >> >> >> >>   be the maximum of the number of elements per pattern in the inputs.
>> >> >> >> >>   (Alternatively, we could always use 3.)  In this case:
>> >> >> >> >>
>> >> >> >> >>   Er = max(E1, E2, Es)
>> >> >> >> >>
>> >> >> >> >>   is a valid number of elements per pattern.
>> >> >> >> >>
>> >> >> >> >> So if (R) holds we can compute the result -- for both VLA and VLS -- by
>> >> >> >> >> calculating the first Pr*Er elements of the result and using the
>> >> >> >> >> encoding to derive the rest.  If (R) doesn't hold then we need the
>> >> >> >> >> selector to be constant-length.  We should then fill in the result
>> >> >> >> >> based on:
>> >> >> >> >>
>> >> >> >> >> - Pr == number of elements in the result
>> >> >> >> >> - Er == 1
>> >> >> >> >>
>> >> >> >> >> But this should be the fallback option, even for VLS.
>> >> >> >> >>
>> >> >> >> >> As far as the arguments go: we should reject CONSTRUCTORs for
>> >> >> >> >> variable-length types.  After doing that, we can treat a CONSTRUCTOR
>> >> >> >> >> for an N-element vector type by setting the number of patterns to N
>> >> >> >> >> and the number of elements per pattern to 1.
>> >> >> >> > Hi Richard,
>> >> >> >> > Thanks for the suggestions, and sorry for late response.
>> >> >> >> > I have a couple of very elementary questions:
>> >> >> >> >
>> >> >> >> > 1: Consider following inputs to VEC_PERM_EXPR:
>> >> >> >> > op1: P_op1 == 4, E_op1 == 1
>> >> >> >> > {1, 2, 3, 4, ...}
>> >> >> >> >
>> >> >> >> > op2: P_op2 == 2, E_op2 == 2
>> >> >> >> > {11, 21, 12, 22, ...}
>> >> >> >> >
>> >> >> >> > sel: P_sel == 3, E_sel == 1
>> >> >> >> > {0, 4, 5, ...}
>> >> >> >> >
>> >> >> >> > What shall be the result in this case ?
>> >> >> >> > P_res = lcm(4, 2, 3) == 12
>> >> >> >> > E_res = max(1, 2, 1) == 2.
>> >> >> >>
>> >> >> >> Yeah, that looks right.  Of course, since sel is just repeating
>> >> >> >> every three elements, it could just be P_res==3, E_sel==1,
>> >> >> >> but the vector_builder would do that optimisation for us.
>> >> >> >>
>> >> >> >> (I'm not sure whether we'd see a P==3 encoding in practice,
>> >> >> >> but perhaps it's possible.)
>> >> >> >>
>> >> >> >> If sel was P_sel==1, E_sel==3 (so a stepped encoding rather than
>> >> >> >> repeating every three elements) then:
>> >> >> >>
>> >> >> >> P_res = lcm(4, 2) == 4
>> >> >> >> E_res = max(1, 2, 3) == 3
>> >> >> >>
>> >> >> >> which also looks like it would give the right encoding.
>> >> >> >>
>> >> >> >> > 2. How should we specify index of element in sel when it is not
>> >> >> >> > explicitly encoded in the operand ?
>> >> >> >> > For eg:
>> >> >> >> > op1: npatterns == 2, nelts_per_pattern == 3
>> >> >> >> > { 1, 0, 2, 0, 3, 0, ... }
>> >> >> >> > op2: npatterns == 6, nelts_per_pattern == 1
>> >> >> >> > { 11, 12, 13, 14, 15, 16, ...}
>> >> >> >> >
>> >> >> >> > In sel, how do we refer to element with value 4, that would be 4th element
>> >> >> >> > of first pattern in op1, but not explicitly encoded ?
>> >> >> >> > In op1, 4 will come at index == 6.
>> >> >> >> > However in sel, index 6 would refer to 11, ie op2[0] ?
>> >> >> >>
>> >> >> >> What index 6 refers to depends on the length of op1.
>> >> >> >> If the length of op1 is 4 at runtime the index 6 refers to op2[2].
>> >> >> >> If the length of op1 is 6 then index 6 refers to op2[0].
>> >> >> >> If the length of op1 is 8 then index 6 refers to op1[6], etc.
>> >> >> >>
>> >> >> >> This comes back to (R) above.  We need to be able to prove at compile
>> >> >> >> time that each pattern selects from the same input vectors (for all
>> >> >> >> elements, not just the encoded elements).  If we can't prove that
>> >> >> >> then we can't fold for variable-length vectors.
>> >> >> > Hi Richard,
>> >> >> > Thanks for the clarification!
>> >> >> > I have come up with an approach to verify R:
>> >> >> >
>> >> >> > Consider following pattern:
>> >> >> > a0, a1, a1 + S, ...,
>> >> >> > nelts_per_pattern would be n / Psel, where n == actual length of the vector.
>> >> >> > And last element of pattern will be given by:
>> >> >> > a1 + (n/Psel - 2) * S
>> >> >>
>> >> >> (I think this is just a terminology thing, but in the source,
>> >> >> nelts_per_pattern is a compile-time constant that describes the
>> >> >> encoding.  It always has the value 1, 2 or 3, regardless of the
>> >> >> runtime length.)
>> >> >>
>> >> >> > Rearranging the above term, we can think of pattern
>> >> >> > as a line with following equation:
>> >> >> > y = (S/Psel) * n + (a1 - 2S)
>> >> >> > where (S/Psel) is the slope, and (a1 - 2S) is the y-intercept.
>> >> >> >
>> >> >> > At,
>> >> >> > n = 2*Psel, y = a1
>> >> >> > n = 3*Psel, y = a1 + S,
>> >> >> > n = 4*Psel, y = a1 + 2S ...
>> >> >> >
>> >> >> > To compare with n, we compare the following lines:
>> >> >> > y1 = (S/Psel) * n + (a1 - 2S)
>> >> >> > y2 = n
>> >> >> >
>> >> >> > So to check if elements always come from first vector,
>> >> >> > we want to check y1 < y2 for n > 0.
>> >> >> > Likewise, if elements always come from second vector,
>> >> >> > we want to check if y1 >= y2, for n > 0.
>> >> >>
>> >> >> One difficulty here is that the indices wrap around, so an index value of
>> >> >> 2n selects from the first vector rather than the second.  (This is pretty
>> >> >> awkward for VLA and doesn't match the native SVE TBL behaviour.)  So...
>> >> >>
>> >> >> > If both lines are parallel, ie S/PSel == 1,
>> >> >> > then we choose first or second vector depending on the y-intercept a1 - 2S.
>> >> >> > If a1 - 2S >= 0, then y1 >= y2 for all values of n, so select second vector.
>> >> >> > If a1 - 2S < 0, then y1 < y2 for all values of n, so select first vector.
>> >> >> >
>> >> >> > For eg, if we have following pattern:
>> >> >> > {0, 1, 3, ...}
>> >> >> > where a1 = 1, S = 2, and consider PSel = 2.
>> >> >> >
>> >> >> > y1 = n - 3
>> >> >> > y2 = n
>> >> >> >
>> >> >> > In this case, y1 < y2 for all values of n,  so we select first vector.
>> >> >> >
>> >> >> > Since y2 = n, passes thru origin with slope = 1,
>> >> >> > a line can intersect it either in 1st or 3rd quadrant.
>> >> >> > Calculate point of intersection:
>> >> >> > n_int = Psel * (a1 - 2S) / (Psel - S);
>> >> >> >
>> >> >> > (a) n_int > 0
>> >> >> > n_int > 0 => intersecting in 1st quadrant.
>> >> >> > In this case there will be a cross-over at n_int.
>> >> >> >
>> >> >> > For eg, consider pattern { 0, 1, 4, ...}
>> >> >> > a1 = 1, S = 3, and let's take PSel = 2
>> >> >> >
>> >> >> > y1 = (3/2)n - 5
>> >> >> > y2 = n
>> >> >> >
>> >> >> > Both intersect at (10, 10).
>> >> >> > So for n < 10, y1 < y2
>> >> >> > and for n > 10, y1 > y2.
>> >> >> > so in this case we can't fold since we will select elements from both vectors.
>> >> >> >
>> >> >> > (b) n_int <= 0
>> >> >> > In this case, the lines will intersect in 3rd quadrant,
>> >> >> > so depending upon the slope we can choose either vector.
>> >> >> > If (S/Psel) < 1, ie y1 has a gentler slope than y2,
>> >> >> > then y1 < y2 for n > 0
>> >> >> > If (S/Psel) > 1, ie, y1 has a steeper slope than y2,
>> >> >> > then y1 > y2 for n > 0.
>> >> >> >
>> >> >> > For eg, in the above pattern {0, 1, 4, ...}
>> >> >> > a1 = 1, S = 3, and let's take PSel = 4
>> >> >> >
>> >> >> > y1 = (3/4)n - 5
>> >> >> > y2 = n
>> >> >> > Both intersect at (-20, -20).
>> >> >> > y1's slope = (S/Psel) = (3/4) < 1
>> >> >> > So y1 < y2 for n > 0.
>> >> >> > Graph: https://www.desmos.com/calculator/ct7edqbr9d
>> >> >> > So we pick first vector.
>> >> >> >
>> >> >> > The following pseudo code attempts to capture this:
>> >> >> >
>> >> >> > tree select_vector_for_pattern (op1, op2, a1, S, Psel)
>> >> >> > {
>> >> >> >   if (S == Psel)
>> >> >> >     {
>> >> >> >       /* If y1 intercept >= 0, then y1 >= y2
>> >> >> >           for all values of n.  */
>> >> >> >       if (a1 - 2*S >= 0)
>> >> >> >         return op2;
>> >> >> >       return op1;
>> >> >> >     }
>> >> >> >
>> >> >> >    n_int = Psel * (a1 - 2*S) / (Psel - S)
>> >> >> >    /* If intersecting in 1st quadrant, there will be cross over,
>> >> >> >        bail out.  */
>> >> >> >    if (n_int > 0)
>> >> >> >      return NULL_TREE;
>> >> >> >    /* If S/Psel < 1, ie y1 has gentler slope than y2,
>> >> >> >       then y1 < y2 for n > 0.  */
>> >> >> >    if (S < Psel)
>> >> >> >      return op1;
>> >> >> >    /* If S/Psel > 1, ie y1 has steeper slope than y2,
>> >> >> >       then y1 > y2 for n > 0.  */
>> >> >> >    return op2;
>> >> >> > }
>> >> >> >
>> >> >> > Does this look reasonable ?
>> >> >>
>> >> >> ...I think we need to be more conservative.  I think we also need to
>> >> >> distinguish n1 (the number of elements in the input vectors) and
>> >> >> nsel (the number of elements in the selector).
>> >> >>
>> >> >> If nsel is a multiple of Psel and nsel >= 2 * Psel then like you say
>> >> >> there will be (nsel /exact Psel) - 1 index elements from the stepped
>> >> >> encoding and the final index value will be:
>> >> >>
>> >> >>   ae = a1 + (nsel /exact Psel - 2) * S
>> >> >>
>> >> >> Because of wrap-around, we need to ensure that that doesn't run
>> >> >> into an adjoining vector.  I think the easiest way of doing that
>> >> >> is to calculate a1 /trunc n1 and ae /trunc n1 (using can_div_trunc_p)
>> >> >> and check that the quotients are equal.
>> >> > IIUC, If a1/n1 == ae/n1, then the sequence will choose from the same
>> >> > vector since ae is last elem,
>> >> > and the quotient can choose the vector because it will be either 0 or
>> >> > 1 (since indices wrap around after 2n).
>> >>
>> >> Right.
>> >>
>> >> > Um, could you please elaborate a bit on how will can_div_trunc_p
>> >> > calculate quotients, when n1 and nsel are unknown
>> >> > at compile time ?
>> >> >
>> >> > To calculate the quotients for a hard coded pattern,
>> >> > with a1 = 1, nsel = n1 = len(VNx4SI), S = 3, Psel = 4,
>> >> > I tried the following:
>> >> >
>> >> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
>> >> >   poly_uint64 nsel = n1;
>> >> >   poly_uint64 a1 = 1
>> >> >   poly_uint64 Esel = exact_div (nsel / Psel);
>> >>
>> >> We can't use exact_div here.  We need to test that nsel is a multiple
>> >> of Psel (e.g. using multiple_p).
>> >>
>> >> >   poly_uint64 ae = a1 + (Esel - 2) * S;
>> >> >
>> >> >   int q1, qe;
>> >> >   poly_uint64 r1, re;
>> >> >
>> >> >   bool div1_p = can_div_trunc_p (a1, n1, &q1, &r1);
>> >> >   bool dive_p = can_div_trunc_p (ae, n1, &qe, &re);
>> >> >
>> >> > Which gave strange values for qe and 0 for q1, with first call succeeding,
>> >> > and second call returning false.
>> >> > Am I calling it incorrectly ?
>> >>
>> >> No, that looks right.  I guess the issue is that ae < a1 for Psel == nsel
>> >> and so for min(nsel) the index wraps around to the other input vector.
>> >> IMO it's OK to punt on that.  We don't have interfaces for applying
>> >> ranges to the indeterminates in a poly_int.
>> > Hi Richard,
>> > Thanks for the suggestions.
>> > I tried the following to test for wrap around:
>> >
>> >   poly_uint64 n1 = GET_MODE_NUNITS (VNx4SImode);
>> >   poly_uint64 nsel = n1;
>> >   poly_uint64 a1 = 1;
>> >   unsigned Psel = 4;
>> >   int S = 3;
>> >
>> >   if (multiple_p (nsel, Psel))
>> >     {
>> >       poly_uint64 nelems = exact_div (nsel, Psel);
>> >       poly_uint64 ae = a1 + (nelems - 2) * S;
>> >
>> >       if (known_gt (ae, a1))
>> >         {
>> >           int q1, qe;
>> >           poly_uint64 r1, re;
>> >
>> >           bool ok1 = can_div_trunc_p (a1, n1, &q1, &r1);
>> >           bool oke = can_div_trunc_p (ae, n1, &qe, &re);
>> >         }
>> >     }
>> >
>> > However, the second call to can_div_trunc_p still returns false, with
>> > strange values for qe.
>>
>> The known_gt (ae, a1) part isn't required.  It's OK for S to be negative.
>>
>> What I meant above is that Psel is a valid value of nsel.  When nsel
>> *does* equal Psel, ae will select from a different input vector from a1.
>> (Although I said ae < a1, that isn't the important bit, sorry.
>> The important bit is that a1 > 0 and ae < 0.)
>>
>> That's why not having conditions that apply ranges to the indeterminates
>> is a problem.  I guess what we want to ask is "does ae select from the
>> same input as a1 for all nsel >= Psel*2?".  But we don't have a way of
>> asking that.  We can only ask for all nsel, and the answer to "does ae
>> select from the same input as a1 for all nsel?" is "no".
>>
>> Perhaps another way of saying this is: if the selector was a natural
>> stepped vector, the first value in the pattern (a0) would be -2.  The
>> selector { -2, 1, 4, ... } *would* cross input vectors.
>>
>> This means that a selector with the above values of a1 and S could only
>> be valid if a0 is in the range [0, nsel).  It's valid for a0 to be in
>> that range, because it can be arbitrarily different from a1 and above,
>> but it means that the test for a1 and above is no longer a simple linear
>> one, of the type we're trying to use here.
>>
>> So what I meant was, we should accept that the fold will be rejected
>> for this case.  I don't think that matters in practice.
>>
>> On the last bit, the values returned by reference from can_div_trunc_p
>> are only meaningful when the function returns true.  The variables
>> aren't updated otherwise (which is by design).
> Hi Richard,
> Thanks for the explanation!
> IIUC, for can_div_trunc_p (a, b, &q, &r);
> to return true, quotients obtained by division of respective
> coeffs should be equal to q0, if q0 = a.coeffs[0] / b.coeffs[0] != 0.
>
>   C q = NCa (a.coeffs[0]) / NCb (b.coeffs[0]);
>    /* Otherwise just check for the case in which ai / bi == Q.  */
>    if (NCa (a.coeffs[i]) / NCb (b.coeffs[i]) != q)
>      return false;
>
> Just to iterate, for above case,
> n1 = len(Vnx4SI) = 4 + 4x
> nsel = n1 = 4 + 4x
> S = 3
> Psel = 4
> a1 = 1
> ae = 3x - 2
>
> a1 /trunc n1 == (1 + 0x) / (4 + 4x)
> Since 1/4 == 0/4, can_div_trunc_p returns true, and sets q1 to 0.
>
> ae /trunc n1 == (-2 + 3x) / (4 + 4x)
> Since the coeff type is unsigned long,
> -2/4 != 3/4, so it returns false.
> and we reject to fold for this case.
> Is this correct ?

Yeah.  Like I say, if this were a simple linear pattern, a0 would be -2,
and so a0 and a1 would select from different inputs.  So a simple
linear test cannot handle this situation.

> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> For num_poly_int_coeffs == 2,
> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> If a1/trunc n1 succeeds,
> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> So, a1 has to be < n1.coeffs[0] ?

Remember that a1 is itself a poly_int.  It's not necessarily a constant.

E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:

  { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }

which is an interleaving of the two patterns:

  { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
  { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

> For eg, if n1 = 4 + 4x, and a1 = 5,
> a1 /trunc n1 will return false since 5/4 != 0 /4.
> At runtime,
> if x == 0, we will select op1[2];
> for x > 0, we will select op0[5];
> But we cannot determine this at compile time ?

Right.  Punting on that case is the right thing to do.

> I have tried to come up with following pseudo code to verify R:
>
> tree get_vector_for_pattern(tree op0, tree op1, vec_perm_indices sel,
> int pattern)
> {
>   if (!multiple_p (nsel, sel_npatterns))
>     return NULL_TREE;
>
>   poly_uint64 nsel = sel.length ();
>   poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
>   poly_uint64 Esel = exact_div (nsel, Psel);

Realise it's only pseudo code, but FWIW, this can be produced
as a side-effect of the multiple_p.

>   poly_uint64 a1 = sel[pattern + sel_npatterns];
>   poly_uint64 ae = a1 + (Esel - 2) * S;
>
>   if (known_lt (ae, 0))
>     return NULL_TREE;

I don't think we need this.

>   uint64_t q1, qe;
>   poly_uint64 r1, re;
>   if (!can_div_trunc_p (a1, n1, &q1, &r1)
>       || !can_div_trunc_p (ae, n1, &qe, &re))
>     return NULL_TREE;
>
>   if (q1 != qe)
>     return NULL_TREE;
>   tree op_vec = (q1 == 0) ? op0 : op1;

This should be testing the low bit of q1.  2 selects from the first
input, 3 from the second, etc.  (Not that we should see that,
since the selector should already have been canonicalised.
But better safe than sorry.)

>  sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>  if (sel_nelts_per_pattern == 3)
>    {

I suppose all of the above is dependent on sel_nelts_per_pattern == 3 too.
S would be zero otherwise.

>      a2 = sel[pattern + 2 * sel_npatterns];
>      S = a2 - a1;
>      if (S < 0)
>        {
>          /* Check for natural stepped pattern.  */
>          if ((a1 - sel[pattern]) != S)
>            return NULL_TREE;
>          if (!known_ge (re, VECTOR_CST_NPATTERNS (op_vec)))
>            return NULL_TREE;
>        }
>    }
>  return op_vec;
> }
>
> Does this look in right direction ?

Yeah, looks like it.

> Um, could you please give an example when nsel is *not* a multiple
> of Psel ?
> I had (incorrectly) assumed, nsel = Psel * Esel,
> so nsel would always be a multiple of Psel.

I was using nsel to mean TYPE_VECTOR_SUBPARTS, i.e. the total number
of elements in the vector at runtime.  Psel and Esel are instead
compile-time constants (number of patterns and explicitly-encoded
elements per pattern respectively).

Although I think in practical use the Psel will be a power of 2, that's
not guaranteed by construction.  Also, there's no reason in principle
why you couldn't have Psel = 4 for nsel = 2 + 2x.  Two of the patterns
will be dropped for x==0, but that's OK.

So all in all, it's better to check.  Like I said in the above comment,
it's a single operation either way: a three-argument multiple_p rather
than an exact_div.

Thanks,
Richard

Richard Sandiford Sept. 30, 2022, 4:08 p.m. UTC | #14

Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> For num_poly_int_coeffs == 2,
>> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> If a1/trunc n1 succeeds,
>> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> So, a1 has to be < n1.coeffs[0] ?
>
> Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>
> E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>
>   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }

Sorry, should have been:

  { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }

> which is an interleaving of the two patterns:
>
>   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
>   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

Prathamesh Kulkarni Oct. 10, 2022, 10:48 a.m. UTC | #15

On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> For num_poly_int_coeffs == 2,
> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> If a1/trunc n1 succeeds,
> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> So, a1 has to be < n1.coeffs[0] ?
> >
> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >
> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >
> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>
> Sorry, should have been:
>
>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
Hi Richard,
Thanks for the clarifications, and sorry for late reply.
I have attached POC patch that tries to implement the above approach.
Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.

For VLA vectors, I have only done limited testing so far.
It seems to pass couple of tests written in the patch for
nelts_per_pattern == 3,
and folds the following svld1rq test:
int32x4_t v = {1, 2, 3, 4};
return svld1rq_s32 (svptrue_b8 (), &v[0])
into:
return {1, 2, 3, 4, ...};
I will try to bootstrap+test it on SVE machine to test further for VLA folding.

I have a couple of questions:
1] When mask selects elements from same vector but from different patterns:
For eg:
arg0 = {1, 11, 2, 12, 3, 13, ...},
arg1 = {21, 31, 22, 32, 23, 33, ...},
mask = {0, 0, 0, 1, 0, 2, ... },
All have npatterns = 2, nelts_per_pattern = 3.

With above mask,
Pattern {0, ...} selects arg0[0], ie {1, ...}
Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
pattern in arg0.
The result is:
res = {1, 1, 1, 11, 1, 2, ...}
In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
with a0 = 1, a1 = 11, S = -9.
Is that expected tho ? It seems to create a new encoding which
wasn't present in the input vector. For instance, the next elem in
sequence would be -7,
which is not present originally in arg0.
I suppose it's fine since if the user defines mask to have pattern {0,
1, 2, ...}
they intended result to have pattern with above encoding.
Just wanted to confirm if this is correct ?

2] Could you please suggest a test-case for S < 0 ?
I am not able to come up with one :/

Thanks,
Prathamesh
>
> > which is an interleaving of the two patterns:
> >
> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 9f7beae14e5..a150a75faf5 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,38 +10497,53 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 			  build_zero_cst (itype));
 }
 
+/* Check if PATTERN in SEL selects either ARG0 or ARG1,
+   and return the selected arg, otherwise return NULL_TREE.  */
 
-/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
-   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
-   true if successful.  */
-
-static bool
-vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
+static tree
+get_vector_for_pattern (tree arg0, tree arg1,
+			const vec_perm_indices &sel, unsigned pattern)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
+  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  poly_uint64 nsel = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (nsel, sel_npatterns, &esel))
+    return NULL_TREE;
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  int S = 0;
+  if (sel_nelts_per_pattern == 3)
     {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      S = (a2 - a1).to_constant ();
     }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+
+  if (!can_div_trunc_p (a1, n1, &q1, &r1)
+      || !can_div_trunc_p (ae, n1, &qe, &re)
+      || (q1 != qe))
+    return NULL_TREE;
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0)
     {
-      constructor_elt *elt;
+      poly_uint64 a0 = sel[pattern];
+      if (!known_eq (S, a1 - a0))
+        return NULL_TREE;
 
-      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
-	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
-	  return false;
-	else
-	  elts[i] = elt->value;
+      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+        return NULL_TREE;
     }
-  else
-    return false;
-  for (; i < nelts; i++)
-    elts[i]
-      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
-  return true;
+  
+  return arg;
 }
 
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
@@ -10539,41 +10557,112 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
-  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
-  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
-      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
+  unsigned res_npatterns = 0;
+  unsigned res_nelts_per_pattern = 0;
+  unsigned sel_npatterns = 0;
+  tree *vector_for_pattern = NULL;
+
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST
+      && !sel.length ().is_constant ())
+    {
+      sel_npatterns = sel.encoding ().npatterns ();
+      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
+      for (unsigned i = 0; i < sel_npatterns; i++)
+	{
+	  tree op = get_vector_for_pattern (arg0, arg1, sel, i);
+	  if (!op)
+	    return NULL_TREE;
+	  vector_for_pattern[i] = op;
+	}
+
+      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+
+      res_npatterns
+        = least_common_multiple (sel_npatterns,
+				 least_common_multiple (arg0_npatterns,
+				 			arg1_npatterns));
+      res_nelts_per_pattern
+	= std::max(sel.encoding ().nelts_per_pattern (),
+		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+    }
+  else if (sel.length ().is_constant (&nelts)
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
+    {
+      /* For VLS vectors, treat all vectors with
+	 npatterns = nelts, nelts_per_pattern = 1. */
+      res_npatterns = sel_npatterns = nelts;
+      res_nelts_per_pattern = 1;
+      vector_for_pattern = XALLOCAVEC (tree, nelts);
+      for (unsigned i = 0; i < nelts; i++)
+        {
+	  HOST_WIDE_INT index;
+	  if (!sel[i].is_constant (&index))
+	    return NULL_TREE;
+	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
+	}
+    }
+  else
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
-  for (i = 0; i < nelts; i++)
+  tree_vector_builder out_elts (type, res_npatterns,
+				res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
     {
-      HOST_WIDE_INT index;
-      if (!sel[i].is_constant (&index))
+      poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+
+      /* Divide sel[i] by input vector length, to obtain remainder,
+	 which would be the index for either input vector.  */
+      if (!can_div_trunc_p (sel[i], n1, &q, &r))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+
+      unsigned HOST_WIDE_INT index;
+      if (!r.is_constant (&index))
+	return NULL_TREE;
+
+      /* For VLA vectors, i % sel_npatterns would give the pattern
+         in sel that ith elem belongs to.
+	 For VLS vectors, sel_npatterns == res_nelts == nelts,
+	 so i % sel_npatterns == i since i < nelts */
+      tree arg = vector_for_pattern[i % sel_npatterns];
+      tree elem;
+      if (TREE_CODE (arg) == CONSTRUCTOR)
+        {
+	  gcc_assert (index < nelts);
+	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
+	    return NULL_TREE;
+	  elem = CONSTRUCTOR_ELT (arg, index)->value;
+	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
+	    return NULL_TREE;
+	  need_ctor = true;
+	}
+      else
+        elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
     }
 
   if (need_ctor)
     {
       vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
+      vec_alloc (v, res_nelts);
+      for (i = 0; i < res_nelts; i++)
 	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
       return build_constructor (type, v);
     }
-  else
-    return out_elts.build ();
+  return out_elts.build ();
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16910,6 +16999,97 @@ test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  //machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should select arg0.  */
+  {
+    int mask_elems[] = {0, 1, 2};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    unsigned res_nelts = vector_cst_encoded_nelts (res);
+    for (unsigned i = 0; i < res_nelts; i++)
+      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
+				    VECTOR_CST_ELT (arg0, i), 0));
+  }
+
+  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should return NULL because for len = 4 + 4x,
+     if x == 0, we select from arg1
+     if x > 0, we select from arg0
+     and thus cannot determine result at compile time.  */
+  {
+    int mask_elems[] = {4, 5, 6};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    gcc_assert (res == NULL_TREE);
+  }
+
+  /* Case 3:
+     mask: {0, 0, 0, 1, 0, 2, ...} 
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[0], ie, 1.
+     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
+     so res = {1, 1, 1, 11, 1, 2, ...}.  */
+  {
+    int mask_elems[] = {0, 0, 0, 1, 0, 2};
+    tree mask = build_vec_int_cst (2, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    /* Check encoding: {1, 11, 2, ...} */
+    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2};
+    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
+  }
+
+  /* Case 4:
+     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[1]
+     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
+     a1 = 5 + 4x
+     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
+        = 5 + 6x
+     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
+     res: {1, 21, 1, 31, 1, 22, ... }
+     FIXME: How to build vector with poly_int elems ?  */
+
+  /* Case 5: S < 0.  */
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16918,6 +17098,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest

Prathamesh Kulkarni Oct. 17, 2022, 10:32 a.m. UTC | #16

On Mon, 10 Oct 2022 at 16:18, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > >> For num_poly_int_coeffs == 2,
> > >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > >> If a1/trunc n1 succeeds,
> > >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > >> So, a1 has to be < n1.coeffs[0] ?
> > >
> > > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > >
> > > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > >
> > >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >
> > Sorry, should have been:
> >
> >   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> Hi Richard,
> Thanks for the clarifications, and sorry for late reply.
> I have attached POC patch that tries to implement the above approach.
> Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>
> For VLA vectors, I have only done limited testing so far.
> It seems to pass couple of tests written in the patch for
> nelts_per_pattern == 3,
> and folds the following svld1rq test:
> int32x4_t v = {1, 2, 3, 4};
> return svld1rq_s32 (svptrue_b8 (), &v[0])
> into:
> return {1, 2, 3, 4, ...};
> I will try to bootstrap+test it on SVE machine to test further for VLA folding.
With the attached patch it seems to pass bootstrap+test with SVE enabled.
The only difference w.r.t previous patch is it adds check in
get_vector_for_pattern
if S is constant otherwise returns NULL_TREE.

I added this check because 930325-1.c ICE'd with previous patch
because it had following vec_perm_expr,
where S was non-constant:
vect__16.13_70 = VEC_PERM_EXPR <vect__16.12_69, vect__16.12_69, {
POLY_INT_CST [3, 4], POLY_INT_CST [6, 8], POLY_INT_CST [9, 12], ...
}>;
I am not sure how to proceed in this case, so chose to bail out.

Thanks,
Prathamesh

>
> I have a couple of questions:
> 1] When mask selects elements from same vector but from different patterns:
> For eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...},
> arg1 = {21, 31, 22, 32, 23, 33, ...},
> mask = {0, 0, 0, 1, 0, 2, ... },
> All have npatterns = 2, nelts_per_pattern = 3.
>
> With above mask,
> Pattern {0, ...} selects arg0[0], ie {1, ...}
> Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> pattern in arg0.
> The result is:
> res = {1, 1, 1, 11, 1, 2, ...}
> In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> with a0 = 1, a1 = 11, S = -9.
> Is that expected tho ? It seems to create a new encoding which
> wasn't present in the input vector. For instance, the next elem in
> sequence would be -7,
> which is not present originally in arg0.
> I suppose it's fine since if the user defines mask to have pattern {0,
> 1, 2, ...}
> they intended result to have pattern with above encoding.
> Just wanted to confirm if this is correct ?
>
> 2] Could you please suggest a test-case for S < 0 ?
> I am not able to come up with one :/
>
> Thanks,
> Prathamesh
> >
> > > which is an interleaving of the two patterns:
> > >
> > >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 9f7beae14e5..e93f2c7b592 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,38 +10497,56 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 			  build_zero_cst (itype));
 }
 
+/* Check if PATTERN in SEL selects either ARG0 or ARG1,
+   and return the selected arg, otherwise return NULL_TREE.  */
 
-/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
-   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
-   true if successful.  */
-
-static bool
-vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
+static tree
+get_vector_for_pattern (tree arg0, tree arg1,
+			const vec_perm_indices &sel, unsigned pattern)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
+  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  poly_uint64 nsel = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (nsel, sel_npatterns, &esel))
+    return NULL_TREE;
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  int64_t S = 0;
+  if (sel_nelts_per_pattern == 3)
     {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      poly_uint64 diff = a2 - a1;
+      if (!diff.is_constant ())
+	return NULL_TREE;
+      S = diff.to_constant ();
     }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+
+  if (!can_div_trunc_p (a1, n1, &q1, &r1)
+      || !can_div_trunc_p (ae, n1, &qe, &re)
+      || (q1 != qe))
+    return NULL_TREE;
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0)
     {
-      constructor_elt *elt;
+      poly_uint64 a0 = sel[pattern];
+      if (!known_eq (S, a1 - a0))
+        return NULL_TREE;
 
-      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
-	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
-	  return false;
-	else
-	  elts[i] = elt->value;
+      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+        return NULL_TREE;
     }
-  else
-    return false;
-  for (; i < nelts; i++)
-    elts[i]
-      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
-  return true;
+  
+  return arg;
 }
 
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
@@ -10539,41 +10560,112 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
-  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
-  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
-      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
+  unsigned res_npatterns = 0;
+  unsigned res_nelts_per_pattern = 0;
+  unsigned sel_npatterns = 0;
+  tree *vector_for_pattern = NULL;
+
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST
+      && !sel.length ().is_constant ())
+    {
+      sel_npatterns = sel.encoding ().npatterns ();
+      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
+      for (unsigned i = 0; i < sel_npatterns; i++)
+	{
+	  tree op = get_vector_for_pattern (arg0, arg1, sel, i);
+	  if (!op)
+	    return NULL_TREE;
+	  vector_for_pattern[i] = op;
+	}
+
+      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+
+      res_npatterns
+        = least_common_multiple (sel_npatterns,
+				 least_common_multiple (arg0_npatterns,
+				 			arg1_npatterns));
+      res_nelts_per_pattern
+	= std::max(sel.encoding ().nelts_per_pattern (),
+		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+    }
+  else if (sel.length ().is_constant (&nelts)
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
+    {
+      /* For VLS vectors, treat all vectors with
+	 npatterns = nelts, nelts_per_pattern = 1. */
+      res_npatterns = sel_npatterns = nelts;
+      res_nelts_per_pattern = 1;
+      vector_for_pattern = XALLOCAVEC (tree, nelts);
+      for (unsigned i = 0; i < nelts; i++)
+        {
+	  HOST_WIDE_INT index;
+	  if (!sel[i].is_constant (&index))
+	    return NULL_TREE;
+	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
+	}
+    }
+  else
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
-  for (i = 0; i < nelts; i++)
+  tree_vector_builder out_elts (type, res_npatterns,
+				res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
     {
-      HOST_WIDE_INT index;
-      if (!sel[i].is_constant (&index))
+      poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+
+      /* Divide sel[i] by input vector length, to obtain remainder,
+	 which would be the index for either input vector.  */
+      if (!can_div_trunc_p (sel[i], n1, &q, &r))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+
+      unsigned HOST_WIDE_INT index;
+      if (!r.is_constant (&index))
+	return NULL_TREE;
+
+      /* For VLA vectors, i % sel_npatterns would give the pattern
+         in sel that ith elem belongs to.
+	 For VLS vectors, sel_npatterns == res_nelts == nelts,
+	 so i % sel_npatterns == i since i < nelts */
+      tree arg = vector_for_pattern[i % sel_npatterns];
+      tree elem;
+      if (TREE_CODE (arg) == CONSTRUCTOR)
+        {
+	  gcc_assert (index < nelts);
+	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
+	    return NULL_TREE;
+	  elem = CONSTRUCTOR_ELT (arg, index)->value;
+	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
+	    return NULL_TREE;
+	  need_ctor = true;
+	}
+      else
+        elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
     }
 
   if (need_ctor)
     {
       vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
+      vec_alloc (v, res_nelts);
+      for (i = 0; i < res_nelts; i++)
 	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
       return build_constructor (type, v);
     }
-  else
-    return out_elts.build ();
+  return out_elts.build ();
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16910,6 +17002,97 @@ test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  //machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should select arg0.  */
+  {
+    int mask_elems[] = {0, 1, 2};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    unsigned res_nelts = vector_cst_encoded_nelts (res);
+    for (unsigned i = 0; i < res_nelts; i++)
+      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
+				    VECTOR_CST_ELT (arg0, i), 0));
+  }
+
+  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should return NULL because for len = 4 + 4x,
+     if x == 0, we select from arg1
+     if x > 0, we select from arg0
+     and thus cannot determine result at compile time.  */
+  {
+    int mask_elems[] = {4, 5, 6};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    gcc_assert (res == NULL_TREE);
+  }
+
+  /* Case 3:
+     mask: {0, 0, 0, 1, 0, 2, ...} 
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[0], ie, 1.
+     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
+     so res = {1, 1, 1, 11, 1, 2, ...}.  */
+  {
+    int mask_elems[] = {0, 0, 0, 1, 0, 2};
+    tree mask = build_vec_int_cst (2, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    /* Check encoding: {1, 11, 2, ...} */
+    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2};
+    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
+  }
+
+  /* Case 4:
+     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[1]
+     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
+     a1 = 5 + 4x
+     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
+        = 5 + 6x
+     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
+     res: {1, 21, 1, 31, 1, 22, ... }
+     FIXME: How to build vector with poly_int elems ?  */
+
+  /* Case 5: S < 0.  */
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16918,6 +17101,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest

Prathamesh Kulkarni Oct. 24, 2022, 8:12 a.m. UTC | #17

On Mon, 17 Oct 2022 at 16:02, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 10 Oct 2022 at 16:18, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > >> For num_poly_int_coeffs == 2,
> > > >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > >> If a1/trunc n1 succeeds,
> > > >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > >> So, a1 has to be < n1.coeffs[0] ?
> > > >
> > > > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > >
> > > > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > >
> > > >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > >
> > > Sorry, should have been:
> > >
> > >   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > Hi Richard,
> > Thanks for the clarifications, and sorry for late reply.
> > I have attached POC patch that tries to implement the above approach.
> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >
> > For VLA vectors, I have only done limited testing so far.
> > It seems to pass couple of tests written in the patch for
> > nelts_per_pattern == 3,
> > and folds the following svld1rq test:
> > int32x4_t v = {1, 2, 3, 4};
> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > into:
> > return {1, 2, 3, 4, ...};
> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> With the attached patch it seems to pass bootstrap+test with SVE enabled.
> The only difference w.r.t previous patch is it adds check in
> get_vector_for_pattern
> if S is constant otherwise returns NULL_TREE.
>
> I added this check because 930325-1.c ICE'd with previous patch
> because it had following vec_perm_expr,
> where S was non-constant:
> vect__16.13_70 = VEC_PERM_EXPR <vect__16.12_69, vect__16.12_69, {
> POLY_INT_CST [3, 4], POLY_INT_CST [6, 8], POLY_INT_CST [9, 12], ...
> }>;
> I am not sure how to proceed in this case, so chose to bail out.
Hi Richard,
ping https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603717.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>
> >
> > I have a couple of questions:
> > 1] When mask selects elements from same vector but from different patterns:
> > For eg:
> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > mask = {0, 0, 0, 1, 0, 2, ... },
> > All have npatterns = 2, nelts_per_pattern = 3.
> >
> > With above mask,
> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > pattern in arg0.
> > The result is:
> > res = {1, 1, 1, 11, 1, 2, ...}
> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > with a0 = 1, a1 = 11, S = -9.
> > Is that expected tho ? It seems to create a new encoding which
> > wasn't present in the input vector. For instance, the next elem in
> > sequence would be -7,
> > which is not present originally in arg0.
> > I suppose it's fine since if the user defines mask to have pattern {0,
> > 1, 2, ...}
> > they intended result to have pattern with above encoding.
> > Just wanted to confirm if this is correct ?
> >
> > 2] Could you please suggest a test-case for S < 0 ?
> > I am not able to come up with one :/
> >
> > Thanks,
> > Prathamesh
> > >
> > > > which is an interleaving of the two patterns:
> > > >
> > > >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > > >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

Richard Sandiford Oct. 26, 2022, 3:37 p.m. UTC | #18

Sorry for the slow response.  I wanted to find some time to think
about this a bit more.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> >> For num_poly_int_coeffs == 2,
>> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> >> If a1/trunc n1 succeeds,
>> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> >> So, a1 has to be < n1.coeffs[0] ?
>> >
>> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>> >
>> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>> >
>> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>>
>> Sorry, should have been:
>>
>>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> Hi Richard,
> Thanks for the clarifications, and sorry for late reply.
> I have attached POC patch that tries to implement the above approach.
> Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>
> For VLA vectors, I have only done limited testing so far.
> It seems to pass couple of tests written in the patch for
> nelts_per_pattern == 3,
> and folds the following svld1rq test:
> int32x4_t v = {1, 2, 3, 4};
> return svld1rq_s32 (svptrue_b8 (), &v[0])
> into:
> return {1, 2, 3, 4, ...};
> I will try to bootstrap+test it on SVE machine to test further for VLA folding.
>
> I have a couple of questions:
> 1] When mask selects elements from same vector but from different patterns:
> For eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...},
> arg1 = {21, 31, 22, 32, 23, 33, ...},
> mask = {0, 0, 0, 1, 0, 2, ... },
> All have npatterns = 2, nelts_per_pattern = 3.
>
> With above mask,
> Pattern {0, ...} selects arg0[0], ie {1, ...}
> Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> pattern in arg0.
> The result is:
> res = {1, 1, 1, 11, 1, 2, ...}
> In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> with a0 = 1, a1 = 11, S = -9.
> Is that expected tho ? It seems to create a new encoding which
> wasn't present in the input vector. For instance, the next elem in
> sequence would be -7,
> which is not present originally in arg0.

Yeah, you're right, sorry.  Going back to:

(2) The explicit encoding can be used to produce a sequence of N*Ex*Px
    elements for any integer N.  This extended sequence can be reencoded
    as having N*Px patterns, with Ex staying the same.

I guess we need to pick an N for the selector such that each new
selector pattern (each one out of the N*Px patterns) selects from
the *same pattern* of the same data input.

So if a particular pattern in the selector has a step S, and the data
input it selects from has Pi patterns, N*S must be a multiple of Pi.
N must be a multiple of least_common_multiple(S,Pi)/S.

I think that means that the total number of patterns in the result
(Pr from previous messages) can safely be:

  Ps * least_common_multiple(
    least_common_multiple(S[1], P[input(1)]) / S[1],
    ...
    least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
  )

where:

  Ps = the number of patterns in the selector
  S[I] = the step for selector pattern I (I being 1-based)
  input(I) = the data input selected by selector pattern I (I being 1-based)
  P[I] = the number of patterns in data input I

That's getting quite complicated :-)  If we allow arbitrary P[...]
and S[...] then it could also get large.  Perhaps we should finally
give up on the general case and limit this to power-of-2 patterns and
power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
simplifies other things as well.

What do you think?

> I suppose it's fine since if the user defines mask to have pattern {0,
> 1, 2, ...}
> they intended result to have pattern with above encoding.
> Just wanted to confirm if this is correct ?
>
> 2] Could you please suggest a test-case for S < 0 ?
> I am not able to come up with one :/

svrev is one way of creating negative steps.

Thanks,
Richard

>
> Thanks,
> Prathamesh
>>
>> > which is an interleaving of the two patterns:
>> >
>> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
>> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

Prathamesh Kulkarni Oct. 28, 2022, 2:46 p.m. UTC | #19

On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Sorry for the slow response.  I wanted to find some time to think
> about this a bit more.
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> >> For num_poly_int_coeffs == 2,
> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> >> If a1/trunc n1 succeeds,
> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> >> So, a1 has to be < n1.coeffs[0] ?
> >> >
> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >> >
> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >> >
> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >>
> >> Sorry, should have been:
> >>
> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > Hi Richard,
> > Thanks for the clarifications, and sorry for late reply.
> > I have attached POC patch that tries to implement the above approach.
> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >
> > For VLA vectors, I have only done limited testing so far.
> > It seems to pass couple of tests written in the patch for
> > nelts_per_pattern == 3,
> > and folds the following svld1rq test:
> > int32x4_t v = {1, 2, 3, 4};
> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > into:
> > return {1, 2, 3, 4, ...};
> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> >
> > I have a couple of questions:
> > 1] When mask selects elements from same vector but from different patterns:
> > For eg:
> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > mask = {0, 0, 0, 1, 0, 2, ... },
> > All have npatterns = 2, nelts_per_pattern = 3.
> >
> > With above mask,
> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > pattern in arg0.
> > The result is:
> > res = {1, 1, 1, 11, 1, 2, ...}
> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > with a0 = 1, a1 = 11, S = -9.
> > Is that expected tho ? It seems to create a new encoding which
> > wasn't present in the input vector. For instance, the next elem in
> > sequence would be -7,
> > which is not present originally in arg0.
>
> Yeah, you're right, sorry.  Going back to:
>
> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>     elements for any integer N.  This extended sequence can be reencoded
>     as having N*Px patterns, with Ex staying the same.
>
> I guess we need to pick an N for the selector such that each new
> selector pattern (each one out of the N*Px patterns) selects from
> the *same pattern* of the same data input.
>
> So if a particular pattern in the selector has a step S, and the data
> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> N must be a multiple of least_common_multiple(S,Pi)/S.
>
> I think that means that the total number of patterns in the result
> (Pr from previous messages) can safely be:
>
>   Ps * least_common_multiple(
>     least_common_multiple(S[1], P[input(1)]) / S[1],
>     ...
>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
>   )
>
> where:
>
>   Ps = the number of patterns in the selector
>   S[I] = the step for selector pattern I (I being 1-based)
>   input(I) = the data input selected by selector pattern I (I being 1-based)
>   P[I] = the number of patterns in data input I
>
> That's getting quite complicated :-)  If we allow arbitrary P[...]
> and S[...] then it could also get large.  Perhaps we should finally
> give up on the general case and limit this to power-of-2 patterns and
> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> simplifies other things as well.
>
> What do you think?
Hi Richard,
Thanks for the suggestions. Yeah I suppose we can initially add support for
power-of-2 patterns and power-of-2 steps and try to generalize it in
follow up patches if possible.

Sorry if this sounds like a silly ques -- if we are going to have
pattern in selector, select *same pattern from same input vector*,
instead of re-encoding the selector to have N * Ps patterns, would it
make sense for elements in selector to denote pattern number itself
instead of element index
if input vectors are VLA ?

For eg:
op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
op1 = {...}
with npatterns == 4, nelts_per_pattern == 3,
sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
so, res = {1, 4, 1, 5, 1, 6, ...}
Not sure if this is correct tho.

Thanks,
Prathamesh
>
> > I suppose it's fine since if the user defines mask to have pattern {0,
> > 1, 2, ...}
> > they intended result to have pattern with above encoding.
> > Just wanted to confirm if this is correct ?
> >
> > 2] Could you please suggest a test-case for S < 0 ?
> > I am not able to come up with one :/
>
> svrev is one way of creating negative steps.
>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Prathamesh
> >>
> >> > which is an interleaving of the two patterns:
> >> >
> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

Richard Sandiford Oct. 31, 2022, 9:57 a.m. UTC | #20

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Sorry for the slow response.  I wanted to find some time to think
>> about this a bit more.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
>> > <richard.sandiford@arm.com> wrote:
>> >>
>> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> >> >> For num_poly_int_coeffs == 2,
>> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> >> >> If a1/trunc n1 succeeds,
>> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> >> >> So, a1 has to be < n1.coeffs[0] ?
>> >> >
>> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>> >> >
>> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>> >> >
>> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>> >>
>> >> Sorry, should have been:
>> >>
>> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
>> > Hi Richard,
>> > Thanks for the clarifications, and sorry for late reply.
>> > I have attached POC patch that tries to implement the above approach.
>> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>> >
>> > For VLA vectors, I have only done limited testing so far.
>> > It seems to pass couple of tests written in the patch for
>> > nelts_per_pattern == 3,
>> > and folds the following svld1rq test:
>> > int32x4_t v = {1, 2, 3, 4};
>> > return svld1rq_s32 (svptrue_b8 (), &v[0])
>> > into:
>> > return {1, 2, 3, 4, ...};
>> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
>> >
>> > I have a couple of questions:
>> > 1] When mask selects elements from same vector but from different patterns:
>> > For eg:
>> > arg0 = {1, 11, 2, 12, 3, 13, ...},
>> > arg1 = {21, 31, 22, 32, 23, 33, ...},
>> > mask = {0, 0, 0, 1, 0, 2, ... },
>> > All have npatterns = 2, nelts_per_pattern = 3.
>> >
>> > With above mask,
>> > Pattern {0, ...} selects arg0[0], ie {1, ...}
>> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
>> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
>> > pattern in arg0.
>> > The result is:
>> > res = {1, 1, 1, 11, 1, 2, ...}
>> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
>> > with a0 = 1, a1 = 11, S = -9.
>> > Is that expected tho ? It seems to create a new encoding which
>> > wasn't present in the input vector. For instance, the next elem in
>> > sequence would be -7,
>> > which is not present originally in arg0.
>>
>> Yeah, you're right, sorry.  Going back to:
>>
>> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>>     elements for any integer N.  This extended sequence can be reencoded
>>     as having N*Px patterns, with Ex staying the same.
>>
>> I guess we need to pick an N for the selector such that each new
>> selector pattern (each one out of the N*Px patterns) selects from
>> the *same pattern* of the same data input.
>>
>> So if a particular pattern in the selector has a step S, and the data
>> input it selects from has Pi patterns, N*S must be a multiple of Pi.
>> N must be a multiple of least_common_multiple(S,Pi)/S.
>>
>> I think that means that the total number of patterns in the result
>> (Pr from previous messages) can safely be:
>>
>>   Ps * least_common_multiple(
>>     least_common_multiple(S[1], P[input(1)]) / S[1],
>>     ...
>>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
>>   )
>>
>> where:
>>
>>   Ps = the number of patterns in the selector
>>   S[I] = the step for selector pattern I (I being 1-based)
>>   input(I) = the data input selected by selector pattern I (I being 1-based)
>>   P[I] = the number of patterns in data input I
>>
>> That's getting quite complicated :-)  If we allow arbitrary P[...]
>> and S[...] then it could also get large.  Perhaps we should finally
>> give up on the general case and limit this to power-of-2 patterns and
>> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
>> simplifies other things as well.
>>
>> What do you think?
> Hi Richard,
> Thanks for the suggestions. Yeah I suppose we can initially add support for
> power-of-2 patterns and power-of-2 steps and try to generalize it in
> follow up patches if possible.
>
> Sorry if this sounds like a silly ques -- if we are going to have
> pattern in selector, select *same pattern from same input vector*,
> instead of re-encoding the selector to have N * Ps patterns, would it
> make sense for elements in selector to denote pattern number itself
> instead of element index
> if input vectors are VLA ?
>
> For eg:
> op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> op1 = {...}
> with npatterns == 4, nelts_per_pattern == 3,
> sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> so, res = {1, 4, 1, 5, 1, 6, ...}
> Not sure if this is correct tho.

This wouldn't allow us to represent things like a "duplicate one
element", or "copy the leading N elements from the first input and
the other elements from elements N+ of the second input", which we
can with the current scheme.

The restriction about each (unwound) selector pattern selecting from the
same input pattern only applies to case where the selector pattern is
stepped (and only applies to the stepped part of the pattern, not the
leading element).  The restriction is also local to this code; it
doesn't make other VEC_PERM_EXPRs invalid.

Thanks,
Richard

>
> Thanks,
> Prathamesh
>>
>> > I suppose it's fine since if the user defines mask to have pattern {0,
>> > 1, 2, ...}
>> > they intended result to have pattern with above encoding.
>> > Just wanted to confirm if this is correct ?
>> >
>> > 2] Could you please suggest a test-case for S < 0 ?
>> > I am not able to come up with one :/
>>
>> svrev is one way of creating negative steps.
>>
>> Thanks,
>> Richard
>>
>> >
>> > Thanks,
>> > Prathamesh
>> >>
>> >> > which is an interleaving of the two patterns:
>> >> >
>> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
>> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

Prathamesh Kulkarni Nov. 4, 2022, 8:30 a.m. UTC | #21

On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Sorry for the slow response.  I wanted to find some time to think
> >> about this a bit more.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> >> > <richard.sandiford@arm.com> wrote:
> >> >>
> >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> >> >> For num_poly_int_coeffs == 2,
> >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> >> >> If a1/trunc n1 succeeds,
> >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> >> >> So, a1 has to be < n1.coeffs[0] ?
> >> >> >
> >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >> >> >
> >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >> >> >
> >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >> >>
> >> >> Sorry, should have been:
> >> >>
> >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> >> > Hi Richard,
> >> > Thanks for the clarifications, and sorry for late reply.
> >> > I have attached POC patch that tries to implement the above approach.
> >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >> >
> >> > For VLA vectors, I have only done limited testing so far.
> >> > It seems to pass couple of tests written in the patch for
> >> > nelts_per_pattern == 3,
> >> > and folds the following svld1rq test:
> >> > int32x4_t v = {1, 2, 3, 4};
> >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> >> > into:
> >> > return {1, 2, 3, 4, ...};
> >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> >> >
> >> > I have a couple of questions:
> >> > 1] When mask selects elements from same vector but from different patterns:
> >> > For eg:
> >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> >> > mask = {0, 0, 0, 1, 0, 2, ... },
> >> > All have npatterns = 2, nelts_per_pattern = 3.
> >> >
> >> > With above mask,
> >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> >> > pattern in arg0.
> >> > The result is:
> >> > res = {1, 1, 1, 11, 1, 2, ...}
> >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> >> > with a0 = 1, a1 = 11, S = -9.
> >> > Is that expected tho ? It seems to create a new encoding which
> >> > wasn't present in the input vector. For instance, the next elem in
> >> > sequence would be -7,
> >> > which is not present originally in arg0.
> >>
> >> Yeah, you're right, sorry.  Going back to:
> >>
> >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >>     elements for any integer N.  This extended sequence can be reencoded
> >>     as having N*Px patterns, with Ex staying the same.
> >>
> >> I guess we need to pick an N for the selector such that each new
> >> selector pattern (each one out of the N*Px patterns) selects from
> >> the *same pattern* of the same data input.
> >>
> >> So if a particular pattern in the selector has a step S, and the data
> >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> >> N must be a multiple of least_common_multiple(S,Pi)/S.
> >>
> >> I think that means that the total number of patterns in the result
> >> (Pr from previous messages) can safely be:
> >>
> >>   Ps * least_common_multiple(
> >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> >>     ...
> >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> >>   )
> >>
> >> where:
> >>
> >>   Ps = the number of patterns in the selector
> >>   S[I] = the step for selector pattern I (I being 1-based)
> >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> >>   P[I] = the number of patterns in data input I
> >>
> >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> >> and S[...] then it could also get large.  Perhaps we should finally
> >> give up on the general case and limit this to power-of-2 patterns and
> >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> >> simplifies other things as well.
> >>
> >> What do you think?
> > Hi Richard,
> > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > follow up patches if possible.
> >
> > Sorry if this sounds like a silly ques -- if we are going to have
> > pattern in selector, select *same pattern from same input vector*,
> > instead of re-encoding the selector to have N * Ps patterns, would it
> > make sense for elements in selector to denote pattern number itself
> > instead of element index
> > if input vectors are VLA ?
> >
> > For eg:
> > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > op1 = {...}
> > with npatterns == 4, nelts_per_pattern == 3,
> > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > so, res = {1, 4, 1, 5, 1, 6, ...}
> > Not sure if this is correct tho.
>
> This wouldn't allow us to represent things like a "duplicate one
> element", or "copy the leading N elements from the first input and
> the other elements from elements N+ of the second input", which we
> can with the current scheme.
>
> The restriction about each (unwound) selector pattern selecting from the
> same input pattern only applies to case where the selector pattern is
> stepped (and only applies to the stepped part of the pattern, not the
> leading element).  The restriction is also local to this code; it
> doesn't make other VEC_PERM_EXPRs invalid.
Hi Richard,
Thanks for the clarifications.
Just to clarify your approach with an eg:
Let selected input vector be:
arg0: {a0, b0, c0, d0,
          a0 + S, b0 + S, c0 + S, d0 + S,
          a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
where arg0 has npatterns = 4, and nelts_per_pattern = 3.

Let sel = {0, 0, 1, 2, 2, 4, ...}
where sel_npatterns = 2 and sel_nelts_per_pattern = 3

So, the first pattern in sel:
p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
which would be incorrect, since they belong to different patterns in arg0.
So to select elements from same pattern in arg0, we need to divide p1
into at least N1 = P_arg0 / S0 = 4 distinct patterns.

Similarly for second pattern in sel:
p2: {0, 2, 4, ...}, we need to divide it into
at least N2 = P_arg0 / S1 = 2 distinct patterns.

Select N = max(N1, N2) = 4
So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
and will be re-encoded with N*Ps = 8 patterns:

re-encoded sel:
{a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
...}

with 8 patterns,
p1: {a0, a0 + 2S, a0 + 4S, ...}
p2: {b0, b0 + 2S, b0 + 4S, ...}
...
which select elements from same pattern from same input vector.
Does this look correct ?

For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
arg1_npatterns are powers of 2 and for each stepped pattern,
it's stepped size S is a power of 2. I suppose this will be sufficient
to ensure that sel can be re-encoded with N*Ps npatterns
such that each new pattern selects elements from same pattern
of the input vector ?

Then compute N:
N = 1;
for (every pattern p in sel)
  {
     op = corresponding input vector for pattern;
     S = step_size (p);
     N_pattern = max (S, npatterns (op)) / S;
     N = max(N, N_pattern)
  }

and re-encode selector with N*Ps patterns.
I guess rest of the patch will mostly stay the same.

Thanks,
Prathamesh

>
> Thanks,
> Richard
>
> >
> > Thanks,
> > Prathamesh
> >>
> >> > I suppose it's fine since if the user defines mask to have pattern {0,
> >> > 1, 2, ...}
> >> > they intended result to have pattern with above encoding.
> >> > Just wanted to confirm if this is correct ?
> >> >
> >> > 2] Could you please suggest a test-case for S < 0 ?
> >> > I am not able to come up with one :/
> >>
> >> svrev is one way of creating negative steps.
> >>
> >> Thanks,
> >> Richard
> >>
> >> >
> >> > Thanks,
> >> > Prathamesh
> >> >>
> >> >> > which is an interleaving of the two patterns:
> >> >> >
> >> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> >> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

Prathamesh Kulkarni Nov. 21, 2022, 9:07 a.m. UTC | #22

On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >>
> > >> Sorry for the slow response.  I wanted to find some time to think
> > >> about this a bit more.
> > >>
> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > >> > <richard.sandiford@arm.com> wrote:
> > >> >>
> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > >> >> >> For num_poly_int_coeffs == 2,
> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > >> >> >> If a1/trunc n1 succeeds,
> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > >> >> >
> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > >> >> >
> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > >> >> >
> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > >> >>
> > >> >> Sorry, should have been:
> > >> >>
> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > >> > Hi Richard,
> > >> > Thanks for the clarifications, and sorry for late reply.
> > >> > I have attached POC patch that tries to implement the above approach.
> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > >> >
> > >> > For VLA vectors, I have only done limited testing so far.
> > >> > It seems to pass couple of tests written in the patch for
> > >> > nelts_per_pattern == 3,
> > >> > and folds the following svld1rq test:
> > >> > int32x4_t v = {1, 2, 3, 4};
> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > >> > into:
> > >> > return {1, 2, 3, 4, ...};
> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > >> >
> > >> > I have a couple of questions:
> > >> > 1] When mask selects elements from same vector but from different patterns:
> > >> > For eg:
> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > >> >
> > >> > With above mask,
> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > >> > pattern in arg0.
> > >> > The result is:
> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > >> > with a0 = 1, a1 = 11, S = -9.
> > >> > Is that expected tho ? It seems to create a new encoding which
> > >> > wasn't present in the input vector. For instance, the next elem in
> > >> > sequence would be -7,
> > >> > which is not present originally in arg0.
> > >>
> > >> Yeah, you're right, sorry.  Going back to:
> > >>
> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > >>     elements for any integer N.  This extended sequence can be reencoded
> > >>     as having N*Px patterns, with Ex staying the same.
> > >>
> > >> I guess we need to pick an N for the selector such that each new
> > >> selector pattern (each one out of the N*Px patterns) selects from
> > >> the *same pattern* of the same data input.
> > >>
> > >> So if a particular pattern in the selector has a step S, and the data
> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > >>
> > >> I think that means that the total number of patterns in the result
> > >> (Pr from previous messages) can safely be:
> > >>
> > >>   Ps * least_common_multiple(
> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > >>     ...
> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > >>   )
> > >>
> > >> where:
> > >>
> > >>   Ps = the number of patterns in the selector
> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > >>   P[I] = the number of patterns in data input I
> > >>
> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > >> and S[...] then it could also get large.  Perhaps we should finally
> > >> give up on the general case and limit this to power-of-2 patterns and
> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > >> simplifies other things as well.
> > >>
> > >> What do you think?
> > > Hi Richard,
> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > follow up patches if possible.
> > >
> > > Sorry if this sounds like a silly ques -- if we are going to have
> > > pattern in selector, select *same pattern from same input vector*,
> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > make sense for elements in selector to denote pattern number itself
> > > instead of element index
> > > if input vectors are VLA ?
> > >
> > > For eg:
> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > op1 = {...}
> > > with npatterns == 4, nelts_per_pattern == 3,
> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > Not sure if this is correct tho.
> >
> > This wouldn't allow us to represent things like a "duplicate one
> > element", or "copy the leading N elements from the first input and
> > the other elements from elements N+ of the second input", which we
> > can with the current scheme.
> >
> > The restriction about each (unwound) selector pattern selecting from the
> > same input pattern only applies to case where the selector pattern is
> > stepped (and only applies to the stepped part of the pattern, not the
> > leading element).  The restriction is also local to this code; it
> > doesn't make other VEC_PERM_EXPRs invalid.
> Hi Richard,
> Thanks for the clarifications.
> Just to clarify your approach with an eg:
> Let selected input vector be:
> arg0: {a0, b0, c0, d0,
>           a0 + S, b0 + S, c0 + S, d0 + S,
>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
>
> Let sel = {0, 0, 1, 2, 2, 4, ...}
> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
>
> So, the first pattern in sel:
> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> which would be incorrect, since they belong to different patterns in arg0.
> So to select elements from same pattern in arg0, we need to divide p1
> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
>
> Similarly for second pattern in sel:
> p2: {0, 2, 4, ...}, we need to divide it into
> at least N2 = P_arg0 / S1 = 2 distinct patterns.
>
> Select N = max(N1, N2) = 4
> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> and will be re-encoded with N*Ps = 8 patterns:
>
> re-encoded sel:
> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> ...}
>
> with 8 patterns,
> p1: {a0, a0 + 2S, a0 + 4S, ...}
> p2: {b0, b0 + 2S, b0 + 4S, ...}
> ...
> which select elements from same pattern from same input vector.
> Does this look correct ?
>
> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> arg1_npatterns are powers of 2 and for each stepped pattern,
> it's stepped size S is a power of 2. I suppose this will be sufficient
> to ensure that sel can be re-encoded with N*Ps npatterns
> such that each new pattern selects elements from same pattern
> of the input vector ?
>
> Then compute N:
> N = 1;
> for (every pattern p in sel)
>   {
>      op = corresponding input vector for pattern;
>      S = step_size (p);
>      N_pattern = max (S, npatterns (op)) / S;
>      N = max(N, N_pattern)
>   }
>
> and re-encode selector with N*Ps patterns.
> I guess rest of the patch will mostly stay the same.
Hi,
I have attached a POC patch based on the above approach.
For the above eg:
arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
and
sel = {0, 0, 0, 1, 0, 2, ...}
with sel_npatterns == 2 and sel_nelts_per_pattern == 3.

For pattern, {0, 1, 2, ...} it will select elements from different
patterns from arg0, which is incorrect.
So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
with following patterns:
p1 = { 0, ... }
p2 = { 0, 2, 4, ... }
p3 = { 0, ... }
p4 = { 1, 3, 5, ... }
which should be correct since each element from the respective
patterns in sel chooses
elements from same pattern from arg0.
So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
Does this look correct ?

Thanks,
Prathamesh

>
> Thanks,
> Prathamesh
>
> >
> > Thanks,
> > Richard
> >
> > >
> > > Thanks,
> > > Prathamesh
> > >>
> > >> > I suppose it's fine since if the user defines mask to have pattern {0,
> > >> > 1, 2, ...}
> > >> > they intended result to have pattern with above encoding.
> > >> > Just wanted to confirm if this is correct ?
> > >> >
> > >> > 2] Could you please suggest a test-case for S < 0 ?
> > >> > I am not able to come up with one :/
> > >>
> > >> svrev is one way of creating negative steps.
> > >>
> > >> Thanks,
> > >> Richard
> > >>
> > >> >
> > >> > Thanks,
> > >> > Prathamesh
> > >> >>
> > >> >> > which is an interleaving of the two patterns:
> > >> >> >
> > >> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > >> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 9f7beae14e5..2f45979d4ac 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 			  build_zero_cst (itype));
 }
 
+/* Check if PATTERN in SEL selects either ARG0 or ARG1,
+   and return the selected arg, otherwise return NULL_TREE.  */
 
-/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
-   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
-   true if successful.  */
-
-static bool
-vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
+static tree
+get_vector_for_pattern (tree arg0, tree arg1,
+			const vec_perm_indices &sel, unsigned pattern,
+			unsigned sel_npatterns, int &S)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
+  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  poly_uint64 nsel = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (nsel, sel_npatterns, &esel))
+    return NULL_TREE;
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  S = 0;
+  if (sel_nelts_per_pattern == 3)
     {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      S = (a2 - a1).to_constant ();
+      if (S != 0 && !pow2p_hwi (S))
+	return NULL_TREE;
     }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+
+  if (!can_div_trunc_p (a1, n1, &q1, &r1)
+      || !can_div_trunc_p (ae, n1, &qe, &re)
+      || (q1 != qe))
+    return NULL_TREE;
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0)
     {
-      constructor_elt *elt;
+      poly_uint64 a0 = sel[pattern];
+      if (!known_eq (S, a1 - a0))
+        return NULL_TREE;
 
-      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
-	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
-	  return false;
-	else
-	  elts[i] = elt->value;
+      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+        return NULL_TREE;
     }
-  else
-    return false;
-  for (; i < nelts; i++)
-    elts[i]
-      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
-  return true;
+  
+  return arg;
 }
 
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
@@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
-  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
-  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
-      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
+  unsigned res_npatterns = 0;
+  unsigned res_nelts_per_pattern = 0;
+  unsigned sel_npatterns = 0;
+  tree *vector_for_pattern = NULL;
+
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST
+      && !sel.length ().is_constant ())
+    {
+      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+      sel_npatterns = sel.encoding ().npatterns ();
+
+      if (!pow2p_hwi (arg0_npatterns)
+	  || !pow2p_hwi (arg1_npatterns)
+	  || !pow2p_hwi (sel_npatterns))
+        return NULL_TREE;
+
+      unsigned N = 1;
+      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
+      for (unsigned i = 0; i < sel_npatterns; i++)
+	{
+	  int S = 0;
+	  tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
+	  if (!op)
+	    return NULL_TREE;
+	  vector_for_pattern[i] = op;
+	  unsigned N_pattern =
+	    (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
+	  N = std::max (N, N_pattern);
+	}
+      
+      res_npatterns
+        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
+
+      res_nelts_per_pattern
+	= std::max(sel.encoding ().nelts_per_pattern (),
+		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+    }
+  else if (sel.length ().is_constant (&nelts)
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
+    {
+      /* For VLS vectors, treat all vectors with
+	 npatterns = nelts, nelts_per_pattern = 1. */
+      res_npatterns = sel_npatterns = nelts;
+      res_nelts_per_pattern = 1;
+      vector_for_pattern = XALLOCAVEC (tree, nelts);
+      for (unsigned i = 0; i < nelts; i++)
+        {
+	  HOST_WIDE_INT index;
+	  if (!sel[i].is_constant (&index))
+	    return NULL_TREE;
+	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
+	}
+    }
+  else
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
-  for (i = 0; i < nelts; i++)
+  tree_vector_builder out_elts (type, res_npatterns,
+				res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
     {
-      HOST_WIDE_INT index;
-      if (!sel[i].is_constant (&index))
-	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+      /* For VLA vectors, i % sel_npatterns would give the original
+         pattern the element belongs to, which is sufficient to get the arg.
+	 Even if sel_npatterns has been multiplied by N,
+	 they will always come from the same input vector.
+	 For VLS vectors, sel_npatterns == res_nelts == nelts,
+	 so i % sel_npatterns == i since i < nelts */
+       
+      tree arg = vector_for_pattern[i % sel_npatterns];
+      unsigned HOST_WIDE_INT index;
+
+      if (arg == arg0)
+	{
+	  if (!sel[i].is_constant ())
+	    return NULL_TREE;
+	  index = sel[i].to_constant ();
+	}
+      else
+        {
+	  gcc_assert (arg == arg1);
+	  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+	  uint64_t q;
+	  poly_uint64 r;
+
+	  /* Divide sel[i] by input vector length, to obtain remainder,
+	     which would be the index for either input vector.  */
+	  if (!can_div_trunc_p (sel[i], n1, &q, &r))
+	    return NULL_TREE;
+
+	  if (!r.is_constant (&index))
+	    return NULL_TREE;
+	}
+
+      tree elem;
+      if (TREE_CODE (arg) == CONSTRUCTOR)
+        {
+	  gcc_assert (index < nelts);
+	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
+	    return NULL_TREE;
+	  elem = CONSTRUCTOR_ELT (arg, index)->value;
+	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
+	    return NULL_TREE;
+	  need_ctor = true;
+	}
+      else
+        elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
     }
 
   if (need_ctor)
     {
       vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
+      vec_alloc (v, res_nelts);
+      for (i = 0; i < res_nelts; i++)
 	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
       return build_constructor (type, v);
     }
-  else
-    return out_elts.build ();
+  return out_elts.build ();
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  //machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should select arg0.  */
+  {
+    int mask_elems[] = {0, 1, 2};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (res != NULL_TREE);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    unsigned res_nelts = vector_cst_encoded_nelts (res);
+    for (unsigned i = 0; i < res_nelts; i++)
+      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
+				    VECTOR_CST_ELT (arg0, i), 0));
+  }
+
+  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
+     should return NULL because for len = 4 + 4x,
+     if x == 0, we select from arg1
+     if x > 0, we select from arg0
+     and thus cannot determine result at compile time.  */
+  {
+    int mask_elems[] = {4, 5, 6};
+    tree mask = build_vec_int_cst (1, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    gcc_assert (res == NULL_TREE);
+  }
+
+  /* Case 3:
+     mask: {0, 0, 0, 1, 0, 2, ...} 
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[0], ie, 1.
+     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
+     so res = {1, 1, 1, 11, 1, 2, ...}.  */
+  {
+    int mask_elems[] = {0, 0, 0, 1, 0, 2};
+    tree mask = build_vec_int_cst (2, 3, mask_elems);
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
+    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
+
+    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
+    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
+    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
+  }
+
+  /* Case 4:
+     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
+     npatterns == 2, nelts_per_pattern == 3
+     Pattern {0, ...} should select arg0[1]
+     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
+     a1 = 5 + 4x
+     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
+        = 5 + 6x
+     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
+     res: {1, 21, 1, 31, 1, 22, ... }
+     FIXME: How to build vector with poly_int elems ?  */
+
+  /* Case 5: S < 0.  */
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest

Prathamesh Kulkarni Nov. 28, 2022, 11:44 a.m. UTC | #23

On Mon, 21 Nov 2022 at 14:37, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > > <richard.sandiford@arm.com> wrote:
> > > >>
> > > >> Sorry for the slow response.  I wanted to find some time to think
> > > >> about this a bit more.
> > > >>
> > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > > >> > <richard.sandiford@arm.com> wrote:
> > > >> >>
> > > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > >> >> >> For num_poly_int_coeffs == 2,
> > > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > >> >> >> If a1/trunc n1 succeeds,
> > > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > > >> >> >
> > > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > >> >> >
> > > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > >> >> >
> > > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > > >> >>
> > > >> >> Sorry, should have been:
> > > >> >>
> > > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > > >> > Hi Richard,
> > > >> > Thanks for the clarifications, and sorry for late reply.
> > > >> > I have attached POC patch that tries to implement the above approach.
> > > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > > >> >
> > > >> > For VLA vectors, I have only done limited testing so far.
> > > >> > It seems to pass couple of tests written in the patch for
> > > >> > nelts_per_pattern == 3,
> > > >> > and folds the following svld1rq test:
> > > >> > int32x4_t v = {1, 2, 3, 4};
> > > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > > >> > into:
> > > >> > return {1, 2, 3, 4, ...};
> > > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > > >> >
> > > >> > I have a couple of questions:
> > > >> > 1] When mask selects elements from same vector but from different patterns:
> > > >> > For eg:
> > > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > > >> >
> > > >> > With above mask,
> > > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > > >> > pattern in arg0.
> > > >> > The result is:
> > > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > > >> > with a0 = 1, a1 = 11, S = -9.
> > > >> > Is that expected tho ? It seems to create a new encoding which
> > > >> > wasn't present in the input vector. For instance, the next elem in
> > > >> > sequence would be -7,
> > > >> > which is not present originally in arg0.
> > > >>
> > > >> Yeah, you're right, sorry.  Going back to:
> > > >>
> > > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > > >>     elements for any integer N.  This extended sequence can be reencoded
> > > >>     as having N*Px patterns, with Ex staying the same.
> > > >>
> > > >> I guess we need to pick an N for the selector such that each new
> > > >> selector pattern (each one out of the N*Px patterns) selects from
> > > >> the *same pattern* of the same data input.
> > > >>
> > > >> So if a particular pattern in the selector has a step S, and the data
> > > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > > >>
> > > >> I think that means that the total number of patterns in the result
> > > >> (Pr from previous messages) can safely be:
> > > >>
> > > >>   Ps * least_common_multiple(
> > > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > > >>     ...
> > > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > > >>   )
> > > >>
> > > >> where:
> > > >>
> > > >>   Ps = the number of patterns in the selector
> > > >>   S[I] = the step for selector pattern I (I being 1-based)
> > > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > > >>   P[I] = the number of patterns in data input I
> > > >>
> > > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > > >> and S[...] then it could also get large.  Perhaps we should finally
> > > >> give up on the general case and limit this to power-of-2 patterns and
> > > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > > >> simplifies other things as well.
> > > >>
> > > >> What do you think?
> > > > Hi Richard,
> > > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > > follow up patches if possible.
> > > >
> > > > Sorry if this sounds like a silly ques -- if we are going to have
> > > > pattern in selector, select *same pattern from same input vector*,
> > > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > > make sense for elements in selector to denote pattern number itself
> > > > instead of element index
> > > > if input vectors are VLA ?
> > > >
> > > > For eg:
> > > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > > op1 = {...}
> > > > with npatterns == 4, nelts_per_pattern == 3,
> > > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > > Not sure if this is correct tho.
> > >
> > > This wouldn't allow us to represent things like a "duplicate one
> > > element", or "copy the leading N elements from the first input and
> > > the other elements from elements N+ of the second input", which we
> > > can with the current scheme.
> > >
> > > The restriction about each (unwound) selector pattern selecting from the
> > > same input pattern only applies to case where the selector pattern is
> > > stepped (and only applies to the stepped part of the pattern, not the
> > > leading element).  The restriction is also local to this code; it
> > > doesn't make other VEC_PERM_EXPRs invalid.
> > Hi Richard,
> > Thanks for the clarifications.
> > Just to clarify your approach with an eg:
> > Let selected input vector be:
> > arg0: {a0, b0, c0, d0,
> >           a0 + S, b0 + S, c0 + S, d0 + S,
> >           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> >
> > Let sel = {0, 0, 1, 2, 2, 4, ...}
> > where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> >
> > So, the first pattern in sel:
> > p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > which would be incorrect, since they belong to different patterns in arg0.
> > So to select elements from same pattern in arg0, we need to divide p1
> > into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> >
> > Similarly for second pattern in sel:
> > p2: {0, 2, 4, ...}, we need to divide it into
> > at least N2 = P_arg0 / S1 = 2 distinct patterns.
> >
> > Select N = max(N1, N2) = 4
> > So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > and will be re-encoded with N*Ps = 8 patterns:
> >
> > re-encoded sel:
> > {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > ...}
> >
> > with 8 patterns,
> > p1: {a0, a0 + 2S, a0 + 4S, ...}
> > p2: {b0, b0 + 2S, b0 + 4S, ...}
> > ...
> > which select elements from same pattern from same input vector.
> > Does this look correct ?
> >
> > For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > arg1_npatterns are powers of 2 and for each stepped pattern,
> > it's stepped size S is a power of 2. I suppose this will be sufficient
> > to ensure that sel can be re-encoded with N*Ps npatterns
> > such that each new pattern selects elements from same pattern
> > of the input vector ?
> >
> > Then compute N:
> > N = 1;
> > for (every pattern p in sel)
> >   {
> >      op = corresponding input vector for pattern;
> >      S = step_size (p);
> >      N_pattern = max (S, npatterns (op)) / S;
> >      N = max(N, N_pattern)
> >   }
> >
> > and re-encode selector with N*Ps patterns.
> > I guess rest of the patch will mostly stay the same.
> Hi,
> I have attached a POC patch based on the above approach.
> For the above eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> and
> sel = {0, 0, 0, 1, 0, 2, ...}
> with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
>
> For pattern, {0, 1, 2, ...} it will select elements from different
> patterns from arg0, which is incorrect.
> So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> with following patterns:
> p1 = { 0, ... }
> p2 = { 0, 2, 4, ... }
> p3 = { 0, ... }
> p4 = { 1, 3, 5, ... }
> which should be correct since each element from the respective
> patterns in sel chooses
> elements from same pattern from arg0.
> So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> Does this look correct ?
Hi Richard,
ping https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606850.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>
> >
> > Thanks,
> > Prathamesh
> >
> > >
> > > Thanks,
> > > Richard
> > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > >>
> > > >> > I suppose it's fine since if the user defines mask to have pattern {0,
> > > >> > 1, 2, ...}
> > > >> > they intended result to have pattern with above encoding.
> > > >> > Just wanted to confirm if this is correct ?
> > > >> >
> > > >> > 2] Could you please suggest a test-case for S < 0 ?
> > > >> > I am not able to come up with one :/
> > > >>
> > > >> svrev is one way of creating negative steps.
> > > >>
> > > >> Thanks,
> > > >> Richard
> > > >>
> > > >> >
> > > >> > Thanks,
> > > >> > Prathamesh
> > > >> >>
> > > >> >> > which is an interleaving of the two patterns:
> > > >> >> >
> > > >> >> >   { 0, 2, 4, ... }                  a0 = 0, a1 = 2, S = 2
> > > >> >> >   { 2 + 2x, 4 + 2x, 6 + 2x }        a0 = 2 + 2x, a1 = 4 + 2x, S = 2

Richard Sandiford Dec. 6, 2022, 3:30 p.m. UTC | #24

Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
>> > > <richard.sandiford@arm.com> wrote:
>> > >>
>> > >> Sorry for the slow response.  I wanted to find some time to think
>> > >> about this a bit more.
>> > >>
>> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
>> > >> > <richard.sandiford@arm.com> wrote:
>> > >> >>
>> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
>> > >> >> >> For num_poly_int_coeffs == 2,
>> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
>> > >> >> >> If a1/trunc n1 succeeds,
>> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
>> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
>> > >> >> >
>> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
>> > >> >> >
>> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
>> > >> >> >
>> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
>> > >> >>
>> > >> >> Sorry, should have been:
>> > >> >>
>> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
>> > >> > Hi Richard,
>> > >> > Thanks for the clarifications, and sorry for late reply.
>> > >> > I have attached POC patch that tries to implement the above approach.
>> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
>> > >> >
>> > >> > For VLA vectors, I have only done limited testing so far.
>> > >> > It seems to pass couple of tests written in the patch for
>> > >> > nelts_per_pattern == 3,
>> > >> > and folds the following svld1rq test:
>> > >> > int32x4_t v = {1, 2, 3, 4};
>> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
>> > >> > into:
>> > >> > return {1, 2, 3, 4, ...};
>> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
>> > >> >
>> > >> > I have a couple of questions:
>> > >> > 1] When mask selects elements from same vector but from different patterns:
>> > >> > For eg:
>> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
>> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
>> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
>> > >> > All have npatterns = 2, nelts_per_pattern = 3.
>> > >> >
>> > >> > With above mask,
>> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
>> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
>> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
>> > >> > pattern in arg0.
>> > >> > The result is:
>> > >> > res = {1, 1, 1, 11, 1, 2, ...}
>> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
>> > >> > with a0 = 1, a1 = 11, S = -9.
>> > >> > Is that expected tho ? It seems to create a new encoding which
>> > >> > wasn't present in the input vector. For instance, the next elem in
>> > >> > sequence would be -7,
>> > >> > which is not present originally in arg0.
>> > >>
>> > >> Yeah, you're right, sorry.  Going back to:
>> > >>
>> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
>> > >>     elements for any integer N.  This extended sequence can be reencoded
>> > >>     as having N*Px patterns, with Ex staying the same.
>> > >>
>> > >> I guess we need to pick an N for the selector such that each new
>> > >> selector pattern (each one out of the N*Px patterns) selects from
>> > >> the *same pattern* of the same data input.
>> > >>
>> > >> So if a particular pattern in the selector has a step S, and the data
>> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
>> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
>> > >>
>> > >> I think that means that the total number of patterns in the result
>> > >> (Pr from previous messages) can safely be:
>> > >>
>> > >>   Ps * least_common_multiple(
>> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
>> > >>     ...
>> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
>> > >>   )
>> > >>
>> > >> where:
>> > >>
>> > >>   Ps = the number of patterns in the selector
>> > >>   S[I] = the step for selector pattern I (I being 1-based)
>> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
>> > >>   P[I] = the number of patterns in data input I
>> > >>
>> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
>> > >> and S[...] then it could also get large.  Perhaps we should finally
>> > >> give up on the general case and limit this to power-of-2 patterns and
>> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
>> > >> simplifies other things as well.
>> > >>
>> > >> What do you think?
>> > > Hi Richard,
>> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
>> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
>> > > follow up patches if possible.
>> > >
>> > > Sorry if this sounds like a silly ques -- if we are going to have
>> > > pattern in selector, select *same pattern from same input vector*,
>> > > instead of re-encoding the selector to have N * Ps patterns, would it
>> > > make sense for elements in selector to denote pattern number itself
>> > > instead of element index
>> > > if input vectors are VLA ?
>> > >
>> > > For eg:
>> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
>> > > op1 = {...}
>> > > with npatterns == 4, nelts_per_pattern == 3,
>> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
>> > > so, res = {1, 4, 1, 5, 1, 6, ...}
>> > > Not sure if this is correct tho.
>> >
>> > This wouldn't allow us to represent things like a "duplicate one
>> > element", or "copy the leading N elements from the first input and
>> > the other elements from elements N+ of the second input", which we
>> > can with the current scheme.
>> >
>> > The restriction about each (unwound) selector pattern selecting from the
>> > same input pattern only applies to case where the selector pattern is
>> > stepped (and only applies to the stepped part of the pattern, not the
>> > leading element).  The restriction is also local to this code; it
>> > doesn't make other VEC_PERM_EXPRs invalid.
>> Hi Richard,
>> Thanks for the clarifications.
>> Just to clarify your approach with an eg:
>> Let selected input vector be:
>> arg0: {a0, b0, c0, d0,
>>           a0 + S, b0 + S, c0 + S, d0 + S,
>>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
>> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
>>
>> Let sel = {0, 0, 1, 2, 2, 4, ...}
>> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
>>
>> So, the first pattern in sel:
>> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
>> which would be incorrect, since they belong to different patterns in arg0.
>> So to select elements from same pattern in arg0, we need to divide p1
>> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
>>
>> Similarly for second pattern in sel:
>> p2: {0, 2, 4, ...}, we need to divide it into
>> at least N2 = P_arg0 / S1 = 2 distinct patterns.
>>
>> Select N = max(N1, N2) = 4
>> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
>> and will be re-encoded with N*Ps = 8 patterns:
>>
>> re-encoded sel:
>> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
>> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
>> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
>> ...}
>>
>> with 8 patterns,
>> p1: {a0, a0 + 2S, a0 + 4S, ...}
>> p2: {b0, b0 + 2S, b0 + 4S, ...}
>> ...
>> which select elements from same pattern from same input vector.
>> Does this look correct ?
>>
>> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
>> arg1_npatterns are powers of 2 and for each stepped pattern,
>> it's stepped size S is a power of 2. I suppose this will be sufficient
>> to ensure that sel can be re-encoded with N*Ps npatterns
>> such that each new pattern selects elements from same pattern
>> of the input vector ?
>>
>> Then compute N:
>> N = 1;
>> for (every pattern p in sel)
>>   {
>>      op = corresponding input vector for pattern;
>>      S = step_size (p);
>>      N_pattern = max (S, npatterns (op)) / S;
>>      N = max(N, N_pattern)
>>   }
>>
>> and re-encode selector with N*Ps patterns.
>> I guess rest of the patch will mostly stay the same.
> Hi,
> I have attached a POC patch based on the above approach.
> For the above eg:
> arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> and
> sel = {0, 0, 0, 1, 0, 2, ...}
> with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
>
> For pattern, {0, 1, 2, ...} it will select elements from different
> patterns from arg0, which is incorrect.
> So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> with following patterns:
> p1 = { 0, ... }
> p2 = { 0, 2, 4, ... }
> p3 = { 0, ... }
> p4 = { 1, 3, 5, ... }
> which should be correct since each element from the respective
> patterns in sel chooses
> elements from same pattern from arg0.
> So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> Does this look correct ?

Yeah.  But like I said above:

  The restriction about each (unwound) selector pattern selecting from the
  same input pattern only applies to case where the selector pattern is
  stepped (and only applies to the stepped part of the pattern, not the
  leading element).

If the selector nelts-per-pattern is 1 or 2 then we can support all
power-of-2 cases, with the final npatterns being the maximum of the
source nelts-per-patterns.

Also, going back to an earlier part of the discussion, I think we
should use this technique for both VLA and VLS, and only fall back
to the VLS-specific approach if the VLA approach fails.

So I suggest we put the VLA code in its own function and have
the VLS-only path kick in when the VLA code fails.  If the code is
having to pass a lot of state around, it might make sense to define
a local class, store the state in member variables, and use member
functions for the various subroutines.  I don't know if that will
work out neater though.

> @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>  			  build_zero_cst (itype));
>  }
>  
> +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> +   and return the selected arg, otherwise return NULL_TREE.  */
>  
> -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> -   true if successful.  */
> -
> -static bool
> -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> +static tree
> +get_vector_for_pattern (tree arg0, tree arg1,
> +			const vec_perm_indices &sel, unsigned pattern,
> +			unsigned sel_npatterns, int &S)
>  {
> -  unsigned HOST_WIDE_INT i, nunits;
> +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>  
> -  if (TREE_CODE (arg) == VECTOR_CST
> -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +  poly_uint64 nsel = sel.length ();
> +  poly_uint64 esel;
> +
> +  if (!multiple_p (nsel, sel_npatterns, &esel))
> +    return NULL_TREE;
> +
> +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> +  S = 0;
> +  if (sel_nelts_per_pattern == 3)
>      {
> -      for (i = 0; i < nunits; ++i)
> -	elts[i] = VECTOR_CST_ELT (arg, i);
> +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> +      S = (a2 - a1).to_constant ();

The code hasn't proven that this to_constant is safe.

> +      if (S != 0 && !pow2p_hwi (S))
> +	return NULL_TREE;
>      }
> -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> +
> +  poly_uint64 ae = a1 + (esel - 2) * S;
> +  uint64_t q1, qe;
> +  poly_uint64 r1, re;
> +
> +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> +      || !can_div_trunc_p (ae, n1, &qe, &re)
> +      || (q1 != qe))
> +    return NULL_TREE;

Going back to the above: this check doesn't make sense for
sel_nelts_per_pattern != 3.

Thanks,
Richard

> +
> +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> +
> +  if (S < 0)
>      {
> -      constructor_elt *elt;
> +      poly_uint64 a0 = sel[pattern];
> +      if (!known_eq (S, a1 - a0))
> +        return NULL_TREE;
>  
> -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> -	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> -	  return false;
> -	else
> -	  elts[i] = elt->value;
> +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> +        return NULL_TREE;
>      }
> -  else
> -    return false;
> -  for (; i < nelts; i++)
> -    elts[i]
> -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> -  return true;
> +  
> +  return arg;
>  }
>  
>  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
>    unsigned HOST_WIDE_INT nelts;
>    bool need_ctor = false;
>  
> -  if (!sel.length ().is_constant (&nelts))
> -    return NULL_TREE;
> -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> +	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> +			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
>    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
>        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
>      return NULL_TREE;
>  
> -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> +  unsigned res_npatterns = 0;
> +  unsigned res_nelts_per_pattern = 0;
> +  unsigned sel_npatterns = 0;
> +  tree *vector_for_pattern = NULL;
> +
> +  if (TREE_CODE (arg0) == VECTOR_CST
> +      && TREE_CODE (arg1) == VECTOR_CST
> +      && !sel.length ().is_constant ())
> +    {
> +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> +      sel_npatterns = sel.encoding ().npatterns ();
> +
> +      if (!pow2p_hwi (arg0_npatterns)
> +	  || !pow2p_hwi (arg1_npatterns)
> +	  || !pow2p_hwi (sel_npatterns))
> +        return NULL_TREE;
> +
> +      unsigned N = 1;
> +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> +      for (unsigned i = 0; i < sel_npatterns; i++)
> +	{
> +	  int S = 0;
> +	  tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> +	  if (!op)
> +	    return NULL_TREE;
> +	  vector_for_pattern[i] = op;
> +	  unsigned N_pattern =
> +	    (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> +	  N = std::max (N, N_pattern);
> +	}
> +      
> +      res_npatterns
> +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> +
> +      res_nelts_per_pattern
> +	= std::max(sel.encoding ().nelts_per_pattern (),
> +		   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> +			     VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> +    }
> +  else if (sel.length ().is_constant (&nelts)
> +	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> +	   && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> +    {
> +      /* For VLS vectors, treat all vectors with
> +	 npatterns = nelts, nelts_per_pattern = 1. */
> +      res_npatterns = sel_npatterns = nelts;
> +      res_nelts_per_pattern = 1;
> +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> +      for (unsigned i = 0; i < nelts; i++)
> +        {
> +	  HOST_WIDE_INT index;
> +	  if (!sel[i].is_constant (&index))
> +	    return NULL_TREE;
> +	  vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;	
> +	}
> +    }
> +  else
>      return NULL_TREE;
>  
> -  tree_vector_builder out_elts (type, nelts, 1);
> -  for (i = 0; i < nelts; i++)
> +  tree_vector_builder out_elts (type, res_npatterns,
> +				res_nelts_per_pattern);
> +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> +  for (unsigned i = 0; i < res_nelts; i++)
>      {
> -      HOST_WIDE_INT index;
> -      if (!sel[i].is_constant (&index))
> -	return NULL_TREE;
> -      if (!CONSTANT_CLASS_P (in_elts[index]))
> -	need_ctor = true;
> -      out_elts.quick_push (unshare_expr (in_elts[index]));
> +      /* For VLA vectors, i % sel_npatterns would give the original
> +         pattern the element belongs to, which is sufficient to get the arg.
> +	 Even if sel_npatterns has been multiplied by N,
> +	 they will always come from the same input vector.
> +	 For VLS vectors, sel_npatterns == res_nelts == nelts,
> +	 so i % sel_npatterns == i since i < nelts */
> +       
> +      tree arg = vector_for_pattern[i % sel_npatterns];
> +      unsigned HOST_WIDE_INT index;
> +
> +      if (arg == arg0)
> +	{
> +	  if (!sel[i].is_constant ())
> +	    return NULL_TREE;
> +	  index = sel[i].to_constant ();
> +	}
> +      else
> +        {
> +	  gcc_assert (arg == arg1);
> +	  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +	  uint64_t q;
> +	  poly_uint64 r;
> +
> +	  /* Divide sel[i] by input vector length, to obtain remainder,
> +	     which would be the index for either input vector.  */
> +	  if (!can_div_trunc_p (sel[i], n1, &q, &r))
> +	    return NULL_TREE;
> +
> +	  if (!r.is_constant (&index))
> +	    return NULL_TREE;
> +	}
> +
> +      tree elem;
> +      if (TREE_CODE (arg) == CONSTRUCTOR)
> +        {
> +	  gcc_assert (index < nelts);
> +	  if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> +	    return NULL_TREE;
> +	  elem = CONSTRUCTOR_ELT (arg, index)->value;
> +	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> +	    return NULL_TREE;
> +	  need_ctor = true;
> +	}
> +      else
> +        elem = vector_cst_elt (arg, index);
> +      out_elts.quick_push (elem);
>      }
>  
>    if (need_ctor)
>      {
>        vec<constructor_elt, va_gc> *v;
> -      vec_alloc (v, nelts);
> -      for (i = 0; i < nelts; i++)
> +      vec_alloc (v, res_nelts);
> +      for (i = 0; i < res_nelts; i++)
>  	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
>        return build_constructor (type, v);
>      }
> -  else
> -    return out_elts.build ();
> +  return out_elts.build ();
>  }
>  
>  /* Try to fold a pointer difference of type TYPE two address expressions of
> @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
>    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
>  }
>  
> +static tree
> +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> +		   int *encoded_elems)
> +{
> +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> +  //machine_mode vmode = VNx4SImode;
> +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> +  tree vectype = build_vector_type (integer_type_node, nunits);
> +
> +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> +  return builder.build ();
> +}
> +
> +static void
> +test_vec_perm_vla_folding ()
> +{
> +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> +
> +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> +
> +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> +    return;
> +
> +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> +     should select arg0.  */
> +  {
> +    int mask_elems[] = {0, 1, 2};
> +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    ASSERT_TRUE (res != NULL_TREE);
> +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> +
> +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> +    for (unsigned i = 0; i < res_nelts; i++)
> +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> +				    VECTOR_CST_ELT (arg0, i), 0));
> +  }
> +
> +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> +     should return NULL because for len = 4 + 4x,
> +     if x == 0, we select from arg1
> +     if x > 0, we select from arg0
> +     and thus cannot determine result at compile time.  */
> +  {
> +    int mask_elems[] = {4, 5, 6};
> +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    gcc_assert (res == NULL_TREE);
> +  }
> +
> +  /* Case 3:
> +     mask: {0, 0, 0, 1, 0, 2, ...} 
> +     npatterns == 2, nelts_per_pattern == 3
> +     Pattern {0, ...} should select arg0[0], ie, 1.
> +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> +  {
> +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> +
> +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> +  }
> +
> +  /* Case 4:
> +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> +     npatterns == 2, nelts_per_pattern == 3
> +     Pattern {0, ...} should select arg0[1]
> +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> +     a1 = 5 + 4x
> +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> +        = 5 + 6x
> +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> +     res: {1, 21, 1, 31, 1, 22, ... }
> +     FIXME: How to build vector with poly_int elems ?  */
> +
> +  /* Case 5: S < 0.  */
> +}
> +
>  /* Run all of the selftests within this file.  */
>  
>  void
> @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
>    test_arithmetic_folding ();
>    test_vector_folding ();
>    test_vec_duplicate_folding ();
> +  test_vec_perm_vla_folding ();
>  }
>  
>  } // namespace selftest

Prathamesh Kulkarni Dec. 13, 2022, 6:05 a.m. UTC | #25

On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> >>
> >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> >> > > <richard.sandiford@arm.com> wrote:
> >> > >>
> >> > >> Sorry for the slow response.  I wanted to find some time to think
> >> > >> about this a bit more.
> >> > >>
> >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> >> > >> > <richard.sandiford@arm.com> wrote:
> >> > >> >>
> >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> >> > >> >> >> For num_poly_int_coeffs == 2,
> >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> >> > >> >> >> If a1/trunc n1 succeeds,
> >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> >> > >> >> >
> >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> >> > >> >> >
> >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> >> > >> >> >
> >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> >> > >> >>
> >> > >> >> Sorry, should have been:
> >> > >> >>
> >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> >> > >> > Hi Richard,
> >> > >> > Thanks for the clarifications, and sorry for late reply.
> >> > >> > I have attached POC patch that tries to implement the above approach.
> >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> >> > >> >
> >> > >> > For VLA vectors, I have only done limited testing so far.
> >> > >> > It seems to pass couple of tests written in the patch for
> >> > >> > nelts_per_pattern == 3,
> >> > >> > and folds the following svld1rq test:
> >> > >> > int32x4_t v = {1, 2, 3, 4};
> >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> >> > >> > into:
> >> > >> > return {1, 2, 3, 4, ...};
> >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> >> > >> >
> >> > >> > I have a couple of questions:
> >> > >> > 1] When mask selects elements from same vector but from different patterns:
> >> > >> > For eg:
> >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> >> > >> >
> >> > >> > With above mask,
> >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> >> > >> > pattern in arg0.
> >> > >> > The result is:
> >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> >> > >> > with a0 = 1, a1 = 11, S = -9.
> >> > >> > Is that expected tho ? It seems to create a new encoding which
> >> > >> > wasn't present in the input vector. For instance, the next elem in
> >> > >> > sequence would be -7,
> >> > >> > which is not present originally in arg0.
> >> > >>
> >> > >> Yeah, you're right, sorry.  Going back to:
> >> > >>
> >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> >> > >>     elements for any integer N.  This extended sequence can be reencoded
> >> > >>     as having N*Px patterns, with Ex staying the same.
> >> > >>
> >> > >> I guess we need to pick an N for the selector such that each new
> >> > >> selector pattern (each one out of the N*Px patterns) selects from
> >> > >> the *same pattern* of the same data input.
> >> > >>
> >> > >> So if a particular pattern in the selector has a step S, and the data
> >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> >> > >>
> >> > >> I think that means that the total number of patterns in the result
> >> > >> (Pr from previous messages) can safely be:
> >> > >>
> >> > >>   Ps * least_common_multiple(
> >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> >> > >>     ...
> >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> >> > >>   )
> >> > >>
> >> > >> where:
> >> > >>
> >> > >>   Ps = the number of patterns in the selector
> >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> >> > >>   P[I] = the number of patterns in data input I
> >> > >>
> >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> >> > >> and S[...] then it could also get large.  Perhaps we should finally
> >> > >> give up on the general case and limit this to power-of-2 patterns and
> >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> >> > >> simplifies other things as well.
> >> > >>
> >> > >> What do you think?
> >> > > Hi Richard,
> >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> >> > > follow up patches if possible.
> >> > >
> >> > > Sorry if this sounds like a silly ques -- if we are going to have
> >> > > pattern in selector, select *same pattern from same input vector*,
> >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> >> > > make sense for elements in selector to denote pattern number itself
> >> > > instead of element index
> >> > > if input vectors are VLA ?
> >> > >
> >> > > For eg:
> >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> >> > > op1 = {...}
> >> > > with npatterns == 4, nelts_per_pattern == 3,
> >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> >> > > Not sure if this is correct tho.
> >> >
> >> > This wouldn't allow us to represent things like a "duplicate one
> >> > element", or "copy the leading N elements from the first input and
> >> > the other elements from elements N+ of the second input", which we
> >> > can with the current scheme.
> >> >
> >> > The restriction about each (unwound) selector pattern selecting from the
> >> > same input pattern only applies to case where the selector pattern is
> >> > stepped (and only applies to the stepped part of the pattern, not the
> >> > leading element).  The restriction is also local to this code; it
> >> > doesn't make other VEC_PERM_EXPRs invalid.
> >> Hi Richard,
> >> Thanks for the clarifications.
> >> Just to clarify your approach with an eg:
> >> Let selected input vector be:
> >> arg0: {a0, b0, c0, d0,
> >>           a0 + S, b0 + S, c0 + S, d0 + S,
> >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> >>
> >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> >>
> >> So, the first pattern in sel:
> >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> >> which would be incorrect, since they belong to different patterns in arg0.
> >> So to select elements from same pattern in arg0, we need to divide p1
> >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> >>
> >> Similarly for second pattern in sel:
> >> p2: {0, 2, 4, ...}, we need to divide it into
> >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> >>
> >> Select N = max(N1, N2) = 4
> >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> >> and will be re-encoded with N*Ps = 8 patterns:
> >>
> >> re-encoded sel:
> >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> >> ...}
> >>
> >> with 8 patterns,
> >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> >> ...
> >> which select elements from same pattern from same input vector.
> >> Does this look correct ?
> >>
> >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> >> arg1_npatterns are powers of 2 and for each stepped pattern,
> >> it's stepped size S is a power of 2. I suppose this will be sufficient
> >> to ensure that sel can be re-encoded with N*Ps npatterns
> >> such that each new pattern selects elements from same pattern
> >> of the input vector ?
> >>
> >> Then compute N:
> >> N = 1;
> >> for (every pattern p in sel)
> >>   {
> >>      op = corresponding input vector for pattern;
> >>      S = step_size (p);
> >>      N_pattern = max (S, npatterns (op)) / S;
> >>      N = max(N, N_pattern)
> >>   }
> >>
> >> and re-encode selector with N*Ps patterns.
> >> I guess rest of the patch will mostly stay the same.
> > Hi,
> > I have attached a POC patch based on the above approach.
> > For the above eg:
> > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > and
> > sel = {0, 0, 0, 1, 0, 2, ...}
> > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> >
> > For pattern, {0, 1, 2, ...} it will select elements from different
> > patterns from arg0, which is incorrect.
> > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > with following patterns:
> > p1 = { 0, ... }
> > p2 = { 0, 2, 4, ... }
> > p3 = { 0, ... }
> > p4 = { 1, 3, 5, ... }
> > which should be correct since each element from the respective
> > patterns in sel chooses
> > elements from same pattern from arg0.
> > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > Does this look correct ?
>
> Yeah.  But like I said above:
>
>   The restriction about each (unwound) selector pattern selecting from the
>   same input pattern only applies to case where the selector pattern is
>   stepped (and only applies to the stepped part of the pattern, not the
>   leading element).
>
> If the selector nelts-per-pattern is 1 or 2 then we can support all
> power-of-2 cases, with the final npatterns being the maximum of the
> source nelts-per-patterns.
>
> Also, going back to an earlier part of the discussion, I think we
> should use this technique for both VLA and VLS, and only fall back
> to the VLS-specific approach if the VLA approach fails.
>
> So I suggest we put the VLA code in its own function and have
> the VLS-only path kick in when the VLA code fails.  If the code is
> having to pass a lot of state around, it might make sense to define
> a local class, store the state in member variables, and use member
> functions for the various subroutines.  I don't know if that will
> work out neater though.
Hi Richard,
Thanks for the suggestions. I have attached an updated POC patch,
that does the following:
(a) Uses VLA approach by default, and falls back to VLS specific
folding if VLA approach fails for VLS vectors.
(b) Separates cases for sel_nelts_per_pattern < 3 and
sel_nelts_per_pattern == 3.
(c) Allows, a0 to select different vector from a1 .. ae.
I have written a few unit tests in the patch for testing the same.
Does the patch look in the right direction ?

The patch has an issue for the following case marked as "case 9"
in test_vec_perm_vla_folding:
arg0 = { 1, 11, 2, 12, 3, 13, ... }
arg1 = { 21, 31, 22, 32, 23, 33, ... }
arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.

mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
where 4 + 4x is runtime vector length.
npatterns = 1, nelts_per_pattern = 3.

a1 = 5 + 4x
ae = a1 + (esel - 2) * S
     = (5 + 4x) + (4 + 4x - 2) * 1
     = 7 + 8x

Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
Is that expected for the above mask ?

I intended it to select the second vector similar to,
sel = { 0, 1, 2 .. }, which would select the first vector
by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
The first would select elements from first pattern from arg0,
while the second pattern would select elements from second pattern from arg0.
with result effectively having same encoding as arg0.
Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?

PS: I will be on vacation next week.

Thanks,
Prathamesh

>
> > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >                         build_zero_cst (itype));
> >  }
> >
> > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > +   and return the selected arg, otherwise return NULL_TREE.  */
> >
> > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > -   true if successful.  */
> > -
> > -static bool
> > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > +static tree
> > +get_vector_for_pattern (tree arg0, tree arg1,
> > +                     const vec_perm_indices &sel, unsigned pattern,
> > +                     unsigned sel_npatterns, int &S)
> >  {
> > -  unsigned HOST_WIDE_INT i, nunits;
> > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >
> > -  if (TREE_CODE (arg) == VECTOR_CST
> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +  poly_uint64 nsel = sel.length ();
> > +  poly_uint64 esel;
> > +
> > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > +    return NULL_TREE;
> > +
> > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > +  S = 0;
> > +  if (sel_nelts_per_pattern == 3)
> >      {
> > -      for (i = 0; i < nunits; ++i)
> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > +      S = (a2 - a1).to_constant ();
>
> The code hasn't proven that this to_constant is safe.
>
> > +      if (S != 0 && !pow2p_hwi (S))
> > +     return NULL_TREE;
> >      }
> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > +
> > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > +  uint64_t q1, qe;
> > +  poly_uint64 r1, re;
> > +
> > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > +      || (q1 != qe))
> > +    return NULL_TREE;
>
> Going back to the above: this check doesn't make sense for
> sel_nelts_per_pattern != 3.
>
> Thanks,
> Richard
>
> > +
> > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > +
> > +  if (S < 0)
> >      {
> > -      constructor_elt *elt;
> > +      poly_uint64 a0 = sel[pattern];
> > +      if (!known_eq (S, a1 - a0))
> > +        return NULL_TREE;
> >
> > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > -       return false;
> > -     else
> > -       elts[i] = elt->value;
> > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > +        return NULL_TREE;
> >      }
> > -  else
> > -    return false;
> > -  for (; i < nelts; i++)
> > -    elts[i]
> > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > -  return true;
> > +
> > +  return arg;
> >  }
> >
> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >    unsigned HOST_WIDE_INT nelts;
> >    bool need_ctor = false;
> >
> > -  if (!sel.length ().is_constant (&nelts))
> > -    return NULL_TREE;
> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >      return NULL_TREE;
> >
> > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > +  unsigned res_npatterns = 0;
> > +  unsigned res_nelts_per_pattern = 0;
> > +  unsigned sel_npatterns = 0;
> > +  tree *vector_for_pattern = NULL;
> > +
> > +  if (TREE_CODE (arg0) == VECTOR_CST
> > +      && TREE_CODE (arg1) == VECTOR_CST
> > +      && !sel.length ().is_constant ())
> > +    {
> > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > +      sel_npatterns = sel.encoding ().npatterns ();
> > +
> > +      if (!pow2p_hwi (arg0_npatterns)
> > +       || !pow2p_hwi (arg1_npatterns)
> > +       || !pow2p_hwi (sel_npatterns))
> > +        return NULL_TREE;
> > +
> > +      unsigned N = 1;
> > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > +     {
> > +       int S = 0;
> > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > +       if (!op)
> > +         return NULL_TREE;
> > +       vector_for_pattern[i] = op;
> > +       unsigned N_pattern =
> > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > +       N = std::max (N, N_pattern);
> > +     }
> > +
> > +      res_npatterns
> > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > +
> > +      res_nelts_per_pattern
> > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > +    }
> > +  else if (sel.length ().is_constant (&nelts)
> > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > +    {
> > +      /* For VLS vectors, treat all vectors with
> > +      npatterns = nelts, nelts_per_pattern = 1. */
> > +      res_npatterns = sel_npatterns = nelts;
> > +      res_nelts_per_pattern = 1;
> > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > +      for (unsigned i = 0; i < nelts; i++)
> > +        {
> > +       HOST_WIDE_INT index;
> > +       if (!sel[i].is_constant (&index))
> > +         return NULL_TREE;
> > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > +     }
> > +    }
> > +  else
> >      return NULL_TREE;
> >
> > -  tree_vector_builder out_elts (type, nelts, 1);
> > -  for (i = 0; i < nelts; i++)
> > +  tree_vector_builder out_elts (type, res_npatterns,
> > +                             res_nelts_per_pattern);
> > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > +  for (unsigned i = 0; i < res_nelts; i++)
> >      {
> > -      HOST_WIDE_INT index;
> > -      if (!sel[i].is_constant (&index))
> > -     return NULL_TREE;
> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > -     need_ctor = true;
> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > +      /* For VLA vectors, i % sel_npatterns would give the original
> > +         pattern the element belongs to, which is sufficient to get the arg.
> > +      Even if sel_npatterns has been multiplied by N,
> > +      they will always come from the same input vector.
> > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > +      so i % sel_npatterns == i since i < nelts */
> > +
> > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > +      unsigned HOST_WIDE_INT index;
> > +
> > +      if (arg == arg0)
> > +     {
> > +       if (!sel[i].is_constant ())
> > +         return NULL_TREE;
> > +       index = sel[i].to_constant ();
> > +     }
> > +      else
> > +        {
> > +       gcc_assert (arg == arg1);
> > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +       uint64_t q;
> > +       poly_uint64 r;
> > +
> > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > +          which would be the index for either input vector.  */
> > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > +         return NULL_TREE;
> > +
> > +       if (!r.is_constant (&index))
> > +         return NULL_TREE;
> > +     }
> > +
> > +      tree elem;
> > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > +        {
> > +       gcc_assert (index < nelts);
> > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > +         return NULL_TREE;
> > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > +         return NULL_TREE;
> > +       need_ctor = true;
> > +     }
> > +      else
> > +        elem = vector_cst_elt (arg, index);
> > +      out_elts.quick_push (elem);
> >      }
> >
> >    if (need_ctor)
> >      {
> >        vec<constructor_elt, va_gc> *v;
> > -      vec_alloc (v, nelts);
> > -      for (i = 0; i < nelts; i++)
> > +      vec_alloc (v, res_nelts);
> > +      for (i = 0; i < res_nelts; i++)
> >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> >        return build_constructor (type, v);
> >      }
> > -  else
> > -    return out_elts.build ();
> > +  return out_elts.build ();
> >  }
> >
> >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> >  }
> >
> > +static tree
> > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > +                int *encoded_elems)
> > +{
> > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > +  //machine_mode vmode = VNx4SImode;
> > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > +
> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > +  return builder.build ();
> > +}
> > +
> > +static void
> > +test_vec_perm_vla_folding ()
> > +{
> > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > +
> > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > +
> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > +    return;
> > +
> > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > +     should select arg0.  */
> > +  {
> > +    int mask_elems[] = {0, 1, 2};
> > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    ASSERT_TRUE (res != NULL_TREE);
> > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > +
> > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > +    for (unsigned i = 0; i < res_nelts; i++)
> > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > +  }
> > +
> > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > +     should return NULL because for len = 4 + 4x,
> > +     if x == 0, we select from arg1
> > +     if x > 0, we select from arg0
> > +     and thus cannot determine result at compile time.  */
> > +  {
> > +    int mask_elems[] = {4, 5, 6};
> > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    gcc_assert (res == NULL_TREE);
> > +  }
> > +
> > +  /* Case 3:
> > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > +     npatterns == 2, nelts_per_pattern == 3
> > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > +  {
> > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > +
> > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > +  }
> > +
> > +  /* Case 4:
> > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > +     npatterns == 2, nelts_per_pattern == 3
> > +     Pattern {0, ...} should select arg0[1]
> > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > +     a1 = 5 + 4x
> > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > +        = 5 + 6x
> > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > +     res: {1, 21, 1, 31, 1, 22, ... }
> > +     FIXME: How to build vector with poly_int elems ?  */
> > +
> > +  /* Case 5: S < 0.  */
> > +}
> > +
> >  /* Run all of the selftests within this file.  */
> >
> >  void
> > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> >    test_arithmetic_folding ();
> >    test_vector_folding ();
> >    test_vec_duplicate_folding ();
> > +  test_vec_perm_vla_folding ();
> >  }
> >
> >  } // namespace selftest
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index e80be8049e1..b1ed90e629b 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "gimple-pretty-print.h"
+#include "tree-pretty-print.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10544,6 +10547,124 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+static std::pair<tree, int>
+get_vector_for_pattern (tree arg0, tree arg1, const vec_perm_indices &sel,
+			unsigned pattern)
+{
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+  gcc_assert (sel_nelts_per_pattern == 3);
+
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  poly_uint64 len = sel.length ();
+  poly_uint64 esel;
+
+  if (!multiple_p (len, sel_npatterns, &esel))
+    return std::make_pair (NULL_TREE, 0);
+
+  poly_uint64 a1 = sel[pattern + sel_npatterns];
+  poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+  poly_uint64 diff = a2 - a1;
+  if (!diff.is_constant ())
+    return std::make_pair (NULL_TREE, 0);
+  int S = diff.to_constant ();
+  if (!pow2p_hwi (S))
+    return std::make_pair (NULL_TREE, 0);
+
+  poly_uint64 ae = a1 + (esel - 2) * S;
+  uint64_t q1, qe;
+  poly_uint64 r1, re;
+  if (!can_div_trunc_p (a1, len, &q1, &r1)
+      || !can_div_trunc_p (ae, len, &qe, &re)
+      || (q1 != qe))
+    return std::make_pair (NULL_TREE, 0);
+
+  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
+
+  if (S < 0
+      && !known_eq (S, a1 - sel[pattern])
+      && !known_gt (re, VECTOR_CST_NPATTERNS (arg)))
+    return std::make_pair (NULL_TREE, 0);
+
+  return std::make_pair (arg, S);
+}
+
+static tree
+fold_vec_perm_vla (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
+{
+  if (TREE_CODE (arg0) != VECTOR_CST
+      || TREE_CODE (arg1) != VECTOR_CST)
+    return NULL_TREE;
+
+  unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
+  unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+  poly_uint64 nelts = sel.length ();
+
+  if (!pow2p_hwi (arg0_npatterns)
+      || !pow2p_hwi (arg1_npatterns)
+      || !pow2p_hwi (sel_npatterns))
+    return NULL_TREE;
+
+  unsigned N = 1;
+  if (sel_nelts_per_pattern == 3)
+    for (unsigned i = 0; i < sel_npatterns; i++)
+      {
+	std::pair<tree, int> ret = get_vector_for_pattern (arg0, arg1, sel, i);
+	tree arg = ret.first;
+	if (!arg)
+	  return NULL_TREE;
+
+	int S = ret.second;
+	/* S can be 0 if one of the patterns is a dup but
+	   other is a stepped sequence. For eg: {0, 0, 0, 1, 0, 2, ...} */
+	unsigned N_pattern
+	  = (S > 0) ? std::max<int> (S, VECTOR_CST_NPATTERNS (arg)) / S : 1;
+	N = std::max (N, N_pattern);
+      }
+
+  unsigned res_npatterns
+    = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
+
+  unsigned res_nelts_per_pattern
+    = std::max (sel_nelts_per_pattern, std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+						 VECTOR_CST_NELTS_PER_PATTERN (arg1)));
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  for (unsigned i = 0; i < res_npatterns * res_nelts_per_pattern; i++)
+    {
+      /* Get pattern corresponding to sel[i] and use that to figure out
+	 the input vector.
+	 For a stepped sequence, a0 may choose different vector,
+	 however a1 ... ae must select from same pattern from same vector.
+	 So if i < sel_npatterns, set pattern_index to index of a0,
+	 and if i >= sel_npatterns, set pattern_index to index of a1.  */
+
+      unsigned pattern_index = i % sel_npatterns;
+      if (i >= sel_npatterns)
+	pattern_index += sel_npatterns;
+
+      uint64_t q;
+      poly_uint64 r;
+      if (!can_div_trunc_p (sel[pattern_index], nelts, &q, &r))
+	return NULL_TREE;
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+
+      unsigned HOST_WIDE_INT index;
+      if (sel[i].is_constant ())
+	index = sel[i].to_constant ();
+      else
+	{
+	  poly_uint64 diff = sel[i] - nelts;
+	  if (!diff.is_constant (&index))
+	    return NULL_TREE;
+	}
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10555,15 +10676,20 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
   unsigned HOST_WIDE_INT nelts;
   bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (tree res = fold_vec_perm_vla (type, arg0, arg1, sel))
+    return res;
+
+  if (!sel.length().is_constant (&nelts))
+    return NULL_TREE;
+
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
@@ -16958,6 +17084,168 @@ test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  //machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  machine_mode vmode = VNx4SImode;
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static tree
+fold_vec_perm_vla_mask (tree type, tree arg0, tree arg1,
+			unsigned mask_npatterns,
+			unsigned mask_nelts_per_pattern,
+			poly_uint64 *mask_elems)
+{
+  poly_uint64 len = TYPE_VECTOR_SUBPARTS (type);
+  vec_perm_builder builder (len, mask_npatterns, mask_nelts_per_pattern);
+  for (unsigned i = 0; i < mask_npatterns * mask_nelts_per_pattern; i++)
+    builder.quick_push (mask_elems[i]);
+  vec_perm_indices sel (builder, (arg0 == arg1) ? 1 : 2, len);
+  return fold_vec_perm_vla (type, arg0, arg1, sel);
+}
+
+static void
+check_vec_perm_vla_result(tree res, int *res_elems,
+			  unsigned npatterns, unsigned nelts_per_pattern)
+{
+  ASSERT_TRUE (TREE_CODE (res) == VECTOR_CST);
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
+  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
+
+  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+    ASSERT_TRUE (wi::to_wide (VECTOR_CST_ELT (res, i)) == res_elems[i]);
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
+  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
+
+  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
+  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: mask is {0, 1, 2, 3, ... }.
+     npatterns = 4, nelts_per_pattern = 1.
+     All elements in mask < 4 + 4x.  */
+  {
+    poly_uint64 mask_elems[] = {0, 1, 2, 3};
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg0_elems[2], arg0_elems[3] };
+    check_vec_perm_vla_result (res, res_elems, 4, 1);
+  }
+
+  /* Case 2: mask = {0, 4, 1, 5, ...}
+     npatterns = 4, nelts_per_pattern = 1.
+     Should return NULL, because result cannot be determined at compile time,
+     since len = 4 + 4x and thus {4, 5} can select either from arg0 or arg1
+     depending on runtime length of the vector.  */
+  {
+    poly_uint64 mask_elems[] = {0, 4, 1, 5};
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    ASSERT_TRUE (res == NULL_TREE);
+  }
+
+  /* Case 3: mask = { 4 + 4x, 5 + 4x, 6 + 4x, 7 + 4x, ... }
+     npatterns = 4, nelts_per_pattern = 1
+     All elements in mask >= 4 + 4x.  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { len, len + 1, len + 2, len + 3 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    int res_elems[] = { arg1_elems[0], arg1_elems[1], arg1_elems[2], arg1_elems[3] };
+    check_vec_perm_vla_result (res, res_elems, 4, 1);
+  }
+
+  /* Case 4: mask = {0, 1, 4 + 4x, 5 + 4x, ... }
+     npatterns = 4, nelts_per_pattern = 1
+     res = { arg0[0], arg0[1], arg1[0], arg1[1], ... }  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { 0, 1, len, len + 1 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 1, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg1_elems[0], arg1_elems[1] };
+    check_vec_perm_vla_result (res, res_elems, 4, 1);
+  }
+
+  /* Case 5: mask = {0, 1, 2, 3, 4 + 4x, 5 + 4x, 6 + 4x, 7 + 4x, ... }
+     npatterns = 4, nelts_per_pattern = 2.
+     res = { arg0[0], arg0[1], arg0[2], arg0[3], arg1[0], arg1[1], arg1[2], arg1[3], ... }  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { 0, 1, 2, 3, len, len + 1, len + 2, len + 3 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 4, 2, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg0_elems[2], arg0_elems[3],
+			arg1_elems[0], arg1_elems[1], arg1_elems[2], arg1_elems[3] };
+    check_vec_perm_vla_result (res, res_elems, 4, 2);
+  }
+
+  /* Case 6: mask = {0, ... }.
+     npatterns = 1, nelts_per_pattern = 1.
+     Test for npatterns(mask) < npatterns(arg0)  */
+  {
+    poly_uint64 mask_elems[] = {0};
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 1, mask_elems);
+    int res_elems[] = { arg0_elems[0] };
+    check_vec_perm_vla_result (res, res_elems, 1, 1);
+  }
+
+  /* Case 7: mask = { 0, 1, 2, ... }.
+     npatterns = 1, nelts_per_pattern = 3.
+     Since {0, 1, 2} will select {1, 11, 2} it will be incorrect.
+     Re-encode sel such that each pattern of sel selects elements from
+     same pattern from arg0.
+     Thus the pattern must be divided into
+     npatterns(arg0) / S = 2 / 1 = 2 distinct patterns.
+     Re-encoded sel: {0, 1, 2, 3, 4, 5, ...}
+     with patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
+     Now each pattern selects elements only from same pattern
+     of arg0.
+     Expected res: {1, 11, 2, 12, 3, 13, ...}  */
+  {
+    poly_uint64 mask_elems[] = { 0, 1, 2 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 3, mask_elems);
+    int res_elems[] = { arg0_elems[0], arg0_elems[1], arg0_elems[2], arg0_elems[3],
+			arg0_elems[4], arg0_elems[5] };
+    check_vec_perm_vla_result (res, res_elems, 2, 3);
+  }
+
+  /* Case 8: mask = {len, 0, 1, ... }
+     npatterns = 1, nelts_per_pattern = 3.
+     Test for case when a0 selects a different vector from a1 ... ae.  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = {len, 0, 1};
+    int res_elems[] = { arg1_elems[0], arg0_elems[0], arg0_elems[1], arg0_elems[2],
+			arg0_elems[3], arg0_elems[4] };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 3, mask_elems);
+    check_vec_perm_vla_result (res, res_elems, 2, 3);
+  }
+
+  /* Case 9: mask = {len, len + 1, len + 2, ...}
+     npatterns = 1, nelts_per_pattern = 3.  */
+  {
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    poly_uint64 mask_elems[] = { len, len + 1, len + 2 };
+    tree res = fold_vec_perm_vla_mask (TREE_TYPE (arg0), arg0, arg1, 1, 3, mask_elems);
+  }
+
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16966,6 +17254,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest

Prathamesh Kulkarni Dec. 26, 2022, 4:26 a.m. UTC | #26

On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > >>
> > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > >> <richard.sandiford@arm.com> wrote:
> > >> >
> > >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > >> > > <richard.sandiford@arm.com> wrote:
> > >> > >>
> > >> > >> Sorry for the slow response.  I wanted to find some time to think
> > >> > >> about this a bit more.
> > >> > >>
> > >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > >> > >> > <richard.sandiford@arm.com> wrote:
> > >> > >> >>
> > >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > >> > >> >> >> For num_poly_int_coeffs == 2,
> > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > >> > >> >> >> If a1/trunc n1 succeeds,
> > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > >> > >> >> >
> > >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > >> > >> >> >
> > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > >> > >> >> >
> > >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > >> > >> >>
> > >> > >> >> Sorry, should have been:
> > >> > >> >>
> > >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > >> > >> > Hi Richard,
> > >> > >> > Thanks for the clarifications, and sorry for late reply.
> > >> > >> > I have attached POC patch that tries to implement the above approach.
> > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > >> > >> >
> > >> > >> > For VLA vectors, I have only done limited testing so far.
> > >> > >> > It seems to pass couple of tests written in the patch for
> > >> > >> > nelts_per_pattern == 3,
> > >> > >> > and folds the following svld1rq test:
> > >> > >> > int32x4_t v = {1, 2, 3, 4};
> > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > >> > >> > into:
> > >> > >> > return {1, 2, 3, 4, ...};
> > >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > >> > >> >
> > >> > >> > I have a couple of questions:
> > >> > >> > 1] When mask selects elements from same vector but from different patterns:
> > >> > >> > For eg:
> > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > >> > >> >
> > >> > >> > With above mask,
> > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > >> > >> > pattern in arg0.
> > >> > >> > The result is:
> > >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > >> > >> > with a0 = 1, a1 = 11, S = -9.
> > >> > >> > Is that expected tho ? It seems to create a new encoding which
> > >> > >> > wasn't present in the input vector. For instance, the next elem in
> > >> > >> > sequence would be -7,
> > >> > >> > which is not present originally in arg0.
> > >> > >>
> > >> > >> Yeah, you're right, sorry.  Going back to:
> > >> > >>
> > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > >> > >>     elements for any integer N.  This extended sequence can be reencoded
> > >> > >>     as having N*Px patterns, with Ex staying the same.
> > >> > >>
> > >> > >> I guess we need to pick an N for the selector such that each new
> > >> > >> selector pattern (each one out of the N*Px patterns) selects from
> > >> > >> the *same pattern* of the same data input.
> > >> > >>
> > >> > >> So if a particular pattern in the selector has a step S, and the data
> > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > >> > >>
> > >> > >> I think that means that the total number of patterns in the result
> > >> > >> (Pr from previous messages) can safely be:
> > >> > >>
> > >> > >>   Ps * least_common_multiple(
> > >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > >> > >>     ...
> > >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > >> > >>   )
> > >> > >>
> > >> > >> where:
> > >> > >>
> > >> > >>   Ps = the number of patterns in the selector
> > >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > >> > >>   P[I] = the number of patterns in data input I
> > >> > >>
> > >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > >> > >> and S[...] then it could also get large.  Perhaps we should finally
> > >> > >> give up on the general case and limit this to power-of-2 patterns and
> > >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > >> > >> simplifies other things as well.
> > >> > >>
> > >> > >> What do you think?
> > >> > > Hi Richard,
> > >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > >> > > follow up patches if possible.
> > >> > >
> > >> > > Sorry if this sounds like a silly ques -- if we are going to have
> > >> > > pattern in selector, select *same pattern from same input vector*,
> > >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > >> > > make sense for elements in selector to denote pattern number itself
> > >> > > instead of element index
> > >> > > if input vectors are VLA ?
> > >> > >
> > >> > > For eg:
> > >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > >> > > op1 = {...}
> > >> > > with npatterns == 4, nelts_per_pattern == 3,
> > >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > >> > > Not sure if this is correct tho.
> > >> >
> > >> > This wouldn't allow us to represent things like a "duplicate one
> > >> > element", or "copy the leading N elements from the first input and
> > >> > the other elements from elements N+ of the second input", which we
> > >> > can with the current scheme.
> > >> >
> > >> > The restriction about each (unwound) selector pattern selecting from the
> > >> > same input pattern only applies to case where the selector pattern is
> > >> > stepped (and only applies to the stepped part of the pattern, not the
> > >> > leading element).  The restriction is also local to this code; it
> > >> > doesn't make other VEC_PERM_EXPRs invalid.
> > >> Hi Richard,
> > >> Thanks for the clarifications.
> > >> Just to clarify your approach with an eg:
> > >> Let selected input vector be:
> > >> arg0: {a0, b0, c0, d0,
> > >>           a0 + S, b0 + S, c0 + S, d0 + S,
> > >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> > >>
> > >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> > >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> > >>
> > >> So, the first pattern in sel:
> > >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > >> which would be incorrect, since they belong to different patterns in arg0.
> > >> So to select elements from same pattern in arg0, we need to divide p1
> > >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> > >>
> > >> Similarly for second pattern in sel:
> > >> p2: {0, 2, 4, ...}, we need to divide it into
> > >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> > >>
> > >> Select N = max(N1, N2) = 4
> > >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > >> and will be re-encoded with N*Ps = 8 patterns:
> > >>
> > >> re-encoded sel:
> > >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > >> ...}
> > >>
> > >> with 8 patterns,
> > >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> > >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> > >> ...
> > >> which select elements from same pattern from same input vector.
> > >> Does this look correct ?
> > >>
> > >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > >> arg1_npatterns are powers of 2 and for each stepped pattern,
> > >> it's stepped size S is a power of 2. I suppose this will be sufficient
> > >> to ensure that sel can be re-encoded with N*Ps npatterns
> > >> such that each new pattern selects elements from same pattern
> > >> of the input vector ?
> > >>
> > >> Then compute N:
> > >> N = 1;
> > >> for (every pattern p in sel)
> > >>   {
> > >>      op = corresponding input vector for pattern;
> > >>      S = step_size (p);
> > >>      N_pattern = max (S, npatterns (op)) / S;
> > >>      N = max(N, N_pattern)
> > >>   }
> > >>
> > >> and re-encode selector with N*Ps patterns.
> > >> I guess rest of the patch will mostly stay the same.
> > > Hi,
> > > I have attached a POC patch based on the above approach.
> > > For the above eg:
> > > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > > and
> > > sel = {0, 0, 0, 1, 0, 2, ...}
> > > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> > >
> > > For pattern, {0, 1, 2, ...} it will select elements from different
> > > patterns from arg0, which is incorrect.
> > > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > > with following patterns:
> > > p1 = { 0, ... }
> > > p2 = { 0, 2, 4, ... }
> > > p3 = { 0, ... }
> > > p4 = { 1, 3, 5, ... }
> > > which should be correct since each element from the respective
> > > patterns in sel chooses
> > > elements from same pattern from arg0.
> > > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > > Does this look correct ?
> >
> > Yeah.  But like I said above:
> >
> >   The restriction about each (unwound) selector pattern selecting from the
> >   same input pattern only applies to case where the selector pattern is
> >   stepped (and only applies to the stepped part of the pattern, not the
> >   leading element).
> >
> > If the selector nelts-per-pattern is 1 or 2 then we can support all
> > power-of-2 cases, with the final npatterns being the maximum of the
> > source nelts-per-patterns.
> >
> > Also, going back to an earlier part of the discussion, I think we
> > should use this technique for both VLA and VLS, and only fall back
> > to the VLS-specific approach if the VLA approach fails.
> >
> > So I suggest we put the VLA code in its own function and have
> > the VLS-only path kick in when the VLA code fails.  If the code is
> > having to pass a lot of state around, it might make sense to define
> > a local class, store the state in member variables, and use member
> > functions for the various subroutines.  I don't know if that will
> > work out neater though.
> Hi Richard,
> Thanks for the suggestions. I have attached an updated POC patch,
> that does the following:
> (a) Uses VLA approach by default, and falls back to VLS specific
> folding if VLA approach fails for VLS vectors.
> (b) Separates cases for sel_nelts_per_pattern < 3 and
> sel_nelts_per_pattern == 3.
> (c) Allows, a0 to select different vector from a1 .. ae.
> I have written a few unit tests in the patch for testing the same.
> Does the patch look in the right direction ?
>
> The patch has an issue for the following case marked as "case 9"
> in test_vec_perm_vla_folding:
> arg0 = { 1, 11, 2, 12, 3, 13, ... }
> arg1 = { 21, 31, 22, 32, 23, 33, ... }
> arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.
>
> mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
> where 4 + 4x is runtime vector length.
> npatterns = 1, nelts_per_pattern = 3.
>
> a1 = 5 + 4x
> ae = a1 + (esel - 2) * S
>      = (5 + 4x) + (4 + 4x - 2) * 1
>      = 7 + 8x
>
> Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
> Is that expected for the above mask ?
>
> I intended it to select the second vector similar to,
> sel = { 0, 1, 2 .. }, which would select the first vector
> by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
> with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
> The first would select elements from first pattern from arg0,
> while the second pattern would select elements from second pattern from arg0.
> with result effectively having same encoding as arg0.
> Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?
Hi Richard,
ping https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html

Thanks,
Prathamesh
>
> PS: I will be on vacation next week.
>
> Thanks,
> Prathamesh
>
> >
> > > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> > >                         build_zero_cst (itype));
> > >  }
> > >
> > > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > > +   and return the selected arg, otherwise return NULL_TREE.  */
> > >
> > > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > > -   true if successful.  */
> > > -
> > > -static bool
> > > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > > +static tree
> > > +get_vector_for_pattern (tree arg0, tree arg1,
> > > +                     const vec_perm_indices &sel, unsigned pattern,
> > > +                     unsigned sel_npatterns, int &S)
> > >  {
> > > -  unsigned HOST_WIDE_INT i, nunits;
> > > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > >
> > > -  if (TREE_CODE (arg) == VECTOR_CST
> > > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +  poly_uint64 nsel = sel.length ();
> > > +  poly_uint64 esel;
> > > +
> > > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > > +    return NULL_TREE;
> > > +
> > > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > > +  S = 0;
> > > +  if (sel_nelts_per_pattern == 3)
> > >      {
> > > -      for (i = 0; i < nunits; ++i)
> > > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > > +      S = (a2 - a1).to_constant ();
> >
> > The code hasn't proven that this to_constant is safe.
> >
> > > +      if (S != 0 && !pow2p_hwi (S))
> > > +     return NULL_TREE;
> > >      }
> > > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > > +
> > > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > > +  uint64_t q1, qe;
> > > +  poly_uint64 r1, re;
> > > +
> > > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > > +      || (q1 != qe))
> > > +    return NULL_TREE;
> >
> > Going back to the above: this check doesn't make sense for
> > sel_nelts_per_pattern != 3.
> >
> > Thanks,
> > Richard
> >
> > > +
> > > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > > +
> > > +  if (S < 0)
> > >      {
> > > -      constructor_elt *elt;
> > > +      poly_uint64 a0 = sel[pattern];
> > > +      if (!known_eq (S, a1 - a0))
> > > +        return NULL_TREE;
> > >
> > > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > > -       return false;
> > > -     else
> > > -       elts[i] = elt->value;
> > > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > > +        return NULL_TREE;
> > >      }
> > > -  else
> > > -    return false;
> > > -  for (; i < nelts; i++)
> > > -    elts[i]
> > > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > > -  return true;
> > > +
> > > +  return arg;
> > >  }
> > >
> > >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> > >    unsigned HOST_WIDE_INT nelts;
> > >    bool need_ctor = false;
> > >
> > > -  if (!sel.length ().is_constant (&nelts))
> > > -    return NULL_TREE;
> > > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> > >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> > >      return NULL_TREE;
> > >
> > > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > > +  unsigned res_npatterns = 0;
> > > +  unsigned res_nelts_per_pattern = 0;
> > > +  unsigned sel_npatterns = 0;
> > > +  tree *vector_for_pattern = NULL;
> > > +
> > > +  if (TREE_CODE (arg0) == VECTOR_CST
> > > +      && TREE_CODE (arg1) == VECTOR_CST
> > > +      && !sel.length ().is_constant ())
> > > +    {
> > > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > > +      sel_npatterns = sel.encoding ().npatterns ();
> > > +
> > > +      if (!pow2p_hwi (arg0_npatterns)
> > > +       || !pow2p_hwi (arg1_npatterns)
> > > +       || !pow2p_hwi (sel_npatterns))
> > > +        return NULL_TREE;
> > > +
> > > +      unsigned N = 1;
> > > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > > +     {
> > > +       int S = 0;
> > > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > > +       if (!op)
> > > +         return NULL_TREE;
> > > +       vector_for_pattern[i] = op;
> > > +       unsigned N_pattern =
> > > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > > +       N = std::max (N, N_pattern);
> > > +     }
> > > +
> > > +      res_npatterns
> > > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > > +
> > > +      res_nelts_per_pattern
> > > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > > +    }
> > > +  else if (sel.length ().is_constant (&nelts)
> > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > > +    {
> > > +      /* For VLS vectors, treat all vectors with
> > > +      npatterns = nelts, nelts_per_pattern = 1. */
> > > +      res_npatterns = sel_npatterns = nelts;
> > > +      res_nelts_per_pattern = 1;
> > > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > > +      for (unsigned i = 0; i < nelts; i++)
> > > +        {
> > > +       HOST_WIDE_INT index;
> > > +       if (!sel[i].is_constant (&index))
> > > +         return NULL_TREE;
> > > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > > +     }
> > > +    }
> > > +  else
> > >      return NULL_TREE;
> > >
> > > -  tree_vector_builder out_elts (type, nelts, 1);
> > > -  for (i = 0; i < nelts; i++)
> > > +  tree_vector_builder out_elts (type, res_npatterns,
> > > +                             res_nelts_per_pattern);
> > > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > > +  for (unsigned i = 0; i < res_nelts; i++)
> > >      {
> > > -      HOST_WIDE_INT index;
> > > -      if (!sel[i].is_constant (&index))
> > > -     return NULL_TREE;
> > > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > > -     need_ctor = true;
> > > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > > +      /* For VLA vectors, i % sel_npatterns would give the original
> > > +         pattern the element belongs to, which is sufficient to get the arg.
> > > +      Even if sel_npatterns has been multiplied by N,
> > > +      they will always come from the same input vector.
> > > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > > +      so i % sel_npatterns == i since i < nelts */
> > > +
> > > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > > +      unsigned HOST_WIDE_INT index;
> > > +
> > > +      if (arg == arg0)
> > > +     {
> > > +       if (!sel[i].is_constant ())
> > > +         return NULL_TREE;
> > > +       index = sel[i].to_constant ();
> > > +     }
> > > +      else
> > > +        {
> > > +       gcc_assert (arg == arg1);
> > > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +       uint64_t q;
> > > +       poly_uint64 r;
> > > +
> > > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > > +          which would be the index for either input vector.  */
> > > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > > +         return NULL_TREE;
> > > +
> > > +       if (!r.is_constant (&index))
> > > +         return NULL_TREE;
> > > +     }
> > > +
> > > +      tree elem;
> > > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > > +        {
> > > +       gcc_assert (index < nelts);
> > > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > > +         return NULL_TREE;
> > > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > > +         return NULL_TREE;
> > > +       need_ctor = true;
> > > +     }
> > > +      else
> > > +        elem = vector_cst_elt (arg, index);
> > > +      out_elts.quick_push (elem);
> > >      }
> > >
> > >    if (need_ctor)
> > >      {
> > >        vec<constructor_elt, va_gc> *v;
> > > -      vec_alloc (v, nelts);
> > > -      for (i = 0; i < nelts; i++)
> > > +      vec_alloc (v, res_nelts);
> > > +      for (i = 0; i < res_nelts; i++)
> > >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > >        return build_constructor (type, v);
> > >      }
> > > -  else
> > > -    return out_elts.build ();
> > > +  return out_elts.build ();
> > >  }
> > >
> > >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> > >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> > >  }
> > >
> > > +static tree
> > > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > > +                int *encoded_elems)
> > > +{
> > > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > > +  //machine_mode vmode = VNx4SImode;
> > > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > > +
> > > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > > +  return builder.build ();
> > > +}
> > > +
> > > +static void
> > > +test_vec_perm_vla_folding ()
> > > +{
> > > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > > +
> > > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > > +
> > > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > > +    return;
> > > +
> > > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > +     should select arg0.  */
> > > +  {
> > > +    int mask_elems[] = {0, 1, 2};
> > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > +    ASSERT_TRUE (res != NULL_TREE);
> > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > +
> > > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > > +    for (unsigned i = 0; i < res_nelts; i++)
> > > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > > +  }
> > > +
> > > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > +     should return NULL because for len = 4 + 4x,
> > > +     if x == 0, we select from arg1
> > > +     if x > 0, we select from arg0
> > > +     and thus cannot determine result at compile time.  */
> > > +  {
> > > +    int mask_elems[] = {4, 5, 6};
> > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > +    gcc_assert (res == NULL_TREE);
> > > +  }
> > > +
> > > +  /* Case 3:
> > > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > > +     npatterns == 2, nelts_per_pattern == 3
> > > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > > +  {
> > > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > +
> > > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > > +  }
> > > +
> > > +  /* Case 4:
> > > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > > +     npatterns == 2, nelts_per_pattern == 3
> > > +     Pattern {0, ...} should select arg0[1]
> > > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > > +     a1 = 5 + 4x
> > > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > > +        = 5 + 6x
> > > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > > +     res: {1, 21, 1, 31, 1, 22, ... }
> > > +     FIXME: How to build vector with poly_int elems ?  */
> > > +
> > > +  /* Case 5: S < 0.  */
> > > +}
> > > +
> > >  /* Run all of the selftests within this file.  */
> > >
> > >  void
> > > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> > >    test_arithmetic_folding ();
> > >    test_vector_folding ();
> > >    test_vec_duplicate_folding ();
> > > +  test_vec_perm_vla_folding ();
> > >  }
> > >
> > >  } // namespace selftest

Prathamesh Kulkarni Jan. 17, 2023, 11:54 a.m. UTC | #27

On Mon, 26 Dec 2022 at 09:56, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >>
> > > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > > >> <richard.sandiford@arm.com> wrote:
> > > >> >
> > > >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > >> > > <richard.sandiford@arm.com> wrote:
> > > >> > >>
> > > >> > >> Sorry for the slow response.  I wanted to find some time to think
> > > >> > >> about this a bit more.
> > > >> > >>
> > > >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > > >> > >> > <richard.sandiford@arm.com> wrote:
> > > >> > >> >>
> > > >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > >> > >> >> >> For num_poly_int_coeffs == 2,
> > > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > >> > >> >> >> If a1/trunc n1 succeeds,
> > > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > > >> > >> >> >
> > > >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > >> > >> >> >
> > > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > >> > >> >> >
> > > >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > > >> > >> >>
> > > >> > >> >> Sorry, should have been:
> > > >> > >> >>
> > > >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > > >> > >> > Hi Richard,
> > > >> > >> > Thanks for the clarifications, and sorry for late reply.
> > > >> > >> > I have attached POC patch that tries to implement the above approach.
> > > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > > >> > >> >
> > > >> > >> > For VLA vectors, I have only done limited testing so far.
> > > >> > >> > It seems to pass couple of tests written in the patch for
> > > >> > >> > nelts_per_pattern == 3,
> > > >> > >> > and folds the following svld1rq test:
> > > >> > >> > int32x4_t v = {1, 2, 3, 4};
> > > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > > >> > >> > into:
> > > >> > >> > return {1, 2, 3, 4, ...};
> > > >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > > >> > >> >
> > > >> > >> > I have a couple of questions:
> > > >> > >> > 1] When mask selects elements from same vector but from different patterns:
> > > >> > >> > For eg:
> > > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > > >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > > >> > >> >
> > > >> > >> > With above mask,
> > > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > > >> > >> > pattern in arg0.
> > > >> > >> > The result is:
> > > >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > > >> > >> > with a0 = 1, a1 = 11, S = -9.
> > > >> > >> > Is that expected tho ? It seems to create a new encoding which
> > > >> > >> > wasn't present in the input vector. For instance, the next elem in
> > > >> > >> > sequence would be -7,
> > > >> > >> > which is not present originally in arg0.
> > > >> > >>
> > > >> > >> Yeah, you're right, sorry.  Going back to:
> > > >> > >>
> > > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > > >> > >>     elements for any integer N.  This extended sequence can be reencoded
> > > >> > >>     as having N*Px patterns, with Ex staying the same.
> > > >> > >>
> > > >> > >> I guess we need to pick an N for the selector such that each new
> > > >> > >> selector pattern (each one out of the N*Px patterns) selects from
> > > >> > >> the *same pattern* of the same data input.
> > > >> > >>
> > > >> > >> So if a particular pattern in the selector has a step S, and the data
> > > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > > >> > >>
> > > >> > >> I think that means that the total number of patterns in the result
> > > >> > >> (Pr from previous messages) can safely be:
> > > >> > >>
> > > >> > >>   Ps * least_common_multiple(
> > > >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > > >> > >>     ...
> > > >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > > >> > >>   )
> > > >> > >>
> > > >> > >> where:
> > > >> > >>
> > > >> > >>   Ps = the number of patterns in the selector
> > > >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > > >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > > >> > >>   P[I] = the number of patterns in data input I
> > > >> > >>
> > > >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > > >> > >> and S[...] then it could also get large.  Perhaps we should finally
> > > >> > >> give up on the general case and limit this to power-of-2 patterns and
> > > >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > > >> > >> simplifies other things as well.
> > > >> > >>
> > > >> > >> What do you think?
> > > >> > > Hi Richard,
> > > >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > >> > > follow up patches if possible.
> > > >> > >
> > > >> > > Sorry if this sounds like a silly ques -- if we are going to have
> > > >> > > pattern in selector, select *same pattern from same input vector*,
> > > >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > >> > > make sense for elements in selector to denote pattern number itself
> > > >> > > instead of element index
> > > >> > > if input vectors are VLA ?
> > > >> > >
> > > >> > > For eg:
> > > >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > >> > > op1 = {...}
> > > >> > > with npatterns == 4, nelts_per_pattern == 3,
> > > >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > >> > > Not sure if this is correct tho.
> > > >> >
> > > >> > This wouldn't allow us to represent things like a "duplicate one
> > > >> > element", or "copy the leading N elements from the first input and
> > > >> > the other elements from elements N+ of the second input", which we
> > > >> > can with the current scheme.
> > > >> >
> > > >> > The restriction about each (unwound) selector pattern selecting from the
> > > >> > same input pattern only applies to case where the selector pattern is
> > > >> > stepped (and only applies to the stepped part of the pattern, not the
> > > >> > leading element).  The restriction is also local to this code; it
> > > >> > doesn't make other VEC_PERM_EXPRs invalid.
> > > >> Hi Richard,
> > > >> Thanks for the clarifications.
> > > >> Just to clarify your approach with an eg:
> > > >> Let selected input vector be:
> > > >> arg0: {a0, b0, c0, d0,
> > > >>           a0 + S, b0 + S, c0 + S, d0 + S,
> > > >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > > >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> > > >>
> > > >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> > > >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> > > >>
> > > >> So, the first pattern in sel:
> > > >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > > >> which would be incorrect, since they belong to different patterns in arg0.
> > > >> So to select elements from same pattern in arg0, we need to divide p1
> > > >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> > > >>
> > > >> Similarly for second pattern in sel:
> > > >> p2: {0, 2, 4, ...}, we need to divide it into
> > > >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> > > >>
> > > >> Select N = max(N1, N2) = 4
> > > >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > > >> and will be re-encoded with N*Ps = 8 patterns:
> > > >>
> > > >> re-encoded sel:
> > > >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > > >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > > >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > > >> ...}
> > > >>
> > > >> with 8 patterns,
> > > >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> > > >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> > > >> ...
> > > >> which select elements from same pattern from same input vector.
> > > >> Does this look correct ?
> > > >>
> > > >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > > >> arg1_npatterns are powers of 2 and for each stepped pattern,
> > > >> it's stepped size S is a power of 2. I suppose this will be sufficient
> > > >> to ensure that sel can be re-encoded with N*Ps npatterns
> > > >> such that each new pattern selects elements from same pattern
> > > >> of the input vector ?
> > > >>
> > > >> Then compute N:
> > > >> N = 1;
> > > >> for (every pattern p in sel)
> > > >>   {
> > > >>      op = corresponding input vector for pattern;
> > > >>      S = step_size (p);
> > > >>      N_pattern = max (S, npatterns (op)) / S;
> > > >>      N = max(N, N_pattern)
> > > >>   }
> > > >>
> > > >> and re-encode selector with N*Ps patterns.
> > > >> I guess rest of the patch will mostly stay the same.
> > > > Hi,
> > > > I have attached a POC patch based on the above approach.
> > > > For the above eg:
> > > > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > > > and
> > > > sel = {0, 0, 0, 1, 0, 2, ...}
> > > > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> > > >
> > > > For pattern, {0, 1, 2, ...} it will select elements from different
> > > > patterns from arg0, which is incorrect.
> > > > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > > > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > > > with following patterns:
> > > > p1 = { 0, ... }
> > > > p2 = { 0, 2, 4, ... }
> > > > p3 = { 0, ... }
> > > > p4 = { 1, 3, 5, ... }
> > > > which should be correct since each element from the respective
> > > > patterns in sel chooses
> > > > elements from same pattern from arg0.
> > > > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > > > Does this look correct ?
> > >
> > > Yeah.  But like I said above:
> > >
> > >   The restriction about each (unwound) selector pattern selecting from the
> > >   same input pattern only applies to case where the selector pattern is
> > >   stepped (and only applies to the stepped part of the pattern, not the
> > >   leading element).
> > >
> > > If the selector nelts-per-pattern is 1 or 2 then we can support all
> > > power-of-2 cases, with the final npatterns being the maximum of the
> > > source nelts-per-patterns.
> > >
> > > Also, going back to an earlier part of the discussion, I think we
> > > should use this technique for both VLA and VLS, and only fall back
> > > to the VLS-specific approach if the VLA approach fails.
> > >
> > > So I suggest we put the VLA code in its own function and have
> > > the VLS-only path kick in when the VLA code fails.  If the code is
> > > having to pass a lot of state around, it might make sense to define
> > > a local class, store the state in member variables, and use member
> > > functions for the various subroutines.  I don't know if that will
> > > work out neater though.
> > Hi Richard,
> > Thanks for the suggestions. I have attached an updated POC patch,
> > that does the following:
> > (a) Uses VLA approach by default, and falls back to VLS specific
> > folding if VLA approach fails for VLS vectors.
> > (b) Separates cases for sel_nelts_per_pattern < 3 and
> > sel_nelts_per_pattern == 3.
> > (c) Allows, a0 to select different vector from a1 .. ae.
> > I have written a few unit tests in the patch for testing the same.
> > Does the patch look in the right direction ?
> >
> > The patch has an issue for the following case marked as "case 9"
> > in test_vec_perm_vla_folding:
> > arg0 = { 1, 11, 2, 12, 3, 13, ... }
> > arg1 = { 21, 31, 22, 32, 23, 33, ... }
> > arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.
> >
> > mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
> > where 4 + 4x is runtime vector length.
> > npatterns = 1, nelts_per_pattern = 3.
> >
> > a1 = 5 + 4x
> > ae = a1 + (esel - 2) * S
> >      = (5 + 4x) + (4 + 4x - 2) * 1
> >      = 7 + 8x
> >
> > Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
> > Is that expected for the above mask ?
> >
> > I intended it to select the second vector similar to,
> > sel = { 0, 1, 2 .. }, which would select the first vector
> > by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
> > with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
> > The first would select elements from first pattern from arg0,
> > while the second pattern would select elements from second pattern from arg0.
> > with result effectively having same encoding as arg0.
> > Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?
> Hi Richard,
> ping https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html
Hi Richard,
ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > PS: I will be on vacation next week.
> >
> > Thanks,
> > Prathamesh
> >
> > >
> > > > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> > > >                         build_zero_cst (itype));
> > > >  }
> > > >
> > > > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > > > +   and return the selected arg, otherwise return NULL_TREE.  */
> > > >
> > > > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > > > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > > > -   true if successful.  */
> > > > -
> > > > -static bool
> > > > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > > > +static tree
> > > > +get_vector_for_pattern (tree arg0, tree arg1,
> > > > +                     const vec_perm_indices &sel, unsigned pattern,
> > > > +                     unsigned sel_npatterns, int &S)
> > > >  {
> > > > -  unsigned HOST_WIDE_INT i, nunits;
> > > > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > > >
> > > > -  if (TREE_CODE (arg) == VECTOR_CST
> > > > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > > > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > +  poly_uint64 nsel = sel.length ();
> > > > +  poly_uint64 esel;
> > > > +
> > > > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > > > +    return NULL_TREE;
> > > > +
> > > > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > > > +  S = 0;
> > > > +  if (sel_nelts_per_pattern == 3)
> > > >      {
> > > > -      for (i = 0; i < nunits; ++i)
> > > > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > > > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > > > +      S = (a2 - a1).to_constant ();
> > >
> > > The code hasn't proven that this to_constant is safe.
> > >
> > > > +      if (S != 0 && !pow2p_hwi (S))
> > > > +     return NULL_TREE;
> > > >      }
> > > > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > +
> > > > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > > > +  uint64_t q1, qe;
> > > > +  poly_uint64 r1, re;
> > > > +
> > > > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > > > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > > > +      || (q1 != qe))
> > > > +    return NULL_TREE;
> > >
> > > Going back to the above: this check doesn't make sense for
> > > sel_nelts_per_pattern != 3.
> > >
> > > Thanks,
> > > Richard
> > >
> > > > +
> > > > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > > > +
> > > > +  if (S < 0)
> > > >      {
> > > > -      constructor_elt *elt;
> > > > +      poly_uint64 a0 = sel[pattern];
> > > > +      if (!known_eq (S, a1 - a0))
> > > > +        return NULL_TREE;
> > > >
> > > > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > > > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > > > -       return false;
> > > > -     else
> > > > -       elts[i] = elt->value;
> > > > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > > > +        return NULL_TREE;
> > > >      }
> > > > -  else
> > > > -    return false;
> > > > -  for (; i < nelts; i++)
> > > > -    elts[i]
> > > > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > > > -  return true;
> > > > +
> > > > +  return arg;
> > > >  }
> > > >
> > > >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > > > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> > > >    unsigned HOST_WIDE_INT nelts;
> > > >    bool need_ctor = false;
> > > >
> > > > -  if (!sel.length ().is_constant (&nelts))
> > > > -    return NULL_TREE;
> > > > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > > > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > > > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > > > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > > >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> > > >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> > > >      return NULL_TREE;
> > > >
> > > > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > > > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > > > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > > > +  unsigned res_npatterns = 0;
> > > > +  unsigned res_nelts_per_pattern = 0;
> > > > +  unsigned sel_npatterns = 0;
> > > > +  tree *vector_for_pattern = NULL;
> > > > +
> > > > +  if (TREE_CODE (arg0) == VECTOR_CST
> > > > +      && TREE_CODE (arg1) == VECTOR_CST
> > > > +      && !sel.length ().is_constant ())
> > > > +    {
> > > > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > > > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > > > +      sel_npatterns = sel.encoding ().npatterns ();
> > > > +
> > > > +      if (!pow2p_hwi (arg0_npatterns)
> > > > +       || !pow2p_hwi (arg1_npatterns)
> > > > +       || !pow2p_hwi (sel_npatterns))
> > > > +        return NULL_TREE;
> > > > +
> > > > +      unsigned N = 1;
> > > > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > > > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > > > +     {
> > > > +       int S = 0;
> > > > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > > > +       if (!op)
> > > > +         return NULL_TREE;
> > > > +       vector_for_pattern[i] = op;
> > > > +       unsigned N_pattern =
> > > > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > > > +       N = std::max (N, N_pattern);
> > > > +     }
> > > > +
> > > > +      res_npatterns
> > > > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > > > +
> > > > +      res_nelts_per_pattern
> > > > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > > > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > > > +    }
> > > > +  else if (sel.length ().is_constant (&nelts)
> > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > > > +    {
> > > > +      /* For VLS vectors, treat all vectors with
> > > > +      npatterns = nelts, nelts_per_pattern = 1. */
> > > > +      res_npatterns = sel_npatterns = nelts;
> > > > +      res_nelts_per_pattern = 1;
> > > > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > > > +      for (unsigned i = 0; i < nelts; i++)
> > > > +        {
> > > > +       HOST_WIDE_INT index;
> > > > +       if (!sel[i].is_constant (&index))
> > > > +         return NULL_TREE;
> > > > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > > > +     }
> > > > +    }
> > > > +  else
> > > >      return NULL_TREE;
> > > >
> > > > -  tree_vector_builder out_elts (type, nelts, 1);
> > > > -  for (i = 0; i < nelts; i++)
> > > > +  tree_vector_builder out_elts (type, res_npatterns,
> > > > +                             res_nelts_per_pattern);
> > > > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > > > +  for (unsigned i = 0; i < res_nelts; i++)
> > > >      {
> > > > -      HOST_WIDE_INT index;
> > > > -      if (!sel[i].is_constant (&index))
> > > > -     return NULL_TREE;
> > > > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > > > -     need_ctor = true;
> > > > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > > > +      /* For VLA vectors, i % sel_npatterns would give the original
> > > > +         pattern the element belongs to, which is sufficient to get the arg.
> > > > +      Even if sel_npatterns has been multiplied by N,
> > > > +      they will always come from the same input vector.
> > > > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > > > +      so i % sel_npatterns == i since i < nelts */
> > > > +
> > > > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > > > +      unsigned HOST_WIDE_INT index;
> > > > +
> > > > +      if (arg == arg0)
> > > > +     {
> > > > +       if (!sel[i].is_constant ())
> > > > +         return NULL_TREE;
> > > > +       index = sel[i].to_constant ();
> > > > +     }
> > > > +      else
> > > > +        {
> > > > +       gcc_assert (arg == arg1);
> > > > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > +       uint64_t q;
> > > > +       poly_uint64 r;
> > > > +
> > > > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > > > +          which would be the index for either input vector.  */
> > > > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > > > +         return NULL_TREE;
> > > > +
> > > > +       if (!r.is_constant (&index))
> > > > +         return NULL_TREE;
> > > > +     }
> > > > +
> > > > +      tree elem;
> > > > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > +        {
> > > > +       gcc_assert (index < nelts);
> > > > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > > > +         return NULL_TREE;
> > > > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > > > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > > > +         return NULL_TREE;
> > > > +       need_ctor = true;
> > > > +     }
> > > > +      else
> > > > +        elem = vector_cst_elt (arg, index);
> > > > +      out_elts.quick_push (elem);
> > > >      }
> > > >
> > > >    if (need_ctor)
> > > >      {
> > > >        vec<constructor_elt, va_gc> *v;
> > > > -      vec_alloc (v, nelts);
> > > > -      for (i = 0; i < nelts; i++)
> > > > +      vec_alloc (v, res_nelts);
> > > > +      for (i = 0; i < res_nelts; i++)
> > > >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > > >        return build_constructor (type, v);
> > > >      }
> > > > -  else
> > > > -    return out_elts.build ();
> > > > +  return out_elts.build ();
> > > >  }
> > > >
> > > >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > > > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> > > >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> > > >  }
> > > >
> > > > +static tree
> > > > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > > > +                int *encoded_elems)
> > > > +{
> > > > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > > > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > > > +  //machine_mode vmode = VNx4SImode;
> > > > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > > > +
> > > > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > > > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > > > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > > > +  return builder.build ();
> > > > +}
> > > > +
> > > > +static void
> > > > +test_vec_perm_vla_folding ()
> > > > +{
> > > > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > > > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > > > +
> > > > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > > > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > > > +
> > > > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > > > +    return;
> > > > +
> > > > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > +     should select arg0.  */
> > > > +  {
> > > > +    int mask_elems[] = {0, 1, 2};
> > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > +    ASSERT_TRUE (res != NULL_TREE);
> > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > +
> > > > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > > > +    for (unsigned i = 0; i < res_nelts; i++)
> > > > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > > > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > > > +  }
> > > > +
> > > > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > +     should return NULL because for len = 4 + 4x,
> > > > +     if x == 0, we select from arg1
> > > > +     if x > 0, we select from arg0
> > > > +     and thus cannot determine result at compile time.  */
> > > > +  {
> > > > +    int mask_elems[] = {4, 5, 6};
> > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > +    gcc_assert (res == NULL_TREE);
> > > > +  }
> > > > +
> > > > +  /* Case 3:
> > > > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > > > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > > > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > > > +  {
> > > > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > > > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > +
> > > > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > > > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > > > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > > > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > > > +  }
> > > > +
> > > > +  /* Case 4:
> > > > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > +     Pattern {0, ...} should select arg0[1]
> > > > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > > > +     a1 = 5 + 4x
> > > > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > > > +        = 5 + 6x
> > > > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > > > +     res: {1, 21, 1, 31, 1, 22, ... }
> > > > +     FIXME: How to build vector with poly_int elems ?  */
> > > > +
> > > > +  /* Case 5: S < 0.  */
> > > > +}
> > > > +
> > > >  /* Run all of the selftests within this file.  */
> > > >
> > > >  void
> > > > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> > > >    test_arithmetic_folding ();
> > > >    test_vector_folding ();
> > > >    test_vec_duplicate_folding ();
> > > > +  test_vec_perm_vla_folding ();
> > > >  }
> > > >
> > > >  } // namespace selftest

Prathamesh Kulkarni Feb. 1, 2023, 10:01 a.m. UTC | #28

On Tue, 17 Jan 2023 at 17:24, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 26 Dec 2022 at 09:56, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 13 Dec 2022 at 11:35, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Tue, 6 Dec 2022 at 21:00, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > > >
> > > > Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > > On Fri, 4 Nov 2022 at 14:00, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >>
> > > > >> On Mon, 31 Oct 2022 at 15:27, Richard Sandiford
> > > > >> <richard.sandiford@arm.com> wrote:
> > > > >> >
> > > > >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > >> > > On Wed, 26 Oct 2022 at 21:07, Richard Sandiford
> > > > >> > > <richard.sandiford@arm.com> wrote:
> > > > >> > >>
> > > > >> > >> Sorry for the slow response.  I wanted to find some time to think
> > > > >> > >> about this a bit more.
> > > > >> > >>
> > > > >> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > >> > >> > On Fri, 30 Sept 2022 at 21:38, Richard Sandiford
> > > > >> > >> > <richard.sandiford@arm.com> wrote:
> > > > >> > >> >>
> > > > >> > >> >> Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > > >> > >> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > >> > >> >> >> Sorry to ask a silly question but in which case shall we select 2nd vector ?
> > > > >> > >> >> >> For num_poly_int_coeffs == 2,
> > > > >> > >> >> >> a1 /trunc n1 == (a1 + 0x) / (n1.coeffs[0] + n1.coeffs[1]*x)
> > > > >> > >> >> >> If a1/trunc n1 succeeds,
> > > > >> > >> >> >> 0 / n1.coeffs[1] == a1/n1.coeffs[0] == 0.
> > > > >> > >> >> >> So, a1 has to be < n1.coeffs[0] ?
> > > > >> > >> >> >
> > > > >> > >> >> > Remember that a1 is itself a poly_int.  It's not necessarily a constant.
> > > > >> > >> >> >
> > > > >> > >> >> > E.g. the TRN1 .D instruction maps to a VEC_PERM_EXPR with the selector:
> > > > >> > >> >> >
> > > > >> > >> >> >   { 0, 2 + 2x, 1, 4 + 2x, 2, 6 + 2x, ... }
> > > > >> > >> >>
> > > > >> > >> >> Sorry, should have been:
> > > > >> > >> >>
> > > > >> > >> >>   { 0, 2 + 2x, 2, 4 + 2x, 4, 6 + 2x, ... }
> > > > >> > >> > Hi Richard,
> > > > >> > >> > Thanks for the clarifications, and sorry for late reply.
> > > > >> > >> > I have attached POC patch that tries to implement the above approach.
> > > > >> > >> > Passes bootstrap+test on x86_64-linux-gnu and aarch64-linux-gnu for VLS vectors.
> > > > >> > >> >
> > > > >> > >> > For VLA vectors, I have only done limited testing so far.
> > > > >> > >> > It seems to pass couple of tests written in the patch for
> > > > >> > >> > nelts_per_pattern == 3,
> > > > >> > >> > and folds the following svld1rq test:
> > > > >> > >> > int32x4_t v = {1, 2, 3, 4};
> > > > >> > >> > return svld1rq_s32 (svptrue_b8 (), &v[0])
> > > > >> > >> > into:
> > > > >> > >> > return {1, 2, 3, 4, ...};
> > > > >> > >> > I will try to bootstrap+test it on SVE machine to test further for VLA folding.
> > > > >> > >> >
> > > > >> > >> > I have a couple of questions:
> > > > >> > >> > 1] When mask selects elements from same vector but from different patterns:
> > > > >> > >> > For eg:
> > > > >> > >> > arg0 = {1, 11, 2, 12, 3, 13, ...},
> > > > >> > >> > arg1 = {21, 31, 22, 32, 23, 33, ...},
> > > > >> > >> > mask = {0, 0, 0, 1, 0, 2, ... },
> > > > >> > >> > All have npatterns = 2, nelts_per_pattern = 3.
> > > > >> > >> >
> > > > >> > >> > With above mask,
> > > > >> > >> > Pattern {0, ...} selects arg0[0], ie {1, ...}
> > > > >> > >> > Pattern {0, 1, 2, ...} selects arg0[0], arg0[1], arg0[2], ie {1, 11, 2, ...}
> > > > >> > >> > While arg0[0] and arg0[2] belong to same pattern, arg0[1] belongs to different
> > > > >> > >> > pattern in arg0.
> > > > >> > >> > The result is:
> > > > >> > >> > res = {1, 1, 1, 11, 1, 2, ...}
> > > > >> > >> > In this case, res's 2nd pattern {1, 11, 2, ...} is encoded with:
> > > > >> > >> > with a0 = 1, a1 = 11, S = -9.
> > > > >> > >> > Is that expected tho ? It seems to create a new encoding which
> > > > >> > >> > wasn't present in the input vector. For instance, the next elem in
> > > > >> > >> > sequence would be -7,
> > > > >> > >> > which is not present originally in arg0.
> > > > >> > >>
> > > > >> > >> Yeah, you're right, sorry.  Going back to:
> > > > >> > >>
> > > > >> > >> (2) The explicit encoding can be used to produce a sequence of N*Ex*Px
> > > > >> > >>     elements for any integer N.  This extended sequence can be reencoded
> > > > >> > >>     as having N*Px patterns, with Ex staying the same.
> > > > >> > >>
> > > > >> > >> I guess we need to pick an N for the selector such that each new
> > > > >> > >> selector pattern (each one out of the N*Px patterns) selects from
> > > > >> > >> the *same pattern* of the same data input.
> > > > >> > >>
> > > > >> > >> So if a particular pattern in the selector has a step S, and the data
> > > > >> > >> input it selects from has Pi patterns, N*S must be a multiple of Pi.
> > > > >> > >> N must be a multiple of least_common_multiple(S,Pi)/S.
> > > > >> > >>
> > > > >> > >> I think that means that the total number of patterns in the result
> > > > >> > >> (Pr from previous messages) can safely be:
> > > > >> > >>
> > > > >> > >>   Ps * least_common_multiple(
> > > > >> > >>     least_common_multiple(S[1], P[input(1)]) / S[1],
> > > > >> > >>     ...
> > > > >> > >>     least_common_multiple(S[Ps], P[input(Ps)]) / S[Ps]
> > > > >> > >>   )
> > > > >> > >>
> > > > >> > >> where:
> > > > >> > >>
> > > > >> > >>   Ps = the number of patterns in the selector
> > > > >> > >>   S[I] = the step for selector pattern I (I being 1-based)
> > > > >> > >>   input(I) = the data input selected by selector pattern I (I being 1-based)
> > > > >> > >>   P[I] = the number of patterns in data input I
> > > > >> > >>
> > > > >> > >> That's getting quite complicated :-)  If we allow arbitrary P[...]
> > > > >> > >> and S[...] then it could also get large.  Perhaps we should finally
> > > > >> > >> give up on the general case and limit this to power-of-2 patterns and
> > > > >> > >> power-of-2 steps, so that least_common_multiple becomes MAX.  Maybe that
> > > > >> > >> simplifies other things as well.
> > > > >> > >>
> > > > >> > >> What do you think?
> > > > >> > > Hi Richard,
> > > > >> > > Thanks for the suggestions. Yeah I suppose we can initially add support for
> > > > >> > > power-of-2 patterns and power-of-2 steps and try to generalize it in
> > > > >> > > follow up patches if possible.
> > > > >> > >
> > > > >> > > Sorry if this sounds like a silly ques -- if we are going to have
> > > > >> > > pattern in selector, select *same pattern from same input vector*,
> > > > >> > > instead of re-encoding the selector to have N * Ps patterns, would it
> > > > >> > > make sense for elements in selector to denote pattern number itself
> > > > >> > > instead of element index
> > > > >> > > if input vectors are VLA ?
> > > > >> > >
> > > > >> > > For eg:
> > > > >> > > op0 = {1, 2, 3, 4, 1, 2, 3, 5, 1, 2, 3, 6, ...}
> > > > >> > > op1 = {...}
> > > > >> > > with npatterns == 4, nelts_per_pattern == 3,
> > > > >> > > sel = {0, 3} should pick pattern 0 and pattern 3 from op0,
> > > > >> > > so, res = {1, 4, 1, 5, 1, 6, ...}
> > > > >> > > Not sure if this is correct tho.
> > > > >> >
> > > > >> > This wouldn't allow us to represent things like a "duplicate one
> > > > >> > element", or "copy the leading N elements from the first input and
> > > > >> > the other elements from elements N+ of the second input", which we
> > > > >> > can with the current scheme.
> > > > >> >
> > > > >> > The restriction about each (unwound) selector pattern selecting from the
> > > > >> > same input pattern only applies to case where the selector pattern is
> > > > >> > stepped (and only applies to the stepped part of the pattern, not the
> > > > >> > leading element).  The restriction is also local to this code; it
> > > > >> > doesn't make other VEC_PERM_EXPRs invalid.
> > > > >> Hi Richard,
> > > > >> Thanks for the clarifications.
> > > > >> Just to clarify your approach with an eg:
> > > > >> Let selected input vector be:
> > > > >> arg0: {a0, b0, c0, d0,
> > > > >>           a0 + S, b0 + S, c0 + S, d0 + S,
> > > > >>           a0 + 2S, b0 + 2S, c0 + 2S, dd + 2S, ...}
> > > > >> where arg0 has npatterns = 4, and nelts_per_pattern = 3.
> > > > >>
> > > > >> Let sel = {0, 0, 1, 2, 2, 4, ...}
> > > > >> where sel_npatterns = 2 and sel_nelts_per_pattern = 3
> > > > >>
> > > > >> So, the first pattern in sel:
> > > > >> p1: {0, 1, 2, ...} which will select {a0, b0, c0, ...}
> > > > >> which would be incorrect, since they belong to different patterns in arg0.
> > > > >> So to select elements from same pattern in arg0, we need to divide p1
> > > > >> into at least N1 = P_arg0 / S0 = 4 distinct patterns.
> > > > >>
> > > > >> Similarly for second pattern in sel:
> > > > >> p2: {0, 2, 4, ...}, we need to divide it into
> > > > >> at least N2 = P_arg0 / S1 = 2 distinct patterns.
> > > > >>
> > > > >> Select N = max(N1, N2) = 4
> > > > >> So, the selector will be extended to N * Ps * Es = 4 * 2 * 3 == 24 elements,
> > > > >> and will be re-encoded with N*Ps = 8 patterns:
> > > > >>
> > > > >> re-encoded sel:
> > > > >> {a0, b0, c0, d0, a0 + S, b0 + S, c0 + S, d0 + S,
> > > > >> a0 + 2S, b0 + 2S, c0 + 2S, d0 + 2S, a0 + 3S, b0 + 3S, c0 + 3S, d0 + 3S,
> > > > >> a0 + 4S, b0 + 4S, c0 + 4s, d0 + 4S, a0 + 5S, b0 + 5S, c0 + 5S, d0 + 5S,
> > > > >> ...}
> > > > >>
> > > > >> with 8 patterns,
> > > > >> p1: {a0, a0 + 2S, a0 + 4S, ...}
> > > > >> p2: {b0, b0 + 2S, b0 + 4S, ...}
> > > > >> ...
> > > > >> which select elements from same pattern from same input vector.
> > > > >> Does this look correct ?
> > > > >>
> > > > >> For feasibility, we can check initially that sel_npatterns, arg0_npatterns,
> > > > >> arg1_npatterns are powers of 2 and for each stepped pattern,
> > > > >> it's stepped size S is a power of 2. I suppose this will be sufficient
> > > > >> to ensure that sel can be re-encoded with N*Ps npatterns
> > > > >> such that each new pattern selects elements from same pattern
> > > > >> of the input vector ?
> > > > >>
> > > > >> Then compute N:
> > > > >> N = 1;
> > > > >> for (every pattern p in sel)
> > > > >>   {
> > > > >>      op = corresponding input vector for pattern;
> > > > >>      S = step_size (p);
> > > > >>      N_pattern = max (S, npatterns (op)) / S;
> > > > >>      N = max(N, N_pattern)
> > > > >>   }
> > > > >>
> > > > >> and re-encode selector with N*Ps patterns.
> > > > >> I guess rest of the patch will mostly stay the same.
> > > > > Hi,
> > > > > I have attached a POC patch based on the above approach.
> > > > > For the above eg:
> > > > > arg0 = {1, 11, 2, 12, 3, 13, ...} // npatterns = 2, nelts_per_pattern = 3,
> > > > > and
> > > > > sel = {0, 0, 0, 1, 0, 2, ...}
> > > > > with sel_npatterns == 2 and sel_nelts_per_pattern == 3.
> > > > >
> > > > > For pattern, {0, 1, 2, ...} it will select elements from different
> > > > > patterns from arg0, which is incorrect.
> > > > > So we choose N = P1/S = 2/1 = 2, where P1 is number of elements in arg0.
> > > > > So re-encoded sel = { 0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, ...}
> > > > > with following patterns:
> > > > > p1 = { 0, ... }
> > > > > p2 = { 0, 2, 4, ... }
> > > > > p3 = { 0, ... }
> > > > > p4 = { 1, 3, 5, ... }
> > > > > which should be correct since each element from the respective
> > > > > patterns in sel chooses
> > > > > elements from same pattern from arg0.
> > > > > So, res = { 1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ... }
> > > > > Does this look correct ?
> > > >
> > > > Yeah.  But like I said above:
> > > >
> > > >   The restriction about each (unwound) selector pattern selecting from the
> > > >   same input pattern only applies to case where the selector pattern is
> > > >   stepped (and only applies to the stepped part of the pattern, not the
> > > >   leading element).
> > > >
> > > > If the selector nelts-per-pattern is 1 or 2 then we can support all
> > > > power-of-2 cases, with the final npatterns being the maximum of the
> > > > source nelts-per-patterns.
> > > >
> > > > Also, going back to an earlier part of the discussion, I think we
> > > > should use this technique for both VLA and VLS, and only fall back
> > > > to the VLS-specific approach if the VLA approach fails.
> > > >
> > > > So I suggest we put the VLA code in its own function and have
> > > > the VLS-only path kick in when the VLA code fails.  If the code is
> > > > having to pass a lot of state around, it might make sense to define
> > > > a local class, store the state in member variables, and use member
> > > > functions for the various subroutines.  I don't know if that will
> > > > work out neater though.
> > > Hi Richard,
> > > Thanks for the suggestions. I have attached an updated POC patch,
> > > that does the following:
> > > (a) Uses VLA approach by default, and falls back to VLS specific
> > > folding if VLA approach fails for VLS vectors.
> > > (b) Separates cases for sel_nelts_per_pattern < 3 and
> > > sel_nelts_per_pattern == 3.
> > > (c) Allows, a0 to select different vector from a1 .. ae.
> > > I have written a few unit tests in the patch for testing the same.
> > > Does the patch look in the right direction ?
> > >
> > > The patch has an issue for the following case marked as "case 9"
> > > in test_vec_perm_vla_folding:
> > > arg0 = { 1, 11, 2, 12, 3, 13, ... }
> > > arg1 = { 21, 31, 22, 32, 23, 33, ... }
> > > arg0 and arg1 have npatterns = 2, nelts_per_pattern = 3.
> > >
> > > mask = { 4 + 4x, 5 + 4x, 6 + 4x, ... }
> > > where 4 + 4x is runtime vector length.
> > > npatterns = 1, nelts_per_pattern = 3.
> > >
> > > a1 = 5 + 4x
> > > ae = a1 + (esel - 2) * S
> > >      = (5 + 4x) + (4 + 4x - 2) * 1
> > >      = 7 + 8x
> > >
> > > Since (7 + 8x) /trunc (4 + 4x) returns false, fold_vec_perm returns NULL_TREE.
> > > Is that expected for the above mask ?
> > >
> > > I intended it to select the second vector similar to,
> > > sel = { 0, 1, 2 .. }, which would select the first vector
> > > by re-encoding sel as { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... }
> > > with two patterns: {0, 2, 4, ...} and {1, 3, 5, ...}
> > > The first would select elements from first pattern from arg0,
> > > while the second pattern would select elements from second pattern from arg0.
> > > with result effectively having same encoding as arg0.
> > > Shouldn't sel = { 4 + 4x, 5 + 4x, 6 + 4x, ... } similarly select arg1 ?
> > Hi Richard,
> > ping https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html
> Hi Richard,
> ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html
Hi Richard,
ping * 3: https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608363.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > PS: I will be on vacation next week.
> > >
> > > Thanks,
> > > Prathamesh
> > >
> > > >
> > > > > @@ -10494,38 +10497,55 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> > > > >                         build_zero_cst (itype));
> > > > >  }
> > > > >
> > > > > +/* Check if PATTERN in SEL selects either ARG0 or ARG1,
> > > > > +   and return the selected arg, otherwise return NULL_TREE.  */
> > > > >
> > > > > -/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
> > > > > -   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
> > > > > -   true if successful.  */
> > > > > -
> > > > > -static bool
> > > > > -vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> > > > > +static tree
> > > > > +get_vector_for_pattern (tree arg0, tree arg1,
> > > > > +                     const vec_perm_indices &sel, unsigned pattern,
> > > > > +                     unsigned sel_npatterns, int &S)
> > > > >  {
> > > > > -  unsigned HOST_WIDE_INT i, nunits;
> > > > > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > > > >
> > > > > -  if (TREE_CODE (arg) == VECTOR_CST
> > > > > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > > > > +  poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > > +  poly_uint64 nsel = sel.length ();
> > > > > +  poly_uint64 esel;
> > > > > +
> > > > > +  if (!multiple_p (nsel, sel_npatterns, &esel))
> > > > > +    return NULL_TREE;
> > > > > +
> > > > > +  poly_uint64 a1 = sel[pattern + sel_npatterns];
> > > > > +  S = 0;
> > > > > +  if (sel_nelts_per_pattern == 3)
> > > > >      {
> > > > > -      for (i = 0; i < nunits; ++i)
> > > > > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > > > > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > > > > +      S = (a2 - a1).to_constant ();
> > > >
> > > > The code hasn't proven that this to_constant is safe.
> > > >
> > > > > +      if (S != 0 && !pow2p_hwi (S))
> > > > > +     return NULL_TREE;
> > > > >      }
> > > > > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > > +
> > > > > +  poly_uint64 ae = a1 + (esel - 2) * S;
> > > > > +  uint64_t q1, qe;
> > > > > +  poly_uint64 r1, re;
> > > > > +
> > > > > +  if (!can_div_trunc_p (a1, n1, &q1, &r1)
> > > > > +      || !can_div_trunc_p (ae, n1, &qe, &re)
> > > > > +      || (q1 != qe))
> > > > > +    return NULL_TREE;
> > > >
> > > > Going back to the above: this check doesn't make sense for
> > > > sel_nelts_per_pattern != 3.
> > > >
> > > > Thanks,
> > > > Richard
> > > >
> > > > > +
> > > > > +  tree arg = ((q1 & 1) == 0) ? arg0 : arg1;
> > > > > +
> > > > > +  if (S < 0)
> > > > >      {
> > > > > -      constructor_elt *elt;
> > > > > +      poly_uint64 a0 = sel[pattern];
> > > > > +      if (!known_eq (S, a1 - a0))
> > > > > +        return NULL_TREE;
> > > > >
> > > > > -      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
> > > > > -     if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
> > > > > -       return false;
> > > > > -     else
> > > > > -       elts[i] = elt->value;
> > > > > +      if (!known_gt (re, VECTOR_CST_NPATTERNS (arg)))
> > > > > +        return NULL_TREE;
> > > > >      }
> > > > > -  else
> > > > > -    return false;
> > > > > -  for (; i < nelts; i++)
> > > > > -    elts[i]
> > > > > -      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
> > > > > -  return true;
> > > > > +
> > > > > +  return arg;
> > > > >  }
> > > > >
> > > > >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> > > > > @@ -10539,41 +10559,135 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> > > > >    unsigned HOST_WIDE_INT nelts;
> > > > >    bool need_ctor = false;
> > > > >
> > > > > -  if (!sel.length ().is_constant (&nelts))
> > > > > -    return NULL_TREE;
> > > > > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > > > > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > > > > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > > > > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > > > > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > > > >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> > > > >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> > > > >      return NULL_TREE;
> > > > >
> > > > > -  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> > > > > -  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> > > > > -      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> > > > > +  unsigned res_npatterns = 0;
> > > > > +  unsigned res_nelts_per_pattern = 0;
> > > > > +  unsigned sel_npatterns = 0;
> > > > > +  tree *vector_for_pattern = NULL;
> > > > > +
> > > > > +  if (TREE_CODE (arg0) == VECTOR_CST
> > > > > +      && TREE_CODE (arg1) == VECTOR_CST
> > > > > +      && !sel.length ().is_constant ())
> > > > > +    {
> > > > > +      unsigned arg0_npatterns = VECTOR_CST_NPATTERNS (arg0);
> > > > > +      unsigned arg1_npatterns = VECTOR_CST_NPATTERNS (arg1);
> > > > > +      sel_npatterns = sel.encoding ().npatterns ();
> > > > > +
> > > > > +      if (!pow2p_hwi (arg0_npatterns)
> > > > > +       || !pow2p_hwi (arg1_npatterns)
> > > > > +       || !pow2p_hwi (sel_npatterns))
> > > > > +        return NULL_TREE;
> > > > > +
> > > > > +      unsigned N = 1;
> > > > > +      vector_for_pattern = XALLOCAVEC (tree, sel_npatterns);
> > > > > +      for (unsigned i = 0; i < sel_npatterns; i++)
> > > > > +     {
> > > > > +       int S = 0;
> > > > > +       tree op = get_vector_for_pattern (arg0, arg1, sel, i, sel_npatterns, S);
> > > > > +       if (!op)
> > > > > +         return NULL_TREE;
> > > > > +       vector_for_pattern[i] = op;
> > > > > +       unsigned N_pattern =
> > > > > +         (S > 0) ? std::max<int>(S, VECTOR_CST_NPATTERNS (op)) / S : 1;
> > > > > +       N = std::max (N, N_pattern);
> > > > > +     }
> > > > > +
> > > > > +      res_npatterns
> > > > > +        = std::max (sel_npatterns * N, std::max (arg0_npatterns, arg1_npatterns));
> > > > > +
> > > > > +      res_nelts_per_pattern
> > > > > +     = std::max(sel.encoding ().nelts_per_pattern (),
> > > > > +                std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > > > +                          VECTOR_CST_NELTS_PER_PATTERN (arg1)));
> > > > > +    }
> > > > > +  else if (sel.length ().is_constant (&nelts)
> > > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > > +        && TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).to_constant () == nelts)
> > > > > +    {
> > > > > +      /* For VLS vectors, treat all vectors with
> > > > > +      npatterns = nelts, nelts_per_pattern = 1. */
> > > > > +      res_npatterns = sel_npatterns = nelts;
> > > > > +      res_nelts_per_pattern = 1;
> > > > > +      vector_for_pattern = XALLOCAVEC (tree, nelts);
> > > > > +      for (unsigned i = 0; i < nelts; i++)
> > > > > +        {
> > > > > +       HOST_WIDE_INT index;
> > > > > +       if (!sel[i].is_constant (&index))
> > > > > +         return NULL_TREE;
> > > > > +       vector_for_pattern[i] = (index < nelts) ? arg0 : arg1;
> > > > > +     }
> > > > > +    }
> > > > > +  else
> > > > >      return NULL_TREE;
> > > > >
> > > > > -  tree_vector_builder out_elts (type, nelts, 1);
> > > > > -  for (i = 0; i < nelts; i++)
> > > > > +  tree_vector_builder out_elts (type, res_npatterns,
> > > > > +                             res_nelts_per_pattern);
> > > > > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > > > > +  for (unsigned i = 0; i < res_nelts; i++)
> > > > >      {
> > > > > -      HOST_WIDE_INT index;
> > > > > -      if (!sel[i].is_constant (&index))
> > > > > -     return NULL_TREE;
> > > > > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > > > > -     need_ctor = true;
> > > > > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > > > > +      /* For VLA vectors, i % sel_npatterns would give the original
> > > > > +         pattern the element belongs to, which is sufficient to get the arg.
> > > > > +      Even if sel_npatterns has been multiplied by N,
> > > > > +      they will always come from the same input vector.
> > > > > +      For VLS vectors, sel_npatterns == res_nelts == nelts,
> > > > > +      so i % sel_npatterns == i since i < nelts */
> > > > > +
> > > > > +      tree arg = vector_for_pattern[i % sel_npatterns];
> > > > > +      unsigned HOST_WIDE_INT index;
> > > > > +
> > > > > +      if (arg == arg0)
> > > > > +     {
> > > > > +       if (!sel[i].is_constant ())
> > > > > +         return NULL_TREE;
> > > > > +       index = sel[i].to_constant ();
> > > > > +     }
> > > > > +      else
> > > > > +        {
> > > > > +       gcc_assert (arg == arg1);
> > > > > +       poly_uint64 n1 = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > > > +       uint64_t q;
> > > > > +       poly_uint64 r;
> > > > > +
> > > > > +       /* Divide sel[i] by input vector length, to obtain remainder,
> > > > > +          which would be the index for either input vector.  */
> > > > > +       if (!can_div_trunc_p (sel[i], n1, &q, &r))
> > > > > +         return NULL_TREE;
> > > > > +
> > > > > +       if (!r.is_constant (&index))
> > > > > +         return NULL_TREE;
> > > > > +     }
> > > > > +
> > > > > +      tree elem;
> > > > > +      if (TREE_CODE (arg) == CONSTRUCTOR)
> > > > > +        {
> > > > > +       gcc_assert (index < nelts);
> > > > > +       if (index >= vec_safe_length (CONSTRUCTOR_ELTS (arg)))
> > > > > +         return NULL_TREE;
> > > > > +       elem = CONSTRUCTOR_ELT (arg, index)->value;
> > > > > +       if (VECTOR_TYPE_P (TREE_TYPE (elem)))
> > > > > +         return NULL_TREE;
> > > > > +       need_ctor = true;
> > > > > +     }
> > > > > +      else
> > > > > +        elem = vector_cst_elt (arg, index);
> > > > > +      out_elts.quick_push (elem);
> > > > >      }
> > > > >
> > > > >    if (need_ctor)
> > > > >      {
> > > > >        vec<constructor_elt, va_gc> *v;
> > > > > -      vec_alloc (v, nelts);
> > > > > -      for (i = 0; i < nelts; i++)
> > > > > +      vec_alloc (v, res_nelts);
> > > > > +      for (i = 0; i < res_nelts; i++)
> > > > >       CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > > > >        return build_constructor (type, v);
> > > > >      }
> > > > > -  else
> > > > > -    return out_elts.build ();
> > > > > +  return out_elts.build ();
> > > > >  }
> > > > >
> > > > >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > > > > @@ -16910,6 +17024,97 @@ test_vec_duplicate_folding ()
> > > > >    ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
> > > > >  }
> > > > >
> > > > > +static tree
> > > > > +build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
> > > > > +                int *encoded_elems)
> > > > > +{
> > > > > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
> > > > > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > > > > +  //machine_mode vmode = VNx4SImode;
> > > > > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > > > +  tree vectype = build_vector_type (integer_type_node, nunits);
> > > > > +
> > > > > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > > > > +  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
> > > > > +    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
> > > > > +  return builder.build ();
> > > > > +}
> > > > > +
> > > > > +static void
> > > > > +test_vec_perm_vla_folding ()
> > > > > +{
> > > > > +  int arg0_elems[] = { 1, 11, 2, 12, 3, 13 };
> > > > > +  tree arg0 = build_vec_int_cst (2, 3, arg0_elems);
> > > > > +
> > > > > +  int arg1_elems[] = { 21, 31, 22, 32, 23, 33 };
> > > > > +  tree arg1 = build_vec_int_cst (2, 3, arg1_elems);
> > > > > +
> > > > > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
> > > > > +      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
> > > > > +    return;
> > > > > +
> > > > > +  /* Case 1: For mask: {0, 1, 2, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > > +     should select arg0.  */
> > > > > +  {
> > > > > +    int mask_elems[] = {0, 1, 2};
> > > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > > +    ASSERT_TRUE (res != NULL_TREE);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 2);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > > +
> > > > > +    unsigned res_nelts = vector_cst_encoded_nelts (res);
> > > > > +    for (unsigned i = 0; i < res_nelts; i++)
> > > > > +      ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i),
> > > > > +                                 VECTOR_CST_ELT (arg0, i), 0));
> > > > > +  }
> > > > > +
> > > > > +  /* Case 2: For mask: {4, 5, 6, ...}, npatterns == 1, nelts_per_pattern == 3,
> > > > > +     should return NULL because for len = 4 + 4x,
> > > > > +     if x == 0, we select from arg1
> > > > > +     if x > 0, we select from arg0
> > > > > +     and thus cannot determine result at compile time.  */
> > > > > +  {
> > > > > +    int mask_elems[] = {4, 5, 6};
> > > > > +    tree mask = build_vec_int_cst (1, 3, mask_elems);
> > > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > > +    gcc_assert (res == NULL_TREE);
> > > > > +  }
> > > > > +
> > > > > +  /* Case 3:
> > > > > +     mask: {0, 0, 0, 1, 0, 2, ...}
> > > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > > +     Pattern {0, ...} should select arg0[0], ie, 1.
> > > > > +     Pattern {0, 1, 2, ...} should select arg0: {1, 11, 2, ...},
> > > > > +     so res = {1, 1, 1, 11, 1, 2, ...}.  */
> > > > > +  {
> > > > > +    int mask_elems[] = {0, 0, 0, 1, 0, 2};
> > > > > +    tree mask = build_vec_int_cst (2, 3, mask_elems);
> > > > > +    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == 4);
> > > > > +    ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == 3);
> > > > > +
> > > > > +    /* Check encoding: {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13, ...}  */
> > > > > +    int res_encoded_elems[] = {1, 1, 1, 11, 1, 2, 1, 12, 1, 3, 1, 13};
> > > > > +    for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > > > > +      ASSERT_TRUE (wi::to_wide(VECTOR_CST_ELT (res, i)) == res_encoded_elems[i]);
> > > > > +  }
> > > > > +
> > > > > +  /* Case 4:
> > > > > +     mask: {0, 4 + 4x, 0, 5 + 4x, 0, 6 + 4x, ...}
> > > > > +     npatterns == 2, nelts_per_pattern == 3
> > > > > +     Pattern {0, ...} should select arg0[1]
> > > > > +     Pattern {4 + 4x, 5 + 4x, 6 + 4x, ...} should select from arg1, since:
> > > > > +     a1 = 5 + 4x
> > > > > +     ae = (5 + 4x) + ((4 + 4x) / 2 - 2) * 1
> > > > > +        = 5 + 6x
> > > > > +     Since a1/4+4x == ae/4+4x == 1, we select arg1[0], arg1[1], arg1[2], ...
> > > > > +     res: {1, 21, 1, 31, 1, 22, ... }
> > > > > +     FIXME: How to build vector with poly_int elems ?  */
> > > > > +
> > > > > +  /* Case 5: S < 0.  */
> > > > > +}
> > > > > +
> > > > >  /* Run all of the selftests within this file.  */
> > > > >
> > > > >  void
> > > > > @@ -16918,6 +17123,7 @@ fold_const_cc_tests ()
> > > > >    test_arithmetic_folding ();
> > > > >    test_vector_folding ();
> > > > >    test_vec_duplicate_folding ();
> > > > > +  test_vec_perm_vla_folding ();
> > > > >  }
> > > > >
> > > > >  } // namespace selftest

diff mbox series

Patch

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 4f4ec81c8d4..5e12260211e 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,9 @@  along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10496,40 +10499,6 @@  fold_mult_zconjz (location_t loc, tree type, tree expr)
 			  build_zero_cst (itype));
 }
 
-
-/* Helper function for fold_vec_perm.  Store elements of VECTOR_CST or
-   CONSTRUCTOR ARG into array ELTS, which has NELTS elements, and return
-   true if successful.  */
-
-static bool
-vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
-{
-  unsigned HOST_WIDE_INT i, nunits;
-
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
-    {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
-    }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
-    {
-      constructor_elt *elt;
-
-      FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (arg), i, elt)
-	if (i >= nelts || TREE_CODE (TREE_TYPE (elt->value)) == VECTOR_TYPE)
-	  return false;
-	else
-	  elts[i] = elt->value;
-    }
-  else
-    return false;
-  for (; i < nelts; i++)
-    elts[i]
-      = fold_convert (TREE_TYPE (TREE_TYPE (arg)), integer_zero_node);
-  return true;
-}
-
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10537,45 +10506,149 @@  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
 tree
 fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
-  unsigned int i;
-  unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
+  poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  poly_uint64 arg1_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1));
+
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type),
+			sel.length ()));
+  gcc_assert (known_eq (arg0_len, arg1_len));
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
-  tree *in_elts = XALLOCAVEC (tree, nelts * 2);
-  if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
-      || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
+  unsigned input_npatterns = 0;
+  unsigned out_npatterns = sel.encoding ().npatterns ();
+  unsigned out_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+
+  /* FIXME: How to reshape fixed length vector_cst, so that
+     npatterns == vector.length () and nelts_per_pattern == 1 ?
+     It seems the vector is canonicalized to minimize npatterns.  */
+
+  if (arg0_len.is_constant ())
+    {
+      /* If arg0, arg1 are fixed width vectors, and sel is VLA,
+         ensure that it is a dup sequence and has same period
+	 as input vector.  */
+
+      if (!sel.length ().is_constant ()
+	  && (sel.encoding ().nelts_per_pattern () > 2
+	      || !known_eq (arg0_len, sel.encoding ().npatterns ())))
+	return NULL_TREE;
+
+      input_npatterns = arg0_len.to_constant ();
+
+      if (sel.length ().is_constant ())
+	{
+	  out_npatterns = sel.length ().to_constant ();
+	  out_nelts_per_pattern = 1;
+	}
+    }
+  else if (TREE_CODE (arg0) == VECTOR_CST
+	   && TREE_CODE (arg1) == VECTOR_CST)
+    {
+      unsigned npatterns = VECTOR_CST_NPATTERNS (arg0);
+      unsigned input_nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg0);
+
+      /* If arg0, arg1 are VLA, then ensure that,
+	 (a) sel also has same length as input vectors.
+	 (b) arg0 and arg1 have same encoding.
+	 (c) sel has same number of patterns as input vectors.
+	 (d) if sel is a stepped sequence, then it has same
+	     encoding as input vectors.  */
+
+      if (!known_eq (arg0_len, sel.length ())
+	  || npatterns != VECTOR_CST_NPATTERNS (arg1)
+	  || input_nelts_per_pattern != VECTOR_CST_NELTS_PER_PATTERN (arg1)
+	  || npatterns != sel.encoding ().npatterns ()
+	  || (sel.encoding ().nelts_per_pattern () > 2
+	      && sel.encoding ().nelts_per_pattern () != input_nelts_per_pattern))
+	return NULL_TREE;
+
+      input_npatterns = npatterns;
+    }
+  else
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
-  for (i = 0; i < nelts; i++)
+  tree_vector_builder out_elts_builder (type, out_npatterns,
+					out_nelts_per_pattern);
+  bool need_ctor = false;
+  unsigned out_encoded_nelts = out_npatterns * out_nelts_per_pattern;
+
+  for (unsigned i = 0; i < out_encoded_nelts; i++)
     {
-      HOST_WIDE_INT index;
-      if (!sel[i].is_constant (&index))
+      HOST_WIDE_INT sel_index;
+      if (!sel[i].is_constant (&sel_index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+
+      /* Convert sel_index to index of either arg0 or arg1.
+	 For eg:
+	 arg0: {a0, b0, a1, b1, a1 + S, b1 + S, ...}
+	 arg1: {c0, d0, c1, d1, c1 + S, d1 + S, ...}
+	 Both have npatterns == 2, nelts_per_pattern == 3.
+	 Then the combined vector would be:
+	 {a0, b0, c0, d0, a1, b1, c1, d1, a1 + S, b1 + S, c1 + S, d1 + S, ... }
+	 This combined vector will have,
+	 npatterns = 2 * input_npatterns == 4.
+	 sel_index is used to index this above combined vector.
+
+	 Since we don't explicitly build the combined vector, we convert
+	 sel_index to corresponding index for either arg0 or arg1.
+	 For eg, if sel_index == 7,
+	 pattern = 7 % 4 == 3.
+	 Since pattern > input_npatterns, the elem will come from:
+	 pattern = 3 - input_npatterns ie, pattern 1 from arg1.
+	 elem_index_in_pattern = 7 / 4 == 1.
+	 So the actual index of the element in arg1 would be: 1 + (1 * 2) == 3.
+	 So, sel_index == 7 corresponds to arg1[3], ie, d1.  */
+
+      unsigned pattern = sel_index % (2 * input_npatterns);
+      unsigned elem_index_in_pattern = sel_index / (2 * input_npatterns);
+      tree arg;
+      if (pattern < input_npatterns)
+	arg = arg0;
+      else
+	{
+	  arg = arg1;
+	  pattern -= input_npatterns;
+	}
+
+      unsigned elem_index = (elem_index_in_pattern * input_npatterns) + pattern;
+      tree elem;
+      if (TREE_CODE (arg) == VECTOR_CST)
+	{
+	  /* If arg is fixed width vector, and elem_index goes out of range,
+	     then return NULL_TREE.  */
+	  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant ()
+	      && elem_index > vector_cst_encoded_nelts (arg))
+	    return NULL_TREE;
+	  elem = vector_cst_elt (arg, elem_index);
+	}
+      else
+	{
+	  gcc_assert (TREE_CODE (arg) == CONSTRUCTOR);
+	  if (elem_index >= CONSTRUCTOR_NELTS (arg))
+	    return NULL_TREE;
+	  elem = CONSTRUCTOR_ELT (arg, elem_index)->value;
+	  if (VECTOR_TYPE_P (TREE_TYPE (elem)))
+	    return NULL_TREE;
+	  need_ctor = true;
+	}
+
+      out_elts_builder.quick_push (unshare_expr (elem));
     }
 
   if (need_ctor)
     {
       vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
+      vec_alloc (v, out_encoded_nelts);
+
+      for (unsigned i = 0; i < out_encoded_nelts; i++)
+	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts_builder[i]);
       return build_constructor (type, v);
     }
-  else
-    return out_elts.build ();
+
+  return out_elts_builder.build ();
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16912,6 +16985,91 @@  test_vec_duplicate_folding ()
   ASSERT_TRUE (operand_equal_p (dup5_expr, dup5_cst, 0));
 }
 
+static tree
+build_vec_int_cst (unsigned npatterns, unsigned nelts_per_pattern,
+		   int *encoded_elems)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (integer_type_node);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  tree vectype = build_vector_type (integer_type_node, nunits);
+
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    builder.quick_push (build_int_cst (integer_type_node, encoded_elems[i]));
+  return builder.build ();
+}
+
+static void
+vpe_verify_res (tree res, unsigned npatterns, unsigned nelts_per_pattern,
+		int *encoded_elems)
+{
+  ASSERT_TRUE (res != NULL_TREE);
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
+  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
+
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    ASSERT_TRUE (wi::to_wide (VECTOR_CST_ELT (res, i))
+			      == encoded_elems[i]);
+}
+
+static void
+test_vec_perm_vla_folding ()
+{
+  /* For all cases
+     arg0: {1, 11, 21, 31, 2, 12, 22, 32, 3, 13, 23, 33, ...}, npatterns == 4, nelts_per_pattern == 3.
+     arg1: {41, 51, 61, 71, 42, 52, 62, 72, 43, 53, 63, 73 ...}, npatterns == 4, nelts_per_pattern == 3.  */
+
+  int arg0_elems[] = { 1, 11, 21, 31, 2, 12, 22, 32, 3, 13, 23, 33 };
+  tree arg0 = build_vec_int_cst (4, 3, arg0_elems);
+
+  int arg1_elems[] = { 41, 51, 61, 71, 42, 52, 62, 72, 43, 53, 63, 73 };
+  tree arg1 = build_vec_int_cst (4, 3, arg1_elems);
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant ()
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)).is_constant ())
+    return;
+
+  /* Case 1: Dup mask sequence.
+     mask = {0, 9, 3, 11, ...}, npatterns == 4, nelts_per_pattern == 1.
+     expected result: {1, 21, 31, 32, ...}, npatterns == 4, nelts_per_pattern == 1.  */
+  {
+    int mask_elems[] = {0, 9, 3, 12};
+    tree mask = build_vec_int_cst (4, 1, mask_elems);
+    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
+      return;
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    int res_encoded_elems[] = {1, 12, 31, 42};
+    vpe_verify_res (res, 4, 1, res_encoded_elems);
+  }
+
+  /* Case 2:
+     mask = {0, 4, 1, 5, 8, 12, 9, 13 ...}, npatterns == 4, nelts_per_pattern == 2.
+     expected result: {1, 41, 11, 51, 2, 12, 42, 52, ...}, npatterns == 4, nelts_per_pattern == 2.  */
+  {
+    int mask_elems[] = {0, 4, 1, 5, 8, 12, 9, 13};
+    tree mask = build_vec_int_cst (4, 2, mask_elems);
+    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
+      return;
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    int res_encoded_elems[] = {1, 41, 11, 51, 2, 42, 12, 52};
+    vpe_verify_res (res, 4, 2, res_encoded_elems);
+  }
+
+  /* Case 3: Stepped mask sequence.
+     mask = {0, 4, 1, 5, 8, 12, 9, 13, 16, 20, 17, 21}, npatterns == 4, nelts_per_pattern == 3.
+     expected result = {1, 41, 11, 51, 2, 42, 12, 52, 3, 43, 13, 53 ...}, npatterns == 4, nelts_per_pattern == 3.  */
+  {
+    int mask_elems[] = {0, 4, 1, 5, 8, 12, 9, 13, 16, 20, 17, 21};
+    tree mask = build_vec_int_cst (4, 3, mask_elems);
+    if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)).is_constant ())
+      return;
+    tree res = fold_ternary (VEC_PERM_EXPR, TREE_TYPE (arg0), arg0, arg1, mask);
+    int res_encoded_elems[] = {1, 41, 11, 51, 2, 42, 12, 52, 3, 43, 13, 53};
+    vpe_verify_res (res, 4, 3, res_encoded_elems);
+  }
+}
+
 /* Run all of the selftests within this file.  */
 
 void
@@ -16920,6 +17078,7 @@  fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_vec_perm_vla_folding ();
 }
 
 } // namespace selftest