diff mbox series

[RFC,v2] Extend fold_vec_perm to handle VLA vectors

Message ID	CAAgBjMnk_N=tgPNBhUu91yt8YN0HcCoWgQQYpshHMqhU=6WgAQ@mail.gmail.com
State	Accepted
Headers	Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org EFC243858D33 MIME-Version: 1.0 Date: Mon, 17 Jul 2023 17:44:13 +0530 Message-ID: <CAAgBjMnk_N=tgPNBhUu91yt8YN0HcCoWgQQYpshHMqhU=6WgAQ@mail.gmail.com> Subject: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors To: gcc Patches <gcc-patches@gcc.gnu.org>, Richard Sandiford <richard.sandiford@arm.com> Content-Type: multipart/mixed; boundary="000000000000bc630f0600adbec5" Precedence: list From: Prathamesh Kulkarni via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> X-getmail-retrieved-from-mailbox: INBOX
Series	[RFC,v2] Extend fold_vec_perm to handle VLA vectors \| [RFC,v2] Extend fold_vec_perm to handle VLA vectors

Checks

Context	Check	Description
snail/gcc-patch-check	success	Github commit url

Commit Message

Prathamesh Kulkarni July 17, 2023, 12:14 p.m. UTC

  Hi Richard,
This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
The attached patch unifies handling of VLS and VLA vector_csts, while
using fallback code
for ctors.

For VLS vector, the patch ignores underlying encoding, and
uses npatterns = nelts, and nelts_per_pattern = 1.

For VLA patterns, if sel has a stepped sequence, then it
only chooses elements from a particular pattern of a particular
input vector.

To make things simpler, the patch imposes following constraints:
(a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
(b) The step size for a stepped sequence is a power of 2, and
      multiple of npatterns of chosen input vector.
(c) Runtime vector length of sel is a multiple of sel_npatterns.
     So, we don't handle sel.length = 2 + 2x and npatterns = 4.

Eg:
op0, op1: npatterns = 2, nelts_per_pattern = 3
op0_len = op1_len = 16 + 16x.
sel = { 0, 0, 2, 0, 4, 0, ... }
npatterns = 2, nelts_per_pattern = 3.

For pattern {0, 2, 4, ...}
Let,
a1 = 2
S = step size = 2

Let Esel denote number of elements per pattern in sel at runtime.
Esel = (16 + 16x) / npatterns_sel
        = (16 + 16x) / 2
        = (8 + 8x)

So, last element of pattern:
ae = a1 + (Esel - 2) * S
     = 2 + (8 + 8x - 2) * 2
     = 14 + 16x

a1 /trunc arg0_len = 2 / (16 + 16x) = 0
ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
Since both are equal with quotient = 0, we select elements from op0.

Since step size (S) is a multiple of npatterns(op0), we select
all elements from same pattern of op0.

res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
                       = max (2, max (2, 2)
                       = 2

res_nelts_per_pattern = max (op0_nelts_per_pattern,
                                                max (op1_nelts_per_pattern,
                                                         sel_nelts_per_pattern))
                                    = max (3, max (3, 3))
                                    = 3

So res has encoding with npatterns = 2, nelts_per_pattern = 3.
res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }

Unfortunately, this results in an issue for poly_int_cst index:
For example,
op0, op1: npatterns = 1, nelts_per_pattern = 3
op0_len = op1_len = 4 + 4x

sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1

In this case,
a1 = 5 + 4x
S = (6 + 4x) - (5 + 4x) = 1
Esel = 4 + 4x

ae = a1 + (esel - 2) * S
     = (5 + 4x) + (4 + 4x - 2) * 1
     = 7 + 8x

IIUC, 7 + 8x will always be index for last element of op1 ?
if x = 0, len = 4, 7 + 8x = 7
if x = 1, len = 8, 7 + 8x = 15, etc.
So the stepped sequence will always choose elements
from op1 regardless of vector length for above case ?

However,
ae /trunc op0_len
= (7 + 8x) / (4 + 4x)
which is not defined because 7/4 != 8/4
and we return NULL_TREE, but I suppose the expected result would be:
res: { op1[0], op1[1], op1[2], ... } ?

The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
and on x86_64-unknown-linux-gnu.
I would be grateful for suggestions on how to proceed.

Thanks,
Prathamesh

Comments

Prathamesh Kulkarni July 25, 2023, 9:26 a.m. UTC | #1

On Mon, 17 Jul 2023 at 17:44, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> Hi Richard,
> This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
> The attached patch unifies handling of VLS and VLA vector_csts, while
> using fallback code
> for ctors.
>
> For VLS vector, the patch ignores underlying encoding, and
> uses npatterns = nelts, and nelts_per_pattern = 1.
>
> For VLA patterns, if sel has a stepped sequence, then it
> only chooses elements from a particular pattern of a particular
> input vector.
>
> To make things simpler, the patch imposes following constraints:
> (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
> (b) The step size for a stepped sequence is a power of 2, and
>       multiple of npatterns of chosen input vector.
> (c) Runtime vector length of sel is a multiple of sel_npatterns.
>      So, we don't handle sel.length = 2 + 2x and npatterns = 4.
>
> Eg:
> op0, op1: npatterns = 2, nelts_per_pattern = 3
> op0_len = op1_len = 16 + 16x.
> sel = { 0, 0, 2, 0, 4, 0, ... }
> npatterns = 2, nelts_per_pattern = 3.
>
> For pattern {0, 2, 4, ...}
> Let,
> a1 = 2
> S = step size = 2
>
> Let Esel denote number of elements per pattern in sel at runtime.
> Esel = (16 + 16x) / npatterns_sel
>         = (16 + 16x) / 2
>         = (8 + 8x)
>
> So, last element of pattern:
> ae = a1 + (Esel - 2) * S
>      = 2 + (8 + 8x - 2) * 2
>      = 14 + 16x
>
> a1 /trunc arg0_len = 2 / (16 + 16x) = 0
> ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
> Since both are equal with quotient = 0, we select elements from op0.
>
> Since step size (S) is a multiple of npatterns(op0), we select
> all elements from same pattern of op0.
>
> res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
>                        = max (2, max (2, 2)
>                        = 2
>
> res_nelts_per_pattern = max (op0_nelts_per_pattern,
>                                                 max (op1_nelts_per_pattern,
>                                                          sel_nelts_per_pattern))
>                                     = max (3, max (3, 3))
>                                     = 3
>
> So res has encoding with npatterns = 2, nelts_per_pattern = 3.
> res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
>
> Unfortunately, this results in an issue for poly_int_cst index:
> For example,
> op0, op1: npatterns = 1, nelts_per_pattern = 3
> op0_len = op1_len = 4 + 4x
>
> sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>
> In this case,
> a1 = 5 + 4x
> S = (6 + 4x) - (5 + 4x) = 1
> Esel = 4 + 4x
>
> ae = a1 + (esel - 2) * S
>      = (5 + 4x) + (4 + 4x - 2) * 1
>      = 7 + 8x
>
> IIUC, 7 + 8x will always be index for last element of op1 ?
> if x = 0, len = 4, 7 + 8x = 7
> if x = 1, len = 8, 7 + 8x = 15, etc.
> So the stepped sequence will always choose elements
> from op1 regardless of vector length for above case ?
>
> However,
> ae /trunc op0_len
> = (7 + 8x) / (4 + 4x)
> which is not defined because 7/4 != 8/4
> and we return NULL_TREE, but I suppose the expected result would be:
> res: { op1[0], op1[1], op1[2], ... } ?
>
> The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
> and on x86_64-unknown-linux-gnu.
> I would be grateful for suggestions on how to proceed.
Hi Richard,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624675.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh

Richard Sandiford July 25, 2023, 12:55 p.m. UTC | #2

Hi,

Thanks for the rework and sorry for the slow review.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi Richard,
> This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
> The attached patch unifies handling of VLS and VLA vector_csts, while
> using fallback code
> for ctors.
>
> For VLS vector, the patch ignores underlying encoding, and
> uses npatterns = nelts, and nelts_per_pattern = 1.
>
> For VLA patterns, if sel has a stepped sequence, then it
> only chooses elements from a particular pattern of a particular
> input vector.
>
> To make things simpler, the patch imposes following constraints:
> (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
> (b) The step size for a stepped sequence is a power of 2, and
>       multiple of npatterns of chosen input vector.
> (c) Runtime vector length of sel is a multiple of sel_npatterns.
>      So, we don't handle sel.length = 2 + 2x and npatterns = 4.
>
> Eg:
> op0, op1: npatterns = 2, nelts_per_pattern = 3
> op0_len = op1_len = 16 + 16x.
> sel = { 0, 0, 2, 0, 4, 0, ... }
> npatterns = 2, nelts_per_pattern = 3.
>
> For pattern {0, 2, 4, ...}
> Let,
> a1 = 2
> S = step size = 2
>
> Let Esel denote number of elements per pattern in sel at runtime.
> Esel = (16 + 16x) / npatterns_sel
>         = (16 + 16x) / 2
>         = (8 + 8x)
>
> So, last element of pattern:
> ae = a1 + (Esel - 2) * S
>      = 2 + (8 + 8x - 2) * 2
>      = 14 + 16x
>
> a1 /trunc arg0_len = 2 / (16 + 16x) = 0
> ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
> Since both are equal with quotient = 0, we select elements from op0.
>
> Since step size (S) is a multiple of npatterns(op0), we select
> all elements from same pattern of op0.
>
> res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
>                        = max (2, max (2, 2)
>                        = 2
>
> res_nelts_per_pattern = max (op0_nelts_per_pattern,
>                                                 max (op1_nelts_per_pattern,
>                                                          sel_nelts_per_pattern))
>                                     = max (3, max (3, 3))
>                                     = 3
>
> So res has encoding with npatterns = 2, nelts_per_pattern = 3.
> res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
>
> Unfortunately, this results in an issue for poly_int_cst index:
> For example,
> op0, op1: npatterns = 1, nelts_per_pattern = 3
> op0_len = op1_len = 4 + 4x
>
> sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>
> In this case,
> a1 = 5 + 4x
> S = (6 + 4x) - (5 + 4x) = 1
> Esel = 4 + 4x
>
> ae = a1 + (esel - 2) * S
>      = (5 + 4x) + (4 + 4x - 2) * 1
>      = 7 + 8x
>
> IIUC, 7 + 8x will always be index for last element of op1 ?
> if x = 0, len = 4, 7 + 8x = 7
> if x = 1, len = 8, 7 + 8x = 15, etc.
> So the stepped sequence will always choose elements
> from op1 regardless of vector length for above case ?
>
> However,
> ae /trunc op0_len
> = (7 + 8x) / (4 + 4x)
> which is not defined because 7/4 != 8/4
> and we return NULL_TREE, but I suppose the expected result would be:
> res: { op1[0], op1[1], op1[2], ... } ?
>
> The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
> and on x86_64-unknown-linux-gnu.
> I would be grateful for suggestions on how to proceed.
>
> Thanks,
> Prathamesh
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index a02ede79fed..8028b3e8e9a 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "vec-perm-indices.h"
>  #include "asan.h"
>  #include "gimple-range.h"
> +#include <algorithm>
> +#include "tree-pretty-print.h"
> +#include "gimple-pretty-print.h"
> +#include "print-tree.h"
>  
>  /* Nonzero if we are folding constants inside an initializer or a C++
>     manifestly-constant-evaluated context; zero otherwise.
> @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>  static bool
>  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>  {
> -  unsigned HOST_WIDE_INT i, nunits;
> +  unsigned HOST_WIDE_INT i;
>  
> -  if (TREE_CODE (arg) == VECTOR_CST
> -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> -    {
> -      for (i = 0; i < nunits; ++i)
> -	elts[i] = VECTOR_CST_ELT (arg, i);
> -    }
> -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> +  if (TREE_CODE (arg) == CONSTRUCTOR)
>      {
>        constructor_elt *elt;
>  
> @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>    return true;
>  }
>  
> +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding.  */
> +
> +static tree
> +vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern)
> +{
> +  gcc_assert (pow2p_hwi (npatterns));
> +
> +  if (VECTOR_CST_NPATTERNS (vec) == npatterns
> +      && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern)
> +    return vec;
> +
> +  tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern);
> +  TREE_TYPE (v) = TREE_TYPE (vec);
> +
> +  unsigned nelts = npatterns * nelts_per_pattern;
> +  for (unsigned i = 0; i < nelts; i++)
> +    VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i);
> +  return v;
> +}
> +
> +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable
> +   operand for VLA vec_perm folding. If arg is VLS, then set
> +   NPATTERNS = nelts and NELTS_PER_PATTERN = 1.  */
> +
> +static tree
> +valid_operand_for_fold_vec_perm_cst_p (tree arg)
> +{
> +  if (TREE_CODE (arg) != VECTOR_CST)
> +    return NULL_TREE;
> +
> +  unsigned HOST_WIDE_INT nelts;
> +  unsigned npatterns, nelts_per_pattern;
> +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts))
> +    {
> +      npatterns = nelts;
> +      nelts_per_pattern = 1;
> +    }
> +  else
> +    {
> +      npatterns = VECTOR_CST_NPATTERNS (arg);
> +      nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg);
> +    }
> +
> +  if (!pow2p_hwi (npatterns))
> +    return NULL_TREE;
> +
> +  return vector_cst_reshape (arg, npatterns, nelts_per_pattern);
> +}

I don't think we should reshape the vectors for VLS, since it would
create more nodes for GC to clean up later.  Also, the "compact" encoding
is canonical even for VLS, so the reshaping would effectively create
noncanonical constants (even if only temporarily).

Instead, I think we should change the later:

> +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
> +					   sel_nelts_per_pattern, reason, verbose))
> +    return NULL_TREE;

so that it comes after the computation of res_npatterns and
res_nelts_per_pattern.  Then, if valid_mask_for_fold_vec_perm_cst_p
returns false, and if the result type has a constant number of elements,
we should:

* set res_npatterns to that number of elements
* set res_nelts_per_pattern to 1
* continue instead of returning null

The loop that follows will then do the correct thing for each element.

The check for a power of 2 would then go in
valid_mask_for_fold_vec_perm_cst_p rather than
valid_operand_for_fold_vec_perm_cst_p.

With that change, I think:

> +
> +/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
> +   mask for VLA vec_perm folding. Set SEL_NPATTERNS and SEL_NELTS_PER_PATTERN
> +   similarly.  */
> +
> +static bool
> +valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
> +				    const vec_perm_indices &sel,
> +				    unsigned& sel_npatterns,
> +				    unsigned& sel_nelts_per_pattern,
> +				    char *reason = NULL,
> +				    bool verbose = false)
> +{
> +  unsigned HOST_WIDE_INT nelts;
> +  if (sel.length ().is_constant (&nelts))
> +    {
> +      sel_npatterns = nelts;
> +      sel_nelts_per_pattern = 1;
> +    }
> +  else
> +    {
> +      sel_npatterns = sel.encoding ().npatterns ();
> +      sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +    }

...we should use the "else" code unconditionally.

The function comment should describe "reason" and "verbose".  It looks
like these are debug parameters, is that right?

> +
> +  if (!pow2p_hwi (sel_npatterns))
> +    {
> +      if (reason)
> +	strcpy (reason, "sel_npatterns is not power of 2");
> +      return false;
> +    }
> +
> +  /* We want to avoid cases where sel.length is not a multiple of npatterns.
> +     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
> +  poly_uint64 esel;
> +  if (!multiple_p (sel.length (), sel_npatterns, &esel))
> +    {
> +      if (reason)
> +	strcpy (reason, "sel.length is not multiple of sel_npatterns");
> +      return false;
> +    }
> +
> +  if (sel_nelts_per_pattern < 3)
> +    return true;
> +
> +  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
> +    {
> +      poly_uint64 a1 = sel[pattern + sel_npatterns];
> +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> +
> +      poly_uint64 step = a2 - a1;
> +      if (!step.is_constant ())
> +	{
> +	  if (reason)
> +	    strcpy (reason, "step is not constant");
> +	  return false;
> +	}
> +      int S = step.to_constant ();
> +      if (S == 0)
> +	continue;

Might be simpler as:

      HOST_WIDE_INT step;
      if (!(a2 - a1).is_constant (&step))
        {
          ...
        }
      if (step == 0)
        continue;

> +
> +      // FIXME: Punt on S < 0 for now, revisit later.
> +      if (S < 0)
> +	return false;
> +
> +      if (!pow2p_hwi (S))
> +	{
> +	  if (reason)
> +	    strcpy (reason, "step is not power of 2");
> +	  return false;
> +	}
> +
> +      poly_uint64 ae = a1 + (esel - 2) * S;
> +      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +      uint64_t q1, qe;
> +      poly_uint64 r1, re;
> +
> +      /* Ensure that stepped sequence of the pattern selects elements
> +	 only from the same input vector.  */
> +      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
> +	    && can_div_trunc_p (ae, arg_len, &qe, &re)
> +	    && (q1 == qe)))

Nit: redundant brackets around q1 == qe

> +	{
> +	  if (reason)
> +	    strcpy (reason, "crossed input vectors");
> +	  return false;
> +	}
> +      unsigned arg_npatterns
> +	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> +			  : VECTOR_CST_NPATTERNS (arg1);
> +
> +      gcc_assert (pow2p_hwi (arg_npatterns));
> +      if (S < arg_npatterns)

Since (as the reason string says) the condition is logically that the
step is a multiple of the number of patterns, rather than simply bigger,
I think it's more obvious to write it as:

      if (!multiple_p (step, arg_npatterns))

without the gcc_assert.

> +	{
> +	  if (reason)
> +	    strcpy (reason, "S is not multiple of npatterns");
> +	  return false;
> +	}
> +    }
> +
> +  return true;
> +}
> +
> +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.  */
> +
> +static tree
> +fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> +		   char *reason = NULL, bool verbose = false)
> +{
> +  /* Allow cases where:
> +     (1) arg0, arg1 and sel are VLS.
> +     (2) arg0, arg1, and sel are VLA.
> +     Punt if input vectors are VLA but sel is VLS or vice-versa.  */
> +  poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +  if (!((arg_len.is_constant () && sel.length ().is_constant ())
> +	 || (!arg_len.is_constant () && !sel.length ().is_constant ())))
> +    return NULL_TREE;

With the changes above, I'm not sure we need to enforce this.

> +  unsigned sel_npatterns, sel_nelts_per_pattern;
> +
> +  arg0 = valid_operand_for_fold_vec_perm_cst_p (arg0);
> +  if (!arg0)
> +    return NULL_TREE;
> +
> +  arg1 = valid_operand_for_fold_vec_perm_cst_p (arg1);
> +  if (!arg1)
> +    return NULL_TREE;
> +
> +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
> +					   sel_nelts_per_pattern, reason, verbose))
> +    return NULL_TREE;
> +
> +  unsigned res_npatterns
> +    = std::max (VECTOR_CST_NPATTERNS (arg0),
> +		std::max (VECTOR_CST_NPATTERNS (arg1), sel_npatterns));
> +
> +  unsigned res_nelts_per_pattern
> +    = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> +		std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> +			  sel_nelts_per_pattern));
> +
> +

Nit: too much vertical space.

> +  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
> +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> +  for (unsigned i = 0; i < res_nelts; i++)
> +    {
> +      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +      uint64_t q;
> +      poly_uint64 r;
> +      unsigned HOST_WIDE_INT index;
> +
> +      /* Punt if sel[i] /trunc_div len cannot be determined,
> +	 because the input vector to be chosen will depend on
> +	 runtime vector length.
> +	 For example if len == 4 + 4x, and sel[i] == 4,
> +	 If len at runtime equals 4, we choose arg1[0].
> +	 For any other value of len > 4 at runtime, we choose arg0[4].
> +	 which makes the element choice dependent on runtime vector length.  */
> +      if (!can_div_trunc_p (sel[i], len, &q, &r))
> +	return NULL_TREE;
> +
> +      /* sel[i] % len will give the index of element in the chosen input
> +	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
> +	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
> +      if (!r.is_constant (&index))
> +	return NULL_TREE;
> +
> +      tree arg = ((q & 1) == 0) ? arg0 : arg1;
> +      tree elem = vector_cst_elt (arg, index);
> +      out_elts.quick_push (elem);
> +    }
> +
> +  return out_elts.build ();
> +}
> +
>  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
>     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
>     NULL_TREE otherwise.  */
> @@ -10528,43 +10750,39 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
>  {
>    unsigned int i;
>    unsigned HOST_WIDE_INT nelts;
> -  bool need_ctor = false;
>  
> -  if (!sel.length ().is_constant (&nelts))
> -    return NULL_TREE;
> -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> +	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> +			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> +
>    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
>        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
>      return NULL_TREE;
>  
> +  if (TREE_CODE (arg0) == VECTOR_CST
> +      && TREE_CODE (arg1) == VECTOR_CST)
> +    return fold_vec_perm_cst (type, arg0, arg1, sel);
> +
> +  /* For fall back case, we want to ensure arg and sel have same len.  */
> +  if (!(sel.length ().is_constant (&nelts)
> +	&& known_eq (sel.length (), TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)))))
> +    return NULL_TREE;

I think the known_eq should be an unconditional assert (after the call
to fold_vec_perm_cst but before the return NULL_TREE).

Looks good otherwise.  I'll do a separate review for the tests (but they
look pretty extensive, thanks).

Richard

> +
>    tree *in_elts = XALLOCAVEC (tree, nelts * 2);
>    if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
>        || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
>      return NULL_TREE;
>  
> -  tree_vector_builder out_elts (type, nelts, 1);
> +  vec<constructor_elt, va_gc> *v;
> +  vec_alloc (v, nelts);
>    for (i = 0; i < nelts; i++)
>      {
>        HOST_WIDE_INT index;
>        if (!sel[i].is_constant (&index))
>  	return NULL_TREE;
> -      if (!CONSTANT_CLASS_P (in_elts[index]))
> -	need_ctor = true;
> -      out_elts.quick_push (unshare_expr (in_elts[index]));
> +      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
>      }
> -
> -  if (need_ctor)
> -    {
> -      vec<constructor_elt, va_gc> *v;
> -      vec_alloc (v, nelts);
> -      for (i = 0; i < nelts; i++)
> -	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> -      return build_constructor (type, v);
> -    }
> -  else
> -    return out_elts.build ();
> +  return build_constructor (type, v);
>  }
>  
>  /* Try to fold a pointer difference of type TYPE two address expressions of
> @@ -16891,6 +17109,388 @@ test_arithmetic_folding ()
>  				   x);
>  }
>  
> +namespace test_fold_vec_perm_cst {
> +
> +static tree
> +get_preferred_vectype (tree inner_type)
> +{
> +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (inner_type);
> +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> +  return build_vector_type (inner_type, nunits);
> +}
> +
> +static tree
> +build_vec_cst_rand (tree inner_type, unsigned npatterns,
> +		    unsigned nelts_per_pattern, int S = 0)
> +{
> +  tree vectype = get_preferred_vectype (inner_type);
> +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> +
> +  // Fill a0 for each pattern
> +  for (unsigned i = 0; i < npatterns; i++)
> +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> +
> +  if (nelts_per_pattern == 1)
> +    return builder.build ();
> +
> +  // Fill a1 for each pattern
> +  for (unsigned i = 0; i < npatterns; i++)
> +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> +
> +  if (nelts_per_pattern == 2)
> +    return builder.build ();
> +
> +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
> +    {
> +      tree prev_elem = builder[i - npatterns];
> +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
> +      int val = prev_elem_val + S;
> +      builder.quick_push (build_int_cst (inner_type, val));
> +    }
> +
> +  return builder.build ();
> +}
> +
> +static void
> +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
> +	      tree res, tree *expected_res)
> +{
> +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
> +
> +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> +    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
> +}
> +
> +/* Verify VLA vec_perm folding.  */
> +
> +static void
> +test_stepped ()
> +{
> +  /* Case 1: sel = {0, 1, 2, ...}
> +     npatterns = 1, nelts_per_pattern = 3  */
> +  {
> +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (0);
> +    builder.quick_push (1);
> +    builder.quick_push (2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
> +			    vector_cst_elt (arg0, 2) };
> +    validate_res (1, 3, res, expected_res);
> +  }
> +
> +#if 0
> +  /* Case 2: sel = {len, len + 1, len + 2, ... }
> +     npatterns = 1, nelts_per_pattern = 3
> +     FIXME: This should return
> +     expected res: { op1[0], op1[1], op1[2], ... }
> +     however it returns NULL_TREE.  */
> +  {
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (arg0_len + 1);
> +    builder.quick_push (arg0_len + 2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +  }
> +#endif
> +
> +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
> +     sel = {len, 0, 0, 0, 2, 0, ...}
> +     npatterns = 2, nelts_per_pattern = 3.
> +     Use extra pattern {0, ...} to lower number of elements per pattern.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 2, 3);
> +    builder.quick_push (arg0_len);
> +    int mask_elems[] = { 0, 0, 0, 2, 0 };
> +    for (int i = 0; i < 5; i++)
> +      builder.quick_push (mask_elems[i]);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    gcc_assert (res);
> +  }
> +
> +  /* Case 4:
> +     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
> +     This should return NULL because we cross the input vectors.
> +     Because,
> +     arg0_len = 16 + 16x
> +     a1 = 0
> +     S = 2
> +     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
> +     ae = a1 + (esel - 2) * S
> +	= 0 + (16 + 16x - 2) * 2
> +	= 28 + 32x
> +     a1 / arg0_len = 0 /trunc (16 + 16x) = 0
> +     ae / arg0_len = (28 + 32x) /trunc (16 + 16x), which is not defined,
> +     since 28/16 != 32/16.
> +     So return NULL_TREE.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    poly_uint64 ae = (arg0_len - 2) * 2;
> +    uint64_t qe;
> +    poly_uint64 re;
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (0);
> +    builder.quick_push (2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    char reason[100] = "\0";
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, reason, false);
> +    gcc_assert (res == NULL_TREE);
> +    gcc_assert (!strcmp (reason, "crossed input vectors"));
> +  }
> +
> +  /* Case 5: Select elements from different patterns.
> +     Should return NULL.  */
> +  {
> +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> +
> +    vec_perm_builder builder (op0_len, 2, 3);
> +    builder.quick_push (op0_len);
> +    int mask_elems[] = { 0, 0, 0, 1, 0 };
> +    for (int i = 0; i < 5; i++)
> +      builder.quick_push (mask_elems[i]);
> +
> +    vec_perm_indices sel (builder, 2, op0_len);
> +    char reason[100] = "\0";
> +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, reason, false);
> +    gcc_assert (res == NULL_TREE);
> +    gcc_assert (!strcmp (reason, "S is not multiple of npatterns"));
> +  }
> +
> +  /* Case 6: Select pattern 0 of op0 and dup of op0[0]
> +     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
> +     sel = { 0, 0, 2, 0, 4, 0, ... }.
> +
> +     For pattern {0, 2, 4, ...}:
> +     a1 = 2
> +     len = 16 + 16x
> +     S = 2
> +     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
> +     ae = a1 + (esel - 2) * S
> +	= 2 + (8 + 8x - 2) * 2
> +	= 14 + 16x
> +     a1 / arg0_len = 2 / (16 + 16x) = 0
> +     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
> +     So a1/arg0_len = ae/arg0_len = 0
> +     Hence we select from first vector op0
> +     S = 2, npatterns = 2.
> +     Since S is multiple of npatterns(op0), we are selecting from
> +     same pattern of op0.
> +
> +     For pattern {0, ...}, we are choosing { op0[0] ... }
> +     So res will be combination of above patterns:
> +     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
> +     with npatterns = 2, nelts_per_pattern = 3.  */
> +  {
> +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> +
> +    vec_perm_builder builder (op0_len, 2, 3);
> +    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
> +    for (int i = 0; i < 6; i++)
> +      builder.quick_push (mask_elems[i]);
> +
> +    vec_perm_indices sel (builder, 2, op0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
> +    validate_res (2, 3, res, expected_res);
> +  }
> +
> +  /* Case 7: sel_npatterns > op_npatterns;
> +     op0, op1: npatterns = 2, nelts_per_pattern = 3
> +     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
> +     with npatterns = 4, nelts_per_pattern = 3.  */
> +  {
> +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> +
> +    vec_perm_builder builder(op0_len, 4, 3);
> +    // -1 is used as place holder for poly_int_cst
> +    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
> +    for (int i = 0; i < 12; i++)
> +      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
> +
> +    vec_perm_indices sel (builder, 2, op0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
> +			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
> +			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
> +    validate_res (4, 3, res, expected_res);
> +  }
> +}
> +
> +static void
> +test_dup ()
> +{
> +  /* Case 1: mask = {0, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (0);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (res, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }
> +
> +  /* Case 2: mask = {len, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }
> +
> +  /* Case 3: mask = { 0, len, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
> +    validate_res (2, 1, res, expected_res);
> +  }
> +
> +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }
> +
> +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> +     npatterns = 4, nelts_per_pattern = 1 */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (4, 1, res, expected_res);
> +  }
> +
> +  /* Case 6: mask = {0, 4, ...}
> +     npatterns = 1, nelts_per_pattern = 2.
> +     This should return NULL_TREE because the index 4 may choose
> +     from either arg0 or arg1 depending on vector length.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (4);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    ASSERT_TRUE (res == NULL_TREE);
> +  }
> +
> +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> +     mask = {0, len, 1, len + 1, ...}
> +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (1);
> +    builder.quick_push (arg0_len + 1);
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }
> +}
> +
> +static void
> +test ()
> +{
> +  tree vectype = get_preferred_vectype (integer_type_node);
> +  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> +    return;
> +
> +  test_dup ();
> +  test_stepped ();
> +}
> +};
> +
>  /* Verify that various binary operations on vectors are folded
>     correctly.  */
>  
> @@ -16942,6 +17542,7 @@ fold_const_cc_tests ()
>    test_arithmetic_folding ();
>    test_vector_folding ();
>    test_vec_duplicate_folding ();
> +  test_fold_vec_perm_cst::test ();
>  }
>  
>  } // namespace selftest

Prathamesh Kulkarni July 28, 2023, 12:57 p.m. UTC | #3

On Tue, 25 Jul 2023 at 18:25, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hi,
>
> Thanks for the rework and sorry for the slow review.
Hi Richard,
Thanks for the suggestions!  Please find my responses inline below.
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi Richard,
> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
> > The attached patch unifies handling of VLS and VLA vector_csts, while
> > using fallback code
> > for ctors.
> >
> > For VLS vector, the patch ignores underlying encoding, and
> > uses npatterns = nelts, and nelts_per_pattern = 1.
> >
> > For VLA patterns, if sel has a stepped sequence, then it
> > only chooses elements from a particular pattern of a particular
> > input vector.
> >
> > To make things simpler, the patch imposes following constraints:
> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
> > (b) The step size for a stepped sequence is a power of 2, and
> >       multiple of npatterns of chosen input vector.
> > (c) Runtime vector length of sel is a multiple of sel_npatterns.
> >      So, we don't handle sel.length = 2 + 2x and npatterns = 4.
> >
> > Eg:
> > op0, op1: npatterns = 2, nelts_per_pattern = 3
> > op0_len = op1_len = 16 + 16x.
> > sel = { 0, 0, 2, 0, 4, 0, ... }
> > npatterns = 2, nelts_per_pattern = 3.
> >
> > For pattern {0, 2, 4, ...}
> > Let,
> > a1 = 2
> > S = step size = 2
> >
> > Let Esel denote number of elements per pattern in sel at runtime.
> > Esel = (16 + 16x) / npatterns_sel
> >         = (16 + 16x) / 2
> >         = (8 + 8x)
> >
> > So, last element of pattern:
> > ae = a1 + (Esel - 2) * S
> >      = 2 + (8 + 8x - 2) * 2
> >      = 14 + 16x
> >
> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0
> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
> > Since both are equal with quotient = 0, we select elements from op0.
> >
> > Since step size (S) is a multiple of npatterns(op0), we select
> > all elements from same pattern of op0.
> >
> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
> >                        = max (2, max (2, 2)
> >                        = 2
> >
> > res_nelts_per_pattern = max (op0_nelts_per_pattern,
> >                                                 max (op1_nelts_per_pattern,
> >                                                          sel_nelts_per_pattern))
> >                                     = max (3, max (3, 3))
> >                                     = 3
> >
> > So res has encoding with npatterns = 2, nelts_per_pattern = 3.
> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
> >
> > Unfortunately, this results in an issue for poly_int_cst index:
> > For example,
> > op0, op1: npatterns = 1, nelts_per_pattern = 3
> > op0_len = op1_len = 4 + 4x
> >
> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
> >
> > In this case,
> > a1 = 5 + 4x
> > S = (6 + 4x) - (5 + 4x) = 1
> > Esel = 4 + 4x
> >
> > ae = a1 + (esel - 2) * S
> >      = (5 + 4x) + (4 + 4x - 2) * 1
> >      = 7 + 8x
> >
> > IIUC, 7 + 8x will always be index for last element of op1 ?
> > if x = 0, len = 4, 7 + 8x = 7
> > if x = 1, len = 8, 7 + 8x = 15, etc.
> > So the stepped sequence will always choose elements
> > from op1 regardless of vector length for above case ?
> >
> > However,
> > ae /trunc op0_len
> > = (7 + 8x) / (4 + 4x)
> > which is not defined because 7/4 != 8/4
> > and we return NULL_TREE, but I suppose the expected result would be:
> > res: { op1[0], op1[1], op1[2], ... } ?
> >
> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
> > and on x86_64-unknown-linux-gnu.
> > I would be grateful for suggestions on how to proceed.
> >
> > Thanks,
> > Prathamesh
> >
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index a02ede79fed..8028b3e8e9a 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "vec-perm-indices.h"
> >  #include "asan.h"
> >  #include "gimple-range.h"
> > +#include <algorithm>
> > +#include "tree-pretty-print.h"
> > +#include "gimple-pretty-print.h"
> > +#include "print-tree.h"
> >
> >  /* Nonzero if we are folding constants inside an initializer or a C++
> >     manifestly-constant-evaluated context; zero otherwise.
> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >  static bool
> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >  {
> > -  unsigned HOST_WIDE_INT i, nunits;
> > +  unsigned HOST_WIDE_INT i;
> >
> > -  if (TREE_CODE (arg) == VECTOR_CST
> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > -    {
> > -      for (i = 0; i < nunits; ++i)
> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > -    }
> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > +  if (TREE_CODE (arg) == CONSTRUCTOR)
> >      {
> >        constructor_elt *elt;
> >
> > @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >    return true;
> >  }
> >
> > +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding.  */
> > +
> > +static tree
> > +vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern)
> > +{
> > +  gcc_assert (pow2p_hwi (npatterns));
> > +
> > +  if (VECTOR_CST_NPATTERNS (vec) == npatterns
> > +      && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern)
> > +    return vec;
> > +
> > +  tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern);
> > +  TREE_TYPE (v) = TREE_TYPE (vec);
> > +
> > +  unsigned nelts = npatterns * nelts_per_pattern;
> > +  for (unsigned i = 0; i < nelts; i++)
> > +    VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i);
> > +  return v;
> > +}
> > +
> > +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable
> > +   operand for VLA vec_perm folding. If arg is VLS, then set
> > +   NPATTERNS = nelts and NELTS_PER_PATTERN = 1.  */
> > +
> > +static tree
> > +valid_operand_for_fold_vec_perm_cst_p (tree arg)
> > +{
> > +  if (TREE_CODE (arg) != VECTOR_CST)
> > +    return NULL_TREE;
> > +
> > +  unsigned HOST_WIDE_INT nelts;
> > +  unsigned npatterns, nelts_per_pattern;
> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts))
> > +    {
> > +      npatterns = nelts;
> > +      nelts_per_pattern = 1;
> > +    }
> > +  else
> > +    {
> > +      npatterns = VECTOR_CST_NPATTERNS (arg);
> > +      nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg);
> > +    }
> > +
> > +  if (!pow2p_hwi (npatterns))
> > +    return NULL_TREE;
> > +
> > +  return vector_cst_reshape (arg, npatterns, nelts_per_pattern);
> > +}
>
> I don't think we should reshape the vectors for VLS, since it would
> create more nodes for GC to clean up later.  Also, the "compact" encoding
> is canonical even for VLS, so the reshaping would effectively create
> noncanonical constants (even if only temporarily).
Removed in the attached patch.
>
> Instead, I think we should change the later:
>
> > +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
> > +                                        sel_nelts_per_pattern, reason, verbose))
> > +    return NULL_TREE;
>
> so that it comes after the computation of res_npatterns and
> res_nelts_per_pattern.  Then, if valid_mask_for_fold_vec_perm_cst_p
> returns false, and if the result type has a constant number of elements,
> we should:
>
> * set res_npatterns to that number of elements
> * set res_nelts_per_pattern to 1
> * continue instead of returning null
Assuming we don't enforce only VLA or only VLS for input vectors and sel,
won't that be still an issue if res (and sel) is VLS, and input
vectors are VLA ?
For eg:
arg0, arg1 are type VNx4SI with npatterns = 2, nelts_per_pattern = 3, step = 2
sel is V4SI constant with encoding { 0, 2, 4, ... }
and res_type is V4SI.
In this case, when it comes to index 4, the vector selection becomes ambiguous,
since it can be arg1 for len = 4 + 4x, and arg0 for lengths > 4 + 4x ?

In the attached patch if sel is not a suitable mask, and input vectors
have constant length, then it sets:
res_npatterns = nelts of input vector
res_nelts_per_pattern = 1
Does that look OK ?
This part of the code has few tests in test_mixed().
>
> The loop that follows will then do the correct thing for each element.
>
> The check for a power of 2 would then go in
> valid_mask_for_fold_vec_perm_cst_p rather than
> valid_operand_for_fold_vec_perm_cst_p.
>
> With that change, I think:
>
> > +
> > +/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
> > +   mask for VLA vec_perm folding. Set SEL_NPATTERNS and SEL_NELTS_PER_PATTERN
> > +   similarly.  */
> > +
> > +static bool
> > +valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
> > +                                 const vec_perm_indices &sel,
> > +                                 unsigned& sel_npatterns,
> > +                                 unsigned& sel_nelts_per_pattern,
> > +                                 char *reason = NULL,
> > +                                 bool verbose = false)
> > +{
> > +  unsigned HOST_WIDE_INT nelts;
> > +  if (sel.length ().is_constant (&nelts))
> > +    {
> > +      sel_npatterns = nelts;
> > +      sel_nelts_per_pattern = 1;
> > +    }
> > +  else
> > +    {
> > +      sel_npatterns = sel.encoding ().npatterns ();
> > +      sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > +    }
>
> ...we should use the "else" code unconditionally.
Done.
>
> The function comment should describe "reason" and "verbose".  It looks
> like these are debug parameters, is that right?
Yes, verbose is just for printf debugging, and reason is used in unit-tests when
result is NULL_TREE, to verify that it's NULL for the intended reason,
and not due to something else.
>
> > +
> > +  if (!pow2p_hwi (sel_npatterns))
> > +    {
> > +      if (reason)
> > +     strcpy (reason, "sel_npatterns is not power of 2");
> > +      return false;
> > +    }
> > +
> > +  /* We want to avoid cases where sel.length is not a multiple of npatterns.
> > +     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
> > +  poly_uint64 esel;
> > +  if (!multiple_p (sel.length (), sel_npatterns, &esel))
> > +    {
> > +      if (reason)
> > +     strcpy (reason, "sel.length is not multiple of sel_npatterns");
> > +      return false;
> > +    }
> > +
> > +  if (sel_nelts_per_pattern < 3)
> > +    return true;
> > +
> > +  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
> > +    {
> > +      poly_uint64 a1 = sel[pattern + sel_npatterns];
> > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > +
> > +      poly_uint64 step = a2 - a1;
> > +      if (!step.is_constant ())
> > +     {
> > +       if (reason)
> > +         strcpy (reason, "step is not constant");
> > +       return false;
> > +     }
> > +      int S = step.to_constant ();
> > +      if (S == 0)
> > +     continue;
>
> Might be simpler as:
>
>       HOST_WIDE_INT step;
>       if (!(a2 - a1).is_constant (&step))
>         {
>           ...
>         }
This resulted in following error:
../../gcc/gcc/fold-const.cc:10563:34: error: no matching function for
call to ‘poly_int<2, long unsigned int>::is_constant(long int*)’
10563 |       if (!(a2 - a1).is_constant (&S))
           |  ~~~~~~~~~~~~~~~~~~~~~~^~~~
>       if (step == 0)
>         continue;
>
> > +
> > +      // FIXME: Punt on S < 0 for now, revisit later.
> > +      if (S < 0)
> > +     return false;
> > +
> > +      if (!pow2p_hwi (S))
> > +     {
> > +       if (reason)
> > +         strcpy (reason, "step is not power of 2");
> > +       return false;
> > +     }
> > +
> > +      poly_uint64 ae = a1 + (esel - 2) * S;
> > +      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +      uint64_t q1, qe;
> > +      poly_uint64 r1, re;
> > +
> > +      /* Ensure that stepped sequence of the pattern selects elements
> > +      only from the same input vector.  */
> > +      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
> > +         && can_div_trunc_p (ae, arg_len, &qe, &re)
> > +         && (q1 == qe)))
>
> Nit: redundant brackets around q1 == qe
Fixed, thanks.
>
> > +     {
> > +       if (reason)
> > +         strcpy (reason, "crossed input vectors");
> > +       return false;
> > +     }
> > +      unsigned arg_npatterns
> > +     = ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> > +                       : VECTOR_CST_NPATTERNS (arg1);
> > +
> > +      gcc_assert (pow2p_hwi (arg_npatterns));
> > +      if (S < arg_npatterns)
>
> Since (as the reason string says) the condition is logically that the
> step is a multiple of the number of patterns, rather than simply bigger,
> I think it's more obvious to write it as:
>
>       if (!multiple_p (step, arg_npatterns))
>
> without the gcc_assert.
Fixed, thanks.
>
> > +     {
> > +       if (reason)
> > +         strcpy (reason, "S is not multiple of npatterns");
> > +       return false;
> > +     }
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> > +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.  */
> > +
> > +static tree
> > +fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> > +                char *reason = NULL, bool verbose = false)
> > +{
> > +  /* Allow cases where:
> > +     (1) arg0, arg1 and sel are VLS.
> > +     (2) arg0, arg1, and sel are VLA.
> > +     Punt if input vectors are VLA but sel is VLS or vice-versa.  */
> > +  poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +  if (!((arg_len.is_constant () && sel.length ().is_constant ())
> > +      || (!arg_len.is_constant () && !sel.length ().is_constant ())))
> > +    return NULL_TREE;
>
> With the changes above, I'm not sure we need to enforce this.
Removed.
>
> > +  unsigned sel_npatterns, sel_nelts_per_pattern;
> > +
> > +  arg0 = valid_operand_for_fold_vec_perm_cst_p (arg0);
> > +  if (!arg0)
> > +    return NULL_TREE;
> > +
> > +  arg1 = valid_operand_for_fold_vec_perm_cst_p (arg1);
> > +  if (!arg1)
> > +    return NULL_TREE;
> > +
> > +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
> > +                                        sel_nelts_per_pattern, reason, verbose))
> > +    return NULL_TREE;
> > +
> > +  unsigned res_npatterns
> > +    = std::max (VECTOR_CST_NPATTERNS (arg0),
> > +             std::max (VECTOR_CST_NPATTERNS (arg1), sel_npatterns));
> > +
> > +  unsigned res_nelts_per_pattern
> > +    = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > +             std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> > +                       sel_nelts_per_pattern));
> > +
> > +
>
> Nit: too much vertical space.
Fixed, thanks.
>
> > +  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
> > +  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
> > +  for (unsigned i = 0; i < res_nelts; i++)
> > +    {
> > +      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +      uint64_t q;
> > +      poly_uint64 r;
> > +      unsigned HOST_WIDE_INT index;
> > +
> > +      /* Punt if sel[i] /trunc_div len cannot be determined,
> > +      because the input vector to be chosen will depend on
> > +      runtime vector length.
> > +      For example if len == 4 + 4x, and sel[i] == 4,
> > +      If len at runtime equals 4, we choose arg1[0].
> > +      For any other value of len > 4 at runtime, we choose arg0[4].
> > +      which makes the element choice dependent on runtime vector length.  */
> > +      if (!can_div_trunc_p (sel[i], len, &q, &r))
> > +     return NULL_TREE;
> > +
> > +      /* sel[i] % len will give the index of element in the chosen input
> > +      vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
> > +      we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
> > +      if (!r.is_constant (&index))
> > +     return NULL_TREE;
> > +
> > +      tree arg = ((q & 1) == 0) ? arg0 : arg1;
> > +      tree elem = vector_cst_elt (arg, index);
> > +      out_elts.quick_push (elem);
> > +    }
> > +
> > +  return out_elts.build ();
> > +}
> > +
> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> >     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
> >     NULL_TREE otherwise.  */
> > @@ -10528,43 +10750,39 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >  {
> >    unsigned int i;
> >    unsigned HOST_WIDE_INT nelts;
> > -  bool need_ctor = false;
> >
> > -  if (!sel.length ().is_constant (&nelts))
> > -    return NULL_TREE;
> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > +
> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >      return NULL_TREE;
> >
> > +  if (TREE_CODE (arg0) == VECTOR_CST
> > +      && TREE_CODE (arg1) == VECTOR_CST)
> > +    return fold_vec_perm_cst (type, arg0, arg1, sel);
> > +
> > +  /* For fall back case, we want to ensure arg and sel have same len.  */
> > +  if (!(sel.length ().is_constant (&nelts)
> > +     && known_eq (sel.length (), TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)))))
> > +    return NULL_TREE;
>
> I think the known_eq should be an unconditional assert (after the call
> to fold_vec_perm_cst but before the return NULL_TREE).
Um sorry I am not sure if I understood this correctly.
In the patch, I placed the assert unconditionally after checking sel
is constant.
Does that look OK ?
>
> Looks good otherwise.  I'll do a separate review for the tests (but they
> look pretty extensive, thanks).
Thanks. The attached patch passes bootstrap+test on aarch64-linux-gnu
with and without SVE enabled,
and on x86_64-linux-gnu.

Could you please elaborate on the following issue, because it still
stands in this patch,
which is in Case 2 of test_fold_vec_perm_cst::test_stepped.

For example,
op0, op1: npatterns = 1, nelts_per_pattern = 3
op0_len = op1_len = 4 + 4x

sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1

In this case,
a1 = 5 + 4x
S = (6 + 4x) - (5 + 4x) = 1
Esel = 4 + 4x

ae = a1 + (esel - 2) * S
     = (5 + 4x) + (4 + 4x - 2) * 1
     = 7 + 8x

IIUC, 7 + 8x will always be index for last element of op1 ?
if x = 0, len = 4, 7 + 8x = 7
if x = 1, len = 8, 7 + 8x = 15, etc.
So the stepped sequence will always choose elements
from op1 regardless of vector length for above case ?

However,
ae /trunc op0_len
= (7 + 8x) / (4 + 4x)
which is not defined because 7/4 != 8/4
and we return NULL_TREE, but I suppose the expected result would be:
res: { op1[0], op1[1], op1[2], ... } ?

I was wondering if the following approach made sense:
Since indices wrap around, take index % 2*arg_len, so the remainder is
in interval (0, 2 * arg_len].
If both remainders are in interval [0, arg_len) -> choose arg0
If both remainders are in interval [arg_len, 2 * arg_len) -> choose arg1
If comparison is not determined at compile time or both lie in
different intervals, return NULL_TREE.

So for above example,
r1 = (5 + 4x) % (8 + 8x) = 5 + 4x
re = (7 + 8x) % (8 + 8x) = 7 + 8x
Since both lie in interval [4+4x, 8+8x) --> choose arg1

If we have say a1 = 4, and arg_len = 4 + 4x, then the comparison
becomes ambiguous,
and known_lt (4, 4+4x) will return 0. In that case, we return NULL_TREE.

Taking Case 4 eg from test_fold_vec_perm_cst::test_stepped():
op0, op1 -> len = 16+16x, npatterns = 1, nelts_per_pattern = 3
sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3

a1 = 0
esel = (16 + 16x) / 1 = 16 + 16x
S = 2

ae = a1 + (esel - 2) * S
     = 28 + 32x

r1 = 0 % (32 + 32x) = 0
re = (28 + 32x) % (32 + 32x) = 28 + 32x

However r1 is in interval [0, 16+16x) and re is in interval [16+16x, 32+32x),
so we cross input vectors here and return NULL_TREE.

Thanks,
Prathamesh



>
> Richard
>
> > +
> >    tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> >    if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> >        || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> >      return NULL_TREE;
> >
> > -  tree_vector_builder out_elts (type, nelts, 1);
> > +  vec<constructor_elt, va_gc> *v;
> > +  vec_alloc (v, nelts);
> >    for (i = 0; i < nelts; i++)
> >      {
> >        HOST_WIDE_INT index;
> >        if (!sel[i].is_constant (&index))
> >       return NULL_TREE;
> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > -     need_ctor = true;
> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > +      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
> >      }
> > -
> > -  if (need_ctor)
> > -    {
> > -      vec<constructor_elt, va_gc> *v;
> > -      vec_alloc (v, nelts);
> > -      for (i = 0; i < nelts; i++)
> > -     CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > -      return build_constructor (type, v);
> > -    }
> > -  else
> > -    return out_elts.build ();
> > +  return build_constructor (type, v);
> >  }
> >
> >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > @@ -16891,6 +17109,388 @@ test_arithmetic_folding ()
> >                                  x);
> >  }
> >
> > +namespace test_fold_vec_perm_cst {
> > +
> > +static tree
> > +get_preferred_vectype (tree inner_type)
> > +{
> > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (inner_type);
> > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +  return build_vector_type (inner_type, nunits);
> > +}
> > +
> > +static tree
> > +build_vec_cst_rand (tree inner_type, unsigned npatterns,
> > +                 unsigned nelts_per_pattern, int S = 0)
> > +{
> > +  tree vectype = get_preferred_vectype (inner_type);
> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > +
> > +  // Fill a0 for each pattern
> > +  for (unsigned i = 0; i < npatterns; i++)
> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> > +
> > +  if (nelts_per_pattern == 1)
> > +    return builder.build ();
> > +
> > +  // Fill a1 for each pattern
> > +  for (unsigned i = 0; i < npatterns; i++)
> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> > +
> > +  if (nelts_per_pattern == 2)
> > +    return builder.build ();
> > +
> > +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
> > +    {
> > +      tree prev_elem = builder[i - npatterns];
> > +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
> > +      int val = prev_elem_val + S;
> > +      builder.quick_push (build_int_cst (inner_type, val));
> > +    }
> > +
> > +  return builder.build ();
> > +}
> > +
> > +static void
> > +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
> > +           tree res, tree *expected_res)
> > +{
> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
> > +
> > +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> > +    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
> > +}
> > +
> > +/* Verify VLA vec_perm folding.  */
> > +
> > +static void
> > +test_stepped ()
> > +{
> > +  /* Case 1: sel = {0, 1, 2, ...}
> > +     npatterns = 1, nelts_per_pattern = 3  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (0);
> > +    builder.quick_push (1);
> > +    builder.quick_push (2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
> > +                         vector_cst_elt (arg0, 2) };
> > +    validate_res (1, 3, res, expected_res);
> > +  }
> > +
> > +#if 0
> > +  /* Case 2: sel = {len, len + 1, len + 2, ... }
> > +     npatterns = 1, nelts_per_pattern = 3
> > +     FIXME: This should return
> > +     expected res: { op1[0], op1[1], op1[2], ... }
> > +     however it returns NULL_TREE.  */
> > +  {
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (arg0_len + 1);
> > +    builder.quick_push (arg0_len + 2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +  }
> > +#endif
> > +
> > +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
> > +     sel = {len, 0, 0, 0, 2, 0, ...}
> > +     npatterns = 2, nelts_per_pattern = 3.
> > +     Use extra pattern {0, ...} to lower number of elements per pattern.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 2, 3);
> > +    builder.quick_push (arg0_len);
> > +    int mask_elems[] = { 0, 0, 0, 2, 0 };
> > +    for (int i = 0; i < 5; i++)
> > +      builder.quick_push (mask_elems[i]);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    gcc_assert (res);
> > +  }
> > +
> > +  /* Case 4:
> > +     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
> > +     This should return NULL because we cross the input vectors.
> > +     Because,
> > +     arg0_len = 16 + 16x
> > +     a1 = 0
> > +     S = 2
> > +     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
> > +     ae = a1 + (esel - 2) * S
> > +     = 0 + (16 + 16x - 2) * 2
> > +     = 28 + 32x
> > +     a1 / arg0_len = 0 /trunc (16 + 16x) = 0
> > +     ae / arg0_len = (28 + 32x) /trunc (16 + 16x), which is not defined,
> > +     since 28/16 != 32/16.
> > +     So return NULL_TREE.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    poly_uint64 ae = (arg0_len - 2) * 2;
> > +    uint64_t qe;
> > +    poly_uint64 re;
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (0);
> > +    builder.quick_push (2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    char reason[100] = "\0";
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, reason, false);
> > +    gcc_assert (res == NULL_TREE);
> > +    gcc_assert (!strcmp (reason, "crossed input vectors"));
> > +  }
> > +
> > +  /* Case 5: Select elements from different patterns.
> > +     Should return NULL.  */
> > +  {
> > +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> > +
> > +    vec_perm_builder builder (op0_len, 2, 3);
> > +    builder.quick_push (op0_len);
> > +    int mask_elems[] = { 0, 0, 0, 1, 0 };
> > +    for (int i = 0; i < 5; i++)
> > +      builder.quick_push (mask_elems[i]);
> > +
> > +    vec_perm_indices sel (builder, 2, op0_len);
> > +    char reason[100] = "\0";
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, reason, false);
> > +    gcc_assert (res == NULL_TREE);
> > +    gcc_assert (!strcmp (reason, "S is not multiple of npatterns"));
> > +  }
> > +
> > +  /* Case 6: Select pattern 0 of op0 and dup of op0[0]
> > +     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
> > +     sel = { 0, 0, 2, 0, 4, 0, ... }.
> > +
> > +     For pattern {0, 2, 4, ...}:
> > +     a1 = 2
> > +     len = 16 + 16x
> > +     S = 2
> > +     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
> > +     ae = a1 + (esel - 2) * S
> > +     = 2 + (8 + 8x - 2) * 2
> > +     = 14 + 16x
> > +     a1 / arg0_len = 2 / (16 + 16x) = 0
> > +     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
> > +     So a1/arg0_len = ae/arg0_len = 0
> > +     Hence we select from first vector op0
> > +     S = 2, npatterns = 2.
> > +     Since S is multiple of npatterns(op0), we are selecting from
> > +     same pattern of op0.
> > +
> > +     For pattern {0, ...}, we are choosing { op0[0] ... }
> > +     So res will be combination of above patterns:
> > +     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
> > +     with npatterns = 2, nelts_per_pattern = 3.  */
> > +  {
> > +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> > +
> > +    vec_perm_builder builder (op0_len, 2, 3);
> > +    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
> > +    for (int i = 0; i < 6; i++)
> > +      builder.quick_push (mask_elems[i]);
> > +
> > +    vec_perm_indices sel (builder, 2, op0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> > +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
> > +    validate_res (2, 3, res, expected_res);
> > +  }
> > +
> > +  /* Case 7: sel_npatterns > op_npatterns;
> > +     op0, op1: npatterns = 2, nelts_per_pattern = 3
> > +     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
> > +     with npatterns = 4, nelts_per_pattern = 3.  */
> > +  {
> > +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> > +
> > +    vec_perm_builder builder(op0_len, 4, 3);
> > +    // -1 is used as place holder for poly_int_cst
> > +    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
> > +    for (int i = 0; i < 12; i++)
> > +      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
> > +
> > +    vec_perm_indices sel (builder, 2, op0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> > +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
> > +                         vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
> > +                         vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
> > +    validate_res (4, 3, res, expected_res);
> > +  }
> > +}
> > +
> > +static void
> > +test_dup ()
> > +{
> > +  /* Case 1: mask = {0, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (0);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
> > +    validate_res (1, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 2: mask = {len, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> > +    validate_res (1, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 3: mask = { 0, len, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
> > +    validate_res (2, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
> > +
> > +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> > +     npatterns = 4, nelts_per_pattern = 1 */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (4, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 6: mask = {0, 4, ...}
> > +     npatterns = 1, nelts_per_pattern = 2.
> > +     This should return NULL_TREE because the index 4 may choose
> > +     from either arg0 or arg1 depending on vector length.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (4);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    ASSERT_TRUE (res == NULL_TREE);
> > +  }
> > +
> > +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> > +     mask = {0, len, 1, len + 1, ...}
> > +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (arg0_len + 1);
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
> > +}
> > +
> > +static void
> > +test ()
> > +{
> > +  tree vectype = get_preferred_vectype (integer_type_node);
> > +  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> > +    return;
> > +
> > +  test_dup ();
> > +  test_stepped ();
> > +}
> > +};
> > +
> >  /* Verify that various binary operations on vectors are folded
> >     correctly.  */
> >
> > @@ -16942,6 +17542,7 @@ fold_const_cc_tests ()
> >    test_arithmetic_folding ();
> >    test_vector_folding ();
> >    test_vec_duplicate_folding ();
> > +  test_fold_vec_perm_cst::test ();
> >  }
> >
> >  } // namespace selftest
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 7e5494dfd39..71857e16185 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,15 +10498,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 static bool
 vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned HOST_WIDE_INT i;
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
-    {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
-    }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  if (TREE_CODE (arg) == CONSTRUCTOR)
     {
       constructor_elt *elt;
 
@@ -10520,6 +10518,192 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
+   mask for VLA vec_perm folding.
+   REASON if specified, will contain the reason why SEL is not suitable.
+   Used only for debugging and unit-testing.
+   VERBOSE if enabled is used for debugging output.  */
+
+static bool
+valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
+				    const vec_perm_indices &sel,
+				    char *reason = NULL,
+				    ATTRIBUTE_UNUSED bool verbose = false)
+{
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+
+  if (!(pow2p_hwi (sel_npatterns)
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
+    {
+      if (reason)
+	strcpy (reason, "npatterns is not power of 2");
+      return false;
+    }
+
+  /* We want to avoid cases where sel.length is not a multiple of npatterns.
+     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
+  poly_uint64 esel;
+  if (!multiple_p (sel.length (), sel_npatterns, &esel))
+    {
+      if (reason)
+	strcpy (reason, "sel.length is not multiple of sel_npatterns");
+      return false;
+    }
+
+  if (sel_nelts_per_pattern < 3)
+    return true;
+
+  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
+    {
+      poly_uint64 a1 = sel[pattern + sel_npatterns];
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      poly_uint64 step = a2 - a1;
+      if (!step.is_constant ())
+	{
+	  if (reason)
+	    strcpy (reason, "step is not constant");
+	  return false;
+	}
+      HOST_WIDE_INT S = step.to_constant ();
+
+      // FIXME: Punt on S < 0 for now, revisit later.
+      if (S < 0)
+	return false;
+      if (S == 0)
+	continue;
+
+      if (!pow2p_hwi (S))
+	{
+	  if (reason)
+	    strcpy (reason, "step is not power of 2");
+	  return false;
+	}
+
+      /* Ensure that stepped sequence of the pattern selects elements
+	 only from the same input vector if it's VLA.  */
+      uint64_t q1, qe;
+      poly_uint64 r1, re;
+      poly_uint64 ae = a1 + (esel - 2) * S;
+      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
+	    && can_div_trunc_p (ae, arg_len, &qe, &re)
+	    && q1 == qe))
+	{
+	  if (reason)
+	    strcpy (reason, "crossed input vectors");
+	  return false;
+	}
+
+      unsigned arg_npatterns
+	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
+			  : VECTOR_CST_NPATTERNS (arg1);
+
+      if (!multiple_p (S, arg_npatterns))
+	{
+	  if (reason)
+	    strcpy (reason, "S is not multiple of npatterns");
+	  return false;
+	}
+    }
+
+  return true;
+}
+
+/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
+   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
+   REASON and VERBOSE have same purpose as described in
+   valid_mask_for_fold_vec_perm_cst_p.
+
+   (1) If SEL is a suitable mask as determined by
+       valid_mask_for_fold_vec_perm_cst_p, then:
+       res_npatterns = max of npatterns between ARG0, ARG1, and SEL
+       res_nelts_per_pattern = max of nelts_per_pattern between
+			       ARG0, ARG1 and SEL.
+   (2) If SEL is not a suitable mask, and ARG0, ARG1 are VLS,
+       then:
+       res_npatterns = nelts in input vector.
+       res_nelts_per_pattern = 1.
+       This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
+
+static tree
+fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
+		   char *reason = NULL, bool verbose = false)
+{
+  unsigned res_npatterns, res_nelts_per_pattern;
+  unsigned HOST_WIDE_INT arg_nelts;
+
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason, verbose))
+    {
+      res_npatterns
+	= std::max (VECTOR_CST_NPATTERNS (arg0),
+		    std::max (VECTOR_CST_NPATTERNS (arg1),
+			      sel.encoding ().npatterns ()));
+
+      res_nelts_per_pattern
+	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
+			      sel.encoding ().nelts_per_pattern ()));
+    }
+  else if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant (&arg_nelts))
+    {
+      res_npatterns = arg_nelts;
+      res_nelts_per_pattern = 1;
+    }
+  else
+    return NULL_TREE;
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
+    {
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+      unsigned HOST_WIDE_INT index;
+
+      if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant (&arg_nelts)
+	  && known_ge (sel[i], 2 * arg_nelts))
+	{
+	  if (reason)
+	    strcpy (reason, "out of bounds access");
+	  return NULL_TREE;
+	}
+
+      /* Punt if sel[i] /trunc_div len cannot be determined,
+	 because the input vector to be chosen will depend on
+	 runtime vector length.
+	 For example if len == 4 + 4x, and sel[i] == 4,
+	 If len at runtime equals 4, we choose arg1[0].
+	 For any other value of len > 4 at runtime, we choose arg0[4].
+	 which makes the element choice dependent on runtime vector length.  */
+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    strcpy (reason, "cannot divide selector element by arg len");
+	  return NULL_TREE;
+	}
+
+      /* sel[i] % len will give the index of element in the chosen input
+	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
+	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
+      if (!r.is_constant (&index))
+	{
+	  if (reason)
+	    strcpy (reason, "remainder is not constant");
+	  return NULL_TREE;
+	}
+
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10529,43 +10713,40 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
   unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST)
+    return fold_vec_perm_cst (type, arg0, arg1, sel);
+
+  /* For fall back case, we want to ensure we have VLS vectors
+     with equal length.  */
+  if (!sel.length ().is_constant (&nelts))
+    return NULL_TREE;
+
+  gcc_assert (known_eq (sel.length (), TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
+  vec<constructor_elt, va_gc> *v;
+  vec_alloc (v, nelts);
   for (i = 0; i < nelts; i++)
     {
       HOST_WIDE_INT index;
       if (!sel[i].is_constant (&index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
-    }
-
-  if (need_ctor)
-    {
-      vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
-      return build_constructor (type, v);
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
     }
-  else
-    return out_elts.build ();
+  return build_constructor (type, v);
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16892,6 +17073,528 @@ test_arithmetic_folding ()
 				   x);
 }
 
+namespace test_fold_vec_perm_cst {
+
+static tree
+get_preferred_vectype (tree inner_type)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (inner_type);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  return build_vector_type (inner_type, nunits);
+}
+
+static tree
+build_vec_cst_rand (tree inner_type, unsigned npatterns,
+		    unsigned nelts_per_pattern, int S = 0,
+		    tree vectype = NULL_TREE)
+{
+  if (!vectype)
+    vectype = get_preferred_vectype (inner_type);
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+
+  // Fill a0 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % 100));
+
+  if (nelts_per_pattern == 1)
+    return builder.build ();
+
+  // Fill a1 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % 100));
+
+  if (nelts_per_pattern == 2)
+    return builder.build ();
+
+  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
+    {
+      tree prev_elem = builder[i - npatterns];
+      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
+      int val = prev_elem_val + S;
+      builder.quick_push (build_int_cst (inner_type, val));
+    }
+
+  return builder.build ();
+}
+
+static void
+validate_res (unsigned npatterns, unsigned nelts_per_pattern,
+	      tree res, tree *expected_res)
+{
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
+  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
+
+  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+static void
+validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
+{
+  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
+  for (unsigned i = 0; i < expected_nelts; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Verify VLA vec_perm folding.  */
+
+static void
+test_stepped ()
+{
+  /* Case 1: sel = {0, 1, 2, ...}
+     npatterns = 1, nelts_per_pattern = 3  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (0);
+    builder.quick_push (1);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
+			    vector_cst_elt (arg0, 2) };
+    validate_res (1, 3, res, expected_res);
+  }
+
+#if 0
+  /* Case 2: sel = {len, len + 1, len + 2, ... }
+     npatterns = 1, nelts_per_pattern = 3
+     FIXME: This should return
+     expected res: { op1[0], op1[1], op1[2], ... }
+     however it returns NULL_TREE.  */
+  {
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (arg0_len + 1);
+    builder.quick_push (arg0_len + 2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+  }
+#endif
+
+  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
+     sel = {len, 0, 0, 0, 2, 0, ...}
+     npatterns = 2, nelts_per_pattern = 3.
+     Use extra pattern {0, ...} to lower number of elements per pattern.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 3);
+    builder.quick_push (arg0_len);
+    int mask_elems[] = { 0, 0, 0, 2, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    char reason[100] = "\0";
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, reason);
+
+    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg0, 0),
+			    vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 0),
+			    vector_cst_elt (arg0, 2), vector_cst_elt (arg0, 0)
+			  };
+    validate_res (2, 3, res, expected_res);
+  }
+
+  /* Case 4:
+     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
+     This should return NULL because we cross the input vectors.
+     Because,
+     arg0_len = 16 + 16x
+     a1 = 0
+     S = 2
+     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
+     ae = 0 + (esel - 2) * S
+	= 0 + (16 + 16x - 2) * 2
+	= 28 + 32x
+     a1 / arg0_len = 0 /trunc (16 + 16x) = 0
+     ae / arg0_len = (28 + 32x) /trunc (16 + 16x), which is not defined,
+     since 28/16 != 32/16.
+     So return NULL_TREE.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (0);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    char reason[100] = "\0";
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, reason, false);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "crossed input vectors"));
+  }
+
+  /* Case 5: Select elements from different patterns.
+     Should return NULL.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    builder.quick_push (op0_len);
+    int mask_elems[] = { 0, 0, 0, 1, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    char reason[100] = "\0";
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, reason, false);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "S is not multiple of npatterns"));
+  }
+
+  /* Case 6: Select pattern 0 of op0 and dup of op0[0]
+     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
+     sel = { 0, 0, 2, 0, 4, 0, ... }.
+
+     For pattern {0, 2, 4, ...}:
+     a1 = 2
+     len = 16 + 16x
+     S = 2
+     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
+     ae = a1 + (esel - 2) * S
+	= 2 + (8 + 8x - 2) * 2
+	= 14 + 16x
+     a1 / arg0_len = 2 / (16 + 16x) = 0
+     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
+     So a1/arg0_len = ae/arg0_len = 0
+     Hence we select from first vector op0
+     S = 2, npatterns = 2.
+     Since S is multiple of npatterns(op0), we are selecting from
+     same pattern of op0.
+
+     For pattern {0, ...}, we are choosing { op0[0] ... }
+     So res will be combination of above patterns:
+     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
+     with npatterns = 2, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
+    for (int i = 0; i < 6; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
+    validate_res (2, 3, res, expected_res);
+  }
+
+  /* Case 7: sel_npatterns > op_npatterns;
+     op0, op1: npatterns = 2, nelts_per_pattern = 3
+     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
+     with npatterns = 4, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder(op0_len, 4, 3);
+    // -1 is used as place holder for poly_int_cst
+    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
+    for (int i = 0; i < 12; i++)
+      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
+    validate_res (4, 3, res, expected_res);
+  }
+}
+
+static void
+test_dup ()
+{
+  /* Case 1: mask = {0, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (0);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (res, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 2: mask = {len, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg1, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 3: mask = { 0, len, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
+    validate_res (2, 1, res, expected_res);
+  }
+
+  /* Case 4: mask = { 0, len, 1, len+1, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+
+  /* Case 5: mask = { 0, len, 1, len+1, .... }
+     npatterns = 4, nelts_per_pattern = 1 */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 6: mask = {0, 4, ...}
+     npatterns = 1, nelts_per_pattern = 2.
+     This should return NULL_TREE because the index 4 may choose
+     from either arg0 or arg1 depending on vector length.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 2);
+    builder.quick_push (0);
+    builder.quick_push (4);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    ASSERT_TRUE (res == NULL_TREE);
+  }
+
+  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
+     mask = {0, len, 1, len + 1, ...}
+     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (arg0_len);
+    builder.quick_push (1);
+    builder.quick_push (arg0_len + 1);
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+}
+
+static void
+test_mixed ()
+{
+  /* Case 1: op0, op1 -> VLS, sel -> VLA and selects from both input vectors.
+     In this case, we treat res_npatterns = nelts in input vector
+     and res_nelts_per_pattern = 1, and create a dup pattern.
+     sel = { 0, 4, 1, 5, ... }
+     res = { op0[0], op1[0], op0[1], op1[1], ...} // (4, 1)
+     res_npatterns = 4, res_nelts_per_pattern = 1.  */
+  {
+    tree arg_vectype = build_vector_type (integer_type_node, 4);
+    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+
+    tree res_type = get_preferred_vectype (integer_type_node);
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (4);
+    builder.quick_push (1);
+    builder.quick_push (5);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 2: Same as Case 1, but sel contains an out of bounds index.
+     result should be NULL_TREE.  */
+  {
+    tree arg_vectype = build_vector_type (integer_type_node, 4);
+    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+
+    tree res_type = get_preferred_vectype (integer_type_node);
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (8);
+    builder.quick_push (1);
+    builder.quick_push (5);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    char reason[100] = "\0";
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, reason);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "out of bounds access"));
+  }
+
+  /* Case 3: op0, op1 are VLS and sel is VLA but contains stepped sequence
+     and crosses input vectors.
+     op0, op1 = V4SI vectors.
+     sel = { 0, 2, 4, ... }
+     res: { op0[0], op0[2], op1[0], op1[2], ... } (4, 1)  */
+  {
+    tree arg_vectype = build_vector_type (integer_type_node, 4);
+    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+
+    tree res_type = get_preferred_vectype (integer_type_node);
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 1, 3);
+    builder.quick_push (0);
+    builder.quick_push (2);
+    builder.quick_push (4);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 2),
+			    vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 2)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 4: op0, op1 are VLA and sel is VLS.
+     op0, op1: VNx16QI with shape (2, 3)
+     sel = V4SI with values {0, 2, 4, 6}
+     res: V4SI with values { op0[0], op0[2], op0[4], op0[6] }.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+
+    poly_uint64 res_len = 4;
+    tree res_type = build_vector_type (char_type_node, res_len);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (2);
+    builder.quick_push (4);
+    builder.quick_push (6);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 2),
+			    vector_cst_elt (arg0, 4), vector_cst_elt (arg0, 6)
+			  };
+    validate_res_vls (res, expected_res, 4);
+  }
+
+  /* Case 5: Same as case 4, but op0, op1 are VNx4SI with shape (2, 3) and step = 2
+     sel = V4SI with values {0, 2, 4, 6}
+     In this case result should be NULL_TREE because we cross input vector
+     boundary at index 4.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
+
+    poly_uint64 res_len = 4;
+    tree res_type = build_vector_type (char_type_node, res_len);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (2);
+    builder.quick_push (4);
+    builder.quick_push (6);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    char reason[100] = "\0";
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, reason);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "crossed input vectors"));
+  }
+}
+
+static void
+test ()
+{
+  tree vectype = get_preferred_vectype (integer_type_node);
+  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
+    return;
+
+  test_dup ();
+  test_stepped ();
+  test_mixed ();
+}
+};
+
 /* Verify that various binary operations on vectors are folded
    correctly.  */
 
@@ -16943,6 +17646,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_fold_vec_perm_cst::test ();
 }
 
 } // namespace selftest

Richard Sandiford Aug. 3, 2023, 1:01 p.m. UTC | #4

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Tue, 25 Jul 2023 at 18:25, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Hi,
>>
>> Thanks for the rework and sorry for the slow review.
> Hi Richard,
> Thanks for the suggestions!  Please find my responses inline below.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > Hi Richard,
>> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
>> > The attached patch unifies handling of VLS and VLA vector_csts, while
>> > using fallback code
>> > for ctors.
>> >
>> > For VLS vector, the patch ignores underlying encoding, and
>> > uses npatterns = nelts, and nelts_per_pattern = 1.
>> >
>> > For VLA patterns, if sel has a stepped sequence, then it
>> > only chooses elements from a particular pattern of a particular
>> > input vector.
>> >
>> > To make things simpler, the patch imposes following constraints:
>> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
>> > (b) The step size for a stepped sequence is a power of 2, and
>> >       multiple of npatterns of chosen input vector.
>> > (c) Runtime vector length of sel is a multiple of sel_npatterns.
>> >      So, we don't handle sel.length = 2 + 2x and npatterns = 4.
>> >
>> > Eg:
>> > op0, op1: npatterns = 2, nelts_per_pattern = 3
>> > op0_len = op1_len = 16 + 16x.
>> > sel = { 0, 0, 2, 0, 4, 0, ... }
>> > npatterns = 2, nelts_per_pattern = 3.
>> >
>> > For pattern {0, 2, 4, ...}
>> > Let,
>> > a1 = 2
>> > S = step size = 2
>> >
>> > Let Esel denote number of elements per pattern in sel at runtime.
>> > Esel = (16 + 16x) / npatterns_sel
>> >         = (16 + 16x) / 2
>> >         = (8 + 8x)
>> >
>> > So, last element of pattern:
>> > ae = a1 + (Esel - 2) * S
>> >      = 2 + (8 + 8x - 2) * 2
>> >      = 14 + 16x
>> >
>> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0
>> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
>> > Since both are equal with quotient = 0, we select elements from op0.
>> >
>> > Since step size (S) is a multiple of npatterns(op0), we select
>> > all elements from same pattern of op0.
>> >
>> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
>> >                        = max (2, max (2, 2)
>> >                        = 2
>> >
>> > res_nelts_per_pattern = max (op0_nelts_per_pattern,
>> >                                                 max (op1_nelts_per_pattern,
>> >                                                          sel_nelts_per_pattern))
>> >                                     = max (3, max (3, 3))
>> >                                     = 3
>> >
>> > So res has encoding with npatterns = 2, nelts_per_pattern = 3.
>> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
>> >
>> > Unfortunately, this results in an issue for poly_int_cst index:
>> > For example,
>> > op0, op1: npatterns = 1, nelts_per_pattern = 3
>> > op0_len = op1_len = 4 + 4x
>> >
>> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>> >
>> > In this case,
>> > a1 = 5 + 4x
>> > S = (6 + 4x) - (5 + 4x) = 1
>> > Esel = 4 + 4x
>> >
>> > ae = a1 + (esel - 2) * S
>> >      = (5 + 4x) + (4 + 4x - 2) * 1
>> >      = 7 + 8x
>> >
>> > IIUC, 7 + 8x will always be index for last element of op1 ?
>> > if x = 0, len = 4, 7 + 8x = 7
>> > if x = 1, len = 8, 7 + 8x = 15, etc.
>> > So the stepped sequence will always choose elements
>> > from op1 regardless of vector length for above case ?
>> >
>> > However,
>> > ae /trunc op0_len
>> > = (7 + 8x) / (4 + 4x)
>> > which is not defined because 7/4 != 8/4
>> > and we return NULL_TREE, but I suppose the expected result would be:
>> > res: { op1[0], op1[1], op1[2], ... } ?
>> >
>> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
>> > and on x86_64-unknown-linux-gnu.
>> > I would be grateful for suggestions on how to proceed.
>> >
>> > Thanks,
>> > Prathamesh
>> >
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index a02ede79fed..8028b3e8e9a 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "vec-perm-indices.h"
>> >  #include "asan.h"
>> >  #include "gimple-range.h"
>> > +#include <algorithm>
>> > +#include "tree-pretty-print.h"
>> > +#include "gimple-pretty-print.h"
>> > +#include "print-tree.h"
>> >
>> >  /* Nonzero if we are folding constants inside an initializer or a C++
>> >     manifestly-constant-evaluated context; zero otherwise.
>> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>> >  static bool
>> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>> >  {
>> > -  unsigned HOST_WIDE_INT i, nunits;
>> > +  unsigned HOST_WIDE_INT i;
>> >
>> > -  if (TREE_CODE (arg) == VECTOR_CST
>> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
>> > -    {
>> > -      for (i = 0; i < nunits; ++i)
>> > -     elts[i] = VECTOR_CST_ELT (arg, i);
>> > -    }
>> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
>> > +  if (TREE_CODE (arg) == CONSTRUCTOR)
>> >      {
>> >        constructor_elt *elt;
>> >
>> > @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>> >    return true;
>> >  }
>> >
>> > +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding.  */
>> > +
>> > +static tree
>> > +vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern)
>> > +{
>> > +  gcc_assert (pow2p_hwi (npatterns));
>> > +
>> > +  if (VECTOR_CST_NPATTERNS (vec) == npatterns
>> > +      && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern)
>> > +    return vec;
>> > +
>> > +  tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern);
>> > +  TREE_TYPE (v) = TREE_TYPE (vec);
>> > +
>> > +  unsigned nelts = npatterns * nelts_per_pattern;
>> > +  for (unsigned i = 0; i < nelts; i++)
>> > +    VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i);
>> > +  return v;
>> > +}
>> > +
>> > +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable
>> > +   operand for VLA vec_perm folding. If arg is VLS, then set
>> > +   NPATTERNS = nelts and NELTS_PER_PATTERN = 1.  */
>> > +
>> > +static tree
>> > +valid_operand_for_fold_vec_perm_cst_p (tree arg)
>> > +{
>> > +  if (TREE_CODE (arg) != VECTOR_CST)
>> > +    return NULL_TREE;
>> > +
>> > +  unsigned HOST_WIDE_INT nelts;
>> > +  unsigned npatterns, nelts_per_pattern;
>> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts))
>> > +    {
>> > +      npatterns = nelts;
>> > +      nelts_per_pattern = 1;
>> > +    }
>> > +  else
>> > +    {
>> > +      npatterns = VECTOR_CST_NPATTERNS (arg);
>> > +      nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg);
>> > +    }
>> > +
>> > +  if (!pow2p_hwi (npatterns))
>> > +    return NULL_TREE;
>> > +
>> > +  return vector_cst_reshape (arg, npatterns, nelts_per_pattern);
>> > +}
>>
>> I don't think we should reshape the vectors for VLS, since it would
>> create more nodes for GC to clean up later.  Also, the "compact" encoding
>> is canonical even for VLS, so the reshaping would effectively create
>> noncanonical constants (even if only temporarily).
> Removed in the attached patch.
>>
>> Instead, I think we should change the later:
>>
>> > +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
>> > +                                        sel_nelts_per_pattern, reason, verbose))
>> > +    return NULL_TREE;
>>
>> so that it comes after the computation of res_npatterns and
>> res_nelts_per_pattern.  Then, if valid_mask_for_fold_vec_perm_cst_p
>> returns false, and if the result type has a constant number of elements,
>> we should:
>>
>> * set res_npatterns to that number of elements
>> * set res_nelts_per_pattern to 1
>> * continue instead of returning null
> Assuming we don't enforce only VLA or only VLS for input vectors and sel,
> won't that be still an issue if res (and sel) is VLS, and input
> vectors are VLA ?
> For eg:
> arg0, arg1 are type VNx4SI with npatterns = 2, nelts_per_pattern = 3, step = 2
> sel is V4SI constant with encoding { 0, 2, 4, ... }
> and res_type is V4SI.
> In this case, when it comes to index 4, the vector selection becomes ambiguous,
> since it can be arg1 for len = 4 + 4x, and arg0 for lengths > 4 + 4x ?

Ah, right.  So the condition is whether the result type and the data
input type have a constant number of elements, rather than just the result.

> In the attached patch if sel is not a suitable mask, and input vectors
> have constant length, then it sets:
> res_npatterns = nelts of input vector
> res_nelts_per_pattern = 1
> Does that look OK ?
> This part of the code has few tests in test_mixed().

res_npatterns should be nelts of the result vector, not the input.

>> The loop that follows will then do the correct thing for each element.
>>
>> The check for a power of 2 would then go in
>> valid_mask_for_fold_vec_perm_cst_p rather than
>> valid_operand_for_fold_vec_perm_cst_p.
>>
>> With that change, I think:
>>
>> > +
>> > +/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
>> > +   mask for VLA vec_perm folding. Set SEL_NPATTERNS and SEL_NELTS_PER_PATTERN
>> > +   similarly.  */
>> > +
>> > +static bool
>> > +valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
>> > +                                 const vec_perm_indices &sel,
>> > +                                 unsigned& sel_npatterns,
>> > +                                 unsigned& sel_nelts_per_pattern,
>> > +                                 char *reason = NULL,
>> > +                                 bool verbose = false)
>> > +{
>> > +  unsigned HOST_WIDE_INT nelts;
>> > +  if (sel.length ().is_constant (&nelts))
>> > +    {
>> > +      sel_npatterns = nelts;
>> > +      sel_nelts_per_pattern = 1;
>> > +    }
>> > +  else
>> > +    {
>> > +      sel_npatterns = sel.encoding ().npatterns ();
>> > +      sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>> > +    }
>>
>> ...we should use the "else" code unconditionally.
> Done.
>>
>> The function comment should describe "reason" and "verbose".  It looks
>> like these are debug parameters, is that right?
> Yes, verbose is just for printf debugging, and reason is used in unit-tests when
> result is NULL_TREE, to verify that it's NULL for the intended reason,
> and not due to something else.

Could we make it a "const char **" instead, and just store the string
pointers there?  It looks like all the reasons are fixed strings rather
than dynamic.

>>
>> > +
>> > +  if (!pow2p_hwi (sel_npatterns))
>> > +    {
>> > +      if (reason)
>> > +     strcpy (reason, "sel_npatterns is not power of 2");
>> > +      return false;
>> > +    }
>> > +
>> > +  /* We want to avoid cases where sel.length is not a multiple of npatterns.
>> > +     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
>> > +  poly_uint64 esel;
>> > +  if (!multiple_p (sel.length (), sel_npatterns, &esel))
>> > +    {
>> > +      if (reason)
>> > +     strcpy (reason, "sel.length is not multiple of sel_npatterns");
>> > +      return false;
>> > +    }
>> > +
>> > +  if (sel_nelts_per_pattern < 3)
>> > +    return true;
>> > +
>> > +  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
>> > +    {
>> > +      poly_uint64 a1 = sel[pattern + sel_npatterns];
>> > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
>> > +
>> > +      poly_uint64 step = a2 - a1;
>> > +      if (!step.is_constant ())
>> > +     {
>> > +       if (reason)
>> > +         strcpy (reason, "step is not constant");
>> > +       return false;
>> > +     }
>> > +      int S = step.to_constant ();
>> > +      if (S == 0)
>> > +     continue;
>>
>> Might be simpler as:
>>
>>       HOST_WIDE_INT step;
>>       if (!(a2 - a1).is_constant (&step))
>>         {
>>           ...
>>         }
> This resulted in following error:
> ../../gcc/gcc/fold-const.cc:10563:34: error: no matching function for
> call to ‘poly_int<2, long unsigned int>::is_constant(long int*)’
> 10563 |       if (!(a2 - a1).is_constant (&S))
>            |  ~~~~~~~~~~~~~~~~~~~~~~^~~~

OK:

       HOST_WIDE_INT step;
       if (!poly_int64 (a2 - a1).is_constant (&step))

> Could you please elaborate on the following issue, because it still
> stands in this patch,
> which is in Case 2 of test_fold_vec_perm_cst::test_stepped.
>
> For example,
> op0, op1: npatterns = 1, nelts_per_pattern = 3
> op0_len = op1_len = 4 + 4x
>
> sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>
> In this case,
> a1 = 5 + 4x
> S = (6 + 4x) - (5 + 4x) = 1
> Esel = 4 + 4x
>
> ae = a1 + (esel - 2) * S
>      = (5 + 4x) + (4 + 4x - 2) * 1
>      = 7 + 8x
>
> IIUC, 7 + 8x will always be index for last element of op1 ?
> if x = 0, len = 4, 7 + 8x = 7
> if x = 1, len = 8, 7 + 8x = 15, etc.
> So the stepped sequence will always choose elements
> from op1 regardless of vector length for above case ?
>
> However,
> ae /trunc op0_len
> = (7 + 8x) / (4 + 4x)
> which is not defined because 7/4 != 8/4

Ah, that was me being lazy and not handling a boundary case that
seemed difficult to keep overflow-free.  I've now pushed a fix.

Thanks,
Richard

Richard Sandiford Aug. 3, 2023, 1:16 p.m. UTC | #5

Richard Sandiford <richard.sandiford@arm.com> writes:
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> On Tue, 25 Jul 2023 at 18:25, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for the rework and sorry for the slow review.
>> Hi Richard,
>> Thanks for the suggestions!  Please find my responses inline below.
>>>
>>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>>> > Hi Richard,
>>> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
>>> > The attached patch unifies handling of VLS and VLA vector_csts, while
>>> > using fallback code
>>> > for ctors.
>>> >
>>> > For VLS vector, the patch ignores underlying encoding, and
>>> > uses npatterns = nelts, and nelts_per_pattern = 1.
>>> >
>>> > For VLA patterns, if sel has a stepped sequence, then it
>>> > only chooses elements from a particular pattern of a particular
>>> > input vector.
>>> >
>>> > To make things simpler, the patch imposes following constraints:
>>> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
>>> > (b) The step size for a stepped sequence is a power of 2, and
>>> >       multiple of npatterns of chosen input vector.
>>> > (c) Runtime vector length of sel is a multiple of sel_npatterns.
>>> >      So, we don't handle sel.length = 2 + 2x and npatterns = 4.
>>> >
>>> > Eg:
>>> > op0, op1: npatterns = 2, nelts_per_pattern = 3
>>> > op0_len = op1_len = 16 + 16x.
>>> > sel = { 0, 0, 2, 0, 4, 0, ... }
>>> > npatterns = 2, nelts_per_pattern = 3.
>>> >
>>> > For pattern {0, 2, 4, ...}
>>> > Let,
>>> > a1 = 2
>>> > S = step size = 2
>>> >
>>> > Let Esel denote number of elements per pattern in sel at runtime.
>>> > Esel = (16 + 16x) / npatterns_sel
>>> >         = (16 + 16x) / 2
>>> >         = (8 + 8x)
>>> >
>>> > So, last element of pattern:
>>> > ae = a1 + (Esel - 2) * S
>>> >      = 2 + (8 + 8x - 2) * 2
>>> >      = 14 + 16x
>>> >
>>> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0
>>> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
>>> > Since both are equal with quotient = 0, we select elements from op0.
>>> >
>>> > Since step size (S) is a multiple of npatterns(op0), we select
>>> > all elements from same pattern of op0.
>>> >
>>> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
>>> >                        = max (2, max (2, 2)
>>> >                        = 2
>>> >
>>> > res_nelts_per_pattern = max (op0_nelts_per_pattern,
>>> >                                                 max (op1_nelts_per_pattern,
>>> >                                                          sel_nelts_per_pattern))
>>> >                                     = max (3, max (3, 3))
>>> >                                     = 3
>>> >
>>> > So res has encoding with npatterns = 2, nelts_per_pattern = 3.
>>> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
>>> >
>>> > Unfortunately, this results in an issue for poly_int_cst index:
>>> > For example,
>>> > op0, op1: npatterns = 1, nelts_per_pattern = 3
>>> > op0_len = op1_len = 4 + 4x
>>> >
>>> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
>>> >
>>> > In this case,
>>> > a1 = 5 + 4x
>>> > S = (6 + 4x) - (5 + 4x) = 1
>>> > Esel = 4 + 4x
>>> >
>>> > ae = a1 + (esel - 2) * S
>>> >      = (5 + 4x) + (4 + 4x - 2) * 1
>>> >      = 7 + 8x
>>> >
>>> > IIUC, 7 + 8x will always be index for last element of op1 ?
>>> > if x = 0, len = 4, 7 + 8x = 7
>>> > if x = 1, len = 8, 7 + 8x = 15, etc.
>>> > So the stepped sequence will always choose elements
>>> > from op1 regardless of vector length for above case ?
>>> >
>>> > However,
>>> > ae /trunc op0_len
>>> > = (7 + 8x) / (4 + 4x)
>>> > which is not defined because 7/4 != 8/4
>>> > and we return NULL_TREE, but I suppose the expected result would be:
>>> > res: { op1[0], op1[1], op1[2], ... } ?
>>> >
>>> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
>>> > and on x86_64-unknown-linux-gnu.
>>> > I would be grateful for suggestions on how to proceed.
>>> >
>>> > Thanks,
>>> > Prathamesh
>>> >
>>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>>> > index a02ede79fed..8028b3e8e9a 100644
>>> > --- a/gcc/fold-const.cc
>>> > +++ b/gcc/fold-const.cc
>>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>>> >  #include "vec-perm-indices.h"
>>> >  #include "asan.h"
>>> >  #include "gimple-range.h"
>>> > +#include <algorithm>
>>> > +#include "tree-pretty-print.h"
>>> > +#include "gimple-pretty-print.h"
>>> > +#include "print-tree.h"
>>> >
>>> >  /* Nonzero if we are folding constants inside an initializer or a C++
>>> >     manifestly-constant-evaluated context; zero otherwise.
>>> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>>> >  static bool
>>> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>>> >  {
>>> > -  unsigned HOST_WIDE_INT i, nunits;
>>> > +  unsigned HOST_WIDE_INT i;
>>> >
>>> > -  if (TREE_CODE (arg) == VECTOR_CST
>>> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
>>> > -    {
>>> > -      for (i = 0; i < nunits; ++i)
>>> > -     elts[i] = VECTOR_CST_ELT (arg, i);
>>> > -    }
>>> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
>>> > +  if (TREE_CODE (arg) == CONSTRUCTOR)
>>> >      {
>>> >        constructor_elt *elt;
>>> >
>>> > @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>>> >    return true;
>>> >  }
>>> >
>>> > +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding.  */
>>> > +
>>> > +static tree
>>> > +vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern)
>>> > +{
>>> > +  gcc_assert (pow2p_hwi (npatterns));
>>> > +
>>> > +  if (VECTOR_CST_NPATTERNS (vec) == npatterns
>>> > +      && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern)
>>> > +    return vec;
>>> > +
>>> > +  tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern);
>>> > +  TREE_TYPE (v) = TREE_TYPE (vec);
>>> > +
>>> > +  unsigned nelts = npatterns * nelts_per_pattern;
>>> > +  for (unsigned i = 0; i < nelts; i++)
>>> > +    VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i);
>>> > +  return v;
>>> > +}
>>> > +
>>> > +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable
>>> > +   operand for VLA vec_perm folding. If arg is VLS, then set
>>> > +   NPATTERNS = nelts and NELTS_PER_PATTERN = 1.  */
>>> > +
>>> > +static tree
>>> > +valid_operand_for_fold_vec_perm_cst_p (tree arg)
>>> > +{
>>> > +  if (TREE_CODE (arg) != VECTOR_CST)
>>> > +    return NULL_TREE;
>>> > +
>>> > +  unsigned HOST_WIDE_INT nelts;
>>> > +  unsigned npatterns, nelts_per_pattern;
>>> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts))
>>> > +    {
>>> > +      npatterns = nelts;
>>> > +      nelts_per_pattern = 1;
>>> > +    }
>>> > +  else
>>> > +    {
>>> > +      npatterns = VECTOR_CST_NPATTERNS (arg);
>>> > +      nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg);
>>> > +    }
>>> > +
>>> > +  if (!pow2p_hwi (npatterns))
>>> > +    return NULL_TREE;
>>> > +
>>> > +  return vector_cst_reshape (arg, npatterns, nelts_per_pattern);
>>> > +}
>>>
>>> I don't think we should reshape the vectors for VLS, since it would
>>> create more nodes for GC to clean up later.  Also, the "compact" encoding
>>> is canonical even for VLS, so the reshaping would effectively create
>>> noncanonical constants (even if only temporarily).
>> Removed in the attached patch.
>>>
>>> Instead, I think we should change the later:
>>>
>>> > +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
>>> > +                                        sel_nelts_per_pattern, reason, verbose))
>>> > +    return NULL_TREE;
>>>
>>> so that it comes after the computation of res_npatterns and
>>> res_nelts_per_pattern.  Then, if valid_mask_for_fold_vec_perm_cst_p
>>> returns false, and if the result type has a constant number of elements,
>>> we should:
>>>
>>> * set res_npatterns to that number of elements
>>> * set res_nelts_per_pattern to 1
>>> * continue instead of returning null
>> Assuming we don't enforce only VLA or only VLS for input vectors and sel,
>> won't that be still an issue if res (and sel) is VLS, and input
>> vectors are VLA ?
>> For eg:
>> arg0, arg1 are type VNx4SI with npatterns = 2, nelts_per_pattern = 3, step = 2
>> sel is V4SI constant with encoding { 0, 2, 4, ... }
>> and res_type is V4SI.
>> In this case, when it comes to index 4, the vector selection becomes ambiguous,
>> since it can be arg1 for len = 4 + 4x, and arg0 for lengths > 4 + 4x ?
>
> Ah, right.  So the condition is whether the result type and the data
> input type have a constant number of elements, rather than just the result.

Actually, I take that back.  The reason:

>>> The loop that follows will then do the correct thing for each element.

is true is that:

+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    strcpy (reason, "cannot divide selector element by arg len");
+	  return NULL_TREE;
+	}

will return false if q isn't computable at compile time (that is,
if we can't decide at compile time which input the element comes from).

So I think checking the result is enough.

Thanks,
Richard

Prathamesh Kulkarni Aug. 4, 2023, 10:06 a.m. UTC | #6

On Thu, 3 Aug 2023 at 18:46, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Sandiford <richard.sandiford@arm.com> writes:
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> On Tue, 25 Jul 2023 at 18:25, Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Thanks for the rework and sorry for the slow review.
> >> Hi Richard,
> >> Thanks for the suggestions!  Please find my responses inline below.
> >>>
> >>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >>> > Hi Richard,
> >>> > This is reworking of patch to extend fold_vec_perm to handle VLA vectors.
> >>> > The attached patch unifies handling of VLS and VLA vector_csts, while
> >>> > using fallback code
> >>> > for ctors.
> >>> >
> >>> > For VLS vector, the patch ignores underlying encoding, and
> >>> > uses npatterns = nelts, and nelts_per_pattern = 1.
> >>> >
> >>> > For VLA patterns, if sel has a stepped sequence, then it
> >>> > only chooses elements from a particular pattern of a particular
> >>> > input vector.
> >>> >
> >>> > To make things simpler, the patch imposes following constraints:
> >>> > (a) op0_npatterns, op1_npatterns and sel_npatterns are powers of 2.
> >>> > (b) The step size for a stepped sequence is a power of 2, and
> >>> >       multiple of npatterns of chosen input vector.
> >>> > (c) Runtime vector length of sel is a multiple of sel_npatterns.
> >>> >      So, we don't handle sel.length = 2 + 2x and npatterns = 4.
> >>> >
> >>> > Eg:
> >>> > op0, op1: npatterns = 2, nelts_per_pattern = 3
> >>> > op0_len = op1_len = 16 + 16x.
> >>> > sel = { 0, 0, 2, 0, 4, 0, ... }
> >>> > npatterns = 2, nelts_per_pattern = 3.
> >>> >
> >>> > For pattern {0, 2, 4, ...}
> >>> > Let,
> >>> > a1 = 2
> >>> > S = step size = 2
> >>> >
> >>> > Let Esel denote number of elements per pattern in sel at runtime.
> >>> > Esel = (16 + 16x) / npatterns_sel
> >>> >         = (16 + 16x) / 2
> >>> >         = (8 + 8x)
> >>> >
> >>> > So, last element of pattern:
> >>> > ae = a1 + (Esel - 2) * S
> >>> >      = 2 + (8 + 8x - 2) * 2
> >>> >      = 14 + 16x
> >>> >
> >>> > a1 /trunc arg0_len = 2 / (16 + 16x) = 0
> >>> > ae /trunc arg0_len = (14 + 16x) / (16 + 16x) = 0
> >>> > Since both are equal with quotient = 0, we select elements from op0.
> >>> >
> >>> > Since step size (S) is a multiple of npatterns(op0), we select
> >>> > all elements from same pattern of op0.
> >>> >
> >>> > res_npatterns = max (op0_npatterns, max (op1_npatterns, sel_npatterns))
> >>> >                        = max (2, max (2, 2)
> >>> >                        = 2
> >>> >
> >>> > res_nelts_per_pattern = max (op0_nelts_per_pattern,
> >>> >                                                 max (op1_nelts_per_pattern,
> >>> >                                                          sel_nelts_per_pattern))
> >>> >                                     = max (3, max (3, 3))
> >>> >                                     = 3
> >>> >
> >>> > So res has encoding with npatterns = 2, nelts_per_pattern = 3.
> >>> > res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
> >>> >
> >>> > Unfortunately, this results in an issue for poly_int_cst index:
> >>> > For example,
> >>> > op0, op1: npatterns = 1, nelts_per_pattern = 3
> >>> > op0_len = op1_len = 4 + 4x
> >>> >
> >>> > sel: { 4 + 4x, 5 + 4x, 6 + 4x, ... } // should choose op1
> >>> >
> >>> > In this case,
> >>> > a1 = 5 + 4x
> >>> > S = (6 + 4x) - (5 + 4x) = 1
> >>> > Esel = 4 + 4x
> >>> >
> >>> > ae = a1 + (esel - 2) * S
> >>> >      = (5 + 4x) + (4 + 4x - 2) * 1
> >>> >      = 7 + 8x
> >>> >
> >>> > IIUC, 7 + 8x will always be index for last element of op1 ?
> >>> > if x = 0, len = 4, 7 + 8x = 7
> >>> > if x = 1, len = 8, 7 + 8x = 15, etc.
> >>> > So the stepped sequence will always choose elements
> >>> > from op1 regardless of vector length for above case ?
> >>> >
> >>> > However,
> >>> > ae /trunc op0_len
> >>> > = (7 + 8x) / (4 + 4x)
> >>> > which is not defined because 7/4 != 8/4
> >>> > and we return NULL_TREE, but I suppose the expected result would be:
> >>> > res: { op1[0], op1[1], op1[2], ... } ?
> >>> >
> >>> > The patch passes bootstrap+test on aarch64-linux-gnu with and without sve,
> >>> > and on x86_64-unknown-linux-gnu.
> >>> > I would be grateful for suggestions on how to proceed.
> >>> >
> >>> > Thanks,
> >>> > Prathamesh
> >>> >
> >>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >>> > index a02ede79fed..8028b3e8e9a 100644
> >>> > --- a/gcc/fold-const.cc
> >>> > +++ b/gcc/fold-const.cc
> >>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
> >>> >  #include "vec-perm-indices.h"
> >>> >  #include "asan.h"
> >>> >  #include "gimple-range.h"
> >>> > +#include <algorithm>
> >>> > +#include "tree-pretty-print.h"
> >>> > +#include "gimple-pretty-print.h"
> >>> > +#include "print-tree.h"
> >>> >
> >>> >  /* Nonzero if we are folding constants inside an initializer or a C++
> >>> >     manifestly-constant-evaluated context; zero otherwise.
> >>> > @@ -10493,15 +10497,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >>> >  static bool
> >>> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >>> >  {
> >>> > -  unsigned HOST_WIDE_INT i, nunits;
> >>> > +  unsigned HOST_WIDE_INT i;
> >>> >
> >>> > -  if (TREE_CODE (arg) == VECTOR_CST
> >>> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> >>> > -    {
> >>> > -      for (i = 0; i < nunits; ++i)
> >>> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> >>> > -    }
> >>> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> >>> > +  if (TREE_CODE (arg) == CONSTRUCTOR)
> >>> >      {
> >>> >        constructor_elt *elt;
> >>> >
> >>> > @@ -10519,6 +10517,230 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >>> >    return true;
> >>> >  }
> >>> >
> >>> > +/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding.  */
> >>> > +
> >>> > +static tree
> >>> > +vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern)
> >>> > +{
> >>> > +  gcc_assert (pow2p_hwi (npatterns));
> >>> > +
> >>> > +  if (VECTOR_CST_NPATTERNS (vec) == npatterns
> >>> > +      && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern)
> >>> > +    return vec;
> >>> > +
> >>> > +  tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern);
> >>> > +  TREE_TYPE (v) = TREE_TYPE (vec);
> >>> > +
> >>> > +  unsigned nelts = npatterns * nelts_per_pattern;
> >>> > +  for (unsigned i = 0; i < nelts; i++)
> >>> > +    VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i);
> >>> > +  return v;
> >>> > +}
> >>> > +
> >>> > +/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable
> >>> > +   operand for VLA vec_perm folding. If arg is VLS, then set
> >>> > +   NPATTERNS = nelts and NELTS_PER_PATTERN = 1.  */
> >>> > +
> >>> > +static tree
> >>> > +valid_operand_for_fold_vec_perm_cst_p (tree arg)
> >>> > +{
> >>> > +  if (TREE_CODE (arg) != VECTOR_CST)
> >>> > +    return NULL_TREE;
> >>> > +
> >>> > +  unsigned HOST_WIDE_INT nelts;
> >>> > +  unsigned npatterns, nelts_per_pattern;
> >>> > +  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts))
> >>> > +    {
> >>> > +      npatterns = nelts;
> >>> > +      nelts_per_pattern = 1;
> >>> > +    }
> >>> > +  else
> >>> > +    {
> >>> > +      npatterns = VECTOR_CST_NPATTERNS (arg);
> >>> > +      nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg);
> >>> > +    }
> >>> > +
> >>> > +  if (!pow2p_hwi (npatterns))
> >>> > +    return NULL_TREE;
> >>> > +
> >>> > +  return vector_cst_reshape (arg, npatterns, nelts_per_pattern);
> >>> > +}
> >>>
> >>> I don't think we should reshape the vectors for VLS, since it would
> >>> create more nodes for GC to clean up later.  Also, the "compact" encoding
> >>> is canonical even for VLS, so the reshaping would effectively create
> >>> noncanonical constants (even if only temporarily).
> >> Removed in the attached patch.
> >>>
> >>> Instead, I think we should change the later:
> >>>
> >>> > +  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
> >>> > +                                        sel_nelts_per_pattern, reason, verbose))
> >>> > +    return NULL_TREE;
> >>>
> >>> so that it comes after the computation of res_npatterns and
> >>> res_nelts_per_pattern.  Then, if valid_mask_for_fold_vec_perm_cst_p
> >>> returns false, and if the result type has a constant number of elements,
> >>> we should:
> >>>
> >>> * set res_npatterns to that number of elements
> >>> * set res_nelts_per_pattern to 1
> >>> * continue instead of returning null
> >> Assuming we don't enforce only VLA or only VLS for input vectors and sel,
> >> won't that be still an issue if res (and sel) is VLS, and input
> >> vectors are VLA ?
> >> For eg:
> >> arg0, arg1 are type VNx4SI with npatterns = 2, nelts_per_pattern = 3, step = 2
> >> sel is V4SI constant with encoding { 0, 2, 4, ... }
> >> and res_type is V4SI.
> >> In this case, when it comes to index 4, the vector selection becomes ambiguous,
> >> since it can be arg1 for len = 4 + 4x, and arg0 for lengths > 4 + 4x ?
> >
> > Ah, right.  So the condition is whether the result type and the data
> > input type have a constant number of elements, rather than just the result.
>
> Actually, I take that back.  The reason:
>
> >>> The loop that follows will then do the correct thing for each element.
>
> is true is that:
>
> +      if (!can_div_trunc_p (sel[i], len, &q, &r))
> +       {
> +         if (reason)
> +           strcpy (reason, "cannot divide selector element by arg len");
> +         return NULL_TREE;
> +       }
>
> will return false if q isn't computable at compile time (that is,
> if we can't decide at compile time which input the element comes from).
>
> So I think checking the result is enough.
Ah yes, thanks for pointing it out! I verified that's indeed the case
(test 4 in test_fold_vec_perm_cst::test_mixed in attached patch).
Does the attached patch look OK ?
Bootstrapped+tested on aarch64-linux-gnu with and without SVE, and on
x86_64-linux-gnu.

Thanks,
Prathamesh
>
> Thanks,
> Richard
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 7e5494dfd39..680d0e54fd4 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10494,15 +10498,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 static bool
 vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned HOST_WIDE_INT i;
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
-    {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
-    }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  if (TREE_CODE (arg) == CONSTRUCTOR)
     {
       constructor_elt *elt;
 
@@ -10520,6 +10518,192 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
+   mask for VLA vec_perm folding.
+   REASON if specified, will contain the reason why SEL is not suitable.
+   Used only for debugging and unit-testing.
+   VERBOSE if enabled is used for debugging output.  */
+
+static bool
+valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
+				    const vec_perm_indices &sel,
+				    const char **reason = NULL,
+				    ATTRIBUTE_UNUSED bool verbose = false)
+{
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+
+  if (!(pow2p_hwi (sel_npatterns)
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
+    {
+      if (reason)
+	*reason = "npatterns is not power of 2";
+      return false;
+    }
+
+  /* We want to avoid cases where sel.length is not a multiple of npatterns.
+     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
+  poly_uint64 esel;
+  if (!multiple_p (sel.length (), sel_npatterns, &esel))
+    {
+      if (reason)
+	*reason = "sel.length is not multiple of sel_npatterns";
+      return false;
+    }
+
+  if (sel_nelts_per_pattern < 3)
+    return true;
+
+  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
+    {
+      poly_uint64 a1 = sel[pattern + sel_npatterns];
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      HOST_WIDE_INT S; 
+      if (!poly_int64 (a2 - a1).is_constant (&S))
+	{
+	  if (reason)
+	    *reason = "step is not constant";
+	  return false;
+	}
+      // FIXME: Punt on S < 0 for now, revisit later.
+      if (S < 0)
+	return false;
+      if (S == 0)
+	continue;
+
+      if (!pow2p_hwi (S))
+	{
+	  if (reason)
+	    *reason = "step is not power of 2";
+	  return false;
+	}
+
+      /* Ensure that stepped sequence of the pattern selects elements
+	 only from the same input vector if it's VLA.  */
+      uint64_t q1, qe;
+      poly_uint64 r1, re;
+      poly_uint64 ae = a1 + (esel - 2) * S;
+      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
+	    && can_div_trunc_p (ae, arg_len, &qe, &re)
+	    && q1 == qe))
+	{
+	  if (reason)
+	    *reason = "crossed input vectors";
+	  return false;
+	}
+
+      unsigned arg_npatterns
+	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
+			  : VECTOR_CST_NPATTERNS (arg1);
+
+      if (!multiple_p (S, arg_npatterns))
+	{
+	  if (reason)
+	    *reason = "S is not multiple of npatterns";
+	  return false;
+	}
+    }
+
+  return true;
+}
+
+/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
+   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
+   REASON and VERBOSE have same purpose as described in
+   valid_mask_for_fold_vec_perm_cst_p.
+
+   (1) If SEL is a suitable mask as determined by
+       valid_mask_for_fold_vec_perm_cst_p, then:
+       res_npatterns = max of npatterns between ARG0, ARG1, and SEL
+       res_nelts_per_pattern = max of nelts_per_pattern between
+			       ARG0, ARG1 and SEL.
+   (2) If SEL is not a suitable mask, and ARG0, ARG1 are VLS,
+       then:
+       res_npatterns = nelts in input vector.
+       res_nelts_per_pattern = 1.
+       This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
+
+static tree
+fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
+		   const char **reason = NULL, bool verbose = false)
+{
+  unsigned res_npatterns, res_nelts_per_pattern;
+  unsigned HOST_WIDE_INT res_nelts;
+
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason, verbose))
+    {
+      res_npatterns
+	= std::max (VECTOR_CST_NPATTERNS (arg0),
+		    std::max (VECTOR_CST_NPATTERNS (arg1),
+			      sel.encoding ().npatterns ()));
+
+      res_nelts_per_pattern
+	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
+			      sel.encoding ().nelts_per_pattern ()));
+
+      res_nelts = res_npatterns * res_nelts_per_pattern;
+    }
+  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
+    {
+      res_npatterns = res_nelts;
+      res_nelts_per_pattern = 1;
+    }
+  else
+    return NULL_TREE;
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  for (unsigned i = 0; i < res_nelts; i++)
+    {
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+      unsigned HOST_WIDE_INT index;
+
+      unsigned HOST_WIDE_INT arg_nelts;
+      if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant (&arg_nelts)
+	  && known_ge (sel[i], poly_int64 (2 * arg_nelts)))
+	{
+	  if (reason)
+	    *reason = "out of bounds access";
+	  return NULL_TREE;
+	}
+
+      /* Punt if sel[i] /trunc_div len cannot be determined,
+	 because the input vector to be chosen will depend on
+	 runtime vector length.
+	 For example if len == 4 + 4x, and sel[i] == 4,
+	 If len at runtime equals 4, we choose arg1[0].
+	 For any other value of len > 4 at runtime, we choose arg0[4].
+	 which makes the element choice dependent on runtime vector length.  */
+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    *reason = "cannot divide selector element by arg len";
+	  return NULL_TREE;
+	}
+
+      /* sel[i] % len will give the index of element in the chosen input
+	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
+	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
+      if (!r.is_constant (&index))
+	{
+	  if (reason)
+	    *reason = "remainder is not constant";
+	  return NULL_TREE;
+	}
+
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10529,43 +10713,40 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
   unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST)
+    return fold_vec_perm_cst (type, arg0, arg1, sel);
+
+  /* For fall back case, we want to ensure we have VLS vectors
+     with equal length.  */
+  if (!sel.length ().is_constant (&nelts))
+    return NULL_TREE;
+
+  gcc_assert (known_eq (sel.length (), TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
+  vec<constructor_elt, va_gc> *v;
+  vec_alloc (v, nelts);
   for (i = 0; i < nelts; i++)
     {
       HOST_WIDE_INT index;
       if (!sel[i].is_constant (&index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
-    }
-
-  if (need_ctor)
-    {
-      vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
-      return build_constructor (type, v);
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
     }
-  else
-    return out_elts.build ();
+  return build_constructor (type, v);
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16892,6 +17073,508 @@ test_arithmetic_folding ()
 				   x);
 }
 
+namespace test_fold_vec_perm_cst {
+
+static tree
+get_preferred_vectype (tree inner_type)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (inner_type);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  return build_vector_type (inner_type, nunits);
+}
+
+static tree
+build_vec_cst_rand (tree inner_type, unsigned npatterns,
+		    unsigned nelts_per_pattern, int S = 0,
+		    tree vectype = NULL_TREE)
+{
+  if (!vectype)
+    vectype = get_preferred_vectype (inner_type);
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+
+  // Fill a0 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % 100));
+
+  if (nelts_per_pattern == 1)
+    return builder.build ();
+
+  // Fill a1 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % 100));
+
+  if (nelts_per_pattern == 2)
+    return builder.build ();
+
+  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
+    {
+      tree prev_elem = builder[i - npatterns];
+      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
+      int val = prev_elem_val + S;
+      builder.quick_push (build_int_cst (inner_type, val));
+    }
+
+  return builder.build ();
+}
+
+static void
+validate_res (unsigned npatterns, unsigned nelts_per_pattern,
+	      tree res, tree *expected_res)
+{
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
+  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
+
+  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+static void
+validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
+{
+  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
+  for (unsigned i = 0; i < expected_nelts; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Verify VLA vec_perm folding.  */
+
+static void
+test_stepped ()
+{
+  /* Case 1: sel = {0, 1, 2, ...}
+     npatterns = 1, nelts_per_pattern = 3
+     expected res: { arg0[0], arg0[1], arg0[2], ... } */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (0);
+    builder.quick_push (1);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
+			    vector_cst_elt (arg0, 2) };
+    validate_res (1, 3, res, expected_res);
+  }
+
+  /* Case 2: sel = {len, len + 1, len + 2, ... }
+     npatterns = 1, nelts_per_pattern = 3
+     FIXME: This should return
+     expected res: { op1[0], op1[1], op1[2], ... }
+     however it returns NULL_TREE.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+    
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (arg0_len + 1);
+    builder.quick_push (arg0_len + 2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL, true);
+    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
+			    vector_cst_elt (arg1, 2) };
+    validate_res (1, 3, res, expected_res);
+  }
+
+  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
+     sel = {len, 0, 0, 0, 2, 0, ...}
+     npatterns = 2, nelts_per_pattern = 3.
+     Use extra pattern {0, ...} to lower number of elements per pattern.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 3);
+    builder.quick_push (arg0_len);
+    int mask_elems[] = { 0, 0, 0, 2, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    const char *reason; 
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+
+    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg0, 0),
+			    vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 0),
+			    vector_cst_elt (arg0, 2), vector_cst_elt (arg0, 0)
+			  };
+    validate_res (2, 3, res, expected_res);
+  }
+
+  /* Case 4:
+     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
+     This should return NULL because we cross the input vectors.
+     Because,
+     arg0_len = 16 + 16x
+     a1 = 0
+     S = 2
+     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
+     ae = 0 + (esel - 2) * S
+	= 0 + (16 + 16x - 2) * 2
+	= 28 + 32x
+     a1 / arg0_len = 0 /trunc (16 + 16x) = 0
+     ae / arg0_len = (28 + 32x) /trunc (16 + 16x), which is not defined,
+     since 28/16 != 32/16.
+     So return NULL_TREE.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (0);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason, false);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "crossed input vectors"));
+  }
+
+  /* Case 5: Select elements from different patterns.
+     Should return NULL.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    builder.quick_push (op0_len);
+    int mask_elems[] = { 0, 0, 0, 1, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, &reason, false);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "S is not multiple of npatterns"));
+  }
+
+  /* Case 6: Select pattern 0 of op0 and dup of op0[0]
+     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
+     sel = { 0, 0, 2, 0, 4, 0, ... }.
+
+     For pattern {0, 2, 4, ...}:
+     a1 = 2
+     len = 16 + 16x
+     S = 2
+     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
+     ae = a1 + (esel - 2) * S
+	= 2 + (8 + 8x - 2) * 2
+	= 14 + 16x
+     a1 / arg0_len = 2 / (16 + 16x) = 0
+     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
+     So a1/arg0_len = ae/arg0_len = 0
+     Hence we select from first vector op0
+     S = 2, npatterns = 2.
+     Since S is multiple of npatterns(op0), we are selecting from
+     same pattern of op0.
+
+     For pattern {0, ...}, we are choosing { op0[0] ... }
+     So res will be combination of above patterns:
+     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
+     with npatterns = 2, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
+    for (int i = 0; i < 6; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
+    validate_res (2, 3, res, expected_res);
+  }
+
+  /* Case 7: sel_npatterns > op_npatterns;
+     op0, op1: npatterns = 2, nelts_per_pattern = 3
+     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
+     with npatterns = 4, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder(op0_len, 4, 3);
+    // -1 is used as place holder for poly_int_cst
+    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
+    for (int i = 0; i < 12; i++)
+      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
+    validate_res (4, 3, res, expected_res);
+  }
+}
+
+static void
+test_dup ()
+{
+  /* Case 1: mask = {0, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (0);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (res, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 2: mask = {len, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg1, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 3: mask = { 0, len, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
+    validate_res (2, 1, res, expected_res);
+  }
+
+  /* Case 4: mask = { 0, len, 1, len+1, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+
+  /* Case 5: mask = { 0, len, 1, len+1, .... }
+     npatterns = 4, nelts_per_pattern = 1 */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 6: mask = {0, 4, ...}
+     npatterns = 1, nelts_per_pattern = 2.
+     This should return NULL_TREE because the index 4 may choose
+     from either arg0 or arg1 depending on vector length.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 2);
+    builder.quick_push (0);
+    builder.quick_push (4);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    ASSERT_TRUE (res == NULL_TREE);
+  }
+
+  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
+     mask = {0, len, 1, len + 1, ...}
+     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (arg0_len);
+    builder.quick_push (1);
+    builder.quick_push (arg0_len + 1);
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+}
+
+static void
+test_mixed ()
+{
+  /* Case 1: op0, op1 -> VLS, sel -> VLA and selects from both input vectors.
+     In this case, we treat res_npatterns = nelts in input vector
+     and res_nelts_per_pattern = 1, and create a dup pattern.
+     sel = { 0, 4, 1, 5, ... }
+     res = { op0[0], op1[0], op0[1], op1[1], ...} // (4, 1)
+     res_npatterns = 4, res_nelts_per_pattern = 1.  */
+  {
+    tree arg_vectype = build_vector_type (integer_type_node, 4);
+    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+
+    tree res_type = get_preferred_vectype (integer_type_node);
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (4);
+    builder.quick_push (1);
+    builder.quick_push (5);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 2: Same as Case 1, but sel contains an out of bounds index.
+     result should be NULL_TREE.  */
+  {
+    tree arg_vectype = build_vector_type (integer_type_node, 4);
+    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
+
+    tree res_type = get_preferred_vectype (integer_type_node);
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (8);
+    builder.quick_push (1);
+    builder.quick_push (5);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    const char *reason; 
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "out of bounds access"));
+  }
+
+  /* Case 3: op0, op1 are VLA and sel is VLS.
+     op0, op1: VNx16QI with shape (2, 3)
+     sel = V4SI with values {0, 2, 4, 6}
+     res: V4SI with values { op0[0], op0[2], op0[4], op0[6] }.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+
+    poly_uint64 res_len = 4;
+    tree res_type = build_vector_type (char_type_node, res_len);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (2);
+    builder.quick_push (4);
+    builder.quick_push (6);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 2),
+			    vector_cst_elt (arg0, 4), vector_cst_elt (arg0, 6)
+			  };
+    validate_res_vls (res, expected_res, 4);
+  }
+
+  /* Case 4: Same as case 4, but op0, op1 are VNx4SI with shape (2, 3) and step = 2
+     sel = V4SI with values {0, 2, 4, 6}
+     In this case result should be NULL_TREE because we cross input vector
+     boundary at index 4.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
+
+    poly_uint64 res_len = 4;
+    tree res_type = build_vector_type (char_type_node, res_len);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (2);
+    builder.quick_push (4);
+    builder.quick_push (6);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "cannot divide selector element by arg len"));
+  }
+}
+
+static void
+test ()
+{
+  tree vectype = get_preferred_vectype (integer_type_node);
+  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
+    return;
+
+  test_dup ();
+  test_stepped ();
+  test_mixed ();
+}
+};
+
 /* Verify that various binary operations on vectors are folded
    correctly.  */
 
@@ -16943,6 +17626,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_fold_vec_perm_cst::test ();
 }
 
 } // namespace selftest

Richard Sandiford Aug. 4, 2023, 3:06 p.m. UTC | #7

Full review this time, sorry for the skipping the tests earlier.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 7e5494dfd39..680d0e54fd4 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "vec-perm-indices.h"
>  #include "asan.h"
>  #include "gimple-range.h"
> +#include <algorithm>

This should be included by defining INCLUDE_ALGORITHM instead.

> +#include "tree-pretty-print.h"
> +#include "gimple-pretty-print.h"
> +#include "print-tree.h"

Are these still needed, or were they for debugging?

>  
>  /* Nonzero if we are folding constants inside an initializer or a C++
>     manifestly-constant-evaluated context; zero otherwise.
> @@ -10494,15 +10498,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
>  static bool
>  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>  {
> -  unsigned HOST_WIDE_INT i, nunits;
> +  unsigned HOST_WIDE_INT i;
>  
> -  if (TREE_CODE (arg) == VECTOR_CST
> -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> -    {
> -      for (i = 0; i < nunits; ++i)
> -	elts[i] = VECTOR_CST_ELT (arg, i);
> -    }
> -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> +  if (TREE_CODE (arg) == CONSTRUCTOR)
>      {
>        constructor_elt *elt;
>  
> @@ -10520,6 +10518,192 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
>    return true;
>  }
>  
> +/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
> +   mask for VLA vec_perm folding.
> +   REASON if specified, will contain the reason why SEL is not suitable.
> +   Used only for debugging and unit-testing.
> +   VERBOSE if enabled is used for debugging output.  */
> +
> +static bool
> +valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
> +				    const vec_perm_indices &sel,
> +				    const char **reason = NULL,
> +				    ATTRIBUTE_UNUSED bool verbose = false)

Since verbose is no longer needed (good!), I think we should just remove it.

> +{
> +  unsigned sel_npatterns = sel.encoding ().npatterns ();
> +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +
> +  if (!(pow2p_hwi (sel_npatterns)
> +	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
> +	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
> +    {
> +      if (reason)
> +	*reason = "npatterns is not power of 2";
> +      return false;
> +    }
> +
> +  /* We want to avoid cases where sel.length is not a multiple of npatterns.
> +     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
> +  poly_uint64 esel;
> +  if (!multiple_p (sel.length (), sel_npatterns, &esel))
> +    {
> +      if (reason)
> +	*reason = "sel.length is not multiple of sel_npatterns";
> +      return false;
> +    }
> +
> +  if (sel_nelts_per_pattern < 3)
> +    return true;
> +
> +  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
> +    {
> +      poly_uint64 a1 = sel[pattern + sel_npatterns];
> +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> +      HOST_WIDE_INT S; 

Trailing whitespace.  The convention is to use lowercase variable
names, so please call this "step".

> +      if (!poly_int64 (a2 - a1).is_constant (&S))
> +	{
> +	  if (reason)
> +	    *reason = "step is not constant";
> +	  return false;
> +	}
> +      // FIXME: Punt on S < 0 for now, revisit later.
> +      if (S < 0)
> +	return false;
> +      if (S == 0)
> +	continue;
> +
> +      if (!pow2p_hwi (S))
> +	{
> +	  if (reason)
> +	    *reason = "step is not power of 2";
> +	  return false;
> +	}
> +
> +      /* Ensure that stepped sequence of the pattern selects elements
> +	 only from the same input vector if it's VLA.  */

s/ if it's VLA//

> +      uint64_t q1, qe;
> +      poly_uint64 r1, re;
> +      poly_uint64 ae = a1 + (esel - 2) * S;
> +      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
> +	    && can_div_trunc_p (ae, arg_len, &qe, &re)
> +	    && q1 == qe))
> +	{
> +	  if (reason)
> +	    *reason = "crossed input vectors";
> +	  return false;
> +	}
> +

Probably worth a comment above the following code too:

  /* Ensure that the stepped sequence always selects from the same
     input pattern.  */

> +      unsigned arg_npatterns
> +	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> +			  : VECTOR_CST_NPATTERNS (arg1);
> +
> +      if (!multiple_p (S, arg_npatterns))
> +	{
> +	  if (reason)
> +	    *reason = "S is not multiple of npatterns";
> +	  return false;
> +	}
> +    }
> +
> +  return true;
> +}
> +
> +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
> +   REASON and VERBOSE have same purpose as described in
> +   valid_mask_for_fold_vec_perm_cst_p.
> +
> +   (1) If SEL is a suitable mask as determined by
> +       valid_mask_for_fold_vec_perm_cst_p, then:
> +       res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> +       res_nelts_per_pattern = max of nelts_per_pattern between
> +			       ARG0, ARG1 and SEL.
> +   (2) If SEL is not a suitable mask, and ARG0, ARG1 are VLS,
> +       then:
> +       res_npatterns = nelts in input vector.

s/input vector/result vector/

> +       res_nelts_per_pattern = 1.
> +       This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */

Guess this is personal preference, but (1) and (2) seem more like
implementation details, so I think they belong...

> +
> +static tree
> +fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> +		   const char **reason = NULL, bool verbose = false)
> +{
> +  unsigned res_npatterns, res_nelts_per_pattern;
> +  unsigned HOST_WIDE_INT res_nelts;
> +

...here instead.

> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason, verbose))
> +    {
> +      res_npatterns
> +	= std::max (VECTOR_CST_NPATTERNS (arg0),
> +		    std::max (VECTOR_CST_NPATTERNS (arg1),
> +			      sel.encoding ().npatterns ()));
> +
> +      res_nelts_per_pattern
> +	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> +		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> +			      sel.encoding ().nelts_per_pattern ()));
> +
> +      res_nelts = res_npatterns * res_nelts_per_pattern;
> +    }
> +  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
> +    {
> +      res_npatterns = res_nelts;
> +      res_nelts_per_pattern = 1;
> +    }
> +  else
> +    return NULL_TREE;
> +
> +  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
> +  for (unsigned i = 0; i < res_nelts; i++)
> +    {
> +      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +      uint64_t q;
> +      poly_uint64 r;
> +      unsigned HOST_WIDE_INT index;
> +
> +      unsigned HOST_WIDE_INT arg_nelts;
> +      if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant (&arg_nelts)
> +	  && known_ge (sel[i], poly_int64 (2 * arg_nelts)))
> +	{
> +	  if (reason)
> +	    *reason = "out of bounds access";
> +	  return NULL_TREE;
> +	}

I don't think this is needed.  The selector indices wrap, and the code
below should handle the wrapping correctly.

> +
> +      /* Punt if sel[i] /trunc_div len cannot be determined,
> +	 because the input vector to be chosen will depend on
> +	 runtime vector length.
> +	 For example if len == 4 + 4x, and sel[i] == 4,
> +	 If len at runtime equals 4, we choose arg1[0].
> +	 For any other value of len > 4 at runtime, we choose arg0[4].
> +	 which makes the element choice dependent on runtime vector length.  */
> +      if (!can_div_trunc_p (sel[i], len, &q, &r))
> +	{
> +	  if (reason)
> +	    *reason = "cannot divide selector element by arg len";
> +	  return NULL_TREE;
> +	}
> +
> +      /* sel[i] % len will give the index of element in the chosen input
> +	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
> +	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
> +      if (!r.is_constant (&index))
> +	{
> +	  if (reason)
> +	    *reason = "remainder is not constant";
> +	  return NULL_TREE;
> +	}
> +
> +      tree arg = ((q & 1) == 0) ? arg0 : arg1;
> +      tree elem = vector_cst_elt (arg, index);
> +      out_elts.quick_push (elem);
> +    }
> +
> +  return out_elts.build ();
> +}
> +
>  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
>     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
>     NULL_TREE otherwise.  */
> @@ -10529,43 +10713,40 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
>  {
>    unsigned int i;
>    unsigned HOST_WIDE_INT nelts;
> -  bool need_ctor = false;
>  
> -  if (!sel.length ().is_constant (&nelts))
> -    return NULL_TREE;
> -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> -	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> +	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> +			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> +
>    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
>        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
>      return NULL_TREE;
>  
> +  if (TREE_CODE (arg0) == VECTOR_CST
> +      && TREE_CODE (arg1) == VECTOR_CST)
> +    return fold_vec_perm_cst (type, arg0, arg1, sel);
> +
> +  /* For fall back case, we want to ensure we have VLS vectors
> +     with equal length.  */
> +  if (!sel.length ().is_constant (&nelts))
> +    return NULL_TREE;
> +
> +  gcc_assert (known_eq (sel.length (), TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));

Nit: long line.

>    tree *in_elts = XALLOCAVEC (tree, nelts * 2);
>    if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
>        || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
>      return NULL_TREE;
>  
> -  tree_vector_builder out_elts (type, nelts, 1);
> +  vec<constructor_elt, va_gc> *v;
> +  vec_alloc (v, nelts);
>    for (i = 0; i < nelts; i++)
>      {
>        HOST_WIDE_INT index;
>        if (!sel[i].is_constant (&index))
>  	return NULL_TREE;
> -      if (!CONSTANT_CLASS_P (in_elts[index]))
> -	need_ctor = true;
> -      out_elts.quick_push (unshare_expr (in_elts[index]));
> -    }
> -
> -  if (need_ctor)
> -    {
> -      vec<constructor_elt, va_gc> *v;
> -      vec_alloc (v, nelts);
> -      for (i = 0; i < nelts; i++)
> -	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> -      return build_constructor (type, v);
> +      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
>      }
> -  else
> -    return out_elts.build ();
> +  return build_constructor (type, v);
>  }
>  
>  /* Try to fold a pointer difference of type TYPE two address expressions of
> @@ -16892,6 +17073,508 @@ test_arithmetic_folding ()
>  				   x);
>  }
>  
> +namespace test_fold_vec_perm_cst {
> +
> +static tree
> +get_preferred_vectype (tree inner_type)
> +{
> +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (inner_type);
> +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> +  return build_vector_type (inner_type, nunits);
> +}
> +
> +static tree
> +build_vec_cst_rand (tree inner_type, unsigned npatterns,
> +		    unsigned nelts_per_pattern, int S = 0,

Similar comment about lowercase variable names here.

> +		    tree vectype = NULL_TREE)
> +{
> +  if (!vectype)
> +    vectype = get_preferred_vectype (inner_type);

I'm not sure how portable this is.  It looks like the tests rely on
the integer_type_node vectors being 4 + 4x, but that isn't necessarily
true on all VLA targets.

Perhaps instead the tests could be classified based on the vector
lengths that they assume.  Then we can iterate through the vector
modes and call the appropriate function based on GET_MODE_NUNITS
and GET_MODE_INNER.

> +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> +
> +  // Fill a0 for each pattern
> +  for (unsigned i = 0; i < npatterns; i++)
> +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> +
> +  if (nelts_per_pattern == 1)
> +    return builder.build ();
> +
> +  // Fill a1 for each pattern
> +  for (unsigned i = 0; i < npatterns; i++)
> +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> +
> +  if (nelts_per_pattern == 2)
> +    return builder.build ();
> +
> +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
> +    {
> +      tree prev_elem = builder[i - npatterns];
> +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
> +      int val = prev_elem_val + S;
> +      builder.quick_push (build_int_cst (inner_type, val));
> +    }
> +
> +  return builder.build ();
> +}
> +
> +static void
> +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
> +	      tree res, tree *expected_res)
> +{
> +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);

I don't think this is safe when the inputs are randomised.  E.g. we
could by chance end up with a vector of all zeros, which would have
a single pattern and a single element per pattern, regardless of the
shapes of the inputs.

Given the way that vector_builder<T, Shape, Derived>::finalize
canonicalises the encoding, it should be safe to use:

* VECTOR_CST_NPATTERNS (res) <= npatterns
* vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern

If we do that then...

> +
> +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)

...this loop bound should be npatterns * nelts_per_pattern instead.

> +    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
> +}
> +
> +static void
> +validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
> +{
> +  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
> +  for (unsigned i = 0; i < expected_nelts; i++)
> +    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
> +}
> +
> +/* Verify VLA vec_perm folding.  */
> +
> +static void
> +test_stepped ()
> +{
> +  /* Case 1: sel = {0, 1, 2, ...}
> +     npatterns = 1, nelts_per_pattern = 3
> +     expected res: { arg0[0], arg0[1], arg0[2], ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (0);
> +    builder.quick_push (1);
> +    builder.quick_push (2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
> +			    vector_cst_elt (arg0, 2) };
> +    validate_res (1, 3, res, expected_res);
> +  }
> +
> +  /* Case 2: sel = {len, len + 1, len + 2, ... }
> +     npatterns = 1, nelts_per_pattern = 3
> +     FIXME: This should return
> +     expected res: { op1[0], op1[1], op1[2], ... }
> +     however it returns NULL_TREE.  */

Looks like the comment is out of date.

> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +    
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (arg0_len + 1);
> +    builder.quick_push (arg0_len + 2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL, true);
> +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
> +			    vector_cst_elt (arg1, 2) };
> +    validate_res (1, 3, res, expected_res);
> +  }
> +
> +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
> +     sel = {len, 0, 0, 0, 2, 0, ...}
> +     npatterns = 2, nelts_per_pattern = 3.
> +     Use extra pattern {0, ...} to lower number of elements per pattern.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 2, 3);
> +    builder.quick_push (arg0_len);
> +    int mask_elems[] = { 0, 0, 0, 2, 0 };
> +    for (int i = 0; i < 5; i++)
> +      builder.quick_push (mask_elems[i]);

This leaves one of the elements unspecified.

> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    const char *reason; 
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> +
> +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg0, 0),
> +			    vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 0),
> +			    vector_cst_elt (arg0, 2), vector_cst_elt (arg0, 0)
> +			  };
> +    validate_res (2, 3, res, expected_res);
> +  }
> +
> +  /* Case 4:
> +     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
> +     This should return NULL because we cross the input vectors.
> +     Because,
> +     arg0_len = 16 + 16x
> +     a1 = 0
> +     S = 2
> +     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
> +     ae = 0 + (esel - 2) * S
> +	= 0 + (16 + 16x - 2) * 2
> +	= 28 + 32x
> +     a1 / arg0_len = 0 /trunc (16 + 16x) = 0
> +     ae / arg0_len = (28 + 32x) /trunc (16 + 16x), which is not defined,
> +     since 28/16 != 32/16.
> +     So return NULL_TREE.  */

The division should succeed now, so as the test says, the reason should
instead be that ae is in the second input.

> +  {
> +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (0);
> +    builder.quick_push (2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    const char *reason;
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason, false);
> +    gcc_assert (res == NULL_TREE);
> +    gcc_assert (!strcmp (reason, "crossed input vectors"));

The tests should use ASSERT_* macros rather than gcc_assert.

> +  }
> +
> +  /* Case 5: Select elements from different patterns.
> +     Should return NULL.  */
> +  {
> +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> +
> +    vec_perm_builder builder (op0_len, 2, 3);
> +    builder.quick_push (op0_len);
> +    int mask_elems[] = { 0, 0, 0, 1, 0 };
> +    for (int i = 0; i < 5; i++)
> +      builder.quick_push (mask_elems[i]);

Should be 6 elements here too.

> +
> +    vec_perm_indices sel (builder, 2, op0_len);
> +    const char *reason;
> +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, &reason, false);
> +    gcc_assert (res == NULL_TREE);
> +    gcc_assert (!strcmp (reason, "S is not multiple of npatterns"));
> +  }
> +
> +  /* Case 6: Select pattern 0 of op0 and dup of op0[0]
> +     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
> +     sel = { 0, 0, 2, 0, 4, 0, ... }.
> +
> +     For pattern {0, 2, 4, ...}:
> +     a1 = 2
> +     len = 16 + 16x
> +     S = 2
> +     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
> +     ae = a1 + (esel - 2) * S
> +	= 2 + (8 + 8x - 2) * 2
> +	= 14 + 16x
> +     a1 / arg0_len = 2 / (16 + 16x) = 0
> +     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
> +     So a1/arg0_len = ae/arg0_len = 0
> +     Hence we select from first vector op0
> +     S = 2, npatterns = 2.
> +     Since S is multiple of npatterns(op0), we are selecting from
> +     same pattern of op0.
> +
> +     For pattern {0, ...}, we are choosing { op0[0] ... }
> +     So res will be combination of above patterns:
> +     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
> +     with npatterns = 2, nelts_per_pattern = 3.  */
> +  {
> +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> +
> +    vec_perm_builder builder (op0_len, 2, 3);
> +    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
> +    for (int i = 0; i < 6; i++)
> +      builder.quick_push (mask_elems[i]);
> +
> +    vec_perm_indices sel (builder, 2, op0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
> +    validate_res (2, 3, res, expected_res);
> +  }
> +
> +  /* Case 7: sel_npatterns > op_npatterns;
> +     op0, op1: npatterns = 2, nelts_per_pattern = 3
> +     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
> +     with npatterns = 4, nelts_per_pattern = 3.  */
> +  {
> +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> +
> +    vec_perm_builder builder(op0_len, 4, 3);
> +    // -1 is used as place holder for poly_int_cst
> +    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
> +    for (int i = 0; i < 12; i++)
> +      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
> +
> +    vec_perm_indices sel (builder, 2, op0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
> +			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
> +			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
> +			    vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
> +    validate_res (4, 3, res, expected_res);
> +  }
> +}
> +
> +static void
> +test_dup ()
> +{
> +  /* Case 1: mask = {0, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (0);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (res, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }
> +
> +  /* Case 2: mask = {len, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }
> +
> +  /* Case 3: mask = { 0, len, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
> +    validate_res (2, 1, res, expected_res);
> +  }
> +
> +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }
> +
> +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> +     npatterns = 4, nelts_per_pattern = 1 */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (4, 1, res, expected_res);
> +  }
> +
> +  /* Case 6: mask = {0, 4, ...}
> +     npatterns = 1, nelts_per_pattern = 2.
> +     This should return NULL_TREE because the index 4 may choose
> +     from either arg0 or arg1 depending on vector length.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (4);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    ASSERT_TRUE (res == NULL_TREE);
> +  }
> +
> +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> +     mask = {0, len, 1, len + 1, ...}
> +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */

This is a good test to have, but it doesn't seem to match the
name of the containing function (test_dup).

> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (1);
> +    builder.quick_push (arg0_len + 1);
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }
> +}
> +
> +static void
> +test_mixed ()
> +{
> +  /* Case 1: op0, op1 -> VLS, sel -> VLA and selects from both input vectors.
> +     In this case, we treat res_npatterns = nelts in input vector
> +     and res_nelts_per_pattern = 1, and create a dup pattern.
> +     sel = { 0, 4, 1, 5, ... }
> +     res = { op0[0], op1[0], op0[1], op1[1], ...} // (4, 1)
> +     res_npatterns = 4, res_nelts_per_pattern = 1.  */
> +  {
> +    tree arg_vectype = build_vector_type (integer_type_node, 4);
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> +
> +    tree res_type = get_preferred_vectype (integer_type_node);
> +    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> +    vec_perm_builder builder (res_len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (4);
> +    builder.quick_push (1);
> +    builder.quick_push (5);
> +
> +    vec_perm_indices sel (builder, 2, res_len);
> +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (4, 1, res, expected_res);
> +  }
> +
> +  /* Case 2: Same as Case 1, but sel contains an out of bounds index.
> +     result should be NULL_TREE.  */
> +  {
> +    tree arg_vectype = build_vector_type (integer_type_node, 4);
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> +
> +    tree res_type = get_preferred_vectype (integer_type_node);
> +    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> +    vec_perm_builder builder (res_len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (8);
> +    builder.quick_push (1);
> +    builder.quick_push (5);
> +
> +    vec_perm_indices sel (builder, 2, res_len);
> +    const char *reason; 
> +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
> +    gcc_assert (res == NULL_TREE);
> +    gcc_assert (!strcmp (reason, "out of bounds access"));
> +  }
> +
> +  /* Case 3: op0, op1 are VLA and sel is VLS.
> +     op0, op1: VNx16QI with shape (2, 3)
> +     sel = V4SI with values {0, 2, 4, 6}
> +     res: V4SI with values { op0[0], op0[2], op0[4], op0[6] }.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +    tree arg1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> +
> +    poly_uint64 res_len = 4;
> +    tree res_type = build_vector_type (char_type_node, res_len);
> +    vec_perm_builder builder (res_len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (2);
> +    builder.quick_push (4);
> +    builder.quick_push (6);
> +
> +    vec_perm_indices sel (builder, 2, res_len);
> +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 2),
> +			    vector_cst_elt (arg0, 4), vector_cst_elt (arg0, 6)
> +			  };
> +    validate_res_vls (res, expected_res, 4);
> +  }
> +
> +  /* Case 4: Same as case 4, but op0, op1 are VNx4SI with shape (2, 3) and step = 2

Same as case 3?

Thanks,
Richard

> +     sel = V4SI with values {0, 2, 4, 6}
> +     In this case result should be NULL_TREE because we cross input vector
> +     boundary at index 4.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
> +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
> +
> +    poly_uint64 res_len = 4;
> +    tree res_type = build_vector_type (char_type_node, res_len);
> +    vec_perm_builder builder (res_len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (2);
> +    builder.quick_push (4);
> +    builder.quick_push (6);
> +
> +    vec_perm_indices sel (builder, 2, res_len);
> +    const char *reason;
> +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
> +    gcc_assert (res == NULL_TREE);
> +    gcc_assert (!strcmp (reason, "cannot divide selector element by arg len"));
> +  }
> +}
> +
> +static void
> +test ()
> +{
> +  tree vectype = get_preferred_vectype (integer_type_node);
> +  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> +    return;
> +
> +  test_dup ();
> +  test_stepped ();
> +  test_mixed ();
> +}
> +};
> +
>  /* Verify that various binary operations on vectors are folded
>     correctly.  */
>  
> @@ -16943,6 +17626,7 @@ fold_const_cc_tests ()
>    test_arithmetic_folding ();
>    test_vector_folding ();
>    test_vec_duplicate_folding ();
> +  test_fold_vec_perm_cst::test ();
>  }
>  
>  } // namespace selftest

Prathamesh Kulkarni Aug. 6, 2023, 12:25 p.m. UTC | #8

On Fri, 4 Aug 2023 at 20:36, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Full review this time, sorry for the skipping the tests earlier.
Thanks for the detailed review! Please find my responses inline below.
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index 7e5494dfd39..680d0e54fd4 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "vec-perm-indices.h"
> >  #include "asan.h"
> >  #include "gimple-range.h"
> > +#include <algorithm>
>
> This should be included by defining INCLUDE_ALGORITHM instead.
Done. Just curious, why do we use this macro instead of directly
including <algorithm> ?
>
> > +#include "tree-pretty-print.h"
> > +#include "gimple-pretty-print.h"
> > +#include "print-tree.h"
>
> Are these still needed, or were they for debugging?
Just for debugging, removed.
>
> >
> >  /* Nonzero if we are folding constants inside an initializer or a C++
> >     manifestly-constant-evaluated context; zero otherwise.
> > @@ -10494,15 +10498,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
> >  static bool
> >  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >  {
> > -  unsigned HOST_WIDE_INT i, nunits;
> > +  unsigned HOST_WIDE_INT i;
> >
> > -  if (TREE_CODE (arg) == VECTOR_CST
> > -      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
> > -    {
> > -      for (i = 0; i < nunits; ++i)
> > -     elts[i] = VECTOR_CST_ELT (arg, i);
> > -    }
> > -  else if (TREE_CODE (arg) == CONSTRUCTOR)
> > +  if (TREE_CODE (arg) == CONSTRUCTOR)
> >      {
> >        constructor_elt *elt;
> >
> > @@ -10520,6 +10518,192 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
> >    return true;
> >  }
> >
> > +/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
> > +   mask for VLA vec_perm folding.
> > +   REASON if specified, will contain the reason why SEL is not suitable.
> > +   Used only for debugging and unit-testing.
> > +   VERBOSE if enabled is used for debugging output.  */
> > +
> > +static bool
> > +valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
> > +                                 const vec_perm_indices &sel,
> > +                                 const char **reason = NULL,
> > +                                 ATTRIBUTE_UNUSED bool verbose = false)
>
> Since verbose is no longer needed (good!), I think we should just remove it.
Done.
>
> > +{
> > +  unsigned sel_npatterns = sel.encoding ().npatterns ();
> > +  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > +
> > +  if (!(pow2p_hwi (sel_npatterns)
> > +     && pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
> > +     && pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
> > +    {
> > +      if (reason)
> > +     *reason = "npatterns is not power of 2";
> > +      return false;
> > +    }
> > +
> > +  /* We want to avoid cases where sel.length is not a multiple of npatterns.
> > +     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
> > +  poly_uint64 esel;
> > +  if (!multiple_p (sel.length (), sel_npatterns, &esel))
> > +    {
> > +      if (reason)
> > +     *reason = "sel.length is not multiple of sel_npatterns";
> > +      return false;
> > +    }
> > +
> > +  if (sel_nelts_per_pattern < 3)
> > +    return true;
> > +
> > +  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
> > +    {
> > +      poly_uint64 a1 = sel[pattern + sel_npatterns];
> > +      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
> > +      HOST_WIDE_INT S;
>
> Trailing whitespace.  The convention is to use lowercase variable
> names, so please call this "step".
Fixed, thanks.
>
> > +      if (!poly_int64 (a2 - a1).is_constant (&S))
> > +     {
> > +       if (reason)
> > +         *reason = "step is not constant";
> > +       return false;
> > +     }
> > +      // FIXME: Punt on S < 0 for now, revisit later.
> > +      if (S < 0)
> > +     return false;
> > +      if (S == 0)
> > +     continue;
> > +
> > +      if (!pow2p_hwi (S))
> > +     {
> > +       if (reason)
> > +         *reason = "step is not power of 2";
> > +       return false;
> > +     }
> > +
> > +      /* Ensure that stepped sequence of the pattern selects elements
> > +      only from the same input vector if it's VLA.  */
>
> s/ if it's VLA//
Oops sorry, that was a relic of something else I was trying :)
Fixed, thanks.
>
> > +      uint64_t q1, qe;
> > +      poly_uint64 r1, re;
> > +      poly_uint64 ae = a1 + (esel - 2) * S;
> > +      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
> > +         && can_div_trunc_p (ae, arg_len, &qe, &re)
> > +         && q1 == qe))
> > +     {
> > +       if (reason)
> > +         *reason = "crossed input vectors";
> > +       return false;
> > +     }
> > +
>
> Probably worth a comment above the following code too:
>
>   /* Ensure that the stepped sequence always selects from the same
>      input pattern.  */
Done.
>
> > +      unsigned arg_npatterns
> > +     = ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
> > +                       : VECTOR_CST_NPATTERNS (arg1);
> > +
> > +      if (!multiple_p (S, arg_npatterns))
> > +     {
> > +       if (reason)
> > +         *reason = "S is not multiple of npatterns";
> > +       return false;
> > +     }
> > +    }
> > +
> > +  return true;
> > +}
> > +
> > +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> > +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
> > +   REASON and VERBOSE have same purpose as described in
> > +   valid_mask_for_fold_vec_perm_cst_p.
> > +
> > +   (1) If SEL is a suitable mask as determined by
> > +       valid_mask_for_fold_vec_perm_cst_p, then:
> > +       res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> > +       res_nelts_per_pattern = max of nelts_per_pattern between
> > +                            ARG0, ARG1 and SEL.
> > +   (2) If SEL is not a suitable mask, and ARG0, ARG1 are VLS,
> > +       then:
> > +       res_npatterns = nelts in input vector.
>
> s/input vector/result vector/
Fixed, thanks.
>
> > +       res_nelts_per_pattern = 1.
> > +       This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
>
> Guess this is personal preference, but (1) and (2) seem more like
> implementation details, so I think they belong...
>
> > +
> > +static tree
> > +fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> > +                const char **reason = NULL, bool verbose = false)
> > +{
> > +  unsigned res_npatterns, res_nelts_per_pattern;
> > +  unsigned HOST_WIDE_INT res_nelts;
> > +
>
> ...here instead.
Done.
>
> > +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason, verbose))
> > +    {
> > +      res_npatterns
> > +     = std::max (VECTOR_CST_NPATTERNS (arg0),
> > +                 std::max (VECTOR_CST_NPATTERNS (arg1),
> > +                           sel.encoding ().npatterns ()));
> > +
> > +      res_nelts_per_pattern
> > +     = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > +                 std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> > +                           sel.encoding ().nelts_per_pattern ()));
> > +
> > +      res_nelts = res_npatterns * res_nelts_per_pattern;
> > +    }
> > +  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
> > +    {
> > +      res_npatterns = res_nelts;
> > +      res_nelts_per_pattern = 1;
> > +    }
> > +  else
> > +    return NULL_TREE;
> > +
> > +  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
> > +  for (unsigned i = 0; i < res_nelts; i++)
> > +    {
> > +      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +      uint64_t q;
> > +      poly_uint64 r;
> > +      unsigned HOST_WIDE_INT index;
> > +
> > +      unsigned HOST_WIDE_INT arg_nelts;
> > +      if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)).is_constant (&arg_nelts)
> > +       && known_ge (sel[i], poly_int64 (2 * arg_nelts)))
> > +     {
> > +       if (reason)
> > +         *reason = "out of bounds access";
> > +       return NULL_TREE;
> > +     }
>
> I don't think this is needed.  The selector indices wrap, and the code
> below should handle the wrapping correctly.
Removed, thanks.
>
> > +
> > +      /* Punt if sel[i] /trunc_div len cannot be determined,
> > +      because the input vector to be chosen will depend on
> > +      runtime vector length.
> > +      For example if len == 4 + 4x, and sel[i] == 4,
> > +      If len at runtime equals 4, we choose arg1[0].
> > +      For any other value of len > 4 at runtime, we choose arg0[4].
> > +      which makes the element choice dependent on runtime vector length.  */
> > +      if (!can_div_trunc_p (sel[i], len, &q, &r))
> > +     {
> > +       if (reason)
> > +         *reason = "cannot divide selector element by arg len";
> > +       return NULL_TREE;
> > +     }
> > +
> > +      /* sel[i] % len will give the index of element in the chosen input
> > +      vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
> > +      we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
> > +      if (!r.is_constant (&index))
> > +     {
> > +       if (reason)
> > +         *reason = "remainder is not constant";
> > +       return NULL_TREE;
> > +     }
> > +
> > +      tree arg = ((q & 1) == 0) ? arg0 : arg1;
> > +      tree elem = vector_cst_elt (arg, index);
> > +      out_elts.quick_push (elem);
> > +    }
> > +
> > +  return out_elts.build ();
> > +}
> > +
> >  /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
> >     selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
> >     NULL_TREE otherwise.  */
> > @@ -10529,43 +10713,40 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
> >  {
> >    unsigned int i;
> >    unsigned HOST_WIDE_INT nelts;
> > -  bool need_ctor = false;
> >
> > -  if (!sel.length ().is_constant (&nelts))
> > -    return NULL_TREE;
> > -  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
> > -           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
> > +  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
> > +           && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
> > +                        TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
> > +
> >    if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
> >        || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
> >      return NULL_TREE;
> >
> > +  if (TREE_CODE (arg0) == VECTOR_CST
> > +      && TREE_CODE (arg1) == VECTOR_CST)
> > +    return fold_vec_perm_cst (type, arg0, arg1, sel);
> > +
> > +  /* For fall back case, we want to ensure we have VLS vectors
> > +     with equal length.  */
> > +  if (!sel.length ().is_constant (&nelts))
> > +    return NULL_TREE;
> > +
> > +  gcc_assert (known_eq (sel.length (), TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));
>
> Nit: long line.
Fixed, thanks.
>
> >    tree *in_elts = XALLOCAVEC (tree, nelts * 2);
> >    if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
> >        || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
> >      return NULL_TREE;
> >
> > -  tree_vector_builder out_elts (type, nelts, 1);
> > +  vec<constructor_elt, va_gc> *v;
> > +  vec_alloc (v, nelts);
> >    for (i = 0; i < nelts; i++)
> >      {
> >        HOST_WIDE_INT index;
> >        if (!sel[i].is_constant (&index))
> >       return NULL_TREE;
> > -      if (!CONSTANT_CLASS_P (in_elts[index]))
> > -     need_ctor = true;
> > -      out_elts.quick_push (unshare_expr (in_elts[index]));
> > -    }
> > -
> > -  if (need_ctor)
> > -    {
> > -      vec<constructor_elt, va_gc> *v;
> > -      vec_alloc (v, nelts);
> > -      for (i = 0; i < nelts; i++)
> > -     CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
> > -      return build_constructor (type, v);
> > +      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
> >      }
> > -  else
> > -    return out_elts.build ();
> > +  return build_constructor (type, v);
> >  }
> >
> >  /* Try to fold a pointer difference of type TYPE two address expressions of
> > @@ -16892,6 +17073,508 @@ test_arithmetic_folding ()
> >                                  x);
> >  }
> >
> > +namespace test_fold_vec_perm_cst {
> > +
> > +static tree
> > +get_preferred_vectype (tree inner_type)
> > +{
> > +  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (inner_type);
> > +  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
> > +  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +  return build_vector_type (inner_type, nunits);
> > +}
> > +
> > +static tree
> > +build_vec_cst_rand (tree inner_type, unsigned npatterns,
> > +                 unsigned nelts_per_pattern, int S = 0,
>
> Similar comment about lowercase variable names here.
>
> > +                 tree vectype = NULL_TREE)
> > +{
> > +  if (!vectype)
> > +    vectype = get_preferred_vectype (inner_type);
>
> I'm not sure how portable this is.  It looks like the tests rely on
> the integer_type_node vectors being 4 + 4x, but that isn't necessarily
> true on all VLA targets.
>
> Perhaps instead the tests could be classified based on the vector
> lengths that they assume.  Then we can iterate through the vector
> modes and call the appropriate function based on GET_MODE_NUNITS
> and GET_MODE_INNER.
I tried this approach in the attached patch.
Does it look OK ?
>
> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> > +
> > +  // Fill a0 for each pattern
> > +  for (unsigned i = 0; i < npatterns; i++)
> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> > +
> > +  if (nelts_per_pattern == 1)
> > +    return builder.build ();
> > +
> > +  // Fill a1 for each pattern
> > +  for (unsigned i = 0; i < npatterns; i++)
> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> > +
> > +  if (nelts_per_pattern == 2)
> > +    return builder.build ();
> > +
> > +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
> > +    {
> > +      tree prev_elem = builder[i - npatterns];
> > +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
> > +      int val = prev_elem_val + S;
> > +      builder.quick_push (build_int_cst (inner_type, val));
> > +    }
> > +
> > +  return builder.build ();
> > +}
> > +
> > +static void
> > +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
> > +           tree res, tree *expected_res)
> > +{
> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
>
> I don't think this is safe when the inputs are randomised.  E.g. we
> could by chance end up with a vector of all zeros, which would have
> a single pattern and a single element per pattern, regardless of the
> shapes of the inputs.
>
> Given the way that vector_builder<T, Shape, Derived>::finalize
> canonicalises the encoding, it should be safe to use:
>
> * VECTOR_CST_NPATTERNS (res) <= npatterns
> * vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern
>
> If we do that then...
>
> > +
> > +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
>
> ...this loop bound should be npatterns * nelts_per_pattern instead.
Ah indeed. Fixed, thanks.
>
> > +    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
> > +}
> > +
> > +static void
> > +validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
> > +{
> > +  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
> > +  for (unsigned i = 0; i < expected_nelts; i++)
> > +    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
> > +}
> > +
> > +/* Verify VLA vec_perm folding.  */
> > +
> > +static void
> > +test_stepped ()
> > +{
> > +  /* Case 1: sel = {0, 1, 2, ...}
> > +     npatterns = 1, nelts_per_pattern = 3
> > +     expected res: { arg0[0], arg0[1], arg0[2], ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (0);
> > +    builder.quick_push (1);
> > +    builder.quick_push (2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
> > +                         vector_cst_elt (arg0, 2) };
> > +    validate_res (1, 3, res, expected_res);
> > +  }
> > +
> > +  /* Case 2: sel = {len, len + 1, len + 2, ... }
> > +     npatterns = 1, nelts_per_pattern = 3
> > +     FIXME: This should return
> > +     expected res: { op1[0], op1[1], op1[2], ... }
> > +     however it returns NULL_TREE.  */
>
> Looks like the comment is out of date.
Fixed, thanks.
>
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (arg0_len + 1);
> > +    builder.quick_push (arg0_len + 2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL, true);
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
> > +                         vector_cst_elt (arg1, 2) };
> > +    validate_res (1, 3, res, expected_res);
> > +  }
> > +
> > +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
> > +     sel = {len, 0, 0, 0, 2, 0, ...}
> > +     npatterns = 2, nelts_per_pattern = 3.
> > +     Use extra pattern {0, ...} to lower number of elements per pattern.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 2, 3);
> > +    builder.quick_push (arg0_len);
> > +    int mask_elems[] = { 0, 0, 0, 2, 0 };
> > +    for (int i = 0; i < 5; i++)
> > +      builder.quick_push (mask_elems[i]);
>
> This leaves one of the elements unspecified.
Sorry, I didn't understand.
It first pushes len in:
builder.quick_push (arg0_len)
and then pushes the remaining indices in the loop:
for (int i = 0; i < 5; i++)
  builder.quick_push (mask_elems[i])
So overall, builder will have 6 elements: {len, 0, 0, 0, 2, 0}
>
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    const char *reason;
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg0, 0),
> > +                         vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 0),
> > +                         vector_cst_elt (arg0, 2), vector_cst_elt (arg0, 0)
> > +                       };
> > +    validate_res (2, 3, res, expected_res);
> > +  }
> > +
> > +  /* Case 4:
> > +     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
> > +     This should return NULL because we cross the input vectors.
> > +     Because,
> > +     arg0_len = 16 + 16x
> > +     a1 = 0
> > +     S = 2
> > +     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
> > +     ae = 0 + (esel - 2) * S
> > +     = 0 + (16 + 16x - 2) * 2
> > +     = 28 + 32x
> > +     a1 / arg0_len = 0 /trunc (16 + 16x) = 0
> > +     ae / arg0_len = (28 + 32x) /trunc (16 + 16x), which is not defined,
> > +     since 28/16 != 32/16.
> > +     So return NULL_TREE.  */
>
> The division should succeed now, so as the test says, the reason should
> instead be that ae is in the second input.
Fixed, thanks.
>
> > +  {
> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (0);
> > +    builder.quick_push (2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    const char *reason;
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason, false);
> > +    gcc_assert (res == NULL_TREE);
> > +    gcc_assert (!strcmp (reason, "crossed input vectors"));
>
> The tests should use ASSERT_* macros rather than gcc_assert.
Fixed, thanks.
>
> > +  }
> > +
> > +  /* Case 5: Select elements from different patterns.
> > +     Should return NULL.  */
> > +  {
> > +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> > +
> > +    vec_perm_builder builder (op0_len, 2, 3);
> > +    builder.quick_push (op0_len);
> > +    int mask_elems[] = { 0, 0, 0, 1, 0 };
> > +    for (int i = 0; i < 5; i++)
> > +      builder.quick_push (mask_elems[i]);
>
> Should be 6 elements here too.
>
> > +
> > +    vec_perm_indices sel (builder, 2, op0_len);
> > +    const char *reason;
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, &reason, false);
> > +    gcc_assert (res == NULL_TREE);
> > +    gcc_assert (!strcmp (reason, "S is not multiple of npatterns"));
> > +  }
> > +
> > +  /* Case 6: Select pattern 0 of op0 and dup of op0[0]
> > +     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
> > +     sel = { 0, 0, 2, 0, 4, 0, ... }.
> > +
> > +     For pattern {0, 2, 4, ...}:
> > +     a1 = 2
> > +     len = 16 + 16x
> > +     S = 2
> > +     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
> > +     ae = a1 + (esel - 2) * S
> > +     = 2 + (8 + 8x - 2) * 2
> > +     = 14 + 16x
> > +     a1 / arg0_len = 2 / (16 + 16x) = 0
> > +     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
> > +     So a1/arg0_len = ae/arg0_len = 0
> > +     Hence we select from first vector op0
> > +     S = 2, npatterns = 2.
> > +     Since S is multiple of npatterns(op0), we are selecting from
> > +     same pattern of op0.
> > +
> > +     For pattern {0, ...}, we are choosing { op0[0] ... }
> > +     So res will be combination of above patterns:
> > +     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
> > +     with npatterns = 2, nelts_per_pattern = 3.  */
> > +  {
> > +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> > +
> > +    vec_perm_builder builder (op0_len, 2, 3);
> > +    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
> > +    for (int i = 0; i < 6; i++)
> > +      builder.quick_push (mask_elems[i]);
> > +
> > +    vec_perm_indices sel (builder, 2, op0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> > +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
> > +    validate_res (2, 3, res, expected_res);
> > +  }
> > +
> > +  /* Case 7: sel_npatterns > op_npatterns;
> > +     op0, op1: npatterns = 2, nelts_per_pattern = 3
> > +     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
> > +     with npatterns = 4, nelts_per_pattern = 3.  */
> > +  {
> > +    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
> > +
> > +    vec_perm_builder builder(op0_len, 4, 3);
> > +    // -1 is used as place holder for poly_int_cst
> > +    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
> > +    for (int i = 0; i < 12; i++)
> > +      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
> > +
> > +    vec_perm_indices sel (builder, 2, op0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
> > +    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
> > +                         vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
> > +                         vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
> > +                         vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
> > +    validate_res (4, 3, res, expected_res);
> > +  }
> > +}
> > +
> > +static void
> > +test_dup ()
> > +{
> > +  /* Case 1: mask = {0, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (0);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
> > +    validate_res (1, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 2: mask = {len, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> > +    validate_res (1, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 3: mask = { 0, len, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
> > +    validate_res (2, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
> > +
> > +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> > +     npatterns = 4, nelts_per_pattern = 1 */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (4, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 6: mask = {0, 4, ...}
> > +     npatterns = 1, nelts_per_pattern = 2.
> > +     This should return NULL_TREE because the index 4 may choose
> > +     from either arg0 or arg1 depending on vector length.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (4);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    ASSERT_TRUE (res == NULL_TREE);
> > +  }
> > +
> > +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> > +     mask = {0, len, 1, len + 1, ...}
> > +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
>
> This is a good test to have, but it doesn't seem to match the
> name of the containing function (test_dup).
Well since the selector has nelts_per_pattern = 2, ie, dup of a1, I
chose to put it in test_dup.
Anyway, the functions are now re-classified based on vector length in the patch.
>
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (arg0_len + 1);
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
> > +}
> > +
> > +static void
> > +test_mixed ()
> > +{
> > +  /* Case 1: op0, op1 -> VLS, sel -> VLA and selects from both input vectors.
> > +     In this case, we treat res_npatterns = nelts in input vector
> > +     and res_nelts_per_pattern = 1, and create a dup pattern.
> > +     sel = { 0, 4, 1, 5, ... }
> > +     res = { op0[0], op1[0], op0[1], op1[1], ...} // (4, 1)
> > +     res_npatterns = 4, res_nelts_per_pattern = 1.  */
> > +  {
> > +    tree arg_vectype = build_vector_type (integer_type_node, 4);
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> > +
> > +    tree res_type = get_preferred_vectype (integer_type_node);
> > +    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > +    vec_perm_builder builder (res_len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (4);
> > +    builder.quick_push (1);
> > +    builder.quick_push (5);
> > +
> > +    vec_perm_indices sel (builder, 2, res_len);
> > +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (4, 1, res, expected_res);
> > +  }
> > +
> > +  /* Case 2: Same as Case 1, but sel contains an out of bounds index.
> > +     result should be NULL_TREE.  */
> > +  {
> > +    tree arg_vectype = build_vector_type (integer_type_node, 4);
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 4, 1, 0, arg_vectype);
> > +
> > +    tree res_type = get_preferred_vectype (integer_type_node);
> > +    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > +    vec_perm_builder builder (res_len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (8);
> > +    builder.quick_push (1);
> > +    builder.quick_push (5);
> > +
> > +    vec_perm_indices sel (builder, 2, res_len);
> > +    const char *reason;
> > +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
> > +    gcc_assert (res == NULL_TREE);
> > +    gcc_assert (!strcmp (reason, "out of bounds access"));
> > +  }
> > +
> > +  /* Case 3: op0, op1 are VLA and sel is VLS.
> > +     op0, op1: VNx16QI with shape (2, 3)
> > +     sel = V4SI with values {0, 2, 4, 6}
> > +     res: V4SI with values { op0[0], op0[2], op0[4], op0[6] }.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
> > +
> > +    poly_uint64 res_len = 4;
> > +    tree res_type = build_vector_type (char_type_node, res_len);
> > +    vec_perm_builder builder (res_len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (2);
> > +    builder.quick_push (4);
> > +    builder.quick_push (6);
> > +
> > +    vec_perm_indices sel (builder, 2, res_len);
> > +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 2),
> > +                         vector_cst_elt (arg0, 4), vector_cst_elt (arg0, 6)
> > +                       };
> > +    validate_res_vls (res, expected_res, 4);
> > +  }
> > +
> > +  /* Case 4: Same as case 4, but op0, op1 are VNx4SI with shape (2, 3) and step = 2
>
> Same as case 3?
Oops sorry, fixed.

The attached patch passes bootstrap+test on aarch64-linux-gnu with and
without SVE, and on x86_64-linux-gnu.

Thanks,
Prathamesh
>

> Thanks,
> Richard
>
> > +     sel = V4SI with values {0, 2, 4, 6}
> > +     In this case result should be NULL_TREE because we cross input vector
> > +     boundary at index 4.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 2);
> > +
> > +    poly_uint64 res_len = 4;
> > +    tree res_type = build_vector_type (char_type_node, res_len);
> > +    vec_perm_builder builder (res_len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (2);
> > +    builder.quick_push (4);
> > +    builder.quick_push (6);
> > +
> > +    vec_perm_indices sel (builder, 2, res_len);
> > +    const char *reason;
> > +    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
> > +    gcc_assert (res == NULL_TREE);
> > +    gcc_assert (!strcmp (reason, "cannot divide selector element by arg len"));
> > +  }
> > +}
> > +
> > +static void
> > +test ()
> > +{
> > +  tree vectype = get_preferred_vectype (integer_type_node);
> > +  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> > +    return;
> > +
> > +  test_dup ();
> > +  test_stepped ();
> > +  test_mixed ();
> > +}
> > +};
> > +
> >  /* Verify that various binary operations on vectors are folded
> >     correctly.  */
> >
> > @@ -16943,6 +17626,7 @@ fold_const_cc_tests ()
> >    test_arithmetic_folding ();
> >    test_vector_folding ();
> >    test_vec_duplicate_folding ();
> > +  test_fold_vec_perm_cst::test ();
> >  }
> >
> >  } // namespace selftest
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 7e5494dfd39..648ef5c647e 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
    gimple code, we need to handle GIMPLE tuples as well as their
    corresponding tree equivalents.  */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -10494,15 +10495,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 static bool
 vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned HOST_WIDE_INT i;
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
-    {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
-    }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  if (TREE_CODE (arg) == CONSTRUCTOR)
     {
       constructor_elt *elt;
 
@@ -10520,6 +10515,182 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
+   mask for VLA vec_perm folding.
+   REASON if specified, will contain the reason why SEL is not suitable.
+   Used only for debugging and unit-testing.  */
+
+static bool
+valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
+				    const vec_perm_indices &sel,
+				    const char **reason = NULL)
+{
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+
+  if (!(pow2p_hwi (sel_npatterns)
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
+    {
+      if (reason)
+	*reason = "npatterns is not power of 2";
+      return false;
+    }
+
+  /* We want to avoid cases where sel.length is not a multiple of npatterns.
+     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
+  poly_uint64 esel;
+  if (!multiple_p (sel.length (), sel_npatterns, &esel))
+    {
+      if (reason)
+	*reason = "sel.length is not multiple of sel_npatterns";
+      return false;
+    }
+
+  if (sel_nelts_per_pattern < 3)
+    return true;
+
+  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
+    {
+      poly_uint64 a1 = sel[pattern + sel_npatterns];
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      HOST_WIDE_INT step;
+      if (!poly_int64 (a2 - a1).is_constant (&step))
+	{
+	  if (reason)
+	    *reason = "step is not constant";
+	  return false;
+	}
+      // FIXME: Punt on step < 0 for now, revisit later.
+      if (step < 0)
+	return false;
+      if (step == 0)
+	continue;
+
+      if (!pow2p_hwi (step))
+	{
+	  if (reason)
+	    *reason = "step is not power of 2";
+	  return false;
+	}
+
+      /* Ensure that stepped sequence of the pattern selects elements
+	 only from the same input vector.  */
+      uint64_t q1, qe;
+      poly_uint64 r1, re;
+      poly_uint64 ae = a1 + (esel - 2) * step;
+      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
+	    && can_div_trunc_p (ae, arg_len, &qe, &re)
+	    && q1 == qe))
+	{
+	  if (reason)
+	    *reason = "crossed input vectors";
+	  return false;
+	}
+
+      /* Ensure that the stepped sequence always selects from the same
+	 input pattern.  */
+      unsigned arg_npatterns
+	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
+			  : VECTOR_CST_NPATTERNS (arg1);
+
+      if (!multiple_p (step, arg_npatterns))
+	{
+	  if (reason)
+	    *reason = "step is not multiple of npatterns";
+	  return false;
+	}
+    }
+
+  return true;
+}
+
+/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
+   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
+   REASON has same purpose as described in
+   valid_mask_for_fold_vec_perm_cst_p.  */
+
+
+static tree
+fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
+		   const char **reason = NULL)
+{
+  unsigned res_npatterns, res_nelts_per_pattern;
+  unsigned HOST_WIDE_INT res_nelts;
+
+  /* (1) If SEL is a suitable mask as determined by
+     valid_mask_for_fold_vec_perm_cst_p, then:
+     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
+     res_nelts_per_pattern = max of nelts_per_pattern between
+			     ARG0, ARG1 and SEL.
+     (2) If SEL is not a suitable mask, and TYPE is VLS then:
+     res_npatterns = nelts in result vector.
+     res_nelts_per_pattern = 1.
+     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns
+	= std::max (VECTOR_CST_NPATTERNS (arg0),
+		    std::max (VECTOR_CST_NPATTERNS (arg1),
+			      sel.encoding ().npatterns ()));
+
+      res_nelts_per_pattern
+	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
+			      sel.encoding ().nelts_per_pattern ()));
+
+      res_nelts = res_npatterns * res_nelts_per_pattern;
+    }
+  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
+    {
+      res_npatterns = res_nelts;
+      res_nelts_per_pattern = 1;
+    }
+  else
+    return NULL_TREE;
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  for (unsigned i = 0; i < res_nelts; i++)
+    {
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+      unsigned HOST_WIDE_INT index;
+
+      /* Punt if sel[i] /trunc_div len cannot be determined,
+	 because the input vector to be chosen will depend on
+	 runtime vector length.
+	 For example if len == 4 + 4x, and sel[i] == 4,
+	 If len at runtime equals 4, we choose arg1[0].
+	 For any other value of len > 4 at runtime, we choose arg0[4].
+	 which makes the element choice dependent on runtime vector length.  */
+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    *reason = "cannot divide selector element by arg len";
+	  return NULL_TREE;
+	}
+
+      /* sel[i] % len will give the index of element in the chosen input
+	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
+	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
+      if (!r.is_constant (&index))
+	{
+	  if (reason)
+	    *reason = "remainder is not constant";
+	  return NULL_TREE;
+	}
+
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10529,43 +10700,41 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
   unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST)
+    return fold_vec_perm_cst (type, arg0, arg1, sel);
+
+  /* For fall back case, we want to ensure we have VLS vectors
+     with equal length.  */
+  if (!sel.length ().is_constant (&nelts))
+    return NULL_TREE;
+
+  gcc_assert (known_eq (sel.length (),
+			TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
+  vec<constructor_elt, va_gc> *v;
+  vec_alloc (v, nelts);
   for (i = 0; i < nelts; i++)
     {
       HOST_WIDE_INT index;
       if (!sel[i].is_constant (&index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
-    }
-
-  if (need_ctor)
-    {
-      vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
-      return build_constructor (type, v);
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
     }
-  else
-    return out_elts.build ();
+  return build_constructor (type, v);
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16892,6 +17061,554 @@ test_arithmetic_folding ()
 				   x);
 }
 
+namespace test_fold_vec_perm_cst {
+
+/* Build a VECTOR_CST corresponding to VMODE, and has
+   encoding given by NPATTERNS, NELTS_PER_PATTERN and STEP.
+   Fill it with randomized elements, using rand() % THRESHOLD.  */
+
+static tree
+build_vec_cst_rand (machine_mode vmode, unsigned npatterns,
+		    unsigned nelts_per_pattern,
+		    int step = 0, int threshold = 100)
+{
+  tree inner_type = lang_hooks.types.type_for_mode (GET_MODE_INNER (vmode), 1);
+  tree vectype = build_vector_type_for_mode (inner_type, vmode);
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+
+  // Fill a0 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 1)
+    return builder.build ();
+
+  // Fill a1 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 2)
+    return builder.build ();
+
+  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
+    {
+      tree prev_elem = builder[i - npatterns];
+      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
+      int val = prev_elem_val + step;
+      builder.quick_push (build_int_cst (inner_type, val));
+    }
+
+  return builder.build ();
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when result is VLA.  */
+
+static void
+validate_res (unsigned npatterns, unsigned nelts_per_pattern,
+	      tree res, tree *expected_res)
+{
+  /* Actual npatterns / nelts_per_pattern in res may be less than expected due
+     to canonicalization.  */
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
+  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) <= nelts_per_pattern);
+
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when the result is VLS.  */
+
+static void
+validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
+{
+  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
+  for (unsigned i = 0; i < expected_nelts; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Test cases where result and input vectors are VNx4SI  */
+
+static void
+test_vnx4si (machine_mode vmode)
+{
+  /* Case 1: mask = {0, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (0);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (res, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 2: mask = {len, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg1, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 3: mask = { 0, len, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
+    validate_res (2, 1, res, expected_res);
+  }
+
+  /* Case 4: mask = { 0, len, 1, len+1, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+
+  /* Case 5: mask = { 0, len, 1, len+1, .... }
+     npatterns = 4, nelts_per_pattern = 1 */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 6: mask = {0, 4, ...}
+     npatterns = 1, nelts_per_pattern = 2.
+     This should return NULL_TREE because the index 4 may choose
+     from either arg0 or arg1 depending on vector length.  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 2);
+    builder.quick_push (0);
+    builder.quick_push (4);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    ASSERT_TRUE (res == NULL_TREE);
+  }
+
+  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
+     mask = {0, len, 1, len + 1, ...}
+     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (arg0_len);
+    builder.quick_push (1);
+    builder.quick_push (arg0_len + 1);
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+
+  /* Case 8: sel = {0, 1, 2, ...}
+     npatterns = 1, nelts_per_pattern = 3
+     expected res: { arg0[0], arg0[1], arg0[2], ... } */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (0);
+    builder.quick_push (1);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
+			    vector_cst_elt (arg0, 2) };
+    validate_res (1, 3, res, expected_res);
+  }
+
+  /* Case 9: sel = {len, len + 1, len + 2, ... }
+     npatterns = 1, nelts_per_pattern = 3
+     expected res: { op1[0], op1[1], op1[2], ... }  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (arg0_len + 1);
+    builder.quick_push (arg0_len + 2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL);
+    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
+			    vector_cst_elt (arg1, 2) };
+    validate_res (1, 3, res, expected_res);
+  }
+}
+
+/* Test cases where result and input vectors are VNx16QI  */
+
+static void
+test_vnx16qi (machine_mode vmode)
+{
+  /* Case 1: Leading element of arg1, stepped sequence: pattern 0 of arg0.
+     sel = {len, 0, 0, 0, 2, 0, ...}
+     npatterns = 2, nelts_per_pattern = 3.
+     Use extra pattern {0, ...} to lower number of elements per pattern.  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 3);
+    builder.quick_push (arg0_len);
+    int mask_elems[] = { 0, 0, 0, 2, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+
+    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg0, 0),
+			    vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 0),
+			    vector_cst_elt (arg0, 2), vector_cst_elt (arg0, 0)
+			  };
+    validate_res (2, 3, res, expected_res);
+  }
+
+  /* Case 2:
+     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
+     This should return NULL because we cross the input vectors.
+     Because,
+     arg0_len = 16 + 16x
+     a1 = 0
+     S = 2
+     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
+     ae = 0 + (esel - 2) * S
+	= 0 + (16 + 16x - 2) * 2
+	= 28 + 32x
+     Let q1 = a1 / arg0_len = 0 /trunc (16 + 16x) = 0
+     Let qe = ae / arg0_len = (28 + 32x) /trunc (16 + 16x) = 1.
+     Since q1 != qe, we cross input vectors.
+     So return NULL_TREE.  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (0);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+    ASSERT_TRUE (res == NULL_TREE);
+    ASSERT_TRUE (!strcmp (reason, "crossed input vectors"));
+  }
+
+  /* Case 3: Select elements from different patterns.
+     Should return NULL.  */
+  {
+    tree op0 = build_vec_cst_rand (vmode, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (vmode, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    builder.quick_push (op0_len);
+    int mask_elems[] = { 0, 0, 0, 1, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, &reason);
+    ASSERT_TRUE (res == NULL_TREE);
+    ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
+  }
+
+  /* Case 4: Select pattern 0 of op0 and dup of op0[0]
+     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
+     sel = { 0, 0, 2, 0, 4, 0, ... }.
+
+     For pattern {0, 2, 4, ...}:
+     a1 = 2
+     len = 16 + 16x
+     S = 2
+     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
+     ae = a1 + (esel - 2) * S
+	= 2 + (8 + 8x - 2) * 2
+	= 14 + 16x
+     a1 / arg0_len = 2 / (16 + 16x) = 0
+     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
+     So a1/arg0_len = ae/arg0_len = 0
+     Hence we select from first vector op0
+     S = 2, npatterns = 2.
+     Since S is multiple of npatterns(op0), we are selecting from
+     same pattern of op0.
+
+     For pattern {0, ...}, we are choosing { op0[0] ... }
+     So res will be combination of above patterns:
+     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
+     with npatterns = 2, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (vmode, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (vmode, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
+    for (int i = 0; i < 6; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
+    validate_res (2, 3, res, expected_res);
+  }
+
+  /* Case 7: sel_npatterns > op_npatterns;
+     op0, op1: npatterns = 2, nelts_per_pattern = 3
+     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
+     with npatterns = 4, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (vmode, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (vmode, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder(op0_len, 4, 3);
+    // -1 is used as place holder for poly_int_cst
+    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
+    for (int i = 0; i < 12; i++)
+      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
+    validate_res (4, 3, res, expected_res);
+  }
+}
+
+/* Test cases where result is VNx4SI and input vectors are V4SI.  */
+
+static void
+test_vnx4si_v4si (machine_mode vnx4si_mode, machine_mode v4si_mode)
+{
+  /* Case 1:
+     sel = { 0, 4, 1, 5, ... }
+     res = { op0[0], op1[0], op0[1], op1[1], ...} // (4, 1)  */
+  {
+    tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+    tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+    tree inner_type
+      = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+    tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (4);
+    builder.quick_push (1);
+    builder.quick_push (5);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 2: Same as case 1, but contains an out of bounds access which
+     should wrap around.
+     sel = {0, 8, 4, 12, ...} (4, 1)
+     res = { op0[0], op0[0], op1[0], op1[0], ... } (4, 1).  */
+  {
+    tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+    tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+    tree inner_type
+      = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+    tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (8);
+    builder.quick_push (4);
+    builder.quick_push (12);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 0),
+			    vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 0)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+}
+
+/* Test cases where result is V4SI and input vectors are VNx4SI.  */
+
+static void
+test_v4si_vnx4si (machine_mode v4si_mode, machine_mode vnx4si_mode)
+{
+  /* Case 1:
+     sel = { 0, 1, 2, 3}
+     res = { op0[0], op0[1], op0[2], op0[3] }.  */
+  {
+    tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+    tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+    tree inner_type
+      = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+    tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    for (int i = 0; i < 4; i++)
+      builder.quick_push (i);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
+			    vector_cst_elt (arg0, 2), vector_cst_elt (arg0, 3) };
+    validate_res_vls (res, expected_res, 4);
+  }
+
+  /* Case 2: Same as Case 1, but crossing input vector.
+     sel = {0, 2, 4, 6}
+     In this case,the index 4 is ambiguous since len = 4 + 4x.
+     If x = 0 at runtime, we choose op1[0]
+     For x > 0 at runtime, we choose op0[4]
+     Since we cannot determine, which vector to choose from during compile time,
+     should return NULL_TREE.  */
+  {
+    tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+    tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+    tree inner_type
+      = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+    tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+    poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+    vec_perm_builder builder (res_len, 4, 1);
+    for (int i = 0; i < 8; i += 2)
+      builder.quick_push (i);
+
+    vec_perm_indices sel (builder, 2, res_len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
+    ASSERT_TRUE (res == NULL_TREE);
+    ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
+  }
+}
+
+/* Helper function to get a vector mode that has element
+   mode as INNER_MODE and GET_MODE_NUNITS equal to len.  */
+
+static machine_mode
+get_vmode (machine_mode inner_mode, poly_uint64 len)
+{
+  machine_mode vmode;
+  FOR_EACH_MODE_IN_CLASS (vmode, MODE_VECTOR_INT)
+    if (GET_MODE_INNER (vmode) == inner_mode
+	&& known_eq (GET_MODE_NUNITS (vmode), len))
+      return vmode;
+  return E_VOIDmode;
+}
+
+/* Invoke tests for fold_vec_perm_cst.  */
+
+static void
+test ()
+{
+  /* Conditionally execute fold_vec_perm_cst tests, if target supports
+     VLA vectors. Use a compile time check so we avoid instantiating
+     poly_uint64 with N > 1 on targets that do not support VLA vectors.  */
+  if constexpr (poly_int_traits<poly_uint64>::num_coeffs > 1)
+    {
+      machine_mode vnx4si_mode = get_vmode (SImode, poly_uint64 (4, 4));
+      if (vnx4si_mode == E_VOIDmode)
+	return;
+      machine_mode vnx16qi_mode = get_vmode (QImode, poly_uint64 (16, 16));
+      if (vnx16qi_mode == E_VOIDmode)
+	return;
+      machine_mode v4si_mode = get_vmode (SImode, 4);
+      if (v4si_mode == E_VOIDmode)
+	return;
+
+      test_vnx4si (vnx4si_mode);
+      test_vnx16qi (vnx16qi_mode);
+      test_vnx4si_v4si (vnx4si_mode, v4si_mode);
+      test_v4si_vnx4si (v4si_mode, vnx4si_mode);
+    }
+}
+}; // end of test_fold_vec_perm_cst namespace
+
 /* Verify that various binary operations on vectors are folded
    correctly.  */
 
@@ -16943,6 +17660,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_fold_vec_perm_cst::test ();
 }
 
 } // namespace selftest

Richard Sandiford Aug. 8, 2023, 9:57 a.m. UTC | #9

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Fri, 4 Aug 2023 at 20:36, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Full review this time, sorry for the skipping the tests earlier.
> Thanks for the detailed review! Please find my responses inline below.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > index 7e5494dfd39..680d0e54fd4 100644
>> > --- a/gcc/fold-const.cc
>> > +++ b/gcc/fold-const.cc
>> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "vec-perm-indices.h"
>> >  #include "asan.h"
>> >  #include "gimple-range.h"
>> > +#include <algorithm>
>>
>> This should be included by defining INCLUDE_ALGORITHM instead.
> Done. Just curious, why do we use this macro instead of directly
> including <algorithm> ?

AIUI, one of the reasons for having every file start with includes
of config.h and (b)system.h, in that order, is to ensure that a small
and predictable amount of GCC-specific stuff happens before including
the system header files.  That helps to avoid OS-specific clashes between
GCC code and system headers.

But another major reason is that system.h ends by poisoning a lot of
stuff that system headers would be entitled to use.

>> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
>> > +
>> > +  // Fill a0 for each pattern
>> > +  for (unsigned i = 0; i < npatterns; i++)
>> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
>> > +
>> > +  if (nelts_per_pattern == 1)
>> > +    return builder.build ();
>> > +
>> > +  // Fill a1 for each pattern
>> > +  for (unsigned i = 0; i < npatterns; i++)
>> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
>> > +
>> > +  if (nelts_per_pattern == 2)
>> > +    return builder.build ();
>> > +
>> > +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
>> > +    {
>> > +      tree prev_elem = builder[i - npatterns];
>> > +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
>> > +      int val = prev_elem_val + S;
>> > +      builder.quick_push (build_int_cst (inner_type, val));
>> > +    }
>> > +
>> > +  return builder.build ();
>> > +}
>> > +
>> > +static void
>> > +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
>> > +           tree res, tree *expected_res)
>> > +{
>> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
>> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
>>
>> I don't think this is safe when the inputs are randomised.  E.g. we
>> could by chance end up with a vector of all zeros, which would have
>> a single pattern and a single element per pattern, regardless of the
>> shapes of the inputs.
>>
>> Given the way that vector_builder<T, Shape, Derived>::finalize
>> canonicalises the encoding, it should be safe to use:
>>
>> * VECTOR_CST_NPATTERNS (res) <= npatterns
>> * vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern
>>
>> If we do that then...
>>
>> > +
>> > +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
>>
>> ...this loop bound should be npatterns * nelts_per_pattern instead.
> Ah indeed. Fixed, thanks.

The patch instead does:

  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) <= nelts_per_pattern);

I think the version I suggested is safer.  It's not the goal of the
canonicalisation algorithm to reduce both npattners and nelts_per_pattern
individually.  The algorithm can increase nelts_per_pattern in order
to decrease npatterns.

>> > +  {
>> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
>> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
>> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +    vec_perm_builder builder (arg0_len, 1, 3);
>> > +    builder.quick_push (arg0_len);
>> > +    builder.quick_push (arg0_len + 1);
>> > +    builder.quick_push (arg0_len + 2);
>> > +
>> > +    vec_perm_indices sel (builder, 2, arg0_len);
>> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL, true);
>> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
>> > +                         vector_cst_elt (arg1, 2) };
>> > +    validate_res (1, 3, res, expected_res);
>> > +  }
>> > +
>> > +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
>> > +     sel = {len, 0, 0, 0, 2, 0, ...}
>> > +     npatterns = 2, nelts_per_pattern = 3.
>> > +     Use extra pattern {0, ...} to lower number of elements per pattern.  */
>> > +  {
>> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
>> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
>> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +    vec_perm_builder builder (arg0_len, 2, 3);
>> > +    builder.quick_push (arg0_len);
>> > +    int mask_elems[] = { 0, 0, 0, 2, 0 };
>> > +    for (int i = 0; i < 5; i++)
>> > +      builder.quick_push (mask_elems[i]);
>>
>> This leaves one of the elements unspecified.
> Sorry, I didn't understand.
> It first pushes len in:
> builder.quick_push (arg0_len)
> and then pushes the remaining indices in the loop:
> for (int i = 0; i < 5; i++)
>   builder.quick_push (mask_elems[i])
> So overall, builder will have 6 elements: {len, 0, 0, 0, 2, 0}

Ah, right.  But in that case I think it would be clearer to put arg0_len
in mask_elems.

> +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
> +   REASON has same purpose as described in
> +   valid_mask_for_fold_vec_perm_cst_p.  */
> +
> +

Nit: too much vertical space.

> +/* Invoke tests for fold_vec_perm_cst.  */
> +
> +static void
> +test ()
> +{
> +  /* Conditionally execute fold_vec_perm_cst tests, if target supports
> +     VLA vectors. Use a compile time check so we avoid instantiating
> +     poly_uint64 with N > 1 on targets that do not support VLA vectors.  */
> +  if constexpr (poly_int_traits<poly_uint64>::num_coeffs > 1)

That's a C++17 feature, so we can't use it.

Instead, how about:

static bool
is_simple_vla_size (poly_uint64 size)
{
  if (size.is_constant ())
    return false;
  for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
    if (size[i] != (i <= 1 ? size[0] : 0))
      return false;
  return true;
}


  FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
    {
      auto nunits = GET_MODE_NUNITS (mode);
      if (!is_simple_vla_size (nunits))
        continue;
      if (nunits[0] ...)
        test_... (mode);
      ...

    }

test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
loop structure above, I think we can apply the test_vnx4si and
test_vnx16qi to more cases.  So the classification isn't the
exact number of elements, but instead a limit.

I think the nunits[0] conditions for test_vnx4si are as follows
(inspection only, so could be wrong):

> +/* Test cases where result and input vectors are VNx4SI  */
> +
> +static void
> +test_vnx4si (machine_mode vmode)
> +{
> +  /* Case 1: mask = {0, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (0);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (res, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }

nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
which I think would be better)

> +
> +  /* Case 2: mask = {len, ...} */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 1);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> +    validate_res (1, 1, res, expected_res);
> +  }

same

> +
> +  /* Case 3: mask = { 0, len, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
> +    validate_res (2, 1, res, expected_res);
> +  }

nunits[0] >= 2

> +
> +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }

nunits[0] >= 2

> +
> +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> +     npatterns = 4, nelts_per_pattern = 1 */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 4, 1);
> +    builder.quick_push (0);
> +    builder.quick_push (len);
> +    builder.quick_push (1);
> +    builder.quick_push (len + 1);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (4, 1, res, expected_res);
> +  }

nunits[0] >= 4

> +
> +  /* Case 6: mask = {0, 4, ...}
> +     npatterns = 1, nelts_per_pattern = 2.
> +     This should return NULL_TREE because the index 4 may choose
> +     from either arg0 or arg1 depending on vector length.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (4);
> +    vec_perm_indices sel (builder, 2, len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    ASSERT_TRUE (res == NULL_TREE);
> +  }

nunits[0] == 2 or == 4 (could be <= 4 if the inputs had nelts_per_pattern==1)

> +
> +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> +     mask = {0, len, 1, len + 1, ...}
> +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 2, 2);
> +    builder.quick_push (0);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (1);
> +    builder.quick_push (arg0_len + 1);
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> +			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> +			  };
> +    validate_res (2, 2, res, expected_res);
> +  }

nunits[0] >= 2


> +
> +  /* Case 8: sel = {0, 1, 2, ...}
> +     npatterns = 1, nelts_per_pattern = 3
> +     expected res: { arg0[0], arg0[1], arg0[2], ... } */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (0);
> +    builder.quick_push (1);
> +    builder.quick_push (2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
> +			    vector_cst_elt (arg0, 2) };
> +    validate_res (1, 3, res, expected_res);
> +  }

all nunits[0]

> +
> +  /* Case 9: sel = {len, len + 1, len + 2, ... }
> +     npatterns = 1, nelts_per_pattern = 3
> +     expected res: { op1[0], op1[1], op1[2], ... }  */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (arg0_len, 1, 3);
> +    builder.quick_push (arg0_len);
> +    builder.quick_push (arg0_len + 1);
> +    builder.quick_push (arg0_len + 2);
> +
> +    vec_perm_indices sel (builder, 2, arg0_len);
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL);
> +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
> +			    vector_cst_elt (arg1, 2) };
> +    validate_res (1, 3, res, expected_res);
> +  }

all nunits[0]

Same idea for the others.

Thanks,
Richard

Prathamesh Kulkarni Aug. 10, 2023, 2:33 p.m. UTC | #10

On Tue, 8 Aug 2023 at 15:27, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Fri, 4 Aug 2023 at 20:36, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Full review this time, sorry for the skipping the tests earlier.
> > Thanks for the detailed review! Please find my responses inline below.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> > index 7e5494dfd39..680d0e54fd4 100644
> >> > --- a/gcc/fold-const.cc
> >> > +++ b/gcc/fold-const.cc
> >> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
> >> >  #include "vec-perm-indices.h"
> >> >  #include "asan.h"
> >> >  #include "gimple-range.h"
> >> > +#include <algorithm>
> >>
> >> This should be included by defining INCLUDE_ALGORITHM instead.
> > Done. Just curious, why do we use this macro instead of directly
> > including <algorithm> ?
>
> AIUI, one of the reasons for having every file start with includes
> of config.h and (b)system.h, in that order, is to ensure that a small
> and predictable amount of GCC-specific stuff happens before including
> the system header files.  That helps to avoid OS-specific clashes between
> GCC code and system headers.
>
> But another major reason is that system.h ends by poisoning a lot of
> stuff that system headers would be entitled to use.
Ah OK, thanks for the clarification!
>
> >> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> >> > +
> >> > +  // Fill a0 for each pattern
> >> > +  for (unsigned i = 0; i < npatterns; i++)
> >> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> >> > +
> >> > +  if (nelts_per_pattern == 1)
> >> > +    return builder.build ();
> >> > +
> >> > +  // Fill a1 for each pattern
> >> > +  for (unsigned i = 0; i < npatterns; i++)
> >> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> >> > +
> >> > +  if (nelts_per_pattern == 2)
> >> > +    return builder.build ();
> >> > +
> >> > +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
> >> > +    {
> >> > +      tree prev_elem = builder[i - npatterns];
> >> > +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
> >> > +      int val = prev_elem_val + S;
> >> > +      builder.quick_push (build_int_cst (inner_type, val));
> >> > +    }
> >> > +
> >> > +  return builder.build ();
> >> > +}
> >> > +
> >> > +static void
> >> > +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
> >> > +           tree res, tree *expected_res)
> >> > +{
> >> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> >> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
> >>
> >> I don't think this is safe when the inputs are randomised.  E.g. we
> >> could by chance end up with a vector of all zeros, which would have
> >> a single pattern and a single element per pattern, regardless of the
> >> shapes of the inputs.
> >>
> >> Given the way that vector_builder<T, Shape, Derived>::finalize
> >> canonicalises the encoding, it should be safe to use:
> >>
> >> * VECTOR_CST_NPATTERNS (res) <= npatterns
> >> * vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern
> >>
> >> If we do that then...
> >>
> >> > +
> >> > +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> >>
> >> ...this loop bound should be npatterns * nelts_per_pattern instead.
> > Ah indeed. Fixed, thanks.
>
> The patch instead does:
>
>   ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
>   ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) <= nelts_per_pattern);
>
> I think the version I suggested is safer.  It's not the goal of the
> canonicalisation algorithm to reduce both npattners and nelts_per_pattern
> individually.  The algorithm can increase nelts_per_pattern in order
> to decrease npatterns.
Oops, sorry I misread, will fix in the next patch.
>
> >> > +  {
> >> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> >> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> >> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > +
> >> > +    vec_perm_builder builder (arg0_len, 1, 3);
> >> > +    builder.quick_push (arg0_len);
> >> > +    builder.quick_push (arg0_len + 1);
> >> > +    builder.quick_push (arg0_len + 2);
> >> > +
> >> > +    vec_perm_indices sel (builder, 2, arg0_len);
> >> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL, true);
> >> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
> >> > +                         vector_cst_elt (arg1, 2) };
> >> > +    validate_res (1, 3, res, expected_res);
> >> > +  }
> >> > +
> >> > +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
> >> > +     sel = {len, 0, 0, 0, 2, 0, ...}
> >> > +     npatterns = 2, nelts_per_pattern = 3.
> >> > +     Use extra pattern {0, ...} to lower number of elements per pattern.  */
> >> > +  {
> >> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> >> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> >> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > +
> >> > +    vec_perm_builder builder (arg0_len, 2, 3);
> >> > +    builder.quick_push (arg0_len);
> >> > +    int mask_elems[] = { 0, 0, 0, 2, 0 };
> >> > +    for (int i = 0; i < 5; i++)
> >> > +      builder.quick_push (mask_elems[i]);
> >>
> >> This leaves one of the elements unspecified.
> > Sorry, I didn't understand.
> > It first pushes len in:
> > builder.quick_push (arg0_len)
> > and then pushes the remaining indices in the loop:
> > for (int i = 0; i < 5; i++)
> >   builder.quick_push (mask_elems[i])
> > So overall, builder will have 6 elements: {len, 0, 0, 0, 2, 0}
>
> Ah, right.  But in that case I think it would be clearer to put arg0_len
> in mask_elems.
Right of course, sorry. For some reason I thought earlier I couldn't
initialize array of poly_uint64 :/
>
> > +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> > +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
> > +   REASON has same purpose as described in
> > +   valid_mask_for_fold_vec_perm_cst_p.  */
> > +
> > +
>
> Nit: too much vertical space.
Will fix in next patch, thanks.
>
> > +/* Invoke tests for fold_vec_perm_cst.  */
> > +
> > +static void
> > +test ()
> > +{
> > +  /* Conditionally execute fold_vec_perm_cst tests, if target supports
> > +     VLA vectors. Use a compile time check so we avoid instantiating
> > +     poly_uint64 with N > 1 on targets that do not support VLA vectors.  */
> > +  if constexpr (poly_int_traits<poly_uint64>::num_coeffs > 1)
>
> That's a C++17 feature, so we can't use it.
>
> Instead, how about:
>
> static bool
> is_simple_vla_size (poly_uint64 size)
> {
>   if (size.is_constant ())
>     return false;
>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
>     if (size[i] != (i <= 1 ? size[0] : 0))
Just wondering is this should be (i == 1 ? size[0] : 0) since i is
initialized to 1 ?
IIUC, is_simple_vla_size should return true for polynomials of first
degree and having same coeff like 4 + 4x ?
>       return false;
>   return true;
> }
>
>
>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
>     {
>       auto nunits = GET_MODE_NUNITS (mode);
>       if (!is_simple_vla_size (nunits))
>         continue;
>       if (nunits[0] ...)
>         test_... (mode);
>       ...
>
>     }
>
> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
> loop structure above, I think we can apply the test_vnx4si and
> test_vnx16qi to more cases.  So the classification isn't the
> exact number of elements, but instead a limit.
>
> I think the nunits[0] conditions for test_vnx4si are as follows
> (inspection only, so could be wrong):
>
> > +/* Test cases where result and input vectors are VNx4SI  */
> > +
> > +static void
> > +test_vnx4si (machine_mode vmode)
> > +{
> > +  /* Case 1: mask = {0, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (0);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
> > +    validate_res (1, 1, res, expected_res);
> > +  }
>
> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
> which I think would be better)
IIUC, the vectors that can be used for a particular test should have
nunits[0] >= res_npatterns,
where res_npatterns is as computed in fold_vec_perm_cst without the
canonicalization ?
For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
require nunits[0] >= 2 ?
Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, etc.

Sorry if this sounds like a silly question -- Won't nunits[0] >= 2
cover all nunits,
since a vector, at a minimum, will contain 2 elements ?

I will send a patch shortly addressing above suggestions.

Thanks,
Prathamesh
>
> > +
> > +  /* Case 2: mask = {len, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> > +    validate_res (1, 1, res, expected_res);
> > +  }
>
> same
>
> > +
> > +  /* Case 3: mask = { 0, len, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
> > +    validate_res (2, 1, res, expected_res);
> > +  }
>
> nunits[0] >= 2
>
> > +
> > +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
>
> nunits[0] >= 2
>
> > +
> > +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> > +     npatterns = 4, nelts_per_pattern = 1 */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (4, 1, res, expected_res);
> > +  }
>
> nunits[0] >= 4
>
> > +
> > +  /* Case 6: mask = {0, 4, ...}
> > +     npatterns = 1, nelts_per_pattern = 2.
> > +     This should return NULL_TREE because the index 4 may choose
> > +     from either arg0 or arg1 depending on vector length.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (4);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    ASSERT_TRUE (res == NULL_TREE);
> > +  }
>
> nunits[0] == 2 or == 4 (could be <= 4 if the inputs had nelts_per_pattern==1)
>
> > +
> > +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> > +     mask = {0, len, 1, len + 1, ...}
> > +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (arg0_len + 1);
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
>
> nunits[0] >= 2
>
>
> > +
> > +  /* Case 8: sel = {0, 1, 2, ...}
> > +     npatterns = 1, nelts_per_pattern = 3
> > +     expected res: { arg0[0], arg0[1], arg0[2], ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (0);
> > +    builder.quick_push (1);
> > +    builder.quick_push (2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
> > +                         vector_cst_elt (arg0, 2) };
> > +    validate_res (1, 3, res, expected_res);
> > +  }
>
> all nunits[0]
>
> > +
> > +  /* Case 9: sel = {len, len + 1, len + 2, ... }
> > +     npatterns = 1, nelts_per_pattern = 3
> > +     expected res: { op1[0], op1[1], op1[2], ... }  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (arg0_len + 1);
> > +    builder.quick_push (arg0_len + 2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL);
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt (arg1, 1),
> > +                         vector_cst_elt (arg1, 2) };
> > +    validate_res (1, 3, res, expected_res);
> > +  }
>
> all nunits[0]
>
> Same idea for the others.
>
> Thanks,
> Richard

Richard Sandiford Aug. 10, 2023, 3:57 p.m. UTC | #11

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> static bool
>> is_simple_vla_size (poly_uint64 size)
>> {
>>   if (size.is_constant ())
>>     return false;
>>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
>>     if (size[i] != (i <= 1 ? size[0] : 0))
> Just wondering is this should be (i == 1 ? size[0] : 0) since i is
> initialized to 1 ?

Both work.  I prefer <= 1 because it doesn't depend on the micro
optimisation to start at coefficient 1.  In a theoretical 3-indeterminate
poly_int, we want the first 2 coefficients to be nonzero and the rest to
be zero.

> IIUC, is_simple_vla_size should return true for polynomials of first
> degree and having same coeff like 4 + 4x ?

FWIW, poly_int only supports first-degree polynomials at the moment.
coeffs>2 means there is more than one indeterminate, rather than a
higher power.

>>       return false;
>>   return true;
>> }
>>
>>
>>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
>>     {
>>       auto nunits = GET_MODE_NUNITS (mode);
>>       if (!is_simple_vla_size (nunits))
>>         continue;
>>       if (nunits[0] ...)
>>         test_... (mode);
>>       ...
>>
>>     }
>>
>> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
>> loop structure above, I think we can apply the test_vnx4si and
>> test_vnx16qi to more cases.  So the classification isn't the
>> exact number of elements, but instead a limit.
>>
>> I think the nunits[0] conditions for test_vnx4si are as follows
>> (inspection only, so could be wrong):
>>
>> > +/* Test cases where result and input vectors are VNx4SI  */
>> > +
>> > +static void
>> > +test_vnx4si (machine_mode vmode)
>> > +{
>> > +  /* Case 1: mask = {0, ...} */
>> > +  {
>> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
>> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
>> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> > +
>> > +    vec_perm_builder builder (len, 1, 1);
>> > +    builder.quick_push (0);
>> > +    vec_perm_indices sel (builder, 2, len);
>> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
>> > +
>> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
> This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
>> > +    validate_res (1, 1, res, expected_res);
>> > +  }
>>
>> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
>> which I think would be better)
> IIUC, the vectors that can be used for a particular test should have
> nunits[0] >= res_npatterns,
> where res_npatterns is as computed in fold_vec_perm_cst without the
> canonicalization ?
> For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
> require nunits[0] >= 2 ?
> Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, etc.

Right, that's what I meant.  With the inputs as they stand it has to be
nunits[0] >= 2.  We need that form the inputs correctly.  But if the
inputs instead had nelts_per_pattern == 1, the test would work for all
nunits.

> Sorry if this sounds like a silly question -- Won't nunits[0] >= 2
> cover all nunits,
> since a vector, at a minimum, will contain 2 elements ?

Not necessarily.  VNx1TI makes conceptual sense.  We just don't use it
currently (although that'll change with SME).  And we do have single-element
VLS vectors like V1DI and V1DF.

Thanks,
Richard

Prathamesh Kulkarni Aug. 13, 2023, 11:49 a.m. UTC | #12

On Thu, 10 Aug 2023 at 21:27, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> static bool
> >> is_simple_vla_size (poly_uint64 size)
> >> {
> >>   if (size.is_constant ())
> >>     return false;
> >>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
> >>     if (size[i] != (i <= 1 ? size[0] : 0))
> > Just wondering is this should be (i == 1 ? size[0] : 0) since i is
> > initialized to 1 ?
>
> Both work.  I prefer <= 1 because it doesn't depend on the micro
> optimisation to start at coefficient 1.  In a theoretical 3-indeterminate
> poly_int, we want the first 2 coefficients to be nonzero and the rest to
> be zero.
>
> > IIUC, is_simple_vla_size should return true for polynomials of first
> > degree and having same coeff like 4 + 4x ?
>
> FWIW, poly_int only supports first-degree polynomials at the moment.
> coeffs>2 means there is more than one indeterminate, rather than a
> higher power.
Oh OK, thanks for the clarification.
>
> >>       return false;
> >>   return true;
> >> }
> >>
> >>
> >>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
> >>     {
> >>       auto nunits = GET_MODE_NUNITS (mode);
> >>       if (!is_simple_vla_size (nunits))
> >>         continue;
> >>       if (nunits[0] ...)
> >>         test_... (mode);
> >>       ...
> >>
> >>     }
> >>
> >> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
> >> loop structure above, I think we can apply the test_vnx4si and
> >> test_vnx16qi to more cases.  So the classification isn't the
> >> exact number of elements, but instead a limit.
> >>
> >> I think the nunits[0] conditions for test_vnx4si are as follows
> >> (inspection only, so could be wrong):
> >>
> >> > +/* Test cases where result and input vectors are VNx4SI  */
> >> > +
> >> > +static void
> >> > +test_vnx4si (machine_mode vmode)
> >> > +{
> >> > +  /* Case 1: mask = {0, ...} */
> >> > +  {
> >> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> >> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> >> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > +
> >> > +    vec_perm_builder builder (len, 1, 1);
> >> > +    builder.quick_push (0);
> >> > +    vec_perm_indices sel (builder, 2, len);
> >> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> >> > +
> >> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
> > This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
> >> > +    validate_res (1, 1, res, expected_res);
> >> > +  }
> >>
> >> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
> >> which I think would be better)
> > IIUC, the vectors that can be used for a particular test should have
> > nunits[0] >= res_npatterns,
> > where res_npatterns is as computed in fold_vec_perm_cst without the
> > canonicalization ?
> > For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
> > require nunits[0] >= 2 ?
> > Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, etc.
>
> Right, that's what I meant.  With the inputs as they stand it has to be
> nunits[0] >= 2.  We need that form the inputs correctly.  But if the
> inputs instead had nelts_per_pattern == 1, the test would work for all
> nunits.
In the attached patch, I have reordered the tests based on min or max limit.
For tests where sel_npatterns < 3 (ie dup sequence), I have kept input
npatterns = 1,
so we can test more vector modes, and also input npatterns matter only
for stepped sequence in sel
(Since for a dup pattern we don't enforce the constraint of selecting
elements from same input pattern).
Does it look OK ?

For the following tests with input vectors having shape (1, 3)
sel = {0, 1, 2, ...}  // (1, 3)
res = { arg0[0], arg0[1], arg0[2], ... } // (1, 3)

and sel = {len, len + 1, len + 2, ... }  // (1, 3)
res = { arg1[0], arg1[1], arg1[2], ... } // (1, 3)

Altho res_npatterns = 1, I suppose these will need to be tested with
vectors with length >= 4 + 4x,
since index 2 can be ambiguous for length 2 + 2x  ?
(In the patch, these are cases 2 and 3 in test_nunits_min_4)

Patch is bootstrapped+tested on aarch64-linux-gnu with and without SVE
and on x86_64-linux-gnu
(altho I suppose bootstrapping won't be necessary for changes to unit-tests?)
>
> > Sorry if this sounds like a silly question -- Won't nunits[0] >= 2
> > cover all nunits,
> > since a vector, at a minimum, will contain 2 elements ?
>
> Not necessarily.  VNx1TI makes conceptual sense.  We just don't use it
> currently (although that'll change with SME).  And we do have single-element
> VLS vectors like V1DI and V1DF.
Thanks for the explanation, I wasn't aware of that.

Thanks,
Prathamesh
>
> Thanks,
> Richard
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 7e5494dfd39..5eacb1d147e 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
    gimple code, we need to handle GIMPLE tuples as well as their
    corresponding tree equivalents.  */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -10494,15 +10495,9 @@ fold_mult_zconjz (location_t loc, tree type, tree expr)
 static bool
 vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned HOST_WIDE_INT i;
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
-    {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
-    }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  if (TREE_CODE (arg) == CONSTRUCTOR)
     {
       constructor_elt *elt;
 
@@ -10520,6 +10515,181 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
+   mask for VLA vec_perm folding.
+   REASON if specified, will contain the reason why SEL is not suitable.
+   Used only for debugging and unit-testing.  */
+
+static bool
+valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
+				    const vec_perm_indices &sel,
+				    const char **reason = NULL)
+{
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+
+  if (!(pow2p_hwi (sel_npatterns)
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
+    {
+      if (reason)
+	*reason = "npatterns is not power of 2";
+      return false;
+    }
+
+  /* We want to avoid cases where sel.length is not a multiple of npatterns.
+     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
+  poly_uint64 esel;
+  if (!multiple_p (sel.length (), sel_npatterns, &esel))
+    {
+      if (reason)
+	*reason = "sel.length is not multiple of sel_npatterns";
+      return false;
+    }
+
+  if (sel_nelts_per_pattern < 3)
+    return true;
+
+  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
+    {
+      poly_uint64 a1 = sel[pattern + sel_npatterns];
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      HOST_WIDE_INT step;
+      if (!poly_int64 (a2 - a1).is_constant (&step))
+	{
+	  if (reason)
+	    *reason = "step is not constant";
+	  return false;
+	}
+      // FIXME: Punt on step < 0 for now, revisit later.
+      if (step < 0)
+	return false;
+      if (step == 0)
+	continue;
+
+      if (!pow2p_hwi (step))
+	{
+	  if (reason)
+	    *reason = "step is not power of 2";
+	  return false;
+	}
+
+      /* Ensure that stepped sequence of the pattern selects elements
+	 only from the same input vector.  */
+      uint64_t q1, qe;
+      poly_uint64 r1, re;
+      poly_uint64 ae = a1 + (esel - 2) * step;
+      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
+	    && can_div_trunc_p (ae, arg_len, &qe, &re)
+	    && q1 == qe))
+	{
+	  if (reason)
+	    *reason = "crossed input vectors";
+	  return false;
+	}
+
+      /* Ensure that the stepped sequence always selects from the same
+	 input pattern.  */
+      unsigned arg_npatterns
+	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
+			  : VECTOR_CST_NPATTERNS (arg1);
+
+      if (!multiple_p (step, arg_npatterns))
+	{
+	  if (reason)
+	    *reason = "step is not multiple of npatterns";
+	  return false;
+	}
+    }
+
+  return true;
+}
+
+/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
+   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
+   REASON has same purpose as described in
+   valid_mask_for_fold_vec_perm_cst_p.  */
+
+static tree
+fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
+		   const char **reason = NULL)
+{
+  unsigned res_npatterns, res_nelts_per_pattern;
+  unsigned HOST_WIDE_INT res_nelts;
+
+  /* (1) If SEL is a suitable mask as determined by
+     valid_mask_for_fold_vec_perm_cst_p, then:
+     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
+     res_nelts_per_pattern = max of nelts_per_pattern between
+			     ARG0, ARG1 and SEL.
+     (2) If SEL is not a suitable mask, and TYPE is VLS then:
+     res_npatterns = nelts in result vector.
+     res_nelts_per_pattern = 1.
+     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns
+	= std::max (VECTOR_CST_NPATTERNS (arg0),
+		    std::max (VECTOR_CST_NPATTERNS (arg1),
+			      sel.encoding ().npatterns ()));
+
+      res_nelts_per_pattern
+	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
+			      sel.encoding ().nelts_per_pattern ()));
+
+      res_nelts = res_npatterns * res_nelts_per_pattern;
+    }
+  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
+    {
+      res_npatterns = res_nelts;
+      res_nelts_per_pattern = 1;
+    }
+  else
+    return NULL_TREE;
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  for (unsigned i = 0; i < res_nelts; i++)
+    {
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+      unsigned HOST_WIDE_INT index;
+
+      /* Punt if sel[i] /trunc_div len cannot be determined,
+	 because the input vector to be chosen will depend on
+	 runtime vector length.
+	 For example if len == 4 + 4x, and sel[i] == 4,
+	 If len at runtime equals 4, we choose arg1[0].
+	 For any other value of len > 4 at runtime, we choose arg0[4].
+	 which makes the element choice dependent on runtime vector length.  */
+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    *reason = "cannot divide selector element by arg len";
+	  return NULL_TREE;
+	}
+
+      /* sel[i] % len will give the index of element in the chosen input
+	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
+	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
+      if (!r.is_constant (&index))
+	{
+	  if (reason)
+	    *reason = "remainder is not constant";
+	  return NULL_TREE;
+	}
+
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10529,43 +10699,41 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
   unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST)
+    return fold_vec_perm_cst (type, arg0, arg1, sel);
+
+  /* For fall back case, we want to ensure we have VLS vectors
+     with equal length.  */
+  if (!sel.length ().is_constant (&nelts))
+    return NULL_TREE;
+
+  gcc_assert (known_eq (sel.length (),
+			TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
+  vec<constructor_elt, va_gc> *v;
+  vec_alloc (v, nelts);
   for (i = 0; i < nelts; i++)
     {
       HOST_WIDE_INT index;
       if (!sel[i].is_constant (&index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
     }
-
-  if (need_ctor)
-    {
-      vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
-      return build_constructor (type, v);
-    }
-  else
-    return out_elts.build ();
+  return build_constructor (type, v);
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16892,6 +17060,588 @@ test_arithmetic_folding ()
 				   x);
 }
 
+namespace test_fold_vec_perm_cst {
+
+/* Build a VECTOR_CST corresponding to VMODE, and has
+   encoding given by NPATTERNS, NELTS_PER_PATTERN and STEP.
+   Fill it with randomized elements, using rand() % THRESHOLD.  */
+
+static tree
+build_vec_cst_rand (machine_mode vmode, unsigned npatterns,
+		    unsigned nelts_per_pattern,
+		    int step = 0, int threshold = 100)
+{
+  tree inner_type = lang_hooks.types.type_for_mode (GET_MODE_INNER (vmode), 1);
+  tree vectype = build_vector_type_for_mode (inner_type, vmode);
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+
+  // Fill a0 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 1)
+    return builder.build ();
+
+  // Fill a1 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 2)
+    return builder.build ();
+
+  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
+    {
+      tree prev_elem = builder[i - npatterns];
+      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
+      int val = prev_elem_val + step;
+      builder.quick_push (build_int_cst (inner_type, val));
+    }
+
+  return builder.build ();
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when result is VLA.  */
+
+static void
+validate_res (unsigned npatterns, unsigned nelts_per_pattern,
+	      tree res, tree *expected_res)
+{
+  /* Actual npatterns and encoded_elts in res may be less than expected due
+     to canonicalization.  */
+  ASSERT_TRUE (res != NULL_TREE);
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
+  ASSERT_TRUE (vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern);
+
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when the result is VLS.  */
+
+static void
+validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
+{
+  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
+  for (unsigned i = 0; i < expected_nelts; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Helper routine to push multiple elements into BUILDER.  */
+
+static void
+builder_push_elems (vec_perm_builder& builder, poly_uint64 *elems)
+{
+  for (unsigned i = 0; i < builder.encoded_nelts (); i++)
+    builder.quick_push (elems[i]);
+}
+
+#define ARG0(index) vector_cst_elt (arg0, index)
+#define ARG1(index) vector_cst_elt (arg1, index)
+
+/* Test cases where result is VNx4SI and input vectors are V4SI.  */
+
+static void
+test_vnx4si_v4si (machine_mode vnx4si_mode, machine_mode v4si_mode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1:
+	 sel = { 0, 4, 1, 5, ... }
+	 res = { arg[0], arg1[0], arg0[1], arg1[1], ...} // (4, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, 4, 1, 5 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (4, 1, res, expected_res);
+      }
+
+      /* Case 2: Same as case 1, but contains an out of bounds access which
+	 should wrap around.
+	 sel = {0, 8, 4, 12, ...} (4, 1)
+	 res = { arg0[0], arg0[0], arg1[0], arg1[0], ... } (4, 1).  */
+      {
+	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, 8, 4, 12 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG1(0), ARG1(0) };
+	validate_res (4, 1, res, expected_res);
+      }
+    }
+}
+
+/* Test cases where result is V4SI and input vectors are VNx4SI.  */
+
+static void
+test_v4si_vnx4si (machine_mode v4si_mode, machine_mode vnx4si_mode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1:
+	 sel = { 0, 1, 2, 3}
+	 res = { arg0[0], arg0[1], arg0[2], arg0[3] }.  */
+      {
+	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = {0, 1, 2, 3};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2), ARG0(3) };
+	validate_res_vls (res, expected_res, 4);
+      }
+
+      /* Case 2: Same as Case 1, but crossing input vector.
+	 sel = {0, 2, 4, 6}
+	 In this case,the index 4 is ambiguous since len = 4 + 4x.
+	 Since we cannot determine, which vector to choose from during
+	 compile time, should return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = {0, 2, 4, 6};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
+
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
+      }
+    }
+}
+
+/* Test all input vectors.  */
+
+static void
+test_all_nunits (machine_mode vmode)
+{
+  /* Test with 10 different inputs.  */
+  for (int i = 0; i < 10; i++)
+    {
+      tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+      tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      /* Case 1: mask = {0, ...} // (1, 1)
+	 res = { arg0[0], ... } // (1, 1)  */
+      {
+	vec_perm_builder builder (len, 1, 1);
+	builder.quick_push (0);
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0) };
+	validate_res (1, 1, res, expected_res);
+      }
+
+      /* Case 2: mask = {len, ...} // (1, 1)
+	 res = { arg1[0], ... } // (1, 1)  */
+      {
+	vec_perm_builder builder (len, 1, 1);
+	builder.quick_push (len);
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG1(0) };
+	validate_res (1, 1, res, expected_res);
+      }
+
+      /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
+	 Test that stepped sequence of the pattern selects from arg0.
+	 res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { len, 0, 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
+	validate_res (1, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 2 elements.  */
+
+static void
+test_nunits_min_2 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: mask = { 0, len, ... }  // (2, 1)
+	 res = { arg0[0], arg1[0], ... } // (2, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 1);
+	poly_uint64 mask_elems[] = { 0, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (2, 1, res, expected_res);
+      }
+
+      /* Case 2: mask = { 0, len, 1, len+1, ... } // (2, 2)
+	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 2);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (2, 2, res, expected_res);
+      }
+
+      /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
+	 Test that the stepped sequence of the pattern selects from
+	 same input pattern. Since input vectors have npatterns = 2,
+	 and step (a2 - a1) = 1, step is not a multiple of npatterns
+	 in input vector. So return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel,
+				      &reason);
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 4 elements.  */
+
+static void
+test_nunits_min_4 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: mask = { 0, len, 1, len+1, ... } // (4, 1)
+	 res: { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (4, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (4, 1, res, expected_res);
+      }
+
+      /* Case 2: sel = {0, 1, 2, ...}  // (1, 3)
+	 res: { arg0[0], arg0[1], arg0[2], ... } // (1, 3) */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (arg0_len, 1, 3);
+	poly_uint64 mask_elems[] = {0, 1, 2};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, arg0_len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2) };
+	validate_res (1, 3, res, expected_res);
+      }
+
+      /* Case 3: sel = {len, len+1, len+2, ...} // (1, 3)
+	 res: { arg1[0], arg1[1], arg1[2], ... } // (1, 3) */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = {len, len + 1, len + 2};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG1(0), ARG1(1), ARG1(2) };
+	validate_res (1, 3, res, expected_res);
+      }
+
+      /* Case 4:
+	sel = { len, 0, 2, ... } // (1, 3) 
+	This should return NULL because we cross the input vectors.
+	Because,
+	Let's assume len = C + Cx
+	a1 = 0
+	S = 2
+	esel = arg0_len / sel_npatterns = C + Cx
+	ae = 0 + (esel - 2) * S
+	   = 0 + (C + Cx - 2) * 2
+	   = 2(C-2) + 2Cx
+
+	For C >= 4:
+	Let q1 = a1 / arg0_len = 0 / (C + Cx) = 0
+	Let qe = ae / arg0_len = (2(C-2) + 2Cx) / (C + Cx) = 1
+	Since q1 != qe, we cross input vectors.
+	So return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (arg0_len, 1, 3);
+	poly_uint64 mask_elems[] = { arg0_len, 0, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, arg0_len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "crossed input vectors"));
+      }
+
+      /* Case 5: npatterns(arg0) = 4 > npatterns(sel) = 2
+	 mask = { 0, len, 1, len + 1, ...} // (2, 2)
+	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)
+
+	 Note that fold_vec_perm_cst will set
+	 res_npatterns = max(4, max(4, 2)) = 4
+	 However after canonicalizing, we will end up with shape (2, 2).  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 4, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 2);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (2, 2, res, expected_res);
+      }
+
+      /* Case 6: Test combination in sel, where one pattern is dup and other
+	 is stepped sequence.
+	 sel = { 0, 0, 0, 1, 0, 2, ... } // (2, 3)
+	 res = { arg0[0], arg0[0], arg0[0],
+		 arg0[1], arg0[0], arg0[2], ... } // (2, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 0, 1, 0, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(0),
+				ARG0(1), ARG0(0), ARG0(2) };
+	validate_res (2, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 8 elements.  */
+
+static void
+test_nunits_min_8 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: sel_npatterns (4) > input npatterns (2)
+	 sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...} // (4, 3)
+	 res: { arg0[0], arg0[0], arg0[0], arg1[0],
+		arg0[2], arg0[0], arg0[3], arg1[0],
+		arg0[4], arg0[0], arg0[5], arg1[0], ... } // (4, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder(len, 4, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 1, len, 2, 0, 3, len,
+				     4, 0, 5, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(1), ARG1(0),
+				ARG0(2), ARG0(0), ARG0(3), ARG1(0),
+				ARG0(4), ARG0(0), ARG0(5), ARG1(0) };
+	validate_res (4, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test vectors for which nunits[0] <= 4.  */
+
+static void
+test_nunits_max_4 (machine_mode vmode)
+{
+  /* Case 1: mask = {0, 4, ...} // (1, 2)
+     This should return NULL_TREE because the index 4 may choose
+     from either arg0 or arg1 depending on vector length.  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 2);
+    poly_uint64 mask_elems[] = {0, 4};
+    builder_push_elems (builder, mask_elems);
+
+    vec_perm_indices sel (builder, 2, len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+    ASSERT_TRUE (res == NULL_TREE);
+    ASSERT_TRUE (reason != NULL);
+    ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
+  }
+}
+
+#undef ARG0
+#undef ARG1
+
+/* Return true if SIZE is of the form C + Cx and C is power of 2.  */
+
+static bool
+is_simple_vla_size (poly_uint64 size)
+{
+  if (size.is_constant ()
+      || !pow2p_hwi (size.coeffs[0]))
+    return false;
+  for (unsigned i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
+    if (size.coeffs[i] != (i <= 1 ? size.coeffs[0] : 0))
+      return false;
+  return true;
+}
+
+/* Execute fold_vec_perm_cst unit tests.  */
+
+static void
+test ()
+{
+  machine_mode vnx4si_mode = E_VOIDmode;
+  machine_mode v4si_mode = E_VOIDmode;
+
+  machine_mode vmode;
+  FOR_EACH_MODE_IN_CLASS (vmode, MODE_VECTOR_INT)
+    {
+      /* Obtain modes corresponding to VNx4SI and V4SI,
+	 to call mixed mode tests below.
+	 FIXME: Is there a better way to do this ?  */
+      if (GET_MODE_INNER (vmode) == SImode)
+	{
+	  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+	  if (is_simple_vla_size (nunits)
+	      && nunits.coeffs[0] == 4)
+	    vnx4si_mode = vmode;
+	  else if (known_eq (nunits, poly_uint64 (4)))
+	    v4si_mode = vmode;
+	}
+
+      if (!is_simple_vla_size (GET_MODE_NUNITS (vmode))
+	  || !targetm.vector_mode_supported_p (vmode))
+	continue;
+
+      poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+      test_all_nunits (vmode);
+      if (nunits.coeffs[0] >= 2)
+	test_nunits_min_2 (vmode);
+      if (nunits.coeffs[0] >= 4)
+	test_nunits_min_4 (vmode);
+      if (nunits.coeffs[0] >= 8)
+	test_nunits_min_8 (vmode);
+
+      if (nunits.coeffs[0] <= 4)
+	test_nunits_max_4 (vmode);
+    }
+
+  if (vnx4si_mode != E_VOIDmode && v4si_mode != E_VOIDmode
+      && targetm.vector_mode_supported_p (vnx4si_mode)
+      && targetm.vector_mode_supported_p (v4si_mode))
+    {
+      test_vnx4si_v4si (vnx4si_mode, v4si_mode);
+      test_v4si_vnx4si (v4si_mode, vnx4si_mode);
+    }
+}
+}; // end of test_fold_vec_perm_cst namespace
+
 /* Verify that various binary operations on vectors are folded
    correctly.  */
 
@@ -16943,6 +17693,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_fold_vec_perm_cst::test ();
 }
 
 } // namespace selftest

Richard Sandiford Aug. 14, 2023, 12:53 p.m. UTC | #13

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Thu, 10 Aug 2023 at 21:27, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> >> static bool
>> >> is_simple_vla_size (poly_uint64 size)
>> >> {
>> >>   if (size.is_constant ())
>> >>     return false;
>> >>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
>> >>     if (size[i] != (i <= 1 ? size[0] : 0))
>> > Just wondering is this should be (i == 1 ? size[0] : 0) since i is
>> > initialized to 1 ?
>>
>> Both work.  I prefer <= 1 because it doesn't depend on the micro
>> optimisation to start at coefficient 1.  In a theoretical 3-indeterminate
>> poly_int, we want the first 2 coefficients to be nonzero and the rest to
>> be zero.
>>
>> > IIUC, is_simple_vla_size should return true for polynomials of first
>> > degree and having same coeff like 4 + 4x ?
>>
>> FWIW, poly_int only supports first-degree polynomials at the moment.
>> coeffs>2 means there is more than one indeterminate, rather than a
>> higher power.
> Oh OK, thanks for the clarification.
>>
>> >>       return false;
>> >>   return true;
>> >> }
>> >>
>> >>
>> >>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
>> >>     {
>> >>       auto nunits = GET_MODE_NUNITS (mode);
>> >>       if (!is_simple_vla_size (nunits))
>> >>         continue;
>> >>       if (nunits[0] ...)
>> >>         test_... (mode);
>> >>       ...
>> >>
>> >>     }
>> >>
>> >> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
>> >> loop structure above, I think we can apply the test_vnx4si and
>> >> test_vnx16qi to more cases.  So the classification isn't the
>> >> exact number of elements, but instead a limit.
>> >>
>> >> I think the nunits[0] conditions for test_vnx4si are as follows
>> >> (inspection only, so could be wrong):
>> >>
>> >> > +/* Test cases where result and input vectors are VNx4SI  */
>> >> > +
>> >> > +static void
>> >> > +test_vnx4si (machine_mode vmode)
>> >> > +{
>> >> > +  /* Case 1: mask = {0, ...} */
>> >> > +  {
>> >> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
>> >> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
>> >> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
>> >> > +
>> >> > +    vec_perm_builder builder (len, 1, 1);
>> >> > +    builder.quick_push (0);
>> >> > +    vec_perm_indices sel (builder, 2, len);
>> >> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
>> >> > +
>> >> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
>> > This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
>> >> > +    validate_res (1, 1, res, expected_res);
>> >> > +  }
>> >>
>> >> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
>> >> which I think would be better)
>> > IIUC, the vectors that can be used for a particular test should have
>> > nunits[0] >= res_npatterns,
>> > where res_npatterns is as computed in fold_vec_perm_cst without the
>> > canonicalization ?
>> > For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
>> > require nunits[0] >= 2 ?
>> > Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, etc.
>>
>> Right, that's what I meant.  With the inputs as they stand it has to be
>> nunits[0] >= 2.  We need that form the inputs correctly.  But if the
>> inputs instead had nelts_per_pattern == 1, the test would work for all
>> nunits.
> In the attached patch, I have reordered the tests based on min or max limit.
> For tests where sel_npatterns < 3 (ie dup sequence), I have kept input
> npatterns = 1,
> so we can test more vector modes, and also input npatterns matter only
> for stepped sequence in sel
> (Since for a dup pattern we don't enforce the constraint of selecting
> elements from same input pattern).
> Does it look OK ?
>
> For the following tests with input vectors having shape (1, 3)
> sel = {0, 1, 2, ...}  // (1, 3)
> res = { arg0[0], arg0[1], arg0[2], ... } // (1, 3)
>
> and sel = {len, len + 1, len + 2, ... }  // (1, 3)
> res = { arg1[0], arg1[1], arg1[2], ... } // (1, 3)
>
> Altho res_npatterns = 1, I suppose these will need to be tested with
> vectors with length >= 4 + 4x,
> since index 2 can be ambiguous for length 2 + 2x  ?
> (In the patch, these are cases 2 and 3 in test_nunits_min_4)

Ah, yeah, fair point.  I guess that means:

+      /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
+	 Test that stepped sequence of the pattern selects from arg0.
+	 res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { len, 0, 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
+	validate_res (1, 3, res, expected_res);
+      }

needs to be min_2 after all.

Also:

> +/* Helper routine to push multiple elements into BUILDER.  */
> +
> +static void
> +builder_push_elems (vec_perm_builder& builder, poly_uint64 *elems)
> +{
> +  for (unsigned i = 0; i < builder.encoded_nelts (); i++)
> +    builder.quick_push (elems[i]);
> +}

I think it'd be safer to make this:

template<unsigned N>
builder_push_elems (vec_perm_builder& builder, poly_uint64 (&elems)[N])
{
  for (unsigned i = 0; i < N; i++)
    builder.quick_push (elems[i]);
}

so that we only push elements that are in the array.

OK for trunk with those changes, thanks.

Richard

> +
> +#define ARG0(index) vector_cst_elt (arg0, index)
> +#define ARG1(index) vector_cst_elt (arg1, index)
> +
> +/* Test cases where result is VNx4SI and input vectors are V4SI.  */
> +
> +static void
> +test_vnx4si_v4si (machine_mode vnx4si_mode, machine_mode v4si_mode)
> +{
> +  for (int i = 0; i < 10; i++)
> +    {
> +      /* Case 1:
> +	 sel = { 0, 4, 1, 5, ... }
> +	 res = { arg[0], arg1[0], arg0[1], arg1[1], ...} // (4, 1)  */
> +      {
> +	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> +	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> +
> +	tree inner_type
> +	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
> +	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
> +
> +	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> +	vec_perm_builder builder (res_len, 4, 1);
> +	poly_uint64 mask_elems[] = { 0, 4, 1, 5 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, res_len);
> +	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> +	validate_res (4, 1, res, expected_res);
> +      }
> +
> +      /* Case 2: Same as case 1, but contains an out of bounds access which
> +	 should wrap around.
> +	 sel = {0, 8, 4, 12, ...} (4, 1)
> +	 res = { arg0[0], arg0[0], arg1[0], arg1[0], ... } (4, 1).  */
> +      {
> +	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> +	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> +
> +	tree inner_type
> +	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
> +	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
> +
> +	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> +	vec_perm_builder builder (res_len, 4, 1);
> +	poly_uint64 mask_elems[] = { 0, 8, 4, 12 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, res_len);
> +	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG0(0), ARG1(0), ARG1(0) };
> +	validate_res (4, 1, res, expected_res);
> +      }
> +    }
> +}
> +
> +/* Test cases where result is V4SI and input vectors are VNx4SI.  */
> +
> +static void
> +test_v4si_vnx4si (machine_mode v4si_mode, machine_mode vnx4si_mode)
> +{
> +  for (int i = 0; i < 10; i++)
> +    {
> +      /* Case 1:
> +	 sel = { 0, 1, 2, 3}
> +	 res = { arg0[0], arg0[1], arg0[2], arg0[3] }.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> +	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> +
> +	tree inner_type
> +	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
> +	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
> +
> +	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> +	vec_perm_builder builder (res_len, 4, 1);
> +	poly_uint64 mask_elems[] = {0, 1, 2, 3};
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, res_len);
> +	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2), ARG0(3) };
> +	validate_res_vls (res, expected_res, 4);
> +      }
> +
> +      /* Case 2: Same as Case 1, but crossing input vector.
> +	 sel = {0, 2, 4, 6}
> +	 In this case,the index 4 is ambiguous since len = 4 + 4x.
> +	 Since we cannot determine, which vector to choose from during
> +	 compile time, should return NULL_TREE.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> +	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> +
> +	tree inner_type
> +	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
> +	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
> +
> +	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> +	vec_perm_builder builder (res_len, 4, 1);
> +	poly_uint64 mask_elems[] = {0, 2, 4, 6};
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, res_len);
> +	const char *reason;
> +	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
> +
> +	ASSERT_TRUE (res == NULL_TREE);
> +	ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
> +      }
> +    }
> +}
> +
> +/* Test all input vectors.  */
> +
> +static void
> +test_all_nunits (machine_mode vmode)
> +{
> +  /* Test with 10 different inputs.  */
> +  for (int i = 0; i < 10; i++)
> +    {
> +      tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +      tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +      /* Case 1: mask = {0, ...} // (1, 1)
> +	 res = { arg0[0], ... } // (1, 1)  */
> +      {
> +	vec_perm_builder builder (len, 1, 1);
> +	builder.quick_push (0);
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +	tree expected_res[] = { ARG0(0) };
> +	validate_res (1, 1, res, expected_res);
> +      }
> +
> +      /* Case 2: mask = {len, ...} // (1, 1)
> +	 res = { arg1[0], ... } // (1, 1)  */
> +      {
> +	vec_perm_builder builder (len, 1, 1);
> +	builder.quick_push (len);
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG1(0) };
> +	validate_res (1, 1, res, expected_res);
> +      }
> +
> +      /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
> +	 Test that stepped sequence of the pattern selects from arg0.
> +	 res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 1, 3);
> +	poly_uint64 mask_elems[] = { len, 0, 1 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
> +	validate_res (1, 3, res, expected_res);
> +      }
> +    }
> +}
> +
> +/* Test all vectors which contain at-least 2 elements.  */
> +
> +static void
> +test_nunits_min_2 (machine_mode vmode)
> +{
> +  for (int i = 0; i < 10; i++)
> +    {
> +      /* Case 1: mask = { 0, len, ... }  // (2, 1)
> +	 res = { arg0[0], arg1[0], ... } // (2, 1)  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 2, 1);
> +	poly_uint64 mask_elems[] = { 0, len };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0) };
> +	validate_res (2, 1, res, expected_res);
> +      }
> +
> +      /* Case 2: mask = { 0, len, 1, len+1, ... } // (2, 2)
> +	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 2, 2);
> +	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> +	validate_res (2, 2, res, expected_res);
> +      }
> +
> +      /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
> +	 Test that the stepped sequence of the pattern selects from
> +	 same input pattern. Since input vectors have npatterns = 2,
> +	 and step (a2 - a1) = 1, step is not a multiple of npatterns
> +	 in input vector. So return NULL_TREE.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 1, 3);
> +	poly_uint64 mask_elems[] = { 0, 0, 1 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	const char *reason;
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel,
> +				      &reason);
> +	ASSERT_TRUE (res == NULL_TREE);
> +	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
> +      }
> +    }
> +}
> +
> +/* Test all vectors which contain at-least 4 elements.  */
> +
> +static void
> +test_nunits_min_4 (machine_mode vmode)
> +{
> +  for (int i = 0; i < 10; i++)
> +    {
> +      /* Case 1: mask = { 0, len, 1, len+1, ... } // (4, 1)
> +	 res: { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (4, 1)  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 4, 1);
> +	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> +	validate_res (4, 1, res, expected_res);
> +      }
> +
> +      /* Case 2: sel = {0, 1, 2, ...}  // (1, 3)
> +	 res: { arg0[0], arg0[1], arg0[2], ... } // (1, 3) */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> +	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (arg0_len, 1, 3);
> +	poly_uint64 mask_elems[] = {0, 1, 2};
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, arg0_len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2) };
> +	validate_res (1, 3, res, expected_res);
> +      }
> +
> +      /* Case 3: sel = {len, len+1, len+2, ...} // (1, 3)
> +	 res: { arg1[0], arg1[1], arg1[2], ... } // (1, 3) */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 1, 3);
> +	poly_uint64 mask_elems[] = {len, len + 1, len + 2};
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +	tree expected_res[] = { ARG1(0), ARG1(1), ARG1(2) };
> +	validate_res (1, 3, res, expected_res);
> +      }
> +
> +      /* Case 4:
> +	sel = { len, 0, 2, ... } // (1, 3) 
> +	This should return NULL because we cross the input vectors.
> +	Because,
> +	Let's assume len = C + Cx
> +	a1 = 0
> +	S = 2
> +	esel = arg0_len / sel_npatterns = C + Cx
> +	ae = 0 + (esel - 2) * S
> +	   = 0 + (C + Cx - 2) * 2
> +	   = 2(C-2) + 2Cx
> +
> +	For C >= 4:
> +	Let q1 = a1 / arg0_len = 0 / (C + Cx) = 0
> +	Let qe = ae / arg0_len = (2(C-2) + 2Cx) / (C + Cx) = 1
> +	Since q1 != qe, we cross input vectors.
> +	So return NULL_TREE.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> +	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (arg0_len, 1, 3);
> +	poly_uint64 mask_elems[] = { arg0_len, 0, 2 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, arg0_len);
> +	const char *reason;
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> +	ASSERT_TRUE (res == NULL_TREE);
> +	ASSERT_TRUE (!strcmp (reason, "crossed input vectors"));
> +      }
> +
> +      /* Case 5: npatterns(arg0) = 4 > npatterns(sel) = 2
> +	 mask = { 0, len, 1, len + 1, ...} // (2, 2)
> +	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)
> +
> +	 Note that fold_vec_perm_cst will set
> +	 res_npatterns = max(4, max(4, 2)) = 4
> +	 However after canonicalizing, we will end up with shape (2, 2).  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 4, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 4, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 2, 2);
> +	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> +	validate_res (2, 2, res, expected_res);
> +      }
> +
> +      /* Case 6: Test combination in sel, where one pattern is dup and other
> +	 is stepped sequence.
> +	 sel = { 0, 0, 0, 1, 0, 2, ... } // (2, 3)
> +	 res = { arg0[0], arg0[0], arg0[0],
> +		 arg0[1], arg0[0], arg0[2], ... } // (2, 3)  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 2, 3);
> +	poly_uint64 mask_elems[] = { 0, 0, 0, 1, 0, 2 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(0),
> +				ARG0(1), ARG0(0), ARG0(2) };
> +	validate_res (2, 3, res, expected_res);
> +      }
> +    }
> +}
> +
> +/* Test all vectors which contain at-least 8 elements.  */
> +
> +static void
> +test_nunits_min_8 (machine_mode vmode)
> +{
> +  for (int i = 0; i < 10; i++)
> +    {
> +      /* Case 1: sel_npatterns (4) > input npatterns (2)
> +	 sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...} // (4, 3)
> +	 res: { arg0[0], arg0[0], arg0[0], arg1[0],
> +		arg0[2], arg0[0], arg0[3], arg1[0],
> +		arg0[4], arg0[0], arg0[5], arg1[0], ... } // (4, 3)  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 2);
> +	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 2);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder(len, 4, 3);
> +	poly_uint64 mask_elems[] = { 0, 0, 1, len, 2, 0, 3, len,
> +				     4, 0, 5, len };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(1), ARG1(0),
> +				ARG0(2), ARG0(0), ARG0(3), ARG1(0),
> +				ARG0(4), ARG0(0), ARG0(5), ARG1(0) };
> +	validate_res (4, 3, res, expected_res);
> +      }
> +    }
> +}
> +
> +/* Test vectors for which nunits[0] <= 4.  */
> +
> +static void
> +test_nunits_max_4 (machine_mode vmode)
> +{
> +  /* Case 1: mask = {0, 4, ...} // (1, 2)
> +     This should return NULL_TREE because the index 4 may choose
> +     from either arg0 or arg1 depending on vector length.  */
> +  {
> +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +    vec_perm_builder builder (len, 1, 2);
> +    poly_uint64 mask_elems[] = {0, 4};
> +    builder_push_elems (builder, mask_elems);
> +
> +    vec_perm_indices sel (builder, 2, len);
> +    const char *reason;
> +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> +    ASSERT_TRUE (res == NULL_TREE);
> +    ASSERT_TRUE (reason != NULL);
> +    ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
> +  }
> +}
> +
> +#undef ARG0
> +#undef ARG1
> +
> +/* Return true if SIZE is of the form C + Cx and C is power of 2.  */
> +
> +static bool
> +is_simple_vla_size (poly_uint64 size)
> +{
> +  if (size.is_constant ()
> +      || !pow2p_hwi (size.coeffs[0]))
> +    return false;
> +  for (unsigned i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
> +    if (size.coeffs[i] != (i <= 1 ? size.coeffs[0] : 0))
> +      return false;
> +  return true;
> +}
> +
> +/* Execute fold_vec_perm_cst unit tests.  */
> +
> +static void
> +test ()
> +{
> +  machine_mode vnx4si_mode = E_VOIDmode;
> +  machine_mode v4si_mode = E_VOIDmode;
> +
> +  machine_mode vmode;
> +  FOR_EACH_MODE_IN_CLASS (vmode, MODE_VECTOR_INT)
> +    {
> +      /* Obtain modes corresponding to VNx4SI and V4SI,
> +	 to call mixed mode tests below.
> +	 FIXME: Is there a better way to do this ?  */
> +      if (GET_MODE_INNER (vmode) == SImode)
> +	{
> +	  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> +	  if (is_simple_vla_size (nunits)
> +	      && nunits.coeffs[0] == 4)
> +	    vnx4si_mode = vmode;
> +	  else if (known_eq (nunits, poly_uint64 (4)))
> +	    v4si_mode = vmode;
> +	}
> +
> +      if (!is_simple_vla_size (GET_MODE_NUNITS (vmode))
> +	  || !targetm.vector_mode_supported_p (vmode))
> +	continue;
> +
> +      poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> +      test_all_nunits (vmode);
> +      if (nunits.coeffs[0] >= 2)
> +	test_nunits_min_2 (vmode);
> +      if (nunits.coeffs[0] >= 4)
> +	test_nunits_min_4 (vmode);
> +      if (nunits.coeffs[0] >= 8)
> +	test_nunits_min_8 (vmode);
> +
> +      if (nunits.coeffs[0] <= 4)
> +	test_nunits_max_4 (vmode);
> +    }
> +
> +  if (vnx4si_mode != E_VOIDmode && v4si_mode != E_VOIDmode
> +      && targetm.vector_mode_supported_p (vnx4si_mode)
> +      && targetm.vector_mode_supported_p (v4si_mode))
> +    {
> +      test_vnx4si_v4si (vnx4si_mode, v4si_mode);
> +      test_v4si_vnx4si (v4si_mode, vnx4si_mode);
> +    }
> +}
> +}; // end of test_fold_vec_perm_cst namespace
> +
>  /* Verify that various binary operations on vectors are folded
>     correctly.  */
>  
> @@ -16943,6 +17693,7 @@ fold_const_cc_tests ()
>    test_arithmetic_folding ();
>    test_vector_folding ();
>    test_vec_duplicate_folding ();
> +  test_fold_vec_perm_cst::test ();
>  }
>  
>  } // namespace selftest

Prathamesh Kulkarni Aug. 15, 2023, 11:29 a.m. UTC | #14

On Mon, 14 Aug 2023 at 18:23, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Thu, 10 Aug 2023 at 21:27, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> >> static bool
> >> >> is_simple_vla_size (poly_uint64 size)
> >> >> {
> >> >>   if (size.is_constant ())
> >> >>     return false;
> >> >>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
> >> >>     if (size[i] != (i <= 1 ? size[0] : 0))
> >> > Just wondering is this should be (i == 1 ? size[0] : 0) since i is
> >> > initialized to 1 ?
> >>
> >> Both work.  I prefer <= 1 because it doesn't depend on the micro
> >> optimisation to start at coefficient 1.  In a theoretical 3-indeterminate
> >> poly_int, we want the first 2 coefficients to be nonzero and the rest to
> >> be zero.
> >>
> >> > IIUC, is_simple_vla_size should return true for polynomials of first
> >> > degree and having same coeff like 4 + 4x ?
> >>
> >> FWIW, poly_int only supports first-degree polynomials at the moment.
> >> coeffs>2 means there is more than one indeterminate, rather than a
> >> higher power.
> > Oh OK, thanks for the clarification.
> >>
> >> >>       return false;
> >> >>   return true;
> >> >> }
> >> >>
> >> >>
> >> >>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
> >> >>     {
> >> >>       auto nunits = GET_MODE_NUNITS (mode);
> >> >>       if (!is_simple_vla_size (nunits))
> >> >>         continue;
> >> >>       if (nunits[0] ...)
> >> >>         test_... (mode);
> >> >>       ...
> >> >>
> >> >>     }
> >> >>
> >> >> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
> >> >> loop structure above, I think we can apply the test_vnx4si and
> >> >> test_vnx16qi to more cases.  So the classification isn't the
> >> >> exact number of elements, but instead a limit.
> >> >>
> >> >> I think the nunits[0] conditions for test_vnx4si are as follows
> >> >> (inspection only, so could be wrong):
> >> >>
> >> >> > +/* Test cases where result and input vectors are VNx4SI  */
> >> >> > +
> >> >> > +static void
> >> >> > +test_vnx4si (machine_mode vmode)
> >> >> > +{
> >> >> > +  /* Case 1: mask = {0, ...} */
> >> >> > +  {
> >> >> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> >> >> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> >> >> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> >> > +
> >> >> > +    vec_perm_builder builder (len, 1, 1);
> >> >> > +    builder.quick_push (0);
> >> >> > +    vec_perm_indices sel (builder, 2, len);
> >> >> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> >> >> > +
> >> >> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
> >> > This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
> >> >> > +    validate_res (1, 1, res, expected_res);
> >> >> > +  }
> >> >>
> >> >> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
> >> >> which I think would be better)
> >> > IIUC, the vectors that can be used for a particular test should have
> >> > nunits[0] >= res_npatterns,
> >> > where res_npatterns is as computed in fold_vec_perm_cst without the
> >> > canonicalization ?
> >> > For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
> >> > require nunits[0] >= 2 ?
> >> > Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, etc.
> >>
> >> Right, that's what I meant.  With the inputs as they stand it has to be
> >> nunits[0] >= 2.  We need that form the inputs correctly.  But if the
> >> inputs instead had nelts_per_pattern == 1, the test would work for all
> >> nunits.
> > In the attached patch, I have reordered the tests based on min or max limit.
> > For tests where sel_npatterns < 3 (ie dup sequence), I have kept input
> > npatterns = 1,
> > so we can test more vector modes, and also input npatterns matter only
> > for stepped sequence in sel
> > (Since for a dup pattern we don't enforce the constraint of selecting
> > elements from same input pattern).
> > Does it look OK ?
> >
> > For the following tests with input vectors having shape (1, 3)
> > sel = {0, 1, 2, ...}  // (1, 3)
> > res = { arg0[0], arg0[1], arg0[2], ... } // (1, 3)
> >
> > and sel = {len, len + 1, len + 2, ... }  // (1, 3)
> > res = { arg1[0], arg1[1], arg1[2], ... } // (1, 3)
> >
> > Altho res_npatterns = 1, I suppose these will need to be tested with
> > vectors with length >= 4 + 4x,
> > since index 2 can be ambiguous for length 2 + 2x  ?
> > (In the patch, these are cases 2 and 3 in test_nunits_min_4)
>
> Ah, yeah, fair point.  I guess that means:
>
> +      /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
> +        Test that stepped sequence of the pattern selects from arg0.
> +        res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
> +      {
> +       tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +       tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +       poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +       vec_perm_builder builder (len, 1, 3);
> +       poly_uint64 mask_elems[] = { len, 0, 1 };
> +       builder_push_elems (builder, mask_elems);
> +
> +       vec_perm_indices sel (builder, 2, len);
> +       tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +       tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
> +       validate_res (1, 3, res, expected_res);
> +      }
>
> needs to be min_2 after all.
Ah indeed. Fixed, thanks.
>
> Also:
>
> > +/* Helper routine to push multiple elements into BUILDER.  */
> > +
> > +static void
> > +builder_push_elems (vec_perm_builder& builder, poly_uint64 *elems)
> > +{
> > +  for (unsigned i = 0; i < builder.encoded_nelts (); i++)
> > +    builder.quick_push (elems[i]);
> > +}
>
> I think it'd be safer to make this:
>
> template<unsigned N>
> builder_push_elems (vec_perm_builder& builder, poly_uint64 (&elems)[N])
> {
>   for (unsigned i = 0; i < N; i++)
>     builder.quick_push (elems[i]);
> }
>
> so that we only push elements that are in the array.
Done, thanks.
>
> OK for trunk with those changes, thanks.
Unfortunately, the patch regressed following tests on ppc64le and
armhf respectively:
gcc.target/powerpc/vec-perm-ctor.c scan-tree-dump-not optimized
"VIEW_CONVERT_EXPR"
gcc.dg/tree-ssa/forwprop-20.c scan-tree-dump-not forwprop1 "VEC_PERM_EXPR"

This happens because of the change to vect_cst_ctor_array which
removes handling of VECTOR_CST,
and thus we return NULL_TREE for cases where VEC_PERM_EXPR has
vector_cst, ctor input operands.

For eg we fail to fold VEC_PERM_EXPR for the following test taken from
forwprop-20.c:
void f (double d, vecf* r)
{
  vecf x = { -d, 5 };
  vecf y = {  1, 4 };
  veci m = {  2, 0 };
  *r = __builtin_shuffle (x, y, m); // { 1, -d }
}
because vect_cst_ctor_to_array will now return NULL_TREE for vector_cst {1, 4}.

The attached patch thus reverts the changes to vect_cst_ctor_to_array,
which makes the tests pass again.
I have put the patch for another round of bootstrap+test on the above
targets (aarch64, aarch64-sve, x86_64, armhf, ppc64le).
OK to commit if it passes ?

Thanks,
Prathamesh
>
> Richard
>
> > +
> > +#define ARG0(index) vector_cst_elt (arg0, index)
> > +#define ARG1(index) vector_cst_elt (arg1, index)
> > +
> > +/* Test cases where result is VNx4SI and input vectors are V4SI.  */
> > +
> > +static void
> > +test_vnx4si_v4si (machine_mode vnx4si_mode, machine_mode v4si_mode)
> > +{
> > +  for (int i = 0; i < 10; i++)
> > +    {
> > +      /* Case 1:
> > +      sel = { 0, 4, 1, 5, ... }
> > +      res = { arg[0], arg1[0], arg0[1], arg1[1], ...} // (4, 1)  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > +     tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > +
> > +     tree inner_type
> > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
> > +     tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
> > +
> > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > +     vec_perm_builder builder (res_len, 4, 1);
> > +     poly_uint64 mask_elems[] = { 0, 4, 1, 5 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, res_len);
> > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > +     validate_res (4, 1, res, expected_res);
> > +      }
> > +
> > +      /* Case 2: Same as case 1, but contains an out of bounds access which
> > +      should wrap around.
> > +      sel = {0, 8, 4, 12, ...} (4, 1)
> > +      res = { arg0[0], arg0[0], arg1[0], arg1[0], ... } (4, 1).  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > +     tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > +
> > +     tree inner_type
> > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
> > +     tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
> > +
> > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > +     vec_perm_builder builder (res_len, 4, 1);
> > +     poly_uint64 mask_elems[] = { 0, 8, 4, 12 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, res_len);
> > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG0(0), ARG1(0), ARG1(0) };
> > +     validate_res (4, 1, res, expected_res);
> > +      }
> > +    }
> > +}
> > +
> > +/* Test cases where result is V4SI and input vectors are VNx4SI.  */
> > +
> > +static void
> > +test_v4si_vnx4si (machine_mode v4si_mode, machine_mode vnx4si_mode)
> > +{
> > +  for (int i = 0; i < 10; i++)
> > +    {
> > +      /* Case 1:
> > +      sel = { 0, 1, 2, 3}
> > +      res = { arg0[0], arg0[1], arg0[2], arg0[3] }.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > +     tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > +
> > +     tree inner_type
> > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
> > +     tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
> > +
> > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > +     vec_perm_builder builder (res_len, 4, 1);
> > +     poly_uint64 mask_elems[] = {0, 1, 2, 3};
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, res_len);
> > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2), ARG0(3) };
> > +     validate_res_vls (res, expected_res, 4);
> > +      }
> > +
> > +      /* Case 2: Same as Case 1, but crossing input vector.
> > +      sel = {0, 2, 4, 6}
> > +      In this case,the index 4 is ambiguous since len = 4 + 4x.
> > +      Since we cannot determine, which vector to choose from during
> > +      compile time, should return NULL_TREE.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > +     tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > +
> > +     tree inner_type
> > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
> > +     tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
> > +
> > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > +     vec_perm_builder builder (res_len, 4, 1);
> > +     poly_uint64 mask_elems[] = {0, 2, 4, 6};
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, res_len);
> > +     const char *reason;
> > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
> > +
> > +     ASSERT_TRUE (res == NULL_TREE);
> > +     ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
> > +      }
> > +    }
> > +}
> > +
> > +/* Test all input vectors.  */
> > +
> > +static void
> > +test_all_nunits (machine_mode vmode)
> > +{
> > +  /* Test with 10 different inputs.  */
> > +  for (int i = 0; i < 10; i++)
> > +    {
> > +      tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +      tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +      /* Case 1: mask = {0, ...} // (1, 1)
> > +      res = { arg0[0], ... } // (1, 1)  */
> > +      {
> > +     vec_perm_builder builder (len, 1, 1);
> > +     builder.quick_push (0);
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +     tree expected_res[] = { ARG0(0) };
> > +     validate_res (1, 1, res, expected_res);
> > +      }
> > +
> > +      /* Case 2: mask = {len, ...} // (1, 1)
> > +      res = { arg1[0], ... } // (1, 1)  */
> > +      {
> > +     vec_perm_builder builder (len, 1, 1);
> > +     builder.quick_push (len);
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG1(0) };
> > +     validate_res (1, 1, res, expected_res);
> > +      }
> > +
> > +      /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
> > +      Test that stepped sequence of the pattern selects from arg0.
> > +      res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 1, 3);
> > +     poly_uint64 mask_elems[] = { len, 0, 1 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
> > +     validate_res (1, 3, res, expected_res);
> > +      }
> > +    }
> > +}
> > +
> > +/* Test all vectors which contain at-least 2 elements.  */
> > +
> > +static void
> > +test_nunits_min_2 (machine_mode vmode)
> > +{
> > +  for (int i = 0; i < 10; i++)
> > +    {
> > +      /* Case 1: mask = { 0, len, ... }  // (2, 1)
> > +      res = { arg0[0], arg1[0], ... } // (2, 1)  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 2, 1);
> > +     poly_uint64 mask_elems[] = { 0, len };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0) };
> > +     validate_res (2, 1, res, expected_res);
> > +      }
> > +
> > +      /* Case 2: mask = { 0, len, 1, len+1, ... } // (2, 2)
> > +      res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 2, 2);
> > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > +     validate_res (2, 2, res, expected_res);
> > +      }
> > +
> > +      /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
> > +      Test that the stepped sequence of the pattern selects from
> > +      same input pattern. Since input vectors have npatterns = 2,
> > +      and step (a2 - a1) = 1, step is not a multiple of npatterns
> > +      in input vector. So return NULL_TREE.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 1, 3);
> > +     poly_uint64 mask_elems[] = { 0, 0, 1 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     const char *reason;
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel,
> > +                                   &reason);
> > +     ASSERT_TRUE (res == NULL_TREE);
> > +     ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
> > +      }
> > +    }
> > +}
> > +
> > +/* Test all vectors which contain at-least 4 elements.  */
> > +
> > +static void
> > +test_nunits_min_4 (machine_mode vmode)
> > +{
> > +  for (int i = 0; i < 10; i++)
> > +    {
> > +      /* Case 1: mask = { 0, len, 1, len+1, ... } // (4, 1)
> > +      res: { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (4, 1)  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 4, 1);
> > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > +     validate_res (4, 1, res, expected_res);
> > +      }
> > +
> > +      /* Case 2: sel = {0, 1, 2, ...}  // (1, 3)
> > +      res: { arg0[0], arg0[1], arg0[2], ... } // (1, 3) */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +     poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (arg0_len, 1, 3);
> > +     poly_uint64 mask_elems[] = {0, 1, 2};
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, arg0_len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +     tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2) };
> > +     validate_res (1, 3, res, expected_res);
> > +      }
> > +
> > +      /* Case 3: sel = {len, len+1, len+2, ...} // (1, 3)
> > +      res: { arg1[0], arg1[1], arg1[2], ... } // (1, 3) */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 1, 3);
> > +     poly_uint64 mask_elems[] = {len, len + 1, len + 2};
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +     tree expected_res[] = { ARG1(0), ARG1(1), ARG1(2) };
> > +     validate_res (1, 3, res, expected_res);
> > +      }
> > +
> > +      /* Case 4:
> > +     sel = { len, 0, 2, ... } // (1, 3)
> > +     This should return NULL because we cross the input vectors.
> > +     Because,
> > +     Let's assume len = C + Cx
> > +     a1 = 0
> > +     S = 2
> > +     esel = arg0_len / sel_npatterns = C + Cx
> > +     ae = 0 + (esel - 2) * S
> > +        = 0 + (C + Cx - 2) * 2
> > +        = 2(C-2) + 2Cx
> > +
> > +     For C >= 4:
> > +     Let q1 = a1 / arg0_len = 0 / (C + Cx) = 0
> > +     Let qe = ae / arg0_len = (2(C-2) + 2Cx) / (C + Cx) = 1
> > +     Since q1 != qe, we cross input vectors.
> > +     So return NULL_TREE.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +     poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (arg0_len, 1, 3);
> > +     poly_uint64 mask_elems[] = { arg0_len, 0, 2 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, arg0_len);
> > +     const char *reason;
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> > +     ASSERT_TRUE (res == NULL_TREE);
> > +     ASSERT_TRUE (!strcmp (reason, "crossed input vectors"));
> > +      }
> > +
> > +      /* Case 5: npatterns(arg0) = 4 > npatterns(sel) = 2
> > +      mask = { 0, len, 1, len + 1, ...} // (2, 2)
> > +      res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)
> > +
> > +      Note that fold_vec_perm_cst will set
> > +      res_npatterns = max(4, max(4, 2)) = 4
> > +      However after canonicalizing, we will end up with shape (2, 2).  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 4, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 4, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 2, 2);
> > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > +     validate_res (2, 2, res, expected_res);
> > +      }
> > +
> > +      /* Case 6: Test combination in sel, where one pattern is dup and other
> > +      is stepped sequence.
> > +      sel = { 0, 0, 0, 1, 0, 2, ... } // (2, 3)
> > +      res = { arg0[0], arg0[0], arg0[0],
> > +              arg0[1], arg0[0], arg0[2], ... } // (2, 3)  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 2, 3);
> > +     poly_uint64 mask_elems[] = { 0, 0, 0, 1, 0, 2 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG0(0), ARG0(0),
> > +                             ARG0(1), ARG0(0), ARG0(2) };
> > +     validate_res (2, 3, res, expected_res);
> > +      }
> > +    }
> > +}
> > +
> > +/* Test all vectors which contain at-least 8 elements.  */
> > +
> > +static void
> > +test_nunits_min_8 (machine_mode vmode)
> > +{
> > +  for (int i = 0; i < 10; i++)
> > +    {
> > +      /* Case 1: sel_npatterns (4) > input npatterns (2)
> > +      sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...} // (4, 3)
> > +      res: { arg0[0], arg0[0], arg0[0], arg1[0],
> > +             arg0[2], arg0[0], arg0[3], arg1[0],
> > +             arg0[4], arg0[0], arg0[5], arg1[0], ... } // (4, 3)  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 2, 3, 2);
> > +     tree arg1 = build_vec_cst_rand (vmode, 2, 3, 2);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder(len, 4, 3);
> > +     poly_uint64 mask_elems[] = { 0, 0, 1, len, 2, 0, 3, len,
> > +                                  4, 0, 5, len };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG0(0), ARG0(1), ARG1(0),
> > +                             ARG0(2), ARG0(0), ARG0(3), ARG1(0),
> > +                             ARG0(4), ARG0(0), ARG0(5), ARG1(0) };
> > +     validate_res (4, 3, res, expected_res);
> > +      }
> > +    }
> > +}
> > +
> > +/* Test vectors for which nunits[0] <= 4.  */
> > +
> > +static void
> > +test_nunits_max_4 (machine_mode vmode)
> > +{
> > +  /* Case 1: mask = {0, 4, ...} // (1, 2)
> > +     This should return NULL_TREE because the index 4 may choose
> > +     from either arg0 or arg1 depending on vector length.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 2);
> > +    poly_uint64 mask_elems[] = {0, 4};
> > +    builder_push_elems (builder, mask_elems);
> > +
> > +    vec_perm_indices sel (builder, 2, len);
> > +    const char *reason;
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> > +    ASSERT_TRUE (res == NULL_TREE);
> > +    ASSERT_TRUE (reason != NULL);
> > +    ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
> > +  }
> > +}
> > +
> > +#undef ARG0
> > +#undef ARG1
> > +
> > +/* Return true if SIZE is of the form C + Cx and C is power of 2.  */
> > +
> > +static bool
> > +is_simple_vla_size (poly_uint64 size)
> > +{
> > +  if (size.is_constant ()
> > +      || !pow2p_hwi (size.coeffs[0]))
> > +    return false;
> > +  for (unsigned i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
> > +    if (size.coeffs[i] != (i <= 1 ? size.coeffs[0] : 0))
> > +      return false;
> > +  return true;
> > +}
> > +
> > +/* Execute fold_vec_perm_cst unit tests.  */
> > +
> > +static void
> > +test ()
> > +{
> > +  machine_mode vnx4si_mode = E_VOIDmode;
> > +  machine_mode v4si_mode = E_VOIDmode;
> > +
> > +  machine_mode vmode;
> > +  FOR_EACH_MODE_IN_CLASS (vmode, MODE_VECTOR_INT)
> > +    {
> > +      /* Obtain modes corresponding to VNx4SI and V4SI,
> > +      to call mixed mode tests below.
> > +      FIXME: Is there a better way to do this ?  */
> > +      if (GET_MODE_INNER (vmode) == SImode)
> > +     {
> > +       poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +       if (is_simple_vla_size (nunits)
> > +           && nunits.coeffs[0] == 4)
> > +         vnx4si_mode = vmode;
> > +       else if (known_eq (nunits, poly_uint64 (4)))
> > +         v4si_mode = vmode;
> > +     }
> > +
> > +      if (!is_simple_vla_size (GET_MODE_NUNITS (vmode))
> > +       || !targetm.vector_mode_supported_p (vmode))
> > +     continue;
> > +
> > +      poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > +      test_all_nunits (vmode);
> > +      if (nunits.coeffs[0] >= 2)
> > +     test_nunits_min_2 (vmode);
> > +      if (nunits.coeffs[0] >= 4)
> > +     test_nunits_min_4 (vmode);
> > +      if (nunits.coeffs[0] >= 8)
> > +     test_nunits_min_8 (vmode);
> > +
> > +      if (nunits.coeffs[0] <= 4)
> > +     test_nunits_max_4 (vmode);
> > +    }
> > +
> > +  if (vnx4si_mode != E_VOIDmode && v4si_mode != E_VOIDmode
> > +      && targetm.vector_mode_supported_p (vnx4si_mode)
> > +      && targetm.vector_mode_supported_p (v4si_mode))
> > +    {
> > +      test_vnx4si_v4si (vnx4si_mode, v4si_mode);
> > +      test_v4si_vnx4si (v4si_mode, vnx4si_mode);
> > +    }
> > +}
> > +}; // end of test_fold_vec_perm_cst namespace
> > +
> >  /* Verify that various binary operations on vectors are folded
> >     correctly.  */
> >
> > @@ -16943,6 +17693,7 @@ fold_const_cc_tests ()
> >    test_arithmetic_folding ();
> >    test_vector_folding ();
> >    test_vec_duplicate_folding ();
> > +  test_fold_vec_perm_cst::test ();
> >  }
> >
> >  } // namespace selftest
Extend fold_vec_perm to handle VLA vector_cst.

gcc/ChangeLog:
	* fold-const.cc (INCLUDE_ALGORITHM): Add Include.
	(valid_mask_for_fold_vec_perm_cst_p): New function.
	(fold_vec_perm_cst): Likewise.
	(fold_vec_perm): Adjust assert and call fold_vec_perm_cst.
	(test_fold_vec_perm_cst): New namespace.
	(test_fold_vec_perm_cst::build_vec_cst_rand): New function.
	(test_fold_vec_perm_cst::validate_res): Likewise.
	(test_fold_vec_perm_cst::validate_res_vls): Likewise.
	(test_fold_vec_perm_cst::builder_push_elems): Likewise.
	(test_fold_vec_perm_cst::test_vnx4si_v4si): Likewise.
	(test_fold_vec_perm_cst::test_v4si_vnx4si): Likewise.
	(test_fold_vec_perm_cst::test_all_nunits): Likewise.
	(test_fold_vec_perm_cst::test_nunits_min_2): Likewise.
	(test_fold_vec_perm_cst::test_nunits_min_4): Likewise.
	(test_fold_vec_perm_cst::test_nunits_min_8): Likewise.
	(test_fold_vec_perm_cst::test_nunits_max_4): Likewise.
	(test_fold_vec_perm_cst::is_simple_vla_size): Likewise.
	(test_fold_vec_perm_cst::test): Likewise.
	(fold_const_cc_tests): Call test_fold_vec_perm_cst::test.

Co-authored-by: Richard Sandiford <richard.sandiford@arm.com>

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 7e5494dfd39..c6fb083027d 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
    gimple code, we need to handle GIMPLE tuples as well as their
    corresponding tree equivalents.  */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -10520,6 +10521,181 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
+   mask for VLA vec_perm folding.
+   REASON if specified, will contain the reason why SEL is not suitable.
+   Used only for debugging and unit-testing.  */
+
+static bool
+valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
+				    const vec_perm_indices &sel,
+				    const char **reason = NULL)
+{
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+
+  if (!(pow2p_hwi (sel_npatterns)
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
+    {
+      if (reason)
+	*reason = "npatterns is not power of 2";
+      return false;
+    }
+
+  /* We want to avoid cases where sel.length is not a multiple of npatterns.
+     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
+  poly_uint64 esel;
+  if (!multiple_p (sel.length (), sel_npatterns, &esel))
+    {
+      if (reason)
+	*reason = "sel.length is not multiple of sel_npatterns";
+      return false;
+    }
+
+  if (sel_nelts_per_pattern < 3)
+    return true;
+
+  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
+    {
+      poly_uint64 a1 = sel[pattern + sel_npatterns];
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      HOST_WIDE_INT step;
+      if (!poly_int64 (a2 - a1).is_constant (&step))
+	{
+	  if (reason)
+	    *reason = "step is not constant";
+	  return false;
+	}
+      // FIXME: Punt on step < 0 for now, revisit later.
+      if (step < 0)
+	return false;
+      if (step == 0)
+	continue;
+
+      if (!pow2p_hwi (step))
+	{
+	  if (reason)
+	    *reason = "step is not power of 2";
+	  return false;
+	}
+
+      /* Ensure that stepped sequence of the pattern selects elements
+	 only from the same input vector.  */
+      uint64_t q1, qe;
+      poly_uint64 r1, re;
+      poly_uint64 ae = a1 + (esel - 2) * step;
+      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
+	    && can_div_trunc_p (ae, arg_len, &qe, &re)
+	    && q1 == qe))
+	{
+	  if (reason)
+	    *reason = "crossed input vectors";
+	  return false;
+	}
+
+      /* Ensure that the stepped sequence always selects from the same
+	 input pattern.  */
+      unsigned arg_npatterns
+	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
+			  : VECTOR_CST_NPATTERNS (arg1);
+
+      if (!multiple_p (step, arg_npatterns))
+	{
+	  if (reason)
+	    *reason = "step is not multiple of npatterns";
+	  return false;
+	}
+    }
+
+  return true;
+}
+
+/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
+   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
+   REASON has same purpose as described in
+   valid_mask_for_fold_vec_perm_cst_p.  */
+
+static tree
+fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
+		   const char **reason = NULL)
+{
+  unsigned res_npatterns, res_nelts_per_pattern;
+  unsigned HOST_WIDE_INT res_nelts;
+
+  /* (1) If SEL is a suitable mask as determined by
+     valid_mask_for_fold_vec_perm_cst_p, then:
+     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
+     res_nelts_per_pattern = max of nelts_per_pattern between
+			     ARG0, ARG1 and SEL.
+     (2) If SEL is not a suitable mask, and TYPE is VLS then:
+     res_npatterns = nelts in result vector.
+     res_nelts_per_pattern = 1.
+     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns
+	= std::max (VECTOR_CST_NPATTERNS (arg0),
+		    std::max (VECTOR_CST_NPATTERNS (arg1),
+			      sel.encoding ().npatterns ()));
+
+      res_nelts_per_pattern
+	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
+			      sel.encoding ().nelts_per_pattern ()));
+
+      res_nelts = res_npatterns * res_nelts_per_pattern;
+    }
+  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
+    {
+      res_npatterns = res_nelts;
+      res_nelts_per_pattern = 1;
+    }
+  else
+    return NULL_TREE;
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  for (unsigned i = 0; i < res_nelts; i++)
+    {
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+      unsigned HOST_WIDE_INT index;
+
+      /* Punt if sel[i] /trunc_div len cannot be determined,
+	 because the input vector to be chosen will depend on
+	 runtime vector length.
+	 For example if len == 4 + 4x, and sel[i] == 4,
+	 If len at runtime equals 4, we choose arg1[0].
+	 For any other value of len > 4 at runtime, we choose arg0[4].
+	 which makes the element choice dependent on runtime vector length.  */
+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    *reason = "cannot divide selector element by arg len";
+	  return NULL_TREE;
+	}
+
+      /* sel[i] % len will give the index of element in the chosen input
+	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
+	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
+      if (!r.is_constant (&index))
+	{
+	  if (reason)
+	    *reason = "remainder is not constant";
+	  return NULL_TREE;
+	}
+
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10529,43 +10705,41 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
   unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST)
+    return fold_vec_perm_cst (type, arg0, arg1, sel);
+
+  /* For fall back case, we want to ensure we have VLS vectors
+     with equal length.  */
+  if (!sel.length ().is_constant (&nelts))
+    return NULL_TREE;
+
+  gcc_assert (known_eq (sel.length (),
+			TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
+  vec<constructor_elt, va_gc> *v;
+  vec_alloc (v, nelts);
   for (i = 0; i < nelts; i++)
     {
       HOST_WIDE_INT index;
       if (!sel[i].is_constant (&index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
-    }
-
-  if (need_ctor)
-    {
-      vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
-      return build_constructor (type, v);
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
     }
-  else
-    return out_elts.build ();
+  return build_constructor (type, v);
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16892,6 +17066,588 @@ test_arithmetic_folding ()
 				   x);
 }
 
+namespace test_fold_vec_perm_cst {
+
+/* Build a VECTOR_CST corresponding to VMODE, and has
+   encoding given by NPATTERNS, NELTS_PER_PATTERN and STEP.
+   Fill it with randomized elements, using rand() % THRESHOLD.  */
+
+static tree
+build_vec_cst_rand (machine_mode vmode, unsigned npatterns,
+		    unsigned nelts_per_pattern,
+		    int step = 0, int threshold = 100)
+{
+  tree inner_type = lang_hooks.types.type_for_mode (GET_MODE_INNER (vmode), 1);
+  tree vectype = build_vector_type_for_mode (inner_type, vmode);
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+
+  // Fill a0 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 1)
+    return builder.build ();
+
+  // Fill a1 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 2)
+    return builder.build ();
+
+  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
+    {
+      tree prev_elem = builder[i - npatterns];
+      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
+      int val = prev_elem_val + step;
+      builder.quick_push (build_int_cst (inner_type, val));
+    }
+
+  return builder.build ();
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when result is VLA.  */
+
+static void
+validate_res (unsigned npatterns, unsigned nelts_per_pattern,
+	      tree res, tree *expected_res)
+{
+  /* Actual npatterns and encoded_elts in res may be less than expected due
+     to canonicalization.  */
+  ASSERT_TRUE (res != NULL_TREE);
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
+  ASSERT_TRUE (vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern);
+
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when the result is VLS.  */
+
+static void
+validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
+{
+  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
+  for (unsigned i = 0; i < expected_nelts; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Helper routine to push multiple elements into BUILDER.  */
+template<unsigned N>
+static void builder_push_elems (vec_perm_builder& builder,
+				poly_uint64 (&elems)[N])
+{
+  for (unsigned i = 0; i < N; i++)
+    builder.quick_push (elems[i]);
+}
+
+#define ARG0(index) vector_cst_elt (arg0, index)
+#define ARG1(index) vector_cst_elt (arg1, index)
+
+/* Test cases where result is VNx4SI and input vectors are V4SI.  */
+
+static void
+test_vnx4si_v4si (machine_mode vnx4si_mode, machine_mode v4si_mode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1:
+	 sel = { 0, 4, 1, 5, ... }
+	 res = { arg[0], arg1[0], arg0[1], arg1[1], ...} // (4, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, 4, 1, 5 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (4, 1, res, expected_res);
+      }
+
+      /* Case 2: Same as case 1, but contains an out of bounds access which
+	 should wrap around.
+	 sel = {0, 8, 4, 12, ...} (4, 1)
+	 res = { arg0[0], arg0[0], arg1[0], arg1[0], ... } (4, 1).  */
+      {
+	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, 8, 4, 12 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG1(0), ARG1(0) };
+	validate_res (4, 1, res, expected_res);
+      }
+    }
+}
+
+/* Test cases where result is V4SI and input vectors are VNx4SI.  */
+
+static void
+test_v4si_vnx4si (machine_mode v4si_mode, machine_mode vnx4si_mode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1:
+	 sel = { 0, 1, 2, 3}
+	 res = { arg0[0], arg0[1], arg0[2], arg0[3] }.  */
+      {
+	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = {0, 1, 2, 3};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2), ARG0(3) };
+	validate_res_vls (res, expected_res, 4);
+      }
+
+      /* Case 2: Same as Case 1, but crossing input vector.
+	 sel = {0, 2, 4, 6}
+	 In this case,the index 4 is ambiguous since len = 4 + 4x.
+	 Since we cannot determine, which vector to choose from during
+	 compile time, should return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = {0, 2, 4, 6};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
+
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
+      }
+    }
+}
+
+/* Test all input vectors.  */
+
+static void
+test_all_nunits (machine_mode vmode)
+{
+  /* Test with 10 different inputs.  */
+  for (int i = 0; i < 10; i++)
+    {
+      tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+      tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      /* Case 1: mask = {0, ...} // (1, 1)
+	 res = { arg0[0], ... } // (1, 1)  */
+      {
+	vec_perm_builder builder (len, 1, 1);
+	builder.quick_push (0);
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0) };
+	validate_res (1, 1, res, expected_res);
+      }
+
+      /* Case 2: mask = {len, ...} // (1, 1)
+	 res = { arg1[0], ... } // (1, 1)  */
+      {
+	vec_perm_builder builder (len, 1, 1);
+	builder.quick_push (len);
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG1(0) };
+	validate_res (1, 1, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 2 elements.  */
+
+static void
+test_nunits_min_2 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: mask = { 0, len, ... }  // (2, 1)
+	 res = { arg0[0], arg1[0], ... } // (2, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 1);
+	poly_uint64 mask_elems[] = { 0, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (2, 1, res, expected_res);
+      }
+
+      /* Case 2: mask = { 0, len, 1, len+1, ... } // (2, 2)
+	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 2);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (2, 2, res, expected_res);
+      }
+
+      /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
+	 Test that the stepped sequence of the pattern selects from
+	 same input pattern. Since input vectors have npatterns = 2,
+	 and step (a2 - a1) = 1, step is not a multiple of npatterns
+	 in input vector. So return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel,
+				      &reason);
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
+      }
+
+      /* Case 5: mask = {len, 0, 1, ...} // (1, 3)
+	 Test that stepped sequence of the pattern selects from arg0.
+	 res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { len, 0, 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
+	validate_res (1, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 4 elements.  */
+
+static void
+test_nunits_min_4 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: mask = { 0, len, 1, len+1, ... } // (4, 1)
+	 res: { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (4, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (4, 1, res, expected_res);
+      }
+
+      /* Case 2: sel = {0, 1, 2, ...}  // (1, 3)
+	 res: { arg0[0], arg0[1], arg0[2], ... } // (1, 3) */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (arg0_len, 1, 3);
+	poly_uint64 mask_elems[] = {0, 1, 2};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, arg0_len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2) };
+	validate_res (1, 3, res, expected_res);
+      }
+
+      /* Case 3: sel = {len, len+1, len+2, ...} // (1, 3)
+	 res: { arg1[0], arg1[1], arg1[2], ... } // (1, 3) */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = {len, len + 1, len + 2};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG1(0), ARG1(1), ARG1(2) };
+	validate_res (1, 3, res, expected_res);
+      }
+
+      /* Case 4:
+	sel = { len, 0, 2, ... } // (1, 3)
+	This should return NULL because we cross the input vectors.
+	Because,
+	Let's assume len = C + Cx
+	a1 = 0
+	S = 2
+	esel = arg0_len / sel_npatterns = C + Cx
+	ae = 0 + (esel - 2) * S
+	   = 0 + (C + Cx - 2) * 2
+	   = 2(C-2) + 2Cx
+
+	For C >= 4:
+	Let q1 = a1 / arg0_len = 0 / (C + Cx) = 0
+	Let qe = ae / arg0_len = (2(C-2) + 2Cx) / (C + Cx) = 1
+	Since q1 != qe, we cross input vectors.
+	So return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (arg0_len, 1, 3);
+	poly_uint64 mask_elems[] = { arg0_len, 0, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, arg0_len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "crossed input vectors"));
+      }
+
+      /* Case 5: npatterns(arg0) = 4 > npatterns(sel) = 2
+	 mask = { 0, len, 1, len + 1, ...} // (2, 2)
+	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)
+
+	 Note that fold_vec_perm_cst will set
+	 res_npatterns = max(4, max(4, 2)) = 4
+	 However after canonicalizing, we will end up with shape (2, 2).  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 4, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 2);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (2, 2, res, expected_res);
+      }
+
+      /* Case 6: Test combination in sel, where one pattern is dup and other
+	 is stepped sequence.
+	 sel = { 0, 0, 0, 1, 0, 2, ... } // (2, 3)
+	 res = { arg0[0], arg0[0], arg0[0],
+		 arg0[1], arg0[0], arg0[2], ... } // (2, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 0, 1, 0, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(0),
+				ARG0(1), ARG0(0), ARG0(2) };
+	validate_res (2, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 8 elements.  */
+
+static void
+test_nunits_min_8 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: sel_npatterns (4) > input npatterns (2)
+	 sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...} // (4, 3)
+	 res: { arg0[0], arg0[0], arg0[0], arg1[0],
+		arg0[2], arg0[0], arg0[3], arg1[0],
+		arg0[4], arg0[0], arg0[5], arg1[0], ... } // (4, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder(len, 4, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 1, len, 2, 0, 3, len,
+				     4, 0, 5, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(1), ARG1(0),
+				ARG0(2), ARG0(0), ARG0(3), ARG1(0),
+				ARG0(4), ARG0(0), ARG0(5), ARG1(0) };
+	validate_res (4, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test vectors for which nunits[0] <= 4.  */
+
+static void
+test_nunits_max_4 (machine_mode vmode)
+{
+  /* Case 1: mask = {0, 4, ...} // (1, 2)
+     This should return NULL_TREE because the index 4 may choose
+     from either arg0 or arg1 depending on vector length.  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 2);
+    poly_uint64 mask_elems[] = {0, 4};
+    builder_push_elems (builder, mask_elems);
+
+    vec_perm_indices sel (builder, 2, len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+    ASSERT_TRUE (res == NULL_TREE);
+    ASSERT_TRUE (reason != NULL);
+    ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
+  }
+}
+
+#undef ARG0
+#undef ARG1
+
+/* Return true if SIZE is of the form C + Cx and C is power of 2.  */
+
+static bool
+is_simple_vla_size (poly_uint64 size)
+{
+  if (size.is_constant ()
+      || !pow2p_hwi (size.coeffs[0]))
+    return false;
+  for (unsigned i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
+    if (size.coeffs[i] != (i <= 1 ? size.coeffs[0] : 0))
+      return false;
+  return true;
+}
+
+/* Execute fold_vec_perm_cst unit tests.  */
+
+static void
+test ()
+{
+  machine_mode vnx4si_mode = E_VOIDmode;
+  machine_mode v4si_mode = E_VOIDmode;
+
+  machine_mode vmode;
+  FOR_EACH_MODE_IN_CLASS (vmode, MODE_VECTOR_INT)
+    {
+      /* Obtain modes corresponding to VNx4SI and V4SI,
+	 to call mixed mode tests below.
+	 FIXME: Is there a better way to do this ?  */
+      if (GET_MODE_INNER (vmode) == SImode)
+	{
+	  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+	  if (is_simple_vla_size (nunits)
+	      && nunits.coeffs[0] == 4)
+	    vnx4si_mode = vmode;
+	  else if (known_eq (nunits, poly_uint64 (4)))
+	    v4si_mode = vmode;
+	}
+
+      if (!is_simple_vla_size (GET_MODE_NUNITS (vmode))
+	  || !targetm.vector_mode_supported_p (vmode))
+	continue;
+
+      poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+      test_all_nunits (vmode);
+      if (nunits.coeffs[0] >= 2)
+	test_nunits_min_2 (vmode);
+      if (nunits.coeffs[0] >= 4)
+	test_nunits_min_4 (vmode);
+      if (nunits.coeffs[0] >= 8)
+	test_nunits_min_8 (vmode);
+
+      if (nunits.coeffs[0] <= 4)
+	test_nunits_max_4 (vmode);
+    }
+
+  if (vnx4si_mode != E_VOIDmode && v4si_mode != E_VOIDmode
+      && targetm.vector_mode_supported_p (vnx4si_mode)
+      && targetm.vector_mode_supported_p (v4si_mode))
+    {
+      test_vnx4si_v4si (vnx4si_mode, v4si_mode);
+      test_v4si_vnx4si (v4si_mode, vnx4si_mode);
+    }
+}
+} // end of test_fold_vec_perm_cst namespace
+
 /* Verify that various binary operations on vectors are folded
    correctly.  */
 
@@ -16943,6 +17699,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_fold_vec_perm_cst::test ();
 }
 
 } // namespace selftest

Prathamesh Kulkarni Aug. 16, 2023, 8:53 a.m. UTC | #15

On Tue, 15 Aug 2023 at 16:59, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Mon, 14 Aug 2023 at 18:23, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > On Thu, 10 Aug 2023 at 21:27, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >>
> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> >> static bool
> > >> >> is_simple_vla_size (poly_uint64 size)
> > >> >> {
> > >> >>   if (size.is_constant ())
> > >> >>     return false;
> > >> >>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
> > >> >>     if (size[i] != (i <= 1 ? size[0] : 0))
> > >> > Just wondering is this should be (i == 1 ? size[0] : 0) since i is
> > >> > initialized to 1 ?
> > >>
> > >> Both work.  I prefer <= 1 because it doesn't depend on the micro
> > >> optimisation to start at coefficient 1.  In a theoretical 3-indeterminate
> > >> poly_int, we want the first 2 coefficients to be nonzero and the rest to
> > >> be zero.
> > >>
> > >> > IIUC, is_simple_vla_size should return true for polynomials of first
> > >> > degree and having same coeff like 4 + 4x ?
> > >>
> > >> FWIW, poly_int only supports first-degree polynomials at the moment.
> > >> coeffs>2 means there is more than one indeterminate, rather than a
> > >> higher power.
> > > Oh OK, thanks for the clarification.
> > >>
> > >> >>       return false;
> > >> >>   return true;
> > >> >> }
> > >> >>
> > >> >>
> > >> >>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
> > >> >>     {
> > >> >>       auto nunits = GET_MODE_NUNITS (mode);
> > >> >>       if (!is_simple_vla_size (nunits))
> > >> >>         continue;
> > >> >>       if (nunits[0] ...)
> > >> >>         test_... (mode);
> > >> >>       ...
> > >> >>
> > >> >>     }
> > >> >>
> > >> >> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
> > >> >> loop structure above, I think we can apply the test_vnx4si and
> > >> >> test_vnx16qi to more cases.  So the classification isn't the
> > >> >> exact number of elements, but instead a limit.
> > >> >>
> > >> >> I think the nunits[0] conditions for test_vnx4si are as follows
> > >> >> (inspection only, so could be wrong):
> > >> >>
> > >> >> > +/* Test cases where result and input vectors are VNx4SI  */
> > >> >> > +
> > >> >> > +static void
> > >> >> > +test_vnx4si (machine_mode vmode)
> > >> >> > +{
> > >> >> > +  /* Case 1: mask = {0, ...} */
> > >> >> > +  {
> > >> >> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > >> >> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > >> >> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > >> >> > +
> > >> >> > +    vec_perm_builder builder (len, 1, 1);
> > >> >> > +    builder.quick_push (0);
> > >> >> > +    vec_perm_indices sel (builder, 2, len);
> > >> >> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > >> >> > +
> > >> >> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
> > >> > This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
> > >> >> > +    validate_res (1, 1, res, expected_res);
> > >> >> > +  }
> > >> >>
> > >> >> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
> > >> >> which I think would be better)
> > >> > IIUC, the vectors that can be used for a particular test should have
> > >> > nunits[0] >= res_npatterns,
> > >> > where res_npatterns is as computed in fold_vec_perm_cst without the
> > >> > canonicalization ?
> > >> > For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
> > >> > require nunits[0] >= 2 ?
> > >> > Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, etc.
> > >>
> > >> Right, that's what I meant.  With the inputs as they stand it has to be
> > >> nunits[0] >= 2.  We need that form the inputs correctly.  But if the
> > >> inputs instead had nelts_per_pattern == 1, the test would work for all
> > >> nunits.
> > > In the attached patch, I have reordered the tests based on min or max limit.
> > > For tests where sel_npatterns < 3 (ie dup sequence), I have kept input
> > > npatterns = 1,
> > > so we can test more vector modes, and also input npatterns matter only
> > > for stepped sequence in sel
> > > (Since for a dup pattern we don't enforce the constraint of selecting
> > > elements from same input pattern).
> > > Does it look OK ?
> > >
> > > For the following tests with input vectors having shape (1, 3)
> > > sel = {0, 1, 2, ...}  // (1, 3)
> > > res = { arg0[0], arg0[1], arg0[2], ... } // (1, 3)
> > >
> > > and sel = {len, len + 1, len + 2, ... }  // (1, 3)
> > > res = { arg1[0], arg1[1], arg1[2], ... } // (1, 3)
> > >
> > > Altho res_npatterns = 1, I suppose these will need to be tested with
> > > vectors with length >= 4 + 4x,
> > > since index 2 can be ambiguous for length 2 + 2x  ?
> > > (In the patch, these are cases 2 and 3 in test_nunits_min_4)
> >
> > Ah, yeah, fair point.  I guess that means:
> >
> > +      /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
> > +        Test that stepped sequence of the pattern selects from arg0.
> > +        res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
> > +      {
> > +       tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +       tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +       poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +       vec_perm_builder builder (len, 1, 3);
> > +       poly_uint64 mask_elems[] = { len, 0, 1 };
> > +       builder_push_elems (builder, mask_elems);
> > +
> > +       vec_perm_indices sel (builder, 2, len);
> > +       tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +       tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
> > +       validate_res (1, 3, res, expected_res);
> > +      }
> >
> > needs to be min_2 after all.
> Ah indeed. Fixed, thanks.
> >
> > Also:
> >
> > > +/* Helper routine to push multiple elements into BUILDER.  */
> > > +
> > > +static void
> > > +builder_push_elems (vec_perm_builder& builder, poly_uint64 *elems)
> > > +{
> > > +  for (unsigned i = 0; i < builder.encoded_nelts (); i++)
> > > +    builder.quick_push (elems[i]);
> > > +}
> >
> > I think it'd be safer to make this:
> >
> > template<unsigned N>
> > builder_push_elems (vec_perm_builder& builder, poly_uint64 (&elems)[N])
> > {
> >   for (unsigned i = 0; i < N; i++)
> >     builder.quick_push (elems[i]);
> > }
> >
> > so that we only push elements that are in the array.
> Done, thanks.
> >
> > OK for trunk with those changes, thanks.
> Unfortunately, the patch regressed following tests on ppc64le and
> armhf respectively:
> gcc.target/powerpc/vec-perm-ctor.c scan-tree-dump-not optimized
> "VIEW_CONVERT_EXPR"
> gcc.dg/tree-ssa/forwprop-20.c scan-tree-dump-not forwprop1 "VEC_PERM_EXPR"
>
> This happens because of the change to vect_cst_ctor_array which
> removes handling of VECTOR_CST,
> and thus we return NULL_TREE for cases where VEC_PERM_EXPR has
> vector_cst, ctor input operands.
>
> For eg we fail to fold VEC_PERM_EXPR for the following test taken from
> forwprop-20.c:
> void f (double d, vecf* r)
> {
>   vecf x = { -d, 5 };
>   vecf y = {  1, 4 };
>   veci m = {  2, 0 };
>   *r = __builtin_shuffle (x, y, m); // { 1, -d }
> }
> because vect_cst_ctor_to_array will now return NULL_TREE for vector_cst {1, 4}.
>
> The attached patch thus reverts the changes to vect_cst_ctor_to_array,
> which makes the tests pass again.
> I have put the patch for another round of bootstrap+test on the above
> targets (aarch64, aarch64-sve, x86_64, armhf, ppc64le).
> OK to commit if it passes ?
The patch now passes bootstrap+test on all these targets.

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Richard
> >
> > > +
> > > +#define ARG0(index) vector_cst_elt (arg0, index)
> > > +#define ARG1(index) vector_cst_elt (arg1, index)
> > > +
> > > +/* Test cases where result is VNx4SI and input vectors are V4SI.  */
> > > +
> > > +static void
> > > +test_vnx4si_v4si (machine_mode vnx4si_mode, machine_mode v4si_mode)
> > > +{
> > > +  for (int i = 0; i < 10; i++)
> > > +    {
> > > +      /* Case 1:
> > > +      sel = { 0, 4, 1, 5, ... }
> > > +      res = { arg[0], arg1[0], arg0[1], arg1[1], ...} // (4, 1)  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > > +     tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > > +
> > > +     tree inner_type
> > > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
> > > +     tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
> > > +
> > > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > > +     vec_perm_builder builder (res_len, 4, 1);
> > > +     poly_uint64 mask_elems[] = { 0, 4, 1, 5 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, res_len);
> > > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > > +     validate_res (4, 1, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 2: Same as case 1, but contains an out of bounds access which
> > > +      should wrap around.
> > > +      sel = {0, 8, 4, 12, ...} (4, 1)
> > > +      res = { arg0[0], arg0[0], arg1[0], arg1[0], ... } (4, 1).  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > > +     tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
> > > +
> > > +     tree inner_type
> > > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
> > > +     tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
> > > +
> > > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > > +     vec_perm_builder builder (res_len, 4, 1);
> > > +     poly_uint64 mask_elems[] = { 0, 8, 4, 12 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, res_len);
> > > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG0(0), ARG1(0), ARG1(0) };
> > > +     validate_res (4, 1, res, expected_res);
> > > +      }
> > > +    }
> > > +}
> > > +
> > > +/* Test cases where result is V4SI and input vectors are VNx4SI.  */
> > > +
> > > +static void
> > > +test_v4si_vnx4si (machine_mode v4si_mode, machine_mode vnx4si_mode)
> > > +{
> > > +  for (int i = 0; i < 10; i++)
> > > +    {
> > > +      /* Case 1:
> > > +      sel = { 0, 1, 2, 3}
> > > +      res = { arg0[0], arg0[1], arg0[2], arg0[3] }.  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > > +     tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > > +
> > > +     tree inner_type
> > > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
> > > +     tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
> > > +
> > > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > > +     vec_perm_builder builder (res_len, 4, 1);
> > > +     poly_uint64 mask_elems[] = {0, 1, 2, 3};
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, res_len);
> > > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2), ARG0(3) };
> > > +     validate_res_vls (res, expected_res, 4);
> > > +      }
> > > +
> > > +      /* Case 2: Same as Case 1, but crossing input vector.
> > > +      sel = {0, 2, 4, 6}
> > > +      In this case,the index 4 is ambiguous since len = 4 + 4x.
> > > +      Since we cannot determine, which vector to choose from during
> > > +      compile time, should return NULL_TREE.  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > > +     tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
> > > +
> > > +     tree inner_type
> > > +       = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
> > > +     tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
> > > +
> > > +     poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
> > > +     vec_perm_builder builder (res_len, 4, 1);
> > > +     poly_uint64 mask_elems[] = {0, 2, 4, 6};
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, res_len);
> > > +     const char *reason;
> > > +     tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
> > > +
> > > +     ASSERT_TRUE (res == NULL_TREE);
> > > +     ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
> > > +      }
> > > +    }
> > > +}
> > > +
> > > +/* Test all input vectors.  */
> > > +
> > > +static void
> > > +test_all_nunits (machine_mode vmode)
> > > +{
> > > +  /* Test with 10 different inputs.  */
> > > +  for (int i = 0; i < 10; i++)
> > > +    {
> > > +      tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +      tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +      /* Case 1: mask = {0, ...} // (1, 1)
> > > +      res = { arg0[0], ... } // (1, 1)  */
> > > +      {
> > > +     vec_perm_builder builder (len, 1, 1);
> > > +     builder.quick_push (0);
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +     tree expected_res[] = { ARG0(0) };
> > > +     validate_res (1, 1, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 2: mask = {len, ...} // (1, 1)
> > > +      res = { arg1[0], ... } // (1, 1)  */
> > > +      {
> > > +     vec_perm_builder builder (len, 1, 1);
> > > +     builder.quick_push (len);
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG1(0) };
> > > +     validate_res (1, 1, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 3: mask = {len, 0, 1, ...} // (1, 3)
> > > +      Test that stepped sequence of the pattern selects from arg0.
> > > +      res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 1, 3);
> > > +     poly_uint64 mask_elems[] = { len, 0, 1 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
> > > +     validate_res (1, 3, res, expected_res);
> > > +      }
> > > +    }
> > > +}
> > > +
> > > +/* Test all vectors which contain at-least 2 elements.  */
> > > +
> > > +static void
> > > +test_nunits_min_2 (machine_mode vmode)
> > > +{
> > > +  for (int i = 0; i < 10; i++)
> > > +    {
> > > +      /* Case 1: mask = { 0, len, ... }  // (2, 1)
> > > +      res = { arg0[0], arg1[0], ... } // (2, 1)  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 2, 1);
> > > +     poly_uint64 mask_elems[] = { 0, len };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG1(0) };
> > > +     validate_res (2, 1, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 2: mask = { 0, len, 1, len+1, ... } // (2, 2)
> > > +      res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 2, 2);
> > > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > > +     validate_res (2, 2, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
> > > +      Test that the stepped sequence of the pattern selects from
> > > +      same input pattern. Since input vectors have npatterns = 2,
> > > +      and step (a2 - a1) = 1, step is not a multiple of npatterns
> > > +      in input vector. So return NULL_TREE.  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 1, 3);
> > > +     poly_uint64 mask_elems[] = { 0, 0, 1 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     const char *reason;
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel,
> > > +                                   &reason);
> > > +     ASSERT_TRUE (res == NULL_TREE);
> > > +     ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
> > > +      }
> > > +    }
> > > +}
> > > +
> > > +/* Test all vectors which contain at-least 4 elements.  */
> > > +
> > > +static void
> > > +test_nunits_min_4 (machine_mode vmode)
> > > +{
> > > +  for (int i = 0; i < 10; i++)
> > > +    {
> > > +      /* Case 1: mask = { 0, len, 1, len+1, ... } // (4, 1)
> > > +      res: { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (4, 1)  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 4, 1);
> > > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > > +     validate_res (4, 1, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 2: sel = {0, 1, 2, ...}  // (1, 3)
> > > +      res: { arg0[0], arg0[1], arg0[2], ... } // (1, 3) */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > > +     poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (arg0_len, 1, 3);
> > > +     poly_uint64 mask_elems[] = {0, 1, 2};
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, arg0_len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +     tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2) };
> > > +     validate_res (1, 3, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 3: sel = {len, len+1, len+2, ...} // (1, 3)
> > > +      res: { arg1[0], arg1[1], arg1[2], ... } // (1, 3) */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 1, 3);
> > > +     poly_uint64 mask_elems[] = {len, len + 1, len + 2};
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +     tree expected_res[] = { ARG1(0), ARG1(1), ARG1(2) };
> > > +     validate_res (1, 3, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 4:
> > > +     sel = { len, 0, 2, ... } // (1, 3)
> > > +     This should return NULL because we cross the input vectors.
> > > +     Because,
> > > +     Let's assume len = C + Cx
> > > +     a1 = 0
> > > +     S = 2
> > > +     esel = arg0_len / sel_npatterns = C + Cx
> > > +     ae = 0 + (esel - 2) * S
> > > +        = 0 + (C + Cx - 2) * 2
> > > +        = 2(C-2) + 2Cx
> > > +
> > > +     For C >= 4:
> > > +     Let q1 = a1 / arg0_len = 0 / (C + Cx) = 0
> > > +     Let qe = ae / arg0_len = (2(C-2) + 2Cx) / (C + Cx) = 1
> > > +     Since q1 != qe, we cross input vectors.
> > > +     So return NULL_TREE.  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > > +     poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (arg0_len, 1, 3);
> > > +     poly_uint64 mask_elems[] = { arg0_len, 0, 2 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, arg0_len);
> > > +     const char *reason;
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> > > +     ASSERT_TRUE (res == NULL_TREE);
> > > +     ASSERT_TRUE (!strcmp (reason, "crossed input vectors"));
> > > +      }
> > > +
> > > +      /* Case 5: npatterns(arg0) = 4 > npatterns(sel) = 2
> > > +      mask = { 0, len, 1, len + 1, ...} // (2, 2)
> > > +      res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)
> > > +
> > > +      Note that fold_vec_perm_cst will set
> > > +      res_npatterns = max(4, max(4, 2)) = 4
> > > +      However after canonicalizing, we will end up with shape (2, 2).  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 4, 1);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 4, 1);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 2, 2);
> > > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +     tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
> > > +     validate_res (2, 2, res, expected_res);
> > > +      }
> > > +
> > > +      /* Case 6: Test combination in sel, where one pattern is dup and other
> > > +      is stepped sequence.
> > > +      sel = { 0, 0, 0, 1, 0, 2, ... } // (2, 3)
> > > +      res = { arg0[0], arg0[0], arg0[0],
> > > +              arg0[1], arg0[0], arg0[2], ... } // (2, 3)  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder (len, 2, 3);
> > > +     poly_uint64 mask_elems[] = { 0, 0, 0, 1, 0, 2 };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG0(0), ARG0(0),
> > > +                             ARG0(1), ARG0(0), ARG0(2) };
> > > +     validate_res (2, 3, res, expected_res);
> > > +      }
> > > +    }
> > > +}
> > > +
> > > +/* Test all vectors which contain at-least 8 elements.  */
> > > +
> > > +static void
> > > +test_nunits_min_8 (machine_mode vmode)
> > > +{
> > > +  for (int i = 0; i < 10; i++)
> > > +    {
> > > +      /* Case 1: sel_npatterns (4) > input npatterns (2)
> > > +      sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...} // (4, 3)
> > > +      res: { arg0[0], arg0[0], arg0[0], arg1[0],
> > > +             arg0[2], arg0[0], arg0[3], arg1[0],
> > > +             arg0[4], arg0[0], arg0[5], arg1[0], ... } // (4, 3)  */
> > > +      {
> > > +     tree arg0 = build_vec_cst_rand (vmode, 2, 3, 2);
> > > +     tree arg1 = build_vec_cst_rand (vmode, 2, 3, 2);
> > > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +     vec_perm_builder builder(len, 4, 3);
> > > +     poly_uint64 mask_elems[] = { 0, 0, 1, len, 2, 0, 3, len,
> > > +                                  4, 0, 5, len };
> > > +     builder_push_elems (builder, mask_elems);
> > > +
> > > +     vec_perm_indices sel (builder, 2, len);
> > > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > > +
> > > +     tree expected_res[] = { ARG0(0), ARG0(0), ARG0(1), ARG1(0),
> > > +                             ARG0(2), ARG0(0), ARG0(3), ARG1(0),
> > > +                             ARG0(4), ARG0(0), ARG0(5), ARG1(0) };
> > > +     validate_res (4, 3, res, expected_res);
> > > +      }
> > > +    }
> > > +}
> > > +
> > > +/* Test vectors for which nunits[0] <= 4.  */
> > > +
> > > +static void
> > > +test_nunits_max_4 (machine_mode vmode)
> > > +{
> > > +  /* Case 1: mask = {0, 4, ...} // (1, 2)
> > > +     This should return NULL_TREE because the index 4 may choose
> > > +     from either arg0 or arg1 depending on vector length.  */
> > > +  {
> > > +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > > +
> > > +    vec_perm_builder builder (len, 1, 2);
> > > +    poly_uint64 mask_elems[] = {0, 4};
> > > +    builder_push_elems (builder, mask_elems);
> > > +
> > > +    vec_perm_indices sel (builder, 2, len);
> > > +    const char *reason;
> > > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
> > > +    ASSERT_TRUE (res == NULL_TREE);
> > > +    ASSERT_TRUE (reason != NULL);
> > > +    ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
> > > +  }
> > > +}
> > > +
> > > +#undef ARG0
> > > +#undef ARG1
> > > +
> > > +/* Return true if SIZE is of the form C + Cx and C is power of 2.  */
> > > +
> > > +static bool
> > > +is_simple_vla_size (poly_uint64 size)
> > > +{
> > > +  if (size.is_constant ()
> > > +      || !pow2p_hwi (size.coeffs[0]))
> > > +    return false;
> > > +  for (unsigned i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
> > > +    if (size.coeffs[i] != (i <= 1 ? size.coeffs[0] : 0))
> > > +      return false;
> > > +  return true;
> > > +}
> > > +
> > > +/* Execute fold_vec_perm_cst unit tests.  */
> > > +
> > > +static void
> > > +test ()
> > > +{
> > > +  machine_mode vnx4si_mode = E_VOIDmode;
> > > +  machine_mode v4si_mode = E_VOIDmode;
> > > +
> > > +  machine_mode vmode;
> > > +  FOR_EACH_MODE_IN_CLASS (vmode, MODE_VECTOR_INT)
> > > +    {
> > > +      /* Obtain modes corresponding to VNx4SI and V4SI,
> > > +      to call mixed mode tests below.
> > > +      FIXME: Is there a better way to do this ?  */
> > > +      if (GET_MODE_INNER (vmode) == SImode)
> > > +     {
> > > +       poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > +       if (is_simple_vla_size (nunits)
> > > +           && nunits.coeffs[0] == 4)
> > > +         vnx4si_mode = vmode;
> > > +       else if (known_eq (nunits, poly_uint64 (4)))
> > > +         v4si_mode = vmode;
> > > +     }
> > > +
> > > +      if (!is_simple_vla_size (GET_MODE_NUNITS (vmode))
> > > +       || !targetm.vector_mode_supported_p (vmode))
> > > +     continue;
> > > +
> > > +      poly_uint64 nunits = GET_MODE_NUNITS (vmode);
> > > +      test_all_nunits (vmode);
> > > +      if (nunits.coeffs[0] >= 2)
> > > +     test_nunits_min_2 (vmode);
> > > +      if (nunits.coeffs[0] >= 4)
> > > +     test_nunits_min_4 (vmode);
> > > +      if (nunits.coeffs[0] >= 8)
> > > +     test_nunits_min_8 (vmode);
> > > +
> > > +      if (nunits.coeffs[0] <= 4)
> > > +     test_nunits_max_4 (vmode);
> > > +    }
> > > +
> > > +  if (vnx4si_mode != E_VOIDmode && v4si_mode != E_VOIDmode
> > > +      && targetm.vector_mode_supported_p (vnx4si_mode)
> > > +      && targetm.vector_mode_supported_p (v4si_mode))
> > > +    {
> > > +      test_vnx4si_v4si (vnx4si_mode, v4si_mode);
> > > +      test_v4si_vnx4si (v4si_mode, vnx4si_mode);
> > > +    }
> > > +}
> > > +}; // end of test_fold_vec_perm_cst namespace
> > > +
> > >  /* Verify that various binary operations on vectors are folded
> > >     correctly.  */
> > >
> > > @@ -16943,6 +17693,7 @@ fold_const_cc_tests ()
> > >    test_arithmetic_folding ();
> > >    test_vector_folding ();
> > >    test_vec_duplicate_folding ();
> > > +  test_fold_vec_perm_cst::test ();
> > >  }
> > >
> > >  } // namespace selftest
Extend fold_vec_perm to handle VLA vector_cst.

gcc/ChangeLog:
	* fold-const.cc (INCLUDE_ALGORITHM): Add Include.
	(valid_mask_for_fold_vec_perm_cst_p): New function.
	(fold_vec_perm_cst): Likewise.
	(fold_vec_perm): Adjust assert and call fold_vec_perm_cst.
	(test_fold_vec_perm_cst): New namespace.
	(test_fold_vec_perm_cst::build_vec_cst_rand): New function.
	(test_fold_vec_perm_cst::validate_res): Likewise.
	(test_fold_vec_perm_cst::validate_res_vls): Likewise.
	(test_fold_vec_perm_cst::builder_push_elems): Likewise.
	(test_fold_vec_perm_cst::test_vnx4si_v4si): Likewise.
	(test_fold_vec_perm_cst::test_v4si_vnx4si): Likewise.
	(test_fold_vec_perm_cst::test_all_nunits): Likewise.
	(test_fold_vec_perm_cst::test_nunits_min_2): Likewise.
	(test_fold_vec_perm_cst::test_nunits_min_4): Likewise.
	(test_fold_vec_perm_cst::test_nunits_min_8): Likewise.
	(test_fold_vec_perm_cst::test_nunits_max_4): Likewise.
	(test_fold_vec_perm_cst::is_simple_vla_size): Likewise.
	(test_fold_vec_perm_cst::test): Likewise.
	(fold_const_cc_tests): Call test_fold_vec_perm_cst::test.

Co-authored-by: Richard Sandiford <richard.sandiford@arm.com>

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 7e5494dfd39..c6fb083027d 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.  If not see
    gimple code, we need to handle GIMPLE tuples as well as their
    corresponding tree equivalents.  */
 
+#define INCLUDE_ALGORITHM
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -10520,6 +10521,181 @@ vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
+   mask for VLA vec_perm folding.
+   REASON if specified, will contain the reason why SEL is not suitable.
+   Used only for debugging and unit-testing.  */
+
+static bool
+valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
+				    const vec_perm_indices &sel,
+				    const char **reason = NULL)
+{
+  unsigned sel_npatterns = sel.encoding ().npatterns ();
+  unsigned sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+
+  if (!(pow2p_hwi (sel_npatterns)
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg0))
+	&& pow2p_hwi (VECTOR_CST_NPATTERNS (arg1))))
+    {
+      if (reason)
+	*reason = "npatterns is not power of 2";
+      return false;
+    }
+
+  /* We want to avoid cases where sel.length is not a multiple of npatterns.
+     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
+  poly_uint64 esel;
+  if (!multiple_p (sel.length (), sel_npatterns, &esel))
+    {
+      if (reason)
+	*reason = "sel.length is not multiple of sel_npatterns";
+      return false;
+    }
+
+  if (sel_nelts_per_pattern < 3)
+    return true;
+
+  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
+    {
+      poly_uint64 a1 = sel[pattern + sel_npatterns];
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+      HOST_WIDE_INT step;
+      if (!poly_int64 (a2 - a1).is_constant (&step))
+	{
+	  if (reason)
+	    *reason = "step is not constant";
+	  return false;
+	}
+      // FIXME: Punt on step < 0 for now, revisit later.
+      if (step < 0)
+	return false;
+      if (step == 0)
+	continue;
+
+      if (!pow2p_hwi (step))
+	{
+	  if (reason)
+	    *reason = "step is not power of 2";
+	  return false;
+	}
+
+      /* Ensure that stepped sequence of the pattern selects elements
+	 only from the same input vector.  */
+      uint64_t q1, qe;
+      poly_uint64 r1, re;
+      poly_uint64 ae = a1 + (esel - 2) * step;
+      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
+	    && can_div_trunc_p (ae, arg_len, &qe, &re)
+	    && q1 == qe))
+	{
+	  if (reason)
+	    *reason = "crossed input vectors";
+	  return false;
+	}
+
+      /* Ensure that the stepped sequence always selects from the same
+	 input pattern.  */
+      unsigned arg_npatterns
+	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
+			  : VECTOR_CST_NPATTERNS (arg1);
+
+      if (!multiple_p (step, arg_npatterns))
+	{
+	  if (reason)
+	    *reason = "step is not multiple of npatterns";
+	  return false;
+	}
+    }
+
+  return true;
+}
+
+/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
+   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
+   REASON has same purpose as described in
+   valid_mask_for_fold_vec_perm_cst_p.  */
+
+static tree
+fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
+		   const char **reason = NULL)
+{
+  unsigned res_npatterns, res_nelts_per_pattern;
+  unsigned HOST_WIDE_INT res_nelts;
+
+  /* (1) If SEL is a suitable mask as determined by
+     valid_mask_for_fold_vec_perm_cst_p, then:
+     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
+     res_nelts_per_pattern = max of nelts_per_pattern between
+			     ARG0, ARG1 and SEL.
+     (2) If SEL is not a suitable mask, and TYPE is VLS then:
+     res_npatterns = nelts in result vector.
+     res_nelts_per_pattern = 1.
+     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns
+	= std::max (VECTOR_CST_NPATTERNS (arg0),
+		    std::max (VECTOR_CST_NPATTERNS (arg1),
+			      sel.encoding ().npatterns ()));
+
+      res_nelts_per_pattern
+	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
+			      sel.encoding ().nelts_per_pattern ()));
+
+      res_nelts = res_npatterns * res_nelts_per_pattern;
+    }
+  else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
+    {
+      res_npatterns = res_nelts;
+      res_nelts_per_pattern = 1;
+    }
+  else
+    return NULL_TREE;
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  for (unsigned i = 0; i < res_nelts; i++)
+    {
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+      unsigned HOST_WIDE_INT index;
+
+      /* Punt if sel[i] /trunc_div len cannot be determined,
+	 because the input vector to be chosen will depend on
+	 runtime vector length.
+	 For example if len == 4 + 4x, and sel[i] == 4,
+	 If len at runtime equals 4, we choose arg1[0].
+	 For any other value of len > 4 at runtime, we choose arg0[4].
+	 which makes the element choice dependent on runtime vector length.  */
+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	{
+	  if (reason)
+	    *reason = "cannot divide selector element by arg len";
+	  return NULL_TREE;
+	}
+
+      /* sel[i] % len will give the index of element in the chosen input
+	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
+	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
+      if (!r.is_constant (&index))
+	{
+	  if (reason)
+	    *reason = "remainder is not constant";
+	  return NULL_TREE;
+	}
+
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10529,43 +10705,41 @@ fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
   unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST)
+    return fold_vec_perm_cst (type, arg0, arg1, sel);
+
+  /* For fall back case, we want to ensure we have VLS vectors
+     with equal length.  */
+  if (!sel.length ().is_constant (&nelts))
+    return NULL_TREE;
+
+  gcc_assert (known_eq (sel.length (),
+			TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0))));
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
+  vec<constructor_elt, va_gc> *v;
+  vec_alloc (v, nelts);
   for (i = 0; i < nelts; i++)
     {
       HOST_WIDE_INT index;
       if (!sel[i].is_constant (&index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
-    }
-
-  if (need_ctor)
-    {
-      vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
-      return build_constructor (type, v);
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
     }
-  else
-    return out_elts.build ();
+  return build_constructor (type, v);
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16892,6 +17066,588 @@ test_arithmetic_folding ()
 				   x);
 }
 
+namespace test_fold_vec_perm_cst {
+
+/* Build a VECTOR_CST corresponding to VMODE, and has
+   encoding given by NPATTERNS, NELTS_PER_PATTERN and STEP.
+   Fill it with randomized elements, using rand() % THRESHOLD.  */
+
+static tree
+build_vec_cst_rand (machine_mode vmode, unsigned npatterns,
+		    unsigned nelts_per_pattern,
+		    int step = 0, int threshold = 100)
+{
+  tree inner_type = lang_hooks.types.type_for_mode (GET_MODE_INNER (vmode), 1);
+  tree vectype = build_vector_type_for_mode (inner_type, vmode);
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+
+  // Fill a0 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 1)
+    return builder.build ();
+
+  // Fill a1 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % threshold));
+
+  if (nelts_per_pattern == 2)
+    return builder.build ();
+
+  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
+    {
+      tree prev_elem = builder[i - npatterns];
+      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
+      int val = prev_elem_val + step;
+      builder.quick_push (build_int_cst (inner_type, val));
+    }
+
+  return builder.build ();
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when result is VLA.  */
+
+static void
+validate_res (unsigned npatterns, unsigned nelts_per_pattern,
+	      tree res, tree *expected_res)
+{
+  /* Actual npatterns and encoded_elts in res may be less than expected due
+     to canonicalization.  */
+  ASSERT_TRUE (res != NULL_TREE);
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
+  ASSERT_TRUE (vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern);
+
+  for (unsigned i = 0; i < npatterns * nelts_per_pattern; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Validate result of VEC_PERM_EXPR folding for the unit-tests below,
+   when the result is VLS.  */
+
+static void
+validate_res_vls (tree res, tree *expected_res, unsigned expected_nelts)
+{
+  ASSERT_TRUE (known_eq (VECTOR_CST_NELTS (res), expected_nelts));
+  for (unsigned i = 0; i < expected_nelts; i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Helper routine to push multiple elements into BUILDER.  */
+template<unsigned N>
+static void builder_push_elems (vec_perm_builder& builder,
+				poly_uint64 (&elems)[N])
+{
+  for (unsigned i = 0; i < N; i++)
+    builder.quick_push (elems[i]);
+}
+
+#define ARG0(index) vector_cst_elt (arg0, index)
+#define ARG1(index) vector_cst_elt (arg1, index)
+
+/* Test cases where result is VNx4SI and input vectors are V4SI.  */
+
+static void
+test_vnx4si_v4si (machine_mode vnx4si_mode, machine_mode v4si_mode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1:
+	 sel = { 0, 4, 1, 5, ... }
+	 res = { arg[0], arg1[0], arg0[1], arg1[1], ...} // (4, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, 4, 1, 5 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (4, 1, res, expected_res);
+      }
+
+      /* Case 2: Same as case 1, but contains an out of bounds access which
+	 should wrap around.
+	 sel = {0, 8, 4, 12, ...} (4, 1)
+	 res = { arg0[0], arg0[0], arg1[0], arg1[0], ... } (4, 1).  */
+      {
+	tree arg0 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+	tree arg1 = build_vec_cst_rand (v4si_mode, 4, 1, 0);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (vnx4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, vnx4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, 8, 4, 12 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG1(0), ARG1(0) };
+	validate_res (4, 1, res, expected_res);
+      }
+    }
+}
+
+/* Test cases where result is V4SI and input vectors are VNx4SI.  */
+
+static void
+test_v4si_vnx4si (machine_mode v4si_mode, machine_mode vnx4si_mode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1:
+	 sel = { 0, 1, 2, 3}
+	 res = { arg0[0], arg0[1], arg0[2], arg0[3] }.  */
+      {
+	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = {0, 1, 2, 3};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2), ARG0(3) };
+	validate_res_vls (res, expected_res, 4);
+      }
+
+      /* Case 2: Same as Case 1, but crossing input vector.
+	 sel = {0, 2, 4, 6}
+	 In this case,the index 4 is ambiguous since len = 4 + 4x.
+	 Since we cannot determine, which vector to choose from during
+	 compile time, should return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vnx4si_mode, 4, 1);
+
+	tree inner_type
+	  = lang_hooks.types.type_for_mode (GET_MODE_INNER (v4si_mode), 1);
+	tree res_type = build_vector_type_for_mode (inner_type, v4si_mode);
+
+	poly_uint64 res_len = TYPE_VECTOR_SUBPARTS (res_type);
+	vec_perm_builder builder (res_len, 4, 1);
+	poly_uint64 mask_elems[] = {0, 2, 4, 6};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, res_len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (res_type, arg0, arg1, sel, &reason);
+
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
+      }
+    }
+}
+
+/* Test all input vectors.  */
+
+static void
+test_all_nunits (machine_mode vmode)
+{
+  /* Test with 10 different inputs.  */
+  for (int i = 0; i < 10; i++)
+    {
+      tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+      tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+      /* Case 1: mask = {0, ...} // (1, 1)
+	 res = { arg0[0], ... } // (1, 1)  */
+      {
+	vec_perm_builder builder (len, 1, 1);
+	builder.quick_push (0);
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0) };
+	validate_res (1, 1, res, expected_res);
+      }
+
+      /* Case 2: mask = {len, ...} // (1, 1)
+	 res = { arg1[0], ... } // (1, 1)  */
+      {
+	vec_perm_builder builder (len, 1, 1);
+	builder.quick_push (len);
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG1(0) };
+	validate_res (1, 1, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 2 elements.  */
+
+static void
+test_nunits_min_2 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: mask = { 0, len, ... }  // (2, 1)
+	 res = { arg0[0], arg1[0], ... } // (2, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 1);
+	poly_uint64 mask_elems[] = { 0, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (2, 1, res, expected_res);
+      }
+
+      /* Case 2: mask = { 0, len, 1, len+1, ... } // (2, 2)
+	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 2);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (2, 2, res, expected_res);
+      }
+
+      /* Case 4: mask = {0, 0, 1, ...} // (1, 3)
+	 Test that the stepped sequence of the pattern selects from
+	 same input pattern. Since input vectors have npatterns = 2,
+	 and step (a2 - a1) = 1, step is not a multiple of npatterns
+	 in input vector. So return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel,
+				      &reason);
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
+      }
+
+      /* Case 5: mask = {len, 0, 1, ...} // (1, 3)
+	 Test that stepped sequence of the pattern selects from arg0.
+	 res = { arg1[0], arg0[0], arg0[1], ... } // (1, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { len, 0, 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG1(0), ARG0(0), ARG0(1) };
+	validate_res (1, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 4 elements.  */
+
+static void
+test_nunits_min_4 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: mask = { 0, len, 1, len+1, ... } // (4, 1)
+	 res: { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (4, 1)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 4, 1);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (4, 1, res, expected_res);
+      }
+
+      /* Case 2: sel = {0, 1, 2, ...}  // (1, 3)
+	 res: { arg0[0], arg0[1], arg0[2], ... } // (1, 3) */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (arg0_len, 1, 3);
+	poly_uint64 mask_elems[] = {0, 1, 2};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, arg0_len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0), ARG0(1), ARG0(2) };
+	validate_res (1, 3, res, expected_res);
+      }
+
+      /* Case 3: sel = {len, len+1, len+2, ...} // (1, 3)
+	 res: { arg1[0], arg1[1], arg1[2], ... } // (1, 3) */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = {len, len + 1, len + 2};
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG1(0), ARG1(1), ARG1(2) };
+	validate_res (1, 3, res, expected_res);
+      }
+
+      /* Case 4:
+	sel = { len, 0, 2, ... } // (1, 3)
+	This should return NULL because we cross the input vectors.
+	Because,
+	Let's assume len = C + Cx
+	a1 = 0
+	S = 2
+	esel = arg0_len / sel_npatterns = C + Cx
+	ae = 0 + (esel - 2) * S
+	   = 0 + (C + Cx - 2) * 2
+	   = 2(C-2) + 2Cx
+
+	For C >= 4:
+	Let q1 = a1 / arg0_len = 0 / (C + Cx) = 0
+	Let qe = ae / arg0_len = (2(C-2) + 2Cx) / (C + Cx) = 1
+	Since q1 != qe, we cross input vectors.
+	So return NULL_TREE.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
+	poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (arg0_len, 1, 3);
+	poly_uint64 mask_elems[] = { arg0_len, 0, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, arg0_len);
+	const char *reason;
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+	ASSERT_TRUE (res == NULL_TREE);
+	ASSERT_TRUE (!strcmp (reason, "crossed input vectors"));
+      }
+
+      /* Case 5: npatterns(arg0) = 4 > npatterns(sel) = 2
+	 mask = { 0, len, 1, len + 1, ...} // (2, 2)
+	 res = { arg0[0], arg1[0], arg0[1], arg1[1], ... } // (2, 2)
+
+	 Note that fold_vec_perm_cst will set
+	 res_npatterns = max(4, max(4, 2)) = 4
+	 However after canonicalizing, we will end up with shape (2, 2).  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 4, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 4, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 2);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+	tree expected_res[] = { ARG0(0), ARG1(0), ARG0(1), ARG1(1) };
+	validate_res (2, 2, res, expected_res);
+      }
+
+      /* Case 6: Test combination in sel, where one pattern is dup and other
+	 is stepped sequence.
+	 sel = { 0, 0, 0, 1, 0, 2, ... } // (2, 3)
+	 res = { arg0[0], arg0[0], arg0[0],
+		 arg0[1], arg0[0], arg0[2], ... } // (2, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 0, 1, 0, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(0),
+				ARG0(1), ARG0(0), ARG0(2) };
+	validate_res (2, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test all vectors which contain at-least 8 elements.  */
+
+static void
+test_nunits_min_8 (machine_mode vmode)
+{
+  for (int i = 0; i < 10; i++)
+    {
+      /* Case 1: sel_npatterns (4) > input npatterns (2)
+	 sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...} // (4, 3)
+	 res: { arg0[0], arg0[0], arg0[0], arg1[0],
+		arg0[2], arg0[0], arg0[3], arg1[0],
+		arg0[4], arg0[0], arg0[5], arg1[0], ... } // (4, 3)  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 2, 3, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 2, 3, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder(len, 4, 3);
+	poly_uint64 mask_elems[] = { 0, 0, 1, len, 2, 0, 3, len,
+				     4, 0, 5, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(0), ARG0(1), ARG1(0),
+				ARG0(2), ARG0(0), ARG0(3), ARG1(0),
+				ARG0(4), ARG0(0), ARG0(5), ARG1(0) };
+	validate_res (4, 3, res, expected_res);
+      }
+    }
+}
+
+/* Test vectors for which nunits[0] <= 4.  */
+
+static void
+test_nunits_max_4 (machine_mode vmode)
+{
+  /* Case 1: mask = {0, 4, ...} // (1, 2)
+     This should return NULL_TREE because the index 4 may choose
+     from either arg0 or arg1 depending on vector length.  */
+  {
+    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 2);
+    poly_uint64 mask_elems[] = {0, 4};
+    builder_push_elems (builder, mask_elems);
+
+    vec_perm_indices sel (builder, 2, len);
+    const char *reason;
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, &reason);
+    ASSERT_TRUE (res == NULL_TREE);
+    ASSERT_TRUE (reason != NULL);
+    ASSERT_TRUE (!strcmp (reason, "cannot divide selector element by arg len"));
+  }
+}
+
+#undef ARG0
+#undef ARG1
+
+/* Return true if SIZE is of the form C + Cx and C is power of 2.  */
+
+static bool
+is_simple_vla_size (poly_uint64 size)
+{
+  if (size.is_constant ()
+      || !pow2p_hwi (size.coeffs[0]))
+    return false;
+  for (unsigned i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
+    if (size.coeffs[i] != (i <= 1 ? size.coeffs[0] : 0))
+      return false;
+  return true;
+}
+
+/* Execute fold_vec_perm_cst unit tests.  */
+
+static void
+test ()
+{
+  machine_mode vnx4si_mode = E_VOIDmode;
+  machine_mode v4si_mode = E_VOIDmode;
+
+  machine_mode vmode;
+  FOR_EACH_MODE_IN_CLASS (vmode, MODE_VECTOR_INT)
+    {
+      /* Obtain modes corresponding to VNx4SI and V4SI,
+	 to call mixed mode tests below.
+	 FIXME: Is there a better way to do this ?  */
+      if (GET_MODE_INNER (vmode) == SImode)
+	{
+	  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+	  if (is_simple_vla_size (nunits)
+	      && nunits.coeffs[0] == 4)
+	    vnx4si_mode = vmode;
+	  else if (known_eq (nunits, poly_uint64 (4)))
+	    v4si_mode = vmode;
+	}
+
+      if (!is_simple_vla_size (GET_MODE_NUNITS (vmode))
+	  || !targetm.vector_mode_supported_p (vmode))
+	continue;
+
+      poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+      test_all_nunits (vmode);
+      if (nunits.coeffs[0] >= 2)
+	test_nunits_min_2 (vmode);
+      if (nunits.coeffs[0] >= 4)
+	test_nunits_min_4 (vmode);
+      if (nunits.coeffs[0] >= 8)
+	test_nunits_min_8 (vmode);
+
+      if (nunits.coeffs[0] <= 4)
+	test_nunits_max_4 (vmode);
+    }
+
+  if (vnx4si_mode != E_VOIDmode && v4si_mode != E_VOIDmode
+      && targetm.vector_mode_supported_p (vnx4si_mode)
+      && targetm.vector_mode_supported_p (v4si_mode))
+    {
+      test_vnx4si_v4si (vnx4si_mode, v4si_mode);
+      test_v4si_vnx4si (v4si_mode, vnx4si_mode);
+    }
+}
+} // end of test_fold_vec_perm_cst namespace
+
 /* Verify that various binary operations on vectors are folded
    correctly.  */
 
@@ -16943,6 +17699,7 @@ fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_fold_vec_perm_cst::test ();
 }
 
 } // namespace selftest

Richard Sandiford Aug. 16, 2023, 9:51 a.m. UTC | #16

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> Unfortunately, the patch regressed following tests on ppc64le and
>> armhf respectively:
>> gcc.target/powerpc/vec-perm-ctor.c scan-tree-dump-not optimized
>> "VIEW_CONVERT_EXPR"
>> gcc.dg/tree-ssa/forwprop-20.c scan-tree-dump-not forwprop1 "VEC_PERM_EXPR"
>>
>> This happens because of the change to vect_cst_ctor_array which
>> removes handling of VECTOR_CST,
>> and thus we return NULL_TREE for cases where VEC_PERM_EXPR has
>> vector_cst, ctor input operands.
>>
>> For eg we fail to fold VEC_PERM_EXPR for the following test taken from
>> forwprop-20.c:
>> void f (double d, vecf* r)
>> {
>>   vecf x = { -d, 5 };
>>   vecf y = {  1, 4 };
>>   veci m = {  2, 0 };
>>   *r = __builtin_shuffle (x, y, m); // { 1, -d }
>> }
>> because vect_cst_ctor_to_array will now return NULL_TREE for vector_cst {1, 4}.
>>
>> The attached patch thus reverts the changes to vect_cst_ctor_to_array,
>> which makes the tests pass again.
>> I have put the patch for another round of bootstrap+test on the above
>> targets (aarch64, aarch64-sve, x86_64, armhf, ppc64le).
>> OK to commit if it passes ?
> The patch now passes bootstrap+test on all these targets.

OK, thanks.

Richard

Prathamesh Kulkarni Aug. 16, 2023, 11:28 a.m. UTC | #17

On Wed, 16 Aug 2023 at 15:21, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> Unfortunately, the patch regressed following tests on ppc64le and
> >> armhf respectively:
> >> gcc.target/powerpc/vec-perm-ctor.c scan-tree-dump-not optimized
> >> "VIEW_CONVERT_EXPR"
> >> gcc.dg/tree-ssa/forwprop-20.c scan-tree-dump-not forwprop1 "VEC_PERM_EXPR"
> >>
> >> This happens because of the change to vect_cst_ctor_array which
> >> removes handling of VECTOR_CST,
> >> and thus we return NULL_TREE for cases where VEC_PERM_EXPR has
> >> vector_cst, ctor input operands.
> >>
> >> For eg we fail to fold VEC_PERM_EXPR for the following test taken from
> >> forwprop-20.c:
> >> void f (double d, vecf* r)
> >> {
> >>   vecf x = { -d, 5 };
> >>   vecf y = {  1, 4 };
> >>   veci m = {  2, 0 };
> >>   *r = __builtin_shuffle (x, y, m); // { 1, -d }
> >> }
> >> because vect_cst_ctor_to_array will now return NULL_TREE for vector_cst {1, 4}.
> >>
> >> The attached patch thus reverts the changes to vect_cst_ctor_to_array,
> >> which makes the tests pass again.
> >> I have put the patch for another round of bootstrap+test on the above
> >> targets (aarch64, aarch64-sve, x86_64, armhf, ppc64le).
> >> OK to commit if it passes ?
> > The patch now passes bootstrap+test on all these targets.
>
> OK, thanks.
Thanks a lot for the helpful reviews! Committed in:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a7dba4a1c05a76026d88dcccc0b519cf83bff9a2

Thanks,
Prathamesh
>
> Richard

diff mbox series

Patch

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index a02ede79fed..8028b3e8e9a 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -85,6 +85,10 @@  along with GCC; see the file COPYING3.  If not see
 #include "vec-perm-indices.h"
 #include "asan.h"
 #include "gimple-range.h"
+#include <algorithm>
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "print-tree.h"
 
 /* Nonzero if we are folding constants inside an initializer or a C++
    manifestly-constant-evaluated context; zero otherwise.
@@ -10493,15 +10497,9 @@  fold_mult_zconjz (location_t loc, tree type, tree expr)
 static bool
 vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
 {
-  unsigned HOST_WIDE_INT i, nunits;
+  unsigned HOST_WIDE_INT i;
 
-  if (TREE_CODE (arg) == VECTOR_CST
-      && VECTOR_CST_NELTS (arg).is_constant (&nunits))
-    {
-      for (i = 0; i < nunits; ++i)
-	elts[i] = VECTOR_CST_ELT (arg, i);
-    }
-  else if (TREE_CODE (arg) == CONSTRUCTOR)
+  if (TREE_CODE (arg) == CONSTRUCTOR)
     {
       constructor_elt *elt;
 
@@ -10519,6 +10517,230 @@  vec_cst_ctor_to_array (tree arg, unsigned int nelts, tree *elts)
   return true;
 }
 
+/* Return a vector with (NPATTERNS, NELTS_PER_PATTERN) encoding.  */
+
+static tree
+vector_cst_reshape (tree vec, unsigned npatterns, unsigned nelts_per_pattern)
+{
+  gcc_assert (pow2p_hwi (npatterns));
+
+  if (VECTOR_CST_NPATTERNS (vec) == npatterns
+      && VECTOR_CST_NELTS_PER_PATTERN (vec) == nelts_per_pattern)
+    return vec;
+
+  tree v = make_vector (exact_log2 (npatterns), nelts_per_pattern);
+  TREE_TYPE (v) = TREE_TYPE (vec);
+
+  unsigned nelts = npatterns * nelts_per_pattern;
+  for (unsigned i = 0; i < nelts; i++)
+    VECTOR_CST_ENCODED_ELT(v, i) = vector_cst_elt (vec, i);
+  return v;
+}
+
+/* Helper routine for fold_vec_perm_vla to check if ARG is a suitable
+   operand for VLA vec_perm folding. If arg is VLS, then set
+   NPATTERNS = nelts and NELTS_PER_PATTERN = 1.  */
+
+static tree
+valid_operand_for_fold_vec_perm_cst_p (tree arg)
+{
+  if (TREE_CODE (arg) != VECTOR_CST)
+    return NULL_TREE;
+
+  unsigned HOST_WIDE_INT nelts;
+  unsigned npatterns, nelts_per_pattern;
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg)).is_constant (&nelts))
+    {
+      npatterns = nelts;
+      nelts_per_pattern = 1;
+    }
+  else
+    {
+      npatterns = VECTOR_CST_NPATTERNS (arg);
+      nelts_per_pattern = VECTOR_CST_NELTS_PER_PATTERN (arg);
+    }
+
+  if (!pow2p_hwi (npatterns))
+    return NULL_TREE;
+
+  return vector_cst_reshape (arg, npatterns, nelts_per_pattern);
+}
+
+/* Helper routine for fold_vec_perm_cst to check if SEL is a suitable
+   mask for VLA vec_perm folding. Set SEL_NPATTERNS and SEL_NELTS_PER_PATTERN
+   similarly.  */
+
+static bool
+valid_mask_for_fold_vec_perm_cst_p (tree arg0, tree arg1,
+				    const vec_perm_indices &sel,
+				    unsigned& sel_npatterns,
+				    unsigned& sel_nelts_per_pattern,
+				    char *reason = NULL,
+				    bool verbose = false)
+{
+  unsigned HOST_WIDE_INT nelts;
+  if (sel.length ().is_constant (&nelts))
+    {
+      sel_npatterns = nelts;
+      sel_nelts_per_pattern = 1;
+    }
+  else
+    {
+      sel_npatterns = sel.encoding ().npatterns ();
+      sel_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+    }
+
+  if (!pow2p_hwi (sel_npatterns))
+    {
+      if (reason)
+	strcpy (reason, "sel_npatterns is not power of 2");
+      return false;
+    }
+
+  /* We want to avoid cases where sel.length is not a multiple of npatterns.
+     For eg: sel.length = 2 + 2x, and sel npatterns = 4.  */
+  poly_uint64 esel;
+  if (!multiple_p (sel.length (), sel_npatterns, &esel))
+    {
+      if (reason)
+	strcpy (reason, "sel.length is not multiple of sel_npatterns");
+      return false;
+    }
+
+  if (sel_nelts_per_pattern < 3)
+    return true;
+
+  for (unsigned pattern = 0; pattern < sel_npatterns; pattern++)
+    {
+      poly_uint64 a1 = sel[pattern + sel_npatterns];
+      poly_uint64 a2 = sel[pattern + 2 * sel_npatterns];
+
+      poly_uint64 step = a2 - a1;
+      if (!step.is_constant ())
+	{
+	  if (reason)
+	    strcpy (reason, "step is not constant");
+	  return false;
+	}
+      int S = step.to_constant ();
+      if (S == 0)
+	continue;
+
+      // FIXME: Punt on S < 0 for now, revisit later.
+      if (S < 0)
+	return false;
+
+      if (!pow2p_hwi (S))
+	{
+	  if (reason)
+	    strcpy (reason, "step is not power of 2");
+	  return false;
+	}
+
+      poly_uint64 ae = a1 + (esel - 2) * S;
+      poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q1, qe;
+      poly_uint64 r1, re;
+
+      /* Ensure that stepped sequence of the pattern selects elements
+	 only from the same input vector.  */
+      if (!(can_div_trunc_p (a1, arg_len, &q1, &r1)
+	    && can_div_trunc_p (ae, arg_len, &qe, &re)
+	    && (q1 == qe)))
+	{
+	  if (reason)
+	    strcpy (reason, "crossed input vectors");
+	  return false;
+	}
+      unsigned arg_npatterns
+	= ((q1 & 0) == 0) ? VECTOR_CST_NPATTERNS (arg0)
+			  : VECTOR_CST_NPATTERNS (arg1);
+
+      gcc_assert (pow2p_hwi (arg_npatterns));
+      if (S < arg_npatterns)
+	{
+	  if (reason)
+	    strcpy (reason, "S is not multiple of npatterns");
+	  return false;
+	}
+    }
+
+  return true;
+}
+
+/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
+   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.  */
+
+static tree
+fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
+		   char *reason = NULL, bool verbose = false)
+{
+  /* Allow cases where:
+     (1) arg0, arg1 and sel are VLS.
+     (2) arg0, arg1, and sel are VLA.
+     Punt if input vectors are VLA but sel is VLS or vice-versa.  */
+  poly_uint64 arg_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+  if (!((arg_len.is_constant () && sel.length ().is_constant ())
+	 || (!arg_len.is_constant () && !sel.length ().is_constant ())))
+    return NULL_TREE;
+
+  unsigned sel_npatterns, sel_nelts_per_pattern;
+
+  arg0 = valid_operand_for_fold_vec_perm_cst_p (arg0);
+  if (!arg0)
+    return NULL_TREE;
+
+  arg1 = valid_operand_for_fold_vec_perm_cst_p (arg1);
+  if (!arg1)
+    return NULL_TREE;
+
+  if (!valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, sel_npatterns,
+					   sel_nelts_per_pattern, reason, verbose))
+    return NULL_TREE;
+
+  unsigned res_npatterns
+    = std::max (VECTOR_CST_NPATTERNS (arg0),
+		std::max (VECTOR_CST_NPATTERNS (arg1), sel_npatterns));
+
+  unsigned res_nelts_per_pattern
+    = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
+		std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
+			  sel_nelts_per_pattern));
+
+
+  tree_vector_builder out_elts (type, res_npatterns, res_nelts_per_pattern);
+  unsigned res_nelts = res_npatterns * res_nelts_per_pattern;
+  for (unsigned i = 0; i < res_nelts; i++)
+    {
+      poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+      uint64_t q;
+      poly_uint64 r;
+      unsigned HOST_WIDE_INT index;
+
+      /* Punt if sel[i] /trunc_div len cannot be determined,
+	 because the input vector to be chosen will depend on
+	 runtime vector length.
+	 For example if len == 4 + 4x, and sel[i] == 4,
+	 If len at runtime equals 4, we choose arg1[0].
+	 For any other value of len > 4 at runtime, we choose arg0[4].
+	 which makes the element choice dependent on runtime vector length.  */
+      if (!can_div_trunc_p (sel[i], len, &q, &r))
+	return NULL_TREE;
+
+      /* sel[i] % len will give the index of element in the chosen input
+	 vector. For example if sel[i] == 5 + 4x and len == 4 + 4x,
+	 we will choose arg1[1] since (5 + 4x) % (4 + 4x) == 1.  */
+      if (!r.is_constant (&index))
+	return NULL_TREE;
+
+      tree arg = ((q & 1) == 0) ? arg0 : arg1;
+      tree elem = vector_cst_elt (arg, index);
+      out_elts.quick_push (elem);
+    }
+
+  return out_elts.build ();
+}
+
 /* Attempt to fold vector permutation of ARG0 and ARG1 vectors using SEL
    selector.  Return the folded VECTOR_CST or CONSTRUCTOR if successful,
    NULL_TREE otherwise.  */
@@ -10528,43 +10750,39 @@  fold_vec_perm (tree type, tree arg0, tree arg1, const vec_perm_indices &sel)
 {
   unsigned int i;
   unsigned HOST_WIDE_INT nelts;
-  bool need_ctor = false;
 
-  if (!sel.length ().is_constant (&nelts))
-    return NULL_TREE;
-  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)), nelts)
-	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1)), nelts));
+  gcc_assert (known_eq (TYPE_VECTOR_SUBPARTS (type), sel.length ())
+	      && known_eq (TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)),
+			   TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg1))));
+
   if (TREE_TYPE (TREE_TYPE (arg0)) != TREE_TYPE (type)
       || TREE_TYPE (TREE_TYPE (arg1)) != TREE_TYPE (type))
     return NULL_TREE;
 
+  if (TREE_CODE (arg0) == VECTOR_CST
+      && TREE_CODE (arg1) == VECTOR_CST)
+    return fold_vec_perm_cst (type, arg0, arg1, sel);
+
+  /* For fall back case, we want to ensure arg and sel have same len.  */
+  if (!(sel.length ().is_constant (&nelts)
+	&& known_eq (sel.length (), TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0)))))
+    return NULL_TREE;
+
   tree *in_elts = XALLOCAVEC (tree, nelts * 2);
   if (!vec_cst_ctor_to_array (arg0, nelts, in_elts)
       || !vec_cst_ctor_to_array (arg1, nelts, in_elts + nelts))
     return NULL_TREE;
 
-  tree_vector_builder out_elts (type, nelts, 1);
+  vec<constructor_elt, va_gc> *v;
+  vec_alloc (v, nelts);
   for (i = 0; i < nelts; i++)
     {
       HOST_WIDE_INT index;
       if (!sel[i].is_constant (&index))
 	return NULL_TREE;
-      if (!CONSTANT_CLASS_P (in_elts[index]))
-	need_ctor = true;
-      out_elts.quick_push (unshare_expr (in_elts[index]));
+      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, in_elts[index]);
     }
-
-  if (need_ctor)
-    {
-      vec<constructor_elt, va_gc> *v;
-      vec_alloc (v, nelts);
-      for (i = 0; i < nelts; i++)
-	CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, out_elts[i]);
-      return build_constructor (type, v);
-    }
-  else
-    return out_elts.build ();
+  return build_constructor (type, v);
 }
 
 /* Try to fold a pointer difference of type TYPE two address expressions of
@@ -16891,6 +17109,388 @@  test_arithmetic_folding ()
 				   x);
 }
 
+namespace test_fold_vec_perm_cst {
+
+static tree
+get_preferred_vectype (tree inner_type)
+{
+  scalar_int_mode int_mode = SCALAR_INT_TYPE_MODE (inner_type);
+  machine_mode vmode = targetm.vectorize.preferred_simd_mode (int_mode);
+  poly_uint64 nunits = GET_MODE_NUNITS (vmode);
+  return build_vector_type (inner_type, nunits);
+}
+
+static tree
+build_vec_cst_rand (tree inner_type, unsigned npatterns,
+		    unsigned nelts_per_pattern, int S = 0)
+{
+  tree vectype = get_preferred_vectype (inner_type);
+  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
+
+  // Fill a0 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % 100));
+
+  if (nelts_per_pattern == 1)
+    return builder.build ();
+
+  // Fill a1 for each pattern
+  for (unsigned i = 0; i < npatterns; i++)
+    builder.quick_push (build_int_cst (inner_type, rand () % 100));
+
+  if (nelts_per_pattern == 2)
+    return builder.build ();
+
+  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; i++)
+    {
+      tree prev_elem = builder[i - npatterns];
+      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
+      int val = prev_elem_val + S;
+      builder.quick_push (build_int_cst (inner_type, val));
+    }
+
+  return builder.build ();
+}
+
+static void
+validate_res (unsigned npatterns, unsigned nelts_per_pattern,
+	      tree res, tree *expected_res)
+{
+  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
+  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
+
+  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
+    ASSERT_TRUE (operand_equal_p (VECTOR_CST_ELT (res, i), expected_res[i], 0));
+}
+
+/* Verify VLA vec_perm folding.  */
+
+static void
+test_stepped ()
+{
+  /* Case 1: sel = {0, 1, 2, ...}
+     npatterns = 1, nelts_per_pattern = 3  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (0);
+    builder.quick_push (1);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg0, 1),
+			    vector_cst_elt (arg0, 2) };
+    validate_res (1, 3, res, expected_res);
+  }
+
+#if 0
+  /* Case 2: sel = {len, len + 1, len + 2, ... }
+     npatterns = 1, nelts_per_pattern = 3
+     FIXME: This should return
+     expected res: { op1[0], op1[1], op1[2], ... }
+     however it returns NULL_TREE.  */
+  {
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (arg0_len + 1);
+    builder.quick_push (arg0_len + 2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+  }
+#endif
+
+  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of arg0.
+     sel = {len, 0, 0, 0, 2, 0, ...}
+     npatterns = 2, nelts_per_pattern = 3.
+     Use extra pattern {0, ...} to lower number of elements per pattern.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 3);
+    builder.quick_push (arg0_len);
+    int mask_elems[] = { 0, 0, 0, 2, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    gcc_assert (res);
+  }
+
+  /* Case 4:
+     sel = { len, 0, 2, ... } npatterns = 1, nelts_per_pattern = 3.
+     This should return NULL because we cross the input vectors.
+     Because,
+     arg0_len = 16 + 16x
+     a1 = 0
+     S = 2
+     esel = arg0_len / npatterns_sel = 16+16x/1 = 16 + 16x
+     ae = a1 + (esel - 2) * S
+	= 0 + (16 + 16x - 2) * 2
+	= 28 + 32x
+     a1 / arg0_len = 0 /trunc (16 + 16x) = 0
+     ae / arg0_len = (28 + 32x) /trunc (16 + 16x), which is not defined,
+     since 28/16 != 32/16.
+     So return NULL_TREE.  */
+  {
+    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    poly_uint64 ae = (arg0_len - 2) * 2;
+    uint64_t qe;
+    poly_uint64 re;
+
+    vec_perm_builder builder (arg0_len, 1, 3);
+    builder.quick_push (arg0_len);
+    builder.quick_push (0);
+    builder.quick_push (2);
+
+    vec_perm_indices sel (builder, 2, arg0_len);
+    char reason[100] = "\0";
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, reason, false);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "crossed input vectors"));
+  }
+
+  /* Case 5: Select elements from different patterns.
+     Should return NULL.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    builder.quick_push (op0_len);
+    int mask_elems[] = { 0, 0, 0, 1, 0 };
+    for (int i = 0; i < 5; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    char reason[100] = "\0";
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel, reason, false);
+    gcc_assert (res == NULL_TREE);
+    gcc_assert (!strcmp (reason, "S is not multiple of npatterns"));
+  }
+
+  /* Case 6: Select pattern 0 of op0 and dup of op0[0]
+     op0, op1, sel: npatterns = 2, nelts_per_pattern = 3
+     sel = { 0, 0, 2, 0, 4, 0, ... }.
+
+     For pattern {0, 2, 4, ...}:
+     a1 = 2
+     len = 16 + 16x
+     S = 2
+     esel = len / npatterns_sel = (16 + 16x) / 2 = (8 + 8x)
+     ae = a1 + (esel - 2) * S
+	= 2 + (8 + 8x - 2) * 2
+	= 14 + 16x
+     a1 / arg0_len = 2 / (16 + 16x) = 0
+     ae / arg0_len = (14 + 16x) / (16 + 16x) = 0
+     So a1/arg0_len = ae/arg0_len = 0
+     Hence we select from first vector op0
+     S = 2, npatterns = 2.
+     Since S is multiple of npatterns(op0), we are selecting from
+     same pattern of op0.
+
+     For pattern {0, ...}, we are choosing { op0[0] ... }
+     So res will be combination of above patterns:
+     res: { op0[0], op0[0], op0[2], op0[0], op0[4], op0[0], ... }
+     with npatterns = 2, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder (op0_len, 2, 3);
+    int mask_elems[] = { 0, 0, 2, 0, 4, 0 };
+    for (int i = 0; i < 6; i++)
+      builder.quick_push (mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0) };
+    validate_res (2, 3, res, expected_res);
+  }
+
+  /* Case 7: sel_npatterns > op_npatterns;
+     op0, op1: npatterns = 2, nelts_per_pattern = 3
+     sel: { 0, 0, 1, len, 2, 0, 3, len, 4, 0, 5, len, ...},
+     with npatterns = 4, nelts_per_pattern = 3.  */
+  {
+    tree op0 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    tree op1 = build_vec_cst_rand (char_type_node, 2, 3, 2);
+    poly_uint64 op0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (op0));
+
+    vec_perm_builder builder(op0_len, 4, 3);
+    // -1 is used as place holder for poly_int_cst
+    int mask_elems[] = { 0, 0, 1, -1, 2, 0, 3, -1, 4, 0, 5, -1 };
+    for (int i = 0; i < 12; i++)
+      builder.quick_push ((mask_elems[i] == -1) ? op0_len : mask_elems[i]);
+
+    vec_perm_indices sel (builder, 2, op0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (op0), op0, op1, sel);
+    tree expected_res[] = { vector_cst_elt (op0, 0), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 1), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 2), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 3), vector_cst_elt (op1, 0),
+			    vector_cst_elt (op0, 4), vector_cst_elt (op0, 0),
+			    vector_cst_elt (op0, 5), vector_cst_elt (op1, 0) };
+    validate_res (4, 3, res, expected_res);
+  }
+}
+
+static void
+test_dup ()
+{
+  /* Case 1: mask = {0, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (0);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (res, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 2: mask = {len, ...} */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 1);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg1, 0) };
+    validate_res (1, 1, res, expected_res);
+  }
+
+  /* Case 3: mask = { 0, len, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0) };
+    validate_res (2, 1, res, expected_res);
+  }
+
+  /* Case 4: mask = { 0, len, 1, len+1, ... } */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+
+  /* Case 5: mask = { 0, len, 1, len+1, .... }
+     npatterns = 4, nelts_per_pattern = 1 */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 4, 1);
+    builder.quick_push (0);
+    builder.quick_push (len);
+    builder.quick_push (1);
+    builder.quick_push (len + 1);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (4, 1, res, expected_res);
+  }
+
+  /* Case 6: mask = {0, 4, ...}
+     npatterns = 1, nelts_per_pattern = 2.
+     This should return NULL_TREE because the index 4 may choose
+     from either arg0 or arg1 depending on vector length.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (len, 1, 2);
+    builder.quick_push (0);
+    builder.quick_push (4);
+    vec_perm_indices sel (builder, 2, len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+    ASSERT_TRUE (res == NULL_TREE);
+  }
+
+  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
+     mask = {0, len, 1, len + 1, ...}
+     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
+  {
+    tree arg0 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    tree arg1 = build_vec_cst_rand (integer_type_node, 2, 3, 1);
+    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+    vec_perm_builder builder (arg0_len, 2, 2);
+    builder.quick_push (0);
+    builder.quick_push (arg0_len);
+    builder.quick_push (1);
+    builder.quick_push (arg0_len + 1);
+    vec_perm_indices sel (builder, 2, arg0_len);
+    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt (arg1, 0),
+			    vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
+			  };
+    validate_res (2, 2, res, expected_res);
+  }
+}
+
+static void
+test ()
+{
+  tree vectype = get_preferred_vectype (integer_type_node);
+  if (TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
+    return;
+
+  test_dup ();
+  test_stepped ();
+}
+};
+
 /* Verify that various binary operations on vectors are folded
    correctly.  */
 
@@ -16942,6 +17542,7 @@  fold_const_cc_tests ()
   test_arithmetic_folding ();
   test_vector_folding ();
   test_vec_duplicate_folding ();
+  test_fold_vec_perm_cst::test ();
 }
 
 } // namespace selftest