Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases.

Message ID 20220813095843.1452308-1-manolis.tsamis@vrull.eu
State New, archived
Headers
Series Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases. |

Commit Message

Manolis Tsamis Aug. 13, 2022, 9:58 a.m. UTC
  When using SWAR (SIMD in a register) techniques a comparison operation within
such a register can be made by using a combination of shifts, bitwise and and
multiplication. If code using this scheme is vectorized then there is potential
to replace all these operations with a single vector comparison, by reinterpreting
the vector types to match the width of the SWAR register.

For example, for the test function packed_cmp_16_32, the original generated code is:

        ldr     q0, [x0]
        add     w1, w1, 1
        ushr    v0.4s, v0.4s, 15
        and     v0.16b, v0.16b, v2.16b
        shl     v1.4s, v0.4s, 16
        sub     v0.4s, v1.4s, v0.4s
        str     q0, [x0], 16
        cmp     w2, w1
        bhi     .L20

with this pattern the above can be optimized to:

        ldr     q0, [x0]
        add     w1, w1, 1
        cmlt    v0.8h, v0.8h, #0
        str     q0, [x0], 16
        cmp     w2, w1
        bhi     .L20

The effect is similar for x86-64.

gcc/ChangeLog:

	* match.pd: Simplify vector shift + bit_and + multiply in some cases.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/swar_to_vec_cmp.c: New test.

Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
---
 gcc/match.pd                                  | 57 +++++++++++++++
 .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72 +++++++++++++++++++
 2 files changed, 129 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
  

Comments

Hans-Peter Nilsson Aug. 18, 2022, 1:18 a.m. UTC | #1
On Sat, 13 Aug 2022, mtsamis wrote:
> When using SWAR (SIMD in a register) techniques a comparison operation within

*within

> such a register can be made by using a combination of shifts, bitwise and and
> multiplication. If code using this scheme is vectorized then there is potential
> to replace all these operations with a single vector comparison, by reinterpreting
> the vector types to match the width of the SWAR register.

Hadn't it been for "all" modern processors these days having
SIMD instructions, adding general SWAR handling to the
vectorizer would be worth promoting to someones favorite
decision-maker.  (This is a wonderful first step.)

Random typo spotted:

> +++ b/gcc/match.pd
> @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>      (view_convert (bit_and:itype (view_convert @0)
>  				 (ne @1 { build_zero_cst (type); })))))))
>
> +/* In SWAR (SIMD in a register) code a comparison of packed data can
> +   be consturcted with a particular combination of shift, bitwise and,

*constructed

brgds, H-P
  
Andrew Pinski Aug. 18, 2022, 2:08 a.m. UTC | #2
On Sat, Aug 13, 2022 at 2:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote:
>
> When using SWAR (SIMD in a register) techniques a comparison operation within
> such a register can be made by using a combination of shifts, bitwise and and
> multiplication. If code using this scheme is vectorized then there is potential
> to replace all these operations with a single vector comparison, by reinterpreting
> the vector types to match the width of the SWAR register.
>
> For example, for the test function packed_cmp_16_32, the original generated code is:
>
>         ldr     q0, [x0]
>         add     w1, w1, 1
>         ushr    v0.4s, v0.4s, 15
>         and     v0.16b, v0.16b, v2.16b
>         shl     v1.4s, v0.4s, 16
>         sub     v0.4s, v1.4s, v0.4s
>         str     q0, [x0], 16
>         cmp     w2, w1
>         bhi     .L20
>
> with this pattern the above can be optimized to:
>
>         ldr     q0, [x0]
>         add     w1, w1, 1
>         cmlt    v0.8h, v0.8h, #0
>         str     q0, [x0], 16
>         cmp     w2, w1
>         bhi     .L20
>
> The effect is similar for x86-64.
>
> gcc/ChangeLog:
>
>         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
>
> Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
> ---
>  gcc/match.pd                                  | 57 +++++++++++++++
>  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72 +++++++++++++++++++
>  2 files changed, 129 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8bbc0dbd5cd..5c768a94846 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>      (view_convert (bit_and:itype (view_convert @0)
>                                  (ne @1 { build_zero_cst (type); })))))))
>
> +/* In SWAR (SIMD in a register) code a comparison of packed data can
> +   be consturcted with a particular combination of shift, bitwise and,
> +   and multiplication by constants.  If that code is vectorized we can
> +   convert this pattern into a more efficient vector comparison.  */
> +(simplify
> + (mult (bit_and (rshift @0 @1) @2) @3)
> + (with {
> +   tree op_type = TREE_TYPE (@0);
> +   tree rshift_cst = NULL_TREE;
> +   tree bit_and_cst = NULL_TREE;
> +   tree mult_cst = NULL_TREE;
> +  }
> +  /* Make sure we're working with vectors and uniform vector constants.  */
> +  (if (VECTOR_TYPE_P (op_type)

I think you can use type here and don't need op_type.

Maybe change this to:
(simplify
 (mult (bit_and (rshift @0 uniform_integer_cst_p@1)
                uniform_integer_cst_p@2)
       uniform_integer_cst_p@3)
 (if (VECTOR_TYPE_P (type))
  (with {
   tree rshift_cst = uniform_integer_cst_p (@1);
   tree bit_and_cst = uniform_integer_cst_p (@2);
   tree mult_cst = uniform_integer_cst_p (@3);
  }
  ...

Similar to the patterns under "Simplifications of comparisons".

Thanks,
Andrew Pinski

> +       && (rshift_cst = uniform_integer_cst_p (@1))
> +       && (bit_and_cst = uniform_integer_cst_p (@2))
> +       && (mult_cst = uniform_integer_cst_p (@3)))
> +   /* Compute what constants would be needed for this to represent a packed
> +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> +   (with {
> +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
> +     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS (op_type).to_constant ();
> +     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
> +
> +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
> +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> +
> +     mult_i = tree_to_uhwi (mult_cst);
> +     bit_and_i = tree_to_uhwi (bit_and_cst);
> +     target_bit_and_i = 0;
> +
> +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
> +    }
> +    (if ((exact_log2 (cmp_bits_i)) >= 0
> +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> +        && tree_fits_uhwi_p (rshift_cst)
> +        && tree_fits_uhwi_p (mult_cst)
> +        && tree_fits_uhwi_p (bit_and_cst)
> +        && target_mult_i == mult_i
> +        && target_bit_and_i == bit_and_i)
> +     /* Compute the vector shape for the comparison and check if the target is
> +       able to expand the comparison with that type.  */
> +     (with {
> +       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
> +       int vector_type_nelts = vec_bits / cmp_bits_i;
> +       tree vector_type = build_vector_type (bool_type, vector_type_nelts);
> +       tree zeros = build_zero_cst (vector_type);
> +       tree mask_type = truth_type_for (vector_type);
> +      }
> +      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
> +       (view_convert:op_type (lt:mask_type (view_convert:vector_type @0)
> +                                          { zeros; })))))))))
> +
>  (for cmp (gt ge lt le)
>       outp (convert convert negate negate)
>       outn (negate negate convert convert)
> diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> new file mode 100644
> index 00000000000..26f9ad9ef28
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> @@ -0,0 +1,72 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize" } */
> +
> +typedef unsigned char uint8_t;
> +typedef unsigned short uint16_t;
> +typedef unsigned int uint32_t;
> +
> +/* 8-bit SWAR tests.  */
> +
> +static uint8_t packed_cmp_8_8(uint8_t a)
> +{
> +  return ((a >> 7) & 0x1U) * 0xffU;
> +}
> +
> +/* 16-bit SWAR tests.  */
> +
> +static uint16_t packed_cmp_8_16(uint16_t a)
> +{
> +  return ((a >> 7) & 0x101U) * 0xffU;
> +}
> +
> +static uint16_t packed_cmp_16_16(uint16_t a)
> +{
> +  return ((a >> 15) & 0x1U) * 0xffffU;
> +}
> +
> +/* 32-bit SWAR tests.  */
> +
> +static uint32_t packed_cmp_8_32(uint32_t a)
> +{
> +  return ((a >> 7) & 0x1010101U) * 0xffU;
> +}
> +
> +static uint32_t packed_cmp_16_32(uint32_t a)
> +{
> +  return ((a >> 15) & 0x10001U) * 0xffffU;
> +}
> +
> +static uint32_t packed_cmp_32_32(uint32_t a)
> +{
> +  return ((a >> 31) & 0x1U) * 0xffffffffU;
> +}
> +
> +/* Driver function to test the vectorized code generated for the different
> +   packed_cmp variants.  */
> +
> +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> +  void vectorized_cmp_##FUNC(T* a, int n)      \
> +  {                                            \
> +    n = (n / 32) * 32;                         \
> +    for(int i = 0; i < n; i += 4)              \
> +    {                                          \
> +      a[i + 0] = FUNC(a[i + 0]);               \
> +      a[i + 1] = FUNC(a[i + 1]);               \
> +      a[i + 2] = FUNC(a[i + 2]);               \
> +      a[i + 3] = FUNC(a[i + 3]);               \
> +    }                                          \
> +  }
> +
> +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> +
> +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> +
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> +
> +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> --
> 2.34.1
>
  
Richard Biener Aug. 26, 2022, 9:08 a.m. UTC | #3
On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote:
>
> When using SWAR (SIMD in a register) techniques a comparison operation within
> such a register can be made by using a combination of shifts, bitwise and and
> multiplication. If code using this scheme is vectorized then there is potential
> to replace all these operations with a single vector comparison, by reinterpreting
> the vector types to match the width of the SWAR register.
>
> For example, for the test function packed_cmp_16_32, the original generated code is:
>
>         ldr     q0, [x0]
>         add     w1, w1, 1
>         ushr    v0.4s, v0.4s, 15
>         and     v0.16b, v0.16b, v2.16b
>         shl     v1.4s, v0.4s, 16
>         sub     v0.4s, v1.4s, v0.4s
>         str     q0, [x0], 16
>         cmp     w2, w1
>         bhi     .L20
>
> with this pattern the above can be optimized to:
>
>         ldr     q0, [x0]
>         add     w1, w1, 1
>         cmlt    v0.8h, v0.8h, #0
>         str     q0, [x0], 16
>         cmp     w2, w1
>         bhi     .L20
>
> The effect is similar for x86-64.
>
> gcc/ChangeLog:
>
>         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
>
> Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
> ---
>  gcc/match.pd                                  | 57 +++++++++++++++
>  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72 +++++++++++++++++++
>  2 files changed, 129 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 8bbc0dbd5cd..5c768a94846 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>      (view_convert (bit_and:itype (view_convert @0)
>                                  (ne @1 { build_zero_cst (type); })))))))
>
> +/* In SWAR (SIMD in a register) code a comparison of packed data can
> +   be consturcted with a particular combination of shift, bitwise and,
> +   and multiplication by constants.  If that code is vectorized we can
> +   convert this pattern into a more efficient vector comparison.  */
> +(simplify
> + (mult (bit_and (rshift @0 @1) @2) @3)

You should restrict the pattern a bit more, below you use uniform_integer_cst_p
and also require a vector type thus

  (simplify
   (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2) VECTOR_CST@3)


> + (with {
> +   tree op_type = TREE_TYPE (@0);

that's the same as 'type' which is already available.

> +   tree rshift_cst = NULL_TREE;
> +   tree bit_and_cst = NULL_TREE;
> +   tree mult_cst = NULL_TREE;
> +  }
> +  /* Make sure we're working with vectors and uniform vector constants.  */
> +  (if (VECTOR_TYPE_P (op_type)
> +       && (rshift_cst = uniform_integer_cst_p (@1))
> +       && (bit_and_cst = uniform_integer_cst_p (@2))
> +       && (mult_cst = uniform_integer_cst_p (@3)))
> +   /* Compute what constants would be needed for this to represent a packed
> +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> +   (with {
> +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
> +     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS (op_type).to_constant ();

you need to check that this isn't a VLA vector operation.

> +     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
> +
> +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;

and that the rshift_cst and others actually fit an uhwi.

> +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> +
> +     mult_i = tree_to_uhwi (mult_cst);
> +     bit_and_i = tree_to_uhwi (bit_and_cst);
> +     target_bit_and_i = 0;
> +
> +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;

it would be nice to have a comment on what this actually does ...

> +    }
> +    (if ((exact_log2 (cmp_bits_i)) >= 0
> +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> +        && tree_fits_uhwi_p (rshift_cst)
> +        && tree_fits_uhwi_p (mult_cst)
> +        && tree_fits_uhwi_p (bit_and_cst)
> +        && target_mult_i == mult_i
> +        && target_bit_and_i == bit_and_i)
> +     /* Compute the vector shape for the comparison and check if the target is
> +       able to expand the comparison with that type.  */
> +     (with {
> +       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
> +       int vector_type_nelts = vec_bits / cmp_bits_i;
> +       tree vector_type = build_vector_type (bool_type, vector_type_nelts);

why do you build a bool vector type here and then ...

> +       tree zeros = build_zero_cst (vector_type);
> +       tree mask_type = truth_type_for (vector_type);

... its truth type?  Note both might not be actually supported by the target
and thus receive BLKmode or an integer mode.  The latter is a problem
for expand_vec_cmp_expr_p as that might pick up a pattern not suitable
here.   Also note that truth_type_for can result in a mask mode, aka
QImode with AVX512 or some VnBImode on other archs - those are
not OK to be simply view_converted back to op_type.  In general
a vector compare operation yields a mask and you can convert that
to a -1/0 value using a vec_cond_expr.  I think we have a pattern
that can then properly simplify the case where this can be expressed
as a view_convert, but of course you then also need to check for
vec_cond_expr support.

I would suggest you make 'vector_type' an integer element type
(that also properly specifies the sign of the comparison!) and check
you end up with a vector mode and the mode of the mask_type
agrees with that if you don't want to go the vec_cond_expr route.


> +      }
> +      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
> +       (view_convert:op_type (lt:mask_type (view_convert:vector_type @0)
> +                                          { zeros; })))))))))
> +
>  (for cmp (gt ge lt le)
>       outp (convert convert negate negate)
>       outn (negate negate convert convert)
> diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> new file mode 100644
> index 00000000000..26f9ad9ef28
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> @@ -0,0 +1,72 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize" } */
> +
> +typedef unsigned char uint8_t;
> +typedef unsigned short uint16_t;
> +typedef unsigned int uint32_t;
> +
> +/* 8-bit SWAR tests.  */
> +
> +static uint8_t packed_cmp_8_8(uint8_t a)
> +{
> +  return ((a >> 7) & 0x1U) * 0xffU;
> +}
> +
> +/* 16-bit SWAR tests.  */
> +
> +static uint16_t packed_cmp_8_16(uint16_t a)
> +{
> +  return ((a >> 7) & 0x101U) * 0xffU;
> +}
> +
> +static uint16_t packed_cmp_16_16(uint16_t a)
> +{
> +  return ((a >> 15) & 0x1U) * 0xffffU;
> +}
> +
> +/* 32-bit SWAR tests.  */
> +
> +static uint32_t packed_cmp_8_32(uint32_t a)
> +{
> +  return ((a >> 7) & 0x1010101U) * 0xffU;
> +}
> +
> +static uint32_t packed_cmp_16_32(uint32_t a)
> +{
> +  return ((a >> 15) & 0x10001U) * 0xffffU;
> +}
> +
> +static uint32_t packed_cmp_32_32(uint32_t a)
> +{
> +  return ((a >> 31) & 0x1U) * 0xffffffffU;
> +}
> +
> +/* Driver function to test the vectorized code generated for the different
> +   packed_cmp variants.  */
> +
> +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> +  void vectorized_cmp_##FUNC(T* a, int n)      \
> +  {                                            \
> +    n = (n / 32) * 32;                         \
> +    for(int i = 0; i < n; i += 4)              \
> +    {                                          \
> +      a[i + 0] = FUNC(a[i + 0]);               \
> +      a[i + 1] = FUNC(a[i + 1]);               \
> +      a[i + 2] = FUNC(a[i + 2]);               \
> +      a[i + 3] = FUNC(a[i + 3]);               \
> +    }                                          \
> +  }
> +
> +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> +
> +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> +
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> +
> +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> --
> 2.34.1
>
  
Tamar Christina Aug. 26, 2022, 1:29 p.m. UTC | #4
> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> Biener via Gcc-patches
> Sent: Friday, August 26, 2022 10:08 AM
> To: mtsamis <manolis.tsamis@vrull.eu>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>;
> jiangning.liu@amperecomputing.com; Philipp Tomsich
> <philipp.tomsich@vrull.eu>
> Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and +
> multiply to vector compare in some cases.
> 
> On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote:
> >
> > When using SWAR (SIMD in a register) techniques a comparison operation
> > within such a register can be made by using a combination of shifts,
> > bitwise and and multiplication. If code using this scheme is
> > vectorized then there is potential to replace all these operations
> > with a single vector comparison, by reinterpreting the vector types to
> match the width of the SWAR register.
> >
> > For example, for the test function packed_cmp_16_32, the original
> generated code is:
> >
> >         ldr     q0, [x0]
> >         add     w1, w1, 1
> >         ushr    v0.4s, v0.4s, 15
> >         and     v0.16b, v0.16b, v2.16b
> >         shl     v1.4s, v0.4s, 16
> >         sub     v0.4s, v1.4s, v0.4s
> >         str     q0, [x0], 16
> >         cmp     w2, w1
> >         bhi     .L20
> >
> > with this pattern the above can be optimized to:
> >
> >         ldr     q0, [x0]
> >         add     w1, w1, 1
> >         cmlt    v0.8h, v0.8h, #0
> >         str     q0, [x0], 16
> >         cmp     w2, w1
> >         bhi     .L20
> >
> > The effect is similar for x86-64.
> >
> > gcc/ChangeLog:
> >
> >         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
> >
> > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
> > ---
> >  gcc/match.pd                                  | 57 +++++++++++++++
> >  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72
> +++++++++++++++++++
> >  2 files changed, 129 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd index
> > 8bbc0dbd5cd..5c768a94846 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> >      (view_convert (bit_and:itype (view_convert @0)
> >                                  (ne @1 { build_zero_cst (type);
> > })))))))
> >
> > +/* In SWAR (SIMD in a register) code a comparison of packed data can
> > +   be consturcted with a particular combination of shift, bitwise and,
> > +   and multiplication by constants.  If that code is vectorized we can
> > +   convert this pattern into a more efficient vector comparison.  */
> > +(simplify  (mult (bit_and (rshift @0 @1) @2) @3)
> 
> You should restrict the pattern a bit more, below you use
> uniform_integer_cst_p and also require a vector type thus
> 
>   (simplify
>    (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2)
> VECTOR_CST@3)
> 
> 
> > + (with {
> > +   tree op_type = TREE_TYPE (@0);
> 
> that's the same as 'type' which is already available.
> 
> > +   tree rshift_cst = NULL_TREE;
> > +   tree bit_and_cst = NULL_TREE;
> > +   tree mult_cst = NULL_TREE;
> > +  }
> > +  /* Make sure we're working with vectors and uniform vector
> > + constants.  */  (if (VECTOR_TYPE_P (op_type)
> > +       && (rshift_cst = uniform_integer_cst_p (@1))
> > +       && (bit_and_cst = uniform_integer_cst_p (@2))
> > +       && (mult_cst = uniform_integer_cst_p (@3)))
> > +   /* Compute what constants would be needed for this to represent a
> packed
> > +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> > +   (with {
> > +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
> > +     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS
> > + (op_type).to_constant ();
> 
> you need to check that this isn't a VLA vector operation.

Seems like this pattern should be applicable to VLA as well no?
So could we not keep vec_nelts as a poly and just use exact_div
Below in the division? The pattern is only valid if cmp_bits_i is a
multiple of vec_elem_bits anyway.  The build_vector_* should then
do the right thing.

Regards,
Tamar

> 
> > +     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
> > +
> > +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> > +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> > +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
> 
> and that the rshift_cst and others actually fit an uhwi.
> 
> > +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> > +
> > +     mult_i = tree_to_uhwi (mult_cst);
> > +     bit_and_i = tree_to_uhwi (bit_and_cst);
> > +     target_bit_and_i = 0;
> > +
> > +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> > +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
> 
> it would be nice to have a comment on what this actually does ...
> 
> > +    }
> > +    (if ((exact_log2 (cmp_bits_i)) >= 0
> > +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> > +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> > +        && tree_fits_uhwi_p (rshift_cst)
> > +        && tree_fits_uhwi_p (mult_cst)
> > +        && tree_fits_uhwi_p (bit_and_cst)
> > +        && target_mult_i == mult_i
> > +        && target_bit_and_i == bit_and_i)
> > +     /* Compute the vector shape for the comparison and check if the
> target is
> > +       able to expand the comparison with that type.  */
> > +     (with {
> > +       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
> > +       int vector_type_nelts = vec_bits / cmp_bits_i;
> > +       tree vector_type = build_vector_type (bool_type,
> > + vector_type_nelts);
> 
> why do you build a bool vector type here and then ...
> 
> > +       tree zeros = build_zero_cst (vector_type);
> > +       tree mask_type = truth_type_for (vector_type);
> 
> ... its truth type?  Note both might not be actually supported by the target
> and thus receive BLKmode or an integer mode.  The latter is a problem for
> expand_vec_cmp_expr_p as that might pick up a pattern not suitable
> here.   Also note that truth_type_for can result in a mask mode, aka
> QImode with AVX512 or some VnBImode on other archs - those are not OK
> to be simply view_converted back to op_type.  In general a vector compare
> operation yields a mask and you can convert that to a -1/0 value using a
> vec_cond_expr.  I think we have a pattern that can then properly simplify the
> case where this can be expressed as a view_convert, but of course you then
> also need to check for vec_cond_expr support.
> 
> I would suggest you make 'vector_type' an integer element type (that also
> properly specifies the sign of the comparison!) and check you end up with a
> vector mode and the mode of the mask_type agrees with that if you don't
> want to go the vec_cond_expr route.
> 
> 
> > +      }
> > +      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
> > +       (view_convert:op_type (lt:mask_type (view_convert:vector_type
> @0)
> > +                                          { zeros; })))))))))
> > +
> >  (for cmp (gt ge lt le)
> >       outp (convert convert negate negate)
> >       outn (negate negate convert convert) diff --git
> > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > new file mode 100644
> > index 00000000000..26f9ad9ef28
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > @@ -0,0 +1,72 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -ftree-vectorize" } */
> > +
> > +typedef unsigned char uint8_t;
> > +typedef unsigned short uint16_t;
> > +typedef unsigned int uint32_t;
> > +
> > +/* 8-bit SWAR tests.  */
> > +
> > +static uint8_t packed_cmp_8_8(uint8_t a) {
> > +  return ((a >> 7) & 0x1U) * 0xffU;
> > +}
> > +
> > +/* 16-bit SWAR tests.  */
> > +
> > +static uint16_t packed_cmp_8_16(uint16_t a) {
> > +  return ((a >> 7) & 0x101U) * 0xffU; }
> > +
> > +static uint16_t packed_cmp_16_16(uint16_t a) {
> > +  return ((a >> 15) & 0x1U) * 0xffffU; }
> > +
> > +/* 32-bit SWAR tests.  */
> > +
> > +static uint32_t packed_cmp_8_32(uint32_t a) {
> > +  return ((a >> 7) & 0x1010101U) * 0xffU; }
> > +
> > +static uint32_t packed_cmp_16_32(uint32_t a) {
> > +  return ((a >> 15) & 0x10001U) * 0xffffU; }
> > +
> > +static uint32_t packed_cmp_32_32(uint32_t a) {
> > +  return ((a >> 31) & 0x1U) * 0xffffffffU; }
> > +
> > +/* Driver function to test the vectorized code generated for the different
> > +   packed_cmp variants.  */
> > +
> > +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> > +  void vectorized_cmp_##FUNC(T* a, int n)      \
> > +  {                                            \
> > +    n = (n / 32) * 32;                         \
> > +    for(int i = 0; i < n; i += 4)              \
> > +    {                                          \
> > +      a[i + 0] = FUNC(a[i + 0]);               \
> > +      a[i + 1] = FUNC(a[i + 1]);               \
> > +      a[i + 2] = FUNC(a[i + 2]);               \
> > +      a[i + 3] = FUNC(a[i + 3]);               \
> > +    }                                          \
> > +  }
> > +
> > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> > +
> > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> > +
> > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> > +
> > +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> > +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> > +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> > +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> > --
> > 2.34.1
> >
  
Philipp Tomsich Nov. 22, 2022, 10:34 a.m. UTC | #5
Richard & Tamar,

On Fri, 26 Aug 2022 at 15:29, Tamar Christina <Tamar.Christina@arm.com> wrote:
>
> > -----Original Message-----
> > From: Gcc-patches <gcc-patches-
> > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > Biener via Gcc-patches
> > Sent: Friday, August 26, 2022 10:08 AM
> > To: mtsamis <manolis.tsamis@vrull.eu>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>;
> > jiangning.liu@amperecomputing.com; Philipp Tomsich
> > <philipp.tomsich@vrull.eu>
> > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and +
> > multiply to vector compare in some cases.
> >
> > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote:
> > >
> > > When using SWAR (SIMD in a register) techniques a comparison operation
> > > within such a register can be made by using a combination of shifts,
> > > bitwise and and multiplication. If code using this scheme is
> > > vectorized then there is potential to replace all these operations
> > > with a single vector comparison, by reinterpreting the vector types to
> > match the width of the SWAR register.
> > >
> > > For example, for the test function packed_cmp_16_32, the original
> > generated code is:
> > >
> > >         ldr     q0, [x0]
> > >         add     w1, w1, 1
> > >         ushr    v0.4s, v0.4s, 15
> > >         and     v0.16b, v0.16b, v2.16b
> > >         shl     v1.4s, v0.4s, 16
> > >         sub     v0.4s, v1.4s, v0.4s
> > >         str     q0, [x0], 16
> > >         cmp     w2, w1
> > >         bhi     .L20
> > >
> > > with this pattern the above can be optimized to:
> > >
> > >         ldr     q0, [x0]
> > >         add     w1, w1, 1
> > >         cmlt    v0.8h, v0.8h, #0
> > >         str     q0, [x0], 16
> > >         cmp     w2, w1
> > >         bhi     .L20
> > >
> > > The effect is similar for x86-64.
> > >
> > > gcc/ChangeLog:
> > >
> > >         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
> > >
> > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
> > > ---
> > >  gcc/match.pd                                  | 57 +++++++++++++++
> > >  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72
> > +++++++++++++++++++
> > >  2 files changed, 129 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > 8bbc0dbd5cd..5c768a94846 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > >      (view_convert (bit_and:itype (view_convert @0)
> > >                                  (ne @1 { build_zero_cst (type);
> > > })))))))
> > >
> > > +/* In SWAR (SIMD in a register) code a comparison of packed data can
> > > +   be consturcted with a particular combination of shift, bitwise and,
> > > +   and multiplication by constants.  If that code is vectorized we can
> > > +   convert this pattern into a more efficient vector comparison.  */
> > > +(simplify  (mult (bit_and (rshift @0 @1) @2) @3)
> >
> > You should restrict the pattern a bit more, below you use
> > uniform_integer_cst_p and also require a vector type thus
> >
> >   (simplify
> >    (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2)
> > VECTOR_CST@3)
> >
> >
> > > + (with {
> > > +   tree op_type = TREE_TYPE (@0);
> >
> > that's the same as 'type' which is already available.
> >
> > > +   tree rshift_cst = NULL_TREE;
> > > +   tree bit_and_cst = NULL_TREE;
> > > +   tree mult_cst = NULL_TREE;
> > > +  }
> > > +  /* Make sure we're working with vectors and uniform vector
> > > + constants.  */  (if (VECTOR_TYPE_P (op_type)
> > > +       && (rshift_cst = uniform_integer_cst_p (@1))
> > > +       && (bit_and_cst = uniform_integer_cst_p (@2))
> > > +       && (mult_cst = uniform_integer_cst_p (@3)))
> > > +   /* Compute what constants would be needed for this to represent a
> > packed
> > > +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> > > +   (with {
> > > +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
> > > +     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS
> > > + (op_type).to_constant ();
> >
> > you need to check that this isn't a VLA vector operation.
>
> Seems like this pattern should be applicable to VLA as well no?
> So could we not keep vec_nelts as a poly and just use exact_div
> Below in the division? The pattern is only valid if cmp_bits_i is a
> multiple of vec_elem_bits anyway.  The build_vector_* should then
> do the right thing.

Seems like we never agreed on what should go into the next version.
Am I right in assuming that applicability to VLA is ok and that we
should primarily focus on addressing the below comments for v2?

Cheers,
Philipp.

> >
> > > +     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
> > > +
> > > +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> > > +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> > > +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
> >
> > and that the rshift_cst and others actually fit an uhwi.
> >
> > > +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> > > +
> > > +     mult_i = tree_to_uhwi (mult_cst);
> > > +     bit_and_i = tree_to_uhwi (bit_and_cst);
> > > +     target_bit_and_i = 0;
> > > +
> > > +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> > > +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
> >
> > it would be nice to have a comment on what this actually does ...
> >
> > > +    }
> > > +    (if ((exact_log2 (cmp_bits_i)) >= 0
> > > +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> > > +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> > > +        && tree_fits_uhwi_p (rshift_cst)
> > > +        && tree_fits_uhwi_p (mult_cst)
> > > +        && tree_fits_uhwi_p (bit_and_cst)
> > > +        && target_mult_i == mult_i
> > > +        && target_bit_and_i == bit_and_i)
> > > +     /* Compute the vector shape for the comparison and check if the
> > target is
> > > +       able to expand the comparison with that type.  */
> > > +     (with {
> > > +       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
> > > +       int vector_type_nelts = vec_bits / cmp_bits_i;
> > > +       tree vector_type = build_vector_type (bool_type,
> > > + vector_type_nelts);
> >
> > why do you build a bool vector type here and then ...
> >
> > > +       tree zeros = build_zero_cst (vector_type);
> > > +       tree mask_type = truth_type_for (vector_type);
> >
> > ... its truth type?  Note both might not be actually supported by the target
> > and thus receive BLKmode or an integer mode.  The latter is a problem for
> > expand_vec_cmp_expr_p as that might pick up a pattern not suitable
> > here.   Also note that truth_type_for can result in a mask mode, aka
> > QImode with AVX512 or some VnBImode on other archs - those are not OK
> > to be simply view_converted back to op_type.  In general a vector compare
> > operation yields a mask and you can convert that to a -1/0 value using a
> > vec_cond_expr.  I think we have a pattern that can then properly simplify the
> > case where this can be expressed as a view_convert, but of course you then
> > also need to check for vec_cond_expr support.
> >
> > I would suggest you make 'vector_type' an integer element type (that also
> > properly specifies the sign of the comparison!) and check you end up with a
> > vector mode and the mode of the mask_type agrees with that if you don't
> > want to go the vec_cond_expr route.
> >
> >
> > > +      }
> > > +      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
> > > +       (view_convert:op_type (lt:mask_type (view_convert:vector_type
> > @0)
> > > +                                          { zeros; })))))))))
> > > +
> > >  (for cmp (gt ge lt le)
> > >       outp (convert convert negate negate)
> > >       outn (negate negate convert convert) diff --git
> > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > new file mode 100644
> > > index 00000000000..26f9ad9ef28
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > @@ -0,0 +1,72 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -ftree-vectorize" } */
> > > +
> > > +typedef unsigned char uint8_t;
> > > +typedef unsigned short uint16_t;
> > > +typedef unsigned int uint32_t;
> > > +
> > > +/* 8-bit SWAR tests.  */
> > > +
> > > +static uint8_t packed_cmp_8_8(uint8_t a) {
> > > +  return ((a >> 7) & 0x1U) * 0xffU;
> > > +}
> > > +
> > > +/* 16-bit SWAR tests.  */
> > > +
> > > +static uint16_t packed_cmp_8_16(uint16_t a) {
> > > +  return ((a >> 7) & 0x101U) * 0xffU; }
> > > +
> > > +static uint16_t packed_cmp_16_16(uint16_t a) {
> > > +  return ((a >> 15) & 0x1U) * 0xffffU; }
> > > +
> > > +/* 32-bit SWAR tests.  */
> > > +
> > > +static uint32_t packed_cmp_8_32(uint32_t a) {
> > > +  return ((a >> 7) & 0x1010101U) * 0xffU; }
> > > +
> > > +static uint32_t packed_cmp_16_32(uint32_t a) {
> > > +  return ((a >> 15) & 0x10001U) * 0xffffU; }
> > > +
> > > +static uint32_t packed_cmp_32_32(uint32_t a) {
> > > +  return ((a >> 31) & 0x1U) * 0xffffffffU; }
> > > +
> > > +/* Driver function to test the vectorized code generated for the different
> > > +   packed_cmp variants.  */
> > > +
> > > +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> > > +  void vectorized_cmp_##FUNC(T* a, int n)      \
> > > +  {                                            \
> > > +    n = (n / 32) * 32;                         \
> > > +    for(int i = 0; i < n; i += 4)              \
> > > +    {                                          \
> > > +      a[i + 0] = FUNC(a[i + 0]);               \
> > > +      a[i + 1] = FUNC(a[i + 1]);               \
> > > +      a[i + 2] = FUNC(a[i + 2]);               \
> > > +      a[i + 3] = FUNC(a[i + 3]);               \
> > > +    }                                          \
> > > +  }
> > > +
> > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> > > +
> > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> > > +
> > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> > > +
> > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> > > --
> > > 2.34.1
> > >
  
Tamar Christina Nov. 22, 2022, 10:57 a.m. UTC | #6
Hi,

> -----Original Message-----
> From: Philipp Tomsich <philipp.tomsich@vrull.eu>
> Sent: Tuesday, November 22, 2022 10:35 AM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Richard Biener <richard.guenther@gmail.com>; mtsamis
> <manolis.tsamis@vrull.eu>; GCC Patches <gcc-patches@gcc.gnu.org>;
> jiangning.liu@amperecomputing.com
> Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and +
> multiply to vector compare in some cases.
> 
> Richard & Tamar,
> 
> On Fri, 26 Aug 2022 at 15:29, Tamar Christina <Tamar.Christina@arm.com>
> wrote:
> >
> > > -----Original Message-----
> > > From: Gcc-patches <gcc-patches-
> > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > > Biener via Gcc-patches
> > > Sent: Friday, August 26, 2022 10:08 AM
> > > To: mtsamis <manolis.tsamis@vrull.eu>
> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>;
> > > jiangning.liu@amperecomputing.com; Philipp Tomsich
> > > <philipp.tomsich@vrull.eu>
> > > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise
> > > and + multiply to vector compare in some cases.
> > >
> > > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu>
> wrote:
> > > >
> > > > When using SWAR (SIMD in a register) techniques a comparison
> > > > operation within such a register can be made by using a
> > > > combination of shifts, bitwise and and multiplication. If code
> > > > using this scheme is vectorized then there is potential to replace
> > > > all these operations with a single vector comparison, by
> > > > reinterpreting the vector types to
> > > match the width of the SWAR register.
> > > >
> > > > For example, for the test function packed_cmp_16_32, the original
> > > generated code is:
> > > >
> > > >         ldr     q0, [x0]
> > > >         add     w1, w1, 1
> > > >         ushr    v0.4s, v0.4s, 15
> > > >         and     v0.16b, v0.16b, v2.16b
> > > >         shl     v1.4s, v0.4s, 16
> > > >         sub     v0.4s, v1.4s, v0.4s
> > > >         str     q0, [x0], 16
> > > >         cmp     w2, w1
> > > >         bhi     .L20
> > > >
> > > > with this pattern the above can be optimized to:
> > > >
> > > >         ldr     q0, [x0]
> > > >         add     w1, w1, 1
> > > >         cmlt    v0.8h, v0.8h, #0
> > > >         str     q0, [x0], 16
> > > >         cmp     w2, w1
> > > >         bhi     .L20
> > > >
> > > > The effect is similar for x86-64.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
> > > >
> > > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
> > > > ---
> > > >  gcc/match.pd                                  | 57 +++++++++++++++
> > > >  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72
> > > +++++++++++++++++++
> > > >  2 files changed, 129 insertions(+)  create mode 100644
> > > > gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > 8bbc0dbd5cd..5c768a94846 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > >      (view_convert (bit_and:itype (view_convert @0)
> > > >                                  (ne @1 { build_zero_cst (type);
> > > > })))))))
> > > >
> > > > +/* In SWAR (SIMD in a register) code a comparison of packed data can
> > > > +   be consturcted with a particular combination of shift, bitwise and,
> > > > +   and multiplication by constants.  If that code is vectorized we can
> > > > +   convert this pattern into a more efficient vector comparison.
> > > > +*/ (simplify  (mult (bit_and (rshift @0 @1) @2) @3)
> > >
> > > You should restrict the pattern a bit more, below you use
> > > uniform_integer_cst_p and also require a vector type thus
> > >
> > >   (simplify
> > >    (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2)
> > > VECTOR_CST@3)
> > >
> > >
> > > > + (with {
> > > > +   tree op_type = TREE_TYPE (@0);
> > >
> > > that's the same as 'type' which is already available.
> > >
> > > > +   tree rshift_cst = NULL_TREE;
> > > > +   tree bit_and_cst = NULL_TREE;
> > > > +   tree mult_cst = NULL_TREE;
> > > > +  }
> > > > +  /* Make sure we're working with vectors and uniform vector
> > > > + constants.  */  (if (VECTOR_TYPE_P (op_type)
> > > > +       && (rshift_cst = uniform_integer_cst_p (@1))
> > > > +       && (bit_and_cst = uniform_integer_cst_p (@2))
> > > > +       && (mult_cst = uniform_integer_cst_p (@3)))
> > > > +   /* Compute what constants would be needed for this to
> > > > + represent a
> > > packed
> > > > +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> > > > +   (with {
> > > > +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
> > > > +     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS
> > > > + (op_type).to_constant ();
> > >
> > > you need to check that this isn't a VLA vector operation.
> >
> > Seems like this pattern should be applicable to VLA as well no?
> > So could we not keep vec_nelts as a poly and just use exact_div Below
> > in the division? The pattern is only valid if cmp_bits_i is a multiple
> > of vec_elem_bits anyway.  The build_vector_* should then do the right
> > thing.
> 
> Seems like we never agreed on what should go into the next version.
> Am I right in assuming that applicability to VLA is ok and that we should
> primarily focus on addressing the below comments for v2?

I don't think there was a disagreement here per say.  Richard's point was that in the
current form the patch is unsafe for VLA, as to_constant () would fail.  That is
if the idea was not to support VLA then this needs a check, or use the checking
version of to_constant.

I was saying that we don’t need to make this ignore VLA, that the pattern should
be applicable for VLA as well.  That is to say, properly handling VLA would address
Richard's concerns as well.

Cheers,
Tamar

> 
> Cheers,
> Philipp.
> 
> > >
> > > > +     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
> > > > +
> > > > +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> > > > +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> > > > +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
> > >
> > > and that the rshift_cst and others actually fit an uhwi.
> > >
> > > > +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> > > > +
> > > > +     mult_i = tree_to_uhwi (mult_cst);
> > > > +     bit_and_i = tree_to_uhwi (bit_and_cst);
> > > > +     target_bit_and_i = 0;
> > > > +
> > > > +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> > > > +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
> > >
> > > it would be nice to have a comment on what this actually does ...
> > >
> > > > +    }
> > > > +    (if ((exact_log2 (cmp_bits_i)) >= 0
> > > > +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> > > > +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> > > > +        && tree_fits_uhwi_p (rshift_cst)
> > > > +        && tree_fits_uhwi_p (mult_cst)
> > > > +        && tree_fits_uhwi_p (bit_and_cst)
> > > > +        && target_mult_i == mult_i
> > > > +        && target_bit_and_i == bit_and_i)
> > > > +     /* Compute the vector shape for the comparison and check if
> > > > + the
> > > target is
> > > > +       able to expand the comparison with that type.  */
> > > > +     (with {
> > > > +       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
> > > > +       int vector_type_nelts = vec_bits / cmp_bits_i;
> > > > +       tree vector_type = build_vector_type (bool_type,
> > > > + vector_type_nelts);
> > >
> > > why do you build a bool vector type here and then ...
> > >
> > > > +       tree zeros = build_zero_cst (vector_type);
> > > > +       tree mask_type = truth_type_for (vector_type);
> > >
> > > ... its truth type?  Note both might not be actually supported by
> > > the target and thus receive BLKmode or an integer mode.  The latter
> > > is a problem for expand_vec_cmp_expr_p as that might pick up a pattern
> not suitable
> > > here.   Also note that truth_type_for can result in a mask mode, aka
> > > QImode with AVX512 or some VnBImode on other archs - those are not
> > > OK to be simply view_converted back to op_type.  In general a vector
> > > compare operation yields a mask and you can convert that to a -1/0
> > > value using a vec_cond_expr.  I think we have a pattern that can
> > > then properly simplify the case where this can be expressed as a
> > > view_convert, but of course you then also need to check for
> vec_cond_expr support.
> > >
> > > I would suggest you make 'vector_type' an integer element type (that
> > > also properly specifies the sign of the comparison!) and check you
> > > end up with a vector mode and the mode of the mask_type agrees with
> > > that if you don't want to go the vec_cond_expr route.
> > >
> > >
> > > > +      }
> > > > +      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
> > > > +       (view_convert:op_type (lt:mask_type
> > > > + (view_convert:vector_type
> > > @0)
> > > > +                                          { zeros; })))))))))
> > > > +
> > > >  (for cmp (gt ge lt le)
> > > >       outp (convert convert negate negate)
> > > >       outn (negate negate convert convert) diff --git
> > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > new file mode 100644
> > > > index 00000000000..26f9ad9ef28
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > @@ -0,0 +1,72 @@
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -ftree-vectorize" } */
> > > > +
> > > > +typedef unsigned char uint8_t;
> > > > +typedef unsigned short uint16_t;
> > > > +typedef unsigned int uint32_t;
> > > > +
> > > > +/* 8-bit SWAR tests.  */
> > > > +
> > > > +static uint8_t packed_cmp_8_8(uint8_t a) {
> > > > +  return ((a >> 7) & 0x1U) * 0xffU; }
> > > > +
> > > > +/* 16-bit SWAR tests.  */
> > > > +
> > > > +static uint16_t packed_cmp_8_16(uint16_t a) {
> > > > +  return ((a >> 7) & 0x101U) * 0xffU; }
> > > > +
> > > > +static uint16_t packed_cmp_16_16(uint16_t a) {
> > > > +  return ((a >> 15) & 0x1U) * 0xffffU; }
> > > > +
> > > > +/* 32-bit SWAR tests.  */
> > > > +
> > > > +static uint32_t packed_cmp_8_32(uint32_t a) {
> > > > +  return ((a >> 7) & 0x1010101U) * 0xffU; }
> > > > +
> > > > +static uint32_t packed_cmp_16_32(uint32_t a) {
> > > > +  return ((a >> 15) & 0x10001U) * 0xffffU; }
> > > > +
> > > > +static uint32_t packed_cmp_32_32(uint32_t a) {
> > > > +  return ((a >> 31) & 0x1U) * 0xffffffffU; }
> > > > +
> > > > +/* Driver function to test the vectorized code generated for the
> different
> > > > +   packed_cmp variants.  */
> > > > +
> > > > +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> > > > +  void vectorized_cmp_##FUNC(T* a, int n)      \
> > > > +  {                                            \
> > > > +    n = (n / 32) * 32;                         \
> > > > +    for(int i = 0; i < n; i += 4)              \
> > > > +    {                                          \
> > > > +      a[i + 0] = FUNC(a[i + 0]);               \
> > > > +      a[i + 1] = FUNC(a[i + 1]);               \
> > > > +      a[i + 2] = FUNC(a[i + 2]);               \
> > > > +      a[i + 3] = FUNC(a[i + 3]);               \
> > > > +    }                                          \
> > > > +  }
> > > > +
> > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> > > > +
> > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> > > > +
> > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> > > > +
> > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> > > > --
> > > > 2.34.1
> > > >
  
Manolis Tsamis Nov. 29, 2022, 10:11 a.m. UTC | #7
Hi all,

based on everyone's comments I have sent a v2 of this patch that can
be found here
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607472.html

As per Richard's comments the pattern now uses vec_cond_expr instead and
includes other fixes as requested.

Also based on Tamar's suggestion I have made it work with poly_int instead of
aborting for VLA vectors.

I would appreciate any further feedback for the new version.

Manolis

On Tue, Nov 22, 2022 at 12:57 PM Tamar Christina
<Tamar.Christina@arm.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > Sent: Tuesday, November 22, 2022 10:35 AM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: Richard Biener <richard.guenther@gmail.com>; mtsamis
> > <manolis.tsamis@vrull.eu>; GCC Patches <gcc-patches@gcc.gnu.org>;
> > jiangning.liu@amperecomputing.com
> > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and +
> > multiply to vector compare in some cases.
> >
> > Richard & Tamar,
> >
> > On Fri, 26 Aug 2022 at 15:29, Tamar Christina <Tamar.Christina@arm.com>
> > wrote:
> > >
> > > > -----Original Message-----
> > > > From: Gcc-patches <gcc-patches-
> > > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > > > Biener via Gcc-patches
> > > > Sent: Friday, August 26, 2022 10:08 AM
> > > > To: mtsamis <manolis.tsamis@vrull.eu>
> > > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>;
> > > > jiangning.liu@amperecomputing.com; Philipp Tomsich
> > > > <philipp.tomsich@vrull.eu>
> > > > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise
> > > > and + multiply to vector compare in some cases.
> > > >
> > > > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu>
> > wrote:
> > > > >
> > > > > When using SWAR (SIMD in a register) techniques a comparison
> > > > > operation within such a register can be made by using a
> > > > > combination of shifts, bitwise and and multiplication. If code
> > > > > using this scheme is vectorized then there is potential to replace
> > > > > all these operations with a single vector comparison, by
> > > > > reinterpreting the vector types to
> > > > match the width of the SWAR register.
> > > > >
> > > > > For example, for the test function packed_cmp_16_32, the original
> > > > generated code is:
> > > > >
> > > > >         ldr     q0, [x0]
> > > > >         add     w1, w1, 1
> > > > >         ushr    v0.4s, v0.4s, 15
> > > > >         and     v0.16b, v0.16b, v2.16b
> > > > >         shl     v1.4s, v0.4s, 16
> > > > >         sub     v0.4s, v1.4s, v0.4s
> > > > >         str     q0, [x0], 16
> > > > >         cmp     w2, w1
> > > > >         bhi     .L20
> > > > >
> > > > > with this pattern the above can be optimized to:
> > > > >
> > > > >         ldr     q0, [x0]
> > > > >         add     w1, w1, 1
> > > > >         cmlt    v0.8h, v0.8h, #0
> > > > >         str     q0, [x0], 16
> > > > >         cmp     w2, w1
> > > > >         bhi     .L20
> > > > >
> > > > > The effect is similar for x86-64.
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * match.pd: Simplify vector shift + bit_and + multiply in some cases.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.target/aarch64/swar_to_vec_cmp.c: New test.
> > > > >
> > > > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
> > > > > ---
> > > > >  gcc/match.pd                                  | 57 +++++++++++++++
> > > > >  .../gcc.target/aarch64/swar_to_vec_cmp.c      | 72
> > > > +++++++++++++++++++
> > > > >  2 files changed, 129 insertions(+)  create mode 100644
> > > > > gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > >
> > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > 8bbc0dbd5cd..5c768a94846 100644
> > > > > --- a/gcc/match.pd
> > > > > +++ b/gcc/match.pd
> > > > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> > > > >      (view_convert (bit_and:itype (view_convert @0)
> > > > >                                  (ne @1 { build_zero_cst (type);
> > > > > })))))))
> > > > >
> > > > > +/* In SWAR (SIMD in a register) code a comparison of packed data can
> > > > > +   be consturcted with a particular combination of shift, bitwise and,
> > > > > +   and multiplication by constants.  If that code is vectorized we can
> > > > > +   convert this pattern into a more efficient vector comparison.
> > > > > +*/ (simplify  (mult (bit_and (rshift @0 @1) @2) @3)
> > > >
> > > > You should restrict the pattern a bit more, below you use
> > > > uniform_integer_cst_p and also require a vector type thus
> > > >
> > > >   (simplify
> > > >    (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2)
> > > > VECTOR_CST@3)
> > > >
> > > >
> > > > > + (with {
> > > > > +   tree op_type = TREE_TYPE (@0);
> > > >
> > > > that's the same as 'type' which is already available.
> > > >
> > > > > +   tree rshift_cst = NULL_TREE;
> > > > > +   tree bit_and_cst = NULL_TREE;
> > > > > +   tree mult_cst = NULL_TREE;
> > > > > +  }
> > > > > +  /* Make sure we're working with vectors and uniform vector
> > > > > + constants.  */  (if (VECTOR_TYPE_P (op_type)
> > > > > +       && (rshift_cst = uniform_integer_cst_p (@1))
> > > > > +       && (bit_and_cst = uniform_integer_cst_p (@2))
> > > > > +       && (mult_cst = uniform_integer_cst_p (@3)))
> > > > > +   /* Compute what constants would be needed for this to
> > > > > + represent a
> > > > packed
> > > > > +      comparison based on the shift amount denoted by RSHIFT_CST.  */
> > > > > +   (with {
> > > > > +     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
> > > > > +     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS
> > > > > + (op_type).to_constant ();
> > > >
> > > > you need to check that this isn't a VLA vector operation.
> > >
> > > Seems like this pattern should be applicable to VLA as well no?
> > > So could we not keep vec_nelts as a poly and just use exact_div Below
> > > in the division? The pattern is only valid if cmp_bits_i is a multiple
> > > of vec_elem_bits anyway.  The build_vector_* should then do the right
> > > thing.
> >
> > Seems like we never agreed on what should go into the next version.
> > Am I right in assuming that applicability to VLA is ok and that we should
> > primarily focus on addressing the below comments for v2?
>
> I don't think there was a disagreement here per say.  Richard's point was that in the
> current form the patch is unsafe for VLA, as to_constant () would fail.  That is
> if the idea was not to support VLA then this needs a check, or use the checking
> version of to_constant.
>
> I was saying that we don’t need to make this ignore VLA, that the pattern should
> be applicable for VLA as well.  That is to say, properly handling VLA would address
> Richard's concerns as well.
>
> Cheers,
> Tamar
>
> >
> > Cheers,
> > Philipp.
> >
> > > >
> > > > > +     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
> > > > > +
> > > > > +     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
> > > > > +     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
> > > > > +     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
> > > >
> > > > and that the rshift_cst and others actually fit an uhwi.
> > > >
> > > > > +     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
> > > > > +
> > > > > +     mult_i = tree_to_uhwi (mult_cst);
> > > > > +     bit_and_i = tree_to_uhwi (bit_and_cst);
> > > > > +     target_bit_and_i = 0;
> > > > > +
> > > > > +     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
> > > > > +       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
> > > >
> > > > it would be nice to have a comment on what this actually does ...
> > > >
> > > > > +    }
> > > > > +    (if ((exact_log2 (cmp_bits_i)) >= 0
> > > > > +        && cmp_bits_i < HOST_BITS_PER_WIDE_INT
> > > > > +        && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
> > > > > +        && tree_fits_uhwi_p (rshift_cst)
> > > > > +        && tree_fits_uhwi_p (mult_cst)
> > > > > +        && tree_fits_uhwi_p (bit_and_cst)
> > > > > +        && target_mult_i == mult_i
> > > > > +        && target_bit_and_i == bit_and_i)
> > > > > +     /* Compute the vector shape for the comparison and check if
> > > > > + the
> > > > target is
> > > > > +       able to expand the comparison with that type.  */
> > > > > +     (with {
> > > > > +       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
> > > > > +       int vector_type_nelts = vec_bits / cmp_bits_i;
> > > > > +       tree vector_type = build_vector_type (bool_type,
> > > > > + vector_type_nelts);
> > > >
> > > > why do you build a bool vector type here and then ...
> > > >
> > > > > +       tree zeros = build_zero_cst (vector_type);
> > > > > +       tree mask_type = truth_type_for (vector_type);
> > > >
> > > > ... its truth type?  Note both might not be actually supported by
> > > > the target and thus receive BLKmode or an integer mode.  The latter
> > > > is a problem for expand_vec_cmp_expr_p as that might pick up a pattern
> > not suitable
> > > > here.   Also note that truth_type_for can result in a mask mode, aka
> > > > QImode with AVX512 or some VnBImode on other archs - those are not
> > > > OK to be simply view_converted back to op_type.  In general a vector
> > > > compare operation yields a mask and you can convert that to a -1/0
> > > > value using a vec_cond_expr.  I think we have a pattern that can
> > > > then properly simplify the case where this can be expressed as a
> > > > view_convert, but of course you then also need to check for
> > vec_cond_expr support.
> > > >
> > > > I would suggest you make 'vector_type' an integer element type (that
> > > > also properly specifies the sign of the comparison!) and check you
> > > > end up with a vector mode and the mode of the mask_type agrees with
> > > > that if you don't want to go the vec_cond_expr route.
> > > >
> > > >
> > > > > +      }
> > > > > +      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
> > > > > +       (view_convert:op_type (lt:mask_type
> > > > > + (view_convert:vector_type
> > > > @0)
> > > > > +                                          { zeros; })))))))))
> > > > > +
> > > > >  (for cmp (gt ge lt le)
> > > > >       outp (convert convert negate negate)
> > > > >       outn (negate negate convert convert) diff --git
> > > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > > new file mode 100644
> > > > > index 00000000000..26f9ad9ef28
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
> > > > > @@ -0,0 +1,72 @@
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -ftree-vectorize" } */
> > > > > +
> > > > > +typedef unsigned char uint8_t;
> > > > > +typedef unsigned short uint16_t;
> > > > > +typedef unsigned int uint32_t;
> > > > > +
> > > > > +/* 8-bit SWAR tests.  */
> > > > > +
> > > > > +static uint8_t packed_cmp_8_8(uint8_t a) {
> > > > > +  return ((a >> 7) & 0x1U) * 0xffU; }
> > > > > +
> > > > > +/* 16-bit SWAR tests.  */
> > > > > +
> > > > > +static uint16_t packed_cmp_8_16(uint16_t a) {
> > > > > +  return ((a >> 7) & 0x101U) * 0xffU; }
> > > > > +
> > > > > +static uint16_t packed_cmp_16_16(uint16_t a) {
> > > > > +  return ((a >> 15) & 0x1U) * 0xffffU; }
> > > > > +
> > > > > +/* 32-bit SWAR tests.  */
> > > > > +
> > > > > +static uint32_t packed_cmp_8_32(uint32_t a) {
> > > > > +  return ((a >> 7) & 0x1010101U) * 0xffU; }
> > > > > +
> > > > > +static uint32_t packed_cmp_16_32(uint32_t a) {
> > > > > +  return ((a >> 15) & 0x10001U) * 0xffffU; }
> > > > > +
> > > > > +static uint32_t packed_cmp_32_32(uint32_t a) {
> > > > > +  return ((a >> 31) & 0x1U) * 0xffffffffU; }
> > > > > +
> > > > > +/* Driver function to test the vectorized code generated for the
> > different
> > > > > +   packed_cmp variants.  */
> > > > > +
> > > > > +#define VECTORIZED_PACKED_CMP(T, FUNC)         \
> > > > > +  void vectorized_cmp_##FUNC(T* a, int n)      \
> > > > > +  {                                            \
> > > > > +    n = (n / 32) * 32;                         \
> > > > > +    for(int i = 0; i < n; i += 4)              \
> > > > > +    {                                          \
> > > > > +      a[i + 0] = FUNC(a[i + 0]);               \
> > > > > +      a[i + 1] = FUNC(a[i + 1]);               \
> > > > > +      a[i + 2] = FUNC(a[i + 2]);               \
> > > > > +      a[i + 3] = FUNC(a[i + 3]);               \
> > > > > +    }                                          \
> > > > > +  }
> > > > > +
> > > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
> > > > > +
> > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
> > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
> > > > > +
> > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
> > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
> > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
> > > > > +
> > > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */
> > > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */
> > > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */
> > > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */
> > > > > --
> > > > > 2.34.1
> > > > >
  

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index 8bbc0dbd5cd..5c768a94846 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -301,6 +301,63 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
     (view_convert (bit_and:itype (view_convert @0)
 				 (ne @1 { build_zero_cst (type); })))))))
 
+/* In SWAR (SIMD in a register) code a comparison of packed data can
+   be consturcted with a particular combination of shift, bitwise and,
+   and multiplication by constants.  If that code is vectorized we can
+   convert this pattern into a more efficient vector comparison.  */
+(simplify
+ (mult (bit_and (rshift @0 @1) @2) @3)
+ (with {
+   tree op_type = TREE_TYPE (@0);
+   tree rshift_cst = NULL_TREE;
+   tree bit_and_cst = NULL_TREE;
+   tree mult_cst = NULL_TREE;
+  }
+  /* Make sure we're working with vectors and uniform vector constants.  */
+  (if (VECTOR_TYPE_P (op_type)
+       && (rshift_cst = uniform_integer_cst_p (@1))
+       && (bit_and_cst = uniform_integer_cst_p (@2))
+       && (mult_cst = uniform_integer_cst_p (@3)))
+   /* Compute what constants would be needed for this to represent a packed
+      comparison based on the shift amount denoted by RSHIFT_CST.  */
+   (with {
+     HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type);
+     HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS (op_type).to_constant ();
+     HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts;
+
+     unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i;
+     unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i;
+     cmp_bits_i = tree_to_uhwi (rshift_cst) + 1;
+     target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1;
+
+     mult_i = tree_to_uhwi (mult_cst);
+     bit_and_i = tree_to_uhwi (bit_and_cst);
+     target_bit_and_i = 0;
+
+     for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++)
+       target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U;
+    }
+    (if ((exact_log2 (cmp_bits_i)) >= 0
+	 && cmp_bits_i < HOST_BITS_PER_WIDE_INT
+	 && vec_elem_bits <= HOST_BITS_PER_WIDE_INT
+	 && tree_fits_uhwi_p (rshift_cst)
+	 && tree_fits_uhwi_p (mult_cst)
+	 && tree_fits_uhwi_p (bit_and_cst)
+	 && target_mult_i == mult_i
+	 && target_bit_and_i == bit_and_i)
+     /* Compute the vector shape for the comparison and check if the target is
+	able to expand the comparison with that type.  */
+     (with {
+       tree bool_type = build_nonstandard_boolean_type (cmp_bits_i);
+       int vector_type_nelts = vec_bits / cmp_bits_i;
+       tree vector_type = build_vector_type (bool_type, vector_type_nelts);
+       tree zeros = build_zero_cst (vector_type);
+       tree mask_type = truth_type_for (vector_type);
+      }
+      (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR))
+       (view_convert:op_type (lt:mask_type (view_convert:vector_type @0)
+					   { zeros; })))))))))
+
 (for cmp (gt ge lt le)
      outp (convert convert negate negate)
      outn (negate negate convert convert)
diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
new file mode 100644
index 00000000000..26f9ad9ef28
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
@@ -0,0 +1,72 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize" } */
+
+typedef unsigned char uint8_t;
+typedef unsigned short uint16_t;
+typedef unsigned int uint32_t;
+
+/* 8-bit SWAR tests.  */
+
+static uint8_t packed_cmp_8_8(uint8_t a)
+{
+  return ((a >> 7) & 0x1U) * 0xffU;
+}
+
+/* 16-bit SWAR tests.  */
+
+static uint16_t packed_cmp_8_16(uint16_t a)
+{
+  return ((a >> 7) & 0x101U) * 0xffU;
+}
+
+static uint16_t packed_cmp_16_16(uint16_t a)
+{
+  return ((a >> 15) & 0x1U) * 0xffffU;
+}
+
+/* 32-bit SWAR tests.  */
+
+static uint32_t packed_cmp_8_32(uint32_t a)
+{
+  return ((a >> 7) & 0x1010101U) * 0xffU;
+}
+
+static uint32_t packed_cmp_16_32(uint32_t a)
+{
+  return ((a >> 15) & 0x10001U) * 0xffffU;
+}
+
+static uint32_t packed_cmp_32_32(uint32_t a)
+{
+  return ((a >> 31) & 0x1U) * 0xffffffffU;
+}
+
+/* Driver function to test the vectorized code generated for the different
+   packed_cmp variants.  */
+
+#define VECTORIZED_PACKED_CMP(T, FUNC)  	\
+  void vectorized_cmp_##FUNC(T* a, int n)	\
+  {						\
+    n = (n / 32) * 32;				\
+    for(int i = 0; i < n; i += 4)		\
+    {						\
+      a[i + 0] = FUNC(a[i + 0]);		\
+      a[i + 1] = FUNC(a[i + 1]);		\
+      a[i + 2] = FUNC(a[i + 2]);		\
+      a[i + 3] = FUNC(a[i + 3]);		\
+    }						\
+  }
+
+VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8);
+
+VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16);
+VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16);
+
+VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32);
+VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32);
+VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32);
+
+/* { dg-final { scan-assembler {\tcmlt\t} } } */
+/* { dg-final { scan-assembler-not {\tushr\t} } } */
+/* { dg-final { scan-assembler-not {\tshl\t} } } */
+/* { dg-final { scan-assembler-not {\tmul\t} } } */