Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases.
Message ID | 20220813095843.1452308-1-manolis.tsamis@vrull.eu |
---|---|
State | New, archived |
Headers |
Return-Path: <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> Delivered-To: ouuuleilei@gmail.com Received: by 2002:a05:6a10:38f:b0:2d5:3c95:9e21 with SMTP id 15csp235090pxh; Sat, 13 Aug 2022 02:59:32 -0700 (PDT) X-Google-Smtp-Source: AA6agR59hfTgL9lkSS+k0DTrG09H0AegGls4jFdOwliTAb4bODV7R6Zd1Mtoy/N6MQrpWOv5V9Dr X-Received: by 2002:a17:907:1690:b0:731:56b6:fded with SMTP id hc16-20020a170907169000b0073156b6fdedmr5274461ejc.119.1660384772719; Sat, 13 Aug 2022 02:59:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1660384772; cv=none; d=google.com; s=arc-20160816; b=JLMq0BdVd3VWeW0nUIW8byuYlZIhu2rerV7s3egkopN1gUywoHPCYOGxhyW4uCl2M8 Rjn6KEJeqyWAOrCAdk9Dz4RfBl+mX7RW3Kj+MIv9qdxScjEtQvNW3ucQPYEjKciJoqf3 thhU9Yp2zhOWm6VroeBp+EC0lyzpOBEcVTWfgkKDwE8spClEZAVM/DzZODtT7SN4BsTC vwthSc4WXVfqsgu0pONzdDbDNLC2obmRZeBSTSRYP6yyQIZNbw+DhlwJ28TiOw7LKRCQ 24mBkGkjPegJLConwMI2hoDPE38R94nrMHrwcZOGJ8RqodlE/YCLOL2FDenE5SapsERz QWbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:cc:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:content-transfer-encoding :mime-version:message-id:date:subject:to:from:dkim-signature :dmarc-filter:delivered-to; bh=IcxsFAxcT9yQSKsZye96GW086g6Emy15OFLzyn74wJI=; b=YEZY8LcUBL3qLoy/jQcGZBNaapJ97n8f29uSYL45xUZ/+5NzzkdqFcKHA5mGWV1azK IAkqR/uAst5sYFtF5Oho7rlGfaxq+2deVWlgiVwyGVzfSKHStwHraJ+hRn5sqFco5eES kAwzgKcnqX8JMIWaagJLKrNpwhXvkaTYB/1n6sxsWEH9oWW2lF3KKa6H7CXx4OMGAkZa 3ItEpEpADBvgVjpaBfWp/dyeepn3xJGlY3IyAC9hP9/6k1Ec4awqx0rx+nj0SEqra+vo ycsozfFJVG8Ak0fTBPG81wGOhLmTFJcZCchYfLmTO2gohmx2Rv6LQcgopmvURV0wBOB8 B4JQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@vrull.eu header.s=google header.b=HK3lHVYH; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from sourceware.org (server2.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id hh20-20020a170906a95400b00730cb7812besi3243800ejb.175.2022.08.13.02.59.32 for <ouuuleilei@gmail.com> (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 13 Aug 2022 02:59:32 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; dkim=fail header.i=@vrull.eu header.s=google header.b=HK3lHVYH; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 02466385802D for <ouuuleilei@gmail.com>; Sat, 13 Aug 2022 09:59:29 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mail-lj1-x236.google.com (mail-lj1-x236.google.com [IPv6:2a00:1450:4864:20::236]) by sourceware.org (Postfix) with ESMTPS id CA0CA3858400 for <gcc-patches@gcc.gnu.org>; Sat, 13 Aug 2022 09:59:03 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CA0CA3858400 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=vrull.eu Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=vrull.eu Received: by mail-lj1-x236.google.com with SMTP id s9so3040777ljs.6 for <gcc-patches@gcc.gnu.org>; Sat, 13 Aug 2022 02:59:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vrull.eu; s=google; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc; bh=IcxsFAxcT9yQSKsZye96GW086g6Emy15OFLzyn74wJI=; b=HK3lHVYHkgT0AxmsElIyJtKCdFm2oY2ZFZ5deL6lGFPWYmIIiTcO0jKhAbOvQSRlEG +1DlplIgBYA3NfHY6zjwwNJZOSRtmDunfrLIp5tszG9S50X4/4QsAG4MdHgSlQu0ntU2 acfpnfTz/yjBdIWmK1NiuebV7aIHXkjew3lU2cNl0bTCRSV7LXY707gsw8nD36P/LaHV wgonBNegFMuOzkS69NOrCwzln5hTzUJ2RrvcTiSXAUmsVsmhsyWx9Z24GkRZ46zBGzXF WbgjksB8UseaDAejvBgccMReewYuj8Gp6YEAoDMYMTZPy9gkNaC/NNVBMjfEvUgdYtls LQtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc; bh=IcxsFAxcT9yQSKsZye96GW086g6Emy15OFLzyn74wJI=; b=SMCPMod59nhg+pPOrHCZIbaFJq2edZ1zAtMoGW4nHQ2mNT7u9Q0AJBlrfWYyC0ZVa2 6eIoLEvZkMmR0F7ZljKrFaHYhXtChA+vNapYfHbIkD4lZ+gQjF7GEDDgBGa6bNlVOj31 D3SOHypqClbkdIMa6RvPicaKBYJolBEix0AQzjOeh7cDjs+Rh/9J+H9wUjwvewGqpZ5s itTKxlTuR+nU//zqPplKGg7tjMQy/v/sJOpTix8sS0R52/n3cNSqTPhSKWnAtpPGI2Kb 0WVk0b+kIGFpTj9zIRTWjMBJXxs5sJRogj0SklPEhhKVrgqJnwCHPNdoG0LMrDTXtbib GeNg== X-Gm-Message-State: ACgBeo2nBtgUDssczLh1eFZxGiMWL7EBO+KXPM27+TMGtYEh10iF8lF+ Tz/TOO6IJAXbfE1UyckGwxFZeMq/oWfjqa4E X-Received: by 2002:a2e:8e89:0:b0:25e:9fe8:726 with SMTP id z9-20020a2e8e89000000b0025e9fe80726mr2179273ljk.142.1660384741961; Sat, 13 Aug 2022 02:59:01 -0700 (PDT) Received: from helsinki-03.engr ([2a01:4f9:6b:2a47::2]) by smtp.gmail.com with ESMTPSA id 17-20020ac25f51000000b0048b17b0db44sm505601lfz.61.2022.08.13.02.59.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 13 Aug 2022 02:59:01 -0700 (PDT) From: mtsamis <manolis.tsamis@vrull.eu> To: gcc-patches@gcc.gnu.org Subject: [PATCH] Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases. Date: Sat, 13 Aug 2022 11:58:43 +0200 Message-Id: <20220813095843.1452308-1-manolis.tsamis@vrull.eu> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-13.4 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, JMQ_SPF_NEUTRAL, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Cc: philipp.tomsich@vrull.eu, jiangning.liu@amperecomputing.com Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org> X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1741039623400283945?= X-GMAIL-MSGID: =?utf-8?q?1741039623400283945?= |
Series |
Add pattern to convert vector shift + bitwise and + multiply to vector compare in some cases.
|
|
Commit Message
Manolis Tsamis
Aug. 13, 2022, 9:58 a.m. UTC
When using SWAR (SIMD in a register) techniques a comparison operation within
such a register can be made by using a combination of shifts, bitwise and and
multiplication. If code using this scheme is vectorized then there is potential
to replace all these operations with a single vector comparison, by reinterpreting
the vector types to match the width of the SWAR register.
For example, for the test function packed_cmp_16_32, the original generated code is:
ldr q0, [x0]
add w1, w1, 1
ushr v0.4s, v0.4s, 15
and v0.16b, v0.16b, v2.16b
shl v1.4s, v0.4s, 16
sub v0.4s, v1.4s, v0.4s
str q0, [x0], 16
cmp w2, w1
bhi .L20
with this pattern the above can be optimized to:
ldr q0, [x0]
add w1, w1, 1
cmlt v0.8h, v0.8h, #0
str q0, [x0], 16
cmp w2, w1
bhi .L20
The effect is similar for x86-64.
gcc/ChangeLog:
* match.pd: Simplify vector shift + bit_and + multiply in some cases.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/swar_to_vec_cmp.c: New test.
Signed-off-by: mtsamis <manolis.tsamis@vrull.eu>
---
gcc/match.pd | 57 +++++++++++++++
.../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++
2 files changed, 129 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c
Comments
On Sat, 13 Aug 2022, mtsamis wrote: > When using SWAR (SIMD in a register) techniques a comparison operation within *within > such a register can be made by using a combination of shifts, bitwise and and > multiplication. If code using this scheme is vectorized then there is potential > to replace all these operations with a single vector comparison, by reinterpreting > the vector types to match the width of the SWAR register. Hadn't it been for "all" modern processors these days having SIMD instructions, adding general SWAR handling to the vectorizer would be worth promoting to someones favorite decision-maker. (This is a wonderful first step.) Random typo spotted: > +++ b/gcc/match.pd > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (view_convert (bit_and:itype (view_convert @0) > (ne @1 { build_zero_cst (type); }))))))) > > +/* In SWAR (SIMD in a register) code a comparison of packed data can > + be consturcted with a particular combination of shift, bitwise and, *constructed brgds, H-P
On Sat, Aug 13, 2022 at 2:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote: > > When using SWAR (SIMD in a register) techniques a comparison operation within > such a register can be made by using a combination of shifts, bitwise and and > multiplication. If code using this scheme is vectorized then there is potential > to replace all these operations with a single vector comparison, by reinterpreting > the vector types to match the width of the SWAR register. > > For example, for the test function packed_cmp_16_32, the original generated code is: > > ldr q0, [x0] > add w1, w1, 1 > ushr v0.4s, v0.4s, 15 > and v0.16b, v0.16b, v2.16b > shl v1.4s, v0.4s, 16 > sub v0.4s, v1.4s, v0.4s > str q0, [x0], 16 > cmp w2, w1 > bhi .L20 > > with this pattern the above can be optimized to: > > ldr q0, [x0] > add w1, w1, 1 > cmlt v0.8h, v0.8h, #0 > str q0, [x0], 16 > cmp w2, w1 > bhi .L20 > > The effect is similar for x86-64. > > gcc/ChangeLog: > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu> > --- > gcc/match.pd | 57 +++++++++++++++ > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++ > 2 files changed, 129 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > diff --git a/gcc/match.pd b/gcc/match.pd > index 8bbc0dbd5cd..5c768a94846 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (view_convert (bit_and:itype (view_convert @0) > (ne @1 { build_zero_cst (type); }))))))) > > +/* In SWAR (SIMD in a register) code a comparison of packed data can > + be consturcted with a particular combination of shift, bitwise and, > + and multiplication by constants. If that code is vectorized we can > + convert this pattern into a more efficient vector comparison. */ > +(simplify > + (mult (bit_and (rshift @0 @1) @2) @3) > + (with { > + tree op_type = TREE_TYPE (@0); > + tree rshift_cst = NULL_TREE; > + tree bit_and_cst = NULL_TREE; > + tree mult_cst = NULL_TREE; > + } > + /* Make sure we're working with vectors and uniform vector constants. */ > + (if (VECTOR_TYPE_P (op_type) I think you can use type here and don't need op_type. Maybe change this to: (simplify (mult (bit_and (rshift @0 uniform_integer_cst_p@1) uniform_integer_cst_p@2) uniform_integer_cst_p@3) (if (VECTOR_TYPE_P (type)) (with { tree rshift_cst = uniform_integer_cst_p (@1); tree bit_and_cst = uniform_integer_cst_p (@2); tree mult_cst = uniform_integer_cst_p (@3); } ... Similar to the patterns under "Simplifications of comparisons". Thanks, Andrew Pinski > + && (rshift_cst = uniform_integer_cst_p (@1)) > + && (bit_and_cst = uniform_integer_cst_p (@2)) > + && (mult_cst = uniform_integer_cst_p (@3))) > + /* Compute what constants would be needed for this to represent a packed > + comparison based on the shift amount denoted by RSHIFT_CST. */ > + (with { > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type); > + HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS (op_type).to_constant (); > + HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts; > + > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > + > + mult_i = tree_to_uhwi (mult_cst); > + bit_and_i = tree_to_uhwi (bit_and_cst); > + target_bit_and_i = 0; > + > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > + } > + (if ((exact_log2 (cmp_bits_i)) >= 0 > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > + && tree_fits_uhwi_p (rshift_cst) > + && tree_fits_uhwi_p (mult_cst) > + && tree_fits_uhwi_p (bit_and_cst) > + && target_mult_i == mult_i > + && target_bit_and_i == bit_and_i) > + /* Compute the vector shape for the comparison and check if the target is > + able to expand the comparison with that type. */ > + (with { > + tree bool_type = build_nonstandard_boolean_type (cmp_bits_i); > + int vector_type_nelts = vec_bits / cmp_bits_i; > + tree vector_type = build_vector_type (bool_type, vector_type_nelts); > + tree zeros = build_zero_cst (vector_type); > + tree mask_type = truth_type_for (vector_type); > + } > + (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR)) > + (view_convert:op_type (lt:mask_type (view_convert:vector_type @0) > + { zeros; }))))))))) > + > (for cmp (gt ge lt le) > outp (convert convert negate negate) > outn (negate negate convert convert) > diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > new file mode 100644 > index 00000000000..26f9ad9ef28 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > @@ -0,0 +1,72 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize" } */ > + > +typedef unsigned char uint8_t; > +typedef unsigned short uint16_t; > +typedef unsigned int uint32_t; > + > +/* 8-bit SWAR tests. */ > + > +static uint8_t packed_cmp_8_8(uint8_t a) > +{ > + return ((a >> 7) & 0x1U) * 0xffU; > +} > + > +/* 16-bit SWAR tests. */ > + > +static uint16_t packed_cmp_8_16(uint16_t a) > +{ > + return ((a >> 7) & 0x101U) * 0xffU; > +} > + > +static uint16_t packed_cmp_16_16(uint16_t a) > +{ > + return ((a >> 15) & 0x1U) * 0xffffU; > +} > + > +/* 32-bit SWAR tests. */ > + > +static uint32_t packed_cmp_8_32(uint32_t a) > +{ > + return ((a >> 7) & 0x1010101U) * 0xffU; > +} > + > +static uint32_t packed_cmp_16_32(uint32_t a) > +{ > + return ((a >> 15) & 0x10001U) * 0xffffU; > +} > + > +static uint32_t packed_cmp_32_32(uint32_t a) > +{ > + return ((a >> 31) & 0x1U) * 0xffffffffU; > +} > + > +/* Driver function to test the vectorized code generated for the different > + packed_cmp variants. */ > + > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > + void vectorized_cmp_##FUNC(T* a, int n) \ > + { \ > + n = (n / 32) * 32; \ > + for(int i = 0; i < n; i += 4) \ > + { \ > + a[i + 0] = FUNC(a[i + 0]); \ > + a[i + 1] = FUNC(a[i + 1]); \ > + a[i + 2] = FUNC(a[i + 2]); \ > + a[i + 3] = FUNC(a[i + 3]); \ > + } \ > + } > + > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > + > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > + > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > + > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > -- > 2.34.1 >
On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote: > > When using SWAR (SIMD in a register) techniques a comparison operation within > such a register can be made by using a combination of shifts, bitwise and and > multiplication. If code using this scheme is vectorized then there is potential > to replace all these operations with a single vector comparison, by reinterpreting > the vector types to match the width of the SWAR register. > > For example, for the test function packed_cmp_16_32, the original generated code is: > > ldr q0, [x0] > add w1, w1, 1 > ushr v0.4s, v0.4s, 15 > and v0.16b, v0.16b, v2.16b > shl v1.4s, v0.4s, 16 > sub v0.4s, v1.4s, v0.4s > str q0, [x0], 16 > cmp w2, w1 > bhi .L20 > > with this pattern the above can be optimized to: > > ldr q0, [x0] > add w1, w1, 1 > cmlt v0.8h, v0.8h, #0 > str q0, [x0], 16 > cmp w2, w1 > bhi .L20 > > The effect is similar for x86-64. > > gcc/ChangeLog: > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu> > --- > gcc/match.pd | 57 +++++++++++++++ > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 +++++++++++++++++++ > 2 files changed, 129 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > diff --git a/gcc/match.pd b/gcc/match.pd > index 8bbc0dbd5cd..5c768a94846 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (view_convert (bit_and:itype (view_convert @0) > (ne @1 { build_zero_cst (type); }))))))) > > +/* In SWAR (SIMD in a register) code a comparison of packed data can > + be consturcted with a particular combination of shift, bitwise and, > + and multiplication by constants. If that code is vectorized we can > + convert this pattern into a more efficient vector comparison. */ > +(simplify > + (mult (bit_and (rshift @0 @1) @2) @3) You should restrict the pattern a bit more, below you use uniform_integer_cst_p and also require a vector type thus (simplify (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2) VECTOR_CST@3) > + (with { > + tree op_type = TREE_TYPE (@0); that's the same as 'type' which is already available. > + tree rshift_cst = NULL_TREE; > + tree bit_and_cst = NULL_TREE; > + tree mult_cst = NULL_TREE; > + } > + /* Make sure we're working with vectors and uniform vector constants. */ > + (if (VECTOR_TYPE_P (op_type) > + && (rshift_cst = uniform_integer_cst_p (@1)) > + && (bit_and_cst = uniform_integer_cst_p (@2)) > + && (mult_cst = uniform_integer_cst_p (@3))) > + /* Compute what constants would be needed for this to represent a packed > + comparison based on the shift amount denoted by RSHIFT_CST. */ > + (with { > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type); > + HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS (op_type).to_constant (); you need to check that this isn't a VLA vector operation. > + HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts; > + > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; and that the rshift_cst and others actually fit an uhwi. > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > + > + mult_i = tree_to_uhwi (mult_cst); > + bit_and_i = tree_to_uhwi (bit_and_cst); > + target_bit_and_i = 0; > + > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; it would be nice to have a comment on what this actually does ... > + } > + (if ((exact_log2 (cmp_bits_i)) >= 0 > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > + && tree_fits_uhwi_p (rshift_cst) > + && tree_fits_uhwi_p (mult_cst) > + && tree_fits_uhwi_p (bit_and_cst) > + && target_mult_i == mult_i > + && target_bit_and_i == bit_and_i) > + /* Compute the vector shape for the comparison and check if the target is > + able to expand the comparison with that type. */ > + (with { > + tree bool_type = build_nonstandard_boolean_type (cmp_bits_i); > + int vector_type_nelts = vec_bits / cmp_bits_i; > + tree vector_type = build_vector_type (bool_type, vector_type_nelts); why do you build a bool vector type here and then ... > + tree zeros = build_zero_cst (vector_type); > + tree mask_type = truth_type_for (vector_type); ... its truth type? Note both might not be actually supported by the target and thus receive BLKmode or an integer mode. The latter is a problem for expand_vec_cmp_expr_p as that might pick up a pattern not suitable here. Also note that truth_type_for can result in a mask mode, aka QImode with AVX512 or some VnBImode on other archs - those are not OK to be simply view_converted back to op_type. In general a vector compare operation yields a mask and you can convert that to a -1/0 value using a vec_cond_expr. I think we have a pattern that can then properly simplify the case where this can be expressed as a view_convert, but of course you then also need to check for vec_cond_expr support. I would suggest you make 'vector_type' an integer element type (that also properly specifies the sign of the comparison!) and check you end up with a vector mode and the mode of the mask_type agrees with that if you don't want to go the vec_cond_expr route. > + } > + (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR)) > + (view_convert:op_type (lt:mask_type (view_convert:vector_type @0) > + { zeros; }))))))))) > + > (for cmp (gt ge lt le) > outp (convert convert negate negate) > outn (negate negate convert convert) > diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > new file mode 100644 > index 00000000000..26f9ad9ef28 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > @@ -0,0 +1,72 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize" } */ > + > +typedef unsigned char uint8_t; > +typedef unsigned short uint16_t; > +typedef unsigned int uint32_t; > + > +/* 8-bit SWAR tests. */ > + > +static uint8_t packed_cmp_8_8(uint8_t a) > +{ > + return ((a >> 7) & 0x1U) * 0xffU; > +} > + > +/* 16-bit SWAR tests. */ > + > +static uint16_t packed_cmp_8_16(uint16_t a) > +{ > + return ((a >> 7) & 0x101U) * 0xffU; > +} > + > +static uint16_t packed_cmp_16_16(uint16_t a) > +{ > + return ((a >> 15) & 0x1U) * 0xffffU; > +} > + > +/* 32-bit SWAR tests. */ > + > +static uint32_t packed_cmp_8_32(uint32_t a) > +{ > + return ((a >> 7) & 0x1010101U) * 0xffU; > +} > + > +static uint32_t packed_cmp_16_32(uint32_t a) > +{ > + return ((a >> 15) & 0x10001U) * 0xffffU; > +} > + > +static uint32_t packed_cmp_32_32(uint32_t a) > +{ > + return ((a >> 31) & 0x1U) * 0xffffffffU; > +} > + > +/* Driver function to test the vectorized code generated for the different > + packed_cmp variants. */ > + > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > + void vectorized_cmp_##FUNC(T* a, int n) \ > + { \ > + n = (n / 32) * 32; \ > + for(int i = 0; i < n; i += 4) \ > + { \ > + a[i + 0] = FUNC(a[i + 0]); \ > + a[i + 1] = FUNC(a[i + 1]); \ > + a[i + 2] = FUNC(a[i + 2]); \ > + a[i + 3] = FUNC(a[i + 3]); \ > + } \ > + } > + > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > + > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > + > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > + > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > -- > 2.34.1 >
> -----Original Message----- > From: Gcc-patches <gcc-patches- > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard > Biener via Gcc-patches > Sent: Friday, August 26, 2022 10:08 AM > To: mtsamis <manolis.tsamis@vrull.eu> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; > jiangning.liu@amperecomputing.com; Philipp Tomsich > <philipp.tomsich@vrull.eu> > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and + > multiply to vector compare in some cases. > > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote: > > > > When using SWAR (SIMD in a register) techniques a comparison operation > > within such a register can be made by using a combination of shifts, > > bitwise and and multiplication. If code using this scheme is > > vectorized then there is potential to replace all these operations > > with a single vector comparison, by reinterpreting the vector types to > match the width of the SWAR register. > > > > For example, for the test function packed_cmp_16_32, the original > generated code is: > > > > ldr q0, [x0] > > add w1, w1, 1 > > ushr v0.4s, v0.4s, 15 > > and v0.16b, v0.16b, v2.16b > > shl v1.4s, v0.4s, 16 > > sub v0.4s, v1.4s, v0.4s > > str q0, [x0], 16 > > cmp w2, w1 > > bhi .L20 > > > > with this pattern the above can be optimized to: > > > > ldr q0, [x0] > > add w1, w1, 1 > > cmlt v0.8h, v0.8h, #0 > > str q0, [x0], 16 > > cmp w2, w1 > > bhi .L20 > > > > The effect is similar for x86-64. > > > > gcc/ChangeLog: > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu> > > --- > > gcc/match.pd | 57 +++++++++++++++ > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 > +++++++++++++++++++ > > 2 files changed, 129 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > 8bbc0dbd5cd..5c768a94846 100644 > > --- a/gcc/match.pd > > +++ b/gcc/match.pd > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > (view_convert (bit_and:itype (view_convert @0) > > (ne @1 { build_zero_cst (type); > > }))))))) > > > > +/* In SWAR (SIMD in a register) code a comparison of packed data can > > + be consturcted with a particular combination of shift, bitwise and, > > + and multiplication by constants. If that code is vectorized we can > > + convert this pattern into a more efficient vector comparison. */ > > +(simplify (mult (bit_and (rshift @0 @1) @2) @3) > > You should restrict the pattern a bit more, below you use > uniform_integer_cst_p and also require a vector type thus > > (simplify > (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2) > VECTOR_CST@3) > > > > + (with { > > + tree op_type = TREE_TYPE (@0); > > that's the same as 'type' which is already available. > > > + tree rshift_cst = NULL_TREE; > > + tree bit_and_cst = NULL_TREE; > > + tree mult_cst = NULL_TREE; > > + } > > + /* Make sure we're working with vectors and uniform vector > > + constants. */ (if (VECTOR_TYPE_P (op_type) > > + && (rshift_cst = uniform_integer_cst_p (@1)) > > + && (bit_and_cst = uniform_integer_cst_p (@2)) > > + && (mult_cst = uniform_integer_cst_p (@3))) > > + /* Compute what constants would be needed for this to represent a > packed > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > + (with { > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type); > > + HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS > > + (op_type).to_constant (); > > you need to check that this isn't a VLA vector operation. Seems like this pattern should be applicable to VLA as well no? So could we not keep vec_nelts as a poly and just use exact_div Below in the division? The pattern is only valid if cmp_bits_i is a multiple of vec_elem_bits anyway. The build_vector_* should then do the right thing. Regards, Tamar > > > + HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts; > > + > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > and that the rshift_cst and others actually fit an uhwi. > > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > + > > + mult_i = tree_to_uhwi (mult_cst); > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > + target_bit_and_i = 0; > > + > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > it would be nice to have a comment on what this actually does ... > > > + } > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > + && tree_fits_uhwi_p (rshift_cst) > > + && tree_fits_uhwi_p (mult_cst) > > + && tree_fits_uhwi_p (bit_and_cst) > > + && target_mult_i == mult_i > > + && target_bit_and_i == bit_and_i) > > + /* Compute the vector shape for the comparison and check if the > target is > > + able to expand the comparison with that type. */ > > + (with { > > + tree bool_type = build_nonstandard_boolean_type (cmp_bits_i); > > + int vector_type_nelts = vec_bits / cmp_bits_i; > > + tree vector_type = build_vector_type (bool_type, > > + vector_type_nelts); > > why do you build a bool vector type here and then ... > > > + tree zeros = build_zero_cst (vector_type); > > + tree mask_type = truth_type_for (vector_type); > > ... its truth type? Note both might not be actually supported by the target > and thus receive BLKmode or an integer mode. The latter is a problem for > expand_vec_cmp_expr_p as that might pick up a pattern not suitable > here. Also note that truth_type_for can result in a mask mode, aka > QImode with AVX512 or some VnBImode on other archs - those are not OK > to be simply view_converted back to op_type. In general a vector compare > operation yields a mask and you can convert that to a -1/0 value using a > vec_cond_expr. I think we have a pattern that can then properly simplify the > case where this can be expressed as a view_convert, but of course you then > also need to check for vec_cond_expr support. > > I would suggest you make 'vector_type' an integer element type (that also > properly specifies the sign of the comparison!) and check you end up with a > vector mode and the mode of the mask_type agrees with that if you don't > want to go the vec_cond_expr route. > > > > + } > > + (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR)) > > + (view_convert:op_type (lt:mask_type (view_convert:vector_type > @0) > > + { zeros; }))))))))) > > + > > (for cmp (gt ge lt le) > > outp (convert convert negate negate) > > outn (negate negate convert convert) diff --git > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > new file mode 100644 > > index 00000000000..26f9ad9ef28 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > @@ -0,0 +1,72 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > + > > +typedef unsigned char uint8_t; > > +typedef unsigned short uint16_t; > > +typedef unsigned int uint32_t; > > + > > +/* 8-bit SWAR tests. */ > > + > > +static uint8_t packed_cmp_8_8(uint8_t a) { > > + return ((a >> 7) & 0x1U) * 0xffU; > > +} > > + > > +/* 16-bit SWAR tests. */ > > + > > +static uint16_t packed_cmp_8_16(uint16_t a) { > > + return ((a >> 7) & 0x101U) * 0xffU; } > > + > > +static uint16_t packed_cmp_16_16(uint16_t a) { > > + return ((a >> 15) & 0x1U) * 0xffffU; } > > + > > +/* 32-bit SWAR tests. */ > > + > > +static uint32_t packed_cmp_8_32(uint32_t a) { > > + return ((a >> 7) & 0x1010101U) * 0xffU; } > > + > > +static uint32_t packed_cmp_16_32(uint32_t a) { > > + return ((a >> 15) & 0x10001U) * 0xffffU; } > > + > > +static uint32_t packed_cmp_32_32(uint32_t a) { > > + return ((a >> 31) & 0x1U) * 0xffffffffU; } > > + > > +/* Driver function to test the vectorized code generated for the different > > + packed_cmp variants. */ > > + > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > + { \ > > + n = (n / 32) * 32; \ > > + for(int i = 0; i < n; i += 4) \ > > + { \ > > + a[i + 0] = FUNC(a[i + 0]); \ > > + a[i + 1] = FUNC(a[i + 1]); \ > > + a[i + 2] = FUNC(a[i + 2]); \ > > + a[i + 3] = FUNC(a[i + 3]); \ > > + } \ > > + } > > + > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > + > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > + > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > + > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > -- > > 2.34.1 > >
Richard & Tamar, On Fri, 26 Aug 2022 at 15:29, Tamar Christina <Tamar.Christina@arm.com> wrote: > > > -----Original Message----- > > From: Gcc-patches <gcc-patches- > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard > > Biener via Gcc-patches > > Sent: Friday, August 26, 2022 10:08 AM > > To: mtsamis <manolis.tsamis@vrull.eu> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; > > jiangning.liu@amperecomputing.com; Philipp Tomsich > > <philipp.tomsich@vrull.eu> > > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and + > > multiply to vector compare in some cases. > > > > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> wrote: > > > > > > When using SWAR (SIMD in a register) techniques a comparison operation > > > within such a register can be made by using a combination of shifts, > > > bitwise and and multiplication. If code using this scheme is > > > vectorized then there is potential to replace all these operations > > > with a single vector comparison, by reinterpreting the vector types to > > match the width of the SWAR register. > > > > > > For example, for the test function packed_cmp_16_32, the original > > generated code is: > > > > > > ldr q0, [x0] > > > add w1, w1, 1 > > > ushr v0.4s, v0.4s, 15 > > > and v0.16b, v0.16b, v2.16b > > > shl v1.4s, v0.4s, 16 > > > sub v0.4s, v1.4s, v0.4s > > > str q0, [x0], 16 > > > cmp w2, w1 > > > bhi .L20 > > > > > > with this pattern the above can be optimized to: > > > > > > ldr q0, [x0] > > > add w1, w1, 1 > > > cmlt v0.8h, v0.8h, #0 > > > str q0, [x0], 16 > > > cmp w2, w1 > > > bhi .L20 > > > > > > The effect is similar for x86-64. > > > > > > gcc/ChangeLog: > > > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu> > > > --- > > > gcc/match.pd | 57 +++++++++++++++ > > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 > > +++++++++++++++++++ > > > 2 files changed, 129 insertions(+) > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > 8bbc0dbd5cd..5c768a94846 100644 > > > --- a/gcc/match.pd > > > +++ b/gcc/match.pd > > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > (view_convert (bit_and:itype (view_convert @0) > > > (ne @1 { build_zero_cst (type); > > > }))))))) > > > > > > +/* In SWAR (SIMD in a register) code a comparison of packed data can > > > + be consturcted with a particular combination of shift, bitwise and, > > > + and multiplication by constants. If that code is vectorized we can > > > + convert this pattern into a more efficient vector comparison. */ > > > +(simplify (mult (bit_and (rshift @0 @1) @2) @3) > > > > You should restrict the pattern a bit more, below you use > > uniform_integer_cst_p and also require a vector type thus > > > > (simplify > > (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2) > > VECTOR_CST@3) > > > > > > > + (with { > > > + tree op_type = TREE_TYPE (@0); > > > > that's the same as 'type' which is already available. > > > > > + tree rshift_cst = NULL_TREE; > > > + tree bit_and_cst = NULL_TREE; > > > + tree mult_cst = NULL_TREE; > > > + } > > > + /* Make sure we're working with vectors and uniform vector > > > + constants. */ (if (VECTOR_TYPE_P (op_type) > > > + && (rshift_cst = uniform_integer_cst_p (@1)) > > > + && (bit_and_cst = uniform_integer_cst_p (@2)) > > > + && (mult_cst = uniform_integer_cst_p (@3))) > > > + /* Compute what constants would be needed for this to represent a > > packed > > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > > + (with { > > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type); > > > + HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS > > > + (op_type).to_constant (); > > > > you need to check that this isn't a VLA vector operation. > > Seems like this pattern should be applicable to VLA as well no? > So could we not keep vec_nelts as a poly and just use exact_div > Below in the division? The pattern is only valid if cmp_bits_i is a > multiple of vec_elem_bits anyway. The build_vector_* should then > do the right thing. Seems like we never agreed on what should go into the next version. Am I right in assuming that applicability to VLA is ok and that we should primarily focus on addressing the below comments for v2? Cheers, Philipp. > > > > > + HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts; > > > + > > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > > > and that the rshift_cst and others actually fit an uhwi. > > > > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > > + > > > + mult_i = tree_to_uhwi (mult_cst); > > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > > + target_bit_and_i = 0; > > > + > > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > > > it would be nice to have a comment on what this actually does ... > > > > > + } > > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > > + && tree_fits_uhwi_p (rshift_cst) > > > + && tree_fits_uhwi_p (mult_cst) > > > + && tree_fits_uhwi_p (bit_and_cst) > > > + && target_mult_i == mult_i > > > + && target_bit_and_i == bit_and_i) > > > + /* Compute the vector shape for the comparison and check if the > > target is > > > + able to expand the comparison with that type. */ > > > + (with { > > > + tree bool_type = build_nonstandard_boolean_type (cmp_bits_i); > > > + int vector_type_nelts = vec_bits / cmp_bits_i; > > > + tree vector_type = build_vector_type (bool_type, > > > + vector_type_nelts); > > > > why do you build a bool vector type here and then ... > > > > > + tree zeros = build_zero_cst (vector_type); > > > + tree mask_type = truth_type_for (vector_type); > > > > ... its truth type? Note both might not be actually supported by the target > > and thus receive BLKmode or an integer mode. The latter is a problem for > > expand_vec_cmp_expr_p as that might pick up a pattern not suitable > > here. Also note that truth_type_for can result in a mask mode, aka > > QImode with AVX512 or some VnBImode on other archs - those are not OK > > to be simply view_converted back to op_type. In general a vector compare > > operation yields a mask and you can convert that to a -1/0 value using a > > vec_cond_expr. I think we have a pattern that can then properly simplify the > > case where this can be expressed as a view_convert, but of course you then > > also need to check for vec_cond_expr support. > > > > I would suggest you make 'vector_type' an integer element type (that also > > properly specifies the sign of the comparison!) and check you end up with a > > vector mode and the mode of the mask_type agrees with that if you don't > > want to go the vec_cond_expr route. > > > > > > > + } > > > + (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR)) > > > + (view_convert:op_type (lt:mask_type (view_convert:vector_type > > @0) > > > + { zeros; }))))))))) > > > + > > > (for cmp (gt ge lt le) > > > outp (convert convert negate negate) > > > outn (negate negate convert convert) diff --git > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > new file mode 100644 > > > index 00000000000..26f9ad9ef28 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > @@ -0,0 +1,72 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > > + > > > +typedef unsigned char uint8_t; > > > +typedef unsigned short uint16_t; > > > +typedef unsigned int uint32_t; > > > + > > > +/* 8-bit SWAR tests. */ > > > + > > > +static uint8_t packed_cmp_8_8(uint8_t a) { > > > + return ((a >> 7) & 0x1U) * 0xffU; > > > +} > > > + > > > +/* 16-bit SWAR tests. */ > > > + > > > +static uint16_t packed_cmp_8_16(uint16_t a) { > > > + return ((a >> 7) & 0x101U) * 0xffU; } > > > + > > > +static uint16_t packed_cmp_16_16(uint16_t a) { > > > + return ((a >> 15) & 0x1U) * 0xffffU; } > > > + > > > +/* 32-bit SWAR tests. */ > > > + > > > +static uint32_t packed_cmp_8_32(uint32_t a) { > > > + return ((a >> 7) & 0x1010101U) * 0xffU; } > > > + > > > +static uint32_t packed_cmp_16_32(uint32_t a) { > > > + return ((a >> 15) & 0x10001U) * 0xffffU; } > > > + > > > +static uint32_t packed_cmp_32_32(uint32_t a) { > > > + return ((a >> 31) & 0x1U) * 0xffffffffU; } > > > + > > > +/* Driver function to test the vectorized code generated for the different > > > + packed_cmp variants. */ > > > + > > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > > + { \ > > > + n = (n / 32) * 32; \ > > > + for(int i = 0; i < n; i += 4) \ > > > + { \ > > > + a[i + 0] = FUNC(a[i + 0]); \ > > > + a[i + 1] = FUNC(a[i + 1]); \ > > > + a[i + 2] = FUNC(a[i + 2]); \ > > > + a[i + 3] = FUNC(a[i + 3]); \ > > > + } \ > > > + } > > > + > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > > + > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > > + > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > > + > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > > -- > > > 2.34.1 > > >
Hi, > -----Original Message----- > From: Philipp Tomsich <philipp.tomsich@vrull.eu> > Sent: Tuesday, November 22, 2022 10:35 AM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: Richard Biener <richard.guenther@gmail.com>; mtsamis > <manolis.tsamis@vrull.eu>; GCC Patches <gcc-patches@gcc.gnu.org>; > jiangning.liu@amperecomputing.com > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and + > multiply to vector compare in some cases. > > Richard & Tamar, > > On Fri, 26 Aug 2022 at 15:29, Tamar Christina <Tamar.Christina@arm.com> > wrote: > > > > > -----Original Message----- > > > From: Gcc-patches <gcc-patches- > > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard > > > Biener via Gcc-patches > > > Sent: Friday, August 26, 2022 10:08 AM > > > To: mtsamis <manolis.tsamis@vrull.eu> > > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; > > > jiangning.liu@amperecomputing.com; Philipp Tomsich > > > <philipp.tomsich@vrull.eu> > > > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise > > > and + multiply to vector compare in some cases. > > > > > > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> > wrote: > > > > > > > > When using SWAR (SIMD in a register) techniques a comparison > > > > operation within such a register can be made by using a > > > > combination of shifts, bitwise and and multiplication. If code > > > > using this scheme is vectorized then there is potential to replace > > > > all these operations with a single vector comparison, by > > > > reinterpreting the vector types to > > > match the width of the SWAR register. > > > > > > > > For example, for the test function packed_cmp_16_32, the original > > > generated code is: > > > > > > > > ldr q0, [x0] > > > > add w1, w1, 1 > > > > ushr v0.4s, v0.4s, 15 > > > > and v0.16b, v0.16b, v2.16b > > > > shl v1.4s, v0.4s, 16 > > > > sub v0.4s, v1.4s, v0.4s > > > > str q0, [x0], 16 > > > > cmp w2, w1 > > > > bhi .L20 > > > > > > > > with this pattern the above can be optimized to: > > > > > > > > ldr q0, [x0] > > > > add w1, w1, 1 > > > > cmlt v0.8h, v0.8h, #0 > > > > str q0, [x0], 16 > > > > cmp w2, w1 > > > > bhi .L20 > > > > > > > > The effect is similar for x86-64. > > > > > > > > gcc/ChangeLog: > > > > > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > > > > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu> > > > > --- > > > > gcc/match.pd | 57 +++++++++++++++ > > > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 > > > +++++++++++++++++++ > > > > 2 files changed, 129 insertions(+) create mode 100644 > > > > gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > 8bbc0dbd5cd..5c768a94846 100644 > > > > --- a/gcc/match.pd > > > > +++ b/gcc/match.pd > > > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > (view_convert (bit_and:itype (view_convert @0) > > > > (ne @1 { build_zero_cst (type); > > > > }))))))) > > > > > > > > +/* In SWAR (SIMD in a register) code a comparison of packed data can > > > > + be consturcted with a particular combination of shift, bitwise and, > > > > + and multiplication by constants. If that code is vectorized we can > > > > + convert this pattern into a more efficient vector comparison. > > > > +*/ (simplify (mult (bit_and (rshift @0 @1) @2) @3) > > > > > > You should restrict the pattern a bit more, below you use > > > uniform_integer_cst_p and also require a vector type thus > > > > > > (simplify > > > (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2) > > > VECTOR_CST@3) > > > > > > > > > > + (with { > > > > + tree op_type = TREE_TYPE (@0); > > > > > > that's the same as 'type' which is already available. > > > > > > > + tree rshift_cst = NULL_TREE; > > > > + tree bit_and_cst = NULL_TREE; > > > > + tree mult_cst = NULL_TREE; > > > > + } > > > > + /* Make sure we're working with vectors and uniform vector > > > > + constants. */ (if (VECTOR_TYPE_P (op_type) > > > > + && (rshift_cst = uniform_integer_cst_p (@1)) > > > > + && (bit_and_cst = uniform_integer_cst_p (@2)) > > > > + && (mult_cst = uniform_integer_cst_p (@3))) > > > > + /* Compute what constants would be needed for this to > > > > + represent a > > > packed > > > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > > > + (with { > > > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type); > > > > + HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS > > > > + (op_type).to_constant (); > > > > > > you need to check that this isn't a VLA vector operation. > > > > Seems like this pattern should be applicable to VLA as well no? > > So could we not keep vec_nelts as a poly and just use exact_div Below > > in the division? The pattern is only valid if cmp_bits_i is a multiple > > of vec_elem_bits anyway. The build_vector_* should then do the right > > thing. > > Seems like we never agreed on what should go into the next version. > Am I right in assuming that applicability to VLA is ok and that we should > primarily focus on addressing the below comments for v2? I don't think there was a disagreement here per say. Richard's point was that in the current form the patch is unsafe for VLA, as to_constant () would fail. That is if the idea was not to support VLA then this needs a check, or use the checking version of to_constant. I was saying that we don’t need to make this ignore VLA, that the pattern should be applicable for VLA as well. That is to say, properly handling VLA would address Richard's concerns as well. Cheers, Tamar > > Cheers, > Philipp. > > > > > > > > + HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts; > > > > + > > > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > > > > > and that the rshift_cst and others actually fit an uhwi. > > > > > > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > > > + > > > > + mult_i = tree_to_uhwi (mult_cst); > > > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > > > + target_bit_and_i = 0; > > > > + > > > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > > > > > it would be nice to have a comment on what this actually does ... > > > > > > > + } > > > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > > > + && tree_fits_uhwi_p (rshift_cst) > > > > + && tree_fits_uhwi_p (mult_cst) > > > > + && tree_fits_uhwi_p (bit_and_cst) > > > > + && target_mult_i == mult_i > > > > + && target_bit_and_i == bit_and_i) > > > > + /* Compute the vector shape for the comparison and check if > > > > + the > > > target is > > > > + able to expand the comparison with that type. */ > > > > + (with { > > > > + tree bool_type = build_nonstandard_boolean_type (cmp_bits_i); > > > > + int vector_type_nelts = vec_bits / cmp_bits_i; > > > > + tree vector_type = build_vector_type (bool_type, > > > > + vector_type_nelts); > > > > > > why do you build a bool vector type here and then ... > > > > > > > + tree zeros = build_zero_cst (vector_type); > > > > + tree mask_type = truth_type_for (vector_type); > > > > > > ... its truth type? Note both might not be actually supported by > > > the target and thus receive BLKmode or an integer mode. The latter > > > is a problem for expand_vec_cmp_expr_p as that might pick up a pattern > not suitable > > > here. Also note that truth_type_for can result in a mask mode, aka > > > QImode with AVX512 or some VnBImode on other archs - those are not > > > OK to be simply view_converted back to op_type. In general a vector > > > compare operation yields a mask and you can convert that to a -1/0 > > > value using a vec_cond_expr. I think we have a pattern that can > > > then properly simplify the case where this can be expressed as a > > > view_convert, but of course you then also need to check for > vec_cond_expr support. > > > > > > I would suggest you make 'vector_type' an integer element type (that > > > also properly specifies the sign of the comparison!) and check you > > > end up with a vector mode and the mode of the mask_type agrees with > > > that if you don't want to go the vec_cond_expr route. > > > > > > > > > > + } > > > > + (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR)) > > > > + (view_convert:op_type (lt:mask_type > > > > + (view_convert:vector_type > > > @0) > > > > + { zeros; }))))))))) > > > > + > > > > (for cmp (gt ge lt le) > > > > outp (convert convert negate negate) > > > > outn (negate negate convert convert) diff --git > > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > new file mode 100644 > > > > index 00000000000..26f9ad9ef28 > > > > --- /dev/null > > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > @@ -0,0 +1,72 @@ > > > > +/* { dg-do compile } */ > > > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > > > + > > > > +typedef unsigned char uint8_t; > > > > +typedef unsigned short uint16_t; > > > > +typedef unsigned int uint32_t; > > > > + > > > > +/* 8-bit SWAR tests. */ > > > > + > > > > +static uint8_t packed_cmp_8_8(uint8_t a) { > > > > + return ((a >> 7) & 0x1U) * 0xffU; } > > > > + > > > > +/* 16-bit SWAR tests. */ > > > > + > > > > +static uint16_t packed_cmp_8_16(uint16_t a) { > > > > + return ((a >> 7) & 0x101U) * 0xffU; } > > > > + > > > > +static uint16_t packed_cmp_16_16(uint16_t a) { > > > > + return ((a >> 15) & 0x1U) * 0xffffU; } > > > > + > > > > +/* 32-bit SWAR tests. */ > > > > + > > > > +static uint32_t packed_cmp_8_32(uint32_t a) { > > > > + return ((a >> 7) & 0x1010101U) * 0xffU; } > > > > + > > > > +static uint32_t packed_cmp_16_32(uint32_t a) { > > > > + return ((a >> 15) & 0x10001U) * 0xffffU; } > > > > + > > > > +static uint32_t packed_cmp_32_32(uint32_t a) { > > > > + return ((a >> 31) & 0x1U) * 0xffffffffU; } > > > > + > > > > +/* Driver function to test the vectorized code generated for the > different > > > > + packed_cmp variants. */ > > > > + > > > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > > > + { \ > > > > + n = (n / 32) * 32; \ > > > > + for(int i = 0; i < n; i += 4) \ > > > > + { \ > > > > + a[i + 0] = FUNC(a[i + 0]); \ > > > > + a[i + 1] = FUNC(a[i + 1]); \ > > > > + a[i + 2] = FUNC(a[i + 2]); \ > > > > + a[i + 3] = FUNC(a[i + 3]); \ > > > > + } \ > > > > + } > > > > + > > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > > > + > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > > > + > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > > > + > > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > > > -- > > > > 2.34.1 > > > >
Hi all, based on everyone's comments I have sent a v2 of this patch that can be found here https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607472.html As per Richard's comments the pattern now uses vec_cond_expr instead and includes other fixes as requested. Also based on Tamar's suggestion I have made it work with poly_int instead of aborting for VLA vectors. I would appreciate any further feedback for the new version. Manolis On Tue, Nov 22, 2022 at 12:57 PM Tamar Christina <Tamar.Christina@arm.com> wrote: > > Hi, > > > -----Original Message----- > > From: Philipp Tomsich <philipp.tomsich@vrull.eu> > > Sent: Tuesday, November 22, 2022 10:35 AM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: Richard Biener <richard.guenther@gmail.com>; mtsamis > > <manolis.tsamis@vrull.eu>; GCC Patches <gcc-patches@gcc.gnu.org>; > > jiangning.liu@amperecomputing.com > > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise and + > > multiply to vector compare in some cases. > > > > Richard & Tamar, > > > > On Fri, 26 Aug 2022 at 15:29, Tamar Christina <Tamar.Christina@arm.com> > > wrote: > > > > > > > -----Original Message----- > > > > From: Gcc-patches <gcc-patches- > > > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard > > > > Biener via Gcc-patches > > > > Sent: Friday, August 26, 2022 10:08 AM > > > > To: mtsamis <manolis.tsamis@vrull.eu> > > > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; > > > > jiangning.liu@amperecomputing.com; Philipp Tomsich > > > > <philipp.tomsich@vrull.eu> > > > > Subject: Re: [PATCH] Add pattern to convert vector shift + bitwise > > > > and + multiply to vector compare in some cases. > > > > > > > > On Sat, Aug 13, 2022 at 11:59 AM mtsamis <manolis.tsamis@vrull.eu> > > wrote: > > > > > > > > > > When using SWAR (SIMD in a register) techniques a comparison > > > > > operation within such a register can be made by using a > > > > > combination of shifts, bitwise and and multiplication. If code > > > > > using this scheme is vectorized then there is potential to replace > > > > > all these operations with a single vector comparison, by > > > > > reinterpreting the vector types to > > > > match the width of the SWAR register. > > > > > > > > > > For example, for the test function packed_cmp_16_32, the original > > > > generated code is: > > > > > > > > > > ldr q0, [x0] > > > > > add w1, w1, 1 > > > > > ushr v0.4s, v0.4s, 15 > > > > > and v0.16b, v0.16b, v2.16b > > > > > shl v1.4s, v0.4s, 16 > > > > > sub v0.4s, v1.4s, v0.4s > > > > > str q0, [x0], 16 > > > > > cmp w2, w1 > > > > > bhi .L20 > > > > > > > > > > with this pattern the above can be optimized to: > > > > > > > > > > ldr q0, [x0] > > > > > add w1, w1, 1 > > > > > cmlt v0.8h, v0.8h, #0 > > > > > str q0, [x0], 16 > > > > > cmp w2, w1 > > > > > bhi .L20 > > > > > > > > > > The effect is similar for x86-64. > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > * match.pd: Simplify vector shift + bit_and + multiply in some cases. > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > * gcc.target/aarch64/swar_to_vec_cmp.c: New test. > > > > > > > > > > Signed-off-by: mtsamis <manolis.tsamis@vrull.eu> > > > > > --- > > > > > gcc/match.pd | 57 +++++++++++++++ > > > > > .../gcc.target/aarch64/swar_to_vec_cmp.c | 72 > > > > +++++++++++++++++++ > > > > > 2 files changed, 129 insertions(+) create mode 100644 > > > > > gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index > > > > > 8bbc0dbd5cd..5c768a94846 100644 > > > > > --- a/gcc/match.pd > > > > > +++ b/gcc/match.pd > > > > > @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > > > > > (view_convert (bit_and:itype (view_convert @0) > > > > > (ne @1 { build_zero_cst (type); > > > > > }))))))) > > > > > > > > > > +/* In SWAR (SIMD in a register) code a comparison of packed data can > > > > > + be consturcted with a particular combination of shift, bitwise and, > > > > > + and multiplication by constants. If that code is vectorized we can > > > > > + convert this pattern into a more efficient vector comparison. > > > > > +*/ (simplify (mult (bit_and (rshift @0 @1) @2) @3) > > > > > > > > You should restrict the pattern a bit more, below you use > > > > uniform_integer_cst_p and also require a vector type thus > > > > > > > > (simplify > > > > (mult (bit_and (rshift @0 VECTOR_CST@1) VECTOR_CST@2) > > > > VECTOR_CST@3) > > > > > > > > > > > > > + (with { > > > > > + tree op_type = TREE_TYPE (@0); > > > > > > > > that's the same as 'type' which is already available. > > > > > > > > > + tree rshift_cst = NULL_TREE; > > > > > + tree bit_and_cst = NULL_TREE; > > > > > + tree mult_cst = NULL_TREE; > > > > > + } > > > > > + /* Make sure we're working with vectors and uniform vector > > > > > + constants. */ (if (VECTOR_TYPE_P (op_type) > > > > > + && (rshift_cst = uniform_integer_cst_p (@1)) > > > > > + && (bit_and_cst = uniform_integer_cst_p (@2)) > > > > > + && (mult_cst = uniform_integer_cst_p (@3))) > > > > > + /* Compute what constants would be needed for this to > > > > > + represent a > > > > packed > > > > > + comparison based on the shift amount denoted by RSHIFT_CST. */ > > > > > + (with { > > > > > + HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type); > > > > > + HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS > > > > > + (op_type).to_constant (); > > > > > > > > you need to check that this isn't a VLA vector operation. > > > > > > Seems like this pattern should be applicable to VLA as well no? > > > So could we not keep vec_nelts as a poly and just use exact_div Below > > > in the division? The pattern is only valid if cmp_bits_i is a multiple > > > of vec_elem_bits anyway. The build_vector_* should then do the right > > > thing. > > > > Seems like we never agreed on what should go into the next version. > > Am I right in assuming that applicability to VLA is ok and that we should > > primarily focus on addressing the below comments for v2? > > I don't think there was a disagreement here per say. Richard's point was that in the > current form the patch is unsafe for VLA, as to_constant () would fail. That is > if the idea was not to support VLA then this needs a check, or use the checking > version of to_constant. > > I was saying that we don’t need to make this ignore VLA, that the pattern should > be applicable for VLA as well. That is to say, properly handling VLA would address > Richard's concerns as well. > > Cheers, > Tamar > > > > > Cheers, > > Philipp. > > > > > > > > > > > + HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts; > > > > > + > > > > > + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; > > > > > + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; > > > > > + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; > > > > > > > > and that the rshift_cst and others actually fit an uhwi. > > > > > > > > > + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; > > > > > + > > > > > + mult_i = tree_to_uhwi (mult_cst); > > > > > + bit_and_i = tree_to_uhwi (bit_and_cst); > > > > > + target_bit_and_i = 0; > > > > > + > > > > > + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) > > > > > + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; > > > > > > > > it would be nice to have a comment on what this actually does ... > > > > > > > > > + } > > > > > + (if ((exact_log2 (cmp_bits_i)) >= 0 > > > > > + && cmp_bits_i < HOST_BITS_PER_WIDE_INT > > > > > + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT > > > > > + && tree_fits_uhwi_p (rshift_cst) > > > > > + && tree_fits_uhwi_p (mult_cst) > > > > > + && tree_fits_uhwi_p (bit_and_cst) > > > > > + && target_mult_i == mult_i > > > > > + && target_bit_and_i == bit_and_i) > > > > > + /* Compute the vector shape for the comparison and check if > > > > > + the > > > > target is > > > > > + able to expand the comparison with that type. */ > > > > > + (with { > > > > > + tree bool_type = build_nonstandard_boolean_type (cmp_bits_i); > > > > > + int vector_type_nelts = vec_bits / cmp_bits_i; > > > > > + tree vector_type = build_vector_type (bool_type, > > > > > + vector_type_nelts); > > > > > > > > why do you build a bool vector type here and then ... > > > > > > > > > + tree zeros = build_zero_cst (vector_type); > > > > > + tree mask_type = truth_type_for (vector_type); > > > > > > > > ... its truth type? Note both might not be actually supported by > > > > the target and thus receive BLKmode or an integer mode. The latter > > > > is a problem for expand_vec_cmp_expr_p as that might pick up a pattern > > not suitable > > > > here. Also note that truth_type_for can result in a mask mode, aka > > > > QImode with AVX512 or some VnBImode on other archs - those are not > > > > OK to be simply view_converted back to op_type. In general a vector > > > > compare operation yields a mask and you can convert that to a -1/0 > > > > value using a vec_cond_expr. I think we have a pattern that can > > > > then properly simplify the case where this can be expressed as a > > > > view_convert, but of course you then also need to check for > > vec_cond_expr support. > > > > > > > > I would suggest you make 'vector_type' an integer element type (that > > > > also properly specifies the sign of the comparison!) and check you > > > > end up with a vector mode and the mode of the mask_type agrees with > > > > that if you don't want to go the vec_cond_expr route. > > > > > > > > > > > > > + } > > > > > + (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR)) > > > > > + (view_convert:op_type (lt:mask_type > > > > > + (view_convert:vector_type > > > > @0) > > > > > + { zeros; }))))))))) > > > > > + > > > > > (for cmp (gt ge lt le) > > > > > outp (convert convert negate negate) > > > > > outn (negate negate convert convert) diff --git > > > > > a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > new file mode 100644 > > > > > index 00000000000..26f9ad9ef28 > > > > > --- /dev/null > > > > > +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c > > > > > @@ -0,0 +1,72 @@ > > > > > +/* { dg-do compile } */ > > > > > +/* { dg-options "-O2 -ftree-vectorize" } */ > > > > > + > > > > > +typedef unsigned char uint8_t; > > > > > +typedef unsigned short uint16_t; > > > > > +typedef unsigned int uint32_t; > > > > > + > > > > > +/* 8-bit SWAR tests. */ > > > > > + > > > > > +static uint8_t packed_cmp_8_8(uint8_t a) { > > > > > + return ((a >> 7) & 0x1U) * 0xffU; } > > > > > + > > > > > +/* 16-bit SWAR tests. */ > > > > > + > > > > > +static uint16_t packed_cmp_8_16(uint16_t a) { > > > > > + return ((a >> 7) & 0x101U) * 0xffU; } > > > > > + > > > > > +static uint16_t packed_cmp_16_16(uint16_t a) { > > > > > + return ((a >> 15) & 0x1U) * 0xffffU; } > > > > > + > > > > > +/* 32-bit SWAR tests. */ > > > > > + > > > > > +static uint32_t packed_cmp_8_32(uint32_t a) { > > > > > + return ((a >> 7) & 0x1010101U) * 0xffU; } > > > > > + > > > > > +static uint32_t packed_cmp_16_32(uint32_t a) { > > > > > + return ((a >> 15) & 0x10001U) * 0xffffU; } > > > > > + > > > > > +static uint32_t packed_cmp_32_32(uint32_t a) { > > > > > + return ((a >> 31) & 0x1U) * 0xffffffffU; } > > > > > + > > > > > +/* Driver function to test the vectorized code generated for the > > different > > > > > + packed_cmp variants. */ > > > > > + > > > > > +#define VECTORIZED_PACKED_CMP(T, FUNC) \ > > > > > + void vectorized_cmp_##FUNC(T* a, int n) \ > > > > > + { \ > > > > > + n = (n / 32) * 32; \ > > > > > + for(int i = 0; i < n; i += 4) \ > > > > > + { \ > > > > > + a[i + 0] = FUNC(a[i + 0]); \ > > > > > + a[i + 1] = FUNC(a[i + 1]); \ > > > > > + a[i + 2] = FUNC(a[i + 2]); \ > > > > > + a[i + 3] = FUNC(a[i + 3]); \ > > > > > + } \ > > > > > + } > > > > > + > > > > > +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); > > > > > + > > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); > > > > > +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); > > > > > + > > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); > > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); > > > > > +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); > > > > > + > > > > > +/* { dg-final { scan-assembler {\tcmlt\t} } } */ > > > > > +/* { dg-final { scan-assembler-not {\tushr\t} } } */ > > > > > +/* { dg-final { scan-assembler-not {\tshl\t} } } */ > > > > > +/* { dg-final { scan-assembler-not {\tmul\t} } } */ > > > > > -- > > > > > 2.34.1 > > > > >
diff --git a/gcc/match.pd b/gcc/match.pd index 8bbc0dbd5cd..5c768a94846 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -301,6 +301,63 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (view_convert (bit_and:itype (view_convert @0) (ne @1 { build_zero_cst (type); }))))))) +/* In SWAR (SIMD in a register) code a comparison of packed data can + be consturcted with a particular combination of shift, bitwise and, + and multiplication by constants. If that code is vectorized we can + convert this pattern into a more efficient vector comparison. */ +(simplify + (mult (bit_and (rshift @0 @1) @2) @3) + (with { + tree op_type = TREE_TYPE (@0); + tree rshift_cst = NULL_TREE; + tree bit_and_cst = NULL_TREE; + tree mult_cst = NULL_TREE; + } + /* Make sure we're working with vectors and uniform vector constants. */ + (if (VECTOR_TYPE_P (op_type) + && (rshift_cst = uniform_integer_cst_p (@1)) + && (bit_and_cst = uniform_integer_cst_p (@2)) + && (mult_cst = uniform_integer_cst_p (@3))) + /* Compute what constants would be needed for this to represent a packed + comparison based on the shift amount denoted by RSHIFT_CST. */ + (with { + HOST_WIDE_INT vec_elem_bits = vector_element_bits (op_type); + HOST_WIDE_INT vec_nelts = TYPE_VECTOR_SUBPARTS (op_type).to_constant (); + HOST_WIDE_INT vec_bits = vec_elem_bits * vec_nelts; + + unsigned HOST_WIDE_INT cmp_bits_i, bit_and_i, mult_i; + unsigned HOST_WIDE_INT target_mult_i, target_bit_and_i; + cmp_bits_i = tree_to_uhwi (rshift_cst) + 1; + target_mult_i = (HOST_WIDE_INT_1U << cmp_bits_i) - 1; + + mult_i = tree_to_uhwi (mult_cst); + bit_and_i = tree_to_uhwi (bit_and_cst); + target_bit_and_i = 0; + + for (unsigned i = 0; i < vec_elem_bits / cmp_bits_i; i++) + target_bit_and_i = (target_bit_and_i << cmp_bits_i) | 1U; + } + (if ((exact_log2 (cmp_bits_i)) >= 0 + && cmp_bits_i < HOST_BITS_PER_WIDE_INT + && vec_elem_bits <= HOST_BITS_PER_WIDE_INT + && tree_fits_uhwi_p (rshift_cst) + && tree_fits_uhwi_p (mult_cst) + && tree_fits_uhwi_p (bit_and_cst) + && target_mult_i == mult_i + && target_bit_and_i == bit_and_i) + /* Compute the vector shape for the comparison and check if the target is + able to expand the comparison with that type. */ + (with { + tree bool_type = build_nonstandard_boolean_type (cmp_bits_i); + int vector_type_nelts = vec_bits / cmp_bits_i; + tree vector_type = build_vector_type (bool_type, vector_type_nelts); + tree zeros = build_zero_cst (vector_type); + tree mask_type = truth_type_for (vector_type); + } + (if (expand_vec_cmp_expr_p (vector_type, mask_type, LT_EXPR)) + (view_convert:op_type (lt:mask_type (view_convert:vector_type @0) + { zeros; }))))))))) + (for cmp (gt ge lt le) outp (convert convert negate negate) outn (negate negate convert convert) diff --git a/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c new file mode 100644 index 00000000000..26f9ad9ef28 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/swar_to_vec_cmp.c @@ -0,0 +1,72 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize" } */ + +typedef unsigned char uint8_t; +typedef unsigned short uint16_t; +typedef unsigned int uint32_t; + +/* 8-bit SWAR tests. */ + +static uint8_t packed_cmp_8_8(uint8_t a) +{ + return ((a >> 7) & 0x1U) * 0xffU; +} + +/* 16-bit SWAR tests. */ + +static uint16_t packed_cmp_8_16(uint16_t a) +{ + return ((a >> 7) & 0x101U) * 0xffU; +} + +static uint16_t packed_cmp_16_16(uint16_t a) +{ + return ((a >> 15) & 0x1U) * 0xffffU; +} + +/* 32-bit SWAR tests. */ + +static uint32_t packed_cmp_8_32(uint32_t a) +{ + return ((a >> 7) & 0x1010101U) * 0xffU; +} + +static uint32_t packed_cmp_16_32(uint32_t a) +{ + return ((a >> 15) & 0x10001U) * 0xffffU; +} + +static uint32_t packed_cmp_32_32(uint32_t a) +{ + return ((a >> 31) & 0x1U) * 0xffffffffU; +} + +/* Driver function to test the vectorized code generated for the different + packed_cmp variants. */ + +#define VECTORIZED_PACKED_CMP(T, FUNC) \ + void vectorized_cmp_##FUNC(T* a, int n) \ + { \ + n = (n / 32) * 32; \ + for(int i = 0; i < n; i += 4) \ + { \ + a[i + 0] = FUNC(a[i + 0]); \ + a[i + 1] = FUNC(a[i + 1]); \ + a[i + 2] = FUNC(a[i + 2]); \ + a[i + 3] = FUNC(a[i + 3]); \ + } \ + } + +VECTORIZED_PACKED_CMP(uint8_t, packed_cmp_8_8); + +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_8_16); +VECTORIZED_PACKED_CMP(uint16_t, packed_cmp_16_16); + +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_8_32); +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_16_32); +VECTORIZED_PACKED_CMP(uint32_t, packed_cmp_32_32); + +/* { dg-final { scan-assembler {\tcmlt\t} } } */ +/* { dg-final { scan-assembler-not {\tushr\t} } } */ +/* { dg-final { scan-assembler-not {\tshl\t} } } */ +/* { dg-final { scan-assembler-not {\tmul\t} } } */