From patchwork Tue Jun 20 08:54:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "juzhe.zhong@rivai.ai" X-Patchwork-Id: 110355 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:994d:0:b0:3d9:f83d:47d9 with SMTP id k13csp3520699vqr; Tue, 20 Jun 2023 01:55:37 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6X0WADxDrrLEcziWH/Aqf2N+X2AluKnekwQeB+uymqshfa3n4lu1IY1shIZbmXUNeZTVlx X-Received: by 2002:a17:906:730c:b0:988:9d0f:db52 with SMTP id di12-20020a170906730c00b009889d0fdb52mr5572497ejc.35.1687251337250; Tue, 20 Jun 2023 01:55:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687251337; cv=none; d=google.com; s=arc-20160816; b=XwDjdB5gYC7vF7i4njwI0Pc+dYxz74NbEiOXaVUb3wqAzOn/8NAYOlR2bBvXdjothD s+S2f1KH6ws1FaE2Kvy+w64XZ290Y+dcZcJ4/41nOIxY0peC5qIBEnZUPLhDFaBjL4AK +ecRDfHbPGn59B/egaeWrVhYNxbf4j7I3p07wn3H+8sDkD8+BJWS5rdiTrwXXLO0rQOu zAW6gdH2w1MsX5WhQ9I8mtBZ73ilPhtAF2mFg08JaS23P4fmoocDYGi7oSPQ5Io6vqUx y/lTqxEbD56tQu2IDGSGCbQNZF6qz3w17NHc1c/yd+f0Pm7+pL34VU8bx/DqgWzH//41 TgoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:feedback-id :content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:dmarc-filter:delivered-to; bh=6kOnuVZf1hXTycYWhnA0mI8AoMsRWRwWr7/4v1BPJX4=; b=1JMrHnRlHEbSShBH13Is8batAbH4a4jTEc7eYrz4GY779rEabhpBbyVbDB1fEYayfm FoucUG89HDfuglGuLX7r75gIB9WC1w4bSGUOafXei2Ubdt0KmD3rhB4edssVxn6ildJ5 zn6sc4ey7ub3DLcX3GrGqKSbe6m7ipB5w67x2SGSkT390LEKECv7GFdnco7JkGA7DhtR VBZOFMJb2nxeEl+yKvLkqcHw6MqptU0wwvHaOo3G9Z+nyaTmpFU1TkbVM3N5YfInKv1D OnqghzFf8AnQvERlCLXY+HZb3HX2PWoypXTZKVmF6TLQfZm4x8p/PedAws+Xc3LwRq7N XDUQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from sourceware.org (ip-8-43-85-97.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id w16-20020a1709062f9000b009873c5f2da3si867556eji.735.2023.06.20.01.55.36 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 20 Jun 2023 01:55:37 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id AEBAF3857704 for ; Tue, 20 Jun 2023 08:55:30 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtpbgjp3.qq.com (smtpbgjp3.qq.com [54.92.39.34]) by sourceware.org (Postfix) with ESMTPS id 424B83858D1E for ; Tue, 20 Jun 2023 08:55:02 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 424B83858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivai.ai Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai X-QQ-mid: bizesmtp70t1687251295tq4kezl8 Received: from server1.localdomain ( [58.60.1.22]) by bizesmtp.qq.com (ESMTP) with id ; Tue, 20 Jun 2023 16:54:54 +0800 (CST) X-QQ-SSF: 01400000000000G0S000000A0000000 X-QQ-FEAT: 90EFqYDyPxBbvHfym5oicB0QiWJOCwUQOuSwoWGUd4w+mr0NZbLknR4y3z/Sl 7J3GIaixJ+G0Dj11kIVqWUzSm9g2uAYb+sYOgOVAOXDXH0G3i/Pya+cZgwG4Yr+ZGaW9qHT P6i8C8B60U/lxFspR+frcANtWgFIercuzGasyIpgtMm5gus1c5wq6XR8CfmKYmCmxSzT56M 4pX5+UIvmDyOvZgGB0H47gXh//bUCiUwrjqCQ9XBYhXxn5OFOcIvy4VQK/c2D3dnllDIxhZ Zws5F9rST1MGUokhWRbHqzbP7r4c6Km0zSOEIkuMVNigE2Wl9CC7LA/Wu458FXCwwa6hWYD Yu7FtjENhHIB6cDYoTVboeIk6OWn2iJDhhekIVnPoez8ZMUYrpYqJji9PRKrhkQeBC6v6CC AoIK/yHO3y5kkaxN/VDrtw== X-QQ-GoodBg: 2 X-BIZMAIL-ID: 4191188907921301734 From: Juzhe-Zhong To: gcc-patches@gcc.gnu.org Cc: kito.cheng@gmail.com, kito.cheng@sifive.com, palmer@dabbelt.com, palmer@rivosinc.com, jeffreyalaw@gmail.com, rdapp.gcc@gmail.com, Juzhe-Zhong Subject: [PATCH V2] RISC-V: Optimize codegen of VLA SLP Date: Tue, 20 Jun 2023 16:54:52 +0800 Message-Id: <20230620085452.132059-1-juzhe.zhong@rivai.ai> X-Mailer: git-send-email 2.36.1 MIME-Version: 1.0 X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 X-Spam-Status: No, score=-9.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SCC_5_SHORT_WORD_LINES, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org Sender: "Gcc-patches" X-getmail-retrieved-from-mailbox: =?utf-8?q?INBOX?= X-GMAIL-THRID: =?utf-8?q?1769189875076989258?= X-GMAIL-MSGID: =?utf-8?q?1769211258438117194?= V2 patch adds comment for Robin: /* As NPATTERNS is always a power of two, we can AND -NPATTERNS to simplify the codegen. */ Recently, I figure out a better approach in case of codegen for VLA stepped vector. Here is the detail descriptions: Case 1: void f (uint8_t *restrict a, uint8_t *restrict b) { for (int i = 0; i < 100; ++i) { a[i * 8] = b[i * 8 + 37] + 1; a[i * 8 + 1] = b[i * 8 + 37] + 2; a[i * 8 + 2] = b[i * 8 + 37] + 3; a[i * 8 + 3] = b[i * 8 + 37] + 4; a[i * 8 + 4] = b[i * 8 + 37] + 5; a[i * 8 + 5] = b[i * 8 + 37] + 6; a[i * 8 + 6] = b[i * 8 + 37] + 7; a[i * 8 + 7] = b[i * 8 + 37] + 8; } } We need to generate the stepped vector: NPATTERNS = 8. { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8 } Before this patch: vid.v v4 ;; {0,1,2,3,4,5,6,7,...} vsrl.vi v4,v4,3 ;; {0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,...} li a3,8 ;; {8} vmul.vx v4,v4,a3 ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...} After this patch: vid.v v4 ;; {0,1,2,3,4,5,6,7,...} vand.vi v4,v4,-8(-NPATTERNS) ;; {0,0,0,0,0,0,0,8,8,8,8,8,8,8,8,...} Case 2: void f (uint8_t *restrict a, uint8_t *restrict b) { for (int i = 0; i < 100; ++i) { a[i * 8] = b[i * 8 + 3] + 1; a[i * 8 + 1] = b[i * 8 + 2] + 2; a[i * 8 + 2] = b[i * 8 + 1] + 3; a[i * 8 + 3] = b[i * 8 + 0] + 4; a[i * 8 + 4] = b[i * 8 + 7] + 5; a[i * 8 + 5] = b[i * 8 + 6] + 6; a[i * 8 + 6] = b[i * 8 + 5] + 7; a[i * 8 + 7] = b[i * 8 + 4] + 8; } } We need to generate the stepped vector: NPATTERNS = 4. { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, ... } Before this patch: li a6,134221824 slli a6,a6,5 addi a6,a6,3 ;; 64-bit: 0x0003000200010000 vmv.v.x v6,a6 ;; {3, 2, 1, 0, ... } vid.v v4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... } vsrl.vi v4,v4,2 ;; {0, 0, 0, 0, 1, 1, 1, 1, ... } li a3,4 ;; {4} vmul.vx v4,v4,a3 ;; {0, 0, 0, 0, 4, 4, 4, 4, ... } vadd.vv v4,v4,v6 ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, ... } After this patch: li a3,-536875008 slli a3,a3,4 addi a3,a3,1 slli a3,a3,16 vmv.v.x v2,a3 ;; {3, 1, -1, -3, ... } vid.v v4 ;; {0, 1, 2, 3, 4, 5, 6, 7, ... } vadd.vv v4,v4,v2 ;; {3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12, ... } gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_const_vector): Optimize codegen. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/partial/slp-1.c: Adapt testcase. * gcc.target/riscv/rvv/autovec/partial/slp-16.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-16.c: New test. --- gcc/config/riscv/riscv-v.cc | 80 ++++++++----------- .../riscv/rvv/autovec/partial/slp-1.c | 2 + .../riscv/rvv/autovec/partial/slp-16.c | 24 ++++++ .../riscv/rvv/autovec/partial/slp_run-16.c | 66 +++++++++++++++ 4 files changed, 127 insertions(+), 45 deletions(-) create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-16.c create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-16.c diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc index 79c0337327d..5d61187a848 100644 --- a/gcc/config/riscv/riscv-v.cc +++ b/gcc/config/riscv/riscv-v.cc @@ -1128,7 +1128,7 @@ expand_const_vector (rtx target, rtx src) builder.quick_push (CONST_VECTOR_ELT (src, i * npatterns + j)); } builder.finalize (); - + if (CONST_VECTOR_DUPLICATE_P (src)) { /* Handle the case with repeating sequence that NELTS_PER_PATTERN = 1 @@ -1204,61 +1204,51 @@ expand_const_vector (rtx target, rtx src) if (builder.single_step_npatterns_p ()) { /* Describe the case by choosing NPATTERNS = 4 as an example. */ - rtx base, step; + insn_code icode; + + /* Step 1: Generate vid = { 0, 1, 2, 3, 4, 5, 6, 7, ... }. */ + rtx vid = gen_reg_rtx (builder.mode ()); + rtx vid_ops[] = {vid}; + icode = code_for_pred_series (builder.mode ()); + emit_vlmax_insn (icode, RVV_MISC_OP, vid_ops); + if (builder.npatterns_all_equal_p ()) { /* Generate the variable-length vector following this rule: { a, a, a + step, a + step, a + step * 2, a + step * 2, ...} E.g. { 0, 0, 8, 8, 16, 16, ... } */ - /* Step 1: Generate base = { 0, 0, 0, 0, 0, 0, 0, ... }. */ - base = expand_vector_broadcast (builder.mode (), builder.elt (0)); + /* Step 2: VID AND -NPATTERNS: + { 0&-4, 1&-4, 2&-4, 3 &-4, 4 &-4, 5 &-4, 6 &-4, 7 &-4, ... } + */ + rtx imm + = gen_int_mode (-builder.npatterns (), builder.inner_mode ()); + /* As NPATTERNS is always a power of two, we can AND -NPATTERNS + to simplify the codegen. */ + rtx and_ops[] = {target, vid, imm}; + icode = code_for_pred_scalar (AND, builder.mode ()); + emit_vlmax_insn (icode, RVV_BINOP, and_ops); } else { /* Generate the variable-length vector following this rule: { a, b, a, b, a + step, b + step, a + step*2, b + step*2, ...} - E.g. { 0, 6, 0, 6, 8, 14, 8, 14, 16, 22, 16, 22, ... } */ - /* Step 1: Generate base = { 0, 6, 0, 6, ... }. */ - rvv_builder new_builder (builder.mode (), builder.npatterns (), - 1); - for (unsigned int i = 0; i < builder.npatterns (); ++i) - new_builder.quick_push (builder.elt (i)); - rtx new_vec = new_builder.build (); - base = gen_reg_rtx (builder.mode ()); - emit_move_insn (base, new_vec); + E.g. { 3, 2, 1, 0, 7, 6, 5, 4, ... } */ + /* Step 2: Generate diff = TARGET - VID: + { 3-0, 2-1, 1-2, 0-3, 7-4, 6-5, 5-6, 4-7, ... }*/ + rvv_builder v (builder.mode (), builder.npatterns (), 1); + for (unsigned int i = 0; i < v.npatterns (); ++i) + { + /* Calculate the diff between the target sequence and + vid sequence. */ + HOST_WIDE_INT diff = INTVAL (builder.elt (i)) - i; + v.quick_push (gen_int_mode (diff, v.inner_mode ())); + } + /* Step 2: Generate result = VID + diff. */ + rtx vec = v.build (); + rtx add_ops[] = {target, vid, vec}; + emit_vlmax_insn (code_for_pred (PLUS, builder.mode ()), RVV_BINOP, + add_ops); } - - /* Step 2: Generate step = gen_int_mode (diff, mode). */ - poly_int64 value1 = rtx_to_poly_int64 (builder.elt (0)); - poly_int64 value2 - = rtx_to_poly_int64 (builder.elt (builder.npatterns ())); - poly_int64 diff = value2 - value1; - step = gen_int_mode (diff, builder.inner_mode ()); - - /* Step 3: Generate vid = { 0, 1, 2, 3, 4, 5, 6, 7, ... }. */ - rtx vid = gen_reg_rtx (builder.mode ()); - rtx op[] = {vid}; - emit_vlmax_insn (code_for_pred_series (builder.mode ()), RVV_MISC_OP, - op); - - /* Step 4: Generate factor = { 0, 0, 0, 0, 1, 1, 1, 1, ... }. */ - rtx factor = gen_reg_rtx (builder.mode ()); - rtx shift_ops[] - = {factor, vid, - gen_int_mode (exact_log2 (builder.npatterns ()), Pmode)}; - emit_vlmax_insn (code_for_pred_scalar (LSHIFTRT, builder.mode ()), - RVV_BINOP, shift_ops); - - /* Step 5: Generate adjusted step = { 0, 0, 0, 0, diff, diff, ... } */ - rtx adjusted_step = gen_reg_rtx (builder.mode ()); - rtx mul_ops[] = {adjusted_step, factor, step}; - emit_vlmax_insn (code_for_pred_scalar (MULT, builder.mode ()), - RVV_BINOP, mul_ops); - - /* Step 6: Generate the final result. */ - rtx add_ops[] = {target, base, adjusted_step}; - emit_vlmax_insn (code_for_pred (PLUS, builder.mode ()), RVV_BINOP, - add_ops); } else /* TODO: We will enable more variable-length vector in the future. */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-1.c index befb518e2dd..0bce8361327 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-1.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-1.c @@ -20,3 +20,5 @@ f (int8_t *restrict a, int8_t *restrict b, int n) } /* { dg-final { scan-tree-dump-times "\.VEC_PERM" 1 "optimized" } } */ +/* { dg-final { scan-assembler {\tvid\.v} } } */ +/* { dg-final { scan-assembler {\tvand} } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-16.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-16.c new file mode 100644 index 00000000000..1a35bba013b --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp-16.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=scalable -fdump-tree-optimized-details" } */ + +#include + +void +f (uint8_t *restrict a, uint8_t *restrict b, int n) +{ + for (int i = 0; i < n; ++i) + { + a[i * 8] = b[i * 8 + 3] + 1; + a[i * 8 + 1] = b[i * 8 + 2] + 2; + a[i * 8 + 2] = b[i * 8 + 1] + 3; + a[i * 8 + 3] = b[i * 8 + 0] + 4; + a[i * 8 + 4] = b[i * 8 + 7] + 5; + a[i * 8 + 5] = b[i * 8 + 6] + 6; + a[i * 8 + 6] = b[i * 8 + 5] + 7; + a[i * 8 + 7] = b[i * 8 + 4] + 8; + } +} + +/* { dg-final { scan-tree-dump-times "\.VEC_PERM" 1 "optimized" } } */ +/* { dg-final { scan-assembler {\tvid\.v} } } */ +/* { dg-final { scan-assembler-not {\tvmul} } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-16.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-16.c new file mode 100644 index 00000000000..765ec5181a4 --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/slp_run-16.c @@ -0,0 +1,66 @@ +/* { dg-do run { target { riscv_vector } } } */ +/* { dg-additional-options "--param riscv-autovec-preference=scalable" } */ + +#include "slp-16.c" + +#define LIMIT 128 +void __attribute__ ((optimize (0))) +f_golden (int8_t *restrict a, int8_t *restrict b, int n) +{ + for (int i = 0; i < n; ++i) + { + a[i * 8] = b[i * 8 + 3] + 1; + a[i * 8 + 1] = b[i * 8 + 2] + 2; + a[i * 8 + 2] = b[i * 8 + 1] + 3; + a[i * 8 + 3] = b[i * 8 + 0] + 4; + a[i * 8 + 4] = b[i * 8 + 7] + 5; + a[i * 8 + 5] = b[i * 8 + 6] + 6; + a[i * 8 + 6] = b[i * 8 + 5] + 7; + a[i * 8 + 7] = b[i * 8 + 4] + 8; + } +} + +int +main (void) +{ +#define RUN(NUM) \ + int8_t a_##NUM[NUM * 8 + 8] = {0}; \ + int8_t a_golden_##NUM[NUM * 8 + 8] = {0}; \ + int8_t b_##NUM[NUM * 8 + 8] = {0}; \ + for (int i = 0; i < NUM * 8 + 8; i++) \ + { \ + if (i % NUM == 0) \ + b_##NUM[i] = (i + NUM) % LIMIT; \ + else \ + b_##NUM[i] = (i - NUM) % (-LIMIT); \ + } \ + f (a_##NUM, b_##NUM, NUM); \ + f_golden (a_golden_##NUM, b_##NUM, NUM); \ + for (int i = 0; i < NUM * 8 + 8; i++) \ + { \ + if (a_##NUM[i] != a_golden_##NUM[i]) \ + __builtin_abort (); \ + } + + RUN (3); + RUN (5); + RUN (15); + RUN (16); + RUN (17); + RUN (31); + RUN (32); + RUN (33); + RUN (63); + RUN (64); + RUN (65); + RUN (127); + RUN (128); + RUN (129); + RUN (239); + RUN (359); + RUN (498); + RUN (799); + RUN (977); + RUN (5789); + return 0; +}