From patchwork Fri Nov 17 04:43:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "juzhe.zhong@rivai.ai" X-Patchwork-Id: 165980 Return-Path: Delivered-To: ouuuleilei@gmail.com Received: by 2002:a59:9910:0:b0:403:3b70:6f57 with SMTP id i16csp298359vqn; Thu, 16 Nov 2023 20:44:00 -0800 (PST) X-Google-Smtp-Source: AGHT+IFm4URcOV03vn+jL22zezCgbuXB38hx+bVYPMuBUiS7i3CqFeGAbGHuBObof1FBSoz2AsmD X-Received: by 2002:a05:620a:2b4e:b0:774:1e07:ea0d with SMTP id dp14-20020a05620a2b4e00b007741e07ea0dmr10664401qkb.44.1700196240435; Thu, 16 Nov 2023 20:44:00 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1700196240; cv=pass; d=google.com; s=arc-20160816; b=ORQBFgiXD9ZXUXjF+IzDBsmcdgVnFXxCs9FYzzwofLYfxjw/SwWIwas/HLrQ1ZBbBC n+hTEi+Ubn6wx7fBNlk59R868bKniF3lvu4+pAPR1Ke+Vep6JAFauIGNIQ6x/mZCfZIp 5NKCRqAeXApyUUeBY7ksNi0tzLfcAUOAapuaoR80+M/cKZ6KUuaX7Jl+B5dyTZuhj2Vt NeEivWFQIRO2yNtffMDRmwITlQ3Pd59+igdOoH6eQ8+3HYTRza66MH8meMSH24ESxHmj fCE5XefMKzwpF0hesCAk8puvcDF63UlVA/d+KhiXZStWfIzxhDYsDVemKzD0SMuU9gJr 6+xA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:feedback-id :content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:arc-filter:dmarc-filter:delivered-to; bh=X+Qq/G4kiHb5XRNwe2QZ8DqnbR8GxCZDhrG6H8olF+8=; fh=12MRPJmZ1mgDpHqWoogMKqnaGRGM2b7lcuJroqfjJiw=; b=skLD2LEGrj+PoHvEJ+ZmnTZY/iv2fFvOjSRrDfY8/d/Yd3jiPgwwUAc4P+1SwsXdCn PaijoU4SECQAQm29IOL4jLpSR1R5R/SRdtl9m7nYOC1JoXF6Jx3L/iTezMdJLBQwVbM5 3UuUUSh2gBfmAwu+QghjyvysZ3QOj+V71Thj6heVQ/0sqd7W2sfG1PPgWu+qj3fMjFP0 xpLKlM1b2l2N2/RTfA/4f2op7bcm+b0IDDQW1fozQVdUezZVOL8AvSM0sblq43r568JD tS0XBflfxEBbPjvoEOIzTL6kVIFw2jhq2uP4uiuuwWvs4JrrVtj+mpR0Pd0DRg0f16Xv 8ulw== ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1); spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from server2.sourceware.org (server2.sourceware.org. [8.43.85.97]) by mx.google.com with ESMTPS id l16-20020a05620a28d000b0077416fc65e5si1047695qkp.346.2023.11.16.20.44.00 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Nov 2023 20:44:00 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) client-ip=8.43.85.97; Authentication-Results: mx.google.com; arc=pass (i=1); spf=pass (google.com: domain of gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org designates 8.43.85.97 as permitted sender) smtp.mailfrom="gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org" Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 3508B3858D3C for ; Fri, 17 Nov 2023 04:44:00 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtpbgbr1.qq.com (smtpbgbr1.qq.com [54.207.19.206]) by sourceware.org (Postfix) with ESMTPS id 760953858D20 for ; Fri, 17 Nov 2023 04:43:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 760953858D20 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivai.ai Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 760953858D20 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=54.207.19.206 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700196217; cv=none; b=X6AZOf1FTZ4i4+S8mhGxpCeDF2o/6UuH4wpZdseqnoEanPOssVRvVEv09eoWgoF1QjDCW3tJ0NFDNC+fXBKsp/ayDteNJbNlqLD6aqXa8s40/fQhQzkh6TY8VhUnT33O2vuKJIIR9uQRhI6PI74YImFR7DbXhMll7yKiAC82ydY= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700196217; c=relaxed/simple; bh=wyjLX4h/YdXDN3FYzHBSxsuiTIlUKO30zOg1XhCKAL4=; h=From:To:Subject:Date:Message-Id:MIME-Version; b=qpOHifPjteamWL9FeX3LB2fzZ79KvOoYo/CPv/Y5KyAMVbB39DfU9KVMBHIFPgwSomboYxS6+wohctkTg2QMv66SD2Yhsujmg+5+KVcmDo1RN2AgC59RbWq+TNhri+UHam02remWBIz5rReOk41sqj3gKtRKqXS0bgDCyeJQSoA= ARC-Authentication-Results: i=1; server2.sourceware.org X-QQ-mid: bizesmtp78t1700196201tq4fpjdw Received: from rios-cad122.hadoop.rioslab.org ( [58.60.1.26]) by bizesmtp.qq.com (ESMTP) with id ; Fri, 17 Nov 2023 12:43:20 +0800 (CST) X-QQ-SSF: 01400000000000G0V000000A0000000 X-QQ-FEAT: LE7C6P2vL8TCMnDMDu0Zh/fVU+n15YyhM29EjVsii00haYqBOLiF70auaFAAZ fabDDjgn2Czo0CPuDfziyAxpvg81KS3NSJtL+cm71dp8vc8DG7GKCcpk3wa80OI9rvDgWD9 Nd6h76oU32tViJdpgOnwuw0QqsA2QOUjT9E7oOJ+bep7pX4SCF1mQJ+SBskQUfXrcPc8Vnn JM8Xm/u6mfFDy7NBzm6aTzhjnmj131hxIaCwmgGNLNZsLCsygUFulwsPWJ/yNIovG/+vVET sE5IFSMlA/A53r3yKEbgu7T0GxFDkKYTBkM8HdHo383OY0vlT/HPMzZdp+5MZWTlWxKkSff pe4gXNrbrA/8s8QJ6+Je3MKlAYsA5ggAjmWT9s7ugZPhKkUKk5tcXQudfMSrPj8tmf8cb4K X-QQ-GoodBg: 2 X-BIZMAIL-ID: 6433297137662589976 From: Juzhe-Zhong To: gcc-patches@gcc.gnu.org Cc: kito.cheng@gmail.com, kito.cheng@sifive.com, jeffreyalaw@gmail.com, rdapp.gcc@gmail.com, Juzhe-Zhong Subject: [PATCH] RISC-V: Optimize VLA SLP with duplicate VLA shuffle indice Date: Fri, 17 Nov 2023 12:43:19 +0800 Message-Id: <20231117044319.3912782-1-juzhe.zhong@rivai.ai> X-Mailer: git-send-email 2.36.3 MIME-Version: 1.0 X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 X-Spam-Status: No, score=-9.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SCC_5_SHORT_WORD_LINES, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+ouuuleilei=gmail.com@gcc.gnu.org X-getmail-retrieved-from-mailbox: INBOX X-GMAIL-THRID: 1782784973338017051 X-GMAIL-MSGID: 1782784973338017051 When evaluating dynamic LMUL, notice we can do better on VLA SLP with duplicate VLA shuffle indice. Consider this following case: void foo (uint16_t *restrict a, uint16_t *restrict b, int n) { for (int i = 0; i < n; ++i) { a[i * 8] = b[i * 8 + 3] + 1; a[i * 8 + 1] = b[i * 8 + 6] + 1; a[i * 8 + 2] = b[i * 8 + 0] + 1; a[i * 8 + 3] = b[i * 8 + 2] + 1; a[i * 8 + 4] = b[i * 8 + 1] + 1; a[i * 8 + 5] = b[i * 8 + 7] + 1; a[i * 8 + 6] = b[i * 8 + 5] + 1; a[i * 8 + 7] = b[i * 8 + 4] + 1; } } We want to generate this following indice: { 3, 6, 0, 2, 1, 7, 5, 4, 11, 14, 8, 10, 9, 15, 13, 12, 19, 22, 16, 18, 17, 23, 21, 20, ... } Before this patch, we are using pair of vmseq + vmerge to create such vector in prologue: https://godbolt.org/z/r919WPabK foo: ble a2,zero,.L5 vsetvli a5,zero,e16,m8,ta,ma vid.v v16 vand.vi v24,v16,7 vmseq.vi v0,v24,1 vmseq.vi v1,v24,2 vmv.v.i v8,3 vmerge.vim v8,v8,5,v0 vmv1r.v v0,v1 vmseq.vi v2,v24,3 vmerge.vim v8,v8,-2,v0 vmv1r.v v0,v2 vmseq.vi v1,v24,4 csrr a4,vlenb vmerge.vim v8,v8,-1,v0 csrr a6,vlenb vmv1r.v v0,v1 vmseq.vi v2,v24,5 slli a3,a4,2 vmerge.vim v8,v8,-3,v0 slli a6,a6,2 vmv1r.v v0,v2 vmseq.vi v1,v24,6 vmerge.vim v8,v8,2,v0 addi a6,a6,-1 vmv1r.v v0,v1 slli a2,a2,3 slli a4,a4,3 neg a7,a3 vmseq.vi v2,v24,7 vmerge.vim v8,v8,-1,v0 vmv.v.x v24,a6 vmv1r.v v0,v2 vmerge.vim v8,v8,-3,v0 vadd.vv v16,v16,v8 vand.vv v24,v16,v24 .L3: minu a5,a2,a3 vsetvli zero,a5,e16,m8,ta,ma mv a6,a2 vle16.v v16,0(a1) vrgather.vv v8,v16,v24 vadd.vi v8,v8,1 vse16.v v8,0(a0) add a1,a1,a4 add a0,a0,a4 add a2,a2,a7 bgtu a6,a3,.L3 .L5: ret After this patch: foo: ble a2,zero,.L5 li a6,-536875008 slli a6,a6,4 addi a6,a6,3 csrr a4,vlenb csrr a5,vlenb li t1,-536850432 slli a6,a6,16 slli a3,a4,1 addi a6,a6,-3 slli a5,a5,1 slli t1,t1,4 vsetvli t3,zero,e64,m4,ta,ma addi a5,a5,-1 vmv.v.x v4,a6 addi t1,t1,3 slli a2,a2,3 slli a4,a4,2 neg a7,a3 vslide1up.vx v16,v4,t1 vsetvli a6,zero,e16,m4,ta,ma vid.v v12 vmv.v.x v4,a5 vand.vi v20,v12,7 vrgather.vv v8,v16,v20 vadd.vv v12,v12,v8 vand.vv v12,v12,v4 .L3: minu a5,a2,a3 vsetvli zero,a5,e16,m4,ta,ma mv a6,a2 vle16.v v8,0(a1) vrgather.vv v4,v8,v12 vadd.vi v4,v4,1 vse16.v v4,0(a0) add a1,a1,a4 add a0,a0,a4 add a2,a2,a7 bgtu a6,a3,.L3 .L5: ret Optimization: 1. reduce 9 dynamic instructions in prologue. 2. Fewer vector instructions reduce hardware pipeline consuming. gcc/ChangeLog: * config/riscv/riscv-v.cc (rvv_builder::merge_pattern): New function. (expand_vector_init_trailing_same_elem): Adapt function. (expand_const_vector): Ditto. (expand_vec_init): Ditto. gcc/testsuite/ChangeLog: * gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c: Adapt test. * gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c: Ditto. * gcc.dg/vect/costmodel/riscv/rvv/slp-1.c: New test. --- gcc/config/riscv/riscv-v.cc | 174 +++++++++++++----- .../costmodel/riscv/rvv/dynamic-lmul4-6.c | 7 +- .../costmodel/riscv/rvv/dynamic-lmul4-8.c | 7 +- .../gcc.dg/vect/costmodel/riscv/rvv/slp-1.c | 40 ++++ 4 files changed, 174 insertions(+), 54 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/slp-1.c diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc index 000d0c2c721..6a2009ffb05 100644 --- a/gcc/config/riscv/riscv-v.cc +++ b/gcc/config/riscv/riscv-v.cc @@ -420,6 +420,7 @@ public: bool single_step_npatterns_p () const; bool npatterns_all_equal_p () const; + void merge_pattern (rtx); machine_mode new_mode () const { return m_new_mode; } scalar_mode inner_mode () const { return m_inner_mode; } @@ -679,6 +680,50 @@ rvv_builder::npatterns_all_equal_p () const return true; } +/* Merge pattern to reduce the elements we need to process. + + E.g. v = { 0, 1, 2, 3 }, mode = V4SI. + Since we can use EEW = 64 RVV instructions. + + Transform it into: + v = { 1 << 32, 3 << 32 | 2 }, mode = V2DI. */ +void +rvv_builder::merge_pattern (rtx src) +{ + if (this->inner_bits_size () >= GET_MODE_BITSIZE (Pmode)) + return; + unsigned int ele_num = GET_MODE_BITSIZE (Pmode) / this->inner_bits_size (); + poly_uint64 nunits = exact_div (this->full_nelts (), ele_num); + machine_mode mode; + if (!get_vector_mode (Pmode, nunits).exists (&mode)) + return; + this->new_vector (mode, this->npatterns () / ele_num, + this->nelts_per_pattern ()); + rtx imm = gen_int_mode ((1ULL << this->inner_bits_size ()) - 1, Pmode); + for (unsigned int i = 0; i < this->nelts_per_pattern (); i++) + { + for (unsigned int j = 0; j < this->npatterns (); j++) + { + rtx val = const0_rtx; + for (unsigned int k = 0; k < ele_num; k++) + { + rtx e + = gen_lowpart (Pmode, CONST_VECTOR_ELT (src, k + j * ele_num)); + e = expand_simple_binop (Pmode, AND, e, imm, NULL_RTX, false, + OPTAB_DIRECT); + e = expand_simple_binop ( + Pmode, ASHIFT, e, + gen_int_mode (this->inner_bits_size () * k, Pmode), NULL_RTX, + false, OPTAB_DIRECT); + val = expand_simple_binop (Pmode, IOR, e, val, NULL_RTX, false, + OPTAB_DIRECT); + } + this->quick_push (val); + } + } + this->finalize (); +} + static unsigned get_sew (machine_mode mode) { @@ -1040,6 +1085,39 @@ expand_vec_series (rtx dest, rtx base, rtx step) emit_move_insn (dest, result); } +/* Subroutine of expand_vec_init and expand_const_vector to handle case + when all trailing elements of builder are same. + This works as follows: + (a) Use expand_insn interface to broadcast last vector element in TARGET. + (b) Insert remaining elements in TARGET using insr. + + ??? The heuristic used is to do above if number of same trailing elements + is greater than leading_ndups, loosely based on + heuristic from mostly_zeros_p. May need fine-tuning. */ + +static void +expand_vector_init_trailing_same_elem (rtx target, + const rtx_vector_builder &builder, + int start, int nelts_reqd) +{ + machine_mode mode = GET_MODE (target); + + rtx dup = expand_vector_broadcast (mode, builder.elt (nelts_reqd - 1)); + for (int i = start - 1; i >= 0; i--) + { + unsigned int unspec + = FLOAT_MODE_P (mode) ? UNSPEC_VFSLIDE1UP : UNSPEC_VSLIDE1UP; + insn_code icode = code_for_pred_slide (unspec, mode); + rtx tmp = gen_reg_rtx (mode); + rtx ops[] = {tmp, dup, builder.elt (i)}; + emit_vlmax_insn (icode, BINARY_OP, ops); + /* slide1up need source and dest to be different REG. */ + dup = tmp; + } + + emit_move_insn (target, dup); +} + static void expand_const_vector (rtx target, rtx src) { @@ -1139,6 +1217,43 @@ expand_const_vector (rtx target, rtx src) rtx dup = expand_vector_broadcast (builder.new_mode (), ele); emit_move_insn (target, gen_lowpart (mode, dup)); } + /* We don't apply such approach for LMUL = 8 since vrgather.vv doesn't + allow dest overlap with any source register and VLA repeating vector + always by a addition. So, it such VLA constant vector will consume + 32 registers if LMUL = 8 which cause serious high register pressure. */ + else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT && npatterns > 2) + { + /* We handle the case that we can't find a vector containter to hold + element bitsize = NPATTERNS * ele_bitsize. + + NPATTERNS = 8, element width = 16 + v = { 3, 5, -2, -1, -3, 2, -1, -3, ... } + Since NPATTERNS * element width = 128, we can't find a container + to hold it. + + In this case, we use NPATTERNS merge operations to generate such + vector. */ + + /* Generate index = { 0, 1, 2, 3, 4, 5, 6, 7, ... }. */ + rtx vid = gen_reg_rtx (builder.int_mode ()); + expand_vec_series (vid, const0_rtx, const1_rtx); + rtx index = gen_reg_rtx (builder.int_mode ()); + rtx range = gen_int_mode (builder.npatterns () - 1, + GET_MODE_INNER (builder.int_mode ())); + rtx and_ops[] = {index, vid, range}; + emit_vlmax_insn (code_for_pred_scalar (AND, builder.int_mode ()), + BINARY_OP, and_ops); + + /* Create vector = { 3, 5, -2, -1, -3, 2, -1, -3, X, ... , X }. */ + builder.merge_pattern (src); + rtx init = gen_reg_rtx (builder.mode ()); + expand_vector_init_trailing_same_elem (init, builder, + builder.npatterns () - 1, + builder.npatterns ()); + + /* Shuffle vector with index. */ + emit_vlmax_gather_insn (target, gen_lowpart (mode, init), index); + } else { /* We handle the case that we can't find a vector containter to hold @@ -2268,47 +2383,6 @@ expand_vector_init_merge_combine_sequence (rtx target, emit_vlmax_insn (icode, MERGE_OP, merge_ops); } -/* Subroutine of expand_vec_init to handle case - when all trailing elements of builder are same. - This works as follows: - (a) Use expand_insn interface to broadcast last vector element in TARGET. - (b) Insert remaining elements in TARGET using insr. - - ??? The heuristic used is to do above if number of same trailing elements - is greater than leading_ndups, loosely based on - heuristic from mostly_zeros_p. May need fine-tuning. */ - -static bool -expand_vector_init_trailing_same_elem (rtx target, - const rtx_vector_builder &builder, - int nelts_reqd) -{ - int leading_ndups = builder.count_dups (0, nelts_reqd - 1, 1); - int trailing_ndups = builder.count_dups (nelts_reqd - 1, -1, -1); - machine_mode mode = GET_MODE (target); - - if (trailing_ndups > leading_ndups) - { - rtx dup = expand_vector_broadcast (mode, builder.elt (nelts_reqd - 1)); - for (int i = nelts_reqd - trailing_ndups - 1; i >= 0; i--) - { - unsigned int unspec - = FLOAT_MODE_P (mode) ? UNSPEC_VFSLIDE1UP : UNSPEC_VSLIDE1UP; - insn_code icode = code_for_pred_slide (unspec, mode); - rtx tmp = gen_reg_rtx (mode); - rtx ops[] = {tmp, dup, builder.elt (i)}; - emit_vlmax_insn (icode, BINARY_OP, ops); - /* slide1up need source and dest to be different REG. */ - dup = tmp; - } - - emit_move_insn (target, dup); - return true; - } - - return false; -} - /* Initialize register TARGET from the elements in PARALLEL rtx VALS. */ void @@ -2378,11 +2452,19 @@ expand_vec_init (rtx target, rtx vals) /* Optimize trailing same elements sequence: v = {y, y2, y3, y4, y5, x, x, x, x, x, x, x, x, x, x, x}; */ - if (!expand_vector_init_trailing_same_elem (target, v, nelts)) - /* Handle common situation by vslide1down. This function can handle any - situation of vec_init. Only the cases that are not optimized above - will fall through here. */ - expand_vector_init_insert_elems (target, v, nelts); + int leading_ndups = v.count_dups (0, nelts - 1, 1); + int trailing_ndups = v.count_dups (nelts - 1, -1, -1); + if (trailing_ndups > leading_ndups) + { + expand_vector_init_trailing_same_elem (target, v, nelts - trailing_ndups, + nelts); + return; + } + + /* Handle common situation by vslide1down. This function can handle any + situation of vec_init. Only the cases that are not optimized above + will fall through here. */ + expand_vector_init_insert_elems (target, v, nelts); } /* Get insn code for corresponding comparison. */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c index c7b88635923..d3149052b13 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c @@ -4,7 +4,7 @@ #include void -foo (uint8_t *restrict a, uint8_t *restrict b, int n) +foo (uint16_t *restrict a, uint16_t *restrict b, int n) { for (int i = 0; i < n; ++i) { @@ -19,9 +19,8 @@ foo (uint8_t *restrict a, uint8_t *restrict b, int n) } } -/* { dg-final { scan-assembler {e8,m4} } } */ -/* { dg-final { scan-assembler-times {csrr} 1 } } */ -/* { dg-final { scan-tree-dump-not "Maximum lmul = 8" "vect" } } */ +/* { dg-final { scan-assembler {e16,m4} } } */ +/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */ /* { dg-final { scan-tree-dump-times "Maximum lmul = 4" 1 "vect" } } */ /* { dg-final { scan-tree-dump-not "Maximum lmul = 2" "vect" } } */ /* { dg-final { scan-tree-dump-not "Maximum lmul = 1" "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c index 4d9175e86f8..4c428f33a79 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c @@ -4,7 +4,7 @@ #include void -foo (uint8_t *restrict a, uint8_t *restrict b, int n) +foo (uint16_t *restrict a, uint16_t *restrict b, int n) { for (int i = 0; i < n; ++i) { @@ -28,9 +28,8 @@ foo (uint8_t *restrict a, uint8_t *restrict b, int n) } } -/* { dg-final { scan-assembler {e8,m4} } } */ -/* { dg-final { scan-assembler-times {csrr} 1 } } */ -/* { dg-final { scan-tree-dump-not "Maximum lmul = 8" "vect" } } */ +/* { dg-final { scan-assembler {e16,m4} } } */ +/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 1 "vect" } } */ /* { dg-final { scan-tree-dump-times "Maximum lmul = 4" 1 "vect" } } */ /* { dg-final { scan-tree-dump-not "Maximum lmul = 2" "vect" } } */ /* { dg-final { scan-tree-dump-not "Maximum lmul = 1" "vect" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/slp-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/slp-1.c new file mode 100644 index 00000000000..23bf59b0cb3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/slp-1.c @@ -0,0 +1,40 @@ +/* { dg-do compile } */ +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize --param riscv-autovec-lmul=dynamic -fno-schedule-insns -fdump-tree-vect-details" } */ +/* { dg-final { check-function-bodies "**" "" } } */ + +#include + +/* +** foo: +** ... +** vslide1up\.vx\s+v[0-9]+,\s*v[0-9]+,\s*[a-x0-9]+ +** ... +** vrgather\.vv\s+v[0-9]+,\s*v[0-9]+,\s*v[0-9]+ +** ... +** vle16\.v\s+v8,\s*0\([a-x0-9]+\) +** ... +** vrgather\.vv\s+v[0-9]+,\s*v[0-9]+,\s*v[0-9]+ +** ... +*/ +void +foo (uint16_t *restrict a, uint16_t *restrict b, int n) +{ + for (int i = 0; i < n; ++i) + { + a[i * 8] = b[i * 8 + 3] + 1; + a[i * 8 + 1] = b[i * 8 + 6] + 1; + a[i * 8 + 2] = b[i * 8 + 0] + 1; + a[i * 8 + 3] = b[i * 8 + 2] + 1; + a[i * 8 + 4] = b[i * 8 + 1] + 1; + a[i * 8 + 5] = b[i * 8 + 7] + 1; + a[i * 8 + 6] = b[i * 8 + 5] + 1; + a[i * 8 + 7] = b[i * 8 + 4] + 1; + } +} + +/* { dg-final { scan-assembler-times {vslide1up\.vx} 1 } } */ +/* { dg-final { scan-assembler-times {vrgather\.vv} 2 } } */ +/* { dg-final { scan-assembler-not {vmv1r\.v} } } */ +/* { dg-final { scan-assembler-not {vmv2r\.v} } } */ +/* { dg-final { scan-assembler-not {vmv4r\.v} } } */ +/* { dg-final { scan-assembler-not {vmv8r\.v} } } */